Imagine having an effortless way to convert your favorite podcasts, YouTube videos, or conference recordings into readable text—or instantly translate them from one language to another without lifting a finger. Whether you’re a content creator, researcher, or simply someone looking to save time, OpenAI’s Whisper is a game-changer.
Whisper excels at automatic translation and transcription of audio files, thanks to its advanced neural architecture and extensive multilingual support. It can not only transcribe audio into text with impressive accuracy but also translate it into another language in one seamless step. From generating subtitles for your global audience to simplifying your post-production workflows, Whisper is built to deliver results right out of the box.
Whisper is a general-purpose speech recognition model released by OpenAI. It can:
- Transcribe audio files in various languages.
- Translate them into English (or other languages depending on configuration).
It offers a broad range of pre-trained model sizes (from tiny to large). The larger the model, the more accurate it tends to be but also the more computational resources it requires.
In this guide, we’ll dive deep into setting up Whisper on Ubuntu using my custom installation script, then walk you through the process of transcribing and translating audio tracks. Whether you’re working with local files or video URLs, this solution has you covered. The new interface now includes two processing tabs—one for local files and another for video URLs—making it even more versatile and user-friendly.
The picture below shows my Gradio Web-App, which I’ve written to use OpenAI’s Whisper models to transcribe and translate audio files.
1. Prerequisites
I am running at home an Ubuntu server with a NVIDIA RTX A6000 and that’s why I always describe my setup for Ubuntu.
Hardware and System Requirements
- Ubuntu (Tested on Ubuntu 20.04 and above; other Linux distros may also work).
- Python 3.7+ installed.
- Sufficient disk space to store the Whisper model files (the
large-v2
model is about 2GB). - An NVIDIA GPU (optional but recommended) with CUDA support for faster inference. If you don’t have a GPU, Whisper will still work on CPU, albeit more slowly.
Additional Dependencies
- FFmpeg for handling various audio/video formats.
- yt-dlp is a feature-rich command-line audio/video downloader with support for thousands of sites.
- Pip for installing Python packages.
2. Installation Using the Script
I’ve created a custom script to streamline Whisper installation on Ubuntu. The script is available in my GitHub repository for Installation Scripts for Generative AI Tools.
2.1 Cloning the Repository
Open your terminal and clone the repository:
git clone https://github.com/custom-build-robots/Installation-Scripts-for-Generative-AI-Tools.git
cd Installation-Scripts-for-Generative-AI-Tools
2.2 Running the Whisper Installation Script
Inside the cloned repository, you’ll find install_whisper.sh
. Make sure it’s executable, then run it:
chmod +x install_whisper.sh
./install_whisper.sh
What Does the Script Do?
- Installs FFmpeg – This is crucial for audio/video processing.
- Installs Python dependencies – Ensures
pip
,torch
, andwhisper
are available. - Creates a folder to house your Whisper models.
Once the script completes, you should have a working Whisper environment on your Ubuntu system.
3. Using Whisper: Two Processing Options
3.1 Local File Processing
You can upload audio or video files directly from your local machine. The Web-App will transcribe or translate the content based on your chosen settings. Supported formats include MP3, WAV, and MP4.
3.2 Video URL Processing
With the video URL tab, you can paste a YouTube or other video link. The app will download the video, process the audio, and output a transcription or translation. This is made possible by integrating yt-dlp
and FFmpeg.
3.3 Running the Sample Python Script
Below is the script I use to run Whisper, specifying a custom download directory for the model. This example demonstrates how to translate Japanese audio into English text. You can also transcribe audio if you don’t need language conversion.
Download: OpenAI-whisper-transcribe-or-translate-locally
Key Points in This Script
model_path
andmodel_name
: You can choose which Whisper model you want to use (e.g.,base
,small
,medium
,large-v2
). Larger models = better accuracy but higher memory usage.language
: Set this if you know the source language; it helps speed up processing and ensures better results when translating.task
: If set to"transcribe"
, you get text in the original language. If set to"translate"
, you get text translated into English (default).- Saved Transcriptions: The transcribed/translated text is stored in a
.txt
file next to your audio file.
4. Verification and Troubleshooting
- Model Storage: Check that the model files (
.pt
) are actually in themodels/
folder. If not, Whisper downloads them automatically. - GPU Support: By default, Whisper will attempt to use a GPU if PyTorch detects one. If you want to force CPU usage, set the environment variable
export CUDA_VISIBLE_DEVICES=""
before running your script. - Performance: If you’re transcribing/translating large files or using big models, you might experience high memory usage. Consider smaller models (like
medium
orsmall
) if resources are limited.
5. Autostart of the Local Whisper Service
Please create the following file as described below.
Command: sudo nano /etc/systemd/system/whisper.service
Now insert the following lines. Ensure to adjust the path to where your Python program is located. Also, note that I am using a virtual environment named venv_whisper
in which my local Whisper service is running.
[Unit]
Description=Whisper Transcribe & Translate Gradio App
After=network.target
[Service]
User=ingmar
WorkingDirectory=/home/ingmar/whisper_offline
ExecStart=/bin/bash -c 'source /home/ingmar/whisper_offline/venv_whisper/bin/activate && python3 /home/ingmar/whisper_offline/whisper_gradio_app.py'
Restart=always
Environment=PYTHONUNBUFFERED=1
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Now save the whisper.service
file after customizing it for your needs.
Make the whisper.service
file executable with the following command:
Command: sudo chmod 644 /etc/systemd/system/whisper.service
Next, execute the following three commands to set up the service in the Ubuntu system:
Command: sudo systemctl daemon-reload
Command: sudo systemctl enable whisper.service
Command: sudo systemctl start whisper.service
You can now check whether the service is running with the following command:
Command: sudo systemctl status whisper.service
From now on, the local Whisper service should be online, and you can access it in your browser.
6. Conclusion
Whisper is a powerful tool for speech recognition and translation. By leveraging my installation script on Ubuntu, you can simplify the setup process. With the new two-tab Web-App interface, processing audio from local files or video URLs has never been easier. Explore Whisper’s capabilities today and unlock a world of transcription and translation possibilities!
Hello Valentin, I will not share anything related to my work on detecting mines or UXO's. Best regards, Maker
Hello, We are a group of students at ESILV working on a project that aim to prove the availability of…