How to Install and Use OpenAI’s Whisper Locally for Automatic Transcription and Translation

Imagine having an effortless way to convert your favorite podcasts, YouTube videos, or conference recordings into readable text—or instantly translate them from one language to another without lifting a finger. Whether you’re a content creator, researcher, or simply someone looking to save time, OpenAI’s Whisper is a game-changer.

Whisper excels at automatic translation and transcription of audio files, thanks to its advanced neural architecture and extensive multilingual support. It can not only transcribe audio into text with impressive accuracy but also translate it into another language in one seamless step. From generating subtitles for your global audience to simplifying your post-production workflows, Whisper is built to deliver results right out of the box.

Whisper is a general-purpose speech recognition model released by OpenAI. It can:

Transcribe audio files in various languages.
Translate them into English (or other languages depending on configuration).

It offers a broad range of pre-trained model sizes (from tiny to large). The larger the model, the more accurate it tends to be but also the more computational resources it requires.

In this guide, we’ll dive deep into setting up Whisper on Ubuntu using my custom installation script, then walk you through the process of transcribing and translating audio tracks. Whether you’re working with local files or video URLs, this solution has you covered. The new interface now includes two processing tabs—one for local files and another for video URLs—making it even more versatile and user-friendly.

The picture below shows my Gradio Web-App, which I’ve written to use OpenAI’s Whisper models to transcribe and translate audio files.

Whisper Transcribe & Translate audio video

Table of Contents

1. Prerequisites

I am running at home an Ubuntu server with a NVIDIA RTX A6000 and that’s why I always describe my setup for Ubuntu.

Hardware and System Requirements

Ubuntu (Tested on Ubuntu 20.04 and above; other Linux distros may also work).
Python 3.7+ installed.
Sufficient disk space to store the Whisper model files (the large-v2 model is about 2GB).
An NVIDIA GPU (optional but recommended) with CUDA support for faster inference. If you don’t have a GPU, Whisper will still work on CPU, albeit more slowly.

Additional Dependencies

FFmpeg for handling various audio/video formats.
yt-dlp is a feature-rich command-line audio/video downloader with support for thousands of sites.
Pip for installing Python packages.

2. Installation Using the Script

I’ve created a custom script to streamline Whisper installation on Ubuntu. The script is available in my GitHub repository for Installation Scripts for Generative AI Tools.

2.1 Cloning the Repository

Open your terminal and clone the repository:

git clone https://github.com/custom-build-robots/Installation-Scripts-for-Generative-AI-Tools.git
cd Installation-Scripts-for-Generative-AI-Tools

2.2 Running the Whisper Installation Script

Inside the cloned repository, you’ll find install_whisper.sh. Make sure it’s executable, then run it:

chmod +x install_whisper.sh
./install_whisper.sh

What Does the Script Do?

Installs FFmpeg – This is crucial for audio/video processing.
Installs Python dependencies – Ensures pip, torch, and whisper are available.
Creates a folder to house your Whisper models.

Once the script completes, you should have a working Whisper environment on your Ubuntu system.

3. Using Whisper: Two Processing Options

3.1 Local File Processing

You can upload audio or video files directly from your local machine. The Web-App will transcribe or translate the content based on your chosen settings. Supported formats include MP3, WAV, and MP4.

3.2 Video URL Processing

With the video URL tab, you can paste a YouTube or other video link. The app will download the video, process the audio, and output a transcription or translation. This is made possible by integrating yt-dlp and FFmpeg.

3.3 Running the Sample Python Script

Below is the script I use to run Whisper, specifying a custom download directory for the model. This example demonstrates how to translate Japanese audio into English text. You can also transcribe audio if you don’t need language conversion.

Download: OpenAI-whisper-transcribe-or-translate-locally

Key Points in This Script

model_path and model_name: You can choose which Whisper model you want to use (e.g., base, small, medium, large-v2). Larger models = better accuracy but higher memory usage.
language: Set this if you know the source language; it helps speed up processing and ensures better results when translating.
task: If set to "transcribe", you get text in the original language. If set to "translate", you get text translated into English (default).
Saved Transcriptions: The transcribed/translated text is stored in a .txt file next to your audio file.

4. Verification and Troubleshooting

Model Storage: Check that the model files (.pt) are actually in the models/ folder. If not, Whisper downloads them automatically.
GPU Support: By default, Whisper will attempt to use a GPU if PyTorch detects one. If you want to force CPU usage, set the environment variable export CUDA_VISIBLE_DEVICES="" before running your script.
Performance: If you’re transcribing/translating large files or using big models, you might experience high memory usage. Consider smaller models (like medium or small) if resources are limited.

5. Autostart of the Local Whisper Service

Please create the following file as described below.

Command: sudo nano /etc/systemd/system/whisper.service

Now insert the following lines. Ensure to adjust the path to where your Python program is located. Also, note that I am using a virtual environment named venv_whisper in which my local Whisper service is running.

[Unit]
Description=Whisper Transcribe & Translate Gradio App
After=network.target

[Service]
User=ingmar
WorkingDirectory=/home/ingmar/whisper_offline
ExecStart=/bin/bash -c 'source /home/ingmar/whisper_offline/venv_whisper/bin/activate && python3 /home/ingmar/whisper_offline/whisper_gradio_app.py'
Restart=always
Environment=PYTHONUNBUFFERED=1
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Now save the whisper.service file after customizing it for your needs.

Make the whisper.service file executable with the following command:

Command: sudo chmod 644 /etc/systemd/system/whisper.service

Next, execute the following three commands to set up the service in the Ubuntu system:

Command: sudo systemctl daemon-reload

Command: sudo systemctl enable whisper.service

Command: sudo systemctl start whisper.service

You can now check whether the service is running with the following command:

Command: sudo systemctl status whisper.service

From now on, the local Whisper service should be online, and you can access it in your browser.

6. Conclusion

Whisper is a powerful tool for speech recognition and translation. By leveraging my installation script on Ubuntu, you can simplify the setup process. With the new two-tab Web-App interface, processing audio from local files or video URLs has never been easier. Explore Whisper’s capabilities today and unlock a world of transcription and translation possibilities!

1 Comment

Georg Mill on December 31, 2024 at 9:09 am

This works using an very old laptop with old GPU

>>> print(torch.cuda.is_available())
True
>>> print(torch.version.cuda)
12.6
>>> print(torch.cuda.device_count())
1
>>> print(torch.cuda.get_device_properties(“cuda:0”))
_CudaDeviceProperties(name=’NVIDIA GeForce GTX 850M’, major=5, minor=0, total_memory=4044MB, multi_processor_count=5, uuid=6004c5b4-5f7e-fff1-ad72-770b71f3260f, L2_cache_size=2MB)
>>> print(torch.cuda.get_device_name(“cuda:0”))
NVIDIA GeForce GTX 850M
>>> import whisper
>>> model = whisper.load_model(“base”)
>>> result = model.transcribe(“whisper.mp3″, language=”German”, task=”translate”)
>>> result[“text”]
‘ Whisper is a KI from OpenIE.com With your help you can share audio or video in text and wanders.’
>>> result = model.transcribe(“whisper.mp3″, language=”German”)
>>> result[“text”]
‘ Whisper ist eine KI von OpenIE.com. Mit ihrer Hilfe kann man Audio oder Video Dateien in Text umwandeln.’
>>>

How to Install and Use OpenAI’s Whisper Locally for Automatic Transcription and Translation