Easy video transcription and subtitling with Whisper, FFmpeg, and Python
Updated: April, 4th 2024
Video are not just a source of entertainment—they are a crucial tool for content creators, educators, and businesses alike. Enhancing your videos with accurate transcriptions and subtitles can significantly improve accessibility and viewer engagement. This guide will walk you through the exciting journey of transcribing your video using the cutting-edge OpenAI Whisper model and seamlessly adding subtitles with the powerful FFmpeg tool.
input.mp4
output.mp4
Required tools:
Before we set sail, let's ensure your toolkit is ready:
- Python: Ensure it's installed on your machine for the coding magic.
- FFmpeg: A cornerstone for handling video files. If it's missing from your toolbox, it's time for a quick setup.
Set up your workspace
- First, you’ll want to create a dedicated workspace:
mkdir open-ai-whisper-ffmpeg
- Navigate into your new project domain and conjure a virtual environment to keep things neat:
cd open-ai-whisper-ffmpeg
python3 -m venv .venv
source .venv/bin/activate
- Install the required packages for OpenAI’s Whisper:
pip install git+https://github.com/m-bain/whisperx.git
Transcribe your video
- First, create a new Python file,
main.py
:
touch main.py
- Paste the code below into main.py:
from datetime import timedelta
import os
import whisperx
def transcribe_video(input_video):
batch_size = 32
compute_type = "float32"
device = "cpu"
model = whisperx.load_model("large-v2", device=device, compute_type=compute_type)
audio = whisperx.load_audio(input_video)
result = model.transcribe(audio, batch_size=batch_size, language="en")
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
segments = result["segments"]
# if srt file exists, delete it
if os.path.exists("subtitles.srt"):
os.remove("subtitles.srt")
for index, segment in enumerate(segments):
startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'
endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'
text = segment['text']
print(text)
segment = f"{index + 1}\n{startTime} --> {endTime}\n{text[1:] if text[0] == ' ' else text}\n\n"
srtFilename = os.path.join(f"subtitles.srt")
with open(srtFilename, 'a', encoding='utf-8') as srtFile:
srtFile.write(segment)
return srtFilename
def main():
input_video_path = "input.mp4"
transcribe_video(input_video_path)
main()
Let’s examine what we’re doing in the code above. In these lines, we import the required packages to work with: whisperx
to load whisper model, os
to get subtitles file path, and timedelta
to format text timestamps:
from datetime import timedelta
import os
import whisperx
Here, we defined a function that takes an input video, loads Whisper model "large-v2"
, specifies a compute_type, and configures the model to use CPU instead of GPU, and .
After that, the function loads video audio into the model, then transcribes the video audio. Finally, it aligns the model results and return text with timestamps:
def transcribe_video(input_video):
batch_size = 32
compute_type = "float32"
device = "cpu"
model = whisperx.load_model("large-v2", device=device, compute_type=compute_type)
audio = whisperx.load_audio(input_video)
result = model.transcribe(audio, batch_size=batch_size, language="en")
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
segments = result["segments"]
After that, the function loops through the model results segments, converts them into .srt
format, and appends each word item into a subtitles.srt
file:
# if srt file exists, delete it
if os.path.exists("subtitles.srt"):
os.remove("subtitles.srt")
for index, segment in enumerate(segments):
startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'
endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'
text = segment['text']
print(text)
segment = f"{index + 1}\n{startTime} --> {endTime}\n{text[1:] if text[0] == ' ' else text}\n\n"
srtFilename = os.path.join(f"subtitles.srt")
with open(srtFilename, 'a', encoding='utf-8') as srtFile:
srtFile.write(segment)
return srtFilename
Adding subtitles to a video
- Now, update main.py with the code mentioned below:
from datetime import timedelta
import os
import whisperx
import subprocess
def transcribe_video(input_video):
batch_size = 32
compute_type = "float32"
device = "cpu"
model = whisperx.load_model("large-v2", device=device, compute_type=compute_type)
audio = whisperx.load_audio(input_video)
result = model.transcribe(audio, batch_size=batch_size, language="en")
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
segments = result["segments"]
# if srt file exists, delete it
if os.path.exists("subtitles.srt"):
os.remove("subtitles.srt")
for index, segment in enumerate(segments):
startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'
endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'
text = segment['text']
print(text)
segment = f"{index + 1}\n{startTime} --> {endTime}\n{text[1:] if text[0] == ' ' else text}\n\n"
srtFilename = os.path.join(f"subtitles.srt")
with open(srtFilename, 'a', encoding='utf-8') as srtFile:
srtFile.write(segment)
return srtFilename
def add_srt_to_video(input_video, output_file):
# FFmpeg command
subtitles_file = 'subtitles.srt'
# FFmpeg command
ffmpeg_command = f"""ffmpeg -i {input_video} -vf "subtitles={subtitles_file}:force_style='FontName=Arial,FontSize=10,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,BorderStyle=3,Outline=1,Shadow=1,Alignment=2,MarginV=10'" -c:a copy {output_file} -y """
# Run the FFmpeg command
subprocess.run(ffmpeg_command, shell=True)
input_video_path = "input.mp4"
output_file = "output.mp4"
transcribe_video(input_video_path)
add_srt_to_video(input_video_path, output_file)
main()
Finally, we load the subtitles.srt
into the video using FFmpeg and add subtitles as text in the video.
Here’s a sample video of this project:
And there you have it, a step-by-step guide to transforming your video into a masterpiece of clarity and engagement. Whether you're aiming to make your content more accessible or simply looking to add a professional touch, these tools empower you to achieve your goals.