Opening Problem Statement
Meet Anna, a documentary filmmaker who spends countless hours manually scripting voiceovers for her nature videos. For her latest project, she has a captivating 3-minute video of Indian street scenes. But writing a narration that truly matches the visual story requires carefully watching the footage multiple times and crafting the script line by line. This process is time-consuming, error-prone, and often delayed because she juggles many projects simultaneously.
Anna estimates she spends over 6 hours per video just on scripting, with frequent rewrites adding frustration. What if she could automate this entire narration process and get a ready-to-use script and voiceover audio with minimal effort? This is exactly the scenario our n8n workflow tackles: turning any video into a narrated audio clip rapidly and with precision.
What This Automation Does
When you run this workflow, several clearly defined processes occur that transform a raw video into a narrated audio file ready for use or sharing:
- Download a video file automatically from a public URL (in the demo case, a sample from Pixabay).
- Extract up to 90 evenly distributed video frames using Python and OpenCV to capture the essence of the visual story.
- Batch these frames in groups of 15 to manage token limits for language model processing.
- Generate a continuous narration script in the style of David Attenborough using OpenAI’s GPT-4o multimodal large language model (LLM) that understands images.
- Convert the full script into an MP3 voiceover using OpenAI’s text-to-speech (TTS) capabilities.
- Upload the resulting audio file to Google Drive for easy access and sharing.
This process saves hours of manual scripting and recording, allowing creators like Anna to focus on editing and creative direction.
Prerequisites ⚙️
- n8n account — set up and ready to build workflows.
- OpenAI API credentials — to use GPT-4o for script generation and TTS for audio creation.
- Google Drive account with API credentials — for uploading the final audio clip.
- Python environment in n8n Code node — for running OpenCV frame extraction.
- Ability to download videos via HTTP Request node.
Step-by-Step Guide
Step 1: Trigger the Workflow Manually
Open n8n, then navigate and add the Manual Trigger node labeled “When clicking ‘Test workflow’”. This node allows us to kick off the process whenever we want.
After configuring, test the trigger by clicking “Execute Workflow”. You should see it successfully activate in the n8n editor.
Common mistake: Forgetting to connect this node to the next step will halt execution.
Step 2: Download the Source Video
Add the HTTP Request node named “Download Video”. Configure the URL to https://cdn.pixabay.com/video/2016/05/12/3175-166339863_small.mp4. Leave the method as GET.
This downloads the video to be processed. Run this node alone to verify it downloads without error.
Common mistake: Using an unsupported video format or URL will cause downstream errors in frame extraction.
Step 3: Extract Evenly Distributed Frames from Video
Insert a Code node (Python) named “Capture Frames” connected to the HTTP Download node. Here, we run a Python script using OpenCV to:
- Decode the base64 video data
- Load the video into OpenCV
- Calculate frame count to take up to 90 frames evenly spaced
- Convert each selected frame to a base64-encoded JPEG image string
- Output a list of these base64 frames
Copy/Paste the Python code provided in the workflow:
import cv2
import numpy as np
import base64
def extract_evenly_distributed_frames_from_base64(base64_string, max_frames=90):
video_bytes = base64.b64decode(base64_string)
video_path = '/tmp/temp_video.mp4'
with open(video_path, 'wb') as video_file:
video_file.write(video_bytes)
video_capture = cv2.VideoCapture(video_path)
total_frames = int(video_capture.get(cv2.CAP_PROP_FRAME_COUNT))
step_size = max(1, total_frames // (max_frames - 1))
selected_frames_base64 = []
for i in range(0, total_frames, step_size):
video_capture.set(cv2.CAP_PROP_POS_FRAMES, i)
ret, frame = video_capture.read()
if ret:
frame_base64 = convert_frame_to_base64(frame)
selected_frames_base64.append(frame_base64)
if len(selected_frames_base64) >= max_frames:
break
video_capture.release()
return selected_frames_base64
def convert_frame_to_base64(frame):
ret, buffer = cv2.imencode('.jpg', frame)
if not ret:
return None
frame_base64 = base64.b64encode(buffer).decode('utf-8')
return frame_base64
base64_video = _input.item.binary.data.data
frames_base64 = extract_evenly_distributed_frames_from_base64(base64_video, max_frames=90)
return { "output": frames_base64 }Execute this node. It may take 1-2 minutes for a 3MB video.
Common mistake: Not having OpenCV installed or not setting “mode” to runOnceForEachItem could cause errors.
Step 4: Split Extracted Frames into Individual Items
Add the Split Out node named “Split Out Frames” to separate the array of base64 frames into individual outputs for batch processing.
Run this node and check outputs show individual frames.
Step 5: Batch Frames for LLM Processing
Add a Split In Batches node “For Every 15 Frames” and set the batch size to 15. This allows sending chunks of images to OpenAI to adhere to token limits.
Common mistake: Setting batch size too large may hit LLM’s limits; too small will increase calls and cost.
Step 6: Convert Base64 Frames to Binary Images
Use Convert To Binary node “Convert to Binary” to transform base64 strings into binary image files for resizing and LLM input.
Step 7: Resize Frames for Optimal Input
Add Edit Image node “Resize Frame”. Set width and height to 768px, format JPEG. This optimizes image size for OpenAI models.
Step 8: Aggregate Resized Frames
Connect an Aggregate node “Aggregate Frames” to combine resized frame binaries into one payload for the OpenAI model.
Step 9: Generate Narration Script Using Multimodal LLM
Configure the Chain LLM node “Generate Narration Script” with the prompt to create a short voiceover script in the style of David Attenborough. It uses the binary image batch inputs and remembers previous parts to produce a cohesive script.
Important prompt snippet:
These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.
This node loops through batches, generating partial scripts, maintaining context to build a single continuous narration.
Step 10: Wait Node to Manage Rate Limits
Include a Wait node “Stay Within Service Limits” to prevent hitting API rate limits. Adjust or remove based on your OpenAI plan.
Step 11: Combine Text Scripts into Full Narration
Use an Aggregate node “Combine Script” to merge all partial scripts into one complete text output.
Step 12: Generate Voice Over Audio from Script
Add the OpenAI node “Use Text-to-Speech” configured to use the text-to-speech resource. Input is the combined text, output format is MP3.
Example input:
{{ $json.data.map(item => item.text).join('n') }}
Step 13: Upload Audio File to Google Drive
Finally, add the Google Drive node “Upload to GDrive” to save the MP3 audio clip. Use dynamic naming like “narrating-video-using-vision-ai-20240725123000.mp3” and specify the target folder ID.
You can then access or share the audio file easily.
Customizations ✏️
- Change Narration Style: In the “Generate Narration Script” node, modify the prompt text from “David Attenborough” to any other narrator or style for different voiceover tone.
- Adjust Frame Count: In the “Capture Frames” Python node, change
max_frames=90to a higher or lower number to affect detail vs. processing time. - Batch Size Tuning: In “For Every 15 Frames”, modify
batchSizefor different token management. Larger batches reduce API calls but can hit limits. - Change Upload Destination: Alter the “Upload to GDrive” node folderId to save audio in a different Google Drive folder or use an alternate cloud storage integration.
- Skip Frame Resizing: Remove or adjust the “Resize Frame” node if you want full resolution frames processed by the model, noting increased payload size.
Troubleshooting 🔧
Problem: “Video frame extraction fails with OpenCV error.”
Cause: Video format unsupported or missing OpenCV Python dependencies.
Solution: Ensure the video URL links to an MP4 or supported format. Verify Python environment has OpenCV installed.
Problem: “OpenAI API rate limit exceeded.”
Cause: Too many requests sent in quick succession.
Solution: Use the “Stay Within Service Limits” Wait node or increase your OpenAI plan.
Problem: “Google Drive upload fails with authorization error.”
Cause: Expired or wrong Google Drive OAuth credentials.
Solution: Re-authenticate Google Drive credentials in n8n and check folder permissions.
Pre-Production Checklist ✅
- Test video download URL for accessibility and format compatibility.
- Run Python code in Capture Frames node standalone to confirm frame extraction.
- Validate OpenAI API credentials and quota.
- Check Google Drive connection and folder ID correctness.
- Execute the workflow with a small video sample first to verify timing and outputs.
- Backup existing generated files before batch runs.
Deployment Guide
Once tested, activate the workflow in n8n. Use manual trigger or schedule as needed for batch processing projects.
Monitor the execution logs in n8n’s editor for any API errors or performance issues. Adjust the Wait node timing for rate compliance.
You can self-host n8n using platforms like Hostinger for more control and scalability.
FAQs
Can I use a different video source?
Yes, any accessible video URL in MP4 format supported by OpenCV should work fine.
Does this consume a lot of OpenAI credits?
The frame batching and script generation uses tokens proportionally to video length and batch size, so monitor usage accordingly.
Is the audio file secure?
Files are uploaded to your Google Drive account under your control, ensuring privacy and security.
Can I increase the number of frames for better narration?
Yes, but be mindful of processing time and memory usage. The default 90 frames balance detail and performance.
Conclusion
By following this detailed tutorial, you’ve created an automated video narration pipeline that downloads a video, extracts key frames, leverages OpenAI’s multimodal GPT-4o to write an engaging script, and converts it to a professional voiceover audio clip — all deployed with n8n workflow automation.
This saves filmmakers, marketers, and content creators hours of tedious scripting and recording, accelerating project turnaround and creative focus.
Next steps? Try customizing narration style, adding subtitles from the script, or incorporating different AI voices for varied effects. Automation with n8n and OpenAI opens many creative doors!