STEP 1. Understand what you’re building
You are creating a system where:
→ Text, images, videos, documents live in one database
→ All data gets embedded into the same vector space
→ AI retrieves the most relevant pieces before answering
This is RAG
Retrieval Augmented Generation
✦ Key shift
Old systems handled text only
New systems handle meaning across formats
STEP 2. Set up your tools
You need 3 things:
- Gemini API → Used for embeddings → Get from Google AI Studio
- Pinecone → Your vector database → Stores embeddings
- OpenRouter or model provider → Used for chat responses
- Visual Studio Code → Your working environment
- Claude Code → Builds everything for you
STEP 3. Create your project
→ Open VS Code
→ Install Claude Code extension
→ Open a new folder
Now open Claude Code panel
Switch to plan mode
Paste documentation link for Gemini embeddings
Then prompt:
“Build a multimodal RAG system using Gemini Embedding 2 and Pinecone.
Create env file placeholders for API keys.
Support text, images, and videos.”
Claude Code will generate:
→ Project structure
→ Dependencies
→ Step-by-step plan
Accept it
STEP 4. Add API keys
In your env file, add:
→ Gemini API key
→ Pinecone API key
→ OpenRouter or model key
Save the file
That’s it for setup
STEP 5. Add your data
Create a “data” folder
Drop in anything:
→ PDFs
→ Images
→ Videos
→ Text files
No need to organize perfectly
The system handles classification
STEP 6. Run ingestion
Prompt Claude Code:
“Process all files and store embeddings in Pinecone.
Then build a simple chat app.”
What happens behind the scenes:
→ Files get chunked
→ Gemini creates embeddings
→ Data stored in Pinecone
→ Metadata added
✦ This is where older tools like n8n get messy
Manual chunking
Separate pipelines
Frequent failures
Here, it runs in one flow
STEP 7. Test your system
Claude Code builds a local app
You open localhost
Now test queries:
→ “How do I clean the filter?”
↳ Returns steps + images from PDF
→ “What are the parts?”
↳ Pulls multiple sections + diagrams
→ Upload an image
↳ Finds similar entries in database
STEP 8. Improve retrieval quality
By default:
→ Images and videos are stored as descriptions
To improve:
Ask Claude Code:
“Add better metadata descriptions for images and videos
Update app to display media inline”
Now your system:
→ Shows images
→ Plays videos
→ Gives richer results
STEP 9. Understand limitations
Current constraints:
→ Video length limit around 120 seconds
→ Image batch limits
→ Quality depends on metadata
✦ Important
Better descriptions = better retrieval
STEP 10. Real use cases
- Instruction manuals → Chat with complex PDFs → Get visual answers
- Service businesses → Upload project images → Retrieve similar jobs with pricing
- Internal knowledge bases → Mix documents, videos, images → One unified search
STEP 11. What changed
Before:
→ Complex n8n pipelines
→ Manual configuration
→ Fragile systems
Now:
→ Describe system in plain language
→ AI builds it
→ You refine outputs
Mini insight
This build took under 30 minutes
Earlier versions took hours or days
