Analyze images, videos, documents & audio with Gemini Tools and Qwen LLM Agent
Workflow preview
DISCOUNT 20%
Important notice
This workflow is provided as-is. Please review and test before using in production.
Overview
π Analyze uploaded images, videos, audio, and documents with specialized tools β powered by a lightweight language-only agent.
π§ What It Does
This workflow enables multimodal file analysis using Google Gemini tools connected to a text-only LLM agent. Users can upload images, videos, audio files, or documents via a chat interface. The workflow will:
- Upload each file to Google Gemini and obtain an accessible URL.
- Dynamically generate contextual prompts based on the file(s) and user message.
- Allow the agent to invoke Gemini tools for specific media types as needed.
- Return a concise, helpful response based on the analysis.
π Use Cases
- Customer support: Let users upload screenshots, documents, or recordings and get helpful insights or summaries.
- Multimedia QA: Review visual, audio, or video content for correctness or compliance.
- Educational agents: Interpret content from PDFs, diagrams, or audio recordings on the fly.
- Low-cost multimodal assistants: Achieve multimodal functionality without relying on large vision-language models.
π― Why This Architecture Matters
Unlike end-to-end multimodal LLMs (like Gemini 1.5 or GPT-4o), this template:
- Uses a text-only LLM (Qwen 32B via Groq) for reasoning.
- Delegates media analysis to specialized Gemini tools.
β Advantages
| Feature | Benefit |
|---|---|
| π§© Modular | LLM + Tools are decoupled; can update them independently |
| πΈ Cost-Efficient | No need to pay for full multimodal models; only use tools when needed |
| π§ Tool-based Reasoning | Agent invokes tools on demand, just like OpenAIβs Toolformer setup |
| β‘ Fast | Groq LLMs offer ultra-fast responses with low latency |
| π Memory | Includes context buffer for multi-turn chats (15 messages) |
π§ͺ How It Works
πΉ Input via Chat
- Users submit a message and (optionally) files via the
chatTrigger.
πΉ File Handling
If no files: prompt is passed directly to the agent.
If files are included:
- Files are split, uploaded to Gemini (to get public URLs).
- Metadata (name, type, URL) is collected and embedded into the prompt.
πΉ Prompt Construction
A new
chatInputis dynamically generated:User message Media: [array of file data]
πΉ Agent Reasoning
The
Langchain Agentreceives:The enriched prompt
File URLs
Memory context (15 turns)
Access to 4 Gemini tools:
IMG: analyze imageVIDEO: analyze videoAUDIO: analyze audioDOCUMENT: analyze document
The agent autonomously decides whether and how to use tools, then responds with concise output.
π§± Nodes & Services
| Category | Node / Tool | Purpose |
|---|---|---|
| Chat Input | chatTrigger |
User interface with file support |
| File Processing | splitOut, splitInBatches |
Process each uploaded file |
| Upload | googleGemini |
Uploads each file to Gemini, gets URL |
| Metadata | set, aggregate |
Builds structured file info |
| AI Agent | Langchain Agent |
Receives context + file data |
| Tools | googleGeminiTool |
Analyze media with Gemini |
| LLM | lmChatGroq (Qwen 32B) |
Text reasoning, high-speed |
| Memory | memoryBufferWindow |
Maintains session context |
βοΈ Setup Instructions
1. π Required Credentials
- Groq API key (for Qwen 32B model)
- Google Gemini API key (Palm / Gemini 1.5 tools)
2. π§© Nodes That Need Setup
Replace existing credentials on:
Upload a file- Each
GeminiTool(IMG, VIDEO, AUDIO, DOCUMENT) lmChatGroq
3. β οΈ File Size & Format Considerations
- Some Gemini tools have file size or format restrictions.
- You may add validation nodes before uploading if needed.
π οΈ Optional Improvements
- Add logging and error handling (e.g., for upload failures).
- Add MIME-type filtering to choose the right tool explicitly.
- Extend to include OCR or transcription services pre-analysis.
- Integrate with Slack, Telegram, or WhatsApp for chat delivery.
π§ͺ Example Use Case
> "Hola, ΒΏquΓ© dice este PDF?"
Uploads a document β Agent routes it to Gemini DOCUMENT tool β Receives extracted content β LLM summarizes it in Spanish.
π§° Tags
multimodal, agent, langchain, groq, gemini, image analysis, audio analysis, document parsing, video analysis, file uploader, chat assistant, LLM tools, memory, AI tools
π Files
- This template is ready to use as-is in n8n.
- No external webhooks or integrations required.