Ingest and search Cloudflare R2 media with Gemini, Groq Whisper, and Supabase

Workflow preview

100%

Open on n8n.io

$20/month : Unlimited workflows

2500 executions/month

Try free

THE #1 IN WEB SCRAPING

Scrape any website without limits

Try free

HOSTINGER

Early Deal
DISCOUNT 20%

Self-hosted n8n

Unlimited workflows - from $4.99/mo

Try free

#1 hub for scraping, AI & automation

6000+ actors - $5 credits/mo

Try free

1. Workflow Overview

Quick overview This workflow ingests images, PDFs, and videos from a Cloudflare R2 folder, uses Google Gemini to view pdfs, images and videos, Groq stt (Whisper) for video transcriptst to generate ...

Best for

Document Extraction automation workflows
AI RAG automation workflows
advanced n8n builders looking for reusable templates

Tools used

n8n-nodes-base.stickynote, n8n-nodes-base.set, @n8n/n8n-nodes-langchain.embeddingsgooglegemini, @n8n/n8n-nodes-langchain.documentdefaultdataloader, @n8n/n8n-nodes-langchain.textsplittercharactertextsplitter, @n8n/n8n-nodes-langchain.vectorstoresupabase, n8n-nodes-base.httprequest, n8n-nodes-base.webhook

Source and attribution

This workflow is cataloged by N8N Workflows and links back to its original n8n.io source page by Dave Sartori.

Original n8n.io source

1.1 Workflow description

Title: Ingest and search Cloudflare R2 media with Gemini, Groq Whisper, and Supabase
Workflow name: Ingest and search Cloudflare R2 media with Gemini, Groq Whisper, and Supabase

Quick overview

This workflow ingests images, PDFs, and videos from a Cloudflare R2 folder, uses Google Gemini to view pdfs, images and videos, Groq stt (Whisper) for video transcriptst - to generate searchable descriptions and tags, stores embeddings in a Supabase pgvector table.

How it works

Receives a webhook request containing a Cloudflare R2 bucket and folder URL, then lists the objects in that folder.
Filters to supported file types, builds public CDN URLs and timestamps, and routes each item as an image, PDF, or video.
For images, calls Google Gemini with the image URL to generate structured metadata (summary, detailed description, tags, and scores).
For PDFs, calls Google Gemini to analyze the document URL and return the same structured metadata.
For videos, downloads each file locally, extracts representative frames with FFmpeg for Google Gemini visual analysis, extracts audio, transcribes it with Groq Whisper, and tags transcript chunks with Groq Llama.
Normalizes results into a single text “content” field plus JSON metadata, generates Google Gemini embeddings, and inserts the vectors into Supabase (pgvector).
Receives a separate webhook query, retrieves the most similar items from Supabase using embeddings, and returns ranked matches in the webhook response.

Setup

Create a Cloudflare R2 bucket with publicly accessible object URLs, and add Cloudflare R2 credentials in n8n.
Set up a Supabase project with pgvector enabled and a table named vec10, then add Supabase credentials in n8n.
Add Google Gemini credentials (Google PaLM/Gemini API) for embeddings and provide an HTTP Header Auth credential for the Gemini HTTP requests.
Set the GROQ_API_KEY environment variable for the Groq Whisper transcription and Llama tag extraction calls.
If you enable video processing, install curl, ffmpeg, and ffprobe on the n8n host and update the local directory paths (temp root, frames directory, and video directory) in the workflow inputs.
Copy the ingest webhook (/vector-ingest) and query webhook (/vector-query) URLs and configure your upstream app to send the expected JSON payloads.

Additional info

Video: FFmpeg code nodes cut videos smartly into "video_frames" items and "video_transcripts" for easy handling and pgvector storage. Exposed webhook to vector query flow allows Voice Agent to find and display the full video, pulled from Cloudflare bucket, by the referenced matching video_frames or video_transcripts returned from vector query.

1.2 Logical Blocks

This catalog entry is organized from the workflow JSON. The node-level section below shows the executable blocks available for review before importing the template.

2. Block-by-Block Analysis

Block 1 - Sticky Note3

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 2 - Sticky Note5

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 3 - Sticky Note6

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 4 - Sticky Note8

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 5 - Sticky Note9

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 6 - Sticky Note10

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 7 - Sticky Note11

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 8 - Sticky Note12

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 9 - Sticky Note13

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 10 - Sticky Note14

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 11 - Sticky Note15

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 12 - Sticky Note16

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 13 - Sticky Note17

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 14 - Sticky Note18

Type / Role: n8n-nodes-base.stickyNote - stickyNote
Config choices: Version 1

Block 15 - Set Content and Metadata

Type / Role: n8n-nodes-base.set - set
Config choices: Version 3.4

Block 16 - Embedding with Gemini Model

Type / Role: @n8n/n8n-nodes-langchain.embeddingsGoogleGemini - embeddingsGoogleGemini
Config choices: Version 1

Block 17 - Load Default Data

Type / Role: @n8n/n8n-nodes-langchain.documentDefaultDataLoader - documentDefaultDataLoader
Config choices: Version 1.1

Block 18 - Split Text by Character

Type / Role: @n8n/n8n-nodes-langchain.textSplitterCharacterTextSplitter - textSplitterCharacterTextSplitter
Config choices: Version 1

Block 19 - Ingest to Vector Store

Type / Role: @n8n/n8n-nodes-langchain.vectorStoreSupabase - vectorStoreSupabase
Config choices: Version 1

Block 20 - Post PDF to API

Type / Role: n8n-nodes-base.httpRequest - httpRequest
Config choices: Version 4.4

Block 21 - Post Image to API

Type / Role: n8n-nodes-base.httpRequest - httpRequest
Config choices: Version 4.4

Block 22 - Image Webhook Trigger

Type / Role: n8n-nodes-base.webhook - webhook
Config choices: Version 2.1

Block 23 - PDF Webhook Trigger

Type / Role: n8n-nodes-base.webhook - webhook
Config choices: Version 2.1

Block 24 - Post New Image to API

Type / Role: n8n-nodes-base.httpRequest - httpRequest
Config choices: Version 4.4

Showing the first 24 of 68 workflow blocks. Download the JSON for the full node graph.

3. Summary Table

Workflow	Ingest and search Cloudflare R2 media with Gemini, Groq Whisper, and Supabase
Complexity	advanced
Nodes	68
Categories	Document Extraction, AI RAG
Author	Dave Sartori
Published	20 Jun 2026

4. Reproducing the Workflow from Scratch

1. Download the workflow JSON

Use the JSON export at /data/workflows/16528/16528.json as the source template for this automation.
2. Import the template into n8n

Open n8n, import the downloaded JSON, and review each node before activating the workflow.
3. Configure credentials and variables

Replace placeholder credentials, API keys, webhook URLs, account IDs, and environment-specific values with your own settings.
4. Test with sample data

Run the workflow manually or in a staging workspace, inspect node output, and confirm downstream systems receive the expected data.
5. Activate and monitor

Enable the workflow only after testing, then monitor executions, errors, and rate limits during the first production runs.

5. General Notes & Resources

Review imported nodes carefully before activation. This catalog entry is intended to help you inspect the workflow structure, understand required services, and find related templates faster.

Node names, credentials, schedules, webhook paths, and external service limits may need adjustment for your workspace.

Download workflow JSON Original n8n.io source Document Extraction workflows AI RAG workflows

Frequently asked questions

What does Ingest and search Cloudflare R2 media with Gemini, Groq Whisper, and Supabase do?

What do I need before importing this workflow?

Review the workflow JSON, configure any required credentials in n8n, and test the automation in a safe workspace before using it in production.

Can I customize this workflow?

Yes. Use the block-by-block analysis and the downloadable JSON to inspect each node, then adjust credentials, prompts, schedules, filters, or destinations for your Document Extraction, AI RAG use case.

Dave Sartori

2 workflows

Nodes

n8n-nodes-base.stickynote n8n-nodes-base.set @n8n/n8n-nodes-langchain.embeddingsgooglegemini @n8n/n8n-nodes-langchain.documentdefaultdataloader @n8n/n8n-nodes-langchain.textsplittercharactertextsplitter @n8n/n8n-nodes-langchain.vectorstoresupabase n8n-nodes-base.httprequest n8n-nodes-base.webhook

Complexity

advanced

Published 20 Jun 2026

Likes 0

View on n8n.io Download Workflow

Install path: /data/workflows/16528/16528.json

Share Your Workflow

Have a useful automation to share? Publish it and help the community.

Submit Your Template How to Submit

Related Workflows

Detect semantic duplicate website pages with Google Drive, Postgres and Ollama

## Quick overview This workflow scans HTML files in a Google Drive folder, extracts and stores page text in Postgres, generates local vector embeddings with Ollama, and uses PGVector similarity searches to produce CSV reports that flag semantically duplicate website pages. ## How it works 1. Starts manually and clears the existing PGVector embeddings table and the scraped page text table in Postgres. 2. Lists files in a specified Google Drive folder, filters to the target documents, and processes them in batches. 3. Downloads each HTML file from Google Drive, extracts the main body text, cleans it, and upserts the results into a Postgres table for scraped pages. 4. Reads the scraped page text back from Postgres in batches, splits it into overlapping chunks, and attaches page metadata (sheet_id, file_name, file_url) to each chunk. 5. Generates embeddings locally with Ollama and inserts the chunk vectors and metadata into Postgres (PGVector), deduplicating already-processed pages. 6. Builds an HNSW index in Postgres, computes chunk-to-chunk similarity matches and a pairwise page report, and exports the results as a CSV file. 7. Computes page-level centroid embeddings, finds highly similar page pairs, and exports a page-level duplicate report as a CSV file. ## Setup 1. Add Google Drive OAuth2 credentials and set the Google Drive folder URL/ID used to scan for your HTML files. 2. Add Postgres credentials for a database with the pgvector extension enabled and permissions to create/alter tables and indexes (including HNSW indexes). 3. Add an Ollama credential and ensure the embedding model mxbai-embed-large:latest is available on your Ollama instance. 4. Confirm your source files are HTML documents and that the workflow’s text extraction and similarity thresholds match your content and desired duplicate sensitivity. ## Requirements - Working instance of n8n, either self-hosted or on the cloud. Remember, this workflow can be computationally expensive. - Google Drive API (with OAuth setup in n8n credentials section) - Ollama (for open source models) or any Embedding model API - PostgreSQL with PGVector or any other vector database - PgAdmin (for PostgreSQL) or your interface to access database tables via SQL for troubleshooting (optional). ## Additional info Limitations and Enhancements: Physical system memory mxbai-embed-large Running through Ollama is free and private, but the embedding generation speed depends entirely on your hardware. The more system memory you have, the more data you can process in batches in the loop node. Similarity threshold and boilerplate content The cosine distance used in this workflow is 0.15 for chunk-level matching. And 0.05 (similarity above 95%) of the threshold is used for page-level centroid matching. This is only the starting point. Once you have the data, and especially if your data has more noise, you might need to tweak these thresholds for better matching. This workflow needs HTML files to extract text This workflow doesn't crawl a website or fetch pages by entering a URL. You need to download HTML files (rendered or source) for consumption. Use parallel processing and Cloud APIs Two sub-processes take the most time: Downloading HTML files from Google Drive Creating vector embeddings If you can use parallel processing in n8n and execute these sub-processes in parallel, the process will be done much faster. Additionally, if you can use cloud APIs for embedding, it may save some you some processing time as well. Use efficient SQL queries Since I am from a non-tech background and not a coder, I used a mix of Gemini, Perplexity and Claude to create SQL codes for this workflow. If you're better at it, you can run computationally efficient queries that would help you achieve better results with less computation expense and time.

View

Generate monthly BigQuery KPI PDF reports with Claude, Google Docs, Outlook and Teams

## Quick overview This workflow runs monthly to query KPI data from Google BigQuery, generates a narrative with Anthropic Claude, fills a Google Docs report template, exports it as a PDF, archives it to OneDrive, emails it via Microsoft Outlook, and posts a summary to Microsoft Teams. ## How it works 1. A schedule trigger fires on the 1st of every month at 07:00 and calculates the start and end dates of the previous full calendar month. 2. Four parallel BigQuery queries fetch revenue KPIs, top product categories, weekly sales trend, and top customers for that period. Each result set is tagged with a source label before merging. 3. All tagged rows are combined and aggregated into a single structured report payload containing KPI summaries, ranked tables, and company metadata. 4. The payload is split across two concurrent branches: one sends it to a Claude Sonnet LLM chain to generate an executive narrative with five sections (summary, revenue analysis, category insights, customer insights, and recommendation), and the other creates a named copy of a Google Docs template in your reports folder. 5. Once both branches complete, the narrative is merged into the report payload and the replacement requests for all template placeholders are constructed. 6. The Google Docs node applies all replacements in a single batchUpdate call, then the file is exported as a PDF via Google Drive. 7. The finished PDF is archived to OneDrive, emailed via Outlook with the report attached, and a formatted KPI summary card is posted to a Microsoft Teams channel. ## Setup 1. Set the following n8n environment variables before activating: GCP_PROJECT_ID, BQ_DATASET, GDOCS_TEMPLATE_FILE_ID, GDRIVE_REPORTS_FOLDER_ID, ONEDRIVE_REPORTS_FOLDER_ID, REPORT_RECIPIENTS, TEAMS_TEAM_ID, and TEAMS_CHANNEL_ID. 2. Connect a Google BigQuery credential and update the four SQL queries to match your dataset, table name, and column names. 3. Connect an Anthropic API credential to the Claude Chat Model sub-node, then manually wire its output to the ai_languageModel input (bottom connector) of the Create AI Narrative with Claude node on the canvas. 4. Connect Google Drive and Google Docs OAuth2 credentials. In Google Drive, create a master report template Doc containing the placeholder strings listed in the Customization section, note its file ID from the URL, and set GDOCS_TEMPLATE_FILE_ID accordingly. 5. Connect Microsoft OneDrive, Outlook, and Teams OAuth2 credentials. Retrieve your Teams team ID and channel ID using the Microsoft Graph Explorer and set the corresponding environment variables. ## Requirements - Google Cloud project with BigQuery enabled and an orders table matching the SQL schema in the workflow - Google Workspace account (Drive and Docs OAuth2 access) - Anthropic API key - Microsoft 365 account with OneDrive, Outlook, and Teams access ## Customization - To change the reporting cadence, update the cron expression in the schedule trigger: 0 7 * * 1 for weekly or 0 7 * * * for daily. - To adapt the SQL queries to your schema, replace orders, order_total, customer_id, is_new_customer, created_at, and product_category with your actual column names. - To change the AI model, update the model field in the Claude Chat Model sub-node. - To add more KPI sections, extend the SQL queries, the aggregation code in Consolidate BigQuery Data, and the replacements array in Construct Replacement Requests.

View

Extract meeting insights and ClickUp tasks with ChatGPT, Google Sheets and Gmail

## Quick overview This workflow collects a meeting transcript via an n8n Form, uses OpenAI (ChatGPT) to extract structured meeting intelligence, logs the results to Google Sheets, creates one ClickUp task per action item, and emails an HTML meeting brief via Gmail. ## How it works 1. Receives a submission from an n8n Form containing the meeting title, date, participants, duration, and full transcript. 2. Validates the transcript length, calculates word and participant counts, and assigns a unique meeting ID. 3. Sends the meeting context to OpenAI (ChatGPT) to extract a strict JSON object with the summary, key topics, decisions, action items, risks, follow-up date, and sentiment. 4. Parses and normalizes the AI JSON output, then prepares formatted strings for reporting and downstream systems. 5. Appends the meeting record to a Google Sheets “Meetings” worksheet. 6. Splits the extracted action items into individual entries and creates a ClickUp task for each valid action item. 7. Sends the formatted HTML meeting brief to the configured team email address using Gmail. ## Setup 1. Add an OpenAI API credential in the OpenAI Chat Model node and select the model to use. 2. Connect Google Sheets OAuth credentials, set the target spreadsheet ID, and ensure a “Meetings” sheet exists with columns matching the fields being appended. 3. Provide a ClickUp API token and replace the ClickUp List ID in the HTTP request URL so tasks are created in the correct list. 4. Connect Gmail OAuth credentials and set the recipient email address for the meeting brief. 5. Activate the workflow and use the generated Form URL to submit meeting transcripts from your team.

View

Need Custom Automation?

Get help designing a custom n8n workflow that connects your stack and fits your process.

Ingest and search Cloudflare R2 media with Gemini, Groq Whisper, and Supabase

Workflow preview

1. Workflow Overview

Best for

Tools used

Source and attribution

1.1 Workflow description

Quick overview

How it works

Setup

Additional info

1.2 Logical Blocks

2. Block-by-Block Analysis

Block 1 - Sticky Note3

Block 2 - Sticky Note5

Block 3 - Sticky Note6

Block 4 - Sticky Note8

Block 5 - Sticky Note9

Block 6 - Sticky Note10

Block 7 - Sticky Note11

Block 8 - Sticky Note12

Block 9 - Sticky Note13

Block 10 - Sticky Note14

Block 11 - Sticky Note15

Block 12 - Sticky Note16

Block 13 - Sticky Note17

Block 14 - Sticky Note18

Block 15 - Set Content and Metadata

Block 16 - Embedding with Gemini Model

Block 17 - Load Default Data

Block 18 - Split Text by Character

Block 19 - Ingest to Vector Store

Block 20 - Post PDF to API

Block 21 - Post Image to API

Block 22 - Image Webhook Trigger

Block 23 - PDF Webhook Trigger

Block 24 - Post New Image to API

3. Summary Table

4. Reproducing the Workflow from Scratch

1. Download the workflow JSON

2. Import the template into n8n

3. Configure credentials and variables

4. Test with sample data

5. Activate and monitor

5. General Notes & Resources

Frequently asked questions