Detect semantic duplicate website pages with Google Drive, Postgres and Ollama

Workflow preview

100%

Open on n8n.io

$20/month : Unlimited workflows

2500 executions/month

Try free

THE #1 IN WEB SCRAPING

Scrape any website without limits

Try free

HOSTINGER

Early Deal
DISCOUNT 20%

Self-hosted n8n

Unlimited workflows - from $4.99/mo

Try free

#1 hub for scraping, AI & automation

6000+ actors - $5 credits/mo

Try free

1. Workflow Overview

Quick overview This workflow scans HTML files in a Google Drive folder, extracts and stores page text in Postgres, generates local vector embeddings with Ollama, and uses PGVector similarity search...

Best for

Document Extraction automation workflows
AI RAG automation workflows
advanced n8n builders looking for reusable templates

Tools used

n8n-nodes-base.splitinbatches, n8n-nodes-base.googledrive, n8n-nodes-base.html, n8n-nodes-base.postgres, @n8n/n8n-nodes-langchain.vectorstorepgvector, @n8n/n8n-nodes-langchain.embeddingsollama, @n8n/n8n-nodes-langchain.documentdefaultdataloader, @n8n/n8n-nodes-langchain.textsplitterrecursivecharactertextsplitter

Source and attribution

This workflow is cataloged by N8N Workflows and links back to its original n8n.io source page by Siddharth Gupta.

Original n8n.io source

1.1 Workflow description

Title: Detect semantic duplicate website pages with Google Drive, Postgres and Ollama
Workflow name: Detect semantic duplicate website pages with Google Drive, Postgres and Ollama

Quick overview

This workflow scans HTML files in a Google Drive folder, extracts and stores page text in Postgres, generates local vector embeddings with Ollama, and uses PGVector similarity searches to produce CSV reports that flag semantically duplicate website pages.

How it works

Starts manually and clears the existing PGVector embeddings table and the scraped page text table in Postgres.
Lists files in a specified Google Drive folder, filters to the target documents, and processes them in batches.
Downloads each HTML file from Google Drive, extracts the main body text, cleans it, and upserts the results into a Postgres table for scraped pages.
Reads the scraped page text back from Postgres in batches, splits it into overlapping chunks, and attaches page metadata (sheet_id, file_name, file_url) to each chunk.
Generates embeddings locally with Ollama and inserts the chunk vectors and metadata into Postgres (PGVector), deduplicating already-processed pages.
Builds an HNSW index in Postgres, computes chunk-to-chunk similarity matches and a pairwise page report, and exports the results as a CSV file.
Computes page-level centroid embeddings, finds highly similar page pairs, and exports a page-level duplicate report as a CSV file.

Setup

Add Google Drive OAuth2 credentials and set the Google Drive folder URL/ID used to scan for your HTML files.
Add Postgres credentials for a database with the pgvector extension enabled and permissions to create/alter tables and indexes (including HNSW indexes).
Add an Ollama credential and ensure the embedding model mxbai-embed-large:latest is available on your Ollama instance.
Confirm your source files are HTML documents and that the workflow’s text extraction and similarity thresholds match your content and desired duplicate sensitivity.

Requirements

Working instance of n8n, either self-hosted or on the cloud. Remember, this workflow can be computationally expensive.
Google Drive API (with OAuth setup in n8n credentials section)
Ollama (for open source models) or any Embedding model API
PostgreSQL with PGVector or any other vector database
PgAdmin (for PostgreSQL) or your interface to access database tables via SQL for troubleshooting (optional).

Additional info

Limitations and Enhancements: Physical system memory mxbai-embed-large Running through Ollama is free and private, but the embedding generation speed depends entirely on your hardware. The more system memory you have, the more data you can process in batches in the loop node.

Similarity threshold and boilerplate content The cosine distance used in this workflow is 0.15 for chunk-level matching. And 0.05 (similarity above 95%) of the threshold is used for page-level centroid matching. This is only the starting point. Once you have the data, and especially if your data has more noise, you might need to tweak these thresholds for better matching.

This workflow needs HTML files to extract text This workflow doesn't crawl a website or fetch pages by entering a URL. You need to download HTML files (rendered or source) for consumption.

Use parallel processing and Cloud APIs Two sub-processes take the most time:

Downloading HTML files from Google Drive

Creating vector embeddings If you can use parallel processing in n8n and execute these sub-processes in parallel, the process will be done much faster. Additionally, if you can use cloud APIs for embedding, it may save some you some processing time as well.

Use efficient SQL queries Since I am from a non-tech background and not a coder, I used a mix of Gemini, Perplexity and Claude to create SQL codes for this workflow. If you're better at it, you can run computationally efficient queries that would help you achieve better results with less computation expense and time.

1.2 Logical Blocks

This catalog entry is organized from the workflow JSON. The node-level section below shows the executable blocks available for review before importing the template.

2. Block-by-Block Analysis

Block 1 - Loop Source File

Type / Role: n8n-nodes-base.splitInBatches - splitInBatches
Config choices: Version 3

Block 2 - GDrive Download Document

Type / Role: n8n-nodes-base.googleDrive - googleDrive
Config choices: Version 3

Block 3 - Extract Raw Text Content

Type / Role: n8n-nodes-base.html - html
Config choices: Version 1.2

Block 4 - Save Scraped Page Text

Type / Role: n8n-nodes-base.postgres - postgres
Config choices: Version 2.6

Block 5 - Batches for Embedding

Type / Role: n8n-nodes-base.splitInBatches - splitInBatches
Config choices: Version 3

Block 6 - Get Unprocessed Scraped Text

Type / Role: n8n-nodes-base.postgres - postgres
Config choices: Version 2.6

Block 7 - Save Document Embeddings

Type / Role: @n8n/n8n-nodes-langchain.vectorStorePGVector - vectorStorePGVector
Config choices: Version 1.3

Block 8 - Generate Local Embeddings

Type / Role: @n8n/n8n-nodes-langchain.embeddingsOllama - embeddingsOllama
Config choices: Version 1

Block 9 - Context Injector

Type / Role: @n8n/n8n-nodes-langchain.documentDefaultDataLoader - documentDefaultDataLoader
Config choices: Version 1.1

Block 10 - Chunk Text Recursively

Type / Role: @n8n/n8n-nodes-langchain.textSplitterRecursiveCharacterTextSplitter - textSplitterRecursiveCharacterTextSplitter
Config choices: Version 1

Block 11 - Dedup Processed Items

Type / Role: n8n-nodes-base.removeDuplicates - removeDuplicates
Config choices: Version 2

Block 12 - Start Duplicate Check

Type / Role: n8n-nodes-base.manualTrigger - manualTrigger
Config choices: Version 1

Block 13 - Clear Vector Table

Type / Role: n8n-nodes-base.postgres - postgres
Config choices: Version 2.6

Block 14 - Clear Scraped Pages Table

Type / Role: n8n-nodes-base.postgres - postgres
Config choices: Version 2.6

Block 15 - Scan Source Directory

Type / Role: n8n-nodes-base.googleDrive - googleDrive
Config choices: Version 3

Block 16 - Isolate Target Documents

Type / Role: n8n-nodes-base.filter - filter
Config choices: Version 2.3

Block 17 - Create HNSW Index

Type / Role: n8n-nodes-base.postgres - postgres
Config choices: Version 2.6

Block 18 - Compute Chunk Similarities

Type / Role: n8n-nodes-base.postgres - postgres
Config choices: Version 2.6

Block 19 - Generate Pairwise Report

Type / Role: n8n-nodes-base.postgres - postgres
Config choices: Version 2.6

Block 20 - Fetch Chunk Similarity Data

Type / Role: n8n-nodes-base.postgres - postgres
Config choices: Version 2.6

Block 21 - Export Chunk Similarity Report

Type / Role: n8n-nodes-base.convertToFile - convertToFile
Config choices: Version 1.1

Block 22 - Calculate Page Centroids

Type / Role: n8n-nodes-base.postgres - postgres
Config choices: Version 2.6

Block 23 - Compute Centroid Distances

Type / Role: n8n-nodes-base.postgres - postgres
Config choices: Version 2.6

Block 24 - Aggregate Page-Level Metrics

Type / Role: n8n-nodes-base.postgres - postgres
Config choices: Version 2.6

Showing the first 24 of 38 workflow blocks. Download the JSON for the full node graph.

3. Summary Table

Workflow	Detect semantic duplicate website pages with Google Drive, Postgres and Ollama
Complexity	advanced
Nodes	38
Categories	Document Extraction, AI RAG
Author	Siddharth Gupta
Published	21 Jun 2026

4. Reproducing the Workflow from Scratch

1. Download the workflow JSON

Use the JSON export at /data/workflows/16540/16540.json as the source template for this automation.
2. Import the template into n8n

Open n8n, import the downloaded JSON, and review each node before activating the workflow.
3. Configure credentials and variables

Replace placeholder credentials, API keys, webhook URLs, account IDs, and environment-specific values with your own settings.
4. Test with sample data

Run the workflow manually or in a staging workspace, inspect node output, and confirm downstream systems receive the expected data.
5. Activate and monitor

Enable the workflow only after testing, then monitor executions, errors, and rate limits during the first production runs.

5. General Notes & Resources

Review imported nodes carefully before activation. This catalog entry is intended to help you inspect the workflow structure, understand required services, and find related templates faster.

Node names, credentials, schedules, webhook paths, and external service limits may need adjustment for your workspace.

Download workflow JSON Original n8n.io source Document Extraction workflows AI RAG workflows

Frequently asked questions

What does Detect semantic duplicate website pages with Google Drive, Postgres and Ollama do?

What do I need before importing this workflow?

Review the workflow JSON, configure any required credentials in n8n, and test the automation in a safe workspace before using it in production.

Can I customize this workflow?

Yes. Use the block-by-block analysis and the downloadable JSON to inspect each node, then adjust credentials, prompts, schedules, filters, or destinations for your Document Extraction, AI RAG use case.

Siddharth Gupta

6 workflows

Nodes

n8n-nodes-base.splitinbatches n8n-nodes-base.googledrive n8n-nodes-base.html n8n-nodes-base.postgres @n8n/n8n-nodes-langchain.vectorstorepgvector @n8n/n8n-nodes-langchain.embeddingsollama @n8n/n8n-nodes-langchain.documentdefaultdataloader @n8n/n8n-nodes-langchain.textsplitterrecursivecharactertextsplitter

Complexity

advanced

Published 21 Jun 2026

Likes 0

View on n8n.io Download Workflow

Install path: /data/workflows/16540/16540.json

Share Your Workflow

Have a useful automation to share? Publish it and help the community.

Submit Your Template How to Submit

Related Workflows

Generate monthly BigQuery KPI PDF reports with Claude, Google Docs, Outlook and Teams

## Quick overview This workflow runs monthly to query KPI data from Google BigQuery, generates a narrative with Anthropic Claude, fills a Google Docs report template, exports it as a PDF, archives it to OneDrive, emails it via Microsoft Outlook, and posts a summary to Microsoft Teams. ## How it works 1. A schedule trigger fires on the 1st of every month at 07:00 and calculates the start and end dates of the previous full calendar month. 2. Four parallel BigQuery queries fetch revenue KPIs, top product categories, weekly sales trend, and top customers for that period. Each result set is tagged with a source label before merging. 3. All tagged rows are combined and aggregated into a single structured report payload containing KPI summaries, ranked tables, and company metadata. 4. The payload is split across two concurrent branches: one sends it to a Claude Sonnet LLM chain to generate an executive narrative with five sections (summary, revenue analysis, category insights, customer insights, and recommendation), and the other creates a named copy of a Google Docs template in your reports folder. 5. Once both branches complete, the narrative is merged into the report payload and the replacement requests for all template placeholders are constructed. 6. The Google Docs node applies all replacements in a single batchUpdate call, then the file is exported as a PDF via Google Drive. 7. The finished PDF is archived to OneDrive, emailed via Outlook with the report attached, and a formatted KPI summary card is posted to a Microsoft Teams channel. ## Setup 1. Set the following n8n environment variables before activating: GCP_PROJECT_ID, BQ_DATASET, GDOCS_TEMPLATE_FILE_ID, GDRIVE_REPORTS_FOLDER_ID, ONEDRIVE_REPORTS_FOLDER_ID, REPORT_RECIPIENTS, TEAMS_TEAM_ID, and TEAMS_CHANNEL_ID. 2. Connect a Google BigQuery credential and update the four SQL queries to match your dataset, table name, and column names. 3. Connect an Anthropic API credential to the Claude Chat Model sub-node, then manually wire its output to the ai_languageModel input (bottom connector) of the Create AI Narrative with Claude node on the canvas. 4. Connect Google Drive and Google Docs OAuth2 credentials. In Google Drive, create a master report template Doc containing the placeholder strings listed in the Customization section, note its file ID from the URL, and set GDOCS_TEMPLATE_FILE_ID accordingly. 5. Connect Microsoft OneDrive, Outlook, and Teams OAuth2 credentials. Retrieve your Teams team ID and channel ID using the Microsoft Graph Explorer and set the corresponding environment variables. ## Requirements - Google Cloud project with BigQuery enabled and an orders table matching the SQL schema in the workflow - Google Workspace account (Drive and Docs OAuth2 access) - Anthropic API key - Microsoft 365 account with OneDrive, Outlook, and Teams access ## Customization - To change the reporting cadence, update the cron expression in the schedule trigger: 0 7 * * 1 for weekly or 0 7 * * * for daily. - To adapt the SQL queries to your schema, replace orders, order_total, customer_id, is_new_customer, created_at, and product_category with your actual column names. - To change the AI model, update the model field in the Claude Chat Model sub-node. - To add more KPI sections, extend the SQL queries, the aggregation code in Consolidate BigQuery Data, and the replacements array in Construct Replacement Requests.

View

Ingest and search Cloudflare R2 media with Gemini, Groq Whisper, and Supabase

## Quick overview This workflow ingests images, PDFs, and videos from a Cloudflare R2 folder, uses Google Gemini to view pdfs, images and videos, Groq stt (Whisper) for video transcriptst - to generate searchable descriptions and tags, stores embeddings in a Supabase pgvector table. ## How it works 1. Receives a webhook request containing a Cloudflare R2 bucket and folder URL, then lists the objects in that folder. 2. Filters to supported file types, builds public CDN URLs and timestamps, and routes each item as an image, PDF, or video. 3. For images, calls Google Gemini with the image URL to generate structured metadata (summary, detailed description, tags, and scores). 4. For PDFs, calls Google Gemini to analyze the document URL and return the same structured metadata. 5. For videos, downloads each file locally, extracts representative frames with FFmpeg for Google Gemini visual analysis, extracts audio, transcribes it with Groq Whisper, and tags transcript chunks with Groq Llama. 6. Normalizes results into a single text “content” field plus JSON metadata, generates Google Gemini embeddings, and inserts the vectors into Supabase (pgvector). 7. Receives a separate webhook query, retrieves the most similar items from Supabase using embeddings, and returns ranked matches in the webhook response. ## Setup 1. Create a Cloudflare R2 bucket with publicly accessible object URLs, and add Cloudflare R2 credentials in n8n. 2. Set up a Supabase project with pgvector enabled and a table named `vec10`, then add Supabase credentials in n8n. 3. Add Google Gemini credentials (Google PaLM/Gemini API) for embeddings and provide an HTTP Header Auth credential for the Gemini HTTP requests. 4. Set the `GROQ_API_KEY` environment variable for the Groq Whisper transcription and Llama tag extraction calls. 5. If you enable video processing, install `curl`, `ffmpeg`, and `ffprobe` on the n8n host and update the local directory paths (temp root, frames directory, and video directory) in the workflow inputs. 6. Copy the ingest webhook (`/vector-ingest`) and query webhook (`/vector-query`) URLs and configure your upstream app to send the expected JSON payloads. ## Additional info Video: FFmpeg code nodes cut videos smartly into "video_frames" items and "video_transcripts" for easy handling and pgvector storage. Exposed webhook to vector query flow allows Voice Agent to find and display the full video, pulled from Cloudflare bucket, by the referenced matching video_frames or video_transcripts returned from vector query.

View

Extract meeting insights and ClickUp tasks with ChatGPT, Google Sheets and Gmail

## Quick overview This workflow collects a meeting transcript via an n8n Form, uses OpenAI (ChatGPT) to extract structured meeting intelligence, logs the results to Google Sheets, creates one ClickUp task per action item, and emails an HTML meeting brief via Gmail. ## How it works 1. Receives a submission from an n8n Form containing the meeting title, date, participants, duration, and full transcript. 2. Validates the transcript length, calculates word and participant counts, and assigns a unique meeting ID. 3. Sends the meeting context to OpenAI (ChatGPT) to extract a strict JSON object with the summary, key topics, decisions, action items, risks, follow-up date, and sentiment. 4. Parses and normalizes the AI JSON output, then prepares formatted strings for reporting and downstream systems. 5. Appends the meeting record to a Google Sheets “Meetings” worksheet. 6. Splits the extracted action items into individual entries and creates a ClickUp task for each valid action item. 7. Sends the formatted HTML meeting brief to the configured team email address using Gmail. ## Setup 1. Add an OpenAI API credential in the OpenAI Chat Model node and select the model to use. 2. Connect Google Sheets OAuth credentials, set the target spreadsheet ID, and ensure a “Meetings” sheet exists with columns matching the fields being appended. 3. Provide a ClickUp API token and replace the ClickUp List ID in the HTTP request URL so tasks are created in the correct list. 4. Connect Gmail OAuth credentials and set the recipient email address for the meeting brief. 5. Activate the workflow and use the generated Form URL to submit meeting transcripts from your team.

View

Need Custom Automation?

Get help designing a custom n8n workflow that connects your stack and fits your process.

Detect semantic duplicate website pages with Google Drive, Postgres and Ollama

Workflow preview

1. Workflow Overview

Best for

Tools used

Source and attribution

1.1 Workflow description

Quick overview

How it works

Setup

Requirements

Additional info

1.2 Logical Blocks

2. Block-by-Block Analysis

Block 1 - Loop Source File

Block 2 - GDrive Download Document

Block 3 - Extract Raw Text Content

Block 4 - Save Scraped Page Text

Block 5 - Batches for Embedding

Block 6 - Get Unprocessed Scraped Text

Block 7 - Save Document Embeddings

Block 8 - Generate Local Embeddings

Block 9 - Context Injector

Block 10 - Chunk Text Recursively

Block 11 - Dedup Processed Items

Block 12 - Start Duplicate Check

Block 13 - Clear Vector Table

Block 14 - Clear Scraped Pages Table

Block 15 - Scan Source Directory

Block 16 - Isolate Target Documents

Block 17 - Create HNSW Index

Block 18 - Compute Chunk Similarities

Block 19 - Generate Pairwise Report

Block 20 - Fetch Chunk Similarity Data

Block 21 - Export Chunk Similarity Report

Block 22 - Calculate Page Centroids

Block 23 - Compute Centroid Distances

Block 24 - Aggregate Page-Level Metrics

3. Summary Table

4. Reproducing the Workflow from Scratch

1. Download the workflow JSON

2. Import the template into n8n

3. Configure credentials and variables

4. Test with sample data

5. Activate and monitor

5. General Notes & Resources

Frequently asked questions