Skip to main content

Detect semantic duplicate website pages with Google Drive, Postgres and Ollama

Workflow preview

Workflow preview
100%
Detect semantic duplicate website pages with Google Drive, Postgres and Ollama preview
Open on n8n.io

1. Workflow Overview

Quick overview This workflow scans HTML files in a Google Drive folder, extracts and stores page text in Postgres, generates local vector embeddings with Ollama, and uses PGVector similarity search...

Best for

  • Document Extraction automation workflows
  • AI RAG automation workflows
  • advanced n8n builders looking for reusable templates

Tools used

n8n-nodes-base.splitinbatches, n8n-nodes-base.googledrive, n8n-nodes-base.html, n8n-nodes-base.postgres, @n8n/n8n-nodes-langchain.vectorstorepgvector, @n8n/n8n-nodes-langchain.embeddingsollama, @n8n/n8n-nodes-langchain.documentdefaultdataloader, @n8n/n8n-nodes-langchain.textsplitterrecursivecharactertextsplitter

Source and attribution

This workflow is cataloged by N8N Workflows and links back to its original n8n.io source page by Siddharth Gupta.

Original n8n.io source

1.1 Workflow description

Title
Detect semantic duplicate website pages with Google Drive, Postgres and Ollama
Workflow name
Detect semantic duplicate website pages with Google Drive, Postgres and Ollama

Quick overview

This workflow scans HTML files in a Google Drive folder, extracts and stores page text in Postgres, generates local vector embeddings with Ollama, and uses PGVector similarity searches to produce CSV reports that flag semantically duplicate website pages.

How it works

  1. Starts manually and clears the existing PGVector embeddings table and the scraped page text table in Postgres.
  2. Lists files in a specified Google Drive folder, filters to the target documents, and processes them in batches.
  3. Downloads each HTML file from Google Drive, extracts the main body text, cleans it, and upserts the results into a Postgres table for scraped pages.
  4. Reads the scraped page text back from Postgres in batches, splits it into overlapping chunks, and attaches page metadata (sheet_id, file_name, file_url) to each chunk.
  5. Generates embeddings locally with Ollama and inserts the chunk vectors and metadata into Postgres (PGVector), deduplicating already-processed pages.
  6. Builds an HNSW index in Postgres, computes chunk-to-chunk similarity matches and a pairwise page report, and exports the results as a CSV file.
  7. Computes page-level centroid embeddings, finds highly similar page pairs, and exports a page-level duplicate report as a CSV file.

Setup

  1. Add Google Drive OAuth2 credentials and set the Google Drive folder URL/ID used to scan for your HTML files.
  2. Add Postgres credentials for a database with the pgvector extension enabled and permissions to create/alter tables and indexes (including HNSW indexes).
  3. Add an Ollama credential and ensure the embedding model mxbai-embed-large:latest is available on your Ollama instance.
  4. Confirm your source files are HTML documents and that the workflow’s text extraction and similarity thresholds match your content and desired duplicate sensitivity.

Requirements

  • Working instance of n8n, either self-hosted or on the cloud. Remember, this workflow can be computationally expensive.
  • Google Drive API (with OAuth setup in n8n credentials section)
  • Ollama (for open source models) or any Embedding model API
  • PostgreSQL with PGVector or any other vector database
  • PgAdmin (for PostgreSQL) or your interface to access database tables via SQL for troubleshooting (optional).

Additional info

Limitations and Enhancements: Physical system memory mxbai-embed-large Running through Ollama is free and private, but the embedding generation speed depends entirely on your hardware. The more system memory you have, the more data you can process in batches in the loop node.

Similarity threshold and boilerplate content The cosine distance used in this workflow is 0.15 for chunk-level matching. And 0.05 (similarity above 95%) of the threshold is used for page-level centroid matching. This is only the starting point. Once you have the data, and especially if your data has more noise, you might need to tweak these thresholds for better matching.

This workflow needs HTML files to extract text This workflow doesn't crawl a website or fetch pages by entering a URL. You need to download HTML files (rendered or source) for consumption.

Use parallel processing and Cloud APIs Two sub-processes take the most time:

Downloading HTML files from Google Drive

Creating vector embeddings If you can use parallel processing in n8n and execute these sub-processes in parallel, the process will be done much faster. Additionally, if you can use cloud APIs for embedding, it may save some you some processing time as well.

Use efficient SQL queries Since I am from a non-tech background and not a coder, I used a mix of Gemini, Perplexity and Claude to create SQL codes for this workflow. If you're better at it, you can run computationally efficient queries that would help you achieve better results with less computation expense and time.

1.2 Logical Blocks

This catalog entry is organized from the workflow JSON. The node-level section below shows the executable blocks available for review before importing the template.

2. Block-by-Block Analysis

Block 1 - Loop Source File

Type / Role
n8n-nodes-base.splitInBatches - splitInBatches
Config choices
Version 3

Block 2 - GDrive Download Document

Type / Role
n8n-nodes-base.googleDrive - googleDrive
Config choices
Version 3

Block 3 - Extract Raw Text Content

Type / Role
n8n-nodes-base.html - html
Config choices
Version 1.2

Block 4 - Save Scraped Page Text

Type / Role
n8n-nodes-base.postgres - postgres
Config choices
Version 2.6

Block 5 - Batches for Embedding

Type / Role
n8n-nodes-base.splitInBatches - splitInBatches
Config choices
Version 3

Block 6 - Get Unprocessed Scraped Text

Type / Role
n8n-nodes-base.postgres - postgres
Config choices
Version 2.6

Block 7 - Save Document Embeddings

Type / Role
@n8n/n8n-nodes-langchain.vectorStorePGVector - vectorStorePGVector
Config choices
Version 1.3

Block 8 - Generate Local Embeddings

Type / Role
@n8n/n8n-nodes-langchain.embeddingsOllama - embeddingsOllama
Config choices
Version 1

Block 9 - Context Injector

Type / Role
@n8n/n8n-nodes-langchain.documentDefaultDataLoader - documentDefaultDataLoader
Config choices
Version 1.1

Block 10 - Chunk Text Recursively

Type / Role
@n8n/n8n-nodes-langchain.textSplitterRecursiveCharacterTextSplitter - textSplitterRecursiveCharacterTextSplitter
Config choices
Version 1

Block 11 - Dedup Processed Items

Type / Role
n8n-nodes-base.removeDuplicates - removeDuplicates
Config choices
Version 2

Block 12 - Start Duplicate Check

Type / Role
n8n-nodes-base.manualTrigger - manualTrigger
Config choices
Version 1

Block 13 - Clear Vector Table

Type / Role
n8n-nodes-base.postgres - postgres
Config choices
Version 2.6

Block 14 - Clear Scraped Pages Table

Type / Role
n8n-nodes-base.postgres - postgres
Config choices
Version 2.6

Block 15 - Scan Source Directory

Type / Role
n8n-nodes-base.googleDrive - googleDrive
Config choices
Version 3

Block 16 - Isolate Target Documents

Type / Role
n8n-nodes-base.filter - filter
Config choices
Version 2.3

Block 17 - Create HNSW Index

Type / Role
n8n-nodes-base.postgres - postgres
Config choices
Version 2.6

Block 18 - Compute Chunk Similarities

Type / Role
n8n-nodes-base.postgres - postgres
Config choices
Version 2.6

Block 19 - Generate Pairwise Report

Type / Role
n8n-nodes-base.postgres - postgres
Config choices
Version 2.6

Block 20 - Fetch Chunk Similarity Data

Type / Role
n8n-nodes-base.postgres - postgres
Config choices
Version 2.6

Block 21 - Export Chunk Similarity Report

Type / Role
n8n-nodes-base.convertToFile - convertToFile
Config choices
Version 1.1

Block 22 - Calculate Page Centroids

Type / Role
n8n-nodes-base.postgres - postgres
Config choices
Version 2.6

Block 23 - Compute Centroid Distances

Type / Role
n8n-nodes-base.postgres - postgres
Config choices
Version 2.6

Block 24 - Aggregate Page-Level Metrics

Type / Role
n8n-nodes-base.postgres - postgres
Config choices
Version 2.6

Showing the first 24 of 38 workflow blocks. Download the JSON for the full node graph.

3. Summary Table

Workflow Detect semantic duplicate website pages with Google Drive, Postgres and Ollama
Complexity advanced
Nodes 38
Categories Document Extraction, AI RAG
Author Siddharth Gupta
Published 21 Jun 2026

4. Reproducing the Workflow from Scratch

  1. 1. Download the workflow JSON

    Use the JSON export at /data/workflows/16540/16540.json as the source template for this automation.

  2. 2. Import the template into n8n

    Open n8n, import the downloaded JSON, and review each node before activating the workflow.

  3. 3. Configure credentials and variables

    Replace placeholder credentials, API keys, webhook URLs, account IDs, and environment-specific values with your own settings.

  4. 4. Test with sample data

    Run the workflow manually or in a staging workspace, inspect node output, and confirm downstream systems receive the expected data.

  5. 5. Activate and monitor

    Enable the workflow only after testing, then monitor executions, errors, and rate limits during the first production runs.

5. General Notes & Resources

Review imported nodes carefully before activation. This catalog entry is intended to help you inspect the workflow structure, understand required services, and find related templates faster.

Node names, credentials, schedules, webhook paths, and external service limits may need adjustment for your workspace.

Frequently asked questions

What does Detect semantic duplicate website pages with Google Drive, Postgres and Ollama do?

Quick overview This workflow scans HTML files in a Google Drive folder, extracts and stores page text in Postgres, generates local vector embeddings with Ollama, and uses PGVector similarity search...

What do I need before importing this workflow?

Review the workflow JSON, configure any required credentials in n8n, and test the automation in a safe workspace before using it in production.

Can I customize this workflow?

Yes. Use the block-by-block analysis and the downloadable JSON to inspect each node, then adjust credentials, prompts, schedules, filters, or destinations for your Document Extraction, AI RAG use case.