Block 1 - Loop Source File
- Type / Role
- n8n-nodes-base.splitInBatches - splitInBatches
- Config choices
- Version 3
Quick overview This workflow scans HTML files in a Google Drive folder, extracts and stores page text in Postgres, generates local vector embeddings with Ollama, and uses PGVector similarity search...
n8n-nodes-base.splitinbatches, n8n-nodes-base.googledrive, n8n-nodes-base.html, n8n-nodes-base.postgres, @n8n/n8n-nodes-langchain.vectorstorepgvector, @n8n/n8n-nodes-langchain.embeddingsollama, @n8n/n8n-nodes-langchain.documentdefaultdataloader, @n8n/n8n-nodes-langchain.textsplitterrecursivecharactertextsplitter
This workflow is cataloged by N8N Workflows and links back to its original n8n.io source page by Siddharth Gupta.
Original n8n.io sourceThis workflow scans HTML files in a Google Drive folder, extracts and stores page text in Postgres, generates local vector embeddings with Ollama, and uses PGVector similarity searches to produce CSV reports that flag semantically duplicate website pages.
Limitations and Enhancements: Physical system memory mxbai-embed-large Running through Ollama is free and private, but the embedding generation speed depends entirely on your hardware. The more system memory you have, the more data you can process in batches in the loop node.
Similarity threshold and boilerplate content The cosine distance used in this workflow is 0.15 for chunk-level matching. And 0.05 (similarity above 95%) of the threshold is used for page-level centroid matching. This is only the starting point. Once you have the data, and especially if your data has more noise, you might need to tweak these thresholds for better matching.
This workflow needs HTML files to extract text This workflow doesn't crawl a website or fetch pages by entering a URL. You need to download HTML files (rendered or source) for consumption.
Use parallel processing and Cloud APIs Two sub-processes take the most time:
Downloading HTML files from Google Drive
Creating vector embeddings If you can use parallel processing in n8n and execute these sub-processes in parallel, the process will be done much faster. Additionally, if you can use cloud APIs for embedding, it may save some you some processing time as well.
Use efficient SQL queries Since I am from a non-tech background and not a coder, I used a mix of Gemini, Perplexity and Claude to create SQL codes for this workflow. If you're better at it, you can run computationally efficient queries that would help you achieve better results with less computation expense and time.
This catalog entry is organized from the workflow JSON. The node-level section below shows the executable blocks available for review before importing the template.
Showing the first 24 of 38 workflow blocks. Download the JSON for the full node graph.
| Workflow | Detect semantic duplicate website pages with Google Drive, Postgres and Ollama |
|---|---|
| Complexity | advanced |
| Nodes | 38 |
| Categories | Document Extraction, AI RAG |
| Author | Siddharth Gupta |
| Published | 21 Jun 2026 |
Use the JSON export at /data/workflows/16540/16540.json as the source template for this automation.
Open n8n, import the downloaded JSON, and review each node before activating the workflow.
Replace placeholder credentials, API keys, webhook URLs, account IDs, and environment-specific values with your own settings.
Run the workflow manually or in a staging workspace, inspect node output, and confirm downstream systems receive the expected data.
Enable the workflow only after testing, then monitor executions, errors, and rate limits during the first production runs.
Review imported nodes carefully before activation. This catalog entry is intended to help you inspect the workflow structure, understand required services, and find related templates faster.
Node names, credentials, schedules, webhook paths, and external service limits may need adjustment for your workspace.
Quick overview This workflow scans HTML files in a Google Drive folder, extracts and stores page text in Postgres, generates local vector embeddings with Ollama, and uses PGVector similarity search...
Review the workflow JSON, configure any required credentials in n8n, and test the automation in a safe workspace before using it in production.
Yes. Use the block-by-block analysis and the downloadable JSON to inspect each node, then adjust credentials, prompts, schedules, filters, or destinations for your Document Extraction, AI RAG use case.