Migrate large Hugging Face datasets to MongoDB with a looping subworkflow

Name: Migrate large Hugging Face datasets to MongoDB with a looping subworkflow
Availability: InStock
Author: Mohamed Abdelwahab

Migrate large Hugging Face datasets to MongoDB with a looping subworkflow preview

Open on n8n.io

$20/month : Unlimited workflows

2500 executions/month

Try free

THE #1 IN WEB SCRAPING

Scrape any website without limits

Try free

HOSTINGER

Early Deal
DISCOUNT 20%

Self-hosted n8n

Unlimited workflows - from $4.99/mo

Try free

#1 hub for scraping, AI & automation

6000+ actors - $5 credits/mo

Try free

Overview

This n8n template provides a production-ready, memory-safe pipeline for ingesting large Hugging Face datasets into MongoDB using batch pagination. It is designed as a reusable data ingestion layer for RAG systems, recommendation engines, analytics pipelines, and ML workflows.

The template includes:

A main workflow that orchestrates pagination and looping
A subworkflow that fetches dataset rows, sanitizes them, and inserts them into MongoDB safely

🚀 What This Template Does

Fetches rows from a Hugging Face dataset using the datasets-server API
Processes data in configurable batches (offset + length)
Removes Hugging Face _id fields to avoid MongoDB duplicate key errors
Inserts clean documents into MongoDB
Automatically loops until all dataset rows are ingested
Handles large datasets without memory overflow

🧩 Architecture Overview

Main Workflow (Orchestrator)

Starts the ingestion process
Defines dataset, batch size, and MongoDB collection
Repeatedly calls the subworkflow until no rows remain

Subworkflow (Batch Processor)

Fetches a single batch of rows from Hugging Face
Splits rows into individual items
Removes _id fields
Inserts documents into MongoDB
Returns batch statistics to the main workflow

🔁 Workflow Logic (High-Level)

Set initial configuration:

Dataset name
Split (train, test, etc.)
Batch size
Offset

Fetch rows from Hugging Face
If rows exist:

Split rows into items
Remove _id
Insert into MongoDB

Increase offset
Repeat until no rows are returned

📦 Default Configuration

Parameter	Default Value
Dataset	`MongoDB/airbnb_embeddings`
Config	`default`
Split	`train`
Batch Size	`100`
MongoDB Collection	`airbnb`

All values can be changed easily from the Config_Start node.

🛠 Prerequisites

n8n (self-hosted or cloud)
MongoDB (local or hosted)
MongoDB credentials configured in n8n
Internet access to datasets-server.huggingface.co

▶️ How to Use

Import the workflow JSON into n8n
Configure MongoDB credentials in the MongoDB node
Update dataset parameters if needed:

Dataset name
Split
Batch size
Collection name

Run the workflow using the Manual Trigger
Monitor execution until completion

🧠 Why `_id` Is Removed

Hugging Face dataset rows often include an _id field. MongoDB requires _id values to be unique, so reusing these values can cause insertion failures.

This template:

Removes the Hugging Face _id
Lets MongoDB generate its own ObjectId
Prevents duplicate key errors
Allows safe re-runs and incremental ingestion

🔍 Ideal Use Cases

✅ RAG (Retrieval-Augmented Generation)

Store dataset content as source documents
Add embeddings later using OpenAI, Mistral, or local models
Connect MongoDB to a vector database or hybrid search

✅ Recommendation Systems

Build item catalogs from public datasets
Use embeddings or metadata for similarity search
Combine with user behavior data downstream

✅ ML & Analytics Pipelines

Centralize dataset ingestion
Normalize data before training or analysis

⚙️ Recommended Enhancements

You can easily extend this template with:

Upsert logic using a deterministic hash (idempotent ingestion)
Embedding generation before or after insertion
Schema validation or field filtering
Rate-limit handling & backoff
Parallel ingestion for faster processing

⚠️ Notes & Best Practices

Reduce batch size if you encounter memory limits
Verify dataset license before production use
Add indexes in MongoDB for faster downstream querying
Use upserts if you plan to re-run ingestion frequently

📄 License & Disclaimer

This workflow template is provided as-is. You are responsible for:

Dataset licensing compliance
Infrastructure costs
Downstream data usage

Hugging Face datasets are subject to their respective licenses.

⭐ Template Summary

Category: Data Ingestion Complexity: Intermediate Scalability: High Memory Safe: Yes Production Ready: Yes

If you want a version with:

Upserts instead of inserts
Built-in embeddings
Vector database support
Logging & monitoring

Just say the word and I’ll generate the enhanced workflow JSON.

Mohamed Abdelwahab

6 workflows

Complexity

advanced

Published 01 Jan 2026

Likes 0

View on n8n.io Download Workflow

Install path: /data/workflows/12338/12338.json

Share Your Workflow

Have a useful automation to share? Publish it and help the community.

Submit Your Template How to Submit

Related Workflows

Forecast property CAPEX and ROI weekly using Google Sheets and GPT-4o

## How It Works This workflow automates weekly capital expenditure (CAPEX) forecasting for property portfolios using a multi-agent AI architecture. It targets property managers, asset managers, and facilities finance teams who need data-driven maintenance budgeting without manual spreadsheet analysis. Three Google Sheets sources, namely: maintenance records, property data, and tenant feedback, are merged into a unified dataset. A Main Prediction Agent orchestrates three specialist sub-agents: a CAPEX Prioritizer that ranks spending needs, an ROI Simulator that models return scenarios, and a Quote Requester that fetches vendor estimates. Each agent is backed by dedicated AI models, memory, and tools including a Calculator and Financial Modeling Tool. Structured predictions are parsed, split by category, formatted, saved back to Google Sheets, and pushed to an external budgeting system via POST, delivering a fully automated, auditable CAPEX planning pipeline every week. ## Setup Steps 1. Connect Google Sheets credentials to all three read nodes and the Save Predictions node. 2. Set correct Sheet IDs for maintenance, property, and tenant feedback tabs. 3. Add Claude or OpenAI API credentials to all Chat Model nodes. 4. Configure the Financial Modeling Tool with your cost rate assumptions. 5. Replace the POST placeholder URL in Update Budgeting System with your actual endpoint. ## Prerequisites - Google Sheets account with populated maintenance, property, and tenant data - Claude or OpenAI API credentials - External budgeting system with a POST-compatible API endpoint ## Use Cases - Weekly CAPEX forecasting for multi-property real estate portfolios - Automated ROI modelling for planned renovations or equipment replacement ## Customization Add more data sources (e.g., IoT sensors, ERP exports). ## Benefits Eliminates manual CAPEX spreadsheet work with autonomous AI forecasting.

View

Turn support tickets into developer insights with OpenAI, Postgres, Slack and Jira

## Overview This workflow transforms raw support tickets into actionable developer insights using AI and data processing. It automatically detects recurring issues, identifies root causes, ranks severity, and generates a structured engineering report. By combining embeddings, clustering, and AI analysis, it helps teams prioritize bugs, understand user pain points, and take data-driven product decisions. --- ## How It Works 1. **Scheduled Trigger** - Runs automatically at a defined time (e.g., daily). 2. **Workflow Configuration** - Defines time window, similarity threshold, scoring weights, and delivery options. 3. **Fetch Feedback Data** - Retrieves recent support tickets (bugs and feature requests) from Postgres. 4. **Preprocessing** - Cleans, normalizes, and removes duplicate messages. 5. **Embedding & Clustering** - Generates embeddings using OpenAI. - Groups similar tickets using cosine similarity. 6. **Cluster Aggregation** - Combines related tickets into structured clusters. 7. **Root Cause Analysis** - AI agent analyzes clusters to identify: - Root cause - Impacted module - Severity - Debug steps - Fix direction 8. **Severity Scoring** - Calculates weighted score based on: - Frequency - Sentiment - Churn risk - Enterprise impact 9. **Report Generation** - Generates a developer-focused report including: - Executive summary - Ranked bugs - Feature requests - Risk analysis - Sprint priorities 10. **Delivery** - Sends report to Slack - Optionally creates Jira issues - Optional email delivery --- ## Setup Instructions 1. **Database Setup** - Configure Postgres credentials - Ensure `support_tickets` table exists with required fields 2. **OpenAI Configuration** - Add API key for: - Embeddings (text-embedding-3-small) - AI analysis agents 3. **Slack Integration** - Add Slack credentials - Set channel ID 4. **Email Setup (Optional)** - Configure SMTP or email service 5. **Jira Integration (Optional)** - Add Jira credentials - Set project key and issue type 6. **Customize Parameters** - Adjust: - Similarity threshold - Scoring weights - Time window 7. **Schedule Configuration** - Modify trigger timing as needed --- ## Use Cases - Product teams analyzing user feedback at scale - Engineering teams prioritizing bug fixes - SaaS companies tracking churn-related issues - Customer support insights automation - AI-driven product intelligence dashboards --- ## Requirements - OpenAI API key - Postgres database with support ticket data - Slack (optional) - Email service (optional) - Jira account (optional) - n8n instance --- ## Key Features - Automated feedback clustering using embeddings - AI-driven root cause analysis - Weighted severity scoring system - Developer-ready intelligence reports - Multi-channel delivery (Slack, Email, Jira) - Fully customizable scoring and thresholds --- ## Summary A powerful AI-driven workflow that converts raw support tickets into structured developer intelligence. It automates clustering, root cause detection, prioritization, and reporting helping teams fix the right problems faster and build better products.

View

Track LLM costs and usage across OpenAI, Anthropic, Google and more

## Installation Steps 1. Go to **Settings → n8n API** and create an API key 2. Add it as credential for the **Get Execution Data** node 3. Review model mappings in **Standardize Names** node 4. Review pricing in **Model Prices** node ## To Monitor a Workflow 1. Add **Execute Workflow** node at the end of your target workflow 2. Select this monitoring workflow 3. **Turn OFF** "Wait For Sub-Workflow Completion" 4. Pass `{ "executionId": "{{ $execution.id }}" }` as input ## Prerequisites Enable **"Return Intermediate Steps"** in your AI Agent settings for best results. ## Supported Providers **OpenAI** · **Anthropic** · **Google** · **DeepSeek** · **Meta** · **Mistral** · **xAI** · **Cohere** · **Alibaba Qwen** · **Moonshot Kimi** ### 120+ Model Variations Mapped Includes all versioned variants (e.g., gpt-4o-2024-08-06 → gpt-4o) Prices sourced from official provider pages (March 2026) ## Output Data ### Per LLM Call - Cost Breakdown (prompt, completion, total USD) - Token Metrics (prompt, completion, total) - Performance (execution time, finish reason) - Content Preview (first 100 chars I/O) - Model Parameters (temp, max tokens, timeout) - Execution Context (workflow, node, status) - Flow Tracking (previous nodes chain) ### Summary Statistics - Total executions and costs - Breakdown by model type - Breakdown by node - Average cost per call - Total execution time ## 💡 You can do anything with this data! - Store in a database for historical tracking - Send to Teams as a cost alert - Build dashboards with the summary data - Set budget thresholds and trigger warnings - Export to Google Sheets for reporting

View

Need Custom Automation?

Get help designing a custom n8n workflow that connects your stack and fits your process.

Migrate large Hugging Face datasets to MongoDB with a looping subworkflow

Workflow preview

Overview

🚀 What This Template Does

🧩 Architecture Overview

Main Workflow (Orchestrator)

Subworkflow (Batch Processor)

🔁 Workflow Logic (High-Level)

📦 Default Configuration

🛠 Prerequisites

▶️ How to Use

🧠 Why _id Is Removed

🔍 Ideal Use Cases

✅ RAG (Retrieval-Augmented Generation)

✅ Recommendation Systems

✅ ML & Analytics Pipelines

⚙️ Recommended Enhancements

⚠️ Notes & Best Practices

📄 License & Disclaimer

⭐ Template Summary

🧠 Why `_id` Is Removed