Skip to main content

Migrate large Hugging Face datasets to MongoDB with a looping subworkflow

Workflow preview

Migrate large Hugging Face datasets to MongoDB with a looping subworkflow preview
Open on n8n.io

Overview

This n8n template provides a production-ready, memory-safe pipeline for ingesting large Hugging Face datasets into MongoDB using batch pagination. It is designed as a reusable data ingestion layer for RAG systems, recommendation engines, analytics pipelines, and ML workflows.

The template includes:

  • A main workflow that orchestrates pagination and looping
  • A subworkflow that fetches dataset rows, sanitizes them, and inserts them into MongoDB safely

๐Ÿš€ What This Template Does

  • Fetches rows from a Hugging Face dataset using the datasets-server API
  • Processes data in configurable batches (offset + length)
  • Removes Hugging Face _id fields to avoid MongoDB duplicate key errors
  • Inserts clean documents into MongoDB
  • Automatically loops until all dataset rows are ingested
  • Handles large datasets without memory overflow

๐Ÿงฉ Architecture Overview

Main Workflow (Orchestrator)

  • Starts the ingestion process
  • Defines dataset, batch size, and MongoDB collection
  • Repeatedly calls the subworkflow until no rows remain

Subworkflow (Batch Processor)

  • Fetches a single batch of rows from Hugging Face
  • Splits rows into individual items
  • Removes _id fields
  • Inserts documents into MongoDB
  • Returns batch statistics to the main workflow

๐Ÿ” Workflow Logic (High-Level)

  1. Set initial configuration:
  • Dataset name
  • Split (train, test, etc.)
  • Batch size
  • Offset
  1. Fetch rows from Hugging Face
  2. If rows exist:
  • Split rows into items
  • Remove _id
  • Insert into MongoDB
  1. Increase offset
  2. Repeat until no rows are returned

๐Ÿ“ฆ Default Configuration

Parameter Default Value
Dataset MongoDB/airbnb_embeddings
Config default
Split train
Batch Size 100
MongoDB Collection airbnb

All values can be changed easily from the Config_Start node.


๐Ÿ›  Prerequisites

  • n8n (self-hosted or cloud)
  • MongoDB (local or hosted)
  • MongoDB credentials configured in n8n
  • Internet access to datasets-server.huggingface.co

โ–ถ๏ธ How to Use

  1. Import the workflow JSON into n8n
  2. Configure MongoDB credentials in the MongoDB node
  3. Update dataset parameters if needed:
  • Dataset name
  • Split
  • Batch size
  • Collection name
  1. Run the workflow using the Manual Trigger
  2. Monitor execution until completion

๐Ÿง  Why _id Is Removed

Hugging Face dataset rows often include an _id field. MongoDB requires _id values to be unique, so reusing these values can cause insertion failures.

This template:

  • Removes the Hugging Face _id
  • Lets MongoDB generate its own ObjectId
  • Prevents duplicate key errors
  • Allows safe re-runs and incremental ingestion

๐Ÿ” Ideal Use Cases

โœ… RAG (Retrieval-Augmented Generation)

  • Store dataset content as source documents
  • Add embeddings later using OpenAI, Mistral, or local models
  • Connect MongoDB to a vector database or hybrid search

โœ… Recommendation Systems

  • Build item catalogs from public datasets
  • Use embeddings or metadata for similarity search
  • Combine with user behavior data downstream

โœ… ML & Analytics Pipelines

  • Centralize dataset ingestion
  • Normalize data before training or analysis

โš™๏ธ Recommended Enhancements

You can easily extend this template with:

  • Upsert logic using a deterministic hash (idempotent ingestion)
  • Embedding generation before or after insertion
  • Schema validation or field filtering
  • Rate-limit handling & backoff
  • Parallel ingestion for faster processing

โš ๏ธ Notes & Best Practices

  • Reduce batch size if you encounter memory limits
  • Verify dataset license before production use
  • Add indexes in MongoDB for faster downstream querying
  • Use upserts if you plan to re-run ingestion frequently

๐Ÿ“„ License & Disclaimer

This workflow template is provided as-is. You are responsible for:

  • Dataset licensing compliance
  • Infrastructure costs
  • Downstream data usage

Hugging Face datasets are subject to their respective licenses.


โญ Template Summary

Category: Data Ingestion Complexity: Intermediate Scalability: High Memory Safe: Yes Production Ready: Yes


If you want a version with:

  • Upserts instead of inserts
  • Built-in embeddings
  • Vector database support
  • Logging & monitoring

Just say the word and Iโ€™ll generate the enhanced workflow JSON.