dongou

Workflow

Workflows by dongou

Sort by:

Daily RAG research paper hub with arXiv, Gemini AI, and Notion

Fetch user-specific research papers from **arXiv** on a daily schedule, process and structure the data, and create or update entries in a Notion database, with support for data delivery - **Paper Topic**: single query keyword - **Update Frequency**: Daily updates, with fewer than 20 entries expected per day - **Tools**: - **Platform**: n8n, for end-to-end workflow configuration - **AI Model**: Gemini-2.5-Flash, for daily paper summarization and data processing - **Database**: Notion, with two tables — Daily Paper Summary and Paper Details - **Message**: Feishu (IM bot notifications), Gmail (email notifications) ## 1. Data Retrieval ### arXiv API The arXiv provides a public API that allows users to query research papers by topic or by predefined categories. [arXiv API User Manual](https://info.arxiv.org/help/api/user-manual.html#arxiv-api-users-manual) **Key Notes:** 1. **Response Format**: The API returns data as a typical *Atom Response*. 2. **Timezone & Update Frequency**: - The arXiv submission process operates on a 24-hour cycle. - Newly submitted articles become available in the API only at midnight *after* they have been processed. - Feeds are updated daily at midnight Eastern Standard Time (EST). - Therefore, a single request per day is sufficient. 3. **Request Limits**: - The maximum number of results per call (`max_results`) is **30,000**, - Results must be retrieved in slices of at most **2,000** at a time, using the `max_results` and `start` query parameters. 4. **Time Format**: - The expected format is `[YYYYMMDDTTTT+TO+YYYYMMDDTTTT]`, - `TTTT` is provided in 24-hour time to the minute, in GMT. ### Scheduled Task - **Execution Frequency**: Daily - **Execution Time**: 6:00 AM - **Time Parameter Handling (JS)**: According to arXiv’s update rules, the scheduled task should query the **previous day’s (T-1)** `submittedDate` data. ## 2. **Data Extraction** ### Data Cleaning Rules (Convert to Standard JSON) 1. **Remove Header** - Keep only the 【entry】【/entry】 blocks representing paper items. 2. **Single Item** - Each 【entry】【/entry】 represents a single item. 3. **Field Processing Rules** - 【id】【/id】 ➡️ id Extract content. Example: 【id】http://arxiv.org/abs/2409.06062v1【/id】 → http://arxiv.org/abs/2409.06062v1 - 【updated】【/updated】 ➡️ updated Convert timestamp to `yyyy-mm-dd hh:mm:ss` - 【published】【/published】 ➡️ published Convert timestamp to `yyyy-mm-dd hh:mm:ss` - 【title】【/title】 ➡️ title Extract text content - 【summary】【/summary】 ➡️ summary Keep text, remove line breaks - 【author】【/author】 ➡️ author Combine all authors into an array Example: `[ "Ernest Pusateri", "Anmol Walia" ]` (for Notion multi-select field) - 【arxiv:comment】【/arxiv:comment】 ➡️ Ignore / discard - 【link type="text/html"】 ➡️ html_url Extract URL - 【link type="application/pdf"】 ➡️ pdf_url Extract URL - 【arxiv:primary_category term="cs.CL"】 ➡️ primary_category Extract `term` value - 【category】 ➡️ category Merge all 【category】 values into an array Example: `[ "eess.AS", "cs.SD" ]` (for Notion multi-select field) 4. **Add Empty Fields** - `github` - `huggingface` ## 3. Data Processing Analyze and summarize paper data using AI, then standardize output as JSON. - Single Paper Basic Information Analysis and Enhancement - Daily Paper Summary and Multilingual Translation ## 4. Data Storage: Notion Database - Create a corresponding database in Notion with the same predefined field names. - In Notion, create an integration under **Integrations** and grant access to the database. Obtain the corresponding **Secret Key**. - Use the Notion **"Create a database page"** node to configure the field mapping and store the data. **Notes** - **"Create a database page"** only adds new entries; data will not be updated. - The `updated` and `published` timestamps of arXiv papers are in **UTC**. - Notion **single-select** and **multi-select** fields only accept arrays. They do not automatically parse comma-separated strings. You need to format them as proper arrays. - Notion does not accept `null` values, which causes a **400 error**. ## 5. Data Delivery Set up two channels for message delivery: **EMAIL** and **IM**, and define the message format and content. ### Email: Gmail **GMAIL OAuth 2.0 – Official Documentation** [Configure your OAuth consent screen](https://docs.n8n.io/integrations/builtin/credentials/google/oauth-single-service/?utm_source=n8n_app&utm_medium=credential_settings&utm_campaign=create_new_credentials_modal#configure-your-oauth-consent-screen) **Steps:** - Enable Gmail API - Create OAuth consent screen - Create OAuth client credentials - Audience: Add **Test users** under Testing status **Message format**: HTML (Model: OpenAI GPT — used to design an HTML email template) ### IM: Feishu (LARK) **Bots in groups** [Use bots in groups](https://www.larksuite.com/hc/en-US/articles/360048487736-use-bots-in-groups)