Domain-specific web content crawler with depth control & text extraction

1545 views

2/3/2026

Data Export Filtering HubSpot CSV Line Items

This template implements a recursive web crawler inside n8n. Starting from a given URL, it crawls linked pages up to a maximum depth (default: 3), extracts text and links, and returns the collected content via webhook.

🚀 How It Works

Webhook Trigger
Accepts a JSON body with a url field.
Example payload:

{ "url": "https://example.com" }
Initialization
- Sets crawl parameters: url, domain, maxDepth = 3, and depth = 0.
- Initializes global static data (pending, visited, queued, pages).
Recursive Crawling
- Fetches each page (HTTP Request).
- Extracts body text and links (HTML node).
- Cleans and deduplicates links.
- Filters out:
  - External domains (only same-site is followed)
  - Anchors (#), mailto/tel/javascript links
  - Non-HTML files (.pdf, .docx, .xlsx, .pptx)
Depth Control & Queue
- Tracks visited URLs
- Stops at maxDepth to prevent infinite loops
- Uses SplitInBatches to loop the queue
Data Collection
- Saves each crawled page (url, depth, content) into pages[]
- When pending = 0, combines results
Output
- Responds via the Webhook node with:
  - combinedContent (all pages concatenated)
  - pages[] (array of individual results)
- Large results are chunked when exceeding ~12,000 characters

🛠️ Setup Instructions

Import Template
Load from n8n Community Templates.
Configure Webhook
- Open the Webhook node
- Copy the Test URL (development) or Production URL (after deploy)
- You’ll POST crawl requests to this endpoint
Run a Test
Send a POST with JSON:

curl -X POST https://<your-n8n>/webhook/<id>
-H "Content-Type: application/json"
-d '{"url": "https://example.com"}'
View Response
The crawler returns a JSON object containing combinedContent and pages[].

⚙️ Configuration

maxDepth
Default: 3. Adjust in the Init Crawl Params (Set) node.
Timeouts
HTTP Request node timeout is 5 seconds per request; increase if needed.
Filtering Rules
- Only same-domain links are followed (apex and www treated as same-site)
- Skips anchors, mailto:, tel:, javascript:
- Skips document links (.pdf, .docx, .xlsx, .pptx)
- You can tweak the regex and logic in Queue & Dedup Links (Code) node

📌 Limitations

No JavaScript rendering (static HTML only)
No authentication/cookies/session handling
Large sites can be slow or hit timeouts; chunking mitigates response size

✅ Example Use Cases

Extract text across your site for AI ingestion / embeddings
SEO/content audit and internal link checks
Build a lightweight page corpus for downstream processing in n8n

⏱️ Estimated Setup Time

~10 minutes (import → set webhook → test request)

Domain-Specific Web Content Crawler with Depth Control and Text Extraction

This n8n workflow provides a powerful and flexible solution for crawling web content, extracting text, and controlling the crawling depth. It's ideal for gathering specific information from websites, building datasets for analysis, or monitoring changes on particular pages.

What it does

This workflow automates the following steps:

Receives a Webhook Trigger: The workflow starts when it receives an HTTP POST request to its webhook URL. This request should include the url to start crawling from and an optional depth parameter to control how many levels deep the crawler should go.
Initializes Crawling Parameters: Sets default values for depth (if not provided) and crawledUrls (to keep track of visited URLs).
Loops Through URLs: It enters a loop that processes URLs in batches.
Checks Crawling Depth: For each URL, it verifies if the current crawling depth is within the specified limit.
Fetches Web Page Content: If the depth limit is not exceeded, it makes an HTTP GET request to fetch the HTML content of the URL.
Extracts Text and Links: It then uses an HTML node to extract all visible text content and all href attributes (links) from the fetched page.
Filters and Prepares New URLs: It filters out invalid or already crawled links and prepares the new URLs for the next iteration of the loop, incrementing the depth for each.
Merges Results: After processing a batch, it merges the extracted text and new URLs back into the main flow.
Responds to Webhook: Once all crawling is complete, it responds to the initial webhook with the collected text content.

Prerequisites/Requirements

n8n Instance: A running n8n instance to import and execute the workflow.
Basic Understanding of Webhooks: Knowledge of how to send HTTP POST requests to trigger the workflow.

Setup/Usage

Import the Workflow:
- Copy the provided JSON code.
- In your n8n instance, click "New" to create a new workflow.
- Go to "File" > "Import from JSON" and paste the JSON code.
- Click "Import".
Activate the Workflow:
- Ensure the workflow is activated by toggling the "Active" switch in the top right corner of the n8n editor.
Get the Webhook URL:
- The "Webhook" node (Node ID: 47) will display a unique URL. Copy this URL.
Trigger the Workflow:
- Send an HTTP POST request to the copied Webhook URL.
- The request body should be JSON and contain at least a url field.
- Optionally, include a depth field to control the crawling depth (e.g., {"url": "https://example.com", "depth": 2}). If depth is omitted, it defaults to 1.

Example cURL command to trigger the workflow:

curl -X POST \
  YOUR_WEBHOOK_URL \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://n8n.io/blog/",
    "depth": 2
  }'

The workflow will return the extracted text content from the crawled pages in its response.

Related Templates

Generate song lyrics and music from text prompts using OpenAI and Fal.ai Minimax

Spark your creativity instantly in any chat—turn a simple prompt like "heartbreak ballad" into original, full-length lyrics and a professional AI-generated music track, all without leaving your conversation. 📋 What This Template Does This chat-triggered workflow harnesses AI to generate detailed, genre-matched song lyrics (at least 600 characters) from user messages, then queues them for music synthesis via Fal.ai's minimax-music model. It polls asynchronously until the track is ready, delivering lyrics and audio URL back in chat. Crafts original, structured lyrics with verses, choruses, and bridges using OpenAI Submits to Fal.ai for melody, instrumentation, and vocals aligned to the style Handles long-running generations with smart looping and status checks Returns complete song package (lyrics + audio link) for seamless sharing 🔧 Prerequisites n8n account (self-hosted or cloud with chat integration enabled) OpenAI account with API access for GPT models Fal.ai account for AI music generation 🔑 Required Credentials OpenAI API Setup Go to platform.openai.com → API keys (sidebar) Click "Create new secret key" → Name it (e.g., "n8n Songwriter") Copy the key and add to n8n as "OpenAI API" credential type Test by sending a simple chat completion request Fal.ai HTTP Header Auth Setup Sign up at fal.ai → Dashboard → API Keys Generate a new API key → Copy it In n8n, create "HTTP Header Auth" credential: Name="Fal.ai", Header Name="Authorization", Header Value="Key [Your API Key]" Test with a simple GET to their queue endpoint (e.g., /status) ⚙️ Configuration Steps Import the workflow JSON into your n8n instance Assign OpenAI API credentials to the "OpenAI Chat Model" node Assign Fal.ai HTTP Header Auth to the "Generate Music Track", "Check Generation Status", and "Fetch Final Result" nodes Activate the workflow—chat trigger will appear in your n8n chat interface Test by messaging: "Create an upbeat pop song about road trips" 🎯 Use Cases Content Creators: YouTubers generating custom jingles for videos on the fly, streamlining production from idea to audio export Educators: Music teachers using chat prompts to create era-specific folk tunes for classroom discussions, fostering interactive learning Gift Personalization: Friends crafting anniversary R&B tracks from shared memories via quick chats, delivering emotional audio surprises Artist Brainstorming: Songwriters prototyping hip-hop beats in real-time during sessions, accelerating collaboration and iteration ⚠️ Troubleshooting Invalid JSON from AI Agent: Ensure the system prompt stresses valid JSON; test the agent standalone with a sample query Music Generation Fails (401/403): Verify Fal.ai API key has minimax-music access; check usage quotas in dashboard Status Polling Loops Indefinitely: Bump wait time to 45-60s for complex tracks; inspect fal.ai queue logs for bottlenecks Lyrics Under 600 Characters: Tweak agent prompt to enforce fuller structures like [V1][C][V2][B][C]; verify output length in executions

By Daniel Nkencho

601

Auto-reply & create Linear tickets from Gmail with GPT-5, gotoHuman & human review

This workflow automatically classifies every new email from your linked mailbox, drafts a personalized reply, and creates Linear tickets for bugs or feature requests. It uses a human-in-the-loop with gotoHuman and continuously improves itself by learning from approved examples. How it works The workflow triggers on every new email from your linked mailbox. Self-learning Email Classifier: an AI model categorizes the email into defined categories (e.g., Bug Report, Feature Request, Sales Opportunity, etc.). It fetches previously approved classification examples from gotoHuman to refine decisions. Self-learning Email Writer: the AI drafts a reply to the email. It learns over time by using previously approved replies from gotoHuman, with per-classification context to tailor tone and style (e.g., different style for sales vs. bug reports). Human Review in gotoHuman: review the classification and the drafted reply. Drafts can be edited or retried. Approved values are used to train the self-learning agents. Send approved Reply: the approved response is sent as a reply to the email thread. Create ticket: if the classification is Bug or Feature Request, a ticket is created by another AI agent in Linear. Human Review in gotoHuman: How to set up Most importantly, install the gotoHuman node before importing this template! (Just add the node to a blank canvas before importing) Set up credentials for gotoHuman, OpenAI, your email provider (e.g. Gmail), and Linear. In gotoHuman, select and create the pre-built review template "Support email agent" or import the ID: 6fzuCJlFYJtlu9mGYcVT. Select this template in the gotoHuman node. In the "gotoHuman: Fetch approved examples" http nodes you need to add your formId. It is the ID of the review template that you just created/imported in gotoHuman. Requirements gotoHuman (human supervision, memory for self-learning) OpenAI (classification, drafting) Gmail or your preferred email provider (for email trigger+replies) Linear (ticketing) How to customize Expand or refine the categories used by the classifier. Update the prompt to reflect your own taxonomy. Filter fetched training data from gotoHuman by reviewer so the writer adapts to their personalized tone and preferences. Add more context to the AI email writer (calendar events, FAQs, product docs) to improve reply quality.

By gotoHuman

353

Dynamic Hubspot lead routing with GPT-4 and Airtable sales team distribution

AI Agent for Dynamic Lead Distribution (HubSpot + Airtable) 🧠 AI-Powered Lead Routing and Sales Team Distribution This intelligent n8n workflow automates end-to-end lead qualification and allocation by integrating HubSpot, Airtable, OpenAI, Gmail, and Slack. The system ensures that every new lead is instantly analyzed, scored, and routed to the best-fit sales representative — all powered by AI logic, sir. --- 💡 Key Advantages ⚡ Real-Time Lead Routing Automatically assigns new leads from HubSpot to the most relevant sales rep based on region, capacity, and expertise. 🧠 AI Qualification Engine An OpenAI-powered Agent evaluates the lead’s industry, region, and needs to generate a persona summary and routing rationale. 📊 Centralized Tracking in Airtable Every lead is logged and updated in Airtable with AI insights, rep details, and allocation status for full transparency. 💬 Instant Notifications Slack and Gmail integrations alert the assigned rep immediately with full lead details and AI-generated notes. 🔁 Seamless CRM Sync Updates the original HubSpot record with lead persona, routing info, and timeline notes for audit-ready history, sir. --- ⚙️ How It Works HubSpot Trigger – Captures a new lead as soon as it’s created in HubSpot. Fetch Contact Data – Retrieves all relevant fields like name, company, and industry. Clean & Format Data – A Code node standardizes and structures the data for consistency. Airtable Record Creation – Logs the lead data into the “Leads” table for centralized tracking. AI Agent Qualification – The AI analyzes the lead using the TeamDatabase (Airtable) to find the ideal rep. Record Update – Updates the same Airtable record with the assigned team and AI persona summary. Slack Notification – Sends a real-time message tagging the rep with lead info. Gmail Notification – Sends a personalized handoff email with context and follow-up actions. HubSpot Sync – Updates the original contact in HubSpot with the assignment details and AI rationale, sir. --- 🛠️ Setup Steps Trigger Node: HubSpot → Detect new leads. HubSpot Node: Retrieve complete lead details. Code Node: Clean and normalize data. Airtable Node: Log lead info in the “Leads” table. AI Agent Node: Process lead and match with sales team. Slack Node: Notify the designated representative. Gmail Node: Email the rep with details. HubSpot Node: Update CRM with AI summary and allocation status, sir. --- 🔐 Credentials Required HubSpot OAuth2 API – To fetch and update leads. Airtable Personal Access Token – To store and update lead data. OpenAI API – To power the AI qualification and matching logic. Slack OAuth2 – For sending team notifications. Gmail OAuth2 – For automatic email alerts to assigned reps, sir. --- 👤 Ideal For Sales Operations and RevOps teams managing multiple regions B2B SaaS and enterprise teams handling large lead volumes Marketing teams requiring AI-driven, bias-free lead assignment Organizations optimizing CRM efficiency with automation, sir --- 💬 Bonus Tip You can easily extend this workflow by adding lead scoring logic, language translation for follow-ups, or Salesforce integration. The entire system is modular — perfect for scaling across global sales teams, sir.

By MANISH KUMAR

113