Domain-specific web content crawler with depth control & text extraction
This template implements a recursive web crawler inside n8n. Starting from a given URL, it crawls linked pages up to a maximum depth (default: 3), extracts text and links, and returns the collected content via webhook.
π How It Works
-
Webhook Trigger
Accepts a JSON body with aurlfield.
Example payload:{ "url": "https://example.com" }
-
Initialization
- Sets crawl parameters:
url,domain,maxDepth = 3, anddepth = 0. - Initializes global static data (
pending,visited,queued,pages).
- Sets crawl parameters:
-
Recursive Crawling
- Fetches each page (HTTP Request).
- Extracts body text and links (HTML node).
- Cleans and deduplicates links.
- Filters out:
- External domains (only same-site is followed)
- Anchors (#), mailto/tel/javascript links
- Non-HTML files (.pdf, .docx, .xlsx, .pptx)
-
Depth Control & Queue
- Tracks visited URLs
- Stops at
maxDepthto prevent infinite loops - Uses SplitInBatches to loop the queue
-
Data Collection
- Saves each crawled page (
url,depth,content) intopages[] - When
pending = 0, combines results
- Saves each crawled page (
-
Output
- Responds via the Webhook node with:
combinedContent(all pages concatenated)pages[](array of individual results)
- Large results are chunked when exceeding ~12,000 characters
- Responds via the Webhook node with:
π οΈ Setup Instructions
-
Import Template
Load from n8n Community Templates. -
Configure Webhook
- Open the Webhook node
- Copy the Test URL (development) or Production URL (after deploy)
- Youβll POST crawl requests to this endpoint
-
Run a Test
Send a POST with JSON:curl -X POST https://<your-n8n>/webhook/<id>
-H "Content-Type: application/json"
-d '{"url": "https://example.com"}' -
View Response
The crawler returns a JSON object containingcombinedContentandpages[].
βοΈ Configuration
-
maxDepth
Default: 3. Adjust in the Init Crawl Params (Set) node. -
Timeouts
HTTP Request node timeout is 5 seconds per request; increase if needed. -
Filtering Rules
- Only same-domain links are followed (apex and
wwwtreated as same-site) - Skips anchors,
mailto:,tel:,javascript: - Skips document links (.pdf, .docx, .xlsx, .pptx)
- You can tweak the regex and logic in Queue & Dedup Links (Code) node
- Only same-domain links are followed (apex and
π Limitations
- No JavaScript rendering (static HTML only)
- No authentication/cookies/session handling
- Large sites can be slow or hit timeouts; chunking mitigates response size
β Example Use Cases
- Extract text across your site for AI ingestion / embeddings
- SEO/content audit and internal link checks
- Build a lightweight page corpus for downstream processing in n8n
β±οΈ Estimated Setup Time
~10 minutes (import β set webhook β test request)
Domain-Specific Web Content Crawler with Depth Control and Text Extraction
This n8n workflow provides a powerful and flexible solution for crawling web content, extracting text, and controlling the crawling depth. It's ideal for gathering specific information from websites, building datasets for analysis, or monitoring changes on particular pages.
What it does
This workflow automates the following steps:
- Receives a Webhook Trigger: The workflow starts when it receives an HTTP POST request to its webhook URL. This request should include the
urlto start crawling from and an optionaldepthparameter to control how many levels deep the crawler should go. - Initializes Crawling Parameters: Sets default values for
depth(if not provided) andcrawledUrls(to keep track of visited URLs). - Loops Through URLs: It enters a loop that processes URLs in batches.
- Checks Crawling Depth: For each URL, it verifies if the current crawling depth is within the specified limit.
- Fetches Web Page Content: If the depth limit is not exceeded, it makes an HTTP GET request to fetch the HTML content of the URL.
- Extracts Text and Links: It then uses an HTML node to extract all visible text content and all
hrefattributes (links) from the fetched page. - Filters and Prepares New URLs: It filters out invalid or already crawled links and prepares the new URLs for the next iteration of the loop, incrementing the depth for each.
- Merges Results: After processing a batch, it merges the extracted text and new URLs back into the main flow.
- Responds to Webhook: Once all crawling is complete, it responds to the initial webhook with the collected text content.
Prerequisites/Requirements
- n8n Instance: A running n8n instance to import and execute the workflow.
- Basic Understanding of Webhooks: Knowledge of how to send HTTP POST requests to trigger the workflow.
Setup/Usage
- Import the Workflow:
- Copy the provided JSON code.
- In your n8n instance, click "New" to create a new workflow.
- Go to "File" > "Import from JSON" and paste the JSON code.
- Click "Import".
- Activate the Workflow:
- Ensure the workflow is activated by toggling the "Active" switch in the top right corner of the n8n editor.
- Get the Webhook URL:
- The "Webhook" node (Node ID: 47) will display a unique URL. Copy this URL.
- Trigger the Workflow:
- Send an HTTP POST request to the copied Webhook URL.
- The request body should be JSON and contain at least a
urlfield. - Optionally, include a
depthfield to control the crawling depth (e.g.,{"url": "https://example.com", "depth": 2}). Ifdepthis omitted, it defaults to 1.
Example cURL command to trigger the workflow:
curl -X POST \
YOUR_WEBHOOK_URL \
-H 'Content-Type: application/json' \
-d '{
"url": "https://n8n.io/blog/",
"depth": 2
}'
The workflow will return the extracted text content from the crawled pages in its response.
Related Templates
Generate song lyrics and music from text prompts using OpenAI and Fal.ai Minimax
Spark your creativity instantly in any chatβturn a simple prompt like "heartbreak ballad" into original, full-length lyrics and a professional AI-generated music track, all without leaving your conversation. π What This Template Does This chat-triggered workflow harnesses AI to generate detailed, genre-matched song lyrics (at least 600 characters) from user messages, then queues them for music synthesis via Fal.ai's minimax-music model. It polls asynchronously until the track is ready, delivering lyrics and audio URL back in chat. Crafts original, structured lyrics with verses, choruses, and bridges using OpenAI Submits to Fal.ai for melody, instrumentation, and vocals aligned to the style Handles long-running generations with smart looping and status checks Returns complete song package (lyrics + audio link) for seamless sharing π§ Prerequisites n8n account (self-hosted or cloud with chat integration enabled) OpenAI account with API access for GPT models Fal.ai account for AI music generation π Required Credentials OpenAI API Setup Go to platform.openai.com β API keys (sidebar) Click "Create new secret key" β Name it (e.g., "n8n Songwriter") Copy the key and add to n8n as "OpenAI API" credential type Test by sending a simple chat completion request Fal.ai HTTP Header Auth Setup Sign up at fal.ai β Dashboard β API Keys Generate a new API key β Copy it In n8n, create "HTTP Header Auth" credential: Name="Fal.ai", Header Name="Authorization", Header Value="Key [Your API Key]" Test with a simple GET to their queue endpoint (e.g., /status) βοΈ Configuration Steps Import the workflow JSON into your n8n instance Assign OpenAI API credentials to the "OpenAI Chat Model" node Assign Fal.ai HTTP Header Auth to the "Generate Music Track", "Check Generation Status", and "Fetch Final Result" nodes Activate the workflowβchat trigger will appear in your n8n chat interface Test by messaging: "Create an upbeat pop song about road trips" π― Use Cases Content Creators: YouTubers generating custom jingles for videos on the fly, streamlining production from idea to audio export Educators: Music teachers using chat prompts to create era-specific folk tunes for classroom discussions, fostering interactive learning Gift Personalization: Friends crafting anniversary R&B tracks from shared memories via quick chats, delivering emotional audio surprises Artist Brainstorming: Songwriters prototyping hip-hop beats in real-time during sessions, accelerating collaboration and iteration β οΈ Troubleshooting Invalid JSON from AI Agent: Ensure the system prompt stresses valid JSON; test the agent standalone with a sample query Music Generation Fails (401/403): Verify Fal.ai API key has minimax-music access; check usage quotas in dashboard Status Polling Loops Indefinitely: Bump wait time to 45-60s for complex tracks; inspect fal.ai queue logs for bottlenecks Lyrics Under 600 Characters: Tweak agent prompt to enforce fuller structures like [V1][C][V2][B][C]; verify output length in executions
Auto-reply & create Linear tickets from Gmail with GPT-5, gotoHuman & human review
This workflow automatically classifies every new email from your linked mailbox, drafts a personalized reply, and creates Linear tickets for bugs or feature requests. It uses a human-in-the-loop with gotoHuman and continuously improves itself by learning from approved examples. How it works The workflow triggers on every new email from your linked mailbox. Self-learning Email Classifier: an AI model categorizes the email into defined categories (e.g., Bug Report, Feature Request, Sales Opportunity, etc.). It fetches previously approved classification examples from gotoHuman to refine decisions. Self-learning Email Writer: the AI drafts a reply to the email. It learns over time by using previously approved replies from gotoHuman, with per-classification context to tailor tone and style (e.g., different style for sales vs. bug reports). Human Review in gotoHuman: review the classification and the drafted reply. Drafts can be edited or retried. Approved values are used to train the self-learning agents. Send approved Reply: the approved response is sent as a reply to the email thread. Create ticket: if the classification is Bug or Feature Request, a ticket is created by another AI agent in Linear. Human Review in gotoHuman: How to set up Most importantly, install the gotoHuman node before importing this template! (Just add the node to a blank canvas before importing) Set up credentials for gotoHuman, OpenAI, your email provider (e.g. Gmail), and Linear. In gotoHuman, select and create the pre-built review template "Support email agent" or import the ID: 6fzuCJlFYJtlu9mGYcVT. Select this template in the gotoHuman node. In the "gotoHuman: Fetch approved examples" http nodes you need to add your formId. It is the ID of the review template that you just created/imported in gotoHuman. Requirements gotoHuman (human supervision, memory for self-learning) OpenAI (classification, drafting) Gmail or your preferred email provider (for email trigger+replies) Linear (ticketing) How to customize Expand or refine the categories used by the classifier. Update the prompt to reflect your own taxonomy. Filter fetched training data from gotoHuman by reviewer so the writer adapts to their personalized tone and preferences. Add more context to the AI email writer (calendar events, FAQs, product docs) to improve reply quality.
Dynamic Hubspot lead routing with GPT-4 and Airtable sales team distribution
AI Agent for Dynamic Lead Distribution (HubSpot + Airtable) π§ AI-Powered Lead Routing and Sales Team Distribution This intelligent n8n workflow automates end-to-end lead qualification and allocation by integrating HubSpot, Airtable, OpenAI, Gmail, and Slack. The system ensures that every new lead is instantly analyzed, scored, and routed to the best-fit sales representative β all powered by AI logic, sir. --- π‘ Key Advantages β‘ Real-Time Lead Routing Automatically assigns new leads from HubSpot to the most relevant sales rep based on region, capacity, and expertise. π§ AI Qualification Engine An OpenAI-powered Agent evaluates the leadβs industry, region, and needs to generate a persona summary and routing rationale. π Centralized Tracking in Airtable Every lead is logged and updated in Airtable with AI insights, rep details, and allocation status for full transparency. π¬ Instant Notifications Slack and Gmail integrations alert the assigned rep immediately with full lead details and AI-generated notes. π Seamless CRM Sync Updates the original HubSpot record with lead persona, routing info, and timeline notes for audit-ready history, sir. --- βοΈ How It Works HubSpot Trigger β Captures a new lead as soon as itβs created in HubSpot. Fetch Contact Data β Retrieves all relevant fields like name, company, and industry. Clean & Format Data β A Code node standardizes and structures the data for consistency. Airtable Record Creation β Logs the lead data into the βLeadsβ table for centralized tracking. AI Agent Qualification β The AI analyzes the lead using the TeamDatabase (Airtable) to find the ideal rep. Record Update β Updates the same Airtable record with the assigned team and AI persona summary. Slack Notification β Sends a real-time message tagging the rep with lead info. Gmail Notification β Sends a personalized handoff email with context and follow-up actions. HubSpot Sync β Updates the original contact in HubSpot with the assignment details and AI rationale, sir. --- π οΈ Setup Steps Trigger Node: HubSpot β Detect new leads. HubSpot Node: Retrieve complete lead details. Code Node: Clean and normalize data. Airtable Node: Log lead info in the βLeadsβ table. AI Agent Node: Process lead and match with sales team. Slack Node: Notify the designated representative. Gmail Node: Email the rep with details. HubSpot Node: Update CRM with AI summary and allocation status, sir. --- π Credentials Required HubSpot OAuth2 API β To fetch and update leads. Airtable Personal Access Token β To store and update lead data. OpenAI API β To power the AI qualification and matching logic. Slack OAuth2 β For sending team notifications. Gmail OAuth2 β For automatic email alerts to assigned reps, sir. --- π€ Ideal For Sales Operations and RevOps teams managing multiple regions B2B SaaS and enterprise teams handling large lead volumes Marketing teams requiring AI-driven, bias-free lead assignment Organizations optimizing CRM efficiency with automation, sir --- π¬ Bonus Tip You can easily extend this workflow by adding lead scoring logic, language translation for follow-ups, or Salesforce integration. The entire system is modular β perfect for scaling across global sales teams, sir.