3 templates found
Category:
Author:
Sort:

Web scraper: extract website content from sitemaps to Google Drive

A reliable, no-frills web scraper that extracts content directly from websites using their sitemaps. Perfect for content audits, migrations, and research when you need straightforward HTML extraction without external dependencies. How It Works This streamlined workflow takes a practical approach to web scraping by leveraging XML sitemaps and direct HTTP requests. Here's how it delivers consistent results: Direct Sitemap Processing: The workflow starts by fetching your target website's XML sitemap and parsing it to extract all available page URLs. This eliminates guesswork and ensures comprehensive coverage of the site's content structure. Robust HTTP Scraping: Each page is scraped using direct HTTP requests with realistic browser headers that mimic legitimate web traffic. The scraper includes comprehensive error handling and timeout protection to handle various website configurations gracefully. Intelligent Content Extraction: The workflow uses sophisticated JavaScript parsing to extract meaningful content from raw HTML. It automatically identifies page titles through multiple methods (title tags, Open Graph metadata, H1 headers) and converts HTML structure into readable text format. Framework Detection: Built-in detection identifies whether sites use WordPress, Divi themes, or heavy JavaScript frameworks. This helps explain content extraction quality and provides valuable insights about the site's technical architecture. Rich Metadata Collection: Each scraped page includes detailed metadata like word count, HTML size, response codes, and technical indicators. This data is formatted into comprehensive markdown files with YAML frontmatter for easy analysis and organization. Respectful Rate Limiting: The workflow includes a 3-second delay between page requests to respect server resources and avoid overwhelming target websites. The processing is sequential and controlled to maintain ethical scraping practices. Detailed Success Reporting: Every scraped page generates a report showing extraction success, potential issues (like JavaScript dependencies), and technical details about the site's structure and framework. Setup Steps Configure Google Drive Integration Connect your Google Drive account in the "Save to Google Drive" node Replace YOURGOOGLEDRIVECREDENTIALID with your actual Google Drive credential ID Create a dedicated folder for your scraped content in Google Drive Copy the folder ID from the Google Drive URL (the long string after /folders/) Replace YOURGOOGLEDRIVEFOLDERID_HERE with your actual folder ID in both the folderId field and cachedResultUrl Update YOURFOLDERNAME_HERE with your folder's actual name Set Your Target Website In the "Set Sitemap URL" node, replace https://yourwebsitehere.com/page-sitemap.xml with your target website's sitemap URL Common sitemap locations include /sitemap.xml, /page-sitemap.xml, or /sitemap_index.xml Tip: Not sure where your sitemap is? Use a free online tool like https://seomator.com/sitemap-finder Verify the sitemap URL loads correctly in your browser before running the workflow Update Workflow IDs (Automatic) When you import this workflow, n8n will automatically generate new IDs for YOURWORKFLOWIDHERE, YOURVERSIONIDHERE, YOURINSTANCEIDHERE, and YOURWEBHOOKIDHERE No manual changes needed for these placeholders Adjust Processing Limits (Optional) The "Limit URLs (Optional)" node is currently disabled for full site scraping Enable this node and set a smaller number (like 5-10) for initial testing For large websites, consider running in batches to manage processing time and storage Customize Rate Limiting (Optional) The "Wait Between Pages" node is set to 3 seconds by default Increase the delay for more respectful scraping of busy sites Decrease only if you have permission and the target site can handle faster requests Test Your Configuration Enable the "Limit URLs (Optional)" node and set it to 3-5 pages for testing Click "Test workflow" to verify the setup works correctly Check your Google Drive folder to confirm files are being created with proper content Review the generated markdown files to assess content extraction quality Run Full Extraction Disable the "Limit URLs (Optional)" node for complete site scraping Execute the workflow and monitor the execution log for any errors Large websites may take considerable time to process completely (plan for several hours for sites with hundreds of pages) Review Results Each generated file includes technical metadata to help you assess extraction quality Look for indicators like "Limited Content" warnings for JavaScript-heavy pages Files include word counts and framework detection to help you understand the site's structure Framework Compatibility: This scraper is specifically designed to work well with WordPress sites, Divi themes, and many JavaScript-heavy frameworks. The intelligent content extraction handles dynamic content effectively and provides detailed feedback about framework detection. While some single-page applications (SPAs) that render entirely through JavaScript may have limited content extraction, most modern websites including those built with popular CMS platforms will work excellently with this scraper. Important Notes: Always ensure you have permission to scrape your target website and respect their robots.txt guidelines. The workflow includes respectful delays and error handling, but monitor your usage to maintain ethical scraping practices.RetryClaude can make mistakes. Please double-check responses.

Wolf BishopBy Wolf Bishop
4251

AI-powered domain & IP security check automation

Description This workflow is designed to automate the security reputation check of domains and IP addresses using multiple APIs such as VirusTotal, AbuseIPDB, and Google DNS. It assesses potential threats including malicious and suspicious scores, as well as email security configurations (SPF, DKIM, DMARC). The analysis results are processed by AI to produce a concise assessment, then automatically updated into Google Sheets for documentation and follow-up. How It Works Automatic Trigger – The workflow runs periodically via a Schedule Trigger. Data Retrieval – Fetches a list of domains from Google Sheets with status "To do". Domain Analysis – Uses VirusTotal API to get the domain report, perform a rescan, and check IP resolutions. IP Analysis – Checks IP reputation using AbuseIPDB. Email Security Validation – Verifies SPF, DKIM, and DMARC configurations via Google DNS. AI Assessment – Analysis data is processed by AI to produce a short summary in Indonesian. Data Update – The results are automatically updated to Google Sheets, changing the status to "Done" or adding notes if potential threats are found. How to Setup Prepare API Keys Sign up and obtain API keys from VirusTotal and AbuseIPDB. Set up access to Google Sheets API. Configure Credentials in n8n Add VirusTotal API, AbuseIPDB API, and Google Sheets OAuth credentials in n8n. Prepare Google Sheets Create a sheet with columns No, Domain, Customer, Keterangan, Status. Ensure initial data has the status "To do". Import Workflow Upload the workflow JSON file into n8n. Set Schedule Trigger Define the checking interval as needed (e.g., every 1 hour). Test Run Run the workflow manually to ensure all API connections and Google Sheets output work properly.

GarriBy Garri
1181

Automated WhatsApp group weekly team reports with Gemini AI summarization

This n8n template automatically summarizes your WhatsApp group activity from the past week and generates a team report. Why use this? Remote teams rely on chat for communication, but important discussions, decisions, and ideas get buried in message threads and forgotten by Monday. This workflow ensures nothing falls through the cracks. How it works Runs every Monday at 6am to collect the previous week's group messages Groups conversations by participant and analyzes message threads AI summarizes individual member activity into personal reports Combines all individual reports into one comprehensive team overview Posts the final report back to your WhatsApp group to kick off the new week Setup requirements WhatsApp (whapAround.pro) no need Meta API Gemini AI (or alternative LLM of choice) Best practices Use one workflow per WhatsApp group for focused results Filter for specific team members if needed Customize the report tone to match your team culture Adjust the schedule if weekly reports don't suit your team's pace Customization ideas Send reports via email instead of posting to busy groups Include project metrics alongside message summaries Connect to knowledge bases or ticket systems for additional context Perfect for project managers who want to keep distributed teams aligned and ensure important conversations don't get lost in the chat noise.

JamotBy Jamot
1007
All templates loaded