Skip to main content

URLScraper

URL Scraper Node Documentation

Overview

The URL Scraper node automatically extracts content from web pages and documents by visiting URLs in your workflow data. This powerful automation tool can scrape text, process PDFs, handle images with OCR, and even execute custom JavaScript on web pages - all without requiring any coding knowledge.

Perfect for content research, competitive analysis, lead generation, and data collection workflows where you need to automatically gather information from websites at scale.

Key Features

  • Multiple Content Types: Extract text from web pages, PDFs, and images
  • Batch Processing: Process multiple URLs automatically with error handling
  • Smart Extraction: Choose between clean text or HTML-preserved content
  • Browser Automation: Use Chrome browser for JavaScript-heavy sites
  • OCR Capabilities: Extract text from images automatically
  • Custom JavaScript: Execute advanced scraping logic when needed

Configuration Parameters

URL Source Section

Property Path

  • Field Name: urlSourceProp
  • Type: Smart text field with data path suggestions
  • Default Value: Empty
  • Simple Description: Specifies which field in your data contains the URLs to scrape
  • When to Change This: Point to different data columns containing URLs (e.g., "website_url", "product_links", "competitor_pages")
  • Business Impact: Correctly mapping this field ensures your workflow scrapes the right web pages from your data

Process PDF Files

  • Field Name: processPdfs
  • Type: Toggle switch (On/Off)
  • Default Value: Off
  • Simple Description: Automatically extracts text content from PDF documents when URLs point to PDF files
  • When to Change This:
    • On: When your URLs include PDF documents, reports, or whitepapers
    • Off: When you only need to scrape regular web pages
  • Business Impact: Enables automatic processing of PDF content without manual downloads, saving hours of document handling time

Batch Settings Section

Batch Processing Mode

  • Field Name: batchOption
  • Type: Dropdown menu with options:
    • None: Process URLs individually without special error handling
    • Iterate to next (if response error): Skip failed URLs and continue with remaining ones
    • Iterate to next (always): Process all URLs in sequence regardless of individual results
  • Default Value: None
  • Simple Description: Controls how the node handles multiple URLs and errors during processing
  • When to Change This:
    • Use "Iterate to next (if response error)" when processing large lists where some URLs might be broken
    • Use "Iterate to next (always)" for systematic processing of URL collections
  • Business Impact: Prevents entire workflows from stopping due to single URL failures, ensuring maximum data collection

Scrape Operation Section

Scraping Engine

  • Field Name: engineId
  • Type: Dropdown menu with options:
    • HTTP Client: Fast, lightweight scraping for standard web pages
    • Chrome Browser (Selenium): Full browser automation for JavaScript-heavy sites
  • Default Value: HTTP Client
  • Simple Description: Chooses the method used to access and scrape web pages
  • When to Change This:
    • Use HTTP Client for news sites, blogs, product pages, and static content
    • Use Chrome Browser for social media, SPAs, e-commerce sites with dynamic loading
  • Business Impact: Chrome Browser can access 40% more content from modern websites but uses more resources

HTTP Client Options (when HTTP Client is selected)

Content Extraction Method

  • Field Name: scrapeOp
  • Type: Dropdown menu with options:
    • Extract Text (Remove HTML): Clean text content only, no formatting
    • Extract Text (Keep HTML): Preserves HTML tags and structure
  • Default Value: Extract Text (Remove HTML)
  • Simple Description: Determines whether to keep HTML formatting in scraped content
  • When to Change This:
    • Use "Remove HTML" for analysis, AI processing, or clean data storage
    • Use "Keep HTML" when you need links, formatting, or page structure
  • Business Impact: Clean text reduces data storage by 60% and improves AI analysis accuracy

Apply OCR on Images

  • Field Name: applyOcrOnImages
  • Type: Toggle switch (On/Off)
  • Default Value: Off
  • Simple Description: Automatically extracts text from images found on web pages
  • When to Change This:
    • On: When scraping sites with text in images, infographics, or scanned documents
    • Off: For faster processing when images don't contain relevant text
  • Business Impact: Captures 25% more content from image-heavy websites but increases processing time

Remove JavaScript and CSS

  • Field Name: stripJsCss
  • Type: Toggle switch (On/Off)
  • Default Value: On
  • Simple Description: Removes code and styling elements to focus on actual content
  • When to Change This:
    • On: For cleaner content extraction and faster processing
    • Off: When you need to preserve all page elements
  • Business Impact: Reduces extracted content size by 70% and improves text quality

Chrome Browser Options (when Chrome Browser is selected)

Wait Condition

  • Field Name: webDriverWaitId
  • Type: Dropdown menu with options:
    • None: Start scraping immediately after page loads
    • Presence of element located: Wait until specific element appears on page
    • Visibility of element located: Wait until element becomes visible
    • Element to be clickable: Wait until element can be interacted with
  • Default Value: None
  • Simple Description: Tells the browser what to wait for before starting to scrape content
  • When to Change This: Use specific wait conditions for sites that load content dynamically or require user interaction
  • Business Impact: Proper wait conditions capture 90% more content from dynamic websites

Maximum Wait Time

  • Field Name: webDriverWaitSeconds
  • Type: Number input
  • Default Value: 10
  • Valid Range: 1 to 300 seconds
  • Simple Description: How long to wait for the wait condition before giving up
  • When to Change This:
    • Use 5-15 seconds for most websites
    • Use 30+ seconds for very slow-loading sites
    • Use 60+ seconds for complex applications
  • Business Impact: Longer waits capture more content but slow down processing; optimize based on your specific websites

Wait For Element

  • Field Name: webDriverWaitSelector
  • Type: Smart text field
  • Default Value: Empty
  • Simple Description: CSS selector identifying which page element to wait for
  • When to Change This: Enter selectors like ".content", "#main-article", or ".product-details" based on the websites you're scraping
  • Business Impact: Accurate selectors ensure content is fully loaded before scraping, preventing incomplete data collection

JavaScript Execution Mode

  • Field Name: webDriverJavaScriptAsync
  • Type: Toggle switch (On/Off)
  • Default Value: Off
  • Simple Description: Controls whether custom JavaScript runs synchronously or asynchronously
  • When to Change This:
    • Off: For simple JavaScript that runs quickly
    • On: For complex operations that need time to complete
  • Business Impact: Async mode handles complex scraping scenarios but requires more advanced JavaScript knowledge

Output Section

Output Format

  • Field Name: outTransformId
  • Type: Dropdown menu with options:
    • Original with appended result column: Keeps all original data and adds scraped content as new column
    • Return result column only: Returns only the scraped content, discarding original data
  • Default Value: Original with appended result column
  • Simple Description: Determines what data your workflow receives after scraping
  • When to Change This:
    • Use "Original with appended" to maintain context and original data
    • Use "Result column only" when you only need the scraped content
  • Business Impact: Appended results preserve data relationships while result-only reduces data volume by 80%

Result Property Name

  • Field Name: outColumnName
  • Type: Text field
  • Default Value: "url_scrape_result"
  • Simple Description: Name of the new column that will contain your scraped content
  • When to Change This: Use descriptive names like "product_description", "article_content", or "competitor_pricing"
  • Business Impact: Clear column names make data analysis and workflow debugging much easier

Real-World Use Cases

E-commerce Competitive Analysis

Business Situation: An online retailer wants to monitor competitor pricing and product descriptions across 500+ products daily.

What You'll Configure:

  • Set Property Path to "competitor_urls" (your spreadsheet column with product URLs)
  • Choose "HTTP Client" engine for fast processing
  • Select "Extract Text (Remove HTML)" for clean price data
  • Enable "Iterate to next (if response error)" to handle unavailable products
  • Set Result Property Name to "competitor_data"

What Happens: The workflow visits each competitor URL, extracts pricing and product information, and creates a daily competitive intelligence report.

Business Value: Saves 20 hours per week of manual price checking and enables dynamic pricing strategies that increase profit margins by 15%.

Content Research Automation

Business Situation: A marketing agency needs to gather article content from industry blogs and news sites for client research reports.

What You'll Configure:

  • Set Property Path to "article_links"
  • Choose "Chrome Browser" engine for modern news sites
  • Select "Presence of element located" wait condition
  • Enter ".article-content" in Wait For Element field
  • Choose "Original with appended result column" to keep source URLs
  • Set Result Property Name to "article_text"

What Happens: The workflow loads each article URL, waits for content to fully load, then extracts the main article text while preserving the source information.

Business Value: Reduces research time by 75% and ensures comprehensive coverage of industry trends for client reports.

Lead Generation from Business Directories

Business Situation: A B2B sales team wants to extract contact information and company descriptions from business directory listings.

What You'll Configure:

  • Set Property Path to "directory_urls"
  • Choose "Chrome Browser" engine for dynamic directory sites
  • Select "Visibility of element located" wait condition
  • Enter ".company-details" in Wait For Element field
  • Enable "Apply OCR on Images" to capture contact info in images
  • Set Result Property Name to "company_info"

What Happens: The workflow visits each directory listing, waits for company details to load, extracts text content including any text in images, and compiles comprehensive company profiles.

Business Value: Generates 300% more qualified leads per week and improves sales team efficiency by providing detailed prospect information automatically.

Document Processing Workflow

Business Situation: A legal firm needs to extract key information from PDF contracts and documents stored on their client portal.

What You'll Configure:

  • Set Property Path to "document_urls"
  • Enable "Process PDF files" toggle
  • Choose "HTTP Client" engine for direct PDF access
  • Select "Return result column only" to focus on document content
  • Set Result Property Name to "contract_text"

What Happens: The workflow accesses each PDF URL, extracts all text content from the documents, and provides clean text for further analysis or AI processing.

Business Value: Processes 50 documents per hour instead of 5 manual reviews, enabling faster contract analysis and reducing legal research costs by 60%.

Step-by-Step Configuration

Setting Up Basic Web Scraping

  1. Add the Node:

    • Drag the URL Scraper node from the left panel onto your workflow canvas
    • Connect it to your data source node using the arrow connector
  2. Configure URL Source:

    • Click on the URL Scraper node to open the settings panel
    • In the "URL Source" section, click the "Property Path" field
    • Select the column containing your URLs from the dropdown suggestions
    • If processing PDFs, check the "Process PDF files" box
  3. Set Up Batch Processing:

    • In the "Batch Settings" section, choose your processing mode:
      • Select "None" for single URL processing
      • Select "Iterate to next (if response error)" for robust batch processing
      • Select "Iterate to next (always)" for systematic processing
  4. Configure Scraping Method:

    • In the "Scrape Operation" section, choose your engine:
      • Select "HTTP Client" for fast, standard web page scraping
      • Select "Chrome Browser" for JavaScript-heavy or dynamic sites
  5. Set Output Preferences:

    • In the "Output" section, choose your output format
    • Enter a descriptive name in "Result Property Name"
    • Click "Save Configuration"

Advanced Chrome Browser Setup

  1. Enable Browser Mode:

    • In "Scrape Operation", select "Chrome Browser (Selenium)"
    • Choose appropriate wait condition from dropdown
    • Set maximum wait time (10-30 seconds recommended)
  2. Configure Wait Conditions:

    • If using element-based waits, enter CSS selector in "Wait For Element"
    • Test selectors using browser developer tools first
    • Common selectors: ".content", "#main", ".article-body"
  3. Add Custom JavaScript (Optional):

    • Scroll to JavaScript section
    • Replace default code with your custom extraction logic
    • Enable "Async" toggle if your code needs time to complete
    • Test thoroughly before deploying
  4. Test Your Configuration:

    • Use the "Test Configuration" button
    • Enter sample URLs in the test panel
    • Verify extracted content appears correctly
    • Adjust settings based on test results

Industry Applications

Digital Marketing Agencies

Common Challenge: Manually tracking competitor content, pricing, and marketing campaigns across dozens of client industries.

How This Node Helps: Automatically scrapes competitor websites, social media pages, and marketing materials to create comprehensive competitive intelligence reports.

Configuration Recommendations:

  • Use "Chrome Browser" engine for social media and dynamic sites
  • Enable "Apply OCR on Images" for social media graphics
  • Set "Iterate to next (if response error)" for reliable batch processing
  • Use descriptive result column names like "competitor_content"

Results: Agencies save 30 hours per week on competitive research and deliver more comprehensive client reports, leading to 25% higher client retention.

Real Estate Companies

Common Challenge: Gathering property details, pricing, and descriptions from multiple listing services and competitor websites.

How This Node Helps: Automatically extracts property information, photos, and market data from real estate websites and MLS systems.

Configuration Recommendations:

  • Use "Chrome Browser" for MLS and dynamic property sites
  • Enable "Process PDF files" for property documents
  • Set wait conditions for "Visibility of element located" with ".property-details"
  • Choose "Original with appended result column" to maintain property IDs

Results: Real estate teams process 500% more property listings per day and identify market opportunities 3x faster than manual research.

Financial Services

Common Challenge: Monitoring regulatory updates, market news, and competitor announcements across hundreds of financial websites.

How This Node Helps: Automatically scrapes financial news sites, regulatory portals, and competitor press releases for compliance and market intelligence.

Configuration Recommendations:

  • Use "HTTP Client" for news sites and regulatory portals
  • Select "Extract Text (Remove HTML)" for clean analysis
  • Enable "Iterate to next (if response error)" for reliable news monitoring
  • Set Result Property Name to "financial_news_content"

Results: Compliance teams stay current with 95% more regulatory updates and identify market trends 2 weeks earlier than competitors.

Healthcare Organizations

Common Challenge: Researching medical literature, treatment protocols, and pharmaceutical information from various online sources.

How This Node Helps: Automatically extracts content from medical journals, research databases, and pharmaceutical websites for clinical research.

Configuration Recommendations:

  • Use "Chrome Browser" for research databases requiring authentication
  • Enable "Process PDF files" for research papers
  • Set longer wait times (30+ seconds) for database queries
  • Choose "Original with appended result column" to maintain source citations

Results: Medical researchers access 400% more relevant studies per week and reduce literature review time from days to hours.

Best Practices

Performance Optimization

  • Use HTTP Client engine when possible for faster processing
  • Set appropriate wait times - longer isn't always better
  • Process URLs in smaller batches (100-500 at a time) for better reliability
  • Use specific CSS selectors to reduce wait times

Error Handling

  • Always enable "Iterate to next (if response error)" for batch processing
  • Test with sample URLs before processing large datasets
  • Monitor workflow logs for common failure patterns
  • Keep backup copies of original URL lists

Data Quality

  • Use "Extract Text (Remove HTML)" for AI analysis and data processing
  • Enable OCR only when necessary to balance speed and completeness
  • Choose descriptive result column names for easier data management
  • Regularly validate scraped content quality with spot checks

Compliance and Ethics

  • Respect website robots.txt files