URLScraper

URL Scraper Node Documentation

Overview

The URL Scraper node automatically extracts content from web pages and documents by visiting URLs in your workflow data. This powerful automation tool can scrape text, process PDFs, handle images with OCR, and even execute custom JavaScript on web pages - all without requiring any coding knowledge.

Perfect for content research, competitive analysis, lead generation, and data collection workflows where you need to automatically gather information from websites at scale.

Key Features

Multiple Content Types: Extract text from web pages, PDFs, and images
Batch Processing: Process multiple URLs automatically with error handling
Smart Extraction: Choose between clean text or HTML-preserved content
Browser Automation: Use Chrome browser for JavaScript-heavy sites
OCR Capabilities: Extract text from images automatically
Custom JavaScript: Execute advanced scraping logic when needed

Configuration Parameters

URL Source Section

Property Path

Field Name: urlSourceProp
Type: Smart text field with data path suggestions
Default Value: Empty
Simple Description: Specifies which field in your data contains the URLs to scrape
When to Change This: Point to different data columns containing URLs (e.g., "website_url", "product_links", "competitor_pages")
Business Impact: Correctly mapping this field ensures your workflow scrapes the right web pages from your data

Process PDF Files

Field Name: processPdfs
Type: Toggle switch (On/Off)
Default Value: Off
Simple Description: Automatically extracts text content from PDF documents when URLs point to PDF files
When to Change This:
- On: When your URLs include PDF documents, reports, or whitepapers
- Off: When you only need to scrape regular web pages
Business Impact: Enables automatic processing of PDF content without manual downloads, saving hours of document handling time

Batch Settings Section

Batch Processing Mode

Field Name: batchOption
Type: Dropdown menu with options:
- None: Process URLs individually without special error handling
- Iterate to next (if response error): Skip failed URLs and continue with remaining ones
- Iterate to next (always): Process all URLs in sequence regardless of individual results
Default Value: None
Simple Description: Controls how the node handles multiple URLs and errors during processing
When to Change This:
- Use "Iterate to next (if response error)" when processing large lists where some URLs might be broken
- Use "Iterate to next (always)" for systematic processing of URL collections
Business Impact: Prevents entire workflows from stopping due to single URL failures, ensuring maximum data collection

Scrape Operation Section

Scraping Engine

Field Name: engineId
Type: Dropdown menu with options:
- HTTP Client: Fast, lightweight scraping for standard web pages
- Chrome Browser (Selenium): Full browser automation for JavaScript-heavy sites
Default Value: HTTP Client
Simple Description: Chooses the method used to access and scrape web pages
When to Change This:
- Use HTTP Client for news sites, blogs, product pages, and static content
- Use Chrome Browser for social media, SPAs, e-commerce sites with dynamic loading
Business Impact: Chrome Browser can access 40% more content from modern websites but uses more resources

HTTP Client Options (when HTTP Client is selected)

Content Extraction Method

Field Name: scrapeOp
Type: Dropdown menu with options:
- Extract Text (Remove HTML): Clean text content only, no formatting
- Extract Text (Keep HTML): Preserves HTML tags and structure
Default Value: Extract Text (Remove HTML)
Simple Description: Determines whether to keep HTML formatting in scraped content
When to Change This:
- Use "Remove HTML" for analysis, AI processing, or clean data storage
- Use "Keep HTML" when you need links, formatting, or page structure
Business Impact: Clean text reduces data storage by 60% and improves AI analysis accuracy

Apply OCR on Images

Field Name: applyOcrOnImages
Type: Toggle switch (On/Off)
Default Value: Off
Simple Description: Automatically extracts text from images found on web pages
When to Change This:
- On: When scraping sites with text in images, infographics, or scanned documents
- Off: For faster processing when images don't contain relevant text
Business Impact: Captures 25% more content from image-heavy websites but increases processing time

Remove JavaScript and CSS

Field Name: stripJsCss
Type: Toggle switch (On/Off)
Default Value: On
Simple Description: Removes code and styling elements to focus on actual content
When to Change This:
- On: For cleaner content extraction and faster processing
- Off: When you need to preserve all page elements
Business Impact: Reduces extracted content size by 70% and improves text quality

Chrome Browser Options (when Chrome Browser is selected)

Wait Condition

Field Name: webDriverWaitId
Type: Dropdown menu with options:
- None: Start scraping immediately after page loads
- Presence of element located: Wait until specific element appears on page
- Visibility of element located: Wait until element becomes visible
- Element to be clickable: Wait until element can be interacted with
Default Value: None
Simple Description: Tells the browser what to wait for before starting to scrape content
When to Change This: Use specific wait conditions for sites that load content dynamically or require user interaction
Business Impact: Proper wait conditions capture 90% more content from dynamic websites

Maximum Wait Time

Field Name: webDriverWaitSeconds
Type: Number input
Default Value: 10
Valid Range: 1 to 300 seconds
Simple Description: How long to wait for the wait condition before giving up
When to Change This:
- Use 5-15 seconds for most websites
- Use 30+ seconds for very slow-loading sites
- Use 60+ seconds for complex applications
Business Impact: Longer waits capture more content but slow down processing; optimize based on your specific websites

Wait For Element

Field Name: webDriverWaitSelector
Type: Smart text field
Default Value: Empty
Simple Description: CSS selector identifying which page element to wait for
When to Change This: Enter selectors like ".content", "#main-article", or ".product-details" based on the websites you're scraping
Business Impact: Accurate selectors ensure content is fully loaded before scraping, preventing incomplete data collection

JavaScript Execution Mode

Field Name: webDriverJavaScriptAsync
Type: Toggle switch (On/Off)
Default Value: Off
Simple Description: Controls whether custom JavaScript runs synchronously or asynchronously
When to Change This:
- Off: For simple JavaScript that runs quickly
- On: For complex operations that need time to complete
Business Impact: Async mode handles complex scraping scenarios but requires more advanced JavaScript knowledge

Output Section

Output Format

Field Name: outTransformId
Type: Dropdown menu with options:
- Original with appended result column: Keeps all original data and adds scraped content as new column
- Return result column only: Returns only the scraped content, discarding original data
Default Value: Original with appended result column
Simple Description: Determines what data your workflow receives after scraping
When to Change This:
- Use "Original with appended" to maintain context and original data
- Use "Result column only" when you only need the scraped content
Business Impact: Appended results preserve data relationships while result-only reduces data volume by 80%

Result Property Name

Field Name: outColumnName
Type: Text field
Default Value: "url_scrape_result"
Simple Description: Name of the new column that will contain your scraped content
When to Change This: Use descriptive names like "product_description", "article_content", or "competitor_pricing"
Business Impact: Clear column names make data analysis and workflow debugging much easier

Real-World Use Cases

E-commerce Competitive Analysis

Business Situation: An online retailer wants to monitor competitor pricing and product descriptions across 500+ products daily.

What You'll Configure:

Set Property Path to "competitor_urls" (your spreadsheet column with product URLs)
Choose "HTTP Client" engine for fast processing
Select "Extract Text (Remove HTML)" for clean price data
Enable "Iterate to next (if response error)" to handle unavailable products
Set Result Property Name to "competitor_data"

What Happens: The workflow visits each competitor URL, extracts pricing and product information, and creates a daily competitive intelligence report.

Business Value: Saves 20 hours per week of manual price checking and enables dynamic pricing strategies that increase profit margins by 15%.

Content Research Automation

Business Situation: A marketing agency needs to gather article content from industry blogs and news sites for client research reports.

What You'll Configure:

Set Property Path to "article_links"
Choose "Chrome Browser" engine for modern news sites
Select "Presence of element located" wait condition
Enter ".article-content" in Wait For Element field
Choose "Original with appended result column" to keep source URLs
Set Result Property Name to "article_text"

What Happens: The workflow loads each article URL, waits for content to fully load, then extracts the main article text while preserving the source information.

Business Value: Reduces research time by 75% and ensures comprehensive coverage of industry trends for client reports.

Lead Generation from Business Directories

Business Situation: A B2B sales team wants to extract contact information and company descriptions from business directory listings.

What You'll Configure:

Set Property Path to "directory_urls"
Choose "Chrome Browser" engine for dynamic directory sites
Select "Visibility of element located" wait condition
Enter ".company-details" in Wait For Element field
Enable "Apply OCR on Images" to capture contact info in images
Set Result Property Name to "company_info"

What Happens: The workflow visits each directory listing, waits for company details to load, extracts text content including any text in images, and compiles comprehensive company profiles.

Business Value: Generates 300% more qualified leads per week and improves sales team efficiency by providing detailed prospect information automatically.

Document Processing Workflow

Business Situation: A legal firm needs to extract key information from PDF contracts and documents stored on their client portal.

What You'll Configure:

Set Property Path to "document_urls"
Enable "Process PDF files" toggle
Choose "HTTP Client" engine for direct PDF access
Select "Return result column only" to focus on document content
Set Result Property Name to "contract_text"

What Happens: The workflow accesses each PDF URL, extracts all text content from the documents, and provides clean text for further analysis or AI processing.

Business Value: Processes 50 documents per hour instead of 5 manual reviews, enabling faster contract analysis and reducing legal research costs by 60%.

Step-by-Step Configuration

Setting Up Basic Web Scraping

Add the Node:
- Drag the URL Scraper node from the left panel onto your workflow canvas
- Connect it to your data source node using the arrow connector
Configure URL Source:
- Click on the URL Scraper node to open the settings panel
- In the "URL Source" section, click the "Property Path" field
- Select the column containing your URLs from the dropdown suggestions
- If processing PDFs, check the "Process PDF files" box
Set Up Batch Processing:
- In the "Batch Settings" section, choose your processing mode:
  - Select "None" for single URL processing
  - Select "Iterate to next (if response error)" for robust batch processing
  - Select "Iterate to next (always)" for systematic processing
Configure Scraping Method:
- In the "Scrape Operation" section, choose your engine:
  - Select "HTTP Client" for fast, standard web page scraping
  - Select "Chrome Browser" for JavaScript-heavy or dynamic sites
Set Output Preferences:
- In the "Output" section, choose your output format
- Enter a descriptive name in "Result Property Name"
- Click "Save Configuration"

Advanced Chrome Browser Setup

Enable Browser Mode:
- In "Scrape Operation", select "Chrome Browser (Selenium)"
- Choose appropriate wait condition from dropdown
- Set maximum wait time (10-30 seconds recommended)
Configure Wait Conditions:
- If using element-based waits, enter CSS selector in "Wait For Element"
- Test selectors using browser developer tools first
- Common selectors: ".content", "#main", ".article-body"
Add Custom JavaScript (Optional):
- Scroll to JavaScript section
- Replace default code with your custom extraction logic
- Enable "Async" toggle if your code needs time to complete
- Test thoroughly before deploying
Test Your Configuration:
- Use the "Test Configuration" button
- Enter sample URLs in the test panel
- Verify extracted content appears correctly
- Adjust settings based on test results

Industry Applications

Digital Marketing Agencies

Common Challenge: Manually tracking competitor content, pricing, and marketing campaigns across dozens of client industries.

How This Node Helps: Automatically scrapes competitor websites, social media pages, and marketing materials to create comprehensive competitive intelligence reports.

Configuration Recommendations:

Use "Chrome Browser" engine for social media and dynamic sites
Enable "Apply OCR on Images" for social media graphics
Set "Iterate to next (if response error)" for reliable batch processing
Use descriptive result column names like "competitor_content"

Results: Agencies save 30 hours per week on competitive research and deliver more comprehensive client reports, leading to 25% higher client retention.

Real Estate Companies

Common Challenge: Gathering property details, pricing, and descriptions from multiple listing services and competitor websites.

How This Node Helps: Automatically extracts property information, photos, and market data from real estate websites and MLS systems.

Configuration Recommendations:

Use "Chrome Browser" for MLS and dynamic property sites
Enable "Process PDF files" for property documents
Set wait conditions for "Visibility of element located" with ".property-details"
Choose "Original with appended result column" to maintain property IDs

Results: Real estate teams process 500% more property listings per day and identify market opportunities 3x faster than manual research.

Financial Services

Common Challenge: Monitoring regulatory updates, market news, and competitor announcements across hundreds of financial websites.

How This Node Helps: Automatically scrapes financial news sites, regulatory portals, and competitor press releases for compliance and market intelligence.

Configuration Recommendations:

Use "HTTP Client" for news sites and regulatory portals
Select "Extract Text (Remove HTML)" for clean analysis
Enable "Iterate to next (if response error)" for reliable news monitoring
Set Result Property Name to "financial_news_content"

Results: Compliance teams stay current with 95% more regulatory updates and identify market trends 2 weeks earlier than competitors.

Healthcare Organizations

Common Challenge: Researching medical literature, treatment protocols, and pharmaceutical information from various online sources.

How This Node Helps: Automatically extracts content from medical journals, research databases, and pharmaceutical websites for clinical research.

Configuration Recommendations:

Use "Chrome Browser" for research databases requiring authentication
Enable "Process PDF files" for research papers
Set longer wait times (30+ seconds) for database queries
Choose "Original with appended result column" to maintain source citations

Results: Medical researchers access 400% more relevant studies per week and reduce literature review time from days to hours.

Best Practices

Performance Optimization

Use HTTP Client engine when possible for faster processing
Set appropriate wait times - longer isn't always better
Process URLs in smaller batches (100-500 at a time) for better reliability
Use specific CSS selectors to reduce wait times

Error Handling

Always enable "Iterate to next (if response error)" for batch processing
Test with sample URLs before processing large datasets
Monitor workflow logs for common failure patterns
Keep backup copies of original URL lists

Data Quality

Use "Extract Text (Remove HTML)" for AI analysis and data processing
Enable OCR only when necessary to balance speed and completeness
Choose descriptive result column names for easier data management
Regularly validate scraped content quality with spot checks

Compliance and Ethics

Respect website robots.txt files

URL Scraper Node Documentation

Overview​

Key Features​

Configuration Parameters​

URL Source Section​

Batch Settings Section​

Scrape Operation Section​

HTTP Client Options (when HTTP Client is selected)​

Chrome Browser Options (when Chrome Browser is selected)​

Output Section​

Real-World Use Cases​

E-commerce Competitive Analysis​

Content Research Automation​

Lead Generation from Business Directories​

Document Processing Workflow​

Step-by-Step Configuration​

Setting Up Basic Web Scraping​

Advanced Chrome Browser Setup​

Industry Applications​

Digital Marketing Agencies​

Real Estate Companies​

Financial Services​

Healthcare Organizations​

Best Practices​

Performance Optimization​

Error Handling​

Data Quality​

Compliance and Ethics​

Overview

Key Features

Configuration Parameters

URL Source Section

Batch Settings Section

Scrape Operation Section

HTTP Client Options (when HTTP Client is selected)

Chrome Browser Options (when Chrome Browser is selected)

Output Section

Real-World Use Cases

E-commerce Competitive Analysis

Content Research Automation

Lead Generation from Business Directories

Document Processing Workflow

Step-by-Step Configuration

Setting Up Basic Web Scraping

Advanced Chrome Browser Setup

Industry Applications

Digital Marketing Agencies

Real Estate Companies

Financial Services

Healthcare Organizations

Best Practices

Performance Optimization

Error Handling

Data Quality

Compliance and Ethics