Web Scrape Component
Use the Web Scrape component to extract content from one or more webpages. This powerful tool can parse HTML, extract clean text, convert content to Markdown or JSON, and even handle modern websites that rely heavily on JavaScript.
Why this matters
What You’ll Configure
- Configure Scraper Settings
- Provide Target URLs
- Handle the Results
- Usage and Abuse Policy
- Best Practices
- Troubleshooting Tips
- What to Try Next
Step 1: Configure Scraper Settings
Fine-tune the scraper's behavior to match the structure of your target websites and the format you need.
Setting | Description |
---|---|
Format | Choose how the scraped content should be structured. Options include Markdown (great for LLMs), Text (for clean text extraction), Raw HTML, or structured JSON . You can also choose to exclude images or links from Markdown. |
Bypass Anti-Bot Protection | Enable this if the target website uses tools like Cloudflare or other Captcha-based systems to block scrapers. |
JavaScript Rendering | Enable this for Single Page Applications (SPAs) or any site that loads its content dynamically using JavaScript. The component will render the page in a headless browser before extracting content. |
Auto Scroll | Enable this for pages that use infinite scrolling to load more content as the user scrolls down. The component will automatically scroll to the bottom to ensure all content is loaded before extraction. |
Which Format Should I Choose?
Step 2: Provide Target URLs
The component needs to know which webpage(s) to scrape.
Input | Description |
---|---|
URLs | The default input. It accepts a single URL string, a comma-separated list of URLs, or an array of URL strings. |
Step 3: Handle the Results
The component provides two outputs to separate successful scrapes from failures.
Output | Description |
---|---|
Results | The primary output containing the scraped data from the successfully processed webpages. This is typically an array of objects, one for each successful URL. |
Failed URLs | An array containing any URLs that could not be successfully processed due to errors (e.g., 404 Not Found, timeouts, or being blocked). |
Step 4: Usage and Abuse Policy
Prohibited Usage
Best Practices
- Start with Simple Extraction: Begin by scraping a single page with a simple
Text
format to ensure the basic connection works before adding advanced settings. - Enable JS Rendering for Modern Sites: Most modern websites are built with frameworks like React, Vue, or Angular. You will almost always need to enable
JavaScript Rendering
to get the full content. - Use
Auto Scroll
for Social Media and Blogs: Feeds on sites like Twitter, LinkedIn, or long blog pages often use infinite scroll. EnableAuto Scroll
to capture all the content. - Process Results with a Loop: If you scrape multiple URLs, connect the
Results
output to a ForEach Loop to process each scraped page individually.
Troubleshooting Tips
If your scraping fails...
What to Try Next
- Scrape an article and feed the
Results
into a GenAI LLM Component to automatically generate a summary. - Create a research agent that takes a topic, uses Web Search to find a list of relevant URLs, and then uses
Web Scrape
to extract the content from each one. - Build a knowledge base ingestion pipeline. Scrape a list of your company's blog posts and use RAG Remember to add the content of each post to your agent's memory.