Skip to main content

Content Extractor Agent

The Content Extractor Agent allows you to rapidly extract and organize text content from any website, transforming unstructured web pages into clean, organized Markdown format.

You provide a URL, and the agent will:

  • Scrape the target URL for all visible text content
  • Automatically clean up HTML, ads, and navigation elements
  • Convert the extracted content into a structured Markdown output
  • Optionally process multiple URLs in a single batch
Why use this agent?

This agent is essential for analysts, researchers, and content teams who need to gather large volumes of clean web data quickly for studies, competitive analysis, or content repurposing, bypassing manual copy-pasting and formatting.

Use Cases

  • Research & Analysis Researchers, journalists, and analysts can quickly extract and compile text content from multiple websites for market research, competitive analysis, or academic studies. The clean markdown format makes it easy to organize and analyze large volumes of web content without manual copying and formatting.

  • Content Curation & Repurposing Content marketers and bloggers can gather text from industry websites, news articles, or competitor pages to inform their own content creation. The extracted content serves as source material for creating original articles, newsletters, or social media posts while ensuring proper research and citation.

  • Documentation & Archiving Businesses and organizations can preserve important web content before it changes or disappears. The agent helps create text backups of policies, announcements, product descriptions, or knowledge base articles for internal documentation and compliance purposes.

Pro tip

Use this agent in conjunction with a GenAILLM component to instantly summarize the extracted text or translate it into another language, speeding up your research pipeline.

Testing the Agent

Step 1: Access the Agent

Content Extractor Agent template in SmythOS Templates section
  • Go to the Templates section in the sidebar
  • Navigate to the Marketing Tab
  • Find the Content Extractor Agent and click Remix
Content Extractor Agent template Remix in SmythOS
  • The full workflow will open in Agent Studio

Step 2: Run the Agent

You can test the agent in two ways:

Option 1: From the top toolbar

Testing Content Extractor Agent from Form Preview
  1. Click Test (top-right)
  2. Switch to Form Preview
  3. Fill in the input fields
  4. Click Run to execute

Option 2: Form Preview from the Canvas

  1. Click the Form Preview button on the APIEndpoint block labeled ExtractTextContent
  2. Enter the input field details. For example:
    • url: https://www.smythos.com
  3. Click Run to test and view the results.

Deploying the Agent

Step 1: Start Deployment

Deploy button in SmythOS Agent Studio
  • Click Deploy (top-right corner of Agent Studio)
  • Pick your environment:
    • Agent Cloud (SmythOS-hosted, recommended)
    • Enterprise (self-managed, secure)
    • Local Runtime (for development and offline use)

Step 2: Choose Your Deployment Type

Pick how users will interact with your agent.

  • Custom GPT – Add instructions, behaviors, or tools
  • Chatbot – Deploy as a chat interface
  • LLM – Connect to large language models with API keys
  • API – Call your agent programmatically
  • MCP – Use Model Context Protocol for structured workflows
  • Alexa – Launch as a voice assistant skill

You can find detailed guides to them by reading the Deploy Your Agent As... page.

Best option

For research and data gathering, deploy to API to integrate the agent's scraping capabilities into your data warehouse, content management system, or custom backend application.

Customization Tips

Example customization

To gather financial data, customize the workflow by adding a Code component after the WebScrape to extract specific numeric patterns (like stock prices or revenue figures) and clean the data before final output.

  • Web Scraping Settings – Enable or disable antiScrapingProtection, javascriptRendering, and autoScroll based on target websites
  • Output Format – Change WebScrape format from markdown to raw, markdown:no_images, or markdown:no_links
  • Error Handling – Add custom outputs to WebScrape component to access FailedURLs for better error reporting
  • Content Processing – Add GenAILLM component to summarize or clean extracted content before output
  • Multiple URLs – Modify input to accept URL arrays for batch processing multiple pages at once
  • Authentication – Add headers to WebScrape for accessing protected content with login credentials
  • Content Filtering – Insert Classifier component to categorize extracted content by type or topic
  • API Output Format – Switch from full to minimal or raw format based on integration needs
  • Rate Limiting – Add FSleep component between requests when processing multiple URLs
  • Custom Parsing – Add Code component to extract specific elements like emails, phone numbers, or dates