Sitemap Crawling in Data Spaces

Data Spaces let you automatically ingest website content from sitemap URLs without writing any custom crawler logic. This allows agents to perform semantic search across a site’s structure using standard sitemap formats.

Why use sitemap crawling?

This built-in feature eliminates manual URL collection. Simply provide a sitemap URL and let SmythOS handle discovery, crawling, and indexing.

How it works

When you add a sitemap URL to a Data Space, SmythOS:

Fetches the sitemap XML from the specified URL
Parses all listed URLs (including nested sitemaps)
Crawls each page and extracts relevant content
Indexes that content into the selected Data Space
Makes it searchable using RAG Search

No manual upload required

Sitemaps must be available at an HTTP or HTTPS address. File uploads of sitemaps are not supported.

Setup Instructions

Step 1: Create a Data Space

Follow the standard instructions in the Data Spaces guide to create a new container.

Step 2: Add a sitemap URL

Open your Data Space
Click Add Data Source
Choose URL as the source type
Enter your sitemap URL (e.g., https://example.com/sitemap.xml)
Save

SmythOS begins crawling and indexing immediately. You can monitor progress from the Data Pool dashboard.

Supported sitemap formats

SmythOS supports multiple types of sitemaps:

Standard: https://example.com/sitemap.xml, .xml.gz, or sitemap index files
News: https://example.com/news-sitemap.xml
Images: https://example.com/image-sitemap.xml
Videos: https://example.com/video-sitemap.xml
Mobile: https://example.com/mobile-sitemap.xml

Sitemap index files are fully supported. SmythOS will crawl each child sitemap listed within them.

Only publicly accessible URLs

Private, password-protected, or file-based sitemap uploads are not supported. Sitemap URLs must be accessible over the public internet.

Working with crawled data

Once indexing is complete, the content is available to agents via these components:

RAG Search

Use this to retrieve semantically relevant results from the sitemap content. Configure it to query the correct Data Space, adjust result count, and toggle metadata inclusion.

RAG Remember(/docs/agent-studio/components/rag-data/rag-remember)

Enhance a sitemap-based Data Space with additional content like manually written summaries or notes. This is useful when you want to enhance the crawled data.

RAG Forget(/docs/agent-studio/components/rag-data/rag-forget)

Only manually added content can be removed using this component. Sitemap-crawled content is persistent and cannot be deleted through workflows.

Combine crawled and manual data

You can add custom inputs to enhance sitemap-based knowledge without altering the original crawled material.

Organising your Data Spaces

Single domain

Use one Data Space for each domain and add its sitemap. Best for simple websites or small teams.

Multi-domain

Use separate Data Spaces for each domain. Helps keep content clean and improves targeted search.

By content type

Segment by purpose (e.g., product pages vs. blog posts) using content-specific sitemaps.

Use consistent naming

Name each Data Space clearly to simplify long-term management. For example: product-sitemap, blog-sitemap, support-content.

Advanced: Custom Storage

For large sitemaps, enable Custom Storage using Pinecone. This lets you scale indexing, run advanced vector searches, and handle datasets with tens of thousands of pages.

Feature	SmythOS Default	Custom Storage
Setup	No config needed	Pinecone required
Scalability	Medium	High
Advanced search	Basic	Vector ops
Manual tuning	Limited	Full API control

Large sitemaps

If your sitemap includes more than 50,000 pages, use Custom Storage for better performance and indexing reliability.

Best practices

Choose the main sitemap or index if available
Avoid sitemap URLs that require login
Split very large sitemaps into category-specific ones
Confirm sitemap URL is valid XML and publicly accessible

Troubleshooting

Issue	Steps to Resolve
Sitemap not processing	Verify URL is public and returns valid XML
Missing pages	Wait for full indexing or check Data Space selection
Poor query results	Add context via RAG Remember
Large sitemap slow	Use Custom Storage or split into smaller Data Spaces

Using Sitemap Data in Agents

Agents can access crawled content through RAG workflows:

Add the RAG Search component in a workflow
Link it to the sitemap-enabled Data Space
Use results in chat, APIs, or forms

For more, see how to connect Data Spaces to agents.

Multi-source agents

Combine sitemap data with file uploads, URLs, and custom notes in a single workflow.

What's Next?

Read more about Data Spaces
Scale with Custom Storage
Add crawled data to Spaces for agent use

Was this page helpful?

How it works​

Setup Instructions​

Step 1: Create a Data Space​

Step 2: Add a sitemap URL​

Supported sitemap formats​

Working with crawled data​

RAG Search​

RAG Remember(/docs/agent-studio/components/rag-data/rag-remember)​

RAG Forget(/docs/agent-studio/components/rag-data/rag-forget)​

Organising your Data Spaces​

Single domain​

Multi-domain​

By content type​

Advanced: Custom Storage​

Best practices​

Troubleshooting​

Using Sitemap Data in Agents​

What's Next?​