How do AI crawlers work?
AI crawlers work fundamentally like classic web crawlers: they visit URLs, download the content and store it for further processing. The difference lies in purpose – while Googlebot prepares pages for the search index, AI crawlers collect content for two different goals:
- Training: Crawled content flows into the training data of language models. This happens in large batches, not in real time.
- Live search: For AI search engines like Perplexity, pages are crawled in real time to deliver current answers.
All reputable AI crawlers respect robots.txt – but only when it is correctly configured. Important: AI crawlers generally have shorter timeouts than Googlebot and react more sensitively to technical errors.
Important: Anyone who blocks AI crawlers will not be cited in AI-generated answers. Anyone who allows them risks their content being used for AI training. The decision lies with the website operator – robots.txt gives full control.
High-relevance AI crawlers
These crawlers have the greatest influence on a website's AI visibility:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
ClaudeBot/1.0; +https://anthropic.com/claudebot
PerplexityBot/1.0; +https://perplexity.ai/perplexitybot
Google-Extended
OAI-SearchBot/1.0; +https://openai.com/searchbot
Meta-ExternalAgent/1.1; +https://llama.meta.com/llama-web-access/
Further AI crawlers
These crawlers have growing relevance and should not be ignored:
Applebot-Extended/1.0; +https://support.apple.com/en-us/111900
Amazonbot/0.1; +https://developer.amazon.com/amazonbot
YouBot; +https://about.you.com/youbot/
Bytespider; +https://zhanzhang.toutiao.com/
cohere-ai/1.0
ImagesiftBot; +https://www.microsoft.com/en-us/bing/imagesiftbot
Classic search engine crawlers
These crawlers are primarily responsible for classic search, but increasingly relevant for AI functions too:
Googlebot/2.1; +http://www.google.com/bot.html
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)
Complete reference table
| Crawler | Operator | robots.txt Name | Type | Relevance |
|---|---|---|---|---|
| GPTBot | OpenAI | GPTBot | Training + Live | ⭐⭐⭐⭐⭐ |
| OAI-SearchBot | OpenAI | OAI-SearchBot | Live search | ⭐⭐⭐⭐⭐ |
| ClaudeBot | Anthropic | ClaudeBot | Training + Live | ⭐⭐⭐⭐⭐ |
| PerplexityBot | Perplexity AI | PerplexityBot | Live search | ⭐⭐⭐⭐⭐ |
| Google-Extended | Google-Extended | Training + Live | ⭐⭐⭐⭐⭐ | |
| Googlebot | Googlebot | Search + AI base | ⭐⭐⭐⭐⭐ | |
| Meta-ExternalAgent | Meta | Meta-ExternalAgent | Training + Live | ⭐⭐⭐⭐ |
| Bingbot | Microsoft | bingbot | Search + Copilot | ⭐⭐⭐⭐ |
| Applebot-Extended | Apple | Applebot-Extended | Training + Live | ⭐⭐⭐⭐ |
| Amazonbot | Amazon | Amazonbot | Training | ⭐⭐⭐ |
| YouBot | You.com | YouBot | Live search | ⭐⭐⭐ |
| Bytespider | ByteDance | Bytespider | Training | ⭐⭐⭐ |
| cohere-ai | Cohere | cohere-ai | Training | ⭐⭐⭐ |
| YandexBot | Yandex | YandexBot | Search + AI | ⭐⭐ |
| DuckDuckBot | DuckDuckGo | DuckDuckBot | Search | ⭐⭐ |
robots.txt configuration
robots.txt enables granular control – each crawler can be individually allowed or blocked.
Allow all AI crawlers (recommended)
Block training, allow live search
Warning: Anyone who blocks all AI crawlers will not appear in AI-generated answers – not on Perplexity, ChatGPT or Claude. This can lead to significant loss of visibility in the medium term.
Recommendation
For most websites: allow all AI crawlers and additionally create an llms.txt – as an optional step once the technical foundations are already in place.
Those concerned about training data can specifically block only training crawlers while continuing to allow live-search crawlers (PerplexityBot, OAI-SearchBot).
All AI crawlers configured correctly?
Use the free AI-Ready Check to verify whether your robots.txt is correctly configured and all relevant AI crawlers have access.
Check for free now →