How do AI crawlers work?
AI crawlers work fundamentally like classic web crawlers: Sie besuchen URLs, laden den Inhalt herunter und speichern ihn für die weitere Verarbeitung. Der Unterschied liegt im Zweck — während Googlebot Seiten für den Suchindex aufbereitet, sammeln KI-Crawler Inhalte für zwei verschiedene Ziele:
- Training: Crawled content flows into the training data of language models. This happens in large batches, not in real time.
- Live search: For AI search engines like Perplexity, pages are crawled in real time to deliver current answers.
All reputable AI crawlers respect robots.txt — but only when it is correctly configured. Important: AI crawlers generally have shorter timeouts than Googlebot and react more sensitively to technical errors.
Important: Anyone who blocks AI crawlers will not be cited in AI-generated answers. Anyone who allows them risks their content being used for AI training. The decision lies with the website operator — robots.txt gives full control.
High-relevance AI crawlers
These crawlers have the greatest influence on a website's AI visibility:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
ClaudeBot/1.0; +https://anthropic.com/claudebot
PerplexityBot/1.0; +https://perplexity.ai/perplexitybot
Google-Extended
OAI-SearchBot/1.0; +https://openai.com/searchbot
Meta-ExternalAgent/1.1; +https://llama.meta.com/llama-web-access/
Further AI crawlers
These crawlers have growing relevance and should not be ignored:
Applebot-Extended/1.0; +https://support.apple.com/en-us/111900
Amazonbot/0.1; +https://developer.amazon.com/amazonbot
YouBot; +https://about.you.com/youbot/
Bytespider; +https://zhanzhang.toutiao.com/
cohere-ai/1.0
ImagesiftBot; +https://www.microsoft.com/en-us/bing/imagesiftbot
Classic search engine crawlers
These crawlers are primarily responsible for classic search, but increasingly relevant for AI functions too:
Googlebot/2.1; +http://www.google.com/bot.html
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
YandexGPT/1.0
DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)
Baiduspider/2.0; +http://www.baidu.com/search/spider.html
Complete reference table
| Crawler | Operator | robots.txt Name | Type | Relevance |
|---|---|---|---|---|
| GPTBot | OpenAI | GPTBot | Training + Live | ⭐⭐⭐⭐⭐ |
| OAI-SearchBot | OpenAI | OAI-SearchBot | Live search | ⭐⭐⭐⭐⭐ |
| ClaudeBot | Anthropic | ClaudeBot | Training + Live | ⭐⭐⭐⭐⭐ |
| PerplexityBot | Perplexity AI | PerplexityBot | Live search | ⭐⭐⭐⭐⭐ |
| Google-Extended | Google-Extended | Training + Live | ⭐⭐⭐⭐⭐ | |
| Googlebot | Googlebot | Search + AI base | ⭐⭐⭐⭐⭐ | |
| Meta-ExternalAgent | Meta | Meta-ExternalAgent | Training + Live | ⭐⭐⭐⭐ |
| Bingbot | Microsoft | bingbot | Search + Copilot | ⭐⭐⭐⭐ |
| Applebot-Extended | Apple | Applebot-Extended | Training + Live | ⭐⭐⭐⭐ |
| Amazonbot | Amazon | Amazonbot | Training | ⭐⭐⭐ |
| YouBot | You.com | YouBot | Live search | ⭐⭐⭐ |
| Bytespider | ByteDance | Bytespider | Training | ⭐⭐⭐ |
| cohere-ai | Cohere | cohere-ai | Training | ⭐⭐⭐ |
| YandexBot | Yandex | YandexBot | Search + AI | ⭐⭐ |
| YandexGPT | Yandex | YandexGPT | Training + Live | ⭐⭐ |
| DuckDuckBot | DuckDuckGo | DuckDuckBot | Search | ⭐⭐ |
| Baiduspider | Baidu | Baiduspider | Search + AI | ⭐ |
robots.txt configuration
robots.txt enables granular control — each crawler can be individually allowed or blocked.
Allow all AI crawlers (recommended)
Block training, allow live search
For those who do not want their content used for AI training, but still want to appear in live search results:
Block all AI crawlers
Warning: Anyone who blocks all AI crawlers will not appear in AI-generated answers — not on Perplexity, ChatGPT or Claude. This can lead to significant loss of visibility in the medium term.
Recommendation
For most websites: allow all AI crawlers and additionally create an llms.txt. This maximises AI visibility and gives crawlers the context they need for accurate answers.
Those concerned about training data can specifically block only training crawlers while continuing to allow live-search crawlers (PerplexityBot, OAI-SearchBot).
All AI crawlers configured correctly?
Use the free AI-Ready Check to verify whether your robots.txt is correctly configured and all relevant AI crawlers have access.
Check for free now →