AI Crawler Overview 2026: All AI Bots, User Agents and robots.txt Settings

What AI crawlers exist, what do they do, and how do you control them via robots.txt? This overview lists all relevant AI bots – from GPTBot to ClaudeBot to Yandex – with user agent, purpose and concrete configuration examples.

How do AI crawlers work?

AI crawlers work fundamentally like classic web crawlers: they visit URLs, download the content and store it for further processing. The difference lies in purpose – while Googlebot prepares pages for the search index, AI crawlers collect content for two different goals:

  • Training: Crawled content flows into the training data of language models. This happens in large batches, not in real time.
  • Live search: For AI search engines like Perplexity, pages are crawled in real time to deliver current answers.

All reputable AI crawlers respect robots.txt – but only when it is correctly configured. Important: AI crawlers generally have shorter timeouts than Googlebot and react more sensitively to technical errors.

Important: Anyone who blocks AI crawlers will not be cited in AI-generated answers. Anyone who allows them risks their content being used for AI training. The decision lies with the website operator – robots.txt gives full control.

High-relevance AI crawlers

These crawlers have the greatest influence on a website's AI visibility:

GPTBotHigh
OpenAI – ChatGPT
Crawls for training and browsing function of ChatGPT. One of the most important AI crawlers – ChatGPT has over 100 million active users.
User-Agent:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
ClaudeBotHigh
Anthropic – Claude
Crawler from Anthropic for Claude. Analyses web content for context and answer generation. Anthropic has its own llms.txt on anthropic.com.
User-Agent:
ClaudeBot/1.0; +https://anthropic.com/claudebot
PerplexityBotHigh
Perplexity AI
Crawls for Perplexity's real-time search. Specialises in fact-based answers with source references.
User-Agent:
PerplexityBot/1.0; +https://perplexity.ai/perplexitybot
Google-ExtendedHigh
Google – Gemini & AI Overviews
Separate crawler for Google Gemini and AI Overviews. Can be controlled separately from Googlebot in robots.txt.
User-Agent:
Google-Extended
OAI-SearchBotHigh
OpenAI – Search
Newer crawler from OpenAI specifically for the real-time search function in ChatGPT. Complements GPTBot for live search queries.
User-Agent:
OAI-SearchBot/1.0; +https://openai.com/searchbot
Meta-ExternalAgentHigh
Meta – Llama & Meta AI
Crawler from Meta for Meta AI products (Facebook, Instagram, WhatsApp AI) and Llama model training.
User-Agent:
Meta-ExternalAgent/1.1; +https://llama.meta.com/llama-web-access/

Further AI crawlers

These crawlers have growing relevance and should not be ignored:

Applebot-ExtendedMedium
Apple – Apple Intelligence
Crawler for Apple Intelligence and Siri. Has become significantly more relevant with iOS 18 and macOS Sequoia.
User-Agent:
Applebot-Extended/1.0; +https://support.apple.com/en-us/111900
AmazonbotMedium
Amazon – Alexa & AWS AI
Crawler from Amazon for Alexa and Amazon AI services.
User-Agent:
Amazonbot/0.1; +https://developer.amazon.com/amazonbot
YouBotMedium
You.com
Crawler for the AI search engine You.com – a direct Perplexity alternative with a growing user base.
User-Agent:
YouBot; +https://about.you.com/youbot/
BytespiderMedium
ByteDance – TikTok
Crawler from ByteDance for AI products. Particularly relevant in Asian markets.
User-Agent:
Bytespider; +https://zhanzhang.toutiao.com/
cohere-aiMedium
Cohere
Crawler from Cohere – one of the leading B2B AI providers.
User-Agent:
cohere-ai/1.0
ImagesiftBotMedium
Microsoft – Bing AI / Copilot
Supplementary crawler from Microsoft for Bing AI and Copilot.
User-Agent:
ImagesiftBot; +https://www.microsoft.com/en-us/bing/imagesiftbot

Classic search engine crawlers

These crawlers are primarily responsible for classic search, but increasingly relevant for AI functions too:

GooglebotVery high
Google
The most important crawler – for Google Search and as the basis for many AI functions.
User-Agent:
Googlebot/2.1; +http://www.google.com/bot.html
BingbotHigh
Microsoft – Bing & Copilot
Crawler for Bing Search and Microsoft Copilot.
User-Agent:
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
YandexBotMedium
Yandex
Russian search engine with its own AI products. Relevant for Russian-speaking markets.
User-Agent:
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
DuckDuckBotLow
DuckDuckGo
Crawler for DuckDuckGo with its own AI functions (DuckAssist).
User-Agent:
DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)

Complete reference table

CrawlerOperatorrobots.txt NameTypeRelevance
GPTBotOpenAIGPTBotTraining + Live⭐⭐⭐⭐⭐
OAI-SearchBotOpenAIOAI-SearchBotLive search⭐⭐⭐⭐⭐
ClaudeBotAnthropicClaudeBotTraining + Live⭐⭐⭐⭐⭐
PerplexityBotPerplexity AIPerplexityBotLive search⭐⭐⭐⭐⭐
Google-ExtendedGoogleGoogle-ExtendedTraining + Live⭐⭐⭐⭐⭐
GooglebotGoogleGooglebotSearch + AI base⭐⭐⭐⭐⭐
Meta-ExternalAgentMetaMeta-ExternalAgentTraining + Live⭐⭐⭐⭐
BingbotMicrosoftbingbotSearch + Copilot⭐⭐⭐⭐
Applebot-ExtendedAppleApplebot-ExtendedTraining + Live⭐⭐⭐⭐
AmazonbotAmazonAmazonbotTraining⭐⭐⭐
YouBotYou.comYouBotLive search⭐⭐⭐
BytespiderByteDanceBytespiderTraining⭐⭐⭐
cohere-aiCoherecohere-aiTraining⭐⭐⭐
YandexBotYandexYandexBotSearch + AI⭐⭐
DuckDuckBotDuckDuckGoDuckDuckBotSearch⭐⭐

robots.txt configuration

robots.txt enables granular control – each crawler can be individually allowed or blocked.

Allow all AI crawlers (recommended)

# Allow all bots User-agent: * Allow: / # Explicitly allow AI crawlers (recommended for safety) User-agent: GPTBot Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ClaudeBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Google-Extended Allow: / User-agent: Meta-ExternalAgent Allow: / Sitemap: https://yourdomain.com/sitemap.xml

Block training, allow live search

# Block training User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Meta-ExternalAgent Disallow: / # Continue to allow live search User-agent: OAI-SearchBot Allow: / User-agent: PerplexityBot Allow: /

Warning: Anyone who blocks all AI crawlers will not appear in AI-generated answers – not on Perplexity, ChatGPT or Claude. This can lead to significant loss of visibility in the medium term.

Recommendation

For most websites: allow all AI crawlers and additionally create an llms.txt – as an optional step once the technical foundations are already in place.

Those concerned about training data can specifically block only training crawlers while continuing to allow live-search crawlers (PerplexityBot, OAI-SearchBot).

All AI crawlers configured correctly?

Use the free AI-Ready Check to verify whether your robots.txt is correctly configured and all relevant AI crawlers have access.

Check for free now →

Frequently asked questions about AI crawlers

Do I have to list every AI crawler individually in robots.txt? +

No – User-agent: * with Allow: / allows all crawlers at once. Explicitly listing individual AI crawlers is only recommended if you want to specifically block certain crawlers or set separate rules.

Do all AI crawlers respect robots.txt? +

All reputable AI crawlers from major providers respect robots.txt. However, less reputable bots exist that ignore robots.txt – against these only server-side blocking by IP or user agent helps.

What happens if I block GPTBot? +

Your website will no longer be included in ChatGPT training data and ChatGPT will no longer be able to use your current content for answers. You potentially lose visibility on one of the most widely used AI platforms.

How do I know if an AI crawler is visiting my website? +

In the server log files – the user agent of every visitor is recorded there. Tools like GoAccess or AWStats can be used to analyse logs and filter by specific user agents.

Can I restrict crawling to specific directories? +

Yes – with Allow and Disallow rules you can granularly control which areas a crawler may visit. For example: allow AI training only on blog posts but not on product or pricing pages.