AI Crawler Overview 2026: All AI Bots, User Agents and robots.txt Settings

What AI crawlers exist, what do they do, and how do you control them via robots.txt? This overview lists all relevant AI bots — from GPTBot to ClaudeBot to Yandex — with user agent, purpose and concrete configuration examples.

How do AI crawlers work?

AI crawlers work fundamentally like classic web crawlers: Sie besuchen URLs, laden den Inhalt herunter und speichern ihn für die weitere Verarbeitung. Der Unterschied liegt im Zweck — während Googlebot Seiten für den Suchindex aufbereitet, sammeln KI-Crawler Inhalte für zwei verschiedene Ziele:

  • Training: Crawled content flows into the training data of language models. This happens in large batches, not in real time.
  • Live search: For AI search engines like Perplexity, pages are crawled in real time to deliver current answers.

All reputable AI crawlers respect robots.txt — but only when it is correctly configured. Important: AI crawlers generally have shorter timeouts than Googlebot and react more sensitively to technical errors.

Important: Anyone who blocks AI crawlers will not be cited in AI-generated answers. Anyone who allows them risks their content being used for AI training. The decision lies with the website operator — robots.txt gives full control.

High-relevance AI crawlers

These crawlers have the greatest influence on a website's AI visibility:

GPTBot High
OpenAI — ChatGPT
Crawls for training and browsing function of ChatGPT. One of the most important AI crawlers — ChatGPT has over 100 million active users.
User-Agent:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
ClaudeBot High
Anthropic — Claude
Crawler from Anthropic for Claude. Analyses web content for context and answer generation. Anthropic actively supports the llms.txt standard.
User-Agent:
ClaudeBot/1.0; +https://anthropic.com/claudebot
PerplexityBot High
Perplexity AI
Crawls for Perplexity's real-time search. Specialises in fact-based answers with source references. Actively reads llms.txt.
User-Agent:
PerplexityBot/1.0; +https://perplexity.ai/perplexitybot
Google-Extended High
Google — Gemini & AI Overviews
Separate crawler for Google Gemini and AI Overviews. Can be controlled separately from Googlebot in robots.txt.
User-Agent:
Google-Extended
OAI-SearchBot High
OpenAI — Search
Newer crawler from OpenAI specifically for the real-time search function in ChatGPT. Complements GPTBot for live search queries.
User-Agent:
OAI-SearchBot/1.0; +https://openai.com/searchbot
Meta-ExternalAgent High
Meta — Llama & Meta AI
Crawler from Meta for Meta AI products (Facebook, Instagram, WhatsApp AI) and Llama model training.
User-Agent:
Meta-ExternalAgent/1.1; +https://llama.meta.com/llama-web-access/

Further AI crawlers

These crawlers have growing relevance and should not be ignored:

Applebot-Extended Medium
Apple — Apple Intelligence
Crawler for Apple Intelligence and Siri. Has become significantly more relevant with iOS 18 and macOS Sequoia. Can be controlled separately from the normal Applebot.
User-Agent:
Applebot-Extended/1.0; +https://support.apple.com/en-us/111900
Amazonbot Medium
Amazon — Alexa & AWS AI
Crawler from Amazon for Alexa and Amazon AI services. Collects data for voice assistants and AWS AI products.
User-Agent:
Amazonbot/0.1; +https://developer.amazon.com/amazonbot
YouBot Medium
You.com
Crawler for the AI search engine You.com — a direct Perplexity alternative with a growing user base.
User-Agent:
YouBot; +https://about.you.com/youbot/
Bytespider Medium
ByteDance — TikTok
Crawler from ByteDance (TikTok parent company) for AI products. Particularly relevant in Asian markets and through TikTok reach.
User-Agent:
Bytespider; +https://zhanzhang.toutiao.com/
cohere-ai Medium
Cohere
Crawler from Cohere — one of the leading B2B AI providers. Particularly relevant for companies using Cohere enterprise products.
User-Agent:
cohere-ai/1.0
ImagesiftBot Medium
Microsoft — Bing AI / Copilot
Supplementary crawler from Microsoft for Bing AI and Copilot. Works alongside the normal Bingbot.
User-Agent:
ImagesiftBot; +https://www.microsoft.com/en-us/bing/imagesiftbot

Classic search engine crawlers

These crawlers are primarily responsible for classic search, but increasingly relevant for AI functions too:

Googlebot Very high
Google
The most important crawler of all — for Google Search and as the basis for many AI functions. Separate control via Google-Extended for AI-specific use.
User-Agent:
Googlebot/2.1; +http://www.google.com/bot.html
Bingbot High
Microsoft — Bing & Copilot
Crawler for Bing Search and Microsoft Copilot. Very relevant for AI visibility due to GPT-4 integration in Copilot.
User-Agent:
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
YandexBot Medium
Yandex
Russian search engine with its own AI products (YandexGPT, Alice). Relevant for Russian-speaking markets and Eastern European audiences.
User-Agent:
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
YandexGPT Medium
Yandex — YandexGPT
Separate crawler from Yandex specifically for YandexGPT. Can be controlled independently from YandexBot.
User-Agent:
YandexGPT/1.0
DuckDuckBot Low
DuckDuckGo
Crawler for DuckDuckGo. The search engine partly uses Bing results but has its own AI functions (DuckAssist).
User-Agent:
DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)
Baiduspider Low
Baidu
Chinese search engine with its own AI products (ERNIE Bot). Only relevant for websites targeting Chinese audiences.
User-Agent:
Baiduspider/2.0; +http://www.baidu.com/search/spider.html

Complete reference table

Crawler Operator robots.txt Name Type Relevance
GPTBotOpenAIGPTBotTraining + Live⭐⭐⭐⭐⭐
OAI-SearchBotOpenAIOAI-SearchBotLive search⭐⭐⭐⭐⭐
ClaudeBotAnthropicClaudeBotTraining + Live⭐⭐⭐⭐⭐
PerplexityBotPerplexity AIPerplexityBotLive search⭐⭐⭐⭐⭐
Google-ExtendedGoogleGoogle-ExtendedTraining + Live⭐⭐⭐⭐⭐
GooglebotGoogleGooglebotSearch + AI base⭐⭐⭐⭐⭐
Meta-ExternalAgentMetaMeta-ExternalAgentTraining + Live⭐⭐⭐⭐
BingbotMicrosoftbingbotSearch + Copilot⭐⭐⭐⭐
Applebot-ExtendedAppleApplebot-ExtendedTraining + Live⭐⭐⭐⭐
AmazonbotAmazonAmazonbotTraining⭐⭐⭐
YouBotYou.comYouBotLive search⭐⭐⭐
BytespiderByteDanceBytespiderTraining⭐⭐⭐
cohere-aiCoherecohere-aiTraining⭐⭐⭐
YandexBotYandexYandexBotSearch + AI⭐⭐
YandexGPTYandexYandexGPTTraining + Live⭐⭐
DuckDuckBotDuckDuckGoDuckDuckBotSearch⭐⭐
BaiduspiderBaiduBaiduspiderSearch + AI

robots.txt configuration

robots.txt enables granular control — each crawler can be individually allowed or blocked.

Allow all AI crawlers (recommended)

# Allow all bots User-agent: * Allow: / # Explicitly allow AI crawlers (recommended for safety) User-agent: GPTBot Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ClaudeBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Google-Extended Allow: / User-agent: Meta-ExternalAgent Allow: / Sitemap: https://deinedomain.de/sitemap.xml LLMs: https://deinedomain.de/llms.txt

Block training, allow live search

For those who do not want their content used for AI training, but still want to appear in live search results:

# Block training User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Meta-ExternalAgent Disallow: / # Continue to allow live search User-agent: OAI-SearchBot Allow: / User-agent: PerplexityBot Allow: /

Block all AI crawlers

User-agent: GPTBot Disallow: / User-agent: OAI-SearchBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Meta-ExternalAgent Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: Amazonbot Disallow: / User-agent: Bytespider Disallow: / User-agent: cohere-ai Disallow: / # Continue to allow classic search engines User-agent: * Allow: /

Warning: Anyone who blocks all AI crawlers will not appear in AI-generated answers — not on Perplexity, ChatGPT or Claude. This can lead to significant loss of visibility in the medium term.

Recommendation

For most websites: allow all AI crawlers and additionally create an llms.txt. This maximises AI visibility and gives crawlers the context they need for accurate answers.

Those concerned about training data can specifically block only training crawlers while continuing to allow live-search crawlers (PerplexityBot, OAI-SearchBot).

All AI crawlers configured correctly?

Use the free AI-Ready Check to verify whether your robots.txt is correctly configured and all relevant AI crawlers have access.

Check for free now →

Frequently asked questions about AI crawlers

Do I have to list every AI crawler individually in robots.txt? +

No — User-agent: * with Allow: / allows all crawlers at once. Explicitly listing individual AI crawlers is only recommended if you want to specifically block certain crawlers or set separate rules.

Do all AI crawlers respect robots.txt? +

All reputable AI crawlers from major providers (OpenAI, Anthropic, Google, Meta etc.) respect robots.txt. However, less reputable bots exist that ignore robots.txt — against these only server-side blocking by IP or user agent helps.

What happens if I block GPTBot? +

Your website will no longer be included in ChatGPT training data and ChatGPT will no longer be able to use your current content for answers. You potentially lose visibility on one of the most widely used AI platforms worldwide.

How do I know if an AI crawler is visiting my website? +

In the server log files — the user agent of every visitor is recorded there. Tools like GoAccess or AWStats can be used to analyse logs and filter by specific user agents.

Can I restrict crawling to specific directories? +

Yes — with Allow and Disallow rules you can granularly control which areas a crawler may visit. For example: allow AI training only on blog posts but not on product or pricing pages.