Guide April 2026

AI Crawler Overview 2026: All AI Bots, User Agents and robots.txt Settings

What AI crawlers exist, what do they do, and how do you control them via robots.txt? This overview lists all relevant AI bots – from GPTBot to ClaudeBot to Yandex – with user agent, purpose and concrete configuration examples.

How do AI crawlers work?

AI crawlers work fundamentally like classic web crawlers: they visit URLs, download the content and store it for further processing. The difference lies in purpose – while Googlebot prepares pages for the search index, AI crawlers collect content for two different goals:

Training: Crawled content flows into the training data of language models. This happens in large batches, not in real time.
Live search: For AI search engines like Perplexity, pages are crawled in real time to deliver current answers.

All reputable AI crawlers respect robots.txt – but only when it is correctly configured. Important: AI crawlers generally have shorter timeouts than Googlebot and react more sensitively to technical errors.

Important: Anyone who blocks AI crawlers will not be cited in AI-generated answers. Anyone who allows them risks their content being used for AI training. The decision lies with the website operator – robots.txt gives full control.

High-relevance AI crawlers

These crawlers have the greatest influence on a website's AI visibility:

GPTBotHigh

OpenAI – ChatGPT

Crawls for training and browsing function of ChatGPT. One of the most important AI crawlers – ChatGPT has over 100 million active users.

User-Agent:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)

ClaudeBotHigh

Anthropic – Claude

Crawler from Anthropic for Claude. Analyses web content for context and answer generation. Anthropic has its own llms.txt on anthropic.com.

User-Agent:

ClaudeBot/1.0; +https://anthropic.com/claudebot

PerplexityBotHigh

Perplexity AI

Crawls for Perplexity's real-time search. Specialises in fact-based answers with source references.

User-Agent:

PerplexityBot/1.0; +https://perplexity.ai/perplexitybot

Google-ExtendedHigh

Google – Gemini & AI Overviews

Separate crawler for Google Gemini and AI Overviews. Can be controlled separately from Googlebot in robots.txt.

User-Agent:

Google-Extended

OAI-SearchBotHigh

OpenAI – Search

Newer crawler from OpenAI specifically for the real-time search function in ChatGPT. Complements GPTBot for live search queries.

User-Agent:

OAI-SearchBot/1.0; +https://openai.com/searchbot

Meta-ExternalAgentHigh

Meta – Llama & Meta AI

Crawler from Meta for Meta AI products (Facebook, Instagram, WhatsApp AI) and Llama model training.

User-Agent:

Meta-ExternalAgent/1.1; +https://llama.meta.com/llama-web-access/

Further AI crawlers

These crawlers have growing relevance and should not be ignored:

Applebot-ExtendedMedium

Apple – Apple Intelligence

Crawler for Apple Intelligence and Siri. Has become significantly more relevant with iOS 18 and macOS Sequoia.

User-Agent:

Applebot-Extended/1.0; +https://support.apple.com/en-us/111900

AmazonbotMedium

Amazon – Alexa & AWS AI

Crawler from Amazon for Alexa and Amazon AI services.

User-Agent:

Amazonbot/0.1; +https://developer.amazon.com/amazonbot

YouBotMedium

You.com

Crawler for the AI search engine You.com – a direct Perplexity alternative with a growing user base.

User-Agent:

YouBot; +https://about.you.com/youbot/

BytespiderMedium

ByteDance – TikTok

Crawler from ByteDance for AI products. Particularly relevant in Asian markets.

User-Agent:

Bytespider; +https://zhanzhang.toutiao.com/

cohere-aiMedium

Cohere

Crawler from Cohere – one of the leading B2B AI providers.

User-Agent:

cohere-ai/1.0

ImagesiftBotMedium

Microsoft – Bing AI / Copilot

Supplementary crawler from Microsoft for Bing AI and Copilot.

User-Agent:

ImagesiftBot; +https://www.microsoft.com/en-us/bing/imagesiftbot

Classic search engine crawlers

These crawlers are primarily responsible for classic search, but increasingly relevant for AI functions too:

GooglebotVery high

Google

The most important crawler – for Google Search and as the basis for many AI functions.

User-Agent:

Googlebot/2.1; +http://www.google.com/bot.html

BingbotHigh

Microsoft – Bing & Copilot

Crawler for Bing Search and Microsoft Copilot.

User-Agent:

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

YandexBotMedium

Yandex

Russian search engine with its own AI products. Relevant for Russian-speaking markets.

User-Agent:

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

DuckDuckBotLow

DuckDuckGo

Crawler for DuckDuckGo with its own AI functions (DuckAssist).

User-Agent:

DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)

Complete reference table

Crawler	Operator	robots.txt Name	Type	Relevance
GPTBot	OpenAI	`GPTBot`	Training + Live	⭐⭐⭐⭐⭐
OAI-SearchBot	OpenAI	`OAI-SearchBot`	Live search	⭐⭐⭐⭐⭐
ClaudeBot	Anthropic	`ClaudeBot`	Training + Live	⭐⭐⭐⭐⭐
PerplexityBot	Perplexity AI	`PerplexityBot`	Live search	⭐⭐⭐⭐⭐
Google-Extended	Google	`Google-Extended`	Training + Live	⭐⭐⭐⭐⭐
Googlebot	Google	`Googlebot`	Search + AI base	⭐⭐⭐⭐⭐
Meta-ExternalAgent	Meta	`Meta-ExternalAgent`	Training + Live	⭐⭐⭐⭐
Bingbot	Microsoft	`bingbot`	Search + Copilot	⭐⭐⭐⭐
Applebot-Extended	Apple	`Applebot-Extended`	Training + Live	⭐⭐⭐⭐
Amazonbot	Amazon	`Amazonbot`	Training	⭐⭐⭐
YouBot	You.com	`YouBot`	Live search	⭐⭐⭐
Bytespider	ByteDance	`Bytespider`	Training	⭐⭐⭐
cohere-ai	Cohere	`cohere-ai`	Training	⭐⭐⭐
YandexBot	Yandex	`YandexBot`	Search + AI	⭐⭐
DuckDuckBot	DuckDuckGo	`DuckDuckBot`	Search	⭐⭐

robots.txt configuration

robots.txt enables granular control – each crawler can be individually allowed or blocked.

Allow all AI crawlers (recommended)

# Allow all bots
User-agent: *
Allow: /

# Explicitly allow AI crawlers (recommended for safety)
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Block training, allow live search

# Block training
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Continue to allow live search
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Warning: Anyone who blocks all AI crawlers will not appear in AI-generated answers – not on Perplexity, ChatGPT or Claude. This can lead to significant loss of visibility in the medium term.

Recommendation

For most websites: allow all AI crawlers and additionally create an llms.txt – as an optional step once the technical foundations are already in place.

Those concerned about training data can specifically block only training crawlers while continuing to allow live-search crawlers (PerplexityBot, OAI-SearchBot).

All AI crawlers configured correctly?

Use the free AI-Ready Check to verify whether your robots.txt is correctly configured and all relevant AI crawlers have access.

Check for free now →

More guides

Frequently asked questions about AI crawlers

Do I have to list every AI crawler individually in robots.txt? +

No – User-agent: * with Allow: / allows all crawlers at once. Explicitly listing individual AI crawlers is only recommended if you want to specifically block certain crawlers or set separate rules.

Do all AI crawlers respect robots.txt? +

All reputable AI crawlers from major providers respect robots.txt. However, less reputable bots exist that ignore robots.txt – against these only server-side blocking by IP or user agent helps.

What happens if I block GPTBot? +

Your website will no longer be included in ChatGPT training data and ChatGPT will no longer be able to use your current content for answers. You potentially lose visibility on one of the most widely used AI platforms.

How do I know if an AI crawler is visiting my website? +

In the server log files – the user agent of every visitor is recorded there. Tools like GoAccess or AWStats can be used to analyse logs and filter by specific user agents.

Can I restrict crawling to specific directories? +

Yes – with Allow and Disallow rules you can granularly control which areas a crawler may visit. For example: allow AI training only on blog posts but not on product or pricing pages.