Guide April 2026

AI Crawler Overview 2026: All AI Bots, User Agents and robots.txt Settings

What AI crawlers exist, what do they do, and how do you control them via robots.txt? This overview lists all relevant AI bots — from GPTBot to ClaudeBot to Yandex — with user agent, purpose and concrete configuration examples.

How do AI crawlers work?

AI crawlers work fundamentally like classic web crawlers: Sie besuchen URLs, laden den Inhalt herunter und speichern ihn für die weitere Verarbeitung. Der Unterschied liegt im Zweck — während Googlebot Seiten für den Suchindex aufbereitet, sammeln KI-Crawler Inhalte für zwei verschiedene Ziele:

Training: Crawled content flows into the training data of language models. This happens in large batches, not in real time.
Live search: For AI search engines like Perplexity, pages are crawled in real time to deliver current answers.

All reputable AI crawlers respect robots.txt — but only when it is correctly configured. Important: AI crawlers generally have shorter timeouts than Googlebot and react more sensitively to technical errors.

Important: Anyone who blocks AI crawlers will not be cited in AI-generated answers. Anyone who allows them risks their content being used for AI training. The decision lies with the website operator — robots.txt gives full control.

High-relevance AI crawlers

These crawlers have the greatest influence on a website's AI visibility:

GPTBot High

OpenAI — ChatGPT

Crawls for training and browsing function of ChatGPT. One of the most important AI crawlers — ChatGPT has over 100 million active users.

User-Agent:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)

ClaudeBot High

Anthropic — Claude

Crawler from Anthropic for Claude. Analyses web content for context and answer generation. Anthropic actively supports the llms.txt standard.

User-Agent:

ClaudeBot/1.0; +https://anthropic.com/claudebot

PerplexityBot High

Perplexity AI

Crawls for Perplexity's real-time search. Specialises in fact-based answers with source references. Actively reads llms.txt.

User-Agent:

PerplexityBot/1.0; +https://perplexity.ai/perplexitybot

Google-Extended High

Google — Gemini & AI Overviews

Separate crawler for Google Gemini and AI Overviews. Can be controlled separately from Googlebot in robots.txt.

User-Agent:

Google-Extended

OAI-SearchBot High

OpenAI — Search

Newer crawler from OpenAI specifically for the real-time search function in ChatGPT. Complements GPTBot for live search queries.

User-Agent:

OAI-SearchBot/1.0; +https://openai.com/searchbot

Meta-ExternalAgent High

Meta — Llama & Meta AI

Crawler from Meta for Meta AI products (Facebook, Instagram, WhatsApp AI) and Llama model training.

User-Agent:

Meta-ExternalAgent/1.1; +https://llama.meta.com/llama-web-access/

Further AI crawlers

These crawlers have growing relevance and should not be ignored:

Applebot-Extended Medium

Apple — Apple Intelligence

Crawler for Apple Intelligence and Siri. Has become significantly more relevant with iOS 18 and macOS Sequoia. Can be controlled separately from the normal Applebot.

User-Agent:

Applebot-Extended/1.0; +https://support.apple.com/en-us/111900

Amazonbot Medium

Amazon — Alexa & AWS AI

Crawler from Amazon for Alexa and Amazon AI services. Collects data for voice assistants and AWS AI products.

User-Agent:

Amazonbot/0.1; +https://developer.amazon.com/amazonbot

YouBot Medium

You.com

Crawler for the AI search engine You.com — a direct Perplexity alternative with a growing user base.

User-Agent:

YouBot; +https://about.you.com/youbot/

Bytespider Medium

ByteDance — TikTok

Crawler from ByteDance (TikTok parent company) for AI products. Particularly relevant in Asian markets and through TikTok reach.

User-Agent:

Bytespider; +https://zhanzhang.toutiao.com/

cohere-ai Medium

Cohere

Crawler from Cohere — one of the leading B2B AI providers. Particularly relevant for companies using Cohere enterprise products.

User-Agent:

cohere-ai/1.0

ImagesiftBot Medium

Microsoft — Bing AI / Copilot

Supplementary crawler from Microsoft for Bing AI and Copilot. Works alongside the normal Bingbot.

User-Agent:

ImagesiftBot; +https://www.microsoft.com/en-us/bing/imagesiftbot

Classic search engine crawlers

These crawlers are primarily responsible for classic search, but increasingly relevant for AI functions too:

Googlebot Very high

Google

The most important crawler of all — for Google Search and as the basis for many AI functions. Separate control via Google-Extended for AI-specific use.

User-Agent:

Googlebot/2.1; +http://www.google.com/bot.html

Bingbot High

Microsoft — Bing & Copilot

Crawler for Bing Search and Microsoft Copilot. Very relevant for AI visibility due to GPT-4 integration in Copilot.

User-Agent:

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

YandexBot Medium

Yandex

Russian search engine with its own AI products (YandexGPT, Alice). Relevant for Russian-speaking markets and Eastern European audiences.

User-Agent:

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

YandexGPT Medium

Yandex — YandexGPT

Separate crawler from Yandex specifically for YandexGPT. Can be controlled independently from YandexBot.

User-Agent:

YandexGPT/1.0

DuckDuckBot Low

DuckDuckGo

Crawler for DuckDuckGo. The search engine partly uses Bing results but has its own AI functions (DuckAssist).

User-Agent:

DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)

Baiduspider Low

Baidu

Chinese search engine with its own AI products (ERNIE Bot). Only relevant for websites targeting Chinese audiences.

User-Agent:

Baiduspider/2.0; +http://www.baidu.com/search/spider.html

Complete reference table

Crawler	Operator	robots.txt Name	Type	Relevance
GPTBot	OpenAI	`GPTBot`	Training + Live	⭐⭐⭐⭐⭐
OAI-SearchBot	OpenAI	`OAI-SearchBot`	Live search	⭐⭐⭐⭐⭐
ClaudeBot	Anthropic	`ClaudeBot`	Training + Live	⭐⭐⭐⭐⭐
PerplexityBot	Perplexity AI	`PerplexityBot`	Live search	⭐⭐⭐⭐⭐
Google-Extended	Google	`Google-Extended`	Training + Live	⭐⭐⭐⭐⭐
Googlebot	Google	`Googlebot`	Search + AI base	⭐⭐⭐⭐⭐
Meta-ExternalAgent	Meta	`Meta-ExternalAgent`	Training + Live	⭐⭐⭐⭐
Bingbot	Microsoft	`bingbot`	Search + Copilot	⭐⭐⭐⭐
Applebot-Extended	Apple	`Applebot-Extended`	Training + Live	⭐⭐⭐⭐
Amazonbot	Amazon	`Amazonbot`	Training	⭐⭐⭐
YouBot	You.com	`YouBot`	Live search	⭐⭐⭐
Bytespider	ByteDance	`Bytespider`	Training	⭐⭐⭐
cohere-ai	Cohere	`cohere-ai`	Training	⭐⭐⭐
YandexBot	Yandex	`YandexBot`	Search + AI	⭐⭐
YandexGPT	Yandex	`YandexGPT`	Training + Live	⭐⭐
DuckDuckBot	DuckDuckGo	`DuckDuckBot`	Search	⭐⭐
Baiduspider	Baidu	`Baiduspider`	Search + AI	⭐

robots.txt configuration

robots.txt enables granular control — each crawler can be individually allowed or blocked.

Allow all AI crawlers (recommended)

# Allow all bots
User-agent: *
Allow: /

# Explicitly allow AI crawlers (recommended for safety)
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

Sitemap: https://deinedomain.de/sitemap.xml
LLMs: https://deinedomain.de/llms.txt

Block training, allow live search

For those who do not want their content used for AI training, but still want to appear in live search results:

# Block training
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Continue to allow live search
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Block all AI crawlers

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: cohere-ai
Disallow: /

# Continue to allow classic search engines
User-agent: *
Allow: /

Warning: Anyone who blocks all AI crawlers will not appear in AI-generated answers — not on Perplexity, ChatGPT or Claude. This can lead to significant loss of visibility in the medium term.

Recommendation

For most websites: allow all AI crawlers and additionally create an llms.txt. This maximises AI visibility and gives crawlers the context they need for accurate answers.

Those concerned about training data can specifically block only training crawlers while continuing to allow live-search crawlers (PerplexityBot, OAI-SearchBot).

All AI crawlers configured correctly?

Use the free AI-Ready Check to verify whether your robots.txt is correctly configured and all relevant AI crawlers have access.

Check for free now →

More guides

Frequently asked questions about AI crawlers

Do I have to list every AI crawler individually in robots.txt? +

No — User-agent: * with Allow: / allows all crawlers at once. Explicitly listing individual AI crawlers is only recommended if you want to specifically block certain crawlers or set separate rules.

Do all AI crawlers respect robots.txt? +

All reputable AI crawlers from major providers (OpenAI, Anthropic, Google, Meta etc.) respect robots.txt. However, less reputable bots exist that ignore robots.txt — against these only server-side blocking by IP or user agent helps.

What happens if I block GPTBot? +

Your website will no longer be included in ChatGPT training data and ChatGPT will no longer be able to use your current content for answers. You potentially lose visibility on one of the most widely used AI platforms worldwide.

How do I know if an AI crawler is visiting my website? +

In the server log files — the user agent of every visitor is recorded there. Tools like GoAccess or AWStats can be used to analyse logs and filter by specific user agents.

Can I restrict crawling to specific directories? +

Yes — with Allow and Disallow rules you can granularly control which areas a crawler may visit. For example: allow AI training only on blog posts but not on product or pricing pages.