← Blog
Diagnostics

GPTBot / ClaudeBot / PerplexityBot: Complete robots.txt Configuration Guide for AI Crawlers (2026)

2026-04-22·8 min

Category: Technical Implementation
Date: April 22, 2026
Reading time: ~8 min


Your robots.txt file may be quietly blocking the AI crawlers that would otherwise send you qualified buyers.

This isn't hypothetical. When we run GEO audits on client websites, more than 60% have a misconfiguration: either the site is fully blocking AI crawlers, or specific rules are accidentally preventing access to the most valuable pages. This guide covers the current crawler names for every major AI platform, the correct configuration template, and the mistakes we've made so you don't have to.


What Happened When We Debugged a Client's Blocked AI Crawlers

We should have caught this faster than we did.

A Shenzhen-based professional services client came to us with a robots.txt that looked correct on the surface. All the right User-agent entries, no blanket Disallow: /. We verified it manually in the browser. Everything looked fine.

Then we checked their actual AI mention count after 10 weeks of content publishing: still zero across all platforms. Something was blocking the crawlers and it wasn't the robots.txt file.

The culprit was Cloudflare. They had Bot Fight Mode enabled, which was intercepting requests from GPTBot and ClaudeBot and returning 403 errors before those crawlers ever got to read the robots.txt file. The robots.txt said "come in," but the CDN was turning them away at the door. Crawlers that receive a 403 don't retry — they move on and don't re-attempt that domain for an extended period.

The fix: adding GPTBot and ClaudeBot IP ranges to Cloudflare's Verified Bots allowlist. Both OpenAI and Anthropic publish their crawler IP ranges:

After the fix, we waited roughly 6 weeks before we saw ChatGPT begin citing this client's content. Time spent diagnosing the root cause: approximately 4 hours. Knowing to check the CDN layer first would have saved most of that.


Configuration Strategy by Situation

Not every site should be fully open. Here's how to think about different scenarios:

Scenario A: Fully open (recommended for GEO)
Allow all major AI crawlers. Restrict only admin panels, private areas, and non-public content. Best for: B2B company websites, content sites, knowledge bases. This is the right approach for most businesses that want to appear in AI answers.

Scenario B: Selective access (for media companies with copyright-sensitive content)
Allow Googlebot for SEO, but restrict specific AI crawlers to prevent unlicensed content use for training. Approach: per-User-agent Disallow: /. Be aware: blocking GPTBot directly reduces ChatGPT's ability to index and cite your content, which has a direct negative GEO impact.

Scenario C: Staged protection (beta products, commercially sensitive pages)
Use path-level blocking (Disallow: /internal/) rather than site-wide Disallow: /. Path-level restrictions are more precise — they protect the pages you mean to protect without accidentally blocking the product pages and case study pages that are GEO-valuable.


Counter-Consensus: Configuring robots.txt Does Not Mean AI Will Recommend You

This is the most common misunderstanding we encounter: someone spends a day getting robots.txt right, then waits for AI mentions to start appearing.

Three months later: still nothing.

The reason: robots.txt only tells crawlers they're allowed in. What they find once they're inside determines whether they'll cite you. If the content has no clear structured information about who you are, what you do, and what evidence supports your claims, an open door makes no difference.

Most GEO content talks about robots.txt and sitemaps as the whole solution. These are necessary baseline conditions. But what determines whether AI systems recommend you is whether your content clearly answers "who are you, what do you do, and what's the proof" — with structured, extractable information that AI systems can summarize and reference. robots.txt is table stakes. It's not the answer.


Verifying Your robots.txt Configuration

After making changes, use these methods to confirm:

  1. Google Search Console → robots.txt tester. Enter specific crawler names like GPTBot or ClaudeBot to test individual page accessibility. Direct link: https://search.google.com/search-console

  2. Direct file check: Visit https://yoursite.com/robots.txt in a browser. Confirm the file content matches what you intended — CDN caching can sometimes serve a stale version.

  3. OpenAI crawler status: OpenAI's platform settings provide GPTBot crawl status visibility for sites you've verified (requires an OpenAI API account).


Our Mistake Log

  • Mistake 1: Relying on plugin-generated robots.txt without manual verification. Yoast and Rank Math both have their own robots.txt generation logic, and configurations can be overwritten silently after plugin updates. Our current standard: robots.txt is tracked in version control, and an automated check runs after every plugin update.

  • Mistake 2: Updating robots.txt without also submitting a fresh sitemap. robots.txt tells crawlers which pages are accessible. The sitemap tells crawlers which pages exist. Publishing new content without updating the sitemap means AI crawlers may not discover those pages until the next natural crawl cycle — typically 4–8 weeks behind a manual submission.

  • Mistake 3: Setting Crawl-delay too high. We've seen configurations with Crawl-delay: 30 — a 30-second delay between requests. That's not protecting server capacity; it's discouraging crawlers from completing a full site pass. Most B2B sites have fewer than 200 pages. No Crawl-delay, or a 2–3 second value, is sufficient for any reasonable server load.


Do This Today (10 Minutes)

Open your site's robots.txt: https://yoursite.com/robots.txt

Look for either of these patterns:

User-agent: *
Disallow: /

or:

User-agent: GPTBot
Disallow: /

If either is present, this is your first GEO emergency fix. Remove the unnecessary Disallow rules, apply the template from this guide, and upload the updated file.

Verification tool: https://search.google.com/search-console (free, enter your domain to get started)


PONT AI | Shenzhen, China | https://pontai.cloud
Full-cycle GEO optimization covering technical setup, content creation, and AI platform coverage across DeepSeek, ChatGPT, Kimi, Doubao, ERNIE Bot (Wenxin), and 5 other platforms.


Let AI speak for you

Talk to AI