Your robots.txt Might Be Blocking AI Crawlers: A Quick Audit Guide

The file nobody reads until something breaks

Your robots.txt is a plain text file that lives at the root of your domain. It tells web crawlers which parts of your site they are allowed to visit. Most websites have one, and most owners have never opened it.

That is fine until AI crawlers show up. A robots.txt file written in 2019 does not know that GPTBot, PerplexityBot, or ClaudeBot exist. But a robots.txt file copied from a template or generated by a plugin might still contain rules that apply to them, sometimes by accident.

If an AI crawler hits your robots.txt and finds a rule that blocks it, that crawler walks away. Your content never enters the index. You never get cited. And you never see an error, because robots.txt blocks are silent by design.

How to find and open your robots.txt

Type your domain into a browser followed by /robots.txt. For example, yoursite.com/robots.txt. If a file exists, you will see its contents. If you get a 404, you do not have one.

Both outcomes matter. A missing robots.txt means crawlers can access everything by default, which is usually fine. A present robots.txt means you have rules in effect, and you need to know what they are.

The AI crawlers you need to know about

These are the user-agents most commonly used by AI systems to fetch web content. If your robots.txt blocks any of them, that AI system cannot read your site.

GPTBot: OpenAI's crawler. Used to gather training data and, in some cases, fetch content for ChatGPT responses.
OAI-SearchBot: OpenAI's search-specific crawler, used for ChatGPT's browsing and search features.
ChatGPT-User: Fetches live pages when a ChatGPT user clicks or requests a URL during a conversation.
PerplexityBot: Perplexity's crawler. Gathers content for citations in Perplexity answers.
ClaudeBot: Anthropic's crawler for Claude.
Google-Extended: Controls whether Google can use your content for Bard, Gemini, and Vertex AI, separate from regular Googlebot indexing.
CCBot: Common Crawl. Not an AI company itself, but its dataset is used by many AI model trainers.

Five common rules that accidentally block AI

Here are the patterns we see most often when auditing sites that are missing from AI results.

1. Explicit AI crawler blocks

Someone added these on purpose, often after reading an article about AI scraping. The block looks like this:

User-agent: GPTBot
Disallow: /

This does exactly what it says. GPTBot cannot access any page on the site. If you want ChatGPT to cite you, this rule needs to go.

2. Blanket blocks that catch everything

A Disallow rule under "User-agent: *" applies to every crawler, including AI. If your robots.txt blocks /blog or /content globally, every AI crawler is blocked from those paths too.

3. Staging leftovers

Sites that were once on staging or under development sometimes still have "Disallow: /" rules left over from launch. This blocks every crawler from the entire site. It happens more often than you would expect.

4. Sitemap missing or pointing to a 404

Your robots.txt should reference your sitemap so crawlers can find all your pages. If the sitemap line is missing, or points to a URL that returns a 404, AI crawlers have a harder time discovering content beyond your homepage.

5. Blocking /wp-content or /uploads

Some WordPress templates block these paths to hide admin internals, but they also contain images, PDFs, and content that AI models may want to reference. Review these blocks carefully.

A safe starting point for most sites

If you want AI crawlers to have full access to your public content and you have nothing to hide, a permissive robots.txt is the best default. Something like this works for most businesses:

User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

This tells every crawler, AI or otherwise, that they may access any page, and points them to your sitemap so they can find everything efficiently.

When you might want to block AI

There are legitimate reasons to restrict AI access. Publishers with paywalled content, sites with proprietary research, and businesses that do not want their copy used for training data all have valid reasons to block specific AI bots.

The key is to make that choice deliberately, not by accident. Blocking GPTBot while wondering why ChatGPT never cites you is a contradiction that costs traffic and authority.

How to verify your robots.txt is working correctly

DidItIndex checks your robots.txt as part of the SEO Readiness and Technical Trust modules. It flags missing sitemaps, overly broad Disallow rules, and blocks against known AI user-agents. If anything in your robots.txt is preventing AI models from reaching your content, the scan report calls it out with a plain-English explanation and a suggested fix.

Audit your robots.txt once. Fix what needs fixing. Then move on to the parts of AI visibility that actually require ongoing attention, like content depth and schema markup.