/// SYSTEM TOOL V.4.0

ROBOTS.TXT SCANNER

/// KNOWLEDGE BASE

THE ULTIMATE GUIDE TO ROBOTS.TXT

Master the "Gatekeeper" of your website. Control traffic, manage crawl budgets, and defend against AI scraping.

What is Robots.txt?

The robots.txt file is a simple text file that sits at the root of your website. It acts as the first point of contact for any bot (or "crawler") visiting your site, including Googlebot, Bingbot, and modern AI scrapers.

Think of it as the Rule of Law for your server. While polite bots (like Google) respect these laws, malicious bots may ignore them. It dictates:

  • Which pages bots can visit.
  • Which pages bots cannot visit.
  • Where your Sitemap is located.
  • How fast they should crawl (Crawl-delay).

Why is it Critical for SEO?

An optimized robots.txt file is the foundation of technical SEO. Without it, you are leaving your site's indexability to chance.

Crawl Budget Preservation Search engines have a limited "budget" of time/resources to crawl your site. Blocking useless pages (admin panels, internal search, filters) ensures they spend that budget on your high-value content.
Prevents Duplicate Content By disallowing print versions, session IDs, or checkout pages, you stop Google from indexing multiple versions of the same page, which dilutes your ranking power.

Anatomy of a Directive

Understanding the syntax is crucial to avoid catastrophic SEO errors.

The Target

User-agent:

Defines WHO the rule applies to.

# Applies to everyone
User-agent: *

# Applies only to Google
User-agent: Googlebot
The Block

Disallow:

Tells the bot NOT to access a specific path.

# Block admin area
Disallow: /wp-admin/

# Block everything (Careful!)
Disallow: /
The Exception

Allow:

Overrides a Disallow for a child path.

# Block folder, but allow image
Disallow: /private/
Allow: /private/logo.png
New Threat Vector

Blocking AI Scrapers & LLMs

With the rise of Large Language Models (LLMs), companies like OpenAI, Google, and Anthropic are aggressively scraping the web to train their AI. This consumes your server resources and uses your content without attribution.

The Nexus Scanner specifically checks for these modern directives. You can protect your data sovereignty by explicitly blocking these bots.

Learn more about AI blocking protocols
robots.txt
# Block ChatGPT
User-agent: GPTBot
Disallow: /
# Block Common Crawl (Used by many AIs)
User-agent: CCBot
Disallow: /
# Block Google Gemini
User-agent: Google-Extended
Disallow: /

Frequently Asked Questions

Does Robots.txt stop hackers?

No. Robots.txt is a "gentleman's agreement." Legitimate bots (Google, Bing) respect it. Hackers, bad bots, and email scrapers ignore it completely. Do not use it to hide sensitive files—use password protection or server-side rules (.htaccess) instead.

Can I block specific pages but not the folder?

Yes. If you have a folder `/products/` and want to block only `/products/confidential-item`, you can target that specific URL. You do not need to block the entire parent directory.

What is the "Crawl-delay" directive?

This tells bots to wait a certain number of seconds between requests to avoid crashing your server. Note: Googlebot ignores Crawl-delay. It is mostly respected by Bing and Yandex.

How do I fix "Indexed, though blocked by robots.txt"?

This Google Search Console error means Google found the page via a link but couldn't read the content because you blocked it. To remove it from the index entirely, allow crawling but add a noindex meta tag to the page header.

Should I block CSS and JS files?

Absolutely not. Google renders pages like a modern browser. If you block `.js` or `.css` files, Google cannot see your layout, responsive design, or mobile-friendliness, which will severely hurt your SEO rankings.

Is the file case-sensitive?

Yes. /Admin/ is different from /admin/. Directives and file paths are case-sensitive, so ensure your rules match your actual URL structure exactly.

NEXUS SYSTEM TOOLS V4.0 // END OF REPORT