TL;DR

Robots.txt is a text file that tells crawlers what to access and what to skip. Misconfigured robots.txt files are one of the most common causes of pages not appearing in search results.

Key Points

✓

Robots.txt controls crawl access, not indexing — blocking a URL in robots.txt doesn't guarantee it won't be indexed if it has external links pointing to it

✓

The 'Disallow: /' directive blocks all crawlers from all pages — a common misconfiguration that can accidentally de-index an entire site

✓

Different crawlers can be targeted with different rules using the User-agent directive (e.g., Googlebot, Bingbot)

✓

Robots.txt can specify the location of your XML sitemap to help crawlers discover your content structure

Robots.txt Syntax and Structure

A robots.txt file consists of user-agent blocks and directives^[1]. User-agent specifies which bot the rules apply to (* means all bots). Disallow specifies paths the bot should not crawl. Allow overrides Disallow for specific sub-paths. Sitemap specifies the location of your XML sitemap. Example: User-agent: * / Disallow: /admin/ / Sitemap: https://example.com/sitemap.xml. Google Search Console's robots.txt tester validates your file and shows how Googlebot would interpret it — always test before deploying changes to avoid accidental Crawlability blocks.

What to Block in Robots.txt

Good candidates for blocking include internal search result pages (which create infinite URL variations that waste crawl budget), admin/login pages, duplicate content generated by filters and sorting parameters, and confirmation pages^[1]. Blocking these preserves crawl budget for your most important pages and prevents low-value URLs from diluting your search index. Critically, avoid blocking CSS, JavaScript, or image files — Googlebot needs to render pages fully to assess their quality and your E-E-A-T signals.

The Robots.txt vs. Noindex Distinction

A critical distinction: robots.txt blocks crawling, while a noindex meta tag controls Indexing. If you block a page in robots.txt, Googlebot can't read the noindex tag either — meaning the page may still appear in search results if other sites link to it (Google will just know less about its content)^[1]. For pages you want excluded from search results, use a noindex meta tag (and don't block it in robots.txt). Use robots.txt to manage crawl budget, not to hide content from search results.

SOURCES

Google Search Central — Robots Meta Tag Specifications

Google Search Central — Crawling and Indexing Overview

Last updated: June 8, 2026

Related Terms

Crawlability

The ability of search engine bots to access, navigate, and read the pages on your website without encountering technical barriers.

Indexing

The process by which a search engine stores and organizes crawled web pages in its database so they can be retrieved and displayed in search results.

Canonical URL

An HTML tag that tells search engines which version of a page is the preferred, authoritative URL when multiple URLs serve the same or very similar content.

XML Sitemap

A file (typically in XML format) that lists all the important URLs on a website, helping search engines discover and crawl content more efficiently.

Put it into practice

Skribra automates your SEO content pipeline — from keyword research to published articles — so you can apply these concepts at scale.

Try Skribra Free