Robots.txt is a plain text file placed in the root directory of a website (accessible at yourdomain.com/robots.txt) that provides instructions to search engine crawlers about which pages or sections of the site should or should not be crawled. It follows the Robots Exclusion Protocol (REP) — a standard that has been used since the early days of the web. Robots.txt is a crawl directive tool, not a security measure: it guides compliant crawlers but cannot enforce access restrictions for malicious bots.

How Robots.txt Works

The file contains directives written in a simple syntax. "User-agent" specifies which crawler the rule applies to (use * for all crawlers, or a specific bot name like Googlebot). "Disallow" tells the specified crawler not to access a given path. "Allow" explicitly permits crawling of a path that would otherwise be blocked by a broader Disallow rule. "Sitemap" can specify the location of the XML sitemap. For example: User-agent: * / Disallow: /admin/ / Allow: /admin/public/ would block all bots from /admin/ except the /admin/public/ subfolder.

Important: Blocking a URL in robots.txt prevents crawling but does not prevent indexing. Google can still index a disallowed URL if it discovers the URL through links — it just won't see the page's content. To prevent indexing, use a noindex meta tag on the page itself.

Why It Matters for SEO

Robots.txt is a fundamental crawl management tool with significant SEO implications:

  • Blocking unnecessary pages (admin areas, search result pages, staging URLs) conserves crawl budget
  • Accidentally blocking important pages is a common and serious SEO mistake
  • Search Console will alert you to pages blocked by robots.txt that have external links
  • The Sitemap directive in robots.txt helps search engines discover your XML sitemap
  • Google's robots.txt tester in Search Console lets you validate directives before deploying