Robots.txt is a plain text file placed in the root directory of a website (accessible at yourdomain.com/robots.txt) that provides instructions to search engine crawlers about which pages or sections of the site should or should not be crawled. It follows the Robots Exclusion Protocol (REP) — a standard that has been used since the early days of the web. Robots.txt is a crawl directive tool, not a security measure: it guides compliant crawlers but cannot enforce access restrictions for malicious bots.
How Robots.txt Works
The file contains directives written in a simple syntax. "User-agent" specifies which crawler the rule applies to (use * for all crawlers, or a specific bot name like Googlebot). "Disallow" tells the specified crawler not to access a given path. "Allow" explicitly permits crawling of a path that would otherwise be blocked by a broader Disallow rule. "Sitemap" can specify the location of the XML sitemap. For example: User-agent: * / Disallow: /admin/ / Allow: /admin/public/ would block all bots from /admin/ except the /admin/public/ subfolder.
Why It Matters for SEO
Robots.txt is a fundamental crawl management tool with significant SEO implications:
- Blocking unnecessary pages (admin areas, search result pages, staging URLs) conserves crawl budget
- Accidentally blocking important pages is a common and serious SEO mistake
- Search Console will alert you to pages blocked by robots.txt that have external links
- The Sitemap directive in robots.txt helps search engines discover your XML sitemap
- Google's robots.txt tester in Search Console lets you validate directives before deploying