What is Robots.txt? | SEO is Alive

Robots.txt is a plain text file placed in the root directory of a website (accessible at yourdomain.com/robots.txt) that provides instructions to search engine crawlers about which pages or sections of the site should or should not be crawled. It follows the Robots Exclusion Protocol (REP) — a standard that has been used since the early days of the web. Robots.txt is a crawl directive tool, not a security measure: it guides compliant crawlers but cannot enforce access restrictions for malicious bots.

How Robots.txt Works

The file contains directives written in a simple syntax. "User-agent" specifies which crawler the rule applies to (use * for all crawlers, or a specific bot name like Googlebot). "Disallow" tells the specified crawler not to access a given path. "Allow" explicitly permits crawling of a path that would otherwise be blocked by a broader Disallow rule. "Sitemap" can specify the location of the XML sitemap. For example: User-agent: * / Disallow: /admin/ / Allow: /admin/public/ would block all bots from /admin/ except the /admin/public/ subfolder.

Important: Blocking a URL in robots.txt prevents crawling but does not prevent indexing. Google can still index a disallowed URL if it discovers the URL through links — it just won't see the page's content. To prevent indexing, use a noindex meta tag on the page itself.

Why It Matters for SEO

Robots.txt is a fundamental crawl management tool with significant SEO implications:

Blocking unnecessary pages (admin areas, search result pages, staging URLs) conserves crawl budget
Accidentally blocking important pages is a common and serious SEO mistake
Search Console will alert you to pages blocked by robots.txt that have external links
The Sitemap directive in robots.txt helps search engines discover your XML sitemap
Google's robots.txt tester in Search Console lets you validate directives before deploying