- Published on
Understanding robots.txt
- Authors

- Name
- hwahyeon
robots.txt is a text file located in the root directory of a website, used to allow or restrict search engine crawlers (bots) from accessing specific pages or directories on the site. This file operates according to the Robots Exclusion Protocol and serves as a guideline for search engines when crawling a site.
Key Components
User-agent
Specifies a particular search engine crawler.
Example: User-agent: Googlebot
To apply the rule to all crawlers, use User-agent: *.
Disallow
Specifies paths that are restricted from crawling.
Example: Disallow: /admin/ prevents the /admin/ directory from being crawled. / prevents all content from being crawled.
Allow
Explicitly permits crawling of specific paths, often used to specify exceptions within restricted paths.
Example: Allow: /public/ allows crawling of the /public/ directory.
Sitemap
Specifies the URL of an XML sitemap, helping search engines better understand the structure of the site.
Example: Sitemap: https://example.com/sitemap.xml
Example
User-agent: Google-Extended
Disallow: /
In this case, the Google-Extended crawler cannot crawl any pages of this site.
Note
robots.txt does not enforce any restrictions. It merely provides guidelines for crawlers. Malicious crawlers may ignore it.