Robots.txt is a file residing in the root directory of a website. It is an instruction manual for search engines to decide which pages or files to request from a site. It helps prevent a website from overloading with requests.
The first thing search engines find while visiting a site is to look for and check the contents of the robots.txt file. They create a list of URLs they can crawl and index for that website depending on the instructions specified in the file. Using an asterisk (*) wildcard makes assigning directives to all user agents easier.
The technical phrases used in Robots.txt are:
- User-agent – A specific bot to which you give crawl instructions in a search engine.
- Disallow – A command that instructs the bot not to crawl a particular URL.
- Allow – A command which tells the bot to crawl a particular URL, even in an otherwise disallowed directory.
- Sitemap – Helps specify the location of sitemap(s) to the bot. The best practice for this is to place the sitemap directives at the end or beginning of the robots.txt file.
- Crawl-delay – Helps specify the number of seconds a crawler should wait before crawling the page. Google is no longer considering it, but Yahoo and Bing do.
You can control how search engine spiders interact and see your website pages with a robots.txt file. When Googlebot comes to your page, it first looks to see if there is a robots.txt file. If it exists, ensure that it is not blocking content that could be used to boost your rankings. It is prudent to have the Google guidelines tool because it will let you know if essential pages or information has been blocked. With Robots.txt, you can control the following:
- Image files – It can help you block images that you do not want to show in the search results
- Non-image files – Using them only to control crawling traffic is best. This is especially important if you have similar pages on the site or pages deemed unimportant to search engine ranking.
- Resource files – You can block this if they do not affect the pages significantly. The resource files could be style, script, or image files deemed unimportant.
Whether your website is small or large, it is important to have a robots.txt file as it gives you more control over search engines movement on your website. A single accidental disallow instruction can cause Googlebot to crawl your entire site. Following are some of the common cases where it can be handy.
- Prevents server overload.
- Prevents sensitive information from getting exposed
- Prevents the crawl budget from getting wasted
- Prevents crawling of duplicate content
- Prevents indexing of unnecessary files on your website (e.g., images, video, PDFs).
- Helps to keep sections of your website private (e.g., staging site).
- Prevents crawling for internal search results pages.
However, one should be careful when changing your robots.txt, as this file can make significant parts of your website inaccessible to search engines.