The robots.txt is a very powerful file that can be added to your website to help control which areas of your site search engines should crawl and which areas should be ignored. It is important to review your robots.txt on a regular basis to make sure it is up to date and if possible use a monitoring tool to be alerted when changes occur.
At Semetrical, as part of our technical SEO service offering we will audit a client’s robots.txt file when undertaking a technical audit of a clients website to check that the paths that are being blocked should be. Additionally, if the SEO team comes across issues as part of the technical SEO audit process such as duplication, new robots.txt rules may be written and added to the file.
As the robots.txt is an important file we have put together a guide that covers what it ultimately is, why someone may use it and common pitfalls that can occur when writing rules.
The robots.txt file is the first port of call for a crawler when visiting your website. It’s a text file which lists instructions for different user agents that essentially tells web crawlers which parts of a site should be crawled and which should be ignored. The main instructions used in a robots.txt file are specified by an “allow” or “disallow” rule.
Historically a “noindex” rule would also work, however in 2019 Google stopped supporting the noindex directive as it was an unpublished rule.
If the file is not used properly it can be detrimental to your website and could cause a huge drop in traffic and rankings. For example, mistakes can happen when a whole website is blocked from search engines or a section of a site is blocked by mistake. When this happens the rankings connected to that part of the site will gradually drop and traffic will in turn drop.
No, it is not compulsory to have a robot.txt on your website especially for small websites with minimal URLs but it is highly recommended for medium to large websites. On large sites it makes it easier to control which parts of your site are accessible and which sections should be blocked from crawlers. If the file does not exist your website will generally be crawled and indexed as normal.
The robots.txt has many use cases and at Semetrical we have used it for the below scenarios:
A robots.txt file needs to be placed at the root of your website, for example, on Semetrical’s site it sits at www.semetrical.com/robots.txt and must be named robots.txt. A website can only have one robots.txt and it needs to be in an UTF-8 encoded text file which includes ASCII.
If you have subdomains such as blog.example.com then the robots.txt can sit on the root of the subdomain such as blog.example.com/robots.txt.
A typical robots.txt file would be made up of different components and elements which include:
Below is an example of Semetrcals robots.txt that includes a user-agent, disallow rules and a sitemap.
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /comments/feed/
Disallow: /trackback/
Disallow: /index.php/
Disallow: /xmlrpc.php
Disallow: /blog-documentation/
Disallow: /test/
Disallow: /hpcontent/
Sitemap: https://devsemetrical.wpengine.com/sitemap.xml
The user-agent defines the start of a group of directives. It often is represented with a wildcard (*) which signals that the instructions below are for all bots visiting the website. An example of this would be:
User-agent: *
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
There will be occasions when you may want to block certain bots or only allow certain bots from accessing certain pages. In order to do this you need to specify the bots name as the user agent. An example of this would be:
User-agent: AdsBot-Google
Disallow: /checkout/reserve
Disallow: /resale/checkout/order
Disallow: /checkout/reserve_search
Common user-agents to be aware of include:
There is also the ability to block specific software from crawling your website or delaying how many URLs they can crawl a second as each tool will have their own user agents that crawl your site. For example, if you wanted to block SEMRush or Ahrefs from crawling your website the below would be added to your file:
User-agent: SemrushBot
Disallow: *
User-agent: AhrefsBot
Disallow: *
If you wanted to delay the number of URLs crawled the below rules would be added to your file:
User-agent: AhrefsBot
Crawl-Delay: [value]
User-agent: SemrushBot
Crawl-Delay: [value]
The disallow directive is a rule a user can put in the robots.txt file that will tell a search engine not to crawl a specific path or set of URLs depending on the rule created. There can be one or multiple lines of disallow rules in the file as you may want to block multiple sections of a website.
If a disallow directive is empty and does not specify anything then bots can crawl the whole website, so in order to block certain paths or your whole website you need to specify a URL prefix or a forward slash “/”. For example in the below example, we are blocking any URL that runs off the path of /cgi-bin/ or /wp-admin/.
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
If you wanted to block your whole website from bots such as Google then you would need to add a disallow directive followed by a forward slash. Typically you may only need to do this on a staging environment when you do not want the staging website from being found or indexed. An example would look like:
User-agent: *
Disallow: /
Most search engines will abide by the allow directive where it essentially will counteract a disallow directive. For example, if you were to block /wp-admin/ it usually would block all the URLs that run off that path, however, if there is an allow rule for /wp-admin/admin-ajax.php then bots will crawl /admin-ajax.php but block any other path that runs off /wp-admin/. See example below:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
The crawl delay directive helps slow down the rate a bot will crawl your website. Not all search engines will follow the crawl delay directive as it’s an unofficial rule.
– Google will not follow this directive
– Baidu will not follow this directive
– Bing and Yahoo supports the crawl delay directive where the rule instructs the bot to wait “n” seconds after a crawl action.
– Yandex also supports the crawl delay directive but interprets the rule slightly differently where it will only access your site once in every “n” seconds”.
An example of a crawl delay directive below:
User-agent: BingBot
Disallow: /wp-admin/
Crawl-delay: 5
The sitemap directive can tell search engines where to find your XML sitemap and it makes it easy for different search engines to find the URLs on your website. The main search engines that will follow this directive include, Google, Bing, Yandex and Yahoo.
It is advised to place the sitemap directive at the bottom of your robots.txt file. An example of this is below:
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /comments/feed/
Sitemap: https://devsemetrical.wpengine.com/sitemap.xml
A robots.txt file can include comments but the presence of comments are only for humans and not bots as anything after a hashtag will be ignored. Comments can be useful for multiple reasons which include:
– Provides a reason why certain rules are present
– References who added the rules
– References which parts of a site the rules are for
– Explains what the rules are doing
– Below shows examples of comments in different robots.txt files:
#Student
Disallow: /student/*-bed-flats-*
Disallow: /student/*-bed-houses*
Disallow: /comments/feed/
#Added by Semetrical
Disallow: /jobs*/full-time/*
Disallow: /jobs*/permanent/*
#International
Disallow: */company/fr/*
Disallow: */company/de/*
The ordering of rules is not important, however when several allow and disallow rules apply to a URL, the longest matching path rule is the one that is applied and takes precedence over the less specific shorter rule. If both paths are the same length, then the less restrictive rule will be used. If you need a specific URL path to be allowed or disallowed, you can make the rule longer by utilising “*” to make the string longer. For example, Disallow: ********/make-longer
On Google’s own website they have listed a sample set of situations which shows the priority rule that takes precedence. The table below was taken from Google.
It is always important to check and validate your robots.txt file before pushing it live as having incorrect rules can have a great impact on your website.
The best way to test is to go to the robots.txt tester tool in Search Console and test different URLs that should be blocked with the rules that are in place. This is also a great way to test any new rules that you are wanting to add to the file.
When creating rules in your robots.txt file, you can use pattern matching to block a range of URLs in one disallow rule. Regular expressions can be used in order to do pattern matching and the two main characters that both Google and Bing abide by include:
Examples of pattern matching at Semetrical:
Disallow: */searchjobs/*
This will block any URL that includes the path of /searchjobs/ such as: www.example.com/searchjobs/construction. This was needed for a client as the search section of their site needed to be blocked so search engines would not crawl and index that section of the site.
Disallow: /jobs*/full-time/*
This will block URLs that include a path after /jobs/ followed by /full-time/ such as
www.example.com/jobs/admin-secretarial-and-pa/full-time/
. In this scenario we need full time as a filter for UX but for search engines there is no need for a page to be indexed to cater for “job title” + “full time”.
Disallow: /jobs*/*-000-*-999/*
This will block URLs that include salary filters such as
www.example.com/jobs/city-of-bristol/-50-000-59-999/
. In this scenario we need salary filters but there was not a need for search engines to crawl salary pages and index them.
Disallow: /jobs/*/*/flexible-hours/
This will block URLs that include flexible-hours and include two facet paths in between. In this scenario we found via keyword research that users may search for location + flexible hours or job + flexile hours but users would not search for “job title” + “location” + “flexible hours”. An example URL looks like
www.example.com/jobs/admin-secretarial-and-pa/united-kingdom/flexible-hours/
.
Disallow: */company/*/*/*/people$
This will block a URL that includes three paths between company and people as well as the URL ending with people. An example would be
www.example.com/company/gb/04905417/company-check-ltd/people
.
Disallow: *?CostLowerAsNumber=*
This rule would block a parameter filter that ordered pricing.
Disallow: *?Radius=*
Disallow: *?radius=*
These two rules blocked bots from crawling a parameter URL that changed the radius of a users search. Both an uppercase and lowercase rule was added as the site included both versions.
If you would like to speak with one of our technical SEO specialists at Semetrical please visit our technical SEO services page for more information.