Robots.txt SEO – How to Optimize and Validate Your Robots.txt

robots txt seo optimization

One of the first things you need to check and optimize when working on your technical SEO is the robots.txt file. A problem or misconfiguration in your robots.txt can cause critical SEO issues that can negatively impact your rankings and traffic.

In this post, you will learn what is a robots.txt file, why do you need it, how to SEO optimize it and how to test that search engines can access it without any problems.

If you are on WordPress there is towards the end of this article, specific information about WordPress virtual robots.txt file.

What is robots.txt?

A robots.txt is a text file that resides in the root directory of your website and gives search engines crawlers instructions as to which pages they can crawl and index, during the crawling and indexing process.

If you have read my previous article on how search engines work, you know that during the crawling and indexing stage, search engines try to find pages available on the public web, that they can include in their index.

When visiting a website, the first thing they do is to look for and check the contents of the robots.txt file. Depending on the rules specified in the file, they create a list of the URLS they can crawl and later index for the particular website.

The contents of a robots.txt are publicly available to the Internet. Unless protected otherwise, anyone can add view your robots.txt file so this is not the place to add content that you don’t want others to see.

What happens if you don’t have a robots.txt file? If a robots.txt file is missing, search engine crawlers assume that all publicly available pages of the particular website can be crawled and added to their index.

What happens if the robots.txt is not well formatted? It depends on the issue. If search engines cannot understand the contents of the file because it is misconfigured, they will still access the website and ignore whatever is in robots.txt.

What happens if I accidentally block search engines from accessing my website? That’s a big problem. For starters, they will not crawl and index pages from your website and gradually they will remove any pages that are already available in their index.

Do you need a robots.txt file?

Yes, you definitely need to have a robots.txt even if you don’t want to exclude any pages or directories of your website from appearing in search engine results.

Why use a robots.txt?

The most common use cases of robots.txt are the following:

#1 – To block search engines from accessing specific pages or directories of your website. For example, look at the robots.txt below and notice the disallow rules.

Example of a robots.txt file
Example of a robots.txt file

These statements instruct search engine crawlers not to index the specific directories. Notice that you can use an * as a wild card character.

#2 – When you have a big website, crawling and indexing can be a very resource intensive process. Crawlers from various search engines will be trying to crawl and index your whole site and this can create serious performance problems.

In this case, you can make use of the robots.txt to restrict access to certain parts of your website that are not important for SEO or rankings. This way, you not only reduce load on your server but it makes the whole indexing process faster.

#3 – When you decide to use URL cloaking for your affiliate links. This is not the same as cloaking your content or URLS to trick users or search engines but it’s a valid process for making your affiliate links easier to manage.

Two Important things to know about robots.txt

The first thing is that any rules you add to the robots.txt are directives only. This means that it’s up to search engines to obey and follow the rules.

In most cases they do but If you have content that you don’t want to be included in their index, the best way is to password protect the particular directory or page.

The second thing is that even if you block a page or directory in robots, it can still appear in the search results if it has links from other pages that are already index. In other words, adding a page to the robots.txt does not guarantee that it will removed or not appear on the web.

Besides password protecting the page or directory, another way is to use page directives. There are added to the <head> of every page and they look like the example below:

<meta name=”robots” content=”noindex”>

How does robots.txt work?

The robots file has a very simple structure. There are some predefined keyword/value combinations you can use.

The most common are: User-agent, Disallow, Allow, Crawl-delay, Sitemap.

User-agent: Specifies which crawlers should take into account the directives. You can use an * to reference all crawlers or specify the name of a crawler, see examples below.

You can view all available names and values for the user-agent directive, here.
User-agent: * – includes all crawlers.
User-agent: Googlebot – instructions are for Google bot only.

Disallow: The directive that instructs a user-agent (specified above), not to crawl a URL or part of a website.

The value of disallow can be a specific file, URL or directory. Look at the example below taken from Google support.

Example of disallow rules in robots.txt
Example of disallow rules in robots.txt

Allow: The directive that tells explicitly which pages or subfolders it can be accessed. This is applicable for the Googlebot only.

You can use the allow to give access to a specific sub-folder on your website, even though the parent directory is disallowed.

For example, you can disallow access to your Photos directory but allow access to your BMW sub-folder which is located under Photos.

User-agent: *
Disallow: /photos
Allow: /photos/bmw/

Crawl-delay: You can specific a crawl-delay value to force search engine crawlers wait for a specific amount of time before crawling the next page from your website. The value you enter is in milliseconds.

It should be noted that the crawl-delay is not taken into account by Googlebot.

You can use Google Search Console to control the crawl rate for Google (the option is found under Site Settings).

Google Crawl rate setting in Google Search Console
Google Crawl rate setting in Google Search Console

You can use the crawl rate in cases you have a website with thousands of pages and you don’t want to overload your server with continuous requests.

In the majority of cases, you shouldn’t make use of the crawl-delay directive.

Sitemap: The sitemap directive is supported by the major search engines including Google and it is used to specify the location of your XML Sitemap.

Even if you don’t specify the location of the XML sitemap in the robots, search engines are still able to find it.

For example, you can use this:

Sitemap: https://example.com/sitemap.xml

Important: Robots.txt is case-sensitive. This means that if you add this directive, Disallow: /File.html will not block file.html.

How to create a robots.txt?

Creating a robots.txt file is easy. All you need is a text editor (like brackets or notepad) and access to your website’s files (via FTP or control panel).

Before getting into the process of creating a robots file, the first thing to do is to check if you already have one.

The easiest way to do this is to open a new browser window and navigate to https://www.yourdomain.com/robots.txt

If you see a something similar to the one below, it means that you already have a robots.txt file and you can edit the existing file instead of creating a new one.

User-agent: *
Allow: /

How to edit your robots.txt

Use your favorite FTP client and connect to your website’s root directory.

Robots.txt is always located in the root folder (www or public_html, depending on your server).

Download the file to your PC and open it with a text editor.

Make the necessary changes and upload the file back to your server.

How to create a new robots.txt

If you don’t already have a robots.txt then create a new .txt file using a text editor, add your directives, save it and upload it to the root directory of your website.

Important: Make sure that your file name is robots.txt and not anything else. Also, have in mind that the file name is case-sensitive so it should be all lowercase.

Where do you put robots.txt? robots.txt should always reside in the root of your website and not in any folder.

Example of a robots.txt

In a typical scenario, your robots.txt file should have the following contents:

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

This allows all bots to access your website without any blockings. It also specifies the sitemap location to make it easier for search engines to locate it.

How to test and validate your robots.txt?

While you can view the contents of your robots.txt by navigating to the robots.txt URL, the best way to test and validate it, is through the robots.txt Tester option of Google Search Console.

Login to your Google Search Console Account.

Click on robots.txt Tester, found under Crawl options.

Click the Test button.

If everything is ok, the Test button will turn green and the label will change to ALLOWED. If there is a problem, the line that causes a disallow will be highlighted.

Robots.txt Tester Tool
Robots.txt Tester Tool

A few more things to know about the robots.txt tester tool:

You can use the URL Tester (bottom of the tool) to enter a URL from your website and test if it is blocked or not.

You can make any changes to the editor and check new rules BUT in order for these to be applied to your live robots.txt, you need to EDIT your file with a text editor and upload the file to your website’s root folder (as explained above).

To inform Google that you have made changes to your robots.txt, click the SUBMIT button (from the screen above) and click the SUBMIT button again from the popup window (option 3 as shown below).

Robots.txt Submit Updates
Robots.txt Submit Updates

Robots.txt and WordPress

Everything that you read so far about robots.txt is applicable for WordPress websites as well.

The only things you need to know about robots.txt and WordPress are the following:

In the past, it was recommended for WordPress websites to block access to wp-admin and wp-includes folders via robots.txt.

As of 2012 this is no longer needed since WordPress provides for a

@header( 'X-Robots-Tag: noindex' );  tag, which does the same job as adding a disallow in robots.txt.

What is a virtual robots.txt file?

WordPress by default is using a virtual robots.txt file. This means that you cannot directly edit the file or find it in the root of your directory.

The only way to view the contents of the file, is to type https://www.yourdomain.com/robots.txt in your browser.

The default values of WordPress robots.txt are:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

When you enable the “Discourage search engines from indexing this site” option under Search Engine Visibility Settings the robots.txt becomes:

Search Engine Visibility Settings WordPress
Search Engine Visibility Settings WordPress


User-agent: *
Disallow: /

Which basically blocks all crawlers from accessing the website.

How do I edit robots.txt in WordPress?

Since you cannot directly edit the virtual robots.txt file provided by WordPress, the only way to edit it, is to create a new one and add it to the root directory of your website.

When a physical file is present on the root directory, the virtual WordPress file is not taken into account.

Robots.txt SEO Best Practices

Test your robots.txt and make sure that you are not blocking any parts of your website that you want to appear in search engines.

Do not block CSS or JS folders. Google during the crawling and indexing process is able to view a website like a real user and if your pages need the JS and CSS to function properly, they should not be blocked.

If you are on WordPress, there is no need to block access to your wp-admin and wp-include folders. WordPress does a great job using the meta robots tag.

Don’t try to specify different rules per search engine bot, it can get confusing and difficult to keep up-to date. Better use user-agent:* and provide one set of rules for all bots.

If you want to exclude pages from being indexed by search engines, better do it using the in the header of each page and not through the robots.txt.

Conclusion

You don’t have to spend too much time configuring or testing your robots.txt. What is important is to have one and to test through Google Webmaster Tools that you are not blocking search engine crawlers from accessing your website.

It’s a task you need to do once when you first create your website or as part of your technical SEO audit.


Read More: Robots.txt SEO – How to Optimize and Validate Your Robots.txt

Leave a Reply

Your email address will not be published. Required fields are marked *