6 Problems With Robots.txt And How To Fix Them

Robots.txt is a simple and effective tool for instructing search engine crawlers on how to crawl your website.

It is not all-powerful (according to Google, “it is not a mechanism for keeping a web page out of Google”), but it can help to keep your site or server from becoming overloaded by crawler requests.

If you have this crawl block on your site, you must ensure that it is being used correctly.

This is especially important if you use dynamic URLs or other methods that can generate an infinite number of pages.

In this guide, we’ll look at some of the most common issues with the robots.txt file, the impact they can have on your website and search presence, and how to fix them if you suspect they’ve occurred.

But first, let’s go over robots.txt and its alternatives.

What Exactly Is Robots.txt?

Robots.txt is a plain text file that is placed in your website’s root directory.

It must be in your site’s root directory; if you put it in a subdirectory, search engines will simply ignore it.

Despite its immense power, robots.txt is frequently a simple document, and a basic robots.txt file can be created in a matter of seconds using an editor such as Notepad.

There are other ways to accomplish some of the same objectives that robots.txt is commonly used for.

Individual pages can contain a robot’s meta tag within the page code.

You can also use the X-Robots-Tag HTTP header to control how (and if) content appears in search results.

What Functions Does Robots.txt Have?

Robots.txt can produce a variety of results across a wide range of content types:

Crawling of web pages can be prevented.

They may continue to appear in search results but without a text description. Non-HTML content on the page will also not be crawled.

It is possible to prevent media files from appearing in Google search results.

Images, video, and audio files are all included.

If the file is public, it will still ‘exist’ online and can be viewed and linked to, but it will not appear in Google searches.

Unwanted external scripts and resource files can be blocked.

However, if Google crawls a page that relies on that resource to load, the Googlebot robot will see a version of the page as if that resource did not exist, which may have an impact on indexing.

You cannot use robots.txt to prevent a website from appearing in Google’s search results.

To accomplish this, you must use a different method, such as adding a noindex meta tag to the page’s head.

Robots.txt Errors: How Dangerous Are They?

A typo in robots.txt can have unintended consequences, but it’s not always the end of the world.

The good news is that by repairing your robots.txt file, you can recover quickly and (usually) completely from any errors.
On the subject of robots.txt errors, Google advises web developers as follows:

On the subject of robots.txt errors, Google advises web developers as follows:

“Web crawlers are generally very flexible and typically will not be swayed by minor mistakes in the robots.txt file. In general, the worst that can happen is that incorrect [or] unsupported directives will be ignored.
Bear in mind though that Google can’t read minds when interpreting a robots.txt file; we have to interpret the robots.txt file we fetched. That said, if you are aware of problems in your robots.txt file, they’re usually easy to fix.”

6 Common Robots.txt Errors

There is no Robots.txt file in the root directory.
Ineffective use of wildcards.
In Robots.txt, noindex is specified.
Scripts and stylesheets are being blocked.
There is no Sitemap URL.
Access to construction sites.

If your website is behaving strangely in search results, check your robots.txt file for any typos, syntax errors, or overreaching rules.

Let’s take a closer look at each of the above mistakes and how to ensure you have a valid robots.txt file.

1. The Robots.txt file is not in the root directory.

The file can only be found by search robots if it is in your root folder.

That is why, in the URL of your robots.txt file, there should only be a forward slash between the.com (or equivalent domain) of your website and the ‘robots.txt’ filename.

If there’s a subfolder in there, your robots.txt file is probably invisible to search robots, and your website is likely to behave as if there isn’t a robots.txt file at all.

To resolve this problem, copy your robots.txt file to your root directory.

It’s important to note that you’ll need root access to your server to do this.

Some content management systems will automatically upload files to the media subdirectory (or something similar), so you may need to work around this to get your robots.txt file in the right place.

2. Inadequate Use of Wildcards

Robots.txt allows for two wildcard characters:

Asterisk * denotes any instance of a valid character, such as a Joker in a deck of cards.
The dollar signs $ denotes the end of a URL and allows you to apply rules only to the last part of the URL, such as the filetype extension.
It’s best to use wildcards sparingly, as they have the potential to restrict access to a much larger portion of your website.

It’s also relatively easy to accidentally block robot access to your entire site with a carelessly placed asterisk.

To resolve a wildcard issue, locate the incorrect wildcard and move or remove it so that your robots.txt file functions properly.

3. In Robots.txt, noindex

This is more common in websites that are older than a few years.

As of September 1, 2019, Google no longer obeyed noindex rules in robots.txt files.

If your robots.txt file was created prior to that date or contains noindex instructions, those pages are likely to be indexed in Google’s search results.

The solution to this problem is to use a different ‘noindex’ method.

One option is to use the robot’s meta tag, which you can add to the head of any web page that you don’t want Google to index.

4. Scripts and Stylesheets That Are Blocked

It may appear logical to prevent crawlers from accessing external JavaScripts and cascading stylesheets (CSS).

However, keep in mind that Googlebot requires access to CSS and JS files in order to properly “see” your HTML and PHP pages.

If your pages are behaving strangely in Google’s results, or it appears that Google is not seeing them correctly, check to see if you are blocking crawler access to necessary external files.

The simplest solution is to remove the line from your robots.txt file that is preventing access.

Alternatively, if you do need to block some files, insert an exception that restores access to the necessary CSS and JavaScripts.

5. There is no Sitemap URL.

This is about SEO more than anything else.

In your robots.txt file, include the URL to your sitemap.

Because this is the first place Googlebot looks at when crawling your website, the crawler has an advantage in understanding the structure and main pages of your site.

While this isn’t strictly an error because omitting a sitemap should have no effect on the actual core functionality and appearance of your website in search results, it’s still worth adding your sitemap URL to robots.txt if you want to boost your SEO efforts.

6. Development Site Access

Blocking crawlers from your live website is a no-no, but so is allowing them to crawl and index your under-construction pages.

It’s best practice to add a disallow instruction to a website’s robots.txt file so that the general public doesn’t see it until it’s finished.

When you launch a finished website, it’s also critical to remove the disallow instruction.

One of the most common mistakes made by web developers is failing to remove this line from robots.txt, which can prevent your entire website from being crawled and indexed correctly.

If your development site appears to be receiving real-world traffic, or if your newly launched website is not performing well in search, check your robots.txt file for a universal user agent disallows rule:

*User-Agent

Allowable: /

If you see this when you shouldn’t (or don’t see it when you should), make the necessary changes to your robots.txt file and double-check that your website’s search appearance changes accordingly.

How to Repair a Robots.txt Error

If a mistake in robots.txt is having an unfavorable effect on the search appearance of your website, the most important first step is to correct robots.txt and ensure that the new rules have the desired effect.

Some SEO crawling tools can assist you with this so that you don’t have to wait for the search engines to crawl your site again.

When you are certain that robots.txt is behaving properly, you can attempt to have your site re-crawled as soon as possible.

Platforms such as Google Search Console and Bing Webmaster Tools can be beneficial.

Submit an updated sitemap and request that any pages that have been incorrectly delisted be re-crawled.

Unfortunately, you are at the mercy of Googlebot – there is no way of knowing how long it will take for any missing pages to reappear in the Google search index.

All you can do is take the appropriate action to reduce that time as much as possible and keep checking until Googlebot implements the corrected robots.txt.