No one is perfect.
And neither is the Drupal robots.txt file. In fact, there are several problems with the file. If you test out your default robots.txt file line by line using Google Webmaster Tools’ robots.txt testing utility, you will find that a lot of paths which look like they are being blocked will actually be crawled.
The reason is that Drupal does not require the trailing slash ( / ) after the path to show you the content. Because of the way robots.txt files are parsed, Googlebot will avoid the page with the slash but crawl the page without the slash. For example: /admin/ is listed as disallowed. As you would expect, the testing utility shows that http://www.yourDrupalsite.com/admin/
is disallowed.
Not so fast.
Put in http://www.yourDrupalsite.com/admin
(without the slash) and you’ll see that is it allowed. “It’s a trap!” Not really, but fortunately it is relatively easy to fix.
Do you want to know how to fix the problems with Drupal’s default robots.txt file in ten easy steps? Please read on.
What in Tarnation is a Googlebot?
Huh? Google what?! Googlebot!
Google and other search engines use server systems–commonly referred to as spiders, crawlers, or robots–to travel the expanse of the Internet and find each and every website. Google’s system is also referred to as Googlebot to distinguish it from all the other search engine robots.
While Google does not reveal how many sites it crawls every week, their overall search index contains hundreds of billions of webpages and is over 100M GB - and that is information from 2016. They haven't posted any more information since then; undoubtedly those stats are much higher now.
"It’s like the index in the back of a book — with an entry for every word seen on every web page we index. When we index a web page, we add it to the entries for all of the words it contains."
Fixing the Drupal Robots.txt File
Like I said earlier, fixing Drupal’s default robots.txt file is relatively easy. Carry out the following steps in order to fix the file:
- Make a backup of the robots.txt file.
- Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor.
- Find the Paths (clean URLs) section and the Paths (no clean URLs) section. Note that both sections appear whether you've turned on clean URLs or not. Drupal covers you either way. They look like this, although yours may be slightly different:
- # Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/ - # Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/ - Duplicate the two sections (simply copy and paste them) so that you have four sections: two "# Paths (clean URLs)" sections and two of "# Paths (no clean URLs)" sections.
- Add 'fixed!' to the comment of the new sections so that you can tell them apart.
- Delete the trailing / after each Disallow line in the fixed! sections. You should end up with four sections that look like this:
- # Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/ - # Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/ - # Paths (clean URLs) – fixed!
Disallow: /admin
Disallow: /comment/reply
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login - # Paths (no clean URLs) – fixed!
Disallow: /?q=admin
Disallow: /?q=comment/reply
Disallow: /?q=contact
Disallow: /?q=logout
Disallow: /?q=node/add
Disallow: /?q=search
Disallow: /?q=user/password
Disallow: /?q=user/register
Disallow: /?q=user/login - Add the path to your site's sitemaps. Use the following format, making sure to use your site's canonical site name:
-
sitemap: https://www.yourDrupalsite.com/sitemap.xml
- If your sitemap is not in this location, you will likely want to change the URL so any search engine bots know where to find it for crawling purposes.
- Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?).
- Go to
http://www.yourDrupalsite.com/robots.txt
and double-check that your changes are in effect. You may need to clear the Drupal cache or do a refresh on your browser to see the changes. - Now your robots.txt file should be working as you expect it to.
Additional Changes You Can Make for SEO
Now that you have fixed your default robots.txt file, there are a few additional changes you can make. Using directives and pattern matching commands, the robots.txt file can exclude entire sections of the site from the crawlers like the admin pages, certain individual files like cron.php, and some directories like /scripts and /modules.
In many cases, though, you should tweak your robots.txt file for optimal SEO results. Here are several changes you can make to the file to meet your needs in certain situations:
- You are developing a new site and you don’t want it to show up in any search engine until you’re ready to launch it. Add Disallow: * just after the User-agent: . Just make sure to change it back after the site goes live or your site will never get crawled.
- The server you are running is very slow and you don’t want the crawlers to slow your site down your site for visitors. Adjust the Crawl-delay by changing it from 10 to 20.
- If you're on a super-fast server (and you should be, right?) you can tell the bots to bring it on! Change the Crawl-delay to 5 or even 1 second. Monitor your server closely for a few days to make sure it can handle the extra load.
- You're running a site which allows users to upload their own images but you don't necessarily want those images to show up in Google. Add these lines at the bottom of your robots.txt file:
- User-agent: Googlebot-Image
Disallow: /path/to/visitor/jpg/files/*.jpg$
Disallow: /path/to/visitor/gif/files/*.gif$
Disallow: /path/to/visitor/png/files/*.png$ - If all of the files were in the /files/users/images/ directory, you could do this:
- User-agent: Googlebot-Image
Disallow: /path/to/visitor/images/ - Say you noticed in your server logs that there was a bad robot out there that was scraping all your content. You can try to prevent this by adding this to the bottom of your robots.txt file:
- User-agent: Bad-Robot
Disallow: * - If you have installed the XML Sitemap module, then you've got a great tool that you should send out to all of the search engines. However, it's tedious to go to each engine and upload your URL. Instead, you can add a couple of simple lines to the bottom robots.txt file to let all search engines know where your site's XML Sitemap can be found.