Writing A Proper Robots.txt For WordPress


If you look on the internet for a properly configured robots.txt file you will find a lot of different guides many of which have not been updated in years and many of them give you false information or are simply not blocking what should be blocked.

For instance, the Yoast approach is very minimal and has not been updated in many years, while other websites point out that you can rely on the virtual one that is provided with WordPress. However, both of these methods will lead to a poor experience for the website crawler.

WordPress does include a virtual robots.txt file which will block the very minimal including the wp-admin, but it still manages to miss a lot of the important items that should be included in it. The virtual robots.txt was meant to provide one for webmasters that would otherwise be unsure on how to make or edit one. It is the bare minimum and it is not meant for more complex websites where you will have hundreds or thousands of pages worth of content.  If you are using a more complex website than you will want to block any page that is potentially useless to the crawler so that way it focuses more on your content.

The fact is WordPress produces a lot of useless pages that are either useless to the search engine or are going to spit out a ton of errors if it tries to access them. Remember we want it to be simple for the crawler when they are on our website and we want it to focus on our content.

Below is what Yoast uses
User-Agent: *
Disallow: /wp-content/plugins/
Disallow: /out/
Disallow: /bugs/
Disallow: /suggest/
Allow: /wp-content/plugins/vipers-video-quicktags/resources/jw-flv-player/player.swf
Here is what we are using. 

User-agent: *
Disallow: /wp-login.php
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /*?s
Disallow: /search/
Allow: /wp-content/uploads/

Sitemap: http://thedailyexposition.com/sitemap_index.xml
Sitemap: http://thedailyexposition.com/post-sitemap.xml
Sitemap: http://thedailyexposition.com/page-sitemap.xml
Sitemap: http://thedailyexposition.com/category-sitemap.xml
Sitemap: http://thedailyexposition.com/author-sitemap.xml
Sitemap: http://thedailyexposition.com/forum-sitemap.xml
Sitemap: http://thedailyexposition.com/topic-sitemap.xml
Sitemap: http://thedailyexposition.com/product_cat-sitemap.xml
Sitemap: http://thedailyexposition.com/product_tag-sitemap.xml
Sitemap: http://thedailyexposition.com/post_tag-sitemap.xml
Sitemap: http://thedailyexposition.com/topic_tag-sitemap.xml
Sitemap: http://thedailyexposition.com/product-sitemap.xml

We block general access to the wp-content folder but allow the crawler to reach our uploads which will include all of our images and other files that are either used in a post or Woocommerce. Then all forms of login are also blocked, and the search is also blocked. This is because every time a query is typed WordPress creates a landing page and these are typically worthless and do nothing but waste the crawlers time. Trackbacks also serve no real purpose for the crawler to track since most WordPress installations typically disable this feature because not only does it slow your website down it wastes your server resources.

I have also taken the liberty of including not only the sitemap index but the main indexes that are included with the file, this makes it easier for the crawler to identify the sitemaps and it will check the index to match the files up. This way if you miss one it will not be missed, and it will be included.

Previous articleSteam Holiday Sale Dates Revealed
Next articleYoast SEO Plugin Review
Profile photo of Scott Hartley
Scott Hartley is a web developer, college student, and an article. Scott has work appearing or coming on several sites including The Daily Exposition, and The Arcade Corner. When Scott is not working on websites or studying for classes he is likely reading about various scientific discoveries and experiments.