
Last week, I learned that WordPress doesn’t ship with a default robots.txt.
- this is the default file that search engine crawlers parse to see what resources and URL patterns that it allowed and not allowed to crawl; it’s step 1 in every search engine optimization (SEO) guide.
I guess I just stupidly assumed that it was included in WP. Anyways, I thought it to be fair to tell everyone that if you are using WordPress and you care how your site shows up in search results, you should generate a robots.txt and a sitemap.xml.
Robots.txt?
Know that it’s important for search engines. Read this:
** NOTE: Not all web crawlers are guaranteed to read example.com/robots.txt; it serves as a guideline.
I Feel Dumb…
I feel like an idiot, and I should. The other day I just happened to search for “engfers” on Google, and the result that came back was my site with an indented sub-result that was some error from a file in the WP-Super-Cache plugin. I thought to myself, why is the plugins/ directory being crawled?
Needless to say, I shortly thereafter found Google’s Webmaster Tools to help rectify my situation. It’s a pretty nice web-app that allows you to remove content from Google’s search (which I then used).
I also noticed that the webmaster tools had sections for analyzing your robots.txt and sitemap.xml. Well, I was surprised to find out that this site didn’t have a robots.txt.
Most of you are probably think that I’m an idiot because that’s SEO 101. Well yes, it is; however, I didn’t realize that WordPress doesn’t ship with a default robots.txt! Don’t ask me why I didn’t see that before because I don’t know. Nevertheless, I think WP should ship with a robots.txt that AT LEAST eliminates plugins/ and wp-include/ from being crawled.
Our Shiny, New robots.txt
There seems to be a billion and one SEO blogs out there; however, I was looking for resources for a robots.txt optimized for WordPress.
I found a couple of articles and examples at askapache.com and an example from the WordPress.org Codex.
The final version of our robots.txt (http://www.engfers.com/robots.txt) was pulled from the WordPress Codex page.
User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: /trackback Disallow: /feed Disallow: /comments Disallow: /category/*/* Disallow: */trackback Disallow: */feed Disallow: */comments Disallow: /*?* Disallow: /*? Allow: /wp-content/uploads # Google Image User-agent: Googlebot-Image Disallow: Allow: /* # Google AdSense User-agent: Mediapartners-Google* Disallow: Allow: /* # Internet Archiver Wayback Machine User-agent: ia_archiver Disallow: / # digg mirror User-agent: duggmirror Disallow: / # Sitemap Sitemap: http://www.engfers.com/sitemap.xml
**NOTE: This file must to be at the ROOT of your web server!
Final Note: sitemap.xml
The big-daddy search engines like Google, Yahoo, Microsoft, etc use your site’s sitemap.xml (example.com/sitemap.xml) to make it easier crawl your website. It’s also a very important point of SEO; just do a bit of searching on it.
The final line in our robots.txt points to the sitemap:
Sitemap: http://www.engfers.com/sitemap.xml
For WordPress, use a plugin like the Google Sitemap Generator, to have it automacially generate the sitemap for you.
+1 = Moreover, It will automatically regenerate the sitemap.xml when you publish or edit a new article or page. =)

Great article engfer! Most people and bloggers have never heard about robots.txt files, and that isn’t good for anyone.
The newer WordPress versions show a default robots.txt “file” by using internal rewrites, which IMHO is not nearly as good as using an actual file. Keep it up..