Abivia Web Hosting

 
  • Increase font size
  • Default font size
  • Decrease font size

Forum Notice

All forum posts are moderated. Your post will not be visible until a moderator has approved it. Please refrain from posting the same question over and over because you don't see it the first time.

A special note for spammers: when we say all, we mean ALL. Every single message is manually reviewed. So don't pretend you have a support issue then try to spam after we've replied. You're wasting your time.

Home Help Forums
Welcome, Guest
Username Password: Remember me

Blocking spiders that ignore robots.txt (eg. Bing)
(1 viewing) (1) Guest
  • Page:
  • 1

TOPIC: Blocking spiders that ignore robots.txt (eg. Bing)

Blocking spiders that ignore robots.txt (eg. Bing) 6 years, 2 months ago #763

  • instance
  • OFFLINE
  • Administrator
  • Posts: 490
  • Karma: 9
This month one of our sites suddenly reported a 1,000% growth in traffic, as measured by data transfer. Normally that's a good thing, but in this case it seemed a little odd. We took a look at our traffic statistics and saw no significant increase.

The last time this happened, a very popular site with a .tw top level domain had a bad image link to our .com of the same name. Just serving up 404 errors was costing us hundreds of megabytes per day and putting unwanted load on the server. So we went to the web server logs to see what was going on. This is what we got (some details removed):

<code>
x.x.x.x - - [] "GET /administrator/ HTTP/1.1" 301 377 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +www.bing.com/bingbot.htm)"
x.x.x.x - - [] "GET /administrator/ HTTP/1.1" 200 4185 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +www.bing.com/bingbot.htm)"
x.x.x.x - - [] "POST /administrator/index.php HTTP/1.1" 301 386 "x.com/administrator/index.php" "Mozilla/5.0 (compatible; bingbot/2.0; +www.bing.com/bingbot.htm)"
x.x.x.x - - [] "GET /administrator/index.php HTTP/1.1" 200 4185 "x.com/administrator/index.php" "Mozilla/5.0 (compatible; bingbot/2.0; +www.bing.com/bingbot.htm)"
x.x.x.x - - [] "GET /administrator/ HTTP/1.1" 301 377 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +www.bing.com/bingbot.htm)"
</code>

Two problems with this:
1) /administrator is explicitly disallowed in robots.txt. Bingbot had no business making this request.
2) There are in the order of 30,000 similar requests, one after the other!

Our first reaction is Bing? Really? This is amateur night stuff! You should be embarrassed by this. Of course there's a chance this is some bot masquerading as Bing. The originating IP address is in the US, but isn't identified as belonging to Microsoft. However it does belong to an ISP that provides corporate, not retail services. The jury is out.

The second issue is more general. How do we prevent misbehaving bots from ignoring robots.txt?

The answer is in the Apache .htaccess file and mod_rewrite, the URL rewriting module and the power of regular expressions. Here's our solution, tailored for Joomla and Wordpress:

<code>
# For crawlers that ignore robots.txt
#
RewriteCond %{HTTP_USER_AGENT} (alltheweb|baidu|bingbot|googlebot|msnbot|slurp) [NC]
RewriteRule ^(administrator|cli|includes|installation|language|libraries|logs|tmp|wp-admin) - [F,L]
</code>

These directives need to be put in the context of a properly configured and enabled mod_rewrite. The first line, RewriteCond matches all the major web crawlers: Google, Microsoft, Yahoo, Overture, and Baidu.

The Rewrite rule matches URLs starting with parts of the site we never want indexed. The work is done in the flags. The F tells mod_rewrite to send a "403 Forbidden" code as a response, and the L tells it that we're done with rewriting. Apache then tells the crawler to go away, hopefully in a way that doesn't need to be repeated 30,000 times!

Your robots.txt file probably has more directories listed, but there are files you want web crawlers to be able to find when they're loading public pages (a Joomla example is /components, in Wordpress it might be /wp-includes). This allows the crawler to fetch images and other media in the context of the page while keeping "good" crawlers that respect robots.txt away from indexing image sub-directories.

We've implemented this on the affected site and are monitoring the situation to make sure there are no unexpected side effects.
Last Edit: 6 years, 2 months ago by instance.
  • Page:
  • 1
Time to create page: 0.22 seconds