Web Robot Detection

part of the ArsDigita Community System by Michael Yoon

The Big Picture

Many of the pages on an ACS-based website are hidden from robots (a.k.a. search engines) by virtue of the fact that login is required to access them. A generic way to expose login-required content to robots is to redirect all requests from robots to a special URL that is designed to give the robot what at least appear to be linked .html files.

You might want to use this software for situations where public (not password-protected) pages aren't getting indexed by a specific robot. Many robots won't visit pages that look like CGI scripts, e.g., with question marks and form vars (this is discussed in Chapter 7 of Philip and Alex's Guide to Web Publishing).

The Medium-sized Picture

In order for this to work, we need a way to distinguish robots from human beings. Fortunately, the Web Robots Database maintains a list of active robots that they kindly publish as a text file. By loading this list into the database, we can implement the following algorithm:
  1. Check the User-Agent of each HTTP request against those of known robots (which are stored in the robot_useragent column of the robots table).
  2. If there is a match, redirect to the special URL.
  3. This special URL can either be a static page or a dynamic script that dumps lots of juicy text from the database, for the robot's indexing pleasure.

This algorithm is implemented by a postauth filter proc: ad_robot_filter.

(Note: For now, we are only storing the minimum number of fields needed to detect robots, so many of the columns in the robots table will be empty. Later, if the need presents itself, we can enhance the code to parse out and store all fields.)

Configuration Parameters

[ns/server/yourservername/acs/robot-detection]
; the URL of the Web Robots DB text file
WebRobotsDB=http://info.webcrawler.com/mak/projects/robots/active/all.txt
; which URLs should ad_robot_filter check (uncomment to turn system on)
; FilterPattern=/members-only-stuff/*.html
; FilterPattern=/members-only-stuff/*.tcl
; the URL where robots should be sent
RedirectURL=/robot-heaven/
; How frequently (in days) the robots table
; should be refreshed from the Web Robots DB
RefreshIntervalDays=30

Notes for the Site Administrator

Set Up

Testing

See the ACS Acceptance Test.


michael@arsdigita.com