You might want to use this software for situations where public (not password-protected) pages aren't getting indexed by a specific robot. Many robots won't visit pages that look like CGI scripts, e.g., with question marks and form vars (this is discussed in Chapter 7 of Philip and Alex's Guide to Web Publishing).
User-Agent
of each HTTP request against those of known robots (which are stored in the robot_useragent
column of the robots
table).
This algorithm is implemented by a postauth filter proc: ad_robot_filter
.
(Note: For now, we are only storing the minimum number of
fields needed to detect robots, so many of the columns in the
robots
table will be empty. Later, if the need presents
itself, we can enhance the code to parse out and store all fields.)
[ns/server/yourservername/acs/robot-detection] ; the URL of the Web Robots DB text file WebRobotsDB=http://info.webcrawler.com/mak/projects/robots/active/all.txt ; which URLs should ad_robot_filter check (uncomment to turn system on) ; FilterPattern=/members-only-stuff/*.html ; FilterPattern=/members-only-stuff/*.tcl ; the URL where robots should be sent RedirectURL=/robot-heaven/ ; How frequently (in days) the robots table ; should be refreshed from the Web Robots DB RefreshIntervalDays=30
robots
table at startup, if
it is empty or if its data is older than the number of days specified
by the RefreshIntervalDay
configuration parameter (see
below).
FilterPattern
s are specified in the configuration,
then the robot detection filter will not be installed.
ns_register_proc
if
necessary to create a pseudo static HTML file appearance