2.2. Web Robot Detection Design Documentation

Table of Contents
2.2.1. Essentials
2.2.2. Introduction
2.2.3. Historical Considerations
2.2.4. Competitive Analysis
2.2.5. Design Tradeoffs
2.2.6. API
2.2.7. Data Model Discussion
2.2.8. User Interface
2.2.9. Configuration/Parameters
2.2.10. Future Improvements/Areas of Likely Change
2.2.11. Authors
2.2.12. Revision History

By Roger Hsueh

2.2.1. Essentials

2.2.2. Introduction

The goal of this package is to expose some parts of the site that are normally not indexable by search engines. The package accomplishes this by performing the following functions:

This special area can either be a static page or a dynamic script that generates texts from the database which are suitable for search-engine indexing.

2.2.3. Historical Considerations

Previous versions of the robot detection package relied on the AOLServer procedure ns_register_filter, which is deprecated as of ACS4. The robot detection package now uses the analogous ad_register_filter provided by the request processor in ACS 4.

The queries are re-written to use the ACS 4 Database Access API. Other than that, this version is a pretty straightforward port.

2.2.4. Competitive Analysis

On Apache, it's possible to accomplish more or less the same thing with mod_rewrite, but that's a low-level module that requires a lot of coding to make it work like the robot detection package.

2.2.5. Design Tradeoffs

2.2.6. API

The API for robot detection is quite simple: (You can also use the API Browser to view this)

ad_cache_robot_useragents

Caches "User-Agent" values for known robots

ad_replicate_web_robots_db

Replicates data from the Web Robots Database into a table in the database. The data is published on the Web as a flat file, whose format is specified in http://info.webcrawler.com/mak/projects/robots/active/schema.txt. Basically, each non-blank line of the database corresponds to one field (name-value pair) of a record that defines the characteristics of a registered robot. Each record has a "robot-id" field as a unique identifier. (There are many fields in the schema, but, for now, the only ones we care about are: robot-id, robot-name, robot-details-url, and robot-useragent.) Returns the number of rows replicated. May raise a Tcl error that should be caught by the caller.

ad_robot_filter conn args why

A filter to redirect any recognized robot to a specified page.

ad_update_robot_list

Will update the robots table if it is empty or if the number of days since it was last updated is greater than the number of days specified by the RefreshIntervalDays configuration parameter.

robot_exists_p robot_id

Returns 1 if a row already exists in the robots table with the specified "robot_id", 0 otherwise.

robot_p useragent

Returns 1 if the useragent is recognized as a search engine, 0 otherwise.

2.2.7. Data Model Discussion

The data model has one table: robots. It mirrors a few critical fields of the schema laid out in Web Robots Database Schema. Only robot_id, robot_name, robot_details_url, and robot_useragent are used in the current implementation.

2.2.7.1. Transactions

On server restarts, the robots table is refreshed with the raw data from Web Robots Database if the table's age is beyond the number of days specified in the RefreshIntervalDays parameter. ACS Administrators can also refresh the robots table manually with Robot Detection Administration.

2.2.8. User Interface

There is one page for ACS Administrators. It displays the current parameters and a list of robots, plus a link to manually refresh the robots table.

2.2.9. Configuration/Parameters

NameDescriptionDefault
WebRobotsDBthe URL of the Web Robots DB text file http://info.webcrawler.com/mak/projects/robots/active/all.txt
FilterPatterna CSV string containing the URLs for ad_robot_filter to checked/members-only-stuff/*
RedirectURLthe URL where robots should be sent/robot-heaven/
RefreshIntervalDaysHow frequently (in days) the robots table should be refreshed from the Web Robots DB30

2.2.10. Future Improvements/Areas of Likely Change

2.2.11. Authors

2.2.12. Revision History

Document Revision # Action Taken, NotesWhen?By Whom?
0.3Revised document based on comments from Kevin Scaldeferri2000-12-13Roger Hsueh
0.2Revised document based on comments from Michael Bryzek2000-12-07Roger Hsueh
0.1Create initial version2000-12-06Roger Hsueh