Chapter 2. Developer's guide

2.1. Web Robot Detection Requirements

2.1.1. Introduction

Search engines use web robots to periodically retrieve pages from sites for indexing. However, robots won't be able to access areas that requre users to log in, yet those areas probably have content in the database that should be open to searches from the public. The site administrator can set up a dedicated area on the site to serve content to robots, now it's up to robot-detection to make robots go to the right place.

2.1.2. Vision Statement

Without search engines, people would be lost on the Internet. However, personalized systems like ACS have much of their content hidden behind login pages -- which would be inaccessible for the software robot crawlers from search engines. To increase a site's visibility, site owners need a tool to identify visiting robots and present them with content to be indexed. The Web Robot Detection package fulfills the role of such a tool.

2.1.3. Web Robot Detection Overview

Web Robot Detection is an application package that defines a data model and some code to handle traffic from search-engine robots. It has the following components:

  • A data model for storing information about known search-engine robots on the web.

  • A mechanism to maintain the list of robots and keep that in sync with the database.

  • A mechanism based on the ACS Kernel to specify the paths from which to redirect robots and the target to direct the robots to.

  • Code definition to make use of the request processor filter provided by ACS Kernel 4.0.

2.1.4. Use-cases and User-scenarios

The Web Robot Detection package is not meant to be used by regular users. Instead, the site-wide administrator is responsible for mapping directories not accessible to search engines to a "robot heaven", which has been setup to provide content suitable to be indexed.

The site-wide administrator would typically download the robot-detection package, install it using the APM (Arsdigita Package Manager) and set up the parameters, check out the administration page to see the current parameters and the list of identifiable robots, build the "robot heaven" and verify the whole thing works. Afterward, this package requires no additional maintenance.

A software robot making a http request to a part of the site that requires login would be automatically redirected to the "robot heaven".

2.1.6. Requirements: Data Model

  • 10.10.0 Store information about robots

  • 10.10.5 A primary key to identify each individual robot

  • 10.10.7 Some fields for the UI: name of robot and url to get more information about the robot

  • 10.10.10 Useragent header is necessary to find out which connection is coming from a robot

  • 10.10.15 Insertion date to keep track when was the robots table last refreshed

2.1.7. Requirements: API

  • 20.10.10 Check on server restart when was the robot information last refreshed, if it's been longer than a value specified in the package's parameter, run the procedure (20.20.0) to refresh the robot information in the database

  • 20.10.15 On server restart, if the robot information is not present in the database, run the refresh procedure (20.20.0) to obtain it

  • 20.20.0 A way to automatically gather information about web robots from a website that keeps such data

  • 20.30.0 A way to detect robots based on the useragent field of the robot's http header

  • 20.40.0 A way to redirect an identified robot to another path on the same site

2.1.8. Requirements: Site Administrator Interface

  • 30.10.0 Display current parameters

  • 30.20.0 Display the list of robots known to the system

  • 30.30.0 A way to refresh the robot list on demand by calling the procedure (20.0.0) to refresh the robot information in the database

2.1.9. Revision History

Document Revision # Action Taken, NotesWhen?By Whom?
0.4Added more detailed requirements, based on suggestions from Kai Wu2001-01-23Roger Hsueh
0.3Revised document based on comments from Kevin Scaldeferri2000-12-13Roger Hsueh
0.2Revised document based on comments from Michael Bryzek2000-12-07Roger Hsueh
0.1Create initial version2000-12-06Roger Hsueh

Last modified: $Date: 2001/04/20 20:51:22 $