User Profiling

part of the ArsDigita Community System by Philip Greenspun, Tracy Adams, and Michael Yoon

The Categories Themselves

The profiling system provides an unlimited number of dimensions along which to place a user's interest. Each dimension is represented in the following table:

create table categories (
	category_id	integer not null primary key,
	category	varchar(50) not null,
	category_description    varchar(4000),
	-- e.g., for a travel site, 'country', or 'activity' 
	-- could also be 'language'
	category_type	varchar(50),
	-- language probably would weight higher than activity 
	profiling_weight	number default 1 check(profiling_weight >= 0),
	enabled_p	char(1) default 't' check(enabled_p in ('t','f')),
	mailing_list_info	varchar(4000)
);
Note that this is the same table that holds the dimensions for classifying content on a site. If this were not so, it would be painful to match up users and content.

The category_type column bears mentioning. It isn't used for score computation but only for user interface when querying the user or giving options for the site administrator. The profiling_weight column may be used to disable scoring along a certain dimension or to make language preference much more important than other preferences.

The mailing_list_info column is used by the software underneath /mailing-list/. It can contain information such as "We expect to send you email once every month if you sign up." The category_description can be used in other places where a user simply wants clarification about a category.

Optional Hierarchy

Suppose that the flat category_type system doesn't work for you. You are running a travel site and want to present the "Appalachian Trail" category to folks who have selected one of the states through which the AT runs (Georgia, Tennessee, North Carolina, Virginia, West Virginia, Maryland, Pennsylvania, New Jersey, New York, Connecticut, Massachusetts, Vermont, New Hampshire, Maine).

You represent this hierarchy with the following table:


create table category_hierarchy (
   parent_category_id     integer references categories,
   child_category_id      integer regerences categories
   unique (parent_category_id, child_category_id)
);
Notice that this isn't a strict hierarchy; a category can have more than one parent. Also notice that we use UNIQUE rather than PRIMARY KEY. That is because we signify the top-level hierarchies with a NULL parent_category_id column.

Mapping content to categories

Similar to the approach taken in the site_wide_search and general_comments module, one table stores all the mappings of content to categories. These content items will be in stored in disparate Oracle tables. Note that this replaces some older tables such as static_categories (mapped static pages to categories).

create sequence site_wide_cat_map_id_seq;

create table site_wide_category_map (
             map_id                  integer primary key,
	     category_id             not null references categories,
	     -- We are mapping a category in the categories table
	     -- to another row in the database.  Which table contains
	     -- the row?
             on_which_table          varchar(30) not null,
	     -- What is the primary key of the item we are mapping to?
	     -- With the bboard this is a varchar so we can't make this
	     -- and integer
             on_what_id              varchar(500) not null,
	     mapping_date	     date not null,
	     -- how strong is this relationship?
	     -- (we can even map anti-relationships with negative numbers)
	     mapping_weight          integer default 5 
				     check(mapping_weight between -10 and 10),
	     -- A short description of the item we are mapping
	     -- this enables us to avoid joining with every table
	     -- in the ACS when looking for the most relevant content 
	     -- to a users' interests
	     -- (maintain one_line_item_desc with triggers.)
             one_line_item_desc      varchar(200) not null,
	     mapping_comment         varchar(200),
	     -- only map a category to an item once
             unique(category_id, on_which_table, on_what_id)
);
To build user and admin interfaces when querying site_wide_category_map, we use the central table_acs_properties, which is shared by side-wide index and the general comments facility.

create table table_acs_properties (
             table_name      varchar(30) primary key,
             section_name    varchar(100) not null,
             user_url_stub   varchar(200) not null,
             admin_url_stub  varchar(200) not null
);
Here is an example entry for the bboard table: You're probably thinking that it would be nice to have table_name reference the Oracle data dictionary view user_tables but this doesn't seem to work.

To build admin pages for inserting, updating, and deleting data in the site_wide_category_map table, you can use the Tcl procs:

ad_categorization_widget

Call ad_categorization_widget within an HTML form to create a user interface for categorizing a specific row in the database. It returns a selection widget that contains options for each category; already mapped categories are pre-selected, and, for each category, the category_type (if one exists) is included in parentheses. If a category hierarchy is defined, then indentation makes it visually apparent.

Suppose that you're running a restaurant-rating site like ZAGAT.COM and you want to categorize restaurants geographically. The ad_categorization_widget for a restaurant with multiple locations like Nobuyuki Matsuhisa's excellent Nobu would look something like this:

ad_categorization_widget takes the following parameters: ad_categorization_widget is declared with ad_proc so you must call it with named arguments, e.g.:
ad_categorization_widget -db $db -which_table "restaurants" -what_id 123

ad_categorize_row

Given a specific row in the database (identified by the which_table and what_id parameters) and the list of categories to which that row should be mapped (the category_id_list parameter), ad_categorize_row will update site_wide_category_map appropriately, inserting any new mappings and deleting any previously existing mappings to categories that are not in category_id_list. Thus, if category_id_list is empty, then all mappings to the row will be deleted.

In addition to which_table, what_id, and category_id_list, ad_categorize_row also takes the following parameters:

Like ad_categorization_widget, you must call ad_categorize_row with named arguments, e.g.:
set category_id_list [util_GetCheckboxValues [ns_conn form] "category_id_list"]
if { $category_id_list != 0 } {
     ad_categorize_row -db $db -which_table "restaurants" -what_id 123 \
          -category_id_list $category_id_list -one_line_item_desc "Nobu"
}
Note that ad_categorize_row will use the same mapping_weight (as well as one_line_item_desc and mapping_comment) for all the categories in category_id_list, so you will need to call it multiple times if you have differently weighted mappings.

For instance, suppose you want to categorize a travel article about surfing near Santa Cruz with a mapping_weight of 10 for "Surfing" (because its great pictures and stories will be very interesting to all surfers, even if the reader is nowhere near California) but a mapping_weight of 2 for "Santa Cruz, CA" (because the average tourist will not find it interesting, not to mention the fact that it's only near Santa Cruz, not in Santa Cruz). You will have to call ad_categorize_row twice:

ad_categorize_row -db $db -which_table "travel_articles" -what_id 456 \
     -category_id_list surfing_category_id -mapping_weight 10 \
     -one_line_item_desc "Surfin' Near Santa Cruz"

ad_categorize_row -db $db -which_table "travel_articles" -what_id 456 \
     -category_id_list Santa_Cruz_CA_category_id -mapping_weight 2 \
     -one_line_item_desc "Surfin' Near Santa Cruz"

Recording User Interests

For a specific user, the level of interest (or lack thereof) is recorded in

create table users_interests (
	user_id		integer not null references users,
	category_id	integer not null references categories,
	-- 0 is same as NULL; -10 is "hate this kind of stuff"; 
	-- 5 is "said I liked it", 10 is "love this kind of stuff"
	interest_level	integer default 5 check(interest_level between -10 and 10),
	interest_date	date,
	unique(user_id, category_id)
);
As the comment above notes, the default behavior when a user signs up for a mailing list or whatever is to assign an interest level of 5. This means "moderately interested". If we ever find out that a user doesn't like something, we can record that with a negative interest_level.

Putting It All Together (Example Queries)

A query to produce links to the most relevant content (to a particular user's interests) on the site in descending order:

select one_line_item_description, user_url_stub, section_name
from site_wide_category_map, categories, users_interests, table_acs_properties
where users_interests.user_id = $user_id
  and categories.category_id = users_interests.category_id
  and site_wide_category_map.category_id = categories.category_id 
  and site_wide_category_map.table_name = table_acs_properties.table_name
group by one_line_item_description, user_url_stub, section_name
order by (sum(categories.profiling_weight * users_interests.interest_level)) desc
Combining keyword and categorical searches:

select 
    the_key, 
    sum(search_score)+sum(category_score) as total_score, 
    section_name, 
    user_url_stub, 
    one_line_description 
from (select 
        site_wide_index.one_line_description, 
        site_wide_index.the_key,
        site_wide_index.table_name, 
        score(10) as search_score, 
        0 as category_score
      from site_wide_index
      where (contains(indexed_stuff,'about($user_entered_query)',10) > 0) 
      union
      select 
        site_wide_category_map.one_line_item_desc, 
        on_what_id,
        on_which_table, 
        0 as search_score, 
        sum(profiling_weight) as category_score
      from site_wide_category_map, categories
      where site_wide_category_map.category_id in ('[join $category_id_list "','"]')
      and site_wide_category_map.category_id = categories.category_id
      group by on_what_id, one_line_item_desc, on_which_table
    ) site_wide, table_acs_properties
where site_wide.table_name = table_acs_properties.table_name
group by the_key, one_line_description, section_name, user_url_stub
order by total_score desc

philg@mit.edu