Working With Intermedia

part of ACS lessons by Michael Bryzek

If you're like me, you love the idea of Intermedia - full text indexing, decent performance, tight integration with sql. But, if you're like me, Intermedia is the part of Oracle that doesn't let you sleep at night.

While working on GuideStar with a lot of other really good programmers, we decided to use Intermedia to add a keyword search box to our site. Following Phil's recommendation, we built a Site-wide Search, that we called search. We were going to use Intermedia to index a columm, keywords, in that table. Sounds simple, and it would have been, except that our data set contained approximately 700,000 rows, 3,000-5,000 of which were updated daily. (Note: Alpha, one of the guys who wrote Intermedia, says 700,000 rows is nothing for Intermedia. I, however, disagree.)

Creating Intermedia Indexes

Once we had populated the table search with a pl/sql procedure, we ran
SQL> create index search_im_keyword_idx on search(keywords) indextype is ctxsys.context;
to create an intermedia index on search(keywords). Note that the "im" in the name of the index is just there to help me identify Intermedia indexes more easily. About 2 1/2 hours later (on a Sun E450 with 4 processors and 4gb of RAM) we saw:
Index created.
And we were in business. This one line sql command was enough to give us the performance we wanted:
SQL> set timing on
SQL> select count(1) 
   from search 
   where contains(keywords, 'guidestar', 1) > 1
   order by score(1);

  COUNT(1)
----------
	 3

Elapsed: 00:00:00.27

Updating Intermedia Indexes

Unlike normal oracle indexes, intermedia indexes are not updated after the execution of dml statements. This means that we manually have to update the Intermedia indexes we are using. There are at least two ways to update these indexes:
  1. Run the context server to automatically update your indexes. To start context server, get yourself a unix shell and run:
     > ctxsrv -user ctxsys/ctxsyspwd -personality M -log $ORACLE_HOME/ctx/ctxsys.log &
    where ctxsyspwd is the password for the oracle user ctxsys. As it good as it sounds, running the context server is not the recommend method to keep the indexes up-to-date and the context server is already deprecated.

  2. Manually update the index on a regular basis. To update an intermedia index, open a sqlplus session as the user who created the index. Then:
    SQL> alter index search_im_keyword_idx rebuild online parameters('sync memory 45M');
    Index altered.
    
    The parameters of this call:
    • rebuild online: Rebuild the index incrementally so that it is still accessible. If you forget to specify "online," users searching the website would not be able to access the index.
    • sync: Just update the index (rather than rebuild it entirely)
    • memory 45M: Specify the amount of memory, in megabytes, that Intermedia can use while rebuilding the index. Using more memory decreases the amount of disk I/O and thus increases the speed with which the index can be rebuilt. The default is 12 megabytes and the acceptable range of inputs for memory is 1 KB to about 50 MB. To increase the maximum allowable index memory:
       SQL> exec ctx_adm.set_parameter('max_index_memory', '1G');
    • More information available in the Oracle Documentation
Since the context server is deprecated, we use the second method. From start to finish, the update of our Intermedia index looks like:
  1. We have triggers on all the tables that feed into the site wide index. Whenever a row in one of these tables is updated, the "dirty bit" in the site wide search table is set
  2. Every hour we run a tcl procedure that updates the dirty rows in the search table by calling a pl/sql procedure. The aolserver call looks like:
        ns_ora exec_plsql $db "begin :1 := 1; keyword_datastore_gasket('$ein'); end;" 
    where the :1 := 1 is just a hack to work around exec_plsql behavior
  3. Every hour we also run a tcl procedure that executes an "alter index" statement to update the intermedia indexes:
        ns_db dml $db "alter index search_im_keyword_idx rebuild online parameters('sync memory 45M')"
When manually altering intermedia indexes, you have to make sure that no more than 1 of these alter index statements is running at any given time. More than 1 statement could result in a deadlock inside intermedia that can only be solved by dropping and recreating the index.

Useful Intermedia Trivia

If you ever wanted to know how many rows are waiting to be synced with each of your Intermedia indexes, you can do:
    SQL> select PND_INDEX_NAME, count(1) from ctx_user_pending group by PND_INDEX_NAME;
You can also look at Intermedia errors to find out what went wrong by opening a sqlplus session as the ctxsys user and running:
    SQL> select err_timestamp, err_index_name, err_text from ctx_index_errors order by err_timestamp;
to get a list of all the errors on intermedia indexes, sorted by date.

Where the Problems Start

A few times the update of the intermedia index failed. One night as I was headed to bed, I decided to update the index overnight.
 
   SQL> alter index search_im_keyword_idx rebuild online parameters('sync memory 45M');
About two hours into my dreams, the phone rang to wake me up - all searches were failing on the website. Users were simply getting back a page with zero results, regardless of their query. About fifteen minutes later, I'm back at the computer where I ran the update index statement. I saw that my index was in a FAILED/LOADING state. Had I been smart, I would have logged what was going on in context by executing:
   SQL> exec ctx_output.start_log('better_days');           
before running my query to log its progress. This starts context logging to $ORACLE_HOME/ctx/log/better_days.

Whenever an index update fails, try a resume statement that should finish up the work from before:

 
   SQL> alter index search_im_keyword_idx rebuild online parameters('resume memory 45M');
The resume statement has never worked for me but maybe you'll have better luck. Probably, you'll just see the same error message you had before. Somehow the index has become corrupted:
 
  SQL> drop index search_im_keyword_idx force;
  SQL> create index search_im_keyword_idx on search(keywords) indextype is ctxsys.context;
And 2 1/2 hours later, the index was re-created and back on-line. Meanwhile, our website was not serving search results for over three hours.

What Went Wrong?

The hourly procedure running inside AOLServer to update the Intermedia index started up in parallel to the one I manually executed in sqlplus. These two threads competed for some resource, and ended up in a deadlock.

I have also seen the same error while running context server and executing the an alter index statement in sqlplus. Alpha tells me that the deadlock shouldn't happen, and I agree with him. However, the reality of the deadlock gives us the following Intermedia law: Use exactly one method to update Intermedia indexes. It is also worth mentioning that an uncommitted sql session can keep the alter index statement from completing.

Note that the two threads were able to become deadlocked simply because we had a lot of data. On a small data set, it is much less likely that two threads would be running at once since each thread would complete much more quickly.

Using Intermedia on the web

The queries humans write will most certainly not be Intermedia friendly. The following procedure provided by some folks at Intermedia takes text written by a human and returns a string suitable for feeding into Intermedia:
- -- We will assume that the users are not going to use any intermedia text syntax
- -- So all intermedia text keywords will be braced to escape them

create or replace function im_convert(
        query in varchar2 default null
        ) return varchar2
is
  i   number :=0;
  len number :=0;
  char varchar2(1);
  minusString varchar2(256);
  plusString varchar2(256); 
  mainString varchar2(256);
  mainAboutString varchar2(256);
  finalString varchar2(256);
  hasMain number :=0;
  hasPlus number :=0;
  hasMinus number :=0;
  token varchar2(256);
  tokenStart number :=1;
  tokenFinish number :=0;
  inPhrase number :=0;
  inPlus number :=0;
  inWord number :=0;
  inMinus number :=0;
  completePhrase number :=0;
  completeWord number :=0;
  code number :=0;  
begin
  
  len := length(query);

- -- we iterate over the string to find special web operators
  for i in 1..len loop
    char := substr(query,i,1);
    if(char = '"') then
      if(inPhrase = 0) then
        inPhrase := 1;
        tokenStart := i;
      else
        inPhrase := 0;
        completePhrase := 1;
        tokenFinish := i-1;
      end if;
    elsif(char = ' ') then
      if(inPhrase = 0) then
        completeWord := 1;
        tokenFinish := i-1;
      end if;
    elsif(char = '+') then
      inPlus := 1;
      tokenStart := i+1;
    elsif((char = '-') and (i = tokenStart)) then
      inMinus :=1;
      tokenStart := i+1;
    end if;

    if(completeWord=1) then
      token := '{ '||substr(query,tokenStart,tokenFinish-tokenStart+1)||' }';      
      if(inPlus=1) then
        plusString := plusString||','||token||'*10';
        hasPlus :=1;    
      elsif(inMinus=1) then
        minusString := minusString||'OR '||token||' ';
        hasMinus :=1;
      else
        mainString := mainString||' NEAR '||token;
        mainAboutString := mainAboutString||' '||token; 
        hasMain :=1;
      end if;
      tokenStart  :=i+1;
      tokenFinish :=0;
      inPlus := 0;
      inMinus :=0;
    end if;
    completePhrase := 0;
    completeWord :=0;
  end loop;

  -- find the last token
  token := '{ '||substr(query,tokenStart,len-tokenStart+1)||' }';
  if(inPlus=1) then
    plusString := plusString||','||token||'*10';
    hasPlus :=1;        
  elsif(inMinus=1) then
    minusString := minusString||'OR '||token||' ';
    hasMinus :=1;
  else
    mainString := mainString||' NEAR '||token;
    mainAboutString := mainAboutString||' '||token; 
    hasMain :=1;
  end if;

  
  mainString := substr(mainString,6,length(mainString)-5);
  mainAboutString := replace(mainAboutString,'{',' ');
  mainAboutString := replace(mainAboutString,'}',' ');
  plusString := substr(plusString,2,length(plusString)-1);
  minusString := substr(minusString,4,length(minusString)-4);

  -- we find the components present and then process them based on the specific combinations
  code := hasMain*4+hasPlus*2+hasMinus;
  if(code = 7) then
    finalString := '('||plusString||','||mainString||'*0.1,about('||mainAboutString||')*0.5) NOT ('||minusString||')';
  elsif (code = 6) then  
    finalString := plusString||','||mainString||'*0.1'||',about('||mainAboutString||')*0.5';
  elsif (code = 5) then  
    finalString := '('||mainString||',about('||mainAboutString||')) NOT ('||minusString||')';
  elsif (code = 4) then  
    finalString := mainString;
    finalString := replace(finalString,'*1,',NULL); 
    finalString := '('||finalString||')*0.1,about('||mainAboutString||')';
  elsif (code = 3) then  
    finalString := '('||plusString||') NOT ('||minusString||')';
  elsif (code = 2) then  
    finalString := plusString;
  elsif (code = 1) then  
    -- not is a binary operator for intermedia text
    finalString := 'totallyImpossibleString'||' NOT ('||minusString||')';
  elsif (code = 0) then  
    finalString := '';
  end if;

  return finalString;
end;
/ 
Words of Caution: Oracle 8.1.5.0 has a bug in the parser for the about clause. The bug seems to incorrectly map multi word queries to their corresponding tokens in the Intermedia tables. For example, about('massachusetts bay'), which is tokenized as 'Massachusetts Bay', is parsed as 'Massachusetts bay', not matching any row in the Intermedia tokens table and thereby throwing an error. If you find yourself in this situation, simply get rid of the about clauses in the above pl/sql function.

Optimizing Intermedia

As you continue to update your intermedia index, the tables Intermedia maintains will become more and more fragmented. Once in awhile, you might want to optimize your index by executing the following from sqlplus:
 
  SQL> ALTER INDEX newsindex rebuild parameters('optimize fast');
Optimization is an extremely slow process and it's not clear that the performance gain of an optimized index will be noticeable. If you are going to optimize, you should limit the amount of time that the optimization will take (180 = 180 minutes):
 
  SQL> ALTER INDEX newsindex rebuild parameters('optimize full maxtime 180');
The Intermedia documentation has more information about optimizing indexes.

Optimizing Web Searches with Intermedia

Unlike "normal" SQL queries, Intermedia actually handles the FIRST_ROWS hint very well. This means that you can sort large data sets incrementally (as opposed to sorting the entire data set to return the first 25 rows in sorted order):
SQL> set timing on
SQL> select /*+ FIRST_ROWS */ org_name
          from search 
          where contains(keywords, 'philanthropic', 1) > 0
          order by score(1);

250 rows selected.

Elapsed: 00:00:00.42
The search is optimized to return the first rows from the cursor. We could now display multi-page search results for organization names that matched the query for "philanthropic"... in .42 seconds, including the time my sqlplus session took to print out the org names. It is important to know that Intermedia does a better job with FIRST_ROWS than Oracle which leads us to second Intermedia law: When sorting through a large dataset, order by score(n) rather than a column not indexed by Intermedia.

mbryzek@arsdigita.com