Using AIAT
Volume Number: 14 (1998)
Issue Number: 4
Column Tag: Emerging Technologies
Using Apple Information Access Toolkit
by Mark Holtz
Apple's new indexing technology provides powerful ways to search and retrieve your data, regardless of how it's stored
Exactly What is AIAT, and What Can It Do For You?
The Apple Information Access Toolkit (AIAT) is a library of routines designed to distill, index, and query collections of textual data. The elegance of the technology stems from it's complete independence from the actual data source it is indexing. It can be used in a fully-interfaced, stand-alone application, or in a memory-constrained plug-in. You can feed it everything from your next-generation database to a catalog of your Compact Disc collection. Best of all, rumor has it that AIAT will be a part of Rhapsody, so your hard work now will pay off in the future.
Sounds great! How do I use it?
If you haven't guessed by now, you're going to have to get your hands dirty and write some code. AIAT provides a set of C++ objects that can be sub-classed to provide the desired functionality. Strange runtime quirks notwithstanding, the overall process is actually easier than it may first appear. In a recent development project, I created an AIAT-based plug-in that indexed and queried data from a 3rd party database in less than two days. This article will focus on that development project and some of the lessons I learned. The most important lesson was that AIAT will do a lot of the work for you, but you have to tell it exactly what you want.
AIAT: A Technology Overview
Upon unwrapping the AIAT package, you will find a nice, clean set of components to speed you on your way. Documentation is provided in Adobe PDF format, and is relatively comprehensive in scope, but occasionally lacks the necessary depth to help you fully understand a particular object. Next, a set of 68K and PowerPC libraries for Metrowerks' CodeWarrior 11 and Pro 1. Where there are libraries, there are also headers. AIAT's headers comprise what I consider to be the missing portions of the documentation. Not only are they slightly more current than the documentation, but also provide the necessary implementation details one needs to avoid MacsBug. The C++ classes in AIAT are robust, but you only need to deal with a small number of methods to create a working application. Finally, there are two example applications, one that indexes files in a folder and another that lets you query against the generated indices. They are fairly complete in their coverage of basic AIAT concepts from a client perspective, but give only marginal insight into how new data sources can be interfaced to AIAT.
The AIAT documentation stresses thorough design and analysis. Having a clear picture of which objects deal with which data and where that data goes will make your coding efforts much more enjoyable. Now that you've done that, let's get down to architecture. AIAT functionality is broken up into six major categories: Index, Analysis, Corpus, Accessor, Storage, and Storable. The Index classes handle creation and maintenance of keyword lists and document references. The Analysis classes are responsible for generating and filtering keywords based on various criteria. The Corpus classes are an abstraction layer for obtaining data from various sources. The Accessor classes provide query and statistical information from indices. Finally, the Storage and Storable classes provide a flexible mechanism for the storage and retrieval of arbitrary sets of data.A quick browse through the headers shows an abstract C++ object for each of these categories, such as the class IACorpus.
You will also find various utility functions, including a set of memory management calls. You can substitute your own allocator and deallocator routines easily, and AIAT will happily use them for block and object allocations. There are a few robust concrete subclasses provided, which you can use directly or subclass to enhance their functionality. The HFSCorpus and HFSTextFolderCorpus classes allow AIAT to index collections of text documents. The EnglishAnalysis class allows AIAT to filter document keywords with arbitrary stop-word and word-stem lists. AIAT makes use of its own exception handling mechanism, based on C++ exceptions with a few additions. A little exploration through all the components of the AIAT distribution is time well spent, as there are a number of classes and functions not specifically called out in the documentation.
The Task at Hand
I was first introduced to AIAT early in 1997 during a conversation with a good friend of mine. He suggested that I look into it as a possible full-text indexing solution for a Mac-based Web site we were creating. As the months passed, we settled on Purity Software's WebSiphon product as the back-end scripting engine for this Web site. WebSiphon includes a fast flat-file database called Verona. It had basic search capabilities, but they were not robust enough to support the kind of ranked queries you'd find on more powerful Full-Text Indexing Systems. WebSiphon's language is extensible via Code Fragment libraries, so the idea for an AIAT library for WebSiphon became an appealing solution. We could collect information from a variety of sources with WebSiphon, store it in Verona, index it using AIAT, and then query against that data using WebSiphon scripts. Since AIAT is simply a set of libraries with little runtime environment dependency, the task was not daunting. The project was distilled into a few discrete tasks: Create the scripting interface for WebSiphon, write the C++ code to interface with the AIAT accessor and index classes, write the C++ code for interfacing AIAT to Verona, and make the entire thing work in a multi-threaded environment. Each phase of the project involved one of AIAT's major areas of functionality, so I was able to focus on one set of concepts at a time, a tribute to AIAT's modular design.
Habeas Corpus
The first task to tackle was to provide AIAT with an interface to Verona so it could access the reams of data our site would produce. This involved getting familiar with AIAT's Corpus classes. In AIAT, the Corpus provides access to a collection of "documents" and the data they contain. The IADoc class is the abstract representation for these documents, and it consists of a name for the document and access methods for the data. In this case, our documents were records in the Verona database, and the data would be accessed via an API to the Verona application instead of reading it directly from a file. After a little bit of work with the Code Fragment Manager, I had a functional C++ interface to Verona, so it was time to tell AIAT how to deal with it. First, I created the CVeronaCorpus class, and defined the two pure virtual methods required to make it work: GetProtoDoc and GetDocText.
Listing 1: CVeronaCorpus.h
CVeronaCorpus
Class definition and required methods for our Corpus interface to a Verona database.
class CVeronaCorpus : public IACorpus
{
public:
CVeronaCorpus(CVeronaGlue* inVeronaGlue,
const char* inDBName);
virtual ~CVeronaCorpus();
virtual IADoc* GetProtoDoc();
virtual IADocText* GetDocText(const IADoc* doc);
virtual IADocIterator* GetDocIterator();
.
.
.
};
IADoc* CVeronaCorpus::GetProtoDoc()
{
return new CVeronaDoc(this, 0);
}
IADocText* CVeronaCorpus::GetDocText(const IADoc* doc)
{
return new CVeronaDocText(this, (CVeronaDoc*) doc);
}
IADocIterator* CVeronaCorpus::GetDocIterator()
{
return new CVeronaDocIterator(this);
}
Phew! That wasn't so bad. GetProtoDoc() returns a new CVeronaDoc object, and GetDocText() returns a new CVeronaDocText object. But what are these objects? CVeronaDoc is a subclass of IADoc, AIAT's abstract representation of a document within a Corpus. CVeronaDocText is a subclass of IADocText, which is responsible for providing the actual text of the document to AIAT's indexing functions. The third method, which is not required by AIAT to make a valid Corpus mechanism, is GetDocIterator(). The IADocIterator class is used to implement a Corpus that consists of multiple documents, and for providing each of those documents to AIAT in a consistent, ordered fashion. You may notice as you peruse the AIAT documentation that many functions deal with IADoc's as a fundamental unit of data. It is up to the Corpus to determine what that unit of data is and how to return it to AIAT when it's requested. Here are the definitions of the other Corpus subclasses I created:
Listing 2: CVeronaCorpus.h (cont'd.)
CVeronaDoc, CVeronaDocText, CVeronaDocInterator
Classes for representing the various sub-elements of CVeronaCorpus.
class CVeronaDocIterator : public IADocIterator
{
public:
CVeronaDocIterator(CVeronaCorpus* inCorpus);
virtual ~CVeronaDocIterator();
virtual IADoc* GetNextDoc();
private:
CVeronaCorpus* mCorpus;
unsigned long mCurrentIndex;
};
class CVeronaDoc : public IADoc
{
public:
CVeronaDoc(CVeronaCorpus* inCorpus,
unsigned long inRecRef);
virtual ~CVeronaDoc();
IAStorable* DeepCopy() const;
IABlockSize StoreSize() const;
void Store(IAOutputBlock* output) const;
IAStorable* Restore(IAInputBlock* input) const;
bool LessThan(const IAOrderedStorable* neighbor) const;
bool Equal(const IAOrderedStorable* neighbor) const;
virtual byte* GetName(uint32 *length) const;
unsigned long GetRecRef(void) { return mRecRef; }
protected:
virtual void DeepCopying(const IAStorable* source);
virtual void Restoring(IAInputBlock* input,
const IAStorable* proto);
.
.
.
private:
CVeronaCorpus* mCorpus;
unsigned long mRecRef;
};
class CVeronaDocText : public IADocText
{
public:
CVeronaDocText(CVeronaCorpus* inCorpus,
CVeronaDoc* inDoc);
virtual ~CVeronaDocText();
virtual uint32 GetNextBuffer(byte* buffer,
uint32 bufferLen);
virtual IADocText* DeepCopy() const;
protected:
private:
CVeronaCorpus* mCorpus;
CVeronaDoc* mDoc;
byte* mBuffer;
unsigned long mAmtRead;
unsigned long mBufSize;
};
At this point, we've got all the elements for our Corpus implementation. However, it may still be unclear how these items work together. When AIAT receives a request to update a particular index, it starts a dialog with the Corpus object that is tied to that index. It starts by asking, "What sort of documents do you contain?" By calling GetProtoDoc(), the Corpus can supply AIAT with a "sample" document. AIAT then asks for an object that can iterate through all the documents in the Corpus' collection. If one is available, it is returned via the GetDocIterator() method. Since AIAT knows nothing about the particular data set it's indexing, the Corpus needs to provide these mechanisms. If a document iterator is available (which is true in this case), AIAT begins asking the iterator for successive documents in the collection.
AIAT makes two assumptions about documents that you must keep in mind. First, all documents in a particular Corpus are of the same type (i.e. CVeronaDoc), and second, that the order of the documents is always the same for a particular set of documents. The latter is important because AIAT uses the document sequence for the indexing mechanism. Hence, the notion of a document being "Less than" another document really has to do with it's order in this sequence. Be sure to be consistent for whatever kind of data you're delivering, and this should not be a problem. Getting back to the dialog, AIAT now has an IADoc it can work with. It starts by asking the document for an IADocText object that contains the document's data. In our example, the CVeronaDocText class knows how to access the text in a Verona database record, so our CVeronaDoc object hands a fresh CVeronaDocText object back to AIAT. This object can access its parent CVeronaDoc object, and uses that link to obtain the record number in the database that the CVeronaDoc represents.
Finally, it is time for AIAT to get the text of the document. It does this by calling CVeronaDocText's GetNextBuffer() method. This method returns the specified number of bytes from the document. Note that the CVeronaDocText object must maintain its own information about what data AIAT has already requested. It may help to think of CVeronaDocText as a one-way stream of data that is read in chunks of arbitrary size. AIAT will continue to call GetNextBuffer() until the method returns zero, indicating the end of the data.
AIAT will continue to call the iterator and resulting document and document text objects to obtain the complete set of data in the Corpus' collection. The other methods of IADoc are used to determine how a particular document should be placed in the index. It is necessary to override these in your subclasses so that they are meaningful to the data you're representing. In my case, the LessThan() and EqualTo() methods compare documents based on their specific Verona index number. Also amongst the methods of IADoc you'll have to override are Store(), Restore() and DeepCopy(). These methods handle converting the object into a data stream, creating an object based on a data stream, and making a complete and independent copy of an object. These functions are nicely explained in the AIAT documentation, except for the DeepCopying() and Restoring() methods, which are used to construct the superclasses of your document class in a DeepCopy or Restore situation. The only place these functions seem to be documented is within the header file of IADoc itself. This may be a perfect time to go back and browse the AIAT headers again.
Now we have a complete Corpus structure for AIAT to handle Verona Databases. A bit of implementation here and there, and we're ready to start indexing and querying the data. However, before moving on to the next section, it is important to keep in mind that there is a lot of possible functionality within the Corpus classes that I have not covered here. For instance, one can filter documents selectively within their IADocIterator subclass, providing only the documents they want to AIAT. As an example, the HFSTextFolderCorpus class filters out files that aren't of type 'TEXT'. For the Verona interface, it was not necessary to implement a very complex Corpus interface, but be sure to explore the section fully and determine which features can help you in your application.
Indexing and You
The ultimate goal of AIAT is to generate an index of terms that can be used to satisfy queries about the database. AIAT provides two different indexing mechanisms to accomplish this goal. An Inverted Index contains a list of terms and references to documents that contain the term. A Vector Index contains a list of documents and the terms that each document contains. These two types of indices can be used separately or together to satisfy various types of queries. In AIAT, the two types of indices are combined to create a Ranked Index, capable of providing ranked results to keyword, boolean, and example-document queries. For the purposes of this project, we chose to use a Ranked index because we would be searching our data in a variety of ways, and wanted to provide as many possible methods as we could.
The next task in our development project was to actually index the data using AIAT. This is accomplished by using the Index, Corpus, Storage, and Analysis classes in concert. The code is simple and straightforward, and the examples are an excellent place to start. Here's a snippet of our indexing code, which is a slightly modified version of the example code:
IATry
{
tIndex = mDataSource->GetIndex();
//
// Makes calls to:
// MakeHFSStorage(), new CVeronaCorpus(dbName),
// new SimpleAnalysis(), and new InVecIndex()
tStorage = mDataSource->GetStorage();
tStorage->Initialize();
tIndex->Initialize();
tIndex->SetFlushProgressFn(
&CSTwinDSUpdateThread::sFlushUpdateFunc);
tIndex->SetFlushProgressFreq(10);
tIndex->Update();
tStorage->Commit();
}
IACatch (const IAException& exception)
{
::SysBeep(0);
WSAppendLog((char*) exception.What());
}
To create an index, a few other objects need to be created. First, we need to store the Index data somewhere. This is accomplished with the IAStorage class and it's subclass, HFSStorage. AIAT provides a built-in mechanism for creating disk-based IAStorage objects, so one call to MakeHFSStorage() takes care of this task. Next, we need to establish a Corpus to index. Creating a new CVeronaCorpus object that points to the desired Verona database satisfies this requirement. The third part of the puzzle involves the method AIAT should use to analyze the Corpus' data. The IAAnalysis class is used for this operation, and the provided SimpleAnalysis() class was fine for our immediate needs. It eliminates words under three characters in length, and changes all terms it finds to lower case letters. Finally, we create the Index object itself. Since we are using the Inverted Vector index, a call to new InVecIndex() with the pointers to the storage, corpus, and analysis objects will suffice nicely. In this project, I've wrapped most of the specific creation into the GetIndex() method, and it takes care of instantiating anything it may need.
Initializing the index is our next task. Since we're creating this index from scratch, we call the Initialize() method of our IAIndex object. If we wanted to update an existing index, we would call the Open() method instead. AIAT does support incremental updates of an index, but it requires you to write some more logic in your Corpus classes. Due to limitations in the Verona database structure, we couldn't provide a clean method for determining when records had changed, so we didn't write the extra code to support incremental updating. To properly initialize a new index, you must first initialize the Storage, and then initialize the Index, using their respective Initialize() methods. The index is now cleared and open, awaiting new terms and data from the Corpus.
At this point, I will make quick mention of the various progress functions in AIAT. They are available at most places in the code where procedures may take some time, such as during indexing and long queries. These functions are well documented, and behave exactly as you would expect. Following the progress function setup calls, we find a call to the Update() method of our index object. This sets the entire AIAT indexing mechanism into motion. AIAT conducts the dialog with the Corpus that I outlined earlier, then takes the resulting data, processes it with the IAAnalysis object we provided, and stores the information in our Storage object. Once the process is complete, the data is flushed to the storage object via IAStorage::Commit().
One last thing to note in this code snippet is the use of the IACatch and IATry macros. These expand to more standard C++ exceptions (or to other exception handling devices if C++ exceptions aren't available). AIAT uses the Try, Catch, Throw model for most all error reporting. If anything goes wrong, such as an index couldn't be opened or you run out of disk space, the program's execution would jump immediately to the IACatch block of code. In this example, we just beep and make a note in the log, but you will probably want to be more robust in your handling of errors. If you are unfamiliar with exception handling in general, I would strongly recommend reading up on it before you start your AIAT-related project.
Live to Search, Search to Live
Whenever I develop a new piece of software, I like to set my sights on an end result that I can work towards, and when it happens, I'll know I've made it. In this case, the end result was to see a Web page that showed database records ranked in relevance to my query string. We've made it through interfacing a new data source to AIAT, and asking AIAT to index it for us. Now we need to make that index produce something useful. AIAT refers to a document that match a query as a "hit".
To generate a list of hits in response to a query, we bring a new class, IAAccessor, into play alongside IAIndex and IAStorage. The Accessor class ties one or more indexes together and provides methods for posting a query and generating lists of hits. It also contains several small classes for describing hits, including the RankedHit class which contains a document reference and a percentile ranking, and the RankedQueryDoc class which can be used to formulate Query-by-Example functions in AIAT. One of the more interesting features of the Accessor class is that it allows you to search several indices at the same time, and the results can be a mix of documents from several different types of data sources. In our project, we support Verona databases, FileMaker databases, folders of text documents, and a few other types of data sources. With AIAT, we can search any combination of data source types, and provide document references for each hit, along with specific information about the data source that contains the particular document.
The following snippet is from the code to generate a ranked hit list in response to a basic keyword query.
Listing 3: Perform a ranked query on multiple data sources
rankedQuery_fn
char* tQuery = "apple internet"; // query string
unsigned long count = 25; // max # of hits
unsigned long resultCount = 0;
unsigned long i, tCt;
HFSDoc tDoc;
RankedHit** results = IAMallocArray(RankedHit*, count);
InVecIndex** indices = IAMallocArray(InVecIndex*, tCt);
// Setup indices[] with list of indexes to search.
IATry
{
for (i=0;i<tCt;i++)
{
indices[i]->Open(true);
}
CSTwinAccessor accessor(indices, tCt);
// MacroInitialize takes care of various init states properly
accessor.MacroInitialize();
resultCount = accessor.RankedSearch(
(unsigned char*) tQuery,
strlen(tQuery),
NULL, 0,
results, count, 4,
NULL, 30, NULL);
DisplayResults(sources, tCt,
results, resultCount);
}
IACatch (const IAException& exception)
{
char tErr[255];
sprintf(tErr, "Caught exception: %s\n",
exception.What());
WSAppendLog(tErr);
}
for (i=0;i<tCt;i++)
sources[i]->Close();
for (i=0;i<resultCount;i++)
{
delete results[i];
}
IAFreeArray(results);
IAFreeArray(sources);
IAFreeArray(indices);
return noErr;
In this snippet, we find that the logic required to generate hits with AIAT is very straightforward. We create arrays for our results and our source indices using IAMallocArray(), which is just a macro for AIAT's allocator. We can use our own memory allocator if we want to, and can have AIAT use it by setting one of AIAT's public variables to the desired function. This is outlined in the documentation and the example code. It is important that you allocate and deallocate AIAT classes and structures with the same allocation code, or you may find yourself in trouble. However, AIAT subclasses all of its objects from one root class, IAObject, which has provisions within it to always allocate and deallocate using AIAT's allocator functions. Since you can set them to whatever you want, AIAT will behave itself in whatever memory allocation environment you may have.
Next, we find ourselves inside of another IATry block. This is because any AIAT calls from this point on may generate an exception, and we need to handle them appropriately. We run through the list of indices we want to search, and open them. The code for gathering the indices was removed for clarity, and it's up to you as to when and how you create the index objects for use in a search.
The next few lines were the most problematic in this project. Although it looked simple in the example application, I had overlooked an important piece of information when I read the documentation. The IAAccessor class requires a certain amount of information to be properly initialized. The example application tested for the presence of this information in the index's storage, and if it wasn't there, it would create it. If it was there, it would simply load it up and the accessor would be ready to use. I missed this line of code, and subsequently, all my searches would crash my machine in strange and wonderful ways. So, I created my own subclass of InVecAccessor that simply had this additional function to properly initialize an accessor in any situation.
Listing 4: Proper initialization of an accessor regardless of previous state
CSTwinAccessor.cp, CSTwinAccessor::MacroInitialize()
void CSTwinAccessor::MacroInitialize()
{
IAIndex** tIndices = ::GetIndices();
unsigned long i;
if (!IsInitializationValid())
{
StoreInitialization();
for (i=0;i<GetIndexCount();i++)
{
IAStorage* tStorage = tIndices[i]->GetStorage();
tStorage->Commit();
}
}
else
Initialize();
}
This function uses IAAccessor::IsInitializationValid() and IAAccessor::StoreInitialization() to figure out the current state of the Accessor object and store it each of the indices' Storage. Don't forget to call Commit() for each storage object to flush the changes.
Following the successful initialization of the accessor, we are ready to make our query. We ask the accessor object to return hits that match tQuery in the array of RankedHit object we created earlier. ResultCount is set to the number of actual results in the array, and I then call the DisplayResults function to display the resulting hits. The AIAT example code shows how to access the various items within a RankedHit object. Using the Index and Doc components of a RankedHit, you can figure out which index a hit comes from, and thus what type of document it is. You can then cast the IADoc* to an appropriate IADoc subclass, and access additional information when presenting your results.
Finally, we close up the indices and delete the RankedHit objects that the accessor created for us. With a few IAFreeArray() calls to go, we have completed our first query against our custom data set using AIAT. That didn't hurt so much, did it?
What to Do for an Encore
It would be hard to encompass all the facets of AIAT within the scope of a single article. With just a small amount of implementation, however, you can interface AIAT to just about anything. As with any technology, the best thing you can do is explore and try different things. (Isn't that what the Macintosh is all about?) Once you're comfortable with the basic concepts in AIAT, there are several concrete classes for you to try out, such as English Analysis (or Korean Analysis!). The IAAnalysis class and its subclasses contain a great deal of interesting functionality, and since they control which terms get generated by AIAT, they are extremely important for ensuring that once you've got your search engine running, the data it returns is actually useful.
To echo the words of the AIAT manual, be sure to spend a good amount of time understanding both the AIAT architecture and the problem you're actually trying to solve. Whatever may be lacking in the documentation about implementation details is more than made up for by the walk-through examples that are provided. The AIAT architecture is fairly orthogonal, so once you understand the behavior of one class of objects, you will find that the other classes follow suit. For the power it provides, AIAT provides a lot of benefit without a great deal of implementation effort.
Resources
- Apple Developer CD Series, Tool Chest/Mac OS 8 Edition, November 1997.
- http://www.research.apple.com/research/tech/V-Twin/.
- Apple Developer World, http://www.devworld.apple.com/.
- Purity Software, Inc. at http://www.purity.com/.
When he isn't messing about in MacsBug, Mark Holtz is one of the founders of MacISP, helping people and organizations provide Internet services using the Mac OS. E-Mail to mholtz@intermag.com should garner a quick response, unless he's out having coffee.