Kasun’s Blog

Kasun Indrasiri

  • Kasun Indrasiri

  • Info

    Department of CSE
    University of Moratuwa,
    Sri Lanka

  • Archives

  • Categories

B-Trees and Inverted Index – De facto standard for file organization (I)

Posted by kasun04 on September 27, 2009

‘Secondary Storage’ is still a nightmare in achieving higher performance in modern computer systems and often the power of multi-core CPU is more or less negate due secondary storage low performance IO. Of course the performance of secondary storage IO has improved in the recent past but it is still inferior relative to CPU performances.

A computer must retrieve an item and place it in main memory before it can be processed. In order to overcome the low performances of the system one must organize the files intelligently and making the retrieval efficient. The file organization depends on the retrieval method; sequential or random. Particularly secondary storage IO is a huge overhead in the context of random access method. Therefore associated with a large, randomly accessed file in a computer system is an index. An index is often a file that stored in the disk and it contains a mapping between the terms and content.

Inverted Index

An index if often used in Information Retrieval systems, Database systems, file indexer for user and general purpose access methods. The important thing here is that how we physically represent index in a computer system. That’s where B-Trees come in to play. So, before we going deep into B-Trees it’s worthy to have a close look at index and its structure.

‘Inverted Index’ is the more specific name given to the index as it’s the most common data structure used in Database Management Systems and Information Retrieval Systems (Search Engines etc.). It consists of three different basic files,

  • Document File
  • Inversion list (posting files)
  • Dictionary

The name “inverted file” comes from its underlying methodology of storing an inversion of the documents: inversion of the document from the perspective that, for each word, a list of documents in which the word is found in is stored (the inversion list for that word). Each document in the system is given a unique numerical identifier (Document ID). It is that identifier that is stored in the inversion list. The way to locate the inversion list for a particular word is via the Dictionary. The Dictionary is typically a sorted list of all unique words (processing tokens/terms) in the system and a pointer to the location of its inversion list. Dictionaries can also store other information used in query optimization such as the length of inversion lists.

The above inverted index is the simplest representation of the inverted index but for modern IR systems there can be more attributes in the inversion list itself. For example, to support proximity, contiguous word phrases and term weighting algorithms the inversion list contains the offset of each word in the document. So for Document ID = 1, the term bit is the 4th, 8th and 30th term in the document, then the inversion list would look as follows.

  • bit -1(4), 1(8), 1(30)

So once we create the inverted index, the search process or system is the actual component that uses the created inverted index. So, when a search is performed for a given query, the inversion lists for the terms in the query are located and the appropriate logic is applied on the inversion list sets. So the outcome is the set of hits for a given query (inversion list may ordered based on ranks), and the hits will be given in the form of document IDs. These document ID will be used to retrieve the accrual documents.

So where are the B-trees ?.. Well that comes in to play when we implement the above conceptual models. Although we stored a set of normalized terms in the form of a ‘dictionary’, it actually uses a B-Tree to hold the normalized set of terms and make Secondary Storage IO more efficient. We should keep in mind that, for an enormous set of document we must get an extremely huge set of terms and the inverted index would be equally huge. If we are going to store these dictionaries in the memory, it would be very costly if not impossible. So we tend to store in the secondary storage and of course we have the performance nightmare which is always bounded to secondary storage. So we’ll see how B-Trees solve our problem in this context from part II of this post.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: