Kasun’s Blog

Kasun Indrasiri

  • Kasun Indrasiri

  • Info

    Department of CSE
    University of Moratuwa,
    Sri Lanka

  • Archives

  • Categories

B-Trees and Inverted Index – De facto standard for file organization (II)

Posted by kasun04 on September 27, 2009

In the context of modern IR Systems, using a dictionary or hash table to represent inversion list would be kind of tedious as it request huge amount of memory. There for its obvious that we need to get the help of secondary storage to store the content and retrieve when required. Then again there is a huge overhead of using secondary storage to store and read a dictionary. So the solution that we came across was B-Tree, which is a used almost each and every implementation of inverted index.

Definition of a B-Tree of order ‘m’

  • A root node with 2 ~ 2m keys
  • All the other internal nodes have between m ~ 2m keys
  • All keys are kept in ascending order
  • All levels have the same level of differ (at most 1)

B-Tree node of order ‘d’ as shown below.

e.g.:

Inversion lists structures are used because they provide optimum performance in searching large databases. The optimality comes from the minimization of data flow in resolving a query. Only data directly related to the query are retrieved from secondary storage. The beauty of B-Trees lies in the methods for insertion and deletion of records which always leaves the tree balanced.

In the most command secondary storage ; the hard disk, the number of disk accesses is measured in terms of the number of pages of information that need to be read from or written to the disk. And the disk access time is not constant-it depends on the distance between the current track and the desired track and also on the initial rotational state of the disk. We shall nonetheless use the number of pages read or written as a first-order approximation of the total time spent accessing the disk.

In a typical IR System B-tree, the amount of data handled is so large that all the data do not fit into main memory at once. The B-tree algorithms copy selected pages from disk into main memory as needed and write back onto disk the pages that have changed. B-tree algorithms are designed so that only a constant number of pages are in main memory at any time; thus, the size of main memory does not limit the size of B-trees that can be handled.

It can be prove that the cost of processing a ‘find’ operation in a B-Tree grows as the logarithm of the file size. For example a B-Tree of order 50 which can index a file of 1 million records can be searched with in 4 disk accesses (worst case). Also for insertion and deletion, a B-Tree of order d, for a file of n records, insertion/deletion time is proportional to log(d) n.

However, for B-Trees there are some practical limitations where as the amount of data that can be transferred to with one secondary storage access is limited as well as the track size of hardware should be taken in to the account. So, in practice, optimum node size is depends critically on the characteristics of hardware.

Further Information on B-Trees

http://www.cl.cam.ac.uk/~smh22/docs/ubiquitous_btree.pdf

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: