Archive for the ‘Uncategorized’ Category
B-Trees and Inverted Index – De facto standard for file organization (II)
In the context of modern IR Systems, using a dictionary or hash table to represent inversion list would be kind of tedious as it request huge amount of memory. There for its obvious that we need to get the help of secondary storage to store the content and retrieve when required. Then again there is a huge overhead of using secondary storage to store and read a dictionary. So the solution that we came across was B-Tree, which is a used almost each and every implementation of inverted index.
Definition of a B-Tree of order ‘m’
- A root node with 2 ~ 2m keys
- All the other internal nodes have between m ~ 2m keys
- All keys are kept in ascending order
- All levels have the same level of differ (at most 1)
B-Tree node of order ‘d’ as shown below.
e.g.:
Inversion lists structures are used because they provide optimum performance in searching large databases. The optimality comes from the minimization of data flow in resolving a query. Only data directly related to the query are retrieved from secondary storage. The beauty of B-Trees lies in the methods for insertion and deletion of records which always leaves the tree balanced.
In the most command secondary storage ; the hard disk, the number of disk accesses is measured in terms of the number of pages of information that need to be read from or written to the disk. And the disk access time is not constant-it depends on the distance between the current track and the desired track and also on the initial rotational state of the disk. We shall nonetheless use the number of pages read or written as a first-order approximation of the total time spent accessing the disk.
In a typical IR System B-tree, the amount of data handled is so large that all the data do not fit into main memory at once. The B-tree algorithms copy selected pages from disk into main memory as needed and write back onto disk the pages that have changed. B-tree algorithms are designed so that only a constant number of pages are in main memory at any time; thus, the size of main memory does not limit the size of B-trees that can be handled.
It can be prove that the cost of processing a ‘find’ operation in a B-Tree grows as the logarithm of the file size. For example a B-Tree of order 50 which can index a file of 1 million records can be searched with in 4 disk accesses (worst case). Also for insertion and deletion, a B-Tree of order d, for a file of n records, insertion/deletion time is proportional to log(d) n.
However, for B-Trees there are some practical limitations where as the amount of data that can be transferred to with one secondary storage access is limited as well as the track size of hardware should be taken in to the account. So, in practice, optimum node size is depends critically on the characteristics of hardware.
Further Information on B-Trees
http://www.cl.cam.ac.uk/~smh22/docs/ubiquitous_btree.pdf
B-Trees and Inverted Index – De facto standard for file organization (I)
‘Secondary Storage’ is still a nightmare in achieving higher performance in modern computer systems and often the power of multi-core CPU is more or less negate due secondary storage low performance IO. Of course the performance of secondary storage IO has improved in the recent past but it is still inferior relative to CPU performances.
A computer must retrieve an item and place it in main memory before it can be processed. In order to overcome the low performances of the system one must organize the files intelligently and making the retrieval efficient. The file organization depends on the retrieval method; sequential or random. Particularly secondary storage IO is a huge overhead in the context of random access method. Therefore associated with a large, randomly accessed file in a computer system is an index. An index is often a file that stored in the disk and it contains a mapping between the terms and content.
Inverted Index
An index if often used in Information Retrieval systems, Database systems, file indexer for user and general purpose access methods. The important thing here is that how we physically represent index in a computer system. That’s where B-Trees come in to play. So, before we going deep into B-Trees it’s worthy to have a close look at index and its structure.
‘Inverted Index’ is the more specific name given to the index as it’s the most common data structure used in Database Management Systems and Information Retrieval Systems (Search Engines etc.). It consists of three different basic files,
- Document File
- Inversion list (posting files)
- Dictionary
The name “inverted file” comes from its underlying methodology of storing an inversion of the documents: inversion of the document from the perspective that, for each word, a list of documents in which the word is found in is stored (the inversion list for that word). Each document in the system is given a unique numerical identifier (Document ID). It is that identifier that is stored in the inversion list. The way to locate the inversion list for a particular word is via the Dictionary. The Dictionary is typically a sorted list of all unique words (processing tokens/terms) in the system and a pointer to the location of its inversion list. Dictionaries can also store other information used in query optimization such as the length of inversion lists.
The above inverted index is the simplest representation of the inverted index but for modern IR systems there can be more attributes in the inversion list itself. For example, to support proximity, contiguous word phrases and term weighting algorithms the inversion list contains the offset of each word in the document. So for Document ID = 1, the term bit is the 4th, 8th and 30th term in the document, then the inversion list would look as follows.
- bit -1(4), 1(8), 1(30)
So once we create the inverted index, the search process or system is the actual component that uses the created inverted index. So, when a search is performed for a given query, the inversion lists for the terms in the query are located and the appropriate logic is applied on the inversion list sets. So the outcome is the set of hits for a given query (inversion list may ordered based on ranks), and the hits will be given in the form of document IDs. These document ID will be used to retrieve the accrual documents.
So where are the B-trees ?.. Well that comes in to play when we implement the above conceptual models. Although we stored a set of normalized terms in the form of a ‘dictionary’, it actually uses a B-Tree to hold the normalized set of terms and make Secondary Storage IO more efficient. We should keep in mind that, for an enormous set of document we must get an extremely huge set of terms and the inverted index would be equally huge. If we are going to store these dictionaries in the memory, it would be very costly if not impossible. So we tend to store in the secondary storage and of course we have the performance nightmare which is always bounded to secondary storage. So we’ll see how B-Trees solve our problem in this context from part II of this post.
Powered by MS Word 2007
This is my first blog post through MS Word 2007. And I’m really happy about the fact that wordpress can interact with MS products. This is a big achievement by MS Word as well as WordPress as MS Word is considered to be the best word processing software.

Unique Perspective on C++
Its not worthy to write a blog post on the impacts of the Software Development on economy, society and life style. Its one of the major driving forces in the modern world. When it comes to software development, the programming languages has an immense impact on software engineering. As an incontestably candidate of programming languages, I would like to talk about C++ a bit.

Well, the most common questions arises at this particular point are, “Why C++ ?”, “Isn’t it dying ?”.
Yes.. C++ is not used in the same frequency as Java or C#. But, its not dying, its not obsolete.. In fact it’s the heart of modern software development process.
‘Choosing the right tool’ is so important in the Software Development process. So in that context, C++ is a tool; a ‘complicate tool’. And it’s a tool that is difficult to learn, how to use properly.
The complexity of C++ has encouraged most software developers to go for alternatives like Java, C# etc. Then again, the complexity is a consequence of its programming power and the performance.
C++ -Complexity
The uniqueness of a proficient C++ programmers can be observed in many distinct ways
- For a proficient C++ programmer, it’s essential to have a solid understanding of, What compiler does, what linker does, what run time system does and what OS does.
- Memory Management. No garbage collection, so the developer needs to taken care of memory allocation and de-allocation. (‘pointers’ is always a nightmare to a C++ developer
)
- OS level Concurrency. No pseudo processes or threads (like threads provided by run time systems). Should use processes and threads directly provided by the operating system.
- Generic Programming. (Templates)
- Interfacing with hardware. (C++ is the best programming language to communicate with hardware)
- Generic Programming. (Templates)
Why C++ ?
Nowadays, selecting C++ for the development of the conventional software system is simply a waste of time and money. We don’t need C++ to develop a traditional business management software system or a web application.
The famous programming languages like Java or C#.Net are often humiliated by some of the requirements of the software system. These applications are often termed “Demanding Applications”.
Demanding Applications
Performance is a major and a crucial requirement of a demanding application. Also, robustness, responsiveness and fault tolerance is also critical in the context of demanding systems. If you still confused about the demanding applications or systems, just now you are working on one of the most complicated demanding system. It’s the operating system. Linux or Windows; both use C/C++. The role that played by C++ in the modern era of Computer Science is obvious when we consider the most common demanding systems .
- Operating Systems – Windows, Mac and Linux is mostly (if not totally) developed using C/C++.
- Embedded Systems – Every embedded system is a real mess of constrained resources and unlimited requirements. So, to maximize the throughput and minimize the resource consumption, C++ immerges as the best development methodology. In fact, embedded systems is the next era of C++ development.
- Hardware Drivers - Almost all the hardware drivers are implemented by C++
- ”The’ most popular windows application. – Well, everybody knows about it. It’s modern success story of Microsoft. Yes.. Its “Microsoft Office suit”. Each and every module of MS-Office is developed using C++.
- Search Engine Core – Google Search Engines is a best example of a demanding system. The performance is the most crucial factor of a search engine. Therefore C++ is the consentaneous candidate to develop the core (or heart) of a search engine.
- Trading Systems – Again this is for the sake of performance and handling the immense amount of load. C++ is a must.
- Real Time Systems - This also a well known case where C++ is mandatory.
So, this is the showstopper for the jerks who claimed that ‘C++ is dying..its obsolete’ . (inspired from a interview with C++ genius, Scott Meyers)
True Colors of Patriotism
Last few weeks or so, the whole Colombo city was covered with Sri Lankan national flag. Sri Lankans are still celebrating not the victory over LTTE but the true independence. As well as enjoying the independence I also enchanted with the beauty of our national flag. To me.. It’s the most beautiful flag in the world (I even heard some Cricket commentators claim that it’s a very colorful flag.). Now we can hoist our national flag with the pride of being able to wipe out the terrorism from our country. It almost took three decades to bring the true independence to the island nation. And we should keep in mind that, the independence that we achieved is formed with the blood and tears of our true patriots. The unanimous credit should goes to HE the president, Armed forces and to the whole Sri Lankan community which is the sheer power behind everything.
However, there are so many people who acts as co-possessor of Victory and Independence of Sri Lanka. Patriotism is suddenly in vogue. Its everywhere.. Nowadays in SL anybody that you meet in a street is a ‘lion hearted patriot’.. anybody who migrated to US or Europe is a ‘true nationalist’ .. There are some people whose ‘patriotism’ is often ignited by alcohols… It’s a dilemma that how a man who migrated to some other country can boast about the sovereignty and the independency of his motherland. All these ‘Non-returning Sri Lankans are the genuine illustrations of ‘Pseudo Patriotism’.

Defeating Terrorism in Sri Lanka is a sacred and a precious thing for all its citizens. As we enjoy the independency we should salute and show our gratitude to its contributor; armed forces, the president, families of all the armed forces, government, local and international community (India, Pakistan, China and Russia in particular). (But not those bloody ‘pseudo patriots’)
Finally, the courage and the dauntlessness of our heroes cannot be explain through a couple of sentences.. But this is a nice poem(translation) written during the Indian revolution around 1940′ when they fight against the British governors.
This day, we walk along with death, and laugh at its pale spectre..
We will not fear those cruel swords, our courage is far sharper..
Mistake not our silence for submission. For beneath lies lava, molten..
O Martyre, O men of valour.. One day the enemy will sing your praises…
We will show our mettle .. When the moment of truth arrives….
For courage lives in deeds, not boastful lies.
We’ve gathered in the enemy’s lair, my friend
In the hope of dying for our motherland…
We will not fear those cruel swords, our courage is far sharper.
This day, we walk along with death and laugh at its pale spectre…..
i Love .. iTunes
Recently I’ve started using iTunes as my music player. Latest version of iTunes (8.1.0.52) looks really cool and GUI is splendid as any other apple application. As I noticed that the sound quality is also improved with ACC format though it’s a slight difference from mp3.
You can organize your music repository with the ease of use with iTunes and playing an album is just like fetching a CD and playing it. And you can search any ‘artist’, ‘album’, ‘Genre’, ’song’ etc using the embedded quick search. Albums are displayed on a 3-D pane where you can select the album by just clicking on album cover. (see the screen shot)

Also you can listen to any type of radio streaming through iTunes and you’ll nearly 1000+ radio streams.
You have ‘alternative’, ‘ambient’, ‘blues’, ‘Folk’, ‘Jazz’, ‘Latino’, ‘Pop’ (you name it.. iTunes got it)and many more streams.

Looks soo.. ’sympathetic’ to Vista too…

Use it once and enjoy forever….
The Romantic Opera – Kasun Kalhara


Nowadays Sri Lankan music stream is flooded with a lot RnB and Hiphop music and many people merely want to listen to such styles of music. And all the radio stations are backing such music and ruining the natural music taste of Sri Lankans. All those commercial oriented artists tend to release their 5-6 tracks per a month. This is something simply related to the quantity and they simply want to create a music track using computers and other equipments. Once they create the track there are people who can write some strange words to tally with the track and the beat. Finally they’ll come up with a music video and all the TV stations eagerly waiting to play their music video.This is the story of Iraj, BnS, .. (you name it)
But .. They have forgotten the fact that there are some (I think its few
) people in Sri Lanka who still loves the instinctive slow music (mood songs) which is a blend of soothing voices and acoustic music.

Kasun Kalhara is one of the exceptions thrown from the Sri Lankan modern music stream. As I see he is the best production by Dr. Premasiri Kemadasaa (Kemadasa master) and doubtless selection of the best voice in the modern generation. Not only his voice but he is a composer too.. But not a synthetic composer who solitary depends on Computer and keyboard.
‘The Romantic Opera’ is his latest album and in my opinion it is the best music release of him. The titled rack, ‘The Romantic Opera’ is inspired from Opera style music.
Opera is an art form in which singers and musicians perform a dramatic work (called an opera) which combines a text (called a libretto) and a musical score -wikipedia
And this album is a combination of a set of distinct music tracks. It has some ‘Latin’ music, some of ‘Chinese or Japanese’ music, some of ‘Indian’ music and a lot of ‘Western’ and ‘Sri Lankan’ music. Most of the songs are composed using acoustic music and very less number of music instruments were used.
You’ll not able to listen to this sort of music when you switch on the radio or TV.. You’ll never hear any of these songs when you are travelling on a bus… Soo… you need to find/buy it and listen to it… because that’s the way that most true Sri Lankan music lovers used to do…

‘The Romantic Opera’ is a must listen music album of true Sri Lankan music fan.. Believe me .. It awesome.
We always call that the ‘music’ is an universal thing… yes it is.. That’s why we love A.R Rahman’s … that’s why we love Josh Groban.. and that why we should love Kasun Kalhara.
Chrome, Firefox, Safari and Opera – CPU and Memory usage
I perfromed a CPU usage and Memory usage test on Google Chrome, Firefox, Apple Safari and Opera. I simply ignore IE7 because the real battle comes when IE8 is officially relesed.
Here are my results. (or proofs) as screen shots.
IDLE State
Same Tabs Opened (various site were selected. Gmail, Apple, BBC, Youtube vedio streaming)
So then, we can analyze the resource usage.
Chrome vs Opera
This is another comparison between Opera and Chrome.
Bottom Line
You have to decide which browser to be used.
Search Engine Technology-An In depth Analysis (I)
A search engine is an information retrieval system (IR Systems-See my prior post) designed to help find information stored on a computer systems. Search engines minimize the time required to find information and the amount of information which must be consulted.
The search engine technology is evolved over the past decade or so, owing to the continuous and rapid grow of World Wide Web. As the information systems grow bigger and bigger the amount information stored in the information systems became enormous. So as an information retrieval system, the search engine has to be evolved in order to be sync. with the growing stored data.
The most public, visible form of a search engine is a Web search engine which searches for information on the World Wide Web. And more or less that the most important form of the search engine, because the www is the world’s fastest growing, distributed and fault tolerant information system. As an important business product the Enterprise Search Engines also play a critical role as most of the companies need to have their own customizable information retrieval system which is beyond a conventional web search engine.
Regardless of web or enterprise search, the technology used by the both forms are quite similar. in fact from the core (or heart) of the both engine looks almost the same. Here is a brief overview of different forms of Search Engines.
Web Search Engines
A Web search engine is a search engine designed to search for information on the World Wide Web. Information may consist of web pages, images and other types of files (including pdf, doc etc.). Web Search products are typically free to use and the vendors generate revenue via advertising.
Challenges:
- The number of documents to be indexed (tens of billions)
- The number of users (hundreds of millions)
Major Vendors:
- Microsoft
- Yahoo
Enterprise search Engines
Enterprise search is the practice of identifying and enabling specific content across the enterprise to be indexed, searched, and displayed to authorized users. Enterprise Search Engine facilitates the application of search technology to information within an organization. And most of ES vendors tend to focus exclusively on Enterprise Search and do not also offer Web Search or Desktop Search products.
Challenges:
- Index documents from a variety of sources such as: File systems, Intranets, Document Management Systems, E-mail, Databases.
- Present a consolidated list of relevance ranked documents from these various sources
- Many applications require the integration of structured data as part of the search criteria and when presenting results back to the users controls are vital
- if users are to be restricted to documents to which they are granted access by the various document repositories within the enterprise
Major Vendors:
- Microsoft (FAST Search & Transfer, Convera)
- Autonomy
- Dieselpoint
- Endeca
- Exalead
- IBM
- ORACLE
Desktop search
Desktop search is about searching the documents personal to a single user (i.e. a local hard drive and email folders).Most of the leading Desktop Search products are provided as free downloads, or for a small fee per user.
Challenges:
- To offer the search on the user’s local machine without impacting the performance of other applications running locally.
Vendors:
- Microsoft
- Google & Copernic/Coveo.
Leave a Comment
Leave a Comment
Leave a Comment










