Pages

Wednesday, August 18, 2010

Indexing data with the STL

When working with data, one often needs to process information in an ordered fashion: for instance, you might want to display results in some specific order, or ensure (for efficiency) that the table where you are looking up a value is in order (as this would speed the lookups).

In C++, the Standard Template Library (STL) provides a std::sort() function (in the algorithm header), which implements a variant of quicksort, one of the fastest general purposes sorting algorithms. It is usually wise to use it, unless you know very well why you don't. The calling convention for sort() is sort(begin, end), where begin and end are two iterators pointing to the first and one after last element in the table to be sorted.

For a vector you would say

vector<int> v;
sort(v.begin(),v.end());

(note that this allows you to sort only a part of the vector, just provide a different set of iterators)

But sort() is not limited to STL creatures, and will work on plain C style malloc'ed or new'ed tables, once you remember pointer can be used as iterators, for instance

int *mytable=new int[size];
sort(mytable, mytable+size);

How does sort() sort, then? By default it uses the < operator of the type you use. This means int, strings, doubles, are sorted as you expect they would (and you should never use it on pointers, which will be sorted according to memory adresses...). For basic agregate types, likes pair<int,int>, they will compare the first element, and then the second (in that order...). For user-defined types, you may want to redefine your order, through something like

bool MyClass::operator<(const MyClass &lhs) {...}

and then call sort() on MyClass containers..

But you can also provide sort() with a third parameter, which is the comparison function you want to use. This can be a function (which takes two constant references to elements in the container and return a bool), a function object (aka Functor, we'll talk of that another time), or one of the many predefined functors in the stl...

For instance, to sort a vector of integer in decreasing order, you'd say

vector<int> v;
sort(v.begin(),v.end(),greater<int>);

But you could also roll your own, say

bool mycomp(const string &ls,const string &rs) {
  return ls.length()<rs.length();
}

vector<string> v;
sort(v.begin(),v.end(),mycomp);

Would sort strings by length (ie shortest words first...)


Now, sorting, as its name implies moves data around. Although this is often just what you want, it is not always necessary to sort data when you want to process it in order.

Here are typical cases when you don't want to sort :

1- while the keys you are sorting on are relatively small, the data is huge, so moving it around while sorting would waste a lot of time... (this is the typical case in databases)
2- the data you are working is read only, eg in an external file, its original order must be preserved, and you don't want (or cannot) make a copy of it
3- you are using several orderings, and cannot sort the data all the time.

The last case is very frequent, just think of some basic data, made of items, which have a code, a name, and a price. In C++, you'd have something like

struct Item {
int Code;
string Name;
double Price;
... other data
};
vector<Item> ListOfItems;

Now, in a typical application, you will want to display items sorted by name or price, but look them up (from a sorted table) according to code... You'd need the three sorts at the same time.

One obvious solution would be to have three copies of the container, each sorted in different order, and use the one you need each time... When the ListOfItems gets large, or "other data" in the item is big, this is quite a waste...

Indexes are the typical solution to this problem, an index is an "indirection vector", which allows one to process the data in the ListOfItems in some specific order. Basically, you have a integer vector, the same size as ListOfItem, such that idx[0] holds the index of the smallest element of ListOfItems, idx[1] the second smallest, etc...

So, the smallest element of ListOf Items is
ListOfItems[idx[0]]

and the largest
ListOfItems[idx[ListOfItems.size()-1]]

Now you probably see what indexes can do for us... In the previous case, we need three indexes, one for each sort order, and we can then traverse the data in any order, at very little cost (one indirection).

Now how do we build those indexes? It would seem that the STL should provide us with some nifty "index" algorithm... But there are none, and none in Boost either...

Actually, sort() is all you need. To build an index file, you just need to define the appropriate comparison. Here's how, there are three steps :

first, initialize the index, so that each element contains its rank idx[0]=0, idx[1]=1, etc... say

for(int i=0;i<n;i++) idx[i]=i;

(if you have SGI STL, there is a adhoc function for this, called iota(), from a APL function, you can also use generate(), if you are stl minded)

\Then define a comparison operator (if you are comparing the Price tags of Items in the above example):

bool mycomp(const int &lh, const int &rh) {
return ListOfItems[lh].Price<ListOfItems[rh].Price;
}

this is the "trick", this comparison function will compare elements of idx, eg idx[i] and idx[j], by comparing MyTable[idx[i]].Price and MyTable[idx[j]].Price. As a result, when you sort idx, the smallest element will be the one which gives the smallest value of MyTable[idx[i]].Price... Bingo!

And the indexing is done by

sort(idx.begin(),idx.end(),mycomp);

To summarize, the following program

int val[5]={3,2,4,1,5};

bool mycomp(const int &lh, const int &rh) {
return val[lh]<val[rh];
}

int main(void) {
vector<int> idx(5);
for(int i=0;i<5;i++)  idx[i]=i;
sort(idx.begin(),idx.end(),mycomp);

for(int i=0;i<5;i++) cout<<val[idx[i]]<<endl;
return 0;
}

should print 1 2 3 4 5, ie index the val data...

No comments:

Post a Comment