Search Algorithms

2011 Nov 06

Search Criteria

The search method that is chosen depends on multiple factors:

Can there be at most one or more than one hit for a key value?
If more than one is it constant or variable and is there an upper bound?
How fast does the lookup have to be?
How often does or can a search happen?
What is the dynamic range of the search key values?
What are the maximum number of possible search key values?
What is the statistics and density of possible or actual search key values in the search key value space?
How much is speed worth (i.e. cost in money and time) to spend on:

Extra memory (RAM or disk) - e.g. for lookup tables?
Special hardware (e.g. CAM)?

Is there a constant (static) or variable (dynamic) number of key values?

If the number is variable is there an actual or artificial constraint on the maximum number of key values?
If the interval between changes in the number of key values is very long, then it may be possible to recompile or regenerate the tables or other data structures.
Is the search code and the key data changes or does the code could change too? If only the data changes then there may be a way to trigger the code to reload the new data from a file or communications channel?

Search Method	Time Order: O(...)			Memory Order: O(...)
Search Method	min	avg	max	min	avg	max
associative	1	k/2	k	n	n	n
binary	1	log₂(n)/2	log₂(n)	n	n	n
linear	1	n/2	n	n	n	n
hash	h	h+k	h+n	n	c*n	c*n
table	k	k	k	n	n*log_b(n)	n*log_b(n)
tree	1	log_b(n)/2	log_b(n)	n	n*log_b(n)	n*log_b(n)

h: cost of hash calculation.

Depending on the nature of the search problem (see: Search Criteria) the above methods can be combined into a linear sequence or tree of decision/search methods.

Depending on the size of the data, i.e. number of key values, all of the data can be in memory. If the data set is very large (>1GB) then the data may have to be on disk or some similar I/O media. For speed reasons there may be a tradeoff where part of the lookup data is in memory and the rest on disk.

Associative Lookup

Associative lookup requires special hardware, e.g. a Content Addressable memory (CAM).

This hardware does parallel lookup. If the entire set of key values fits in the CAM, then lookup time is O(1). If the CAM is of size 's' which is too small then the time is proprortional to k which is proprtional to n/s.

Binary Search

This method is a tradeoff of more speed but no extra memory. At any given moment when searching the table is of a known size 'num'. See example code:

int find_bin(int num, int *a, int key) {
// Returns -1 if not found else returns index in table a[].
// Requiers that table a[] be in ascending order of key values.
int bot, mid, top;
    bot = 0;
    top = num-1;
    mid = top>>1;
    while(1) {
        if(key == a[mid])
            return mid;
        if(key < a[mid]) {
            if(bot >= top)
                return -1; // not found
            top = mid-1;
            mid = (bot+top)>>1;
        } else {
            if(bot >= top)
                return -1; // not found
            bot = mid+1;
            mid = (bot+top)>>1;
        }
    }
    return -1;
} // find_bin

Linear Search

This is the slowest and most naive method (assuming someone does not go out of their way to be even slower :) since it can take time O(n). If the number of key values is constant then a fixed size table can be made in memory or disk. If the number is variable then the table can be of variable, i.e. dynamic size, or be implemented as a (singly or doubly) linked list can be used (which costs more memory - due to one or two link pointers).

Table Search

When the table is small or the maximum size of the table space (e.g. less than 65Kb) and sufficient memory is available, the entire table can be implemented allowing an answer in time of O(1). If the key value space usage is dense then a full tree of the space could be implemented, all or some in memory and all or some on disk. If the ley value space is sparse then a sparse tree of tables can be used.

Tree Search

There are many kinds of trees: binary trees, trees with a maximum number of children, balanced trees, clustered trees (e.g. rtrees). For parsing, for example, a tree of states based on input characters could be implemented, where each state adds to the a parse tree created as the input is read.

If the key value is a string, with characters values limited to ASCII (numbers in the range 0..127) then a tree of tables, each of size 128, can implemented giving an effective speed of O(k*log₁₂₈(n)), where k is proportional to the maximum length of the key string.

2007-2011