ALGORITHM ALLEY

Generating Sequential Keys in an Arbitrary Radix

Gene Callahan

Gene is president of St. George Technologies and the developer of software such as Managing Your Money and the HyperWriter Autolinker. He can be contacted at ecallah@ibm.net.

Recently I had to expand the range of 4-byte keys that could be sent to a client database, as we were on the verge of running out of keys. Our existing internal library functions dealt adequately with things like incrementing a key to generate the next key in sequence. But these functions were hardwired to operate in base-36, using all 10 digits and 26 lowercase letters to represent the key. Obviously, I could have "changed the wiring" to use base-62 keys (by including uppercase letters, which the client system could accept). But I hesitated to recast an algorithm hardcoded to handle a radix of 36 into one hardcoded to handle a radix of 62.

I set about looking for one. I consulted my algorithm books and found nothing. I asked half a dozen other programmers if they had ever needed to write code to handle this situation. Five said that they had. When asked what they had done, it turned out that each had written an ad hoc solution suitable only for the radix and character representation used in that specific situation. I wanted one that could handle any reasonable radix, any length key, and any ordering of characters within the base representation.

The solution I present here achieves these goals and also illustrates a couple of important programming principles, to which I was first alerted by Jon Bentley's book, Programming Pearls (Addison-Wesley, 1986):

Often, solving a general problem is easier than solving a specific instance of that problem. The algorithm employed here is easier to understand than a hardwired one, and shouldn't take any longer to get right.

The key to solving many problems is getting the right data structures--here, given the data structures, the code follows naturally.

The Data

First, I looked at the data I had to deal with. Many languages have built-in routines for processing strings into numbers. C, however, has no routine that, when fed a string in an arbitrary base, will convert it to an internal representation suitable for numeric processing. Clearly, such a facility would make the problem as trivial as it would be if we were using octal, decimal, or hexadecimal numbers in C. In such cases, we simply use standard library routines such as sprintf(), atoi(), and sscanf().

Because the programs in the system I was working on had to perform these conversions thousands of times a day, they had to be fast. Space, however, was not at a premium--our machine had 512 megabytes of RAM. This immediately suggested two table structures, one mapping from characters in a string to their numeric value in a particular base, and the other, back again, by simple array lookup.

In creating the data structure for the conversion from string to number, I took advantage of the happy circumstance that C considers characters valid array indexes. What could be simpler than to use the character itself as the index to find its value in a base? Since the characters are merely signals in an arbitrary code (the key), I didn't have to worry about such niceties as whether some user will want a different, larger alphabet appearing on screen. My strings were for internal consumption only!

To convert from native numeric representation back to a string representation in base-X, a string laying out the ordering of characters in base-X is the natural mapping mechanism. The base-X digits of the native number itself, "peeled off" modulus the powers of the radix X, can serve as indexes into the mapping string. This takes advantage of the relationship between a base and the modulus and division operations--for any base X and number N, repeated modulus and division operations on N by X will produce the value of each base-X digit of N from right to left. The structures I came up with, along with a representative instance, appear in Listing One.

The base representation in the listing has long runs of consecutive values, but this need not be the case. If the application involved something like encryption, the values assigned to different characters could just as well be randomly generated.

The Functions

To map from string to number, BaseXToLong() (see Listing Two) moves through the string representing the number from right to left, indexing by the character at each position into the structure NumValues. That character's value in the current base is stored in NumValues and then multiplied by the power of the radix equivalent to that position in the string. In other words, the numeric value of the last character in the string is multiplied by the 0th power of the radix; the next to last, by the first; the next, by the square of the radix; and so on.

LongToBaseX() maps in the other direction, using the structure CharValues. Again, the function operates from right to left, filling the last character of the string first. To determine the character in the last place, I take the long argument (named "number") modulus the base, and index CharValues with the result. For the next character, I divide number by the radix, essentially dropping off the last digit, and repeat the modulus step; see Listing Three.

Finally, I had to implement the increment function itself. I wrote it as a single statement using the primitives discussed earlier. The code has now gained a level of abstraction. The caller of LongToBaseX() has to worry about details like the length of the return value and where to store it. But at the level of increment abstraction, I knew the answers to these questions--the length is the length of the original key, and it is stored back in the original buffer. I left the low-level routines flexible about the answer to these questions, and built another layer of abstraction to hide these implementation details from the layers above it; see Example 1.

I wrote this function as a single statement because it concisely states my algorithm. The calls to BaseXToLong() and strlen() are made only for their return values, not for any side effect such as I/O or assignment; hence their position inside the call to LongXToBase(). Since I won't otherwise use any intermediate results, storing them would be deceptive. Also, side effects complicate the proof of a function's correctness and impede operations such as shipping the calculation of each of a function's arguments off to separate processors--one function may depend on the side effect of another. By making explicit the precedence of calls to BaseXToLong(), strlen(), and LongXToBase(), this form suggests a macro (or template, in C++) implementation of a family of BaseXIncr functions. These functions add one to a macro argument, rather than the return of BaseXToLong(). Example 2 is a function that allows key comparison, even of keys stored in different bases.

This basic idea can be implemented in several different ways. For instance, the radix could be passed as an argument instead of being in the BaseRep structure; thus, one BaseRep structure could work for many different radices. The base representation could be read in from a file for further flexibility. C++ users should easily be able to turn this into a class, packaging together the BaseRep pointer and the functions operating on it. You might write the class so that the radix could be reset, and perhaps BaseXToLong() and LongToBaseX() would be protected virtual members, so that you could override them in descendant classes.

Conclusion

In working out this problem, I found that once I discovered adequate data structures, I was nearly done. The rest of the implementation consists of only a few lines of code--it is much shorter, in fact, than the version hardcoded to base-36, and it expresses the solution to the problem in a more straightforward way. Changing the radix or the character representation of the base requires only filling in tables, not error-prone recoding. And by isolating general-purpose primitives, I was able to keep all "messy" details out of my application-level functions.

Example 1: Hiding details from the above layers.

char* BaseXIncr(char* number, BaseRep* br)
{
     return LongToBaseX(BaseXToLong
(number, br) + 1,
          number, strlen(number), br);
}

Example 2: Function that allows key comparison, even of keys stored in different bases.

int LessThan(char* num1, BaseRep* br1,
             char* num2, BaseRep* br2)
{
     return BaseXToLong(num1, br1)
          < BaseXToLong(num2, br2);
}

Listing One

/* given an ASCII character, what is its value in the base? */
#define MAX_BASE     256
typedef struct baserep
{
     int  Radix;
     int  NumValues[MAX_BASE];
     char CharValues[MAX_BASE + 1];
} BaseRep;
BaseRep Base36DigitsLower = {
     { 36 },
/* 0 - 47 */
     {-1, -1, -1, /* ...and so on, until 48 negative ones in all*/
/* 48 ('0') - 57 ('9') */
     0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
/* 58 - 96: another run of -1 */
/* 97 ('a') - 122 ('z') */
     10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,         
23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 
/* 123 - 255: whole bunch o' negative ones go here */},
     {"0123456789abcdefghijklmnopqrstuvwxyz"}
};

Listing Two

long BaseXToLong(char* number, BaseRep* br)
{
     long PowerOfRadix;
     long ret = 0;
     int  i;
     int  digits = strlen(number);
     for(i = digits - 1, PowerOfRadix = 1; 
          i >= 0; 
          i--, PowerOfRadix *= br->Radix)
          ret += (br->NumValues[(int)number[i]] * PowerOfRadix);
     return ret;
}

Listing Three

char* LongToBaseX(long number, char* buf, 
                  int buf_len, BaseRep* br)
{
     int i;
     for(i = buf_len - 1; i >= 0; i--)
     {
          buf[i] = br->CharValues[number % br->Radix];
          number /= br->Radix;
     }
     if(number) return 0;  /* null: error return */
     return buf;  /* so function will generate rvalue */
}

Back to Table of Contents