[NCLUG] [Python-Dev] Judy for replacing internal dictionaries?

Douglas Baskins dougbaskins at frii.com
Tue Jul 30 15:13:53 MDT 2002


>> [Sean Reifschneider]
>> Recently at a Hacking Society meeting someone was working on
>> packaging Judy for Debian.  Apparently, Judy is a data-structure
>> designed by some researchers at Hewlett-Packard.  It's goal is to
>> be a very fast implementation of an associative array or
>> (possibly sparse) integer indexed array.
>>
>> Judy has recently been released under the LGPL.
>>
>> After reding the FAQ and 10 minute introduction, I started wondering
>> about wether it could improve the overall performance of Python by
>> replacing dictionaries used for namespaces, classes, etc...

> [Tim Peters] 
> Sorry, almost certainly not.  In a typical Python namespace lookup, the pure
> overheads of calling and returning from the lookup function cost more than
> doing the lookup.  Python dicts are more optimized for this use than you
> realize.  Judy looks like it would be faster than Python dicts for large
> mappings, though (and given the boggling complexity of Judy's data
> structures, it damn well better be <wink>).
>
> As a general replacement for Python dicts, it wouldn't fly because it
> requires a total ordering on keys, and an ordering explicitly given by
> bitstrings, not implicitly via calls to an opaque ordering function.
>
> Looks like it may be an excellent alternative to in-memory B-Trees keyed by
> manifest bitstrings (like ints and character strings or even addresses).

One of my favorite benchmarks is to take a good hash method and replace
the synonym handling code with the appropriate Judy array.  Since a
Judy array is allocated with a NULL pointer, this is typically an easy
thing to do.  Next compare speed and space performance versus population.
Then change the size of the hash table (in the code using Judy) and
re-measure.  I have found that a hash table size of 1 is usually the
preferred choice.

However, there are applications where the working-set (WS) is small 
enough to reside in the CPU's data cache.  In this case a hash table
size of 256 and a hash method of a simple shift/mask may be the 
preferred choice.  A hash table must inhabit the data cache to maintain
good performance.  Using a large hash table when the WS outgrows the 
data cache can negate its effectiveness for all (including the very 
important stack).  A good hash method can be in direct opposition to 
the "locality" needs of a data cache.  The hierarchical use of memory
in Judy is well matched to the requirements of an effective data cache.

Doug Baskins  doug at sourcejudy.com



More information about the NCLUG mailing list