[NCLUG] RE: [Python-Dev] Judy for replacing internal dictionaries?

Sat Aug 3 20:36:32 MDT 2002

[Douglas Baskins, on www.sourcejudy.com/benchmarks/JudySLcompare.tar.gz]
> ...
> Remarkably, JudySL was the fastest "ordered mapping" method by a slight
> margin but, was more notable for its memory efficiency.  A hashing method
> that scales, requires awkward re-sizing the hashing table.  This would
> likely make the stores slower and disjoint in performance.  Hash lookup
> times were generally the fastest by a slim margin over JudySL.

But the read and write times for hash were significantly faster than for
JudySL in the two tables at the top of that file, where anything over 10% is
"significant" in my business, anything over 20% is "highly significant", and
30% is in "no-brainer" territory:

store/line  lookup/line  ADT
  2.847 uS     2.118 uS HASH
  3.845 uS     2.407 uS JUDY
   35.1%        13.6%   speedup if switching from Judy to hash

store/line  lookup/line  ADT
  2.236 uS     1.863 uS HASH
  3.439 uS     2.350 uS JUDY
   53.8%        26.1%   speedup if switching from Judy to hasn

> ...
> I believe any of the methods in this benchmark (except hashing) would be
> an excellent match for use in an interpreter such a Python.

I don't follow this.  Hashing has served us extremely well, and as I said
before, namespace hashtables are generally tiny in Python.  People have
optimized the hashtable implementation over a decade, and it screams.  BTW,
I note that the hash tables you used had chained collision lists; Python
uses open addressing instead, partly for better locality and partly to save
memory (hidden malloc overheads can kill you when allocating gazillions of
tiny records, and

typedef struct hashrec
{
    char	*word;
    struct hashrec *next;
} HASHREC;

likely burns at least 4 bytes per hashrec (50% overhead on a 32-bit box)
just for hidden malloc bookkeeping; Python avoids both the per-record malloc
overhead and the per-record 4-byte pointer to "the next" record).

> (Of course,I would prefer you use Judy).  I think memory efficiency and
> scaleablity are the most important requirements of an interpreter.

We worry about all seven of those <wink>, but the tradeoffs are complex.

>>
>> ...del
>> lookdict_string(dictobject *mp, PyObject *key, register long hash)
>> ...del
>> > 	if (ep->me_key == NULL || ep->me_key == key) return ep;

> I suspect the performance 'shortcut' by comparing the string pointer
> is 'buried' under the expensive subroutine call to lookdict_string().
> Does removing this code effect performance?  (If it does, I would be
> curious by how much, why and what was the test case?)

Oren Tirosh did some experiments with inlining this on key paths and posted
results to the Python-Dev mailing list.  The function-call overhead is the
bulk of the expense, and after inlining global namespace lookups they were
within 10% of the speed of local namespace lookups.  The jargon there isn't
obvious, so I'll spell that out a little:  a global namespace lookup is done
via the code I posted, looking up an interned string in a Python dict.  A
local namespace lookup doesn't involve dicts at all.  Instead local names
are mapped to a range of little integers at compile-time, and the code
generated for a local namespace lookup simply uses that integer to index a
contiguous vector containing the locals' values.

OTOH, there are other schemes on the table to make global lookups more like
local lookups, mapping to integer indices at compile-time.  This is
difficult in Python because the language is highly dynamic, and the *set* of
global names can mutate at runtime in ways static analysis can't determine.

> I am considering a "persistent" Judy project in the near future.  My
> goal would be to obsolete the b-tree.

I maintain Zope's B-Tree code, so I'd sure welcome that <wink>.  Note that
B+ Trees are ubiquitous in the database world, and they haven't historically
been concerned much with in-memory efficiency.  Instead the nodes are
carefully laid out to match physical disk characteristics, and the thrust is
to guarantee low worst-case bounds on the number of disk seeks needed.  I
expect that *could* be right up Judy's alley too.

> Also on the plate is a more scalable malloc() (just waiting to be done)
> and a fast, multi-cpu version of Judy.

If you need anyone in court, I'll be happy to testify that many platform
mallocs suck <wink>.