[NCLUG] Annoying post-meeting SQL curiosities

Wed Dec 10 19:52:37 MST 2008

Shelley, Robert - Fort Collins, CO wrote:
> Aaron D. Johnson wrote:
> > During the very pleasant post-meeting dinner discussions last night, I
> > mentioned I had been let down by a number of SQL databases over the
> > last week.  Paul suggested I share with the list, so here's my cause
> > of consternation.  Consider the following:
> > 
> >     CREATE TABLE test (a INTEGER);
> >     CREATE UNIQUE INDEX test_a_idx ON test (a);
> >     INSERT INTO test (a) VALUES (1);
> >     INSERT INTO test (a) VALUES (2);
> > 
> > So far, all is good.  I'll skip the SELECT showing that it worked as
> > expected.
> > 
> > Here's where the fun starts:
> > 
> >     UPDATE test SET a = a + 1;

I am not a DBA.  But that hasn't stopped me from working on projects
which required one.  :-) At the root of your problem I think this is
because you are doing something that is frowned upon.

Why are you changing the primary key of the data?  Let the database
choose the primary key and never change it.  You will avoid this
turmoil and consternation.

If you have a case that seems to require setting the primary key that
would be an interesting discussion.

> > This should leave us with two rows, with values 2 and 3, right?

I think this is quite a hard.  It would need to make a complete copy
of the table in passing through the operation.  If your table is very
large then this would consume a very large amount of either memory or
disk for the temporary storage.  Imagine this operation on a table
consuming a very large amount of data.

Being able to do this might be cool but people concerned with high
performance databases will complain that it is slower to always make a
temporary copy than otherwise.  I would imagine that the selection of
different database engines would strongly affect the result.  Most
would choose the faster engine and avoid asking it to do things which
hurt.  (Doctor, it hurts when I do this.  Response, then don't do that.)

> > Am I wrong to think that an UPDATE is supposed to be atomic (not
> > divisible into smaller pieces) and that all rows matching the
> > (non-existant) predicates specified are to be changed simultaneously?
> > And would it not follow that there should be no duplicates in the
> > index when the UPDATE is done?

Wasn't it atomic?  When it failed didn't the data remain the old data?
If not and if it left you with corrupted tables then that would be the
more serious bug.  I assume that when it failed that the data simply
wasn't updated.  Which is exactly what should happen if the atomic
operation fails.

> BTW, I wish I could have made last night's meeting, too. I wanted to
> hear the discussion, so let me second the request for notes and
> documentation!

OT: I was also under the weather and unable to go.  Actually I am
still not feeling good yet.  So I would also be interested in any
notes from last night.

> Following is a boiled-down version of his comments since we went pretty

I pretty much agreed with all of your comments.

> * With his large tables of millions of rows, data coming in likely uses
> a file load instead of row-by-row inserts.

Doing a full database load doesn't require fine grained locking.  This
is one of the things that will make it faster.  Of course doing the
offline database load must occur, offline.  Performance of offline
operations isn't usually as critical.

> * He noted that if he is doing a load, he deletes the indices, then
> loads several million records from file, then recreates the indices.
> This is because the load is MUCH slower if the indices are turned on.

Yes.  A basic insertion into the database without an index is O(1).
But keeping an index depends upon the type of index and very often
would be O(log N).  Log N is still quite good though.

Also often the worst case performance for inserting data one by one
into a sorted collection is when the key is already sorted.  Doing so
may cause the maximum number of internal key data rotations as the
index is kept updated.  If you load a database this would typically be
the case as the primary key is usually auto-incremented and therefore
stored in ascending order and therefore dumped that way too.  Worst
case performance may be *significantly* worse than typical performance.

> Y'know, over the years, I've found that sometimes just explaining the
> problem to someone else helps me see a basic issue -- often, as in this
> case, where I've made a potentially invalid assumption and they
> challenge it.

Excellent advice!

Bob