[NCLUG] Annoying post-meeting SQL curiosities

Wed Dec 10 23:14:44 MST 2008

Robert Shelley writes:
> BTW, I wish I could have made last night's meeting, too. I wanted to
> hear the discussion, so let me second the request for notes and
> documentation!  OK, the rest of this is for Aaron and anyone else
> interested in SQL:

I only caught the last couple of minutes of the presentation.  It
looked like it was very cool stuff.  Wish I had caught the whole thing
myself.

> Although I haven't been a dba for maybe 20 years, I talked about your
> SQL question with one of our experienced SQL people in the shop.  He
> works with extremely large production databases (mega-millions of rows
> in complex production situations) using SQL Server.  Before I start, let
> me say that I've seen many comments over the years that SQL Server isn't
> exactly like Oracle isn't exactly like MySQL -- so it isn't unexpected
> that you might encounter different errors from the exact same situation!
> In any case, though, this means that his comments below are specific to
> our environment, interpreted by me, and YMMV!

It's certainly true that Oracle != SQL Server != MySQL != SQLite !=
Informix != PostgreSQL != your_favorite_databse.  The SQL statements I
shared in the previous post were meant to be as simple minded,
transparent, ANSI standard, and generic as possible.  Not necessarily
representative of actual tables and actual updates in an actual
application.  I should have pointed that out, I guess.

> * With his large tables of millions of rows, data coming in likely
> uses a file load instead of row-by-row inserts.  Look ahead in your
> design to the production future to see if you are really going to be
> doing small-scale INSERTS or if you really will be doing big LOAD
> processes instead.  It may side-step your current issue entirely and
> show it to be a design artifact due to very small-scale production
> and design concepts.
>
> * He noted that if he is doing a load, he deletes the indices, then
> loads several million records from file, then recreates the indices.
> This is because the load is MUCH slower if the indices are turned
> on.  (Myself, I would interpret that based on out-of-date experience
> as the indices are being rebuilt on the fly and that slows the load
> down! If so, that would seem to argue that the update is NOT atomic,
> but row-by-row. This would be contrary to your stated assumption.)

Yep, common practice for OLAP type stuff is to do massive numbers of
inserts without the indices in place.  For my purposes, there will not
be any huge loads happening.  (Tens of INSERTs a minute would be heavy
usage and I care more about the integrity of the data model than it
being fast.)

As for an UPDATE being atomic or not, it's not a matter of my assuming
anything, but that ANSI SQL says an UPDATE statement is atomic and
affects all rows that match the predicates given or none of them.  If
all of the rows matching the predicate are updated as a unit, the
column constraint is never violated, so what's up with the error?
(I'm not trying to imply that doing so in this case would be easy.
It's obvious that implementation details would make this hard in
the vast majority of cases.)

The brief discussion of this at dinner was along the lines of "I can't
believe something that simple wouldn't work right, why don't you share
it with the list?"  I just brought it up as a (perhaps interesting)
annoyance.  It seems to have worked piqued your interest.

> * With regard to the specific situation you posed...  He asked why
> would you index at all if you were going to update every row anyway
> as you did in the example?  Again, thinking about the final
> production design and process may take it from an intellectual
> question to reality-based.

I went the other way.  Discovered something in an application I was
building and turned it into an intellectual problem.  The INSERTs and
UPDATEs I shared were a degenerate test case designed to clearly
illustrate what I saw as an oddity.  It was not what the actual
application is doing.

> * He thinks what you are seeing in the errors reported is that the
> different database software apps are reporting a problem with
> non-uniqueness of the index, but not using the same terminology.  In
> your example, you have forced a violation of uniqueness with your
> first INSERT statement.  There are now two 2's.  He said he thought
> your supposition that simultaneous changes to all rows is not
> realistic -- for instance, for 10M rows, how would that work as an
> atomic process anyway?  Simultaneously is asking a lot!

I'm only asking for the behaviour ANSI SQL specifies.  Is that too
much to hope for?  (Yes, quite probably.  And I may be
misunderstanding that ANSI SQL actually says, too.  It has been
criticised for insufficiently specifying behaviour of an
implementation (See third bullet item here:
http://en.wikipedia.org/wiki/SQL#Reasons_for_lack_of_portability)

Also, I'm not following you here on having "forced a violation of
uniqueness."  After doing the following:
  -- There is no 'test' table at this point.
  CREATE TABLE test (a INTEGER);
  CREATE UNIQUE INDEX test_a_idx ON test (a);
  INSERT INTO test (a) VALUES (1);
  INSERT INTO test (a) VALUES (2);
There cannot have been such a violation.

And there is no uniqueness constraint violation when the two rows'
values are incremented to 2 and 3 by the UPDATE.  The only way the
uniqueness constraint is violated is if the first row's new value is
checked with the second row's _original_ value.  (Which is certainly
what is happening on those products where my UPDATE statement fails.)

> * He suggested that instead of taking an index unique approach that
> you consider using a unique field (an assigned, guaranteed unique
> value taken from a separate table where the code removes the value
> from the table once it is used). Alternatively, back in my dba days,
> we often made a hopefully unique index by concatenating several
> fields to ensure uniqueness -- for a simple example, every value in
> lastname+last4SSN is very likely to be unique (at least once and if
> the lastname field is truncated and well-cleaned data!). That could
> even become a field in your table for add-match purposes. That
> particular example has its own problems with keeping the Lastname
> field really under control because there are many ways last names
> get changed in the real world, but it's one of the challenges for
> the DBA to find a workable way to do indexing!  :) Digression: did
> you realize that Social Security numbers are not truly unique?

I fall into the "database keys should be opaque, unique, meaningless
values" (aka surrogate keys) camp of thinking.  On the actual project,
just for grins, the primary keys on the tables are random UUID4s.
(http://en.wikipedia.org/wiki/UUID#Version_4_.28random.29)

> * Finally, if your ultimate design forces you to continue to use
> small-scale INSERTs and a unique index approach, then he thinks you
> might want sidestep the problem you report by using a WHERE clause in
> the INSERT. It might side-step the violation of uniqueness, but test it
> out first.

I assume you mean "UPDATE ... WHERE ...;" and, yeah, the real project
does that.  Let's say I have a few more rows in the table so that the
values go 1 through 8 and I want to open up two slots between 4 and 5
by adding 2 to every value greater than 4...

    CREATE TABLE test (a INTEGER, other_col INTEGER, another_col TEXT);
    CREATE UNIQUE INDEX test_a_idx ON test (a);
    INSERT INTO test (a) VALUES (1);
     [ ... ]
    INSERT INTO test (a) VALUES (8);
    UPDATE test SET a = a + 2 WHERE a > 4;

This still works as I'd expect on some implementations, and reports a
constraint violation on others.

> Y'know, over the years, I've found that sometimes just explaining
> the problem to someone else helps me see a basic issue -- often, as
> in this case, where I've made a potentially invalid assumption and
> they challenge it.

True, that.

> Well, it was fun for me to try to recall the terminology and see
> some of the same old issues from so many years ago while relaying
> his comments to you.  Hope it helps you out, Aaron!

It's been an interesting diversion looking into this during the
daytime, yes.

Is anyone interested in the scorecard so far?  I added gadfly to the
list of implementations I tried today.

- Aaron