[NCLUG] Annoying post-meeting SQL curiosities

Shelley, Robert - Fort Collins, CO Robert.Shelley at ftc.usda.gov
Wed Dec 10 16:27:56 MST 2008


Hey Aaron

BTW, I wish I could have made last night's meeting, too. I wanted to
hear the discussion, so let me second the request for notes and
documentation! OK, the rest of this is for Aaron and anyone else
interested in SQL:

Although I haven't been a dba for maybe 20 years, I talked about your
SQL question with one of our experienced SQL people in the shop. He
works with extremely large production databases (mega-millions of rows
in complex production situations) using SQL Server.  Before I start, let
me say that I've seen many comments over the years that SQL Server isn't
exactly like Oracle isn't exactly like MySQL -- so it isn't unexpected
that you might encounter different errors from the exact same situation!
In any case, though, this means that his comments below are specific to
our environment, interpreted by me, and YMMV!

Following is a boiled-down version of his comments since we went pretty
far afield and rambled as we talked... If I sound confused in what I
write here, it would only be because of ignorance and bad transcribing
on my part.

* With his large tables of millions of rows, data coming in likely uses
a file load instead of row-by-row inserts. Look ahead in your design to
the production future to see if you are really going to be doing
small-scale INSERTS or if you really will be doing big LOAD processes
instead. It may side-step your current issue entirely and show it to be
a design artifact due to very small-scale production and design
concepts.

* He noted that if he is doing a load, he deletes the indices, then
loads several million records from file, then recreates the indices.
This is because the load is MUCH slower if the indices are turned on.
(Myself, I would interpret that based on out-of-date experience as the
indices are being rebuilt on the fly and that slows the load down! If
so, that would seem to argue that the update is NOT atomic, but
row-by-row. This would be contrary to your stated assumption.)

* With regard to the specific situation you posed... He asked why would
you index at all if you were going to update every row anyway as you did
in the example? Again, thinking about the final production design and
process may take it from an intellectual question to reality-based.

* He thinks what you are seeing in the errors reported is that the
different database software apps are reporting a problem with
non-uniqueness of the index, but not using the same terminology. In your
example, you have forced a violation of uniqueness with your first
INSERT statement. There are now two 2's. He said he thought your
supposition that simultaneous changes to all rows is not realistic --
for instance, for 10M rows, how would that work as an atomic process
anyway? Simultaneously is asking a lot!

* He suggested that instead of taking an index unique approach that you
consider using a unique field (an assigned, guaranteed unique value
taken from a separate table where the code removes the value from the
table once it is used). Alternatively, back in my dba days, we often
made a hopefully unique index by concatenating several fields to ensure
uniqueness -- for a simple example, every value in lastname+last4SSN is
very likely to be unique (at least once and if the lastname field is
truncated and well-cleaned data!). That could even become a field in
your table for add-match purposes. That particular example has its own
problems with keeping the Lastname field really under control because
there are many ways last names get changed in the real world, but it's
one of the challenges for the DBA to find a workable way to do indexing!
:)  Digression: did you realize that Social Security numbers are not
truly unique?

* Finally, if your ultimate design forces you to continue to use
small-scale INSERTs and a unique index approach, then he thinks you
might want sidestep the problem you report by using a WHERE clause in
the INSERT. It might side-step the violation of uniqueness, but test it
out first.

Y'know, over the years, I've found that sometimes just explaining the
problem to someone else helps me see a basic issue -- often, as in this
case, where I've made a potentially invalid assumption and they
challenge it.

Well, it was fun for me to try to recall the terminology and see some of
the same old issues from so many years ago while relaying his comments
to you. Hope it helps you out, Aaron!



- Bob Shelley, an Antique Person 
who remembers dBase II and FoxPro for smaller computers on CP/M and
MSDOS


-----Original Message-----
From: nclug-bounces at nclug.org [mailto:nclug-bounces at nclug.org] On Behalf
Of Aaron D. Johnson
Sent: Wednesday, December 10, 2008 1:58 PM
To: nclug at nclug.org
Subject: [NCLUG] Annoying post-meeting SQL curiosities

During the very pleasant post-meeting dinner discussions last night, I
mentioned I had been let down by a number of SQL databases over the
last week.  Paul suggested I share with the list, so here's my cause
of consternation.  Consider the following:

    CREATE TABLE test (a INTEGER);
    CREATE UNIQUE INDEX test_a_idx ON test (a);
    INSERT INTO test (a) VALUES (1);
    INSERT INTO test (a) VALUES (2);

So far, all is good.  I'll skip the SELECT showing that it worked as
expected.

Here's where the fun starts:

    UPDATE test SET a = a + 1;

This should leave us with two rows, with values 2 and 3, right?  Well,
not so fast.  Some databases do work that way, and others report that
the unique index constraint is being violated.  For some, the result
is dependent on the order of the INSERTs.  Neat, huh?

Tested so far are:
- INFORMIX-SQL Version 7.20.UD1
- Microsoft SQL Server 2000 (8.00.761)
- MySQL 5.0.51a (With the default storage engine in Debian Lenny)
- Oracle 9i 9.2.0.7.0
- PostgreSQL 8.3.5
- SQLite 3.5.9

Just for grins, I'm working on getting Ingres built and should have
access to a DB2 instance running on an IBM AS400/"i Series"/"System
i5"/whatever-they-call-it-this-week to try it on in a few weeks.
Anyone care to report results on an SQL database platform not listed
here?

I'll save the what-works-where report for later.

Am I wrong to think that an UPDATE is supposed to be atomic (not
divisible into smaller pieces) and that all rows matching the
(non-existant) predicates specified are to be changed simultaneously?
And would it not follow that there should be no duplicates in the
index when the UPDATE is done?

(And, yeah, bug reports to the failing projects are pending.)

Thoughts?

- Aaron
_______________________________________________
NCLUG mailing list       NCLUG at nclug.org

To unsubscribe, subscribe, or modify 
your settings, go to: 
http://www.nclug.org/mailman/listinfo/nclug



More information about the NCLUG mailing list