[NCLUG] Annoying post-meeting SQL curiosities

Sun Dec 14 03:55:18 MST 2008

Aaron D. Johnson wrote:
>
> This should leave us with two rows, with values 2 and 3, right? 
No, maybe, and yes.

Database engines have an optimizer, this will attempt to do things in
the order it thinks will run fastest. An update is actually two
different operations, first an insert and then a delete.  The actually
row is not changed in the sense that you think it is.   So if a=3 is
inserted first then  a=2 is deleted ... wow presto no error.  With just
two rows in a table, you may/would expect consistent results with any
given OS with any given DB engine. Because the optimizer for each given
DB engine has the same algorithms:  it may decide consistently to start
with a=1,  resulting in a consistent failure or start it may start with
row a=2 and consistently succeed.   So your test is telling you a bit
about the optimizer, namely how it behaves under these very limited
conditions.

How to get it to fail:

With SQL Server 2000 / 2005, Informix, DB2 (9.1/9.5), Oracle, and MySQL
a unique index does not reorder the data unless it is created as a
clustered index.  The command you used would not do this.  Clustering
put the rows on the disk physically in order of the index.  So a=1 would
physically be first on the page and a=2 would be second.  By the way,
you can only create a clustered index on MySQL if you are using InnDB
(will not work with MyISAM)

Let's say you create a clustered index with these two rows. If the db
engine uses the index for your update, it would fail. Because it would
start with a=1. However, with only two rows, most likely the engine
would NOT use your index!  Why? A full table scan would be faster then
using the overhead of the index.  So, you are still left with a chance
that it might work.

Now, let's add many many rows.  Now, the database optimizer will decide
that even with the overhead of using the index it would faster than a
full table scan.  The result would be a failure.  There are commands to
force/hint the optimizer to use an index, but different engines behave
differently and some, like SQL Server,  will still ignore an index even
in hinted if a table scan is faster in an UPDATE statement.
>
> Am I wrong to think that an UPDATE is supposed to be atomic (not
> divisible into smaller pieces) and that all rows matching the
> (non-existant) predicates specified are to be changed simultaneously?
> And would it not follow that there should be no duplicates in the
> index when the UPDATE is done?
>   
The update statement affects each row one by one.  This is a good thing,
because this allows you to write triggers that actually will do some
task on each row. But these changes are only logged and staged  Once all
the rows are inserted and deleted they can be committed.   If
auto-commit is on, then your done. That is, the commit happens as part
of the update.(Use auto-commit carefully, with great convenience comes
great responsibility.)  Otherwise a commit command needs to be sent by
the user. So, in this sense it is atomic.  That is, all the changes are
committed at once or rolled back at one.

Hope that helps

William