[NCLUG] Annoying post-meeting SQL curiosities

Michael Milligan milli at acmeps.com
Tue Dec 16 19:38:27 MST 2008


Aaron D. Johnson wrote:
> Michael Milligan writes:
>> I.e., the correct handling of this is:
>>
>> BEGIN;
>> DROP INDEX test_a_idx;
>> UPDATE test SET a = a + 1;
>> CREATE UNIQUE INDEX test_a_idx ON test (a);
>> COMMIT;
> 
> I've got to say I find that just leaves a bad taste in my mouth.  Any
> operations that could make use of the index are reduced to full table
> scans.  Re-creating the index incurs a second full table scan.  And,

All database planners I know of would ignore the index for this
operation anyway, as you have no WHERE clause constraining the results
on the indexed column.

With a disk buffer cache, the second pass to re-create the index would
almost certainly take place in memory, unless you are talking about
millions of rows which exceeds the disk buffer cache.

> worse, the uniqueness constraint on the data in column a is removed
> for the course of the transaction.

No big deal.  As I said, you are provably (in a mathematical sense)
updating the values such that you will be left in a state where you
still have unique values and thus re-creating the index will succeed.

The additional bonus of re-creating the index is it will be as compact
as possible on disk and thus in a most-future-performant state (i.e.,
highest likelihood that it'll fit into shared memory and/or disk buffer
cache).

> Since you bring up PostgreSQL specifically, the development team seems
> to regard the current behaviour as a bug.  There's a discussion on
> exactly this and what an impact it would have on the code starting
> here:
> http://archives.postgresql.org/pgsql-hackers/2006-09/msg01202.php

Interesting, though I wouldn't go so far as to say Tom Lane agreed it
was a bug.  I would certainly agree that it violates the principle of
least surprise anyway.

> The nicest workaround I've found so far goes like this.  We'll just
> assume that my table does actually have more than just the "A" column
> in it and I'm ignoring those for now.  And I want to make some space
> between rows where A is 1 and A is 2.
> 
>     CREATE TABLE test (a INTEGER UNIQUE, irrelevantdata VARCHAR(100));
>     -- This index is more-or-less implicit in the UNIQUE column
>     -- constraint above.  Making it explicit.
>     CREATE UNIQUE INDEX test_a_idx ON test (a);
>     -- stuff some sample rows into the table
>     BEGIN;
>     INSERT INTO test (a, irrelevatdata) VALUES
>         (1, 'foo');
>     INSERT INTO test (a, irrelevatdata) VALUES
>         (2, 'bar');
>     INSERT INTO test (a, irrelevatdata) VALUES
>         (3, 'baz');
>     COMMIT;
>     -- stick a new value in here between the after row (1,'foo')
>     -- increasing A in all rows that follow
>     BEGIN;
>     UPDATE test SET a = 0 - a - 1 WHERE a >= 2;
>     INSERT INTO test (a, irrelevantdata) VALUES
>         (2, 'xyzzy');
>     UPDATE test SET a = -a WHERE a < 0;
>     COMMIT;
> 
> I found this to be quite clever.  It maintains the column constraint,

That is clever.  But what jumps out at me is that this will likely take
more than twice as long to run than just dropping the index and
re-creating it since you have two UPDATE commands and both of those are
incrementally updating the index for each row update.  Perhaps
performance isn't a concern.  But if you're using PostgreSQL, you can
prepend the UPDATE commands with EXPLAIN to see what the planner would
do and what the cost would be for the updates...  compare and contrast.
 (I do it all the time...)

> doesn't depend on any vaguely specified implementation-specific
> behaviour, doesn't do a full table scan for every single query, and it

Careful though... the planner _could_ still decide to do a seq scan.
Depends on the table size, index size, memory configuration and cost
parameters, etc...

> is reasonably clear about what is happening.  (Credit where it is due:
> http://web.archive.org/web/20070818213714/http://people.planetpostgresql.org/greg/index.php?/archives/46-Updating-unique-columns.html)

Good stuff.

FYI, I'm used to dealing with databases that contain millions and
millions of rows, 100's of gigabytes on disk, so indices are key to good
performance.  I always _have_ to minimize sequential table scans
whenever possible with tables that large, and do have (maintenance)
cases where a column has to be updated on every row.  A full sequential
scan is unavoidable in those cases (again, index is useless when you
don't constrain on the indexed column in the WHERE clause).

Regards,
Mike

-- 
Michael Milligan                                   -> milli at acmeps.com



More information about the NCLUG mailing list