Thread: Getting sequence-generated IDs from multiple row insert

Getting sequence-generated IDs from multiple row insert

From
Ben Hoyt
Date:
Hi folks,

I've just run into a subtle but fairly serious race condition while using web.py's SQL library to insert multiple rows into the database and return their IDs (a "serial primary key" column). Specifically I'm using the multiple_insert() function that web.py defines here:


This function runs a query like this:

    INSERT INTO table (col1, col2) VALUES (col1_1, col2_1), (col1_2, col2_2), ...; SELECT currval('table_id_seq');

Using the output of the currval(), web.py tries to build a list of the most recent IDs by creating a range from "currval - num_rows_inserted + 1" through "currval". In Python:

    out = range(out-len(values)+1, out+1)

This *looks* nice, and must have seemed fine to the developers who implemented it, but I've just hit a case where two sessions each doing a multiple insert don't use sequential IDs. For example, the range code above for the first insert gave 2117552...2117829. And the second insert gave 2117625...2117818. Which are obviously overlapping and is a nasty bug waiting to happen. Thankfully it caused an IntegrityError further down in my code so I didn't screw things up.

First of all, I presume this is expected, and is how the sequence with a multi-row insert is supposed to work? In other words, the sequence guarantees the IDs will be unique, but with multi-row insert, they won't necessarily be consecutive? If so, it's a fairly serious bug in web.py's multiple_insert(), which probably shouldn't return anything due to this issue.

Second, what's the right thing to do here? The first thing I found was PostgreSQL's "RETURNING" clause, but somewhat frustratingly for this use case, even that's not guaranteed to return the results in the order you specified. I need the IDs in insertion order so I can do further processing. Tom Lane and others in this thread indicate that this is not a guarantee of the RETURNING clause, for future optimization reasons and due to how SQL handles sets:


So currently I've changed my code to use RETURNING and then I'm ordering the results based on a secondary column that I know the order of. This works, but seems clunky, so I'm wondering if there's a nicer way.

Thanks,
Ben

Re: Getting sequence-generated IDs from multiple row insert

From
Andrew Sullivan
Date:
On Mon, Mar 31, 2014 at 03:28:14PM -0400, Ben Hoyt wrote:

> , but I've just hit a case where two sessions each doing a
> multiple insert don't use sequential IDs. For example, the range code above
> for the first insert gave 2117552...2117829. And the second insert gave
> 2117625...2117818. Which are obviously overlapping and is a nasty bug
> waiting to happen. Thankfully it caused an IntegrityError further down in
> my code so I didn't screw things up.

Good thing you caught it.  But yes, just from the description it
looked like an obvious race to me.  Concurrency is hard.

> Second, what's the right thing to do here? The first thing I found was
> PostgreSQL's "RETURNING" clause, but somewhat frustratingly for this use
> case, even that's not guaranteed to return the results in the order you
> specified.

In SQL, _nothing_ is guaranteed to return in the order you specified.
This isn't really a Postgres thing; unless you use ORDER BY, SQL's
sets are not ordered.

> I need the IDs in insertion order so I can do further processing.

This sets off alarm bells for me.  What further processing are you
doing?  Is it possible that you could move that into a single step in
the database (maybe with a function or even a trigger) so that the
result of your RETURNING really would provide you with what you need?

> So currently I've changed my code to use RETURNING and then I'm ordering
> the results based on a secondary column that I know the order of. This
> works, but seems clunky, so I'm wondering if there's a nicer way.

This is probably what I'd do, assuming that "further processing" isn't
more data transformation.  If it _is_, then I'd do the whole thing in
a single step (in the database, once I inserted).

A

--
Andrew Sullivan
ajs@crankycanuck.ca


Re: Getting sequence-generated IDs from multiple row insert

From
David Johnston
Date:
Andrew Sullivan-8 wrote
>> So currently I've changed my code to use RETURNING and then I'm ordering
>> the results based on a secondary column that I know the order of. This
>> works, but seems clunky, so I'm wondering if there's a nicer way.
>
> This is probably what I'd do, assuming that "further processing" isn't
> more data transformation.  If it _is_, then I'd do the whole thing in
> a single step (in the database, once I inserted).

If order is an implicit property of the source data then you need to
explicitly encode that order during (or before) import.  There are numerous
ways to implement such but except for extremely simple cases PostgreSQL will
not do the appropriate thing automatically in the face of concurrency.

Also, do you need sequential IDs or just IDs that are ever increasing?  And
if the later then tagging the input source will let you distinguish between
two different datasets even if their sequences are overlapping.

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Getting-sequence-generated-IDs-from-multiple-row-insert-tp5798092p5798107.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Getting sequence-generated IDs from multiple row insert

From
Andrew Sullivan
Date:
On Mon, Mar 31, 2014 at 01:34:04PM -0700, David Johnston wrote:
>
> If order is an implicit property of the source data then you need to
> explicitly encode that order during (or before) import.

Sure, but the problem the OP had I thought was that the RETURNING
clause doesn't guarantee that the rows coming back are in the order
they were inserted.  This is just a SQL thing.  (I guess you could
ORDER BY the RETURNING clause, right?)

> There are numerous
> ways to implement such but except for extremely simple cases PostgreSQL will
> not do the appropriate thing automatically in the face of concurrency.

It _is_ doing the appropriate thing, though: this is SQL.  The rows
aren't ordered unless you tell them to be.

A

--
Andrew Sullivan
ajs@crankycanuck.ca