Re: INSERT...ON DUPLICATE KEY IGNORE - Mailing list pgsql-hackers

From Andres Freund
Subject Re: INSERT...ON DUPLICATE KEY IGNORE
Date
Msg-id 20130831183426.GA6989@awork2.anarazel.de
Whole thread Raw
In response to Re: INSERT...ON DUPLICATE KEY IGNORE  (Peter Geoghegan <pg@heroku.com>)
Responses Re: INSERT...ON DUPLICATE KEY IGNORE  (Peter Geoghegan <pg@heroku.com>)
Re: INSERT...ON DUPLICATE KEY IGNORE  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 2013-08-30 19:38:24 -0700, Peter Geoghegan wrote:
> On Fri, Aug 30, 2013 at 5:47 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > While I ponder on it some more, could you explain whether I understood
> > something correctly? Consider the following scenario:
> >
> > CREATE TABLE foo(a int UNIQUE, b int UNIQUE);
> >
> > S1: INSERT INTO foo(0, 0);
> > S1: BEGIN;
> > S1: INSERT INTO foo(1, 1);
> > S1: SELECT pg_sleep(3600);
> > S2: BEGIN;
> > S2: INSERT INTO foo(2, 1);
> > S3: SELECT * FROM foo WHERE a = 0;
> >
> > Won't S2 in this case exclusively lock a page in foo_a_key (the one
> > where 2 will reside on) for 3600s, while it's waiting for S1 to finish
> > while getting the speculative insertion into foo_b_key?
> 
> Yes. The way the patch currently handles that case is unacceptable. It
> needs to be more concerned about holding an exclusive lock on
> earlier-locked indexes when locking the second and subsequent indexes
> iff it blocks on another transaction in the course of locking the
> second and subsequent indexes.

After some thinking I don't think any solution primarily based on
holding page level locks across other index operations is going to scale
ok.

What I had in mind when playing around with this is the following trick:
Just as in your patch split index insertion into two phases and perform
one of them before inserting the heap tuple. In what you called
"speculative insertion" don't stop at locking the page, but actually
insert an index tuple in the index. As we haven't yet inserted the heap
tuple we obviously can't insert an actual tuple. Instead set a flag on
the item and reuse the the item pointer to store the xid of the
inserting process. I coined those items "insertion promises"...

Index searches will ignore promises they see, but concurrent index
insertions will notice it's a promise and wait for the xid (or not, if
it has aborted). That fits pretty well into the existing btree code.  If
the insertion of an index promise worked for all unique indexes and the
heap tuple has been inserted, convert those promises into actual item
pointers pointing to the heap tuple. That probably requires a second
search and some cuteness to find the correct pointer because of
concurrent splits, but that's not too bad (possibly you can do away with
that if we continue to hold a pin).

The fact that now xids can be in the index creates some slight
complications because it now needs to be vacuumed - but that's
relatively easy to fit into the bulkdelete api and I don't think the
contortions to the vacuum code will be too bad.

Imo that solution is fairly elegant because it doesn't cause any heap
bloat, and it causes the same amount of index bloat
"insert-into-heap-first" type of solutions cause. It also has pretty
good concurrency behaviour because you never need to retain a page lock
for longer than a normal index insertion.
It's certainly a bit more complex than your solution tho...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: dynamic shared memory
Next
From: Stephen Frost
Date:
Subject: Re: [v9.4] row level security