Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}
Date
Msg-id CAM3SWZRdNXyo_wpdQJacjpBPTpN7ONJtS3nniHMAFsGoudU2YA@mail.gmail.com
Whole thread Raw
In response to Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On Mon, Dec 29, 2014 at 11:52 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Correct, I haven't seen any problems with approach #1

That helps with debugging #2, then. That's very helpful.

> Generally the problem will occur early on in the process, and if not then it
> will not occur at all.  I think that is because the table starts out empty,
> and so a lot of insertions collide with each other.  Once the table is more
> thoroughly populated, most query takes the CONFLICT branch and therefore two
> insertion-branches are unlikely to collide.
>
> At its simplest, I just use the count_upsert.pl script and your patch and
> forget all the rest of the stuff from my test platform.

I can reproduce this on my laptop now. I think that building at -O2
and without assertions helps. I'm starting to work through debugging
it.

I threw together a quick script for getting pg_xlogdump into a
Postgres table (a nice use of the new pg_lsn type). It's here:

https://github.com/petergeoghegan/jjanes_upsert/blob/master/pg_xlogdump2csv.py

It tells a story. Looking at the last segment before shutdown when the
problem occurred, I see:

postgres=# select count(*), tx from my_xlogdump group by tx having
count(*) > 4 order by 1;count |   tx
-------+---------    5 | 1917836    5 | 1902576    5 | 1909746    5 | 1901586    5 | 1916971    6 | 1870077   39 |
1918004 119 | 1918003 2246 |       0
 
(9 rows)

postgres=# select max(tx::text::int4) from my_xlogdump ;  max
---------1918004
(1 row)

So the last two transactions (1918003 and 1918004) get into some kind
of live-lock situation, it looks like. Or at least something that
causes them to produce significant more WAL records than other xacts
due to some kind of persistent problem with conflicts.

Here is where the earlier of the two problematic transactions has its
first record:

postgres=# select * from my_xlogdump where tx = '1918003' order by
r_lsn asc limit 1;rmgr | len_rec | len_tot |   tx    |   r_lsn    |  prev_lsn  |              descr
------+---------+---------+---------+------------+------------+----------------------------------------------------Heap
|      3 |     203 | 1918003 | 0/1783BB70 | 0/1783BB48 | INSERT
 
off 33 blkref #0: rel 1663/16471/12502 blk
(1 row)

After and including that record, until the trouble spot up to and
including shutdown, here is the rmgr breakdown:

postgres=# select count(*),  rmgr from my_xlogdump where r_lsn >=
'0/1783BB70' group by rmgr order by 1;count |    rmgr
-------+-------------    1 | XLOG         -- 1 CHECKPOINT_SHUTDOWN record    2 | Transaction -- commit records for the
twoxacts   20 | Heap2        -- all are CLEAN remxid records, tx is 0   76 | Heap          -- All from our two xacts...
 80 | Btree          -- All from XID 1918003 only
 
(5 rows)

So looks like a bad interaction with VACUUM. Maybe it's a problem with
VACUUM interlocking. That was my first suspicion, FWIW.

I'll need to do more investigating, but I can provide a custom format
dump of the table, in case anyone wants to look at what I have here in
detail. I've uploaded it to:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/files/my_xlogdump.custom.dump.table
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Failure on markhor with CLOBBER_CACHE_ALWAYS for test brin
Next
From: Amit Kapila
Date:
Subject: Re: Additional role attributes && superuser review