Make HeapTupleSatisfiesMVCC more concurrent - Mailing list pgsql-hackers

From Jeff Janes
Subject Make HeapTupleSatisfiesMVCC more concurrent
Date
Msg-id CAMkU=1xVyQ0BC2ChEBAk+PGGJEwfrK0Qe9KWi6NJwBVOvW=C_g@mail.gmail.com
Whole thread Raw
Responses Re: Make HeapTupleSatisfiesMVCC more concurrent  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
When we check a tuple for MVCC, it has to pass checks that the inserting transaction has committed, and that it committed before our snapshot began.  And similarly that the deleting transaction hasn't committed, or did so after our snapshot.

XidInMVCCSnapshot is (or can be) very much cheaper than TransactionIdIsInProgress, because the former touches only local memory while the latter takes a highly contended lock and inspects shared memory.  We do the slow one first, but we could do the fast one first and sometimes short-circuit the slow one.  If the transaction is in our snapshot, it doesn't matter if it is still in progress or not.

This was discussed back in 2013 (http://www.postgresql.org/message-id/CAMkU=1yy-YEQVvqj2xJitT1EFkyuFk7uTV_hrOMGyGMxpU=N+Q@mail.gmail.com), and I wanted to revive it. The recent lwlock atomic changes haven't made the problem irrelevant.

This patch swaps the order of the checks under some conditions.  So that hackers can readily do testing without juggling binaries, I've added an experimental guc which controls the behavior. JJ_SNAP=0 is the original (git HEAD) behavior, JJ_SNAP=1 turns on the new behavior.

I've added some flag variables to record if XidInMVCCSnapshot was already called. XidInMVCCSnapshot is cheap, but not so cheap that we want to call it twice if we can avoid it. Those would probably stay in some form or another when the experimental guc goes away.

We might be able to rearrange the series of "if-tests" to get rid of the flags variables, but I didn't want to touch the HEAP_MOVED_OFF and HEAP_MOVED_IN parts of the code, as those must get about zero regression testing.

The situation where the performance of this really shows up is when there are tuples that remain in an unresolved state while highly concurrent processes keep stumbling over them.

I set that up by using the pgbench tables with scale factor of 1, and running a custom query at high concurrency which seq_scans the accounts table:

pgbench -f <(echo 'select sum(abalance) from pgbench_accounts') -T 30 \
    -n -c32 -j32 --startup='set JJ_SNAP=1'

While the test is contrived, it reproduces complaints I've seen on several forums.

To create the burden of unresolved tuples, I open psql and run:
begin; update pgbench_accounts set abalance =1-abalance;

...and leave it uncommitted for a while.

Representative numbers for test runs of the above custom query on a 8-CPU machine:

tps = 542   regardless of JJ_SNAP, when no in-progress tuples
tps = 30     JJ_SNAP=0 with uncommitted bulk update
tps = 364   JJ_SNAP=1 with uncommitted bulk update


A side effect of making this change would be that a query which finds a tuple inserted or deleted by a transaction still in the query's snapshot never checks to see if that transaction committed, and so it doesn't set the hint bit if it did commit or abort.  Some future query with a newer snapshot will have to do that.  It is at least theoretically possible that this could mean that many hint bits could fail to get set while the buffer is still dirty in shared_buffers, which means it needs to get dirtied again once set.  I doubt this would be significant, but if anyone has a test case which they think could show up a problem in this area, please try it out or describe it.

There are other places in tqual.c which could probably use similar re-ordering tricks, but this is the one for which I have a reproducible test case, 

Cheers

Jeff

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Error message with plpgsql CONTINUE
Next
From: Qingqing Zhou
Date:
Subject: Re: Our trial to TPC-DS but optimizer made unreasonable plan