Re: Speedup twophase transactions - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Speedup twophase transactions
Date
Msg-id CAB7nPqT99Ur6ORqLhXTeYf8XD49E8eO4HZu8ZQShNZxNecd0NA@mail.gmail.com
Whole thread Raw
In response to Re: Speedup twophase transactions  (Stas Kelvich <s.kelvich@postgrespro.ru>)
Responses Re: Speedup twophase transactions  (Stas Kelvich <s.kelvich@postgrespro.ru>)
List pgsql-hackers
On Wed, Mar 30, 2016 at 10:19 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
> Hm, it’s hard to create descriptive names because test changes master/slave
> roles for that nodes several times during test.

Really? the names used in the patch help less then.

> It’s possible to call them
> “node1” and “node2” but I’m not sure that improves something. But anyway I’m
> not insisting on that particular names and will agree with any reasonable
> suggestion.

I would suggest the following name modifications, node names have been
introduced to help tracking of each node's log:
- Candie => master
- Django => slave or just standby
There is no need for complication! And each node's log file is
prefixed by the test number. Note that in other tests there are no
promotions, or fallbacks done, but we stick with a node name that
represents the initial state of the cluster.

> +# Switch to synchronous replication
> +$node_master->append_conf('postgresql.conf', qq(
> +synchronous_standby_names = '*'
> +));
> +$node_master->restart;
> Reloading would be fine.
>
> Good catch, done.

+$node_master->psql('postgres', "select pg_reload_conf()");

It would be cleaner to introduce a new routine in PostgresNode that
calls pg_ctl reload (mentioned in the N-sync patch as well, that would
be useful for many purposes).

> All accesses to 2pc dummies in ProcArray are covered with TwoPhaseStateLock,
> so I thick that’s safe. Also I’ve deleted comment above that block, probably
> it’s more confusing than descriptive.

OK, you removed the use to allProcs. Though by reading again the code
just holding TwoPhaseStateLock that's actually fine.

The patch needs a small cleanup:
$ git diff master --check
src/test/recovery/t/006_twophase.pl:224: new blank line at EOF.

006_twophase.pl should be renamed to 007. It keeps its license to
kill, and gains in being famous.

- * Scan the pg_twophase directory and setup all the required information to
- * allow standby queries to treat prepared transactions as still active.
- * This is never called at the end of recovery - we use
- * RecoverPreparedTransactions() at that point.
+ * It's a caller responsibility to call MarkAsPrepared() on returned gxact. *
Wouldn't it be more simple to just call MarkAsPrepared at the end of
RecoverPreparedFromBuffer?

While testing the patch, I found a bug in the recovery conflict code
path. You can do the following to reproduce it:
1) Start a master with a standby
2) prepare a transaction on master
3) Stop immediate on standby to force replay
4) commit prepare transaction on master
5) When starting the standby, it remains stuck here:
* thread #1: tid = 0x229b4, 0x00007fff8e2d4f96
libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread',
stop reason = signal SIGSTOP * frame #0: 0x00007fff8e2d4f96 libsystem_kernel.dylib`poll + 10   frame #1:
0x0000000107e5e043
postgres`WaitEventSetWaitBlock(set=0x00007f90c20596a8, cur_timeout=-1,
occurred_events=0x00007fff581efd28, nevents=1) + 51 at latch.c:1102   frame #2: 0x0000000107e5da26
postgres`WaitEventSetWait(set=0x00007f90c20596a8, timeout=-1,
occurred_events=0x00007fff581efd28, nevents=1) + 390 at latch.c:935   frame #3: 0x0000000107e5d4c7
postgres`WaitLatchOrSocket(latch=0x0000000111432464, wakeEvents=1,
sock=-1, timeout=-1) + 343 at latch.c:347   frame #4: 0x0000000107e5d36a
postgres`WaitLatch(latch=0x0000000111432464, wakeEvents=1, timeout=0)
+ 42 at latch.c:302   frame #5: 0x0000000107e7b5a6 postgres`ProcWaitForSignal + 38 at proc.c:1731   frame #6:
0x0000000107e6a4eb
postgres`ResolveRecoveryConflictWithLock(locktag=LOCKTAG at
0x00007fff581efde8) + 187 at standby.c:391   frame #7: 0x0000000107e7a6a8
postgres`ProcSleep(locallock=0x00007f90c203dac8,
lockMethodTable=0x00000001082f6218) + 1128 at proc.c:1215   frame #8: 0x0000000107e72886
postgres`WaitOnLock(locallock=0x00007f90c203dac8,
owner=0x0000000000000000) + 358 at lock.c:1703   frame #9: 0x0000000107e70f93
postgres`LockAcquireExtended(locktag=0x00007fff581f0238, lockmode=8,
sessionLock='\x01', dontWait='\0', reportMemoryError='\0') + 2819 at
lock.c:998   frame #10: 0x0000000107e6a9a6
postgres`StandbyAcquireAccessExclusiveLock(xid=863, dbOid=16384,
relOid=16385) + 358 at standby.c:627   frame #11: 0x0000000107e6af0b
postgres`standby_redo(record=0x00007f90c2041e38) + 251 at
standby.c:809   frame #12: 0x0000000107b0e227 postgres`StartupXLOG + 9351 at xlog.c:6871
It seems that the replay on on-memory state of the PREPARE transaction
is conflicting directly with replay.
--
Michael



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Incorrect format in error message
Next
From: Amit Kapila
Date:
Subject: Re: OOM in libpq and infinite loop with getCopyStart()