Re: Speedup twophase transactions - Mailing list pgsql-hackers

From Stas Kelvich
Subject Re: Speedup twophase transactions
Date
Msg-id E7497864-DE11-4099-83F5-89FB97AF5073@postgrespro.ru
Whole thread Raw
In response to Re: Speedup twophase transactions  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Speedup twophase transactions  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: Speedup twophase transactions  (Jesper Pedersen <jesper.pedersen@redhat.com>)
Re: Speedup twophase transactions  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

Thanks for reviews and commit!

  As Simon and Andres already mentioned in this thread replay of twophase transaction is significantly slower then the
sameoperations in normal mode. Major reason is that each state file is fsynced during replay and while it is not a
problemfor recovery, it is a problem for replication. Under high 2pc update load lag between master and async replica
isconstantly increasing (see graph below). 

  One way to improve things is to move fsyncs to restartpoints, but as we saw previously it is a half-measure and just
frequentcalls to fopen can cause bottleneck. 

  Other option is to use the same scenario for replay that was used already for non-recovery mode: read state files to
memoryduring replay of prepare, and if checkpoint/restartpoint occurs between prepare and commit move data to files. On
commitwe can read xlog or files. So here is the patch that implements this scenario for replay. 

  Patch is quite straightforward. During replay of prepare records RecoverPreparedFromXLOG() is called to create memory
statein GXACT, PROC, PGPROC; on commit XlogRedoFinishPrepared() is called to clean up that state. Also there are
severalfunctions (PrescanPreparedTransactions, StandbyTransactionIdIsPrepared) that were assuming that during replay
allprepared xacts have files in pg_twophase, so I have extended them to check GXACT too. 
  Side effect of that behaviour is that we can see prepared xacts in pg_prepared_xacts view on slave.

While this patch touches quite sensible part of postgres replay and there is some rarely used code paths, I wrote shell
scriptto setup master/slave replication and test different failure scenarios that can happened with instances.
Attachingthis file to show test scenarios that I have tested and more importantly to show what I didn’t tested.
ParticularlyI failed to reproduce situation where StandbyTransactionIdIsPrepared() is called, may be somebody can
suggestway how to force it’s usage. Also I’m not too sure about necessity of calling cache invalidation callbacks
duringXlogRedoFinishPrepared(), I’ve marked this place in patch with 2REVIEWER comment. 

Tests shows that this patch increases speed of 2pc replay to the level when replica can keep pace with master.

Graph: replica lag under a pgbench run for a 200 seconds with 2pc update transactions (80 connections, one update per
2pctx, two servers with 12 cores each, 10GbE interconnect) on current master and with suggested patch. Replica lag
measuredwith "select sent_location-replay_location as delay from pg_stat_replication;" each second. 



---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

> On 12 Jan 2016, at 22:57, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 12 January 2016 at 18:14, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> Thank you for the additional review.
>
> On 2016-01-11 19:39:14 +0000, Simon Riggs wrote:
> > Currently, the patch reuses all of the code related to reading/write state
> > files, so it is the minimal patch that can implement the important things
> > for performance. The current patch succeeds in its goal to improve
> > performance, so I personally see no further need for code churn.
>
> Sorry, I don't buy that argument. This approach leaves us with a bunch
> of code related to statefiles that's barely ever going to be exercised,
> and leaves the performance bottleneck on WAL replay in place.
>
> I raised the issue of WAL replay performance before you were involved, as has been mentioned already. I don't see it
asa blocker for this patch. I have already requested it from Stas and he has already agreed to write that. 
>
> Anyway, we know the statefile code works, so I'd prefer to keep it, rather than write a whole load of new code that
wouldalmost certainly fail. Whatever the code looks like, the frequency of usage is the same. As I already said, you
cansubmit a patch for the new way if you wish; the reality is that this code works and there's no additional
performancegain from doing it a different way. 
>
> > As you suggest, we could also completely redesign the state file mechanism
> > and/or put it in WAL at checkpoint. That's all very nice but is much more
> > code and doesn't anything more for performance, since the current mainline
> > path writes ZERO files at checkpoint.
>
> Well, on the primary, yes.
>
> Your changes proposed earlier wouldn't change performance on the standby.
>
> > If you want that for some other reason or refactoring, I won't stop
> > you, but its a separate patch for a separate purpose.
>
> Maintainability/complexity very much has to be considered during review
> and can't just be argued away with "but this is what we implemented".
>
> ;-) ehem, please don't make the mistake of thinking that because your judgement differs to mine that you can claim
thatyou are the only one that has thought about maintainability and complexity. 
>
> I'm happy to do some refactoring if you and Michael think it necessary.
>
> > - *           In order to survive crashes and shutdowns, all prepared
> > - *           transactions must be stored in permanent storage. This includes
> > - *           locking information, pending notifications etc. All that state
> > - *           information is written to the per-transaction state file in
> > - *           the pg_twophase directory.
> > + *           Information to recover prepared transactions in case of crash is
> > + *           now stored in WAL for the common case. In some cases there will be
> > + *           an extended period between preparing a GXACT and commit/abort, in
>
> Absolutely minor: The previous lines were differently indented (two tabs
> before, one space + two tabs now), which will probably mean pgindent
> will yank all of it around, besides looking confusing with different tab
> settings.
>
>
> > *             * In case of crash replay will move data from xlog to files, if that
> > *               hasn't happened before. XXX TODO - move to shmem in replay also
>
> This is a bit confusing - causing my earlier confusion about using
> XlogReadTwoPhaseData in recovery - because what this actually means is
> that we get the data from normal WAL replay, not our new way of getting
> things from the WAL.
>
>
> > @@ -772,7 +769,7 @@ TwoPhaseGetGXact(TransactionId xid)
> >        * During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
> >        * repeatedly for the same XID.  We can save work with a simple cache.
> >        */
> > -     if (xid == cached_xid)
> > +     if (xid == cached_xid && cached_gxact)
> >               return cached_gxact;
>
> What's that about? When can cached_xid be be equal xid and cached_gxact
> not set? And why did that change with this patch?
>
>
> > /*
> >  * Finish preparing state file.
> >  *
> >  * Calculates CRC and writes state file to WAL and in pg_twophase directory.
> >  */
> > void
> > EndPrepare(GlobalTransaction gxact)
>
> In contrast to that comment we're not writing to pg_twophase anymore.
>
>
> >       /*
> >        * If the file size exceeds MaxAllocSize, we won't be able to read it in
> >        * ReadTwoPhaseFile. Check for that now, rather than fail at commit time.
> >        */
> >       if (hdr->total_len > MaxAllocSize)
> >               ereport(ERROR,
> >                               (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
> >                                errmsg("two-phase state file maximum length exceeded")));
> >
>
> Outdated comment.
>
> Ack all above.
>
> > +/*
> > + * Reads 2PC data from xlog. During checkpoint this data will be moved to
> > + * twophase files and ReadTwoPhaseFile should be used instead.
> > + */
> > +static void
> > +XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
> > +{
> > +     XLogRecord *record;
> > +     XLogReaderState *xlogreader;
> > +     char       *errormsg;
> > +
> > +     xlogreader = XLogReaderAllocate(&logical_read_local_xlog_page, NULL);
> > +     if (!xlogreader)
> > +             ereport(ERROR,
> > +                             (errcode(ERRCODE_OUT_OF_MEMORY),
> > +                              errmsg("out of memory"),
> > +                              errdetail("Failed while allocating an XLog reading processor.")));
>
> Creating and deleting an xlogreader for every 2pc transaction isn't
> particularly efficient.
>
> Is keeping an xlogreader around in a backend for potentially long periods a better solution? I'd be happy to hear
thata statically allocated one would be better. 
>
> Reading the 2pc state from WAL will often also
> mean hitting disk if there's significant WAL volume (we even hint that
> we want the cache to be throw away for low wal_level!).
>
> Nobody has yet proposed an alternative to this design (reading the WAL at commit prepared).
>
> It's better than the last one and I haven't thought of anything better.
>
> If we really go this way, we really need a) a comment here explaining
> why timelines are never an issue b) an assert, preventing to be called
> during recovery.
>
> Sure
>
> > +     record = XLogReadRecord(xlogreader, lsn, &errormsg);
> > +     if (record == NULL ||
> > +             XLogRecGetRmid(xlogreader) != RM_XACT_ID ||
> > +             (XLogRecGetInfo(xlogreader) & XLOG_XACT_OPMASK) != XLOG_XACT_PREPARE)
> > +             ereport(ERROR,
> > +                             (errcode_for_file_access(),
> > +                              errmsg("could not read two-phase state from xlog at %X/%X",
> > +                                                     (uint32) (lsn >> 32),
> > +                                                     (uint32) lsn)));
>
> I think the record == NULL case should be handled separately (printing
> ->errormsg), and XLogRecGetRmid(xlogreader) != RM_XACT_ID &
> (XLogRecGetInfo(xlogreader) & XLOG_XACT_OPMASK) != XLOG_XACT_PREPARE)
> should get a more descriptive error message.
>
> OK
>
> > /*
> >  * Scan a 2PC state file (already read into memory by ReadTwoPhaseFile)
> >  * and call the indicated callbacks for each 2PC record.
> >  */
> > static void
> > ProcessRecords(char *bufptr, TransactionId xid,
> >                          const TwoPhaseCallback callbacks[])
> >
>
> The data isn't neccesarily coming from a statefile anymore.
>
>
> >  void
> >  CheckPointTwoPhase(XLogRecPtr redo_horizon)
> >  {
> > -     TransactionId *xids;
> > -     int                     nxids;
> > -     char            path[MAXPGPATH];
> >       int                     i;
> > +     int                     n = 0;
>
> s/n/serialized_xacts/?
>
>
> Maybe also add a quick exit for when this is called during recovery?
>
> OK
>
> > +     /*
> > +      * We are expecting there to be zero GXACTs that need to be
> > +      * copied to disk, so we perform all I/O while holding
> > +      * TwoPhaseStateLock for simplicity. This prevents any new xacts
> > +      * from preparing while this occurs, which shouldn't be a problem
> > +      * since the presence of long-lived prepared xacts indicates the
> > +      * transaction manager isn't active.
>
> It's not *that* unlikely. Depending on settings the time between the
> computation of the redo pointer and CheckPointTwoPhase() isn't
> necessarily that large.
>
> CheckPointTwoPhase() deliberately happens after CheckPointBuffers()
>
> Default settings would make that gap 2.5 minutes. Common tuning parameters would take that to >9 minutes.
>
> That is much, much longer than acceptable transaction response times. So in normal circumstances there will be zero
transactionsand I concur with the decision not to bother with complex locking to avoid longer lock times, robustness
beinga consideration for seldom executed code. 
>
> I wonder if we can address the replay performance issue significantly
> enough by simply not fsyncing in RecreateTwoPhaseFile() during WAL
> replay. If we make CheckPointTwoPhase() do that for the relevant 2pc
> state files, we ought to be good, no?
>
> That was the design I was thinking for simplicity, but we could do better.
>
> Now that'd still not get close to
> the performance on the primary (we do many more file creations!), but
> it'd remove the most expensive part, the fsync.
>
> Which is why I asked Stas to consider it. As soon as I realised the potential timeline issues was the point where I
say"separate patch". 
>
> This is a good performance patch with some subtle code that after much thought I agree with. I'd like to see more
fromStas and I trust that he will progress to the next performance patch after this. 
>
> So, I will make some refactoring changes, fix your code suggestions above and commit.
>
> --
> Simon Riggs                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Existence check for suitable index in advance when concurrently refreshing.
Next
From: Vitaly Burovoy
Date:
Subject: Re: custom function for converting human readable sizes to bytes