Re: Speedup twophase transactions - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Speedup twophase transactions
Date
Msg-id CANP8+j+YEDxObOREgDC4O7cJPOEa8iKV9wEhOHoy8d16+ZaNKg@mail.gmail.com
Whole thread Raw
In response to Re: Speedup twophase transactions  (Andres Freund <andres@anarazel.de>)
Responses Re: Speedup twophase transactions  (Stas Kelvich <s.kelvich@postgrespro.ru>)
List pgsql-hackers
On 12 January 2016 at 18:14, Andres Freund <andres@anarazel.de> wrote:
Hi,

Thank you for the additional review.
 
On 2016-01-11 19:39:14 +0000, Simon Riggs wrote:
> Currently, the patch reuses all of the code related to reading/write state
> files, so it is the minimal patch that can implement the important things
> for performance. The current patch succeeds in its goal to improve
> performance, so I personally see no further need for code churn.

Sorry, I don't buy that argument. This approach leaves us with a bunch
of code related to statefiles that's barely ever going to be exercised,
and leaves the performance bottleneck on WAL replay in place.

I raised the issue of WAL replay performance before you were involved, as has been mentioned already. I don't see it as a blocker for this patch. I have already requested it from Stas and he has already agreed to write that.

Anyway, we know the statefile code works, so I'd prefer to keep it, rather than write a whole load of new code that would almost certainly fail. Whatever the code looks like, the frequency of usage is the same. As I already said, you can submit a patch for the new way if you wish; the reality is that this code works and there's no additional performance gain from doing it a different way.
 
> As you suggest, we could also completely redesign the state file mechanism
> and/or put it in WAL at checkpoint. That's all very nice but is much more
> code and doesn't anything more for performance, since the current mainline
> path writes ZERO files at checkpoint.

Well, on the primary, yes.

Your changes proposed earlier wouldn't change performance on the standby.
 
> If you want that for some other reason or refactoring, I won't stop
> you, but its a separate patch for a separate purpose.

Maintainability/complexity very much has to be considered during review
and can't just be argued away with "but this is what we implemented".

;-) ehem, please don't make the mistake of thinking that because your judgement differs to mine that you can claim that you are the only one that has thought about maintainability and complexity.

I'm happy to do some refactoring if you and Michael think it necessary.
 
> - *           In order to survive crashes and shutdowns, all prepared
> - *           transactions must be stored in permanent storage. This includes
> - *           locking information, pending notifications etc. All that state
> - *           information is written to the per-transaction state file in
> - *           the pg_twophase directory.
> + *           Information to recover prepared transactions in case of crash is
> + *           now stored in WAL for the common case. In some cases there will be
> + *           an extended period between preparing a GXACT and commit/abort, in

Absolutely minor: The previous lines were differently indented (two tabs
before, one space + two tabs now), which will probably mean pgindent
will yank all of it around, besides looking confusing with different tab
settings.


> *             * In case of crash replay will move data from xlog to files, if that
> *               hasn't happened before. XXX TODO - move to shmem in replay also

This is a bit confusing - causing my earlier confusion about using
XlogReadTwoPhaseData in recovery - because what this actually means is
that we get the data from normal WAL replay, not our new way of getting
things from the WAL.


> @@ -772,7 +769,7 @@ TwoPhaseGetGXact(TransactionId xid)
>        * During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
>        * repeatedly for the same XID.  We can save work with a simple cache.
>        */
> -     if (xid == cached_xid)
> +     if (xid == cached_xid && cached_gxact)
>               return cached_gxact;

What's that about? When can cached_xid be be equal xid and cached_gxact
not set? And why did that change with this patch?


> /*
>  * Finish preparing state file.
>  *
>  * Calculates CRC and writes state file to WAL and in pg_twophase directory.
>  */
> void
> EndPrepare(GlobalTransaction gxact)

In contrast to that comment we're not writing to pg_twophase anymore.


>       /*
>        * If the file size exceeds MaxAllocSize, we won't be able to read it in
>        * ReadTwoPhaseFile. Check for that now, rather than fail at commit time.
>        */
>       if (hdr->total_len > MaxAllocSize)
>               ereport(ERROR,
>                               (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
>                                errmsg("two-phase state file maximum length exceeded")));
>

Outdated comment.

Ack all above.
 
> +/*
> + * Reads 2PC data from xlog. During checkpoint this data will be moved to
> + * twophase files and ReadTwoPhaseFile should be used instead.
> + */
> +static void
> +XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
> +{
> +     XLogRecord *record;
> +     XLogReaderState *xlogreader;
> +     char       *errormsg;
> +
> +     xlogreader = XLogReaderAllocate(&logical_read_local_xlog_page, NULL);
> +     if (!xlogreader)
> +             ereport(ERROR,
> +                             (errcode(ERRCODE_OUT_OF_MEMORY),
> +                              errmsg("out of memory"),
> +                              errdetail("Failed while allocating an XLog reading processor.")));

Creating and deleting an xlogreader for every 2pc transaction isn't
particularly efficient.

Is keeping an xlogreader around in a backend for potentially long periods a better solution? I'd be happy to hear that a statically allocated one would be better.
 
Reading the 2pc state from WAL will often also
mean hitting disk if there's significant WAL volume (we even hint that
we want the cache to be throw away for low wal_level!).

Nobody has yet proposed an alternative to this design (reading the WAL at commit prepared). 

It's better than the last one and I haven't thought of anything better.
 
If we really go this way, we really need a) a comment here explaining
why timelines are never an issue b) an assert, preventing to be called
during recovery.

Sure
 
> +     record = XLogReadRecord(xlogreader, lsn, &errormsg);
> +     if (record == NULL ||
> +             XLogRecGetRmid(xlogreader) != RM_XACT_ID ||
> +             (XLogRecGetInfo(xlogreader) & XLOG_XACT_OPMASK) != XLOG_XACT_PREPARE)
> +             ereport(ERROR,
> +                             (errcode_for_file_access(),
> +                              errmsg("could not read two-phase state from xlog at %X/%X",
> +                                                     (uint32) (lsn >> 32),
> +                                                     (uint32) lsn)));

I think the record == NULL case should be handled separately (printing
->errormsg), and XLogRecGetRmid(xlogreader) != RM_XACT_ID &
(XLogRecGetInfo(xlogreader) & XLOG_XACT_OPMASK) != XLOG_XACT_PREPARE)
should get a more descriptive error message.

OK
 
> /*
>  * Scan a 2PC state file (already read into memory by ReadTwoPhaseFile)
>  * and call the indicated callbacks for each 2PC record.
>  */
> static void
> ProcessRecords(char *bufptr, TransactionId xid,
>                          const TwoPhaseCallback callbacks[])
>

The data isn't neccesarily coming from a statefile anymore.


>  void
>  CheckPointTwoPhase(XLogRecPtr redo_horizon)
>  {
> -     TransactionId *xids;
> -     int                     nxids;
> -     char            path[MAXPGPATH];
>       int                     i;
> +     int                     n = 0;

s/n/serialized_xacts/?


Maybe also add a quick exit for when this is called during recovery?

OK
 
> +     /*
> +      * We are expecting there to be zero GXACTs that need to be
> +      * copied to disk, so we perform all I/O while holding
> +      * TwoPhaseStateLock for simplicity. This prevents any new xacts
> +      * from preparing while this occurs, which shouldn't be a problem
> +      * since the presence of long-lived prepared xacts indicates the
> +      * transaction manager isn't active.

It's not *that* unlikely. Depending on settings the time between the
computation of the redo pointer and CheckPointTwoPhase() isn't
necessarily that large.

CheckPointTwoPhase() deliberately happens after CheckPointBuffers()

Default settings would make that gap 2.5 minutes. Common tuning parameters would take that to >9 minutes.

That is much, much longer than acceptable transaction response times. So in normal circumstances there will be zero transactions and I concur with the decision not to bother with complex locking to avoid longer lock times, robustness being a consideration for seldom executed code.
 
I wonder if we can address the replay performance issue significantly
enough by simply not fsyncing in RecreateTwoPhaseFile() during WAL
replay. If we make CheckPointTwoPhase() do that for the relevant 2pc
state files, we ought to be good, no?

That was the design I was thinking for simplicity, but we could do better.
 
Now that'd still not get close to
the performance on the primary (we do many more file creations!), but
it'd remove the most expensive part, the fsync.

Which is why I asked Stas to consider it. As soon as I realised the potential timeline issues was the point where I say "separate patch".

This is a good performance patch with some subtle code that after much thought I agree with. I'd like to see more from Stas and I trust that he will progress to the next performance patch after this.

So, I will make some refactoring changes, fix your code suggestions above and commit.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

Previous
From: Stas Kelvich
Date:
Subject: Re: Speedup twophase transactions
Next
From: Simon Riggs
Date:
Subject: Re: Speedup twophase transactions