Thread: Bug in StartupSUBTRANS

Bug in StartupSUBTRANS

From

Jeff Janes

Date:

09 February 2016, 17:42:25

While testing the crash resilience of the recent 2-part-commit
improvements, I've run into a problem where sometimes after a crash
the recovery process creates zeroed files in pg_subtrans until it
exhausts all disk space.

Looking at the code, it looks like it does not anticipate that the xid
might wrap around, meaning startPage/endPage might also wrap around.
But obviously should not do so at int_max but rather at some much
smaller other value.

Here is the state near the time of disaster:

(gdb) print startPage
$1 = 2813758
(gdb) print endPage
$2 = 179
(gdb) p oldestActiveXID
$3 = 4293679649
(gdb) p ShmemVariableCache->nextXid
$4 = 367568


Attached is my attempt at a fix.  I've tested it for the ability to
start up the crashed server again, but have not tested the full stack
from initdb to crash with this in place.

Assuming I'm right, I am curious how this problem has been around so
long without being discovered previously.  So maybe I'm not right. I
found this with some code to accelerate the consumption of xids, but I
don't see how that would lead to a false positive here.

I think I found it testing 2-part-commit because that inherently means
leaving an active XID hanging around for a few checkpoint cycles,
which is something I've never intentionally tested before.

Cheers,

Jeff

Attachment

StartupSUB.patch

Re: Bug in StartupSUBTRANS

From

Simon Riggs

Date:

09 February 2016, 18:35:16

On 9 February 2016 at 18:42, Jeff Janes <jeff.janes@gmail.com> wrote:

While testing the crash resilience of the recent 2-part-commit
improvements, I've run into a problem where sometimes after a crash
the recovery process creates zeroed files in pg_subtrans until it
exhausts all disk space.

Not sure which patch you're talking about there (2-part-commit).

Looking at the code, it looks like it does not anticipate that the xid
might wrap around, meaning startPage/endPage might also wrap around.
But obviously should not do so at int_max but rather at some much
smaller other value.

Hmm, looks like the != part attempted to wrap, but just didn't get it right.

Your patch looks right to me, so I will commit, barring objections... with backpatch. Likely to 9.0, AFAICS.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Bug in StartupSUBTRANS

From

Tom Lane

Date:

09 February 2016, 18:45:23

Simon Riggs <simon@2ndQuadrant.com> writes:
> Your patch looks right to me, so I will commit, barring objections... with
> backpatch. Likely to 9.0, AFAICS.

9.0 is out of support and should not be patched anymore.

I agree that the patch is basically correct, though I'd personally
write it without bothering with the extra variable:

+    /* must account for wraparound */
+    if (startPage > TransactionIdToPage(0xFFFFFFFF))
+        startPage = 0;

Also, the comment at line 45 is now wrong and needs an addition.
        regards, tom lane

Re: Bug in StartupSUBTRANS

From

Michael Paquier

Date:

10 February 2016, 05:06:57

On Wed, Feb 10, 2016 at 3:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> Your patch looks right to me, so I will commit, barring objections... with
>> backpatch. Likely to 9.0, AFAICS.
>
> 9.0 is out of support and should not be patched anymore.
>
> I agree that the patch is basically correct, though I'd personally
> write it without bothering with the extra variable:
>
> +       /* must account for wraparound */
> +       if (startPage > TransactionIdToPage(0xFFFFFFFF))
> +               startPage = 0;
>
> Also, the comment at line 45 is now wrong and needs an addition.

Instead of using a hardcoded value, wouldn't it be better to use
something based on MaxTransactionId?
-- 
Michael

Re: Bug in StartupSUBTRANS

From

Jeff Janes

Date:

14 February 2016, 00:03:48

On Tue, Feb 9, 2016 at 10:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 9 February 2016 at 18:42, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> While testing the crash resilience of the recent 2-part-commit
>> improvements, I've run into a problem where sometimes after a crash
>> the recovery process creates zeroed files in pg_subtrans until it
>> exhausts all disk space.
>
>
> Not sure which patch you're talking about there (2-part-commit).

The one here:

commit 978b2f65aa1262eb4ecbf8b3785cb1b9cf4db78e
Author: Simon Riggs <simon@2ndQuadrant.com>
Date:   Wed Jan 20 18:40:44 2016 -0800

    Speedup 2PC by skipping two phase state files in normal path

I thought there was a second commit in the series, but I guess that is
still just proposed.  I've haven't tested that one, just the committed
one.

I've repeated the crash/recovery/wrap-around testing with the proposed
fix in place, and this pre-existing problem seems to be fixed now, and
I have found no other problems.

I've attached a new version, incorporating comments from Tom and Michael.

Cheers,

Jeff

Attachment

StartupSUB_v2.patch

Re: Bug in StartupSUBTRANS

From

Michael Paquier

Date:

17 February 2016, 06:13:03

On Sun, Feb 14, 2016 at 9:03 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I've repeated the crash/recovery/wrap-around testing with the proposed
> fix in place, and this pre-existing problem seems to be fixed now, and
> I have found no other problems.
>
> I've attached a new version, incorporating comments from Tom and Michael.

Thanks, this patch looks good to me this way.

I have added an entry in the CF app to not lose track of this bug:
https://commitfest.postgresql.org/9/526/
That's the most I can do.
-- 
Michael

Re: Bug in StartupSUBTRANS

From

Simon Riggs

Date:

19 February 2016, 09:26:23

On 14 February 2016 at 00:03, Jeff Janes <jeff.janes@gmail.com> wrote:

I've attached a new version, incorporating comments from Tom and Michael.

Applied, thanks.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services