Thread: Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

I have a database where a a couple rows don't appear in index scans
but do appear in sequential scans. It looks like the same problem as
Peter reported but this is a different database. I've extracted all
the xlogdump records and below are the ones I think are relevant. You
can see that lp 2 gets a few HOT updates and concurrently has someone
create a MultiXact NO KEY UPDATE lock while one of those HOT updates
is pending but not committed. The net result seems to be that the ctid
update chain got broken. The index of course points to the head of the
HOT chain so it doesn't find the live tail whereas the sequential scan
picks it up.

I don't see any evidence of MultiXactId wraparound, the members run to
001F and the offsets run to 000B. This is on a standby that's been
activated but afaik that shouldn't change these files any more right?

rmgr: Heap        len (rec/tot):    235/   267, tx:    5943845, lsn:
FD/2F0A3640, prev FD/2F0A3600, bkp: 0000, desc: insert: rel
1663/16385/212653; tid 13065/2
rmgr: Transaction len (rec/tot):     12/    44, tx:    5943845, lsn:
FD/2F0A8178, prev FD/2F0A8148, bkp: 0000, desc: commit: 2014-02-19
20:41:23.698513 UTC
rmgr: Heap        len (rec/tot):     25/    57, tx:    5943847, lsn:
FD/2F0AA440, prev FD/2F0AA3F8, bkp: 0000, desc: lock 5943847: rel
1663/16385/212653; tid 13065/2 LOCK_ONLY KEYSHR_LOCK
rmgr: Transaction len (rec/tot):     12/    44, tx:    5943847, lsn:
FD/2F0AA480, prev FD/2F0AA440, bkp: 0000, desc: commit: 2014-02-19
20:41:23.713969 UTC
rmgr: Heap        len (rec/tot):    291/   323, tx:    5943849, lsn:
FD/2F0ADFC0, prev FD/2F0ADF90, bkp: 0000, desc: hot_update: rel
1663/16385/212653; tid 13065/2 xmax 5943849 ; new tid 13065/3 xmax 0
rmgr: Heap2       len (rec/tot):     25/    57, tx:    5943851, lsn:
FD/2F0AE450, prev FD/2F0AE408, bkp: 0000, desc: lock updated: xmax
5943851 msk 000a; rel 1663/16385/212653; tid 13065/3
rmgr: MultiXact   len (rec/tot):     28/    60, tx:    5943851, lsn:
FD/2F0AE490, prev FD/2F0AE450, bkp: 0000, desc: create mxid 728896
offset 1632045 nmembers 2: 5943849 (nokeyupd) 5943851 (keysh)
rmgr: Heap        len (rec/tot):     25/    57, tx:    5943851, lsn:
FD/2F0AE4D0, prev FD/2F0AE490, bkp: 0000, desc: lock 728896: rel
1663/16385/212653; tid 13065/2 IS_MULTI EXCL_LOCK
rmgr: Transaction len (rec/tot):     12/    44, tx:    5943849, lsn:
FD/2F0AE510, prev FD/2F0AE4D0, bkp: 0000, desc: commit: 2014-02-19
20:41:23.744989 UTC
rmgr: Transaction len (rec/tot):     12/    44, tx:    5943851, lsn:
FD/2F0AE570, prev FD/2F0AE540, bkp: 0000, desc: commit: 2014-02-19
20:41:23.746820 UTC
rmgr: Heap        len (rec/tot):    306/   338, tx:    5943879, lsn:
FD/2F103788, prev FD/2F103758, bkp: 0000, desc: hot_update: rel
1663/16385/212653; tid 13065/3 xmax 5943879 ; new tid 13065/4 xmax 0
rmgr: Transaction len (rec/tot):     12/    44, tx:    5943879, lsn:
FD/2F1038E0, prev FD/2F103788, bkp: 0000, desc: commit: 2014-02-19
20:41:24.580827 UTC
rmgr: Heap        len (rec/tot):    306/   338, tx:    5943880, lsn:
FD/2F103910, prev FD/2F1038E0, bkp: 0000, desc: hot_update: rel
1663/16385/212653; tid 13065/4 xmax 5943880 ; new tid 13065/7 xmax 0
rmgr: Transaction len (rec/tot):     12/    44, tx:    5943880, lsn:
FD/2F106070, prev FD/2F106030, bkp: 0000, desc: commit: 2014-02-19
20:41:24.617048 UTC

lp | lp_off | lp_flags | lp_len | t_xmin  | t_xmax  | t_field3 |
t_ctid   | t_infomask2 | t_infomask | t_hoff |
----+--------+----------+--------+---------+---------+----------+------------+-------------+------------+--------+- 2 |
 3424 |        1 |    232 | 5943845 |  728896 |        0 |
 
(13065,2)  |          32 |       4419 |     32 | 3 |   3152 |        1 |    272 | 5943849 | 5943879 |        0 |
(13065,4)  |       49184 |       9475 |     32 | 4 |   2864 |        1 |    287 | 5943879 | 5943880 |        0 |
(13065,7)  |       49184 |       9475 |     32 | 7 |   2576 |        1 |    287 | 5943880 |       0 |        0 |
(13065,7)  |       32800 |      10499 |     32 |


-- 
greg



Hi,

On 2014-02-20 13:25:35 +0000, Greg Stark wrote:
> I have a database where a a couple rows don't appear in index scans
> but do appear in sequential scans. It looks like the same problem as
> Peter reported but this is a different database. I've extracted all
> the xlogdump records and below are the ones I think are relevant. You
> can see that lp 2 gets a few HOT updates and concurrently has someone
> create a MultiXact NO KEY UPDATE lock while one of those HOT updates
> is pending but not committed. The net result seems to be that the ctid
> update chain got broken. The index of course points to the head of the
> HOT chain so it doesn't find the live tail whereas the sequential scan
> picks it up.
> 
> I don't see any evidence of MultiXactId wraparound, the members run to
> 001F and the offsets run to 000B. This is on a standby that's been
> activated but afaik that shouldn't change these files any more right?

I think this might actually be
c6cd27e36b9c58ceda8582ba81e37b6f9ad87d59,
2dcc48c35af5305fba0d8cb5e31fa0c25f52d13f might also be involved.

Hard to say at my current tiredness/caffeination level.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On 2014-02-20 13:25:35 +0000, Greg Stark wrote:
> rmgr: Heap        len (rec/tot):    235/   267, tx:    5943845, lsn:
> FD/2F0A3640, prev FD/2F0A3600, bkp: 0000, desc: insert: rel
> 1663/16385/212653; tid 13065/2

>  lp | lp_off | lp_flags | lp_len | t_xmin  | t_xmax  | t_field3 |
> t_ctid   | t_infomask2 | t_infomask | t_hoff |
> ----+--------+----------+--------+---------+---------+----------+------------+-------------+------------+--------+-
>   2 |   3424 |        1 |    232 | 5943845 |  728896 |        0 |
> (13065,2)  |          32 |       4419 |     32 |
>   3 |   3152 |        1 |    272 | 5943849 | 5943879 |        0 |
> (13065,4)  |       49184 |       9475 |     32 |
>   4 |   2864 |        1 |    287 | 5943879 | 5943880 |        0 |
> (13065,7)  |       49184 |       9475 |     32 |
>   7 |   2576 |        1 |    287 | 5943880 |       0 |        0 |
> (13065,7)  |       32800 |      10499 |     32 |

Any chance you could send those again, with less linebreak damage? They
are really hard to read this way...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

From
Alvaro Herrera
Date:
Here's a reformatted copy.  I think this is the same bug as Peter G.
reported in 
http://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com

I have a hunch that this is related to the heap_lock_updated business.
I haven't investigated yet.


Greg Stark wrote:
> I have a database where a a couple rows don't appear in index scans
> but do appear in sequential scans. It looks like the same problem as
> Peter reported but this is a different database. I've extracted all
> the xlogdump records and below are the ones I think are relevant. You
> can see that lp 2 gets a few HOT updates and concurrently has someone
> create a MultiXact NO KEY UPDATE lock while one of those HOT updates
> is pending but not committed. The net result seems to be that the ctid
> update chain got broken. The index of course points to the head of the
> HOT chain so it doesn't find the live tail whereas the sequential scan
> picks it up.
> 
> I don't see any evidence of MultiXactId wraparound, the members run to
> 001F and the offsets run to 000B. This is on a standby that's been
> activated but afaik that shouldn't change these files any more right?
> 
> rmgr: Heap        len (rec/tot):    235/   267, tx:    5943845, lsn: FD/2F0A3640, prev FD/2F0A3600, bkp: 0000, desc:
insert:rel 1663/16385/212653; tid 13065/2
 
> rmgr: Transaction len (rec/tot):     12/    44, tx:    5943845, lsn: FD/2F0A8178, prev FD/2F0A8148, bkp: 0000, desc:
commit:2014-02-19 20:41:23.698513 UTC
 
> rmgr: Heap        len (rec/tot):     25/    57, tx:    5943847, lsn: FD/2F0AA440, prev FD/2F0AA3F8, bkp: 0000, desc:
lock5943847: rel 1663/16385/212653; tid 13065/2 LOCK_ONLY KEYSHR_LOCK
 
> rmgr: Transaction len (rec/tot):     12/    44, tx:    5943847, lsn: FD/2F0AA480, prev FD/2F0AA440, bkp: 0000, desc:
commit:2014-02-19 20:41:23.713969 UTC
 
> rmgr: Heap        len (rec/tot):    291/   323, tx:    5943849, lsn: FD/2F0ADFC0, prev FD/2F0ADF90, bkp: 0000, desc:
hot_update:rel 1663/16385/212653; tid 13065/2 xmax 5943849 ; new tid 13065/3 xmax 0
 
> rmgr: Heap2       len (rec/tot):     25/    57, tx:    5943851, lsn: FD/2F0AE450, prev FD/2F0AE408, bkp: 0000, desc:
lockupdated: xmax 5943851 msk 000a; rel 1663/16385/212653; tid 13065/3
 
> rmgr: MultiXact   len (rec/tot):     28/    60, tx:    5943851, lsn: FD/2F0AE490, prev FD/2F0AE450, bkp: 0000, desc:
createmxid 728896 offset 1632045 nmembers 2: 5943849 (nokeyupd) 5943851 (keysh)
 
> rmgr: Heap        len (rec/tot):     25/    57, tx:    5943851, lsn: FD/2F0AE4D0, prev FD/2F0AE490, bkp: 0000, desc:
lock728896: rel 1663/16385/212653; tid 13065/2 IS_MULTI EXCL_LOCK
 
> rmgr: Transaction len (rec/tot):     12/    44, tx:    5943849, lsn: FD/2F0AE510, prev FD/2F0AE4D0, bkp: 0000, desc:
commit:2014-02-19 20:41:23.744989 UTC
 
> rmgr: Transaction len (rec/tot):     12/    44, tx:    5943851, lsn: FD/2F0AE570, prev FD/2F0AE540, bkp: 0000, desc:
commit:2014-02-19 20:41:23.746820 UTC
 
> rmgr: Heap        len (rec/tot):    306/   338, tx:    5943879, lsn: FD/2F103788, prev FD/2F103758, bkp: 0000, desc:
hot_update:rel 1663/16385/212653; tid 13065/3 xmax 5943879 ; new tid 13065/4 xmax 0
 
> rmgr: Transaction len (rec/tot):     12/    44, tx:    5943879, lsn: FD/2F1038E0, prev FD/2F103788, bkp: 0000, desc:
commit:2014-02-19 20:41:24.580827 UTC
 
> rmgr: Heap        len (rec/tot):    306/   338, tx:    5943880, lsn: FD/2F103910, prev FD/2F1038E0, bkp: 0000, desc:
hot_update:rel 1663/16385/212653; tid 13065/4 xmax 5943880 ; new tid 13065/7 xmax 0
 
> rmgr: Transaction len (rec/tot):     12/    44, tx:    5943880, lsn: FD/2F106070, prev FD/2F106030, bkp: 0000, desc:
commit:2014-02-19 20:41:24.617048 UTC
 
> 
> 
>  lp | lp_off | lp_flags | lp_len | t_xmin  | t_xmax  | t_field3 |   t_ctid   | t_infomask2 | t_infomask | t_hoff |
> ----+--------+----------+--------+---------+---------+----------+------------+-------------+------------+--------+-
>   2 |   3424 |        1 |    232 | 5943845 |  728896 |        0 | (13065,2)  |          32 |       4419 |     32 |
>   3 |   3152 |        1 |    272 | 5943849 | 5943879 |        0 | (13065,4)  |       49184 |       9475 |     32 |
>   4 |   2864 |        1 |    287 | 5943879 | 5943880 |        0 | (13065,7)  |       49184 |       9475 |     32 |
>   7 |   2576 |        1 |    287 | 5943880 |       0 |        0 | (13065,7)  |       32800 |      10499 |     32 |
> 
> 
> -- 
> greg
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Hi,

On 2014-02-24 17:55:14 -0300, Alvaro Herrera wrote:
> Greg Stark wrote:
> > I have a database where a a couple rows don't appear in index scans
> > but do appear in sequential scans. It looks like the same problem as
> > Peter reported but this is a different database. I've extracted all
> > the xlogdump records and below are the ones I think are relevant. You
> > can see that lp 2 gets a few HOT updates and concurrently has someone
> > create a MultiXact NO KEY UPDATE lock while one of those HOT updates
> > is pending but not committed.

Per se the sequence of records doesn't look bad (even though I am not
happy that we log intermediate and final rows first, and only then the
start of the chain).

> > The net result seems to be that the ctid
> > update chain got broken. The index of course points to the head of the
> > HOT chain so it doesn't find the live tail whereas the sequential scan
> > picks it up.

Yea, that's the problem.

> > rmgr: Heap        len (rec/tot):    291/   323, tx:    5943849, lsn: FD/2F0ADFC0, prev FD/2F0ADF90, bkp: 0000,
desc:hot_update: rel 1663/16385/212653; tid 13065/2 xmax 5943849 ; new tid 13065/3 xmax 0
 
> > rmgr: Heap2       len (rec/tot):     25/    57, tx:    5943851, lsn: FD/2F0AE450, prev FD/2F0AE408, bkp: 0000,
desc:lock updated: xmax 5943851 msk 000a; rel 1663/16385/212653; tid 13065/3
 
> > rmgr: MultiXact   len (rec/tot):     28/    60, tx:    5943851, lsn: FD/2F0AE490, prev FD/2F0AE450, bkp: 0000,
desc:create mxid 728896 offset 1632045 nmembers 2: 5943849 (nokeyupd) 5943851 (keysh)
 
> > rmgr: Heap        len (rec/tot):     25/    57, tx:    5943851, lsn: FD/2F0AE4D0, prev FD/2F0AE490, bkp: 0000,
desc:lock 728896: rel 1663/16385/212653; tid 13065/2 IS_MULTI EXCL_LOCK
 

> >  lp | lp_off | lp_flags | lp_len | t_xmin  | t_xmax  | t_field3 |   t_ctid   | t_infomask2 | t_infomask | t_hoff |
> >
----+--------+----------+--------+---------+---------+----------+------------+-------------+------------+--------+-
> >   2 |   3424 |        1 |    232 | 5943845 |  728896 |        0 | (13065,2)  |          32 |       4419 |     32 |
> >   3 |   3152 |        1 |    272 | 5943849 | 5943879 |        0 | (13065,4)  |       49184 |       9475 |     32 |
> >   4 |   2864 |        1 |    287 | 5943879 | 5943880 |        0 | (13065,7)  |       49184 |       9475 |     32 |
> >   7 |   2576 |        1 |    287 | 5943880 |       0 |        0 | (13065,7)  |       32800 |      10499 |     32 |

Those together explain the story. Note this bit:

static void
heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record)
{
...   HeapTupleHeaderClearHotUpdated(htup);   HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
HeapTupleHeaderSetCmax(htup,FirstCommandId, false);   /* Make sure there is no forward chain link in t_ctid */
htup->t_ctid= xlrec->target.tid;
 
...
}

So, the replay of FD/2F0AE4D0 breaks the ctid chain *and* unsets the
HOT_UPDATED flag.

Which means fkey locks have never properly worked across SR/crash
recovery.

Haven't thought about how to fix it yet, I hope won't have to (hint hint).

We somehow need to have a policy of testing changes to the WAL format
without full_page_writes. They hide bugs in replay far, far too often.


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On 2014-02-24 22:17:31 +0100, Andres Freund wrote:
> Those together explain the story. Note this bit:
> 
> static void
> heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record)
> {
> ...
>     HeapTupleHeaderClearHotUpdated(htup);
>     HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>     HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
>     /* Make sure there is no forward chain link in t_ctid */
>     htup->t_ctid = xlrec->target.tid;
> ...
> }
> 
> So, the replay of FD/2F0AE4D0 breaks the ctid chain *and* unsets the
> HOT_UPDATED flag.

Some quick archeology shows that the HeapTupleHeaderClearHotUpdated()
was in the original HOT commit (282d2a03d) and clearing of t_ctid was in
the original commit implementing FOR SHARE (bedb78d38). Both look like
they are copied from other places, I don't immediately see any need for
them here...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

From
Peter Geoghegan
Date:
On Mon, Feb 24, 2014 at 1:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> We somehow need to have a policy of testing changes to the WAL format
> without full_page_writes. They hide bugs in replay far, far too often.

What's the easiest way to get atomic page writes at the FS level on
your laptop? ZFS or some data journaling FS, I suppose.


-- 
Peter Geoghegan



On 2014-02-24 15:05:37 -0800, Peter Geoghegan wrote:
> On Mon, Feb 24, 2014 at 1:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > We somehow need to have a policy of testing changes to the WAL format
> > without full_page_writes. They hide bugs in replay far, far too often.
> 
> What's the easiest way to get atomic page writes at the FS level on
> your laptop? ZFS or some data journaling FS, I suppose.

TBH I don't care about torn pages during normal testing. I don't want to
suggest disabling it for real workloads with real data, just that it's
important to do so during development/testing of WAL related code,
because otherwise it will hide/fixup many errors.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

From
Peter Geoghegan
Date:
On Mon, Feb 24, 2014 at 3:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> TBH I don't care about torn pages during normal testing. I don't want to
> suggest disabling it for real workloads with real data, just that it's
> important to do so during development/testing of WAL related code,
> because otherwise it will hide/fixup many errors.

Sure, but you might want to account for torn pages anyway.
Particularly if you're interested in some degree of automation, as we
all seem to be.


-- 
Peter Geoghegan



On 2014-02-24 15:20:13 -0800, Peter Geoghegan wrote:
> On Mon, Feb 24, 2014 at 3:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > TBH I don't care about torn pages during normal testing. I don't want to
> > suggest disabling it for real workloads with real data, just that it's
> > important to do so during development/testing of WAL related code,
> > because otherwise it will hide/fixup many errors.
> 
> Sure, but you might want to account for torn pages anyway.
> Particularly if you're interested in some degree of automation, as we
> all seem to be.

Hm, well. I have to admit, if a test machine crashes, I'd just rebuild
the cluster. Unless I am working on crash safety testing in which case
I'd probably not have full_page_writes disabled...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

From
Alvaro Herrera
Date:
Andres Freund wrote:

> static void
> heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record)
> {
> ...
>     HeapTupleHeaderClearHotUpdated(htup);
>     HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>     HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
>     /* Make sure there is no forward chain link in t_ctid */
>     htup->t_ctid = xlrec->target.tid;
> ...
> }

I think the fix is to reset HOT_UPDATED and t_ctid only if the infomask
says the tuple is LOCKED_ONLY, per the attached patch.  This matches
what heap_lock_tuple is doing originally:

    if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
        HeapTupleHeaderClearHotUpdated(tuple->t_data);
    HeapTupleHeaderSetXmax(tuple->t_data, xid);

    /*
     * Make sure there is no forward chain link in t_ctid.    Note that in the
     * cases where the tuple has been updated, we must not overwrite t_ctid,
     * because it was set by the updater.  Moreover, if the tuple has been
     * updated, we need to follow the update chain to lock the new versions of
     * the tuple as well.
     */
    if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
        tuple->t_data->t_ctid = *tid;

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

From
Alvaro Herrera
Date:
I forgot to mention that the bug can be reproduced in a hot-standby
setup with the attached isolation spec.  Note that full_page_writes must
be turned off (otherwise, the updates use full-page writes and then the
bogus code is not run).  Once the spec is executed, in the replica run

SET enable_seqscan TO off;
SELECT * FROM f WHERE a=1;

and you get no rows back.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment
On 2014-02-26 19:03:42 -0300, Alvaro Herrera wrote:
> I forgot to mention that the bug can be reproduced in a hot-standby
> setup with the attached isolation spec.  Note that full_page_writes must
> be turned off (otherwise, the updates use full-page writes and then the
> bogus code is not run).  Once the spec is executed, in the replica run

Should be "fixable" by just issuing a independent write (another lock,
update) in a previous step...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On 2014-02-26 18:18:05 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > static void
> > heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record)
> > {
> > ...
> >     HeapTupleHeaderClearHotUpdated(htup);
> >     HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
> >     HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
> >     /* Make sure there is no forward chain link in t_ctid */
> >     htup->t_ctid = xlrec->target.tid;
> > ...
> > }
> 
> I think the fix is to reset HOT_UPDATED and t_ctid only if the infomask
> says the tuple is LOCKED_ONLY, per the attached patch.

Looks good to me.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

From
Alvaro Herrera
Date:
Andres Freund wrote:
> On 2014-02-26 18:18:05 -0300, Alvaro Herrera wrote:
> > Andres Freund wrote:
> > 
> > > static void
> > > heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record)
> > > {
> > > ...
> > >     HeapTupleHeaderClearHotUpdated(htup);
> > >     HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
> > >     HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
> > >     /* Make sure there is no forward chain link in t_ctid */
> > >     htup->t_ctid = xlrec->target.tid;
> > > ...
> > > }
> > 
> > I think the fix is to reset HOT_UPDATED and t_ctid only if the infomask
> > says the tuple is LOCKED_ONLY, per the attached patch.
> 
> Looks good to me.

Thanks, pushed.

Greg, Peter, if you could update your standbys to the current HEAD of
REL9_3_STABLE for the affected apps and verify the problem no longer
shows up in a reasonable timeframe, it would be great.  (I'm assuming
you saw this happen repeatedly.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



On Thu, Feb 27, 2014 at 2:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Greg, Peter, if you could update your standbys to the current HEAD of
> REL9_3_STABLE for the affected apps and verify the problem no longer
> shows up in a reasonable timeframe, it would be great.  (I'm assuming
> you saw this happen repeatedly.)

Actually I can do better. I restored the same base backup and applied
the same period of logs and can confirm that the rows in question now
appear normally:

=# select * from (select
(heap_page_items(get_raw_page('users',13065))).*) as x where lp
in(2,3,4,7);
lp | lp_off | lp_flags | lp_len | t_xmin  | t_xmax  | t_field3 |
t_ctid   | t_infomask2 | t_infomask | t_hoff |
----+--------+----------+--------+---------+---------+----------+-----------+-------------+------------+--------+--- 2
|  3424 |        1 |    232 | 5943845 |  728896 |        0 |
 
(13065,3) |       16416 |       4419 |     32 | 3 |   3152 |        1 |    272 | 5943849 | 5943879 |        0 |
(13065,4) |       49184 |       9475 |     32 | 4 |   2864 |        1 |    287 | 5943879 | 5943880 |        0 |
(13065,7) |       49184 |       9475 |     32 | 7 |   2576 |        1 |    287 | 5943880 |       0 |        0 |
(13065,7) |       32800 |      10499 |     32 |



-- 
greg



Though I notice something I can't understand here.

After activating the new clone subsequent attempts to select rows from
the page bump the LSN, presumably due to touching hint bits (since the
prune xid hasn't changed). But the checksum hasn't changed even after
running CHECKPOINT.

How is it possible for the LSN to get updated without changing the checksum?

postgres=# select (page_header(get_raw_page('users',13065))).* ;    lsn     | checksum | flags | lower | upper |
special| pagesize |
 
version | prune_xid
-------------+----------+-------+-------+-------+---------+----------+---------+-----------FD/330EC998 |   -25547 |
1|   152 |  2576 |    8192 |     8192 |     4 |   5638282
 
(1 row)

postgres=# select (page_header(get_raw_page('users',13065))).* ;    lsn     | checksum | flags | lower | upper |
special| pagesize |
 
version | prune_xid
-------------+----------+-------+-------+-------+---------+----------+---------+-----------FD/33140160 |   -25547 |
1|   152 |  2576 |    8192 |     8192 |     4 |   5638282
 
(1 row)

postgres=# select (page_header(get_raw_page('users',13065))).* ;    lsn     | checksum | flags | lower | upper |
special| pagesize |
 
version | prune_xid
-------------+----------+-------+-------+-------+---------+----------+---------+-----------FD/350016E8 |   -25547 |
1|   152 |  2576 |    8192 |     8192 |     4 |   5638282
 
(1 row)



On 2014-02-27 23:41:08 +0000, Greg Stark wrote:
> Though I notice something I can't understand here.
> 
> After activating the new clone subsequent attempts to select rows from
> the page bump the LSN, presumably due to touching hint bits (since the
> prune xid hasn't changed). But the checksum hasn't changed even after
> running CHECKPOINT.

Are you running with full_page_writes=off?

Only delete and update do a PageSetPrunable(), so prune_xid not being
changed doesn't say much...

> How is it possible for the LSN to get updated without changing the checksum?

Generally the LSN is computed when writing, not when a buffer is
modified, so that's not particularly surprising. It'd be interesting to
see what the records are that end on those LSNs.

It'd probably nice to add the capability to dump records that end in a
particular location to pg_xlogdump...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



<p dir="ltr"><br /> On 28 Feb 2014 06:19, "Andres Freund" <<a
href="mailto:andres@2ndquadrant.com">andres@2ndquadrant.com</a>>wrote:<br /> ><br /> > On 2014-02-27 23:41:08
+0000,Greg Stark wrote:<br /> > > Though I notice something I can't understand here.<br /> > ><br /> >
>After activating the new clone subsequent attempts to select rows from<br /> > > the page bump the LSN,
presumablydue to touching hint bits (since the<br /> > > prune xid hasn't changed). But the checksum hasn't
changedeven after<br /> > > running CHECKPOINT.<br /> ><br /> > Are you running with
full_page_writes=off?<pdir="ltr">No<br /><p dir="ltr">> Only delete and update do a PageSetPrunable(), so prune_xid
notbeing<br /> > changed doesn't say much...<br /> ><br /> > > How is it possible for the LSN to get
updatedwithout changing the checksum?<br /> ><br /> > Generally the LSN is computed when writing, not when a
bufferis<br /> > modified, so that's not particularly surprising. It'd be interesting to<br /> > see what the
recordsare that end on those LSNs.<p dir="ltr">The checksum you mean? But that's why I ran checkpoint. <br /><p
dir="ltr">>It'd probably nice to add the capability to dump records that end in a<br /> > particular location to
pg_xlogdump...<pdir="ltr">I have this crazy idea about combining xlogdump and pg_receivexlog to archive all xlog to a
postgresdatabase for querying..... 
On 2014-02-28 10:44:14 +0000, Greg Stark wrote:
> On 28 Feb 2014 06:19, "Andres Freund" <andres@2ndquadrant.com> wrote:
> > Generally the LSN is computed when writing, not when a buffer is
> > modified, so that's not particularly surprising. It'd be interesting to
> > see what the records are that end on those LSNs.
> 
> The checksum you mean? But that's why I ran checkpoint.

The checksum is computed in a copy of the buffer, not in the
original. C.f. PageSetChecksumCopy().

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services