Re: Duplicated LSN in ReorderBuffer - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Duplicated LSN in ReorderBuffer
Date
Msg-id CAD21AoCav3kOsGoEpRvm9KFZ6nJ4LPqyLhT13ab4jvFbFf8txg@mail.gmail.com
Whole thread Raw
In response to Re: Duplicated LSN in ReorderBuffer  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Duplicated LSN in ReorderBuffer  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Mon, Jul 8, 2019 at 9:00 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Jun 26, 2019 at 2:46 AM Ildar Musin <ildar@adjust.com> wrote:
> > Attached is a simple patch that uses subxid instead of top-level xid
> > in ReorderBufferAddNewTupleCids() call. It seems to fix the bug, but
> > i'm not sure that this is a valid change. Can someone please verify it
> > or maybe suggest a better solution for the issue?
>

I've reproduced this issue with script Ildar provided. I don't find
out the root cause yet and I'm not sure the patch takes a correct way
to fix this.

In my environment, I got the following pg_waldump output and the
logical decoding failed at 0/019FA058 when processing NEW_CID.

  90489 rmgr: Transaction len (rec/tot):     38/    38, tx:
1999, lsn: 0/019F9E80, prev 0/019F9E38, desc: ASSIGNMENT xtop 1998:
subxacts: 1999
  90490 rmgr: Standby     len (rec/tot):    405/   405, tx:
0, lsn: 0/019F9EA8, prev 0/019F9E80, desc: RUNNING_XACTS nextXid 2000
latestCompletedXid 1949 oldestRunningXid 1836; 48 xacts: 1990 1954
1978 1850 1944 1972 1940 1924 1906 1970 1985 1998 1966 1987 1975 1858
1914 1982 1958 1840 1920 1926 1992 1962 1
  90490 910 1950 1874 1928 1974 1968 1946 1912 1918 1996 1922 1930
1964 1952 1994 1934 1980 1836 1984 1960 1956 1916 1908 1938
  90491 rmgr: Heap2       len (rec/tot):     60/    60, tx:
1999, lsn: 0/019FA058, prev 0/019F9EA8, desc: NEW_CID rel
1663/12678/2615; tid 11/59; cmin: 0, cmax: 4294967295, combo:
4294967295
  90492 rmgr: Heap        len (rec/tot):    127/   127, tx:
1999, lsn: 0/019FA098, prev 0/019FA058, desc: INSERT off 59 flags
0x00, blkref #0: rel 1663/12678/2615 blk 11

I thought that the logical decoding doesn't create ReorderBufferTXN of
xid=1999 when processing NEW_CID since it decodes ASSIGNMENT of
xid=1999 beforehand. But what actually happen is that it skips NEW_CID
since the state of snapshot builder is SNAPBUILD_BUILDING_SNAPSHOT yet
and then the state becomes SNAPBUILD_FULL_SNAPSHOT when processing
RUNNING_XACTS , and therefore it creates two ReorderBufferTXN entries
for xid = 1999 and xid = 1998 as top-level transactions when
processing NEW_CID (ReorderBufferXidSetCatalogChanges creates xid=1999
and ReorderBufferAddNewTupleCids creates xid = 1998). And therefore it
got the assertion failure when adding ReorderBufferTXN of xid = 1998.

I'll look into this more deeply tomorrow.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: James Coleman
Date:
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Next
From: Robert Haas
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs