Thread: could not access status of transaction

could not access status of transaction

From
chenhj
Date:
In our PG 10.2(CentOS 7.3) database, the following error is reported when querying a table. We have already restored
theproduction data through backup, but i want to confirm what may be the reason and how to avoid it in the future. 

    lma=# select count(*) from bi_dm.tdm_ttk_site_on_way_rt;
    ERROR:  could not access status of transaction 3250922107
    DETAIL:  Could not open file "pg_xact/0C1C": No such file or directory.

Here are some related information

The CLOG files in pg_xact diractory is as follow:

    0C4A(Last update date: 2018/12/29)
        ...
        0D09(Last update date: 2019/01/13)

The stack when the error occurs is as follows

    (gdb) bt
    #0  SlruReportIOError (ctl=ctl@entry=0xca98e0 <ClogCtlData>, pageno=pageno@entry=99210, xid=xid@entry=3250922107)
atslru.c:901 
    #1  0x00000000004e53a8 in SimpleLruReadPage (ctl=ctl@entry=0xca98e0 <ClogCtlData>, pageno=pageno@entry=99210,
write_ok=write_ok@entry=1'\001', xid=xid@entry=3250922107) 
        at slru.c:445
    #2  0x00000000004e5460 in SimpleLruReadPage_ReadOnly (ctl=ctl@entry=0xca98e0 <ClogCtlData>,
pageno=pageno@entry=99210,xid=xid@entry=3250922107) at slru.c:492 
    #3  0x00000000004dde4d in TransactionIdGetStatus (xid=xid@entry=3250922107, lsn=lsn@entry=0x7ffeb4472660) at
clog.c:411
    #4  0x00000000004e6e28 in TransactionLogFetch (transactionId=3250922107) at transam.c:79
    #5  0x00000000004e6e7f in TransactionIdDidCommit (transactionId=<optimized out>) at transam.c:129
    #6  0x000000000084ce6a in HeapTupleSatisfiesMVCC (htup=<optimized out>, snapshot=0xf7f0c8, buffer=837) at
tqual.c:1124
    #7  0x00000000004b106e in heapgetpage (scan=scan@entry=0x10bb1b0, page=page@entry=1078209) at heapam.c:439
    #8  0x00000000004b23ab in heapgettup_pagemode (key=0x0, nkeys=0, dir=1078209, scan=0x10bb1b0) at heapam.c:1034
    #9  heap_getnext (scan=scan@entry=0x10bb1b0, direction=direction@entry=ForwardScanDirection) at heapam.c:1801
    #10 0x000000000060fa93 in SeqNext (node=node@entry=0x108c360) at nodeSeqscan.c:81
    #11 0x00000000005f51ca in ExecScanFetch (recheckMtd=0x60fa40 <SeqRecheck>, accessMtd=0x60fa60 <SeqNext>,
node=0x108c360)at execScan.c:97 
    #12 ExecScan (node=0x108c360, accessMtd=0x60fa60 <SeqNext>, recheckMtd=0x60fa40 <SeqRecheck>) at execScan.c:147
    #13 0x00000000005fad89 in ExecProcNode (node=0x108c360) at ../../../src/include/executor/executor.h:250
    #14 fetch_input_tuple (aggstate=aggstate@entry=0x108bcb8) at nodeAgg.c:695
    #15 0x00000000005fcdbf in agg_retrieve_direct (aggstate=0x108bcb8) at nodeAgg.c:2448
    #16 ExecAgg (pstate=0x108bcb8) at nodeAgg.c:2158
    #17 0x00000000005ef612 in ExecProcNode (node=0x108bcb8) at ../../../src/include/executor/executor.h:250
    #18 ExecutePlan (execute_once=<optimized out>, dest=0x10c40e8, direction=<optimized out>, numberTuples=0,
sendTuples=1'\001', operation=CMD_SELECT,  
        use_parallel_mode=<optimized out>, planstate=0x108bcb8, estate=0x108baa8) at execMain.c:1722
    #19 standard_ExecutorRun (queryDesc=0x10c7168, direction=<optimized out>, count=0, execute_once=<optimized out>) at
execMain.c:363
    #20 0x00007f766427915d in pgss_ExecutorRun (queryDesc=0x10c7168, direction=ForwardScanDirection, count=0,
execute_once=<optimizedout>) at pg_stat_statements.c:889 
    #21 0x000000000071a28b in PortalRunSelect (portal=portal@entry=0xfac708, forward=forward@entry=1 '\001', count=0,
count@entry=9223372036854775807,dest=dest@entry=0x10c40e8) 
        at pquery.c:932
    #22 0x000000000071b63f in PortalRun (portal=portal@entry=0xfac708, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=1'\001',  
        run_once=run_once@entry=1 '\001', dest=dest@entry=0x10c40e8, altdest=altdest@entry=0x10c40e8,
completionTag=completionTag@entry=0x7ffeb4472c80"") at pquery.c:773 
    #23 0x00000000007175b3 in exec_simple_query (query_string=0x102a4e8 "select count(*) from tdm_ttk_site_on_way_rt
;")at postgres.c:1099 
    #24 0x00000000007188ac in PostgresMain (argc=<optimized out>, argv=argv@entry=0xfa2d68, dbname=0xfa2c50 "lma",
username=<optimizedout>) at postgres.c:4088 
    #25 0x000000000047ad3c in BackendRun (port=0xfa49b0) at postmaster.c:4405
    #26 BackendStartup (port=0xfa49b0) at postmaster.c:4077
    #27 ServerLoop () at postmaster.c:1755
    #28 0x00000000006afacf in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf5fa20) at postmaster.c:1363
    #29 0x000000000047bb6f in main (argc=3, argv=0xf5fa20) at main.c:228
    (gdb) f 6
    #6  0x000000000084ce6a in HeapTupleSatisfiesMVCC (htup=<optimized out>, snapshot=0xf7f0c8, buffer=837) at
tqual.c:1124
    1124                    if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple)))
    (gdb) p *tuple
    $3 = {t_choice = {t_heap = {t_xmin = 3238223060, t_xmax = 3250922107, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum
={datum_len_ = -1056744236, datum_typmod = -1044045189,  
          datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 16, bi_lo = 29633}, ip_posid = 3}, t_infomask2 = 8211,
t_infomask= 258, t_hoff = 24 '\030',  
      t_bits = 0x7f666297141f ""}

That is, when reading a record with ctid of (1078209, 3), the tuple in page 1078209 is as follows:

    lma=# select * from  heap_page_items(get_raw_page('bi_dm.tdm_ttk_site_on_way_rt',1078209));
     lp | lp_off | lp_flags | lp_len |   t_xmin   |   t_xmax   | t_field3 |    t_ctid    | t_infomask2 | t_infomask |
t_hoff| t_bits | t_oid | t_data 

----+--------+----------+--------+------------+------------+----------+--------------+-------------+------------+--------+--------+-------+-----------
      1 |   8024 |        1 |    168 | 3238223060 | 3250922107 |        0 | (1078209,1)  |        8211 |       1282 |
 24 |        |       | \x...... 
      2 |   7864 |        1 |    156 | 3238223060 | 3250922107 |        0 | (1078209,2)  |        8211 |       1282 |
 24 |        |       | \x...... 
      3 |   7688 |        1 |    176 | 3238223060 | 3250922107 |        0 | (1078209,3)  |        8211 |        258 |
 24 |        |       | \x...... 
      4 |   7536 |        1 |    152 | 3238223060 | 3250922107 |        0 | (1078209,4)  |        8211 |        258 |
 24 |        |       | \x...... 
      5 |   7368 |        1 |    164 | 3238223060 | 3250922107 |        0 | (1078209,5)  |        8211 |        258 |
 24 |        |       | \x...... 
      6 |   7192 |        1 |    176 | 3238223060 | 3250922107 |        0 | (1078209,6)  |        8211 |        258 |
 24 |        |       | \x...... 
      7 |   7016 |        1 |    176 | 3238223060 | 3250922107 |        0 | (1078209,7)  |        8211 |        258 |
 24 |        |       | \x...... 
    ...

t_infomask of the record after the third one above is different from the first two. The HEAP_XMAX_COMMITTED flag is set
inthe first two t_infomasks (1282).This means that transaction 3250922107 has been submitted. 


The relevant configuration of the database is as follows:

    lma=# show autovacuum_freeze_max_age ;
     autovacuum_freeze_max_age
    ---------------------------
     200000000
    (1 row)

    lma=# select relfrozenxid from pg_class where relname='tdm_ttk_site_on_way_rt';
    relfrozenxid
    --------------
       3424813835
    (1 row)

    lma=# select txid_current();
    txid_current
    --------------
      24957915753
    (1 row)

    lma=#
    lma=# create table tb1(id int);
    CREATE TABLE
    lma=# insert into tb1 values(1);
    INSERT 0 1
    lma=# select xmin from tb1;
        xmin
    ------------
    3483079275
    (1 row)


A similar problem has been reported in 9.0, but there is no reason to mention it.

https://www.postgresql.org/message-id/flat/1300970362.2349.27.camel%40stevie


Currently I suspect that it may be the same problem as the bug below. is it possible?

The bug will cause some sessions to cache the wrong relfrozenxid of the table. The session that may call
vac_truncate_clog()will clean up the clog after the actual relfrozenxid due to reading the wrong relfrozenxid. 

https://www.postgresql.org/message-id/flat/20180809161148.GB22623%40momjian.us#a7cc4d41464064b7752a5574eb74a06d

Regards,
Chen Huajun

Re: could not access status of transaction

From
Tomas Vondra
Date:
On 1/20/19 5:07 PM, chenhj wrote:
> In our PG 10.2(CentOS 7.3) database, the following error is reported when querying a table. We have already restored
theproduction data through backup, but i want to confirm what may be the reason and how to avoid it in the future.
 
> 
>     lma=# select count(*) from bi_dm.tdm_ttk_site_on_way_rt;
>     ERROR:  could not access status of transaction 3250922107
>     DETAIL:  Could not open file "pg_xact/0C1C": No such file or directory.
> 
> Here are some related information
> 
> The CLOG files in pg_xact diractory is as follow:
> 
>     0C4A(Last update date: 2018/12/29)
>         ...
>         0D09(Last update date: 2019/01/13)
> 

Yes, that very much looks like a data corruption, probably due to
truncating the clog too early or something like that.

> ...
> 
> A similar problem has been reported in 9.0, but there is no reason to mention it.
> 
> https://www.postgresql.org/message-id/flat/1300970362.2349.27.camel%40stevie
> 

The symptoms are the same, but that's far from sufficient to conclude
it's the same root cause.

> Currently I suspect that it may be the same problem as the bug below. is it possible?
> 
> The bug will cause some sessions to cache the wrong relfrozenxid of the table. The session that may call
vac_truncate_clog()will clean up the clog after the actual relfrozenxid due to reading the wrong relfrozenxid.
 
> 
> https://www.postgresql.org/message-id/flat/20180809161148.GB22623%40momjian.us#a7cc4d41464064b7752a5574eb74a06d
> 

Maybe. But it'll be hard to confirm it's what happened. It also shows
why it's important to keep up with minor updates (you're running 10.3,
which is almost 1 year old).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re:Re: could not access status of transaction

From
chenhj
Date:
Hi, all

Our other system had encountered the same failure, but this time it is PostgreSQL 10.7(rhel 6.3).

Details are as follows:

Phenomenon:

app_db=# select count(*) from loba_sp_cost_xcd_104561;
ERROR:  could not access status of transaction 35153545
DETAIL:  Could not open file "pg_xact/0021": No such file or directory.


Related Information:
The clog file in the pg_xact directory is 00CE~0179, and pg_xact/0021 has been cleaned up.

Through gdb analysis, the tuple that caused the error was in block 163363. 

(gdb) bt
#0  SlruReportIOError (ctl=0xca6b60, pageno=1072, xid=35153545) at slru.c:894
#1  0x00000000004e2060 in SimpleLruReadPage (ctl=0xca6b60, pageno=1072, write_ok=1 '\001', xid=35153545) at slru.c:445
#2  0x00000000004da431 in TransactionIdGetStatus (xid=35153545, lsn=0x7fffd07b1e88) at clog.c:411
#3  0x00000000004e385a in TransactionLogFetch (transactionId=35153545) at transam.c:79
#4  0x00000000004e39a5 in TransactionIdDidCommit (transactionId=<value optimized out>) at transam.c:129
#5  0x000000000085e959 in HeapTupleSatisfiesMVCC (htup=<value optimized out>, snapshot=0x2c21740, buffer=320) at tqual.c:1057
#6  0x00000000004b49e7 in heapgetpage (scan=0x2c8a5d8, page=163363) at heapam.c:439
#7  0x00000000004b5091 in heapgettup_pagemode (scan=0x2c8a5d8, direction=<value optimized out>) at heapam.c:1034
#8  heap_getnext (scan=0x2c8a5d8, direction=<value optimized out>) at heapam.c:1801
#9  0x0000000000618b21 in SeqNext (node=0x2c63708) at nodeSeqscan.c:81
#10 0x00000000005fe132 in ExecScanFetch (node=0x2c63708, accessMtd=0x618ae0 <SeqNext>, recheckMtd=0x618840 <SeqRecheck>) at execScan.c:97
#11 ExecScan (node=0x2c63708, accessMtd=0x618ae0 <SeqNext>, recheckMtd=0x618840 <SeqRecheck>) at execScan.c:164
#12 0x00000000006061ee in ExecProcNode (aggstate=0x2c630c0) at ../../../src/include/executor/executor.h:250
#13 fetch_input_tuple (aggstate=0x2c630c0) at nodeAgg.c:695
#14 0x0000000000607f89 in agg_retrieve_direct (pstate=0x2c630c0) at nodeAgg.c:2448
#15 ExecAgg (pstate=0x2c630c0) at nodeAgg.c:2158
#16 0x000000000060b240 in ExecProcNode (pstate=0x2c62da0) at ../../../src/include/executor/executor.h:250
#17 gather_getnext (pstate=0x2c62da0) at nodeGather.c:286
#18 ExecGather (pstate=0x2c62da0) at nodeGather.c:215
#19 0x00000000006061ee in ExecProcNode (aggstate=0x2c626f8) at ../../../src/include/executor/executor.h:250
#20 fetch_input_tuple (aggstate=0x2c626f8) at nodeAgg.c:695
#21 0x00000000006080ec in agg_retrieve_direct (pstate=0x2c626f8) at nodeAgg.c:2347
#22 ExecAgg (pstate=0x2c626f8) at nodeAgg.c:2158
#23 0x00000000005fb6bf in ExecProcNode (queryDesc=0x2c81658, direction=<value optimized out>, count=0, execute_once=-8 '\370') at ../../../src/include/executor/executor.h:250
#24 ExecutePlan (queryDesc=0x2c81658, direction=<value optimized out>, count=0, execute_once=-8 '\370') at execMain.c:1723
#25 standard_ExecutorRun (queryDesc=0x2c81658, direction=<value optimized out>, count=0, execute_once=-8 '\370') at execMain.c:364
#26 0x00007fd6f161c6f8 in pgss_ExecutorRun (queryDesc=0x2c81658, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at pg_stat_statements.c:889
#27 0x00007fd6f141674d in explain_ExecutorRun (queryDesc=0x2c81658, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at auto_explain.c:267
#28 0x000000000072986b in PortalRunSelect (portal=0x2c85f98, forward=<value optimized out>, count=0, dest=<value optimized out>) at pquery.c:932
#29 0x000000000072aa01 in PortalRun (portal=0x2c85f98, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x2c5fb58, altdest=0x2c5fb58, completionTag=0x7fffd07b26d0 "")
at pquery.c:773
#30 0x0000000000727051 in exec_simple_query (query_string=0x2c26358 "select count(*) from loba_sp_cost_xcd_104561;") at postgres.c:1122
#31 0x0000000000728039 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x2bcc2f0 "app_db", username=<value optimized out>) at postgres.c:4117
#32 0x00000000006bb43a in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4405
#33 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4077
---Type <return> to continue, or q <return> to quit---
#34 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1755
#35 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1363
#36 0x000000000063b4d0 in main (argc=1, argv=0x2b8e9f0) at main.c:228
(gdb) f 5
#5  0x000000000085e959 in HeapTupleSatisfiesMVCC (htup=<value optimized out>, snapshot=0x2c21740, buffer=320) at tqual.c:1057
1057 tqual.c: No such file or directory.
in tqual.c
(gdb) p *tuple
$1 = {t_choice = {t_heap = {t_xmin = 35153545, t_xmax = 0, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 35153545, datum_typmod = 0, datum_typeid = 0}}, t_ctid = {ip_blkid = {
  bi_hi = 2, bi_lo = 32291}, ip_posid = 9}, t_infomask2 = 32788, t_infomask = 10243, t_hoff = 32 ' ', t_bits = 0x7fd6d97f0440 "\211f\030\002"}
  
  
The problematic tuple data parsed using `pg_filedump -i` is as follows:


Block 32291 ********************************************************
<Header> -----
Block Offset: 0x0fc46000         Offsets: Lower     236 (0x00ec)
Block: Size 8192  Version    4            Upper    2112 (0x0840)
LSN:  logid    254 recoff 0xa3a598c8      Special  8192 (0x2000)
Items:   53                      Free Space: 1876
Checksum: 0x8355  Prune XID: 0x02186566  Flags: 0x0001 (HAS_FREE_LINES)
Length (including item array): 236
...
Item   8 -- Length:  151  Offset: 4696 (0x1258)  Flags: NORMAL
  XMIN: 35153480  XMAX: 35153545  CID|XVAC: 0
  Block Id: 163363  linp Index: 9   Attributes: 20   Size: 32
  infomask: 0x2503 (HASNULL|HASVARWIDTH|XMIN_COMMITTED|XMAX_COMMITTED|UPDATED|HOT_UPDATED|HEAP_ONLY)
  t_bits: [0]: 0xff [1]: 0xff [2]: 0x07

Item   9 -- Length:  151  Offset: 4544 (0x11c0)  Flags: NORMAL
  XMIN: 35153545  XMAX: 0  CID|XVAC: 0
  Block Id: 163363  linp Index: 9   Attributes: 20   Size: 32
  infomask: 0x2803 (HASNULL|HASVARWIDTH|XMAX_INVALID|UPDATED|HEAP_ONLY)
  t_bits: [0]: 0xff [1]: 0xff [2]: 0x07

The XMIN of the tuple (163363, 9) above does not have XMIN_COMMITTED, and not be frozen.

Check the data file of this table and find that in addition to some tuples with an infomask of 0x2803, there are many tuples with an infomask of 0x0803, and XMIN_COMMITTED is not set, which also triggers this fault.

Block 32438 ********************************************************
<Header> -----
Block Offset: 0x0fd6c000         Offsets: Lower     204 (0x00cc)
Block: Size 8192  Version    4            Upper     216 (0x00d8)
LSN:  logid    122 recoff 0x41303260      Special  8192 (0x2000)
Items:   45                      Free Space:   12
Checksum: 0xd960  Prune XID: 0x00000000  Flags: 0x0000 ()
Length (including item array): 204
...
Item  23 -- Length:  174  Offset: 4072 (0x0fe8)  Flags: NORMAL
  XMIN: 35292368  XMAX: 0  CID|XVAC: 0
  Block Id: 163510  linp Index: 23   Attributes: 20   Size: 32
  infomask: 0x0803 (HASNULL|HASVARWIDTH|XMAX_INVALID)
  t_bits: [0]: 0xff [1]: 0xff [2]: 0x00

Regards,
Chen Huajun

Re:could not access status of transaction

From
chenhj
Date:
Hi,

This database has not had the same failure again 2019/09/16(reported at 2019/09/29), so this is a very low probability failure, but it is uncertain whether it will happen again in the future. Now add some information for incident at 2019/09/16, may be useful for analyze the cause of this problem.


> Block 32291 ********************************************************
> <Header> -----
>  Block Offset: 0x0fc46000         Offsets: Lower     236 (0x00ec)
>  Block: Size 8192  Version    4            Upper    2112 (0x0840)
>  LSN:  logid    254 recoff 0xa3a598c8      Special  8192 (0x2000)
>  Items:   53                      Free Space: 1876
>  Checksum: 0x8355  Prune XID: 0x02186566  Flags: 0x0001 (HAS_FREE_LINES)
>  Length (including item array): 236
> ...
>  Item   8 -- Length:  151  Offset: 4696 (0x1258)  Flags: NORMAL
>   XMIN: 35153480  XMAX: 35153545  CID|XVAC: 0
>   Block Id: 163363  linp Index: 9   Attributes: 20   Size: 32
>   infomask: 0x2503 (HASNULL|HASVARWIDTH|XMIN_COMMITTED|XMAX_COMMITTED|UPDATED|HOT_UPDATED|HEAP_ONLY)
>   t_bits: [0]: 0xff [1]: 0xff [2]: 0x07
>  Item   9 -- Length:  151  Offset: 4544 (0x11c0)  Flags: NORMAL
>   XMIN: 35153545  XMAX: 0  CID|XVAC: 0
>   Block Id: 163363  linp Index: 9   Attributes: 20   Size: 32
>   infomask: 0x2803 (HASNULL|HASVARWIDTH|XMAX_INVALID|UPDATED|HEAP_ONLY)
>   t_bits: [0]: 0xff [1]: 0xff [2]: 0x07

According to above information, the flags of the heap page (163363) with the problem tuple (163363, 9) is 0x0001 (HAS_FREE_LINES), that is, ALL_VISIBLE is not set.

However, according  hexdump content of the corresponding vm file, that block(location is 9F88 + 6bit) has set VISIBILITYMAP_ALL_FROZEN and VISIBILITYMAP_ALL_VISIBLE flags. That is, the heap file and the vm file are inconsistent.

# vi 19548_vm.hexdump

0000000 0215 0000 0858 857c 8cee 0000 0018 2000
0000010 2000 2004 0000 0000 ffff ffff ffff ffff
0000020 ffff ffff ffff ffff ffff ffff ffff ffff
*
0002000 0215 0000 1680 857c 3bb1 0000 0018 2000
0002010 2000 2004 0000 0000 ffff ffff ffff ffff
0002020 ffff ffff ffff ffff ffff ffff ffff ffff
*
0004000 0215 0000 20e8 857c 07f7 0000 0018 2000
0004010 2000 2004 0000 0000 ffff ffff ffff ffff
0004020 ffff ffff ffff ffff ffff ffff ffff ffff
*
0006000 0215 0000 3618 857c 4792 0000 0018 2000
0006010 2000 2004 0000 0000 ffff ffff ffff ffff
0006020 ffff ffff ffff ffff ffff ffff ffff ffff
*
0008000 0215 0000 17a8 8580 3d7e 0000 0018 2000
0008010 2000 2004 0000 0000 ffff ffff ffff ffff
0008020 ffff ffff ffff ffff ffff ffff ffff ffff
*
000a000 0215 0000 3558 8585 1239 0000 0018 2000
000a010 2000 2004 0000 0000 ffff ffff ffff ffff
000a020 ffff ffff ffff ffff ffff ffff ffff ffff
*
000c000 0215 0000 72a8 871b 1a23 0000 0018 2000
000c010 2000 2004 0000 0000 ffff ffff ffff ffff
000c020 ffff ffff ffff ffff ffff ffff ffff ffff
...
000e000 0215 0000 93d0 8794 506a 0000 0018 2000
000e010 2000 2004 0000 0000 fc03 ffff ffff ffff
000e020 ffff f3ff cccf fffc ffff ffff f3cf ff3f
...
000f6c0 3f0f 3303 c33f 00f0 00c3 0303 0003 0000
000f6d0 0000 0000 0000 0000 0000 0000 0000 0000
*
0010000
Regards,
Chen Huajun

Re: could not access status of transaction

From
Robert Haas
Date:
On Sun, Jan 5, 2020 at 11:00 PM chenhj <chjischj@163.com> wrote:
> According to above information, the flags of the heap page (163363) with the problem tuple (163363, 9) is 0x0001
(HAS_FREE_LINES),that is, ALL_VISIBLE is not set.
 
>
> However, according  hexdump content of the corresponding vm file, that block(location is 9F88 + 6bit) has set
VISIBILITYMAP_ALL_FROZENand VISIBILITYMAP_ALL_VISIBLE flags. That is, the heap file and the vm file are inconsistent.
 

That's not supposed to happen, and represents data corruption. Your
previous report of a too-old xmin surviving in the heap is also
corruption.  There is no guarantee that both problems have the same
cause, but suppose they do. One possibility is that a write to the
heap page may have gotten lost or undone. Suppose that, while this
page was in shared_buffers, VACUUM came through and froze it, setting
the bits in the VM and later truncating CLOG. Then, suppose that when
that page was evicted from shared_buffers, it didn't really get
written back to disk, or alternatively it did, but then later somehow
the old version reappeared. I think that would produce these symptoms.

I think that bad hardware could cause this, or running two copies of
the server on the same data files at the same time, or maybe some kind
of filesystem-related flakiness, especially if, for example, you are
using a network filesystem like NFS, or maybe a broken iSCSI stack.
There is also no reason it couldn't be a bug in PostgreSQL itself,
although if we lost page writes routinely somebody would surely have
noticed by now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company