could not access status of transaction - Mailing list pgsql-hackers

From chenhj
Subject could not access status of transaction
Date
Msg-id 18d5bb99.8e49.1686c03e4e3.Coremail.chjischj@163.com
Whole thread Raw
Responses Re: could not access status of transaction  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re:could not access status of transaction  (chenhj <chjischj@163.com>)
List pgsql-hackers
In our PG 10.2(CentOS 7.3) database, the following error is reported when querying a table. We have already restored
theproduction data through backup, but i want to confirm what may be the reason and how to avoid it in the future. 

    lma=# select count(*) from bi_dm.tdm_ttk_site_on_way_rt;
    ERROR:  could not access status of transaction 3250922107
    DETAIL:  Could not open file "pg_xact/0C1C": No such file or directory.

Here are some related information

The CLOG files in pg_xact diractory is as follow:

    0C4A(Last update date: 2018/12/29)
        ...
        0D09(Last update date: 2019/01/13)

The stack when the error occurs is as follows

    (gdb) bt
    #0  SlruReportIOError (ctl=ctl@entry=0xca98e0 <ClogCtlData>, pageno=pageno@entry=99210, xid=xid@entry=3250922107)
atslru.c:901 
    #1  0x00000000004e53a8 in SimpleLruReadPage (ctl=ctl@entry=0xca98e0 <ClogCtlData>, pageno=pageno@entry=99210,
write_ok=write_ok@entry=1'\001', xid=xid@entry=3250922107) 
        at slru.c:445
    #2  0x00000000004e5460 in SimpleLruReadPage_ReadOnly (ctl=ctl@entry=0xca98e0 <ClogCtlData>,
pageno=pageno@entry=99210,xid=xid@entry=3250922107) at slru.c:492 
    #3  0x00000000004dde4d in TransactionIdGetStatus (xid=xid@entry=3250922107, lsn=lsn@entry=0x7ffeb4472660) at
clog.c:411
    #4  0x00000000004e6e28 in TransactionLogFetch (transactionId=3250922107) at transam.c:79
    #5  0x00000000004e6e7f in TransactionIdDidCommit (transactionId=<optimized out>) at transam.c:129
    #6  0x000000000084ce6a in HeapTupleSatisfiesMVCC (htup=<optimized out>, snapshot=0xf7f0c8, buffer=837) at
tqual.c:1124
    #7  0x00000000004b106e in heapgetpage (scan=scan@entry=0x10bb1b0, page=page@entry=1078209) at heapam.c:439
    #8  0x00000000004b23ab in heapgettup_pagemode (key=0x0, nkeys=0, dir=1078209, scan=0x10bb1b0) at heapam.c:1034
    #9  heap_getnext (scan=scan@entry=0x10bb1b0, direction=direction@entry=ForwardScanDirection) at heapam.c:1801
    #10 0x000000000060fa93 in SeqNext (node=node@entry=0x108c360) at nodeSeqscan.c:81
    #11 0x00000000005f51ca in ExecScanFetch (recheckMtd=0x60fa40 <SeqRecheck>, accessMtd=0x60fa60 <SeqNext>,
node=0x108c360)at execScan.c:97 
    #12 ExecScan (node=0x108c360, accessMtd=0x60fa60 <SeqNext>, recheckMtd=0x60fa40 <SeqRecheck>) at execScan.c:147
    #13 0x00000000005fad89 in ExecProcNode (node=0x108c360) at ../../../src/include/executor/executor.h:250
    #14 fetch_input_tuple (aggstate=aggstate@entry=0x108bcb8) at nodeAgg.c:695
    #15 0x00000000005fcdbf in agg_retrieve_direct (aggstate=0x108bcb8) at nodeAgg.c:2448
    #16 ExecAgg (pstate=0x108bcb8) at nodeAgg.c:2158
    #17 0x00000000005ef612 in ExecProcNode (node=0x108bcb8) at ../../../src/include/executor/executor.h:250
    #18 ExecutePlan (execute_once=<optimized out>, dest=0x10c40e8, direction=<optimized out>, numberTuples=0,
sendTuples=1'\001', operation=CMD_SELECT,  
        use_parallel_mode=<optimized out>, planstate=0x108bcb8, estate=0x108baa8) at execMain.c:1722
    #19 standard_ExecutorRun (queryDesc=0x10c7168, direction=<optimized out>, count=0, execute_once=<optimized out>) at
execMain.c:363
    #20 0x00007f766427915d in pgss_ExecutorRun (queryDesc=0x10c7168, direction=ForwardScanDirection, count=0,
execute_once=<optimizedout>) at pg_stat_statements.c:889 
    #21 0x000000000071a28b in PortalRunSelect (portal=portal@entry=0xfac708, forward=forward@entry=1 '\001', count=0,
count@entry=9223372036854775807,dest=dest@entry=0x10c40e8) 
        at pquery.c:932
    #22 0x000000000071b63f in PortalRun (portal=portal@entry=0xfac708, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=1'\001',  
        run_once=run_once@entry=1 '\001', dest=dest@entry=0x10c40e8, altdest=altdest@entry=0x10c40e8,
completionTag=completionTag@entry=0x7ffeb4472c80"") at pquery.c:773 
    #23 0x00000000007175b3 in exec_simple_query (query_string=0x102a4e8 "select count(*) from tdm_ttk_site_on_way_rt
;")at postgres.c:1099 
    #24 0x00000000007188ac in PostgresMain (argc=<optimized out>, argv=argv@entry=0xfa2d68, dbname=0xfa2c50 "lma",
username=<optimizedout>) at postgres.c:4088 
    #25 0x000000000047ad3c in BackendRun (port=0xfa49b0) at postmaster.c:4405
    #26 BackendStartup (port=0xfa49b0) at postmaster.c:4077
    #27 ServerLoop () at postmaster.c:1755
    #28 0x00000000006afacf in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf5fa20) at postmaster.c:1363
    #29 0x000000000047bb6f in main (argc=3, argv=0xf5fa20) at main.c:228
    (gdb) f 6
    #6  0x000000000084ce6a in HeapTupleSatisfiesMVCC (htup=<optimized out>, snapshot=0xf7f0c8, buffer=837) at
tqual.c:1124
    1124                    if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple)))
    (gdb) p *tuple
    $3 = {t_choice = {t_heap = {t_xmin = 3238223060, t_xmax = 3250922107, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum
={datum_len_ = -1056744236, datum_typmod = -1044045189,  
          datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 16, bi_lo = 29633}, ip_posid = 3}, t_infomask2 = 8211,
t_infomask= 258, t_hoff = 24 '\030',  
      t_bits = 0x7f666297141f ""}

That is, when reading a record with ctid of (1078209, 3), the tuple in page 1078209 is as follows:

    lma=# select * from  heap_page_items(get_raw_page('bi_dm.tdm_ttk_site_on_way_rt',1078209));
     lp | lp_off | lp_flags | lp_len |   t_xmin   |   t_xmax   | t_field3 |    t_ctid    | t_infomask2 | t_infomask |
t_hoff| t_bits | t_oid | t_data 

----+--------+----------+--------+------------+------------+----------+--------------+-------------+------------+--------+--------+-------+-----------
      1 |   8024 |        1 |    168 | 3238223060 | 3250922107 |        0 | (1078209,1)  |        8211 |       1282 |
 24 |        |       | \x...... 
      2 |   7864 |        1 |    156 | 3238223060 | 3250922107 |        0 | (1078209,2)  |        8211 |       1282 |
 24 |        |       | \x...... 
      3 |   7688 |        1 |    176 | 3238223060 | 3250922107 |        0 | (1078209,3)  |        8211 |        258 |
 24 |        |       | \x...... 
      4 |   7536 |        1 |    152 | 3238223060 | 3250922107 |        0 | (1078209,4)  |        8211 |        258 |
 24 |        |       | \x...... 
      5 |   7368 |        1 |    164 | 3238223060 | 3250922107 |        0 | (1078209,5)  |        8211 |        258 |
 24 |        |       | \x...... 
      6 |   7192 |        1 |    176 | 3238223060 | 3250922107 |        0 | (1078209,6)  |        8211 |        258 |
 24 |        |       | \x...... 
      7 |   7016 |        1 |    176 | 3238223060 | 3250922107 |        0 | (1078209,7)  |        8211 |        258 |
 24 |        |       | \x...... 
    ...

t_infomask of the record after the third one above is different from the first two. The HEAP_XMAX_COMMITTED flag is set
inthe first two t_infomasks (1282).This means that transaction 3250922107 has been submitted. 


The relevant configuration of the database is as follows:

    lma=# show autovacuum_freeze_max_age ;
     autovacuum_freeze_max_age
    ---------------------------
     200000000
    (1 row)

    lma=# select relfrozenxid from pg_class where relname='tdm_ttk_site_on_way_rt';
    relfrozenxid
    --------------
       3424813835
    (1 row)

    lma=# select txid_current();
    txid_current
    --------------
      24957915753
    (1 row)

    lma=#
    lma=# create table tb1(id int);
    CREATE TABLE
    lma=# insert into tb1 values(1);
    INSERT 0 1
    lma=# select xmin from tb1;
        xmin
    ------------
    3483079275
    (1 row)


A similar problem has been reported in 9.0, but there is no reason to mention it.

https://www.postgresql.org/message-id/flat/1300970362.2349.27.camel%40stevie


Currently I suspect that it may be the same problem as the bug below. is it possible?

The bug will cause some sessions to cache the wrong relfrozenxid of the table. The session that may call
vac_truncate_clog()will clean up the clog after the actual relfrozenxid due to reading the wrong relfrozenxid. 

https://www.postgresql.org/message-id/flat/20180809161148.GB22623%40momjian.us#a7cc4d41464064b7752a5574eb74a06d

Regards,
Chen Huajun

pgsql-hackers by date:

Previous
From: Chapman Flack
Date:
Subject: Re: PostgreSQL vs SQL/XML Standards
Next
From: Tomas Vondra
Date:
Subject: Re: could not access status of transaction