Re: 10.5 but not 10.4: backend startup during reindex system: couldnot read block 0 in file "base/16400/..": read only 0 of 8192 bytes - Mailing list pgsql-hackers

On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > I've seen this message now a handful of times recently.  It seems to happen
> > overnight, during a maintenance job which reindex things, including system
> > catalog indices.
> > It's easy to reproduce error under 10.5, but not under 10.3 nor 10.4.
> > while :; do for a in pg_class_oid_index pg_class_relname_nsp_index pg_class_tblspc_relfilenode_index; do psql ts
-qc"REINDEX INDEX $a"; done; done&
 
> > time while psql ts -qc ''; do :; done
> 
> This looks suspiciously like the issue under discussion in
> 
> https://www.postgresql.org/message-id/12259.1532117714%40sss.pgh.pa.us
> 
> As far as we can tell, that bug is a dozen years old, so it's not clear
> why you find that you can reproduce it only in 10.5.  But there might be
> some subtle timing change accounting for that.

I confirmed the stack appears to be same as here, modulo differences between
9.6 to 10.
https://www.postgresql.org/message-id/fcab55ec-660a-bc43-7fe1-09af3665e89a%402ndquadrant.com

It seems to me there's one root problem occurring in (at least) two slightly
different ways.  The issue/symptom that I've been seeing occurs in 10.5 but not
10.4, and specifically at commit 2ce64ca, but not before. 

|commit 2ce64caaf793615d0f7a7ce4380e7255077d130c
|Author: Andres Freund <andres@anarazel.de>
|Date:   Tue Jun 12 11:13:21 2018 -0700
|
|Fix bugs in vacuum of shared rels, by keeping their relcache entries current.

However, the same message CAN happen under 10.4, though less easily, due to
autovacuum but (afaict) not user queries.  Maybe that's a coincidence, but my
experience mirrors the build logs: the problem since 10.5 is easily tickled in
a few seconds rather than an hour.

The issue I see frequently under 10.5 has this stack:

Core was generated by `postgres: pryzbyj postgres [local] startup                                    '.
#2  0x0000000000709425 in mdread ()
#3  0x00000000006e2bc8 in ReadBuffer_common ()
#4  0x00000000006e3511 in ReadBufferExtended ()
#5  0x00000000004c2da9 in _bt_getbuf ()
#6  0x00000000004c32b2 in _bt_getroot ()
#7  0x00000000004c799b in _bt_search ()
#8  0x00000000004c8568 in _bt_first ()
#9  0x00000000004c52c7 in btgettuple ()
#10 0x00000000004bf936 in index_getnext_tid ()
#11 0x00000000004bfb43 in index_getnext ()
#12 0x00000000004beeae in systable_getnext ()
#13 0x00000000007fa074 in ScanPgRelation ()
#14 0x00000000007fe371 in RelationClearRelation ()
#15 0x00000000007fe84b in RelationCacheInvalidateEntry ()
#16 0x00000000007f7e15 in LocalExecuteInvalidationMessage ()
#17 0x00000000006f3b38 in ReceiveSharedInvalidMessages ()
#18 0x00000000006f6eb5 in LockRelationOid ()
#19 0x00000000004ae495 in relation_open ()
#20 0x00000000004bf336 in index_open ()
#21 0x00000000004bee3c in systable_beginscan ()
#22 0x00000000007fa069 in ScanPgRelation ()
#23 0x00000000007fce84 in RelationBuildDesc ()
#24 0x00000000007fe6f8 in RelationIdGetRelation ()
#25 0x00000000004ae463 in relation_open ()
#26 0x00000000004ae6a6 in heap_open ()
#27 0x0000000000819237 in InitPostgres ()
#28 0x000000000070e2e9 in PostgresMain ()
#29 0x0000000000477f36 in ServerLoop ()
#30 0x00000000006a5e3e in PostmasterMain ()
#31 0x0000000000478a08 in main ()

From autovacum running REL_10_4 I've now seen (since trying to tickle the
problem yesterday):

#1  0x00007fbfb5b9d01a in __GI_abort () at abort.c:89
#2  0x0000000000708489 in mdread ()
#3  0x00000000006e1c48 in ReadBuffer_common ()
#4  0x00000000006e2591 in ReadBufferExtended ()
#5  0x00000000004c2369 in _bt_getbuf ()
#6  0x00000000004c2810 in _bt_getroot ()
#7  0x00000000004c6f5b in _bt_search ()
#8  0x00000000004c7b28 in _bt_first ()
#9  0x00000000004c4887 in btgettuple ()
#10 0x00000000004bf00e in index_getnext_tid ()
#11 0x00000000004bf1f3 in index_getnext ()
#12 0x00000000004be679 in systable_getnext ()
#13 0x00000000007f9084 in ScanPgRelation ()
#14 0x00000000007fbe54 in RelationBuildDesc ()
#15 0x00000000007fcc86 in RelationClearRelation ()
#16 0x00000000007fd7bb in RelationCacheInvalidateEntry ()
#17 0x00000000007f6e25 in LocalExecuteInvalidationMessage ()
#18 0x00000000006f2bb8 in ReceiveSharedInvalidMessages ()
#19 0x00000000006f5f35 in LockRelationOid ()
#20 0x00000000004ade05 in relation_open ()
#21 0x00000000004bea96 in index_open ()
#22 0x00000000004be606 in systable_beginscan ()
#23 0x00000000007f9079 in ScanPgRelation ()
#24 0x00000000007fbe54 in RelationBuildDesc ()
#25 0x00000000007fd668 in RelationIdGetRelation ()
#26 0x00000000004addd3 in relation_open ()
#27 0x00000000004bea96 in index_open ()
#28 0x00000000004be606 in systable_beginscan ()
#29 0x00000000007fef60 in RelationGetIndexList ()
#30 0x00000000005dba6c in vac_open_indexes ()
#31 0x00000000005dc652 in lazy_vacuum_rel ()
#32 0x00000000005da7b4 in vacuum_rel ()
#33 0x00000000005db661 in vacuum ()
#34 0x00000000004770db in do_autovacuum ()
#35 0x0000000000696242 in AutoVacWorkerMain.isra.6 ()
#36 0x0000000000696a49 in StartAutoVacWorker ()
#37 0x00000000006a3f0a in sigusr1_handler ()
#38 <signal handler called>
#39 0x00007fbfb5c62cf3 in __select_nocancel ()
    at ../sysdeps/unix/syscall-template.S:84
#40 0x000000000047760e in ServerLoop ()
#41 0x00000000006a4ebe in PostmasterMain ()
#42 0x0000000000478a78 in main ()

Justin


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Use C99 designated initializers for some structs
Next
From: Magnus Hagander
Date:
Subject: Re: pg_verify_checksums and -fno-strict-aliasing