Thread: 10.5 but not 10.4: backend startup during reindex system: could notread block 0 in file "base/16400/..": read only 0 of 8192 bytes

I've seen this message now a handful of times recently.  It seems to happen
overnight, during a maintenance job which reindex things, including system
catalog indices.

It's easy to reproduce error under 10.5, but not under 10.3 nor 10.4.

while :; do for a in pg_class_oid_index pg_class_relname_nsp_index pg_class_tblspc_relfilenode_index; do psql ts -qc
"REINDEXINDEX $a"; done; done&
 

[pryzbyj@database ~]$ while :; do psql ts -qc ''; done
psql: FATAL:  could not read block 1 in file "base/16400/313430687": read only 0 of 8192 bytes
psql: FATAL:  could not read block 0 in file "base/16400/313430708": read only 0 of 8192 bytes
psql: FATAL:  could not read block 0 in file "base/16400/313430711": read only 0 of 8192 bytes
^C

postgres=# SELECT * FROM postgres_log WHERE error_severity ='FATAL';
log_time               | 2018-08-28 22:19:58.822-05
user_name              | telsasoft
database               | ts
pid                    | 22445
connection_from        | 192.168.122.12:58318
session_id             | 5b8610de.57ad
session_line           | 1
command_tag            | startup
session_start_time     | 2018-08-28 22:19:58-05
virtual_transaction_id | 26/280967
transaction_id         | 0
error_severity         | FATAL
sql_state_code         | XX001
message                | could not read block 0 in file "base/16400/313316300": read only 0 of 8192 bytes

Justin


Justin Pryzby <pryzby@telsasoft.com> writes:
> I've seen this message now a handful of times recently.  It seems to happen
> overnight, during a maintenance job which reindex things, including system
> catalog indices.
> It's easy to reproduce error under 10.5, but not under 10.3 nor 10.4.
> while :; do for a in pg_class_oid_index pg_class_relname_nsp_index pg_class_tblspc_relfilenode_index; do psql ts -qc
"REINDEXINDEX $a"; done; done& 

This looks suspiciously like the issue under discussion in

https://www.postgresql.org/message-id/12259.1532117714%40sss.pgh.pa.us

As far as we can tell, that bug is a dozen years old, so it's not clear
why you find that you can reproduce it only in 10.5.  But there might be
some subtle timing change accounting for that.

            regards, tom lane


On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > I've seen this message now a handful of times recently.  It seems to happen
> > overnight, during a maintenance job which reindex things, including system
> > catalog indices.
> > It's easy to reproduce error under 10.5, but not under 10.3 nor 10.4.
> > while :; do for a in pg_class_oid_index pg_class_relname_nsp_index pg_class_tblspc_relfilenode_index; do psql ts
-qc"REINDEX INDEX $a"; done; done&
 
> > time while psql ts -qc ''; do :; done
> 
> This looks suspiciously like the issue under discussion in
> 
> https://www.postgresql.org/message-id/12259.1532117714%40sss.pgh.pa.us
> 
> As far as we can tell, that bug is a dozen years old, so it's not clear
> why you find that you can reproduce it only in 10.5.  But there might be
> some subtle timing change accounting for that.

I confirmed the stack appears to be same as here, modulo differences between
9.6 to 10.
https://www.postgresql.org/message-id/fcab55ec-660a-bc43-7fe1-09af3665e89a%402ndquadrant.com

It seems to me there's one root problem occurring in (at least) two slightly
different ways.  The issue/symptom that I've been seeing occurs in 10.5 but not
10.4, and specifically at commit 2ce64ca, but not before. 

|commit 2ce64caaf793615d0f7a7ce4380e7255077d130c
|Author: Andres Freund <andres@anarazel.de>
|Date:   Tue Jun 12 11:13:21 2018 -0700
|
|Fix bugs in vacuum of shared rels, by keeping their relcache entries current.

However, the same message CAN happen under 10.4, though less easily, due to
autovacuum but (afaict) not user queries.  Maybe that's a coincidence, but my
experience mirrors the build logs: the problem since 10.5 is easily tickled in
a few seconds rather than an hour.

The issue I see frequently under 10.5 has this stack:

Core was generated by `postgres: pryzbyj postgres [local] startup                                    '.
#2  0x0000000000709425 in mdread ()
#3  0x00000000006e2bc8 in ReadBuffer_common ()
#4  0x00000000006e3511 in ReadBufferExtended ()
#5  0x00000000004c2da9 in _bt_getbuf ()
#6  0x00000000004c32b2 in _bt_getroot ()
#7  0x00000000004c799b in _bt_search ()
#8  0x00000000004c8568 in _bt_first ()
#9  0x00000000004c52c7 in btgettuple ()
#10 0x00000000004bf936 in index_getnext_tid ()
#11 0x00000000004bfb43 in index_getnext ()
#12 0x00000000004beeae in systable_getnext ()
#13 0x00000000007fa074 in ScanPgRelation ()
#14 0x00000000007fe371 in RelationClearRelation ()
#15 0x00000000007fe84b in RelationCacheInvalidateEntry ()
#16 0x00000000007f7e15 in LocalExecuteInvalidationMessage ()
#17 0x00000000006f3b38 in ReceiveSharedInvalidMessages ()
#18 0x00000000006f6eb5 in LockRelationOid ()
#19 0x00000000004ae495 in relation_open ()
#20 0x00000000004bf336 in index_open ()
#21 0x00000000004bee3c in systable_beginscan ()
#22 0x00000000007fa069 in ScanPgRelation ()
#23 0x00000000007fce84 in RelationBuildDesc ()
#24 0x00000000007fe6f8 in RelationIdGetRelation ()
#25 0x00000000004ae463 in relation_open ()
#26 0x00000000004ae6a6 in heap_open ()
#27 0x0000000000819237 in InitPostgres ()
#28 0x000000000070e2e9 in PostgresMain ()
#29 0x0000000000477f36 in ServerLoop ()
#30 0x00000000006a5e3e in PostmasterMain ()
#31 0x0000000000478a08 in main ()

From autovacum running REL_10_4 I've now seen (since trying to tickle the
problem yesterday):

#1  0x00007fbfb5b9d01a in __GI_abort () at abort.c:89
#2  0x0000000000708489 in mdread ()
#3  0x00000000006e1c48 in ReadBuffer_common ()
#4  0x00000000006e2591 in ReadBufferExtended ()
#5  0x00000000004c2369 in _bt_getbuf ()
#6  0x00000000004c2810 in _bt_getroot ()
#7  0x00000000004c6f5b in _bt_search ()
#8  0x00000000004c7b28 in _bt_first ()
#9  0x00000000004c4887 in btgettuple ()
#10 0x00000000004bf00e in index_getnext_tid ()
#11 0x00000000004bf1f3 in index_getnext ()
#12 0x00000000004be679 in systable_getnext ()
#13 0x00000000007f9084 in ScanPgRelation ()
#14 0x00000000007fbe54 in RelationBuildDesc ()
#15 0x00000000007fcc86 in RelationClearRelation ()
#16 0x00000000007fd7bb in RelationCacheInvalidateEntry ()
#17 0x00000000007f6e25 in LocalExecuteInvalidationMessage ()
#18 0x00000000006f2bb8 in ReceiveSharedInvalidMessages ()
#19 0x00000000006f5f35 in LockRelationOid ()
#20 0x00000000004ade05 in relation_open ()
#21 0x00000000004bea96 in index_open ()
#22 0x00000000004be606 in systable_beginscan ()
#23 0x00000000007f9079 in ScanPgRelation ()
#24 0x00000000007fbe54 in RelationBuildDesc ()
#25 0x00000000007fd668 in RelationIdGetRelation ()
#26 0x00000000004addd3 in relation_open ()
#27 0x00000000004bea96 in index_open ()
#28 0x00000000004be606 in systable_beginscan ()
#29 0x00000000007fef60 in RelationGetIndexList ()
#30 0x00000000005dba6c in vac_open_indexes ()
#31 0x00000000005dc652 in lazy_vacuum_rel ()
#32 0x00000000005da7b4 in vacuum_rel ()
#33 0x00000000005db661 in vacuum ()
#34 0x00000000004770db in do_autovacuum ()
#35 0x0000000000696242 in AutoVacWorkerMain.isra.6 ()
#36 0x0000000000696a49 in StartAutoVacWorker ()
#37 0x00000000006a3f0a in sigusr1_handler ()
#38 <signal handler called>
#39 0x00007fbfb5c62cf3 in __select_nocancel ()
    at ../sysdeps/unix/syscall-template.S:84
#40 0x000000000047760e in ServerLoop ()
#41 0x00000000006a4ebe in PostmasterMain ()
#42 0x0000000000478a78 in main ()

Justin


Hi,

On 2018-08-30 15:00:28 -0500, Justin Pryzby wrote:
> On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
> > This looks suspiciously like the issue under discussion in
> > 
> > https://www.postgresql.org/message-id/12259.1532117714%40sss.pgh.pa.us
> > 
> > As far as we can tell, that bug is a dozen years old, so it's not clear
> > why you find that you can reproduce it only in 10.5.  But there might be
> > some subtle timing change accounting for that.
> 
> I confirmed the stack appears to be same as here, modulo differences between
> 9.6 to 10.

Could you check if both of the proposed attempts at fixing the issue
also solves your problem?

Thanks,

Andres Freund


Justin Pryzby <pryzby@telsasoft.com> writes:
> On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
>> As far as we can tell, that bug is a dozen years old, so it's not clear
>> why you find that you can reproduce it only in 10.5.  But there might be
>> some subtle timing change accounting for that.

> It seems to me there's one root problem occurring in (at least) two slightly
> different ways.  The issue/symptom that I've been seeing occurs in 10.5 but not
> 10.4, and specifically at commit 2ce64ca, but not before.

Yeah, as you probably saw in the other thread, we later realized that
2ce64ca created an additional pathway for ScanPgRelation to recurse;
a pathway that's evidently easier to hit than the pre-existing ones.
I note that both of your stack traces display ScanPgRelation recursion,
so I'm feeling pretty confident that what you're seeing is the same
thing.

But, as Andres says, it'd be great if you could confirm whether the
draft patches fix it for you.

            regards, tom lane


On Thu, Aug 30, 2018 at 05:30:30PM -0400, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
> >> As far as we can tell, that bug is a dozen years old, so it's not clear
> >> why you find that you can reproduce it only in 10.5.  But there might be
> >> some subtle timing change accounting for that.
> 
> > It seems to me there's one root problem occurring in (at least) two slightly
> > different ways.  The issue/symptom that I've been seeing occurs in 10.5 but not
> > 10.4, and specifically at commit 2ce64ca, but not before. 
> 
> Yeah, as you probably saw in the other thread, we later realized that
> 2ce64ca created an additional pathway for ScanPgRelation to recurse;
> a pathway that's evidently easier to hit than the pre-existing ones.
> I note that both of your stack traces display ScanPgRelation recursion,
> so I'm feeling pretty confident that what you're seeing is the same
> thing.
> 
> But, as Andres says, it'd be great if you could confirm whether the
> draft patches fix it for you.

I tested with relcache-rebuild.diff which hasn't broken in 15min, so I'm
confident that doesn't hit the additional recusive pathway, but have to wait
awhile and see if autovacuum survives, too.

I tried to apply fix-missed-inval-msg-accepts-1.patch on top of PG10.5 but
patch didn't apply, so I can test HEAD after the first patch soaks awhile.

Just curious, is there really any difficulty in reproducing this?  Once I
realized this was a continuing issue and started to suspect pg10.5, it takes
just about nothing to reproduce anywhere I've tried.  I just tested 5 servers,
and only one took more than a handful of seconds to fail.  I gave up waiting
for a 6th server, because I found it was waiting on a pre-existing lock.

[pryzbyj@database ~]$ while :; do for a in pg_class_oid_index pg_class_relname_nsp_index
pg_class_tblspc_relfilenode_index;do psql ts -qc "REINDEX INDEX $a"; done; done&
 
[pryzbyj@database ~]$ a=0; time while psql ts -qc ''; do a=$((1+a)); done ; echo "$a"
psql: FATAL:  could not read block 0 in file "base/16400/313581263": read only 0 of 8192 bytes

real    0m1.772s
user    0m0.076s
sys     0m0.116s
47

Justin


Justin Pryzby <pryzby@telsasoft.com> writes:
> Just curious, is there really any difficulty in reproducing this?

Once you have the right test case, it's not hard.  But it took us two
months to find one ...

            regards, tom lane


On Thu, Aug 30, 2018 at 01:25:00PM -0700, Andres Freund wrote:
> On August 30, 2018 1:24:12 PM PDT, Justin Pryzby <pryzby@telsasoft.com> wrote:
> >On Thu, Aug 30, 2018 at 01:18:59PM -0700, Andres Freund wrote:
> >> Could you check if both of the proposed attempts at fixing the issue
> >> also solves your problem?
> >
> >Just checking that you're referring to:
> >
> >1)  relcache-rebuild.diff
> >https://www.postgresql.org/message-id/20180829083730.n645apqhb2gyih3g%40alap3.anarazel.de
> >2)  fix-missed-inval-msg-accepts-1.patch
> >https://www.postgresql.org/message-id/25024.1535579899%40sss.pgh.pa.us
> 
> Right. Either should fix the issue.

Confirmed that relcache-rebuild.diff seems to fix/avoid the issue under 10.5,
and fix-missed-inval-msg-accepts-1.patch does so for HEAD (but doesn't apply
cleanly to 10.5 so not tested).  That's for both the the new easy-to-hit
symptom and the pre-existing hard-to-hit issue affecting pg10.4 and earlier.

Thanks,
Justin