Could not read block at end of the relation - Mailing list pgsql-bugs

From Ronan Dunklau
Subject Could not read block at end of the relation
Date
Msg-id 1878547.tdWV9SEqCh@aivenlaptop
Whole thread Raw
Responses FSM Corruption (was: Could not read block at end of the relation)
List pgsql-bugs
Hello,

I'm sorry as this will be a very poor bug report. On PG16, I'm am experiencing 
random errors which share the same characteristics: 

- happens during heavy system load
- lots of concurrent writes happening on a table
- often (but haven't been able to confirm it is necessary), a vacuum is running 
on the table at the same time the error is triggered

Then, several backends get the same error at once "ERROR:  could not read 
block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with different 
block numbers. The relation is always a table (regular or toast). The blocks 
are past the end of the relation, and the different backends are all trying to 
read a different block. The offending queries are either an INSERT / UPDATE / 
COPY. 

I've seen that several bugs have been fixed in 16.1 and 16.2 regarding the new 
relation extension infrastructure, involving partitioned tables in one case 
and temp tables in the other one so I suspect maybe some other corner cases 
are uncovered in there. 

I suspected the FSM could be corrupted in some way but taking a look at it 
just after the errors have been triggered, the offending (non existing)blocks 
are just not present in the FSM either.

I'm desperately trying to reproduce the issue in a test environment, without 
any luck so far... I suspected a race condition with VACUUM trying to reclaim 
the space at the end of the relation, but running a custom build trying to 
reproduce that (by always trying to truncate the relation during VACUUM 
regardless of the amount of possibly-freeable-space) hasn't led me anywhere.

It seems that for some reason, a backend is extending the relation for the 
other waiting ones, and the newly allocated blocks don't end up being pinned 
in shared_buffers. They could then be evicted, and the waiting backend is now 
trying to read from a block which has to be read from disk but has never been 
marked dirty and never persisted. I don't have anything to back that 
hypothesis though...

Once again I'm sorry that this report is too vague, I'll update if I manage to 
reproduce the issue or gather some more information, but in the meantime has 
anybody witnessed something similar ?  And more importantly, do you have any 
pointers on how to investigate to try to trigger the issue manually ?

Best regards,

--
Ronan Dunklau





pgsql-bugs by date:

Previous
From: jian he
Date:
Subject: Re: BUG #18314: PARALLEL UNSAFE function does not prevent parallel index build
Next
From: PG Bug reporting form
Date:
Subject: BUG #18367: Cannot drop schema