Hello,
I'm sorry as this will be a very poor bug report. On PG16, I'm am experiencing
random errors which share the same characteristics:
- happens during heavy system load
- lots of concurrent writes happening on a table
- often (but haven't been able to confirm it is necessary), a vacuum is running
on the table at the same time the error is triggered
Then, several backends get the same error at once "ERROR: could not read
block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with different
block numbers. The relation is always a table (regular or toast). The blocks
are past the end of the relation, and the different backends are all trying to
read a different block. The offending queries are either an INSERT / UPDATE /
COPY.
I've seen that several bugs have been fixed in 16.1 and 16.2 regarding the new
relation extension infrastructure, involving partitioned tables in one case
and temp tables in the other one so I suspect maybe some other corner cases
are uncovered in there.
I suspected the FSM could be corrupted in some way but taking a look at it
just after the errors have been triggered, the offending (non existing)blocks
are just not present in the FSM either.
I'm desperately trying to reproduce the issue in a test environment, without
any luck so far... I suspected a race condition with VACUUM trying to reclaim
the space at the end of the relation, but running a custom build trying to
reproduce that (by always trying to truncate the relation during VACUUM
regardless of the amount of possibly-freeable-space) hasn't led me anywhere.
It seems that for some reason, a backend is extending the relation for the
other waiting ones, and the newly allocated blocks don't end up being pinned
in shared_buffers. They could then be evicted, and the waiting backend is now
trying to read from a block which has to be read from disk but has never been
marked dirty and never persisted. I don't have anything to back that
hypothesis though...
Once again I'm sorry that this report is too vague, I'll update if I manage to
reproduce the issue or gather some more information, but in the meantime has
anybody witnessed something similar ? And more importantly, do you have any
pointers on how to investigate to try to trigger the issue manually ?
Best regards,
--
Ronan Dunklau