Bernd Helmle wrote: > A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11 > instance. > > The database crashed with the following log messages: > > 2015-09-08 00:49:16 CEST [2912] PANIC: could not access status of > transaction 1068235595 > 2015-09-08 00:49:16 CEST [2912] DETAIL: Could not open file > "pg_multixact/members/FFFF5FC4": No such file or directory. > 2015-09-08 00:49:16 CEST [2912] STATEMENT: delete from StockTransfer > where oid = $1 and tanum = $2
I wonder if these bogus page and offset numbers are just SlruReportIOError being confused because pg_multixact/members is so weird (I don't think it should be the case, since this stuff is using page numbers only, not anything related to how each page is layed out).
But SlruReportIOError uses the same macro to build the filename as SlruReadPhysicalPage and other functions, namely SlruFileName which uses sprintf with %04X (unsigned integer uppercase hex) and gives it segno (which is always an int), so I don't think the problem is in error reporting only.
Assuming default block size, to get FFFF5FC4 from SlruFileName you need segno == -41020.
Oops, I meant to attach the proviso "Assuming default block size" to the assumption further down that MULTIXACT_MEMBERS_PER_PAGE == 1636.
We have int segno = pageno / 32 (that's SLRU_PAGES_PER_SEGMENT), so to get segno == -41020 you need pageno between -1312640 and -1312609 (whose bit patterns reinterpreted as unsigned are 4293654656 and 4293654687).
In various places we have int pageno = offset / (uint32) 1636, expanded from this macro (which calls the offset an xid):
I don't really see how any uint32 value could produce such a pageno via that macro. Even if called in an environment where (xid) is accidentally an int, the int / unsigned expression would convert it to unsigned first (unless (xid) is a bigger type like int64_t: by the rules of int promotion you'd get signed division in that case, hmm...). But it's always called with a MultiXactOffset AKA uint32 variable.
So via that route, there is no MultiXactOffset value that can't be mapped to a segment in the range "0000", "14078". Famously, it wraps after that.
Maybe the negative pageno came from somewhere else. Where? Inside SLRU code we can see pageno = shared->page_number[slotno]... maybe the SLRU slots got corrupted somehow?