Re: [HACKERS] emergency outage requiring database restart - Mailing list pgsql-hackers
From | Ants Aasma |
---|---|
Subject | Re: [HACKERS] emergency outage requiring database restart |
Date | |
Msg-id | CA+CSw_seDunLPXqczV_5NO1YaOq-89r0fqsCX7zsEba8cmyeOg@mail.gmail.com Whole thread Raw |
In response to | Re: emergency outage requiring database restart (Merlin Moncure <mmoncure@gmail.com>) |
Responses |
Re: [HACKERS] emergency outage requiring database restart
|
List | pgsql-hackers |
On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants.aasma@eesti.ee> wrote: >> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> Still getting checksum failures. Over the last 30 days, I see the >>> following. Since enabling checksums FWICT none of the damage is >>> permanent and rolls back with the transaction. So creepy! >> >> The checksums still only differ in least significant digits which >> pretty much means that there is a block number mismatch. So if you >> rule out filesystem not doing its job correctly and transposing >> blocks, it could be something else that is resulting in blocks getting >> read from a location that happens to differ by a small multiple of >> page size. Maybe somebody is racily mucking with table fd's between >> seeking and reading. That would explain the issue disappearing after a >> retry. >> >> Maybe you can arrange for the RelFileNode and block number to be >> logged for the checksum failures and check what the actual checksums >> are in data files surrounding the failed page. If the requested block >> number contains something completely else, but the page that follows >> contains the expected checksum value, then it would support this >> theory. > > will do. Main challenge is getting hand compiled server to swap in > so that libdir continues to work. Getting access to the server is > difficult as is getting a maintenance window. I'll post back ASAP. As a new datapoint, we just had a customer with an issue that I think might be related. The issue was reasonably repeatable by running a report on the standby system. Issue manifested itself by first "could not open relation" and/or "column is not in index" errors, followed a few minutes later by a PANIC from startup process due to "specified item offset is too large", "invalid max offset number" or "page X of relation base/16384/1259 is uninitialized". I took a look at the xlog dump and it was completely fine. For instance in the "specified item offset is too large" case there was a INSERT_LEAF redo record inserting the preceding offset just a couple hundred kilobytes back. Restarting the server sometimes successfully applied the offending WAL, sometimes it failed with other corruption errors. The offending relations were always pg_class or pg_class_oid_index. Replacing plsh functions with dummy plpgsql functions made the problem go away, reintroducing plsh functions made it reappear. The only thing I came up with that is consistent with the symptoms is that a page got thrown out of shared_buffers between the two xlog records referencing it (shared_buffers was default 128MB), and then read back by a backend process, where in the time between FileSeek and FileRead calls in mdread a subprocess mucked with the fd's offset so that a different page than intended got read in. Or basically the same race condition, but on the write side. Maybe somebody else has a better imagination than me... Regards, Ants Aasma
pgsql-hackers by date: