On Fri, Aug 11, 2017 at 1:33 PM, Greg Stark <stark@mit.edu> wrote:
On 10 August 2017 at 15:26, Chris Travers <chris.travers@gmail.com> wrote: > > > The bitwise comparison is interesting. Remember the error was: > > pg_xlogdump: FATAL: error in WAL record at 1E39C/E1117FB8: unexpected > pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset > 1146880 ... > Since this didn't throw a checksum error (we have data checksums disabled but wal records ISTR have a separate CRC check), would this perhaps indicate that the checksum operated over incorrect data?
No checksum error and this "unexpected pageaddr" doesn't necessarily mean data corruption. It could mean that when the database stopped logging it was reusing a wal file and the old wal stream had a record boundary on the same byte position. So the previous record checksum passed and the following record checksum passes but the record header is for a different wal stream position.
I expect to test this theory shortly.
Assuming it is correct, what can we do to prevent restarts of slaves from running into it?
I think you could actually hack xlogdump to ignore this condition and keep outputting and you'll see whether the records that follow appear to be old wal log data. I haven't actually tried this though. -- greg
--
Best Wishes,
Chris Travers
Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in.