Home > mailing lists

[bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary - Mailing list pgsql-hackers

From	Tsunakawa, Takayuki
Subject	[bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
Date	March 1, 2018 04:26:32
Msg-id	0A3221C70F24FB45833433255569204D1F8DAAA2@G01JPEXMBYT05 Whole thread Raw
Responses	Re: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary Re: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
List	pgsql-hackers

Tree view

Hello,

Our customer hit another bug of pg_rewind with PG 9.5.  The attached patch fixes this.


PROBLEM
========================================

After a long run of successful pg_rewind, the synchronized standby could not catch up the primary forever, emitting the
followingmessage repeatedly:
 

LOG:  XX000: could not read from log segment 000000060000028A00000031, offset 16384: No error


CAUSE
========================================

If the primary removes WAL files that pg_rewind is going to get, pg_rewind leaves 0-byte WAL files in the target
directoryhere:
 

[libpq_fetch.c]
            case FILE_ACTION_COPY:
                /* Truncate the old file out of the way, if any */
                open_target_file(entry->path, true);
                fetch_file_range(entry->path, 0, entry->newsize);
                break;

pg_rewind completes successfully, create recovery.conf, and then start the standby in the target cluster.  walreceiver
receivesWAL records and write them to the 0-byte WAL files.  Finally, xlog reader complains that he cannot read a WAL
page.


FIX
========================================

pg_rewind deletes the file when it finds that the primary has deleted it.


OTHER THOUGHTS
========================================

BTW, should pg_rewind really copy WAL files from the primary?  If the sole purpose of pg_rewind is to recover an
instanceto use as a standby, can pg_rewind just remove all WAL files in the target directory, because the standby can
getWAL files from the primary and/or archive?
 

Related to this, shouldn't pg_rewind avoid copying more files and directories like pg_basebackup?  Currently, pg_rewind
doesn'tcopy postmaster.pid, postmaster.opts, and temporary files/directories (pg_sql_tmp/).
 

Regards
Takayuki Tsunakawa

Attachment

pg_rewind_corrupt_wal.patch

pgsql-hackers by date:

From: Craig Ringer
Date: 01 March 2018, 04:16:14
Subject: Re: Online enabling of checksums

From: Amit Langote
Date: 01 March 2018, 04:27:54
Subject: Re: [HACKERS] path toward faster partition pruning

[bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary - Mailing list pgsql-hackers

Attachment

Previous

Next