Thread: Why standby restores some WALs many times from archive?
Hi, I tried to ask a question in pgsql-general@ but got no response. I have postgres 9.4 standby with archive-based replication (via restore_command). PostgreSQL 9.4.12 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit One day I noticed strange behavior in postgres logs: the same WAL has been restored many times: "restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,"" "unexpected pageaddr 64A81/F9000000 in log segment 0000000200064AA80000005E, offset 0",,,,,,,,,"" "restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,"" "restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,"" "restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,"" "restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,"" "restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,"" "restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,"" "restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,"" "restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,"" "restored log file ""0000000200064AA80000005E"" from archive",,,,,,,,,"" "restored log file ""0000000200064AA80000005F"" from archive",,,,,,,,,"" "restored log file ""0000000200064AA800000060"" from archive",,,,,,,,,"" I enabled logging of successful execution in 'restore_command' script and realized that it's really happens: some WALs arerecovering again and again. And this is always connected with 'unexpected pageaddr' log message. Some statistics of how many 'same WALs' have been restored in 3 minutes: [vyagofarov(at)myserver:~] $ cut -f 2 -d '-' wals_restored.log | uniq -c -d | sort -r 9 0000000200064AA9000000A4 9 0000000200064AA80000005D 8 0000000200064AAA00000044 7 0000000200064AAA000000AB 6 0000000200064AAA000000ED 3 0000000200064AA7000000E8 2 0000000200064AA80000003D 2 0000000200064AA800000033 2 0000000200064AA800000028 2 0000000200064AA800000018 2 0000000200064AA800000004 2 0000000200064AA7000000FD Is this a bug? It can be quite high traffic if the archive server is used by >10 clusters. -- Best regards, Victor Yagofarov
On Thu, Dec 28, 2017 at 7:02 AM, Victor Yagofarov <xnasx@yandex.ru> wrote:
Hi,
I tried to ask a question in pgsql-general@ but got no response.
I have postgres 9.4 standby with archive-based replication (via restore_command).
Can you show us both your archive_command and your restore_command?
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Thu, Dec 28, 2017 at 7:02 AM, Victor Yagofarov <xnasx@yandex.ru> wrote: > > I have postgres 9.4 standby with archive-based replication (via > > restore_command). > Can you show us both your archive_command and your restore_command? We use this scripts: https://github.com/avito-tech/dba-utils/tree/master/pg_archive But I can reproduce problem with simple cp & mv: archive_command: test ! -f /var/lib/postgresql/wals/%f && \ test ! -f /var/lib/postgresql/wals/%f.tmp && \ cp %p /var/lib/postgresql/wals/%f.tmp && \ mv /var/lib/postgresql/wals/%f.tmp /var/lib/postgresql/wals/%f recovery.conf: restore_command = 'cp /var/lib/postgresql/wals/%f %p' archive_cleanup_command = 'pg_archivecleanup /var/lib/postgresql/wals/ %r' standby_mode = on I use docker for tests and run standby with DEBUG3 (full log in attachments). I run this commands in test: PGVER=9.4 docker-compose up docker-compose exec -u postgres master psql \ -c "create database test" docker-compose exec -u postgres master psql -d test \ -c "create table xxx (id serial, v text)" # we need some recycled WALs docker-compose exec -u postgres master psql -d test \ -c "insert into xxx (v) select 'this is long line for fill space ' || n from generate_series(1, 5000000) n" docker-compose exec -u postgres standby psql -c "checkpoint" docker-compose exec -u postgres master psql -d test \ -c "insert into xxx (v) select 'this is long line for fill space ' || n from generate_series(1, 1000000) n; truncatexxx;" docker-compose logs standby | grep 'restored log file' | sort | uniq -c | awk '$1 > 1 { print }' docker-compose down -v Log have two duplicated WALS: 4 | 000000010000000000000024 3 | 00000001000000000000002B ======================================== DEBUG: switched WAL source from stream to archive after failure DEBUG: executing restore command "cp /var/lib/postgresql/wals/000000010000000000000024 pg_xlog/RECOVERYXLOG" LOG: restored log file "000000010000000000000024" from archive DEBUG: got WAL segment from archive DEBUG: skipping restartpoint, already performed at 0/22258058 DEBUG: executing restore command "cp /var/lib/postgresql/wals/000000010000000000000025 pg_xlog/RECOVERYXLOG" cp: cannot stat ‘/var/lib/postgresql/wals/000000010000000000000025’: No such file or directory DEBUG: could not restore file "000000010000000000000025" from archive: child process exited with exit code 1 DEBUG: unexpected pageaddr 0/10000000 in log segment 000000010000000000000025, offset 0 DEBUG: switched WAL source from archive to stream after failure DEBUG: switched WAL source from stream to archive after failure DEBUG: executing restore command "cp /var/lib/postgresql/wals/000000010000000000000024 pg_xlog/RECOVERYXLOG" LOG: restored log file "000000010000000000000024" from archive DEBUG: got WAL segment from archive DEBUG: skipping restartpoint, already performed at 0/22258058 DEBUG: executing restore command "cp /var/lib/postgresql/wals/000000010000000000000025 pg_xlog/RECOVERYXLOG" cp: cannot stat ‘/var/lib/postgresql/wals/000000010000000000000025’: No such file or directory DEBUG: could not restore file "000000010000000000000025" from archive: child process exited with exit code 1 DEBUG: unexpected pageaddr 0/10000000 in log segment 000000010000000000000025, offset 0 DEBUG: switched WAL source from archive to stream after failure DEBUG: switched WAL source from stream to archive after failure DEBUG: executing restore command "cp /var/lib/postgresql/wals/000000010000000000000024 pg_xlog/RECOVERYXLOG" LOG: restored log file "000000010000000000000024" from archive DEBUG: got WAL segment from archive DEBUG: skipping restartpoint, already performed at 0/22258058 ======================================== ======================================== LOG: restartpoint starting: xlog DEBUG: performing replication slot checkpoint DEBUG: switched WAL source from stream to archive after failure DEBUG: executing restore command "cp /var/lib/postgresql/wals/00000001000000000000002C pg_xlog/RECOVERYXLOG" cp: cannot stat ‘/var/lib/postgresql/wals/00000001000000000000002C’: No such file or directory DEBUG: could not restore file "00000001000000000000002C" from archive: child process exited with exit code 1 LOG: unexpected pageaddr 0/21000000 in log segment 00000001000000000000002C, offset 0 DEBUG: switched WAL source from archive to stream after failure DEBUG: switched WAL source from stream to archive after failure DEBUG: executing restore command "cp /var/lib/postgresql/wals/00000001000000000000002B pg_xlog/RECOVERYXLOG" LOG: restored log file "00000001000000000000002B" from archive DEBUG: got WAL segment from archive DEBUG: executing restore command "cp /var/lib/postgresql/wals/00000001000000000000002C pg_xlog/RECOVERYXLOG" cp: cannot stat ‘/var/lib/postgresql/wals/00000001000000000000002C’: No such file or directory DEBUG: could not restore file "00000001000000000000002C" from archive: child process exited with exit code 1 DEBUG: unexpected pageaddr 0/21000000 in log segment 00000001000000000000002C, offset 0 DEBUG: switched WAL source from archive to stream after failure DEBUG: switched WAL source from stream to archive after failure DEBUG: executing restore command "cp /var/lib/postgresql/wals/00000001000000000000002B pg_xlog/RECOVERYXLOG" LOG: restored log file "00000001000000000000002B" from archive DEBUG: got WAL segment from archive DEBUG: executing restore command "cp /var/lib/postgresql/wals/00000001000000000000002C pg_xlog/RECOVERYXLOG" LOG: restored log file "00000001000000000000002C" from archive DEBUG: got WAL segment from archive ======================================== -- Sergey Burladyan
Attachment
On Sat, Dec 30, 2017 at 04:30:07AM +0300, Sergey Burladyan wrote: > We use this scripts: > https://github.com/avito-tech/dba-utils/tree/master/pg_archive > > But I can reproduce problem with simple cp & mv: > archive_command: > test ! -f /var/lib/postgresql/wals/%f && \ > test ! -f /var/lib/postgresql/wals/%f.tmp && \ > cp %p /var/lib/postgresql/wals/%f.tmp && \ > mv /var/lib/postgresql/wals/%f.tmp /var/lib/postgresql/wals/%f This is unsafe. PostgreSQL expects the WAL segment archived to be flushed to disk once the archive command has returned its result to the backend. Don't be surprised if you get corrupted instances or that you are not able to recover up to a consistent point if you need to roll in a backup. Note that the documentation of PostgreSQL provides a simple example of archive command, which is itself bad enough not to use. -- Michael
Attachment
Michael Paquier <michael.paquier@gmail.com> writes: > On Sat, Dec 30, 2017 at 04:30:07AM +0300, Sergey Burladyan wrote: > > We use this scripts: > > https://github.com/avito-tech/dba-utils/tree/master/pg_archive > > > > But I can reproduce problem with simple cp & mv: > > archive_command: > > test ! -f /var/lib/postgresql/wals/%f && \ > > test ! -f /var/lib/postgresql/wals/%f.tmp && \ > > cp %p /var/lib/postgresql/wals/%f.tmp && \ > > mv /var/lib/postgresql/wals/%f.tmp /var/lib/postgresql/wals/%f > > This is unsafe. PostgreSQL expects the WAL segment archived to be > flushed to disk once the archive command has returned its result to the > backend. Yes, you are right, thank you for pointing that out! I upload new version with sync to github. > Don't be surprised if you get corrupted instances or that you > are not able to recover up to a consistent point if you need to roll in > a backup. But only if archive was reboot unexpectedly, am I right? -- Sergey Burladyan
On Sun, Dec 31, 2017 at 10:22:37AM +0300, Sergey Burladyan wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> Don't be surprised if you get corrupted instances or that you >> are not able to recover up to a consistent point if you need to roll in >> a backup. > > But only if archive was reboot unexpectedly, am I right? Or unplugged, which happens quite often because of human mistakes. Maintaining scripts that do backup and restore require quite a lot of specialized knowledge. Solutions like barman, WAL-E or pgBackRest, or even what in-core has with say pg_receivewal are battle-proven and have hundreds of man-hours behind from many people. I suspect that there are other gems hidden in your scripts. So be careful. -- Michael
Attachment
I think I found what happened here. One WAL record can be split between WAL files. In XLogReadRecord, if last WAL record is incomplete, it try to get next WAL: /* Copy the first fragment of the record from the first page. */ memcpy(state->readRecordBuf, state->readBuf + RecPtr % XLOG_BLCKSZ, len); buffer = state->readRecordBuf + len; gotlen = len; do { /* Calculate pointer to beginning of next page */ targetPagePtr += XLOG_BLCKSZ; /* Wait for the next page to become available */ readOff = ReadPageInternal(state, targetPagePtr, Min(total_len - gotlen + SizeOfXLogShortPHD, XLOG_BLCKSZ)); if (readOff < 0) goto err; but in my case next WAL not yet in archive (archive_timeout=0 in master) and it clean cache: err: /* * Invalidate the xlog page we've cached. We might read from a different * source after failure. */ state->readSegNo = 0; state->readOff = 0; state->readLen = 0; PG switch to streaming and last WAL (00000001000000000000002B for example) still not replayed completely. We do not use streaming and it switch back to archive: LOG: restored log file "00000001000000000000002B" from archive ... DEBUG: could not restore file "00000001000000000000002C" from archive: child process exited with exit code 1 DEBUG: switched WAL source from archive to stream after failure DEBUG: switched WAL source from stream to archive after failure Now it must reread first part of last WAL record from 00000001000000000000002B, but XLogFileReadAnyTLI is _always_ read from the archive first, even if this file is already in pg_xlog: if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE) { fd = XLogFileRead(segno, emode, tli, XLOG_FROM_ARCHIVE, true); if (fd != -1) { elog(DEBUG1, "got WAL segment from archive"); if (!expectedTLEs) expectedTLEs = tles; return fd; } } if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_XLOG) { fd = XLogFileRead(segno, emode, tli, XLOG_FROM_PG_XLOG, true); if (fd != -1) { if (!expectedTLEs) expectedTLEs = tles; return fd; } } Well, I think we'll be able to cache locally the last WAL file in restore_command if needed :-) -- Sergey Burladyan
On Sat, Dec 30, 2017 at 4:20 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
Cheers,
On Sat, Dec 30, 2017 at 04:30:07AM +0300, Sergey Burladyan wrote:
> We use this scripts:
> https://github.com/avito-tech/dba-utils/tree/master/pg_ archive
>
> But I can reproduce problem with simple cp & mv:
> archive_command:
> test ! -f /var/lib/postgresql/wals/%f && \
> test ! -f /var/lib/postgresql/wals/%f.tmp && \
> cp %p /var/lib/postgresql/wals/%f.tmp && \
> mv /var/lib/postgresql/wals/%f.tmp /var/lib/postgresql/wals/%f
This is unsafe. PostgreSQL expects the WAL segment archived to be
flushed to disk once the archive command has returned its result to the
backend. Don't be surprised if you get corrupted instances or that you
are not able to recover up to a consistent point if you need to roll in
a backup. Note that the documentation of PostgreSQL provides a simple
example of archive command, which is itself bad enough not to use.
True, but that but doesn't explain the current situation, as it reproduces without an OS level crash so a missing sync would not be relevant. (and on some systems, mv'ing a file will force it to be synced under some conditions, so it might be safe anyway)
I thought I'd seen something recently in the mail lists or commit log about an off-by-one error which causes it to re-fetch the previous file rather than the current file if the previous file ends with just the right type of record and amount of padding. But now I can't find it.
Jeff