Thread: Why standby restores some WALs many times from archive?

Why standby restores some WALs many times from archive?

From
Victor Yagofarov
Date:
Hi,

I tried to ask a question in pgsql-general@ but got no response.

I have postgres 9.4 standby with archive-based replication (via restore_command).
PostgreSQL 9.4.12 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit

One day I noticed strange behavior in postgres logs: the same WAL has been restored many times:

"restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,""
"unexpected pageaddr 64A81/F9000000 in log segment 0000000200064AA80000005E, offset 0",,,,,,,,,""
"restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,""
"restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,""
"restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,""
"restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,""
"restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,""
"restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,""
"restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,""
"restored log file ""0000000200064AA80000005D"" from archive",,,,,,,,,""
"restored log file ""0000000200064AA80000005E"" from archive",,,,,,,,,""
"restored log file ""0000000200064AA80000005F"" from archive",,,,,,,,,""
"restored log file ""0000000200064AA800000060"" from archive",,,,,,,,,""


I enabled logging of successful execution in 'restore_command' script and realized that it's really happens: some WALs
arerecovering again and again.
 
And this is always connected with 'unexpected pageaddr' log message.

Some statistics of how many 'same WALs' have been restored in 3 minutes:
[vyagofarov(at)myserver:~] $  cut -f 2 -d '-' wals_restored.log | uniq -c -d | sort -r
      9  0000000200064AA9000000A4
      9  0000000200064AA80000005D
      8  0000000200064AAA00000044
      7  0000000200064AAA000000AB
      6  0000000200064AAA000000ED
      3  0000000200064AA7000000E8
      2  0000000200064AA80000003D
      2  0000000200064AA800000033
      2  0000000200064AA800000028
      2  0000000200064AA800000018
      2  0000000200064AA800000004
      2  0000000200064AA7000000FD

Is this a bug? It can be quite high traffic if the archive server is used by >10 clusters.


-- 
Best regards,
Victor Yagofarov


Re: Why standby restores some WALs many times from archive?

From
Jeff Janes
Date:
On Thu, Dec 28, 2017 at 7:02 AM, Victor Yagofarov <xnasx@yandex.ru> wrote:
Hi,

I tried to ask a question in pgsql-general@ but got no response.

I have postgres 9.4 standby with archive-based replication (via restore_command).

Can you show us both your archive_command and your restore_command?

Cheers,

Jeff

Re: Why standby restores some WALs many times from archive?

From
Sergey Burladyan
Date:
Jeff Janes <jeff.janes@gmail.com> writes:

> On Thu, Dec 28, 2017 at 7:02 AM, Victor Yagofarov <xnasx@yandex.ru> wrote:
> > I have postgres 9.4 standby with archive-based replication (via
> > restore_command).

> Can you show us both your archive_command and your restore_command?

We use this scripts:
https://github.com/avito-tech/dba-utils/tree/master/pg_archive

But I can reproduce problem with simple cp & mv:
archive_command:
  test ! -f /var/lib/postgresql/wals/%f && \
  test ! -f /var/lib/postgresql/wals/%f.tmp && \
  cp %p /var/lib/postgresql/wals/%f.tmp && \
  mv /var/lib/postgresql/wals/%f.tmp /var/lib/postgresql/wals/%f

recovery.conf:
  restore_command = 'cp /var/lib/postgresql/wals/%f %p'
  archive_cleanup_command = 'pg_archivecleanup /var/lib/postgresql/wals/ %r'
  standby_mode = on

I use docker for tests and run standby with DEBUG3 (full log in attachments).

I run this commands in test:
PGVER=9.4 docker-compose up
    docker-compose exec -u postgres master psql \
        -c "create database test"
    docker-compose exec -u postgres master psql -d test \
        -c "create table xxx (id serial, v text)"
    # we need some recycled WALs
    docker-compose exec -u postgres master psql -d test \
        -c "insert into xxx (v) select 'this is long line for fill space ' || n from generate_series(1, 5000000) n"
    docker-compose exec -u postgres standby psql -c "checkpoint"
    docker-compose exec -u postgres master psql -d test \
        -c "insert into xxx (v) select 'this is long line for fill space ' || n from generate_series(1, 1000000) n;
truncatexxx;" 
    docker-compose logs standby | grep 'restored log file' | sort | uniq -c | awk '$1 > 1 { print }'
docker-compose down -v

Log have two duplicated WALS:
4 | 000000010000000000000024
3 | 00000001000000000000002B

========================================
DEBUG:  switched WAL source from stream to archive after failure
DEBUG:  executing restore command "cp /var/lib/postgresql/wals/000000010000000000000024 pg_xlog/RECOVERYXLOG"
LOG:  restored log file "000000010000000000000024" from archive
DEBUG:  got WAL segment from archive
DEBUG:  skipping restartpoint, already performed at 0/22258058
DEBUG:  executing restore command "cp /var/lib/postgresql/wals/000000010000000000000025 pg_xlog/RECOVERYXLOG"
cp: cannot stat ‘/var/lib/postgresql/wals/000000010000000000000025’: No such file or directory
DEBUG:  could not restore file "000000010000000000000025" from archive: child process exited with exit code 1
DEBUG:  unexpected pageaddr 0/10000000 in log segment 000000010000000000000025, offset 0
DEBUG:  switched WAL source from archive to stream after failure
DEBUG:  switched WAL source from stream to archive after failure
DEBUG:  executing restore command "cp /var/lib/postgresql/wals/000000010000000000000024 pg_xlog/RECOVERYXLOG"
LOG:  restored log file "000000010000000000000024" from archive
DEBUG:  got WAL segment from archive
DEBUG:  skipping restartpoint, already performed at 0/22258058
DEBUG:  executing restore command "cp /var/lib/postgresql/wals/000000010000000000000025 pg_xlog/RECOVERYXLOG"
cp: cannot stat ‘/var/lib/postgresql/wals/000000010000000000000025’: No such file or directory
DEBUG:  could not restore file "000000010000000000000025" from archive: child process exited with exit code 1
DEBUG:  unexpected pageaddr 0/10000000 in log segment 000000010000000000000025, offset 0
DEBUG:  switched WAL source from archive to stream after failure
DEBUG:  switched WAL source from stream to archive after failure
DEBUG:  executing restore command "cp /var/lib/postgresql/wals/000000010000000000000024 pg_xlog/RECOVERYXLOG"
LOG:  restored log file "000000010000000000000024" from archive
DEBUG:  got WAL segment from archive
DEBUG:  skipping restartpoint, already performed at 0/22258058
========================================

========================================
LOG:  restartpoint starting: xlog
DEBUG:  performing replication slot checkpoint
DEBUG:  switched WAL source from stream to archive after failure
DEBUG:  executing restore command "cp /var/lib/postgresql/wals/00000001000000000000002C pg_xlog/RECOVERYXLOG"
cp: cannot stat ‘/var/lib/postgresql/wals/00000001000000000000002C’: No such file or directory
DEBUG:  could not restore file "00000001000000000000002C" from archive: child process exited with exit code 1
LOG:  unexpected pageaddr 0/21000000 in log segment 00000001000000000000002C, offset 0
DEBUG:  switched WAL source from archive to stream after failure
DEBUG:  switched WAL source from stream to archive after failure
DEBUG:  executing restore command "cp /var/lib/postgresql/wals/00000001000000000000002B pg_xlog/RECOVERYXLOG"
LOG:  restored log file "00000001000000000000002B" from archive
DEBUG:  got WAL segment from archive
DEBUG:  executing restore command "cp /var/lib/postgresql/wals/00000001000000000000002C pg_xlog/RECOVERYXLOG"
cp: cannot stat ‘/var/lib/postgresql/wals/00000001000000000000002C’: No such file or directory
DEBUG:  could not restore file "00000001000000000000002C" from archive: child process exited with exit code 1
DEBUG:  unexpected pageaddr 0/21000000 in log segment 00000001000000000000002C, offset 0
DEBUG:  switched WAL source from archive to stream after failure
DEBUG:  switched WAL source from stream to archive after failure
DEBUG:  executing restore command "cp /var/lib/postgresql/wals/00000001000000000000002B pg_xlog/RECOVERYXLOG"
LOG:  restored log file "00000001000000000000002B" from archive
DEBUG:  got WAL segment from archive
DEBUG:  executing restore command "cp /var/lib/postgresql/wals/00000001000000000000002C pg_xlog/RECOVERYXLOG"
LOG:  restored log file "00000001000000000000002C" from archive
DEBUG:  got WAL segment from archive
========================================

--
Sergey Burladyan


Attachment

Re: Why standby restores some WALs many times from archive?

From
Michael Paquier
Date:
On Sat, Dec 30, 2017 at 04:30:07AM +0300, Sergey Burladyan wrote:
> We use this scripts:
> https://github.com/avito-tech/dba-utils/tree/master/pg_archive
>
> But I can reproduce problem with simple cp & mv:
> archive_command:
>   test ! -f /var/lib/postgresql/wals/%f && \
>   test ! -f /var/lib/postgresql/wals/%f.tmp && \
>   cp %p /var/lib/postgresql/wals/%f.tmp && \
>   mv /var/lib/postgresql/wals/%f.tmp /var/lib/postgresql/wals/%f

This is unsafe. PostgreSQL expects the WAL segment archived to be
flushed to disk once the archive command has returned its result to the
backend. Don't be surprised if you get corrupted instances or that you
are not able to recover up to a consistent point if you need to roll in
a backup. Note that the documentation of PostgreSQL provides a simple
example of archive command, which is itself bad enough not to use.
--
Michael

Attachment

Re: Why standby restores some WALs many times from archive?

From
Sergey Burladyan
Date:
Michael Paquier <michael.paquier@gmail.com> writes:

> On Sat, Dec 30, 2017 at 04:30:07AM +0300, Sergey Burladyan wrote:
> > We use this scripts:
> > https://github.com/avito-tech/dba-utils/tree/master/pg_archive
> > 
> > But I can reproduce problem with simple cp & mv:
> > archive_command:
> >   test ! -f /var/lib/postgresql/wals/%f && \
> >   test ! -f /var/lib/postgresql/wals/%f.tmp && \
> >   cp %p /var/lib/postgresql/wals/%f.tmp && \
> >   mv /var/lib/postgresql/wals/%f.tmp /var/lib/postgresql/wals/%f
>
> This is unsafe. PostgreSQL expects the WAL segment archived to be
> flushed to disk once the archive command has returned its result to the
> backend.

Yes, you are right, thank you for pointing that out! I upload new
version with sync to github.

> Don't be surprised if you get corrupted instances or that you
> are not able to recover up to a consistent point if you need to roll in
> a backup.

But only if archive was reboot unexpectedly, am I right?

-- 
Sergey Burladyan


Re: Why standby restores some WALs many times from archive?

From
Michael Paquier
Date:
On Sun, Dec 31, 2017 at 10:22:37AM +0300, Sergey Burladyan wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> Don't be surprised if you get corrupted instances or that you
>> are not able to recover up to a consistent point if you need to roll in
>> a backup.
>
> But only if archive was reboot unexpectedly, am I right?

Or unplugged, which happens quite often because of human mistakes.
Maintaining scripts that do backup and restore require quite a lot of
specialized knowledge. Solutions like barman, WAL-E or pgBackRest, or
even what in-core has with say pg_receivewal are battle-proven and have
hundreds of man-hours behind from many people. I suspect that there are
other gems hidden in your scripts. So be careful.
--
Michael

Attachment

Re: Why standby restores some WALs many times from archive?

From
Sergey Burladyan
Date:
I think I found what happened here.

One WAL record can be split between WAL files.

In XLogReadRecord, if last WAL record is incomplete, it try to get next WAL:
        /* Copy the first fragment of the record from the first page. */
        memcpy(state->readRecordBuf,
               state->readBuf + RecPtr % XLOG_BLCKSZ, len);
        buffer = state->readRecordBuf + len;
        gotlen = len;

        do
        {
            /* Calculate pointer to beginning of next page */
            targetPagePtr += XLOG_BLCKSZ;

            /* Wait for the next page to become available */
            readOff = ReadPageInternal(state, targetPagePtr,
                                 Min(total_len - gotlen + SizeOfXLogShortPHD,
                                     XLOG_BLCKSZ));

            if (readOff < 0)
                goto err;

but in my case next WAL not yet in archive (archive_timeout=0 in master)
and it clean cache:
err:

    /*
     * Invalidate the xlog page we've cached. We might read from a different
     * source after failure.
     */
    state->readSegNo = 0;
    state->readOff = 0;
    state->readLen = 0;

PG switch to streaming and last WAL (00000001000000000000002B for
example) still not replayed completely. We do not use streaming and it
switch back to archive:
LOG:  restored log file "00000001000000000000002B" from archive
...
DEBUG:  could not restore file "00000001000000000000002C" from archive: child process exited with exit code 1
DEBUG:  switched WAL source from archive to stream after failure
DEBUG:  switched WAL source from stream to archive after failure

Now it must reread first part of last WAL record from
00000001000000000000002B, but XLogFileReadAnyTLI is _always_ read
from the archive first, even if this file is already in pg_xlog:
        if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
        {
            fd = XLogFileRead(segno, emode, tli,
                              XLOG_FROM_ARCHIVE, true);
            if (fd != -1)
            {
                elog(DEBUG1, "got WAL segment from archive");
                if (!expectedTLEs)
                    expectedTLEs = tles;
                return fd;
            }
        }

        if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_XLOG)
        {
            fd = XLogFileRead(segno, emode, tli,
                              XLOG_FROM_PG_XLOG, true);
            if (fd != -1)
            {
                if (!expectedTLEs)
                    expectedTLEs = tles;
                return fd;
            }
        }

Well, I think we'll be able to cache locally the last WAL file in restore_command
if needed :-)

-- 
Sergey Burladyan


Re: Why standby restores some WALs many times from archive?

From
Jeff Janes
Date:
On Sat, Dec 30, 2017 at 4:20 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Sat, Dec 30, 2017 at 04:30:07AM +0300, Sergey Burladyan wrote:
> We use this scripts:
> https://github.com/avito-tech/dba-utils/tree/master/pg_archive
>
> But I can reproduce problem with simple cp & mv:
> archive_command:
>   test ! -f /var/lib/postgresql/wals/%f && \
>   test ! -f /var/lib/postgresql/wals/%f.tmp && \
>   cp %p /var/lib/postgresql/wals/%f.tmp && \
>   mv /var/lib/postgresql/wals/%f.tmp /var/lib/postgresql/wals/%f

This is unsafe. PostgreSQL expects the WAL segment archived to be
flushed to disk once the archive command has returned its result to the
backend. Don't be surprised if you get corrupted instances or that you
are not able to recover up to a consistent point if you need to roll in
a backup. Note that the documentation of PostgreSQL provides a simple
example of archive command, which is itself bad enough not to use.

True, but that but doesn't explain the current situation, as it reproduces without an OS level crash so a missing sync would not be relevant. (and on some systems, mv'ing a file will force it to be synced under some conditions, so it might be safe anyway)

I thought I'd seen something recently in the mail lists or commit log about an off-by-one error which causes it to re-fetch the previous file rather than the current file if the previous file ends with just the right type of record and amount of padding.  But now I can't find it. 

Cheers,

Jeff