Re: A failure of standby to follow timeline switch - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: A failure of standby to follow timeline switch
Date
Msg-id 20210108200843.GA26309@alvherre.pgsql
Whole thread Raw
In response to Re: A failure of standby to follow timeline switch  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: A failure of standby to follow timeline switch
List pgsql-hackers
Masao-san: Are you intending to act as committer for these?  Since the
bug is mine I can look into it, but since you already did all the
reviewing work, I'm good with you giving it the final push.

0001 looks good to me; let's get that one committed quickly so that we
can focus on the interesting stuff.  While the implementation of
find_in_log is quite dumb (not this patch's fault), it seems sufficient
to deal with small log files.  We can improve the implementation later,
if needed, but we have to get the API right on the first try.

0003: The fix looks good to me.  I verified that the test fails without
the fix, and it passes with the fix.


The test added in 0002 is a bit optimistic regarding timing, as well as
potentially slow; it loops 1000 times and sleeps 100 milliseconds each
time.  In a very slow server (valgrind or clobber_cache animals) this
could not be sufficient time, while on fast servers it may end up
waiting longer than needed.  Maybe we can do something like this:

for (my $i = 0 ; $i < 1000; $i++)
{
    my $current_log_size = determine_current_log_size()

    if ($node_standby_3->find_in_log(
            "requested WAL segment [0-9A-F]+ has already been removed",
            $logstart))
    {
        last;
    }
    elsif ($node_standby_3->find_in_log(
            "End of WAL reached on timeline",
               $logstart))
    {
        $success = 1;
        last;
    }
    $logstart = $current_log_size;

    while (determine_current_log_size() == current_log_size)
    {
        usleep(10_000);
        # with a retry count?
    }
}

With test patch, make check PROVE_FLAGS="--timer" PROVE_TESTS=t/001_stream_rep.pl

ok     6386 ms ( 0.00 usr  0.00 sys +  1.14 cusr  0.93 csys =  2.07 CPU)
ok     6352 ms ( 0.00 usr  0.00 sys +  1.10 cusr  0.94 csys =  2.04 CPU)
ok     6255 ms ( 0.01 usr  0.00 sys +  0.99 cusr  0.97 csys =  1.97 CPU)

without test patch:

ok     4954 ms ( 0.00 usr  0.00 sys +  0.71 cusr  0.64 csys =  1.35 CPU)
ok     5033 ms ( 0.01 usr  0.00 sys +  0.71 cusr  0.73 csys =  1.45 CPU)
ok     4991 ms ( 0.01 usr  0.00 sys +  0.73 cusr  0.59 csys =  1.33 CPU)

-- 
Álvaro Herrera



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: support for MERGE
Next
From: Stephen Frost
Date:
Subject: Re: Key management with tests