archive recovery fetching wrong segments - Mailing list pgsql-hackers

From Grigory Smolkin
Subject archive recovery fetching wrong segments
Date
Msg-id dd6690b0-ec03-6b3c-6fac-c963f91f87a7@postgrespro.ru
Whole thread Raw
Responses Re: archive recovery fetching wrong segments
List pgsql-hackers
Hello, hackers!

I`m investigating a complains from our clients about archive recovery 
speed been very slow, and I`ve noticed a really strange and, I think, a 
very dangerous recovery behavior.

When running multi-timeline archive recovery, for every requested segno 
startup process iterates through every timeline in restore target 
timeline history, starting from highest timeline and ending in current, 
and tries to fetch the segno in question from this timeline.

Consider the following example.
Timelines:
ARCHIVE INSTANCE 'node'

================================================================================================================================


  TLI  Parent TLI  Switchpoint  Min Segno                 Max 
Segno                 N segments  Size   Zratio  N backups  Status

================================================================================================================================


  3    2           0/AEFFEDE0   0000000300000000000000AE 
0000000300000000000000D5  40          41MB   15.47   0 OK
  2    1           0/A08768D0   0000000200000000000000A0 
0000000200000000000000AE  15          14MB   17.24   0 OK
  1    0           0/0          000000010000000000000001 
0000000100000000000000BB  187         159MB  18.77   1          OK


Backup:

================================================================================================================================


  Instance  Version  ID      Recovery Time           Mode  WAL Mode  
TLI  Time  Data   WAL  Zratio  Start LSN  Stop LSN   Status

================================================================================================================================


  node      11       Q8C8IH  2020-04-06 02:13:31+03  FULL ARCHIVE 1/0    
3s  23MB  16MB    1.00  0/2000028  0/30000B8  OK


So when we are trying to restore this backup, located on Timeline 1, to 
the restore target on Timeline 3, we are getting this in the PostgreSQL 
log:
....
2020-04-05 23:24:36 GMT [28508]: [5-1] LOG:  restored log file 
"00000003.history" from archive
INFO: PID [28511]: pg_probackup archive-get WAL file: 
000000030000000000000002, remote: none, threads: 1/1, batch: 20
ERROR: PID [28511]: pg_probackup archive-get failed to deliver WAL file 
000000030000000000000002, prefetched: 0/20, time elapsed: 0ms
INFO: PID [28512]: pg_probackup archive-get WAL file: 
000000020000000000000002, remote: none, threads: 1/1, batch: 20
ERROR: PID [28512]: pg_probackup archive-get failed to deliver WAL file 
000000020000000000000002, prefetched: 0/20, time elapsed: 0ms
INFO: PID [28513]: pg_probackup archive-get WAL file: 
000000010000000000000002, remote: none, threads: 1/1, batch: 20
INFO: PID [28513]: pg_probackup archive-get copied WAL file 
000000010000000000000002
2020-04-05 23:24:36 GMT [28508]: [6-1] LOG:  restored log file 
"000000010000000000000002" from archive
...

Before requesting 000000010000000000000002 recovery tries to fetch 
000000030000000000000002 and 000000020000000000000002 and that goes for 
every segment, restored from the archive.
This tremendously slows down recovery speed, especially if archive is 
located on remote machine with high latency network.
And it also may lead to feeding recovery with wrong WAL segment, located 
on the next timeline.

Is there a reason behind this behavior?

Also I`ve  attached a patch, which fixed this issue for me, but I`m not 
sure, that chosen approach is sound and didn`t break something.

-- 
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: CLOBBER_CACHE_ALWAYS regression instability
Next
From: Andres Freund
Date:
Subject: Re: CLOBBER_CACHE_ALWAYS regression instability