Home > mailing lists

Re: Hot Backup with rsync fails at pg_clog if under load - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: Hot Backup with rsync fails at pg_clog if under load
Date	September 23, 2011 06:05:49
Msg-id	4E7C4BE0.4060701@enterprisedb.com Whole thread Raw
In response to	Re: Hot Backup with rsync fails at pg_clog if under load (Florian Pflug <fgp@phlo.org>)
Responses	Re: Hot Backup with rsync fails at pg_clog if under load
List	pgsql-hackers

Tree view

On 23.09.2011 11:48, Florian Pflug wrote:
> On Sep23, 2011, at 10:41 , Heikki Linnakangas wrote:
>> On 23.09.2011 11:02, Linas Virbalas wrote:
>>> On 9/22/11 6:59 PM, "Euler Taveira de Oliveira"<euler@timbira.com>   wrote:
>>>
>>>>> If needed, I could do that, if I had the exact procedure... Currently,
>>>>> during the start of the backup I take the following information:
>>>>>
>>>> Just show us the output of pg_start_backup and part of the standby log with
>>>> the following message 'redo starts at' and the subsequent messages up to the
>>>> failure.
>>>
>>> Unfortunately, it's impossible, because the error message "Could not read
>>> from file "pg_clog/0001" at offset 32768: Success" is shown (and startup
>>> aborted) before the turn for "redo starts at" message arrives.
>>
>> It looks to me that pg_clog/0001 exists, but it shorter than recovery expects. Which shouldn't happen, of course,
becausethe start-backup checkpoint should flush all the clog that's needed by recovery to disk before the backup
procedurebegins to them.

>
> Yeah. What confuses me though is that we fail while trying to read from the clog. When do we do that during normal
(non-standby)recovery?

I believe the OP is setting up a standby. He didn't say if it's a hot 
standby, though. That changes the startup sequence a bit.

At the end of recovery, we read the "last" clog page, containing the 
next XID that will be assigned to a transaction, and zero out the future 
part of that clog file (StartupCLOG). In hot standby, we do that 
earlier, after reading the checkpoint record but before we start 
replaying records.

> One other possibility would be that the problem is somehow triggered by vacuum running while the start-backup
checkpointis commencing. That'd explain why the problem goes away with immediate checkpoints, because those make the
windowsmach smaller. But I haven't got a concrete theory of whats happening..

Hmm, that's a good theory. The logic in SimpleLruPhysicakReadPage() 
tolerates non-existent files, because the old clog files might've been 
truncated away by a vacuum after the backup started. That's OK, because 
they will be recreated, and later deleted again, during the WAL replay. 
But what if something like this happens:

0. pg_start_backup().
1. rsync starts. It gets a list of all files. It notes that pg_clog/0001 
exists.
2. vacuum runs, and deletes pg_clog/0001
3. rsync starts to copy pg_clog/0001. Oops, it's gone. Instead of 
leaving it out altogether, it includes it as an empty file.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

pgsql-hackers by date:

From: Florian Pflug
Date: 23 September 2011, 05:48:19
Subject: Re: Hot Backup with rsync fails at pg_clog if under load

From: Kyotaro HORIGUCHI
Date: 23 September 2011, 06:17:30
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases

Re: Hot Backup with rsync fails at pg_clog if under load - Mailing list pgsql-hackers

Previous

Next