Home > mailing lists

Re: [patch] Fix pg_checksums to allow checking of offline basebackup directories - Mailing list pgsql-hackers

From	Michael Banck
Subject	Re: [patch] Fix pg_checksums to allow checking of offline basebackup directories
Date	April 7, 2020 12:02:23
Msg-id	24abd557a1a18d664683369249345dd703085cd8.camel@credativ.de Whole thread Raw
In response to	Re: [patch] Fix pg_checksums to allow checking of offline basebackup directories (Michael Paquier <michael@paquier.xyz>)
List	pgsql-hackers

Tree view

Hi,

Am Dienstag, den 07.04.2020, 17:07 +0900 schrieb Michael Paquier:
> On Mon, Apr 06, 2020 at 01:26:17PM +0200, Michael Banck wrote:
> > I think we can allow checking of base backups if we make sure
> > backup_label exists in the data directory or am I missing something?
> > I think we need to have similar checks about pages changed during base
> > backup, so this patch ignores checksum failures between the checkpoint
> > LSN and (as a reasonable upper bound) the last LSN of the last existing
> > transaction log file. If no xlog files exist (the --wal-method=none
> > case), the last LSN of the checkpoint WAL segment is taken.
> 
> Have you considered that backup_label files can exist in the data
> directory of a live cluster?  That's not the case with pg_basebackup
> or non-exclusive backups with the SQL interface, but that's possible
> with the SQL interface and an exclusive backup running.

I see, that's what I was missing. I think it is unfortunate that
pg_control does not record an ongoing (base)backup in the state or
elsewhere. Maybe one could look at the `BACKUP METHOD' field in
backup_label, which is (always?) `pg_start_backup' for the non-exclusive 
backup and `streamed' for pg_basebackup.

> FWIW, my take on this matter is that you should consider checksum
> verification as one step to check the sanity of a base backup, meaning
> that you have to restore the base backup first, then let it reach its
> consistent LSN position, and finally stop the cluster cleanly to make
> sure that everything is safely flushed on disk and consistent.

That's a full restore and it should certainly be encouraged that
organizations do full restore tests regularly, but (not only) if you
have lots of big instances, that is often not the case.

So I think making it easier to check plain base backups would be
helpful, even if some part of  recently changed data might not get
checked.

> Attempting to verify checksums from a raw base backup would most
> likely lead to false positives, and my guess is that your patch has
> issues in this area.  Hint at quick glance: the code path setting
> insertLimitLSN where you actually don't use any APIs from
> xlogreader.h.

I evaluated using xlogreader to fetch the BACKUP STOP position from the
WAL but then discarded that for now as possibly being overkill and went
the route of a slightly larger upper bound by taking the following WAL
segment and not the BACKUP STOP position. But I can take a look at
implementing the more fine-grained method if needed.

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

pgsql-hackers by date:

From: Peter Eisentraut
Date: 07 April 2020, 12:01:02
Subject: Re: adding partitioned tables to publications

From: Julien Rouhaud
Date: 07 April 2020, 12:18:24
Subject: Re: WAL usage calculation patch

Re: [patch] Fix pg_checksums to allow checking of offline basebackup directories - Mailing list pgsql-hackers

Previous

Next