Re: spurious(?) warnings in archive recovery - Mailing list pgsql-hackers

From Vik Fearing
Subject Re: spurious(?) warnings in archive recovery
Date
Msg-id 153eb917-df68-ec4f-c4ee-51c0c8f45608@2ndquadrant.com
Whole thread Raw
In response to spurious(?) warnings in archive recovery  (Andrew Gierth <andrew@tao11.riddles.org.uk>)
Responses Re: spurious(?) warnings in archive recovery  (Andrew Gierth <andrew@tao11.riddles.org.uk>)
List pgsql-hackers
On 13/11/2018 16:34, Andrew Gierth wrote:
> So while investigating a case of this warning (in
> UpdateMinRecoveryPoint):
> 
> "xlog min recovery request %X/%X is past current point %X/%X"
> 
> I noticed that it is issued even in cases where we know that
> minRecoveryPoint is not yet valid, for example because we're waiting to
> see XLOG_BACKUP_END before declaring consistency.
> 
> But, you'd think, you shouldn't get this error because any page we
> modify during recovery should have been restored from an FPI with a
> suitably early LSN? For data pages that is correct, but not for VM or
> (iff wal_log_hints or checksums are enabled) FSM pages.
> 
> When we replay an operation that, for example, clears a bit in the VM,
> the redo code will read in that VM page from disk, and because we're not
> yet consistent and because _clearing_ a VM bit is not in itself
> wal-logged and doesn't result in any FPI being generated for the VM
> page, it could well read a VM page that has a far-future LSN from the
> point of view of replay, and dirty it, causing a later eviction to try
> and do UpdateMinRecoveryPoint with that future LSN.
> 
> (I haven't investigated this aspect, but there also appears to be no
> protection against torn pages in the VM when checksums are enabled? am I
> missing something somewhere?)
> 
> I'm less clear on the exact mechanisms, but when wal_log_hints (or
> checksums) is on, FSM pages also get LSNs, sometimes, thanks to
> MarkBufferDirtyHint, and at least some code paths can also do
> MarkBufferDirty on FSM pages during recovery, which would cause their
> eviction with possible future LSNs as with VM pages.
> 
> This means that if you simply do an old-style base backup using
> pg_start_backup/rsync/pg_stop_backup (on a sufficiently active system
> and taking long enough) and then recover from it, you're likely to get a
> log spammed with these errors for no very good reason.
> 
> So it seems to me that issuing this error is a bug if the conditions
> described are actually harmless, while if they're not harmless, then
> obviously that is a bug. So _something_ needs fixing here, but I'm not
> yet sufficiently confident of my analysis to say what.
> 
> Opinions?
> 
> (as a further point, it seems to me that backupEndRequired is a rather
> misleadingly named variable, since what _actually_ determines whether an
> XLOG_BACKUP_END record is expected is whether backupStartPoint is set.
> backupEndRequired seems to change one error message and, questionably,
> one decision about whether to do crash recovery before entering archive
> recovery, but nothing else.)


Bump.

I was the originator of this report.  I work with Postgres every single
day and I was spooked by these warnings.  People with much less
involvement would probably be terrified.
-- 
Vik Fearing                                          +33 6 46 75 15 36
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


pgsql-hackers by date:

Previous
From: Haribabu Kommi
Date:
Subject: Re: New function pg_stat_statements_reset_query() to reset statisticsof a specific query
Next
From: "范孝剑(康贤)"
Date:
Subject: Can I skip function ResolveRecoveryConflictWithSnapshot if setting hot_standby_feedback=on all the time