Re: Full page images in WAL & Cache Invalidation - Mailing list pgsql-hackers

From Florian G. Pflug
Subject Re: Full page images in WAL & Cache Invalidation
Date
Msg-id 46A4ADD0.8040500@phlo.org
Whole thread Raw
In response to Re: Full page images in WAL & Cache Invalidation  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Full page images in WAL & Cache Invalidation  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>>> Anyway, if you believe that DDL is infrequent, why are you resistant
>>> to the idea of WAL-logging cache flushes?
> 
>> First, cache invalidations are not the only problem caused by replaying 
>> system-table updates. The whole SnapshotNow
>> business doesn't exactly make things easier too. So it feels like a
>> lot of added complexity and code for little gain - unless a *lot*
>> more things (like locking requests) are logged too.
> 
> The mention of locking requests brought to mind the following
> gedankenexperiment:
> 
> 1. slave server backend is running some long-running query on table X.
> 
> 2. WAL-reading process receives and executes DROP TABLE X.
> 
> (It doesn't even have to be a DROP; most varieties of ALTER are enough
> to create problems for a concurrently-running query.)
> 
> It's really hard to see how to defend against that without a fairly
> complete simulation of locking on the slave side.

Well, locking on the slave is a bit easier than on the master, for
two reasons
1) Queries running on the slave only lock in  AccessShareLock mode -   therefore, you only need to know if a certain
modeconflics with   AccessShareLock - and there seems to be only one that does, namely   AccessExclusiveLock. So we
reallyonly need to care about   AccessExclusiveLock locks on the master
 
2) As far as I can see, the point of an AccessExclusiveLock is *not*   actually preventing queries from running while a
DDLstatement is   *executed*, but rather preventing queries from running while the   statement is *committed*. This
fitsthe fact that system tables are   read using SnapshotNow (not SnapshotDirty) - while the DDL is   running,
everybodyis happily using the old information, the trouble   would only start after the commit because with SnapshotNow
you  suddenly see the new state.
 

I not yet 100% sure that (2) holds (with the exception of VACUUM FULL)-
but I'm fairly confident, because if (2) was wrong, than how would the
system survive a crash during the execution of a DDL statement?

So after a bit more thought (And reading. Thanks for all the replies,
guys! They are greatly appreciated.),
I came up with the following plan for both inval events and locks
.) Store two flags in the commit record of a transaction, for   "transaction generated inval events" and "transaction
heldan   access exlusive lock".
 
.) Upon replay, block until no transactions are running (for   "transaction held an exclusive lock") before replaying
the  record, or flush the caches after replaying it (for   "transaction generated inval events").
 

This scheme has two really nice properties:
First, it can be extended fairly easily to not store a simple flag, but
a list of OIDs, and use that to make the flushing and blocking more
fine-grained - without changing the basic way in which things work.

And second, it stores all information needed in the *commit* record.
That removes problems with transactions that are aborted due to a
crash, and therefor the WAL doesn't tell that they were aborted.

VACUUM FULL will need some special care - but at least VACUUM FULL is
already such a disruptive operation, that it probably won't surprise
anyone if it's disruptive on the slave too. (And now that CLUSTER is
MVCC-Safe from what I read, the usecase for VACUUM FULL seems to
be pretty slim).

The btree metapage caching will need special treatement too - probably
some flags in the WAL record that change the metapage that instruct the
slave to synthesize a suitable inval event.

What stays open is regular VACUUM (and maybe inline vacuuming - is that
part of the latest version of HOT, or did it get removed?). Here, the
key is logging the return value of GetOldestXmin() I think. Since that
value is what decides in the end if a tuple can be killed or not, having
it available on the slave should allow the slave to block replay until
no slave query depends on the tuple anymore.

greetings, Florian Pflug




pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: SSPI vs MingW
Next
From: Magnus Hagander
Date:
Subject: Oops in fe-auth.c