Thread: Full page images in WAL & Cache Invalidation

Full page images in WAL & Cache Invalidation

From
"Florian G. Pflug"
Date:
Hi

I'm currently working on correctly flushing the
catalog/relation/sgmr caches on a readonly PITR
slave during recovery. These are the things that
currently cause me headache.

1) It seems that the btree code sends out relcache   invalidation msgs during normal operation   (No DDL statements are
executed).This lets any   simple flush-all-caches-if-ddl-was-execute scheme   fail.
 

2) When a full page image is written to the wal, the   information about what tuple was updated is lost.   So
synthesizingcache invalidation msgs from the   WAL records would need to reverseengineer a full   page image, which
seemshard and errorprone.
 

3) Most cache invalidations seem to be generated by   heap_insert (via PrepareForTupleInvalidation). Those   seems to
bereconstructable from the WAL quite easily.   Those sent out via CacheInvalidateRelcache*, however,   seem to leave no
tracein the WAL.
 

What I'm wondering is how much performance is lost if
I just let the slave flush all it's caches whenever it
replayed a commit record of a transaction that executed
DDL. To me it looks like that would only seriously harm
performance if a lot of temporary tables are created on
the master. Since there seem to be quite people who are
unhappiy about the current temptable implementation,
optimizing for that case might prove worthless if 8.4 or
8.5 will change the way that temptables are handled.

If this brute-force approach turns out to perform really
bad, does anyone see an elegant way around (2) and (3)?
(2) seems solveable by writing logical and physical records
to the wal - similar to what that xlog compression idea
needs (I, however, lost track of what came out of that
discussion). But (3) seems to be messy..

greetings, Florian Pflug



Re: Full page images in WAL & Cache Invalidation

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> I'm currently working on correctly flushing the
> catalog/relation/sgmr caches on a readonly PITR
> slave during recovery.

I don't believe there is any workable solution to that short of logging
cache-flush operations in WAL.
        regards, tom lane


Re: Full page images in WAL & Cache Invalidation

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> I'm currently working on correctly flushing the
> catalog/relation/sgmr caches on a readonly PITR
> slave during recovery.

I don't believe there is any workable solution to that short of logging
cache-flush operations in WAL.
        regards, tom lane


Re: Full page images in WAL & Cache Invalidation

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> I'm currently working on correctly flushing the
>> catalog/relation/sgmr caches on a readonly PITR
>> slave during recovery.
> 
> I don't believe there is any workable solution to that short of logging
> cache-flush operations in WAL.

I still don't fully understand if and what problems are caused by overly 
aggresive cache flushing - what puzzles me is that DDL statements seems
to be considered something rare on a production system by most people on
this list, yet the caches seem to be highly optimized to avoid 
unnecessary invalidates.

Leaving aside the btree issues, are you worried about performance
problems, or can aggressive cache flushing hurt correctness?

The reason that I dislike WAL-logging of the flush operations so much is
that it since peopel are concerned about the amount of wal traffic 
postgres generated, such a solution would introduce yet another GUC.
And to make this reasonable foolproof, the slave would need a way to
detect if that GUC is set correctly on the master. All in all, that
seems to be quite hackish...

greetings, Florian Pflug


Re: Full page images in WAL & Cache Invalidation

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> Leaving aside the btree issues, are you worried about performance
> problems, or can aggressive cache flushing hurt correctness?

It shouldn't hurt correctness, but I don't think you entirely grasp
the magnitude of the performance hit you'll take.  The last time I
tried running the regression tests with CLOBBER_CACHE_ALWAYS enabled,
they took about one hundred times longer than normal.  Now you are
evidently hoping not to disable the caches entirely, but I don't
understand how you are going to handle the various random
CacheInvalidateRelcache calls that are here and there in the system
(not only in btree).
        regards, tom lane


Re: Full page images in WAL & Cache Invalidation

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> Leaving aside the btree issues, are you worried about performance
>> problems, or can aggressive cache flushing hurt correctness?
> 
> It shouldn't hurt correctness, but I don't think you entirely grasp
> the magnitude of the performance hit you'll take.  The last time I
> tried running the regression tests with CLOBBER_CACHE_ALWAYS enabled,
> they took about one hundred times longer than normal.  Now you are
> evidently hoping not to disable the caches entirely, but I don't
> understand how you are going to handle the various random
> CacheInvalidateRelcache calls that are here and there in the system
> (not only in btree).

I must be missing something... A quick grep for CacheInvalidateRelcache
turned of these places:
src/backend/rewrite/rewriteDefine.c (EnableDisableRule)
src/backend/rewrite/rewriteSupport.c (SetRelationRuleStatus)
src/backend/access/nbtree/nbtinsert.c (_bt_insertonpg, _bt_newroot)
src/backend/access/nbtree/nbtpage.c (_bt_getroot, _bt_pagedel)
src/backend/commands/trigger.c (renametrig, EnableDisableTrigger)
src/backend/commands/cluster.c (mark_index_clustered)
src/backend/commands/indexcmds.c (DefineIndex)
src/backend/commands/tablecmds.c (setRelhassubclassInRelation)
src/backend/commands/vacuum.c (vac_update_relstats)
src/backend/catalog/heap.c (SetRelationNumChecks)
src/backend/catalog/index.c (index_drop, index_update_stats)

For CacheInvalidateHeapTuple, there is an additional hit in
src/backend/commands/vacuum.c (move_chain_tuple, move_plain_tuple).
Note that move_chain_tuple and move_plain_tuple are only called
in repair_frag, which in turn is only used in full_vacuum_rel.

Now, to me all of these with the exception of the btree functions,
vac_update_relstats and move_chain/plain_tuple look like they are only
called during DDL statements.

My basic assumption is that DDL is something quite uncommon on a
production system. This is obviously *totally* *wrong* for the
regression tests, and I don't doubt that my scheme will show quite
bad performance if you use that as a benchmark. But if you, say,
use pgbench for testing, than the picture will be quite different
I imagine.

My strategy would be the following
1) Mark the commit record if a transaction generated any invalidation   events apart from the btree ones. The only
othersource of inval   events seems to be "VACUUM FULL" on a system relation, which won't   happen on a modestly
well-tunedsystem I think - any VACCUM FULL   will need a special treatement anyway.
 
2) At replay time, the caches are flushed after that record was   replayed.

greetings, Florian Pflug


Re: Full page images in WAL & Cache Invalidation

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> My basic assumption is that DDL is something quite uncommon on a
> production system.

I'm not sure I believe that, because of temp tables.  There's also
the problem that plain VACUUM (or ANALYZE) causes a relcache flush
to update the relation-size statistics.

The real problem with the scheme you propose is that it turns a
cache flush on one table into a system-wide cache flush.

We might be able to do something about the temp-table case upstream:
AFAICS there's no reason for backends to broadcast cache flushes for
their own temp tables to other backends.  But that's just a sketch
of a thought at the moment.

Anyway, if you believe that DDL is infrequent, why are you resistant
to the idea of WAL-logging cache flushes?
        regards, tom lane


Re: Full page images in WAL & Cache Invalidation

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> My basic assumption is that DDL is something quite uncommon on a
>> production system.
> 
> I'm not sure I believe that, because of temp tables.  There's also
> the problem that plain VACUUM (or ANALYZE) causes a relcache flush
> to update the relation-size statistics.>
> The real problem with the scheme you propose is that it turns a
> cache flush on one table into a system-wide cache flush.

Yes.. It really builds on the idea that those flushes happen not
too frequently.

> We might be able to do something about the temp-table case upstream:
> AFAICS there's no reason for backends to broadcast cache flushes for
> their own temp tables to other backends.  But that's just a sketch
> of a thought at the moment.

I was actually hoping that some day temptables wouldn't be stored in
the pg_class and friends at all. I was actually wondering if it
wouldn't be possible to keep the information about them soley in
the catcache and relcache (Making the word cache a big lie). Didn't
check if that is feasible at all, though - just an idea I got at
one point.

> Anyway, if you believe that DDL is infrequent, why are you resistant
> to the idea of WAL-logging cache flushes?
For multiple reasons.

First, cache invalidations are not the only problem caused by replaying 
system-table updates. The whole SnapshotNow
business doesn't exactly make things easier too. So it feels like a
lot of added complexity and code for little gain - unless a *lot*
more things (like locking requests) are logged too.

Second, I'm sure that people would insist on a GUC to turn logging
those records off if they don't need them in their setup. Similar to
that make_wal_compressable GUC that was proposed a few weeks ago.
And if it's a GUC, the slave should be able to verify that it was
set correctly on the master, otherwise this becomes a huge footgun.

Third, I try to keep the changes necessary on the master at a
minimum - I feel that this will make merging the code at some point
easier, because the risk of breaking something is smaller. Bugs
in the slave code will maybe cause crashes and wrong results, but
at least they won't cause data corruption on the master.

And last but not least, I have only limited time for this project -
so I try to find the simplest workable solution, and maybe tune things
later when pratical experience shows where the real bottlenecks are.

greetings, Florian Pflug


Re: Full page images in WAL & Cache Invalidation

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> Tom Lane wrote:
>> The real problem with the scheme you propose is that it turns a
>> cache flush on one table into a system-wide cache flush.

> Yes.. It really builds on the idea that those flushes happen not
> too frequently.

The problem is that flushes may be infrequent for any one table,
but if you use an implementation that converts every per-table
flush to a system-wide flush, it's not so infrequent anymore.
        regards, tom lane


Re: Full page images in WAL & Cache Invalidation

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
>> Anyway, if you believe that DDL is infrequent, why are you resistant
>> to the idea of WAL-logging cache flushes?

> First, cache invalidations are not the only problem caused by replaying 
> system-table updates. The whole SnapshotNow
> business doesn't exactly make things easier too. So it feels like a
> lot of added complexity and code for little gain - unless a *lot*
> more things (like locking requests) are logged too.

The mention of locking requests brought to mind the following
gedankenexperiment:

1. slave server backend is running some long-running query on table X.

2. WAL-reading process receives and executes DROP TABLE X.

(It doesn't even have to be a DROP; most varieties of ALTER are enough
to create problems for a concurrently-running query.)

It's really hard to see how to defend against that without a fairly
complete simulation of locking on the slave side.
        regards, tom lane


Re: Full page images in WAL & Cache Invalidation

From
"Simon Riggs"
Date:
On Sun, 2007-07-22 at 19:58 +0200, Florian G. Pflug wrote:
> Tom Lane wrote:
> > "Florian G. Pflug" <fgp@phlo.org> writes:
> >> I'm currently working on correctly flushing the
> >> catalog/relation/sgmr caches on a readonly PITR
> >> slave during recovery.
> > 
> > I don't believe there is any workable solution to that short of logging
> > cache-flush operations in WAL.

> The reason that I dislike WAL-logging of the flush operations so much is
> that it since peopel are concerned about the amount of wal traffic 
> postgres generated, such a solution would introduce yet another GUC.
> And to make this reasonable foolproof, the slave would need a way to
> detect if that GUC is set correctly on the master. All in all, that
> seems to be quite hackish...

Seems like we should WAL log flush operations first. It's fairly
straightforward to do that and we can then measure its effect on the
primary easily enough. Your other suggestions seem much more complex.

I think we have a reasonable tolerance for increases in WAL and as you
said earlier, we may balance that out with other optimisations. Or we
may find a more efficient way of doing it later.

Let's aim to get that first query running, then go back and tune it
later.

--  Simon Riggs EnterpriseDB  http://www.enterprisedb.com



Re: Full page images in WAL & Cache Invalidation

From
"Simon Riggs"
Date:
On Sun, 2007-07-22 at 17:54 +0200, Florian G. Pflug wrote:

> 1) It seems that the btree code sends out relcache
>     invalidation msgs during normal operation
>     (No DDL statements are executed). This lets any
>     simple flush-all-caches-if-ddl-was-execute scheme
>     fail.

Cache invalidation is uniquely associated with these WAL record types:

XLOG_BTREE_NEWROOT
XLOG_BTREE_INSERT_META
XLOG_BTREE_DELETE_PAGE_META

None of those things happen with any real frequency in most use cases.
We don't increase or reduce the number of levels of an index very
frequently.

--  Simon Riggs EnterpriseDB  http://www.enterprisedb.com



Re: Full page images in WAL & Cache Invalidation

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> 2. WAL-reading process receives and executes DROP TABLE X.
>
> (It doesn't even have to be a DROP; most varieties of ALTER are enough
> to create problems for a concurrently-running query.)
>
> It's really hard to see how to defend against that without a fairly
> complete simulation of locking on the slave side.

Well, it's not hard to see ways to do it. It may be hard to see *good* ways to
do it. In the limit you can just block WAL replay of such records (I think
vacuum page defragmenting is also in this category) until there are no queries
running. This means a single long-running query can block log replay for a
long time though.

There's an additional problem from commands like create index concurrently and
vacuum's deleting of dead tuples which aren't safe to execute if there are any
standby queries running unless we can prove they're not "too old". CLUSTER and
most varieties of ALTER suffer from this problem as well.

It does seem like the shortest path solution is just to get things working
first with a "one big lock" implementation and then work on refining it after
that.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Full page images in WAL & Cache Invalidation

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>>> Anyway, if you believe that DDL is infrequent, why are you resistant
>>> to the idea of WAL-logging cache flushes?
> 
>> First, cache invalidations are not the only problem caused by replaying 
>> system-table updates. The whole SnapshotNow
>> business doesn't exactly make things easier too. So it feels like a
>> lot of added complexity and code for little gain - unless a *lot*
>> more things (like locking requests) are logged too.
> 
> The mention of locking requests brought to mind the following
> gedankenexperiment:
> 
> 1. slave server backend is running some long-running query on table X.
> 
> 2. WAL-reading process receives and executes DROP TABLE X.
> 
> (It doesn't even have to be a DROP; most varieties of ALTER are enough
> to create problems for a concurrently-running query.)
> 
> It's really hard to see how to defend against that without a fairly
> complete simulation of locking on the slave side.

Well, locking on the slave is a bit easier than on the master, for
two reasons
1) Queries running on the slave only lock in  AccessShareLock mode -   therefore, you only need to know if a certain
modeconflics with   AccessShareLock - and there seems to be only one that does, namely   AccessExclusiveLock. So we
reallyonly need to care about   AccessExclusiveLock locks on the master
 
2) As far as I can see, the point of an AccessExclusiveLock is *not*   actually preventing queries from running while a
DDLstatement is   *executed*, but rather preventing queries from running while the   statement is *committed*. This
fitsthe fact that system tables are   read using SnapshotNow (not SnapshotDirty) - while the DDL is   running,
everybodyis happily using the old information, the trouble   would only start after the commit because with SnapshotNow
you  suddenly see the new state.
 

I not yet 100% sure that (2) holds (with the exception of VACUUM FULL)-
but I'm fairly confident, because if (2) was wrong, than how would the
system survive a crash during the execution of a DDL statement?

So after a bit more thought (And reading. Thanks for all the replies,
guys! They are greatly appreciated.),
I came up with the following plan for both inval events and locks
.) Store two flags in the commit record of a transaction, for   "transaction generated inval events" and "transaction
heldan   access exlusive lock".
 
.) Upon replay, block until no transactions are running (for   "transaction held an exclusive lock") before replaying
the  record, or flush the caches after replaying it (for   "transaction generated inval events").
 

This scheme has two really nice properties:
First, it can be extended fairly easily to not store a simple flag, but
a list of OIDs, and use that to make the flushing and blocking more
fine-grained - without changing the basic way in which things work.

And second, it stores all information needed in the *commit* record.
That removes problems with transactions that are aborted due to a
crash, and therefor the WAL doesn't tell that they were aborted.

VACUUM FULL will need some special care - but at least VACUUM FULL is
already such a disruptive operation, that it probably won't surprise
anyone if it's disruptive on the slave too. (And now that CLUSTER is
MVCC-Safe from what I read, the usecase for VACUUM FULL seems to
be pretty slim).

The btree metapage caching will need special treatement too - probably
some flags in the WAL record that change the metapage that instruct the
slave to synthesize a suitable inval event.

What stays open is regular VACUUM (and maybe inline vacuuming - is that
part of the latest version of HOT, or did it get removed?). Here, the
key is logging the return value of GetOldestXmin() I think. Since that
value is what decides in the end if a tuple can be killed or not, having
it available on the slave should allow the slave to block replay until
no slave query depends on the tuple anymore.

greetings, Florian Pflug




Re: Full page images in WAL & Cache Invalidation

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> I came up with the following plan for both inval events and locks
> .) Store two flags in the commit record of a transaction, for
>     "transaction generated inval events" and "transaction held an
>     access exlusive lock".
> .) Upon replay, block until no transactions are running (for
>     "transaction held an exclusive lock") before replaying the
>     record, or flush the caches after replaying it (for
>     "transaction generated inval events").

This does not work; the lock has to be taken earlier than that.
(See for instance VACUUM's truncate calls.)  Not to mention that
you have converted "exclusive lock on one table" to "exclusive lock
on every table", which is even worse than the idea of converting
per-table cache flushes to system-wide ones.
        regards, tom lane


Re: Full page images in WAL & Cache Invalidation

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> I came up with the following plan for both inval events and locks
>> .) Store two flags in the commit record of a transaction, for
>>     "transaction generated inval events" and "transaction held an
>>     access exlusive lock".
>> .) Upon replay, block until no transactions are running (for
>>     "transaction held an exclusive lock") before replaying the
>>     record, or flush the caches after replaying it (for
>>     "transaction generated inval events").
> 
> This does not work; the lock has to be taken earlier than that.
> (See for instance VACUUM's truncate calls.)  Not to mention that
> you have converted "exclusive lock on one table" to "exclusive lock
> on every table", which is even worse than the idea of converting
> per-table cache flushes to system-wide ones.

I'll check what VACUUM is doing.. I primarily had CLUSTER and TRUNCATE
in mind.

That "exclusive lock on one table becomes exclusive lock on all tables"
issue can (as I wrote in the part of my mail that you sniped) be
solved I think by storing a list of OIDs instead of a flag for the
locks and inval events.

greetings, Florian Pflug


Re: Full page images in WAL & Cache Invalidation

From
"Florian G. Pflug"
Date:
Simon Riggs wrote:
> On Sun, 2007-07-22 at 19:58 +0200, Florian G. Pflug wrote:
>> Tom Lane wrote:
>>> "Florian G. Pflug" <fgp@phlo.org> writes:
>>>> I'm currently working on correctly flushing the
>>>> catalog/relation/sgmr caches on a readonly PITR
>>>> slave during recovery.
>>> I don't believe there is any workable solution to that short of logging
>>> cache-flush operations in WAL.
> 
>> The reason that I dislike WAL-logging of the flush operations so much is
>> that it since peopel are concerned about the amount of wal traffic 
>> postgres generated, such a solution would introduce yet another GUC.
>> And to make this reasonable foolproof, the slave would need a way to
>> detect if that GUC is set correctly on the master. All in all, that
>> seems to be quite hackish...
> 
> Seems like we should WAL log flush operations first. It's fairly
> straightforward to do that and we can then measure its effect on the
> primary easily enough. Your other suggestions seem much more complex.
> 
> I think we have a reasonable tolerance for increases in WAL and as you
> said earlier, we may balance that out with other optimisations. Or we
> may find a more efficient way of doing it later.
> 
> Let's aim to get that first query running, then go back and tune it
> later.

I've so far added an LWLock that makes replay and queries mutually
exclusive, Simple testcases seem to work, but I haven't really
beaten the system yet...

Of course, my current version falls over as soon as you do
DDL on the master - working on fixing that, and on
subsequently removing that lock again :-)

greetings, Florian Pflug