Thread: Full page images in WAL & Cache Invalidation
Hi I'm currently working on correctly flushing the catalog/relation/sgmr caches on a readonly PITR slave during recovery. These are the things that currently cause me headache. 1) It seems that the btree code sends out relcache invalidation msgs during normal operation (No DDL statements are executed).This lets any simple flush-all-caches-if-ddl-was-execute scheme fail. 2) When a full page image is written to the wal, the information about what tuple was updated is lost. So synthesizingcache invalidation msgs from the WAL records would need to reverseengineer a full page image, which seemshard and errorprone. 3) Most cache invalidations seem to be generated by heap_insert (via PrepareForTupleInvalidation). Those seems to bereconstructable from the WAL quite easily. Those sent out via CacheInvalidateRelcache*, however, seem to leave no tracein the WAL. What I'm wondering is how much performance is lost if I just let the slave flush all it's caches whenever it replayed a commit record of a transaction that executed DDL. To me it looks like that would only seriously harm performance if a lot of temporary tables are created on the master. Since there seem to be quite people who are unhappiy about the current temptable implementation, optimizing for that case might prove worthless if 8.4 or 8.5 will change the way that temptables are handled. If this brute-force approach turns out to perform really bad, does anyone see an elegant way around (2) and (3)? (2) seems solveable by writing logical and physical records to the wal - similar to what that xlog compression idea needs (I, however, lost track of what came out of that discussion). But (3) seems to be messy.. greetings, Florian Pflug
"Florian G. Pflug" <fgp@phlo.org> writes: > I'm currently working on correctly flushing the > catalog/relation/sgmr caches on a readonly PITR > slave during recovery. I don't believe there is any workable solution to that short of logging cache-flush operations in WAL. regards, tom lane
"Florian G. Pflug" <fgp@phlo.org> writes: > I'm currently working on correctly flushing the > catalog/relation/sgmr caches on a readonly PITR > slave during recovery. I don't believe there is any workable solution to that short of logging cache-flush operations in WAL. regards, tom lane
Tom Lane wrote: > "Florian G. Pflug" <fgp@phlo.org> writes: >> I'm currently working on correctly flushing the >> catalog/relation/sgmr caches on a readonly PITR >> slave during recovery. > > I don't believe there is any workable solution to that short of logging > cache-flush operations in WAL. I still don't fully understand if and what problems are caused by overly aggresive cache flushing - what puzzles me is that DDL statements seems to be considered something rare on a production system by most people on this list, yet the caches seem to be highly optimized to avoid unnecessary invalidates. Leaving aside the btree issues, are you worried about performance problems, or can aggressive cache flushing hurt correctness? The reason that I dislike WAL-logging of the flush operations so much is that it since peopel are concerned about the amount of wal traffic postgres generated, such a solution would introduce yet another GUC. And to make this reasonable foolproof, the slave would need a way to detect if that GUC is set correctly on the master. All in all, that seems to be quite hackish... greetings, Florian Pflug
"Florian G. Pflug" <fgp@phlo.org> writes: > Leaving aside the btree issues, are you worried about performance > problems, or can aggressive cache flushing hurt correctness? It shouldn't hurt correctness, but I don't think you entirely grasp the magnitude of the performance hit you'll take. The last time I tried running the regression tests with CLOBBER_CACHE_ALWAYS enabled, they took about one hundred times longer than normal. Now you are evidently hoping not to disable the caches entirely, but I don't understand how you are going to handle the various random CacheInvalidateRelcache calls that are here and there in the system (not only in btree). regards, tom lane
Tom Lane wrote: > "Florian G. Pflug" <fgp@phlo.org> writes: >> Leaving aside the btree issues, are you worried about performance >> problems, or can aggressive cache flushing hurt correctness? > > It shouldn't hurt correctness, but I don't think you entirely grasp > the magnitude of the performance hit you'll take. The last time I > tried running the regression tests with CLOBBER_CACHE_ALWAYS enabled, > they took about one hundred times longer than normal. Now you are > evidently hoping not to disable the caches entirely, but I don't > understand how you are going to handle the various random > CacheInvalidateRelcache calls that are here and there in the system > (not only in btree). I must be missing something... A quick grep for CacheInvalidateRelcache turned of these places: src/backend/rewrite/rewriteDefine.c (EnableDisableRule) src/backend/rewrite/rewriteSupport.c (SetRelationRuleStatus) src/backend/access/nbtree/nbtinsert.c (_bt_insertonpg, _bt_newroot) src/backend/access/nbtree/nbtpage.c (_bt_getroot, _bt_pagedel) src/backend/commands/trigger.c (renametrig, EnableDisableTrigger) src/backend/commands/cluster.c (mark_index_clustered) src/backend/commands/indexcmds.c (DefineIndex) src/backend/commands/tablecmds.c (setRelhassubclassInRelation) src/backend/commands/vacuum.c (vac_update_relstats) src/backend/catalog/heap.c (SetRelationNumChecks) src/backend/catalog/index.c (index_drop, index_update_stats) For CacheInvalidateHeapTuple, there is an additional hit in src/backend/commands/vacuum.c (move_chain_tuple, move_plain_tuple). Note that move_chain_tuple and move_plain_tuple are only called in repair_frag, which in turn is only used in full_vacuum_rel. Now, to me all of these with the exception of the btree functions, vac_update_relstats and move_chain/plain_tuple look like they are only called during DDL statements. My basic assumption is that DDL is something quite uncommon on a production system. This is obviously *totally* *wrong* for the regression tests, and I don't doubt that my scheme will show quite bad performance if you use that as a benchmark. But if you, say, use pgbench for testing, than the picture will be quite different I imagine. My strategy would be the following 1) Mark the commit record if a transaction generated any invalidation events apart from the btree ones. The only othersource of inval events seems to be "VACUUM FULL" on a system relation, which won't happen on a modestly well-tunedsystem I think - any VACCUM FULL will need a special treatement anyway. 2) At replay time, the caches are flushed after that record was replayed. greetings, Florian Pflug
"Florian G. Pflug" <fgp@phlo.org> writes: > My basic assumption is that DDL is something quite uncommon on a > production system. I'm not sure I believe that, because of temp tables. There's also the problem that plain VACUUM (or ANALYZE) causes a relcache flush to update the relation-size statistics. The real problem with the scheme you propose is that it turns a cache flush on one table into a system-wide cache flush. We might be able to do something about the temp-table case upstream: AFAICS there's no reason for backends to broadcast cache flushes for their own temp tables to other backends. But that's just a sketch of a thought at the moment. Anyway, if you believe that DDL is infrequent, why are you resistant to the idea of WAL-logging cache flushes? regards, tom lane
Tom Lane wrote: > "Florian G. Pflug" <fgp@phlo.org> writes: >> My basic assumption is that DDL is something quite uncommon on a >> production system. > > I'm not sure I believe that, because of temp tables. There's also > the problem that plain VACUUM (or ANALYZE) causes a relcache flush > to update the relation-size statistics.> > The real problem with the scheme you propose is that it turns a > cache flush on one table into a system-wide cache flush. Yes.. It really builds on the idea that those flushes happen not too frequently. > We might be able to do something about the temp-table case upstream: > AFAICS there's no reason for backends to broadcast cache flushes for > their own temp tables to other backends. But that's just a sketch > of a thought at the moment. I was actually hoping that some day temptables wouldn't be stored in the pg_class and friends at all. I was actually wondering if it wouldn't be possible to keep the information about them soley in the catcache and relcache (Making the word cache a big lie). Didn't check if that is feasible at all, though - just an idea I got at one point. > Anyway, if you believe that DDL is infrequent, why are you resistant > to the idea of WAL-logging cache flushes? For multiple reasons. First, cache invalidations are not the only problem caused by replaying system-table updates. The whole SnapshotNow business doesn't exactly make things easier too. So it feels like a lot of added complexity and code for little gain - unless a *lot* more things (like locking requests) are logged too. Second, I'm sure that people would insist on a GUC to turn logging those records off if they don't need them in their setup. Similar to that make_wal_compressable GUC that was proposed a few weeks ago. And if it's a GUC, the slave should be able to verify that it was set correctly on the master, otherwise this becomes a huge footgun. Third, I try to keep the changes necessary on the master at a minimum - I feel that this will make merging the code at some point easier, because the risk of breaking something is smaller. Bugs in the slave code will maybe cause crashes and wrong results, but at least they won't cause data corruption on the master. And last but not least, I have only limited time for this project - so I try to find the simplest workable solution, and maybe tune things later when pratical experience shows where the real bottlenecks are. greetings, Florian Pflug
"Florian G. Pflug" <fgp@phlo.org> writes: > Tom Lane wrote: >> The real problem with the scheme you propose is that it turns a >> cache flush on one table into a system-wide cache flush. > Yes.. It really builds on the idea that those flushes happen not > too frequently. The problem is that flushes may be infrequent for any one table, but if you use an implementation that converts every per-table flush to a system-wide flush, it's not so infrequent anymore. regards, tom lane
"Florian G. Pflug" <fgp@phlo.org> writes: >> Anyway, if you believe that DDL is infrequent, why are you resistant >> to the idea of WAL-logging cache flushes? > First, cache invalidations are not the only problem caused by replaying > system-table updates. The whole SnapshotNow > business doesn't exactly make things easier too. So it feels like a > lot of added complexity and code for little gain - unless a *lot* > more things (like locking requests) are logged too. The mention of locking requests brought to mind the following gedankenexperiment: 1. slave server backend is running some long-running query on table X. 2. WAL-reading process receives and executes DROP TABLE X. (It doesn't even have to be a DROP; most varieties of ALTER are enough to create problems for a concurrently-running query.) It's really hard to see how to defend against that without a fairly complete simulation of locking on the slave side. regards, tom lane
On Sun, 2007-07-22 at 19:58 +0200, Florian G. Pflug wrote: > Tom Lane wrote: > > "Florian G. Pflug" <fgp@phlo.org> writes: > >> I'm currently working on correctly flushing the > >> catalog/relation/sgmr caches on a readonly PITR > >> slave during recovery. > > > > I don't believe there is any workable solution to that short of logging > > cache-flush operations in WAL. > The reason that I dislike WAL-logging of the flush operations so much is > that it since peopel are concerned about the amount of wal traffic > postgres generated, such a solution would introduce yet another GUC. > And to make this reasonable foolproof, the slave would need a way to > detect if that GUC is set correctly on the master. All in all, that > seems to be quite hackish... Seems like we should WAL log flush operations first. It's fairly straightforward to do that and we can then measure its effect on the primary easily enough. Your other suggestions seem much more complex. I think we have a reasonable tolerance for increases in WAL and as you said earlier, we may balance that out with other optimisations. Or we may find a more efficient way of doing it later. Let's aim to get that first query running, then go back and tune it later. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Sun, 2007-07-22 at 17:54 +0200, Florian G. Pflug wrote: > 1) It seems that the btree code sends out relcache > invalidation msgs during normal operation > (No DDL statements are executed). This lets any > simple flush-all-caches-if-ddl-was-execute scheme > fail. Cache invalidation is uniquely associated with these WAL record types: XLOG_BTREE_NEWROOT XLOG_BTREE_INSERT_META XLOG_BTREE_DELETE_PAGE_META None of those things happen with any real frequency in most use cases. We don't increase or reduce the number of levels of an index very frequently. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > 2. WAL-reading process receives and executes DROP TABLE X. > > (It doesn't even have to be a DROP; most varieties of ALTER are enough > to create problems for a concurrently-running query.) > > It's really hard to see how to defend against that without a fairly > complete simulation of locking on the slave side. Well, it's not hard to see ways to do it. It may be hard to see *good* ways to do it. In the limit you can just block WAL replay of such records (I think vacuum page defragmenting is also in this category) until there are no queries running. This means a single long-running query can block log replay for a long time though. There's an additional problem from commands like create index concurrently and vacuum's deleting of dead tuples which aren't safe to execute if there are any standby queries running unless we can prove they're not "too old". CLUSTER and most varieties of ALTER suffer from this problem as well. It does seem like the shortest path solution is just to get things working first with a "one big lock" implementation and then work on refining it after that. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > "Florian G. Pflug" <fgp@phlo.org> writes: >>> Anyway, if you believe that DDL is infrequent, why are you resistant >>> to the idea of WAL-logging cache flushes? > >> First, cache invalidations are not the only problem caused by replaying >> system-table updates. The whole SnapshotNow >> business doesn't exactly make things easier too. So it feels like a >> lot of added complexity and code for little gain - unless a *lot* >> more things (like locking requests) are logged too. > > The mention of locking requests brought to mind the following > gedankenexperiment: > > 1. slave server backend is running some long-running query on table X. > > 2. WAL-reading process receives and executes DROP TABLE X. > > (It doesn't even have to be a DROP; most varieties of ALTER are enough > to create problems for a concurrently-running query.) > > It's really hard to see how to defend against that without a fairly > complete simulation of locking on the slave side. Well, locking on the slave is a bit easier than on the master, for two reasons 1) Queries running on the slave only lock in AccessShareLock mode - therefore, you only need to know if a certain modeconflics with AccessShareLock - and there seems to be only one that does, namely AccessExclusiveLock. So we reallyonly need to care about AccessExclusiveLock locks on the master 2) As far as I can see, the point of an AccessExclusiveLock is *not* actually preventing queries from running while a DDLstatement is *executed*, but rather preventing queries from running while the statement is *committed*. This fitsthe fact that system tables are read using SnapshotNow (not SnapshotDirty) - while the DDL is running, everybodyis happily using the old information, the trouble would only start after the commit because with SnapshotNow you suddenly see the new state. I not yet 100% sure that (2) holds (with the exception of VACUUM FULL)- but I'm fairly confident, because if (2) was wrong, than how would the system survive a crash during the execution of a DDL statement? So after a bit more thought (And reading. Thanks for all the replies, guys! They are greatly appreciated.), I came up with the following plan for both inval events and locks .) Store two flags in the commit record of a transaction, for "transaction generated inval events" and "transaction heldan access exlusive lock". .) Upon replay, block until no transactions are running (for "transaction held an exclusive lock") before replaying the record, or flush the caches after replaying it (for "transaction generated inval events"). This scheme has two really nice properties: First, it can be extended fairly easily to not store a simple flag, but a list of OIDs, and use that to make the flushing and blocking more fine-grained - without changing the basic way in which things work. And second, it stores all information needed in the *commit* record. That removes problems with transactions that are aborted due to a crash, and therefor the WAL doesn't tell that they were aborted. VACUUM FULL will need some special care - but at least VACUUM FULL is already such a disruptive operation, that it probably won't surprise anyone if it's disruptive on the slave too. (And now that CLUSTER is MVCC-Safe from what I read, the usecase for VACUUM FULL seems to be pretty slim). The btree metapage caching will need special treatement too - probably some flags in the WAL record that change the metapage that instruct the slave to synthesize a suitable inval event. What stays open is regular VACUUM (and maybe inline vacuuming - is that part of the latest version of HOT, or did it get removed?). Here, the key is logging the return value of GetOldestXmin() I think. Since that value is what decides in the end if a tuple can be killed or not, having it available on the slave should allow the slave to block replay until no slave query depends on the tuple anymore. greetings, Florian Pflug
"Florian G. Pflug" <fgp@phlo.org> writes: > I came up with the following plan for both inval events and locks > .) Store two flags in the commit record of a transaction, for > "transaction generated inval events" and "transaction held an > access exlusive lock". > .) Upon replay, block until no transactions are running (for > "transaction held an exclusive lock") before replaying the > record, or flush the caches after replaying it (for > "transaction generated inval events"). This does not work; the lock has to be taken earlier than that. (See for instance VACUUM's truncate calls.) Not to mention that you have converted "exclusive lock on one table" to "exclusive lock on every table", which is even worse than the idea of converting per-table cache flushes to system-wide ones. regards, tom lane
Tom Lane wrote: > "Florian G. Pflug" <fgp@phlo.org> writes: >> I came up with the following plan for both inval events and locks >> .) Store two flags in the commit record of a transaction, for >> "transaction generated inval events" and "transaction held an >> access exlusive lock". >> .) Upon replay, block until no transactions are running (for >> "transaction held an exclusive lock") before replaying the >> record, or flush the caches after replaying it (for >> "transaction generated inval events"). > > This does not work; the lock has to be taken earlier than that. > (See for instance VACUUM's truncate calls.) Not to mention that > you have converted "exclusive lock on one table" to "exclusive lock > on every table", which is even worse than the idea of converting > per-table cache flushes to system-wide ones. I'll check what VACUUM is doing.. I primarily had CLUSTER and TRUNCATE in mind. That "exclusive lock on one table becomes exclusive lock on all tables" issue can (as I wrote in the part of my mail that you sniped) be solved I think by storing a list of OIDs instead of a flag for the locks and inval events. greetings, Florian Pflug
Simon Riggs wrote: > On Sun, 2007-07-22 at 19:58 +0200, Florian G. Pflug wrote: >> Tom Lane wrote: >>> "Florian G. Pflug" <fgp@phlo.org> writes: >>>> I'm currently working on correctly flushing the >>>> catalog/relation/sgmr caches on a readonly PITR >>>> slave during recovery. >>> I don't believe there is any workable solution to that short of logging >>> cache-flush operations in WAL. > >> The reason that I dislike WAL-logging of the flush operations so much is >> that it since peopel are concerned about the amount of wal traffic >> postgres generated, such a solution would introduce yet another GUC. >> And to make this reasonable foolproof, the slave would need a way to >> detect if that GUC is set correctly on the master. All in all, that >> seems to be quite hackish... > > Seems like we should WAL log flush operations first. It's fairly > straightforward to do that and we can then measure its effect on the > primary easily enough. Your other suggestions seem much more complex. > > I think we have a reasonable tolerance for increases in WAL and as you > said earlier, we may balance that out with other optimisations. Or we > may find a more efficient way of doing it later. > > Let's aim to get that first query running, then go back and tune it > later. I've so far added an LWLock that makes replay and queries mutually exclusive, Simple testcases seem to work, but I haven't really beaten the system yet... Of course, my current version falls over as soon as you do DDL on the master - working on fixing that, and on subsequently removing that lock again :-) greetings, Florian Pflug