Re: Temporary tables under hot standby - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Temporary tables under hot standby |
Date | |
Msg-id | CA+Tgmob=L1k0cpXRcipdsaE07ok+On=tTjRiw7FtD_D2T=Jwhg@mail.gmail.com Whole thread Raw |
In response to | Temporary tables under hot standby (Noah Misch <noah@leadboat.com>) |
Responses |
Re: Temporary tables under hot standby
Re: Temporary tables under hot standby Re: Temporary tables under hot standby |
List | pgsql-hackers |
On Tue, Apr 24, 2012 at 11:55 PM, Noah Misch <noah@leadboat.com> wrote: > A key barrier to migrations from trigger-based replication to WAL-based > replication is the lack of temporary tables under hot standby. I'd like to > close that gap; the changes needed will also reduce the master-side cost of > temporary table usage. Here is a high-level design for your advice and > comments. Much of this builds on ideas from past postings noted below. > > Our temporary tables are cataloged and filled like permanent tables. This has > the major advantage of making most code operate on tables with minimal regard > for their relpersistence. It also brings disadvantages: > > 1. High catalog turnover in rapid create/drop workloads. Heavy temporary > table users often need to REINDEX relation-oriented catalogs. Hot standby > cannot assign OIDs or modify system catalogs at all. > 2. Consumption of permanent XIDs for DML on the table. This increases COMMIT > cost on the master and is a non-starter under hot standby. > 3. autovacuum cannot VACUUM temporary tables, but their relfrozenxid values > delay pg_clog truncation and can trigger a wraparound-prevention shutdown. > 4. sinval traffic from every CREATE TEMP TABLE et al. > 5. CREATE TEMP TABLE records AccessExclusiveLock in the global lock table. > 6. We don't automatically drop temporary tables that existed at the point of a > crash, because they look much like permanent tables. #6 is already fixed in 9.1. Temporary tables now have names like tBACKENDID_RELFILENODE, and are cleaned up during start-up in exactly the same way you're propose to do it further down. > To resolve points 2 and 3, let's change the XID values stored in temporary > tables from regular TransactionId to LocalTransactionId. This entails an lxid > counterpart for clog.c. Functions like GetCurrentTransactionId() and > HeapTupleSatisfiesVisibility() will take a Relation argument to identify the > XID type. One open question is whether to add conditional logic to functions > like HeapTupleSatisfiesMVCC() or to have parallel implementations like > HeapTupleSatisfiesMVCCLocal(). I lean toward the latter, perhaps with the > help of some code generation. I don't think a counterpart for pg_subtrans > will be necessary; the backend knows its own XID tree, and the > TRANSACTION_STATUS_SUB_COMMITTED interlock is superfluous with only one > backend as reader and writer. I'm also thinking the local clog can live > strictly in memory; a session that retains a temporary table across 2B local > transactions can afford 512 MiB of RAM. With this change, VACUUM can ignore > relfrozenxid of temporary tables when calculating a new datfrozenxid. This > change can form an independent patch. Agreed. If you can pull it off, this will be a nice improvement regardless of what happens with the rest of this, and it makes sense to do it as a separate patch. I don't yet have a strong opinion on what to do with the HeapTupleSatisfies* functions, but I suspect you're correct in thinking that separate functions are better. For one thing, those functions are very much performance-critical, so introducing extra branches is something to avoid. Maybe this is a silly idea, but if you're thinking about creating a local XID space and a global XID space, it might be a good idea to also make allowance for an "unlogged" XID space - that is, an XID space that is global to all backends but need not survive crashes. This would potentially allow unlogged tables to be used in HS mode. I would expect that you probably don't want to go as far as actually trying to make this work as part of your current project, but maybe as you're structuring the code it would be worth considering the possibility that we'll eventually want >2 XID spaces, either for this or other reasons. > I do not see a clean behind-the-scenes fix for points 1, 4 and 5. We can > resolve those by adding a new variety of temporary table, one coincidentally > matching the SQL standard's notion of a temporary table. The developer will > declare it once, after which all sessions observe it as an initially-empty > table whose contents remain local to the session. Most relation catalog > entries, including all OIDs, are readily sharable among sessions. The > exceptions are relpages, reltuples, relallvisible, relfrozenxid, and > pg_statistic rows. I will handle the pg_class columns by introducing new > backend APIs abstracting them. Those APIs will consult the relcache for > permanent tables and a local-memory hash for temporary tables. For > statistics, add a new catalog pg_temp_statistic, an inheritance child of > pg_statistic and itself one of these new-variety temporary tables. With respect to problem #5, I've been wondering if we couldn't just forget about taking AccessExclusiveLock when first creating a table (temporary or permanent). Unless and until the transaction commits, nobody can see the catalog entry anyway, so nobody else is going to attempt to take a conflicting lock. So why bother? Maybe there's some reason here that's not obvious to me. If not, it might be worth doing on general principle independent of this project. Problem #4 is a little stickier, but I think also solvable. Basically all cross-backend access to temp tables ought to be prohibited anyway, but it currently isn't, because there's at least one cross-backend operation that we categorically need to support: DROP. Separating the global and local XID spaces would help with that, but not really all that much, since if a session manages to exit without cleaning up after itself, we'd like someone to be able to do that later (and maybe more aggressively than what we do now, which is to only clean things up after 2B txns or so). Nevertheless it's currently possible for backend A to build a relcache entry for backend B's temporary relation, and it might be that if we prohibited that and maybe a few other things we could see our way clear to removing most or all of the sinval traffic here. Still, it may not really help that much without a solution to problem #1. On that note, I had a thought in the pat that it might be possible to do solve problem #1 by using global temp tables as system catalogs - that is, for each type of system catalog that relates to table creation, you'd have a permanent catalog and a global temp catalog. So if someone wants to create a temporary table of the existing variety on the standby, you can make all the entries in the global-temp version of pg_class, pg_attribute, etc. However, this seems extremely difficult to manage in general - there's a lot of code churn involved, and also possible temporary -> permanent dependencies; for example, the temporary table might have a pg_attrdef entry that needs to depend on a non-temporary pg_proc entry. That's tricky to solve on the master and even trickier to solve in HS operation. So I'm inclined to agree with you that it makes more sense to just aim to support global temp tables in HS mode, and if we want to beat our head against the brick wall of making regular temp tables work there eventually, that can be a later project. I would suggest just not worrying about the statistical stuff for the first version of the patch. Assume that it's adequate to have one set of statistics for all copies of the table, both pg_statistic entries and the stats-related stuff in pg_class (relpages, reltuples, relallvisible). It's not unreasonable to think that the table will be used in broadly similiar ways across all backends, so in some cases this might actually give better performance than what you're proposing to do. If not, you can solve that problem in a future patch. This project is hard enough without throwing that problem onto the pile, and I really think that if you do throw it on the pile you're going to be adding a lot of complexity and code churn that isn't really necessary for a v1. > Past discussions have raised the issue of interaction between commands like > ALTER TABLE and sessions using the new-variety temporary table. As a first > cut, let's keep this simple and have ongoing use of the table block operations > requiring AccessExclusiveLock. Note that you can always just make a new > temporary table with a different name to deploy a change quickly. Implement > this with a heavyweight lock having a novel life cycle. When a session first > takes an ordinary relation lock on the table, acquire the longer-term lock and > schedule it for release on transaction abort. On TRUNCATE, schedule a release > on transaction commit. Of course, also release the lock at session end. I'm not sure I believe this will work, but maybe I'm just not understanding it. > For new-variety temporary tables, change file names from "relfilenode[_fork]" > to "refilenode[_fork].pid.localnode". During crash recovery, delete all files > conforming to that pattern for refilenodes of known temporary tables. This > also lets backends share pg_class.relfilenode. The "localnode" field would > distinguish multiple generations of a table across VACUUM FULL, CLUSTER, and > TRUNCATE. We could build on this strategy to safely resolve point 6 for the > existing kind of temporary table, but that entails enough other details to > probably not mix it into the same patch. In lieu of including localnode in the filename, I would suggest that for a global temp table, we set relfilenode = 0, and let each backend assign one (and reassign new ones) whenever it feels like it. The mapping from reloid -> relfilenode can be kept in backend-local memory, or you can keep a <backend id, rel oid> -> relfilenode mapping in a separate relation fork. The latter would have the additional advantage of simplifying cleanup and might also be helpful in detecting when the table is or is not in use by multiple backends. > How important is support for VACUUM on these tables under hot standby? The > alternative is to fail when a session retains a temporary table across 2B > local transactions. I do not currently see any challenges sufficient to > motivate not supporting VACUUM, but it might be a useful simplification to > keep in mind. What about ANALYZE support; how important is the ability to > collect statistics on temporary tables? Again, I tentatively expect to > support it regardless of the answer. I think it's probably pretty important to support VACUUM, because even ignoring wraparound considerations, not vacuuming tends to cause performance to suck. I think ANALYZE is less important for the reasons stated above. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: