Thread: Re: [PATCHES] Non-transactional pg_class, try 2
[ moving to -hackers to get some more eyeballs on the question ] Simon Riggs <simon@2ndquadrant.com> writes: > On Sun, 2006-06-11 at 17:53 -0400, Alvaro Herrera wrote: >> Here I repost the patch to implement non-transactional catalogs, the >> first of which is pg_ntclass, intended to hold the non-transactional >> info about pg_class (reltuples, relpages). > Will a user be able to update reltuples and relpages manually? No, which is a tad annoying now that you mention it. I'm not sure that there's any very good reason for users to want to do that, though. Once or twice I've hacked those fields manually to set up test cases for the planner, which is why I'd be annoyed to lose the ability --- but does it really matter to users? (Especially in view of the fact that the planner no longer trusts relpages anyway.) It does seem like rather a lot of mechanism and overhead though, especially in view of Alvaro's worries about the non-cacheability of pg_class_nt rows. I wonder whether we shouldn't take two steps back and rethink. The main thing we are trying to accomplish here is to decouple transactional and nontransactional updates to a pg_class row. Is there another way to do that? Do we need complete decoupling? It strikes me that the only case where we absolutely must not lose a nontransactional update is where we are un-freezing a frozen rel. If we could guarantee that un-freezing happens before any transactional update within a particular transaction, then maybe we could have that. Manual updates to pg_class seem like they'd risk breaking such a guarantee, but maybe there's a way around that. Personally I'd be willing to live with commands that try to modify a frozen rel erroring out if they see the current pg_class row is uncommitted. regards, tom lane
On Mon, 2006-06-12 at 19:15 -0400, Tom Lane wrote: > [ moving to -hackers to get some more eyeballs on the question ] > > Simon Riggs <simon@2ndquadrant.com> writes: > > On Sun, 2006-06-11 at 17:53 -0400, Alvaro Herrera wrote: > >> Here I repost the patch to implement non-transactional catalogs, the > >> first of which is pg_ntclass, intended to hold the non-transactional > >> info about pg_class (reltuples, relpages). > > > Will a user be able to update reltuples and relpages manually? > > No, which is a tad annoying now that you mention it. I'm not sure that > there's any very good reason for users to want to do that, though. Once > or twice I've hacked those fields manually to set up test cases for the > planner, which is why I'd be annoyed to lose the ability --- but does it > really matter to users? (Especially in view of the fact that the > planner no longer trusts relpages anyway.) No need to have an SQL route. A special function call would suffice. I'd like to be able to set up a test database that has the statistics copied from the live system. A schema only pg_dump with mods is all I need, but it sounds like we're moving away from that. We can then perform various what-ifs on the design. Elsewhere, it has been discussed that we might hold the number of blocks in a relation in shared memory. Does that idea now fall down, or is it complementary to this? i.e. would we replace ANALYZE's relpages with an accurate relpages for the planner? > It does seem like rather a lot of mechanism and overhead though, > especially in view of Alvaro's worries about the non-cacheability of > pg_class_nt rows. I wonder whether we shouldn't take two steps back > and rethink. Review, yes. Could still be the best way. > The main thing we are trying to accomplish here is to decouple > transactional and nontransactional updates to a pg_class row. With the goal being avoiding table bloat?? > Is there another way to do that? Do we need complete decoupling? > It strikes me that the only case where we absolutely must not lose a > nontransactional update is where we are un-freezing a frozen rel. Not sure why you'd want to do that, assuming I've understood you. For me, freezing is last step before writing to WORM media, so there is never an unfreeze step. > If we could guarantee that un-freezing happens before any transactional > update within a particular transaction, then maybe we could have that. > Manual updates to pg_class seem like they'd risk breaking such a > guarantee, but maybe there's a way around that. Personally I'd be > willing to live with commands that try to modify a frozen rel erroring > out if they see the current pg_class row is uncommitted. Sounds OK. It's a major state change after all. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > Elsewhere, it has been discussed that we might hold the number of blocks > in a relation in shared memory. Does that idea now fall down, or is it > complementary to this? It's been the case for some time that the planner uses RelationGetNumberOfBlocks() to determine true rel size. The only reason relpages is still stored at all is that it's used to approximate true number of tuples viatrue_ntuples = (reltuples/relpages) * true_npages ie, assuming that the tuple density is still what it was at the last VACUUM or ANALYZE. So you can't fool the system with a totally made-up relation size anyway. (This too is moderately annoying for planner testing, but it seems the only way to get the planner to react when a table's been filled without an immediate vacuum/analyze.) The only point of tracking rel size in shared memory would be to avoid the costs of lseek() kernel calls in RelationGetNumberOfBlocks. >> The main thing we are trying to accomplish here is to decouple >> transactional and nontransactional updates to a pg_class row. > With the goal being avoiding table bloat?? No, with the goal being correctness. If you have a freeze/unfreeze mechanism then unfreezing a relation is an action that must NOT be rolled back if your transaction (or any other one for that matter) later aborts. The tuples you put into it meanwhile need to be vacuumed anyway. So you can't mark it unfrozen in an uncommitted pg_class entry that might never become committed. > For me, freezing is last step before writing to WORM media, so there is > never an unfreeze step. That is not what Alvaro is after. Nor anyone else here. I have not heard anyone mention WORM media for Postgres in *years*. It strikes me though that automatic UNFREEZE isn't necessarily the requirement. What if VACUUM FREEZE causes the table to become effectively read-only, and you need an explicit UNFREEZE command to put it back into a read-write state? Then UNFREEZE could be a transactional operation, and most of these issues go away. The case where this doesn't work conveniently is copying a frozen database (viz template0), but maybe biting the bullet and finding a way to do prep work in a freshly made database is the answer for that. We've certainly seen plenty of other possible uses for post-CREATE processing in a new database. Another reason for not doing unfreeze automatically is that as the patch stands, any database user can force unfreezing of any table, whether he has any access rights on it or not (because the LockTable will happen before we check access rights, I believe). This is probably Not Good. Ideally I think FREEZE/UNFREEZE would be owner-permission-required. regards, tom lane
On Tue, 2006-06-13 at 10:02 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Elsewhere, it has been discussed that we might hold the number of blocks > > in a relation in shared memory. Does that idea now fall down, or is it > > complementary to this? > > It's been the case for some time that the planner uses > RelationGetNumberOfBlocks() to determine true rel size. The only reason > relpages is still stored at all is that it's used to approximate true > number of tuples via > true_ntuples = (reltuples/relpages) * true_npages > ie, assuming that the tuple density is still what it was at the last > VACUUM or ANALYZE. So you can't fool the system with a totally made-up > relation size anyway. (This too is moderately annoying for planner > testing, but it seems the only way to get the planner to react when a > table's been filled without an immediate vacuum/analyze.) > > The only point of tracking rel size in shared memory would be to avoid > the costs of lseek() kernel calls in RelationGetNumberOfBlocks. Yes, understood. With the second point to allow them to be separately set for PGSQL developer testing of optimizer, and application dev testing of SQL and/or what/if scenarios. > >> The main thing we are trying to accomplish here is to decouple > >> transactional and nontransactional updates to a pg_class row. > > > With the goal being avoiding table bloat?? > > No, with the goal being correctness. If you have a freeze/unfreeze > mechanism then unfreezing a relation is an action that must NOT be > rolled back if your transaction (or any other one for that matter) later > aborts. The tuples you put into it meanwhile need to be vacuumed anyway. > So you can't mark it unfrozen in an uncommitted pg_class entry that > might never become committed. > > > For me, freezing is last step before writing to WORM media, so there is > > never an unfreeze step. > > That is not what Alvaro is after. Nor anyone else here. So what is unfreeze for again? > I have not > heard anyone mention WORM media for Postgres in *years*. Oh? Big requirements for archive these days, much more so than before. This will allow years of data in a seamless on-line/near-line partitioned table set. Lots of people want that: .gov, .mil, .com More modern equivalent: a MAID archive system for WORO data > It strikes me though that automatic UNFREEZE isn't necessarily the > requirement. What if VACUUM FREEZE causes the table to become > effectively read-only, and you need an explicit UNFREEZE command to > put it back into a read-write state? Then UNFREEZE could be a > transactional operation, and most of these issues go away. That works for me. Very much preferred. > The case > where this doesn't work conveniently is copying a frozen database > (viz template0), but maybe biting the bullet and finding a way to do > prep work in a freshly made database is the answer for that. We've > certainly seen plenty of other possible uses for post-CREATE processing > in a new database. > > Another reason for not doing unfreeze automatically is that as the patch > stands, any database user can force unfreezing of any table, whether he > has any access rights on it or not (because the LockTable will happen > before we check access rights, I believe). This is probably Not Good. > Ideally I think FREEZE/UNFREEZE would be owner-permission-required. Seems like a plan. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Ok, let's step back to discuss this again. Sorry for the length -- this is a description of the problem I'm trying to solve, the issues I found, and how I tried to solve them. The relminxid Patch =================== What I'm after is not freezing for read-only media, nor archive, nor read-only tables. What I'm after is removing the requirement that all databases must be vacuumed wholly every 2 billion transactions. Now, why do we need to vacuum whole databases at a time? The Transaction Id Counter ========================== We know that the Xid counter is weird; it cycles, for starters, and also there are special values at the "start" of the cycle that are lesser than all other values (BootstrapXid, FrozenXid). The idea here is to allow the counter to wrap around and old tuples not be affected, i.e., appear like they were committed in some distant past. So we use the special Xid values to mark special stuff, like tuples created by the bootstrap processing (which are always known to be good) or tuples in template databases that are not connectable ("frozen" databases). We also use FrozenXid to mark tuples that are very old, i.e. were committed a long time ago and never deleted. Any such tuple is unaffected by the status of the Xid counter. It should be clear that we must ensure that after a suitable amount of "time" (measured in advancement of the Xid counter) has passed, we should change the old Xids in tuples to the special FrozenXid value. The requirement for whole-database vacuuming is there because we need to ensure that this is done in all the tables in the database. We keep track of a "minimum Xid", call it minxid. The Xid generator refuses to assign a new Xid counter if this minxid is too far in the past, because we'd risk causing Xid-wraparound data loss if we did; the Xid comparison semantics would start behaving funny, and some tuples that appeared to be alive not many transactions ago now suddenly appear dead. Clearly, it's important that before we advance this minxid we ensure that all tables in the database have been under the process of changing all regular Xids into FrozenXid. Currently the only way to ensure that all tables have gone through this process is processing them in a single VACUUM pass. Skip even one table, and you can forget about advancing the minxid. Even if the skipped table was vacuumed in the transaction just before this one. Even if the table is fully frozen, i.e., all tables on it are marked with FrozenXid. Even if the table is empty. Tracking minxid Per Table ========================= So, my idea is to track this minxid per table. To do this, I added a column to pg_class called relminxid. The minimum of it across a database is used to determine each database's minimum, datminxid. The minimum of all databases is used to advance the global minimum Xid counter. So, if a table has 3 tuples whose Xmins are 42, 512 and FrozenXid, the relminxid is 42. If we keep track of all these religiously during vacuum, we know exactly what is the minxid we should apply to this particular table. It is obvious that vacuuming one table can set the minimum for that table. So when the vacuuming is done, we can recalculate the database minimum; and using the minima of all databases, we can advance the global minimum Xid counter and truncate pg_clog. We can do this on each single-table vacuum -- so, no more need for database-wide vacuuming. If a table is empty, or all tuples on it are frozen, then we must mark the table with relminxid = RecentXmin. This is because there could be an open transaction that writes a new tuple to the table after the vacuum is finished. A newly created table must also be created with relminxid = RecentXid. Because of this, we never mark a table with relminxid = FrozenXid. Template Databases ================== Up to this point everything is relatively simple. Here is where the strange problems appear. The main issue is template databases. Why are template databases special? Because they are never vacuumed. More generally, we assume that every database that is marked as "datallowconn = false" is fully frozen, i.e. all tables on it are frozen. Autovacuum skips them. VACUUM ignores them. The minxid calculations ignore them. They are fully frozen so they don't matter and they don't harm anybody. That's fine and dandy until you realize what happens when you freeze a database, let a couple billion transactions pass, and then create a database using that as a template (or just "reallow connections" to a database). Because all the tables were frozen 2 billion transaction ago, they are marked with an old relminxid, so as soon as you vacuum any table, the minxid computations went to hell, and we have a DoS condition. So, we have to do something to cope with frozen databases. I see two ways: 1. Remove the special case, i.e., process frozen databases in VACUUM like every other database. This is the easiest, becauseno extra logic is needed. Just make sure they are vacuumed in time. The only problem would be that we'd need touselessly vacuum tables that we know are frozen, from time to time. But then, those tables are probably small, so what'sthe problem with that? 2. Mark frozen databases specially somehow. To mark databases frozen, we need a way to mark tables as frozen. How do wedo that? As I explain below, this allows some nice optimizations, but it's a very tiny can full of a huge amount of worms. Marking a Table Frozen ====================== Marking a table frozen is simple as setting relminxid = FrozenXid for a table. As explained above, this cannot be done in a regular postmaster environment, because a concurrent transaction could be doing nasty stuff to a table. So we can do it only in a standalone backend. On the other hand, a "frozen" table must be marked with relminxid = a-regular-Xid as soon as a transaction writes some tuples on it. Note that this "unfreezing" must take place even if the offending transaction is aborted, because the Xid is written in the table nevertheless and thus it would be incorrect to lose the unfreezing. This is how pg_class_nt came into existence -- it would be a place where information about a table would be stored and not subject to the rolling back of the transaction that wrote it. So if you find that a table is frozen, you write an unfreezing into its pg_class_nt tuple, and that's it. Nice optimization: if we detect that a table is fully frozen, then VACUUM is a no-op (not VACUUM FULL), because by definition there are no tuples to remove. Another optimization: if we are sure that unfreezing works, we can even mark a table as frozen in a postmaster environment, as long as we take an ExclusiveLock on the table. Thus we know that the vacuum is the sole transaction concurrently accessing the table; and if another transaction comes about and writes something after we're finished, it'll correctly unfreeze the table and all is well. Where are the problems in this approach? 1. Performance. We'll need to keep a cache of pg_class_nt tuples. This cache must be independent of the current relcache, because the relcache is properly transactional while the pg_class_nt cache must not be. 2. The current implementation puts the unfreezing in LockRelation. This is a problem, because any user can cause a LockRelation on any table, even if the user does not have access to that table. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
[Resending: apparently the previous one to the list was eaten by spam filters or something. Changing SMTP relay again ... ] Ok, let's step back to discuss this again. Sorry for the length -- this is a description of the problem I'm trying to solve, the issues I found, and how I tried to solve them. The relminxid Patch =================== What I'm after is not freezing for read-only media, nor archive, nor read-only tables. What I'm after is removing the requirement that all databases must be vacuumed wholly every 2 billion transactions. Now, why do we need to vacuum whole databases at a time? The Transaction Id Counter ========================== We know that the Xid counter is weird; it cycles, for starters, and also there are special values at the "start" of the cycle that are lesser than all other values (BootstrapXid, FrozenXid). The idea here is to allow the counter to wrap around and old tuples not be affected, i.e., appear like they were committed in some distant past. So we use the special Xid values to mark special stuff, like tuples created by the bootstrap processing (which are always known to be good) or tuples in template databases that are not connectable ("frozen" databases). We also use FrozenXid to mark tuples that are very old, i.e. were committed a long time ago and never deleted. Any such tuple is unaffected by the status of the Xid counter. It should be clear that we must ensure that after a suitable amount of "time" (measured in advancement of the Xid counter) has passed, we should change the old Xids in tuples to the special FrozenXid value. The requirement for whole-database vacuuming is there because we need to ensure that this is done in all the tables in the database. We keep track of a "minimum Xid", call it minxid. The Xid generator refuses to assign a new Xid counter if this minxid is too far in the past, because we'd risk causing Xid-wraparound data loss if we did; the Xid comparison semantics would start behaving funny, and some tuples that appeared to be alive not many transactions ago now suddenly appear dead. Clearly, it's important that before we advance this minxid we ensure that all tables in the database have been under the process of changing all regular Xids into FrozenXid. Currently the only way to ensure that all tables have gone through this process is processing them in a single VACUUM pass. Skip even one table, and you can forget about advancing the minxid. Even if the skipped table was vacuumed in the transaction just before this one. Even if the table is fully frozen, i.e., all tables on it are marked with FrozenXid. Even if the table is empty. Tracking minxid Per Table ========================= So, my idea is to track this minxid per table. To do this, I added a column to pg_class called relminxid. The minimum of it across a database is used to determine each database's minimum, datminxid. The minimum of all databases is used to advance the global minimum Xid counter. So, if a table has 3 tuples whose Xmins are 42, 512 and FrozenXid, the relminxid is 42. If we keep track of all these religiously during vacuum, we know exactly what is the minxid we should apply to this particular table. It is obvious that vacuuming one table can set the minimum for that table. So when the vacuuming is done, we can recalculate the database minimum; and using the minima of all databases, we can advance the global minimum Xid counter and truncate pg_clog. We can do this on each single-table vacuum -- so, no more need for database-wide vacuuming. If a table is empty, or all tuples on it are frozen, then we must mark the table with relminxid = RecentXmin. This is because there could be an open transaction that writes a new tuple to the table after the vacuum is finished. A newly created table must also be created with relminxid = RecentXid. Because of this, we never mark a table with relminxid = FrozenXid. Template Databases ================== Up to this point everything is relatively simple. Here is where the strange problems appear. The main issue is template databases. Why are template databases special? Because they are never vacuumed. More generally, we assume that every database that is marked as "datallowconn = false" is fully frozen, i.e. all tables on it are frozen. Autovacuum skips them. VACUUM ignores them. The minxid calculations ignore them. They are fully frozen so they don't matter and they don't harm anybody. That's fine and dandy until you realize what happens when you freeze a database, let a couple billion transactions pass, and then create a database using that as a template (or just "reallow connections" to a database). Because all the tables were frozen 2 billion transaction ago, they are marked with an old relminxid, so as soon as you vacuum any table, the minxid computations went to hell, and we have a DoS condition. So, we have to do something to cope with frozen databases. I see two ways: 1. Remove the special case, i.e., process frozen databases in VACUUM like every other database. This is the easiest, becauseno extra logic is needed. Just make sure they are vacuumed in time. The only problem would be that we'd need touselessly vacuum tables that we know are frozen, from time to time. But then, those tables are probably small, so what'sthe problem with that? 2. Mark frozen databases specially somehow. To mark databases frozen, we need a way to mark tables as frozen. How do wedo that? As I explain below, this allows some nice optimizations, but it's a very tiny can full of a huge amount of worms. Marking a Table Frozen ====================== Marking a table frozen is simple as setting relminxid = FrozenXid for a table. As explained above, this cannot be done in a regular postmaster environment, because a concurrent transaction could be doing nasty stuff to a table. So we can do it only in a standalone backend. On the other hand, a "frozen" table must be marked with relminxid = a-regular-Xid as soon as a transaction writes some tuples on it. Note that this "unfreezing" must take place even if the offending transaction is aborted, because the Xid is written in the table nevertheless and thus it would be incorrect to lose the unfreezing. This is how pg_class_nt came into existence -- it would be a place where information about a table would be stored and not subject to the rolling back of the transaction that wrote it. So if you find that a table is frozen, you write an unfreezing into its pg_class_nt tuple, and that's it. Nice optimization: if we detect that a table is fully frozen, then VACUUM is a no-op (not VACUUM FULL), because by definition there are no tuples to remove. Another optimization: if we are sure that unfreezing works, we can even mark a table as frozen in a postmaster environment, as long as we take an ExclusiveLock on the table. Thus we know that the vacuum is the sole transaction concurrently accessing the table; and if another transaction comes about and writes something after we're finished, it'll correctly unfreeze the table and all is well. Where are the problems in this approach? 1. Performance. We'll need to keep a cache of pg_class_nt tuples. This cache must be independent of the current relcache, because the relcache is properly transactional while the pg_class_nt cache must not be. 2. The current implementation puts the unfreezing in LockRelation. This is a problem, because any user can cause a LockRelation on any table, even if the user does not have access to that table. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Mon, 2006-06-26 at 13:58 -0400, Alvaro Herrera wrote: > Ok, let's step back to discuss this again. Sorry for the length -- this > is a description of the problem I'm trying to solve, the issues I found, > and how I tried to solve them. Thanks. This is good. > The relminxid Patch > =================== > > What I'm after is not freezing for read-only media, nor archive, nor > read-only tables. OK, but I am... but I'm happy to not to confuse the discussion. > Now, why do we need to vacuum whole databases at a time? > So, we have to do something to cope with frozen databases. I see two > ways: > > 1. Remove the special case, i.e., process frozen databases in VACUUM > like every other database. > This is the easiest, because no extra logic is needed. Just make > sure they are vacuumed in time. The only problem would be that we'd > need to uselessly vacuum tables that we know are frozen, from time to > time. But then, those tables are probably small, so what's the > problem with that? > 2. Mark frozen databases specially somehow. > To mark databases frozen, we need a way to mark tables as frozen. > How do we do that? As I explain below, this allows some nice > optimizations, but it's a very tiny can full of a huge amount of > worms. At this stage you talk about databases, yet below we switch to discussing tables. Not sure why we switched from one to the other. > Marking a Table Frozen > ====================== > > Marking a table frozen is simple as setting relminxid = FrozenXid for a > table. As explained above, this cannot be done in a regular postmaster > environment, because a concurrent transaction could be doing nasty stuff > to a table. So we can do it only in a standalone backend. Surely we just lock the table? No concurrent transactions? > On the other hand, a "frozen" table must be marked with relminxid = > a-regular-Xid as soon as a transaction writes some tuples on it. Note > that this "unfreezing" must take place even if the offending transaction > is aborted, because the Xid is written in the table nevertheless and > thus it would be incorrect to lose the unfreezing. > > This is how pg_class_nt came into existence -- it would be a place where > information about a table would be stored and not subject to the rolling > back of the transaction that wrote it. So if you find that a table is > frozen, you write an unfreezing into its pg_class_nt tuple, and that's > it. > > Nice optimization: if we detect that a table is fully frozen, then > VACUUM is a no-op (not VACUUM FULL), because by definition there are no > tuples to remove. Yes please, but we don't need it anymore do we? Guess we need it for backwards compatibility? VACUUM still needs to vacuum every table. > Another optimization: if we are sure that unfreezing works, we can even > mark a table as frozen in a postmaster environment, as long as we take > an ExclusiveLock on the table. Thus we know that the vacuum is the sole > transaction concurrently accessing the table; and if another transaction > comes about and writes something after we're finished, it'll correctly > unfreeze the table and all is well. Why not just have a command to FREEZE and UNFREEZE an object? It can hold an ExclusiveLock, avoiding all issues. Presumably FREEZE and UNFREEZE are rare commands? > Where are the problems in this approach? > > 1. Performance. We'll need to keep a cache of pg_class_nt tuples. This > cache must be independent of the current relcache, because the relcache > is properly transactional while the pg_class_nt cache must not be. > > 2. The current implementation puts the unfreezing in LockRelation. This > is a problem, because any user can cause a LockRelation on any table, > even if the user does not have access to that table. That last bit just sounds horrible to me. But thinking about it: how come any user can lock a relation they shouldn't even be allowed to know exists? Possibly OT. I can see other reasons for having pg_class_nt, so having table info cached in shared memory does make sense to me (yet not being part of the strict definitions of the relcache). -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Mon, 2006-06-26 at 13:58 -0400, Alvaro Herrera wrote: > > The relminxid Patch > > =================== > > > > What I'm after is not freezing for read-only media, nor archive, nor > > read-only tables. > > OK, but I am... but I'm happy to not to confuse the discussion. Ok :-) I think I put a note about this but removed it while restructuring the text so it would be clearer. The note is that while I don't care about read-only stuff in this proposal, it may be that read-only tables may come as a "side effect of implementing this. But I agree we should not make the discussion more complex than it already is. > > 2. Mark frozen databases specially somehow. > > To mark databases frozen, we need a way to mark tables as frozen. > > How do we do that? As I explain below, this allows some nice > > optimizations, but it's a very tiny can full of a huge amount of > > worms. > > At this stage you talk about databases, yet below we switch to > discussing tables. Not sure why we switched from one to the other. Sorry, I forgot one step. To mark a database frozen, we must make sure that all tables within that database are frozen as well. So the first step to freezing a database is freezing all its tables. > > Marking a Table Frozen > > ====================== > > > > Marking a table frozen is simple as setting relminxid = FrozenXid for a > > table. As explained above, this cannot be done in a regular postmaster > > environment, because a concurrent transaction could be doing nasty stuff > > to a table. So we can do it only in a standalone backend. > > Surely we just lock the table? No concurrent transactions? No, because a transaction can have been started previously and yet not hold any lock on the table, and write on the table after the vacuum finishes. Or write on an earlier page of the table, after the vacuuming already processed it. But here it comes one of the "nice points" below, which was that if we acquire a suitable exclusive lock on the table, we _can_ mark it frozen. Of course, this cannot be done by plain vacuum, because we want the table to be still accesible by other transactions. This is where VACUUM FREEZE comes in -- it does the same processing as lazy vacuum, except that it locks the table exclusively and marks it with FrozenXid. > > Nice optimization: if we detect that a table is fully frozen, then > > VACUUM is a no-op (not VACUUM FULL), because by definition there are no > > tuples to remove. > > Yes please, but we don't need it anymore do we? Guess we need it for > backwards compatibility? VACUUM still needs to vacuum every table. Sorry, I don't understand what you mean here. We don't need what anymore? > > Another optimization: if we are sure that unfreezing works, we can even > > mark a table as frozen in a postmaster environment, as long as we take > > an ExclusiveLock on the table. Thus we know that the vacuum is the sole > > transaction concurrently accessing the table; and if another transaction > > comes about and writes something after we're finished, it'll correctly > > unfreeze the table and all is well. > > Why not just have a command to FREEZE and UNFREEZE an object? It can > hold an ExclusiveLock, avoiding all issues. Presumably FREEZE and > UNFREEZE are rare commands? Ok, if I'm following you here, your point is that FREEZE'ing a table sets the relminxid to FrozenXid, and UNFREEZE removes that; and also, in between, no one can write to the table? This seems to make sense. However, I'm not very sure about the FREEZE'ing operation, because we need to make sure the table is really frozen. So we either scan it, or we make sure something else already scanned it; to me what makes the most sense is having a VACUUM option that would do the freezing (and a separate command to do the unfreezing). > > Where are the problems in this approach? > > > > 2. The current implementation puts the unfreezing in LockRelation. This > > is a problem, because any user can cause a LockRelation on any table, > > even if the user does not have access to that table. > > That last bit just sounds horrible to me. But thinking about it: how > come any user can lock a relation they shouldn't even be allowed to know > exists? Possibly OT. Hmm, I guess there must be several commands that open the relation and lock it, and then check permissions. I haven't checked the code but you shouldn't check permissions before acquiring some kind of lock, and we shouldn't be upgrading locks either. > I can see other reasons for having pg_class_nt, so having table info > cached in shared memory does make sense to me (yet not being part of the > strict definitions of the relcache). Yeah. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > What I'm after is not freezing for read-only media, nor archive, nor > read-only tables. What I'm after is removing the requirement that all > databases must be vacuumed wholly every 2 billion transactions. Well, if that's the only goal then I hardly think we need to have a discussion, because your alternative #1 is *obviously* the winner: > 1. Remove the special case, i.e., process frozen databases in VACUUM > like every other database. > This is the easiest, because no extra logic is needed. Just make > sure they are vacuumed in time. The only problem would be that we'd > need to uselessly vacuum tables that we know are frozen, from time to > time. But then, those tables are probably small, so what's the > problem with that? > 2. Mark frozen databases specially somehow. > To mark databases frozen, we need a way to mark tables as frozen. > How do we do that? As I explain below, this allows some nice > optimizations, but it's a very tiny can full of a huge amount of > worms. Avoiding a vacuum pass over template0 once every 2 billion transactions cannot be thought worthy of the amount of complexity and risk entailed in the nontransactional-catalog thing. Especially since in the normal case those would be read-only passes (the tuples all being frozen already). So if you want to bring in the other goals that you're trying to pretend aren't there, step right up and do it. You have not here made a case that would convince anyone that we shouldn't just do #1 and be done with it. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > What I'm after is not freezing for read-only media, nor archive, nor > > read-only tables. What I'm after is removing the requirement that all > > databases must be vacuumed wholly every 2 billion transactions. > > Well, if that's the only goal then I hardly think we need to have a > discussion, because your alternative #1 is *obviously* the winner: > > > 1. Remove the special case, i.e., process frozen databases in VACUUM > > like every other database. > > This is the easiest, because no extra logic is needed. Just make > > sure they are vacuumed in time. The only problem would be that we'd > > need to uselessly vacuum tables that we know are frozen, from time to > > time. But then, those tables are probably small, so what's the > > problem with that? I'm happy to do at least this for 8.2. We can still try to do the non-transactional catalog later, either in this release or the next; the code is almost there, and it'll be easier to discuss/design because we'll have taken the relminxid stuff out of the way. So if everyone agrees, I'll do this now. Beware -- this may make you vacuum databases that you previously weren't vacuuming. (I really doubt anyone is setting datallowconn=false just to skip vacuuming big databases, but there are people with strange ideas out there.) > So if you want to bring in the other goals that you're trying to pretend > aren't there, step right up and do it. You have not here made a case > that would convince anyone that we shouldn't just do #1 and be done with > it. We can do it in a separate discussion. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Mon, 2006-06-26 at 16:54 -0400, Alvaro Herrera wrote: > > > Another optimization: if we are sure that unfreezing works, we can > even > > > mark a table as frozen in a postmaster environment, as long as we > take > > > an ExclusiveLock on the table. Thus we know that the vacuum is > the sole > > > transaction concurrently accessing the table; and if another > transaction > > > comes about and writes something after we're finished, it'll > correctly > > > unfreeze the table and all is well. > > > > Why not just have a command to FREEZE and UNFREEZE an object? It can > > hold an ExclusiveLock, avoiding all issues. Presumably FREEZE and > > UNFREEZE are rare commands? > > Ok, if I'm following you here, your point is that FREEZE'ing a table > sets the relminxid to FrozenXid, and UNFREEZE removes that; and also, > in > between, no one can write to the table? > > This seems to make sense. However, I'm not very sure about the > FREEZE'ing operation, because we need to make sure the table is really > frozen. So we either scan it, or we make sure something else already > scanned it; to me what makes the most sense is having a VACUUM option > that would do the freezing (and a separate command to do the > unfreezing). Sounds like we're in step here: VACUUM FREEZE -- works at either table or database level -- takes ExclusiveLock, reads all blocks of a table, freezing rows -- once complete, all write operations are prevented until... ALTER TABLE xxx UNFREEZE; ALTER DATABASE xxx UNFREEZE; -- takes AccessExclusiveLock, allows writes again CREATE DATABASE automatically does unfreeze after template db copy Suggest that we prevent write operations on Frozen tables by revoking all INSERT, UPDATE or DELETE rights held, then enforcing a check during GRANT to prevent them being re-enabled. Superusers would need to check every time. If we dont do this, then we will have two contradictory states marked in the catalog - privilges saying Yes and freezing saying No. Not sure where pg_class_nt comes in here though, even though I think I still want it. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Very nice explanation, thanks Alvaro. > 2. Mark frozen databases specially somehow. > To mark databases frozen, we need a way to mark tables as frozen. > How do we do that? As I explain below, this allows some nice > optimizations, but it's a very tiny can full of a huge amount of > worms. > > Marking a Table Frozen > ====================== > > Marking a table frozen is simple as setting relminxid = > FrozenXid for a table. As explained above, this cannot be > done in a regular postmaster environment, because a > concurrent transaction could be doing nasty stuff to a table. > So we can do it only in a standalone backend. Unless you lock the table exclusively during vacuum, that could be done with vacuum freeze. I like that more, than changing stuff that is otherwise completely frozen/static. (I see you wrote that below) > On the other hand, a "frozen" table must be marked with > relminxid = a-regular-Xid as soon as a transaction writes > some tuples on it. Note that this "unfreezing" must take > place even if the offending transaction is aborted, because > the Xid is written in the table nevertheless and thus it > would be incorrect to lose the unfreezing. The other idea was to need a special unfreeze command ... > > This is how pg_class_nt came into existence -- it would be a > place where information about a table would be stored and not > subject to the rolling back of the transaction that wrote it. Oh, that puts it in another league, since it must guarantee commit. I am not sure we can do that. The previous discussion was about concurrency and data that was not so important like tuple count. In short: - I'd start with #1 (no relminxid = FrozenXid) like Tom suggested- and then implement FREEZE/UNFREEZE with exclusive locks like Simon wrote (so it does not need pg_class_nt) anduse that for the templates. Simon wrote: > Suggest that we prevent write operations on Frozen tables by revoking all INSERT, UPDATE or DELETE rights held, then enforcing a check during GRANT to prevent them being re-enabled. Superusers would need to check every time. If we dont do this, then we will have two contradictory states marked in the catalog - privilges saying Yes and freezing saying No. No, I'd not mess with the permissions and return a different error when trying to modify a frozen table. (It would also be complicated to unfreeze after create database) We should make it clear, that freezing is no replacement for revoke. Andreas
On Tue, 2006-06-27 at 10:04 +0200, Zeugswetter Andreas DCP SD wrote: > Simon wrote: > > Suggest that we prevent write operations on Frozen tables by revoking > all INSERT, UPDATE or DELETE rights held, then enforcing a check during > GRANT to prevent them being re-enabled. Superusers would need to check > every time. If we dont do this, then we will have two contradictory > states marked in the catalog - privilges saying Yes and freezing saying > No. > > No, I'd not mess with the permissions and return a different error when > trying to > modify a frozen table. (It would also be complicated to unfreeze after > create database) > We should make it clear, that freezing is no replacement for revoke. That was with a mind to performance. Checking every INSERT, UPDATE and DELETE statement to see if they are being done against a frozen table seems like a waste. There would still be a specific error message for frozen tables, just on the GRANT rather than the actual DML statements. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> > > Suggest that we prevent write operations on Frozen tables by > > > revoking > > all INSERT, UPDATE or DELETE rights held, then enforcing a check > > during GRANT to prevent them being re-enabled. Superusers would need > > to check every time. If we dont do this, then we will have two > > contradictory states marked in the catalog - privilges saying Yes and > > freezing saying No. > > > > No, I'd not mess with the permissions and return a different error > > when trying to modify a frozen table. (It would also be complicated to > > unfreeze after create database) We should make it clear, that freezing > > is no replacement for revoke. > > That was with a mind to performance. Checking every INSERT, > UPDATE and DELETE statement to see if they are being done > against a frozen table seems like a waste. I'd think we would have relminxid in the relcache, so I don't buy the performance argument :-) (You could still do the actual check in the same place where the permission is checked) > There would still be a specific error message for frozen > tables, just on the GRANT rather than the actual DML statements. I'd still prefer to see the error on modify. Those that don't can revoke. Andreas
"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> writes: >> That was with a mind to performance. Checking every INSERT, >> UPDATE and DELETE statement to see if they are being done >> against a frozen table seems like a waste. > I'd think we would have relminxid in the relcache, so I don't buy the > performance argument :-) Me either. Further, auto-revoking permissions loses information. I think that idea is an ugly kluge. Anyway, the bottom line here seems to be that we should forget about pg_class_nt and just keep the info in pg_class; there's not sufficient justification to build the infrastructure needed for a nontransactional auxiliary catalog. This implies the following conclusions: * template0 has to be vacuumed against wraparound, same as any other database. * To support frozen tables, "VACUUM FREEZE" and "ALTER TABLE UNFREEZE" would need to be explicit commands taking ExclusiveLock, and can't be nested inside transaction blocks either. Automatic unfreeze upon an updating command isn't possible. Neither of these are bad enough to justify pg_class_nt --- in fact, I'd argue that explicit unfreeze is better than automatic anyway. So it was a cute idea, but its time hasn't come. regards, tom lane