Thread: [WIP] Add relminxid column to pg_class
Hi, I attach my patch adding the relminxid column to pg_class; this turns vacuum/freeze Xid tracking a per-table issue, and so it removes the need to vacuum whole databases. Transaction Id wraparound can thus be dealt with one table at a time. I refer you to the old thread for a more complete description of the patch: http://archives.postgresql.org/pgsql-patches/2005-11/msg00074.php I just noticed a problem, which is why I labeled it a WIP: the current implementation turns pg_database.datminxid to InvalidTransactionId when the table that has the minimum relminxid is dropped. The problem is that this could cause "vacuum starvation" if autovacuum is using datminxid to decide what database to vacuum, and the minimum Xid table is being dropped continuously. I had refrained from calculating pg_database.datminxid each time said table is dropped, because doing it means scanning pg_class and locking pg_database -- I'm wary of deadlock problems. Not sure what to do here. (Maybe the answer is to do nothing -- this is a very low probability scenario anyway. Opinions?) I dropped the controversial (and broken) idea of setting a table's relminxid to FreezeTransactionId (which would allow a table to survive a transaction wraparound without needing vacuum at all.) This will need more work; probably a switch to mark a table strictly read-only. Unless there are objections I plan to apply this probably by the next weekend, so that there is plenty of time for review and comment. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Attachment
Alvaro Herrera <alvherre@commandprompt.com> writes: > I just noticed a problem, which is why I labeled it a WIP: the current > implementation turns pg_database.datminxid to InvalidTransactionId when > the table that has the minimum relminxid is dropped. The problem is > that this could cause "vacuum starvation" if autovacuum is using > datminxid to decide what database to vacuum, and the minimum Xid table > is being dropped continuously. I had refrained from calculating > pg_database.datminxid each time said table is dropped, because doing it > means scanning pg_class and locking pg_database -- I'm wary of deadlock > problems. Not sure what to do here. (Maybe the answer is to do nothing > -- this is a very low probability scenario anyway. Opinions?) I'd argue that you should do nothing, ie, dropping a table should never affect datminxid. The proper interpretation of the pg_database columns is that we guarantee that all XID's in the database are *at least* thus- and-so, not that the minimum is exact. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > I just noticed a problem, which is why I labeled it a WIP: the current > > implementation turns pg_database.datminxid to InvalidTransactionId when > > the table that has the minimum relminxid is dropped. > > I'd argue that you should do nothing, ie, dropping a table should never > affect datminxid. The proper interpretation of the pg_database columns > is that we guarantee that all XID's in the database are *at least* thus- > and-so, not that the minimum is exact. Ah-ha, an easier approach. This would mean either: a) we need to seqscan pg_class each time to discover the minimum b) we need a partial index on pg_class (relminxid) WHERE relkind = 'r' to quickly discover the minimum (Is the bootstrap mode able to create partial indexes?) We need to do this after each vacuum. We could arrange things so that autovacuum manages to do it only once after processing a database instead of once per table, but this could be fragile if the vacuuming dies partway through. This reminds me of an unrelated problem. On pgsql-bugs or pgsql-es-ayuda there was a report recently that autovacuum was dying because of corrupt data on a database. The problem was that this prevented vacuuming of all other databases as well, because on the next autovacuum iteration the same database would be chosen and the same error would ensue. Probably we should do something about this, so that autovac is allowed to process other databases before failing again on the offending database. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> The proper interpretation of the pg_database columns >> is that we guarantee that all XID's in the database are *at least* thus- >> and-so, not that the minimum is exact. > Ah-ha, an easier approach. This would mean either: > a) we need to seqscan pg_class each time to discover the minimum We'd have to do that during each VACUUM to see if we could change the pg_database entry, I think. > b) we need a partial index on pg_class (relminxid) WHERE relkind = 'r' > to quickly discover the minimum > (Is the bootstrap mode able to create partial indexes?) No, and we can't update them on system catalogs either, so it's not gonna work. (I think plain tables are a large enough fraction of pg_class that you'd not save anything anyway...) > We need to do this after each vacuum. > We could arrange things so that autovacuum manages to do it only once > after processing a database instead of once per table, but this could be > fragile if the vacuuming dies partway through. No, it'd just mean that the pg_database entry is smaller than it could be, but this is not a failure case. Note that a plain manual full-database VACUUM could have the same optimization; it's not only autovac. > This reminds me of an unrelated problem. On pgsql-bugs or > pgsql-es-ayuda there was a report recently that autovacuum was dying > because of corrupt data on a database. The problem was that this > prevented vacuuming of all other databases as well, because on the next > autovacuum iteration the same database would be chosen and the same > error would ensue. Probably we should do something about this, so that > autovac is allowed to process other databases before failing again on > the offending database. Hm, I kinda thought we had already got some provision in there to discourage autovac from choosing the same database over and over. regards, tom lane
Tom Lane wrote: > I'd argue that you should do nothing, ie, dropping a table should never > affect datminxid. The proper interpretation of the pg_database columns > is that we guarantee that all XID's in the database are *at least* thus- > and-so, not that the minimum is exact. Ok, this new patch does this. It allowed to simplify some code a bit, and works wonderfully. However I spotted another problem. Suppose I initdb; then I use the system for some 2 billion-minus-delta transactions. At this point, I create a new database using template0 as template. When this is done, the logic in createdb() puts the current TransactionId as pg_database.datminxid and datvacuumxid, which is fine because we assume that template0 is fully frozen and thus it doesn't need vacuuming right away. However, pg_class entries all contain values close to 500 (the Xid at which the initial vacuum is run by initdb). Thus if you vacuum only one table, the cluster-wide limit will be set at that low value, and suddenly the server will refuse to generate TransactionIds; the user will be forced to start a standalone postgres to vacuum. The solution seems to be to vacuum the whole database right after cloning. Or to forcibly set the pg_class value to the current TransactionId, without vacuuming (which should be fine, because the template database was frozen). Thoughts? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Attachment
Alvaro Herrera wrote: > Tom Lane wrote: > > > I'd argue that you should do nothing, ie, dropping a table should never > > affect datminxid. The proper interpretation of the pg_database columns > > is that we guarantee that all XID's in the database are *at least* thus- > > and-so, not that the minimum is exact. > > Ok, this new patch does this. It allowed to simplify some code a bit, > and works wonderfully. BTW, I forgot to mention that I intend to apply this patch later today, regardless of whatever solution we may decide for the problem below; we can add it later, and it certainly is a corner case. > The solution seems to be to vacuum the whole database right after > cloning. Or to forcibly set the pg_class value to the current > TransactionId, without vacuuming (which should be fine, because the > template database was frozen). Another possibility is to have autovac vacuum all databases, including those marked not connectable. While this won't solve the problem for those not running autovac, chances are that people doing so will be less with each release. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > BTW, I forgot to mention that I intend to apply this patch later today, > regardless of whatever solution we may decide for the problem below; we > can add it later, and it certainly is a corner case. Please don't. If you don't have a solution for that, then it doesn't work. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > BTW, I forgot to mention that I intend to apply this patch later today, > > regardless of whatever solution we may decide for the problem below; we > > can add it later, and it certainly is a corner case. > > Please don't. If you don't have a solution for that, then it doesn't > work. Ok. Let's see what possible solutions we have. 1. automatically vacuum the newly created database after creation, before returning from createdb() to the user. This doesn't work because we are not connected to it. 2. "Fix" pg_class in the new database after creation. Same problem as above. 2. force the user to issue a database-wide vacuum as first thing in the newly created database. Not sure how to implement the restriction (needs an additional pg_database column?). Doesn't seem very user friendly. 3. automatically change the first single-table VACUUM in a database into a database-wide vacuum. Additionally to the problems from the precedent proposal, this one breaks principle of least surprise. 4. mandate that not connectable databases must be vacuumed regularly, just like any other. This follows the KISS principle. The last one is the only one that seems relatively reasonable to me ... Plus it doesn't seem too onerous, since a template database is probably very small and vacuuming should be cheap. There is one problem: if a database is not connectable, how would the user vacuum it? Autovacuum would do it, but not everybody uses it. We could hack vacuumdb so that it connects elsewhere, changes the datallowconn bit, vacuum the db, and change the datallowconn bit back. I'm outta ideas for now ... any other? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Ok. Let's see what possible solutions we have. > 1. automatically vacuum the newly created database after creation, > before returning from createdb() to the user. This doesn't work because > we are not connected to it. > 2. "Fix" pg_class in the new database after creation. Same problem as > above. > 2. force the user to issue a database-wide vacuum as first thing in the > newly created database. Not sure how to implement the restriction > (needs an additional pg_database column?). Doesn't seem very user > friendly. I agree, all of those are a bit ugly. > 3. automatically change the first single-table VACUUM in a database into > a database-wide vacuum. Additionally to the problems from the precedent > proposal, this one breaks principle of least surprise. This isn't as bad as all that. Suppose we make a database-wide VACUUM FREEZE set relminxid = FrozenXid (or maybe InvalidXid would be better) for all the tables, and also put this into datminxid. This value can *not* be combined with other relminxid values, and therefore the rule is that all other forms of VACUUM forcibly process every rel with this relminxid, setting relminxid to current XID counter; and then they can validly compute a minimum database-wide value of relminxid to put into datminxid. > 4. mandate that not connectable databases must be vacuumed regularly, > just like any other. This follows the KISS principle. If we went down that path, I think we'd essentially abandon the notion of VACUUM FREEZE altogether. Which would certainly simplify the system, but I'm not sure it's a step forward. The big objection to either of these approaches is that they give up the possibility of suppressing vacuums permanently on large read-only tables, which is one of the goals that we had in mind for VACUUM FREEZE. The data warehousing guys will be on our case if we put that idea permanently off the table. I remember we had decided against the idea of having the first modification of a frozen table change its relminxid, but I've forgotten what the rationale was ... do you remember? It's starting to look like not such a bad idea after all. If we added a hack to do that then the relminxid bookkeeping rules would get a lot simpler. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > 3. automatically change the first single-table VACUUM in a database into > > a database-wide vacuum. Additionally to the problems from the precedent > > proposal, this one breaks principle of least surprise. > > This isn't as bad as all that. Suppose we make a database-wide VACUUM > FREEZE set relminxid = FrozenXid (or maybe InvalidXid would be better) > for all the tables, and also put this into datminxid. This value can > *not* be combined with other relminxid values, and therefore the rule is > that all other forms of VACUUM forcibly process every rel with this > relminxid, setting relminxid to current XID counter; and then they can > validly compute a minimum database-wide value of relminxid to put into > datminxid. Hmm ... interesting idea. > > 4. mandate that not connectable databases must be vacuumed regularly, > > just like any other. This follows the KISS principle. > > If we went down that path, I think we'd essentially abandon the notion > of VACUUM FREEZE altogether. Which would certainly simplify the system, > but I'm not sure it's a step forward. > > The big objection to either of these approaches is that they give up the > possibility of suppressing vacuums permanently on large read-only > tables, which is one of the goals that we had in mind for VACUUM FREEZE. I think a better idea is to have a separate "is read only" bit in pg_class. A table with that bit set doesn't need vacuuming at all, and needs not participate in the datminxid calculations. So we abandon the notion of VACUUM FREEZE; all databases need constant vacuuming; but big tables can be set read only. One problem I see with this new bit is that a table can only have it set if it's correctly frozen; and how do we know if it is? We would need to set the relminxid to FrozenXid anyway :-( Or maybe the flag can only be set by a VACUUM command, say "VACUUM AND SET READ ONLY" ;-) Or we could dictate that VACUUM FREEZE sets the flag, and in order to write to the table again you need to "ALTER TABLE SET WRITABLE" ... On further though: we can't do it on VACUUM FREEZE, because it's not a full vacuum and thus it doesn't have an exclusive lock on the table, so someone else could be modifying it. But what about creating a new VACUUM mode which would lock the table and set the read-only flag? > The data warehousing guys will be on our case if we put that idea > permanently off the table. We can serve them better with this new read only flag ;-) > I remember we had decided against the idea of having the first > modification of a frozen table change its relminxid, but I've forgotten > what the rationale was ... do you remember? No, we didn't -- what happened was that my first implementation was so bad that I decided to silently crawl away from it ;-) Actually in the end I decided not to explore that route further because I wasn't sure how to deal with it from WAL. If you freeze a table, then "unfreeze" it and the system crashes, how does the modification reach pg_class? Do we emit another WAL entry; one for the heap_insert (or whatever) and another for the pg_class modification? Or does heap_redo() take care of it with the heap_insert WAL entry? Seriously, if this can be done, then I see it as the simplest route, because we wouldn't need special flags, nor special commands, and it would be automatically unset if somebody writes to the table. (On the other hand, if we did this, it would be trivial to add a "read only" flag, changeable with a simple ALTER TABLE that would only work if the table is in frozen state.) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > I think a better idea is to have a separate "is read only" bit in > pg_class. A table with that bit set doesn't need vacuuming at all, and > needs not participate in the datminxid calculations. I think that just complicates matters. > One problem I see with this new bit is that a table can only have it set > if it's correctly frozen; and how do we know if it is? We would need to > set the relminxid to FrozenXid anyway :-( Right. > On further though: we can't do it on VACUUM FREEZE, because it's not a > full vacuum and thus it doesn't have an exclusive lock on the table, so > someone else could be modifying it. But what about creating a new > VACUUM mode which would lock the table and set the read-only flag? I was thinking it might be acceptable for VACUUM FREEZE to take ExclusiveLock (not AccessExclusiveLock). That would still allow concurrent readers, and it's not clear why you'd want to do VACUUM FREEZE on a table that has active writers. >> I remember we had decided against the idea of having the first >> modification of a frozen table change its relminxid, but I've forgotten >> what the rationale was ... do you remember? > Actually in the end I decided not to explore that route further because > I wasn't sure how to deal with it from WAL. If you freeze a table, then > "unfreeze" it and the system crashes, how does the modification reach > pg_class? You certainly must emit a WAL entry for the action of unfreezing, but I don't see any particular reason why that's a bad idea. Transitioning a table between frozen and unfrozen states should be rare enough that emitting a WAL entry for it is not a performance problem. Here's a sketch of the idea as it's developing in my mind: 1. We consider that relminxid = FrozenXid means that the table is frozen, ie is guaranteed to contain no valid XIDs except FrozenXid. Otherwise, relminxid must be a lower bound on the non-frozen XIDs in the table. 2. VACUUM FREEZE acquires ExclusiveLock, vacuums the table replacing all XIDs with FrozenXid, and if successful sets relminxid = FrozenXid. (It might not be successful, eg it might see recently-dead tuples it can't remove; we can't replace their xmin/xmax obviously.) In all other cases, VACUUM sets relminxid = Min(oldest unfrozen xid in table, transaction xmin) (or some other convenient lower-bound computation, eg maybe just use the cutoff instead of actively figuring the min XID). I think we have to XLOG the setting of relminxid to be safe. 3. Any modification of a table (that inserts an XID into it) must check to see if relminxid = FrozenXid, and if so change it to transaction xmin (or some other lower bound on the oldest running XID). This action has to be WAL-logged. 4. VACUUM has to recompute datminxid to be the oldest non-frozen relminxid in the database (but not more than transaction xmin, to cover case where someone else is creating a table concurrently). We might be able to go back to your idea of not having to do this work unless the prior value of relminxid matches datminxid. I think plain VACUUM could skip tables having relminxid = FrozenXid altogether. I'm tempted to say that the "unfreezing" action (#3) could just be done at the point where we open a rel and take a stronger-than-AccessShare lock on it. This would minimize the overhead needed, and give us pretty good confidence we'd not missed any places. It'd mean that, say, an UPDATE that changed no rows would still mark the table unfrozen, but I see no great downside to that. Comments? regards, tom lane
Tom Lane wrote: > Here's a sketch of the idea as it's developing in my mind: > > 1. We consider that relminxid = FrozenXid means that the table is > frozen, ie is guaranteed to contain no valid XIDs except FrozenXid. > Otherwise, relminxid must be a lower bound on the non-frozen XIDs in > the table. > > 2. VACUUM FREEZE acquires ExclusiveLock, vacuums the table replacing > all XIDs with FrozenXid, and if successful sets relminxid = FrozenXid. > (It might not be successful, eg it might see recently-dead tuples it > can't remove; we can't replace their xmin/xmax obviously.) In all other > cases, VACUUM sets relminxid = Min(oldest unfrozen xid in table, > transaction xmin) (or some other convenient lower-bound computation, > eg maybe just use the cutoff instead of actively figuring the min XID). > I think we have to XLOG the setting of relminxid to be safe. > > 3. Any modification of a table (that inserts an XID into it) must check > to see if relminxid = FrozenXid, and if so change it to transaction xmin > (or some other lower bound on the oldest running XID). This action has > to be WAL-logged. > > 4. VACUUM has to recompute datminxid to be the oldest non-frozen > relminxid in the database (but not more than transaction xmin, to cover > case where someone else is creating a table concurrently). We might be > able to go back to your idea of not having to do this work unless the > prior value of relminxid matches datminxid. I think plain VACUUM could > skip tables having relminxid = FrozenXid altogether. > > I'm tempted to say that the "unfreezing" action (#3) could just be done > at the point where we open a rel and take a stronger-than-AccessShare > lock on it. This would minimize the overhead needed, and give us pretty > good confidence we'd not missed any places. It'd mean that, say, an > UPDATE that changed no rows would still mark the table unfrozen, but I > see no great downside to that. Seems fine. I'll devote some time to this. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
[ dept of further thoughts ] I wrote: >> 4. VACUUM has to recompute datminxid to be the oldest non-frozen >> relminxid in the database (but not more than transaction xmin, to cover >> case where someone else is creating a table concurrently). Strictly speaking, VACUUM could never set datminxid = FrozenXid given the above rule --- and this is correct, because someone else could be creating a new table or unfreezing an existing table concurrently with the VACUUM. However, we'd really like to set datminxid = FrozenXid for template0, because otherwise we can't write any sort of hardline guarantee that we know the database system is safe against wraparound. I see a workaround though: it *is* OK for VACUUM FREEZE to set datminxid = FrozenXid if (1) it's been able to freeze all the tables and (2) it's running in a standalone backend. This rule would let us freeze template0 during initdb. There's still a question of where/when to worry about marking a database unfrozen. Perhaps a suitably cheap, conservative approximation would be to cause any new connection to a frozen database to immediately mark it unfrozen in pg_database. (I'm not sure if this works conveniently for initdb's processing though --- we might want to fudge a bit depending on whether we're running standalone.) regards, tom lane
Tom Lane wrote: > I see a workaround though: it *is* OK for VACUUM FREEZE to set datminxid > = FrozenXid if (1) it's been able to freeze all the tables and (2) it's > running in a standalone backend. This rule would let us freeze > template0 during initdb. > > There's still a question of where/when to worry about marking a database > unfrozen. Perhaps a suitably cheap, conservative approximation would be > to cause any new connection to a frozen database to immediately mark it > unfrozen in pg_database. (I'm not sure if this works conveniently for > initdb's processing though --- we might want to fudge a bit depending on > whether we're running standalone.) initdb never connects to template0, so I think it should work. We just need to add an additional step to initdb to run the vacuum freeze. What do you think of making vacuum a no-op on a table that has datminxid = FrozenXid? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > What do you think of making vacuum a no-op on a table that has datminxid > = FrozenXid? I think this is probably OK for plain VACUUM and VACUUM FREEZE, but not for VACUUM FULL, since a frozen table isn't necessarily fully packed. Also, for VACUUM ANALYZE, the ANALYZE step is likely still needed? regards, tom lane