Thread: [WIP] Add relminxid column to pg_class

[WIP] Add relminxid column to pg_class

From
Alvaro Herrera
Date:
Hi,

I attach my patch adding the relminxid column to pg_class; this turns
vacuum/freeze Xid tracking a per-table issue, and so it removes the
need to vacuum whole databases.  Transaction Id wraparound can thus be
dealt with one table at a time.  I refer you to the old thread for a
more complete description of the patch:
http://archives.postgresql.org/pgsql-patches/2005-11/msg00074.php

I just noticed a problem, which is why I labeled it a WIP: the current
implementation turns pg_database.datminxid to InvalidTransactionId when
the table that has the minimum relminxid is dropped.  The problem is
that this could cause "vacuum starvation" if autovacuum is using
datminxid to decide what database to vacuum, and the minimum Xid table
is being dropped continuously.  I had refrained from calculating
pg_database.datminxid each time said table is dropped, because doing it
means scanning pg_class and locking pg_database -- I'm wary of deadlock
problems.  Not sure what to do here.  (Maybe the answer is to do nothing
-- this is a very low probability scenario anyway.  Opinions?)

I dropped the controversial (and broken) idea of setting a table's
relminxid to FreezeTransactionId (which would allow a table to survive a
transaction wraparound without needing vacuum at all.)  This will need
more work; probably a switch to mark a table strictly read-only.

Unless there are objections I plan to apply this probably by the next
weekend, so that there is plenty of time for review and comment.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Attachment

Re: [WIP] Add relminxid column to pg_class

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I just noticed a problem, which is why I labeled it a WIP: the current
> implementation turns pg_database.datminxid to InvalidTransactionId when
> the table that has the minimum relminxid is dropped.  The problem is
> that this could cause "vacuum starvation" if autovacuum is using
> datminxid to decide what database to vacuum, and the minimum Xid table
> is being dropped continuously.  I had refrained from calculating
> pg_database.datminxid each time said table is dropped, because doing it
> means scanning pg_class and locking pg_database -- I'm wary of deadlock
> problems.  Not sure what to do here.  (Maybe the answer is to do nothing
> -- this is a very low probability scenario anyway.  Opinions?)

I'd argue that you should do nothing, ie, dropping a table should never
affect datminxid.  The proper interpretation of the pg_database columns
is that we guarantee that all XID's in the database are *at least* thus-
and-so, not that the minimum is exact.

            regards, tom lane

Re: [WIP] Add relminxid column to pg_class

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > I just noticed a problem, which is why I labeled it a WIP: the current
> > implementation turns pg_database.datminxid to InvalidTransactionId when
> > the table that has the minimum relminxid is dropped.
>
> I'd argue that you should do nothing, ie, dropping a table should never
> affect datminxid.  The proper interpretation of the pg_database columns
> is that we guarantee that all XID's in the database are *at least* thus-
> and-so, not that the minimum is exact.

Ah-ha, an easier approach.  This would mean either:

a) we need to seqscan pg_class each time to discover the minimum

b) we need a partial index on pg_class (relminxid) WHERE relkind = 'r'
to quickly discover the minimum
(Is the bootstrap mode able to create partial indexes?)

We need to do this after each vacuum.

We could arrange things so that autovacuum manages to do it only once
after processing a database instead of once per table, but this could be
fragile if the vacuuming dies partway through.

This reminds me of an unrelated problem.  On pgsql-bugs or
pgsql-es-ayuda there was a report recently that autovacuum was dying
because of corrupt data on a database.  The problem was that this
prevented vacuuming of all other databases as well, because on the next
autovacuum iteration the same database would be chosen and the same
error would ensue.  Probably we should do something about this, so that
autovac is allowed to process other databases before failing again on
the offending database.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: [WIP] Add relminxid column to pg_class

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> The proper interpretation of the pg_database columns
>> is that we guarantee that all XID's in the database are *at least* thus-
>> and-so, not that the minimum is exact.

> Ah-ha, an easier approach.  This would mean either:
> a) we need to seqscan pg_class each time to discover the minimum

We'd have to do that during each VACUUM to see if we could change the
pg_database entry, I think.

> b) we need a partial index on pg_class (relminxid) WHERE relkind = 'r'
> to quickly discover the minimum
> (Is the bootstrap mode able to create partial indexes?)

No, and we can't update them on system catalogs either, so it's not
gonna work.  (I think plain tables are a large enough fraction of
pg_class that you'd not save anything anyway...)

> We need to do this after each vacuum.

> We could arrange things so that autovacuum manages to do it only once
> after processing a database instead of once per table, but this could be
> fragile if the vacuuming dies partway through.

No, it'd just mean that the pg_database entry is smaller than it could
be, but this is not a failure case.  Note that a plain manual
full-database VACUUM could have the same optimization; it's not only
autovac.

> This reminds me of an unrelated problem.  On pgsql-bugs or
> pgsql-es-ayuda there was a report recently that autovacuum was dying
> because of corrupt data on a database.  The problem was that this
> prevented vacuuming of all other databases as well, because on the next
> autovacuum iteration the same database would be chosen and the same
> error would ensue.  Probably we should do something about this, so that
> autovac is allowed to process other databases before failing again on
> the offending database.

Hm, I kinda thought we had already got some provision in there to
discourage autovac from choosing the same database over and over.

            regards, tom lane

Re: [WIP] Add relminxid column to pg_class

From
Alvaro Herrera
Date:
Tom Lane wrote:

> I'd argue that you should do nothing, ie, dropping a table should never
> affect datminxid.  The proper interpretation of the pg_database columns
> is that we guarantee that all XID's in the database are *at least* thus-
> and-so, not that the minimum is exact.

Ok, this new patch does this.  It allowed to simplify some code a bit,
and works wonderfully.

However I spotted another problem.  Suppose I initdb; then I use the
system for some 2 billion-minus-delta transactions.  At this point, I
create a new database using template0 as template.  When this is done,
the logic in createdb() puts the current TransactionId as
pg_database.datminxid and datvacuumxid, which is fine because we assume
that template0 is fully frozen and thus it doesn't need vacuuming right
away.

However, pg_class entries all contain values close to 500 (the Xid at
which the initial vacuum is run by initdb).  Thus if you vacuum only one
table, the cluster-wide limit will be set at that low value, and
suddenly the server will refuse to generate TransactionIds; the user
will be forced to start a standalone postgres to vacuum.

The solution seems to be to vacuum the whole database right after
cloning.  Or to forcibly set the pg_class value to the current
TransactionId, without vacuuming (which should be fine, because the
template database was frozen).

Thoughts?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Attachment

Re: [WIP] Add relminxid column to pg_class

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:
> Tom Lane wrote:
>
> > I'd argue that you should do nothing, ie, dropping a table should never
> > affect datminxid.  The proper interpretation of the pg_database columns
> > is that we guarantee that all XID's in the database are *at least* thus-
> > and-so, not that the minimum is exact.
>
> Ok, this new patch does this.  It allowed to simplify some code a bit,
> and works wonderfully.

BTW, I forgot to mention that I intend to apply this patch later today,
regardless of whatever solution we may decide for the problem below; we
can add it later, and it certainly is a corner case.

> The solution seems to be to vacuum the whole database right after
> cloning.  Or to forcibly set the pg_class value to the current
> TransactionId, without vacuuming (which should be fine, because the
> template database was frozen).

Another possibility is to have autovac vacuum all databases, including
those marked not connectable.  While this won't solve the problem for
those not running autovac, chances are that people doing so will be less
with each release.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: [WIP] Add relminxid column to pg_class

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> BTW, I forgot to mention that I intend to apply this patch later today,
> regardless of whatever solution we may decide for the problem below; we
> can add it later, and it certainly is a corner case.

Please don't.  If you don't have a solution for that, then it doesn't
work.

            regards, tom lane

Re: [WIP] Add relminxid column to pg_class

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > BTW, I forgot to mention that I intend to apply this patch later today,
> > regardless of whatever solution we may decide for the problem below; we
> > can add it later, and it certainly is a corner case.
>
> Please don't.  If you don't have a solution for that, then it doesn't
> work.

Ok.  Let's see what possible solutions we have.

1. automatically vacuum the newly created database after creation,
before returning from createdb() to the user.  This doesn't work because
we are not connected to it.

2. "Fix" pg_class in the new database after creation.  Same problem as
above.

2. force the user to issue a database-wide vacuum as first thing in the
newly created database.  Not sure how to implement the restriction
(needs an additional pg_database column?).  Doesn't seem very user
friendly.

3. automatically change the first single-table VACUUM in a database into
a database-wide vacuum.  Additionally to the problems from the precedent
proposal, this one breaks principle of least surprise.

4. mandate that not connectable databases must be vacuumed regularly,
just like any other.  This follows the KISS principle.


The last one is the only one that seems relatively reasonable to me ...
Plus it doesn't seem too onerous, since a template database is probably
very small and vacuuming should be cheap.  There is one problem: if a
database is not connectable, how would the user vacuum it?  Autovacuum
would do it, but not everybody uses it.  We could hack vacuumdb so that
it connects elsewhere, changes the datallowconn bit, vacuum the db, and
change the datallowconn bit back.

I'm outta ideas for now ... any other?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [WIP] Add relminxid column to pg_class

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Ok.  Let's see what possible solutions we have.

> 1. automatically vacuum the newly created database after creation,
> before returning from createdb() to the user.  This doesn't work because
> we are not connected to it.

> 2. "Fix" pg_class in the new database after creation.  Same problem as
> above.

> 2. force the user to issue a database-wide vacuum as first thing in the
> newly created database.  Not sure how to implement the restriction
> (needs an additional pg_database column?).  Doesn't seem very user
> friendly.

I agree, all of those are a bit ugly.

> 3. automatically change the first single-table VACUUM in a database into
> a database-wide vacuum.  Additionally to the problems from the precedent
> proposal, this one breaks principle of least surprise.

This isn't as bad as all that.  Suppose we make a database-wide VACUUM
FREEZE set relminxid = FrozenXid (or maybe InvalidXid would be better)
for all the tables, and also put this into datminxid.  This value can
*not* be combined with other relminxid values, and therefore the rule is
that all other forms of VACUUM forcibly process every rel with this
relminxid, setting relminxid to current XID counter; and then they can
validly compute a minimum database-wide value of relminxid to put into
datminxid.

> 4. mandate that not connectable databases must be vacuumed regularly,
> just like any other.  This follows the KISS principle.

If we went down that path, I think we'd essentially abandon the notion
of VACUUM FREEZE altogether.  Which would certainly simplify the system,
but I'm not sure it's a step forward.

The big objection to either of these approaches is that they give up the
possibility of suppressing vacuums permanently on large read-only
tables, which is one of the goals that we had in mind for VACUUM FREEZE.
The data warehousing guys will be on our case if we put that idea
permanently off the table.

I remember we had decided against the idea of having the first
modification of a frozen table change its relminxid, but I've forgotten
what the rationale was ... do you remember?  It's starting to look like
not such a bad idea after all.  If we added a hack to do that then the
relminxid bookkeeping rules would get a lot simpler.

            regards, tom lane

Re: [WIP] Add relminxid column to pg_class

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:

> > 3. automatically change the first single-table VACUUM in a database into
> > a database-wide vacuum.  Additionally to the problems from the precedent
> > proposal, this one breaks principle of least surprise.
>
> This isn't as bad as all that.  Suppose we make a database-wide VACUUM
> FREEZE set relminxid = FrozenXid (or maybe InvalidXid would be better)
> for all the tables, and also put this into datminxid.  This value can
> *not* be combined with other relminxid values, and therefore the rule is
> that all other forms of VACUUM forcibly process every rel with this
> relminxid, setting relminxid to current XID counter; and then they can
> validly compute a minimum database-wide value of relminxid to put into
> datminxid.

Hmm ... interesting idea.

> > 4. mandate that not connectable databases must be vacuumed regularly,
> > just like any other.  This follows the KISS principle.
>
> If we went down that path, I think we'd essentially abandon the notion
> of VACUUM FREEZE altogether.  Which would certainly simplify the system,
> but I'm not sure it's a step forward.
>
> The big objection to either of these approaches is that they give up the
> possibility of suppressing vacuums permanently on large read-only
> tables, which is one of the goals that we had in mind for VACUUM FREEZE.

I think a better idea is to have a separate "is read only" bit in
pg_class.  A table with that bit set doesn't need vacuuming at all, and
needs not participate in the datminxid calculations.  So we abandon the
notion of VACUUM FREEZE; all databases need constant vacuuming; but big
tables can be set read only.

One problem I see with this new bit is that a table can only have it set
if it's correctly frozen; and how do we know if it is?  We would need to
set the relminxid to FrozenXid anyway :-(  Or maybe the flag can only be
set by a VACUUM command, say "VACUUM AND SET READ ONLY" ;-)  Or we could
dictate that VACUUM FREEZE sets the flag, and in order to write to the
table again you need to "ALTER TABLE SET WRITABLE" ...

On further though: we can't do it on VACUUM FREEZE, because it's not a
full vacuum and thus it doesn't have an exclusive lock on the table, so
someone else could be modifying it.  But what about creating a new
VACUUM mode which would lock the table and set the read-only flag?


> The data warehousing guys will be on our case if we put that idea
> permanently off the table.

We can serve them better with this new read only flag ;-)


> I remember we had decided against the idea of having the first
> modification of a frozen table change its relminxid, but I've forgotten
> what the rationale was ... do you remember?

No, we didn't -- what happened was that my first implementation was so
bad that I decided to silently crawl away from it ;-)

Actually in the end I decided not to explore that route further because
I wasn't sure how to deal with it from WAL.  If you freeze a table, then
"unfreeze" it and the system crashes, how does the modification reach
pg_class?  Do we emit another WAL entry; one for the heap_insert (or
whatever) and another for the pg_class modification?  Or does
heap_redo() take care of it with the heap_insert WAL entry?

Seriously, if this can be done, then I see it as the simplest route,
because we wouldn't need special flags, nor special commands, and it
would be automatically unset if somebody writes to the table.

(On the other hand, if we did this, it would be trivial to add a "read
only" flag, changeable with a simple ALTER TABLE that would only work if
the table is in frozen state.)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [WIP] Add relminxid column to pg_class

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I think a better idea is to have a separate "is read only" bit in
> pg_class.  A table with that bit set doesn't need vacuuming at all, and
> needs not participate in the datminxid calculations.

I think that just complicates matters.

> One problem I see with this new bit is that a table can only have it set
> if it's correctly frozen; and how do we know if it is?  We would need to
> set the relminxid to FrozenXid anyway :-(

Right.

> On further though: we can't do it on VACUUM FREEZE, because it's not a
> full vacuum and thus it doesn't have an exclusive lock on the table, so
> someone else could be modifying it.  But what about creating a new
> VACUUM mode which would lock the table and set the read-only flag?

I was thinking it might be acceptable for VACUUM FREEZE to take
ExclusiveLock (not AccessExclusiveLock).  That would still allow
concurrent readers, and it's not clear why you'd want to do VACUUM
FREEZE on a table that has active writers.

>> I remember we had decided against the idea of having the first
>> modification of a frozen table change its relminxid, but I've forgotten
>> what the rationale was ... do you remember?

> Actually in the end I decided not to explore that route further because
> I wasn't sure how to deal with it from WAL.  If you freeze a table, then
> "unfreeze" it and the system crashes, how does the modification reach
> pg_class?

You certainly must emit a WAL entry for the action of unfreezing, but I
don't see any particular reason why that's a bad idea.  Transitioning a
table between frozen and unfrozen states should be rare enough that
emitting a WAL entry for it is not a performance problem.

Here's a sketch of the idea as it's developing in my mind:

1. We consider that relminxid = FrozenXid means that the table is
frozen, ie is guaranteed to contain no valid XIDs except FrozenXid.
Otherwise, relminxid must be a lower bound on the non-frozen XIDs in
the table.

2. VACUUM FREEZE acquires ExclusiveLock, vacuums the table replacing
all XIDs with FrozenXid, and if successful sets relminxid = FrozenXid.
(It might not be successful, eg it might see recently-dead tuples it
can't remove; we can't replace their xmin/xmax obviously.)  In all other
cases, VACUUM sets relminxid = Min(oldest unfrozen xid in table,
transaction xmin) (or some other convenient lower-bound computation,
eg maybe just use the cutoff instead of actively figuring the min XID).
I think we have to XLOG the setting of relminxid to be safe.

3. Any modification of a table (that inserts an XID into it) must check
to see if relminxid = FrozenXid, and if so change it to transaction xmin
(or some other lower bound on the oldest running XID).  This action has
to be WAL-logged.

4. VACUUM has to recompute datminxid to be the oldest non-frozen
relminxid in the database (but not more than transaction xmin, to cover
case where someone else is creating a table concurrently).  We might be
able to go back to your idea of not having to do this work unless the
prior value of relminxid matches datminxid.  I think plain VACUUM could
skip tables having relminxid = FrozenXid altogether.

I'm tempted to say that the "unfreezing" action (#3) could just be done
at the point where we open a rel and take a stronger-than-AccessShare
lock on it.  This would minimize the overhead needed, and give us pretty
good confidence we'd not missed any places.  It'd mean that, say, an
UPDATE that changed no rows would still mark the table unfrozen, but I
see no great downside to that.

Comments?

            regards, tom lane

Re: [WIP] Add relminxid column to pg_class

From
Alvaro Herrera
Date:
Tom Lane wrote:

> Here's a sketch of the idea as it's developing in my mind:
>
> 1. We consider that relminxid = FrozenXid means that the table is
> frozen, ie is guaranteed to contain no valid XIDs except FrozenXid.
> Otherwise, relminxid must be a lower bound on the non-frozen XIDs in
> the table.
>
> 2. VACUUM FREEZE acquires ExclusiveLock, vacuums the table replacing
> all XIDs with FrozenXid, and if successful sets relminxid = FrozenXid.
> (It might not be successful, eg it might see recently-dead tuples it
> can't remove; we can't replace their xmin/xmax obviously.)  In all other
> cases, VACUUM sets relminxid = Min(oldest unfrozen xid in table,
> transaction xmin) (or some other convenient lower-bound computation,
> eg maybe just use the cutoff instead of actively figuring the min XID).
> I think we have to XLOG the setting of relminxid to be safe.
>
> 3. Any modification of a table (that inserts an XID into it) must check
> to see if relminxid = FrozenXid, and if so change it to transaction xmin
> (or some other lower bound on the oldest running XID).  This action has
> to be WAL-logged.
>
> 4. VACUUM has to recompute datminxid to be the oldest non-frozen
> relminxid in the database (but not more than transaction xmin, to cover
> case where someone else is creating a table concurrently).  We might be
> able to go back to your idea of not having to do this work unless the
> prior value of relminxid matches datminxid.  I think plain VACUUM could
> skip tables having relminxid = FrozenXid altogether.
>
> I'm tempted to say that the "unfreezing" action (#3) could just be done
> at the point where we open a rel and take a stronger-than-AccessShare
> lock on it.  This would minimize the overhead needed, and give us pretty
> good confidence we'd not missed any places.  It'd mean that, say, an
> UPDATE that changed no rows would still mark the table unfrozen, but I
> see no great downside to that.

Seems fine.  I'll devote some time to this.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [WIP] Add relminxid column to pg_class

From
Tom Lane
Date:
[ dept of further thoughts ]

I wrote:
>> 4. VACUUM has to recompute datminxid to be the oldest non-frozen
>> relminxid in the database (but not more than transaction xmin, to cover
>> case where someone else is creating a table concurrently).

Strictly speaking, VACUUM could never set datminxid = FrozenXid given
the above rule --- and this is correct, because someone else could be
creating a new table or unfreezing an existing table concurrently with
the VACUUM.  However, we'd really like to set datminxid = FrozenXid for
template0, because otherwise we can't write any sort of hardline
guarantee that we know the database system is safe against wraparound.

I see a workaround though: it *is* OK for VACUUM FREEZE to set datminxid
= FrozenXid if (1) it's been able to freeze all the tables and (2) it's
running in a standalone backend.  This rule would let us freeze
template0 during initdb.

There's still a question of where/when to worry about marking a database
unfrozen.  Perhaps a suitably cheap, conservative approximation would be
to cause any new connection to a frozen database to immediately mark it
unfrozen in pg_database.  (I'm not sure if this works conveniently for
initdb's processing though --- we might want to fudge a bit depending on
whether we're running standalone.)

            regards, tom lane

Re: [WIP] Add relminxid column to pg_class

From
Alvaro Herrera
Date:
Tom Lane wrote:

> I see a workaround though: it *is* OK for VACUUM FREEZE to set datminxid
> = FrozenXid if (1) it's been able to freeze all the tables and (2) it's
> running in a standalone backend.  This rule would let us freeze
> template0 during initdb.
>
> There's still a question of where/when to worry about marking a database
> unfrozen.  Perhaps a suitably cheap, conservative approximation would be
> to cause any new connection to a frozen database to immediately mark it
> unfrozen in pg_database.  (I'm not sure if this works conveniently for
> initdb's processing though --- we might want to fudge a bit depending on
> whether we're running standalone.)

initdb never connects to template0, so I think it should work.  We just
need to add an additional step to initdb to run the vacuum freeze.

What do you think of making vacuum a no-op on a table that has datminxid
= FrozenXid?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [WIP] Add relminxid column to pg_class

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> What do you think of making vacuum a no-op on a table that has datminxid
> = FrozenXid?

I think this is probably OK for plain VACUUM and VACUUM FREEZE, but not
for VACUUM FULL, since a frozen table isn't necessarily fully packed.

Also, for VACUUM ANALYZE, the ANALYZE step is likely still needed?

            regards, tom lane