Thread: Vacuum dead tuples that are "between" transactions

Vacuum dead tuples that are "between" transactions

From

Paul Tillotson

Date:

27 February 2006, 23:27:40

The topic of improving vacuum for use in heavy-update environments seems 
to come up frequently on the list.  Has anyone weighed the costs of 
allowing VACUUM to reclaim tuples that are not older than the oldest 
transaction but are nonetheless invisible to all running transactions?  
It seems that it's not that hard....

Currently, a tuple is not elligible to be reclaimed by vacuum unless it 
was deleted by a transaction that committed before the oldest currently 
running transaction committed. (i.e., it's xmax is known to have 
committed before the oldest-currently-running xid was started.)  Right?

However, it seems like under certain scenarios (heavy updates to small 
tables while a long-running transaction is occurring) there might be a 
lot of tuples that are invisible to all transactions but not able to be 
vacuumed under the current method.  Example: updating a single row over 
and over again while pg_dump is running.

Suppose that in the system, we have a serializable transaction with xid 
1000 and a read committed transaction with xid 1001.  Other than these 
two, the oldest running xid is 2000.

Suppose we consider a tuple with xmin 1200 and xmax 1201.  We will 
assume that xid 1201 committed before xid 2000 began to run.

So:

(A) This tuple is invisible to the serializable transaction, since its 
snapshot can't ever advance.

(B) The read committed transaction might be able to see it.  However, if 
its current command started AFTER xid 1201 committed, it can't. 

Unless I'm missing something, it seems that when vacuuming you can leave 
serializable transactions (like pg_dump) out of the calculation of the 
"oldest running transaction" so long as you keep a list of them and 
check each tuple T against each serializable transaction X to make sure 
that T's xmin is greater than X, or else T's xmax committed before X 
started to run.  Of course this is "a lot" of work, but this should 
mitigate the effect of long running serializable transactions until such 
time as processor power becomes your limiting factor.

The read committed ones are a more difficult matter, but I think you can 
treat a tuple as dead if it was inserted after the read committed 
transaction started to run AND the tuple was deleted before the 
transaction's currently running command started to run.  I suppose the 
major difficulty here is that currently a transaction has no way of 
knowing when another backend's command started to run?

Is this too difficult to do or is it a good idea that no one has enough 
'round tuits for?

Regards,

Paul Tillotson

Re: Vacuum dead tuples that are "between" transactions

From

Tom Lane

Date:

28 February 2006, 02:32:25

Paul Tillotson <spam1011@adelphia.net> writes:
> The topic of improving vacuum for use in heavy-update environments seems 
> to come up frequently on the list.  Has anyone weighed the costs of 
> allowing VACUUM to reclaim tuples that are not older than the oldest 
> transaction but are nonetheless invisible to all running transactions?  
> It seems that it's not that hard....

It's not that easy either --- you are assuming that every process
advertises far more of its internal state than it actually does.

> Suppose that in the system, we have a serializable transaction with xid 
> 1000 and a read committed transaction with xid 1001.  Other than these 
> two, the oldest running xid is 2000.

> Suppose we consider a tuple with xmin 1200 and xmax 1201.  We will 
> assume that xid 1201 committed before xid 2000 began to run.

> So:

> (A) This tuple is invisible to the serializable transaction, since its 
> snapshot can't ever advance.

Wrong --- you can't assume that simply from the transaction numbering,
even assuming that you know that 1000 is serializable.  1000 might not
have set its snapshot until quite some time after it started.  (This is
even pretty likely, if it waited for some locks before setting the
snapshot.)  You'd need access to the snapshot 1000 is actually using to
be sure which "later" transactions are invisible to it.

While advertising whole snapshots (rather than just xmin) in shared
memory is at least theoretically possible, the costs of doing that seem
nontrivial to me ... and they'd have to be paid whether any savings
ensued or not.

> (B) The read committed transaction might be able to see it.  However, if 
> its current command started AFTER xid 1201 committed, it can't. 

Another issue is that there's not just "one single snapshot" to worry
about per backend.  Cursors for instance capture their own snaps.
So a backend would have to somehow keep track of the oldest live
snapshot it has internally.

> The read committed ones are a more difficult matter, but I think you can 
> treat a tuple as dead if it was inserted after the read committed 
> transaction started to run AND the tuple was deleted before the 
> transaction's currently running command started to run.

To do that requires not just that you have access to a backend's oldest
snapshot, but that you have access to *all* its active snapshots;
because such a transient tuple might be visible in some newer snap even
though it's too new for the oldest snap.  Doing that will create very
significant problems of shared memory management, as well as performance
and locking issues.

There's been some talk of distinguishing "global" and "within database"
xmin values, so that a long-lived transaction wouldn't interfere with
vacuuming tables in other databases that that xact couldn't possibly
access.  That seems doable to me, but I think any finer-grained analysis
is probably going to be a net loss because of the distributed overhead
it'd impose on every transaction.
        regards, tom lane

Re: Vacuum dead tuples that are "between" transactions

From

Simon Riggs

Date:

28 February 2006, 09:22:49

On Tue, 2006-02-28 at 01:32 -0500, Tom Lane wrote:
> Paul Tillotson <spam1011@adelphia.net> writes:
> > The topic of improving vacuum for use in heavy-update environments seems 
> > to come up frequently on the list.  Has anyone weighed the costs of 
> > allowing VACUUM to reclaim tuples that are not older than the oldest 
> > transaction but are nonetheless invisible to all running transactions?  
> > It seems that it's not that hard....
> 
> It's not that easy either --- you are assuming that every process
> advertises far more of its internal state than it actually does.
> 
> > Suppose that in the system, we have a serializable transaction with xid 
> > 1000 and a read committed transaction with xid 1001.  Other than these 
> > two, the oldest running xid is 2000.
> 
> > Suppose we consider a tuple with xmin 1200 and xmax 1201.  We will 
> > assume that xid 1201 committed before xid 2000 began to run.
> 
> > So:
> 
> > (A) This tuple is invisible to the serializable transaction, since its 
> > snapshot can't ever advance.
> 
> Wrong --- you can't assume that simply from the transaction numbering,
> even assuming that you know that 1000 is serializable.  1000 might not
> have set its snapshot until quite some time after it started.  (This is
> even pretty likely, if it waited for some locks before setting the
> snapshot.)  You'd need access to the snapshot 1000 is actually using to
> be sure which "later" transactions are invisible to it.
> 
> While advertising whole snapshots (rather than just xmin) in shared
> memory is at least theoretically possible, the costs of doing that seem
> nontrivial to me ... and they'd have to be paid whether any savings
> ensued or not.
> 
> > (B) The read committed transaction might be able to see it.  However, if 
> > its current command started AFTER xid 1201 committed, it can't. 
> 
> Another issue is that there's not just "one single snapshot" to worry
> about per backend.  Cursors for instance capture their own snaps.
> So a backend would have to somehow keep track of the oldest live
> snapshot it has internally.
> 
> > The read committed ones are a more difficult matter, but I think you can 
> > treat a tuple as dead if it was inserted after the read committed 
> > transaction started to run AND the tuple was deleted before the 
> > transaction's currently running command started to run.
> 
> To do that requires not just that you have access to a backend's oldest
> snapshot, but that you have access to *all* its active snapshots;
> because such a transient tuple might be visible in some newer snap even
> though it's too new for the oldest snap.  Doing that will create very
> significant problems of shared memory management, as well as performance
> and locking issues.
> 
> There's been some talk of distinguishing "global" and "within database"
> xmin values, so that a long-lived transaction wouldn't interfere with
> vacuuming tables in other databases that that xact couldn't possibly
> access.  That seems doable to me, but I think any finer-grained analysis
> is probably going to be a net loss because of the distributed overhead
> it'd impose on every transaction.

Paul raises some thoughts that are worth considering, even with the
usual minefield of difficulties.

Paul, you mention serializable transactions, but your root issue seems
to be that VACUUM clears up less rows when pg_dump is running, yes? Have
you tried using an on-line hot backup with archive_command set (PITR)?
That doesn't suffer from the same issue and is faster too.

OTOH a few hackers discussed this recently and found that nobody used
serializable transactions (ST) except during pg_dump. It seems a
reasonable observation that *most* STs are pg_dumps, or at very least:
the longest running STs are pg_dumps. So rather than changing all
transaction modes, or even special-casing STs, why not put in some
infrastructure to cope specifically with the problems that pg_dump can
cause? 

A general facility that would allow STs to identify which tables they
would/would not touch again could be used by pg_dump to advertise useful
information. That information could then be picked up by a VACUUM: when
locking to get xmin it would see an ST, then retrieve the information to
allow it to work out a per-table xmin. Non-VACUUM transactions would
ignore any special ST information, causing very low overhead for normal
operation (checking whether each current transaction was an ST, which
mostly will be predicted correctly as "no" by the CPU).

You could take that further and get pg_dump to use a list file like
pg_restore. You would then be able to *where possible* alter the
sequence of data dumping so that heavily updated tables were dumped
first so the dumping ST could then advertise "no further access" to
particular tables. VACUUMs could then proceed as if the ST were not
there at all.

Or maybe at least the idea of some special case ST behaviour might be
worthy of some thought.

I've no intention of working on this myself, especially since PITR
provides an alternate backup solution anyway (even in combination with
other techniques), but the idea seems worth recording for others to
discuss.

Best Regards, Simon Riggs

Re: Vacuum dead tuples that are "between" transactions

From

"Jim C. Nasby"

Date:

28 February 2006, 12:22:05

On Tue, Feb 28, 2006 at 01:22:35PM +0000, Simon Riggs wrote:
> Paul, you mention serializable transactions, but your root issue seems
> to be that VACUUM clears up less rows when pg_dump is running, yes? Have
> you tried using an on-line hot backup with archive_command set (PITR)?
> That doesn't suffer from the same issue and is faster too.
> 
> OTOH a few hackers discussed this recently and found that nobody used
> serializable transactions (ST) except during pg_dump. It seems a
> reasonable observation that *most* STs are pg_dumps, or at very least:
> the longest running STs are pg_dumps. So rather than changing all
> transaction modes, or even special-casing STs, why not put in some
> infrastructure to cope specifically with the problems that pg_dump can
> cause? 

While it's not currently serialized, another big candidate IMO is vacuum
itself. Vacuuming a large table in a database that also sees heavy
update activity can be a real nightmare, because dead space piles up in
the updated tables while the long vacuum is running. Although there's
probably any number of ways that this problem could be addressed, making
vacuum a serialized transaction (which shouldn't be an issue, afaik) and
creating a generic framework that optimizes for that case would win in
more than one place.

Also, does this really only apply to serialized transactions? As the OP
stated, if a row couldn't possibly exist to a specific (old)
transaction, it should be safe to vacuum it...

> A general facility that would allow STs to identify which tables they
> would/would not touch again could be used by pg_dump to advertise useful
> information. That information could then be picked up by a VACUUM: when
> locking to get xmin it would see an ST, then retrieve the information to
> allow it to work out a per-table xmin. Non-VACUUM transactions would
> ignore any special ST information, causing very low overhead for normal
> operation (checking whether each current transaction was an ST, which
> mostly will be predicted correctly as "no" by the CPU).
> 
> You could take that further and get pg_dump to use a list file like
> pg_restore. You would then be able to *where possible* alter the
> sequence of data dumping so that heavily updated tables were dumped
> first so the dumping ST could then advertise "no further access" to
> particular tables. VACUUMs could then proceed as if the ST were not
> there at all.
> 
> Or maybe at least the idea of some special case ST behaviour might be
> worthy of some thought.
> 
> I've no intention of working on this myself, especially since PITR
> provides an alternate backup solution anyway (even in combination with
> other techniques), but the idea seems worth recording for others to
> discuss.
> 
> Best Regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
> 

-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Vacuum dead tuples that are "between" transactions

From

"Jim C. Nasby"

Date:

28 February 2006, 12:29:04

On Tue, Feb 28, 2006 at 01:32:19AM -0500, Tom Lane wrote:
> To do that requires not just that you have access to a backend's oldest
> snapshot, but that you have access to *all* its active snapshots;
> because such a transient tuple might be visible in some newer snap even
> though it's too new for the oldest snap.  Doing that will create very
> significant problems of shared memory management, as well as performance
> and locking issues.
> 
> There's been some talk of distinguishing "global" and "within database"
> xmin values, so that a long-lived transaction wouldn't interfere with
> vacuuming tables in other databases that that xact couldn't possibly
> access.  That seems doable to me, but I think any finer-grained analysis
> is probably going to be a net loss because of the distributed overhead
> it'd impose on every transaction.

True, but we don't need this for every transaction, only long-running
ones. And in most cases, it'd probably be safe to define 'long-running'
in terms of minutes. Presumably, a mechanism similar to
statement_timeout could be used to 'publish' the required state
information after a given period of time.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Vacuum dead tuples that are "between"

From

"Kevin Grittner"

Date:

01 March 2006, 12:32:45

>>> On Tue, Feb 28, 2006 at  7:22 am, in message
<1141132955.27729.119.camel@localhost.localdomain>, Simon Riggs
<simon@2ndquadrant.com> wrote: 
> 
> OTOH a few hackers discussed this recently and found that nobody
used
> serializable transactions (ST) except during pg_dump.

I've not been able to keep up with all messages on these lists, and I
missed that discussion.

We use serializable transactions heavily; our whole middle tier
architecture depends on having that transaction isolation level for all
requests which modify data.  (You probably don't want to hear the
details.)  It would be OK (although a little disappointing) if VACUUM
enhancements weren't as beneficial to us as a result; it would render
PostgreSQL entirely unusable for us if the integrity of serializable
transactions was broken unless we added some other, non-standard steps
to run them.

We only use pg_dump for version upgrades and other special cases.  PITR
is our main backup technique.

-Kevin

Re: Vacuum dead tuples that are "between" transactions

From

Simon Riggs

Date:

01 March 2006, 13:02:57

On Wed, 2006-03-01 at 10:22 -0600, Kevin Grittner wrote:
> >>> On Tue, Feb 28, 2006 at  7:22 am, in message
> <1141132955.27729.119.camel@localhost.localdomain>, Simon Riggs
> <simon@2ndquadrant.com> wrote: 
> > 
> > OTOH a few hackers discussed this recently and found that nobody
> used
> > serializable transactions (ST) except during pg_dump.
> 
> I've not been able to keep up with all messages on these lists, and I
> missed that discussion.

It was a verbal discussion, hence not recorded on list. I should have
said "nobody on that discussion"; I had no doubt somebody used them. My
mention of that wasn't to add weight to the thought, just to mention a
quick straw poll had been taken...

> We use serializable transactions heavily; our whole middle tier
> architecture depends on having that transaction isolation level for all
> requests which modify data.  (You probably don't want to hear the
> details.) 

*I* would, but others may not. ;-)

>  It would be OK (although a little disappointing) if VACUUM
> enhancements weren't as beneficial to us as a result; it would render
> PostgreSQL entirely unusable for us if the integrity of serializable
> transactions was broken unless we added some other, non-standard steps
> to run them.

I would never suggest breaking STs; they are part of the SQL standard. I
merely suggested an extra, optional API by which ST users could provide
additional information that could help others avoid pessimal decisions
in order to preserve correctness.

> We only use pg_dump for version upgrades and other special cases.  PITR
> is our main backup technique.

Cool.

Best Regards, Simon Riggs

Wisconsin Court Systems software

From

"Kevin Grittner"

Date:

02 March 2006, 12:38:52

>>> On Wed, Mar 1, 2006 at 11:02 am, in message
<1141232565.27729.379.camel@localhost.localdomain>, Simon Riggs
<simon@2ndquadrant.com> wrote: 
> On Wed, 2006- 03- 01 at 10:22 - 0600, Kevin Grittner wrote:
>> >>> On Tue, Feb 28, 2006 at  7:22 am, in message
>> <1141132955.27729.119.camel@localhost.localdomain>, Simon Riggs
>> <simon@2ndquadrant.com> wrote: 
> 
>> We use serializable transactions heavily; our whole middle tier
>> architecture depends on having that transaction isolation level for
all
>> requests which modify data.  (You probably don't want to hear the
>> details.) 
> 
> *I* would, but others may not. ;- )

An executive overview of our environment, with enough detail to
constitute more than vague hand waving, would probably be at least 4K of
text.  If you're interested, I could write something up and post it
somewhere, but this list doesn't seem to be the appropriate place. 
Where would be?

The general hand waving overview: We've got about 100 databases with a
lot of fancy portability features which allow real time data replication
in a heterogeneous environment.  3,000 directly connected users, dozens
of queue-based interfaces (in both directions) with business partner
agencies, and a web site with about 2 million hits per day which query
from the databases (when you count both SOAP and browser traffic).  So
far four of our databases are on PostgreSQL, with another four being
converted over the next few days.  The servers spread around the state
will be converted more gradually, over the course of the next year or
so.  Software is almost entirely Java, and mostly home-grown.  All
database access is done through JDBC from a middle tier database service
which treats each client request as one (and only one) database
transaction so that serialization errors can be handled automatically. 
Queries are written in an approved subset of standard ANSI SQL, with our
query tool parsing that and turning it into Java classes which use
"lowest common denominator" SQL code, with Java code to handle all
procedural steps.  Since stored procedures and triggers are implemented
in our Java code, we can see all data flow, allowing us to copy the
transactions to audit databases and replicate the data from multiple
sources to multiple targets.  It is easy to show a number of ways that
we will have data integrity problems if the JDBC requests from the
trigger code aren't in a SERIALIZABLE transaction with the triggering
data modification.

Now, aren't you sorry you asked?   ;-)

-Kevin