Thread: Vacuum dead tuples that are "between" transactions
The topic of improving vacuum for use in heavy-update environments seems to come up frequently on the list. Has anyone weighed the costs of allowing VACUUM to reclaim tuples that are not older than the oldest transaction but are nonetheless invisible to all running transactions? It seems that it's not that hard.... Currently, a tuple is not elligible to be reclaimed by vacuum unless it was deleted by a transaction that committed before the oldest currently running transaction committed. (i.e., it's xmax is known to have committed before the oldest-currently-running xid was started.) Right? However, it seems like under certain scenarios (heavy updates to small tables while a long-running transaction is occurring) there might be a lot of tuples that are invisible to all transactions but not able to be vacuumed under the current method. Example: updating a single row over and over again while pg_dump is running. Suppose that in the system, we have a serializable transaction with xid 1000 and a read committed transaction with xid 1001. Other than these two, the oldest running xid is 2000. Suppose we consider a tuple with xmin 1200 and xmax 1201. We will assume that xid 1201 committed before xid 2000 began to run. So: (A) This tuple is invisible to the serializable transaction, since its snapshot can't ever advance. (B) The read committed transaction might be able to see it. However, if its current command started AFTER xid 1201 committed, it can't. Unless I'm missing something, it seems that when vacuuming you can leave serializable transactions (like pg_dump) out of the calculation of the "oldest running transaction" so long as you keep a list of them and check each tuple T against each serializable transaction X to make sure that T's xmin is greater than X, or else T's xmax committed before X started to run. Of course this is "a lot" of work, but this should mitigate the effect of long running serializable transactions until such time as processor power becomes your limiting factor. The read committed ones are a more difficult matter, but I think you can treat a tuple as dead if it was inserted after the read committed transaction started to run AND the tuple was deleted before the transaction's currently running command started to run. I suppose the major difficulty here is that currently a transaction has no way of knowing when another backend's command started to run? Is this too difficult to do or is it a good idea that no one has enough 'round tuits for? Regards, Paul Tillotson
Paul Tillotson <spam1011@adelphia.net> writes: > The topic of improving vacuum for use in heavy-update environments seems > to come up frequently on the list. Has anyone weighed the costs of > allowing VACUUM to reclaim tuples that are not older than the oldest > transaction but are nonetheless invisible to all running transactions? > It seems that it's not that hard.... It's not that easy either --- you are assuming that every process advertises far more of its internal state than it actually does. > Suppose that in the system, we have a serializable transaction with xid > 1000 and a read committed transaction with xid 1001. Other than these > two, the oldest running xid is 2000. > Suppose we consider a tuple with xmin 1200 and xmax 1201. We will > assume that xid 1201 committed before xid 2000 began to run. > So: > (A) This tuple is invisible to the serializable transaction, since its > snapshot can't ever advance. Wrong --- you can't assume that simply from the transaction numbering, even assuming that you know that 1000 is serializable. 1000 might not have set its snapshot until quite some time after it started. (This is even pretty likely, if it waited for some locks before setting the snapshot.) You'd need access to the snapshot 1000 is actually using to be sure which "later" transactions are invisible to it. While advertising whole snapshots (rather than just xmin) in shared memory is at least theoretically possible, the costs of doing that seem nontrivial to me ... and they'd have to be paid whether any savings ensued or not. > (B) The read committed transaction might be able to see it. However, if > its current command started AFTER xid 1201 committed, it can't. Another issue is that there's not just "one single snapshot" to worry about per backend. Cursors for instance capture their own snaps. So a backend would have to somehow keep track of the oldest live snapshot it has internally. > The read committed ones are a more difficult matter, but I think you can > treat a tuple as dead if it was inserted after the read committed > transaction started to run AND the tuple was deleted before the > transaction's currently running command started to run. To do that requires not just that you have access to a backend's oldest snapshot, but that you have access to *all* its active snapshots; because such a transient tuple might be visible in some newer snap even though it's too new for the oldest snap. Doing that will create very significant problems of shared memory management, as well as performance and locking issues. There's been some talk of distinguishing "global" and "within database" xmin values, so that a long-lived transaction wouldn't interfere with vacuuming tables in other databases that that xact couldn't possibly access. That seems doable to me, but I think any finer-grained analysis is probably going to be a net loss because of the distributed overhead it'd impose on every transaction. regards, tom lane
On Tue, 2006-02-28 at 01:32 -0500, Tom Lane wrote: > Paul Tillotson <spam1011@adelphia.net> writes: > > The topic of improving vacuum for use in heavy-update environments seems > > to come up frequently on the list. Has anyone weighed the costs of > > allowing VACUUM to reclaim tuples that are not older than the oldest > > transaction but are nonetheless invisible to all running transactions? > > It seems that it's not that hard.... > > It's not that easy either --- you are assuming that every process > advertises far more of its internal state than it actually does. > > > Suppose that in the system, we have a serializable transaction with xid > > 1000 and a read committed transaction with xid 1001. Other than these > > two, the oldest running xid is 2000. > > > Suppose we consider a tuple with xmin 1200 and xmax 1201. We will > > assume that xid 1201 committed before xid 2000 began to run. > > > So: > > > (A) This tuple is invisible to the serializable transaction, since its > > snapshot can't ever advance. > > Wrong --- you can't assume that simply from the transaction numbering, > even assuming that you know that 1000 is serializable. 1000 might not > have set its snapshot until quite some time after it started. (This is > even pretty likely, if it waited for some locks before setting the > snapshot.) You'd need access to the snapshot 1000 is actually using to > be sure which "later" transactions are invisible to it. > > While advertising whole snapshots (rather than just xmin) in shared > memory is at least theoretically possible, the costs of doing that seem > nontrivial to me ... and they'd have to be paid whether any savings > ensued or not. > > > (B) The read committed transaction might be able to see it. However, if > > its current command started AFTER xid 1201 committed, it can't. > > Another issue is that there's not just "one single snapshot" to worry > about per backend. Cursors for instance capture their own snaps. > So a backend would have to somehow keep track of the oldest live > snapshot it has internally. > > > The read committed ones are a more difficult matter, but I think you can > > treat a tuple as dead if it was inserted after the read committed > > transaction started to run AND the tuple was deleted before the > > transaction's currently running command started to run. > > To do that requires not just that you have access to a backend's oldest > snapshot, but that you have access to *all* its active snapshots; > because such a transient tuple might be visible in some newer snap even > though it's too new for the oldest snap. Doing that will create very > significant problems of shared memory management, as well as performance > and locking issues. > > There's been some talk of distinguishing "global" and "within database" > xmin values, so that a long-lived transaction wouldn't interfere with > vacuuming tables in other databases that that xact couldn't possibly > access. That seems doable to me, but I think any finer-grained analysis > is probably going to be a net loss because of the distributed overhead > it'd impose on every transaction. Paul raises some thoughts that are worth considering, even with the usual minefield of difficulties. Paul, you mention serializable transactions, but your root issue seems to be that VACUUM clears up less rows when pg_dump is running, yes? Have you tried using an on-line hot backup with archive_command set (PITR)? That doesn't suffer from the same issue and is faster too. OTOH a few hackers discussed this recently and found that nobody used serializable transactions (ST) except during pg_dump. It seems a reasonable observation that *most* STs are pg_dumps, or at very least: the longest running STs are pg_dumps. So rather than changing all transaction modes, or even special-casing STs, why not put in some infrastructure to cope specifically with the problems that pg_dump can cause? A general facility that would allow STs to identify which tables they would/would not touch again could be used by pg_dump to advertise useful information. That information could then be picked up by a VACUUM: when locking to get xmin it would see an ST, then retrieve the information to allow it to work out a per-table xmin. Non-VACUUM transactions would ignore any special ST information, causing very low overhead for normal operation (checking whether each current transaction was an ST, which mostly will be predicted correctly as "no" by the CPU). You could take that further and get pg_dump to use a list file like pg_restore. You would then be able to *where possible* alter the sequence of data dumping so that heavily updated tables were dumped first so the dumping ST could then advertise "no further access" to particular tables. VACUUMs could then proceed as if the ST were not there at all. Or maybe at least the idea of some special case ST behaviour might be worthy of some thought. I've no intention of working on this myself, especially since PITR provides an alternate backup solution anyway (even in combination with other techniques), but the idea seems worth recording for others to discuss. Best Regards, Simon Riggs
On Tue, Feb 28, 2006 at 01:22:35PM +0000, Simon Riggs wrote: > Paul, you mention serializable transactions, but your root issue seems > to be that VACUUM clears up less rows when pg_dump is running, yes? Have > you tried using an on-line hot backup with archive_command set (PITR)? > That doesn't suffer from the same issue and is faster too. > > OTOH a few hackers discussed this recently and found that nobody used > serializable transactions (ST) except during pg_dump. It seems a > reasonable observation that *most* STs are pg_dumps, or at very least: > the longest running STs are pg_dumps. So rather than changing all > transaction modes, or even special-casing STs, why not put in some > infrastructure to cope specifically with the problems that pg_dump can > cause? While it's not currently serialized, another big candidate IMO is vacuum itself. Vacuuming a large table in a database that also sees heavy update activity can be a real nightmare, because dead space piles up in the updated tables while the long vacuum is running. Although there's probably any number of ways that this problem could be addressed, making vacuum a serialized transaction (which shouldn't be an issue, afaik) and creating a generic framework that optimizes for that case would win in more than one place. Also, does this really only apply to serialized transactions? As the OP stated, if a row couldn't possibly exist to a specific (old) transaction, it should be safe to vacuum it... > A general facility that would allow STs to identify which tables they > would/would not touch again could be used by pg_dump to advertise useful > information. That information could then be picked up by a VACUUM: when > locking to get xmin it would see an ST, then retrieve the information to > allow it to work out a per-table xmin. Non-VACUUM transactions would > ignore any special ST information, causing very low overhead for normal > operation (checking whether each current transaction was an ST, which > mostly will be predicted correctly as "no" by the CPU). > > You could take that further and get pg_dump to use a list file like > pg_restore. You would then be able to *where possible* alter the > sequence of data dumping so that heavily updated tables were dumped > first so the dumping ST could then advertise "no further access" to > particular tables. VACUUMs could then proceed as if the ST were not > there at all. > > Or maybe at least the idea of some special case ST behaviour might be > worthy of some thought. > > I've no intention of working on this myself, especially since PITR > provides an alternate backup solution anyway (even in combination with > other techniques), but the idea seems worth recording for others to > discuss. > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, Feb 28, 2006 at 01:32:19AM -0500, Tom Lane wrote: > To do that requires not just that you have access to a backend's oldest > snapshot, but that you have access to *all* its active snapshots; > because such a transient tuple might be visible in some newer snap even > though it's too new for the oldest snap. Doing that will create very > significant problems of shared memory management, as well as performance > and locking issues. > > There's been some talk of distinguishing "global" and "within database" > xmin values, so that a long-lived transaction wouldn't interfere with > vacuuming tables in other databases that that xact couldn't possibly > access. That seems doable to me, but I think any finer-grained analysis > is probably going to be a net loss because of the distributed overhead > it'd impose on every transaction. True, but we don't need this for every transaction, only long-running ones. And in most cases, it'd probably be safe to define 'long-running' in terms of minutes. Presumably, a mechanism similar to statement_timeout could be used to 'publish' the required state information after a given period of time. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
>>> On Tue, Feb 28, 2006 at 7:22 am, in message <1141132955.27729.119.camel@localhost.localdomain>, Simon Riggs <simon@2ndquadrant.com> wrote: > > OTOH a few hackers discussed this recently and found that nobody used > serializable transactions (ST) except during pg_dump. I've not been able to keep up with all messages on these lists, and I missed that discussion. We use serializable transactions heavily; our whole middle tier architecture depends on having that transaction isolation level for all requests which modify data. (You probably don't want to hear the details.) It would be OK (although a little disappointing) if VACUUM enhancements weren't as beneficial to us as a result; it would render PostgreSQL entirely unusable for us if the integrity of serializable transactions was broken unless we added some other, non-standard steps to run them. We only use pg_dump for version upgrades and other special cases. PITR is our main backup technique. -Kevin
On Wed, 2006-03-01 at 10:22 -0600, Kevin Grittner wrote: > >>> On Tue, Feb 28, 2006 at 7:22 am, in message > <1141132955.27729.119.camel@localhost.localdomain>, Simon Riggs > <simon@2ndquadrant.com> wrote: > > > > OTOH a few hackers discussed this recently and found that nobody > used > > serializable transactions (ST) except during pg_dump. > > I've not been able to keep up with all messages on these lists, and I > missed that discussion. It was a verbal discussion, hence not recorded on list. I should have said "nobody on that discussion"; I had no doubt somebody used them. My mention of that wasn't to add weight to the thought, just to mention a quick straw poll had been taken... > We use serializable transactions heavily; our whole middle tier > architecture depends on having that transaction isolation level for all > requests which modify data. (You probably don't want to hear the > details.) *I* would, but others may not. ;-) > It would be OK (although a little disappointing) if VACUUM > enhancements weren't as beneficial to us as a result; it would render > PostgreSQL entirely unusable for us if the integrity of serializable > transactions was broken unless we added some other, non-standard steps > to run them. I would never suggest breaking STs; they are part of the SQL standard. I merely suggested an extra, optional API by which ST users could provide additional information that could help others avoid pessimal decisions in order to preserve correctness. > We only use pg_dump for version upgrades and other special cases. PITR > is our main backup technique. Cool. Best Regards, Simon Riggs
>>> On Wed, Mar 1, 2006 at 11:02 am, in message <1141232565.27729.379.camel@localhost.localdomain>, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2006- 03- 01 at 10:22 - 0600, Kevin Grittner wrote: >> >>> On Tue, Feb 28, 2006 at 7:22 am, in message >> <1141132955.27729.119.camel@localhost.localdomain>, Simon Riggs >> <simon@2ndquadrant.com> wrote: > >> We use serializable transactions heavily; our whole middle tier >> architecture depends on having that transaction isolation level for all >> requests which modify data. (You probably don't want to hear the >> details.) > > *I* would, but others may not. ;- ) An executive overview of our environment, with enough detail to constitute more than vague hand waving, would probably be at least 4K of text. If you're interested, I could write something up and post it somewhere, but this list doesn't seem to be the appropriate place. Where would be? The general hand waving overview: We've got about 100 databases with a lot of fancy portability features which allow real time data replication in a heterogeneous environment. 3,000 directly connected users, dozens of queue-based interfaces (in both directions) with business partner agencies, and a web site with about 2 million hits per day which query from the databases (when you count both SOAP and browser traffic). So far four of our databases are on PostgreSQL, with another four being converted over the next few days. The servers spread around the state will be converted more gradually, over the course of the next year or so. Software is almost entirely Java, and mostly home-grown. All database access is done through JDBC from a middle tier database service which treats each client request as one (and only one) database transaction so that serialization errors can be handled automatically. Queries are written in an approved subset of standard ANSI SQL, with our query tool parsing that and turning it into Java classes which use "lowest common denominator" SQL code, with Java code to handle all procedural steps. Since stored procedures and triggers are implemented in our Java code, we can see all data flow, allowing us to copy the transactions to audit databases and replicate the data from multiple sources to multiple targets. It is easy to show a number of ways that we will have data integrity problems if the JDBC requests from the trigger code aren't in a SERIALIZABLE transaction with the triggering data modification. Now, aren't you sorry you asked? ;-) -Kevin