Thread: Idea for getting rid of VACUUM FREEZE on cold pages
From a discussion at dinner at pgcon, I wanted to send this to the list for people to poke holes in it: Problem: currently, if your database has a large amount of "cold" data, such as 350GB of 3-year-old sales transactions, in 8.4 vacuum no longer needs to touch it thanks to the visibility map. However, every freeze_age transactions, very old pages need to be sucked into memory and rewritten just in order to freeze those pages. This can have a huge impact on system performance, and seems unjustified because the pages are not actually being used. Suggested resolution: we would add a 4-byte field to the *page* header which would track the XID wraparound count. Any page whose wraparound count was not equal to the current one would be considered to have all frozen tuples. This would remove the necessity to read and write old pages just to freeze them, a humongous gain for databases with long data retention horizons, let alone data warehouses. All xids on the page would, necessarily, need to belong to the same wraparound; if a page gets updated and its wraparound count (hereafter WCID) is lower than current, all tuples on the page would be frozen before any data is written to it. XIDs which were before the max_freeze horizon on a page which was being written anyway would be frozen as they are now. Obvious issues: (1) In a case of rows written close to the wraparound point, this would cause a set of tuples to be frozen sooner than they would be in the current system. (2) It's not clear what to do with a page where there are XIDs which are just before wraparound (like XID # 2.4b) which are still visible and receives a write with a new cycle xid (#1). (3) This will require changing the page structure, with all that entails. So it should probably be done when we're making another change (like adding CRCs). -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On May 21, 2010, at 23:57 , Josh Berkus wrote: > From a discussion at dinner at pgcon, I wanted to send this to the list for people to poke holes in it: > > Problem: currently, if your database has a large amount of "cold" data, such as 350GB of 3-year-old sales transactions,in 8.4 vacuum no longer needs to touch it thanks to the visibility map. However, every freeze_age transactions,very old pages need to be sucked into memory and rewritten just in order to freeze those pages. This can havea huge impact on system performance, and seems unjustified because the pages are not actually being used. > > Suggested resolution: we would add a 4-byte field to the *page* header which would track the XID wraparound count. Anypage whose wraparound count was not equal to the current one would be considered to have all frozen tuples. This wouldremove the necessity to read and write old pages just to freeze them, a humongous gain for databases with long dataretention horizons, let alone data warehouses. If I understand this correctly, VACUUM usually only frees old tuples, but never increases the oldest xid in the pg_classrecord. Once that value becomes older than freeze_age, VACUUM needs to scan the whole relation to freeze old tuples.That results in most of the pages being marked dirty and subsequently being written out, causing an IO storm. If,OTOH, the wraparound count was stored in the page header, VACUUM would still need to read those pages, but wouldn't needto write them out. Alternatively, VACUUM could freeze a few pages on each run, even if the xids are below freeze_age. It could pick those pagesrandomly, or maybe even prefer pages whose tuples have older xmin/xmas values. That would spread the load out more evenly,much like we try to spread checkpoints out over the whole checkpoint interval. best regards, Florian Pflugi
Josh Berkus <josh@agliodbs.com> writes: > From a discussion at dinner at pgcon, I wanted to send this to the list > for people to poke holes in it: Somebody (I think Joe or Heikki) poked a big hole in this last night at the Royal Oak. Although the scheme would get rid of the need to replace old XIDs with FrozenXid, it does not get rid of the need to set hint bits before you can truncate CLOG. So in your example of an insert-only table that's probably never read again, there's still a minimum of one update visit required on every old page. Now that's still better than two update visits ... but we could manage that already, just by tweaking vacuum's heuristics about when to freeze vs when to set hint bits. regards, tom lane
> Somebody (I think Joe or Heikki) poked a big hole in this last night at > the Royal Oak. Although the scheme would get rid of the need to replace > old XIDs with FrozenXid, it does not get rid of the need to set hint > bits before you can truncate CLOG. So in your example of an insert-only > table that's probably never read again, there's still a minimum of one > update visit required on every old page. Now that's still better than > two update visits ... but we could manage that already, just by tweaking > vacuum's heuristics about when to freeze vs when to set hint bits. Yeah, someone pointed that out to me too and suggested that a freeze map was the better solution. I still think there's something we can do with pages on the visibility map but I'll have to think about it some more. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: >> Somebody (I think Joe or Heikki) poked a big hole in this last night at >> the Royal Oak. Although the scheme would get rid of the need to replace >> old XIDs with FrozenXid, it does not get rid of the need to set hint >> bits before you can truncate CLOG. So in your example of an insert-only >> table that's probably never read again, there's still a minimum of one >> update visit required on every old page. Now that's still better than >> two update visits ... but we could manage that already, just by tweaking >> vacuum's heuristics about when to freeze vs when to set hint bits. > Yeah, someone pointed that out to me too and suggested that a freeze map > was the better solution. I still think there's something we can do with > pages on the visibility map but I'll have to think about it some more. It occurred to me on the flight home that maybe we could salvage something from this if there were some mechanism that caused hint bits to get set before the page got written out from shared buffers the first time. This assumes that you have enough slack in shared-buffer space that the transactions that touched a particular page all commit or abort before the page first gets flushed to disk. regards, tom lane
On 5/22/2010 9:16 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >>> Somebody (I think Joe or Heikki) poked a big hole in this last night at >>> the Royal Oak. Although the scheme would get rid of the need to replace >>> old XIDs with FrozenXid, it does not get rid of the need to set hint >>> bits before you can truncate CLOG. So in your example of an insert-only >>> table that's probably never read again, there's still a minimum of one >>> update visit required on every old page. Now that's still better than >>> two update visits ... but we could manage that already, just by tweaking >>> vacuum's heuristics about when to freeze vs when to set hint bits. > >> Yeah, someone pointed that out to me too and suggested that a freeze map >> was the better solution. I still think there's something we can do with >> pages on the visibility map but I'll have to think about it some more. > > It occurred to me on the flight home that maybe we could salvage > something from this if there were some mechanism that caused hint bits > to get set before the page got written out from shared buffers the first > time. This assumes that you have enough slack in shared-buffer space > that the transactions that touched a particular page all commit or abort > before the page first gets flushed to disk. At least the background writer should have a few spare cycles to look over a "to be flushed" page before writing it. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 22/05/10 16:35, Tom Lane wrote: > Josh Berkus<josh@agliodbs.com> writes: >> From a discussion at dinner at pgcon, I wanted to send this to the list >> for people to poke holes in it: > > Somebody (I think Joe or Heikki) poked a big hole in this last night at > the Royal Oak. Me. > Although the scheme would get rid of the need to replace > old XIDs with FrozenXid, it does not get rid of the need to set hint > bits before you can truncate CLOG. Hmm, we don't rely on setting hint bits to truncate CLOG anymore (http://archives.postgresql.org/pgsql-committers/2006-11/msg00026.php). It's the replacement of xids with FrozenXid that matters, the hint bits are really just hints. Doesn't change the conclusion, though: you still need to replace XIDs with FrozenXids to truncate the clog. Conceivably we could keep around more than 2^32 transactions in clog with this scheme, but then you need a lot more space for the clog. But perhaps it would be better to do that than to launch anti-wraparound vacuums, or to refuse more updates in the extreme cases. > So in your example of an insert-only > table that's probably never read again, there's still a minimum of one > update visit required on every old page. Now that's still better than > two update visits ... but we could manage that already, just by tweaking > vacuum's heuristics about when to freeze vs when to set hint bits. (As also discussed in the Royal Oak) I think we should simply not dirty a page when a hint bit is updated. Reading a page from disk is expensive, setting hint bits on the access is generally cheap compared to that. But that is orthogonal to the idea of a per-page XID epoch. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > (As also discussed in the Royal Oak) I think we should simply not dirty > a page when a hint bit is updated. Reading a page from disk is > expensive, setting hint bits on the access is generally cheap compared > to that. But that is orthogonal to the idea of a per-page XID epoch. I'm not sure it's cheap. What you suggest would result in a substantial increase in clog accesses, which means (1) more I/O and (2) more contention. Certainly it's worth experimenting with, but it's no guaranteed win. regards, tom lane
> I'm not sure it's cheap. What you suggest would result in a substantial > increase in clog accesses, which means (1) more I/O and (2) more > contention. Certainly it's worth experimenting with, but it's no > guaranteed win. It seems like there's a number of issues we could fix by making the CLOG more efficient somehow -- from the elimination of hint bits to the ability to freeze pages without writing them. Not, of course, that I have any idea how to do that. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Excerpts from Josh Berkus's message of vie may 21 17:57:35 -0400 2010: > Problem: currently, if your database has a large amount of "cold" data, > such as 350GB of 3-year-old sales transactions, in 8.4 vacuum no longer > needs to touch it thanks to the visibility map. However, every > freeze_age transactions, very old pages need to be sucked into memory > and rewritten just in order to freeze those pages. This can have a huge > impact on system performance, and seems unjustified because the pages > are not actually being used. I think this is nonsense. If you have 3-years-old sales transactions, and your database has any interesting churn, tuples those pages have been frozen for a very long time *already*. The problem is vacuum reading them in so that it can verify there's nothing to do. If we want to avoid *reading* those pages, this solution is useless: > Suggested resolution: we would add a 4-byte field to the *page* header > which would track the XID wraparound count. because you still have to read the page. I think what you're looking for is for this Xid wraparound count to be stored elsewhere, not inside the page. That way vacuum can read it and skip the page without reading it altogether. I think a "freeze map" has been mentioned downthread. I remember mentioning some time ago that we could declare some tables as frozen, i.e. "not needing vacuum". This strikes me as similar, except at the page level rather than table level. -- Álvaro Herrera <alvherre@alvh.no-ip.org>
On 24/05/10 22:49, Alvaro Herrera wrote: > Excerpts from Josh Berkus's message of vie may 21 17:57:35 -0400 2010: > >> Problem: currently, if your database has a large amount of "cold" data, >> such as 350GB of 3-year-old sales transactions, in 8.4 vacuum no longer >> needs to touch it thanks to the visibility map. However, every >> freeze_age transactions, very old pages need to be sucked into memory >> and rewritten just in order to freeze those pages. This can have a huge >> impact on system performance, and seems unjustified because the pages >> are not actually being used. > > I think this is nonsense. If you have 3-years-old sales transactions, > and your database has any interesting churn, tuples those pages have > been frozen for a very long time *already*. The problem is vacuum > reading them in so that it can verify there's nothing to do. If we want > to avoid *reading* those pages, this solution is useless: > >> Suggested resolution: we would add a 4-byte field to the *page* header >> which would track the XID wraparound count. > > because you still have to read the page. What's missing from the suggestion is that relfrozenxid and datfrozenxid also need to be expanded to 8-bytes. That way you effectively have 8-byte XIDs, which means that you never need to vacuum to avoid XID wraparound. You still need to freeze to truncate clog, though, but if you have the disk space, you can now do that every 100 billion transactions for example if you wish. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Excerpts from Heikki Linnakangas's message of mar may 25 04:41:30 -0400 2010: > On 24/05/10 22:49, Alvaro Herrera wrote: > > I think this is nonsense. If you have 3-years-old sales transactions, > > and your database has any interesting churn, tuples those pages have > > been frozen for a very long time *already*. > What's missing from the suggestion is that relfrozenxid and datfrozenxid > also need to be expanded to 8-bytes. That way you effectively have > 8-byte XIDs, which means that you never need to vacuum to avoid XID > wraparound. Hmm, so are we going to use the "xid epoch" more officially? That's entirely a new line of development, perhaps it opens new possibilities. This sounds like extending Xid to 64 bits, without having to store the high bits everywhere. Was this discussed in the PGCon devs meeting? -- Álvaro Herrera <alvherre@alvh.no-ip.org>
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > This sounds like extending Xid to 64 bits, without having to store the > high bits everywhere. Was this discussed in the PGCon devs meeting? Yeah, that's what it would amount to. It was not discussed at the dev meeting --- it was an idea that came up one evening at PGCon. I'm not sure whether this would imply having to widen xid to 64 bits internally. That could be a bit unpleasant as far as CPU and shared memory space go, although every year that goes by makes 32-bit machines less interesting as DB servers. regards, tom lane
Alvaro, >> This sounds like extending Xid to 64 bits, without having to store the >> high bits everywhere. Was this discussed in the PGCon devs meeting? Essentially, yes. One of the main objections to raising XID to 64-bit has been the per-row overhead. But adding 4 bytes per page wouldn't be much of an impact. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 5/24/2010 9:30 AM, Heikki Linnakangas wrote: > On 22/05/10 16:35, Tom Lane wrote: >> Josh Berkus<josh@agliodbs.com> writes: >>> From a discussion at dinner at pgcon, I wanted to send this to the list >>> for people to poke holes in it: >> >> Somebody (I think Joe or Heikki) poked a big hole in this last night at >> the Royal Oak. > > Me. > >> Although the scheme would get rid of the need to replace >> old XIDs with FrozenXid, it does not get rid of the need to set hint >> bits before you can truncate CLOG. > > Hmm, we don't rely on setting hint bits to truncate CLOG anymore > (http://archives.postgresql.org/pgsql-committers/2006-11/msg00026.php). > It's the replacement of xids with FrozenXid that matters, the hint bits > are really just hints. > > Doesn't change the conclusion, though: you still need to replace XIDs > with FrozenXids to truncate the clog. Conceivably we could keep around > more than 2^32 transactions in clog with this scheme, but then you need > a lot more space for the clog. But perhaps it would be better to do that > than to launch anti-wraparound vacuums, or to refuse more updates in the > extreme cases. Correct. The problem actually are aborted transactions. Just because an XID is really old doesn't mean it was committed. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
> Correct. The problem actually are aborted transactions. Just because an > XID is really old doesn't mean it was committed. Yes, that's the main issue with my idea; XIDs which fell off the CLOG would become visible even if they'd aborted. Do we get a bit in the visibility map for a page which has aborted transaction rows on it? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 25/05/10 23:56, Josh Berkus wrote: > Do we get a bit in the visibility map for a page which has aborted > transaction rows on it? If there's a tuple with an aborted xmin on a page, the bit in the visibility map is not set. A tuple with aborted xmax doesn't matter. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 5/25/10 10:04 PM, Heikki Linnakangas wrote: > On 25/05/10 23:56, Josh Berkus wrote: >> Do we get a bit in the visibility map for a page which has aborted >> transaction rows on it? > > If there's a tuple with an aborted xmin on a page, the bit in the > visibility map is not set. A tuple with aborted xmax doesn't matter. Then it seems like pages in the visibility map, at least, would not need to be vacuumed or frozen. Do pages persist in the visibility map indefinitely? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 26/05/10 21:35, Josh Berkus wrote: > On 5/25/10 10:04 PM, Heikki Linnakangas wrote: >> On 25/05/10 23:56, Josh Berkus wrote: >>> Do we get a bit in the visibility map for a page which has aborted >>> transaction rows on it? >> >> If there's a tuple with an aborted xmin on a page, the bit in the >> visibility map is not set. A tuple with aborted xmax doesn't matter. > > Then it seems like pages in the visibility map, at least, would not need > to be vacuumed or frozen. Do pages persist in the visibility map > indefinitely? In theory, until any tuple on the page is inserted/updated/deleted again. However, we've been operating on the assumption that it's always safe to clear any bit in the visibility map, without affecting correctness. I would not like to give up that assumption, it makes life easier. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> In theory, until any tuple on the page is inserted/updated/deleted > again. However, we've been operating on the assumption that it's always > safe to clear any bit in the visibility map, without affecting > correctness. I would not like to give up that assumption, it makes life > easier. It wouldn't affect correctness, it would just force that page to be vacuumed-and-frozen. I think I can make this work, let me just hammer it out. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Wed, May 26, 2010 at 2:44 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 26/05/10 21:35, Josh Berkus wrote: >> On 5/25/10 10:04 PM, Heikki Linnakangas wrote: >>> On 25/05/10 23:56, Josh Berkus wrote: >>>> Do we get a bit in the visibility map for a page which has aborted >>>> transaction rows on it? >>> >>> If there's a tuple with an aborted xmin on a page, the bit in the >>> visibility map is not set. A tuple with aborted xmax doesn't matter. >> >> Then it seems like pages in the visibility map, at least, would not need >> to be vacuumed or frozen. Do pages persist in the visibility map >> indefinitely? > > In theory, until any tuple on the page is inserted/updated/deleted again. > However, we've been operating on the assumption that it's always safe to > clear any bit in the visibility map, without affecting correctness. I would > not like to give up that assumption, it makes life easier. What if we drove it off of the PD_ALL_VISIBLE bit on the page itself, rather than the visibility map bit? It would be safe to clear the visibility map bit without touching the page, but if you clear the PD_ALL_VISIBLE bit on the page itself then you set all the hint bits and freeze all the tuples. In the case where the visibility map bit gets cleared but the page-level bit is still set, a future vacuum can notice and reset the visibility map bit. But whenever the visibility map bit is set, you know that the page-level bit MUST be set, so you needn't vacuum those pages, even for anti-wraparound: you know they'll be frozen when and if they ever get written again. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
> What if we drove it off of the PD_ALL_VISIBLE bit on the page itself, > rather than the visibility map bit? It would be safe to clear the > visibility map bit without touching the page, but if you clear the > PD_ALL_VISIBLE bit on the page itself then you set all the hint bits > and freeze all the tuples. In the case where the visibility map bit > gets cleared but the page-level bit is still set, a future vacuum can > notice and reset the visibility map bit. But whenever the visibility > map bit is set, you know that the page-level bit MUST be set, so you > needn't vacuum those pages, even for anti-wraparound: you know they'll > be frozen when and if they ever get written again. How does that get us out of reading and writing old pages, though? If we're going to set a bit on them, we might as well freeze them. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > How does that get us out of reading and writing old pages, though? Yeah. Neither PD_ALL_VISIBLE nor the visibility map are going to solve your problem, because they cannot become set without having visited the page. regards, tom lane
On Wed, May 26, 2010 at 8:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> How does that get us out of reading and writing old pages, though? > > Yeah. Neither PD_ALL_VISIBLE nor the visibility map are going to solve > your problem, because they cannot become set without having visited the > page. Well, maybe I'm confused here, but arranging things so that we NEVER have to visit the page after initially writing it seems like it's setting the bar almost impossibly high. Consider a table that is regularly written but append-only. Every time autovacuum kicks in, we'll go and remove any dead tuples and then mark the pages PD_ALL_VISIBLE and set the visibility map bits, which will cause subsequent vacuums to ignore the all-visible portions of the table... until anti-wraparound kicks in, at which point we'll vacuum the entire table and freeze everything. If, however, we decree that you can't write a new tuple into a PD_ALL_VISIBLE page without freezing the existing tuples, then you'll still have the small, incremental vacuums but those are pretty cheap, and in any event, I don't see any way to get rid of them unless someone can devise a scheme to do away with vacuum entirely. But you won't need the full-table vacuum to freeze tuples, because you can freeze them opportunistically the next time those pages are written (at which point freezing will be very cheap because the page has to be written to disk at that point no matter what). Maybe I'm confused. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, May 26, 2010 at 8:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah. �Neither PD_ALL_VISIBLE nor the visibility map are going to solve >> your problem, because they cannot become set without having visited the >> page. > Well, maybe I'm confused here, but arranging things so that we NEVER > have to visit the page after initially writing it seems like it's > setting the bar almost impossibly high. Well, that was the use-case that Josh was on about when this idea came up: high-volume append-only log tables that in most cases will never be read, so his client wants to get rid of the extra I/O for maintenance visits to once-written pages. If you're willing to allow one visit and rewrite of each page, then we can do that today with maybe a bit of rejiggering of vacuum's when-to-freeze heuristics. regards, tom lane
On Wed, May 26, 2010 at 8:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, May 26, 2010 at 8:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Yeah. Neither PD_ALL_VISIBLE nor the visibility map are going to solve >>> your problem, because they cannot become set without having visited the >>> page. > >> Well, maybe I'm confused here, but arranging things so that we NEVER >> have to visit the page after initially writing it seems like it's >> setting the bar almost impossibly high. > > Well, that was the use-case that Josh was on about when this idea came > up: high-volume append-only log tables that in most cases will never be > read, so his client wants to get rid of the extra I/O for maintenance > visits to once-written pages. Well, I'll just note that using PD_ALL_VISIBLE as I'm proposing is basically equivalent to Josh's original proposal of using an XID epoch except that it addresses all three of the "obvious issues" which he noted in his original email; plus it doesn't prevent truncating CLOG (on the assumption that we rejigger things not to consult clog when the page is marked PD_ALL_VISIBLE). > If you're willing to allow one visit and rewrite of each page, then > we can do that today with maybe a bit of rejiggering of vacuum's > when-to-freeze heuristics. Hmm, yeah. Maybe we should freeze when we set PD_ALL_VISIBLE; that might be just as good, and simpler. Assuming the visibility map is sufficiently crash-safe/non-buggy, we could then teach VACUUM that it's OK to advance relfrozenxid even when doing just a partial vacuum - because any pages that were skipped must contain only frozen tuples.Previously you've objected to proposals in this directionbecause they might destroy forensic information, but maybe we should do it anyway. Either way, I think if we do this it *basically* gets rid of anti-wraparound vacuum. Yeah, we'll still do routine partial vacuums, but what you won't have is... write the table, vacuum, vacuum, vacuum, vacuum, OK, everything's visible to everyone, don't need to vacuum any more... months pass... boom, unexpected full-table vacuum. The beginning part is the same, but you get rid of the boom at the end. The only way I see to cut down vacuum activity even further is to freeze them (and set the visibility map bit) before evicting them from shared_buffers. That's really the only way to get "write once and only once", but it's pretty hit or miss, because the xmin horizon might not advance fast enough to make it actually work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 27/05/2010, at 02.48, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 26, 2010 at 8:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Josh Berkus <josh@agliodbs.com> writes: >>> How does that get us out of reading and writing old pages, though? >> >> Yeah. Neither PD_ALL_VISIBLE nor the visibility map are going to >> solve >> your problem, because they cannot become set without having visited >> the >> page. > > Well, maybe I'm confused here, but arranging things so that we NEVER > have to visit the page after initially writing it seems like it's > setting the bar almost impossibly high. Consider a table that is > regularly written but append-only. Every time autovacuum kicks in, > we'll go and remove any dead tuples and then mark the pages > PD_ALL_VISIBLE and set the visibility map bits, which will cause > subsequent vacuums to ignore the all-visible portions of the table... > until anti-wraparound kicks in, at which point we'll vacuum the entire > table and freeze everything. Just a thought. Wouldn't a All-visible bit also enable index only scans to some degree? Jesper
On 27/05/10 08:56, Jesper Krogh wrote: > Just a thought. Wouldn't a All-visible bit also enable index only scans > to some degree? Yes. In fact, that's one reason I implemented the visibility map in the first place. I started working on index-only scans based on that last year, if you search the archives for index-only scans you'll find those discussions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> Well, maybe I'm confused here, but arranging things so that we NEVER > have to visit the page after initially writing it seems like it's > setting the bar almost impossibly high. That is the use case, though. What I've encountered so far at 3 client sites is tables which are largely append-only, with a few selects and very few updates (< 2%) on recent data. In general, once data gets flushed out of memory, it goes to disk and never gets recalled, and certainly not written. Thinks are hunky-dory until we reach max_freeze_age, at which point the server has to chew through hundreds of gigabytes of old data just to freeze them, sometimes bringing the application to a halt in the process. The user's perspective on this is quite reasonable: if I haven't selected these pages, and I haven't written to them, why does autovacuum need to visit them and screw up my server performance? > Consider a table that is > regularly written but append-only. Every time autovacuum kicks in, > we'll go and remove any dead tuples and then mark the pages > PD_ALL_VISIBLE and set the visibility map bits, which will cause > subsequent vacuums to ignore the all-visible portions of the table... > until anti-wraparound kicks in, at which point we'll vacuum the entire > table and freeze everything. > > If, however, we decree that you can't write a new tuple into a > PD_ALL_VISIBLE page without freezing the existing tuples, then you'll > still have the small, incremental vacuums but those are pretty cheap, That only works if those pages were going to be autovacuumed anyway. In the case outlined above (which I've seen at 3 different production sites this year), they wouldn't be; a table with less than 2% updates and deletes does not get vacuumed until max_freeze_age for any reason. For that matter, pages which are getting autovacuumed are not a problem, period; they're being read and written and freezing them is not an issue. I'm not seeing a way of fixing this common issue short of overhauling CLOG, or of creating a freeze_map. Darn. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 5/26/10 6:32 PM, Robert Haas wrote: > Hmm, yeah. Maybe we should freeze when we set PD_ALL_VISIBLE; that > might be just as good, and simpler. Assuming the visibility map is > sufficiently crash-safe/non-buggy, we could then teach VACUUM that > it's OK to advance relfrozenxid even when doing just a partial vacuum > - because any pages that were skipped must contain only frozen tuples. > Previously you've objected to proposals in this direction because > they might destroy forensic information, but maybe we should do it > anyway. It would be an improvement, and easier than the various ways of never having to visit the pages, which are all fairly intensive. Given the destruction of rollback information, though, we'd probably want a way to switch this behaviour on and off as an autovacuum setting. Does this send us down the wrong path, though? I thought we wanted to think about removing hint bits so that we could implement things like CRCs. No? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Thu, May 27, 2010 at 2:17 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 5/26/10 6:32 PM, Robert Haas wrote: >> Hmm, yeah. Maybe we should freeze when we set PD_ALL_VISIBLE; that >> might be just as good, and simpler. Assuming the visibility map is >> sufficiently crash-safe/non-buggy, we could then teach VACUUM that >> it's OK to advance relfrozenxid even when doing just a partial vacuum >> - because any pages that were skipped must contain only frozen tuples. >> Previously you've objected to proposals in this direction because >> they might destroy forensic information, but maybe we should do it >> anyway. > > It would be an improvement, and easier than the various ways of never > having to visit the pages, which are all fairly intensive. Given the > destruction of rollback information, though, we'd probably want a way to > switch this behaviour on and off as an autovacuum setting. It's not going to destroy anything that is needed for rollback unless there's a bug - PD_ALL_VISIBLE only gets set when all tuples on the page are visible to all backends. That can't happen until all transactions that wrote the page, and all others that have a lower xmin, have committed. That having been said, if making it a GUC makes people less nervous about doing it, then +1 from me. > Does this send us down the wrong path, though? I thought we wanted to > think about removing hint bits so that we could implement things like > CRCs. No? PD_ALL_VISIBLE is a page-level bit, not a tuple-level bit, and I strongly suspect it's not going anywhere. It's critical infrastructure for index-only scans, among other things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, May 27, 2010 at 2:00 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Well, maybe I'm confused here, but arranging things so that we NEVER >> have to visit the page after initially writing it seems like it's >> setting the bar almost impossibly high. > > That is the use case, though. What I've encountered so far at 3 client > sites is tables which are largely append-only, with a few selects and > very few updates (< 2%) on recent data. In general, once data gets > flushed out of memory, it goes to disk and never gets recalled, and > certainly not written. We might be able to optimize this case if the transactions are small, such that they commit before dirtying too large a fraction of shared_buffers. We could - at least in theory - teach the bgwriter or some other process to freeze them before writing them to disk the first time. But if the blocks have to be written to disk before transaction commit it seems to me we're DOA, unless we're willing to retain arbitrarily large amounts of CLOG. What might be more practical is to try to find ways to spread out the I/O so that it doesn't happen all at once in a huge ornery spike. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 27/05/2010, at 20.00, Josh Berkus <josh@agliodbs.com> wrote: > >> Well, maybe I'm confused here, but arranging things so that we NEVER >> have to visit the page after initially writing it seems like it's >> setting the bar almost impossibly high. > > That is the use case, though. What I've encountered so far at 3 > client > sites is tables which are largely append-only, with a few selects and > very few updates (< 2%) on recent data. In general, once data gets > flushed out of memory, it goes to disk and never gets recalled, and > certainly not written. Thinks are hunky-dory until we reach > max_freeze_age, at which point the server has to chew through hundreds > of gigabytes of old data just to freeze them, sometimes bringing the > application to a halt in the process. The data doesn't get in there in " no time" if autovacuum was aware of inserts too it would incrementally freeze the table as it grows. It would still cause it to be read in again but not in a big chunck. Couldn't pages that are totally filled by the same transaction, be frozen on the initial write? Jesper - given my limited knowledge about how it works.
Jesper Krogh <jesper@krogh.cc> wrote: > Couldn't pages that are totally filled by the same transaction, be > frozen on the initial write? As far as I'm aware, that can only be done if: (a) The tuples were written within the same transaction which created or truncated the table. *or* (b) The writing transaction and all transactions concurrent to it have completed by the time the page is about to be written. -Kevin
On Thu, May 27, 2010 at 3:15 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Jesper Krogh <jesper@krogh.cc> wrote: > >> Couldn't pages that are totally filled by the same transaction, be > >> frozen on the initial write? > > As far as I'm aware, that can only be done if: > > (a) The tuples were written within the same transaction which > created or truncated the table. > > *or* > > (b) The writing transaction and all transactions concurrent to it > have completed by the time the page is about to be written. Actually, I think this is true only in case (b). In case (a), you mess up visibility with respect to other command-IDs within the transaction. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 27, 2010 at 3:15 PM, Kevin Grittner >> (a) The tuples were written within the same transaction which >> created or truncated the table. > In case (a), you mess up visibility with respect to other > command-IDs within the transaction. Surely that problem is surmountable? -Kevin
On Thu, May 27, 2010 at 3:52 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, May 27, 2010 at 3:15 PM, Kevin Grittner > >>> (a) The tuples were written within the same transaction which >>> created or truncated the table. > >> In case (a), you mess up visibility with respect to other >> command-IDs within the transaction. > > Surely that problem is surmountable? I proposed an idea at PGCon, but I believe Tom and Heikki thought it was far too grotty to consider. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> wrote: > I proposed an idea at PGCon, but I believe Tom and Heikki thought > it was far too grotty to consider. Well, as an alternative -- don't we have some information about the relation pinned which could hold the xid of its creator? If the tuple is frozen check to see if your transaction is the creator and behave like you created the tuple (which, in fact, you did)? -Kevin
On 05/27/2010 12:39 PM, Robert Haas wrote: > On Thu, May 27, 2010 at 3:15 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> Jesper Krogh <jesper@krogh.cc> wrote: >> >>> Couldn't pages that are totally filled by the same transaction, be >> >>> frozen on the initial write? >> >> As far as I'm aware, that can only be done if: >> >> (a) The tuples were written within the same transaction which >> created or truncated the table. >> >> *or* >> >> (b) The writing transaction and all transactions concurrent to it >> have completed by the time the page is about to be written. > > Actually, I think this is true only in case (b). In case (a), you > mess up visibility with respect to other command-IDs within the > transaction. > (a) can work if it is all in one command, CREATE TABLE AS SELECT... Additionally we were discussing COPY in the FROM clause, which means you could CREATE TABLE AS SELECT ... FROM (COPY ...). That would allow bulk loading with hint bits already set (and tuples frozen?). Joe
>>>Joe Conway <mail@joeconway.com> wrote: > (a) can work if it is all in one command, CREATE TABLE AS > SELECT... > Additionally we were discussing COPY in the FROM clause, which > means you could CREATE TABLE AS SELECT ... FROM (COPY ...). That > would allow bulk loading with hint bits already set (and tuples > frozen?). As long as it's hinted and frozen after a pg_dump -1 | psql I'll be happy. -Kevin
On 27/05/10 22:56, Robert Haas wrote: > On Thu, May 27, 2010 at 3:52 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> Robert Haas<robertmhaas@gmail.com> wrote: >>> On Thu, May 27, 2010 at 3:15 PM, Kevin Grittner >> >>>> (a) The tuples were written within the same transaction which >>>> created or truncated the table. >> >>> In case (a), you mess up visibility with respect to other >>> command-IDs within the transaction. >> >> Surely that problem is surmountable? > > I proposed an idea at PGCon, but I believe Tom and Heikki thought it > was far too grotty to consider. No, I think it's surmountable too. We discussed hacks to teach the MVCC checks that all frozen tuples on a table that was created in the same transaction (i.e. the same cases where we skip WAL logging) were actually created by the running transaction, and check commandid accordingly. Or detect simple DML commands where we know that the command doesn't read the table. COPY would usually fall into that category, though non-immutable input functions make that a bit iffy. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 28/05/10 04:00, Josh Berkus wrote: >> Consider a table that is >> regularly written but append-only. Every time autovacuum kicks in, >> we'll go and remove any dead tuples and then mark the pages >> PD_ALL_VISIBLE and set the visibility map bits, which will cause >> subsequent vacuums to ignore the all-visible portions of the table... >> until anti-wraparound kicks in, at which point we'll vacuum the entire >> table and freeze everything. >> >> If, however, we decree that you can't write a new tuple into a >> PD_ALL_VISIBLE page without freezing the existing tuples, then you'll >> still have the small, incremental vacuums but those are pretty cheap, >> > That only works if those pages were going to be autovacuumed anyway. In > the case outlined above (which I've seen at 3 different production sites > this year), they wouldn't be; a table with less than 2% updates and > deletes does not get vacuumed until max_freeze_age for any reason. For > that matter, pages which are getting autovacuumed are not a problem, > period; they're being read and written and freezing them is not an issue. > > I'm not seeing a way of fixing this common issue short of overhauling > CLOG, or of creating a freeze_map. Darn. > Don't you not get a positive enough effect by adjusting the table's autovacuum_min_freeze_age and autovacuum_max_freeze_age. If you set those numbers small, it appears to me that you would get very quickly to a state where the vacuum would example only the most recent part of the table rather than the whole thing. Does that give you enough of a win that it stops the scanning and writing of the whole table which reduces the performance problem being experienced. It's not a complete solution, but does it go someway? Regards Russell
Excerpts from Russell Smith's message of mié jun 02 06:38:35 -0400 2010: > Don't you not get a positive enough effect by adjusting the table's > autovacuum_min_freeze_age and autovacuum_max_freeze_age. If you set > those numbers small, it appears to me that you would get very quickly to > a state where the vacuum would example only the most recent part of the > table rather than the whole thing. The problem is that vacuum doesn't know that a certain part of the table is already frozen. It needs to scan it completely anyways. If we had a "frozen" map, we could mark pages that are completely frozen and thus do not need any vacuuming; but we don't (I don't recall the reasons for this. Maybe it's just that no one has gotten around to it, or maybe there's something else). -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > The problem is that vacuum doesn't know that a certain part of the table > is already frozen. It needs to scan it completely anyways. If we had a > "frozen" map, we could mark pages that are completely frozen and thus do > not need any vacuuming; but we don't (I don't recall the reasons for > this. Maybe it's just that no one has gotten around to it, or maybe > there's something else). Offhand I think the reason is that you'd have to trust the frozen bit to be 100% correct (or at least never set to 1 in error). Currently, both the FSM and visibility forks are just hints, and we won't suffer data corruption if they're wrong; so we don't get too tense about WAL logging or fsync'ing updates. I believe Heikki is looking into what it'd take to make the visibility map 100% reliable, in connection with the desire for index-only scans. If we get that and the overhead isn't too terrible maybe we could build a frozen-status map the same way. regards, tom lane
On Wed, Jun 2, 2010 at 2:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> The problem is that vacuum doesn't know that a certain part of the table >> is already frozen. It needs to scan it completely anyways. If we had a >> "frozen" map, we could mark pages that are completely frozen and thus do >> not need any vacuuming; but we don't (I don't recall the reasons for >> this. Maybe it's just that no one has gotten around to it, or maybe >> there's something else). > > Offhand I think the reason is that you'd have to trust the frozen bit > to be 100% correct (or at least never set to 1 in error). Currently, > both the FSM and visibility forks are just hints, and we won't suffer > data corruption if they're wrong; so we don't get too tense about WAL > logging or fsync'ing updates. I believe Heikki is looking into what > it'd take to make the visibility map 100% reliable, in connection with > the desire for index-only scans. If we get that and the overhead isn't > too terrible maybe we could build a frozen-status map the same way. We could, but I think we'd be better off just freezing at the time we mark the page PD_ALL_VISIBLE and then using the visibility map for both purposes. Keeping around the old xmin values after every tuple on the page is visible to every running transaction is useful only for forensics, and building a whole new freeze map just to retain that information longer (and eventually force a massive anti-wraparound vacuum) seems like overkill. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Excerpts from Robert Haas's message of mié jun 02 14:16:33 -0400 2010: > We could, but I think we'd be better off just freezing at the time we > mark the page PD_ALL_VISIBLE and then using the visibility map for > both purposes. Keeping around the old xmin values after every tuple > on the page is visible to every running transaction is useful only for > forensics, and building a whole new freeze map just to retain that > information longer (and eventually force a massive anti-wraparound > vacuum) seems like overkill. Reducing the xid wraparound horizon "a bit" is reasonable, but moving it all the way forward to OldestXmin is a bit much, methinks. Besides, there's another argument for not freezing tuples immediately: they may be updated shortly thereafter, causing extra churn for no gain. I'd prefer a setting that would tell the system to freeze all tuples that fall within a safety range whenever any tuple in the page is frozen -- weren't you working on a patch to do this? (was it Jeff Davis?) (BTW maybe instead of separate visibility and freeze maps we could have two bits in the visibility map?) -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Excerpts from Robert Haas's message of mié jun 02 14:16:33 -0400 2010: > > > We could, but I think we'd be better off just freezing at the time we > > mark the page PD_ALL_VISIBLE and then using the visibility map for > > both purposes. Keeping around the old xmin values after every tuple > > on the page is visible to every running transaction is useful only for > > forensics, and building a whole new freeze map just to retain that > > information longer (and eventually force a massive anti-wraparound > > vacuum) seems like overkill. > > Reducing the xid wraparound horizon "a bit" is reasonable, but moving it > all the way forward to OldestXmin is a bit much, methinks. > > Besides, there's another argument for not freezing tuples immediately: > they may be updated shortly thereafter, causing extra churn for no gain. > > I'd prefer a setting that would tell the system to freeze all tuples > that fall within a safety range whenever any tuple in the page is frozen > -- weren't you working on a patch to do this? (was it Jeff Davis?) > > (BTW maybe instead of separate visibility and freeze maps we could have > two bits in the visibility map?) Yeah, the two-bits idea was suggested during the conversation core had about the issue. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
On Wed, Jun 2, 2010 at 3:10 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Robert Haas's message of mié jun 02 14:16:33 -0400 2010: > >> We could, but I think we'd be better off just freezing at the time we >> mark the page PD_ALL_VISIBLE and then using the visibility map for >> both purposes. Keeping around the old xmin values after every tuple >> on the page is visible to every running transaction is useful only for >> forensics, and building a whole new freeze map just to retain that >> information longer (and eventually force a massive anti-wraparound >> vacuum) seems like overkill. > > Reducing the xid wraparound horizon "a bit" is reasonable, but moving it > all the way forward to OldestXmin is a bit much, methinks. Why? If it's just for forensics, those are some pretty expensive forensics - it eventually costs you an additional complete rewrite of every page. > Besides, there's another argument for not freezing tuples immediately: > they may be updated shortly thereafter, causing extra churn for no gain. But if you were going to update PD_ALL_VISIBLE, then you were going to write the page anyway. You might as well freeze everything at the same time so you don't have to come back. Alternatively, you could do what I suggested upthread and just believe PD_ALL_VISIBLE over the individual tuple xmins. Then you don't have to freeze the page until it's next written, but you still get to keep your forensic info. > I'd prefer a setting that would tell the system to freeze all tuples > that fall within a safety range whenever any tuple in the page is frozen > -- weren't you working on a patch to do this? (was it Jeff Davis?) Not me. I don't think that's going to help a whole lot, though. In many of the painful scenarios, every tuple on the page will have the same XID, and therefore they'll all be frozen at the same time anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
So I think the scheme in the original post of this thread is workable. Not as described but could be made to work. In which case I think it's preferable to a freeze map -- which I had previously assumed we would need eventually. The problem with the scheme originally described is that it assumed you could have an cycle counter and then arrange that all the xids on the page are within that cycle. That doesn't work because you could easily have two live xids on the page that belong to two cycles -- one FirstNormalTransactionId and one MaxTransactionId. I think to make it work you need to store a whole 64-bit reference transaction id consisting of both a cycle counter and a transaction id. The invariant for the page is that every xid on the page can be compared to that reference transaction id using normal transactionid semantics. Actually I think the easiest way to do that is to set it to the oldest xid on the page. The first thing to do before comparing any transaction id on the page with a real transaction id would be to figure out whether the reference xid is comparable to the live xid, which if it's the oldest xid on the page implies they'll all be comparable. The way to maintain that invariant would be that any xid insertion on the page must advance the reference xid if it's not comparable to the newly inserted xid. It has to be advanced to the oldest xid that's still comparable with the newly inserted xid. Any xids on the page that are older than the new refernce xid have to be frozen or removed. I'm not sure how to do that without keeping clog forever though.
On 6/2/2010 2:16 PM, Robert Haas wrote: > On Wed, Jun 2, 2010 at 2:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Alvaro Herrera <alvherre@commandprompt.com> writes: >>> The problem is that vacuum doesn't know that a certain part of the table >>> is already frozen. It needs to scan it completely anyways. If we had a >>> "frozen" map, we could mark pages that are completely frozen and thus do >>> not need any vacuuming; but we don't (I don't recall the reasons for >>> this. Maybe it's just that no one has gotten around to it, or maybe >>> there's something else). >> >> Offhand I think the reason is that you'd have to trust the frozen bit >> to be 100% correct (or at least never set to 1 in error). Currently, >> both the FSM and visibility forks are just hints, and we won't suffer >> data corruption if they're wrong; so we don't get too tense about WAL >> logging or fsync'ing updates. I believe Heikki is looking into what >> it'd take to make the visibility map 100% reliable, in connection with >> the desire for index-only scans. If we get that and the overhead isn't >> too terrible maybe we could build a frozen-status map the same way. > > We could, but I think we'd be better off just freezing at the time we > mark the page PD_ALL_VISIBLE and then using the visibility map for > both purposes. Keeping around the old xmin values after every tuple > on the page is visible to every running transaction is useful only for > forensics, and building a whole new freeze map just to retain that > information longer (and eventually force a massive anti-wraparound > vacuum) seems like overkill. Agreed. The whole business of minimum freeze age always struck me as leaving bread crumbs behind. Other than forensics, what is the actual value of that overhead? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 6/2/2010 3:10 PM, Alvaro Herrera wrote: > Excerpts from Robert Haas's message of mié jun 02 14:16:33 -0400 2010: > >> We could, but I think we'd be better off just freezing at the time we >> mark the page PD_ALL_VISIBLE and then using the visibility map for >> both purposes. Keeping around the old xmin values after every tuple >> on the page is visible to every running transaction is useful only for >> forensics, and building a whole new freeze map just to retain that >> information longer (and eventually force a massive anti-wraparound >> vacuum) seems like overkill. > > Reducing the xid wraparound horizon "a bit" is reasonable, but moving it > all the way forward to OldestXmin is a bit much, methinks. Why? > > Besides, there's another argument for not freezing tuples immediately: > they may be updated shortly thereafter, causing extra churn for no gain. What extra churn does it create if the tuple can be frozen before the bgwriter ever writes the page in the first place? > > I'd prefer a setting that would tell the system to freeze all tuples > that fall within a safety range whenever any tuple in the page is frozen > -- weren't you working on a patch to do this? (was it Jeff Davis?) I just see a lot of cost caused by this "safety range". I yet have to see its real value, other than "feel good". Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck <JanWieck@Yahoo.com> writes: > On 6/2/2010 3:10 PM, Alvaro Herrera wrote: >> I'd prefer a setting that would tell the system to freeze all tuples >> that fall within a safety range whenever any tuple in the page is frozen >> -- weren't you working on a patch to do this? (was it Jeff Davis?) > I just see a lot of cost caused by this "safety range". I yet have to > see its real value, other than "feel good". Jan, you don't know what you're talking about. I have repeatedly had cases where being able to look at xmin was critical to understanding a bug. I *will not* hold still for a solution that effectively reduces min_freeze_age to zero. regards, tom lane
On Fri, Jun 4, 2010 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> On 6/2/2010 3:10 PM, Alvaro Herrera wrote: >>> I'd prefer a setting that would tell the system to freeze all tuples >>> that fall within a safety range whenever any tuple in the page is frozen >>> -- weren't you working on a patch to do this? (was it Jeff Davis?) > >> I just see a lot of cost caused by this "safety range". I yet have to >> see its real value, other than "feel good". > > Jan, you don't know what you're talking about. I have repeatedly had > cases where being able to look at xmin was critical to understanding > a bug. I *will not* hold still for a solution that effectively reduces > min_freeze_age to zero. So, we're talking in circles here. I've already proposed a method that would avoid the need to wipe out the xmins: http://archives.postgresql.org/pgsql-hackers/2010-05/msg01485.php And you said that if we were going to do that we might as well just freeze sooner: http://archives.postgresql.org/pgsql-hackers/2010-05/msg01548.php If you don't want to freeze sooner, let's go back to the method described in the first email. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> I just see a lot of cost caused by this "safety range". I yet >> have to see its real value, other than "feel good". > > Jan, you don't know what you're talking about. I have repeatedly > had cases where being able to look at xmin was critical to > understanding a bug. I *will not* hold still for a solution that > effectively reduces min_freeze_age to zero. In my experience with my own environment, I can honestly say that it's clear that not freezing tuples quickly adds more cost than running with cassert on. If we had to run in production with one or the other, I would definitely choose cassert from a performance perspective; which one would do more to find bugs? Why do we view them so differently? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > In my experience with my own environment, I can honestly say that > it's clear that not freezing tuples quickly adds more cost than > running with cassert on. If we had to run in production with one or > the other, I would definitely choose cassert from a performance > perspective; which one would do more to find bugs? Why do we view > them so differently? The reason for not recommending cassert in production builds is not cost but stability. Per the fine manual: Also, having the tests turned on won't necessarily enhance the stability of your server! The assertion checksare not categorized for severity, and so what might be a relatively harmless bug will still lead to serverrestarts if it triggers an assertion failure. This option is not recommended for production use, but you should have it on for development work or when running a beta version. regards, tom lane
On Fri, Jun 4, 2010 at 11:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> In my experience with my own environment, I can honestly say that >> it's clear that not freezing tuples quickly adds more cost than >> running with cassert on. If we had to run in production with one or >> the other, I would definitely choose cassert from a performance >> perspective; which one would do more to find bugs? Why do we view >> them so differently? > > The reason for not recommending cassert in production builds is not > cost but stability. Per the fine manual: > > Also, having the tests turned on won't necessarily enhance the > stability of your server! The assertion checks are not categorized > for severity, and so what might be a relatively harmless bug will > still lead to server restarts if it triggers an assertion > failure. This option is not recommended for production use, but > you should have it on for development work or when running a beta > version. We routinely castigate people for benchmarking done with cassert turned on, and tell them their numbers are meaningless. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jun 4, 2010 at 11:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The reason for not recommending cassert in production builds is not >> cost but stability. > We routinely castigate people for benchmarking done with cassert > turned on, and tell them their numbers are meaningless. I didn't say it wasn't expensive ;-). But Kevin's question seemed to be based on the assumption that runtime cost was the only negative. It wouldn't be terribly hard to make a variant of cassert that skips two or three of the most expensive things (particularly memory context checking and CLOBBER_FREED_MEMORY), and from a cost perspective that would be totally reasonable to run in production. We haven't done it because of the stability issue. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > But Kevin's question seemed to be based on the assumption that > runtime cost was the only negative. It wouldn't be terribly hard > to make a variant of cassert that skips two or three of the most > expensive things (particularly memory context checking and > CLOBBER_FREED_MEMORY), and from a cost perspective that would be > totally reasonable to run in production. We haven't done it > because of the stability issue. Fair enough. I was thinking of them both as debugging features, which had various ideas roiling around in my head. Having run hundreds of databases 24/7 for years without ever needing this information, but paying the cost for it one way or another every day, my perspective is that it would be A Good Thing if it could just be turned on when needed. If you have recurring bug that can be arranged, but in those cases you have other options; so I'm assuming you want this kept because it is primarily of forensic value after a non-repeatable bug has munged something? Another thought bouncing around was that these breadcrumbs are expensive; I was trying to think of some other way to capture the information which would be cheaper, but I haven't thought of anything, and I'm far from certain that cheaper breadcrumbs to answer the need can be developed. The best thought I've had so far is that if someone kept WAL files long enough the evidence might be in there somewhere.... -Kevin
Kevin Grittner wrote: > Fair enough. I was thinking of them both as debugging features, > which had various ideas roiling around in my head. Having run > hundreds of databases 24/7 for years without ever needing this > information, but paying the cost for it one way or another every > day, my perspective is that it would be A Good Thing if it could > just be turned on when needed. If you have recurring bug that can > be arranged, but in those cases you have other options; so I'm > assuming you want this kept because it is primarily of forensic > value after a non-repeatable bug has munged something? > > Another thought bouncing around was that these breadcrumbs are > expensive; I was trying to think of some other way to capture the > information which would be cheaper, but I haven't thought of > anything, and I'm far from certain that cheaper breadcrumbs to > answer the need can be developed. The best thought I've had so far > is that if someone kept WAL files long enough the evidence might be > in there somewhere.... The idea that thousands of Postgres installations are slower just so we can occasionally debug xmin/xmax issues seems way off balance to me. If people want debugging, let them modify the freeze age settings; the defaults should not favor debugging when there is a measurable cost involved. How many times in the past five years have we even needed such debugging information, and also are cases where we could not have told the user to change freeze settings to get us that info? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
Bruce Momjian <bruce@momjian.us> writes: > The idea that thousands of Postgres installations are slower just so we > can occasionally debug xmin/xmax issues seems way off balance to me. There's no evidence whatsoever that the scope of the problem is that large. > If people want debugging, let them modify the freeze age settings; the > defaults should not favor debugging when there is a measurable cost > involved. How many times in the past five years have we even needed > such debugging information, and also are cases where we could not have > told the user to change freeze settings to get us that info? You're missing the point here: this is something we need when trying to make sense of cases that are hard or impossible to reproduce. Retroactively changing the freeze policy isn't possible. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bruce Momjian <bruce@momjian.us> writes: >> The idea that thousands of Postgres installations are slower just >> so we can occasionally debug xmin/xmax issues seems way off >> balance to me. > > There's no evidence whatsoever that the scope of the problem is > that large. Well, are we agreed that the current approach means that insertion of a heap tuple normally requires it to be written to disk three times, with two of those WAL-logged? And that deletion of a tuple generally requires the same? I'd say that constitutes prima facie evidence that any PostgreSQL installation doing any significant number of writes is slower because of this. Are you suggesting there aren't thousands of such installations, or that the repeated disk writes are generally free? -Kevin
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > The idea that thousands of Postgres installations are slower just so we > > can occasionally debug xmin/xmax issues seems way off balance to me. > > There's no evidence whatsoever that the scope of the problem is that large. > > > If people want debugging, let them modify the freeze age settings; the > > defaults should not favor debugging when there is a measurable cost > > involved. How many times in the past five years have we even needed > > such debugging information, and also are cases where we could not have > > told the user to change freeze settings to get us that info? > > You're missing the point here: this is something we need when trying > to make sense of cases that are hard or impossible to reproduce. > Retroactively changing the freeze policy isn't possible. With in-place VACUUM FULL gone in 9.0, will there be as much need for xmin/xmax forensics? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
Bruce Momjian <bruce@momjian.us> writes: > With in-place VACUUM FULL gone in 9.0, will there be as much need for > xmin/xmax forensics? You know perfectly well that no one could answer that question. (Or at least not answer it on the basis of facts available today.) regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > With in-place VACUUM FULL gone in 9.0, will there be as much need for > > xmin/xmax forensics? > > You know perfectly well that no one could answer that question. > (Or at least not answer it on the basis of facts available today.) Well, guess then. In the past, how many forensic cases were needed for in-place VACUUM FULL bugs, vs. other cases? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> Bruce Momjian <bruce@momjian.us> writes: >>> With in-place VACUUM FULL gone in 9.0, will there be as much need for >>> xmin/xmax forensics? >> >> You know perfectly well that no one could answer that question. >> (Or at least not answer it on the basis of facts available today.) > Well, guess then. I already told you my opinion on this matter. Since you're prepared to discount that, I don't see why you'd put any credence in my evidence-free guesses. regards, tom lane
Excerpts from Bruce Momjian's message of vie jun 04 15:39:07 -0400 2010: > Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > With in-place VACUUM FULL gone in 9.0, will there be as much need for > > > xmin/xmax forensics? > > > > You know perfectly well that no one could answer that question. > > (Or at least not answer it on the basis of facts available today.) > > Well, guess then. In the past, how many forensic cases were needed for > in-place VACUUM FULL bugs, vs. other cases? I don't understand the question. I know I have debugged a bunch of cases of data corruption, and having xmin/xmax around has been truly useful. VACUUM FULL has never been involved (that I know of -- most of our customers tend not to run it AFAIK), so why would I care about whether it's gone in 9.0? Note that it's not always about PG bugs; but in the cases where xmin=FrozenXid for all/most involved tuples, the problems are more difficult to track down. Yes, VACUUM FULL had bugs too -- I, for one, welcome our new not-in-place VACUUM FULL overlord. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Excerpts from Bruce Momjian's message of vie jun 04 15:39:07 -0400 2010: > > Tom Lane wrote: > > > Bruce Momjian <bruce@momjian.us> writes: > > > > With in-place VACUUM FULL gone in 9.0, will there be as much need for > > > > xmin/xmax forensics? > > > > > > You know perfectly well that no one could answer that question. > > > (Or at least not answer it on the basis of facts available today.) > > > > Well, guess then. In the past, how many forensic cases were needed for > > in-place VACUUM FULL bugs, vs. other cases? > > I don't understand the question. I know I have debugged a bunch of > cases of data corruption, and having xmin/xmax around has been truly > useful. VACUUM FULL has never been involved (that I know of -- most of > our customers tend not to run it AFAIK), so why would I care about > whether it's gone in 9.0? Note that it's not always about PG bugs; but > in the cases where xmin=FrozenXid for all/most involved tuples, the > problems are more difficult to track down. > > Yes, VACUUM FULL had bugs too -- I, for one, welcome our new > not-in-place VACUUM FULL overlord. OK, so we had lots of forensics the didn't involve VACUUM FULL. That's what I wanted to know. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > ... my perspective is that it would be A Good Thing if it could > just be turned on when needed. If you have recurring bug that can > be arranged, but in those cases you have other options; so I'm > assuming you want this kept because it is primarily of forensic > value after a non-repeatable bug has munged something? Yeah, that's exactly the problem. When you realize you need it, it's too late. > The best thought I've had so far > is that if someone kept WAL files long enough the evidence might be > in there somewhere.... Hm, that is an excellent point. The WAL trace would actually be a lot superior in terms of being able to figure out what went wrong. But I don't quite see how we tell people "either keep xmin or keep your old WAL". Also, for production sites the amount of WAL you'd have to hang onto seems a bit daunting. Other problems are the cost of shipping it to a developer, and the impracticality of sanitizing private data in it before you show it to somebody. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> The best thought I've had so far is that if someone kept WAL >> files long enough the evidence might be in there somewhere.... > > Hm, that is an excellent point. The WAL trace would actually be a > lot superior in terms of being able to figure out what went wrong. > But I don't quite see how we tell people "either keep xmin or keep > your old WAL". Also, for production sites the amount of WAL you'd > have to hang onto seems a bit daunting. Any thoughts on how far back the WAL would need to go to deal with the issues where such information has been useful? (For example, we always have at least two weeks worth, but I don't know if that's a useful range or not.) > Other problems are the cost of shipping it to a developer, and the > impracticality of sanitizing private data in it before you show it > to somebody. Yeah, this wouldn't be a practical answer to the need unless PostgreSQL shipped with a tool which could scan WAL and extract the relevant information (probably under direction of someone from the list or a private support organization). Is the required information predictable enough to make developing such a tool a tractable problem? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hm, that is an excellent point. The WAL trace would actually be a >> lot superior in terms of being able to figure out what went wrong. >> But I don't quite see how we tell people "either keep xmin or keep >> your old WAL". Also, for production sites the amount of WAL you'd >> have to hang onto seems a bit daunting. > Any thoughts on how far back the WAL would need to go to deal with > the issues where such information has been useful? (For example, we > always have at least two weeks worth, but I don't know if that's a > useful range or not.) Well, it's a "how long does it take you to notice data corruption" kind of issue. The most recent case I can think of where xmin was helpful was in trying to sort out a problem with an index being inconsistent with the heap, which manifested as wrong query answers for the user. I don't know how long it took him to recognize and report the problem. (We never did locate the bug-if-any, IIRC... it would have been much more helpful to have the WAL trace. xmin did let me rule out some theories, though.) >> Other problems are the cost of shipping it to a developer, and the >> impracticality of sanitizing private data in it before you show it >> to somebody. > Yeah, this wouldn't be a practical answer to the need unless > PostgreSQL shipped with a tool which could scan WAL and extract the > relevant information (probably under direction of someone from the > list or a private support organization). Is the required > information predictable enough to make developing such a tool a > tractable problem? Hard to tell. If we were actually going in this direction we'd want to write a much better WAL-text-dump tool than we have, and then in principle somebody could sanitize the text output before shipping it off. But going through a large volume of data that way could be pretty impractical. Also, we (or at least I) have nearly zip experience with trying to debug problems by examining WAL, so it's not real clear to me which details might be important. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > If we were actually going in this direction we'd want to write a > much better WAL-text-dump tool than we have, and then in principle > somebody could sanitize the text output before shipping it off. I wouldn't think this would be practical unless there was a way to scan the WAL files and dump only the bits related to the affected pages, and probably with at least an option (possibly default) to just mention the data type and length, rather than showing the actual values -- how often are the values relevant, anyway? (Not rhetorical; I really don't know.) -Kevin
Tom Lane wrote: > If we were actually going in this direction we'd > want to write a much better WAL-text-dump tool than we have, and then > in principle somebody could sanitize the text output before shipping > it off. But going through a large volume of data that way could be > pretty impractical. Also, we (or at least I) have nearly zip experience > with trying to debug problems by examining WAL, so it's not real clear > to me which details might be important. > There's another interesting thing about moving in this direction too. Systems that have WAL archiving setup that run into a problem could end up being a much richer source for historical analysis of how the system got into the bad state than is available right now. Typically those can have longer histories available than you'll find on a primary that's recycling segments all the time. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > If we were actually going in this direction we'd > want to write a much better WAL-text-dump tool than we have, and then > in principle somebody could sanitize the text output before shipping > it off. But going through a large volume of data that way could be > pretty impractical. Also, we (or at least I) have nearly zip experience > with trying to debug problems by examining WAL, so it's not real clear > to me which details might be important. Just an off-the-wall thought, but, would it be possible to have a tool which read WAL backwards and compared entries in the WAL against entries on disk? I realize that you'd only see one version of a particular block and then have to skip any updates which are earlier than it, but it seems like you could cover a pretty large chunk of the recent changes to the database using this approach.. Thanks, Stephen
On Tue, Jun 8, 2010 at 2:26 AM, Stephen Frost <sfrost@snowman.net> wrote: > Just an off-the-wall thought, but, would it be possible to have a tool > which read WAL backwards and compared entries in the WAL against entries > on disk? I realize that you'd only see one version of a particular > block and then have to skip any updates which are earlier than it, but > it seems like you could cover a pretty large chunk of the recent changes > to the database using this approach.. I assume you mean back out the changes incrementally until you find a full_page_write and see if it matches? And continue comparing with full_page_writes once per checkpoint? I don't think the WAL has enough information to replay backwards though. For example vacuum cleanup records just list the tids to remove. They don't have the contents to replace there. -- greg
* Greg Stark (gsstark@mit.edu) wrote: > I assume you mean back out the changes incrementally until you find a > full_page_write and see if it matches? To be honest, you're already assuming I know more about how this all works than I do. :) The gist of my thought was simply- we write out block changes to the WAL, including data in many cases. If we were to look at the very end of the WAL, at the last piece of data written there, and the data files have supposedly been flushed, then what's in the WAL at that point should match what's in the data files, right? If it doesn't, that'd be bad. > And continue comparing with > full_page_writes once per checkpoint? If we could only do it when there's a full page write, then perhaps that would work as well, but I thought we tracked them at a lower level. In any case, the idea is the same- compare what's in WAL to what's supposed to be on disk, and alarm whenever there's a clear error. > I don't think the WAL has enough > information to replay backwards though. For example vacuum cleanup > records just list the tids to remove. They don't have the contents to > replace there. Right, you couldn't actually move the database backwards in time using this tool (because we only write out new data, we don't write out what was in that block/page before the write)- that isn't the idea or intent. It would just be a tool that someone could run against a database where they've detected corruption (or, I dunno, more frequently, to perhaps catch corruption faster?), to see if the problem is a PG bug or a hardware/kernel/etc issue. In fact, if you could somehow do this against a backup that's been taken using pg_start_backup/pg_stop_backup, that would be pretty awesome. I know that if such a tool existed, I'd be happy to run it as part of my regular backup routines- I *always* have all the WALs from my last backup to my next backup (and typically farther back than that, eg: if I run full backups weekly, I'll have 4 full backups + all 4 weeks of WALs, to be able to replay back to any point in the month..). The big question that I have is- would this actually be productive? Would it actually be able to catch hardware corruption or help at all with PG bugs? Those are the things I'm not really sure about. Thanks, Stephen
On 6/8/2010 8:27 AM, Greg Stark wrote: > On Tue, Jun 8, 2010 at 2:26 AM, Stephen Frost <sfrost@snowman.net> wrote: >> Just an off-the-wall thought, but, would it be possible to have a tool >> which read WAL backwards and compared entries in the WAL against entries >> on disk? I realize that you'd only see one version of a particular >> block and then have to skip any updates which are earlier than it, but >> it seems like you could cover a pretty large chunk of the recent changes >> to the database using this approach.. > > I assume you mean back out the changes incrementally until you find a > full_page_write and see if it matches? And continue comparing with > full_page_writes once per checkpoint? I don't think the WAL has enough > information to replay backwards though. For example vacuum cleanup > records just list the tids to remove. They don't have the contents to > replace there. You can't back out changes. WAL does not contain before images. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Fri, 2010-06-04 at 10:18 -0400, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > > On 6/2/2010 3:10 PM, Alvaro Herrera wrote: > >> I'd prefer a setting that would tell the system to freeze all tuples > >> that fall within a safety range whenever any tuple in the page is frozen > >> -- weren't you working on a patch to do this? (was it Jeff Davis?) > > > I just see a lot of cost caused by this "safety range". I yet have to > > see its real value, other than "feel good". > > Jan, you don't know what you're talking about. I have repeatedly had > cases where being able to look at xmin was critical to understanding > a bug. I *will not* hold still for a solution that effectively reduces > min_freeze_age to zero. Recent history shows Tom's view to be the most useful one: its useful to keep seeing the xmin. The last time we altered the way we set hint bits we caused multiple data loss bugs doing it. We will need to debug things and the WAL is always long gone (great idea though). Why not just have a separate flag for HEAP_XMIN_FROZEN, that way we can keep the xmin but also can see it is frozen? We already WAL-log certain flag settings, so why not this one also? -- Simon Riggs www.2ndQuadrant.com
On Tue, Jun 8, 2010 at 4:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Fri, 2010-06-04 at 10:18 -0400, Tom Lane wrote: >> Jan Wieck <JanWieck@Yahoo.com> writes: >> > On 6/2/2010 3:10 PM, Alvaro Herrera wrote: >> >> I'd prefer a setting that would tell the system to freeze all tuples >> >> that fall within a safety range whenever any tuple in the page is frozen >> >> -- weren't you working on a patch to do this? (was it Jeff Davis?) >> >> > I just see a lot of cost caused by this "safety range". I yet have to >> > see its real value, other than "feel good". >> >> Jan, you don't know what you're talking about. I have repeatedly had >> cases where being able to look at xmin was critical to understanding >> a bug. I *will not* hold still for a solution that effectively reduces >> min_freeze_age to zero. > > Recent history shows Tom's view to be the most useful one: its useful to > keep seeing the xmin. The last time we altered the way we set hint bits > we caused multiple data loss bugs doing it. We will need to debug things > and the WAL is always long gone (great idea though). > > Why not just have a separate flag for HEAP_XMIN_FROZEN, that way we can > keep the xmin but also can see it is frozen? We could do that, but I think the point of this exercise is to reduce I/O - specifically, I/O caused by anti-wraparound vacuums - and I'm not clear how such a flag would help with that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Tue, 2010-06-08 at 16:58 -0400, Robert Haas wrote: > On Tue, Jun 8, 2010 at 4:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Fri, 2010-06-04 at 10:18 -0400, Tom Lane wrote: > >> Jan Wieck <JanWieck@Yahoo.com> writes: > >> > On 6/2/2010 3:10 PM, Alvaro Herrera wrote: > >> >> I'd prefer a setting that would tell the system to freeze all tuples > >> >> that fall within a safety range whenever any tuple in the page is frozen > >> >> -- weren't you working on a patch to do this? (was it Jeff Davis?) > >> > >> > I just see a lot of cost caused by this "safety range". I yet have to > >> > see its real value, other than "feel good". > >> > >> Jan, you don't know what you're talking about. I have repeatedly had > >> cases where being able to look at xmin was critical to understanding > >> a bug. I *will not* hold still for a solution that effectively reduces > >> min_freeze_age to zero. > > > > Recent history shows Tom's view to be the most useful one: its useful to > > keep seeing the xmin. The last time we altered the way we set hint bits > > we caused multiple data loss bugs doing it. We will need to debug things > > and the WAL is always long gone (great idea though). > > > > Why not just have a separate flag for HEAP_XMIN_FROZEN, that way we can > > keep the xmin but also can see it is frozen? > > We could do that, but I think the point of this exercise is to reduce > I/O - specifically, I/O caused by anti-wraparound vacuums - and I'm > not clear how such a flag would help with that. Hmmm: You suggested a variant of this idea, so whatever reasoning was behind your suggestion would be shared here, surely? Tom has been saying we cannot freeze early because we need to keep xmins. I agree with that. This suggestion shows it is possible to freeze a tuple AND keep its xmin. So that removes the argument that we should freeze more aggressively (whenever we write the block) and can thus reduce longer term I/O costs. -- Simon Riggs www.2ndQuadrant.com
On Tue, Jun 8, 2010 at 5:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > Why not just have a separate flag for HEAP_XMIN_FROZEN, that way we can >> > keep the xmin but also can see it is frozen? >> >> We could do that, but I think the point of this exercise is to reduce >> I/O - specifically, I/O caused by anti-wraparound vacuums - and I'm >> not clear how such a flag would help with that. > > Hmmm: You suggested a variant of this idea, so whatever reasoning was > behind your suggestion would be shared here, surely? > > Tom has been saying we cannot freeze early because we need to keep > xmins. I agree with that. This suggestion shows it is possible to freeze > a tuple AND keep its xmin. So that removes the argument that we should > freeze more aggressively (whenever we write the block) and can thus > reduce longer term I/O costs. OK, yes, I see what you're getting at now. There are two possible ways to do freeze the tuples and keep the xmin: we can either rely on the PD_ALL_VISIBLE page-level bit (as I previously proposed) or we can additionally have a HEAP_XMIN_FROZEN bit as you propose here. I am not sure which way is better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Tue, 2010-06-08 at 18:03 -0400, Robert Haas wrote: > OK, yes, I see what you're getting at now. There are two possible > ways to do freeze the tuples and keep the xmin: we can either rely on > the PD_ALL_VISIBLE page-level bit (as I previously proposed) or we can > additionally have a HEAP_XMIN_FROZEN bit as you propose here. I am > not sure which way is better. Doing it at tuple level is more flexible and allows more aggressive freezing. It also works better with existing tuple visibility code. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2010-06-08 at 18:03 -0400, Robert Haas wrote: >> OK, yes, I see what you're getting at now. There are two possible >> ways to do freeze the tuples and keep the xmin: we can either rely on >> the PD_ALL_VISIBLE page-level bit (as I previously proposed) or we can >> additionally have a HEAP_XMIN_FROZEN bit as you propose here. I am >> not sure which way is better. > Doing it at tuple level is more flexible and allows more aggressive > freezing. It also works better with existing tuple visibility code. I agree, relying on a page-level bit (or field) is unpleasant in a number of ways. But none of this accomplishes a damn thing towards the original goal, which was to avoid an extra disk write associated with freezing (not to mention an extra write for setting the transaction-committed hint bit). Setting a bit is no cheaper from that standpoint than changing the xmin field. regards, tom lane
Excerpts from Tom Lane's message of mar jun 08 18:35:00 -0400 2010: > But none of this accomplishes a damn thing towards the original goal, > which was to avoid an extra disk write associated with freezing (not > to mention an extra write for setting the transaction-committed hint > bit). Setting a bit is no cheaper from that standpoint than changing > the xmin field. ... unless the bit is outside the page itself -- so we get back to the idea of a freeze map. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jun 8, 2010 at 6:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Tue, 2010-06-08 at 18:03 -0400, Robert Haas wrote: >>> OK, yes, I see what you're getting at now. There are two possible >>> ways to do freeze the tuples and keep the xmin: we can either rely on >>> the PD_ALL_VISIBLE page-level bit (as I previously proposed) or we can >>> additionally have a HEAP_XMIN_FROZEN bit as you propose here. I am >>> not sure which way is better. > >> Doing it at tuple level is more flexible and allows more aggressive >> freezing. It also works better with existing tuple visibility code. > > I agree, relying on a page-level bit (or field) is unpleasant in a > number of ways. > > But none of this accomplishes a damn thing towards the original goal, > which was to avoid an extra disk write associated with freezing (not > to mention an extra write for setting the transaction-committed hint > bit). Setting a bit is no cheaper from that standpoint than changing > the xmin field. Except for insert-only tables, I don't believe this is true. If you freeze all tuples by the time the pages are marked all-visible, perhaps via the xmin-preserving mechanism Simon suggested, then you can use the visibility map to skip anti-wraparound vacuum as well as regular vacuum. That sounds to me like it's accomplishing something. Is it a complete solution? No. Is it better than what we have now? Yes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jun 8, 2010 at 6:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> But none of this accomplishes a damn thing towards the original goal, >> which was to avoid an extra disk write associated with freezing (not >> to mention an extra write for setting the transaction-committed hint >> bit). �Setting a bit is no cheaper from that standpoint than changing >> the xmin field. > Except for insert-only tables, I don't believe this is true. But insert-only tables are exactly the case that Josh complained about to start with. > If you > freeze all tuples by the time the pages are marked all-visible, > perhaps via the xmin-preserving mechanism Simon suggested, then you > can use the visibility map to skip anti-wraparound vacuum as well as > regular vacuum. That sounds to me like it's accomplishing something. > Is it a complete solution? No. Is it better than what we have now? > Yes. I do like the idea of using a status bit rather than FrozenXid to mark a frozen tuple, because that eliminates the conflict between wanting to freeze aggressively for performance reasons and wanting to preserve Xids for forensic reasons. But it doesn't seem to do much for Josh's original problem. regards, tom lane
On Wed, Jun 9, 2010 at 12:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Tue, 2010-06-08 at 18:03 -0400, Robert Haas wrote: >>> OK, yes, I see what you're getting at now. There are two possible >>> ways to do freeze the tuples and keep the xmin: we can either rely on >>> the PD_ALL_VISIBLE page-level bit (as I previously proposed) or we can >>> additionally have a HEAP_XMIN_FROZEN bit as you propose here. I am >>> not sure which way is better. > >> Doing it at tuple level is more flexible and allows more aggressive >> freezing. It also works better with existing tuple visibility code. > > I agree, relying on a page-level bit (or field) is unpleasant in a > number of ways. > > But none of this accomplishes a damn thing towards the original goal, > which was to avoid an extra disk write associated with freezing (not > to mention an extra write for setting the transaction-committed hint > bit). Setting a bit is no cheaper from that standpoint than changing > the xmin field. > Could a tuple wih the bit set be considered frozen already? Would we actually ever need to rewrite the xmin, even for anti-wraparound reasons? Greetings Marcin
On Tue, 2010-06-08 at 18:35 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Tue, 2010-06-08 at 18:03 -0400, Robert Haas wrote: > >> OK, yes, I see what you're getting at now. There are two possible > >> ways to do freeze the tuples and keep the xmin: we can either rely on > >> the PD_ALL_VISIBLE page-level bit (as I previously proposed) or we can > >> additionally have a HEAP_XMIN_FROZEN bit as you propose here. I am > >> not sure which way is better. > > > Doing it at tuple level is more flexible and allows more aggressive > > freezing. It also works better with existing tuple visibility code. > > I agree, relying on a page-level bit (or field) is unpleasant in a > number of ways. > > But none of this accomplishes a damn thing towards the original goal, > which was to avoid an extra disk write associated with freezing (not > to mention an extra write for setting the transaction-committed hint > bit). Setting a bit is no cheaper from that standpoint than changing > the xmin field. No, it doesn't of itself, but if you raise a complaint then we must first address the complaint as a sub-topic before we continue the main discussion around $TOPIC. My proposal removes the barrier that early freezing would overwrite xmin values, so early freezing need not have any negative effects. The general idea is to hide the "third write" (freezing) on a tuple by making it happen at the same time as another tuple's "second write" (hint bit setting). I'm happy to let that continue by the OPs. -- Simon Riggs www.2ndQuadrant.com
On Wed, Jun 9, 2010 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If you >> freeze all tuples by the time the pages are marked all-visible, >> perhaps via the xmin-preserving mechanism Simon suggested, then you >> can use the visibility map to skip anti-wraparound vacuum as well as >> regular vacuum. That sounds to me like it's accomplishing something. >> Is it a complete solution? No. Is it better than what we have now? >> Yes. > > I do like the idea of using a status bit rather than FrozenXid to mark a > frozen tuple, because that eliminates the conflict between wanting to > freeze aggressively for performance reasons and wanting to preserve Xids > for forensic reasons. But it doesn't seem to do much for Josh's > original problem. OK, I see. So maybe we add a Todo to implement that, and then keep thinking about how to fix Josh's problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
marcin mank <marcin.mank@gmail.com> writes: > Could a tuple wih the bit set be considered frozen already? Would we > actually ever need to rewrite the xmin, even for anti-wraparound > reasons? That's exactly what Simon is suggesting: if we had a tuple status flag with the semantics of "this xmin is known visible to all current and future transactions", we could consider setting that bit to be the moral equivalent of freezing the tuple. The tuple visibility tests would never actually consult clog for such an xmin and thus we'd never have to replace it with FrozenXid. But this doesn't in itself save us any work: we'd still need to treat setting that bit as a WAL-logged operation, and we'd still need to have VACUUM track the oldest not-thus-hinted xmins. What it does do is eliminate the conflict between wanting to freeze tuples aggressively for various performance reasons and wanting to preserve original xmin values for forensic reasons. I wonder how this might play into Heikki's ideas about making the visibility map trustworthy. If we WAL-logged the operation of "set all the per-tuple VISIBLE-TO-ALL bits on this page, as well as the page's bit in the visibility map", then that end of things would be trustworthy. And all the operations that have to unset the map bit are already WAL-logged. regards, tom lane
Seems I underestimated the importance of forensic breadcrumbs. On 6/9/2010 12:09 PM, Tom Lane wrote: > I do like the idea of using a status bit rather than FrozenXid to mark a > frozen tuple, because that eliminates the conflict between wanting to > freeze aggressively for performance reasons and wanting to preserve Xids > for forensic reasons. But it doesn't seem to do much for Josh's > original problem. Depends. Specifically on transaction profiles and how long the blocks linger around before being written. If you can set the all visible bit by the time, the page is written the first time, what bit including the is-frozen one cannot be set at that time too? Maybe some analysis on the typical behavior of such system is in order. Especially the case Josh was mentioning seems to be a typical single insert logging style application, with little else going on on that particular database. I can't reveal specifics about that particular case, but think of something like taking frequent sensor readings, that need to be kept for years for forensics in case there is a product recall some day. And even if some cases still required another page write because those frozen bits cannot be set on first write, this seems to be a win-win. We would get rid of the FrozenXid completely and shift to a bit, so we can effectively have a min_ freeze_age of zero while keeping the xid's forever. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck <JanWieck@Yahoo.com> writes: > Depends. Specifically on transaction profiles and how long the blocks > linger around before being written. If you can set the all visible bit > by the time, the page is written the first time, what bit including the > is-frozen one cannot be set at that time too? All-visible and is-frozen would be the same bit ... > And even if some cases still required another page write because those > frozen bits cannot be set on first write, this seems to be a win-win. We > would get rid of the FrozenXid completely and shift to a bit, so we can > effectively have a min_ freeze_age of zero while keeping the xid's forever. Right. I don't see any downside, other than eating another status bit per tuple, which we can afford. regards, tom lane
On Thu, Jun 3, 2010 at 11:41 AM, Greg Stark <gsstark@mit.edu> wrote: > I think to make it work you need to store a whole 64-bit reference > transaction id consisting of both a cycle counter and a transaction > id. The invariant for the page is that every xid on the page can be > compared to that reference transaction id using normal transactionid > semantics. Actually I think the easiest way to do that is to set it to > the oldest xid on the page. The first thing to do before comparing any > transaction id on the page with a real transaction id would be to > figure out whether the reference xid is comparable to the live xid, > which if it's the oldest xid on the page implies they'll all be > comparable. > > The way to maintain that invariant would be that any xid insertion on > the page must advance the reference xid if it's not comparable to the > newly inserted xid. It has to be advanced to the oldest xid that's > still comparable with the newly inserted xid. Any xids on the page > that are older than the new refernce xid have to be frozen or removed. > I'm not sure how to do that without keeping clog forever though. So the more I think about this the more I think it's unavoidable that we would need to retain clog forever. I think the goal here is to be able to load data into the database and then never write the data ever again. Even if you visit the page years later after the transaction ids have wrapped around several times. In that case there's no avoiding that you'll need to know whether that transaction committed or aborted. Now we could make a bet that most transactions commit and therefore we could keep a list of aborted transactions only which we might be able to keep "forever" in very little space if very few transactions abort. Presumably we would only use this form once the transaction was about to be truncated out of clog. I'm not too happy with the assumption that there aren't many aborts though. Someone could come along with a use case where they have lots of aborts and run into strange limitations and performance characteristics. Alternatively we could do something like keeping a list of tables touched by any transaction. Then vacuum could look for any non-committed transactions old enough to be in danger of aging out of clog and ensure those tables are frozen. But any tables which have never been touched by any such old transaction could be left alone. when we read in the page we'll be able to recognize the old transactions as committed if they're beyond the end of the clog horizon. I don't really like that idea either because it leaves performance really quite unpredictable. I could have a large table that goes unvacuumed for a long time -- then when I come along with some tiny query where I hit C-c and cause an abort I suddenly set a trap which causes a huge vacuum freeze to fire off. -- greg
Greg Stark wrote: > On Thu, Jun 3, 2010 at 11:41 AM, Greg Stark <gsstark@mit.edu> wrote: > > I think to make it work you need to store a whole 64-bit reference > > transaction id consisting of both a cycle counter and a transaction > > id. The invariant for the page is that every xid on the page can be > > compared to that reference transaction id using normal transactionid > > semantics. Actually I think the easiest way to do that is to set it to > > the oldest xid on the page. The first thing to do before comparing any > > transaction id on the page with a real transaction id would be to > > figure out whether the reference xid is comparable to the live xid, > > which if it's the oldest xid on the page implies they'll all be > > comparable. > > > > The way to maintain that invariant would be that any xid insertion on > > the page must advance the reference xid if it's not comparable to the > > newly inserted xid. It has to be advanced to the oldest xid that's > > still comparable with the newly inserted xid. Any xids on the page > > that are older than the new refernce xid have to be frozen or removed. > > I'm not sure how to do that without keeping clog forever though. > > So the more I think about this the more I think it's unavoidable that > we would need to retain clog forever. > > I think the goal here is to be able to load data into the database and > then never write the data ever again. Even if you visit the page years > later after the transaction ids have wrapped around several times. In > that case there's no avoiding that you'll need to know whether that > transaction committed or aborted. I think we might need two bits, one commited and all visible, and another aborted and all vislble. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
Bruce Momjian <bruce@momjian.us> writes: > I think we might need two bits, one commited and all visible, and > another aborted and all vislble. Huh? The latter means "vacuumable". regards, tom lane