Thread: Re: 64 bit transaction id
Hi
pá 1. 11. 2019 v 10:11 odesílatel Павел Ерёмин <shnoor111gmail@yandex.ru> napsal:
Hi.sorry for my English.I want to once again open the topic of 64 bit transaction id. I did not manage to find in the archive of the option that I want to discuss, so I write. If I searched poorly, then please forgive me.The idea is not very original and probably has already been considered, again I repeat - I did not find it. Therefore, please do not scold me severely.In discussions of 64-bit transaction id, I did not find mention of an algorithm for storing them, as it was done, for example, in MS SQL Server.What if instead of 2 fields (xmin and xmax) with a total length of 64 bits - use 1 field (let's call it xid) with a length of 64 bits in tuple header? In this field store the xid of the transaction that created the version. In this case, the new transaction in order to understand whether the read version is suitable for it or not, will have to read the next version as well. Those. The downside of such decision is of course an increase in I / O. Transactions will have to read the +1 version. On the plus side, the title of the tuple remains the same length.
is 32 bit tid really problem? Why you need to know state of last 2^31 transactions? Is not problem in too low usage (or maybe too high overhead) of VACUUM FREEZE.
I am not sure if increasing this range can has much more fatal problems (maybe later)
Pavel
regards, Eremin Pavel.
On Fri, Nov 01, 2019 at 10:25:17AM +0100, Pavel Stehule wrote: >Hi > >pá 1. 11. 2019 v 10:11 odesílatel Павел Ерёмин <shnoor111gmail@yandex.ru> >napsal: > >> Hi. >> sorry for my English. >> I want to once again open the topic of 64 bit transaction id. I did not >> manage to find in the archive of the option that I want to discuss, so I >> write. If I searched poorly, then please forgive me. >> The idea is not very original and probably has already been considered, >> again I repeat - I did not find it. Therefore, please do not scold me >> severely. >> In discussions of 64-bit transaction id, I did not find mention of an >> algorithm for storing them, as it was done, for example, in MS SQL Server. >> What if instead of 2 fields (xmin and xmax) with a total length of 64 bits >> - use 1 field (let's call it xid) with a length of 64 bits in tuple header? >> In this field store the xid of the transaction that created the version. In >> this case, the new transaction in order to understand whether the read >> version is suitable for it or not, will have to read the next version as >> well. Those. The downside of such decision is of course an increase in I / >> O. Transactions will have to read the +1 version. On the plus side, the >> title of the tuple remains the same length. >> > >is 32 bit tid really problem? Why you need to know state of last 2^31 >transactions? Is not problem in too low usage (or maybe too high overhead) >of VACUUM FREEZE. > It certainly can be an issue for large and busy systems, that may need anti-wraparoud vacuum every couple of days. If that requires rewriting a couple of TB of data, it's not particularly nice. That's why 64-bit XIDs were discussed repeatedly in the past, and it's likely to get even more pressing as the systems get larger. >I am not sure if increasing this range can has much more fatal problems >(maybe later) > Well, not fatal, but naive approaches can increase per-tuple overhead. And we already have plenty of that, hence there were proposals to use page epochs and so on. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Nov 02, 2019 at 07:07:17PM +0300, Павел Ерёмин wrote: > > The proposed option does not need to change the length of either the page > header or tuple header. Therefore, you will not need to physically change > the data. > So how do you link the tuple versions together? Clearly, that has to be stored somewhere ... -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Nov 02, 2019 at 11:35:09PM +0300, Павел Ерёмин wrote: > The proposed option is not much different from what it is now. > We are not trying to save some space - we will reuse the existing one. We > just work in 64 bit transaction counters. Correct me if I'm wrong - the > address of the next version of the line is stored in the 6 byte field > t_cid in the tuple header - which is not attached to the current page in > any way - and can be stored anywhere in the table. Nothing changes. I think you mean t_ctid, not t_cid (which is a 4-byte CommandId, not any sort of item pointer). I think this comment from htup_details.h explains the issue: * ... Beware however that VACUUM might * erase the pointed-to (newer) tuple before erasing the pointing (older) * tuple. Hence, when following a t_ctid link, it is necessary to check * to see if the referenced slot is empty or contains an unrelated tuple. * Check that the referenced tuple has XMIN equal to the referencing tuple's * XMAX to verify that it is actually the descendant version and not an * unrelated tuple stored into a slot recently freed by VACUUM. If either * check fails, one may assume that there is no live descendant version. Now, imagine you have a tuple that gets updated repeatedly (say, 3x) and each version gets to a different page. Say, pages #1, #2, #3. And then VACUUM happens on some of the "middle" page (this may happen when trying to fit new row onto a page to allow HOT, but it might happen even during regular VACUUM). So we started with 3 tuples on pages #1, #2, #3, but now we have this #1 - tuple exists, points to tuple on page #2 #2 - tuple no longer exists, cleaned up by vacuum #3 - tuple exists The scheme you proposed requires existence of all the tuples in the chain to determine visibility. When tuple #2 no longer exists, it's impossible to decide whether tuple on page #1 is visible or not. This also significantly increases the amount of random I/O, pretty much by factor of 2, because whenever you look at a row, you also have to look at the "next version" which may be on another page. That's pretty bad, bot for I/O and cache hit ratio. I don't think that's a reasonable trade-off (at least compared to simply making the XIDs 64bit). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Nov 03, 2019 at 02:17:15PM +0300, Павел Ерёмин wrote: > I completely agree with all of the above. Therefore, the proposed > mechanism may entail larger improvements (and not only VACUUM). I think the best think you can do is try implementing this ... I'm afraid the "improvements" essentially mean making various imporant parts of the system much more complicated and expensive. There's a trade-off between saving 8B per row and additional overhead (during vacuum etc.), and it does not seem like a winning strategy. What started as "we can simply look at the next row version" is clearly way more complicated and expensive. The trouble here is that it adds dependency between pages in the data file. That for example means that during cleanup of a page it may be necessary to modify the other page, when originally that would be read-only in that checkpoint interval. That's essentially write amplification, and may significantly increase the amount of WAL due to generating FPW for the other page. > I can offer the following solution. > For VACUUM, create a hash table. > VACUUM scanning the table sees that the version (tuple1) has t_ctid filled > and refers to the address tuple2, it creates a structure into which it > writes the address tuple1, tuple1.xid, length tuple1 (well, and other > information that is needed), puts this structure in the hash table by key > tuple2 addresses. > VACUUM reaches tuple2, checks the address of tuple2 in the hash table - if > it finds it, it evaluates the connection between them and makes a decision > on cleaning. > We know VACUUM is already pretty expensive, so making it even more expensive seems pretty awful. And the proposed solution seems damn expensive. We already do something similar for indexes - we track pointers for removed rows, so that we can remove them from indexes. And it's damn expensive because we don't know where in the index the tuples are - so we have to scan the whole indexes. This would mean we have to do the same thing for table, because we don't know where in the table are the older versions of those rows, because we don't know where the other rows are. That seems mighty expensive. Not to mention that this does nothing for page-level vacuum, which we do when trying to fit another row on a page (e.g. for HOT). This has to be absolutely cheap, we certainly are not going to do lookups of other pages or looking for older versions of the row, and so on. Being able to do visibility decisions based on the tuple alone (or possibly page-level + tuple information) has a lot of value, and I don't think we want to make this more complicated. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 04, 2019 at 04:39:44PM +0300, Павел Ерёмин wrote: > And yet, if I try to implement a similar mechanism, if successful, will my > revision be considered? > Why wouldn't it be considered? If you submit a patch that demonstrably improves the behavior (in this case reduces per-tuple overhead without causing significant issues elsewhere), we'd be crazy not to consider it. The bar is pretty high, though, because this touches one of the core pieces of the database. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, (I've not read the rest of this thread yet) On 2019-11-04 16:07:23 +0100, Tomas Vondra wrote: > On Mon, Nov 04, 2019 at 04:39:44PM +0300, Павел Ерёмин wrote: > > And yet, if I try to implement a similar mechanism, if successful, will my > > revision be considered? > > > > Why wouldn't it be considered? If you submit a patch that demonstrably > improves the behavior (in this case reduces per-tuple overhead without > causing significant issues elsewhere), we'd be crazy not to consider it. And "without causing significant issues elsewhere" unfortunately includes continuing to allow pg_upgrade to work. Greetings, Andres Freund
On Mon, Nov 04, 2019 at 10:04:09AM -0800, Andres Freund wrote: >Hi, > >(I've not read the rest of this thread yet) > >On 2019-11-04 16:07:23 +0100, Tomas Vondra wrote: >> On Mon, Nov 04, 2019 at 04:39:44PM +0300, Павел Ерёмин wrote: >> > And yet, if I try to implement a similar mechanism, if successful, will my >> > revision be considered? >> > >> >> Why wouldn't it be considered? If you submit a patch that demonstrably >> improves the behavior (in this case reduces per-tuple overhead without >> causing significant issues elsewhere), we'd be crazy not to consider it. > >And "without causing significant issues elsewhere" unfortunately >includes continuing to allow pg_upgrade to work. > Yeah. I suppose we could have a different AM implementing this, but maybe that's not possible ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-11-04 19:39:18 +0100, Tomas Vondra wrote: > On Mon, Nov 04, 2019 at 10:04:09AM -0800, Andres Freund wrote: > > And "without causing significant issues elsewhere" unfortunately > > includes continuing to allow pg_upgrade to work. > Yeah. I suppose we could have a different AM implementing this, but > maybe that's not possible ... Entirely possible. But the amount of code duplication / unnecessary branching and the user confusion from two very similar AMs, would have to be weighed against the benefits. Greetings, Andres Freund
On Mon, Nov 04, 2019 at 10:44:53AM -0800, Andres Freund wrote: >Hi, > >On 2019-11-04 19:39:18 +0100, Tomas Vondra wrote: >> On Mon, Nov 04, 2019 at 10:04:09AM -0800, Andres Freund wrote: >> > And "without causing significant issues elsewhere" unfortunately >> > includes continuing to allow pg_upgrade to work. > >> Yeah. I suppose we could have a different AM implementing this, but >> maybe that's not possible ... > >Entirely possible. But the amount of code duplication / unnecessary >branching and the user confusion from two very similar AMs, would have >to be weighed against the benefits. > Agreed. I think code complexity is part of the trade-off. IMO it's fine to hack existing heap AM initially, and only explore the separate AM if that turns out to be promising. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 5, 2019 at 8:45 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On Mon, Nov 04, 2019 at 10:44:53AM -0800, Andres Freund wrote: > >On 2019-11-04 19:39:18 +0100, Tomas Vondra wrote: > >> On Mon, Nov 04, 2019 at 10:04:09AM -0800, Andres Freund wrote: > >> > And "without causing significant issues elsewhere" unfortunately > >> > includes continuing to allow pg_upgrade to work. > > > >> Yeah. I suppose we could have a different AM implementing this, but > >> maybe that's not possible ... > > > >Entirely possible. But the amount of code duplication / unnecessary > >branching and the user confusion from two very similar AMs, would have > >to be weighed against the benefits. > > > > Agreed. I think code complexity is part of the trade-off. IMO it's fine > to hack existing heap AM initially, and only explore the separate AM if > that turns out to be promising. I thought a bit about how to make a minimally-diffferent-from-heap non-freezing table AM using 64 bit xids, as a thought experiment when trying to understand or explain to others what zheap is about. Committed transactions are easy (you don't have to freeze fxid references from the ancient past because they don't wrap around so they always look old), but how do you deal with *aborted* transactions when truncating the CLOG (given that our current rule is "if it's before the CLOG begins, it must be committed")? I see three possibilities: (1) don't truncate the CLOG anymore (use 64 bit addressing and let it leak disk forever, like we did before commit 2589735d and later work), (2) freeze aborted transactions only, using a wraparound vacuum (and now you have failed, if the goal was to avoid having to scan all tuples periodically to freeze stuff, though admittedly it will require less IO to freeze only the aborted transactions), (3) go and remove aborted fxid references eagerly, when you roll back (this could be done using the undo technology that we have been developing to support zheap). Another way to explain (3) is that this hypothetical table AM, let's call it "yheap", takes the minimum parts of the zheap technology stack required to get rid of vacuum-for-wraparound, without doing in-place updates or any of that hard stuff. To make this really work you'd also have to deal with multixacts, which also require freezing. If that all sounds too complicated, you're back to (2) which seems a bit weak to me. Or perhaps I'm missing something?
On Tue, Nov 5, 2019 at 09:34:52AM +1300, Thomas Munro wrote: > On Tue, Nov 5, 2019 at 8:45 AM Tomas Vondra > > Agreed. I think code complexity is part of the trade-off. IMO it's fine > > to hack existing heap AM initially, and only explore the separate AM if > > that turns out to be promising. > > I thought a bit about how to make a minimally-diffferent-from-heap > non-freezing table AM using 64 bit xids, as a thought experiment when > trying to understand or explain to others what zheap is about. > Committed transactions are easy (you don't have to freeze fxid > references from the ancient past because they don't wrap around so > they always look old), but how do you deal with *aborted* transactions > when truncating the CLOG (given that our current rule is "if it's > before the CLOG begins, it must be committed")? I see three > possibilities: (1) don't truncate the CLOG anymore (use 64 bit > addressing and let it leak disk forever, like we did before commit > 2589735d and later work), (2) freeze aborted transactions only, using > a wraparound vacuum (and now you have failed, if the goal was to avoid > having to scan all tuples periodically to freeze stuff, though > admittedly it will require less IO to freeze only the aborted > transactions), (3) go and remove aborted fxid references eagerly, when > you roll back (this could be done using the undo technology that we > have been developing to support zheap). Another way to explain (3) is > that this hypothetical table AM, let's call it "yheap", takes the > minimum parts of the zheap technology stack required to get rid of > vacuum-for-wraparound, without doing in-place updates or any of that > hard stuff. To make this really work you'd also have to deal with > multixacts, which also require freezing. If that all sounds too > complicated, you're back to (2) which seems a bit weak to me. Or > perhaps I'm missing something? The above is a very good summary of the constraints that have led to our current handling of XID wraparound. If we are concerned about excessive vacuum freeze overhead, why is the default autovacuum_freeze_max_age = 200000000 so low? That causes feezing when the pg_xact directory holds 200 million xids or 50 megabytes of xid status? As far as I understand it, we cause the database to stop writes when the xid counter gets within 2 billion xids of the current transaction counter, so 200 million is only 1/10th to that limit, and even then, I am not sure why we couldn't make it stop writes at 3 billion or something. My point is that increasing the default autovacuum_freeze_max_age value seems like an easy way to reduce vacuum freeze. (While, the visibility map helps avoid vacuum freeze from reading all heap pages, and we still need to read all index pages.) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Nov 7, 2019 at 10:28 AM Bruce Momjian <bruce@momjian.us> wrote: > The above is a very good summary of the constraints that have led to our > current handling of XID wraparound. If we are concerned about excessive > vacuum freeze overhead, why is the default autovacuum_freeze_max_age = > 200000000 so low? That causes feezing when the pg_xact directory holds > 200 million xids or 50 megabytes of xid status? > > As far as I understand it, we cause the database to stop writes when the > xid counter gets within 2 billion xids of the current transaction > counter, so 200 million is only 1/10th to that limit, and even then, I > am not sure why we couldn't make it stop writes at 3 billion or > something. My point is that increasing the default > autovacuum_freeze_max_age value seems like an easy way to reduce vacuum > freeze. (While, the visibility map helps avoid vacuum freeze from > reading all heap pages, and we still need to read all index pages.) Yeah, I've also wondered why this isn't higher by default, but it's a somewhat tricky topic. Three billion won't work, because it's deeply baked into PostgreSQL's architecture that at most two billion XIDs are used at one time. For comparison purposes, the four billion XIDs form a ring, so that from the perspective of any individual XID, half of the XIDs are in the future and the other half are in the past. If three billion XIDs were in use simultaneously, say starting with XID 5 and ending with XID 3,000,000,004, then XID 5 would see XID 3,000,000,004 as being the past rather than the future, while XID 1,500,000,000 would (correctly) see XID 5 as in the past and XID 3,000,000,004 as in the future. So XID comparison would not be transitive, which would break a lot of code. Allowing at most two billion XIDs to be in use at one time fixes this problem. That doesn't mean we couldn't raise the setting. It just means that the hard limit is two billion, not four billion. But, how high should we raise it? The highest safe value depends on how many XIDs you'll burn while the freezing vacuums are running, which depends on both the size of the database and the rate of XID consumption, and those values can be very different on different systems. I think most people could get by with a significantly higher value, but even with the current value I think there are probably some people who run out of XIDs, at which point they can no longer write to the database. The higher we make the default, the more people are going to have that problem. It's true that a lot of people only hit the limit because something has gone wrong, like they've forgotten about a prepared transaction or an unused replication slot, but still, on high-velocity systems you can't afford to cut it too close because you're still going to be burning through XIDs while vacuum is running. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company