Thread: Re: 64 bit transaction id

Re: 64 bit transaction id

From
Pavel Stehule
Date:
Hi

pá 1. 11. 2019 v 10:11 odesílatel Павел Ерёмин <shnoor111gmail@yandex.ru> napsal:
Hi.
sorry for my English.
I want to once again open the topic of 64 bit transaction id. I did not manage to find in the archive of the option that I want to discuss, so I write. If I searched poorly, then please forgive me.
The idea is not very original and probably has already been considered, again I repeat - I did not find it. Therefore, please do not scold me severely.
In discussions of 64-bit transaction id, I did not find mention of an algorithm for storing them, as it was done, for example, in MS SQL Server.
What if instead of 2 fields (xmin and xmax) with a total length of 64 bits - use 1 field (let's call it xid) with a length of 64 bits in tuple header? In this field store the xid of the transaction that created the version. In this case, the new transaction in order to understand whether the read version is suitable for it or not, will have to read the next version as well. Those. The downside of such  decision is of course an increase in I / O. Transactions will have to read the +1 version. On the plus side, the title of the tuple remains the same length.

is 32 bit tid really problem? Why you need to know state of last 2^31 transactions? Is not problem in too low usage (or maybe too high overhead) of VACUUM FREEZE.

I am not sure if increasing this range can has much more fatal problems (maybe later)

Pavel


 
 
regards, Eremin Pavel.

Re: 64 bit transaction id

From
Tomas Vondra
Date:
On Fri, Nov 01, 2019 at 10:25:17AM +0100, Pavel Stehule wrote:
>Hi
>
>pá 1. 11. 2019 v 10:11 odesílatel Павел Ерёмин <shnoor111gmail@yandex.ru>
>napsal:
>
>> Hi.
>> sorry for my English.
>> I want to once again open the topic of 64 bit transaction id. I did not
>> manage to find in the archive of the option that I want to discuss, so I
>> write. If I searched poorly, then please forgive me.
>> The idea is not very original and probably has already been considered,
>> again I repeat - I did not find it. Therefore, please do not scold me
>> severely.
>> In discussions of 64-bit transaction id, I did not find mention of an
>> algorithm for storing them, as it was done, for example, in MS SQL Server.
>> What if instead of 2 fields (xmin and xmax) with a total length of 64 bits
>> - use 1 field (let's call it xid) with a length of 64 bits in tuple header?
>> In this field store the xid of the transaction that created the version. In
>> this case, the new transaction in order to understand whether the read
>> version is suitable for it or not, will have to read the next version as
>> well. Those. The downside of such  decision is of course an increase in I /
>> O. Transactions will have to read the +1 version. On the plus side, the
>> title of the tuple remains the same length.
>>
>
>is 32 bit tid really problem? Why you need to know state of last 2^31
>transactions? Is not problem in too low usage (or maybe too high overhead)
>of VACUUM FREEZE.
>

It certainly can be an issue for large and busy systems, that may need
anti-wraparoud vacuum every couple of days. If that requires rewriting a
couple of TB of data, it's not particularly nice. That's why 64-bit XIDs
were discussed repeatedly in the past, and it's likely to get even more
pressing as the systems get larger.

>I am not sure if increasing this range can has much more fatal problems
>(maybe later)
>

Well, not fatal, but naive approaches can increase per-tuple overhead.
And we already have plenty of that, hence there were proposals to use
page epochs and so on.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: 64 bit transaction id

From
Tomas Vondra
Date:
On Sat, Nov 02, 2019 at 07:07:17PM +0300, Павел Ерёмин wrote:
>    
>   The proposed option does not need to change the length of either the page
>   header or tuple header. Therefore, you will not need to physically change
>   the data.
>    

So how do you link the tuple versions together? Clearly, that has to be
stored somewhere ...


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: 64 bit transaction id

From
Tomas Vondra
Date:
On Sat, Nov 02, 2019 at 11:35:09PM +0300, Павел Ерёмин wrote:
>   The proposed option is not much different from what it is now.
>   We are not trying to save some space - we will reuse the existing one. We
>   just work in 64 bit transaction counters. Correct me if I'm wrong - the
>   address of the next version of the line is stored in the 6 byte field
>   t_cid in the tuple header - which is not attached to the current page in
>   any way - and can be stored anywhere in the table. Nothing changes.

I think you mean t_ctid, not t_cid (which is a 4-byte CommandId, not any
sort of item pointer).

I think this comment from htup_details.h explains the issue:

 * ... Beware however that VACUUM might
 * erase the pointed-to (newer) tuple before erasing the pointing (older)
 * tuple.  Hence, when following a t_ctid link, it is necessary to check
 * to see if the referenced slot is empty or contains an unrelated tuple.
 * Check that the referenced tuple has XMIN equal to the referencing tuple's
 * XMAX to verify that it is actually the descendant version and not an
 * unrelated tuple stored into a slot recently freed by VACUUM.  If either
 * check fails, one may assume that there is no live descendant version.

Now, imagine you have a tuple that gets updated repeatedly (say, 3x) and
each version gets to a different page. Say, pages #1, #2, #3. And then
VACUUM happens on some of the "middle" page (this may happen when trying
to fit new row onto a page to allow HOT, but it might happen even during
regular VACUUM).

So we started with 3 tuples on pages #1, #2, #3, but now we have this

  #1 - tuple exists, points to tuple on page #2
  #2 - tuple no longer exists, cleaned up by vacuum
  #3 - tuple exists

The scheme you proposed requires existence of all the tuples in the
chain to determine visibility. When tuple #2 no longer exists, it's
impossible to decide whether tuple on page #1 is visible or not.

This also significantly increases the amount of random I/O, pretty much
by factor of 2, because whenever you look at a row, you also have to
look at the "next version" which may be on another page. That's pretty
bad, bot for I/O and cache hit ratio. I don't think that's a reasonable
trade-off (at least compared to simply making the XIDs 64bit).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: 64 bit transaction id

From
Tomas Vondra
Date:
On Sun, Nov 03, 2019 at 02:17:15PM +0300, Павел Ерёмин wrote:
>   I completely agree with all of the above. Therefore, the proposed
>   mechanism may entail larger improvements (and not only VACUUM).

I think the best think you can do is try implementing this ...

I'm afraid the "improvements" essentially mean making various imporant
parts of the system much more complicated and expensive. There's a
trade-off between saving 8B per row and additional overhead (during
vacuum etc.), and it does not seem like a winning strategy. What started
as "we can simply look at the next row version" is clearly way more
complicated and expensive.

The trouble here is that it adds dependency between pages in the data
file. That for example means that during cleanup of a page it may be 
necessary to modify the other page, when originally that would be 
read-only in that checkpoint interval. That's essentially write 
amplification, and may significantly increase the amount of WAL due to 
generating FPW for the other page.

>   I can offer the following solution.
>   For VACUUM, create a hash table.
>   VACUUM scanning the table sees that the version (tuple1) has t_ctid filled
>   and refers to the address tuple2, it creates a structure into which it
>   writes the address tuple1, tuple1.xid, length tuple1 (well, and other
>   information that is needed), puts this structure in the hash table by key
>   tuple2 addresses.
>   VACUUM reaches tuple2, checks the address of tuple2 in the hash table - if
>   it finds it, it evaluates the connection between them and makes a decision
>   on cleaning.
>   

We know VACUUM is already pretty expensive, so making it even more
expensive seems pretty awful. And the proposed solution seems damn
expensive. We already do something similar for indexes - we track
pointers for removed rows, so that we can remove them from indexes. And
it's damn expensive because we don't know where in the index the tuples
are - so we have to scan the whole indexes.

This would mean we have to do the same thing for table, because we don't
know where in the table are the older versions of those rows, because we
don't know where the other rows are. That seems mighty expensive.

Not to mention that this does nothing for page-level vacuum, which we
do when trying to fit another row on a page (e.g. for HOT). This has to
be absolutely cheap, we certainly are not going to do lookups of other
pages or looking for older versions of the row, and so on.

Being able to do visibility decisions based on the tuple alone (or
possibly page-level + tuple information) has a lot of value, and I don't
think we want to make this more complicated.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: 64 bit transaction id

From
Tomas Vondra
Date:
On Mon, Nov 04, 2019 at 04:39:44PM +0300, Павел Ерёмин wrote:
>   And yet, if I try to implement a similar mechanism, if successful, will my
>   revision be considered?
>    

Why wouldn't it be considered? If you submit a patch that demonstrably
improves the behavior (in this case reduces per-tuple overhead without
causing significant issues elsewhere), we'd be crazy not to consider it.

The bar is pretty high, though, because this touches one of the core
pieces of the database.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: 64 bit transaction id

From
Andres Freund
Date:
Hi,

(I've not read the rest of this thread yet)

On 2019-11-04 16:07:23 +0100, Tomas Vondra wrote:
> On Mon, Nov 04, 2019 at 04:39:44PM +0300, Павел Ерёмин wrote:
> >   And yet, if I try to implement a similar mechanism, if successful, will my
> >   revision be considered?
> >    
> 
> Why wouldn't it be considered? If you submit a patch that demonstrably
> improves the behavior (in this case reduces per-tuple overhead without
> causing significant issues elsewhere), we'd be crazy not to consider it.

And "without causing significant issues elsewhere" unfortunately
includes continuing to allow pg_upgrade to work.

Greetings,

Andres Freund



Re: 64 bit transaction id

From
Tomas Vondra
Date:
On Mon, Nov 04, 2019 at 10:04:09AM -0800, Andres Freund wrote:
>Hi,
>
>(I've not read the rest of this thread yet)
>
>On 2019-11-04 16:07:23 +0100, Tomas Vondra wrote:
>> On Mon, Nov 04, 2019 at 04:39:44PM +0300, Павел Ерёмин wrote:
>> >   And yet, if I try to implement a similar mechanism, if successful, will my
>> >   revision be considered?
>> >    
>>
>> Why wouldn't it be considered? If you submit a patch that demonstrably
>> improves the behavior (in this case reduces per-tuple overhead without
>> causing significant issues elsewhere), we'd be crazy not to consider it.
>
>And "without causing significant issues elsewhere" unfortunately
>includes continuing to allow pg_upgrade to work.
>

Yeah. I suppose we could have a different AM implementing this, but
maybe that's not possible ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: 64 bit transaction id

From
Andres Freund
Date:
Hi,

On 2019-11-04 19:39:18 +0100, Tomas Vondra wrote:
> On Mon, Nov 04, 2019 at 10:04:09AM -0800, Andres Freund wrote:
> > And "without causing significant issues elsewhere" unfortunately
> > includes continuing to allow pg_upgrade to work.

> Yeah. I suppose we could have a different AM implementing this, but
> maybe that's not possible ...

Entirely possible. But the amount of code duplication / unnecessary
branching and the user confusion from two very similar AMs, would have
to be weighed against the benefits.

Greetings,

Andres Freund



Re: 64 bit transaction id

From
Tomas Vondra
Date:
On Mon, Nov 04, 2019 at 10:44:53AM -0800, Andres Freund wrote:
>Hi,
>
>On 2019-11-04 19:39:18 +0100, Tomas Vondra wrote:
>> On Mon, Nov 04, 2019 at 10:04:09AM -0800, Andres Freund wrote:
>> > And "without causing significant issues elsewhere" unfortunately
>> > includes continuing to allow pg_upgrade to work.
>
>> Yeah. I suppose we could have a different AM implementing this, but
>> maybe that's not possible ...
>
>Entirely possible. But the amount of code duplication / unnecessary
>branching and the user confusion from two very similar AMs, would have
>to be weighed against the benefits.
>

Agreed. I think code complexity is part of the trade-off. IMO it's fine
to hack existing heap AM initially, and only explore the separate AM if
that turns out to be promising.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: 64 bit transaction id

From
Thomas Munro
Date:
On Tue, Nov 5, 2019 at 8:45 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Mon, Nov 04, 2019 at 10:44:53AM -0800, Andres Freund wrote:
> >On 2019-11-04 19:39:18 +0100, Tomas Vondra wrote:
> >> On Mon, Nov 04, 2019 at 10:04:09AM -0800, Andres Freund wrote:
> >> > And "without causing significant issues elsewhere" unfortunately
> >> > includes continuing to allow pg_upgrade to work.
> >
> >> Yeah. I suppose we could have a different AM implementing this, but
> >> maybe that's not possible ...
> >
> >Entirely possible. But the amount of code duplication / unnecessary
> >branching and the user confusion from two very similar AMs, would have
> >to be weighed against the benefits.
> >
>
> Agreed. I think code complexity is part of the trade-off. IMO it's fine
> to hack existing heap AM initially, and only explore the separate AM if
> that turns out to be promising.

I thought a bit about how to make a minimally-diffferent-from-heap
non-freezing table AM using 64 bit xids, as a thought experiment when
trying to understand or explain to others what zheap is about.
Committed transactions are easy (you don't have to freeze fxid
references from the ancient past because they don't wrap around so
they always look old), but how do you deal with *aborted* transactions
when truncating the CLOG (given that our current rule is "if it's
before the CLOG begins, it must be committed")?  I see three
possibilities: (1) don't truncate the CLOG anymore (use 64 bit
addressing and let it leak disk forever, like we did before commit
2589735d and later work), (2) freeze aborted transactions only, using
a wraparound vacuum (and now you have failed, if the goal was to avoid
having to scan all tuples periodically to freeze stuff, though
admittedly it will require less IO to freeze only the aborted
transactions), (3) go and remove aborted fxid references eagerly, when
you roll back (this could be done using the undo technology that we
have been developing to support zheap).  Another way to explain (3) is
that this hypothetical table AM, let's call it "yheap", takes the
minimum parts of the zheap technology stack required to get rid of
vacuum-for-wraparound, without doing in-place updates or any of that
hard stuff.  To make this really work you'd also have to deal with
multixacts, which also require freezing.  If that all sounds too
complicated, you're back to (2) which seems a bit weak to me.  Or
perhaps I'm missing something?



Re: 64 bit transaction id

From
Bruce Momjian
Date:
On Tue, Nov  5, 2019 at 09:34:52AM +1300, Thomas Munro wrote:
> On Tue, Nov 5, 2019 at 8:45 AM Tomas Vondra
> > Agreed. I think code complexity is part of the trade-off. IMO it's fine
> > to hack existing heap AM initially, and only explore the separate AM if
> > that turns out to be promising.
> 
> I thought a bit about how to make a minimally-diffferent-from-heap
> non-freezing table AM using 64 bit xids, as a thought experiment when
> trying to understand or explain to others what zheap is about.
> Committed transactions are easy (you don't have to freeze fxid
> references from the ancient past because they don't wrap around so
> they always look old), but how do you deal with *aborted* transactions
> when truncating the CLOG (given that our current rule is "if it's
> before the CLOG begins, it must be committed")?  I see three
> possibilities: (1) don't truncate the CLOG anymore (use 64 bit
> addressing and let it leak disk forever, like we did before commit
> 2589735d and later work), (2) freeze aborted transactions only, using
> a wraparound vacuum (and now you have failed, if the goal was to avoid
> having to scan all tuples periodically to freeze stuff, though
> admittedly it will require less IO to freeze only the aborted
> transactions), (3) go and remove aborted fxid references eagerly, when
> you roll back (this could be done using the undo technology that we
> have been developing to support zheap).  Another way to explain (3) is
> that this hypothetical table AM, let's call it "yheap", takes the
> minimum parts of the zheap technology stack required to get rid of
> vacuum-for-wraparound, without doing in-place updates or any of that
> hard stuff.  To make this really work you'd also have to deal with
> multixacts, which also require freezing.  If that all sounds too
> complicated, you're back to (2) which seems a bit weak to me.  Or
> perhaps I'm missing something?

The above is a very good summary of the constraints that have led to our
current handling of XID wraparound.  If we are concerned about excessive
vacuum freeze overhead, why is the default autovacuum_freeze_max_age =
200000000 so low?  That causes feezing when the pg_xact directory holds
200 million xids or 50 megabytes of xid status?

As far as I understand it, we cause the database to stop writes when the
xid counter gets within 2 billion xids of the current transaction
counter, so 200 million is only 1/10th to that limit, and even then, I
am not sure why we couldn't make it stop writes at 3 billion or
something.  My point is that increasing the default
autovacuum_freeze_max_age value seems like an easy way to reduce vacuum
freeze.  (While, the visibility map helps avoid vacuum freeze from
reading all heap pages, and we still need to read all index pages.)

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: 64 bit transaction id

From
Robert Haas
Date:
On Thu, Nov 7, 2019 at 10:28 AM Bruce Momjian <bruce@momjian.us> wrote:
> The above is a very good summary of the constraints that have led to our
> current handling of XID wraparound.  If we are concerned about excessive
> vacuum freeze overhead, why is the default autovacuum_freeze_max_age =
> 200000000 so low?  That causes feezing when the pg_xact directory holds
> 200 million xids or 50 megabytes of xid status?
>
> As far as I understand it, we cause the database to stop writes when the
> xid counter gets within 2 billion xids of the current transaction
> counter, so 200 million is only 1/10th to that limit, and even then, I
> am not sure why we couldn't make it stop writes at 3 billion or
> something.  My point is that increasing the default
> autovacuum_freeze_max_age value seems like an easy way to reduce vacuum
> freeze.  (While, the visibility map helps avoid vacuum freeze from
> reading all heap pages, and we still need to read all index pages.)

Yeah, I've also wondered why this isn't higher by default, but it's a
somewhat tricky topic.

Three billion won't work, because it's deeply baked into PostgreSQL's
architecture that at most two billion XIDs are used at one time. For
comparison purposes, the four billion XIDs form a ring, so that from
the perspective of any individual XID, half of the XIDs are in the
future and the other half are in the past. If three billion XIDs were
in use simultaneously, say starting with XID 5 and ending with XID
3,000,000,004, then XID 5 would see XID 3,000,000,004 as being the
past rather than the future, while XID 1,500,000,000 would (correctly)
see XID 5 as in the past and XID 3,000,000,004 as in the future. So
XID comparison would not be transitive, which would break a lot of
code. Allowing at most two billion XIDs to be in use at one time fixes
this problem.

That doesn't mean we couldn't raise the setting. It just means that
the hard limit is two billion, not four billion. But, how high should
we raise it? The highest safe value depends on how many XIDs you'll
burn while the freezing vacuums are running, which depends on both the
size of the database and the rate of XID consumption, and those values
can be very different on different systems. I think most people could
get by with a significantly higher value, but even with the current
value I think there are probably some people who run out of XIDs, at
which point they can no longer write to the database.  The higher we
make the default, the more people are going to have that problem. It's
true that a lot of people only hit the limit because something has
gone wrong, like they've forgotten about a prepared transaction or an
unused replication slot, but still, on high-velocity systems you can't
afford to cut it too close because you're still going to be burning
through XIDs while vacuum is running.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company