Re: 64 bit transaction id - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: 64 bit transaction id |
Date | |
Msg-id | 20191103191522.jkrv6eik73r3ulzo@development Whole thread Raw |
In response to | Re: 64 bit transaction id (Pavel Stehule <pavel.stehule@gmail.com>) |
List | pgsql-hackers |
On Sun, Nov 03, 2019 at 02:17:15PM +0300, Павел Ерёмин wrote: > I completely agree with all of the above. Therefore, the proposed > mechanism may entail larger improvements (and not only VACUUM). I think the best think you can do is try implementing this ... I'm afraid the "improvements" essentially mean making various imporant parts of the system much more complicated and expensive. There's a trade-off between saving 8B per row and additional overhead (during vacuum etc.), and it does not seem like a winning strategy. What started as "we can simply look at the next row version" is clearly way more complicated and expensive. The trouble here is that it adds dependency between pages in the data file. That for example means that during cleanup of a page it may be necessary to modify the other page, when originally that would be read-only in that checkpoint interval. That's essentially write amplification, and may significantly increase the amount of WAL due to generating FPW for the other page. > I can offer the following solution. > For VACUUM, create a hash table. > VACUUM scanning the table sees that the version (tuple1) has t_ctid filled > and refers to the address tuple2, it creates a structure into which it > writes the address tuple1, tuple1.xid, length tuple1 (well, and other > information that is needed), puts this structure in the hash table by key > tuple2 addresses. > VACUUM reaches tuple2, checks the address of tuple2 in the hash table - if > it finds it, it evaluates the connection between them and makes a decision > on cleaning. > We know VACUUM is already pretty expensive, so making it even more expensive seems pretty awful. And the proposed solution seems damn expensive. We already do something similar for indexes - we track pointers for removed rows, so that we can remove them from indexes. And it's damn expensive because we don't know where in the index the tuples are - so we have to scan the whole indexes. This would mean we have to do the same thing for table, because we don't know where in the table are the older versions of those rows, because we don't know where the other rows are. That seems mighty expensive. Not to mention that this does nothing for page-level vacuum, which we do when trying to fit another row on a page (e.g. for HOT). This has to be absolutely cheap, we certainly are not going to do lookups of other pages or looking for older versions of the row, and so on. Being able to do visibility decisions based on the tuple alone (or possibly page-level + tuple information) has a lot of value, and I don't think we want to make this more complicated. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: