Thread: 9.3: summary of corruption detection / checksums / CRCs discussion

9.3: summary of corruption detection / checksums / CRCs discussion

From
Jeff Davis
Date:
A lot of discussion took place regarding corruption detection, and I am
attempting to summarize it in a useful way. Please excuse the lack of
references; I'm hoping to agree on the basic problem space and the
nature of the solutions offered, and then turn it into a wiki where we
can get into the details. Also, almost every discussion touched on
several of these issues.

Please help me fill in the parts of the problem and directions in the
high-level solution that are missing. This thread is not intended to get
into low-level design or decision making.

First, I'll get a few of the smaller issues out of the way:

* In addition to data pages and slru/clog, it also may be useful to
detect problems with temp files. It's also much easier, because we don't
need to worry about upgrade or crash scenarios. Performance impact
unknown, but could easily be turned on/off at runtime.

* In addition to detecting random garbage, we also need to be able to
detect zeroing of pages. Right now, a zero page is not considered
corrupt, so that's a problem. We'll need to WAL table extension
operations, and we'll need to mitigate the performance impact of doing
so. I think we can do that by extending larger tables by many pages
(say, 16 at a time) so we can amortize the cost of WAL and avoid
contention.

* Utilities, like those for taking a base backup, should also verify the
checksum.

* In addition to detecting random garbage and zeros, we need to detect
entire pages being transposed into different parts of the same file or
different files. To do this we can include the database ID, tablespace,
relfilenode, and page number in the CRC calculation. Perhaps only
include relfilenode and page# to make it easier for utilities to check.
None of this information needs to actually be stored on the page, so it
doesn't affect the header. However, if we are going to expand the page
header anyway, it would be useful to include this information so that
the CRC can be calculated without any external context.


Now, onto the two big problems, upgrade and torn pages:

-----------------------------------------------
UPGRADE (and on/off)
-----------------------------------------------

* Should we try to use existing space in header? It's very appealing to
be able to avoid the upgrade work by using existing space in the header.
There was a surprising amount of discussion about which space to use:
pd_pagesize_version or pd_tli. There was also some discussion of using a
few bits from various places.

* Table-level, or system level? Table-level would be appealing if there
turns out to be a significant performance impact. But there are
challenges during recovery, because no relcache is available. It seems
like a relatively minor problem, because pages would indicate whether
they have a checksum or not, but there are some details to be worked
out.

* If we do expand the header, we need an upgrade path. One proposed
approach is to start reserving the necessary space in the previous
version (with a simple point release), and have some way to verify that
all of the pages have the required free space to upgrade. Then, the new
version can update pages gradually, with some final VACUUM to ensure
that all pages are the new version. That sounds easy, except that we
need some way to free up space on the old pages in the old version,
which is non-trivial. For heap pages, that could be like an update; but
for index pages, it would need to be something like a page split, which
is specific to the index type.

* Also, if we expand the page header, we need to figure out something
for the SLRU/CLOG as well.

* We'll need some variant of VACUUM to turn checksums on/off (either
per-table or system wide).


-----------------------------------------------
TORN PAGES
-----------------------------------------------

We don't want torn pages to falsely indicate a checksum failure. Many
page writes are already protected from this with full-page images in the
WAL; but hint bit updates (including the index dead tuple markers) are
not.

* Just pay the price -- WAL all hint bit updates, including FPIs.

* Double-Write buffer -- this attacks the problem most directly. Don't
make any changes to the way hint bits are done; instead, push all page
writes through a double-write buffer. There are numerous performance
implications here, some of which may improve performance and some which
may hurt performance. It's hard to say, at the end, whether this will be
a good solution for everyone (particularly those without battery-backed
caches), but it seems like an accepted approach that can be very good
for the people who need performance the most.

* Bulk Load -- this is more indirect. The idea is that, during normal
OLTP operation, using the WAL for hints might not be so bad, because the
page is likely to need a FPI for some other reason. The worst case is
when bulk loading, so see if we can set hint bits during the bulk load
in an MVCC-safe way.
http://archives.postgresql.org/message-id/CABRT9RBRMdsoz8KxgeHfb4LG-ev9u67-6DLqvoiibpkKhTLQfw@mail.gmail.com

* Some way of caching CLOG information or making the access faster.
IIRC, there were some vague ideas about mmapping() the CLOG, or caching
a very small representation of the CLOG.

* Something else -- There are a few other lines of thought here. For
instance, can we use WAL for hint bits without a FPI, and still protect
against torn pages causing CRC failures? This is related to a comment
during the 2011 developer meeting, where someone brought up the idea of
idempotent WAL changes, and how that could help us avoid FPIs. It seems
possible after reading the discussions, but not clear enough on the
challenges to summarize here.

If we do use WAL for hint bit updates, that has an impact on Hot
Standby, because HS can't write WAL. So, it would seem that HS could not
set hint bits.


Comments?

Regards,Jeff Davis




Re: 9.3: summary of corruption detection / checksums / CRCs discussion

From
Greg Stark
Date:
On Sat, Apr 21, 2012 at 10:40 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> * In addition to detecting random garbage, we also need to be able to
> detect zeroing of pages. Right now, a zero page is not considered
> corrupt, so that's a problem. We'll need to WAL table extension
> operations, and we'll need to mitigate the performance impact of doing
> so. I think we can do that by extending larger tables by many pages
> (say, 16 at a time) so we can amortize the cost of WAL and avoid
> contention.

I haven't seen this come up in discussion. WAL logging table
extensions wouldn't by itself work because currently we treat the file
size on disk as the size of the table. So you would have to do the
extension in the critical section or else different backends might see
the wrong file size and write out conflicting wal entries.

> -----------------------------------------------
> TORN PAGES
> -----------------------------------------------
>
> We don't want torn pages to falsely indicate a checksum failure. Many
> page writes are already protected from this with full-page images in the
> WAL; but hint bit updates (including the index dead tuple markers) are
> not.
>
> * Just pay the price -- WAL all hint bit updates, including FPIs.
>
> * Double-Write buffer -- this attacks the problem most directly. Don't

The earlier consensus was to move all the hint bits to a dedicated
area and exclude them from the checksum. I think double-write buffers
seem to have become more fashionable but a summary that doesn't
describe the former is definitely incomplete.

Fwiw the tradeoff here is at least partly between small and large
systems. For double writes to be at all efficient you need either
flash or a dedicated third spindle in addition to the logs and data.
For smaller systems that would be a huge cost but for larger ones
that's not really a problem at all.


> * Bulk Load -- this is more indirect. The idea is that, during normal
> OLTP operation, using the WAL for hints might not be so bad, because the
> page is likely to need a FPI for some other reason. The worst case is
> when bulk loading, so see if we can set hint bits during the bulk load
> in an MVCC-safe way.
> http://archives.postgresql.org/message-id/CABRT9RBRMdsoz8KxgeHfb4LG-ev9u67-6DLqvoiibpkKhTLQfw@mail.gmail.com

That link points to the MVCC-safe truncate patch. I don't follow how
optimizations in bulk loads are relevant to wal logging hint bit
updates.


-- 
greg


Re: 9.3: summary of corruption detection / checksums / CRCs discussion

From
Jeff Davis
Date:
On Sun, 2012-04-22 at 00:08 +0100, Greg Stark wrote:
> On Sat, Apr 21, 2012 at 10:40 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > * In addition to detecting random garbage, we also need to be able to
> > detect zeroing of pages. Right now, a zero page is not considered
> > corrupt, so that's a problem. We'll need to WAL table extension
> > operations, and we'll need to mitigate the performance impact of doing
> > so. I think we can do that by extending larger tables by many pages
> > (say, 16 at a time) so we can amortize the cost of WAL and avoid
> > contention.
> 
> I haven't seen this come up in discussion.

I don't have any links, and it might just be based on in-person
discussions. I think it's just being left as a loose end for later, but
it will eventually need to be solved.

> WAL logging table
> extensions wouldn't by itself work because currently we treat the file
> size on disk as the size of the table. So you would have to do the
> extension in the critical section or else different backends might see
> the wrong file size and write out conflicting wal entries.

By "critical section", I assume you mean "while holding the relation
extension lock" not "while inside a CRITICAL_SECTION()", right?

There would be some synchronization overhead, to be sure, but I think it
can be done. Ideally, we'd be able to do large enough extensions that,
if there is a parallel bulk load on a single table or something, the
overhead could be made insignificant.

I didn't intend to get too much into the detail in this thread, but if
it's a totally ridiculous or impossible idea, I'll remove it.

> The earlier consensus was to move all the hint bits to a dedicated
> area and exclude them from the checksum. I think double-write buffers
> seem to have become more fashionable but a summary that doesn't
> describe the former is definitely incomplete.

Thank you, that's the kind of omission I was looking to catch.

> That link points to the MVCC-safe truncate patch. I don't follow how
> optimizations in bulk loads are relevant to wal logging hint bit
> updates.

I should have linked to these messages:
http://archives.postgresql.org/message-id/CA
+TgmoYLOzDezzJKyJ8_x2bPeEerAo5dJ-OMvS1fLQOQSQP5jg@mail.gmail.com
http://archives.postgresql.org/message-id/CA
+Tgmoa4Xs1jbZhm=pb9Xi4AGMJXRB2a4GSE9EJtLo=70Zne=g@mail.gmail.com

Though perhaps I'm reading too much into Robert's comments.

Regards,Jeff Davis




Re: 9.3: summary of corruption detection / checksums / CRCs discussion

From
Robert Haas
Date:
On Sat, Apr 21, 2012 at 5:40 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> * In addition to detecting random garbage, we also need to be able to
> detect zeroing of pages. Right now, a zero page is not considered
> corrupt, so that's a problem. We'll need to WAL table extension
> operations, and we'll need to mitigate the performance impact of doing
> so. I think we can do that by extending larger tables by many pages
> (say, 16 at a time) so we can amortize the cost of WAL and avoid
> contention.

I think that extending tables in larger chunks is probably a very good
idea for performance reasons completely apart from checksums.
However, I don't think that WAL-logging relation extension will allow
you to treat a zero page as an error condition unless you not only
WAL-log the operation but also *flush WAL* before performing the
actual table-extension operation.  Otherwise, we might crash after the
actual extension and before the WAL record hits the disk, and now
we're back to having a zero'd page in the file.  And the impact of
writing *and flushing* WAL for every extension seems likely to be more
than we're willing to pay.  If we extended 16 pages at a time, that
means waiting for 8 WAL fsyncs per MB of relation extension.  On my
MacBook Pro, which is admittedly a pessimal case for fsyncs, that
would work out to an extra half second of elapsed time per MB written,
so pgbench -i -s 100 would probably take about an extra 640 seconds.
If that's a correct analysis, that sounds pretty bad, because right
now it's taking 143 seconds.

> * Should we try to use existing space in header? It's very appealing to
> be able to avoid the upgrade work by using existing space in the header.
> There was a surprising amount of discussion about which space to use:
> pd_pagesize_version or pd_tli. There was also some discussion of using a
> few bits from various places.

It still seems to me that pd_tli is the obvious choice, since there is
no code anywhere in the system that relies on that field having any
particular value, so we can pretty much whack it around at will
without breaking anything.  AFAICS, the only reason to bump the page
format is if we want a longer checksum - 32 bits, say, instead of 16.
But I am inclined to think that 16 ought to be enough to detect the
vast majority of cases of corruption.

> * Table-level, or system level? Table-level would be appealing if there
> turns out to be a significant performance impact. But there are
> challenges during recovery, because no relcache is available. It seems
> like a relatively minor problem, because pages would indicate whether
> they have a checksum or not, but there are some details to be worked
> out.

I think the root of the issue here is that it's hard to turn checksums
on and off *online*.  If checksums were an initdb-time option, the
design here would be pretty simple: store the flag in the control
file.  And it wouldn't even be hard to allow the flag to be flipped
while the cluster is shut down: just write a utility to checksum and
rewrite all the blocks, fsync everything, and then flip the flag in
the control file and fsync that; also, force the user to rebuild all
their standbys.  This might not be very convenient, but it would be
comparatively simple to implement.

However, I've been assuming that we do want to handle turning
checksums on and off without shutting down the cluster, and that
definitely makes things more complicated.  In theory it could work
more or less the same way as in the off-line case: you launch some
command or function that visits every data block in the cluster and
rewrites them all, xlogging as it goes.  However, I'm a little
concerned that it won't be very usable in that form for some of the
same reasons that the off-line solution might not be very usable: you
essentially force the user to do a very large data-rewriting operation
all at once.  We already know that having a kicking off a big vacuum
in the background can impact the performance of the whole cluster, and
that's just one table; this would be the whole cluster all at once.

Moreover, if it's based on the existing vacuum code, which seems
generally desirable, it seems like it's going to have to be something
you run in one database at a time until you've hit them all.  In that
case, you're going to need some way to indicate which tables or
databases have been processed already and which haven't yet ... and if
you have that, then it's not a very big step to think that maybe we
ought to just control it on a per-table or per-database level to begin
with, since that would also have some of the collateral benefits you
mention.  The alternative is to have a new command, or a new form of
vacuum, that processes every table in every cluster regardless of
which DB you're connected to at the moment.  That's a lot of new,
fairly special-purpose code and it doesn't seem very convenient from a
management perspective, but it is probably simpler otherwise.

I'm not sure what the right decision is here.  I like the idea of
being able to activate and deactivate the feature without shutting
down the cluster, and I really like the idea of being able to do it
incrementally, one table at a time, instead of all at once.  But it's
surely more complex.  If you store the flag on a table level, then
it's not available during recovery (unless we invent some new trick to
make that work, like storing it in some other format that is simple
enough to be interpreted by the recovery code); if you store it on a
page level, then you can miss corruption if that corruption has the
side effect of clearing the bit.  It's not dissimilar to your
complaint upthread about empty pages: if you see an empty page (a page
that says it hasn't been checksummed), how do you know whether it's
really supposed to be empty (have no checksum) or whether it ended
empty (unchecksummed) because of exactly the sort of error we're
trying to detect?

> We don't want torn pages to falsely indicate a checksum failure. Many
> page writes are already protected from this with full-page images in the
> WAL; but hint bit updates (including the index dead tuple markers) are
> not.
>
> * Just pay the price -- WAL all hint bit updates, including FPIs.

Simon's patch contains a pretty good idea on this point: we don't
actually need to WAL log all hint bit updates.  In the worst case, we
need to log an FPI once per checkpoint cycle, and then only if the
page is hint-bit dirtied before it's dirtied by a fully WAL-logged
operation.  Moreover, if we WAL-log an FPI when the page is
hint-bit-dirtied, and a fully WAL-logged operation comes along later
in the same checkpoint cycle, we can skip the FPI for the second
operation, which probably recovers most of the cost of the previous
FPI.

The real problem here is: what happens when we're dirtying a large
number of pages only for hint bits?  In that case, we're potentially
generating a huge volume of WAL.  We can probably optimize the
bulk-loading case, but there are going to be other cases that stink.
Apart from bulk-loading, pgbench doesn't take much of a hit because
it's basically not going to touch a pgbench_accounts page again until
it comes back to update another row on the same page, so the overhead
shouldn't be that bad.  But it might be much worse on some other
workload where there are lots of reads mixed in with the writes: now
you've got a good chance of kicking out a bunch of extra FPIs.  There
are some things we could do about that - e.g. make sure to set all
possible hint bits before evicting an already-dirty buffer, so that
the chances of needing to dirty it again the next time we visit are
minimized.  However, I don't think all of that is work that has to be
done up front.  I suspect we will need to do some mitigation of the
bulk-loading case before committing even an initial version of
checksums, but I think beyond that we could probably postpone most of
the other optimization in this area to a future patch.

> * Double-Write buffer -- this attacks the problem most directly. Don't
> make any changes to the way hint bits are done; instead, push all page
> writes through a double-write buffer. There are numerous performance
> implications here, some of which may improve performance and some which
> may hurt performance. It's hard to say, at the end, whether this will be
> a good solution for everyone (particularly those without battery-backed
> caches), but it seems like an accepted approach that can be very good
> for the people who need performance the most.

I'm pretty skeptical of this approach.  It may work out well with
specialized hardware, but none of the hardware I have is going to cope
with that volume of fsyncs anything like gracefully.  Unless of course
we make the buffer really big, but that hasn't seemed to be a popular
approach.

> * Some way of caching CLOG information or making the access faster.
> IIRC, there were some vague ideas about mmapping() the CLOG, or caching
> a very small representation of the CLOG.

I'm interested in pursuing this in general, independent of checksums.
Not sure how far I will get, but reducing CLOG contention is something
I do plan to spend more time on.

> * Something else -- There are a few other lines of thought here. For
> instance, can we use WAL for hint bits without a FPI, and still protect
> against torn pages causing CRC failures? This is related to a comment
> during the 2011 developer meeting, where someone brought up the idea of
> idempotent WAL changes, and how that could help us avoid FPIs. It seems
> possible after reading the discussions, but not clear enough on the
> challenges to summarize here.

There are some possibilities here, but overall I have my doubts about
backing up checksums behind another complex project with uncertain
prospects.

> If we do use WAL for hint bit updates, that has an impact on Hot
> Standby, because HS can't write WAL. So, it would seem that HS could not
> set hint bits.

Yeah.  The good news is that WAL-logging hint bits on the master would
propagate those hint bits to the slave, which would reduce the need to
set them on the slave.  But there's definitely some potential for
performance regression here; I'm just not sure how much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.3: summary of corruption detection / checksums / CRCs discussion

From
Robert Haas
Date:
On Sat, Apr 21, 2012 at 7:08 PM, Greg Stark <stark@mit.edu> wrote:
> The earlier consensus was to move all the hint bits to a dedicated
> area and exclude them from the checksum. I think double-write buffers
> seem to have become more fashionable but a summary that doesn't
> describe the former is definitely incomplete.

I don't think we ever had any consensus that moving the hint bits
around was a good idea.  For one thing, they are only hints in one
direction.  It's OK to clear them by accident, but it's not OK to set
them by accident.  For two things, it's not exactly clear how we'd
rearrange the page to make this work at all: where are those hint bits
gonna go, if not in the tuple headers?  For three things, index pages
have hint-type changes that are not single-bit changes.

> That link points to the MVCC-safe truncate patch. I don't follow how
> optimizations in bulk loads are relevant to wal logging hint bit
> updates.

That patch actually has more than one optimization in it, I think, but
the basic idea is that if we could figure out a way to set
HEAP_XMIN_COMMITTED when loading data into a table created or
truncated within the same transaction, the need to set hint bits on
first scan of the table would be eliminated.  Writing the xmin as
FrozenTransactionId would save even more, though it introduces some
additional complexity.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.3: summary of corruption detection / checksums / CRCs discussion

From
Greg Stark
Date:
On Tue, Apr 24, 2012 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>  For three things, index pages
> have hint-type changes that are not single-bit changes.

? Just how big are these? Part of the reason hint bit updates are safe
is because one bit definitely absolutely has to be entirely in one
page. You can't tear a page in the middle of a bit. In reality the
size is much larger, probably 4k and almost certainly at least 512
bytes. But the postgres block layout doesn't really offer much
guarantees about the location of anything relative those 512 byte
blocks so probably anything larger than a word is unsafe to update.

The main problem with the approach was that we kept finding more hint
bits we had forgotten about. Once the coding idiom was established it
seems it was a handy hammer for a lot of problems.
--
greg


Re: 9.3: summary of corruption detection / checksums / CRCs discussion

From
Josh Berkus
Date:
On 4/21/12 2:40 PM, Jeff Davis wrote:
> If we do use WAL for hint bit updates, that has an impact on Hot
> Standby, because HS can't write WAL. So, it would seem that HS could not
> set hint bits.

If we're WAL-logging hint bits, then the standby would be receiving
them, so it doesn't *need* to write them.

However, I suspect that WAL-logging hint bits would be prohibitively
expensive.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: 9.3: summary of corruption detection / checksums / CRCs discussion

From
Merlin Moncure
Date:
On Tue, Apr 24, 2012 at 3:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Apr 21, 2012 at 7:08 PM, Greg Stark <stark@mit.edu> wrote:
>> The earlier consensus was to move all the hint bits to a dedicated
>> area and exclude them from the checksum. I think double-write buffers
>> seem to have become more fashionable but a summary that doesn't
>> describe the former is definitely incomplete.
>
> I don't think we ever had any consensus that moving the hint bits
> around was a good idea.  For one thing, they are only hints in one
> direction.  It's OK to clear them by accident, but it's not OK to set
> them by accident.  For two things, it's not exactly clear how we'd
> rearrange the page to make this work at all: where are those hint bits
> gonna go, if not in the tuple headers?  For three things, index pages
> have hint-type changes that are not single-bit changes.
>
>> That link points to the MVCC-safe truncate patch. I don't follow how
>> optimizations in bulk loads are relevant to wal logging hint bit
>> updates.
>
> That patch actually has more than one optimization in it, I think, but
> the basic idea is that if we could figure out a way to set
> HEAP_XMIN_COMMITTED when loading data into a table created or
> truncated within the same transaction, the need to set hint bits on
> first scan of the table would be eliminated.  Writing the xmin as
> FrozenTransactionId would save even more, though it introduces some
> additional complexity.

This would be great but it's only a corner case.  A pretty common
application flow is to write a large number of records, scan them,
update them, scan them again, delete them, etc. in a table that's
already established and possibly pretty large.  Unfortunately this
type of work doesn't get a lot of coverage with the common benchmarks.

Also, wouldn't the extra out of band wal traffic from hint bits
exacerbate contention issues on the wal insert lock?

merlin


Re: 9.3: summary of corruption detection / checksums / CRCs discussion

From
Robert Haas
Date:
On Tue, Apr 24, 2012 at 8:52 PM, Greg Stark <stark@mit.edu> wrote:
> On Tue, Apr 24, 2012 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>  For three things, index pages
>> have hint-type changes that are not single-bit changes.
>
> ? Just how big are these? Part of the reason hint bit updates are safe
> is because one bit definitely absolutely has to be entirely in one
> page. You can't tear a page in the middle of a bit. In reality the
> size is much larger, probably 4k and almost certainly at least 512
> bytes. But the postgres block layout doesn't really offer much
> guarantees about the location of anything relative those 512 byte
> blocks so probably anything larger than a word is unsafe to update.

See _bt_killitems.  It uses ItemIdMarkDead, which looks like it will
turn into a 4-byte store.

> The main problem with the approach was that we kept finding more hint
> bits we had forgotten about. Once the coding idiom was established it
> seems it was a handy hammer for a lot of problems.

It is.  And I think we shouldn't be lulled into the trap of thinking
hint bits are bad.  They do cause some problems, but they exist
because they solve even worse problems.  It's fundamentally pretty
useful to be able to cache the results of expensive calculations in
data pages, which is what hints allow us to do, and they let us do it
without incurring the overhead of WAL-logging.  Even if we could find
a way of making CLOG access cheap enough that we didn't need
HEAP_XMIN/XMAX_COMMITTED, it wouldn't clear the way to getting rid of
hinting entirely.  I strongly suspect that the btree item-is-dead
hinting is actually MORE valuable than the heap hint bits.  CLOG
probes are expensive, but there is room for optimization there through
caching and just because the data set is relatively limited in size.
OTOH, the btree hints potentially save you a heap fetch on the next
trip through, which potentially means a random I/O into a huge table.
That's nothing to sneeze at.  It also means that the next index
insertion in the page can potentially prune that item away completely,
allowing faster space re-use.  That's nothing to sneeze at, either.

To put that another way, the reason why WAL-logging all hints seems
expensive is because NOT WAL-logging hints is a huge performance
optimization.  If we can come up with an even better performance
optimization that also reduces the need to write out hinted pages,
then of course we should do that, but we shouldn't hate the
optimization we have because it's not as good as the one we wish we
had.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 9.3: summary of corruption detection / checksums / CRCs discussion

From
Merlin Moncure
Date:
On Wed, Apr 25, 2012 at 9:28 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> That patch actually has more than one optimization in it, I think, but
>> the basic idea is that if we could figure out a way to set
>> HEAP_XMIN_COMMITTED when loading data into a table created or
>> truncated within the same transaction, the need to set hint bits on
>> first scan of the table would be eliminated.  Writing the xmin as
>> FrozenTransactionId would save even more, though it introduces some
>> additional complexity.
>
> This would be great but it's only a corner case.  A pretty common
> application flow is to write a large number of records, scan them,
> update them, scan them again, delete them, etc. in a table that's
> already established and possibly pretty large.  Unfortunately this
> type of work doesn't get a lot of coverage with the common benchmarks.
>
> Also, wouldn't the extra out of band wal traffic from hint bits
> exacerbate contention issues on the wal insert lock?

Hm,  You probably remember the process local hint bit cache.  While my
implementation had a lot of issues and it's arguable if I'm a patient
enough C programmer to come up with one to pass muster, it did work as
advertised: it greatly reduced to the point of elimination  hint bit
traffic in cases like this, including the bulk load case.  Some
variant of the approach would probably take at least some of the sting
out of the logged hint bits if that's the direction things have to go.

That doesn't help high concurrency cases where the #tuples/xid is low
though.  And even a small increase in contention on the wal insert
lock could mean a measurable reduction in scaling unless that problem
is solved independently.

merlin