Thread: Writing Commit Status hint bits (was Re: [HACKERS] Constant WAL replay)

Writing Commit Status hint bits (was Re: [HACKERS] Constant WAL replay)

From

Simon Riggs

Date:

19 July 2005, 19:57:45

On Sun, 2005-04-24 at 02:28 -0400, Tom Lane wrote:
> In the current code there is no such thing as a hard read-only behavior
> --- for example we will try to update commit-status hint bits no matter
> what.  Allowing that to be turned off would be interesting for a number
> of purposes, such as burning a database onto CD.

Short patch enclosed to turn off writing of commit-status hint bits.

I've called this "cache_txn_status_with_data" but I'm open to other
suggestions...

This should allow migration of older child tables to hierarchical
storage when using a large historical table design.

Best Regards, Simon Riggs

Attachment

hintbits.patch

Re: Writing Commit Status hint bits (was Re: [HACKERS] Constant WAL replay)

From

Tom Lane

Date:

19 July 2005, 23:24:50

Simon Riggs <simon@2ndquadrant.com> writes:
> Short patch enclosed to turn off writing of commit-status hint bits.

Doesn't this entirely destroy the ability to truncate clog, and
therefore the ability to survive XID wraparound?

It probably also breaks subxact and multixact logging, but I haven't
looked closely...

            regards, tom lane

Re: Writing Commit Status hint bits (was Re: [HACKERS] Constant

From

Simon Riggs

Date:

20 July 2005, 03:27:49

On Tue, 2005-07-19 at 22:24 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Short patch enclosed to turn off writing of commit-status hint bits.
>
> Doesn't this entirely destroy the ability to truncate clog, and
> therefore the ability to survive XID wraparound?

I hope not for all our sakes, since the hint bits are not WAL logged and
anything that relies upon them would be fragile. If we were going to
move a table to WORM storage, then we'd have done VACUUM FREEZE first
anyway, which would be required to avoid XID wraparound. This patch
would then guarantee that no further system-initiated writes take place.

Not AFAICS.... comments from vacuum.c included:

/*
* Now scan all the pages that we moved tuples onto and update tuple
* status bits.  This is not really necessary, but will save time for
* future transactions examining these tuples.
*/
update_hint_bits(onerel, fraged_pages, num_fraged_pages,
last_move_dest_block, num_moved);

then again on the function itself...

/*
* update_hint_bits() -- update hint bits in destination pages
*
* Scan all the pages that we moved tuples onto and update tuple status
bits.
* This is normally not really necessary, but it will save time for
future
* transactions examining these tuples.
*

*
* For the non-freeze case, one wonders whether it wouldn't be better to
skip
* this work entirely, and let the tuple status updates happen someplace
* that's not holding an exclusive lock on the relation.
*/

> It probably also breaks subxact and multixact logging, but I haven't
> looked closely...

AFAIK this has nothing to do with that, since those locks are not
persistent across a crash,,,but conceivably something in 2PC might be
effected.

Best Regards, Simon Riggs

Re: Writing Commit Status hint bits (was Re: [HACKERS] Constant WAL replay)

From

Tom Lane

Date:

20 July 2005, 10:24:44

Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2005-07-19 at 22:24 -0400, Tom Lane wrote:
>> Simon Riggs <simon@2ndquadrant.com> writes:
>>> Short patch enclosed to turn off writing of commit-status hint bits.
>>
>> Doesn't this entirely destroy the ability to truncate clog, and
>> therefore the ability to survive XID wraparound?

> I hope not for all our sakes, since the hint bits are not WAL logged and
> anything that relies upon them would be fragile.

We don't rely on any one write of them to work, but that doesn't mean
that we can indefinitely postpone writing them.

            regards, tom lane

Re: Writing Commit Status hint bits (was Re: [HACKERS] Constant

From

Simon Riggs

Date:

20 July 2005, 13:26:57

On Wed, 2005-07-20 at 09:24 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Tue, 2005-07-19 at 22:24 -0400, Tom Lane wrote:
> >> Simon Riggs <simon@2ndquadrant.com> writes:
> >>> Short patch enclosed to turn off writing of commit-status hint bits.
> >>
> >> Doesn't this entirely destroy the ability to truncate clog, and
> >> therefore the ability to survive XID wraparound?
>
> > I hope not for all our sakes, since the hint bits are not WAL logged and
> > anything that relies upon them would be fragile.
>
> We don't rely on any one write of them to work, but that doesn't mean
> that we can indefinitely postpone writing them.

OK, I think I understand where you're coming from now.

A table can't be migrated to read-only media until all of its tuples
have an xmin of FrozenTransactionId; I said the following, which was
wrong:
"This should allow migration of older child tables to hierarchical
storage when using a large historical table design."
The patch doesn't directly contribute to that goal, though is of value
in a large historical table design with many read only child tables (and
other situations).

But that comment wasn't the only inspiration for the patch.

When VACUUM freezes the xid, it *does* make sense at that point to
update the hint bits as a performance optimization. That isn't required
though, and *can* be indefinitely postponed, AFAICS. All of the tqual
routines will still work just as well without the hint bits set. If you
know different, I'll need a deeper explanation before I understand.

IMHO, the direction the patch is going in is still worthwhile because of
these issues:

1. Any block read may attempt to set hint bits, which dirties the block
and must be written out. So *reads* can result in heavier write activity
at checkpoint time. That effects both OLTP and DW systems: Random read
transactions against a large table will be worst effected, since we may
end up writing the block once for each read.

2. A lazy vacuum may also dirty a block, even when it has done nothing
else useful to that block. Worse, if we vacuum a table that is bigger
than shared_buffers (or close), then we will end up having to evict
dirty buffers that the vacuum itself has written in order to continue
the vacuum. Since Vacuum is two-pass, we may end up writing a block
*twice*, once where we set the hint bits and then again later where we
remove the tuples and re-write. So this patch will allow a normal VACUUM
to perform better on larger tables.

(1) is a pain, but there's no point solving it without also solving (2).

The patch would fail an Assert test during a VACUUM, since the info bits
are not actually set if cache_txn_status_with_data = false during
VACUUM.

As a result of (2), perhaps we should remove all of the
SetBufferCommitInfoNeedsSave calls in HeapTupleSatisfiesVacuum, and add
a call to SetBufferCommitInfoNeedsSave that overrides
cache_txn_status_with_data when we actually freeze a row. (Or perhaps
that should be a VACUUM FAST command?) That way we would set the hint
bits *only* when we freeze a row and not at any other time.

If we further reduced the number of times we dirty the block *at all* on
the first pass of a VACUUM, we would reduce the chance of writing twice.
We could save the setting of frozen transactions until the second phase,
i.e. only dirty the block if (pgchanged && vacrelstats->num_dead_tuples
> prev_dead_count)

The patch sets cache_txn_status_with_data as a USERSET, with the
intention that particular read-only users would not wish to have their
read-only transactions turn into write transactions. There was no
intention to prevent VACUUM, not to avoid the optimisation of writing
hint bits on a VACUUM FREEZE nor to set full read only status - which we
discussed previously but is a much longer project.

Sorry for any confusion caused in my initial patch submission. Does my
longer explanation make sense of what the patch is trying to achieve.

Best Regards, Simon Riggs

Re: Writing Commit Status hint bits (was Re: [HACKERS] Constant WAL replay)

From

Tom Lane

Date:

20 July 2005, 14:20:40

Simon Riggs <simon@2ndquadrant.com> writes:
> On Wed, 2005-07-20 at 09:24 -0400, Tom Lane wrote:
>> We don't rely on any one write of them to work, but that doesn't mean
>> that we can indefinitely postpone writing them.

> OK, I think I understand where you're coming from now.

Apparently not :-(

> When VACUUM freezes the xid, it *does* make sense at that point to
> update the hint bits as a performance optimization.

The hint bits are not really relevant when xmin = FrozenTransactionId,
since any examiner of the tuple would consider that XID committed anyway.
Besides, the hint bit is guaranteed set in that scenario; note the
Assert where vacuum is setting it:

                    HeapTupleHeaderSetXmin(tuple.t_data, FrozenTransactionId);
                    /* infomask should be okay already */
                    Assert(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED);

The scenario in which the hint bit *must* be set is where it is for
an XID for which we have deleted the relevant section of pg_clog,
which we are willing to do well before freeze occurs, if we know that
all the relevant XIDs have been hinted.  See TruncateCLOG.  Your patch
breaks that logic by not considering hint-bit updates as changes that
must be flushed to disk by checkpoint.

> 1. Any block read may attempt to set hint bits, which dirties the block
> and must be written out. So *reads* can result in heavier write activity
> at checkpoint time.

Sure, but the alternative is heavier activity in repeated checks of
pg_clog to find out commit state that a previous examiner of the tuple
already found out.  The patch supposes that one write is worse than N
reads, which is clearly a loss at some not-exceedingly-large value of N.
If we thought that was a good tradeoff, we might as well not have the
hint bits at all.

            regards, tom lane

Re: Writing Commit Status hint bits (was Re: [HACKERS] Constant

From

Simon Riggs

Date:

22 July 2005, 04:51:45

On Wed, 2005-07-20 at 13:20 -0400, Tom Lane wrote:

> > When VACUUM freezes the xid, it *does* make sense at that point to
> > update the hint bits as a performance optimization.
>
> The hint bits are not really relevant when xmin = FrozenTransactionId,
> since any examiner of the tuple would consider that XID committed anyway.
> Besides, the hint bit is guaranteed set in that scenario; note the
> Assert where vacuum is setting it:
>
>                     HeapTupleHeaderSetXmin(tuple.t_data, FrozenTransactionId);
>                     /* infomask should be okay already */
>                     Assert(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED);
> The scenario in which the hint bit *must* be set is where it is for
> an XID for which we have deleted the relevant section of pg_clog,
> which we are willing to do well before freeze occurs, if we know that
> all the relevant XIDs have been hinted.  See TruncateCLOG.  Your patch
> breaks that logic by not considering hint-bit updates as changes that
> must be flushed to disk by checkpoint.

OK, I see that now. Thank you for showing me that.

I will change the patch so that this does not prevent VACUUM from
setting hint bits.

All of this does raise other questions, but thats probably sensible to
raise those on other threads.

> > 1. Any block read may attempt to set hint bits, which dirties the block
> > and must be written out. So *reads* can result in heavier write activity
> > at checkpoint time.
>
> Sure, but the alternative is heavier activity in repeated checks of
> pg_clog to find out commit state that a previous examiner of the tuple
> already found out.  The patch supposes that one write is worse than N
> reads, which is clearly a loss at some not-exceedingly-large value of N.
> If we thought that was a good tradeoff, we might as well not have the
> hint bits at all.

That is not a decision we can make without knowledge of the application.
In general, the hint bits are good. In *some* cases, not. I still seek
control over that as a designer.

Specifically, the scenario I want to optimize is this:
- we load a table with lots of real time measurement data, as one child
out of a large number of similar child tables
- we then immediately create summary tables from the measurements
- after this the detailed data is only sporadically accessed, if ever
- detail data is dropped from the database after a few weeks
- the majority of the database is detail data, so those tables are never
vacuumed since no rows are ever deleted from those tables (the tables
are dropped) nor is access sufficiently frequent to make it sensible to
set hint bits - hence *no* complete database vacuum is run on a regular
basis

In this design, the second step causes the whole detailed data table to
be written out to the database (again) immediately after loading. I
would like to be able to prevent that.

That design is *not* uncommon, since we might describe it as a
"regulatory/monitoring data archiving" application. I have two current
clients with exactly this design, plus have worked on similar apps in
the past. The PostgreSQL licence cost makes it particularly suitable for
that kind of application.

Best Regards, Simon Riggs

Re: Writing Commit Status hint bits (was Re: [HACKERS] Constant WAL replay)

From

Tom Lane

Date:

22 July 2005, 10:40:19

Simon Riggs <simon@2ndquadrant.com> writes:
> In general, the hint bits are good. In *some* cases, not. I still seek
> control over that as a designer.

> Specifically, the scenario I want to optimize is this:
> - we load a table with lots of real time measurement data, as one child
> out of a large number of similar child tables
> - we then immediately create summary tables from the measurements
> - after this the detailed data is only sporadically accessed, if ever
> - detail data is dropped from the database after a few weeks
> - the majority of the database is detail data, so those tables are never
> vacuumed since no rows are ever deleted from those tables (the tables
> are dropped) nor is access sufficiently frequent to make it sensible to
> set hint bits - hence *no* complete database vacuum is run on a regular
> basis

> In this design, the second step causes the whole detailed data table to
> be written out to the database (again) immediately after loading. I
> would like to be able to prevent that.

This application seems sufficiently off the beaten path to not be worth
solving with a knob as klugy and potentially dangerous as
suppress_hint_bits.

A better way to avoid the write pass, if you must, is to do the summary
table creation in the same transaction that loads the data.

            regards, tom lane

Re: Writing Commit Status hint bits (was Re: [HACKERS] Constant

From

Simon Riggs

Date:

22 July 2005, 13:23:36

On Fri, 2005-07-22 at 09:40 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > In general, the hint bits are good. In *some* cases, not. I still seek
> > control over that as a designer.

...

> ... not be worth
> solving with a knob as klugy and potentially dangerous as
> suppress_hint_bits.

I note that heap_get_latest_tid() relies upon the setting of
HEAP_XMIN_COMMITTED in the infomask in lieu of checking any return codes
from the HeapTupleSatisfies check. Sequences also touch on that.

I agree with you that it is potentially dangerous to mess with that, for
now.

I'll look at Read-Only Tables for 8.2 rather than this quick-fix.

Best Regards, Simon Riggs