Thread: Writing Commit Status hint bits (was Re: [HACKERS] Constant WAL replay)
On Sun, 2005-04-24 at 02:28 -0400, Tom Lane wrote: > In the current code there is no such thing as a hard read-only behavior > --- for example we will try to update commit-status hint bits no matter > what. Allowing that to be turned off would be interesting for a number > of purposes, such as burning a database onto CD. Short patch enclosed to turn off writing of commit-status hint bits. I've called this "cache_txn_status_with_data" but I'm open to other suggestions... This should allow migration of older child tables to hierarchical storage when using a large historical table design. Best Regards, Simon Riggs
Attachment
Simon Riggs <simon@2ndquadrant.com> writes: > Short patch enclosed to turn off writing of commit-status hint bits. Doesn't this entirely destroy the ability to truncate clog, and therefore the ability to survive XID wraparound? It probably also breaks subxact and multixact logging, but I haven't looked closely... regards, tom lane
On Tue, 2005-07-19 at 22:24 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Short patch enclosed to turn off writing of commit-status hint bits. > > Doesn't this entirely destroy the ability to truncate clog, and > therefore the ability to survive XID wraparound? I hope not for all our sakes, since the hint bits are not WAL logged and anything that relies upon them would be fragile. If we were going to move a table to WORM storage, then we'd have done VACUUM FREEZE first anyway, which would be required to avoid XID wraparound. This patch would then guarantee that no further system-initiated writes take place. Not AFAICS.... comments from vacuum.c included: /* * Now scan all the pages that we moved tuples onto and update tuple * status bits. This is not really necessary, but will save time for * future transactions examining these tuples. */ update_hint_bits(onerel, fraged_pages, num_fraged_pages, last_move_dest_block, num_moved); then again on the function itself... /* * update_hint_bits() -- update hint bits in destination pages * * Scan all the pages that we moved tuples onto and update tuple status bits. * This is normally not really necessary, but it will save time for future * transactions examining these tuples. * * * For the non-freeze case, one wonders whether it wouldn't be better to skip * this work entirely, and let the tuple status updates happen someplace * that's not holding an exclusive lock on the relation. */ > It probably also breaks subxact and multixact logging, but I haven't > looked closely... AFAIK this has nothing to do with that, since those locks are not persistent across a crash,,,but conceivably something in 2PC might be effected. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2005-07-19 at 22:24 -0400, Tom Lane wrote: >> Simon Riggs <simon@2ndquadrant.com> writes: >>> Short patch enclosed to turn off writing of commit-status hint bits. >> >> Doesn't this entirely destroy the ability to truncate clog, and >> therefore the ability to survive XID wraparound? > I hope not for all our sakes, since the hint bits are not WAL logged and > anything that relies upon them would be fragile. We don't rely on any one write of them to work, but that doesn't mean that we can indefinitely postpone writing them. regards, tom lane
On Wed, 2005-07-20 at 09:24 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2005-07-19 at 22:24 -0400, Tom Lane wrote: > >> Simon Riggs <simon@2ndquadrant.com> writes: > >>> Short patch enclosed to turn off writing of commit-status hint bits. > >> > >> Doesn't this entirely destroy the ability to truncate clog, and > >> therefore the ability to survive XID wraparound? > > > I hope not for all our sakes, since the hint bits are not WAL logged and > > anything that relies upon them would be fragile. > > We don't rely on any one write of them to work, but that doesn't mean > that we can indefinitely postpone writing them. OK, I think I understand where you're coming from now. A table can't be migrated to read-only media until all of its tuples have an xmin of FrozenTransactionId; I said the following, which was wrong: "This should allow migration of older child tables to hierarchical storage when using a large historical table design." The patch doesn't directly contribute to that goal, though is of value in a large historical table design with many read only child tables (and other situations). But that comment wasn't the only inspiration for the patch. When VACUUM freezes the xid, it *does* make sense at that point to update the hint bits as a performance optimization. That isn't required though, and *can* be indefinitely postponed, AFAICS. All of the tqual routines will still work just as well without the hint bits set. If you know different, I'll need a deeper explanation before I understand. IMHO, the direction the patch is going in is still worthwhile because of these issues: 1. Any block read may attempt to set hint bits, which dirties the block and must be written out. So *reads* can result in heavier write activity at checkpoint time. That effects both OLTP and DW systems: Random read transactions against a large table will be worst effected, since we may end up writing the block once for each read. 2. A lazy vacuum may also dirty a block, even when it has done nothing else useful to that block. Worse, if we vacuum a table that is bigger than shared_buffers (or close), then we will end up having to evict dirty buffers that the vacuum itself has written in order to continue the vacuum. Since Vacuum is two-pass, we may end up writing a block *twice*, once where we set the hint bits and then again later where we remove the tuples and re-write. So this patch will allow a normal VACUUM to perform better on larger tables. (1) is a pain, but there's no point solving it without also solving (2). The patch would fail an Assert test during a VACUUM, since the info bits are not actually set if cache_txn_status_with_data = false during VACUUM. As a result of (2), perhaps we should remove all of the SetBufferCommitInfoNeedsSave calls in HeapTupleSatisfiesVacuum, and add a call to SetBufferCommitInfoNeedsSave that overrides cache_txn_status_with_data when we actually freeze a row. (Or perhaps that should be a VACUUM FAST command?) That way we would set the hint bits *only* when we freeze a row and not at any other time. If we further reduced the number of times we dirty the block *at all* on the first pass of a VACUUM, we would reduce the chance of writing twice. We could save the setting of frozen transactions until the second phase, i.e. only dirty the block if (pgchanged && vacrelstats->num_dead_tuples > prev_dead_count) The patch sets cache_txn_status_with_data as a USERSET, with the intention that particular read-only users would not wish to have their read-only transactions turn into write transactions. There was no intention to prevent VACUUM, not to avoid the optimisation of writing hint bits on a VACUUM FREEZE nor to set full read only status - which we discussed previously but is a much longer project. Sorry for any confusion caused in my initial patch submission. Does my longer explanation make sense of what the patch is trying to achieve. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Wed, 2005-07-20 at 09:24 -0400, Tom Lane wrote: >> We don't rely on any one write of them to work, but that doesn't mean >> that we can indefinitely postpone writing them. > OK, I think I understand where you're coming from now. Apparently not :-( > When VACUUM freezes the xid, it *does* make sense at that point to > update the hint bits as a performance optimization. The hint bits are not really relevant when xmin = FrozenTransactionId, since any examiner of the tuple would consider that XID committed anyway. Besides, the hint bit is guaranteed set in that scenario; note the Assert where vacuum is setting it: HeapTupleHeaderSetXmin(tuple.t_data, FrozenTransactionId); /* infomask should be okay already */ Assert(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED); The scenario in which the hint bit *must* be set is where it is for an XID for which we have deleted the relevant section of pg_clog, which we are willing to do well before freeze occurs, if we know that all the relevant XIDs have been hinted. See TruncateCLOG. Your patch breaks that logic by not considering hint-bit updates as changes that must be flushed to disk by checkpoint. > 1. Any block read may attempt to set hint bits, which dirties the block > and must be written out. So *reads* can result in heavier write activity > at checkpoint time. Sure, but the alternative is heavier activity in repeated checks of pg_clog to find out commit state that a previous examiner of the tuple already found out. The patch supposes that one write is worse than N reads, which is clearly a loss at some not-exceedingly-large value of N. If we thought that was a good tradeoff, we might as well not have the hint bits at all. regards, tom lane
On Wed, 2005-07-20 at 13:20 -0400, Tom Lane wrote: > > When VACUUM freezes the xid, it *does* make sense at that point to > > update the hint bits as a performance optimization. > > The hint bits are not really relevant when xmin = FrozenTransactionId, > since any examiner of the tuple would consider that XID committed anyway. > Besides, the hint bit is guaranteed set in that scenario; note the > Assert where vacuum is setting it: > > HeapTupleHeaderSetXmin(tuple.t_data, FrozenTransactionId); > /* infomask should be okay already */ > Assert(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED); > The scenario in which the hint bit *must* be set is where it is for > an XID for which we have deleted the relevant section of pg_clog, > which we are willing to do well before freeze occurs, if we know that > all the relevant XIDs have been hinted. See TruncateCLOG. Your patch > breaks that logic by not considering hint-bit updates as changes that > must be flushed to disk by checkpoint. OK, I see that now. Thank you for showing me that. I will change the patch so that this does not prevent VACUUM from setting hint bits. All of this does raise other questions, but thats probably sensible to raise those on other threads. > > 1. Any block read may attempt to set hint bits, which dirties the block > > and must be written out. So *reads* can result in heavier write activity > > at checkpoint time. > > Sure, but the alternative is heavier activity in repeated checks of > pg_clog to find out commit state that a previous examiner of the tuple > already found out. The patch supposes that one write is worse than N > reads, which is clearly a loss at some not-exceedingly-large value of N. > If we thought that was a good tradeoff, we might as well not have the > hint bits at all. That is not a decision we can make without knowledge of the application. In general, the hint bits are good. In *some* cases, not. I still seek control over that as a designer. Specifically, the scenario I want to optimize is this: - we load a table with lots of real time measurement data, as one child out of a large number of similar child tables - we then immediately create summary tables from the measurements - after this the detailed data is only sporadically accessed, if ever - detail data is dropped from the database after a few weeks - the majority of the database is detail data, so those tables are never vacuumed since no rows are ever deleted from those tables (the tables are dropped) nor is access sufficiently frequent to make it sensible to set hint bits - hence *no* complete database vacuum is run on a regular basis In this design, the second step causes the whole detailed data table to be written out to the database (again) immediately after loading. I would like to be able to prevent that. That design is *not* uncommon, since we might describe it as a "regulatory/monitoring data archiving" application. I have two current clients with exactly this design, plus have worked on similar apps in the past. The PostgreSQL licence cost makes it particularly suitable for that kind of application. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > In general, the hint bits are good. In *some* cases, not. I still seek > control over that as a designer. > Specifically, the scenario I want to optimize is this: > - we load a table with lots of real time measurement data, as one child > out of a large number of similar child tables > - we then immediately create summary tables from the measurements > - after this the detailed data is only sporadically accessed, if ever > - detail data is dropped from the database after a few weeks > - the majority of the database is detail data, so those tables are never > vacuumed since no rows are ever deleted from those tables (the tables > are dropped) nor is access sufficiently frequent to make it sensible to > set hint bits - hence *no* complete database vacuum is run on a regular > basis > In this design, the second step causes the whole detailed data table to > be written out to the database (again) immediately after loading. I > would like to be able to prevent that. This application seems sufficiently off the beaten path to not be worth solving with a knob as klugy and potentially dangerous as suppress_hint_bits. A better way to avoid the write pass, if you must, is to do the summary table creation in the same transaction that loads the data. regards, tom lane
On Fri, 2005-07-22 at 09:40 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > In general, the hint bits are good. In *some* cases, not. I still seek > > control over that as a designer. ... > ... not be worth > solving with a knob as klugy and potentially dangerous as > suppress_hint_bits. I note that heap_get_latest_tid() relies upon the setting of HEAP_XMIN_COMMITTED in the infomask in lieu of checking any return codes from the HeapTupleSatisfies check. Sequences also touch on that. I agree with you that it is potentially dangerous to mess with that, for now. I'll look at Read-Only Tables for 8.2 rather than this quick-fix. Best Regards, Simon Riggs