Re: [HACKERS] Hint Bits and Write I/O - Mailing list pgsql-patches
From | Simon Riggs |
---|---|
Subject | Re: [HACKERS] Hint Bits and Write I/O |
Date | |
Msg-id | 1213797209.9468.126.camel@ebony.site Whole thread Raw |
Responses |
Re: [HACKERS] Hint Bits and Write I/O
Re: [HACKERS] Hint Bits and Write I/O Re: [HACKERS] Hint Bits and Write I/O |
List | pgsql-patches |
On Tue, 2008-05-27 at 19:32 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > My proposal is to have this as a two-stage process. When we set the hint > > on a tuple in a clean buffer we mark it BM_DIRTY_HINTONLY, if not > > already dirty. If we set a hint on a buffer that is BM_DIRTY_HINTONLY > > then we mark it BM_DIRTY. > > I wonder if it is worth actually counting the number of newly set hint > bits, rather than just having a counter that saturates at two. We could > steal a byte from usage_count without making the buffer headers bigger. > > > If the bgwriter has time, it will write out BM_DIRTY_HINTONLY buffers, > > though on a consistently busy server this should not occur. > > What do you mean by "if it has time"? How would it know that? > > > This won't change the behaviour of first-read-after-copy. To improve > > that behaviour, I suggest that we only move from BM_DIRTY_HINTONLY to > > BM_DIRTY when we are setting the hint for a new xid. If we are just > > setting the same xid over-and-over again then we should avoid setting > > the page dirty. So when data has been loaded via COPY, we will just > > check the status of the xid once, then scan the whole page using the > > single-item transaction cache. > > This doesn't make any sense to me. What is a "new xid"? And what is > "setting the same xid over and over"? If a page is full of occurrences > of the same xid, that doesn't really mean that it's less useful to > correctly hint each occurrence. > > > The whole proposal seems a bit overly complicated. What we talked about > at PGCon was simply not setting the dirtybit when setting a hint bit. > There's a certain amount of self-optimization there: if a page > continually receives hint bit updates, that also means it is getting > pinned and hence its usage_count stays high, thus it will tend to stay > in shared buffers until something happens to make it really dirty. > (Although that argument might not hold water for a bulk seqscan: you'll > have hinted all the tuples and then very possibly throw the page away > immediately. So counting the hints and eventually deciding we did > enough to justify dirtying the page might be worth doing.) WIP patch, for discussion and performance evaluation. This patch changes the way that buffer hints are managed. We can split the patch conceptually into two parts: how we set buffer hints and what happens to buffers for which hints have been set. With this patch, when we set a hint on a buffer we don't dirty the buffer, though we keep track of hint_count for each buffer. Note that there is no additional block state, no BM_SET_HINT_BITS etc.. When running a VACUUM command we always dirty the block when setting hint bits, for a number of reasons: * VACUUM FULL expects all hint bits to be set prior to moving tuples * Setting all hint bits allows us to truncate the clog * it forces the VACUUM to write out its own dirty buffers, which is OK, since it is a background process. Other commands call HeapTupleSatisfiesVacuum(), yet these tasks can be more flexible with hint bit setting. These include ANALYZE, CREATE INDEX, CLUSTER, HOT pruning and index scan marking deleted tuples (with changes in all index AMs). This means we have to differentiate between VACUUM and other callers of HeapTupleSatisfiesVacuum(). So the patch changes the APIs of HeapTupleSatisfiesVacuum(), SetBufferCommitInfoNeedsSave() and SetHintBits() with changes to 13 AM and command files. There are many changes in tqual.c, which seems the right way because SetHintBits() is inlined. These make the patch fairly large, though most of it is simple changes. Hints are set on buffers in various ways, most of these ways relate to tuples, though there are other hints which touch the block itself. This patch takes the simply assumes that all hints are equal. We might later choose, for example, that index LP_DEAD hints are worth more than heap visibility hints. We might also argue that CLUSTER hints should *never* set the page dirty, since those will be wasted writes if the CLUSTER's transaction commits. We don't yet do anything like that in buffer usage_count, so no need to get that complex yet in this case. So what happens to buffers that have hint_count > 0? During the bgwriter's normal LRU scan we write out non-dirty buffers when the number of buffer hints is greater than or equal to a new GUC parameter called bgwriter_hints_force_write. The default and minimum value for this parameter is 1, so very similar to existing behaviour. Expected settings would be 2-5, possibly as high as 20, though those are just educated guesses. So the maximum is set arbitrarily as 100. During a checkpoint we write out only fully dirty buffers, which are the only ones required for correctness. The bgwriter does interrupt itself to re-commence the LRU scan, so the bgwriter may still write hinted blocks during the time a checkpoint is in progress. Doing it this way ensures we don't confuse the two separate goals of correctness and performance, which currently contributes to the peak write workload during checkpoint. When buffer replacement takes place in a backend, we only write out a victim buffer if it is dirty. We *never* write out non-dirty victim buffers, whatever their hint count. This speeds up the path for backends if/when the bgwriter is ineffective and is a change from the current behaviour. It also avoids adding to the most frequent code path when blocks aren't hinted at all. Temp buffers are never dirtied by hint bit setting. Most temp tables are written in a single command, so that re-accessing clog for temp tuples is seldom costly. This also changes current behaviour. We might imagine a slightly smarter bgwriter algorithm or an auto-tuning mechanism, but this patch is intended to allow the basic principles to be tested and proven before we worry about such niceties. I've got a few ideas, just as you probably have by now while reading this. We set hint_count = 0 only when buffer allocated or assigned from freelist, minimising additional management overhead. hint_count is ignored once the buffer has been dirtied. All of this should have the effect that a table scan on a newly loaded table will avoid re-writing the table. The scan sets the hints but doesn't dirty the block, so when it loops around to reuse buffers it will not need to write them out. The bgwriter may clean some of the blocks, if it has time, but that's it. There is one minor strangeness in the patch, which is the change of initdb's command order when "vacuuming database template1". With the previous ordering of ANALYZE; VACUUM FULL; VACUUM; the flexible hint bit setting of the ANALYZE on a freshly bootstrapped database caused a *consistent* error during the VACUUM FULL which follows it. That took, (cough, splutter), a little while to resolve. I've added that as a test to the vacuum regression tests and not found another error (yet?). An interesting mystery though. :-) Some instrumentation can be added - I'll add what people think is useful for attempting to prove this is effective. I am expecting the patch to reduce I/O on both large and small databases, though possibly at the expense of scalability. Let's see. Comments and performance test results, please. I expect to continue working on the patch for a while yet. backend/access/gist/gistget.c | 4 backend/access/hash/hash.c | 6 ! backend/access/heap/heapam.c | 6 ! backend/access/heap/pruneheap.c | 10 !! backend/access/index/indexam.c | 7 ! backend/access/nbtree/nbtinsert.c | 6 ! backend/access/nbtree/nbtree.c | 2 backend/access/nbtree/nbtutils.c | 6 ! backend/catalog/index.c | 3 backend/commands/analyze.c | 9 ! backend/commands/cluster.c | 6 ! backend/commands/vacuum.c | 10 !! backend/commands/vacuumlazy.c | 6 ! backend/storage/buffer/buf_init.c | 1 backend/storage/buffer/bufmgr.c | 69 +!!!!!!!!!!!!! backend/storage/buffer/localbuf.c | 1 backend/utils/misc/guc.c | 9 + backend/utils/misc/postgresql.conf.sample | 2 backend/utils/time/tqual.c | 140 !!!!!!!!!!!!.... bin/initdb/initdb.c | 2 include/storage/buf_internals.h | 3 include/storage/bufmgr.h | 3 include/utils/tqual.h | 4 test/regress/expected/vacuum.out | 2 test/regress/sql/vacuum.sql | 2 25 files changed, 39 insertions(+), 280 modifications(!) -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
pgsql-patches by date: