Re: The Free Space Map: Problems and Opportunities - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: The Free Space Map: Problems and Opportunities |
Date | |
Msg-id | CAH2-Wz=sSvMX5HWo8d=DFznid0BiTySCm-zhHUsnC4UP-OaXbA@mail.gmail.com Whole thread Raw |
In response to | Re: The Free Space Map: Problems and Opportunities (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: The Free Space Map: Problems and Opportunities
|
List | pgsql-hackers |
On Tue, Sep 7, 2021 at 12:31 PM Robert Haas <robertmhaas@gmail.com> wrote: > Doing work in the background has some advantages, though. In > particular, it has the possibly-large advantage of not slowing down > foreground work. What I really like about the idea of doing the work in the foreground (when it's clearly not going to be too much of a latency hit) is that it would act as a natural form of backpressure. Sometimes slowing things down is a good thing -- fundamentally inefficient and unsustainable use of the system's resources ought to be slowed down by imposing the cost on the transaction that performs the relevant insert queries. To me this seems like a problem of negative externalities. With small transactions we can pretty much know for sure that the benefit of not leaving pages in good shape at the moment they become "closed" is fixed and low, while the cost to the system as a whole over time is potentially far higher. It seems almost unbounded, in a way, so we're bound to lose in the long run, regardless of workload. (The work can be piggybacked across transactions to make it more efficient, but that's mostly an implementation detail.) Updaters might actually be much less of a problem than inserters. They already keep costs down by pruning opportunistically -- they at least try. Most individual logical rows that are inserted have exactly zero chance of ever being updated, but a ~100% chance of needing to have hint bits set sooner or later. We know very little about workload characteristics, but these are some of the few things that we really do know for sure. > For me the key insight here is that HOT-pruning a heap page is a lot > cheaper if you do it before you write the page. Once you've written > the page, the eventual HOT-prune is going to need to dirty it, which > will cause it to be written again. If you prune before writing it the > first time, that cost is avoided. Right. The difficulty with applying that insight before now has been making the prune operation take place at the level of whole pages systematically. I believe that it's quite valuable to set all the hint bits on a page as we "close it out" for the first time. But setting hint bits for all of the heap tuples on the page except one or two is pretty much useless. You might as well not bother at all. It follows that concurrent inserters clobbering the same page senselessly are more or less not okay if we're going to pursue this technique. You might still have a concurrent update of one of the inserted rows shortly after the original inserting transaction commits, but before the page is closed -- that might make the idea not work out for that one page (obviously it depends on the specifics). That seems like something we can live with, though. > I'm not sure that it really matters > whether the space consumed by aborted tuples gets reused by "the very > next transaction" or, say, 10 transactions after that, or even 1000 > transactions after that. If you wait for a million transactions, > you've quite possibly lost enough temporality to matter, but at 10 or > 1000 that's much less likely. I didn't mean to suggest that it had to happen in perfect lockstep. But I do think that it should be pretty close to perfect. What I actually do right now is prune an open page when it *appears* to be full inside the loop in RelationGetBufferForTuple(). As crude as that is, it sets hint bits and prunes away aborted transactions in mostly the right way, at least as far as TPC-C is concerned. (This only works because it fits into the wider framework of my much larger patch, which introduces ownership semantics, open and closed pages, empty page freelists, etc.) > The exact threshold is fuzzy: every > moment you wait makes it less likely that you have sufficient > locality, but you can always find a workload where even a very long > wait is acceptable, and another one where even a tiny delay is > catastrophic, and it's hard to say what the "real world" looks like. I agree that offloading work in the background can be very useful, and that technique needs to be possible as future work (I actually really want this myself). But most individual apps probably won't really notice the latency hit from pruning if it's well managed -- as with pruning by updates today. We can make a decision about which strategy to use (lightweight foreground processing with a potential latency hit, or background batch cleanup?) at a very late point in the cycle, based on the specifics in each case. Offhand, I don't think that we need to characterize this fuzzy threshold in too much detail. I suspect that the main determinative question can be simple enough. Something like: Does this transaction (or this group of transactions) involve only trivial changes to maybe 3 - 5 heap pages at the most, or not? There is a separate question about what happens if our background cleanup process cannot keep up over time. I think that we can handle that by falling back on foreground processing (within reason). Let's say we have transactions that each insert on to dozens or perhaps even hundreds of heap pages each -- these are "medium sized'' write transactions. If we're already doing way too much background processing for similar transactions from earlier on, and if it's clear we just can't keep up, maybe we can fall back and force foreground processing, to keep the debt under control. (In general I think that this back pressure, negative externalities business will be important.) It seems to me that this leaves one harder question unanswered: at what point does a "medium sized" transaction become so large that it just doesn't make sense to do either? What's the crossover point at which background processing and foreground processing like this should be assumed to be not worth it? That I won't speculate about just yet. I suspect that at some point it really does make sense to leave it all up to a true table-level batch operation, like a conventional VACUUM. -- Peter Geoghegan
pgsql-hackers by date: