Re: The Free Space Map: Problems and Opportunities - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: The Free Space Map: Problems and Opportunities |
Date | |
Msg-id | CA+TgmoaHWLXkwiXqyAEXPbfRuOKrHtMtsV1Sr-77nTzhHJkLXg@mail.gmail.com Whole thread Raw |
In response to | Re: The Free Space Map: Problems and Opportunities (Andres Freund <andres@anarazel.de>) |
Responses |
Re: The Free Space Map: Problems and Opportunities
|
List | pgsql-hackers |
On Tue, Aug 17, 2021 at 9:18 AM Andres Freund <andres@anarazel.de> wrote: > > Take DB2's version of the FSM, FSIP [7]. This design usually doesn't > > ever end up inserting *any* new logical rows on a heap page after the > > page first fills -- it is initially "open" when first allocated, and > > then quickly becomes "closed" to inserts of new logical rows, once it > > crosses a certain almost-empty space threshold. The heap page usually > > stays "closed" forever. While the design does not *completely* ignore > > the possibility that the page won't eventually get so empty that reuse > > really does look like a good idea, it makes it rare. A heap page needs > > to have rather a lot of the original logical rows deleted before that > > becomes possible again. It's rare for it to go backwards because the > > threshold that makes a closed page become open again is much lower > > (i.e. involves much more free space) than the threshold that initially > > made a (newly allocated) heap page closed. This one-way stickiness > > quality is important. > > I suspect that we'd get a *LOT* of complaints if we introduced that degree of > stickiness. I don't know what the right degree of stickiness is, but I can easily believe that we want to have more than none. Part of Peter's point, at least as I understand it, is that if a page has 100 tuples and you delete 1 or 2 or 3 of them, it is not smart to fill up the space thus created with 1 or 2 or 3 new tuples. You should instead hope that space will optimize future HOT updates, or else wait for the page to lose some larger number of tuples and then fill up all the space at the same time with tuples that are, hopefully, related to each other in some useful way. Now what's the threshold? 20 out of 100? 50? 80? That's not really clear to me. But it's probably not 1 out of 100. > > This stickiness concept is called "hysteresis" by some DB researchers, > > often when discussing UNDO stuff [8]. Having *far less* granularity > > than FSM_CATEGORIES/255 seems essential to make that work as intended. > > Pages need to be able to settle without being disturbed by noise-level > > differences. That's all that super fine grained values buy you: more > > noise, more confusion. > > Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit > from a fairly granular FSM_CATEGORIES, while still achieving better > locality. You could e.g. search for free space for an out-of-page update in > nearby pages with a lower free-percentage threshold, while having a different > threshold and selection criteria for placement of inserts. Yeah, I don't think that reducing FSM_CATEGORIES is likely to have much value for its own sake. But it might be useful as a way of accomplishing some other goal. For example if we decided that we need a bit per page to track "open" vs. "closed" status or something of that sort, I don't think we'd lose much by having fewer categories. > The density of the VM is fairly crucial for efficient IOS, so I'm doubtful > merging FSM and VM is a good direction to go to. Additionally we need to make > sure that the VM is accurately durable, which isn't the case for the FSM > (which I think we should use to maintain it more accurately). > > But perhaps I'm just missing something here. What is the strong interlink > between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM > that the main linkage is on the side of inserts/update that choose a target > page from the FSM. Isn't it better to make that side smarter? I don't have a well-formed opinion on this point yet. I am also skeptical of the idea of merging the FSM and VM, because it seems to me that which pages are all-visible and which pages are full are two different things. However, there's something to Peter's point too. An empty page is all-visible but still valid as an insertion target, but this is not as much of a contradiction to the idea of merging them as it might seem, because marking the empty page as all-visible is not nearly as useful as marking a full page all-visible. An index-only scan can't care about the all-visible status of a page that doesn't contain any tuples, and we're also likely to make the page non-empty in the near future, in which case the work we did to set the all-visible bit and log the change was wasted. The only thing we're accomplishing by making the page all-visible is letting VACUUM skip it, which will work out to a win if it stays empty for multiple VACUUM cycles, but not otherwise. Conceptually, you might think of the merged data structure as a "quiescence map." Pages that aren't being actively updated are probably all-visible and are probably not great insertion targets. Those that are being actively updated are probably not all-visible and have better chances of being good insertion targets. However, there are a lot of gremlins buried in the word "probably." A page could get full but still not be all-visible for a long time, due to long-running transactions, and it doesn't seem likely to me that we can just ignore that possibility. Nor on the other hand does the difference between an old-but-mostly-full page and an old-but-mostly-empty page seem like something we just want to ignore. So I don't know how a merged data structure would actually end up being an improvement, exactly. > To me this seems like it'd be better addressed by a shared, per-relfilenode, > in-memory datastructure. Thomas Munro has been working on keeping accurate > per-relfilenode relation size information. ISTM that that that'd be a better > place to hook in for this. +1. I had this same thought reading Peter's email. I'm not sure if it makes sense to hook that into Thomas's work, but I think it makes a ton of sense to use shared memory to coordinate transient state like "hey, right now I'm inserting into this block, you guys leave it alone" while using the disk for durable state like "here's how much space this block has left." -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: