RE: [Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers
From | k.jamison@fujitsu.com |
---|---|
Subject | RE: [Patch] Optimize dropping of relation buffers using dlist |
Date | |
Msg-id | OSBPR01MB2341AA2D8F567522456CC018EF490@OSBPR01MB2341.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: [Patch] Optimize dropping of relation buffers using dlist (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
On Friday, August 7, 2020 12:38 PM, Amit Kapila wrote: Hi, > On Thu, Aug 6, 2020 at 6:53 AM k.jamison@fujitsu.com <k.jamison@fujitsu.com> > wrote: > > > > On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote: > > > > Hi, > > Thank you for your constructive review and comments. > > Sorry for the late reply. > > > > > Hi, > > > > > > On 2020-07-31 15:50:04 -0400, Tom Lane wrote: > > > > Andres Freund <andres@anarazel.de> writes: > > > > > Indeed. The buffer mapping hashtable already is visible as a > > > > > major bottleneck in a number of workloads. Even in readonly > > > > > pgbench if s_b is large enough (so the hashtable is larger than > > > > > the cache). Not to speak of things like a cached sequential scan > > > > > with a cheap qual and wide > > > rows. > > > > > > > > To be fair, the added overhead is in buffer allocation not buffer lookup. > > > > So it shouldn't add cost to fully-cached cases. As Tomas noted > > > > upthread, the potential trouble spot is where the working set is > > > > bigger than shared buffers but still fits in RAM (so there's no > > > > actual I/O needed, but we do still have to shuffle buffers a lot). > > > > > > Oh, right, not sure what I was thinking. > > > > > > > > > > > Wonder if the temporary fix is just to do explicit hashtable > > > > > probes for all pages iff the size of the relation is < s_b / 500 or so. > > > > > That'll address the case where small tables are frequently > > > > > dropped - and dropping large relations is more expensive from > > > > > the OS and data loading perspective, so it's not gonna happen as often. > > > > > > > > Oooh, interesting idea. We'd need a reliable idea of how long the > > > > relation had been (preferably without adding an lseek call), but > > > > maybe that's do-able. > > > > > > IIRC we already do smgrnblocks nearby, when doing the truncation (to > > > figure out which segments we need to remove). Perhaps we can arrange > > > to combine the two? The layering probably makes that somewhat ugly > > > :( > > > > > > We could also just use pg_class.relpages. It'll probably mostly be > > > accurate enough? > > > > > > Or we could just cache the result of the last smgrnblocks call... > > > > > > > > > One of the cases where this type of strategy is most intersting to > > > me is the partial truncations that autovacuum does... There we even > > > know the range of tables ahead of time. > > > > Konstantin tested it on various workloads and saw no regression. > > But I understand the sentiment on the added overhead on BufferAlloc. > > Regarding the case where the patch would potentially affect workloads > > that fit into RAM but not into shared buffers, could one of Andres' > > suggested idea/s above address that, in addition to this patch's > > possible shared invalidation fix? Could that settle the added overhead in > BufferAlloc() as temporary fix? > > > > Yes, I think so. Because as far as I can understand he is suggesting to do changes > only in the Truncate/Vacuum code path. Basically, I think you need to change > DropRelFileNodeBuffers or similar functions. > There shouldn't be any change in the BufferAlloc or the common code path, so > there is no question of regression in such cases. I am not sure what you have in > mind for this but feel free to clarify your understanding before implementation. > > > Thomas Munro is also working on caching relation sizes [1], maybe that > > way we could get the latest known relation size. Currently, it's > > possible only during recovery in smgrnblocks. > > > > Tom Lane wrote: > > > But aside from that, I noted a number of things I didn't like a bit: > > > > > > * The amount of new shared memory this needs seems several orders of > > > magnitude higher than what I'd call acceptable: according to my > > > measurements it's over 10KB per shared buffer! Most of that is > > > going into the CachedBufTableLock data structure, which seems > > > fundamentally misdesigned --- how could we be needing a lock per map > > > partition *per buffer*? For comparison, the space used by > > > buf_table.c is about 56 bytes per shared buffer; I think this needs to stay at > least within hailing distance of there. > > > > > > * It is fairly suspicious that the new data structure is manipulated > > > while holding per-partition locks for the existing buffer hashtable. > > > At best that seems bad for concurrency, and at worst it could result > > > in deadlocks, because I doubt we can assume that the new hash table > > > has partition boundaries identical to the old one. > > > > > > * More generally, it seems like really poor design that this has > > > been written completely independently of the existing buffer hash table. > > > Can't we get any benefit by merging them somehow? > > > > The original aim is to just shorten the recovery process, and > > eventually the speedup on both vacuum and truncate process are just added > bonus. > > Given that we don't have a shared invalidation mechanism in place yet > > like radix tree buffer mapping which is complex, I thought a patch > > like mine could be an alternative approach to that. So I want to improve the > patch further. > > I hope you can help me clarify the direction, so that I can avoid > > going farther away from what the community wants. > > 1. Both normal operations and recovery process 2. Improve recovery > > process only > > > > I feel Andres's suggestion will help in both cases. > > > > I wonder if you have considered case of local hash (maintained only during > recovery)? > > > If there is after-crash recovery, then there will be no concurrent > > > access to shared buffers and this hash will be up-to-date. > > > in case of hot-standby replica we can use some simple invalidation > > > (just one flag or counter which indicates that buffer cache was updated). > > > This hash also can be constructed on demand when > > > DropRelFileNodeBuffers is called first time (so w have to scan all > > > buffers once, but subsequent drop operation will be fast. > > > > I'm examining this, but I am not sure if I got the correct > > understanding. Please correct me if I'm wrong. > > I think above is a suggestion wherein the postgres startup process > > uses local hash table to keep track of the buffers of relations. Since > > there may be other read-only sessions which read from disk, evict cached > blocks, and modify the shared_buffers, the flag would be updated. > > We could do it during recovery, then release it as recovery completes. > > > > I haven't looked deeply yet into the source code but we maybe can > > modify the REDO (main redo do-while loop) in StartupXLOG() once the > read-only connections are consistent. > > It would also be beneficial to construct this local hash when > > DropRefFileNodeBuffers() is called for the first time, so the whole > > share buffers is scanned initially, then as you mentioned subsequent > > dropping will be fast. (similar behavior to what the patch does) > > > > Do you think this is feasible to be implemented? Or should we explore another > approach? > > > > I think we should try what Andres is suggesting as that seems like a promising > idea and can address most of the common problems in this area but if you feel > otherwise, then do let us know. > > -- > With Regards, > Amit Kapila. Hi, thank you for the review. I just wanted to confirm so that I can hopefully cover it in the patch revision. Basically, we don't want the added overhead in BufferAlloc(), so I'll just make a way to get both the last known relation size and nblocks, and modify the operations for dropping of relation of buffers, based from the comments and suggestions of the reviewers. Hopefully I can also provide performance test results by next CF. Regards, Kirk Jamison
pgsql-hackers by date: