Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CAFBsxsGOOuf1Sfi=ByO4WBUhqSUWj+P7xWZpmrMK3BA19pOuUg@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
Re: [PoC] Improve dead tuple storage for lazy vacuum Re: [PoC] Improve dead tuple storage for lazy vacuum |
List | pgsql-hackers |
On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: [v3 patch] Hi Masahiko, Since there are new files, and they are pretty large, I've attached most specific review comments and questions as a diff rather than in the email body. This is not a full review, which will take more time -- this is a first pass mostly to aid my understanding, and discuss some of the design and performance implications. I tend to think it's a good idea to avoid most cosmetic review until it's close to commit, but I did mention a couple things that might enhance readability during review. As I mentioned to you off-list, I have some thoughts on the nodes using SIMD: > On Thu, Jun 16, 2022 at 4:30 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > For now, though, I'd like to question > > why we even need to use 32-byte registers in the first place. For one, > > the paper referenced has 16-pointer nodes, but none for 32 (next level > > is 48 and uses a different method to find the index of the next > > pointer). Andres' prototype has 32-pointer nodes, but in a quick read > > of his patch a couple weeks ago I don't recall a reason mentioned for > > it. > > I might be wrong but since AVX2 instruction set is introduced in > Haswell microarchitecture in 2013 and the referenced paper is > published in the same year, the art didn't use AVX2 instruction set. Sure, but with a bit of work the same technique could be done on that node size with two 16-byte registers. > 32-pointer nodes are better from a memory perspective as you > mentioned. Andres' prototype supports both 16-pointer nodes and > 32-pointer nodes (out of 6 node types). This would provide better > memory usage but on the other hand, it would also bring overhead of > switching the node type. Right, using more node types provides smaller increments of node size. Just changing node type can be better or worse, depending on the input. > Anyway, it's an important design decision to > support which size of node to support. It should be done based on > experiment results and documented. Agreed. I would add that in the first step, we want something straightforward to read and easy to integrate into our codebase. I suspect other optimizations would be worth a lot more than using AVX2: - collapsing inner nodes - taking care when constructing the key (more on this when we integrate with VACUUM) ...and a couple Andres mentioned: - memory management: in https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de - node dispatch: https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de Therefore, I would suggest that we use SSE2 only, because: - portability is very easy - to avoid a performance hit from indirecting through a function pointer When the PG16 cycle opens, I will work separately on ensuring the portability of using SSE2, so you can focus on other aspects. I think it would be a good idea to have both node16 and node32 for testing. During benchmarking we can delete one or the other and play with the other thresholds a bit. Ideally, node16 and node32 would have the same code with a different loop count (1 or 2). More generally, there is too much duplication of code (noted by Andres in his PoC), and there are many variable names with the node size embedded. This is a bit tricky to make more general, so we don't need to try it yet, but ideally we would have something similar to: switch (node->kind) // todo: inspect tagged pointer { case RADIX_TREE_NODE_KIND_4: idx = node_search_eq(node, chunk, 4); do_action(node, idx, 4, ...); break; case RADIX_TREE_NODE_KIND_32: idx = node_search_eq(node, chunk, 32); do_action(node, idx, 32, ...); ... } static pg_alwaysinline void node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout) { if (node_fanout <= SIMPLE_LOOP_THRESHOLD) // do simple loop with (node_simple *) node; else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD) // do vectorized loop where available with (node_vec *) node; ... } ...and let the compiler do loop unrolling and branch removal. Not sure how difficult this is to do, but something to think about. Another thought: for non-x86 platforms, the SIMD nodes degenerate to "simple loop", and looping over up to 32 elements is not great (although possibly okay). We could do binary search, but that has bad branch prediction. -- John Naylor EDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: