Re: Parallel CREATE INDEX for BRIN indexes - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: Parallel CREATE INDEX for BRIN indexes |
Date | |
Msg-id | CAEze2Wg=6w42JiQtGwQvM-nPdgMYNVK-Z85r=_QZ1EPC02E9dQ@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel CREATE INDEX for BRIN indexes (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Parallel CREATE INDEX for BRIN indexes
|
List | pgsql-hackers |
Hi, On Wed, 22 Nov 2023 at 20:16, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 11/20/23 20:48, Matthias van de Meent wrote: >> On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >>> >>> Hi, >>> >>> here's an updated patch, addressing the review comments, and reworking >>> how the work is divided between the workers & leader etc. >>> >> >> After code-only review, here are some comments: >> >>> +++ b/src/backend/access/brin/brin.c >>> [...] >>> +/* Magic numbers for parallel state sharing */ >>> +#define PARALLEL_KEY_BRIN_SHARED UINT64CONST(0xA000000000000001) >>> +#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002) >> >> These shm keys use the same constants also in use in >> access/nbtree/nbtsort.c. While this shouldn't be an issue in normal >> operations, I'd prefer if we didn't actively introduce conflicting >> identifiers when we still have significant amounts of unused values >> remaining. >> > > Hmmm. Is there some rule of thumb how to pick these key values? None that I know of. There is a warning in various places that define these constants that they take care to not conflict with plan node's node_id: parallel plan execution uses plain plan node IDs as keys, and as node_id is int-sized, any other key value that's created manually of value < 2^32 should be sure that it can't be executed in a parallel backend. But apart from that one case, I can't find a convention, no. >>> +#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000003) >> >> This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the >> others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one >> more than brin's), backend/executor/execParallel.c >> (0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though >> I've not checked that their uses are exactly the same, I'd expect at >> least btree to match mostly, if not fully, 1:1). >> I think we could probably benefit from a less ad-hoc sharing of query >> texts. I don't think that needs to happen specifically in this patch, >> but I think it's something to keep in mind in future efforts. >> > > I'm afraid I don't quite get what you mean by "ad hoc sharing of query > texts". Are you saying we shouldn't propagate the query text to the > parallel workers? Why? Or what's the proper solution? What I mean is that we have several different keys that all look like they contain the debug query string, and always for the same debugging purposes. For debugging, I think it'd be useful to use one well-known key, rather than N well-known keys in each of the N parallel subsystems. But as mentioned, it doesn't need to happen in this patch, as that'd increase scope beyond brin/index ams. >>> + state->bs_numtuples = brinshared->indtuples; >> >> My IDE complains about bs_numtuples being an integer. This is a >> pre-existing issue, but still valid: we can hit an overflow on tables >> with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely, >> but problematic nonetheless. >> > > True. I think I've been hesitant to make this a double because it seems > a bit weird to do +1 with a double, and at some point (d == d+1). But > this seems safe, we're guaranteed to be far away from that threshold. Yes, ignoring practical constraints like page space, we "only" have bitspace for 2^48 tuples in each (non-partitioned) relation, so double's 56 significant bits should be more than enough to count tuples. >> I also noticed that this is likely to execute `union_tuples` many >> times when pages_per_range is coprime with the parallel table scan's >> block stride (or when we for other reasons have many tuples returned >> for each range); and this `union_tuples` internally allocates and >> deletes its own memory context for its deserialization of the 'b' >> tuple. I think we should just pass a scratch context instead, so that >> we don't have the overhead of continously creating then deleting the >> same memory context > > Perhaps. Looking at the code, isn't it a bit strange how union_tuples > uses the context? It creates the context, calls brin_deform_tuple in > that context, but then the rest of the function (including datumCopy and > similar stuff) happens in the caller's context ... The union operator may leak (lots of) memory, so I think it makes sense to keep a context around that can be reset after we've extracted the merge result. > However, I don't think the number of union_tuples calls is likely to be > very high, especially for large tables. Because we split the table into > 2048 chunks, and then cap the chunk size by 8192. For large tables > (where this matters) we're likely close to 8192. I agree that the merging part of the index creation is the last part, and usually has no high impact on the total performance of the reindex operation, but in memory-constrained environments releasing and then requesting the same chunk of memory over and over again just isn't great. Also note that parallel scan chunk sizes decrease when we're about to hit the end of the table, and that a different AM may have different ideas about scanning a table in parallel; it could very well decide to use striped assignments exclusively, as opposed to on-demand chunk allocations; both increasing the chance that brin's page ranges are processed by more than one backend. >> As a project PostgreSQL seems to be trying to move away from >> hardcoding heap into everything in favour of the more AM-agnostic >> 'table'. I suggest replacing all mentions of "heap" in the arguments >> with "table", to reduce the work future maintainers need to do to fix >> this. > > I'm not against doing that, but I'd prefer to do that in a separate > patch. There's a bunch of preexisting heap references, so and I don't > want to introduce inconsistency (patch using table, old code heap) nor > do I want to tweak unrelated code. Sure. Kind regards, Matthias van de Meent Neon (https://neon.tech)
pgsql-hackers by date: