Re: Parallel CREATE INDEX for BRIN indexes - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Parallel CREATE INDEX for BRIN indexes
Date
Msg-id CAEze2Wg=6w42JiQtGwQvM-nPdgMYNVK-Z85r=_QZ1EPC02E9dQ@mail.gmail.com
Whole thread Raw
In response to Re: Parallel CREATE INDEX for BRIN indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Parallel CREATE INDEX for BRIN indexes
List pgsql-hackers
Hi,

On Wed, 22 Nov 2023 at 20:16, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 11/20/23 20:48, Matthias van de Meent wrote:
>> On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> Hi,
>>>
>>> here's an updated patch, addressing the review comments, and reworking
>>> how the work is divided between the workers & leader etc.
>>>
>>
>> After code-only review, here are some comments:
>>
>>> +++ b/src/backend/access/brin/brin.c
>>> [...]
>>> +/* Magic numbers for parallel state sharing */
>>> +#define PARALLEL_KEY_BRIN_SHARED        UINT64CONST(0xA000000000000001)
>>> +#define PARALLEL_KEY_TUPLESORT            UINT64CONST(0xA000000000000002)
>>
>> These shm keys use the same constants also in use in
>> access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
>> operations, I'd prefer if we didn't actively introduce conflicting
>> identifiers when we still have significant amounts of unused values
>> remaining.
>>
>
> Hmmm. Is there some rule of thumb how to pick these key values?

None that I know of.
There is a warning in various places that define these constants that
they take care to not conflict with plan node's node_id: parallel plan
execution uses plain plan node IDs as keys, and as node_id is
int-sized, any other key value that's created manually of value < 2^32
should be sure that it can't be executed in a parallel backend.
But apart from that one case, I can't find a convention, no.

>>> +#define PARALLEL_KEY_QUERY_TEXT            UINT64CONST(0xA000000000000003)
>>
>> This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
>> others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one
>> more than brin's), backend/executor/execParallel.c
>> (0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
>> I've not checked that their uses are exactly the same, I'd expect at
>> least btree to match mostly, if not fully, 1:1).
>> I think we could probably benefit from a less ad-hoc sharing of query
>> texts. I don't think that needs to happen specifically in this patch,
>> but I think it's something to keep in mind in future efforts.
>>
>
> I'm afraid I don't quite get what you mean by "ad hoc sharing of query
> texts". Are you saying we shouldn't propagate the query text to the
> parallel workers? Why? Or what's the proper solution?

What I mean is that we have several different keys that all look like
they contain the debug query string, and always for the same debugging
purposes. For debugging, I think it'd be useful to use one well-known
key, rather than N well-known keys in each of the N parallel
subsystems.

But as mentioned, it doesn't need to happen in this patch, as that'd
increase scope beyond brin/index ams.

>>> +    state->bs_numtuples = brinshared->indtuples;
>>
>> My IDE complains about bs_numtuples being an integer. This is a
>> pre-existing issue, but still valid: we can hit an overflow on tables
>> with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely,
>> but problematic nonetheless.
>>
>
> True. I think I've been hesitant to make this a double because it seems
> a bit weird to do +1 with a double, and at some point (d == d+1). But
> this seems safe, we're guaranteed to be far away from that threshold.

Yes, ignoring practical constraints like page space, we "only" have
bitspace for 2^48 tuples in each (non-partitioned) relation, so
double's 56 significant bits should be more than enough to count
tuples.

>> I also noticed that this is likely to execute `union_tuples` many
>> times when pages_per_range is coprime with the parallel table scan's
>> block stride (or when we for other reasons have many tuples returned
>> for each range); and this `union_tuples` internally allocates and
>> deletes its own memory context for its deserialization of the 'b'
>> tuple. I think we should just pass a scratch context instead, so that
>> we don't have the overhead of continously creating then deleting the
>> same memory context
>
> Perhaps. Looking at the code, isn't it a bit strange how union_tuples
> uses the context? It creates the context, calls brin_deform_tuple in
> that context, but then the rest of the function (including datumCopy and
> similar stuff) happens in the caller's context ...

The union operator may leak (lots of) memory, so I think it makes
sense to keep a context around that can be reset after we've extracted
the merge result.

> However, I don't think the number of union_tuples calls is likely to be
> very high, especially for large tables. Because we split the table into
> 2048 chunks, and then cap the chunk size by 8192. For large tables
> (where this matters) we're likely close to 8192.

I agree that the merging part of the index creation is the last part,
and usually has no high impact on the total performance of the reindex
operation, but in memory-constrained environments releasing and then
requesting the same chunk of memory over and over again just isn't
great.
Also note that parallel scan chunk sizes decrease when we're about to
hit the end of the table, and that a different AM may have different
ideas about scanning a table in parallel; it could very well decide to
use striped assignments exclusively, as opposed to on-demand chunk
allocations; both increasing the chance that brin's page ranges are
processed by more than one backend.

>> As a project PostgreSQL seems to be trying to move away from
>> hardcoding heap into everything in favour of the more AM-agnostic
>> 'table'. I suggest replacing all mentions of "heap" in the arguments
>> with "table", to reduce the work future maintainers need to do to fix
>> this.
>
> I'm not against doing that, but I'd prefer to do that in a separate
> patch. There's a bunch of preexisting heap references, so and I don't
> want to introduce inconsistency (patch using table, old code heap) nor
> do I want to tweak unrelated code.

Sure.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



pgsql-hackers by date:

Previous
From: Quan Zongliang
Date:
Subject: Re: PL/pgSQL: Incomplete item Allow handling of %TYPE arrays, e.g. tab.col%TYPE[]
Next
From: Alexander Korotkov
Date:
Subject: Table AM Interface Enhancements