Re: Parallel CREATE INDEX for BRIN indexes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Parallel CREATE INDEX for BRIN indexes
Date
Msg-id b1fb43f8-e9a0-9e34-fbce-590463520506@enterprisedb.com
Whole thread Raw
In response to Re: Parallel CREATE INDEX for BRIN indexes  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: Parallel CREATE INDEX for BRIN indexes
List pgsql-hackers
On 11/29/23 15:42, Matthias van de Meent wrote:
> On Tue, 28 Nov 2023 at 18:59, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 11/28/23 16:39, Matthias van de Meent wrote:
>>> On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>> On 11/23/23 13:33, Matthias van de Meent wrote:
>>>>> The union operator may leak (lots of) memory, so I think it makes
>>>>> sense to keep a context around that can be reset after we've extracted
>>>>> the merge result.
>>>>>
>>>>
>>>> But does the current code actually achieve that? It does create a "brin
>>>> union" context, but then it only does this:
>>>>
>>>>     /* Use our own memory context to avoid retail pfree */
>>>>     cxt = AllocSetContextCreate(CurrentMemoryContext,
>>>>                                 "brin union",
>>>>                                 ALLOCSET_DEFAULT_SIZES);
>>>>     oldcxt = MemoryContextSwitchTo(cxt);
>>>>     db = brin_deform_tuple(bdesc, b, NULL);
>>>>     MemoryContextSwitchTo(oldcxt);
>>>>
>>>> Surely that does not limit the amount of memory used by the actual union
>>>> functions in any way?
>>>
>>> Oh, yes, of course. For some reason I thought that covered the calls
>>> to the union operator function too, but it indeed only covers
>>> deserialization. I do think it is still worthwhile to not do the
>>> create/delete cycle, but won't hold the patch back for that.
>>>
>>
>> I think the union_tuples() changes are better left for a separate patch.
>>
>>>>>> However, I don't think the number of union_tuples calls is likely to be
>>>>>> very high, especially for large tables. Because we split the table into
>>>>>> 2048 chunks, and then cap the chunk size by 8192. For large tables
>>>>>> (where this matters) we're likely close to 8192.
>>>>>
>>>>> I agree that the merging part of the index creation is the last part,
>>>>> and usually has no high impact on the total performance of the reindex
>>>>> operation, but in memory-constrained environments releasing and then
>>>>> requesting the same chunk of memory over and over again just isn't
>>>>> great.
>>>>
>>>> OK, I'll take a look at the scratch context you suggested.
>>>>
>>>> My point however was we won't actually do that very often, because on
>>>> large tables the BRIN ranges are likely smaller than the parallel scan
>>>> chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
>>>> ranges are large, there'll be few of them.
>>>
>>> That's true, so maybe I'm concerned about something that amounts to
>>> only marginal gains.
>>>
>>
>> However, after thinking about this a bit more, I think we actually do
>> need to do something about the memory management when merging tuples.
>> AFAIK the general assumption was that union_tuple() only runs for a
>> single range, and then the whole context gets freed.
> 
> Correct, but it is also is (or should be) assumed that union_tuple
> will be called several times in the same context to fix repeat
> concurrent updates. Presumably, that only happens rarely, but it's
> something that should be kept in mind regardless.
> 

In theory, yes. But union_tuples() is used only in summarize_range(),
and that only processes a single page range.

>> But the way the
>> merging was implemented, it all runs in a single context. And while a
>> single union_tuple() may not need a lot memory, in total it may be
>> annoying. I just added a palloc(1MB) into union_tuples and ended up with
>> ~160MB allocated in the PortalContext on just 2GB table. In practice the
>> memory will grow more slowly, but not great :-/
>>
>> The attached 0003 patch adds a memory context that's reset after
>> producing a merged BRIN tuple for each page range.
> 
> Looks good.
> 
> This also made me think a bit more about how we're working with the
> tuples. With your latest patch, we always deserialize and re-serialize
> the sorted brin tuples, just in case the next tuple will also be a
> BRIN tuple of the same page range. Could we save some of that
> deserialization time by optimistically expecting that we're not going
> to need to merge the tuple and only store a local copy of it locally?
> See attached 0002; this saves some cycles in common cases.
> 

Good idea!

> The v20231128 version of the patchset (as squashed, attached v5-0001)
> looks good to me.
> 

Cool. I'll put this through a bit more stress testing, and then I'll get
it pushed.

Thanks for the reviews!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: proposal: possibility to read dumped table's name from file
Next
From: Heikki Linnakangas
Date:
Subject: Re: Change GUC hashtable to use simplehash?