Re: Parallel CREATE INDEX for BRIN indexes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Parallel CREATE INDEX for BRIN indexes
Date
Msg-id c990c65c-93e9-f9c3-6792-02f929ce308f@enterprisedb.com
Whole thread Raw
In response to Re: Parallel CREATE INDEX for BRIN indexes  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: Parallel CREATE INDEX for BRIN indexes
List pgsql-hackers
On 11/29/23 21:30, Matthias van de Meent wrote:
> On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 11/29/23 15:52, Tomas Vondra wrote:
>>>> ...
>>>>
>>>> This also made me think a bit more about how we're working with the
>>>> tuples. With your latest patch, we always deserialize and re-serialize
>>>> the sorted brin tuples, just in case the next tuple will also be a
>>>> BRIN tuple of the same page range. Could we save some of that
>>>> deserialization time by optimistically expecting that we're not going
>>>> to need to merge the tuple and only store a local copy of it locally?
>>>> See attached 0002; this saves some cycles in common cases.
>>>>
>>>
>>> Good idea!
>>>
>>
>> FWIW there's a bug, in this part of the optimization:
>>
>> ------------------
>> +    if (memtuple == NULL)
>> +        memtuple = brin_deform_tuple(state->bs_bdesc, btup,
>> +                                     memtup_holder);
>> +
>>      union_tuples(state->bs_bdesc, memtuple, btup);
>>      continue;
>> ------------------
>>
>> The deforming should use prevbtup, otherwise union_tuples() jut combines
>> two copies of the same tuple.
> 
> Good point. There were some more issues as well, fixes are attached.
> 
>> Which however brings me to the bigger issue with this - my stress test
>> found this issue pretty quickly, but then I spent quite a bit of time
>> trying to find what went wrong. I find this reworked code pretty hard to
>> understand, and not necessarily because of how it's written. The problem
>> is it the same loop tries to juggle multiple pieces of information with
>> different lifespans, and so on. I find it really hard to reason about
>> how it behaves ...
> 
> Yeah, it'd be nice if we had a peek option for sortsupport, that'd
> improve context handling.
> 
>> I did try to measure how much it actually saves, but none of the tests I
>> did actually found measurable improvement. So I'm tempted to just not
>> include this part, and accept that we may deserialize some of the tuples
>> unnecessarily.
>>
>> Did you actually observe measurable improvements in some cases?
> 
> The improvements would mostly stem from brin indexes with multiple
> (potentially compressed) by-ref types, as they go through more complex
> and expensive code to deserialize, requiring separate palloc() and
> memcpy() calls each.
> For single-column and by-value types the improvements are expected to
> be negligible, because there is no meaningful difference between
> copying a single by-ref value and copying its container; the
> additional work done for each tuple is marginal for those.
> 
> For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
> (sha256((id+1)::text::bytea)::text),
> (sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
> measured a difference of 10x less time spent in the main loop of
> _brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
> ranges. It's not a lot, but worth at least something, I guess?
> 

It is something, but I can't really convince myself it's worth the extra
code complexity. It's a somewhat extreme example, and the parallelism
certainly saves much more than this.

> The attached patch fixes the issue that you called out .
> It also further updates _brin_end_parallel: the final 'write empty
> tuples' loop is never hit and is thus removed, because if there were
> any tuples in the spool we'd have filled the empty ranges at the end
> of the main loop, and if there were no tuples in the spool then the
> memtuple would still be at its original initialized value of 0 thus
> resulting in a constant false condition. I also updated some comments.
> 

Ah, right. I'll take a look tomorrow, but I guess I didn't realize we
insert the empty ranges in the main loop, because we're already looking
at the *next* summary.

But I think the idea was to insert empty ranges if there's a chunk of
empty ranges at the end of the table, after the last tuple the index
build reads. But I'm not sure that can actually happen ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Matthias van de Meent
Date:
Subject: Re: Parallel CREATE INDEX for BRIN indexes
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Changing references of password encryption to hashing