Re: PATCH: Using BRIN indexes for sorted output - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PATCH: Using BRIN indexes for sorted output
Date
Msg-id fdebb803-f9e7-4806-20b8-c55fc18ab790@enterprisedb.com
Whole thread Raw
In response to Re: PATCH: Using BRIN indexes for sorted output  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: PATCH: Using BRIN indexes for sorted output
List pgsql-hackers
On 2/23/23 17:44, Matthias van de Meent wrote:
> On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 2/23/23 15:19, Matthias van de Meent wrote:
>>> Comments on 0001, mostly comments and patch design:
> 
> One more comment:
> 
>>>> + range->min_value = bval->bv_values[0];
>>>> + range->max_value = bval->bv_values[1];
> 
> This produces dangling pointers for minmax indexes on by-ref types
> such as text and bytea, due to the memory context of the decoding
> tuple being reset every iteration. :-(
> 

Yeah, that sounds like a bug. Also a sign the tests should have some
by-ref data types (presumably there are none, as that would certainly
trip some asserts etc.).

>>>> +range_values_cmp(const void *a, const void *b, void *arg)
>>>
>>> Can the arguments of these functions be modified into the types they
>>> are expected to receive? If not, could you add a comment on why that's
>>> not possible?
>>
>> The reason is that that's what qsort() expects. If you change that to
>> actual data types, you'll get compile-time warnings. I agree this may
>> need better comments, though.
> 
> Thanks in advance.
> 
>>>> +     * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).
>>>
>>> Could you please expand on this? We do have GIST support for ranges, too.
>>>
>>
>> Expand in what way? This is meant to be AM-specific, so if GiST wants to
>> collect some additional stats, it's free to do so - perhaps some of the
>> ideas from the stats collected for BRIN would be applicable, but it's
>> also bound to the index structure.
> 
> I don't quite understand the flow of the comment, as I don't clearly
> see what the "BRIN for ranges" tries to refer to. In my view, that
> makes it a bad example which needs further explanation or rewriting,
> aka "expanding on".
> 

Ah, right. Yeah, the "BRIN for ranges" wording is a bit misleading. It
should really say only BRIN, but I was focused on the minmax use case,
so I mentioned the ranges.

>>>> + * brin_minmax_stats
>>>> + *        Calculate custom statistics for a BRIN minmax index.
>>>> + *
>>>> + * At the moment this calculates:
>>>> + *
>>>> + *  - number of summarized/not-summarized and all/has nulls ranges
>>>
>>> I think statistics gathering of an index should be done at the AM
>>> level, not attribute level. The docs currently suggest that the user
>>> builds one BRIN index with 16 columns instead of 16 BRIN indexes with
>>> one column each, which would make the statistics gathering use 16x
>>> more IO if the scanned data cannot be reused.
>>>
>>
>> Why? The row sample is collected only once and used for building all the
>> index AM stats - it doesn't really matter if we analyze 16 single-column
>> indexes or 1 index with 16 columns. Yes, we'll need to scan more
>> indexes, but the with 16 columns the summaries will be larger so the
>> total amount of I/O will be almost the same I think.
>>
>> Or maybe I don't understand what I/O you're talking about?
> 
> With the proposed patch, we do O(ncols_statsenabled) scans over the
> BRIN index. Each scan reads all ncol columns of all block ranges from
> disk, so in effect the data scan does on the order of
> O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
> all columns have statistics enabled.
> 

I don't think that's the number of I/O operations we'll do, because we
always read the whole BRIN tuple at once. So I believe it should rather
be something like

  O(ncols_statsenabled * nranges)

assuming nranges is the number of page ranges. But even that's likely a
significant overestimate because most of the tuples will be served from
shared buffers.

Considering how tiny BRIN indexes are, this is likely orders of
magnitude less I/O than we expend on sampling rows from the table. I
mean, with the default statistics target we read ~30000 pages (~240MB)
or more just to sample the rows. Randomly, while the BRIN index is
likely scanned mostly sequentially.

Maybe there are cases where this would be an issue, but I haven't seen
one when working on this patch (and I did a lot of experiments). I'd
like to see one before we start optimizing it ...

This also reminds me that the issues I actually saw (e.g. memory
consumption) would be made worse by processing all columns at once,
because then you need to keep more columns in memory.


>>> It is possible to build BRIN indexes on more than one column with more
>>> than one opclass family like `USING brin (id int8_minmax_ops, id
>>> int8_bloom_ops)`. This would mean various duplicate statistics fields,
>>> no?
>>> It seems to me that it's more useful to do the null- and n_summarized
>>> on the index level instead of duplicating that inside the opclass.
>>
>> I don't think it's worth it. The amount of data this would save is tiny,
>> and it'd only apply to cases where the index includes the same attribute
>> multiple times, and that's pretty rare I think. I don't think it's worth
>> the extra complexity.
> 
> Not necessarily, it was just an example of where we'd save IO.
> Note that the current gathering method already retrieves all tuple
> attribute data, so from a basic processing perspective we'd save some
> time decoding as well.
> 

[shrug] I still think it's a negligible fraction of the time.

>>>
>>> I'm planning on reviewing the other patches, and noticed that a lot of
>>> the patches are marked WIP. Could you share a status on those, because
>>> currently that status is unknown: Are these patches you don't plan on
>>> including, or are these patches only (or mostly) included for
>>> debugging?
>>>
>>
>> I think the WIP label is a bit misleading, I used it mostly to mark
>> patches that are not meant to be committed on their own. A quick overview:
>>
>> [...]
> 
> Thanks for the explanation, that's quite helpful. I'll see to further
> reviewing 0004 and 0005 when I have additional time.
> 

Cool, thank you!


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: [PoC] Federated Authn/z with OAUTHBEARER
Next
From: Kirk Wolak
Date:
Subject: Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)