Re: PATCH: Using BRIN indexes for sorted output - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: PATCH: Using BRIN indexes for sorted output
Date
Msg-id CAEze2Wi=DmXRzrOfvWoFqiSjT+arvXFUU=X2rEXkbZFoKUUqVQ@mail.gmail.com
Whole thread Raw
In response to Re: PATCH: Using BRIN indexes for sorted output  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: PATCH: Using BRIN indexes for sorted output
List pgsql-hackers
On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 2/23/23 17:44, Matthias van de Meent wrote:
> > On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> On 2/23/23 15:19, Matthias van de Meent wrote:
> >>> Comments on 0001, mostly comments and patch design:
> >
> > One more comment:
> >
> >>>> + range->min_value = bval->bv_values[0];
> >>>> + range->max_value = bval->bv_values[1];
> >
> > This produces dangling pointers for minmax indexes on by-ref types
> > such as text and bytea, due to the memory context of the decoding
> > tuple being reset every iteration. :-(
> >
>
> Yeah, that sounds like a bug. Also a sign the tests should have some
> by-ref data types (presumably there are none, as that would certainly
> trip some asserts etc.).

I'm not sure we currently trip asserts, as the data we store in the
memory context is very limited, making it unlikely we actually release
the memory region back to the OS.
I did get assertion failures by adding the attached patch, but I don't
think that's something we can do in the final release.

> > With the proposed patch, we do O(ncols_statsenabled) scans over the
> > BRIN index. Each scan reads all ncol columns of all block ranges from
> > disk, so in effect the data scan does on the order of
> > O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
> > all columns have statistics enabled.
> >
>
> I don't think that's the number of I/O operations we'll do, because we
> always read the whole BRIN tuple at once. So I believe it should rather
> be something like
>
>   O(ncols_statsenabled * nranges)
>
> assuming nranges is the number of page ranges. But even that's likely a
> significant overestimate because most of the tuples will be served from
> shared buffers.

We store some data per index attribute, which makes the IO required
for a single indexed range proportional to the number of index
attributes.
If we then scan the index a number of times proportional to the number
of attributes, the cumulative IO load of statistics gathering for that
index is quadratic on the number of attributes, not linear, right?

> Considering how tiny BRIN indexes are, this is likely orders of
> magnitude less I/O than we expend on sampling rows from the table. I
> mean, with the default statistics target we read ~30000 pages (~240MB)
> or more just to sample the rows. Randomly, while the BRIN index is
> likely scanned mostly sequentially.

Mostly agreed; except I think it's not too difficult to imagine a BRIN
index that is on that scale; with as an example the bloom filters that
easily take up several hundreds of bytes.

With the default configuration of 128 pages_per_range,
n_distinct_per_range of -0.1, and false_positive_rate of 0.01, the
bloom filter size is 4.36 KiB - each indexed item on its own page. It
is still only 1% of the original table's size, but there are enough
tables that are larger than 24GB that this could be a significant
issue.

Note that most of my concerns are related to our current
documentation's statement that there are no demerits to multi-column
BRIN indexes:

"""
11.3. Multicolumn Indexes

[...] The only reason to have multiple BRIN indexes instead of one
multicolumn BRIN index on a single table is to have a different
pages_per_range storage parameter.
"""

Wide BRIN indexes add IO overhead for single-attribute scans when
compared to single-attribute indexes, so having N single-attribute
index scans to build statistics the statistics on an N-attribute index
is not great.

> Maybe there are cases where this would be an issue, but I haven't seen
> one when working on this patch (and I did a lot of experiments). I'd
> like to see one before we start optimizing it ...

I'm not only worried about optimizing it, I'm also worried that we're
putting this abstraction at the wrong level in a way that is difficult
to modify.

> This also reminds me that the issues I actually saw (e.g. memory
> consumption) would be made worse by processing all columns at once,
> because then you need to keep more columns in memory.

Yes, I that can be a valid concern, but don't we already do similar
things in the current table statistics gathering?

Kind regards,

Matthias van de Meent



pgsql-hackers by date:

Previous
From: Joseph Koshakow
Date:
Subject: Inconsistency in ACL error message
Next
From: Nathan Bossart
Date:
Subject: Re: Inconsistency in ACL error message