Re: PATCH: Using BRIN indexes for sorted output - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: PATCH: Using BRIN indexes for sorted output |
Date | |
Msg-id | fdebb803-f9e7-4806-20b8-c55fc18ab790@enterprisedb.com Whole thread Raw |
In response to | Re: PATCH: Using BRIN indexes for sorted output (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Responses |
Re: PATCH: Using BRIN indexes for sorted output
|
List | pgsql-hackers |
On 2/23/23 17:44, Matthias van de Meent wrote: > On Thu, 23 Feb 2023 at 16:22, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 2/23/23 15:19, Matthias van de Meent wrote: >>> Comments on 0001, mostly comments and patch design: > > One more comment: > >>>> + range->min_value = bval->bv_values[0]; >>>> + range->max_value = bval->bv_values[1]; > > This produces dangling pointers for minmax indexes on by-ref types > such as text and bytea, due to the memory context of the decoding > tuple being reset every iteration. :-( > Yeah, that sounds like a bug. Also a sign the tests should have some by-ref data types (presumably there are none, as that would certainly trip some asserts etc.). >>>> +range_values_cmp(const void *a, const void *b, void *arg) >>> >>> Can the arguments of these functions be modified into the types they >>> are expected to receive? If not, could you add a comment on why that's >>> not possible? >> >> The reason is that that's what qsort() expects. If you change that to >> actual data types, you'll get compile-time warnings. I agree this may >> need better comments, though. > > Thanks in advance. > >>>> + * Statistics calculated by index AM (e.g. BRIN for ranges, etc.). >>> >>> Could you please expand on this? We do have GIST support for ranges, too. >>> >> >> Expand in what way? This is meant to be AM-specific, so if GiST wants to >> collect some additional stats, it's free to do so - perhaps some of the >> ideas from the stats collected for BRIN would be applicable, but it's >> also bound to the index structure. > > I don't quite understand the flow of the comment, as I don't clearly > see what the "BRIN for ranges" tries to refer to. In my view, that > makes it a bad example which needs further explanation or rewriting, > aka "expanding on". > Ah, right. Yeah, the "BRIN for ranges" wording is a bit misleading. It should really say only BRIN, but I was focused on the minmax use case, so I mentioned the ranges. >>>> + * brin_minmax_stats >>>> + * Calculate custom statistics for a BRIN minmax index. >>>> + * >>>> + * At the moment this calculates: >>>> + * >>>> + * - number of summarized/not-summarized and all/has nulls ranges >>> >>> I think statistics gathering of an index should be done at the AM >>> level, not attribute level. The docs currently suggest that the user >>> builds one BRIN index with 16 columns instead of 16 BRIN indexes with >>> one column each, which would make the statistics gathering use 16x >>> more IO if the scanned data cannot be reused. >>> >> >> Why? The row sample is collected only once and used for building all the >> index AM stats - it doesn't really matter if we analyze 16 single-column >> indexes or 1 index with 16 columns. Yes, we'll need to scan more >> indexes, but the with 16 columns the summaries will be larger so the >> total amount of I/O will be almost the same I think. >> >> Or maybe I don't understand what I/O you're talking about? > > With the proposed patch, we do O(ncols_statsenabled) scans over the > BRIN index. Each scan reads all ncol columns of all block ranges from > disk, so in effect the data scan does on the order of > O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when > all columns have statistics enabled. > I don't think that's the number of I/O operations we'll do, because we always read the whole BRIN tuple at once. So I believe it should rather be something like O(ncols_statsenabled * nranges) assuming nranges is the number of page ranges. But even that's likely a significant overestimate because most of the tuples will be served from shared buffers. Considering how tiny BRIN indexes are, this is likely orders of magnitude less I/O than we expend on sampling rows from the table. I mean, with the default statistics target we read ~30000 pages (~240MB) or more just to sample the rows. Randomly, while the BRIN index is likely scanned mostly sequentially. Maybe there are cases where this would be an issue, but I haven't seen one when working on this patch (and I did a lot of experiments). I'd like to see one before we start optimizing it ... This also reminds me that the issues I actually saw (e.g. memory consumption) would be made worse by processing all columns at once, because then you need to keep more columns in memory. >>> It is possible to build BRIN indexes on more than one column with more >>> than one opclass family like `USING brin (id int8_minmax_ops, id >>> int8_bloom_ops)`. This would mean various duplicate statistics fields, >>> no? >>> It seems to me that it's more useful to do the null- and n_summarized >>> on the index level instead of duplicating that inside the opclass. >> >> I don't think it's worth it. The amount of data this would save is tiny, >> and it'd only apply to cases where the index includes the same attribute >> multiple times, and that's pretty rare I think. I don't think it's worth >> the extra complexity. > > Not necessarily, it was just an example of where we'd save IO. > Note that the current gathering method already retrieves all tuple > attribute data, so from a basic processing perspective we'd save some > time decoding as well. > [shrug] I still think it's a negligible fraction of the time. >>> >>> I'm planning on reviewing the other patches, and noticed that a lot of >>> the patches are marked WIP. Could you share a status on those, because >>> currently that status is unknown: Are these patches you don't plan on >>> including, or are these patches only (or mostly) included for >>> debugging? >>> >> >> I think the WIP label is a bit misleading, I used it mostly to mark >> patches that are not meant to be committed on their own. A quick overview: >> >> [...] > > Thanks for the explanation, that's quite helpful. I'll see to further > reviewing 0004 and 0005 when I have additional time. > Cool, thank you! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: