Re: PATCH: Using BRIN indexes for sorted output - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: PATCH: Using BRIN indexes for sorted output |
Date | |
Msg-id | 6319e69d-6207-e543-2f5d-3f45b666aca2@enterprisedb.com Whole thread Raw |
In response to | Re: PATCH: Using BRIN indexes for sorted output (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
List | pgsql-hackers |
Hi, here's an updated/reworked version of the patch, on top of the "BRIN statistics" patch as 0001 (because some of the stuff is useful, but we can ignore this part in this thread). Warning: I realized the new node is somewhat broken when it comes to projection and matching the indexed column, most likely because the targetlists are wired/processed incorrectly or something like that. So when experimenting with this, just index the first column of the table and don't do anything requiring a projection. I'll get this fixed, but I've been focusing on the other stuff. I'm not particularly familiar with this tlist/project stuff, so any help is welcome. The main change in this version is the adoption of multiple ideas suggested by Matthias in his earlier responses. Firstly, this changes how the index opclass passes information to the executor node. Instead of using a plain array, we now use a tuplesort. This addresses the memory consumption issues with large number of ranges, and it also simplifies the sorting etc. which is now handled by the tuplesort. The support procedure simply fills a tuplesort and then hands it over to the caller (more or less). Secondly, instead of ordering the ranges by maxval, this orders them by minval (as suggested by Matthias), which greatly simplifies the code because we don't need to detect overlapping ranges etc. More precisely, the ranges are sorted to get this ordering - not yet summarized ranges - ranges sorted by (minval, blkno) - all-nulls ranges This particular ordering is beneficial for the algorithm, which does two passes over the ranges. For the NULLS LAST case (i.e. the default), we do this: - produce tuples with non-NULL values, ordered by the value - produce tuples with NULL values (arbitrary order) And each of these phases does a separate pass over the ranges (I'll get to that in a minute). And the ordering is tailored to this. Note: For DESC we'd sort by maxval, and for NULLS FIRST the phases would happen in the opposite order, but those are details. Let's assume ASC ordering with NULLS LAST, unless stated otherwise. The idea here is that all not-summarized ranges need to be processed always, both when processing NULLs and non-NULL values, which happens as two separate passes over ranges. The all-null ranges don't need to be processed during the non-NULL pass, and we can terminate this pass early once we hit the first null-only range. So placing them last helps with this. The regular ranges are ordered by minval, as dictated by the algorithm (which is now described in nodeBrinSort.c comment), but we also sort them by blkno to make this a bit more sequential (but this only matters for ranges with the same minval, and that's probably rare, but the extra sort key is also cheap so why not). I mentioned we do two separate passes - one for non-NULL values, one for NULL values. That may be somewhat inefficient, because in extreme cases we might end up scanning the whole table twice (imagine BRIN ranges where each range has both regular values and NULLs). It might be possible to do all of this in a single pass, at least in some cases - for example while scanning ranges, we might stash NULL rows into a tuplestore, so that the second pass is not needed. That assumes there are not too many such rows (otherwise we might need to write and then read many rows, outweighing the cost of just doing two passes). This should be possible to estimate/cost fairly well, I think, and the comment in nodeBrinSort actually presents some ideas about this. And we can't do that for the NULLS FIRST case, because if we stash the non-NULL rows somewhere, we won't be able to do the "incremental" sort, i.e. we might just do regular Sort right away. So I've kept this simple approach with two passes for now. This still uses the approach with spilling tuples to a tuplestore, and only sorting rows that we know are safe to output. I still think this is a good approach, for the reasons I explained before, but changing this is not hard so we can experiment. There's however a related question - how quickly should we increment the minval value, serving as a watermark? One option is to go to the next distinct minval value - but that may result in excessive number of tiny sorts, because the number ranges and rows between the old and new minval values tends to be small. Another negative consequence is that this may cause of lot of spilling (and re-spilling), because we only consume tiny number of rows from the tuplestore after incrementing the watermark. Or we can do larger steps, skipping some of the minval values, so that more rows quality into the sort. Of course, too large step means we'll exceed work_mem and switch to an on-disk sort, which we probably don't want. Also, this may be the wrong thing to do for LIMIT queries, that only need a couple rows, and a tiny sort is fine (because we won't do too many of them). Patch 0004 introduces a new GUC called brinsort_watermark_step, that can be used to experiment with this. By default it's set to '1' which means we simply progress to the next minval value. Then 0005 tries to customize this based on statistics - we estimate the number of rows we expect to get for each minval increment to "add" and then pick just a step value not to overflow work_mem. This happens in create_brinsort_plan, and the comment explains the main weakness - the way the number of rows is estimated is somewhat naive, as it just divides reltuples by number of ranges. But I have a couple ideas about what statistics we might collect, explained in 0001 in the comment at brin_minmax_stats. But there's another option - we can tune the step based on past sorts. If we see the sorts are doing on-disk sort, maybe try doing smaller steps. Patch 0006 implements a very simple variant of this. There's a couple ideas about how it might be improved, mentioned in the comment at brinsort_adjust_watermark_step. There's also patch 0003, which extends the EXPLAIN output with a counters tracking the number of sorts, counts of on-disk/in-memory sorts, space used, number of rows sorted/spilled, and so on. This is useful when analyzing e.g. the effect of higher/lower watermark steps, discussed in the preceding paragraphs. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Allow-index-AMs-to-build-and-use-custom-sta-20221022.patch
- 0002-Allow-BRIN-indexes-to-produce-sorted-output-20221022.patch
- 0003-wip-brinsort-explain-stats-20221022.patch
- 0004-wip-multiple-watermark-steps-20221022.patch
- 0005-wip-adjust-watermark-step-20221022.patch
- 0006-wip-adaptive-watermark-step-20221022.patch
pgsql-hackers by date: