Re: PATCH: Using BRIN indexes for sorted output - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: PATCH: Using BRIN indexes for sorted output |
Date | |
Msg-id | 0695af03-8bd3-f032-47aa-0657135ce45d@enterprisedb.com Whole thread Raw |
In response to | Re: PATCH: Using BRIN indexes for sorted output (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Responses |
Re: PATCH: Using BRIN indexes for sorted output
|
List | pgsql-hackers |
On 10/16/22 22:17, Matthias van de Meent wrote: > First of all, it's really great to see that this is being worked on. > > On Sun, 16 Oct 2022 at 16:34, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> Try to formulate the whole algorithm. Maybe I'm missing something. >> >> The current algorithm is something like this: >> >> 1. request info about ranges from the BRIN opclass >> 2. sort them by maxval and minval > > Why sort on maxval and minval? That seems wasteful for effectively all > sorts, where range sort on minval should suffice: If you find a range > that starts at 100 in a list of ranges sorted at minval, you've > processed all values <100. You can't make a similar comparison when > that range is sorted on maxvals. > Because that allows to identify overlapping ranges quickly. Imagine you have the ranges sorted by maxval, which allows you to add tuples in small increments. But how do you know there's not a range (possibly with arbitrarily high maxval), that however overlaps with the range we're currently processing? Consider these ranges sorted by maxval range #1 [0,100] range #2 [101,200] range #3 [150,250] ... range #1000000 [190,1000000000] processing the range #1 is simple, because there are no overlapping ranges. When processing range #2, that's not the case - the following range #3 is overlapping too, so we need to load the tuples too. But there may be other ranges (in arbitrary distance) also overlapping. So we either have to cross-check everything with everything - that's O(N^2) so not great, or we can invent a way to eliminate ranges that can't overlap. The patch does that by having two arrays - one sorted by maxval, one sorted by minval. After proceeding to the next range by maxval (using the first array), the minval-sorted array is used to detect overlaps. This can be done quickly, because we only care for new matches since the previous range, so we can remember the index to the array and start from it. And we can stop once the minval exceeds the maxval for the range in the first step. Because we'll only sort tuples up to that point. >> 3. NULLS FIRST: read all ranges that might have NULLs => output >> 4. read the next range (by maxval) into tuplesort >> (if no more ranges, go to (9)) >> 5. load all tuples from "splill" tuplestore, compare to maxval > > Instead of this, shouldn't an update to tuplesort that allows for > restarting the sort be better than this? Moving tuples that we've > accepted into BRINsort state but not yet returned around seems like a > waste of cycles, and I can't think of a reason why it can't work. > I don't understand what you mean by "update to tuplesort". Can you elaborate? The point of spilling them into a tuplestore is to make the sort cheaper by not sorting tuples that can't possibly be produced, because the value exceeds the current maxval. Consider ranges sorted by maxval [0,1000] [500,1500] [1001,2000] ... We load tuples from [0,1000] and use 1000 as "threshold" up to which we can sort. But we have to load tuples from the overlapping range(s) too, e.g. from [500,1500] except that all tuples with values > 1000 can't be produced (because there might be yet more ranges intersecting with that part). So why sort these tuples at all? Imagine imperfectly correlated table where each range overlaps with ~10 other ranges. If we feed all of that into the tuplestore, we're now sorting 11x the amount of data. Or maybe I just don't understand what you mean. >> 6. load all tuples from no-summarized ranges (first range only) >> (into tuplesort/tuplestore, depending on maxval comparison) >> 7. load all intersecting ranges (with minval < current maxval) >> (into tuplesort/tuplestore, depending on maxval comparison) >> 8. sort the tuplesort, output all tuples, then back to (4) >> 9. NULLS LAST: read all ranges that might have NULLs => output >> 10. done >> >> For "DESC" ordering the process is almost the same, except that we swap >> minval/maxval in most places. > > When I was thinking about this feature at the PgCon unconference, I > was thinking about it more along the lines of the following system > (for ORDER BY col ASC NULLS FIRST): > > 1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by > [has_nulls, min ASC] > 2. scan info about ranges from BRIN, store them in Rs. > 3. Finalize the sorting of Rs. > 4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified > column ordering. > 5. load all tuples from no-summarized ranges into Ts' > 6. while Rs has a block range Rs' with has_nulls: > - Remove Rs' from Rs > - store the tuples of Rs' range in Ts. > We now have all tuples with NULL in our sorted set; max_sorted = (NULL) > 7. Finalize the Ts sorted set. > 8. While the next tuple Ts' in the Ts tuplesort <= max_sorted > - Remove Ts' from Ts > - Yield Ts' > Now, all tuples up to and including max_sorted are yielded. > 9. If there are no more ranges in Rs: > - Yield all remaining tuples from Ts, then return. > 10. "un-finalize" Ts, so that we can start adding tuples to that tuplesort. > This is different from Tomas' implementation, as he loads the > tuples into a new tuplestore. > 11. get the next item from Rs: Rs' > - remove Rs' from Rs > - assign Rs' min value to max_sorted > - store the tuples of Rs' range in Ts I don't think this works, because we may get a range (Rs') with very high maxval (thus read very late from Rs), but with very low minval. AFAICS max_sorted must never go back, and this breaks it. > 12. while the next item Rs' from Rs has a min value of max_sorted: > - remove Rs' from Rs > - store the tuples of Rs' range in Ts > 13. The 'new' value from the next item from Rs is stored in > max_sorted. If no such item exists, max_sorted is assigned a sentinel > value (+INF) > 14. Go to Step 7 > > This set of operations requires a restarting tuplesort for Ts, but I > don't think that would result in many API changes for tuplesort. It > reduces the overhead of large overlapping ranges, as it doesn't need > to copy all tuples that have been read from disk but have not yet been > returned. > > The maximum cost of this tuplesort would be the cost of sorting a > seqscanned table, plus sorting the relevant BRIN ranges, plus the 1 > extra compare per tuple and range that are needed to determine whether > the range or tuple should be extracted from the tuplesort. The minimum > cost would be the cost of sorting all BRIN ranges, plus sorting all > tuples in one of the index's ranges. > I'm not a tuplesort expert, but my assumption it's better to sort smaller amounts of rows - which is why the patch sorts only the rows it knows it can actually output. > Kind regards, > > Matthias van de Meent > > PS. Are you still planning on giving the HOT optimization for BRIN a > second try? I'm fairly confident that my patch at [0] would fix the > issue that lead to the revert of that feature, but it introduced ABI > changes after the feature freeze and thus it didn't get in. The patch > might need some polishing, but I think it shouldn't take too much > extra effort to get into PG16. > Thanks for reminding me, I'll take a look before the next CF. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: