Re: PATCH: Using BRIN indexes for sorted output - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PATCH: Using BRIN indexes for sorted output
Date
Msg-id 87964cda-e440-8759-2838-bc4628072a18@enterprisedb.com
Whole thread Raw
In response to Re: PATCH: Using BRIN indexes for sorted output  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: PATCH: Using BRIN indexes for sorted output  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Re: PATCH: Using BRIN indexes for sorted output  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On 10/17/22 16:00, Matthias van de Meent wrote:
> On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 10/16/22 22:17, Matthias van de Meent wrote:
>>> On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>> Try to formulate the whole algorithm. Maybe I'm missing something.
>>>>
>>>> The current algorithm is something like this:
>>>>
>>>> 1. request info about ranges from the BRIN opclass
>>>> 2. sort them by maxval and minval
>>>
>>> Why sort on maxval and minval? That seems wasteful for effectively all
>>> sorts, where range sort on minval should suffice: If you find a range
>>> that starts at 100 in a list of ranges sorted at minval, you've
>>> processed all values <100. You can't make a similar comparison when
>>> that range is sorted on maxvals.
>>
>> Because that allows to identify overlapping ranges quickly.
>>
>> Imagine you have the ranges sorted by maxval, which allows you to add
>> tuples in small increments. But how do you know there's not a range
>> (possibly with arbitrarily high maxval), that however overlaps with the
>> range we're currently processing?
> 
> Why do we need to identify overlapping ranges specifically? If you
> sort by minval, it becomes obvious that any subsequent range cannot
> contain values < the minval of the next range in the list, allowing
> you to emit any values less than the next, unprocessed, minmax range's
> minval.
> 

D'oh! I think you're right, it should be possible to do this with only
sort by minval. And it might actually be better way to do that.

I think I chose the "maxval" ordering because it seemed reasonable.
Looking at the current range and using the maxval as the threshold
seemed reasonable. But it leads to a bunch of complexity with the
intersecting ranges, and I never reconsidered this choice. Silly me.

>>>> 3. NULLS FIRST: read all ranges that might have NULLs => output
>>>> 4. read the next range (by maxval) into tuplesort
>>>>    (if no more ranges, go to (9))
>>>> 5. load all tuples from "splill" tuplestore, compare to maxval
>>>
>>> Instead of this, shouldn't an update to tuplesort that allows for
>>> restarting the sort be better than this? Moving tuples that we've
>>> accepted into BRINsort state but not yet returned around seems like a
>>> waste of cycles, and I can't think of a reason why it can't work.
>>>
>>
>> I don't understand what you mean by "update to tuplesort". Can you
>> elaborate?
> 
> Tuplesort currently only allows the following workflow: you to load
> tuples, then call finalize, then extract tuples. There is currently no
> way to add tuples once you've started extracting them.
> 
> For my design to work efficiently or without hacking into the
> internals of tuplesort, we'd need a way to restart or 'un-finalize'
> the tuplesort so that it returns to the 'load tuples' phase. Because
> all data of the previous iteration is already sorted, adding more data
> shouldn't be too expensive.
> 

Not sure. I still think it's better to limit the amount of data we have
in the tuplesort. Even if the tuplesort can efficiently skip the already
sorted part, it'll still occupy disk space, possibly even force the data
to disk etc. (We'll still have to write that into a tuplestore, but that
should be relatively small and short-lived/recycled).

FWIW I wonder if the assumption that tuplesort can quickly skip already
sorted data holds e.g. for tuplesorts much larger than work_mem, but I
haven't checked that.

I'd also like to include some more info in the explain, like how many
times we did a sort, and what was the largest amount of data we sorted.
Although, maybe that could be tracked by tracking the tuplesort size of
the last sort.

Considering the tuplesort does not currently support this, I'll probably
stick to the existing approach with separate tuplestore. There's enough
complexity in the patch already, I think. The only thing we'll need with
the minval ordering is the ability to "peek ahead" to the next minval
(which is going to be the threshold used to route values either to
tuplesort or tuplestore).

>>  The point of spilling them into a tuplestore is to make the sort cheaper
>> by not sorting tuples that can't possibly be produced, because the value
>> exceeds the current maxval. Consider ranges sorted by maxval
>> [...]
>>
>> Or maybe I just don't understand what you mean.
> 
> If we sort the ranges by minval like this:
> 
> 1. [0,1000]
> 2. [0,999]
> 3. [50,998]
> 4. [100,997]
> 5. [100,996]
> 6. [150,995]
> 
> Then we can load and sort the values for range 1 and 2, and emit all
> values up to (not including) 50 - the minval of the next,
> not-yet-loaded range in the ordered list of ranges. Then add the
> values from range 3 to the set of tuples we have yet to output; sort;
> and then emit valus up to 100 (range 4's minval), etc. This reduces
> the amount of tuples in the tuplesort to the minimum amount needed to
> output any specific value.
> 
> If the ranges are sorted and loaded by maxval, like your algorithm expects:
> 
> 1. [150,995]
> 2. [100,996]
> 3. [100,997]
> 4. [50,998]
> 5. [0,999]
> 6. [0,1000]
> 
> We need to load all ranges into the sort before it could start
> emitting any tuples, as all ranges overlap with the first range.
> 

Right, thanks - I get this now.

>>> [algo]
>>
>> I don't think this works, because we may get a range (Rs') with very
>> high maxval (thus read very late from Rs), but with very low minval.
>> AFAICS max_sorted must never go back, and this breaks it.
> 
> max_sorted cannot go back, because it is the min value of the next
> range in the list of ranges sorted by min value; see also above.
> 
> There is a small issue in my algorithm where I use <= for yielding
> values where it should be <, where initialization of max_value to NULL
> is then be incorrect, but apart from that I don't think there are any
> issues with the base algorithm.
> 
>>> The maximum cost of this tuplesort would be the cost of sorting a
>>> seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
>>> extra compare per tuple and range that are needed to determine whether
>>> the range or tuple should be extracted from the tuplesort. The minimum
>>> cost would be the cost of sorting all BRIN ranges, plus sorting all
>>> tuples in one of the index's ranges.
>>>
>>
>> I'm not a tuplesort expert, but my assumption it's better to sort
>> smaller amounts of rows - which is why the patch sorts only the rows it
>> knows it can actually output.
> 
> I see that the two main differences between our designs are in
> answering these questions:
> 
> - How do we select table ranges for processing?
> - How do we handle tuples that we know we can't output yet?
> 
> For the first, I think the differences are explained above. The main
> drawback of your selection algorithm seems to be that your algorithm's
> worst-case is "all ranges overlap", whereas my algorithm's worst case
> is "all ranges start at the same value", which is only a subset of
> your worst case.
> 

Right, those are very good points.

> For the second, the difference is whether we choose to sort the tuples
> that are out-of-bounds, but are already in the working set due to
> being returned from a range overlapping with the current bound.
> My algorithm tries to reduce the overhead of increasing the sort
> boundaries by also sorting the out-of-bound data, allowing for
> O(n-less-than-newbound) overhead of extending the bounds (total
> complexity for whole sort O(n-out-of-bound)), and O(n log n)
> processing of all tuples during insertion.
> Your algorithm - if I understand it correctly - seems to optimize for
> faster results within the current bound by not sorting the
> out-of-bounds data with O(1) processing when out-of-bounds, at the
> cost of needing O(n-out-of-bound-tuples) operations when the maxval /
> max_sorted boundary is increased, with a complexity of O(n*m) for an
> average of n out-of-bound tuples and m bound updates.
> 

Right. I wonder if we these are actually complementary approaches, and
we could/should pick between them depending on how many rows we expect
to consume.

My focus was LIMIT queries, so I favored the approach with the lowest
startup cost. I haven't quite planned for this to work so well even in
full-sort cases. That kinda surprised me (I wonder if the very large
tuplesorts - compared to work_mem - would hurt this, though).

> Lastly, there is the small difference in how the ranges are extracted
> from BRIN: I prefer and mention an iterative approach where the tuples
> are extracted from the index and loaded into a tuplesort in some
> iterative fashion (which spills to disk and does not need all tuples
> to reside in memory), whereas your current approach was mentioned as
> (paraphrasing) 'allocate all this data in one chunk and hope that
> there is enough memory available'. I think this is not so much a
> disagreement in best approach, but mostly a case of what could be made
> to work; so in later updates I hope we'll see improvements here.
> 

Right. I think I mentioned this in my post [1], where I also envisioned
some sort of iterative approach. And I think you're right the approach
with ordering by minval is naturally more suitable because it just
consumes the single sequence of ranges.


regards

[1]
https://www.postgresql.org/message-id/1a7c2ff5-a855-64e9-0272-1f9947f8a558%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply
Next
From: Nazir Bilal Yavuz
Date:
Subject: Re: Mingw task for Cirrus CI