Re: PATCH: Using BRIN indexes for sorted output - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: PATCH: Using BRIN indexes for sorted output
Date
Msg-id CAEze2Wi7S7bMeUpMyBbnGLrFgcWQTDN1edSXYcd8WixeBj9B_w@mail.gmail.com
Whole thread Raw
In response to Re: PATCH: Using BRIN indexes for sorted output  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: PATCH: Using BRIN indexes for sorted output  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 10/16/22 22:17, Matthias van de Meent wrote:
> > On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >> Try to formulate the whole algorithm. Maybe I'm missing something.
> >>
> >> The current algorithm is something like this:
> >>
> >> 1. request info about ranges from the BRIN opclass
> >> 2. sort them by maxval and minval
> >
> > Why sort on maxval and minval? That seems wasteful for effectively all
> > sorts, where range sort on minval should suffice: If you find a range
> > that starts at 100 in a list of ranges sorted at minval, you've
> > processed all values <100. You can't make a similar comparison when
> > that range is sorted on maxvals.
>
> Because that allows to identify overlapping ranges quickly.
>
> Imagine you have the ranges sorted by maxval, which allows you to add
> tuples in small increments. But how do you know there's not a range
> (possibly with arbitrarily high maxval), that however overlaps with the
> range we're currently processing?

Why do we need to identify overlapping ranges specifically? If you
sort by minval, it becomes obvious that any subsequent range cannot
contain values < the minval of the next range in the list, allowing
you to emit any values less than the next, unprocessed, minmax range's
minval.

> >> 3. NULLS FIRST: read all ranges that might have NULLs => output
> >> 4. read the next range (by maxval) into tuplesort
> >>    (if no more ranges, go to (9))
> >> 5. load all tuples from "splill" tuplestore, compare to maxval
> >
> > Instead of this, shouldn't an update to tuplesort that allows for
> > restarting the sort be better than this? Moving tuples that we've
> > accepted into BRINsort state but not yet returned around seems like a
> > waste of cycles, and I can't think of a reason why it can't work.
> >
>
> I don't understand what you mean by "update to tuplesort". Can you
> elaborate?

Tuplesort currently only allows the following workflow: you to load
tuples, then call finalize, then extract tuples. There is currently no
way to add tuples once you've started extracting them.

For my design to work efficiently or without hacking into the
internals of tuplesort, we'd need a way to restart or 'un-finalize'
the tuplesort so that it returns to the 'load tuples' phase. Because
all data of the previous iteration is already sorted, adding more data
shouldn't be too expensive.

>  The point of spilling them into a tuplestore is to make the sort cheaper
> by not sorting tuples that can't possibly be produced, because the value
> exceeds the current maxval. Consider ranges sorted by maxval
> [...]
>
> Or maybe I just don't understand what you mean.

If we sort the ranges by minval like this:

1. [0,1000]
2. [0,999]
3. [50,998]
4. [100,997]
5. [100,996]
6. [150,995]

Then we can load and sort the values for range 1 and 2, and emit all
values up to (not including) 50 - the minval of the next,
not-yet-loaded range in the ordered list of ranges. Then add the
values from range 3 to the set of tuples we have yet to output; sort;
and then emit valus up to 100 (range 4's minval), etc. This reduces
the amount of tuples in the tuplesort to the minimum amount needed to
output any specific value.

If the ranges are sorted and loaded by maxval, like your algorithm expects:

1. [150,995]
2. [100,996]
3. [100,997]
4. [50,998]
5. [0,999]
6. [0,1000]

We need to load all ranges into the sort before it could start
emitting any tuples, as all ranges overlap with the first range.

> > [algo]
>
> I don't think this works, because we may get a range (Rs') with very
> high maxval (thus read very late from Rs), but with very low minval.
> AFAICS max_sorted must never go back, and this breaks it.

max_sorted cannot go back, because it is the min value of the next
range in the list of ranges sorted by min value; see also above.

There is a small issue in my algorithm where I use <= for yielding
values where it should be <, where initialization of max_value to NULL
is then be incorrect, but apart from that I don't think there are any
issues with the base algorithm.

> > The maximum cost of this tuplesort would be the cost of sorting a
> > seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
> > extra compare per tuple and range that are needed to determine whether
> > the range or tuple should be extracted from the tuplesort. The minimum
> > cost would be the cost of sorting all BRIN ranges, plus sorting all
> > tuples in one of the index's ranges.
> >
>
> I'm not a tuplesort expert, but my assumption it's better to sort
> smaller amounts of rows - which is why the patch sorts only the rows it
> knows it can actually output.

I see that the two main differences between our designs are in
answering these questions:

- How do we select table ranges for processing?
- How do we handle tuples that we know we can't output yet?

For the first, I think the differences are explained above. The main
drawback of your selection algorithm seems to be that your algorithm's
worst-case is "all ranges overlap", whereas my algorithm's worst case
is "all ranges start at the same value", which is only a subset of
your worst case.

For the second, the difference is whether we choose to sort the tuples
that are out-of-bounds, but are already in the working set due to
being returned from a range overlapping with the current bound.
My algorithm tries to reduce the overhead of increasing the sort
boundaries by also sorting the out-of-bound data, allowing for
O(n-less-than-newbound) overhead of extending the bounds (total
complexity for whole sort O(n-out-of-bound)), and O(n log n)
processing of all tuples during insertion.
Your algorithm - if I understand it correctly - seems to optimize for
faster results within the current bound by not sorting the
out-of-bounds data with O(1) processing when out-of-bounds, at the
cost of needing O(n-out-of-bound-tuples) operations when the maxval /
max_sorted boundary is increased, with a complexity of O(n*m) for an
average of n out-of-bound tuples and m bound updates.

Lastly, there is the small difference in how the ranges are extracted
from BRIN: I prefer and mention an iterative approach where the tuples
are extracted from the index and loaded into a tuplesort in some
iterative fashion (which spills to disk and does not need all tuples
to reside in memory), whereas your current approach was mentioned as
(paraphrasing) 'allocate all this data in one chunk and hope that
there is enough memory available'. I think this is not so much a
disagreement in best approach, but mostly a case of what could be made
to work; so in later updates I hope we'll see improvements here.

Kind regards,

Matthias van de Meent



pgsql-hackers by date:

Previous
From: Andy Fan
Date:
Subject: Re: Question about pull_up_sublinks_qual_recurse
Next
From: Robert Haas
Date:
Subject: Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock