Re: Using Btree to Provide Sorting on Suffix Keys with LIMIT - Mailing list pgsql-hackers

From James Coleman
Subject Re: Using Btree to Provide Sorting on Suffix Keys with LIMIT
Date
Msg-id CAAaqYe-N2gfZ8EWHTr9kjnKOf1KjpgRztgBVEvHPUCeaKKc6xQ@mail.gmail.com
Whole thread Raw
In response to Re: Using Btree to Provide Sorting on Suffix Keys with LIMIT  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Using Btree to Provide Sorting on Suffix Keys with LIMIT  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Thu, Jan 10, 2019 at 6:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sat, Dec 29, 2018 at 6:50 PM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> > > Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in
> > > (...) order by c` and `a in (...) and b = 1 order by c` (and further
> > > similar derivations with increasing numbers of equality quals).
> >
> > I don't quite understand this the above paragraph, but I assume this
> > would be possible to do with some new index am routine which allowed
> > multiple different values for the initial key.
>
> I'm confused about James' use of the term "new index am". I guess he
> just meant new support function or something?

Thanks for responding.

I was wondering if a new index access method would be the preferable
way to do this such as how the skip scan patch does (I was taking some
ideas from that one).

> > > Note: Another (loosely) related problem is that sorting can't
> > > currently take advantage of cases where an index provides a partial
> > > (prefix of requested pathkeys) ordering.
> >
> > There has been a patch [2] around for about 4 years now that does
> > this.  I'm unsure of the current status, other than not yet committed.
> >
> > [1] https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489%40BlueTreble.com
> > [2] https://www.postgresql.org/message-id/flat/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
>
> I can see why you'd mention these two, but I also expected you to
> mention the skip scan project, since that involves pushing down
> knowledge about how the index is to be accessed to the index am (at
> least, I assume that it does), and skipping leading attributes to use
> the sort order from a suffix attribute. Actually, the partial sort
> idea that you linked to is more or less a dual of skip scan, at least
> to my mind (the former *extends* the sort order by adding a suffix
> tie-breaker, while the latter *skips* a leading attribute to get to an
> interesting suffix attribute).

Yes, I'd been looking at the skip scan patch but didn't mention it. A
lot of my initial email was from my initial brainstorming notes on the
topic, and I should have cleaned it up a bit better before sending it.

> The way James constructed his example suggested that there'd be some
> kind of natural locality, that we'd expect to be able to take
> advantage of at execution time when the new strategy is favorable. I'm
> not sure if that was intended -- James? ...

I'm not sure what you mean by "natural locality"; I believe that the
artificial data I've constructed is actually somewhat close to worst
case for what I'm proposing. Evenly distributed (this is random, but
in this case I think that's close enough) data will realize the
smallest possible gains (and be the most likely to represent a
regression with few enough rows in each group) because it is the case
where we'd have have the most overhead of rotating among scan keys.

> ... I think it might help James to
> construct a more obviously realistic/practical motivating example. I'm
> perfectly willing to believe that this idea would help his real world
> queries, and having an example that can easily be played with is
> helpful in other ways. But I'd like to know why this idea is important
> is in detail, since I think that it would help me to place it in the
> wider landscape of ideas that are like this. Placing it in that wider
> landscape, and figuring out next steps at a high level seem to be the
> problem right now.

I'll attempt to describe a more real world scenario: suppose we have a
schema like:

users(id serial primary key)
orders(id serial primary key, user_id integer, created_at timestamp)

And wanted to find the most recent N orders for a specific group of
users (e.g., in a report or search). Your query might look like:

SELECT *
FROM orders
WHERE orders.user_id IN (1, 2, 3)
ORDER BY orders.created_at DESC
LIMIT 25

Currently an index on orders(user_id, created_at) will be used for
this query, but only to satisfy the scalar array op qual. Then all
matching orders (say, years worth) will be fetched, a sort node will
sort all of those results, and then a limit node will take the top N.

Generalized the problem is something like "find the top N rows across
a group of foreign keys" (though saying foreign keys probably is too
specific).

But under the scheme I'm proposing that same index would be able to
provide both the filter and guarantee ordering as well.

Does that more real-world-ish example help place the usefulness of this?

I think this goes beyond increasing the usefulness of indexes by
requiring less specific indexes (incremental sort does this), but
rather allows the index to support a kind of query you can't currently
(as far as I'm aware) can't express in a performant way at call
currently (other than a complex recursive cte or in some subset of
cases a bunch of union statements -- one per array entry).

James Coleman


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: New vacuum option to do only freezing
Next
From: Tom Lane
Date:
Subject: Re: [PATCH] pgbench tap tests fail if the path contains a perl special character