PATCH: Using BRIN indexes for sorted output - Mailing list pgsql-hackers

From Tomas Vondra
Subject PATCH: Using BRIN indexes for sorted output
Date
Msg-id e70fa091-e338-1598-9de4-6d0ef6b693e2@enterprisedb.com
Whole thread Raw
Responses Re: PATCH: Using BRIN indexes for sorted output
Re: PATCH: Using BRIN indexes for sorted output
Re: PATCH: Using BRIN indexes for sorted output
List pgsql-hackers
Hi,

There have been a couple discussions about using BRIN indexes for
sorting - in fact this was mentioned even in the "Improving Indexing
Performance" unconference session this year (don't remember by whom).
But I haven't seen any patches, so here's one.

The idea is that we can use information about ranges to split the table
into smaller parts that can be sorted in smaller chunks. For example if
you have a tiny 2MB table with two ranges, with values in [0,100] and
[101,200] intervals, then it's clear we can sort the first range, output
tuples, and then sort/output the second range.

The attached patch builds "BRIN Sort" paths/plans, closely resembling
index scans, only for BRIN indexes. And this special type of index scan
does what was mentioned above - incrementally sorts the data. It's a bit
more complicated because of overlapping ranges, ASC/DESC, NULL etc.

This is disabled by default, using a GUC enable_brinsort (you may need
to tweak other GUCs to disable parallel plans etc.).

A trivial example, demonstrating the benefits:

  create table t (a int) with (fillfactor = 10);
  insert into t select i from generate_series(1,10000000) s(i);


First, a simple LIMIT query:

explain (analyze, costs off) select * from t order by a limit 10;

                              QUERY PLAN
------------------------------------------------------------------------
 Limit (actual time=1879.768..1879.770 rows=10 loops=1)
   ->  Sort (actual time=1879.767..1879.768 rows=10 loops=1)
         Sort Key: a
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on t
             (actual time=0.007..1353.110 rows=10000000 loops=1)
 Planning Time: 0.083 ms
 Execution Time: 1879.786 ms
(7 rows)

                              QUERY PLAN
------------------------------------------------------------------------
 Limit (actual time=1.217..1.219 rows=10 loops=1)
   ->  BRIN Sort using t_a_idx on t
       (actual time=1.216..1.217 rows=10 loops=1)
         Sort Key: a
 Planning Time: 0.084 ms
 Execution Time: 1.234 ms
(5 rows)

That's a pretty nice improvement - of course, this is thanks to having a
perfectly sequential, and the difference can be almost arbitrary by
making the table smaller/larger. Similarly, if the table gets less
sequential (making ranges to overlap), the BRIN plan will be more
expensive. Feel free to experiment with other data sets.

However, not only the LIMIT queries can improve - consider a sort of the
whole table:

test=# explain (analyze, costs off) select * from t order by a;

                               QUERY PLAN
-------------------------------------------------------------------------
 Sort (actual time=2806.468..3487.213 rows=10000000 loops=1)
   Sort Key: a
   Sort Method: external merge  Disk: 117528kB
   ->  Seq Scan on t (actual time=0.018..1498.754 rows=10000000 loops=1)
 Planning Time: 0.110 ms
 Execution Time: 3766.825 ms
(6 rows)

test=# explain (analyze, costs off) select * from t order by a;
                                    QUERY PLAN

----------------------------------------------------------------------------------
 BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=10000000
loops=1)
   Sort Key: a
 Planning Time: 0.073 ms
 Execution Time: 2939.324 ms
(4 rows)

Right - not a huge difference, but still a nice 25% speedup, mostly due
to not having to spill data to disk and sorting smaller amounts of data.

There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)

1) The paths are created in build_index_paths() because that's what
creates index scans (which the new path resembles). But that is expected
to produce IndexPath, not BrinSortPath, so it's not quite correct.
Should be somewhere "higher" I guess.

2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS
FIRST/LAST does not really matter for them. The patch just generates
paths for all 4 combinations (or tries to). Maybe there's a better way.

3) I'm not quite sure the separation of responsibilities between
opfamily and opclass is optimal. I added a new amproc, but maybe this
should be split differently. At the moment only minmax indexes have
this, but adding this to minmax-multi should be trivial.

4) The state changes in nodeBrinSort is a bit confusing. Works, but may
need cleanup and refactoring. Ideas welcome.

5) The costing is essentially just plain cost_index. I have some ideas
about BRIN costing in general, which I'll post in a separate thread (as
it's not specific to this patch).

6) At the moment this only picks one of the index keys, specified in the
ORDER BY clause. I think we can generalize this to multiple keys, but
thinking about multi-key ranges was a bit too much for me. The good
thing is this nicely combines with IncrementalSort.

7) Only plain index keys for the ORDER BY keys, no expressions. Should
not be hard to fix, though.

8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.

9) I was also thinking about leveraging other indexes to quickly
eliminate ranges that need to be sorted. The node does evaluate filter,
of course, but only after reading the tuple from the range. But imagine
we allow BrinSort to utilize BRIN indexes to evaluate the filter - in
that case we might skip many ranges entirely. Essentially like a bitmap
index scan does, except that building the bitmap incrementally with BRIN
is trivial - you can quickly check if a particular range matches or not.
With other indexes (e.g. btree) you essentially need to evaluate the
filter completely, and only then you can look at the bitmap. Which seems
rather against the idea of this patch, which is about low startup cost.
Of course, the condition might be very selective, but then you probably
can just fetch the matching tuples and do a Sort.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Improve description of XLOG_RUNNING_XACTS
Next
From: Amit Kapila
Date:
Subject: Re: create subscription - improved warning message