PATCH: Using BRIN indexes for sorted output - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | PATCH: Using BRIN indexes for sorted output |
Date | |
Msg-id | e70fa091-e338-1598-9de4-6d0ef6b693e2@enterprisedb.com Whole thread Raw |
Responses |
Re: PATCH: Using BRIN indexes for sorted output
Re: PATCH: Using BRIN indexes for sorted output Re: PATCH: Using BRIN indexes for sorted output |
List | pgsql-hackers |
Hi, There have been a couple discussions about using BRIN indexes for sorting - in fact this was mentioned even in the "Improving Indexing Performance" unconference session this year (don't remember by whom). But I haven't seen any patches, so here's one. The idea is that we can use information about ranges to split the table into smaller parts that can be sorted in smaller chunks. For example if you have a tiny 2MB table with two ranges, with values in [0,100] and [101,200] intervals, then it's clear we can sort the first range, output tuples, and then sort/output the second range. The attached patch builds "BRIN Sort" paths/plans, closely resembling index scans, only for BRIN indexes. And this special type of index scan does what was mentioned above - incrementally sorts the data. It's a bit more complicated because of overlapping ranges, ASC/DESC, NULL etc. This is disabled by default, using a GUC enable_brinsort (you may need to tweak other GUCs to disable parallel plans etc.). A trivial example, demonstrating the benefits: create table t (a int) with (fillfactor = 10); insert into t select i from generate_series(1,10000000) s(i); First, a simple LIMIT query: explain (analyze, costs off) select * from t order by a limit 10; QUERY PLAN ------------------------------------------------------------------------ Limit (actual time=1879.768..1879.770 rows=10 loops=1) -> Sort (actual time=1879.767..1879.768 rows=10 loops=1) Sort Key: a Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on t (actual time=0.007..1353.110 rows=10000000 loops=1) Planning Time: 0.083 ms Execution Time: 1879.786 ms (7 rows) QUERY PLAN ------------------------------------------------------------------------ Limit (actual time=1.217..1.219 rows=10 loops=1) -> BRIN Sort using t_a_idx on t (actual time=1.216..1.217 rows=10 loops=1) Sort Key: a Planning Time: 0.084 ms Execution Time: 1.234 ms (5 rows) That's a pretty nice improvement - of course, this is thanks to having a perfectly sequential, and the difference can be almost arbitrary by making the table smaller/larger. Similarly, if the table gets less sequential (making ranges to overlap), the BRIN plan will be more expensive. Feel free to experiment with other data sets. However, not only the LIMIT queries can improve - consider a sort of the whole table: test=# explain (analyze, costs off) select * from t order by a; QUERY PLAN ------------------------------------------------------------------------- Sort (actual time=2806.468..3487.213 rows=10000000 loops=1) Sort Key: a Sort Method: external merge Disk: 117528kB -> Seq Scan on t (actual time=0.018..1498.754 rows=10000000 loops=1) Planning Time: 0.110 ms Execution Time: 3766.825 ms (6 rows) test=# explain (analyze, costs off) select * from t order by a; QUERY PLAN ---------------------------------------------------------------------------------- BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=10000000 loops=1) Sort Key: a Planning Time: 0.073 ms Execution Time: 2939.324 ms (4 rows) Right - not a huge difference, but still a nice 25% speedup, mostly due to not having to spill data to disk and sorting smaller amounts of data. There's a bunch of issues with this initial version of the patch, usually described in XXX comments in the relevant places.6) 1) The paths are created in build_index_paths() because that's what creates index scans (which the new path resembles). But that is expected to produce IndexPath, not BrinSortPath, so it's not quite correct. Should be somewhere "higher" I guess. 2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS FIRST/LAST does not really matter for them. The patch just generates paths for all 4 combinations (or tries to). Maybe there's a better way. 3) I'm not quite sure the separation of responsibilities between opfamily and opclass is optimal. I added a new amproc, but maybe this should be split differently. At the moment only minmax indexes have this, but adding this to minmax-multi should be trivial. 4) The state changes in nodeBrinSort is a bit confusing. Works, but may need cleanup and refactoring. Ideas welcome. 5) The costing is essentially just plain cost_index. I have some ideas about BRIN costing in general, which I'll post in a separate thread (as it's not specific to this patch). 6) At the moment this only picks one of the index keys, specified in the ORDER BY clause. I think we can generalize this to multiple keys, but thinking about multi-key ranges was a bit too much for me. The good thing is this nicely combines with IncrementalSort. 7) Only plain index keys for the ORDER BY keys, no expressions. Should not be hard to fix, though. 8) Parallel version is not supported, but I think it shouldn't be possible. Just make the leader build the range info, and then let the workers to acquire/sort ranges and merge them by Gather Merge. 9) I was also thinking about leveraging other indexes to quickly eliminate ranges that need to be sorted. The node does evaluate filter, of course, but only after reading the tuple from the range. But imagine we allow BrinSort to utilize BRIN indexes to evaluate the filter - in that case we might skip many ranges entirely. Essentially like a bitmap index scan does, except that building the bitmap incrementally with BRIN is trivial - you can quickly check if a particular range matches or not. With other indexes (e.g. btree) you essentially need to evaluate the filter completely, and only then you can look at the bitmap. Which seems rather against the idea of this patch, which is about low startup cost. Of course, the condition might be very selective, but then you probably can just fetch the matching tuples and do a Sort. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: