Thread: BRIN index worse than sequential scan for large search set
Hello everyone,
I'm playing around with BRIN indexes so as to get a feel for the feature.
During my tests, I was unable to make BRIN indexes perform better than a sequential scan for queries searching for large value sets (20K values in the example down below).
Creating the table with one single BIGINT required column:
Filling the table with 20 million sorted random BIGINT values:
Now we cluster the table (even though this shouldn't be needed):
Now we create the BRIN index on what should be a perfectly ordered and dense table:
Let's VACUUM the table just to be safe:
And now let's query 20K random rows from our 20M total rows:
By default, this query will not use the BRIN index and simply run a 1.5s long sequential scan (hitting 700 MB) and a 2.47s hash join for a total 8.7s query time:
https://explain.dalibo.com/plan/46c3191g8a6c1bc7
If we force the use of the BRIN index using (`SET LOCAL enable_seqscan = OFF;`) the same query will now take 50s with 2.5s spent on the bitmap index scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan (hitting 20 GB of data!):
https://explain.dalibo.com/plan/7f73bg9172a8b226
Since the bitmap heap scan is taking a long time, lets reduce the `pages_per_range` parameter from its 128 default value to 32:
The query now takes 25s, half the time we had before, with 9.7s (up from 2.5s) spent on the bitmap index scan (hitting 2.6GB of data, up from 470 MB) and 10s (down from 42s) on the bitmap heap scan (hitting 4.9GB of data, down from 20 GB):
https://explain.dalibo.com/plan/64fh5h1daaheeab3
We go a step further and lower the `pages_per_range` parameter to 8 (the other extreme).
The query now takes 45s, close-ish to the initial time, with 38.5s (up from 2.5s) spent on the bitmap index scan (hitting 9.8GB of data, up from 470 MB) and 2.6s (down from 42s) on the bitmap heap scan (hitting 1.2GB of data, down from 20 GB):
https://explain.dalibo.com/plan/431fbde7gb19g6g6
So I had the following two questions:
Mickael
I'm playing around with BRIN indexes so as to get a feel for the feature.
During my tests, I was unable to make BRIN indexes perform better than a sequential scan for queries searching for large value sets (20K values in the example down below).
Creating the table with one single BIGINT required column:
CREATE TABLE
test_brin
(
idx BIGINT NOT NULL
)
WITH
(
fillfactor = 100
)
;
Filling the table with 20 million sorted random BIGINT values:
INSERT INTO
test_brin
(
idx
)
SELECT
CAST(FLOOR(RANDOM() * 9223372036854775806) AS BIGINT)
AS idx
FROM
GENERATE_SERIES(1, 20 * 1000 * 1000, 1)
AS g
ORDER BY
idx ASC
;
Now we cluster the table (even though this shouldn't be needed):
CREATE UNIQUE INDEX test_brin_idx_uniq ON test_brin (idx);
CLUSTER test_brin USING test_brin_idx_uniq;
DROP INDEX test_brin_idx_uniq;
Now we create the BRIN index on what should be a perfectly ordered and dense table:
CREATE INDEX
test_brin_idx
ON
test_brin
USING
BRIN
(
idx
)
;
Let's VACUUM the table just to be safe:
VACUUM test_brin;
VACUUM ANALYSE test_brin;
And now let's query 20K random rows from our 20M total rows:
EXPLAIN (
ANALYZE,
VERBOSE,
COSTS,
BUFFERS,
TIMING
)
SELECT
tb.idx
FROM
test_brin
AS tb
WHERE
EXISTS (
SELECT
FROM
(
SELECT
idx
FROM
(
SELECT
-- Just trying to break the optimisation that would recognize "idx" as being an indexed column
(idx + (CEIL(RANDOM()) - 1))::BIGINT
AS idx
FROM
test_brin
ORDER BY
RANDOM()
LIMIT
20000
)
AS t
ORDER BY
idx ASC
)
AS q
WHERE
tb.idx = q.idx
)
;
By default, this query will not use the BRIN index and simply run a 1.5s long sequential scan (hitting 700 MB) and a 2.47s hash join for a total 8.7s query time:
https://explain.dalibo.com/plan/46c3191g8a6c1bc7
If we force the use of the BRIN index using (`SET LOCAL enable_seqscan = OFF;`) the same query will now take 50s with 2.5s spent on the bitmap index scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan (hitting 20 GB of data!):
https://explain.dalibo.com/plan/7f73bg9172a8b226
Since the bitmap heap scan is taking a long time, lets reduce the `pages_per_range` parameter from its 128 default value to 32:
CREATE INDEX
test_brin_idx
ON
test_brin
USING
BRIN
(
idx
)
WITH
(
pages_per_range = 32
)
;
The query now takes 25s, half the time we had before, with 9.7s (up from 2.5s) spent on the bitmap index scan (hitting 2.6GB of data, up from 470 MB) and 10s (down from 42s) on the bitmap heap scan (hitting 4.9GB of data, down from 20 GB):
https://explain.dalibo.com/plan/64fh5h1daaheeab3
We go a step further and lower the `pages_per_range` parameter to 8 (the other extreme).
The query now takes 45s, close-ish to the initial time, with 38.5s (up from 2.5s) spent on the bitmap index scan (hitting 9.8GB of data, up from 470 MB) and 2.6s (down from 42s) on the bitmap heap scan (hitting 1.2GB of data, down from 20 GB):
https://explain.dalibo.com/plan/431fbde7gb19g6g6
So I had the following two questions:
- Why is the BRIN index systematically worse than a sequential scan, even when the table is x1000 larger than the search set, physically pre-sorted, dense (fillfactor at 100%) and the search rows are themselves sorted?
This setup would seem to be the ideal conditions for a BRIN index to run well. - Since we only select the "idx" column, why does the BRIN index not simply return the searched value if included in one of it's ranges?
Hitting the actual row data stored in the table seems to be unnessary no?
- PostgreSQL version: v14
- Memory: 32GB
- CPUs: 8
Mickael
On Fri, Feb 24, 2023 at 05:40:55PM +0100, Mickael van der Beek wrote: > Hello everyone, > > I'm playing around with BRIN indexes so as to get a feel for the feature. > During my tests, I was unable to make BRIN indexes perform better than a > sequential scan for queries searching for large value sets (20K values in > the example down below). > And now let's query 20K random rows from our 20M total rows: I didn't try your test, but I think *random* is the problem/explanation. > By default, this query will not use the BRIN index and simply run a 1.5s > long sequential scan (hitting 700 MB) and a 2.47s hash join for a total > 8.7s query time: > https://explain.dalibo.com/plan/46c3191g8a6c1bc7 > If we force the use of the BRIN index using (`SET LOCAL enable_seqscan = > OFF;`) the same query will now take 50s with 2.5s spent on the bitmap index > scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan > (hitting 20 GB of data!): > https://explain.dalibo.com/plan/7f73bg9172a8b226 That means the planner's cost model correctly preferred a seq scan. > So I had the following two questions: > > 1. Why is the BRIN index systematically worse than a sequential scan, > even when the table is x1000 larger than the search set, physically > pre-sorted, dense (fillfactor at 100%) and the search rows are themselves > sorted? The table may be dense, but the tuples aren't. You're asking to return 1/1000th of the tuples, across the entire table. Suppose there are ~100 tuples per page, and you need to read about every 10th page. It makes sense that it's slow to read a large amount of data nonsequentially. That's why random_page_cost is several times higher than seq_page_cost. I would expect brin to win if the pages to be accessed were dense rather than distributed across the whole table. > 2. Since we only select the "idx" column, why does the BRIN index not > simply return the searched value if included in one of it's ranges? > Hitting the actual row data stored in the table seems to be unnessary no? Because it's necessary to check if the tuple is visible to the current transaction. It might be from an uncommited/aborted transaction. -- Justin
Hello Justin,
Thanks for the quick response!
The table may be dense, but the tuples aren't. You're asking to return
1/1000th of the tuples, across the entire table. Suppose there are ~100
tuples per page, and you need to read about every 10th page. It makes
sense that it's slow to read a large amount of data nonsequentially.
Ah, of course, you're right!
I forgot that the BRIN indexes store ranges that are not fully covered by the row values and that PostgreSQL has to double-check (bitmap heap scan) ...
Would you thus advise to only use BRIN indexes for columns who's values are (1) monotonically increasing but also (2) close to each other?
I guess that in my use case, something like a roaring bitmap would have been perfect but I do not believe that those exist in PostgreSQL by default.
Btrees work well performance wise but are simply too large (> 400MB for 20M rows) even when the index fill factor is 100% (+/- 380 MB) for my use case as I need to index around 6B rows partitioned in roughly 3K buckets which would result in ~120GB of Btree indexes which seems a bit much for simple existence checks.
Because it's necessary to check if the tuple is visible to the current
transaction. It might be from an uncommited/aborted transaction.
Interesting, that's something I did not consider indeed.
I'm guessing that this is a cost brought by MVCC that you can't get around no matter the isolation level?
Mickael
On Fri, Feb 24, 2023 at 6:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Fri, Feb 24, 2023 at 05:40:55PM +0100, Mickael van der Beek wrote:
> Hello everyone,
>
> I'm playing around with BRIN indexes so as to get a feel for the feature.
> During my tests, I was unable to make BRIN indexes perform better than a
> sequential scan for queries searching for large value sets (20K values in
> the example down below).
> And now let's query 20K random rows from our 20M total rows:
I didn't try your test, but I think *random* is the problem/explanation.
> By default, this query will not use the BRIN index and simply run a 1.5s
> long sequential scan (hitting 700 MB) and a 2.47s hash join for a total
> 8.7s query time:
> https://explain.dalibo.com/plan/46c3191g8a6c1bc7
> If we force the use of the BRIN index using (`SET LOCAL enable_seqscan =
> OFF;`) the same query will now take 50s with 2.5s spent on the bitmap index
> scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan
> (hitting 20 GB of data!):
> https://explain.dalibo.com/plan/7f73bg9172a8b226
That means the planner's cost model correctly preferred a seq scan.
> So I had the following two questions:
>
> 1. Why is the BRIN index systematically worse than a sequential scan,
> even when the table is x1000 larger than the search set, physically
> pre-sorted, dense (fillfactor at 100%) and the search rows are themselves
> sorted?
The table may be dense, but the tuples aren't. You're asking to return
1/1000th of the tuples, across the entire table. Suppose there are ~100
tuples per page, and you need to read about every 10th page. It makes
sense that it's slow to read a large amount of data nonsequentially.
That's why random_page_cost is several times higher than seq_page_cost.
I would expect brin to win if the pages to be accessed were dense rather
than distributed across the whole table.
> 2. Since we only select the "idx" column, why does the BRIN index not
> simply return the searched value if included in one of it's ranges?
> Hitting the actual row data stored in the table seems to be unnessary no?
Because it's necessary to check if the tuple is visible to the current
transaction. It might be from an uncommited/aborted transaction.
--
Justin
On Fri, Feb 24, 2023 at 06:51:00PM +0100, Mickael van der Beek wrote: > Hello Justin, > > Thanks for the quick response! > > > The table may be dense, but the tuples aren't. You're asking to return > > 1/1000th of the tuples, across the entire table. Suppose there are ~100 > > tuples per page, and you need to read about every 10th page. It makes > > sense that it's slow to read a large amount of data nonsequentially. > > Ah, of course, you're right! > I forgot that the BRIN indexes store ranges that are not fully covered by > the row values and that PostgreSQL has to double-check (bitmap heap scan) > ... > Would you thus advise to only use BRIN indexes for columns who's values are > (1) monotonically increasing but also (2) close to each other? It's not important whether they're "rigidly" monotonic (nor "strictly"). What's important is that a query doesn't need to access a large number of pages. For example, some of the BRIN indexes that I'm familiar with are created on a column called "start time", but the table's data tends to be naturally sorted by "end time" - and that's good enough. If someone queries for data between 12pm and 1pm, there's surely no data for the first 12 hours of the day's table (because it hadn't happened yet) and there's probably no data for the last 9+ hours of the day, either, so it's only got to read data for a 1-2h interval in the middle. This assumes that the column's data is typically correlated. If the tuples aren't clustered/"close to each other" then it probably doesn't work well. I haven't played with brin "multi minmax", though. > > > 2. Since we only select the "idx" column, why does the BRIN index not > > > simply return the searched value if included in one of it's ranges? > > > Hitting the actual row data stored in the table seems to be unnessary no? > > > > Because it's necessary to check if the tuple is visible to the current > > transaction. It might be from an uncommited/aborted transaction. Actually, a better explanation is that all the brin scan returns is the page, and not the tuples. "BRIN indexes can satisfy queries via regular bitmap index scans, and will return all tuples in all pages within each range if the summary info stored by the index is CONSISTENT with the query conditions. The query executor is in charge of rechecking these tuples and discarding those that do not match the query conditions — in other words, these indexes are LOSSY". The index is returning pages where matching tuples *might* be found, after excluding those pages where it's certain that no tuples are found. -- Justin
FWIW I don't think the randomness per se is the main issue. The main problem is that this results in a loop of bitmap index scans, with 20k loops. This is made worse by the block-range nature of BRIN indexes, resulting in many more heap block accesses. The table has ~80k pages, but the bitmapscan plan reads ~2559362. That can easily happen if each loop matches 100 ranges (out of 700), each having 128 pages. Making the ranges smaller should help to minimize the amount of pages read unnecessarily, but the loops are an issue. And the same range may be scanned again and again, if the range is consistent with multiple values. Interestingly enough, this is the kind of queries / plans I thought about a couple weeks ago, which made me to look at SK_SEARCHARRAY support for BRIN indexes. Imagine you did rewrite the query to something like: SELECT * FROM test_brin WHERE id IN (... lilerals ...); That would only scan the table one, reducing the number of heap pages it has to access. The drawback is that we still have to build the bitmap, and without the SK_SEARCHARRAY support we just scan the index for each element again. So better. With the SK_SEARCHARRAY patch [1] this is further optimized, and for me the query runs a bit faster than seqscan (not by much, and it depend on how the data is cached). At least with pages_per_range=1. Of course, this won't help unless the query can be rewritten like this. At least not currently, but I wonder if we could invent some sort of "pushdown" that'd derive an array of values and push it down into a parameterized path at once (instead of doing that for each value in a loop). regards [1] https://commitfest.postgresql.org/42/4187/ -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company