Thread: BRIN index worse than sequential scan for large search set

BRIN index worse than sequential scan for large search set

From

Mickael van der Beek

Date:

24 February 2023, 16:40:55

Hello everyone,

I'm playing around with BRIN indexes so as to get a feel for the feature.
During my tests, I was unable to make BRIN indexes perform better than a sequential scan for queries searching for large value sets (20K values in the example down below).

Creating the table with one single BIGINT required column:

CREATE TABLE
test_brin
(
idx BIGINT NOT NULL
)
WITH
(
fillfactor = 100
)
;

Filling the table with 20 million sorted random BIGINT values:

INSERT INTO
test_brin
(
idx
)
SELECT
CAST(FLOOR(RANDOM() * 9223372036854775806) AS BIGINT)
AS idx
FROM
GENERATE_SERIES(1, 20 * 1000 * 1000, 1)
AS g
ORDER BY
idx ASC
;

Now we cluster the table (even though this shouldn't be needed):

CREATE UNIQUE INDEX test_brin_idx_uniq ON test_brin (idx);
CLUSTER test_brin USING test_brin_idx_uniq;
DROP INDEX test_brin_idx_uniq;

Now we create the BRIN index on what should be a perfectly ordered and dense table:

CREATE INDEX
test_brin_idx
ON
test_brin
USING
BRIN
(
idx
)
;

Let's VACUUM the table just to be safe:

VACUUM test_brin;
VACUUM ANALYSE test_brin;

And now let's query 20K random rows from our 20M total rows:

EXPLAIN (
ANALYZE,
VERBOSE,
COSTS,
BUFFERS,
TIMING
)
SELECT
tb.idx
FROM
test_brin
AS tb
WHERE
EXISTS (
SELECT
FROM
(
SELECT
idx
FROM
(
SELECT
-- Just trying to break the optimisation that would recognize "idx" as being an indexed column
(idx + (CEIL(RANDOM()) - 1))::BIGINT
AS idx
FROM
test_brin
ORDER BY
RANDOM()
LIMIT
20000
)
AS t
ORDER BY
idx ASC
)
AS q
WHERE
tb.idx = q.idx
)
;

By default, this query will not use the BRIN index and simply run a 1.5s long sequential scan (hitting 700 MB) and a 2.47s hash join for a total 8.7s query time:

https://explain.dalibo.com/plan/46c3191g8a6c1bc7

If we force the use of the BRIN index using (`SET LOCAL enable_seqscan = OFF;`) the same query will now take 50s with 2.5s spent on the bitmap index scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan (hitting 20 GB of data!):

https://explain.dalibo.com/plan/7f73bg9172a8b226

Since the bitmap heap scan is taking a long time, lets reduce the `pages_per_range` parameter from its 128 default value to 32:

CREATE INDEX
test_brin_idx
ON
test_brin
USING
BRIN
(
idx
)
WITH
(
pages_per_range = 32
)
;

The query now takes 25s, half the time we had before, with 9.7s (up from 2.5s) spent on the bitmap index scan (hitting 2.6GB of data, up from 470 MB) and 10s (down from 42s) on the bitmap heap scan (hitting 4.9GB of data, down from 20 GB):

https://explain.dalibo.com/plan/64fh5h1daaheeab3

We go a step further and lower the `pages_per_range` parameter to 8 (the other extreme).

The query now takes 45s, close-ish to the initial time, with 38.5s (up from 2.5s) spent on the bitmap index scan (hitting 9.8GB of data, up from 470 MB) and 2.6s (down from 42s) on the bitmap heap scan (hitting 1.2GB of data, down from 20 GB):

https://explain.dalibo.com/plan/431fbde7gb19g6g6

So I had the following two questions:

Why is the BRIN index systematically worse than a sequential scan, even when the table is x1000 larger than the search set, physically pre-sorted, dense (fillfactor at 100%) and the search rows are themselves sorted?
This setup would seem to be the ideal conditions for a BRIN index to run well.
Since we only select the "idx" column, why does the BRIN index not simply return the searched value if included in one of it's ranges?
Hitting the actual row data stored in the table seems to be unnessary no?

Here's my test environnement:

PostgreSQL version: v14
Memory: 32GB
CPUs: 8

Thanks a lot in advance,

Mickael

Re: BRIN index worse than sequential scan for large search set

From

Justin Pryzby

Date:

24 February 2023, 17:19:44

On Fri, Feb 24, 2023 at 05:40:55PM +0100, Mickael van der Beek wrote:
> Hello everyone,
> 
> I'm playing around with BRIN indexes so as to get a feel for the feature.
> During my tests, I was unable to make BRIN indexes perform better than a
> sequential scan for queries searching for large value sets (20K values in
> the example down below).

> And now let's query 20K random rows from our 20M total rows:

I didn't try your test, but I think *random* is the problem/explanation.

> By default, this query will not use the BRIN index and simply run a 1.5s
> long sequential scan (hitting 700 MB) and a 2.47s hash join for a total
> 8.7s query time:
> https://explain.dalibo.com/plan/46c3191g8a6c1bc7

> If we force the use of the BRIN index using (`SET LOCAL enable_seqscan =
> OFF;`) the same query will now take 50s with 2.5s spent on the bitmap index
> scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan
> (hitting 20 GB of data!):
> https://explain.dalibo.com/plan/7f73bg9172a8b226

That means the planner's cost model correctly preferred a seq scan.

> So I had the following two questions:
> 
>    1. Why is the BRIN index systematically worse than a sequential scan,
>    even when the table is x1000 larger than the search set, physically
>    pre-sorted, dense (fillfactor at 100%) and the search rows are themselves
>    sorted?

The table may be dense, but the tuples aren't.  You're asking to return
1/1000th of the tuples, across the entire table.  Suppose there are ~100
tuples per page, and you need to read about every 10th page.  It makes
sense that it's slow to read a large amount of data nonsequentially.
That's why random_page_cost is several times higher than seq_page_cost.

I would expect brin to win if the pages to be accessed were dense rather
than distributed across the whole table.

>    2. Since we only select the "idx" column, why does the BRIN index not
>    simply return the searched value if included in one of it's ranges?
>    Hitting the actual row data stored in the table seems to be unnessary no?

Because it's necessary to check if the tuple is visible to the current
transaction.  It might be from an uncommited/aborted transaction.

-- 
Justin

Re: BRIN index worse than sequential scan for large search set

From

Mickael van der Beek

Date:

24 February 2023, 17:51:00

Hello Justin,

Thanks for the quick response!

The table may be dense, but the tuples aren't. You're asking to return
1/1000th of the tuples, across the entire table. Suppose there are ~100
tuples per page, and you need to read about every 10th page. It makes
sense that it's slow to read a large amount of data nonsequentially.

Ah, of course, you're right!

I forgot that the BRIN indexes store ranges that are not fully covered by the row values and that PostgreSQL has to double-check (bitmap heap scan) ...

Would you thus advise to only use BRIN indexes for columns who's values are (1) monotonically increasing but also (2) close to each other?

I guess that in my use case, something like a roaring bitmap would have been perfect but I do not believe that those exist in PostgreSQL by default.

Btrees work well performance wise but are simply too large (> 400MB for 20M rows) even when the index fill factor is 100% (+/- 380 MB) for my use case as I need to index around 6B rows partitioned in roughly 3K buckets which would result in ~120GB of Btree indexes which seems a bit much for simple existence checks.

Because it's necessary to check if the tuple is visible to the current
transaction. It might be from an uncommited/aborted transaction.

Interesting, that's something I did not consider indeed.

I'm guessing that this is a cost brought by MVCC that you can't get around no matter the isolation level?

Mickael

On Fri, Feb 24, 2023 at 6:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Fri, Feb 24, 2023 at 05:40:55PM +0100, Mickael van der Beek wrote:
> Hello everyone,
>
> I'm playing around with BRIN indexes so as to get a feel for the feature.
> During my tests, I was unable to make BRIN indexes perform better than a
> sequential scan for queries searching for large value sets (20K values in
> the example down below).

> And now let's query 20K random rows from our 20M total rows:

I didn't try your test, but I think *random* is the problem/explanation.

> By default, this query will not use the BRIN index and simply run a 1.5s
> long sequential scan (hitting 700 MB) and a 2.47s hash join for a total
> 8.7s query time:
> https://explain.dalibo.com/plan/46c3191g8a6c1bc7

> If we force the use of the BRIN index using (`SET LOCAL enable_seqscan =
> OFF;`) the same query will now take 50s with 2.5s spent on the bitmap index
> scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan
> (hitting 20 GB of data!):
> https://explain.dalibo.com/plan/7f73bg9172a8b226

That means the planner's cost model correctly preferred a seq scan.

> So I had the following two questions:
>
> 1. Why is the BRIN index systematically worse than a sequential scan,
> even when the table is x1000 larger than the search set, physically
> pre-sorted, dense (fillfactor at 100%) and the search rows are themselves
> sorted?

The table may be dense, but the tuples aren't. You're asking to return
1/1000th of the tuples, across the entire table. Suppose there are ~100
tuples per page, and you need to read about every 10th page. It makes
sense that it's slow to read a large amount of data nonsequentially.
That's why random_page_cost is several times higher than seq_page_cost.

I would expect brin to win if the pages to be accessed were dense rather
than distributed across the whole table.

> 2. Since we only select the "idx" column, why does the BRIN index not
> simply return the searched value if included in one of it's ranges?
> Hitting the actual row data stored in the table seems to be unnessary no?

Because it's necessary to check if the tuple is visible to the current
transaction. It might be from an uncommited/aborted transaction.

--
Justin

Mickael van der Beek

Web developer & Security analyst

mickael.van.der.beek@gmail.com

Re: BRIN index worse than sequential scan for large search set

From

Justin Pryzby

Date:

24 February 2023, 20:37:15

On Fri, Feb 24, 2023 at 06:51:00PM +0100, Mickael van der Beek wrote:
> Hello Justin,
> 
> Thanks for the quick response!
> 
> > The table may be dense, but the tuples aren't.  You're asking to return
> > 1/1000th of the tuples, across the entire table.  Suppose there are ~100
> > tuples per page, and you need to read about every 10th page.  It makes
> > sense that it's slow to read a large amount of data nonsequentially.
> 
> Ah, of course, you're right!
> I forgot that the BRIN indexes store ranges that are not fully covered by
> the row values and that PostgreSQL has to double-check (bitmap heap scan)
> ...
> Would you thus advise to only use BRIN indexes for columns who's values are
> (1) monotonically increasing but also (2) close to each other?

It's not important whether they're "rigidly" monotonic (nor "strictly").
What's important is that a query doesn't need to access a large number
of pages.

For example, some of the BRIN indexes that I'm familiar with are created
on a column called "start time", but the table's data tends to be
naturally sorted by "end time" - and that's good enough.  If someone
queries for data between 12pm and 1pm, there's surely no data for the
first 12 hours of the day's table (because it hadn't happened yet) and
there's probably no data for the last 9+ hours of the day, either, so
it's only got to read data for a 1-2h interval in the middle.  This
assumes that the column's data is typically correlated.  If the tuples
aren't clustered/"close to each other" then it probably doesn't work
well.  I haven't played with brin "multi minmax", though.

> > >    2. Since we only select the "idx" column, why does the BRIN index not
> > >    simply return the searched value if included in one of it's ranges?
> > >    Hitting the actual row data stored in the table seems to be unnessary no?
> >
> > Because it's necessary to check if the tuple is visible to the current
> > transaction.  It might be from an uncommited/aborted transaction.

Actually, a better explanation is that all the brin scan returns is the page,
and not the tuples.

"BRIN indexes can satisfy queries via regular bitmap index scans, and
will return all tuples in all pages within each range if the summary
info stored by the index is CONSISTENT with the query conditions.  The
query executor is in charge of rechecking these tuples and discarding
those that do not match the query conditions — in other words, these
indexes are LOSSY".

The index is returning pages where matching tuples *might* be found,
after excluding those pages where it's certain that no tuples are found.

-- 
Justin

Re: BRIN index worse than sequential scan for large search set

From

Tomas Vondra

Date:

25 February 2023, 16:18:16

FWIW I don't think the randomness per se is the main issue. The main
problem is that this results in a loop of bitmap index scans, with 20k
loops. This is made worse by the block-range nature of BRIN indexes,
resulting in many more heap block accesses.

The table has ~80k pages, but the bitmapscan plan reads ~2559362. That
can easily happen if each loop matches 100 ranges (out of 700), each
having 128 pages. Making the ranges smaller should help to minimize the
amount of pages read unnecessarily, but the loops are an issue.

And the same range may be scanned again and again, if the range is
consistent with multiple values.

Interestingly enough, this is the kind of queries / plans I thought
about a couple weeks ago, which made me to look at SK_SEARCHARRAY
support for BRIN indexes.

Imagine you did rewrite the query to something like:

    SELECT * FROM test_brin WHERE id IN (... lilerals ...);

That would only scan the table one, reducing the number of heap pages it
has to access. The drawback is that we still have to build the bitmap,
and without the SK_SEARCHARRAY support we just scan the index for each
element again. So better.

With the SK_SEARCHARRAY patch [1] this is further optimized, and for me
the query runs a bit faster than seqscan (not by much, and it depend on
how the data is cached). At least with pages_per_range=1.

Of course, this won't help unless the query can be rewritten like this.
At least not currently, but I wonder if we could invent some sort of
"pushdown" that'd derive an array of values and push it down into a
parameterized path at once (instead of doing that for each value in a loop).

regards

[1] https://commitfest.postgresql.org/42/4187/


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company