Bloom filters and the planner / parallel execution - Mailing list pgsql-performance

From Daniel Westermann (DWE)
Subject Bloom filters and the planner / parallel execution
Date
Msg-id GV0P278MB04197233E85D50BB67300DECD27E2@GV0P278MB0419.CHEP278.PROD.OUTLOOK.COM
Whole thread Raw
List pgsql-performance
Hi,

previous discussion at [1].

Initially I wanted to send this to the docs mailing list, but while testing this further I think this is more an issue
withthe planner. 

The example(s) in the docs for bloom filters (https://www.postgresql.org/docs/current/bloom.html) cannot be reproduced
atall (and the size of the indexes are far off the truth as well). 

postgres=# select version();
                                           version
----------------------------------------------------------------------------------------------
 PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
(1 row)

postgres=# CREATE TABLE tbloom AS
   SELECT
     (random() * 1000000)::int as i1,
     (random() * 1000000)::int as i2,
     (random() * 1000000)::int as i3,
     (random() * 1000000)::int as i4,
     (random() * 1000000)::int as i5,
     (random() * 1000000)::int as i6
   FROM
  generate_series(1,10000000);
SELECT 10000000
postgres=# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);
CREATE INDEX
postgres=# analyze tbloom ;
ANALYZE
postgres=# SELECT pg_size_pretty(pg_relation_size('bloomidx'));
 pg_size_pretty
----------------
 153 MB
(1 row)

The following statement (from the docs as well) will never use the bloom index but will go for parallel execution:

postgres=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
                                                       QUERY PLAN
 

-------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..127244.10 rows=1 width=24) (actual time=155.618..155.652 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on tbloom  (cost=0.00..126244.00 rows=1 width=24) (actual time=145.353..145.354 rows=0
loops=3)
         Filter: ((i2 = 898732) AND (i5 = 123451))
         Rows Removed by Filter: 3333333
 Planning Time: 0.241 ms
 Execution Time: 155.673 ms
(8 rows)

The index will be used if we make parallel execution a lot more costly:

postgres=# set parallel_setup_cost = 100000;
SET
postgres=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
                                                        QUERY PLAN
   

---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbloom  (cost=178436.00..178440.02 rows=1 width=24) (actual time=31.364..31.365 rows=0 loops=1)
   Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
   Rows Removed by Index Recheck: 2424
   Heap Blocks: exact=2372
   ->  Bitmap Index Scan on bloomidx  (cost=0.00..178436.00 rows=1 width=0) (actual time=26.387..26.387 rows=2424
loops=1)
         Index Cond: ((i2 = 898732) AND (i5 = 123451))
 Planning Time: 0.210 ms
 Execution Time: 31.511 ms
(8 rows)

postgres=# reset parallel_setup_cost ;
RESET
postgres=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
                                                       QUERY PLAN
 

-------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..127244.10 rows=1 width=24) (actual time=150.802..150.835 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on tbloom  (cost=0.00..126244.00 rows=1 width=24) (actual time=143.902..143.903 rows=0
loops=3)
         Filter: ((i2 = 898732) AND (i5 = 123451))
         Rows Removed by Filter: 3333333
 Planning Time: 0.172 ms
 Execution Time: 150.865 ms
(8 rows)

The reason I am sending this to this list is the execution time. The plan using the index is faster than the plan using
parallelexecution, but it is not the one executed in the default configuration. Am I missing something? 

Here is another example where the index should clearly win:

postgres=# create table tdummy as select i as a, i as b, i as c, i as d, i as e, i as f, i as g, i as h from
generate_series(1, 20000000) i; 
SELECT 20000000
postgres=# CREATE INDEX bloomidx2 ON tdummy USING bloom (a,b,c,d,e,f,g,h);
CREATE INDEX
postgres=# analyze tdummy ;
ANALYZE
postgres=# explain analyze select * from tdummy where a = 1 and b = 1 and c = 1 and d = 1 and e = 1;
                                                       QUERY PLAN
 

-------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..335576.41 rows=1 width=32) (actual time=0.573..437.776 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on tdummy  (cost=0.00..334576.31 rows=1 width=32) (actual time=285.408..430.800 rows=0
loops=3)
         Filter: ((a = 1) AND (b = 1) AND (c = 1) AND (d = 1) AND (e = 1))
         Rows Removed by Filter: 6666666
 Planning Time: 0.552 ms
 Execution Time: 437.817 ms
(8 rows)

postgres=# set parallel_setup_cost = 10000;
SET
postgres=# explain analyze select * from tdummy where a = 1 and b = 1 and c = 1 and d = 1 and e = 1;
                                                       QUERY PLAN
 

-------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=10000.00..344576.41 rows=1 width=32) (actual time=0.439..316.780 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on tdummy  (cost=0.00..334576.31 rows=1 width=32) (actual time=206.356..311.511 rows=0
loops=3)
         Filter: ((a = 1) AND (b = 1) AND (c = 1) AND (d = 1) AND (e = 1))
         Rows Removed by Filter: 6666666
 Planning Time: 0.197 ms
 Execution Time: 316.807 ms
(8 rows)

postgres=# explain analyze select * from tdummy where a = 1 and b = 1 and c = 1 and d = 1 and e = 1;
                                                       QUERY PLAN
 

-------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=10000.00..344572.10 rows=1 width=32) (actual time=1.284..291.015 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on tdummy  (cost=0.00..334572.00 rows=1 width=32) (actual time=188.019..284.259 rows=0
loops=3)
         Filter: ((a = 1) AND (b = 1) AND (c = 1) AND (d = 1) AND (e = 1))
         Rows Removed by Filter: 6666666
 Planning Time: 0.311 ms
 Execution Time: 291.055 ms
(8 rows)

postgres=# set parallel_setup_cost = 100000;
SET
postgres=# explain analyze select * from tdummy where a = 1 and b = 1 and c = 1 and d = 1 and e = 1;
                                                       QUERY PLAN
 

-------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=100000.00..434572.10 rows=1 width=32) (actual time=0.547..292.466 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on tdummy  (cost=0.00..334572.00 rows=1 width=32) (actual time=188.160..285.308 rows=0
loops=3)
         Filter: ((a = 1) AND (b = 1) AND (c = 1) AND (d = 1) AND (e = 1))
         Rows Removed by Filter: 6666666
 Planning Time: 0.275 ms
 Execution Time: 292.500 ms
(8 rows)

postgres=# set parallel_setup_cost = 1000000;
SET
postgres=# explain analyze select * from tdummy where a = 1 and b = 1 and c = 1 and d = 1 and e = 1;
                                                       QUERY PLAN
 

-------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tdummy  (cost=506868.00..506872.02 rows=1 width=32) (actual time=54.085..54.092 rows=1 loops=1)
   Recheck Cond: ((a = 1) AND (b = 1) AND (c = 1) AND (d = 1) AND (e = 1))
   Rows Removed by Index Recheck: 1
   Heap Blocks: exact=2
   ->  Bitmap Index Scan on bloomidx2  (cost=0.00..506868.00 rows=1 width=0) (actual time=54.047..54.047 rows=2
loops=1)
         Index Cond: ((a = 1) AND (b = 1) AND (c = 1) AND (d = 1) AND (e = 1))
 Planning Time: 0.165 ms
 Execution Time: 54.847 ms
(8 rows)

Regards
Daniel

[1]
https://www.postgresql.org/message-id/flat/ZR0P278MB0122119FAE78721A694C30C8D2340%40ZR0P278MB0122.CHEP278.PROD.OUTLOOK.COM



pgsql-performance by date:

Previous
From: Andrey Stikheev
Date:
Subject: Re: Performance degradation in Index searches with special characters
Next
From: Marcelo Fernandes
Date:
Subject: Adding exclusion constraint in a big table