Thread: Bitmap scan is undercosted?

Bitmap scan is undercosted?

From

Vitaliy Garnashevich

Date:

01 December 2017, 20:40:08

Hi,

We recently had an issue in production, where a bitmap scan was chosen 
instead of an index scan. Despite being 30x slower, the bitmap scan had 
about the same cost as the index scan.

I've found some cases where similar issues with bitmap scans were 
reported before:

https://www.postgresql.org/message-id/flat/1456154321.976561.528310154.6A623C0E%40webmail.messagingengine.com

https://www.postgresql.org/message-id/flat/CA%2BwPC0MRMhF_8fD9dc8%2BQWZQzUvHahPRSv%3DxMtCmsVLSsy-p0w%40mail.gmail.com

I've made a synthetic test, which kind of reproduces the issue:

shared_buffers = 512MB
effective_cache_size = 512MB
work_mem = 100MB

set seq_page_cost = 1.0;
set random_page_cost = 1.5;
set cpu_tuple_cost = 0.01;
set cpu_index_tuple_cost = 0.005;
set cpu_operator_cost = 0.0025;

drop table if exists aaa;
create table aaa as select (id%100)::int num, (id%10=1)::bool flag from 
generate_series(1, 10000000) id;
create index i1 on aaa  (num);
create index i2 on aaa  (flag);
analyze aaa;

select relname, reltuples::bigint, relpages::bigint, 
(reltuples/relpages)::bigint tpp from pg_class where relname 
in('aaa','i1','i2') order by relname;
"aaa";9999985;44248;226
"i1";9999985;27422;365
"i2";9999985;27422;365

I've been running the same query while enabling and disabling different 
kinds of scans:

1) set enable_bitmapscan = on;  set enable_indexscan = off; set 
enable_seqscan = off;
2) set enable_bitmapscan = off; set enable_indexscan = on;  set 
enable_seqscan = off;
3) set enable_bitmapscan = off; set enable_indexscan = off; set 
enable_seqscan = on;

The query was:
explain (analyze,verbose,costs,buffers)
select count(*) from aaa where num = 1 and flag = true;

Here are the results for PostgreSQL 9.6 (for 9.3 and 10.1 the results 
are very similar):

1) Aggregate  (cost=24821.70..24821.71 rows=1 width=8) (actual 
time=184.591..184.591 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=47259
   ->  Bitmap Heap Scan on public.aaa  (cost=13038.21..24796.22 
rows=10189 width=0) (actual time=122.492..178.006 rows=100000 loops=1)
         Output: num, flag
         Recheck Cond: (aaa.num = 1)
         Filter: aaa.flag
         Heap Blocks: exact=44248
         Buffers: shared hit=47259
         ->  BitmapAnd  (cost=13038.21..13038.21 rows=10189 width=0) 
(actual time=110.699..110.699 rows=0 loops=1)
               Buffers: shared hit=3011
               ->  Bitmap Index Scan on i1  (cost=0.00..1158.94 
rows=99667 width=0) (actual time=19.600..19.600 rows=100000 loops=1)
                     Index Cond: (aaa.num = 1)
                     Buffers: shared hit=276
               ->  Bitmap Index Scan on i2  (cost=0.00..11873.92 
rows=1022332 width=0) (actual time=81.676..81.676 rows=1000000 loops=1)
                     Index Cond: (aaa.flag = true)
                     Buffers: shared hit=2735
Planning time: 0.104 ms
Execution time: 184.988 ms

2) Aggregate  (cost=67939.09..67939.10 rows=1 width=8) (actual 
time=67.510..67.510 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=44524
   ->  Index Scan using i1 on public.aaa  (cost=0.44..67910.95 
rows=11256 width=0) (actual time=0.020..61.180 rows=100000 loops=1)
         Output: num, flag
         Index Cond: (aaa.num = 1)
         Filter: aaa.flag
         Buffers: shared hit=44524
Planning time: 0.096 ms
Execution time: 67.543 ms

3) Aggregate  (cost=169276.49..169276.50 rows=1 width=8) (actual 
time=977.063..977.063 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=44248
   ->  Seq Scan on public.aaa  (cost=0.00..169248.35 rows=11256 width=0) 
(actual time=0.018..969.294 rows=100000 loops=1)
         Output: num, flag
         Filter: (aaa.flag AND (aaa.num = 1))
         Rows Removed by Filter: 9900000
         Buffers: shared hit=44248
Planning time: 0.099 ms
Execution time: 977.094 ms


The bitmap scan version runs more than twice slower than the one with 
index scan, while being costed at more than twice cheaper.

I've tried to increase cpu_tuple_cost and cpu_index_tuple_cost, and this 
behavior remains after 6x increase in values. Although the difference in 
costs becomes much less. After increasing the settings more than 6x, 
PostgreSQL decides to use a different plan for bitmap scans, so it's 
hard to make conclusions at that point.

Could such cases be fixed with tuning of cost settings, or that's just 
how PostgreSQL estimates bitmap scans and this can't be fixed without 
modifying the optimizer? Or am I missing something and that's the 
expected behavior? Thoughts?

Regards,
Vitaliy

Re: Bitmap scan is undercosted?

From

Justin Pryzby

Date:

01 December 2017, 21:34:27

On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote:
> We recently had an issue in production, where a bitmap scan was chosen
> instead of an index scan. Despite being 30x slower, the bitmap scan had
> about the same cost as the index scan.

Me too, see also:

https://www.postgresql.org/message-id/flat/CAH2-WzkRTggiy_LKQUu-oViyp6y_Hhz-a1yWacPy4tcYWV1HoA%40mail.gmail.com#CAH2-WzkRTggiy_LKQUu-oViyp6y_Hhz-a1yWacPy4tcYWV1HoA@mail.gmail.com

> drop table if exists aaa;
> create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
> generate_series(1, 10000000) id;
> create index i1 on aaa  (num);
> create index i2 on aaa  (flag);
> analyze aaa;
> 
> select relname, reltuples::bigint, relpages::bigint,
> (reltuples/relpages)::bigint tpp from pg_class where relname
> in('aaa','i1','i2') order by relname;
> "aaa";9999985;44248;226
> "i1";9999985;27422;365
> "i2";9999985;27422;365
> 
> The query was:
> explain (analyze,verbose,costs,buffers)
> select count(*) from aaa where num = 1 and flag = true;

Note that id%100==1 implies flag='t', so the planner anticipates retrieving
fewer rows than it will ultimately read, probably by 2x.  It makes sense that
causes the index scan to be more expensive than expected, but that's only
somewhat important, since there's no joins involved.

The reason why it's more than a bit slower is due to the "density" [0] of the
heap pages read.  num=1 is more selective than flag=true, so it scans i1,
reading 1% of the whole table.  But it's not reading the first 1% or 
some other 1% of the table, it reads tuples evenly distributed across the
entire table (226*0.01 = ~2 rows of each page).  Since the index was created
after the INSERT, the repeated keys (logical value: id%100) are read in
physical order on the heap, so this is basically doing a seq scan, but with the
additional overhead of reading the index, and maybe doing an fseek() before
each/some read()s, too.  You could confirm that by connecting strace to the
backend before starting the query.

Since you did that using % and with indices added after the INSERT, you can't
improve it by reindexing (as I was able to for my case).  That's an elegant
test case, so thanks.

I think shared_buffers=512MB is just small enough for this test to be painful
for 1e7 rows.  I see the table+index is 559MB.

I don't know if that's really similar to your production use case, but I would
recommend trying BRIN indices, which always require a bitmap scan.  Note that
some things (like max()) that can use an btree index cannot use brin.  PG10.1
has WITH autosummarize, which was important for our use, since we rarely do
UPDATEs or DELETEs so tables are rarely vacuumed (only analyzed).

Justin

[0] I'm borrowing Jeff's language from here:
https://www.postgresql.org/message-id/CAMkU%3D1xwGn%2BO0jhKsvrUrbW9MQp1YX0iB4Y-6h1mEz0ffBxK-Q%40mail.gmail.com
"density" wasn't our problem, but it's a perfect description of this issue.

Re: Bitmap scan is undercosted?

From

Vitaliy Garnashevich

Date:

02 December 2017, 02:08:03

On 01/12/2017 20:34, Justin Pryzby wrote:
> On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote:
>> We recently had an issue in production, where a bitmap scan was chosen
>> instead of an index scan. Despite being 30x slower, the bitmap scan had
>> about the same cost as the index scan.
> Me too, see also:
>
https://www.postgresql.org/message-id/flat/CAH2-WzkRTggiy_LKQUu-oViyp6y_Hhz-a1yWacPy4tcYWV1HoA%40mail.gmail.com#CAH2-WzkRTggiy_LKQUu-oViyp6y_Hhz-a1yWacPy4tcYWV1HoA@mail.gmail.com
>
>> drop table if exists aaa;
>> create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
>> generate_series(1, 10000000) id;
>> create index i1 on aaa  (num);
>> create index i2 on aaa  (flag);
>> analyze aaa;
>>
>> select relname, reltuples::bigint, relpages::bigint,
>> (reltuples/relpages)::bigint tpp from pg_class where relname
>> in('aaa','i1','i2') order by relname;
>> "aaa";9999985;44248;226
>> "i1";9999985;27422;365
>> "i2";9999985;27422;365
>>
>> The query was:
>> explain (analyze,verbose,costs,buffers)
>> select count(*) from aaa where num = 1 and flag = true;
> Note that id%100==1 implies flag='t', so the planner anticipates retrieving
> fewer rows than it will ultimately read, probably by 2x.  It makes sense that
> causes the index scan to be more expensive than expected, but that's only
> somewhat important, since there's no joins involved.
I don't think the planner is that smart to account for correlation 
between values in different columns. When different values are used in 
filter (num=2, num=39, num=74), the query actually runs faster, while 
still being about twice slower than using an index scan. But the cost 
does not change much. It jumps up and down for different values, but 
it's still close to the initial value.

Aggregate  (cost=24239.02..24239.03 rows=1 width=8) (actual 
time=105.239..105.239 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=3011
   ->  Bitmap Heap Scan on public.aaa  (cost=12812.05..24214.48 
rows=9816 width=0) (actual time=105.236..105.236 rows=0 loops=1)
         Output: num, flag
         Recheck Cond: (aaa.num = 39)
         Filter: aaa.flag
         Buffers: shared hit=3011
         ->  BitmapAnd  (cost=12812.05..12812.05 rows=9816 width=0) 
(actual time=105.157..105.157 rows=0 loops=1)
               Buffers: shared hit=3011
               ->  Bitmap Index Scan on i1  (cost=0.00..1134.94 
rows=97667 width=0) (actual time=15.725..15.725 rows=100000 loops=1)
                     Index Cond: (aaa.num = 39)
                     Buffers: shared hit=276
               ->  Bitmap Index Scan on i2  (cost=0.00..11671.96 
rows=1005003 width=0) (actual time=77.920..77.920 rows=1000000 loops=1)
                     Index Cond: (aaa.flag = true)
                     Buffers: shared hit=2735
Planning time: 0.104 ms
Execution time: 105.553 ms

Aggregate  (cost=65785.99..65786.00 rows=1 width=8) (actual 
time=48.587..48.587 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=44524
   ->  Index Scan using i1 on public.aaa  (cost=0.44..65761.45 rows=9816 
width=0) (actual time=48.583..48.583 rows=0 loops=1)
         Output: num, flag
         Index Cond: (aaa.num = 39)
         Filter: aaa.flag
         Rows Removed by Filter: 100000
         Buffers: shared hit=44524
Planning time: 0.097 ms
Execution time: 48.620 ms

>
> The reason why it's more than a bit slower is due to the "density" [0] of the
> heap pages read.  num=1 is more selective than flag=true, so it scans i1,
> reading 1% of the whole table.  But it's not reading the first 1% or
> some other 1% of the table, it reads tuples evenly distributed across the
> entire table (226*0.01 = ~2 rows of each page).  Since the index was created
> after the INSERT, the repeated keys (logical value: id%100) are read in
> physical order on the heap, so this is basically doing a seq scan, but with the
> additional overhead of reading the index, and maybe doing an fseek() before
> each/some read()s, too.  You could confirm that by connecting strace to the
> backend before starting the query.
>
> Since you did that using % and with indices added after the INSERT, you can't
> improve it by reindexing (as I was able to for my case).  That's an elegant
> test case, so thanks.
>
> I think shared_buffers=512MB is just small enough for this test to be painful
> for 1e7 rows.  I see the table+index is 559MB.
            table           | ~count  |    size    |   toast |  idx   | 
size + toast + idx
---------------------------+---------+------------+------------+--------+--------------------
  aaa                       | 9999994 | 346 MB     | 0 bytes    | 428 MB 
| 774 MB

But the plan says all buffers are "shared hit", and none "read", so 
that's probably not an issue.

>
> I don't know if that's really similar to your production use case, but I would
> recommend trying BRIN indices, which always require a bitmap scan.  Note that
> some things (like max()) that can use an btree index cannot use brin.  PG10.1
> has WITH autosummarize, which was important for our use, since we rarely do
> UPDATEs or DELETEs so tables are rarely vacuumed (only analyzed).
Yes, BRIN indexes should be beneficial for our case, because there is a 
date column which grows over time (not strictly regularly, but still). 
Unfortunately, we're still migrating our databases from 9.3 to 9.6. 
Anyway, thanks for the advice.
>
> Justin
>
> [0] I'm borrowing Jeff's language from here:
> https://www.postgresql.org/message-id/CAMkU%3D1xwGn%2BO0jhKsvrUrbW9MQp1YX0iB4Y-6h1mEz0ffBxK-Q%40mail.gmail.com
> "density" wasn't our problem, but it's a perfect description of this issue.
>

Re: Bitmap scan is undercosted?

From

Justin Pryzby

Date:

02 December 2017, 02:11:05

I tried to reproduce this issue and couldn't, under PG95 and 10.1:

On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote:
> On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote:
> > We recently had an issue in production, where a bitmap scan was chosen
> > instead of an index scan. Despite being 30x slower, the bitmap scan had
> > about the same cost as the index scan.
> 
> > drop table if exists aaa;
> > create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
> > generate_series(1, 10000000) id;
> > create index i1 on aaa  (num);
> > create index i2 on aaa  (flag);
> > analyze aaa;

What is:
effective_io_concurrency
max_parallel_workers_per_gather (I gather you don't have this)

Note:
postgres=# SELECT correlation FROM pg_stats WHERE tablename='aaa' AND attname='num';
correlation | 0.00710112

..so this is different from the issue corrected by the patch I created while
testing.

> Note that id%100==1 implies flag='t', so the planner anticipates retrieving
> fewer rows than it will ultimately read, probably by 2x.  It makes sense that
> causes the index scan to be more expensive than expected, but that's only
> somewhat important, since there's no joins involved.

I changed the query from COUNT(*) TO * for easier to read explain:

CREATE TABLE aaa AS SELECT (id%100)::int num, (id%10=1)::bool flag FROM generate_series(1, 10000000) id;
CREATE INDEX i1 ON aaa(num);
CREATE INDEX i2 ON aaa (flag);
ANALYZE VERBOSE aaa;
EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag=true;
 Bitmap Heap Scan on public.aaa  (cost=20652.98..45751.75 rows=10754 width=5) (actual time=85.314..185.107 rows=100000
loops=1)
   ->  BitmapAnd  (cost=20652.98..20652.98 rows=10754 width=0) (actual time=163.220..163.220 rows=0 loops=1)
         ->  Bitmap Index Scan on i1  (cost=0.00..1965.93 rows=106333 width=0) (actual time=26.943..26.943 rows=100000
loops=1)
         ->  Bitmap Index Scan on i2  (cost=0.00..18681.42 rows=1011332 width=0) (actual time=133.804..133.804
rows=1000000loops=1)
 

..which is what's wanted with no planner hints (PG10.1 here).

Same on PG95:
postgres=# EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag=true;
 Bitmap Heap Scan on public.aaa  (cost=19755.64..43640.32 rows=9979 width=5) (actual time=230.017..336.583 rows=100000
loops=1)
   ->  BitmapAnd  (cost=19755.64..19755.64 rows=9979 width=0) (actual time=205.242..205.242 rows=0 loops=1)
         ->  Bitmap Index Scan on i1  (cost=0.00..1911.44 rows=103334 width=0) (actual time=24.911..24.911 rows=100000
loops=1)
         ->  Bitmap Index Scan on i2  (cost=0.00..17838.96 rows=965670 width=0) (actual time=154.237..154.237
rows=1000000loops=1)
 

The rowcount is off, but not a critical issue without a join.

Justin

Re: Bitmap scan is undercosted?

From

Vitaliy Garnashevich

Date:

02 December 2017, 02:54:09

On 02/12/2017 01:11, Justin Pryzby wrote:
> I tried to reproduce this issue and couldn't, under PG95 and 10.1:
>
> On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote:
>> On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote:
>>> We recently had an issue in production, where a bitmap scan was chosen
>>> instead of an index scan. Despite being 30x slower, the bitmap scan had
>>> about the same cost as the index scan.
>>> drop table if exists aaa;
>>> create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
>>> generate_series(1, 10000000) id;
>>> create index i1 on aaa  (num);
>>> create index i2 on aaa  (flag);
>>> analyze aaa;
> What is:
> effective_io_concurrency
> max_parallel_workers_per_gather (I gather you don't have this)
effective_io_concurrency = 0
max_parallel_workers_per_gather = 0

Did you notice random_page_cost = 1.5 ?

For this test I'm using SSD and Windows (if that matters). On production 
we also use SSD, hence lower random_page_cost. But with the default 
random_page_cost=4.0, the difference in cost between the index scan plan 
and the bitmap scan plan is even bigger.
>
> Note:
> postgres=# SELECT correlation FROM pg_stats WHERE tablename='aaa' AND attname='num';
> correlation | 0.00710112
>
> ..so this is different from the issue corrected by the patch I created while
> testing.
>
>> Note that id%100==1 implies flag='t', so the planner anticipates retrieving
>> fewer rows than it will ultimately read, probably by 2x.  It makes sense that
>> causes the index scan to be more expensive than expected, but that's only
>> somewhat important, since there's no joins involved.
> I changed the query from COUNT(*) TO * for easier to read explain:
>
> CREATE TABLE aaa AS SELECT (id%100)::int num, (id%10=1)::bool flag FROM generate_series(1, 10000000) id;
> CREATE INDEX i1 ON aaa(num);
> CREATE INDEX i2 ON aaa (flag);
> ANALYZE VERBOSE aaa;
> EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag=true;
>   Bitmap Heap Scan on public.aaa  (cost=20652.98..45751.75 rows=10754 width=5) (actual time=85.314..185.107
rows=100000loops=1)
 
>     ->  BitmapAnd  (cost=20652.98..20652.98 rows=10754 width=0) (actual time=163.220..163.220 rows=0 loops=1)
>           ->  Bitmap Index Scan on i1  (cost=0.00..1965.93 rows=106333 width=0) (actual time=26.943..26.943
rows=100000loops=1)
 
>           ->  Bitmap Index Scan on i2  (cost=0.00..18681.42 rows=1011332 width=0) (actual time=133.804..133.804
rows=1000000loops=1)
 
>
> ..which is what's wanted with no planner hints (PG10.1 here).
Yes, that's what you get without planner hints, but it's strange to get 
this plan, when there is another one, which runs 2-3 times faster, but 
happens to be estimated to be twice more costly than the one with bitmap 
scans:

# set enable_bitmapscan = off; set enable_indexscan = on;  set 
enable_seqscan = off;
# explain analyze select * from aaa where num = 1 and flag = true;
Index Scan using i1 on aaa  (cost=0.44..66369.81 rows=10428 width=5) 
(actual time=0.020..57.765 rows=100000 loops=1)

vs.

# set enable_bitmapscan = on;  set enable_indexscan = off; set 
enable_seqscan = off;
# explain analyze select * from aaa where num = 1 and flag = true;
Bitmap Heap Scan on aaa  (cost=13099.33..25081.40 rows=10428 width=5) 
(actual time=122.137..182.811 rows=100000 loops=1)
   ->  BitmapAnd  (cost=13099.33..13099.33 rows=10428 width=0) (actual 
time=110.168..110.168 rows=0 loops=1)
         ->  Bitmap Index Scan on i1  (cost=0.00..1181.44 rows=101667 
width=0) (actual time=20.845..20.845 rows=100000 loops=1)
         ->  Bitmap Index Scan on i2  (cost=0.00..11912.43 rows=1025666 
width=0) (actual time=80.323..80.323 rows=1000000 loops=1)

>
> Same on PG95:
> postgres=# EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag=true;
>   Bitmap Heap Scan on public.aaa  (cost=19755.64..43640.32 rows=9979 width=5) (actual time=230.017..336.583
rows=100000loops=1)
 
>     ->  BitmapAnd  (cost=19755.64..19755.64 rows=9979 width=0) (actual time=205.242..205.242 rows=0 loops=1)
>           ->  Bitmap Index Scan on i1  (cost=0.00..1911.44 rows=103334 width=0) (actual time=24.911..24.911
rows=100000loops=1)
 
>           ->  Bitmap Index Scan on i2  (cost=0.00..17838.96 rows=965670 width=0) (actual time=154.237..154.237
rows=1000000loops=1)
 
>
> The rowcount is off, but not a critical issue without a join.
>
> Justin
>

Re: Bitmap scan is undercosted?

From

pryzby@telsasoft.com (Justin Pryzby)

Date:

02 December 2017, 05:06:03

On Fri, Dec 01, 2017 at 05:11:04PM -0600, Justin Pryzby wrote:
> I tried to reproduce this issue and couldn't, under PG95 and 10.1:

I'm embarassed to say that I mis-read your message, despite you're amply clear
subject.  You're getting a bitmap scan but you'd prefer to get an index scan.
I anticipated the opposite problem (which is what I've had issues with myself).

> On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote:
> > On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote:
> > > We recently had an issue in production, where a bitmap scan was chosen
> > > instead of an index scan. Despite being 30x slower, the bitmap scan had
> > > about the same cost as the index scan.
> 
> Note:
> postgres=# SELECT correlation FROM pg_stats WHERE tablename='aaa' AND attname='num';
> correlation | 0.00710112
> 
> ..so this is different from the issue corrected by the patch I created while
> testing.

Actually, that the table is "not correlated" on "num" column is maybe the
primary reason why PG avoids using an index scan.  It (more or less correctly)
deduces that it's going to have to "read" a large fraction of the pages (even
if only to process a small fraction of the rows), which is costly, except it's
all cached..  In your case, that overly-penalizes the index scan.

This is cost_index() and cost_bitmap_heap_scan() in costsize.c.  Since the
index is uncorrelated, it's returning something close to max_IO_cost.  It looks
like effective_cache_size only affects index_pages_fetched().

I'm going to try to dig some more into it.  Maybe there's evidence to
re-evaluate one of these:

cost_index()
| run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);
or
cost_bitmap_heap_scan()
| cost_per_page = spc_random_page_cost - 
|               (spc_random_page_cost - spc_seq_page_cost)
|               * sqrt(pages_fetched / T);

Justin

Re: Bitmap scan is undercosted?

From

Jeff Janes

Date:

02 December 2017, 08:51:56

On Fri, Dec 1, 2017 at 3:54 PM, Vitaliy Garnashevich < vgarnashevich@gmail.com> wrote: > On 02/12/2017 01:11, Justin Pryzby wrote: > >> I tried to reproduce this issue and couldn't, under PG95 and 10.1: >> >> On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote: >> >>> On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote: >>> >>>> We recently had an issue in production, where a bitmap scan was chosen >>>> instead of an index scan. Despite being 30x slower, the bitmap scan had >>>> about the same cost as the index scan. >>>> drop table if exists aaa; >>>> create table aaa as select (id%100)::int num, (id%10=1)::bool flag from >>>> generate_series(1, 10000000) id; >>>> create index i1 on aaa (num); >>>> create index i2 on aaa (flag); >>>> analyze aaa; >>>> >>> What is: >> effective_io_concurrency >> max_parallel_workers_per_gather (I gather you don't have this) >> > effective_io_concurrency = 0 > max_parallel_workers_per_gather = 0 > > Did you notice random_page_cost = 1.5 ? > For the aaa.num = 39 case, the faster index scan actually does hit 15 times more buffers than the bitmap scan does. While 1.5 is lot lower than 4.0, it is still much higher than the true cost of reading a page from the buffer cache. This why the index scan is getting punished. You could lower random_page_cost and seq_page_cost to 0, to remove those considerations. (I'm not saying you should do this on your production system, but rather you should do it as a way to investigate the issue. But it might make sense on production as well) > For this test I'm using SSD and Windows (if that matters). On production > we also use SSD, hence lower random_page_cost. But with the default > random_page_cost=4.0, the difference in cost between the index scan plan > and the bitmap scan plan is even bigger. Since it is all shared buffers hits, it doesn't matter if you have SSD for this particular test case. Cheers, Jeff

Re: Bitmap scan is undercosted?

From

Justin Pryzby

Date:

02 December 2017, 09:41:13

On Sat, Dec 02, 2017 at 01:54:09AM +0200, Vitaliy Garnashevich wrote:
> On 02/12/2017 01:11, Justin Pryzby wrote:
> >..which is what's wanted with no planner hints (PG10.1 here).
> Yes, that's what you get without planner hints, but it's strange to get this
> plan, when there is another one, which runs 2-3 times faster, but happens to
> be estimated to be twice more costly than the one with bitmap scans:
> 
> # set enable_bitmapscan = off; set enable_indexscan = on;  set enable_seqscan = off;
> # explain analyze select * from aaa where num = 1 and flag = true;
> Index Scan using i1 on aaa  (cost=0.44..66369.81 rows=10428 width=5) (actual time=0.020..57.765 rows=100000 loops=1)
> 
> vs.
> 
> # set enable_bitmapscan = on;  set enable_indexscan = off; set enable_seqscan = off;
> # explain analyze select * from aaa where num = 1 and flag = true;
> Bitmap Heap Scan on aaa  (cost=13099.33..25081.40 rows=10428 width=5) (actual time=122.137..182.811 rows=100000
loops=1)

I was able to get an index plan with:

SET random_page_cost=1; SET cpu_index_tuple_cost=.04; -- default: 0.005; see selfuncs.c
postgres=# EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag=true; 
 Index Scan using i1 on public.aaa  (cost=0.43..50120.71 rows=10754 width=5) (actual time=0.040..149.580 rows=100000
loops=1)

Or with:
SET random_page_cost=1; SET cpu_operator_cost=0.03; -- default: 0.0025 see cost_bitmap_tree_node()
EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag= true;  
 Index Scan using i1 on public.aaa  (cost=5.22..49328.00 rows=10754 width=5) (actual time=0.051..109.082 rows=100000
loops=1)

Or a combination trying to minimize the cost of the index scan:
postgres=# SET random_page_cost=1; SET cpu_index_tuple_cost=.0017; SET cpu_operator_cost=0.03; EXPLAIN
(analyze,verbose,costs,buffers)SELECT * FROM aaa WHERE num=1 AND flag= true;  
 
 Index Scan using i1 on public.aaa  (cost=5.22..48977.10 rows=10754 width=5) (actual time=0.032..86.883 rows=100000
loops=1)

Not sure if that's reasonable, but maybe it helps to understand.

Justin

Re: Bitmap scan is undercosted?

From

Vitaliy Garnashevich

Date:

02 December 2017, 10:08:38

On 02/12/2017 07:51, Jeff Janes wrote: > On Fri, Dec 1, 2017 at 3:54 PM, Vitaliy Garnashevich > > wrote: > > On 02/12/2017 01:11, Justin Pryzby wrote: > > I tried to reproduce this issue and couldn't, under PG95 and 10.1: > > On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote: > > On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy > Garnashevich wrote: > > We recently had an issue in production, where a bitmap > scan was chosen > instead of an index scan. Despite being 30x slower, > the bitmap scan had > about the same cost as the index scan. > drop table if exists aaa; > create table aaa as select (id%100)::int num, > (id%10=1)::bool flag from > generate_series(1, 10000000) id; > create index i1 on aaa (num); > create index i2 on aaa (flag); > analyze aaa; > > What is: > effective_io_concurrency > max_parallel_workers_per_gather (I gather you don't have this) > > effective_io_concurrency = 0 > max_parallel_workers_per_gather = 0 > > Did you notice random_page_cost = 1.5 ? > > > For the aaa.num = 39 case, the faster index scan actually does hit 15 > times more buffers than the bitmap scan does. While 1.5 is lot lower > than 4.0, it is still much higher than the true cost of reading a page > from the buffer cache. This why the index scan is getting punished. > You could lower random_page_cost and seq_page_cost to 0, to remove > those considerations. (I'm not saying you should do this on your > production system, but rather you should do it as a way to investigate > the issue. But it might make sense on production as well) seq_page_cost = 1.0 random_page_cost = 1.0* *explain analyze select * from aaa where num = 2 and flag = true; Bitmap Heap Scan on aaa (cost=11536.74..20856.96 rows=10257 width=5) (actual time=108.338..108.338 rows=0 loops=1) -> BitmapAnd (cost=11536.74..11536.74 rows=10257 width=0) (actual time=108.226..108.226 rows=0 loops=1) -> Bitmap Index Scan on i1 (cost=0.00..1025.43 rows=100000 width=0) (actual time=18.563..18.563 rows=100000 loops=1) -> Bitmap Index Scan on i2 (cost=0.00..10505.93 rows=1025666 width=0) (actual time=78.493..78.493 rows=1000000 loops=1) Index Scan using i1 on aaa (cost=0.44..44663.58 rows=10257 width=5) (actual time=51.264..51.264 rows=0 loops=1) Here I've used the filter num = 2, which produces rows=0 at BitmapAnd, and thus avoids a lot of work at "Bitmap Heap Scan" node, while still leaving about the same proportion in bitmap vs index - the bitmap is twice slower but twice less costly. It does not matter much which value to use for the filter, if it's other than num = 1. seq_page_cost = 0.0 random_page_cost = 0.0 explain analyze select * from aaa where num = 2 and flag = true; Bitmap Heap Scan on aaa (cost=753.00..2003.00 rows=10257 width=5) (actual time=82.212..82.212 rows=0 loops=1) -> Bitmap Index Scan on i1 (cost=0.00..750.43 rows=100000 width=0) (actual time=17.401..17.401 rows=100000 loops=1) Index Scan using i1 on aaa (cost=0.44..1750.43 rows=10257 width=5) (actual time=49.766..49.766 rows=0 loops=1) The bitmap plan was reduced to use only one bitmap scan, and finally it costs more than the index plan. But I doubt that the settings seq_page_cost = random_page_cost = 0.0 should actually be used. Probably it should be instead something like 1.0/1.0 or 1.0/1.1, but other costs increased, to have more weight. # x4 tuple/operator costs - bitmap scan still a bit cheaper set seq_page_cost = 1.0; set random_page_cost = 1.0; set cpu_tuple_cost = 0.04; set cpu_index_tuple_cost = 0.02; set cpu_operator_cost = 0.01; Bitmap Heap Scan on aaa (cost=36882.97..46587.82 rows=10257 width=5) (actual time=106.045..106.045 rows=0 loops=1) -> BitmapAnd (cost=36882.97..36882.97 rows=10257 width=0) (actual time=105.966..105.966 rows=0 loops=1) -> Bitmap Index Scan on i1 (cost=0.00..3276.74 rows=100000 width=0) (actual time=15.977..15.977 rows=100000 loops=1) -> Bitmap Index Scan on i2 (cost=0.00..33584.72 rows=1025666 width=0) (actual time=79.208..79.208 rows=1000000 loops=1) Index Scan using i1 on aaa (cost=1.74..49914.89 rows=10257 width=5) (actual time=50.144..50.144 rows=0 loops=1) # x5 tuple/operator costs - switched to single bitmap index scan, but now it costs more than the index scan set seq_page_cost = 1.0; set random_page_cost = 1.0; set cpu_tuple_cost = 0.05; set cpu_index_tuple_cost = 0.025; set cpu_operator_cost = 0.0125; Bitmap Heap Scan on aaa (cost=4040.00..54538.00 rows=10257 width=5) (actual time=82.338..82.338 rows=0 loops=1) -> Bitmap Index Scan on i1 (cost=0.00..4027.18 rows=100000 width=0) (actual time=19.541..19.541 rows=100000 loops=1) Index Scan using i1 on aaa (cost=2.17..51665.32 rows=10257 width=5) (actual time=49.545..49.545 rows=0 loops=1) I've also tried seq_page_cost = 1.0, random_page_cost = 1.1, but that would require more than x10 increase in tuple/operator costs, to make bitmap more costly than index. > > > For this test I'm using SSD and Windows (if that matters). On > production we also use SSD, hence lower random_page_cost. But with > the default random_page_cost=4.0, the difference in cost between > the index scan plan and the bitmap scan plan is even bigger. > > > Since it is all shared buffers hits, it doesn't matter if you have SSD > for this particular test case. Agree. I've just tried to justify the value of random_page_cost, which is lower than like 2.0. > Cheers, > > Jeff

Re: Bitmap scan is undercosted?

From

Jeff Janes

Date:

03 December 2017, 00:17:47

On Fri, Dec 1, 2017 at 11:08 PM, Vitaliy Garnashevich < vgarnashevich@gmail.com> wrote: > > > seq_page_cost = 0.0 > random_page_cost = 0.0 > explain analyze select * from aaa where num = 2 and flag = true; > > Bitmap Heap Scan on aaa (cost=753.00..2003.00 rows=10257 width=5) (actual > time=82.212..82.212 rows=0 loops=1) > -> Bitmap Index Scan on i1 (cost=0.00..750.43 rows=100000 width=0) > (actual time=17.401..17.401 rows=100000 loops=1) > > Index Scan using i1 on aaa (cost=0.44..1750.43 rows=10257 width=5) > (actual time=49.766..49.766 rows=0 loops=1) > > The bitmap plan was reduced to use only one bitmap scan, and finally it > costs more than the index plan. > Right, so there is a cpu costing problem (which could only be fixed by hacking postgresql and recompiling it), but it is much smaller of a problem than the IO cost not being accurate due to the high hit rate. Fixing the CPU costing problem is unlikely to make a difference to your real query. If you set the page costs to zero, what happens to your real query? > But I doubt that the settings seq_page_cost = random_page_cost = 0.0 > should actually be used. > Why not? If your production server really has everything in memory during normal operation, that is the correct course of action. If you ever restart the server, then you could have some unpleasant time getting it back up to speed again, but pg_prewarm could help with that. > Probably it should be instead something like 1.0/1.0 or 1.0/1.1, but other > costs increased, to have more weight. > This doesn't make any sense to me. Halving the page costs is mathematically the same as doubling all the other constants. But the first way of doing things says what you are doing, and the second way is an obfuscation of what you are doing. > > # x4 tuple/operator costs - bitmap scan still a bit cheaper > set seq_page_cost = 1.0; > set random_page_cost = 1.0; > set cpu_tuple_cost = 0.04; > set cpu_index_tuple_cost = 0.02; > set cpu_operator_cost = 0.01; > If you really want to target the plan with the BitmapAnd, you should increase cpu_index_tuple_cost and/or cpu_operator_cost but not increase cpu_tuple_cost. That is because the unselective bitmap index scan does not incur any cpu_tuple_cost, but does incur index_tuple and operator costs. Unfortunately all other index scans in the system will also be skewed by such a change if you make the change system-wide. Incidentally, the "actual rows" field of BitmapAnd is always zero. That field is not implemented for that node type. Why do you have an index on flag in the first place? What does the index accomplish, other than enticing the planner into bad plans? I don't know how this translates back into your real query, but dropping that index should be considered. Or replace both indexes with one on (num,flag). Or you can re-write the part of the WHERE clause in a way that it can't use an index, something like: and flag::text ='t' Cheers, Jeff

Re: Bitmap scan is undercosted?

From

Tom Lane

Date:

03 December 2017, 02:44:50

Jeff Janes <jeff.janes@gmail.com> writes:
> On Fri, Dec 1, 2017 at 11:08 PM, Vitaliy Garnashevich <
> vgarnashevich@gmail.com> wrote:
>> # x4 tuple/operator costs - bitmap scan still a bit cheaper
>> set seq_page_cost = 1.0;
>> set random_page_cost = 1.0;
>> set cpu_tuple_cost = 0.04;
>> set cpu_index_tuple_cost = 0.02;
>> set cpu_operator_cost = 0.01;

> If you really want to target the plan with the BitmapAnd, you should
> increase  cpu_index_tuple_cost and/or cpu_operator_cost but not increase
> cpu_tuple_cost.  That is because the  unselective bitmap index scan does
> not incur any cpu_tuple_cost, but does incur index_tuple and operator
> costs.  Unfortunately all other index scans in the system will also be
> skewed by such a change if you make the change system-wide.

I think it'd be a serious error to screw around with your cost settings
on the basis of a single case in which the rowcount estimates are so
far off.  It's really those estimates that are the problem AFAICS.

The core issue in this example is that, the way the test data is set up,
the "flag = true" condition actually adds no selectivity at all, because
every row with "num = 1" is certain to have "flag = true".  If the planner
realized that, it would certainly not bother with BitmapAnd'ing the flag
index onto the results of the num index.  But it doesn't know that those
columns are correlated, so it supposes that adding the extra index will
give a 10x reduction in the number of heap rows that have to be visited
(since it knows that only 1/10th of the rows have "flag = true").
*That* is what causes the overly optimistic cost estimate for the
two-index bitmapscan, and no amount of fiddling with the cost parameters
will make that better.

I tried creating multiple-column statistics using the v10 facility for
that:

regression=# create statistics s1 on num, flag from aaa;
CREATE STATISTICS
regression=# analyze aaa;
ANALYZE

but that changed the estimate not at all, which surprised me because
dependency statistics are supposed to fix exactly this type of problem.
I suspect there may be something in the extended-stats code that causes it
not to work right for boolean columns --- this wouldn't be excessively
surprising because of the way the planner messes around with converting
"flag = true" to just "flag" and sometimes back again.  But I've not
looked closer yet.

            regards, tom lane

Re: Bitmap scan is undercosted?

From

Jeff Janes

Date:

03 December 2017, 04:27:51

On Sat, Dec 2, 2017 at 3:44 PM, Tom Lane wrote: > Jeff Janes writes: > > On Fri, Dec 1, 2017 at 11:08 PM, Vitaliy Garnashevich < > > vgarnashevich@gmail.com> wrote: > >> # x4 tuple/operator costs - bitmap scan still a bit cheaper > >> set seq_page_cost = 1.0; > >> set random_page_cost = 1.0; > >> set cpu_tuple_cost = 0.04; > >> set cpu_index_tuple_cost = 0.02; > >> set cpu_operator_cost = 0.01; > > > If you really want to target the plan with the BitmapAnd, you should > > increase cpu_index_tuple_cost and/or cpu_operator_cost but not increase > > cpu_tuple_cost. That is because the unselective bitmap index scan does > > not incur any cpu_tuple_cost, but does incur index_tuple and operator > > costs. Unfortunately all other index scans in the system will also be > > skewed by such a change if you make the change system-wide. > > I think it'd be a serious error to screw around with your cost settings > on the basis of a single case in which the rowcount estimates are so > far off. It's really those estimates that are the problem AFAICS. > > The core issue in this example is that, the way the test data is set up, > the "flag = true" condition actually adds no selectivity at all, because > every row with "num = 1" is certain to have "flag = true". If the planner > realized that, it would certainly not bother with BitmapAnd'ing the flag > index onto the results of the num index. But it doesn't know that those > columns are correlated, so it supposes that adding the extra index will > give a 10x reduction in the number of heap rows that have to be visited > (since it knows that only 1/10th of the rows have "flag = true"). > *That* is what causes the overly optimistic cost estimate for the > two-index bitmapscan, and no amount of fiddling with the cost parameters > will make that better. > But he also tested with num=2 and num=39, which reverses the situation so the bitmap is 100% selective rather than the 90% the planner thinks it will be. But it is still slower for him (I am having trouble replicating that exact behavior), so building the bitmap to rule out 100% of the rows is empirically not worth it, I don't see how building it to rule out 90%, as the planner things, would be any better. > I tried creating multiple-column statistics using the v10 facility for > that: > > regression=# create statistics s1 on num, flag from aaa; > CREATE STATISTICS > regression=# analyze aaa; > ANALYZE > > but that changed the estimate not at all, which surprised me because > dependency statistics are supposed to fix exactly this type of problem. > I suspect there may be something in the extended-stats code that causes it > not to work right for boolean columns --- this wouldn't be excessively > surprising because of the way the planner messes around with converting > "flag = true" to just "flag" and sometimes back again. But I've not > looked closer yet. > I think the non-extended stats code also has trouble with booleans. pg_stats gives me a correlation of 0.8 or higher for the flag column. Due to that, when I disable bitmapscans and seqscans, I start getting slow index scans on the wrong index, i2 rather than i1. I don't know why he doesn't see that in his example. Cheers, Jeff

Re: Bitmap scan is undercosted? - boolean correlation

From

Justin Pryzby

Date:

03 December 2017, 07:04:30

On Sat, Dec 02, 2017 at 05:27:51PM -0800, Jeff Janes wrote:
> I think the non-extended stats code also has trouble with booleans.
> pg_stats gives me a correlation  of 0.8 or higher for the flag column.

It's not due to the boolean though; you see the same thing if you do:
CREATE INDEX aaa_f ON aaa((flag::text));
ANALYZE aaa;
correlation | 0.81193

or:
ALTER TABLE aaa ADD flag2 int; UPDATE aaa SET flag2= flag::int
correlation | 0.81193

I think it's caused by having so few (2) values to correlate.

most_common_vals       | {f,t}
most_common_freqs      | {0.9014,0.0986}
correlation            | 0.822792

It thinks there's somewhat-high correlation since it gets a list of x and y
values (integer positions by logical and physical sort order) and 90% of the x
list (logical value) are the same value ('t'), and the CTIDs are in order on
the new index, so 90% of the values are 100% correlated.

It improves (by which I mean here that it spits out a lower number) if it's not
a 90/10 split:

CREATE TABLE aaa5 AS SELECT (id%100)::int num, (id%10>5)::bool flag FROM generate_series(1, 10000000) id;
CREATE INDEX ON aaa5 (flag);

tablename   | aaa5
attname     | flag
correlation | 0.522184

Justin

Re: Bitmap scan is undercosted?

From

Vitaliy Garnashevich

Date:

04 December 2017, 00:15:01

On 02/12/2017 23:17, Jeff Janes wrote: > Right, so there is a cpu costing problem (which could only be fixed by > hacking postgresql and recompiling it), but it is much smaller of a > problem than the IO cost not being accurate due to the high hit rate. > Fixing the CPU costing problem is unlikely to make a difference to > your real query. If you set the page costs to zero, what happens to > your real query? I can't reproduce the exact issue on the real database any more. The query started to use the slow bitmap scan recently, and had been doing so for some time lately, but now it's switched back to use the index scan. The table involved in the query gets modified a lot. It has hundreds of millions of rows. Lots of new rows are appended to it every day, the oldest rows are sometimes removed. The table is analyzed at least daily. It's possible that statistics was updated and that caused the query to run differently. But I still would like to understand why that issue happened, and how to properly fix it, in case the issue returns. > > But I doubt that the settings seq_page_cost = random_page_cost = > 0.0 should actually be used. > > > Why not? If your production server really has everything in memory > during normal operation, that is the correct course of action. If you > ever restart the server, then you could have some unpleasant time > getting it back up to speed again, but pg_prewarm could help with that. In the real database, not everything is in memory. There are 200GB+ of RAM, but DB is 500GB+. The table involved in the query itself is 60GB+ of data and 100GB+ of indexes. I'm running the test case in a way where all reads are done from RAM, only to make it easier to reproduce and to avoid unrelated effects. As far as know, costs in Postgres were designed to be relative to seq_page_cost, which for that reason is usually defined as 1.0. Even if everything would be in RAM, accesses to the pages would still not have zero cost. Setting 0.0 just seems too extreme, as all other non-zero costs would become infinitely bigger. > If you really want to target the plan with the BitmapAnd, you should > increase cpu_index_tuple_cost and/or cpu_operator_cost but not > increase cpu_tuple_cost. That is because the unselective bitmap > index scan does not incur any cpu_tuple_cost, but does incur > index_tuple and operator costs. Unfortunately all other index scans > in the system will also be skewed by such a change if you make the > change system-wide. Exactly. I'd like to understand why the worse plan is being chosen, and 1) if it's fixable by tuning costs, to figure out the right settings which could be used in production, 2) if there is a bug in Postgres optimizer, then to bring some attention to it, so that it's eventually fixed in one of future releases, 3) if Postgres is supposed to work this way, then at least I (and people who ever read this thread) would understand it better. Regards, Vitaliy

Re: Bitmap scan is undercosted?

From

Vitaliy Garnashevich

Date:

04 December 2017, 00:22:52

On 03/12/2017 01:44, Tom Lane wrote:
> I think it'd be a serious error to screw around with your cost settings
> on the basis of a single case in which the rowcount estimates are so
> far off.  It's really those estimates that are the problem AFAICS.
>
> The core issue in this example is that, the way the test data is set up,
> the "flag = true" condition actually adds no selectivity at all, because
> every row with "num = 1" is certain to have "flag = true".  If the planner
> realized that, it would certainly not bother with BitmapAnd'ing the flag
> index onto the results of the num index.  But it doesn't know that those
> columns are correlated, so it supposes that adding the extra index will
> give a 10x reduction in the number of heap rows that have to be visited
> (since it knows that only 1/10th of the rows have "flag = true").
> *That* is what causes the overly optimistic cost estimate for the
> two-index bitmapscan, and no amount of fiddling with the cost parameters
> will make that better.
Here I've tried to make a test which would not have correlation between 
the two columns.

shared_buffers = 512MB
effective_cache_size = 512MB
work_mem = 100MB

set seq_page_cost = 1.0;
set random_page_cost = 1.5;
set cpu_tuple_cost = 0.01;
set cpu_index_tuple_cost = 0.005;
set cpu_operator_cost = 0.0025;

drop table if exists aaa;
create table aaa as select floor(random()*100)::int num, (random()*10 < 
1)::bool flag from generate_series(1, 10000000) id;
create index i1 on aaa  (num);
create index i2 on aaa  (flag);
analyze aaa;

select relname, reltuples::bigint, relpages::bigint, 
(reltuples/relpages)::bigint tpp from pg_class where relname 
in('aaa','i1','i2') order by relname;
"aaa";10000033;44248;226
"i1";10000033;27422;365
"i2";10000033;27422;365

select flag, count(*) from aaa group by flag order by flag;
f;9000661
t;999339

select num, count(*) from aaa group by num order by num;
0;99852
1;99631
2;99699
3;100493
...
96;100345
97;99322
98;100013
99;100030

explain (analyze,verbose,costs,buffers)
select * from aaa where num = 1 and flag = true;

Bitmap Heap Scan on public.aaa  (cost=12829.83..24729.85 rows=10340 
width=5) (actual time=104.941..112.649 rows=9944 loops=1)
   Output: num, flag
   Recheck Cond: (aaa.num = 1)
   Filter: aaa.flag
   Heap Blocks: exact=8922
   Buffers: shared hit=11932
   ->  BitmapAnd  (cost=12829.83..12829.83 rows=10340 width=0) (actual 
time=102.926..102.926 rows=0 loops=1)
         Buffers: shared hit=3010
         ->  Bitmap Index Scan on i1  (cost=0.00..1201.44 rows=103334 
width=0) (actual time=15.459..15.459 rows=99631 loops=1)
               Index Cond: (aaa.num = 1)
               Buffers: shared hit=276
         ->  Bitmap Index Scan on i2  (cost=0.00..11622.97 rows=1000671 
width=0) (actual time=76.906..76.906 rows=999339 loops=1)
               Index Cond: (aaa.flag = true)
               Buffers: shared hit=2734
Planning time: 0.110 ms
Execution time: 113.272 ms

Index Scan using i1 on public.aaa  (cost=0.44..66621.56 rows=10340 
width=5) (actual time=0.027..47.075 rows=9944 loops=1)
   Output: num, flag
   Index Cond: (aaa.num = 1)
   Filter: aaa.flag
   Rows Removed by Filter: 89687
   Buffers: shared hit=39949
Planning time: 0.104 ms
Execution time: 47.351 ms

>
> I tried creating multiple-column statistics using the v10 facility for
> that:
>
> regression=# create statistics s1 on num, flag from aaa;
> CREATE STATISTICS
> regression=# analyze aaa;
> ANALYZE
>
> but that changed the estimate not at all, which surprised me because
> dependency statistics are supposed to fix exactly this type of problem.
> I suspect there may be something in the extended-stats code that causes it
> not to work right for boolean columns --- this wouldn't be excessively
> surprising because of the way the planner messes around with converting
> "flag = true" to just "flag" and sometimes back again.  But I've not
> looked closer yet.
>
>             regards, tom lane
> .
>

Re: Bitmap scan is undercosted?

From

Vitaliy Garnashevich

Date:

04 December 2017, 01:11:47

On 03/12/2017 03:27, Jeff Janes wrote: > Due to that, when I disable bitmapscans and seqscans, I start getting > slow index scans on the wrong index, i2 rather than i1. I don't know > why he doesn't see that in his example. When I increase effective_cache_size to 1024MB, I start getting the plan with the slower index i2, too. *Bitmap Heap Scan* on public.aaa (cost=12600.90..*23688**.70* rows=9488 width=5) (actual time=107.529..*115.902* rows=9976 loops=1) -> BitmapAnd (cost=12600.90..12600.90 rows=9488 width=0) (actual time=105.133..105.133 rows=0 loops=1) -> Bitmap Index Scan on i1 (cost=0.00..1116.43 rows=96000 width=0) (actual time=16.313..16.313 rows=100508 loops=1) -> Bitmap Index Scan on i2 (cost=0.00..11479.47 rows=988338 width=0) (actual time=77.950..77.950 rows=1000200 loops=1) *Index Scan* using i2 on public.aaa (cost=0.44..*48227.31* rows=9488 width=5) (actual time=0.020..*285.695* rows=9976 loops=1) *Seq Scan* on public.aaa (cost=0.00..*169248.54* rows=9488 width=5) (actual time=0.024..*966.469* rows=9976 loops=1) This way the estimates and the actual time get more sense. But then there's the question - maybe it's i1 runs too fast, and is estimated incorrectly? Why that happens? Here are the complete plans with the two different kinds of index scans once again: Index Scan using i1 on public.aaa (cost=0.44..66621.56 rows=10340 width=5) (actual time=0.027..47.075 rows=9944 loops=1) Output: num, flag Index Cond: (aaa.num = 1) Filter: aaa.flag Rows Removed by Filter: 89687 Buffers: shared hit=39949 Planning time: 0.104 ms Execution time: 47.351 ms Index Scan using i2 on public.aaa (cost=0.44..48227.31 rows=9488 width=5) (actual time=0.020..285.695 rows=9976 loops=1) Output: num, flag Index Cond: (aaa.flag = true) Filter: (aaa.flag AND (aaa.num = 1)) Rows Removed by Filter: 990224 Buffers: shared hit=46984 Planning time: 0.098 ms Execution time: 286.081 ms // The test DB was populated with: create table aaa as select floor(random()*100)::int num, (random()*10 < 1)::bool flag from generate_series(1, 10000000) id; Regards, Vitaliy

Re: Bitmap scan is undercosted?

From

Tom Lane

Date:

04 December 2017, 02:08:46

I wrote:
> I tried creating multiple-column statistics using the v10 facility for
> that:
> regression=# create statistics s1 on num, flag from aaa;
> CREATE STATISTICS
> regression=# analyze aaa;
> ANALYZE
> but that changed the estimate not at all, which surprised me because
> dependency statistics are supposed to fix exactly this type of problem.
> I suspect there may be something in the extended-stats code that causes it
> not to work right for boolean columns --- this wouldn't be excessively
> surprising because of the way the planner messes around with converting
> "flag = true" to just "flag" and sometimes back again.  But I've not
> looked closer yet.

After looking, I found that indeed dependency_is_compatible_clause()
rejects expressions like "flag" or "NOT flag", which it needn't since
those are fully equivalent to "flag = true" or "flag = false"
respectively.  Moreover there's nothing else in
dependencies_clauselist_selectivity that depends on the exact form of
the clause under test, only on the semantics that it's an equality
condition on some Var.  Hence I propose the attached patch, which
fixes the rowcount estimate for the example discussed in this thread:

create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
generate_series(1, 10000000) id;
create index i1 on aaa  (num);
create index i2 on aaa  (flag);
create statistics s1 on num, flag from aaa;
analyze aaa;

explain analyze select count(*) from aaa where num = 1 and flag = true;

Without patch:

 Aggregate  (cost=43236.73..43236.74 rows=1 width=8) (actual time=349.365..349.3
65 rows=1 loops=1)
   ->  Bitmap Heap Scan on aaa  (cost=20086.40..43212.94 rows=9514 width=0) (act
ual time=101.308..337.985 rows=100000 loops=1)
         Recheck Cond: (num = 1)
         Filter: flag
         Heap Blocks: exact=44248
         ->  BitmapAnd  (cost=20086.40..20086.40 rows=9514 width=0) (actual time
=92.214..92.214 rows=0 loops=1)
               ->  Bitmap Index Scan on i1  (cost=0.00..1776.43 rows=96000 width
=0) (actual time=17.236..17.236 rows=100000 loops=1)
                     Index Cond: (num = 1)
               ->  Bitmap Index Scan on i2  (cost=0.00..18304.96 rows=991003 wid
th=0) (actual time=72.168..72.168 rows=1000000 loops=1)
                     Index Cond: (flag = true)
 Planning time: 0.254 ms
 Execution time: 350.796 ms

With patch:

 Aggregate  (cost=43496.19..43496.20 rows=1 width=8) (actual time=359.195..359.1
95 rows=1 loops=1)
   ->  Bitmap Heap Scan on aaa  (cost=20129.64..43256.19 rows=96000 width=0) (ac
tual time=99.750..347.353 rows=100000 loops=1)
         Recheck Cond: (num = 1)
         Filter: flag
         Heap Blocks: exact=44248
         ->  BitmapAnd  (cost=20129.64..20129.64 rows=9514 width=0) (actual time
=90.671..90.671 rows=0 loops=1)
               ->  Bitmap Index Scan on i1  (cost=0.00..1776.43 rows=96000 width
=0) (actual time=16.946..16.946 rows=100000 loops=1)
                     Index Cond: (num = 1)
               ->  Bitmap Index Scan on i2  (cost=0.00..18304.96 rows=991003 wid
th=0) (actual time=70.898..70.898 rows=1000000 loops=1)
                     Index Cond: (flag = true)
 Planning time: 0.218 ms
 Execution time: 360.608 ms

That's the right overall rowcount estimate for the scan, given the stats
it's working from.  There's apparently still something wrong with bitmap
costing, since it's still estimating this as cheaper than the single-index
case --- noting the bogus rowcount estimate for the BitmapAnd, I suspect
that bitmap costing is taking shortcuts rather than using
clauselist_selectivity to estimate the overall selectivity of the bitmap
conditions.  But whatever is causing that, it's independent of this
deficiency.

In addition to the bugfix proper, I improved some comments, got rid of
a NumRelids() test that's redundant with the preceding bms_membership()
test, and fixed dependencies_clauselist_selectivity so that
estimatedclauses actually is a pure output argument as stated by its
API contract.

            regards, tom lane

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index 9756fb8..ae0f304 100644
*** a/src/backend/statistics/dependencies.c
--- b/src/backend/statistics/dependencies.c
*************** pg_dependencies_send(PG_FUNCTION_ARGS)
*** 736,826 ****
   * dependency_is_compatible_clause
   *        Determines if the clause is compatible with functional dependencies
   *
!  * Only OpExprs with two arguments using an equality operator are supported.
!  * When returning True attnum is set to the attribute number of the Var within
!  * the supported clause.
!  *
!  * Currently we only support Var = Const, or Const = Var. It may be possible
!  * to expand on this later.
   */
  static bool
  dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
  {
      RestrictInfo *rinfo = (RestrictInfo *) clause;

      if (!IsA(rinfo, RestrictInfo))
          return false;

!     /* Pseudoconstants are not really interesting here. */
      if (rinfo->pseudoconstant)
          return false;

!     /* clauses referencing multiple varnos are incompatible */
      if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
          return false;

      if (is_opclause(rinfo->clause))
      {
          OpExpr       *expr = (OpExpr *) rinfo->clause;
-         Var           *var;
-         bool        varonleft = true;
-         bool        ok;

!         /* Only expressions with two arguments are considered compatible. */
          if (list_length(expr->args) != 2)
              return false;

!         /* see if it actually has the right */
!         ok = (NumRelids((Node *) expr) == 1) &&
!             (is_pseudo_constant_clause(lsecond(expr->args)) ||
!              (varonleft = false,
!               is_pseudo_constant_clause(linitial(expr->args))));
!
!         /* unsupported structure (two variables or so) */
!         if (!ok)
              return false;

          /*
!          * If it's not "=" operator, just ignore the clause, as it's not
           * compatible with functional dependencies.
           *
           * This uses the function for estimating selectivity, not the operator
           * directly (a bit awkward, but well ...).
           */
          if (get_oprrest(expr->opno) != F_EQSEL)
              return false;

!         var = (varonleft) ? linitial(expr->args) : lsecond(expr->args);
!
          /*
!          * We may ignore any RelabelType node above the operand.  (There won't
!          * be more than one, since eval_const_expressions() has been applied
!          * already.)
           */
!         if (IsA(var, RelabelType))
!             var = (Var *) ((RelabelType *) var)->arg;

!         /* We only support plain Vars for now */
!         if (!IsA(var, Var))
!             return false;

!         /* Ensure var is from the correct relation */
!         if (var->varno != relid)
!             return false;

!         /* we also better ensure the Var is from the current level */
!         if (var->varlevelsup > 0)
!             return false;

!         /* Also skip system attributes (we don't allow stats on those). */
!         if (!AttrNumberIsForUserDefinedAttr(var->varattno))
!             return false;

!         *attnum = var->varattno;
!         return true;
!     }

!     return false;
  }

  /*
--- 736,839 ----
   * dependency_is_compatible_clause
   *        Determines if the clause is compatible with functional dependencies
   *
!  * Only clauses that have the form of equality to a pseudoconstant, or can be
!  * interpreted that way, are currently accepted.  Furthermore the variable
!  * part of the clause must be a simple Var belonging to the specified
!  * relation, whose attribute number we return in *attnum on success.
   */
  static bool
  dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
  {
      RestrictInfo *rinfo = (RestrictInfo *) clause;
+     Var           *var;

      if (!IsA(rinfo, RestrictInfo))
          return false;

!     /* Pseudoconstants are not interesting (they couldn't contain a Var) */
      if (rinfo->pseudoconstant)
          return false;

!     /* Clauses referencing multiple, or no, varnos are incompatible */
      if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
          return false;

      if (is_opclause(rinfo->clause))
      {
+         /* If it's an opclause, check for Var = Const or Const = Var. */
          OpExpr       *expr = (OpExpr *) rinfo->clause;

!         /* Only expressions with two arguments are candidates. */
          if (list_length(expr->args) != 2)
              return false;

!         /* Make sure non-selected argument is a pseudoconstant. */
!         if (is_pseudo_constant_clause(lsecond(expr->args)))
!             var = linitial(expr->args);
!         else if (is_pseudo_constant_clause(linitial(expr->args)))
!             var = lsecond(expr->args);
!         else
              return false;

          /*
!          * If it's not an "=" operator, just ignore the clause, as it's not
           * compatible with functional dependencies.
           *
           * This uses the function for estimating selectivity, not the operator
           * directly (a bit awkward, but well ...).
+          *
+          * XXX this is pretty dubious; probably it'd be better to check btree
+          * or hash opclass membership, so as not to be fooled by custom
+          * selectivity functions, and to be more consistent with decisions
+          * elsewhere in the planner.
           */
          if (get_oprrest(expr->opno) != F_EQSEL)
              return false;

!         /* OK to proceed with checking "var" */
!     }
!     else if (not_clause((Node *) rinfo->clause))
!     {
          /*
!          * "NOT x" can be interpreted as "x = false", so get the argument and
!          * proceed with seeing if it's a suitable Var.
           */
!         var = (Var *) get_notclausearg(rinfo->clause);
!     }
!     else
!     {
!         /*
!          * A boolean expression "x" can be interpreted as "x = true", so
!          * proceed with seeing if it's a suitable Var.
!          */
!         var = (Var *) rinfo->clause;
!     }

!     /*
!      * We may ignore any RelabelType node above the operand.  (There won't be
!      * more than one, since eval_const_expressions has been applied already.)
!      */
!     if (IsA(var, RelabelType))
!         var = (Var *) ((RelabelType *) var)->arg;

!     /* We only support plain Vars for now */
!     if (!IsA(var, Var))
!         return false;

!     /* Ensure Var is from the correct relation */
!     if (var->varno != relid)
!         return false;

!     /* We also better ensure the Var is from the current level */
!     if (var->varlevelsup != 0)
!         return false;

!     /* Also ignore system attributes (we don't allow stats on those) */
!     if (!AttrNumberIsForUserDefinedAttr(var->varattno))
!         return false;

!     *attnum = var->varattno;
!     return true;
  }

  /*
*************** find_strongest_dependency(StatisticExtIn
*** 891,902 ****

  /*
   * dependencies_clauselist_selectivity
!  *        Return the estimated selectivity of the given clauses using
!  *        functional dependency statistics, or 1.0 if no useful functional
   *        dependency statistic exists.
   *
   * 'estimatedclauses' is an output argument that gets a bit set corresponding
!  * to the (zero-based) list index of clauses that are included in the
   * estimated selectivity.
   *
   * Given equality clauses on attributes (a,b) we find the strongest dependency
--- 904,915 ----

  /*
   * dependencies_clauselist_selectivity
!  *        Return the estimated selectivity of (a subset of) the given clauses
!  *        using functional dependency statistics, or 1.0 if no useful functional
   *        dependency statistic exists.
   *
   * 'estimatedclauses' is an output argument that gets a bit set corresponding
!  * to the (zero-based) list index of each clause that is included in the
   * estimated selectivity.
   *
   * Given equality clauses on attributes (a,b) we find the strongest dependency
*************** dependencies_clauselist_selectivity(Plan
*** 932,937 ****
--- 945,953 ----
      AttrNumber *list_attnums;
      int            listidx;

+     /* initialize output argument */
+     *estimatedclauses = NULL;
+
      /* check if there's any stats that might be useful for us. */
      if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES))
          return 1.0;

Re: Bitmap scan is undercosted?

From

Tom Lane

Date:

04 December 2017, 02:08:46

I wrote:
> I tried creating multiple-column statistics using the v10 facility for
> that:
> regression=# create statistics s1 on num, flag from aaa;
> CREATE STATISTICS
> regression=# analyze aaa;
> ANALYZE
> but that changed the estimate not at all, which surprised me because
> dependency statistics are supposed to fix exactly this type of problem.
> I suspect there may be something in the extended-stats code that causes it
> not to work right for boolean columns --- this wouldn't be excessively
> surprising because of the way the planner messes around with converting
> "flag = true" to just "flag" and sometimes back again.  But I've not
> looked closer yet.

After looking, I found that indeed dependency_is_compatible_clause()
rejects expressions like "flag" or "NOT flag", which it needn't since
those are fully equivalent to "flag = true" or "flag = false"
respectively.  Moreover there's nothing else in
dependencies_clauselist_selectivity that depends on the exact form of
the clause under test, only on the semantics that it's an equality
condition on some Var.  Hence I propose the attached patch, which
fixes the rowcount estimate for the example discussed in this thread:

create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
generate_series(1, 10000000) id;
create index i1 on aaa  (num);
create index i2 on aaa  (flag);
create statistics s1 on num, flag from aaa;
analyze aaa;

explain analyze select count(*) from aaa where num = 1 and flag = true;

Without patch:

 Aggregate  (cost=43236.73..43236.74 rows=1 width=8) (actual time=349.365..349.3
65 rows=1 loops=1)
   ->  Bitmap Heap Scan on aaa  (cost=20086.40..43212.94 rows=9514 width=0) (act
ual time=101.308..337.985 rows=100000 loops=1)
         Recheck Cond: (num = 1)
         Filter: flag
         Heap Blocks: exact=44248
         ->  BitmapAnd  (cost=20086.40..20086.40 rows=9514 width=0) (actual time
=92.214..92.214 rows=0 loops=1)
               ->  Bitmap Index Scan on i1  (cost=0.00..1776.43 rows=96000 width
=0) (actual time=17.236..17.236 rows=100000 loops=1)
                     Index Cond: (num = 1)
               ->  Bitmap Index Scan on i2  (cost=0.00..18304.96 rows=991003 wid
th=0) (actual time=72.168..72.168 rows=1000000 loops=1)
                     Index Cond: (flag = true)
 Planning time: 0.254 ms
 Execution time: 350.796 ms

With patch:

 Aggregate  (cost=43496.19..43496.20 rows=1 width=8) (actual time=359.195..359.1
95 rows=1 loops=1)
   ->  Bitmap Heap Scan on aaa  (cost=20129.64..43256.19 rows=96000 width=0) (ac
tual time=99.750..347.353 rows=100000 loops=1)
         Recheck Cond: (num = 1)
         Filter: flag
         Heap Blocks: exact=44248
         ->  BitmapAnd  (cost=20129.64..20129.64 rows=9514 width=0) (actual time
=90.671..90.671 rows=0 loops=1)
               ->  Bitmap Index Scan on i1  (cost=0.00..1776.43 rows=96000 width
=0) (actual time=16.946..16.946 rows=100000 loops=1)
                     Index Cond: (num = 1)
               ->  Bitmap Index Scan on i2  (cost=0.00..18304.96 rows=991003 wid
th=0) (actual time=70.898..70.898 rows=1000000 loops=1)
                     Index Cond: (flag = true)
 Planning time: 0.218 ms
 Execution time: 360.608 ms

That's the right overall rowcount estimate for the scan, given the stats
it's working from.  There's apparently still something wrong with bitmap
costing, since it's still estimating this as cheaper than the single-index
case --- noting the bogus rowcount estimate for the BitmapAnd, I suspect
that bitmap costing is taking shortcuts rather than using
clauselist_selectivity to estimate the overall selectivity of the bitmap
conditions.  But whatever is causing that, it's independent of this
deficiency.

In addition to the bugfix proper, I improved some comments, got rid of
a NumRelids() test that's redundant with the preceding bms_membership()
test, and fixed dependencies_clauselist_selectivity so that
estimatedclauses actually is a pure output argument as stated by its
API contract.

            regards, tom lane

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index 9756fb8..ae0f304 100644
*** a/src/backend/statistics/dependencies.c
--- b/src/backend/statistics/dependencies.c
*************** pg_dependencies_send(PG_FUNCTION_ARGS)
*** 736,826 ****
   * dependency_is_compatible_clause
   *        Determines if the clause is compatible with functional dependencies
   *
!  * Only OpExprs with two arguments using an equality operator are supported.
!  * When returning True attnum is set to the attribute number of the Var within
!  * the supported clause.
!  *
!  * Currently we only support Var = Const, or Const = Var. It may be possible
!  * to expand on this later.
   */
  static bool
  dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
  {
      RestrictInfo *rinfo = (RestrictInfo *) clause;

      if (!IsA(rinfo, RestrictInfo))
          return false;

!     /* Pseudoconstants are not really interesting here. */
      if (rinfo->pseudoconstant)
          return false;

!     /* clauses referencing multiple varnos are incompatible */
      if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
          return false;

      if (is_opclause(rinfo->clause))
      {
          OpExpr       *expr = (OpExpr *) rinfo->clause;
-         Var           *var;
-         bool        varonleft = true;
-         bool        ok;

!         /* Only expressions with two arguments are considered compatible. */
          if (list_length(expr->args) != 2)
              return false;

!         /* see if it actually has the right */
!         ok = (NumRelids((Node *) expr) == 1) &&
!             (is_pseudo_constant_clause(lsecond(expr->args)) ||
!              (varonleft = false,
!               is_pseudo_constant_clause(linitial(expr->args))));
!
!         /* unsupported structure (two variables or so) */
!         if (!ok)
              return false;

          /*
!          * If it's not "=" operator, just ignore the clause, as it's not
           * compatible with functional dependencies.
           *
           * This uses the function for estimating selectivity, not the operator
           * directly (a bit awkward, but well ...).
           */
          if (get_oprrest(expr->opno) != F_EQSEL)
              return false;

!         var = (varonleft) ? linitial(expr->args) : lsecond(expr->args);
!
          /*
!          * We may ignore any RelabelType node above the operand.  (There won't
!          * be more than one, since eval_const_expressions() has been applied
!          * already.)
           */
!         if (IsA(var, RelabelType))
!             var = (Var *) ((RelabelType *) var)->arg;

!         /* We only support plain Vars for now */
!         if (!IsA(var, Var))
!             return false;

!         /* Ensure var is from the correct relation */
!         if (var->varno != relid)
!             return false;

!         /* we also better ensure the Var is from the current level */
!         if (var->varlevelsup > 0)
!             return false;

!         /* Also skip system attributes (we don't allow stats on those). */
!         if (!AttrNumberIsForUserDefinedAttr(var->varattno))
!             return false;

!         *attnum = var->varattno;
!         return true;
!     }

!     return false;
  }

  /*
--- 736,839 ----
   * dependency_is_compatible_clause
   *        Determines if the clause is compatible with functional dependencies
   *
!  * Only clauses that have the form of equality to a pseudoconstant, or can be
!  * interpreted that way, are currently accepted.  Furthermore the variable
!  * part of the clause must be a simple Var belonging to the specified
!  * relation, whose attribute number we return in *attnum on success.
   */
  static bool
  dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
  {
      RestrictInfo *rinfo = (RestrictInfo *) clause;
+     Var           *var;

      if (!IsA(rinfo, RestrictInfo))
          return false;

!     /* Pseudoconstants are not interesting (they couldn't contain a Var) */
      if (rinfo->pseudoconstant)
          return false;

!     /* Clauses referencing multiple, or no, varnos are incompatible */
      if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
          return false;

      if (is_opclause(rinfo->clause))
      {
+         /* If it's an opclause, check for Var = Const or Const = Var. */
          OpExpr       *expr = (OpExpr *) rinfo->clause;

!         /* Only expressions with two arguments are candidates. */
          if (list_length(expr->args) != 2)
              return false;

!         /* Make sure non-selected argument is a pseudoconstant. */
!         if (is_pseudo_constant_clause(lsecond(expr->args)))
!             var = linitial(expr->args);
!         else if (is_pseudo_constant_clause(linitial(expr->args)))
!             var = lsecond(expr->args);
!         else
              return false;

          /*
!          * If it's not an "=" operator, just ignore the clause, as it's not
           * compatible with functional dependencies.
           *
           * This uses the function for estimating selectivity, not the operator
           * directly (a bit awkward, but well ...).
+          *
+          * XXX this is pretty dubious; probably it'd be better to check btree
+          * or hash opclass membership, so as not to be fooled by custom
+          * selectivity functions, and to be more consistent with decisions
+          * elsewhere in the planner.
           */
          if (get_oprrest(expr->opno) != F_EQSEL)
              return false;

!         /* OK to proceed with checking "var" */
!     }
!     else if (not_clause((Node *) rinfo->clause))
!     {
          /*
!          * "NOT x" can be interpreted as "x = false", so get the argument and
!          * proceed with seeing if it's a suitable Var.
           */
!         var = (Var *) get_notclausearg(rinfo->clause);
!     }
!     else
!     {
!         /*
!          * A boolean expression "x" can be interpreted as "x = true", so
!          * proceed with seeing if it's a suitable Var.
!          */
!         var = (Var *) rinfo->clause;
!     }

!     /*
!      * We may ignore any RelabelType node above the operand.  (There won't be
!      * more than one, since eval_const_expressions has been applied already.)
!      */
!     if (IsA(var, RelabelType))
!         var = (Var *) ((RelabelType *) var)->arg;

!     /* We only support plain Vars for now */
!     if (!IsA(var, Var))
!         return false;

!     /* Ensure Var is from the correct relation */
!     if (var->varno != relid)
!         return false;

!     /* We also better ensure the Var is from the current level */
!     if (var->varlevelsup != 0)
!         return false;

!     /* Also ignore system attributes (we don't allow stats on those) */
!     if (!AttrNumberIsForUserDefinedAttr(var->varattno))
!         return false;

!     *attnum = var->varattno;
!     return true;
  }

  /*
*************** find_strongest_dependency(StatisticExtIn
*** 891,902 ****

  /*
   * dependencies_clauselist_selectivity
!  *        Return the estimated selectivity of the given clauses using
!  *        functional dependency statistics, or 1.0 if no useful functional
   *        dependency statistic exists.
   *
   * 'estimatedclauses' is an output argument that gets a bit set corresponding
!  * to the (zero-based) list index of clauses that are included in the
   * estimated selectivity.
   *
   * Given equality clauses on attributes (a,b) we find the strongest dependency
--- 904,915 ----

  /*
   * dependencies_clauselist_selectivity
!  *        Return the estimated selectivity of (a subset of) the given clauses
!  *        using functional dependency statistics, or 1.0 if no useful functional
   *        dependency statistic exists.
   *
   * 'estimatedclauses' is an output argument that gets a bit set corresponding
!  * to the (zero-based) list index of each clause that is included in the
   * estimated selectivity.
   *
   * Given equality clauses on attributes (a,b) we find the strongest dependency
*************** dependencies_clauselist_selectivity(Plan
*** 932,937 ****
--- 945,953 ----
      AttrNumber *list_attnums;
      int            listidx;

+     /* initialize output argument */
+     *estimatedclauses = NULL;
+
      /* check if there's any stats that might be useful for us. */
      if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES))
          return 1.0;

Re: Bitmap scan is undercosted? - boolean correlation

From

Jeff Janes

Date:

04 December 2017, 02:21:56

On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby wrote: > On Sat, Dec 02, 2017 at 05:27:51PM -0800, Jeff Janes wrote: > > I think the non-extended stats code also has trouble with booleans. > > pg_stats gives me a correlation of 0.8 or higher for the flag column. > > It's not due to the boolean though; you see the same thing if you do: > CREATE INDEX aaa_f ON aaa((flag::text)); > ANALYZE aaa; > correlation | 0.81193 > > or: > ALTER TABLE aaa ADD flag2 int; UPDATE aaa SET flag2= flag::int > correlation | 0.81193 > > I think it's caused by having so few (2) values to correlate. > > most_common_vals | {f,t} > most_common_freqs | {0.9014,0.0986} > correlation | 0.822792 > > It thinks there's somewhat-high correlation since it gets a list of x and y > values (integer positions by logical and physical sort order) and 90% of > the x > list (logical value) are the same value ('t'), and the CTIDs are in order > on > the new index, so 90% of the values are 100% correlated. > But there is no index involved (except in the case of the functional index). The correlation of table columns to physical order of the table doesn't depend on the existence of an index, or the physical order within an index. But I do see that ties within the logical order of the column values are broken to agree with the physical order. That is wrong, right? Is there any argument that this is desirable? It looks like it could be fixed with a few extra double calcs per distinct value. Considering we already sorted the sample values using SQL-callable collation dependent comparators, I doubt a few C-level double calcs is going to be meaningful. Cheers, Jeff

Re: Bitmap scan is undercosted? - boolean correlation

From

Tom Lane

Date:

04 December 2017, 02:31:40

Jeff Janes <jeff.janes@gmail.com> writes:
> On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
>> It thinks there's somewhat-high correlation since it gets a list of x
>> and y values (integer positions by logical and physical sort order) and
>> 90% of the x list (logical value) are the same value ('t'), and the
>> CTIDs are in order on the new index, so 90% of the values are 100%
>> correlated.

> But there is no index involved (except in the case of the functional
> index).  The correlation of table columns to physical order of the table
> doesn't depend on the existence of an index, or the physical order within
> an index.

> But I do see that ties within the logical order of the column values are
> broken to agree with the physical order.  That is wrong, right?  Is there
> any argument that this is desirable?

Uh ... what do you propose doing instead?  We'd have to do something with
ties, and it's not so obvious this way is wrong.

            regards, tom lane

Re: Bitmap scan is undercosted? - boolean correlation

From

Jeff Janes

Date:

04 December 2017, 03:27:23

On Dec 3, 2017 15:31, "Tom Lane" wrote: Jeff Janes writes: > On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby wrote: >> It thinks there's somewhat-high correlation since it gets a list of x >> and y values (integer positions by logical and physical sort order) and >> 90% of the x list (logical value) are the same value ('t'), and the >> CTIDs are in order on the new index, so 90% of the values are 100% >> correlated. > But there is no index involved (except in the case of the functional > index). The correlation of table columns to physical order of the table > doesn't depend on the existence of an index, or the physical order within > an index. > But I do see that ties within the logical order of the column values are > broken to agree with the physical order. That is wrong, right? Is there > any argument that this is desirable? Uh ... what do you propose doing instead? We'd have to do something with ties, and it's not so obvious this way is wrong. Let them be tied. If there are 10 distinct values, number the values 0 to 9, and all rows of a given distinct values get the same number for the logical order axis. Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to me. Although if we switched btree to store duplicate values with tid as a tie breaker, then maybe it wouldn't be as obviously wrong. Cheers, Jeff

Re: Bitmap scan is undercosted? - boolean correlation

From

Tom Lane

Date:

05 December 2017, 21:50:11

Jeff Janes <jeff.janes@gmail.com> writes:
> On Dec 3, 2017 15:31, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> But I do see that ties within the logical order of the column values are
>>> broken to agree with the physical order.  That is wrong, right?  Is there
>>> any argument that this is desirable?

>> Uh ... what do you propose doing instead?  We'd have to do something with
>> ties, and it's not so obvious this way is wrong.

> Let them be tied.  If there are 10 distinct values, number the values 0 to
> 9, and all rows of a given distinct values get the same number for the
> logical order axis.
> Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to
> me.  Although if we switched btree to store duplicate values with tid as a
> tie breaker, then maybe it wouldn't be as obviously wrong.

I thought some more about this.  What we really want the correlation stat
to do is help us estimate how randomly an index-ordered scan will access
the heap.  If the values we've sampled are all unequal then there's no
particular issue.  However, if we have some group of equal values, we
do not really know what order an indexscan will visit them in.  The
existing correlation calculation is making the *most optimistic possible*
assumption, that such a group will be visited exactly in heap order ---
and that assumption isn't too defensible.  IIRC, a freshly built b-tree
will behave that way, because the initial sort of a btree breaks ties
using heap TIDs; but we don't maintain that property during later
insertions.  In any case, given that we do this calculation without regard
to any specific index, we can't reasonably expect to model exactly what
the index will do.  It would be best to adopt some middle-of-the-road
assumption about what the heap visitation order will be for a set of
duplicate values: not exactly heap order, but I think we should not use
a worst-case assumption either, since the btree may retain some amount
of its initial ordering.

BTW, I disagree that "correlation = zero" is the right answer for this
particular example.  If the btree is freshly built, then an index-order
scan would visit all the heap pages in sequence to fetch "f" rows, and
then visit them all in sequence again to fetch "t" rows, which is a whole
lot better than the purely random access that zero correlation implies.
So I think 0.8 or so is actually a perfectly reasonable answer when the
index is fresh.  The trouble is just that it'd behoove us to derate that
answer somewhat for the probability that the index isn't fresh.

My first thought for a concrete fix was to use the mean position of
a group of duplicates for the purposes of the correlation calculation,
but on reflection that's clearly wrong.  We do have an idea, from the
data we have, whether the duplicates are close together in the heap
or spread all over.  Using only mean position would fail to distinguish
those cases, but really we'd better penalize the spread-all-over case.
I'm not sure how to do that.

Or we could leave this calculation alone and try to move towards keeping
equal values in TID order in btrees.  Not sure about the downsides of
that, though.

            regards, tom lane

Re: Bitmap scan is undercosted?

From

Vitaliy Garnashevich

Date:

06 December 2017, 10:06:52

On 04/12/2017 00:11, Vitaliy Garnashevich wrote:

On 03/12/2017 03:27, Jeff Janes wrote:
Due to that, when I disable bitmapscans and seqscans, I start getting slow index scans on the wrong index, i2 rather than i1. I don't know why he doesn't see that in his example.
When I increase effective_cache_size to 1024MB, I start getting the plan with the slower index i2, too.

Bitmap Heap Scan on public.aaa (cost=12600.90..23688.70 rows=9488 width=5) (actual time=107.529..115.902 rows=9976 loops=1)
-> BitmapAnd (cost=12600.90..12600.90 rows=9488 width=0) (actual time=105.133..105.133 rows=0 loops=1)
-> Bitmap Index Scan on i1 (cost=0.00..1116.43 rows=96000 width=0) (actual time=16.313..16.313 rows=100508 loops=1)
-> Bitmap Index Scan on i2 (cost=0.00..11479.47 rows=988338 width=0) (actual time=77.950..77.950 rows=1000200 loops=1)

Index Scan using i2 on public.aaa (cost=0.44..48227.31 rows=9488 width=5) (actual time=0.020..285.695 rows=9976 loops=1)

Seq Scan on public.aaa (cost=0.00..169248.54 rows=9488 width=5) (actual time=0.024..966.469 rows=9976 loops=1)

This way the estimates and the actual time get more sense. But then there's the question - maybe it's i1 runs too fast, and is estimated incorrectly? Why that happens?

I've tried to create a better test case:
- Increase shared_buffers and effective_cache_size to fit whole database, including indexes.
- Use random(), to avoid correlation between the filtered values.
- Make both columns of integer type, to avoid special cases with boolean (like the one happened with CREATE STATISTICS).
- Flush OS disk cache and then try running the query several times, to get both cold-cache results and all-in-ram results.
- There are several tests, with different frequency of the selected values in the two columns: [1/10, 1/10], [1/50, 1/10], [1/100, 1/10].
- There is a split of cost by contribution of each of its components: seq_page_cost, random_page_cost, cpu_tuple_cost, cpu_index_tuple_cost, cpu_operator_cost. The EXPLAIN is run for each component, every time with only one of the components set to non-zero.
- The test was run on a Digitalocean VM: Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-101-generic x86_64), 2 GB RAM, 2 core CPU, SSD; PostgreSQL 9.5.10.

shared_buffers = 1024MBeffective_cache_size = 1024MBwork_mem = 100MBcreate table aaa as select floor(random()*10)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id;create table aaa as select floor(random()*50)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id;create table aaa as select floor(random()*100)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id;create index i1 on aaa (num);create index i2 on aaa (flag);set enable_bitmapscan = on; set enable_indexscan = off; set enable_seqscan = off;set enable_bitmapscan = off; set enable_indexscan = on; set enable_seqscan = off;set enable_bitmapscan = off; set enable_indexscan = off; set enable_seqscan = on;set seq_page_cost = 1.0; set random_page_cost = 1.0; set cpu_tuple_cost = 0.01; set cpu_index_tuple_cost = 0.005; set cpu_operator_cost = 0.0025;explain (analyze,verbose,costs,buffers) select * from aaa where num = 1 and flag = 1;Bitmap Heap Scan on public.aaa (cost=20687.87..66456.91 rows=101403 width=8) (actual time=345.349..6061.834 rows=99600 loops=1) read=45091Bitmap Heap Scan on public.aaa (cost=20687.87..66456.91 rows=101403 width=8) (actual time=593.915..991.403 rows=99600 loops=1) hit=45091Bitmap Heap Scan on public.aaa (cost=20687.87..66456.91 rows=101403 width=8) (actual time=255.273..355.694 rows=99600 loops=1) hit=45091Bitmap Heap Scan on public.aaa (cost=20687.87..66456.91 rows=101403 width=8) (actual time=284.768..385.505 rows=99600 loops=1) hit=45091? + ? + 1014.03 + ? + 5595.52 = ?Bitmap Heap Scan on public.aaa (cost=12081.43..28452.09 rows=19644 width=8) (actual time=238.566..3115.445 rows=20114 loops=1) read=19425Bitmap Heap Scan on public.aaa (cost=12081.43..28452.09 rows=19644 width=8) (actual time=314.590..382.207 rows=20114 loops=1) hit=19425Bitmap Heap Scan on public.aaa (cost=12081.43..28452.09 rows=19644 width=8) (actual time=265.899..311.064 rows=20114 loops=1) hit=19425Bitmap Heap Scan on public.aaa (cost=12081.43..28452.09 rows=19644 width=8) (actual time=209.470..237.697 rows=20114 loops=1) hit=194259689.92 + ? + 196.44 + ? + ? = ?Bitmap Heap Scan on public.aaa (cost=11273.15..20482.50 rows=10090 width=8) (actual time=284.381..2019.717 rows=10114 loops=1) read=12059Bitmap Heap Scan on public.aaa (cost=11273.15..20482.50 rows=10090 width=8) (actual time=153.445..180.770 rows=10114 loops=1) hit=12059Bitmap Heap Scan on public.aaa (cost=11273.15..20482.50 rows=10090 width=8) (actual time=146.275..159.446 rows=10114 loops=1) hit=12059Bitmap Heap Scan on public.aaa (cost=11273.15..20482.50 rows=10090 width=8) (actual time=140.973..153.998 rows=10114 loops=1) hit=120594098.28 + ? + 100.90 + ? + ? = ?Seq Scan on public.aaa (cost=0.00..194248.49 rows=101403 width=8) (actual time=0.126..2056.913 rows=99600 loops=1) read=44248Seq Scan on public.aaa (cost=0.00..194248.49 rows=101403 width=8) (actual time=0.045..1595.377 rows=99600 loops=1) hit=32 read=44216Seq Scan on public.aaa (cost=0.00..194248.49 rows=101403 width=8) (actual time=0.066..1392.700 rows=99600 loops=1) hit=64 read=44184Seq Scan on public.aaa (cost=0.00..194248.49 rows=101403 width=8) (actual time=0.069..1378.574 rows=99600 loops=1) hit=96 read=4415244248.00 + 0.00 + 100000.33 + 0.00 + 50000.17 = 194248.5Seq Scan on public.aaa (cost=0.00..194247.77 rows=19644 width=8) (actual time=0.646..1801.794 rows=20114 loops=1) read=44248Seq Scan on public.aaa (cost=0.00..194247.77 rows=19644 width=8) (actual time=0.385..1518.613 rows=20114 loops=1) hit=32 read=44216Seq Scan on public.aaa (cost=0.00..194247.77 rows=19644 width=8) (actual time=0.346..1369.021 rows=20114 loops=1) hit=64 read=44184Seq Scan on public.aaa (cost=0.00..194247.77 rows=19644 width=8) (actual time=0.597..1792.959 rows=20114 loops=1) hit=96 read=4415244248.00 + 0.00 + 99999.85 + 0.00 + 49999.93 = 194247.78Seq Scan on public.aaa (cost=0.00..194247.77 rows=10090 width=8) (actual time=0.700..2194.195 rows=10114 loops=1) read=44248Seq Scan on public.aaa (cost=0.00..194247.77 rows=10090 width=8) (actual time=0.145..1401.274 rows=10114 loops=1) hit=32 read=44216Seq Scan on public.aaa (cost=0.00..194247.77 rows=10090 width=8) (actual time=0.185..1602.002 rows=10114 loops=1) hit=64 read=44184Seq Scan on public.aaa (cost=0.00..194247.77 rows=10090 width=8) (actual time=0.184..1353.162 rows=10114 loops=1) hit=96 read=4415244248.00 + 0.00 + 99999.85 + 0.00 + 49999.93 = 194247.78Index Scan using i1 on public.aaa (cost=0.43..67358.15 rows=101403 width=8) (actual time=0.400..1325.638 rows=99600 loops=1) read=46983Index Scan using i1 on public.aaa (cost=0.43..67358.15 rows=101403 width=8) (actual time=0.020..479.713 rows=99600 loops=1) hit=46983Index Scan using i1 on public.aaa (cost=0.43..67358.15 rows=101403 width=8) (actual time=0.024..642.947 rows=99600 loops=1) hit=46983Index Scan using i1 on public.aaa (cost=0.43..67358.15 rows=101403 width=8) (actual time=0.038..756.045 rows=99600 loops=1) hit=4698345.95 + 46638.42 + 10336.67 + 5168.34 + 5168.77 = 67358.15Rows Removed by Filter: 900156Index Scan using i1 on public.aaa (cost=0.43..48809.34 rows=19644 width=8) (actual time=0.600..1059.269 rows=20114 loops=1) read=44373Index Scan using i1 on public.aaa (cost=0.43..48809.34 rows=19644 width=8) (actual time=0.071..208.940 rows=20114 loops=1) hit=44373Index Scan using i1 on public.aaa (cost=0.43..48809.34 rows=19644 width=8) (actual time=0.044..124.437 rows=20114 loops=1) hit=44373Index Scan using i1 on public.aaa (cost=0.43..48809.34 rows=19644 width=8) (actual time=0.044..127.814 rows=20114 loops=1) hit=443730.23 + 44788.68 + 2010.00 + 1005.00 + 1005.43 = 48809.34Rows Removed by Filter: 179792Index Scan using i1 on public.aaa (cost=0.43..46544.80 rows=10090 width=8) (actual time=0.647..1510.482 rows=10114 loops=1) read=39928Index Scan using i1 on public.aaa (cost=0.43..46544.80 rows=10090 width=8) (actual time=0.035..141.847 rows=10114 loops=1) hit=39928Index Scan using i1 on public.aaa (cost=0.43..46544.80 rows=10090 width=8) (actual time=0.032..86.716 rows=10114 loops=1) hit=39928Index Scan using i1 on public.aaa (cost=0.43..46544.80 rows=10090 width=8) (actual time=0.032..79.492 rows=10114 loops=1) hit=399280.01 + 44524.36 + 1010.00 + 505.00 + 505.44 = 46544.81Rows Removed by Filter: 89762Index Scan using i2 on public.aaa (cost=0.43..39166.34 rows=101403 width=8) (actual time=1.337..1543.611 rows=99600 loops=1) read=46985Index Scan using i2 on public.aaa (cost=0.43..39166.34 rows=101403 width=8) (actual time=0.027..410.623 rows=99600 loops=1) hit=46985Index Scan using i2 on public.aaa (cost=0.43..39166.34 rows=101403 width=8) (actual time=0.025..377.529 rows=99600 loops=1) hit=46985Index Scan using i2 on public.aaa (cost=0.43..39166.34 rows=101403 width=8) (actual time=0.025..377.554 rows=99600 loops=1) hit=469852979.08 + 16566.83 + 9810.00 + 4905.00 + 4905.44 = 39166.35Rows Removed by Filter: 900835Index Scan using i2 on public.aaa (cost=0.43..39247.83 rows=19644 width=8) (actual time=0.783..2236.765 rows=20114 loops=1) read=46985Index Scan using i2 on public.aaa (cost=0.43..39247.83 rows=19644 width=8) (actual time=0.198..777.279 rows=20114 loops=1) hit=46985Index Scan using i2 on public.aaa (cost=0.43..39247.83 rows=19644 width=8) (actual time=0.196..437.177 rows=20114 loops=1) hit=46985Index Scan using i2 on public.aaa (cost=0.43..39247.83 rows=19644 width=8) (actual time=0.075..346.481 rows=20114 loops=1) hit=469852949.05 + 16751.68 + 9773.33 + 4886.66 + 4887.10 = 39247.82Rows Removed by Filter: 980350Index Scan using i2 on public.aaa (cost=0.43..40167.34 rows=10090 width=8) (actual time=0.619..1350.886 rows=10114 loops=1) read=46987Index Scan using i2 on public.aaa (cost=0.43..40167.34 rows=10090 width=8) (actual time=0.069..448.975 rows=10114 loops=1) hit=46987Index Scan using i2 on public.aaa (cost=0.43..40167.34 rows=10090 width=8) (actual time=0.048..383.066 rows=10114 loops=1) hit=46987Index Scan using i2 on public.aaa (cost=0.43..40167.34 rows=10090 width=8) (actual time=0.050..387.874 rows=10114 loops=1) hit=469872974.39 + 17212.52 + 9990.00 + 4995.00 + 4995.44 = 40167.35Rows Removed by Filter: 991389

What seems odd to me is that in different kinds of tests (with different frequency of column values):

i1 Rows Removed by Filter = 900156, 179792, 89762 (decreased a lot)
i1 buffers = 46983, 44373, 39928 (decreased, but not a lot)
i1 best case time = 756.045, 127.814, 79.492 (decreased a lot, as well as probably average case too)
i1 cost estimates = 67358.15, 48809.34, 46544.80 (did not decrease a lot)

i2 Rows Removed by Filter = 900835, 980350, 991389
i2 buffers = 46985, 46985, 46987
i2 best case time = 377.554, 346.481, 387.874
i2 cost estimates = 39166.34, 39247.83, 40167.34

It's odd that increase in actual execution time for "i1" was not reflected enough in cost estimates. The cost even didn't go below "i2" costs.

Details about the test are attached.

Regards,
Vitaliy

Attachment

test.zip

Re: Bitmap scan is undercosted? - overestimated correlation andcost_index

From

Justin Pryzby

Date:

07 December 2017, 00:46:52

On Tue, Dec 05, 2017 at 01:50:11PM -0500, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > On Dec 3, 2017 15:31, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
> >> Jeff Janes <jeff.janes@gmail.com> writes:
> >>> But I do see that ties within the logical order of the column values are
> >>> broken to agree with the physical order.  That is wrong, right?  Is there
> >>> any argument that this is desirable?
> 
> >> Uh ... what do you propose doing instead?  We'd have to do something with
> >> ties, and it's not so obvious this way is wrong.
> 
> > Let them be tied.
...
> I thought some more about this.  What we really want the correlation stat
> to do is help us estimate how randomly an index-ordered scan will access
> the heap.  If the values we've sampled are all unequal then there's no
> particular issue.  However, if we have some group of equal values, we
> do not really know what order an indexscan will visit them in.  The
> existing correlation calculation is making the *most optimistic possible*
> assumption, that such a group will be visited exactly in heap order ---
> and that assumption isn't too defensible.

I'm interested in discusstion regarding bitmap cost, since it would have helped
our case discussed here ~18 months ago:
https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@telsasoft.com

...but remember: in Vitaliy's case (as opposed to mine), the index scan is
*faster* but being estimated at higher cost than bitmap (I have to keep
reminding myself).  So the rest of this discussion is about the
overly-optimistic cost estimate of index scans, which moves in the opposite
direction for this reported problem.  For the test cases I looked at, index
scans were used when RPC=1 and redundant conditions were avoided, so I'm not
sure if there's any remaining issue (but I haven't looked at the latest cases
Vitaliy sent).

> In any case, given that we do this calculation without regard
> to any specific index,

One solution is to compute stats (at least correlation) for all indices, not
just expr inds.  I did that earlier this year while throwing around/out ideas.
https://www.postgresql.org/message-id/20170707234119.GN17566%40telsasoft.com

> We do have an idea, from the data we have, whether the duplicates are close
> together in the heap or spread all over.

I think you just mean pg_stats.correlation for all values, not just duplicates
(with the understanding that duplicates might be a large fraction of the
tuples, and high weight in correlation).

Another issue I noted in an earlier thread is that as table sizes increase, the
existing correlation computation approaches 1 for correlated insertions, (like
"append-only" timestamps clustered around now()), due to ANALYZE sampling a
fraction of the table, and thereby representing only large-scale correlation,
and, to an increasing degree, failing to represent small-scale variations
between adjacent index TIDs, which has real cost (and for which the mitigation
by cache effects probably decreases WRT table size, too).  I think any solution
needs to handle this somehow.

Generated data demonstrating this (I reused this query so it's more complicated
than it needs to be):

[pryzbyj@database ~]$ time for sz in 9999{,9{,9{,9{,9}}}} ; do psql postgres -tc "DROP TABLE IF EXISTS t; CREATE TABLE
t(ifloat, j int); CREATE INDEX ON t(i);INSERT INTO t SELECT i/99999.0+pow(2,(-random())) FROM generate_series(1,$sz) i
ORDERBY i; ANALYZE t; SELECT $sz, correlation, most_common_freqs[1] FROM pg_stats WHERE attname='i' AND tablename='t'";
done

     9999 |    0.187146 |                  
    99999 |    0.900629 |                  
   999999 |    0.998772 |                  
  9999999 |    0.999987 |                  

Trying to keep it all in my own head: For sufficiently large number of pages,
bitmap scan should be preferred to idx scan due to reduced random-page-cost
outweighing its overhead in CPU cost.  Probably by penalizing index scans, not
discounting bitmap scans.  Conceivably a correlation adjustment can be
conditionalized or weighted based on index_pages_fetched() ...
    x = ln (x/999999);
    if (x>1) correlation/=x;

I think one could look at the fraction of duplicated index keys expected to be
returned: if we expect to return 1000 tuples, with 200 duplicates from MCV,
cost_index would multiply correlation by (1 - 200/1000), meaning to use
something closer to max_IO_cost rather than min_IO_cost.  I imagine it'd be
possible to limit to only those MCVs which pass quals - if none pass, then
there may be few tuples returned, so apply no correction to (additionally)
penalize index scan.

In my tests, at one point I implemented idx_corr_fudge(), returning a value
like "fragmentation" from pgstatindex (which I'm sure is where I got the phrase
when reporting the problem).  That only uses the leaf nodes' "next" pointer,
and not the individual tuples, which probably works if there's a sufficiently
number of repeated keys.

I think that's all for now..

Justin

Re: Bitmap scan is undercosted?

From

Vitaliy Garnashevich

Date:

07 December 2017, 03:17:13

> What seems odd to me is that in different kinds of tests (with 
> different frequency of column values):
>
> i1 Rows Removed by Filter = 900156, 179792, 89762 (decreased a lot)
> i1 buffers = 46983, 44373, 39928 (decreased, but not a lot)
> i1 best case time = 756.045, 127.814, 79.492 (decreased a lot, as well 
> as probably average case too)
> i1 cost estimates = 67358.15, 48809.34, 46544.80 (did not decrease a lot)
>
> i2 Rows Removed by Filter = 900835, 980350, 991389
> i2 buffers = 46985, 46985, 46987
> i2 best case time = 377.554, 346.481, 387.874
> i2 cost estimates = 39166.34, 39247.83, 40167.34
>
> It's odd that increase in actual execution time for "i1" was not 
> reflected enough in cost estimates. The cost even didn't go below "i2" 
> costs.

I've added some logging, in order to get the actual numbers which were 
used for estimation.

--drop table if exists aaa;
--create table aaa as select floor(random()*100)::int num, (random()*10 
< 1)::int flag from generate_series(1, 10000000) id;
--analyze aaa;

--set enable_bitmapscan = off; set enable_indexscan = on;  set 
enable_seqscan = off;
--set seq_page_cost = 1.0; set random_page_cost = 1.0; set 
cpu_tuple_cost = 0.01; set cpu_index_tuple_cost = 0.005; set 
cpu_operator_cost = 0.0025;

--create index i1 on aaa  (num);
--drop index if exists i2;
--explain (analyze,verbose,costs,buffers) select * from aaa where num = 
1 and flag = 1;

Index Scan using i1 on public.aaa  (cost=0.43..46697.59 rows=10641 
width=8) (actual time=0.047..153.521 rows=9826 loops=1)
   Rows Removed by Filter: 89948

--drop index if exists i1;
--create index i2 on aaa  (flag);
--explain (analyze,verbose,costs,buffers) select * from aaa where num = 
1 and flag = 1;

Index Scan using i2 on public.aaa  (cost=0.43..39583.11 rows=10641 
width=8) (actual time=0.098..351.454 rows=9826 loops=1)
   Rows Removed by Filter: 990249


LOG:  cost_index:
         seq_page_cost=1.00, random_page_cost=1.00, 
cpu_tuple_cost=0.0100, cpu_index_tuple_cost=0.0050, 
cpu_operator_cost=0.0025, effective_cache_size=131072
         indexStartupCost=0.43, indexTotalCost=1103.94, 
indexSelectivity=0.01076667, indexCorrelation=0.00208220
         baserel->tuples=10000033.00, baserel->pages=44248.00, 
baserel->allvisfrac=0.00000000
         tuples_fetched=107667.00, pages_fetched=477.00
         max_IO_cost=44248.0000, min_IO_cost=477.0000, csquared=0.0000
         qpqual_cost.startup=0.0000, qpqual_cost.per_tuple=0.0025, 
cpu_per_tuple=0.0125
         spc_seq_page_cost=1.00, spc_random_page_cost=1.00
         startup_cost=0.43, total_cost=46697.59

LOG:  cost_index:
         seq_page_cost=1.00, random_page_cost=1.00, 
cpu_tuple_cost=0.0100, cpu_index_tuple_cost=0.0050, 
cpu_operator_cost=0.0025, effective_cache_size=131072
         indexStartupCost=0.43, indexTotalCost=10123.93, 
indexSelectivity=0.09883333, indexCorrelation=0.82505685
         baserel->tuples=10000000.00, baserel->pages=44248.00, 
baserel->allvisfrac=0.00000000
         tuples_fetched=988333.00, pages_fetched=4374.00
         max_IO_cost=44248.0000, min_IO_cost=4374.0000, csquared=0.6807
         qpqual_cost.startup=0.0000, qpqual_cost.per_tuple=0.0025, 
cpu_per_tuple=0.0125
         spc_seq_page_cost=1.00, spc_random_page_cost=1.00
         startup_cost=0.43, total_cost=39583.11


Here is a break down of the total_cost into components, for i1 query and 
for i2 query (some rounding was removed from the formula for brevity):

path->path.total_cost =
   (indexTotalCost + qpqual_cost.startup) +
   (max_IO_cost + csquared * (min_IO_cost - max_IO_cost)) +
   (cpu_tuple_cost + qpqual_cost.per_tuple) * (indexSelectivity * 
baserel->tuples);
path->path.total_cost =
   1103.94 + 0.0000 +                                // 1103.94 +
   44248.0000 + 0.0000 * (477.0000 - 44248.0000) +   // 44248.00 +
   (0.0100 + 0.0025) * (0.01076667 * 10000033.00)    // 1345.84
   = 46697.78;                                       // = 46697.78;

path->path.total_cost =
   (indexTotalCost + qpqual_cost.startup) +
   (max_IO_cost + csquared * (min_IO_cost - max_IO_cost)) +
   (cpu_tuple_cost + qpqual_cost.per_tuple) * (indexSelectivity * 
baserel->tuples);
path->path.total_cost =
   10123.93 + 0.0000 +                               // 10123.93 +
   44248.0000 + 0.6807 * (4374.0000 - 44248.0000) +  // 17105.77 +
   (0.0100 + 0.0025) * (0.09883333 * 10000000.00)    // 12354.17
   = 39583.86;                                       // = 39583.86;


PS.
The code used for logging:
/postgresql-9.3.1/src/backend/optimizer/path/costsize.c : cost_index()

     ereport(LOG,
             (errmsg("cost_index: \n"
                     "seq_page_cost=%.2f, random_page_cost=%.2f, 
cpu_tuple_cost=%.4f, cpu_index_tuple_cost=%.4f, cpu_operator_cost=%.4f, 
effective_cache_size=%.0f\n"
                     "indexStartupCost=%.2f, indexTotalCost=%.2f, 
indexSelectivity=%.8f, indexCorrelation=%.8f\n"
                     "baserel->tuples=%.2f, baserel->pages=%.2f, 
baserel->allvisfrac=%.8f\n"
                     "tuples_fetched=%.2f, pages_fetched=%.2f\n"
                     "max_IO_cost=%.4f, min_IO_cost=%.4f, csquared=%.4f\n"
                     "qpqual_cost.startup=%.4f, 
qpqual_cost.per_tuple=%.4f, cpu_per_tuple=%.4f\n"
                     "spc_seq_page_cost=%.2f, spc_random_page_cost=%.2f\n"
                     "startup_cost=%.2f, total_cost=%.2f\n",

                     seq_page_cost, random_page_cost, cpu_tuple_cost, 
cpu_index_tuple_cost, cpu_operator_cost, (double)effective_cache_size,
                     indexStartupCost, indexTotalCost, indexSelectivity, 
indexCorrelation,
                     baserel->tuples, (double) baserel->pages, 
baserel->allvisfrac,
                     tuples_fetched, pages_fetched,
                     max_IO_cost, min_IO_cost, csquared,
                     qpqual_cost.startup, qpqual_cost.per_tuple, 
cpu_per_tuple,
                     spc_seq_page_cost, spc_random_page_cost,
                     startup_cost, startup_cost + run_cost
                     )));

Regards,
Vitaliy

Re: Bitmap scan is undercosted?

From

Jeff Janes

Date:

07 December 2017, 06:48:32

On Sun, Dec 3, 2017 at 1:15 PM, Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:

On 02/12/2017 23:17, Jeff Janes wrote:
Right, so there is a cpu costing problem (which could only be fixed by hacking postgresql and recompiling it), but it is much smaller of a problem than the IO cost not being accurate due to the high hit rate. Fixing the CPU costing problem is unlikely to make a difference to your real query. If you set the page costs to zero, what happens to your real query?
I can't reproduce the exact issue on the real database any more. The query started to use the slow bitmap scan recently, and had been doing so for some time lately, but now it's switched back to use the index scan. The table involved in the query gets modified a lot. It has hundreds of millions of rows. Lots of new rows are appended to it every day, the oldest rows are sometimes removed. The table is analyzed at least daily. It's possible that statistics was updated and that caused the query to run differently. But I still would like to understand why that issue happened, and how to properly fix it, in case the issue returns.

While your test case displays some cost estimation issues, there is really no reason to think that they are the same issues your real query shows. Particularly since you said the difference was a factor of 30 in the real case, rather than 3. Any chance you can show EXPLAIN ANALYZE output for the real query, but when it is acting up and when it is not? Something in the plans might stand out to us as the obvious problem. On the other hand, maybe nothing will stand out without having a replicable test case. The only way to know is to try.

But I doubt that the settings seq_page_cost = random_page_cost = 0.0 should actually be used.

Why not? If your production server really has everything in memory during normal operation, that is the correct course of action. If you ever restart the server, then you could have some unpleasant time getting it back up to speed again, but pg_prewarm could help with that.
In the real database, not everything is in memory. There are 200GB+ of RAM, but DB is 500GB+. The table involved in the query itself is 60GB+ of data and 100GB+ of indexes. I'm running the test case in a way where all reads are done from RAM, only to make it easier to reproduce and to avoid unrelated effects.

Is everything that the particular query in questions needs in memory, even if other queries need things from disk? Or does the problematic query also need things from disk? If the query does need to read things from disk, the bitmap actually should be faster. Which reinforces the idea that maybe the issue brought up by your test case is not the same as the issue brought up by your real case, even if they both point in the same direction.

As far as know, costs in Postgres were designed to be relative to seq_page_cost, which for that reason is usually defined as 1.0. Even if everything would be in RAM, accesses to the pages would still not have zero cost. Setting 0.0 just seems too extreme, as all other non-zero costs would become infinitely bigger.

When exploring things, 0.0 certain helps to simplify things. Yes, 0.05 or something similar might be better for a completely cached database. The problem is that it is very context dependent. Reading a page from shared_buffers when there is no contention from other processes for the same page is probably less than 0.01. If it is not in shared_buffers but is in effective_cache_size, it is probably a few multiples of 0.01. If there is contention either for that specific page, or for available buffers into which to read pages, then it could be substantially higher yet. Higher, none of those are things the planner is aware of.

If you really want to target the plan with the BitmapAnd, you should increase cpu_index_tuple_cost and/or cpu_operator_cost but not increase cpu_tuple_cost. That is because the unselective bitmap index scan does not incur any cpu_tuple_cost, but does incur index_tuple and operator costs. Unfortunately all other index scans in the system will also be skewed by such a change if you make the change system-wide.
Exactly. I'd like to understand why the worse plan is being chosen, and 1) if it's fixable by tuning costs, to figure out the right settings which could be used in production, 2) if there is a bug in Postgres optimizer, then to bring some attention to it, so that it's eventually fixed in one of future releases, 3) if Postgres is supposed to work this way, then at least I (and people who ever read this thread) would understand it better.

I would argue that it is planner "bug", (quotes because it doesn't give wrong answers, just sub-optimal plans) but one that is very hard to pin down, and also depends on the hardware you are running on. Also, people have made some optimizations to the machinery behind the bitmap code recently, as well as the costing of the bitmap code, so if it is bug, the size of it is changing with the version you are using. If your aim is to improve the planner (rather than simply tuning the planner that currently exists) then you should probably 1) make your test case use random number generators, rather than modulus, to avoid cross-column correlation and other such issues, 2) run it against 11dev code, which is where improvements to PostgreSQL are targeted, rather than against production versions, and 3) post to pgsql-hackers, rather than performance.

Cheers,

Jeff

Re: Bitmap scan is undercosted? - boolean correlation

From

Jeff Janes

Date:

07 December 2017, 07:51:16

On Tue, Dec 5, 2017 at 10:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:
> On Dec 3, 2017 15:31, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> But I do see that ties within the logical order of the column values are
>>> broken to agree with the physical order. That is wrong, right? Is there
>>> any argument that this is desirable?

>> Uh ... what do you propose doing instead? We'd have to do something with
>> ties, and it's not so obvious this way is wrong.

> Let them be tied. If there are 10 distinct values, number the values 0 to
> 9, and all rows of a given distinct values get the same number for the
> logical order axis.
> Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to
> me. Although if we switched btree to store duplicate values with tid as a
> tie breaker, then maybe it wouldn't be as obviously wrong.

I thought some more about this. What we really want the correlation stat
to do is help us estimate how randomly an index-ordered scan will access
the heap.

The correlation is used in another place, estimating how much of the table we will visit in the first place. If the correlation is very high, then scanning 10% of the index leaf pages means we will visit 10% of the table. If the correlation is low, then we use Mackert and Lohman, and (in the case of visiting 10% of the index) predict we will visit most of the table. Assuming effective_cache_size is high, we will visit most of the table just once, but still in a random order, because subsequent visits for the same query will be found in the cache. Rather than visiting the various pages repeatedly and not finding them in cache each time.

In addition to estimating how much of the table we visit, we also estimate how "sequential like" those visits are. Which is the use that you describe. Ideally for that use case, we would know for each distinct value, how correlated the tids are with the leaf page ordering. If the index is freshly built, that is very high. We visit 1/10 of the index, which causes us to visit 100% of the table but in perfect order, plucking 1/10 of the tuples from each table page.

But visiting 100% of the table in physical order in order to pluck out 10% of the tuples from each page is quite different than visiting 10% of the table pages in physical order to pluck out 100% of the tuples from those pages and 0% from the pages not visited.

...

BTW, I disagree that "correlation = zero" is the right answer for this
particular example. If the btree is freshly built, then an index-order
scan would visit all the heap pages in sequence to fetch "f" rows, and
then visit them all in sequence again to fetch "t" rows, which is a whole
lot better than the purely random access that zero correlation implies.
So I think 0.8 or so is actually a perfectly reasonable answer when the
index is fresh. The trouble is just that it'd behoove us to derate that
answer somewhat for the probability that the index isn't fresh.

But, for the case of "how much of the table do we visit at all", correlation = zero is the right answer, even if it isn't the right answer for "how sequentially do we visit whatever we visit"

My first thought for a concrete fix was to use the mean position of
a group of duplicates for the purposes of the correlation calculation,
but on reflection that's clearly wrong. We do have an idea, from the
data we have, whether the duplicates are close together in the heap
or spread all over. Using only mean position would fail to distinguish
those cases, but really we'd better penalize the spread-all-over case.
I'm not sure how to do that.

Departing from correlations, we could also try to estimate "How many different table pages does each index leaf page reference". This could capture functional dependencies which are strong, but not in the form of linear correlations. (The current extended statistics only captures dependencies between user columns, not between one user column and one system column such as table slot)

For whatever its worth, here is my "let ties be ties" patch.

It breaks two regression tests due to plan changes, and both are cases where maybe the plan ought to change for the very reason being discussed. If I just put random gibberish into the correlation field, more regression tests fail, so I think my implementation is not too far broken.

The accumulations into corr_ysum and corr_y2sum could trivially be pushed down into the "if", and corr_xysum could as well with a little algebra. But that seems like premature optimization for a proof-of-concept patch.

Cheers,

Jeff

Attachment

correl.patch

Re: Bitmap scan is undercosted?

From

Jeff Janes

Date:

12 December 2017, 10:29:05

On Tue, Dec 5, 2017 at 11:06 PM, Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:

This is very cool, thanks.

I've tried to create a better test case:
- Increase shared_buffers and effective_cache_size to fit whole database, including indexes.
- Use random(), to avoid correlation between the filtered values.
- Make both columns of integer type, to avoid special cases with boolean (like the one happened with CREATE STATISTICS).
- Flush OS disk cache and then try running the query several times, to get both cold-cache results and all-in-ram results.
- There are several tests, with different frequency of the selected values in the two columns: [1/10, 1/10], [1/50, 1/10], [1/100, 1/10].
- There is a split of cost by contribution of each of its components: seq_page_cost, random_page_cost, cpu_tuple_cost, cpu_index_tuple_cost, cpu_operator_cost. The EXPLAIN is run for each component, every time with only one of the components set to non-zero.

Where you have question marks, that means you could not force it into the plan you wanted with all-but-one settings being zero?

- The test was run on a Digitalocean VM: Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-101-generic x86_64), 2 GB RAM, 2 core CPU, SSD; PostgreSQL 9.5.10.

shared_buffers = 1024MBeffective_cache_size = 1024MB

I would set this even higher.

work_mem = 100MBcreate table aaa as select floor(random()*10)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id;create table aaa as select floor(random()*50)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id;create table aaa as select floor(random()*100)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id;create index i1 on aaa (num);create index i2 on aaa (flag);set enable_bitmapscan = on; set enable_indexscan = off; set enable_seqscan = off;set enable_bitmapscan = off; set enable_indexscan = on; set enable_seqscan = off;set enable_bitmapscan = off; set enable_indexscan = off; set enable_seqscan = on;set seq_page_cost = 1.0; set random_page_cost = 1.0; set cpu_tuple_cost = 0.01; set cpu_index_tuple_cost = 0.005; set cpu_operator_cost = 0.0025;explain (analyze,verbose,costs,buffers) select * from aaa where num = 1 and flag = 1;

One thing to try is to use explain (analyze, timing off), and then get the total execution time from the summary line at the end of the explain, rather than from "actual time" fields. Collecting the times of each individual step in the execution can impose a lot of overhead, and some plans have more if this artificial overhead than others. It might change the results, or it might not. I know that sorts and hash joins are very sensitive to this, but I don't know about bitmap scans.

....

What seems odd to me is that in different kinds of tests (with different frequency of column values):

i1 Rows Removed by Filter = 900156, 179792, 89762 (decreased a lot)
i1 buffers = 46983, 44373, 39928 (decreased, but not a lot)

To filter out 89762 tuples, you first have to look them up in the table, and since they are randomly scattered that means you hit nearly every page in the table at least once. In fact, I don't understand how the empirical number of buffers hits can be only 46983 in the first case, when it has to visit 1,000,000 rows (and reject 90% of them). I'm guessing that it is because your newly created index is sorted by ctid order within a given index value, and that the scan holds a pin on the table page between visits, and so doesn't count as a hit if it already holds the pin.

You could try to create an empty table, create the indexes, then populate the table with your random select query, to see if that changes the buffer hit count. (Note that this wouldn't change the cost estimates much even it does change the measured number of buffer hits, because of effective_cache_size. It knows you will be hitting ~47,000 pages ~25 times each, and so only charges you for the first time each one is hit.)

i1 best case time = 756.045, 127.814, 79.492 (decreased a lot, as well as probably average case too)
i1 cost estimates = 67358.15, 48809.34, 46544.80 (did not decrease a lot)

Right. Your best case times are when the data is completely in cache. But your cost estimates are dominated by *_page_cost, which are irrelevant when the data is entirely in cache. You are telling it to estimate the worse-case costs, and it is doing that pretty well (within this one plan).

i2 Rows Removed by Filter = 900835, 980350, 991389
i2 buffers = 46985, 46985, 46987
i2 best case time = 377.554, 346.481, 387.874
i2 cost estimates = 39166.34, 39247.83, 40167.34

It's odd that increase in actual execution time for "i1" was not reflected enough in cost estimates.

No, that's entirely expected given your settings. As long as you are charging disk-read costs for reading data from RAM, you will never get realistic cost estimates. Remember, you aren't trying to tune your production server here, you are trying to get a test case that can be dissected.

Perhaps you think that effective_cache_size is supposed to fix this for you. But it only accounts for blocks hit repeatedly within the same query. It has no idea that you are running a bunch of other queries on the same data back to back, and so it will likely find that data already in memory from one query to the next. That knowledge (currently) has to be baked into your *_page_cost setting. There is also no way to say that data for one table is more likely to be found in cache than data for another table is.

The cost even didn't go below "i2" costs.

That is partially because i2 benefits from a spuriously large correlation due to an artifact of how stats are computed. See another thread that has spun off of this one.

If you want to see the effect of this on your cost estimates, you can do something like:

update pg_statistic set stanumbers2='{0}' where starelid='aaa'::regclass and staattnum=2;

Cheers,

Jeff

Re: Bitmap scan is undercosted? - overestimated correlation and cost_index

From

Jeff Janes

Date:

12 December 2017, 12:29:48

On Wed, Dec 6, 2017 at 1:46 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Dec 05, 2017 at 01:50:11PM -0500, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > On Dec 3, 2017 15:31, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
> >> Jeff Janes <jeff.janes@gmail.com> writes:
> >>> But I do see that ties within the logical order of the column values are
> >>> broken to agree with the physical order. That is wrong, right? Is there
> >>> any argument that this is desirable?
>
> >> Uh ... what do you propose doing instead? We'd have to do something with
> >> ties, and it's not so obvious this way is wrong.
>
> > Let them be tied.
...
> I thought some more about this. What we really want the correlation stat
> to do is help us estimate how randomly an index-ordered scan will access
> the heap. If the values we've sampled are all unequal then there's no
> particular issue. However, if we have some group of equal values, we
> do not really know what order an indexscan will visit them in. The
> existing correlation calculation is making the *most optimistic possible*
> assumption, that such a group will be visited exactly in heap order ---
> and that assumption isn't too defensible.

I'm interested in discusstion regarding bitmap cost, since it would have helped
our case discussed here ~18 months ago:
https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@telsasoft.com

...but remember: in Vitaliy's case (as opposed to mine), the index scan is
*faster* but being estimated at higher cost than bitmap (I have to keep
reminding myself). So the rest of this discussion is about the
overly-optimistic cost estimate of index scans, which moves in the opposite
direction for this reported problem. For the test cases I looked at, index
scans were used when RPC=1 and redundant conditions were avoided, so I'm not
sure if there's any remaining issue (but I haven't looked at the latest cases
Vitaliy sent).

> In any case, given that we do this calculation without regard
> to any specific index,

One solution is to compute stats (at least correlation) for all indices, not
just expr inds. I did that earlier this year while throwing around/out ideas.
https://www.postgresql.org/message-id/20170707234119.GN17566%40telsasoft.com

When is the correlation of a column which is not the leading column of a btree index or in a brin index ever used? If we did compute index-specific correlations, we could maybe just drop pure-column correlations.

> We do have an idea, from the data we have, whether the duplicates are close
> together in the heap or spread all over.

I think you just mean pg_stats.correlation for all values, not just duplicates
(with the understanding that duplicates might be a large fraction of the
tuples, and high weight in correlation).

Another issue I noted in an earlier thread is that as table sizes increase, the
existing correlation computation approaches 1 for correlated insertions, (like
"append-only" timestamps clustered around now()), due to ANALYZE sampling a
fraction of the table, and thereby representing only large-scale correlation,

That isn't due to sampling. That is due to the definition of linear correlation. Large scale is what it is about.

Generated data demonstrating this (I reused this query so it's more complicated
than it needs to be):

[pryzbyj@database ~]$ time for sz in 9999{,9{,9{,9{,9}}}} ; do psql postgres -tc "DROP TABLE IF EXISTS t; CREATE TABLE t(i float, j int); CREATE INDEX ON t(i);INSERT INTO t SELECT i/99999.0+pow(2,(-random())) FROM generate_series(1,$sz) i ORDER BY i; ANALYZE t; SELECT $sz, correlation, most_common_freqs[1] FROM pg_stats WHERE attname='i' AND tablename='t'"; done

9999 | 0.187146 |
99999 | 0.900629 |
999999 | 0.998772 |
9999999 | 0.999987 |

Because the amount of jitter introduced is constant WRT $sz, but the range of i/99999.0 increases with $sz, the correlation actually does increase; it is not a sampling effect.

Trying to keep it all in my own head: For sufficiently large number of pages,
bitmap scan should be preferred to idx scan due to reduced random-page-cost
outweighing its overhead in CPU cost.

But CPU cost is probably not why it is losing anyway.

Index scans get a double bonus from high correlation. It assumes that only a small fraction of the table will be visited. And then it assumes that the visits it does make will be largely sequential. I think that you are saying that for a large enough table, that last assumption is wrong, that the residual amount of non-correlation is enough to make the table reads more random than sequential. Maybe. Do you have a test case that demonstrates this? If so, how big do we need to go, and can you see the problem on SSD as well as HDD?

The thing is, the bitmap scan gets cheated out of one of these bonuses. It gets no credit for visiting only a small part of the table when the correlation is high, but does get credit for being mostly sequential in whatever visits it does make (even if the correlation is low, because of course the bitmap perfects the correlation).

I think it would be easier to give bitmap scans their due rather than try to knock down index scans. But of course a synthetic test case would go a long way to either one.

Probably by penalizing index scans, not
discounting bitmap scans. Conceivably a correlation adjustment can be
conditionalized or weighted based on index_pages_fetched() ...
x = ln (x/999999);
if (x>1) correlation/=x;

I think we should go with something with some statistical principle behind it if we want to do that. There is the notion of "restricted range" in correlations, that if there is a high correlation over the full range of data, but you zoom into only a small fraction of that range, the correlation you see over that restricted range will be much less than the full correlation is. I don't know that field well enough to give a formula off the top of my head, but I think it will be based on the fraction of the key space which is being scanned, and (1-RSQ), rather than an arbitrary number like 999999.

Cheers,

Jeff

Re: Bitmap scan is undercosted? - overestimated correlation andcost_index

From

Justin Pryzby

Date:

15 December 2017, 23:54:06

On Tue, Dec 12, 2017 at 01:29:48AM -0800, Jeff Janes wrote:
> On Wed, Dec 6, 2017 at 1:46 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> > On Tue, Dec 05, 2017 at 01:50:11PM -0500, Tom Lane wrote:

> > > In any case, given that we do this calculation without regard
> > > to any specific index,
> >
> > One solution is to compute stats (at least correlation) for all indices,
> > not
> > just expr inds.  I did that earlier this year while throwing around/out
> > ideas.
> > https://www.postgresql.org/message-id/20170707234119.
> > GN17566%40telsasoft.com
> 
> When is the correlation of a column which is not the leading column of a
> btree index or in a brin index ever used?  If we did compute index-specific
> correlations, we could maybe just drop pure-column correlations.

Yes I think so - correlation is collected for every column, but only used for
indices.

I also have a comment to myself in that patch to force attstattarget=0 for
non-expr indices, to avoid keeping MCV/histogram which duplicates that of their
column.

> Trying to keep it all in my own head: For sufficiently large number of
> > pages,
> > bitmap scan should be preferred to idx scan due to reduced random-page-cost
> > outweighing its overhead in CPU cost.
> 
> 
> But CPU cost is probably not why it is losing anyway.
> 
> Index scans get a double bonus from high correlation.  It assumes that only
> a small fraction of the table will be visited.  And then it assumes that
> the visits it does make will be largely sequential.  I think that you are
> saying that for a large enough table, that last assumption is wrong, that
> the residual amount of non-correlation is enough to make the table reads
> more random than sequential.  Maybe.  Do you have a test case that
> demonstrates this?  If so, how big do we need to go, and can you see the
> problem on SSD as well as HDD?

Right: The "residual"/fine-scale variations (those which are not adequately
represented by correlation metric) are/may be non-sequential, so don't get good
readahead.  

The original issue was with an 75GB table (an inheritence child) and an
analytic query previously taking ~30min at that point taking 4-5 hours due to
random seeks (from duplicate values in a timestamp column with 1second
resolution).  There would've been very little if any of the previous day's
table cached: the child table being queried (by way of its parent) had size
roughly same as the server's RAM, and would've been loaded over the course of
the preceding 6-30hours, and not frequently accessed.  It may be that there's a
sharp change once cache no longer effectively mitigates the random heap reads.

SSD: good question.

Here's an rackspace VM with PG9.6.6, 2GB shared_buffers, 8GB RAM (~4GB of which
is being used as OS page cache), and 32GB SSD (with random_page_cost=1).  The
server is in use by our application.

I believe you could scale up the size of the table to see this behavior with
any cache size.  0.0001 controls the "jitter", with smaller values being more
jittery..

postgres=# CREATE TABLE t(i int,j int) TABLESPACE tmp; CREATE INDEX ON t(i); INSERT INTO t SELECT
(0.0001*a+9*(random()-0.5))::intFROM generate_series(1,99999999) a; VACUUM ANALYZE t;

 public | t    | table | pryzbyj | 3458 MB |
relpages | 442478

For comparison purposes/baseline; here's a scan on an SEPARATE index freshly
built AFTER insertions:

postgres=# explain(analyze,buffers) SELECT COUNT(j) FROM t WHERE i BETWEEN 0 AND 4000;
First invocation:
#1  ->  Index Scan using t_i_idx1 on t  (cost=0.57..1413352.60 rows=39933001 width=4) (actual time=25.660..52575.127
rows=39996029loops=1)

         Buffers: shared hit=1578644 read=286489 written=1084
Subsequent invocations with (extra) effect from OS cache:
#2  ->  Index Scan using t_i_idx1 on t  (cost=0.57..1413352.60 rows=39933001 width=4) (actual time=61.054..37646.556
rows=39996029loops=1)

         Buffers: shared hit=1578644 read=286489 written=2223
#3  ->  Index Scan using t_i_idx1 on t  (cost=0.57..1413352.60 rows=39933001 width=4) (actual time=9.344..31265.398
rows=39996029loops=1)

         Buffers: shared hit=1578644 read=286489 written=1192

Dropping that index, and scanning a different range on the non-fresh index:

postgres=# explain(analyze,buffers) SELECT COUNT(j) FROM t WHERE i BETWEEN 4000 AND 8000;
#1  ->  Index Scan using t_i_idx on t  (cost=0.57..1546440.47 rows=40298277 width=4) (actual time=95.815..139152.147
rows=40009853loops=1)

         Buffers: shared hit=1948069 read=316536 written=3411
Rerunning with cache effects:
#2  ->  Index Scan using t_i_idx on t  (cost=0.57..1546440.47 rows=40298277 width=4) (actual time=203.590..87547.287
rows=40009853loops=1)

         Buffers: shared hit=1948069 read=316536 written=5712
#3  ->  Index Scan using t_i_idx on t  (cost=0.57..1546440.47 rows=40298277 width=4) (actual time=164.504..83768.890
rows=40009853loops=1)

         Buffers: shared hit=1948069 read=316536 written=1979

Compare to seq scan:
   ->  Seq Scan on t  (cost=0.00..1942478.00 rows=40298277 width=4) (actual time=1173.162..20980.069 rows=40009853
loops=1)
         Buffers: shared hit=47341 read=395137

Bitmap:
   ->  Bitmap Heap Scan on t  (cost=975197.91..2022150.06 rows=40298277 width=4) (actual time=24396.270..39304.813
rows=40009853loops=1)

         Buffers: shared read=316536 written=1431

The index scan reads 2.3e6 pages, compared to 4e5 pages (seq) and 3e5 pages
(bitmap).  And idx scans were 4-7x slower than seq scan.  Was the index scan
actually that badly affected by CPU cost of revisiting pages (rather than IO
costs)?  Or did the OS actually fail to cache the 3e5 pages "read"?  That would
be consistent with running almost 2x faster on the 3rd invocation.

The "hits" are largely from pages being revisited and recently accessed.  Are
the misses (reads) mostly from pages being revisited after already falling out
of cache ?  Or mostly initial access ?  Or ??

If the index scan is really paying an high IO cost for rereads and not
primarily a CPU cost, this seems to be something like the "correlated index
scan" variant of traditional failure to effectively cache doing seq scan on a
sufficiently large table using a MRU buffer - the cache is ineffective for
adequately mitigating IO costs when the (re)reads have sufficient "spread".

postgres=# SELECT tablename, attname, correlation FROM pg_stats WHERE tablename='t';
tablename   | t
attname     | i
correlation | 1

I still want to say that's unreasonble due to (I think) high fraction of
nonrandom reads and associated absense of readahead.

> I think it would be easier to give bitmap scans their due rather than try
> to knock down index scans.  But of course a synthetic test case would go a
> long way to either one.

As Tom said: index scans involving repeated keys are assuming best-case
sequential reads for a given computed correlation.  I'd be happy if they were
costed to avoid that assumption, and instead used some "middle of the road"
interpretation (probably based on correlation and MCV fraction?), but, in the
alternate, need to distinguish the cost_index cases, rather than adjusting
bitmap.  This is what led me to play around with stats computation for all
indices.

Justin

Re: Bitmap scan is undercosted? - overestimated correlation andcost_index

From

Justin Pryzby

Date:

16 December 2017, 22:18:38

On Fri, Dec 15, 2017 at 02:54:06PM -0600, Justin Pryzby wrote:
> SSD: good question.
> 
> Here's an rackspace VM with PG9.6.6, 2GB shared_buffers, 8GB RAM (~4GB of which
> is being used as OS page cache), and 32GB SSD (with random_page_cost=1).  The
> server is in use by our application.
> 
> I believe you could scale up the size of the table to see this behavior with
> any cache size.  0.0001 controls the "jitter", with smaller values being more
> jittery..
> 
> postgres=# CREATE TABLE t(i int,j int) TABLESPACE tmp; CREATE INDEX ON t(i); INSERT INTO t SELECT
(0.0001*a+9*(random()-0.5))::intFROM generate_series(1,99999999) a; VACUUM ANALYZE t;
 
>  public | t    | table | pryzbyj | 3458 MB |
> relpages | 442478

I realized I've made a mistake here; the table is on SSD but not its index...
So all this cost is apparently coming from the index and not the heap.

   ->  Bitmap Heap Scan on t  (cost=855041.91..1901994.06 rows=40298277 width=4) (actual time=14202.624..27754.982
rows=40009853loops=1)
 
         ->  Bitmap Index Scan on t_i_idx1  (cost=0.00..844967.34 rows=40298277 width=0) (actual
time=14145.877..14145.877rows=40009853 loops=1)
 

Let me get back to you about that.

Justin

Re: Bitmap scan is undercosted? - overestimated correlation andcost_index

From

Justin Pryzby

Date:

17 December 2017, 05:37:01

On Fri, Dec 15, 2017 at 02:54:06PM -0600, Justin Pryzby wrote:
> SSD: good question.
> 
> Here's an rackspace VM with PG9.6.6, 2GB shared_buffers, 8GB RAM (~4GB of which
> is being used as OS page cache), and 32GB SSD (with random_page_cost=1).  The
> server is in use by our application.
> 
> I believe you could scale up the size of the table to see this behavior with
> any cache size.  0.0001 controls the "jitter", with smaller values being more
> jittery..

On Sat, Dec 16, 2017 at 01:18:38PM -0600, Justin Pryzby wrote:
> I realized I've made a mistake here; the table is on SSD but not its index...
> So all this cost is apparently coming from the index and not the heap.
> 
>    ->  Bitmap Heap Scan on t  (cost=855041.91..1901994.06 rows=40298277 width=4) (actual time=14202.624..27754.982
rows=40009853loops=1)
 
>          ->  Bitmap Index Scan on t_i_idx1  (cost=0.00..844967.34 rows=40298277 width=0) (actual
time=14145.877..14145.877rows=40009853 loops=1)
 

I'm rerunning with this:

postgres=# CREATE TABLE t(i int,j int) TABLESPACE tmp; CREATE INDEX ON t(i) TABLESPACE tmp; INSERT INTO t SELECT
(0.0001*a+9*(random()-0.5))::intFROM generate_series(1,99999999) a; VACUUM ANALYZE t; CREATE INDEX ON t(i) TABLESPACE
tmp;

That doesn't seem to invalidate my conclusions regarding the test data.

The non-fresh index:
#1  ->  Index Scan using t_i_idx on t  (cost=0.57..1103588.59 rows=39536704 width=4) (actual time=2.295..60094.704
rows=40009646loops=1)
 
Rerun:
#2  ->  Index Scan using t_i_idx on t  (cost=0.57..1103588.59 rows=39536704 width=4) (actual time=1.671..54209.037
rows=40009646loops=1)
 
#3  ->  Index Scan using t_i_idx on t  (cost=0.57..1103588.59 rows=39536704 width=4) (actual time=1.743..46436.538
rows=40009646loops=1)
 

Scan fresh index:
   ->  Index Scan using t_i_idx1 on t  (cost=0.57..1074105.46 rows=39536704 width=4) (actual time=1.715..16119.720
rows=40009646loops=1)
 

bitmap scan on non-fresh idx:
   ->  Bitmap Heap Scan on t  (cost=543141.78..1578670.34 rows=39536704 width=4) (actual time=4397.767..9137.541
rows=40009646loops=1)
 
         Buffers: shared hit=91235 read=225314
         ->  Bitmap Index Scan on t_i_idx  (cost=0.00..533257.61 rows=39536704 width=0) (actual time=4346.556..4346.556
rows=40009646loops=1)
 
               Buffers: shared read=139118

seq scan:
   ->  Seq Scan on t  (cost=0.00..1942478.00 rows=39536704 width=4) (actual time=6093.269..17880.164 rows=40009646
loops=1)

I also tried an idx only scan (note COUNT i vs j / "eye" vs "jay"), which I
think should be like an index scan without heap costs:

postgres=# SET max_parallel_workers_per_gather=0;SET enable_bitmapscan=off;SET enable_indexscan=on; begin; DROP INDEX
t_i_idx1;explain(analyze,buffers) SELECT COUNT(i) FROM t WHERE i BETWEEN 4000 AND 8000; rollback;
 
   ->  Index Only Scan using t_i_idx on t  (cost=0.57..928624.65 rows=39536704 width=4) (actual time=0.515..12646.676
rows=40009646loops=1)
 
         Buffers: shared hit=276 read=139118

However, in this test, random reads on the INDEX are still causing a large
fraction of the query time.  When cached by the OS, this is much faster.
Compare:

#1   ->  Bitmap Heap Scan on t  (cost=543141.78..1578670.34 rows=39536704 width=4) (actual time=25498.978..41418.870
rows=40009646loops=1)
 
         Buffers: shared read=316549 written=497
         ->  Bitmap Index Scan on t_i_idx  (cost=0.00..533257.61 rows=39536704 width=0) (actual
time=25435.865..25435.865rows=40009646 loops=1)
 
               Buffers: shared read=139118 written=2

#2   ->  Bitmap Heap Scan on t  (cost=543141.78..1578670.34 rows=39536704 width=4) (actual time=5863.003..17531.860
rows=40009646loops=1)
 
         Buffers: shared read=316549 written=31
         ->  Bitmap Index Scan on t_i_idx  (cost=0.00..533257.61 rows=39536704 width=0) (actual time=5799.400..5799.400
rows=40009646loops=1)
 
               Buffers: shared read=139118 written=31

Note that for the test data, the index is a large fraction of the table data
(since the only non-indexed column is nullfrac=1):
 public | t    | table | pryzbyj | 3458 MB | 
 public | t_i_idx  | index | pryzbyj | t     | 2725 MB | 
 public | t_i_idx1 | index | pryzbyj | t     | 2142 MB | 
(that could be 10% smaller with fillfactor=100)

I think the test case are reasonably reproducing the original issue.  Note that
the 2nd invocation of the bitmap scan scanned the index in 5.8sec and the heap
in 11sec, but the 2nd invocation of the index scan took 54sec, of which I
gather ~6sec was from the index.  So there's still 48sec spent accessing the
heap randomly, rather than 11sec sequentially.

I'm also playing with the tables which were the source of the original problem,
for which index reads in bitmap scan do not appear to be a large fraction of
the query time, probably because the index are 1-2% of the table size rather
than 60-70%.  I'll mail about that separately.

Justin

Re: Bitmap scan is undercosted?

From

Vitaliy Garnashevich

Date:

24 February 2018, 11:45:14

Sorry for delay with response, I had to switch to other tasks and didn't 
have time to run proper tests and write some meaningful response.

Recently,  a similar issue happened with another our database, so I 
decided to write an update.

Bitmap scan was preferred to index scan by the planner, but bitmap scan 
was running worse in practice. Here are the relevant pieces of a much 
bigger query plan:

  ->  Bitmap Heap Scan on cmdb_program_daily_usage 
cmdb_program_daily_usage_6  (cost=6707.08..6879.35 rows=32 width=20) 
(actual time=39.994..40.019 rows=12 loops=336)
        Recheck Cond: ((used_from = cmdb_ci_computer_12.id) AND 
(usage_date >= '2018-02-02'::date) AND (usage_date <= '2018-02-12'::date))
        Filter: (((NOT thin_client) OR (thin_client IS NULL)) AND 
(program_instance IS NOT NULL) AND (minutes_in_use > 0))
        Rows Removed by Filter: 69
        Heap Blocks: exact=2995
        Buffers: shared hit=563448
        ->  BitmapAnd  (cost=6707.08..6707.08 rows=154 width=0) (actual 
time=39.978..39.978 rows=0 loops=336)
              Buffers: shared hit=560453
              ->  Bitmap Index Scan on idx_fk_5317241949468942  
(cost=0.00..133.87 rows=12641 width=0) (actual time=0.373..0.373 
rows=4780 loops=336)
                    Index Cond: (used_from = cmdb_ci_computer_12.id)
                    Buffers: shared hit=5765
              ->  Bitmap Index Scan on idx_263911642415136  
(cost=0.00..6572.94 rows=504668 width=0) (actual time=40.873..40.873 
rows=540327 loops=324)
                    Index Cond: ((usage_date >= '2018-02-02'::date) AND 
(usage_date <= '2018-02-12'::date))
                    Buffers: shared hit=554688

  ->  Index Scan using idx_fk_5317241949468942 on 
cmdb_program_daily_usage cmdb_program_daily_usage_6 (cost=0.56..24322.97 
rows=35 width=20) (actual time=1.211..2.196 rows=14 loops=338)
        Index Cond: (used_from = cmdb_ci_computer_12.id)
        Filter: (((NOT thin_client) OR (thin_client IS NULL)) AND 
(program_instance IS NOT NULL) AND (minutes_in_use > 0) AND (usage_date 
 >= '2018-02-02'::date) AND (usage_date <= '2018-02-12'::date))
        Rows Removed by Filter: 4786
        Buffers: shared hit=289812

The difference in run time does not look very huge, but when it's a part 
of a loop, that could mean difference between minutes and hours.

After running some tests, here are the conclusions we've made:

- When running with cold cache, and data is being read from disk, then 
the planner estimates look adequate. Bitmap scan has better costs, and 
indeed it performs better in that case.

- When running with hot cache, and most of data is already in RAM, then 
index scan starts to outperform bitmap scan. Unfortunately the planner 
cannot account for the cache very well, and can't switch the plan. 
Because even if the planner would ever learn to account for the current 
content of shared buffers, it still can't know much about the content of 
filesystem cache.

- Tests showed that the costs are dominated by random_page_cost, but 
there is still potential to change the total plan cost, if "cpu_*" costs 
would be less distant from "*_page_cost".

- In our case the data is likely to be in cache, so we decided to change 
cost settings: seq_page_cost 1.0 -> 0.5; random_page_cost 1.1 -> 0.6

Regards,
Vitaliy