Thread: Bitmap scan is undercosted?
Hi, We recently had an issue in production, where a bitmap scan was chosen instead of an index scan. Despite being 30x slower, the bitmap scan had about the same cost as the index scan. I've found some cases where similar issues with bitmap scans were reported before: https://www.postgresql.org/message-id/flat/1456154321.976561.528310154.6A623C0E%40webmail.messagingengine.com https://www.postgresql.org/message-id/flat/CA%2BwPC0MRMhF_8fD9dc8%2BQWZQzUvHahPRSv%3DxMtCmsVLSsy-p0w%40mail.gmail.com I've made a synthetic test, which kind of reproduces the issue: shared_buffers = 512MB effective_cache_size = 512MB work_mem = 100MB set seq_page_cost = 1.0; set random_page_cost = 1.5; set cpu_tuple_cost = 0.01; set cpu_index_tuple_cost = 0.005; set cpu_operator_cost = 0.0025; drop table if exists aaa; create table aaa as select (id%100)::int num, (id%10=1)::bool flag from generate_series(1, 10000000) id; create index i1 on aaa (num); create index i2 on aaa (flag); analyze aaa; select relname, reltuples::bigint, relpages::bigint, (reltuples/relpages)::bigint tpp from pg_class where relname in('aaa','i1','i2') order by relname; "aaa";9999985;44248;226 "i1";9999985;27422;365 "i2";9999985;27422;365 I've been running the same query while enabling and disabling different kinds of scans: 1) set enable_bitmapscan = on; set enable_indexscan = off; set enable_seqscan = off; 2) set enable_bitmapscan = off; set enable_indexscan = on; set enable_seqscan = off; 3) set enable_bitmapscan = off; set enable_indexscan = off; set enable_seqscan = on; The query was: explain (analyze,verbose,costs,buffers) select count(*) from aaa where num = 1 and flag = true; Here are the results for PostgreSQL 9.6 (for 9.3 and 10.1 the results are very similar): 1) Aggregate (cost=24821.70..24821.71 rows=1 width=8) (actual time=184.591..184.591 rows=1 loops=1) Output: count(*) Buffers: shared hit=47259 -> Bitmap Heap Scan on public.aaa (cost=13038.21..24796.22 rows=10189 width=0) (actual time=122.492..178.006 rows=100000 loops=1) Output: num, flag Recheck Cond: (aaa.num = 1) Filter: aaa.flag Heap Blocks: exact=44248 Buffers: shared hit=47259 -> BitmapAnd (cost=13038.21..13038.21 rows=10189 width=0) (actual time=110.699..110.699 rows=0 loops=1) Buffers: shared hit=3011 -> Bitmap Index Scan on i1 (cost=0.00..1158.94 rows=99667 width=0) (actual time=19.600..19.600 rows=100000 loops=1) Index Cond: (aaa.num = 1) Buffers: shared hit=276 -> Bitmap Index Scan on i2 (cost=0.00..11873.92 rows=1022332 width=0) (actual time=81.676..81.676 rows=1000000 loops=1) Index Cond: (aaa.flag = true) Buffers: shared hit=2735 Planning time: 0.104 ms Execution time: 184.988 ms 2) Aggregate (cost=67939.09..67939.10 rows=1 width=8) (actual time=67.510..67.510 rows=1 loops=1) Output: count(*) Buffers: shared hit=44524 -> Index Scan using i1 on public.aaa (cost=0.44..67910.95 rows=11256 width=0) (actual time=0.020..61.180 rows=100000 loops=1) Output: num, flag Index Cond: (aaa.num = 1) Filter: aaa.flag Buffers: shared hit=44524 Planning time: 0.096 ms Execution time: 67.543 ms 3) Aggregate (cost=169276.49..169276.50 rows=1 width=8) (actual time=977.063..977.063 rows=1 loops=1) Output: count(*) Buffers: shared hit=44248 -> Seq Scan on public.aaa (cost=0.00..169248.35 rows=11256 width=0) (actual time=0.018..969.294 rows=100000 loops=1) Output: num, flag Filter: (aaa.flag AND (aaa.num = 1)) Rows Removed by Filter: 9900000 Buffers: shared hit=44248 Planning time: 0.099 ms Execution time: 977.094 ms The bitmap scan version runs more than twice slower than the one with index scan, while being costed at more than twice cheaper. I've tried to increase cpu_tuple_cost and cpu_index_tuple_cost, and this behavior remains after 6x increase in values. Although the difference in costs becomes much less. After increasing the settings more than 6x, PostgreSQL decides to use a different plan for bitmap scans, so it's hard to make conclusions at that point. Could such cases be fixed with tuning of cost settings, or that's just how PostgreSQL estimates bitmap scans and this can't be fixed without modifying the optimizer? Or am I missing something and that's the expected behavior? Thoughts? Regards, Vitaliy
On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote: > We recently had an issue in production, where a bitmap scan was chosen > instead of an index scan. Despite being 30x slower, the bitmap scan had > about the same cost as the index scan. Me too, see also: https://www.postgresql.org/message-id/flat/CAH2-WzkRTggiy_LKQUu-oViyp6y_Hhz-a1yWacPy4tcYWV1HoA%40mail.gmail.com#CAH2-WzkRTggiy_LKQUu-oViyp6y_Hhz-a1yWacPy4tcYWV1HoA@mail.gmail.com > drop table if exists aaa; > create table aaa as select (id%100)::int num, (id%10=1)::bool flag from > generate_series(1, 10000000) id; > create index i1 on aaa (num); > create index i2 on aaa (flag); > analyze aaa; > > select relname, reltuples::bigint, relpages::bigint, > (reltuples/relpages)::bigint tpp from pg_class where relname > in('aaa','i1','i2') order by relname; > "aaa";9999985;44248;226 > "i1";9999985;27422;365 > "i2";9999985;27422;365 > > The query was: > explain (analyze,verbose,costs,buffers) > select count(*) from aaa where num = 1 and flag = true; Note that id%100==1 implies flag='t', so the planner anticipates retrieving fewer rows than it will ultimately read, probably by 2x. It makes sense that causes the index scan to be more expensive than expected, but that's only somewhat important, since there's no joins involved. The reason why it's more than a bit slower is due to the "density" [0] of the heap pages read. num=1 is more selective than flag=true, so it scans i1, reading 1% of the whole table. But it's not reading the first 1% or some other 1% of the table, it reads tuples evenly distributed across the entire table (226*0.01 = ~2 rows of each page). Since the index was created after the INSERT, the repeated keys (logical value: id%100) are read in physical order on the heap, so this is basically doing a seq scan, but with the additional overhead of reading the index, and maybe doing an fseek() before each/some read()s, too. You could confirm that by connecting strace to the backend before starting the query. Since you did that using % and with indices added after the INSERT, you can't improve it by reindexing (as I was able to for my case). That's an elegant test case, so thanks. I think shared_buffers=512MB is just small enough for this test to be painful for 1e7 rows. I see the table+index is 559MB. I don't know if that's really similar to your production use case, but I would recommend trying BRIN indices, which always require a bitmap scan. Note that some things (like max()) that can use an btree index cannot use brin. PG10.1 has WITH autosummarize, which was important for our use, since we rarely do UPDATEs or DELETEs so tables are rarely vacuumed (only analyzed). Justin [0] I'm borrowing Jeff's language from here: https://www.postgresql.org/message-id/CAMkU%3D1xwGn%2BO0jhKsvrUrbW9MQp1YX0iB4Y-6h1mEz0ffBxK-Q%40mail.gmail.com "density" wasn't our problem, but it's a perfect description of this issue.
On 01/12/2017 20:34, Justin Pryzby wrote: > On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote: >> We recently had an issue in production, where a bitmap scan was chosen >> instead of an index scan. Despite being 30x slower, the bitmap scan had >> about the same cost as the index scan. > Me too, see also: > https://www.postgresql.org/message-id/flat/CAH2-WzkRTggiy_LKQUu-oViyp6y_Hhz-a1yWacPy4tcYWV1HoA%40mail.gmail.com#CAH2-WzkRTggiy_LKQUu-oViyp6y_Hhz-a1yWacPy4tcYWV1HoA@mail.gmail.com > >> drop table if exists aaa; >> create table aaa as select (id%100)::int num, (id%10=1)::bool flag from >> generate_series(1, 10000000) id; >> create index i1 on aaa (num); >> create index i2 on aaa (flag); >> analyze aaa; >> >> select relname, reltuples::bigint, relpages::bigint, >> (reltuples/relpages)::bigint tpp from pg_class where relname >> in('aaa','i1','i2') order by relname; >> "aaa";9999985;44248;226 >> "i1";9999985;27422;365 >> "i2";9999985;27422;365 >> >> The query was: >> explain (analyze,verbose,costs,buffers) >> select count(*) from aaa where num = 1 and flag = true; > Note that id%100==1 implies flag='t', so the planner anticipates retrieving > fewer rows than it will ultimately read, probably by 2x. It makes sense that > causes the index scan to be more expensive than expected, but that's only > somewhat important, since there's no joins involved. I don't think the planner is that smart to account for correlation between values in different columns. When different values are used in filter (num=2, num=39, num=74), the query actually runs faster, while still being about twice slower than using an index scan. But the cost does not change much. It jumps up and down for different values, but it's still close to the initial value. Aggregate (cost=24239.02..24239.03 rows=1 width=8) (actual time=105.239..105.239 rows=1 loops=1) Output: count(*) Buffers: shared hit=3011 -> Bitmap Heap Scan on public.aaa (cost=12812.05..24214.48 rows=9816 width=0) (actual time=105.236..105.236 rows=0 loops=1) Output: num, flag Recheck Cond: (aaa.num = 39) Filter: aaa.flag Buffers: shared hit=3011 -> BitmapAnd (cost=12812.05..12812.05 rows=9816 width=0) (actual time=105.157..105.157 rows=0 loops=1) Buffers: shared hit=3011 -> Bitmap Index Scan on i1 (cost=0.00..1134.94 rows=97667 width=0) (actual time=15.725..15.725 rows=100000 loops=1) Index Cond: (aaa.num = 39) Buffers: shared hit=276 -> Bitmap Index Scan on i2 (cost=0.00..11671.96 rows=1005003 width=0) (actual time=77.920..77.920 rows=1000000 loops=1) Index Cond: (aaa.flag = true) Buffers: shared hit=2735 Planning time: 0.104 ms Execution time: 105.553 ms Aggregate (cost=65785.99..65786.00 rows=1 width=8) (actual time=48.587..48.587 rows=1 loops=1) Output: count(*) Buffers: shared hit=44524 -> Index Scan using i1 on public.aaa (cost=0.44..65761.45 rows=9816 width=0) (actual time=48.583..48.583 rows=0 loops=1) Output: num, flag Index Cond: (aaa.num = 39) Filter: aaa.flag Rows Removed by Filter: 100000 Buffers: shared hit=44524 Planning time: 0.097 ms Execution time: 48.620 ms > > The reason why it's more than a bit slower is due to the "density" [0] of the > heap pages read. num=1 is more selective than flag=true, so it scans i1, > reading 1% of the whole table. But it's not reading the first 1% or > some other 1% of the table, it reads tuples evenly distributed across the > entire table (226*0.01 = ~2 rows of each page). Since the index was created > after the INSERT, the repeated keys (logical value: id%100) are read in > physical order on the heap, so this is basically doing a seq scan, but with the > additional overhead of reading the index, and maybe doing an fseek() before > each/some read()s, too. You could confirm that by connecting strace to the > backend before starting the query. > > Since you did that using % and with indices added after the INSERT, you can't > improve it by reindexing (as I was able to for my case). That's an elegant > test case, so thanks. > > I think shared_buffers=512MB is just small enough for this test to be painful > for 1e7 rows. I see the table+index is 559MB. table | ~count | size | toast | idx | size + toast + idx ---------------------------+---------+------------+------------+--------+-------------------- aaa | 9999994 | 346 MB | 0 bytes | 428 MB | 774 MB But the plan says all buffers are "shared hit", and none "read", so that's probably not an issue. > > I don't know if that's really similar to your production use case, but I would > recommend trying BRIN indices, which always require a bitmap scan. Note that > some things (like max()) that can use an btree index cannot use brin. PG10.1 > has WITH autosummarize, which was important for our use, since we rarely do > UPDATEs or DELETEs so tables are rarely vacuumed (only analyzed). Yes, BRIN indexes should be beneficial for our case, because there is a date column which grows over time (not strictly regularly, but still). Unfortunately, we're still migrating our databases from 9.3 to 9.6. Anyway, thanks for the advice. > > Justin > > [0] I'm borrowing Jeff's language from here: > https://www.postgresql.org/message-id/CAMkU%3D1xwGn%2BO0jhKsvrUrbW9MQp1YX0iB4Y-6h1mEz0ffBxK-Q%40mail.gmail.com > "density" wasn't our problem, but it's a perfect description of this issue. >
I tried to reproduce this issue and couldn't, under PG95 and 10.1: On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote: > On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote: > > We recently had an issue in production, where a bitmap scan was chosen > > instead of an index scan. Despite being 30x slower, the bitmap scan had > > about the same cost as the index scan. > > > drop table if exists aaa; > > create table aaa as select (id%100)::int num, (id%10=1)::bool flag from > > generate_series(1, 10000000) id; > > create index i1 on aaa (num); > > create index i2 on aaa (flag); > > analyze aaa; What is: effective_io_concurrency max_parallel_workers_per_gather (I gather you don't have this) Note: postgres=# SELECT correlation FROM pg_stats WHERE tablename='aaa' AND attname='num'; correlation | 0.00710112 ..so this is different from the issue corrected by the patch I created while testing. > Note that id%100==1 implies flag='t', so the planner anticipates retrieving > fewer rows than it will ultimately read, probably by 2x. It makes sense that > causes the index scan to be more expensive than expected, but that's only > somewhat important, since there's no joins involved. I changed the query from COUNT(*) TO * for easier to read explain: CREATE TABLE aaa AS SELECT (id%100)::int num, (id%10=1)::bool flag FROM generate_series(1, 10000000) id; CREATE INDEX i1 ON aaa(num); CREATE INDEX i2 ON aaa (flag); ANALYZE VERBOSE aaa; EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag=true; Bitmap Heap Scan on public.aaa (cost=20652.98..45751.75 rows=10754 width=5) (actual time=85.314..185.107 rows=100000 loops=1) -> BitmapAnd (cost=20652.98..20652.98 rows=10754 width=0) (actual time=163.220..163.220 rows=0 loops=1) -> Bitmap Index Scan on i1 (cost=0.00..1965.93 rows=106333 width=0) (actual time=26.943..26.943 rows=100000 loops=1) -> Bitmap Index Scan on i2 (cost=0.00..18681.42 rows=1011332 width=0) (actual time=133.804..133.804 rows=1000000loops=1) ..which is what's wanted with no planner hints (PG10.1 here). Same on PG95: postgres=# EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag=true; Bitmap Heap Scan on public.aaa (cost=19755.64..43640.32 rows=9979 width=5) (actual time=230.017..336.583 rows=100000 loops=1) -> BitmapAnd (cost=19755.64..19755.64 rows=9979 width=0) (actual time=205.242..205.242 rows=0 loops=1) -> Bitmap Index Scan on i1 (cost=0.00..1911.44 rows=103334 width=0) (actual time=24.911..24.911 rows=100000 loops=1) -> Bitmap Index Scan on i2 (cost=0.00..17838.96 rows=965670 width=0) (actual time=154.237..154.237 rows=1000000loops=1) The rowcount is off, but not a critical issue without a join. Justin
On 02/12/2017 01:11, Justin Pryzby wrote: > I tried to reproduce this issue and couldn't, under PG95 and 10.1: > > On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote: >> On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote: >>> We recently had an issue in production, where a bitmap scan was chosen >>> instead of an index scan. Despite being 30x slower, the bitmap scan had >>> about the same cost as the index scan. >>> drop table if exists aaa; >>> create table aaa as select (id%100)::int num, (id%10=1)::bool flag from >>> generate_series(1, 10000000) id; >>> create index i1 on aaa (num); >>> create index i2 on aaa (flag); >>> analyze aaa; > What is: > effective_io_concurrency > max_parallel_workers_per_gather (I gather you don't have this) effective_io_concurrency = 0 max_parallel_workers_per_gather = 0 Did you notice random_page_cost = 1.5 ? For this test I'm using SSD and Windows (if that matters). On production we also use SSD, hence lower random_page_cost. But with the default random_page_cost=4.0, the difference in cost between the index scan plan and the bitmap scan plan is even bigger. > > Note: > postgres=# SELECT correlation FROM pg_stats WHERE tablename='aaa' AND attname='num'; > correlation | 0.00710112 > > ..so this is different from the issue corrected by the patch I created while > testing. > >> Note that id%100==1 implies flag='t', so the planner anticipates retrieving >> fewer rows than it will ultimately read, probably by 2x. It makes sense that >> causes the index scan to be more expensive than expected, but that's only >> somewhat important, since there's no joins involved. > I changed the query from COUNT(*) TO * for easier to read explain: > > CREATE TABLE aaa AS SELECT (id%100)::int num, (id%10=1)::bool flag FROM generate_series(1, 10000000) id; > CREATE INDEX i1 ON aaa(num); > CREATE INDEX i2 ON aaa (flag); > ANALYZE VERBOSE aaa; > EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag=true; > Bitmap Heap Scan on public.aaa (cost=20652.98..45751.75 rows=10754 width=5) (actual time=85.314..185.107 rows=100000loops=1) > -> BitmapAnd (cost=20652.98..20652.98 rows=10754 width=0) (actual time=163.220..163.220 rows=0 loops=1) > -> Bitmap Index Scan on i1 (cost=0.00..1965.93 rows=106333 width=0) (actual time=26.943..26.943 rows=100000loops=1) > -> Bitmap Index Scan on i2 (cost=0.00..18681.42 rows=1011332 width=0) (actual time=133.804..133.804 rows=1000000loops=1) > > ..which is what's wanted with no planner hints (PG10.1 here). Yes, that's what you get without planner hints, but it's strange to get this plan, when there is another one, which runs 2-3 times faster, but happens to be estimated to be twice more costly than the one with bitmap scans: # set enable_bitmapscan = off; set enable_indexscan = on; set enable_seqscan = off; # explain analyze select * from aaa where num = 1 and flag = true; Index Scan using i1 on aaa (cost=0.44..66369.81 rows=10428 width=5) (actual time=0.020..57.765 rows=100000 loops=1) vs. # set enable_bitmapscan = on; set enable_indexscan = off; set enable_seqscan = off; # explain analyze select * from aaa where num = 1 and flag = true; Bitmap Heap Scan on aaa (cost=13099.33..25081.40 rows=10428 width=5) (actual time=122.137..182.811 rows=100000 loops=1) -> BitmapAnd (cost=13099.33..13099.33 rows=10428 width=0) (actual time=110.168..110.168 rows=0 loops=1) -> Bitmap Index Scan on i1 (cost=0.00..1181.44 rows=101667 width=0) (actual time=20.845..20.845 rows=100000 loops=1) -> Bitmap Index Scan on i2 (cost=0.00..11912.43 rows=1025666 width=0) (actual time=80.323..80.323 rows=1000000 loops=1) > > Same on PG95: > postgres=# EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag=true; > Bitmap Heap Scan on public.aaa (cost=19755.64..43640.32 rows=9979 width=5) (actual time=230.017..336.583 rows=100000loops=1) > -> BitmapAnd (cost=19755.64..19755.64 rows=9979 width=0) (actual time=205.242..205.242 rows=0 loops=1) > -> Bitmap Index Scan on i1 (cost=0.00..1911.44 rows=103334 width=0) (actual time=24.911..24.911 rows=100000loops=1) > -> Bitmap Index Scan on i2 (cost=0.00..17838.96 rows=965670 width=0) (actual time=154.237..154.237 rows=1000000loops=1) > > The rowcount is off, but not a critical issue without a join. > > Justin >
On Fri, Dec 01, 2017 at 05:11:04PM -0600, Justin Pryzby wrote: > I tried to reproduce this issue and couldn't, under PG95 and 10.1: I'm embarassed to say that I mis-read your message, despite you're amply clear subject. You're getting a bitmap scan but you'd prefer to get an index scan. I anticipated the opposite problem (which is what I've had issues with myself). > On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote: > > On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote: > > > We recently had an issue in production, where a bitmap scan was chosen > > > instead of an index scan. Despite being 30x slower, the bitmap scan had > > > about the same cost as the index scan. > > Note: > postgres=# SELECT correlation FROM pg_stats WHERE tablename='aaa' AND attname='num'; > correlation | 0.00710112 > > ..so this is different from the issue corrected by the patch I created while > testing. Actually, that the table is "not correlated" on "num" column is maybe the primary reason why PG avoids using an index scan. It (more or less correctly) deduces that it's going to have to "read" a large fraction of the pages (even if only to process a small fraction of the rows), which is costly, except it's all cached.. In your case, that overly-penalizes the index scan. This is cost_index() and cost_bitmap_heap_scan() in costsize.c. Since the index is uncorrelated, it's returning something close to max_IO_cost. It looks like effective_cache_size only affects index_pages_fetched(). I'm going to try to dig some more into it. Maybe there's evidence to re-evaluate one of these: cost_index() | run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost); or cost_bitmap_heap_scan() | cost_per_page = spc_random_page_cost - | (spc_random_page_cost - spc_seq_page_cost) | * sqrt(pages_fetched / T); Justin
On Fri, Dec 1, 2017 at 3:54 PM, Vitaliy Garnashevich <
vgarnashevich@gmail.com> wrote:
> On 02/12/2017 01:11, Justin Pryzby wrote:
>
>> I tried to reproduce this issue and couldn't, under PG95 and 10.1:
>>
>> On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote:
>>
>>> On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote:
>>>
>>>> We recently had an issue in production, where a bitmap scan was chosen
>>>> instead of an index scan. Despite being 30x slower, the bitmap scan had
>>>> about the same cost as the index scan.
>>>> drop table if exists aaa;
>>>> create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
>>>> generate_series(1, 10000000) id;
>>>> create index i1 on aaa (num);
>>>> create index i2 on aaa (flag);
>>>> analyze aaa;
>>>>
>>> What is:
>> effective_io_concurrency
>> max_parallel_workers_per_gather (I gather you don't have this)
>>
> effective_io_concurrency = 0
> max_parallel_workers_per_gather = 0
>
> Did you notice random_page_cost = 1.5 ?
>
For the aaa.num = 39 case, the faster index scan actually does hit 15 times
more buffers than the bitmap scan does. While 1.5 is lot lower than 4.0,
it is still much higher than the true cost of reading a page from the
buffer cache. This why the index scan is getting punished. You could
lower random_page_cost and seq_page_cost to 0, to remove those
considerations. (I'm not saying you should do this on your production
system, but rather you should do it as a way to investigate the issue. But
it might make sense on production as well)
> For this test I'm using SSD and Windows (if that matters). On production
> we also use SSD, hence lower random_page_cost. But with the default
> random_page_cost=4.0, the difference in cost between the index scan plan
> and the bitmap scan plan is even bigger.
Since it is all shared buffers hits, it doesn't matter if you have SSD for
this particular test case.
Cheers,
Jeff
On Sat, Dec 02, 2017 at 01:54:09AM +0200, Vitaliy Garnashevich wrote: > On 02/12/2017 01:11, Justin Pryzby wrote: > >..which is what's wanted with no planner hints (PG10.1 here). > Yes, that's what you get without planner hints, but it's strange to get this > plan, when there is another one, which runs 2-3 times faster, but happens to > be estimated to be twice more costly than the one with bitmap scans: > > # set enable_bitmapscan = off; set enable_indexscan = on; set enable_seqscan = off; > # explain analyze select * from aaa where num = 1 and flag = true; > Index Scan using i1 on aaa (cost=0.44..66369.81 rows=10428 width=5) (actual time=0.020..57.765 rows=100000 loops=1) > > vs. > > # set enable_bitmapscan = on; set enable_indexscan = off; set enable_seqscan = off; > # explain analyze select * from aaa where num = 1 and flag = true; > Bitmap Heap Scan on aaa (cost=13099.33..25081.40 rows=10428 width=5) (actual time=122.137..182.811 rows=100000 loops=1) I was able to get an index plan with: SET random_page_cost=1; SET cpu_index_tuple_cost=.04; -- default: 0.005; see selfuncs.c postgres=# EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag=true; Index Scan using i1 on public.aaa (cost=0.43..50120.71 rows=10754 width=5) (actual time=0.040..149.580 rows=100000 loops=1) Or with: SET random_page_cost=1; SET cpu_operator_cost=0.03; -- default: 0.0025 see cost_bitmap_tree_node() EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag= true; Index Scan using i1 on public.aaa (cost=5.22..49328.00 rows=10754 width=5) (actual time=0.051..109.082 rows=100000 loops=1) Or a combination trying to minimize the cost of the index scan: postgres=# SET random_page_cost=1; SET cpu_index_tuple_cost=.0017; SET cpu_operator_cost=0.03; EXPLAIN (analyze,verbose,costs,buffers)SELECT * FROM aaa WHERE num=1 AND flag= true; Index Scan using i1 on public.aaa (cost=5.22..48977.10 rows=10754 width=5) (actual time=0.032..86.883 rows=100000 loops=1) Not sure if that's reasonable, but maybe it helps to understand. Justin
On 02/12/2017 07:51, Jeff Janes wrote:
> On Fri, Dec 1, 2017 at 3:54 PM, Vitaliy Garnashevich
> > wrote:
>
> On 02/12/2017 01:11, Justin Pryzby wrote:
>
> I tried to reproduce this issue and couldn't, under PG95 and 10.1:
>
> On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote:
>
> On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy
> Garnashevich wrote:
>
> We recently had an issue in production, where a bitmap
> scan was chosen
> instead of an index scan. Despite being 30x slower,
> the bitmap scan had
> about the same cost as the index scan.
> drop table if exists aaa;
> create table aaa as select (id%100)::int num,
> (id%10=1)::bool flag from
> generate_series(1, 10000000) id;
> create index i1 on aaa (num);
> create index i2 on aaa (flag);
> analyze aaa;
>
> What is:
> effective_io_concurrency
> max_parallel_workers_per_gather (I gather you don't have this)
>
> effective_io_concurrency = 0
> max_parallel_workers_per_gather = 0
>
> Did you notice random_page_cost = 1.5 ?
>
>
> For the aaa.num = 39 case, the faster index scan actually does hit 15
> times more buffers than the bitmap scan does. While 1.5 is lot lower
> than 4.0, it is still much higher than the true cost of reading a page
> from the buffer cache. This why the index scan is getting punished.
> You could lower random_page_cost and seq_page_cost to 0, to remove
> those considerations. (I'm not saying you should do this on your
> production system, but rather you should do it as a way to investigate
> the issue. But it might make sense on production as well)
seq_page_cost = 1.0
random_page_cost = 1.0*
*explain analyze select * from aaa where num = 2 and flag = true;
Bitmap Heap Scan on aaa (cost=11536.74..20856.96 rows=10257 width=5)
(actual time=108.338..108.338 rows=0 loops=1)
-> BitmapAnd (cost=11536.74..11536.74 rows=10257 width=0) (actual
time=108.226..108.226 rows=0 loops=1)
-> Bitmap Index Scan on i1 (cost=0.00..1025.43 rows=100000
width=0) (actual time=18.563..18.563 rows=100000 loops=1)
-> Bitmap Index Scan on i2 (cost=0.00..10505.93 rows=1025666
width=0) (actual time=78.493..78.493 rows=1000000 loops=1)
Index Scan using i1 on aaa (cost=0.44..44663.58 rows=10257 width=5)
(actual time=51.264..51.264 rows=0 loops=1)
Here I've used the filter num = 2, which produces rows=0 at BitmapAnd,
and thus avoids a lot of work at "Bitmap Heap Scan" node, while still
leaving about the same proportion in bitmap vs index - the bitmap is
twice slower but twice less costly. It does not matter much which value
to use for the filter, if it's other than num = 1.
seq_page_cost = 0.0
random_page_cost = 0.0
explain analyze select * from aaa where num = 2 and flag = true;
Bitmap Heap Scan on aaa (cost=753.00..2003.00 rows=10257 width=5)
(actual time=82.212..82.212 rows=0 loops=1)
-> Bitmap Index Scan on i1 (cost=0.00..750.43 rows=100000 width=0)
(actual time=17.401..17.401 rows=100000 loops=1)
Index Scan using i1 on aaa (cost=0.44..1750.43 rows=10257 width=5)
(actual time=49.766..49.766 rows=0 loops=1)
The bitmap plan was reduced to use only one bitmap scan, and finally it
costs more than the index plan. But I doubt that the settings
seq_page_cost = random_page_cost = 0.0 should actually be used. Probably
it should be instead something like 1.0/1.0 or 1.0/1.1, but other costs
increased, to have more weight.
# x4 tuple/operator costs - bitmap scan still a bit cheaper
set seq_page_cost = 1.0;
set random_page_cost = 1.0;
set cpu_tuple_cost = 0.04;
set cpu_index_tuple_cost = 0.02;
set cpu_operator_cost = 0.01;
Bitmap Heap Scan on aaa (cost=36882.97..46587.82 rows=10257 width=5)
(actual time=106.045..106.045 rows=0 loops=1)
-> BitmapAnd (cost=36882.97..36882.97 rows=10257 width=0) (actual
time=105.966..105.966 rows=0 loops=1)
-> Bitmap Index Scan on i1 (cost=0.00..3276.74 rows=100000
width=0) (actual time=15.977..15.977 rows=100000 loops=1)
-> Bitmap Index Scan on i2 (cost=0.00..33584.72 rows=1025666
width=0) (actual time=79.208..79.208 rows=1000000 loops=1)
Index Scan using i1 on aaa (cost=1.74..49914.89 rows=10257 width=5)
(actual time=50.144..50.144 rows=0 loops=1)
# x5 tuple/operator costs - switched to single bitmap index scan, but
now it costs more than the index scan
set seq_page_cost = 1.0;
set random_page_cost = 1.0;
set cpu_tuple_cost = 0.05;
set cpu_index_tuple_cost = 0.025;
set cpu_operator_cost = 0.0125;
Bitmap Heap Scan on aaa (cost=4040.00..54538.00 rows=10257 width=5)
(actual time=82.338..82.338 rows=0 loops=1)
-> Bitmap Index Scan on i1 (cost=0.00..4027.18 rows=100000 width=0)
(actual time=19.541..19.541 rows=100000 loops=1)
Index Scan using i1 on aaa (cost=2.17..51665.32 rows=10257 width=5)
(actual time=49.545..49.545 rows=0 loops=1)
I've also tried seq_page_cost = 1.0, random_page_cost = 1.1, but that
would require more than x10 increase in tuple/operator costs, to make
bitmap more costly than index.
>
>
> For this test I'm using SSD and Windows (if that matters). On
> production we also use SSD, hence lower random_page_cost. But with
> the default random_page_cost=4.0, the difference in cost between
> the index scan plan and the bitmap scan plan is even bigger.
>
>
> Since it is all shared buffers hits, it doesn't matter if you have SSD
> for this particular test case.
Agree. I've just tried to justify the value of random_page_cost, which
is lower than like 2.0.
> Cheers,
>
> Jeff
On Fri, Dec 1, 2017 at 11:08 PM, Vitaliy Garnashevich <
vgarnashevich@gmail.com> wrote:
>
>
> seq_page_cost = 0.0
> random_page_cost = 0.0
> explain analyze select * from aaa where num = 2 and flag = true;
>
> Bitmap Heap Scan on aaa (cost=753.00..2003.00 rows=10257 width=5) (actual
> time=82.212..82.212 rows=0 loops=1)
> -> Bitmap Index Scan on i1 (cost=0.00..750.43 rows=100000 width=0)
> (actual time=17.401..17.401 rows=100000 loops=1)
>
> Index Scan using i1 on aaa (cost=0.44..1750.43 rows=10257 width=5)
> (actual time=49.766..49.766 rows=0 loops=1)
>
> The bitmap plan was reduced to use only one bitmap scan, and finally it
> costs more than the index plan.
>
Right, so there is a cpu costing problem (which could only be fixed by
hacking postgresql and recompiling it), but it is much smaller of a problem
than the IO cost not being accurate due to the high hit rate. Fixing the
CPU costing problem is unlikely to make a difference to your real query.
If you set the page costs to zero, what happens to your real query?
> But I doubt that the settings seq_page_cost = random_page_cost = 0.0
> should actually be used.
>
Why not? If your production server really has everything in memory during
normal operation, that is the correct course of action. If you ever
restart the server, then you could have some unpleasant time getting it
back up to speed again, but pg_prewarm could help with that.
> Probably it should be instead something like 1.0/1.0 or 1.0/1.1, but other
> costs increased, to have more weight.
>
This doesn't make any sense to me. Halving the page costs is
mathematically the same as doubling all the other constants. But the first
way of doing things says what you are doing, and the second way is an
obfuscation of what you are doing.
>
> # x4 tuple/operator costs - bitmap scan still a bit cheaper
> set seq_page_cost = 1.0;
> set random_page_cost = 1.0;
> set cpu_tuple_cost = 0.04;
> set cpu_index_tuple_cost = 0.02;
> set cpu_operator_cost = 0.01;
>
If you really want to target the plan with the BitmapAnd, you should
increase cpu_index_tuple_cost and/or cpu_operator_cost but not increase
cpu_tuple_cost. That is because the unselective bitmap index scan does
not incur any cpu_tuple_cost, but does incur index_tuple and operator
costs. Unfortunately all other index scans in the system will also be
skewed by such a change if you make the change system-wide.
Incidentally, the "actual rows" field of BitmapAnd is always zero. That
field is not implemented for that node type.
Why do you have an index on flag in the first place? What does the index
accomplish, other than enticing the planner into bad plans? I don't know
how this translates back into your real query, but dropping that index
should be considered. Or replace both indexes with one on (num,flag).
Or you can re-write the part of the WHERE clause in a way that it can't use
an index, something like:
and flag::text ='t'
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Fri, Dec 1, 2017 at 11:08 PM, Vitaliy Garnashevich < > vgarnashevich@gmail.com> wrote: >> # x4 tuple/operator costs - bitmap scan still a bit cheaper >> set seq_page_cost = 1.0; >> set random_page_cost = 1.0; >> set cpu_tuple_cost = 0.04; >> set cpu_index_tuple_cost = 0.02; >> set cpu_operator_cost = 0.01; > If you really want to target the plan with the BitmapAnd, you should > increase cpu_index_tuple_cost and/or cpu_operator_cost but not increase > cpu_tuple_cost. That is because the unselective bitmap index scan does > not incur any cpu_tuple_cost, but does incur index_tuple and operator > costs. Unfortunately all other index scans in the system will also be > skewed by such a change if you make the change system-wide. I think it'd be a serious error to screw around with your cost settings on the basis of a single case in which the rowcount estimates are so far off. It's really those estimates that are the problem AFAICS. The core issue in this example is that, the way the test data is set up, the "flag = true" condition actually adds no selectivity at all, because every row with "num = 1" is certain to have "flag = true". If the planner realized that, it would certainly not bother with BitmapAnd'ing the flag index onto the results of the num index. But it doesn't know that those columns are correlated, so it supposes that adding the extra index will give a 10x reduction in the number of heap rows that have to be visited (since it knows that only 1/10th of the rows have "flag = true"). *That* is what causes the overly optimistic cost estimate for the two-index bitmapscan, and no amount of fiddling with the cost parameters will make that better. I tried creating multiple-column statistics using the v10 facility for that: regression=# create statistics s1 on num, flag from aaa; CREATE STATISTICS regression=# analyze aaa; ANALYZE but that changed the estimate not at all, which surprised me because dependency statistics are supposed to fix exactly this type of problem. I suspect there may be something in the extended-stats code that causes it not to work right for boolean columns --- this wouldn't be excessively surprising because of the way the planner messes around with converting "flag = true" to just "flag" and sometimes back again. But I've not looked closer yet. regards, tom lane
On Sat, Dec 2, 2017 at 3:44 PM, Tom Lane wrote:
> Jeff Janes writes:
> > On Fri, Dec 1, 2017 at 11:08 PM, Vitaliy Garnashevich <
> > vgarnashevich@gmail.com> wrote:
> >> # x4 tuple/operator costs - bitmap scan still a bit cheaper
> >> set seq_page_cost = 1.0;
> >> set random_page_cost = 1.0;
> >> set cpu_tuple_cost = 0.04;
> >> set cpu_index_tuple_cost = 0.02;
> >> set cpu_operator_cost = 0.01;
>
> > If you really want to target the plan with the BitmapAnd, you should
> > increase cpu_index_tuple_cost and/or cpu_operator_cost but not increase
> > cpu_tuple_cost. That is because the unselective bitmap index scan does
> > not incur any cpu_tuple_cost, but does incur index_tuple and operator
> > costs. Unfortunately all other index scans in the system will also be
> > skewed by such a change if you make the change system-wide.
>
> I think it'd be a serious error to screw around with your cost settings
> on the basis of a single case in which the rowcount estimates are so
> far off. It's really those estimates that are the problem AFAICS.
>
> The core issue in this example is that, the way the test data is set up,
> the "flag = true" condition actually adds no selectivity at all, because
> every row with "num = 1" is certain to have "flag = true". If the planner
> realized that, it would certainly not bother with BitmapAnd'ing the flag
> index onto the results of the num index. But it doesn't know that those
> columns are correlated, so it supposes that adding the extra index will
> give a 10x reduction in the number of heap rows that have to be visited
> (since it knows that only 1/10th of the rows have "flag = true").
> *That* is what causes the overly optimistic cost estimate for the
> two-index bitmapscan, and no amount of fiddling with the cost parameters
> will make that better.
>
But he also tested with num=2 and num=39, which reverses the situation so
the bitmap is 100% selective rather than the 90% the planner thinks it will
be.
But it is still slower for him (I am having trouble replicating that exact
behavior), so building the bitmap to rule out 100% of the rows is
empirically not worth it, I don't see how building it to rule out 90%, as
the planner things, would be any better.
> I tried creating multiple-column statistics using the v10 facility for
> that:
>
> regression=# create statistics s1 on num, flag from aaa;
> CREATE STATISTICS
> regression=# analyze aaa;
> ANALYZE
>
> but that changed the estimate not at all, which surprised me because
> dependency statistics are supposed to fix exactly this type of problem.
> I suspect there may be something in the extended-stats code that causes it
> not to work right for boolean columns --- this wouldn't be excessively
> surprising because of the way the planner messes around with converting
> "flag = true" to just "flag" and sometimes back again. But I've not
> looked closer yet.
>
I think the non-extended stats code also has trouble with booleans.
pg_stats gives me a correlation of 0.8 or higher for the flag column.
Due to that, when I disable bitmapscans and seqscans, I start getting slow
index scans on the wrong index, i2 rather than i1. I don't know why he
doesn't see that in his example.
Cheers,
Jeff
On Sat, Dec 02, 2017 at 05:27:51PM -0800, Jeff Janes wrote: > I think the non-extended stats code also has trouble with booleans. > pg_stats gives me a correlation of 0.8 or higher for the flag column. It's not due to the boolean though; you see the same thing if you do: CREATE INDEX aaa_f ON aaa((flag::text)); ANALYZE aaa; correlation | 0.81193 or: ALTER TABLE aaa ADD flag2 int; UPDATE aaa SET flag2= flag::int correlation | 0.81193 I think it's caused by having so few (2) values to correlate. most_common_vals | {f,t} most_common_freqs | {0.9014,0.0986} correlation | 0.822792 It thinks there's somewhat-high correlation since it gets a list of x and y values (integer positions by logical and physical sort order) and 90% of the x list (logical value) are the same value ('t'), and the CTIDs are in order on the new index, so 90% of the values are 100% correlated. It improves (by which I mean here that it spits out a lower number) if it's not a 90/10 split: CREATE TABLE aaa5 AS SELECT (id%100)::int num, (id%10>5)::bool flag FROM generate_series(1, 10000000) id; CREATE INDEX ON aaa5 (flag); tablename | aaa5 attname | flag correlation | 0.522184 Justin
On 02/12/2017 23:17, Jeff Janes wrote:
> Right, so there is a cpu costing problem (which could only be fixed by
> hacking postgresql and recompiling it), but it is much smaller of a
> problem than the IO cost not being accurate due to the high hit rate.
> Fixing the CPU costing problem is unlikely to make a difference to
> your real query. If you set the page costs to zero, what happens to
> your real query?
I can't reproduce the exact issue on the real database any more. The
query started to use the slow bitmap scan recently, and had been doing
so for some time lately, but now it's switched back to use the index
scan. The table involved in the query gets modified a lot. It has
hundreds of millions of rows. Lots of new rows are appended to it every
day, the oldest rows are sometimes removed. The table is analyzed at
least daily. It's possible that statistics was updated and that caused
the query to run differently. But I still would like to understand why
that issue happened, and how to properly fix it, in case the issue returns.
>
> But I doubt that the settings seq_page_cost = random_page_cost =
> 0.0 should actually be used.
>
>
> Why not? If your production server really has everything in memory
> during normal operation, that is the correct course of action. If you
> ever restart the server, then you could have some unpleasant time
> getting it back up to speed again, but pg_prewarm could help with that.
In the real database, not everything is in memory. There are 200GB+ of
RAM, but DB is 500GB+. The table involved in the query itself is 60GB+
of data and 100GB+ of indexes. I'm running the test case in a way where
all reads are done from RAM, only to make it easier to reproduce and to
avoid unrelated effects.
As far as know, costs in Postgres were designed to be relative to
seq_page_cost, which for that reason is usually defined as 1.0. Even if
everything would be in RAM, accesses to the pages would still not have
zero cost. Setting 0.0 just seems too extreme, as all other non-zero
costs would become infinitely bigger.
> If you really want to target the plan with the BitmapAnd, you should
> increase cpu_index_tuple_cost and/or cpu_operator_cost but not
> increase cpu_tuple_cost. That is because the unselective bitmap
> index scan does not incur any cpu_tuple_cost, but does incur
> index_tuple and operator costs. Unfortunately all other index scans
> in the system will also be skewed by such a change if you make the
> change system-wide.
Exactly. I'd like to understand why the worse plan is being chosen, and
1) if it's fixable by tuning costs, to figure out the right settings
which could be used in production, 2) if there is a bug in Postgres
optimizer, then to bring some attention to it, so that it's eventually
fixed in one of future releases, 3) if Postgres is supposed to work this
way, then at least I (and people who ever read this thread) would
understand it better.
Regards,
Vitaliy
On 03/12/2017 01:44, Tom Lane wrote: > I think it'd be a serious error to screw around with your cost settings > on the basis of a single case in which the rowcount estimates are so > far off. It's really those estimates that are the problem AFAICS. > > The core issue in this example is that, the way the test data is set up, > the "flag = true" condition actually adds no selectivity at all, because > every row with "num = 1" is certain to have "flag = true". If the planner > realized that, it would certainly not bother with BitmapAnd'ing the flag > index onto the results of the num index. But it doesn't know that those > columns are correlated, so it supposes that adding the extra index will > give a 10x reduction in the number of heap rows that have to be visited > (since it knows that only 1/10th of the rows have "flag = true"). > *That* is what causes the overly optimistic cost estimate for the > two-index bitmapscan, and no amount of fiddling with the cost parameters > will make that better. Here I've tried to make a test which would not have correlation between the two columns. shared_buffers = 512MB effective_cache_size = 512MB work_mem = 100MB set seq_page_cost = 1.0; set random_page_cost = 1.5; set cpu_tuple_cost = 0.01; set cpu_index_tuple_cost = 0.005; set cpu_operator_cost = 0.0025; drop table if exists aaa; create table aaa as select floor(random()*100)::int num, (random()*10 < 1)::bool flag from generate_series(1, 10000000) id; create index i1 on aaa (num); create index i2 on aaa (flag); analyze aaa; select relname, reltuples::bigint, relpages::bigint, (reltuples/relpages)::bigint tpp from pg_class where relname in('aaa','i1','i2') order by relname; "aaa";10000033;44248;226 "i1";10000033;27422;365 "i2";10000033;27422;365 select flag, count(*) from aaa group by flag order by flag; f;9000661 t;999339 select num, count(*) from aaa group by num order by num; 0;99852 1;99631 2;99699 3;100493 ... 96;100345 97;99322 98;100013 99;100030 explain (analyze,verbose,costs,buffers) select * from aaa where num = 1 and flag = true; Bitmap Heap Scan on public.aaa (cost=12829.83..24729.85 rows=10340 width=5) (actual time=104.941..112.649 rows=9944 loops=1) Output: num, flag Recheck Cond: (aaa.num = 1) Filter: aaa.flag Heap Blocks: exact=8922 Buffers: shared hit=11932 -> BitmapAnd (cost=12829.83..12829.83 rows=10340 width=0) (actual time=102.926..102.926 rows=0 loops=1) Buffers: shared hit=3010 -> Bitmap Index Scan on i1 (cost=0.00..1201.44 rows=103334 width=0) (actual time=15.459..15.459 rows=99631 loops=1) Index Cond: (aaa.num = 1) Buffers: shared hit=276 -> Bitmap Index Scan on i2 (cost=0.00..11622.97 rows=1000671 width=0) (actual time=76.906..76.906 rows=999339 loops=1) Index Cond: (aaa.flag = true) Buffers: shared hit=2734 Planning time: 0.110 ms Execution time: 113.272 ms Index Scan using i1 on public.aaa (cost=0.44..66621.56 rows=10340 width=5) (actual time=0.027..47.075 rows=9944 loops=1) Output: num, flag Index Cond: (aaa.num = 1) Filter: aaa.flag Rows Removed by Filter: 89687 Buffers: shared hit=39949 Planning time: 0.104 ms Execution time: 47.351 ms > > I tried creating multiple-column statistics using the v10 facility for > that: > > regression=# create statistics s1 on num, flag from aaa; > CREATE STATISTICS > regression=# analyze aaa; > ANALYZE > > but that changed the estimate not at all, which surprised me because > dependency statistics are supposed to fix exactly this type of problem. > I suspect there may be something in the extended-stats code that causes it > not to work right for boolean columns --- this wouldn't be excessively > surprising because of the way the planner messes around with converting > "flag = true" to just "flag" and sometimes back again. But I've not > looked closer yet. > > regards, tom lane > . >
On 03/12/2017 03:27, Jeff Janes wrote:
> Due to that, when I disable bitmapscans and seqscans, I start getting
> slow index scans on the wrong index, i2 rather than i1. I don't know
> why he doesn't see that in his example.
When I increase effective_cache_size to 1024MB, I start getting the plan
with the slower index i2, too.
*Bitmap Heap Scan* on public.aaa (cost=12600.90..*23688**.70* rows=9488
width=5) (actual time=107.529..*115.902* rows=9976 loops=1)
-> BitmapAnd (cost=12600.90..12600.90 rows=9488 width=0) (actual
time=105.133..105.133 rows=0 loops=1)
-> Bitmap Index Scan on i1 (cost=0.00..1116.43 rows=96000
width=0) (actual time=16.313..16.313 rows=100508 loops=1)
-> Bitmap Index Scan on i2 (cost=0.00..11479.47 rows=988338
width=0) (actual time=77.950..77.950 rows=1000200 loops=1)
*Index Scan* using i2 on public.aaa (cost=0.44..*48227.31* rows=9488
width=5) (actual time=0.020..*285.695* rows=9976 loops=1)
*Seq Scan* on public.aaa (cost=0.00..*169248.54* rows=9488 width=5)
(actual time=0.024..*966.469* rows=9976 loops=1)
This way the estimates and the actual time get more sense. But then
there's the question - maybe it's i1 runs too fast, and is estimated
incorrectly? Why that happens?
Here are the complete plans with the two different kinds of index scans
once again:
Index Scan using i1 on public.aaa (cost=0.44..66621.56 rows=10340
width=5) (actual time=0.027..47.075 rows=9944 loops=1)
Output: num, flag
Index Cond: (aaa.num = 1)
Filter: aaa.flag
Rows Removed by Filter: 89687
Buffers: shared hit=39949
Planning time: 0.104 ms
Execution time: 47.351 ms
Index Scan using i2 on public.aaa (cost=0.44..48227.31 rows=9488
width=5) (actual time=0.020..285.695 rows=9976 loops=1)
Output: num, flag
Index Cond: (aaa.flag = true)
Filter: (aaa.flag AND (aaa.num = 1))
Rows Removed by Filter: 990224
Buffers: shared hit=46984
Planning time: 0.098 ms
Execution time: 286.081 ms
// The test DB was populated with: create table aaa as select
floor(random()*100)::int num, (random()*10 < 1)::bool flag from
generate_series(1, 10000000) id;
Regards,
Vitaliy
I wrote: > I tried creating multiple-column statistics using the v10 facility for > that: > regression=# create statistics s1 on num, flag from aaa; > CREATE STATISTICS > regression=# analyze aaa; > ANALYZE > but that changed the estimate not at all, which surprised me because > dependency statistics are supposed to fix exactly this type of problem. > I suspect there may be something in the extended-stats code that causes it > not to work right for boolean columns --- this wouldn't be excessively > surprising because of the way the planner messes around with converting > "flag = true" to just "flag" and sometimes back again. But I've not > looked closer yet. After looking, I found that indeed dependency_is_compatible_clause() rejects expressions like "flag" or "NOT flag", which it needn't since those are fully equivalent to "flag = true" or "flag = false" respectively. Moreover there's nothing else in dependencies_clauselist_selectivity that depends on the exact form of the clause under test, only on the semantics that it's an equality condition on some Var. Hence I propose the attached patch, which fixes the rowcount estimate for the example discussed in this thread: create table aaa as select (id%100)::int num, (id%10=1)::bool flag from generate_series(1, 10000000) id; create index i1 on aaa (num); create index i2 on aaa (flag); create statistics s1 on num, flag from aaa; analyze aaa; explain analyze select count(*) from aaa where num = 1 and flag = true; Without patch: Aggregate (cost=43236.73..43236.74 rows=1 width=8) (actual time=349.365..349.3 65 rows=1 loops=1) -> Bitmap Heap Scan on aaa (cost=20086.40..43212.94 rows=9514 width=0) (act ual time=101.308..337.985 rows=100000 loops=1) Recheck Cond: (num = 1) Filter: flag Heap Blocks: exact=44248 -> BitmapAnd (cost=20086.40..20086.40 rows=9514 width=0) (actual time =92.214..92.214 rows=0 loops=1) -> Bitmap Index Scan on i1 (cost=0.00..1776.43 rows=96000 width =0) (actual time=17.236..17.236 rows=100000 loops=1) Index Cond: (num = 1) -> Bitmap Index Scan on i2 (cost=0.00..18304.96 rows=991003 wid th=0) (actual time=72.168..72.168 rows=1000000 loops=1) Index Cond: (flag = true) Planning time: 0.254 ms Execution time: 350.796 ms With patch: Aggregate (cost=43496.19..43496.20 rows=1 width=8) (actual time=359.195..359.1 95 rows=1 loops=1) -> Bitmap Heap Scan on aaa (cost=20129.64..43256.19 rows=96000 width=0) (ac tual time=99.750..347.353 rows=100000 loops=1) Recheck Cond: (num = 1) Filter: flag Heap Blocks: exact=44248 -> BitmapAnd (cost=20129.64..20129.64 rows=9514 width=0) (actual time =90.671..90.671 rows=0 loops=1) -> Bitmap Index Scan on i1 (cost=0.00..1776.43 rows=96000 width =0) (actual time=16.946..16.946 rows=100000 loops=1) Index Cond: (num = 1) -> Bitmap Index Scan on i2 (cost=0.00..18304.96 rows=991003 wid th=0) (actual time=70.898..70.898 rows=1000000 loops=1) Index Cond: (flag = true) Planning time: 0.218 ms Execution time: 360.608 ms That's the right overall rowcount estimate for the scan, given the stats it's working from. There's apparently still something wrong with bitmap costing, since it's still estimating this as cheaper than the single-index case --- noting the bogus rowcount estimate for the BitmapAnd, I suspect that bitmap costing is taking shortcuts rather than using clauselist_selectivity to estimate the overall selectivity of the bitmap conditions. But whatever is causing that, it's independent of this deficiency. In addition to the bugfix proper, I improved some comments, got rid of a NumRelids() test that's redundant with the preceding bms_membership() test, and fixed dependencies_clauselist_selectivity so that estimatedclauses actually is a pure output argument as stated by its API contract. regards, tom lane diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c index 9756fb8..ae0f304 100644 *** a/src/backend/statistics/dependencies.c --- b/src/backend/statistics/dependencies.c *************** pg_dependencies_send(PG_FUNCTION_ARGS) *** 736,826 **** * dependency_is_compatible_clause * Determines if the clause is compatible with functional dependencies * ! * Only OpExprs with two arguments using an equality operator are supported. ! * When returning True attnum is set to the attribute number of the Var within ! * the supported clause. ! * ! * Currently we only support Var = Const, or Const = Var. It may be possible ! * to expand on this later. */ static bool dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum) { RestrictInfo *rinfo = (RestrictInfo *) clause; if (!IsA(rinfo, RestrictInfo)) return false; ! /* Pseudoconstants are not really interesting here. */ if (rinfo->pseudoconstant) return false; ! /* clauses referencing multiple varnos are incompatible */ if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON) return false; if (is_opclause(rinfo->clause)) { OpExpr *expr = (OpExpr *) rinfo->clause; - Var *var; - bool varonleft = true; - bool ok; ! /* Only expressions with two arguments are considered compatible. */ if (list_length(expr->args) != 2) return false; ! /* see if it actually has the right */ ! ok = (NumRelids((Node *) expr) == 1) && ! (is_pseudo_constant_clause(lsecond(expr->args)) || ! (varonleft = false, ! is_pseudo_constant_clause(linitial(expr->args)))); ! ! /* unsupported structure (two variables or so) */ ! if (!ok) return false; /* ! * If it's not "=" operator, just ignore the clause, as it's not * compatible with functional dependencies. * * This uses the function for estimating selectivity, not the operator * directly (a bit awkward, but well ...). */ if (get_oprrest(expr->opno) != F_EQSEL) return false; ! var = (varonleft) ? linitial(expr->args) : lsecond(expr->args); ! /* ! * We may ignore any RelabelType node above the operand. (There won't ! * be more than one, since eval_const_expressions() has been applied ! * already.) */ ! if (IsA(var, RelabelType)) ! var = (Var *) ((RelabelType *) var)->arg; ! /* We only support plain Vars for now */ ! if (!IsA(var, Var)) ! return false; ! /* Ensure var is from the correct relation */ ! if (var->varno != relid) ! return false; ! /* we also better ensure the Var is from the current level */ ! if (var->varlevelsup > 0) ! return false; ! /* Also skip system attributes (we don't allow stats on those). */ ! if (!AttrNumberIsForUserDefinedAttr(var->varattno)) ! return false; ! *attnum = var->varattno; ! return true; ! } ! return false; } /* --- 736,839 ---- * dependency_is_compatible_clause * Determines if the clause is compatible with functional dependencies * ! * Only clauses that have the form of equality to a pseudoconstant, or can be ! * interpreted that way, are currently accepted. Furthermore the variable ! * part of the clause must be a simple Var belonging to the specified ! * relation, whose attribute number we return in *attnum on success. */ static bool dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum) { RestrictInfo *rinfo = (RestrictInfo *) clause; + Var *var; if (!IsA(rinfo, RestrictInfo)) return false; ! /* Pseudoconstants are not interesting (they couldn't contain a Var) */ if (rinfo->pseudoconstant) return false; ! /* Clauses referencing multiple, or no, varnos are incompatible */ if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON) return false; if (is_opclause(rinfo->clause)) { + /* If it's an opclause, check for Var = Const or Const = Var. */ OpExpr *expr = (OpExpr *) rinfo->clause; ! /* Only expressions with two arguments are candidates. */ if (list_length(expr->args) != 2) return false; ! /* Make sure non-selected argument is a pseudoconstant. */ ! if (is_pseudo_constant_clause(lsecond(expr->args))) ! var = linitial(expr->args); ! else if (is_pseudo_constant_clause(linitial(expr->args))) ! var = lsecond(expr->args); ! else return false; /* ! * If it's not an "=" operator, just ignore the clause, as it's not * compatible with functional dependencies. * * This uses the function for estimating selectivity, not the operator * directly (a bit awkward, but well ...). + * + * XXX this is pretty dubious; probably it'd be better to check btree + * or hash opclass membership, so as not to be fooled by custom + * selectivity functions, and to be more consistent with decisions + * elsewhere in the planner. */ if (get_oprrest(expr->opno) != F_EQSEL) return false; ! /* OK to proceed with checking "var" */ ! } ! else if (not_clause((Node *) rinfo->clause)) ! { /* ! * "NOT x" can be interpreted as "x = false", so get the argument and ! * proceed with seeing if it's a suitable Var. */ ! var = (Var *) get_notclausearg(rinfo->clause); ! } ! else ! { ! /* ! * A boolean expression "x" can be interpreted as "x = true", so ! * proceed with seeing if it's a suitable Var. ! */ ! var = (Var *) rinfo->clause; ! } ! /* ! * We may ignore any RelabelType node above the operand. (There won't be ! * more than one, since eval_const_expressions has been applied already.) ! */ ! if (IsA(var, RelabelType)) ! var = (Var *) ((RelabelType *) var)->arg; ! /* We only support plain Vars for now */ ! if (!IsA(var, Var)) ! return false; ! /* Ensure Var is from the correct relation */ ! if (var->varno != relid) ! return false; ! /* We also better ensure the Var is from the current level */ ! if (var->varlevelsup != 0) ! return false; ! /* Also ignore system attributes (we don't allow stats on those) */ ! if (!AttrNumberIsForUserDefinedAttr(var->varattno)) ! return false; ! *attnum = var->varattno; ! return true; } /* *************** find_strongest_dependency(StatisticExtIn *** 891,902 **** /* * dependencies_clauselist_selectivity ! * Return the estimated selectivity of the given clauses using ! * functional dependency statistics, or 1.0 if no useful functional * dependency statistic exists. * * 'estimatedclauses' is an output argument that gets a bit set corresponding ! * to the (zero-based) list index of clauses that are included in the * estimated selectivity. * * Given equality clauses on attributes (a,b) we find the strongest dependency --- 904,915 ---- /* * dependencies_clauselist_selectivity ! * Return the estimated selectivity of (a subset of) the given clauses ! * using functional dependency statistics, or 1.0 if no useful functional * dependency statistic exists. * * 'estimatedclauses' is an output argument that gets a bit set corresponding ! * to the (zero-based) list index of each clause that is included in the * estimated selectivity. * * Given equality clauses on attributes (a,b) we find the strongest dependency *************** dependencies_clauselist_selectivity(Plan *** 932,937 **** --- 945,953 ---- AttrNumber *list_attnums; int listidx; + /* initialize output argument */ + *estimatedclauses = NULL; + /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES)) return 1.0;
I wrote: > I tried creating multiple-column statistics using the v10 facility for > that: > regression=# create statistics s1 on num, flag from aaa; > CREATE STATISTICS > regression=# analyze aaa; > ANALYZE > but that changed the estimate not at all, which surprised me because > dependency statistics are supposed to fix exactly this type of problem. > I suspect there may be something in the extended-stats code that causes it > not to work right for boolean columns --- this wouldn't be excessively > surprising because of the way the planner messes around with converting > "flag = true" to just "flag" and sometimes back again. But I've not > looked closer yet. After looking, I found that indeed dependency_is_compatible_clause() rejects expressions like "flag" or "NOT flag", which it needn't since those are fully equivalent to "flag = true" or "flag = false" respectively. Moreover there's nothing else in dependencies_clauselist_selectivity that depends on the exact form of the clause under test, only on the semantics that it's an equality condition on some Var. Hence I propose the attached patch, which fixes the rowcount estimate for the example discussed in this thread: create table aaa as select (id%100)::int num, (id%10=1)::bool flag from generate_series(1, 10000000) id; create index i1 on aaa (num); create index i2 on aaa (flag); create statistics s1 on num, flag from aaa; analyze aaa; explain analyze select count(*) from aaa where num = 1 and flag = true; Without patch: Aggregate (cost=43236.73..43236.74 rows=1 width=8) (actual time=349.365..349.3 65 rows=1 loops=1) -> Bitmap Heap Scan on aaa (cost=20086.40..43212.94 rows=9514 width=0) (act ual time=101.308..337.985 rows=100000 loops=1) Recheck Cond: (num = 1) Filter: flag Heap Blocks: exact=44248 -> BitmapAnd (cost=20086.40..20086.40 rows=9514 width=0) (actual time =92.214..92.214 rows=0 loops=1) -> Bitmap Index Scan on i1 (cost=0.00..1776.43 rows=96000 width =0) (actual time=17.236..17.236 rows=100000 loops=1) Index Cond: (num = 1) -> Bitmap Index Scan on i2 (cost=0.00..18304.96 rows=991003 wid th=0) (actual time=72.168..72.168 rows=1000000 loops=1) Index Cond: (flag = true) Planning time: 0.254 ms Execution time: 350.796 ms With patch: Aggregate (cost=43496.19..43496.20 rows=1 width=8) (actual time=359.195..359.1 95 rows=1 loops=1) -> Bitmap Heap Scan on aaa (cost=20129.64..43256.19 rows=96000 width=0) (ac tual time=99.750..347.353 rows=100000 loops=1) Recheck Cond: (num = 1) Filter: flag Heap Blocks: exact=44248 -> BitmapAnd (cost=20129.64..20129.64 rows=9514 width=0) (actual time =90.671..90.671 rows=0 loops=1) -> Bitmap Index Scan on i1 (cost=0.00..1776.43 rows=96000 width =0) (actual time=16.946..16.946 rows=100000 loops=1) Index Cond: (num = 1) -> Bitmap Index Scan on i2 (cost=0.00..18304.96 rows=991003 wid th=0) (actual time=70.898..70.898 rows=1000000 loops=1) Index Cond: (flag = true) Planning time: 0.218 ms Execution time: 360.608 ms That's the right overall rowcount estimate for the scan, given the stats it's working from. There's apparently still something wrong with bitmap costing, since it's still estimating this as cheaper than the single-index case --- noting the bogus rowcount estimate for the BitmapAnd, I suspect that bitmap costing is taking shortcuts rather than using clauselist_selectivity to estimate the overall selectivity of the bitmap conditions. But whatever is causing that, it's independent of this deficiency. In addition to the bugfix proper, I improved some comments, got rid of a NumRelids() test that's redundant with the preceding bms_membership() test, and fixed dependencies_clauselist_selectivity so that estimatedclauses actually is a pure output argument as stated by its API contract. regards, tom lane diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c index 9756fb8..ae0f304 100644 *** a/src/backend/statistics/dependencies.c --- b/src/backend/statistics/dependencies.c *************** pg_dependencies_send(PG_FUNCTION_ARGS) *** 736,826 **** * dependency_is_compatible_clause * Determines if the clause is compatible with functional dependencies * ! * Only OpExprs with two arguments using an equality operator are supported. ! * When returning True attnum is set to the attribute number of the Var within ! * the supported clause. ! * ! * Currently we only support Var = Const, or Const = Var. It may be possible ! * to expand on this later. */ static bool dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum) { RestrictInfo *rinfo = (RestrictInfo *) clause; if (!IsA(rinfo, RestrictInfo)) return false; ! /* Pseudoconstants are not really interesting here. */ if (rinfo->pseudoconstant) return false; ! /* clauses referencing multiple varnos are incompatible */ if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON) return false; if (is_opclause(rinfo->clause)) { OpExpr *expr = (OpExpr *) rinfo->clause; - Var *var; - bool varonleft = true; - bool ok; ! /* Only expressions with two arguments are considered compatible. */ if (list_length(expr->args) != 2) return false; ! /* see if it actually has the right */ ! ok = (NumRelids((Node *) expr) == 1) && ! (is_pseudo_constant_clause(lsecond(expr->args)) || ! (varonleft = false, ! is_pseudo_constant_clause(linitial(expr->args)))); ! ! /* unsupported structure (two variables or so) */ ! if (!ok) return false; /* ! * If it's not "=" operator, just ignore the clause, as it's not * compatible with functional dependencies. * * This uses the function for estimating selectivity, not the operator * directly (a bit awkward, but well ...). */ if (get_oprrest(expr->opno) != F_EQSEL) return false; ! var = (varonleft) ? linitial(expr->args) : lsecond(expr->args); ! /* ! * We may ignore any RelabelType node above the operand. (There won't ! * be more than one, since eval_const_expressions() has been applied ! * already.) */ ! if (IsA(var, RelabelType)) ! var = (Var *) ((RelabelType *) var)->arg; ! /* We only support plain Vars for now */ ! if (!IsA(var, Var)) ! return false; ! /* Ensure var is from the correct relation */ ! if (var->varno != relid) ! return false; ! /* we also better ensure the Var is from the current level */ ! if (var->varlevelsup > 0) ! return false; ! /* Also skip system attributes (we don't allow stats on those). */ ! if (!AttrNumberIsForUserDefinedAttr(var->varattno)) ! return false; ! *attnum = var->varattno; ! return true; ! } ! return false; } /* --- 736,839 ---- * dependency_is_compatible_clause * Determines if the clause is compatible with functional dependencies * ! * Only clauses that have the form of equality to a pseudoconstant, or can be ! * interpreted that way, are currently accepted. Furthermore the variable ! * part of the clause must be a simple Var belonging to the specified ! * relation, whose attribute number we return in *attnum on success. */ static bool dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum) { RestrictInfo *rinfo = (RestrictInfo *) clause; + Var *var; if (!IsA(rinfo, RestrictInfo)) return false; ! /* Pseudoconstants are not interesting (they couldn't contain a Var) */ if (rinfo->pseudoconstant) return false; ! /* Clauses referencing multiple, or no, varnos are incompatible */ if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON) return false; if (is_opclause(rinfo->clause)) { + /* If it's an opclause, check for Var = Const or Const = Var. */ OpExpr *expr = (OpExpr *) rinfo->clause; ! /* Only expressions with two arguments are candidates. */ if (list_length(expr->args) != 2) return false; ! /* Make sure non-selected argument is a pseudoconstant. */ ! if (is_pseudo_constant_clause(lsecond(expr->args))) ! var = linitial(expr->args); ! else if (is_pseudo_constant_clause(linitial(expr->args))) ! var = lsecond(expr->args); ! else return false; /* ! * If it's not an "=" operator, just ignore the clause, as it's not * compatible with functional dependencies. * * This uses the function for estimating selectivity, not the operator * directly (a bit awkward, but well ...). + * + * XXX this is pretty dubious; probably it'd be better to check btree + * or hash opclass membership, so as not to be fooled by custom + * selectivity functions, and to be more consistent with decisions + * elsewhere in the planner. */ if (get_oprrest(expr->opno) != F_EQSEL) return false; ! /* OK to proceed with checking "var" */ ! } ! else if (not_clause((Node *) rinfo->clause)) ! { /* ! * "NOT x" can be interpreted as "x = false", so get the argument and ! * proceed with seeing if it's a suitable Var. */ ! var = (Var *) get_notclausearg(rinfo->clause); ! } ! else ! { ! /* ! * A boolean expression "x" can be interpreted as "x = true", so ! * proceed with seeing if it's a suitable Var. ! */ ! var = (Var *) rinfo->clause; ! } ! /* ! * We may ignore any RelabelType node above the operand. (There won't be ! * more than one, since eval_const_expressions has been applied already.) ! */ ! if (IsA(var, RelabelType)) ! var = (Var *) ((RelabelType *) var)->arg; ! /* We only support plain Vars for now */ ! if (!IsA(var, Var)) ! return false; ! /* Ensure Var is from the correct relation */ ! if (var->varno != relid) ! return false; ! /* We also better ensure the Var is from the current level */ ! if (var->varlevelsup != 0) ! return false; ! /* Also ignore system attributes (we don't allow stats on those) */ ! if (!AttrNumberIsForUserDefinedAttr(var->varattno)) ! return false; ! *attnum = var->varattno; ! return true; } /* *************** find_strongest_dependency(StatisticExtIn *** 891,902 **** /* * dependencies_clauselist_selectivity ! * Return the estimated selectivity of the given clauses using ! * functional dependency statistics, or 1.0 if no useful functional * dependency statistic exists. * * 'estimatedclauses' is an output argument that gets a bit set corresponding ! * to the (zero-based) list index of clauses that are included in the * estimated selectivity. * * Given equality clauses on attributes (a,b) we find the strongest dependency --- 904,915 ---- /* * dependencies_clauselist_selectivity ! * Return the estimated selectivity of (a subset of) the given clauses ! * using functional dependency statistics, or 1.0 if no useful functional * dependency statistic exists. * * 'estimatedclauses' is an output argument that gets a bit set corresponding ! * to the (zero-based) list index of each clause that is included in the * estimated selectivity. * * Given equality clauses on attributes (a,b) we find the strongest dependency *************** dependencies_clauselist_selectivity(Plan *** 932,937 **** --- 945,953 ---- AttrNumber *list_attnums; int listidx; + /* initialize output argument */ + *estimatedclauses = NULL; + /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES)) return 1.0;
On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby wrote:
> On Sat, Dec 02, 2017 at 05:27:51PM -0800, Jeff Janes wrote:
> > I think the non-extended stats code also has trouble with booleans.
> > pg_stats gives me a correlation of 0.8 or higher for the flag column.
>
> It's not due to the boolean though; you see the same thing if you do:
> CREATE INDEX aaa_f ON aaa((flag::text));
> ANALYZE aaa;
> correlation | 0.81193
>
> or:
> ALTER TABLE aaa ADD flag2 int; UPDATE aaa SET flag2= flag::int
> correlation | 0.81193
>
> I think it's caused by having so few (2) values to correlate.
>
> most_common_vals | {f,t}
> most_common_freqs | {0.9014,0.0986}
> correlation | 0.822792
>
> It thinks there's somewhat-high correlation since it gets a list of x and y
> values (integer positions by logical and physical sort order) and 90% of
> the x
> list (logical value) are the same value ('t'), and the CTIDs are in order
> on
> the new index, so 90% of the values are 100% correlated.
>
But there is no index involved (except in the case of the functional
index). The correlation of table columns to physical order of the table
doesn't depend on the existence of an index, or the physical order within
an index.
But I do see that ties within the logical order of the column values are
broken to agree with the physical order. That is wrong, right? Is there
any argument that this is desirable?
It looks like it could be fixed with a few extra double calcs per distinct
value. Considering we already sorted the sample values using SQL-callable
collation dependent comparators, I doubt a few C-level double calcs is
going to be meaningful.
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby <pryzby@telsasoft.com> wrote: >> It thinks there's somewhat-high correlation since it gets a list of x >> and y values (integer positions by logical and physical sort order) and >> 90% of the x list (logical value) are the same value ('t'), and the >> CTIDs are in order on the new index, so 90% of the values are 100% >> correlated. > But there is no index involved (except in the case of the functional > index). The correlation of table columns to physical order of the table > doesn't depend on the existence of an index, or the physical order within > an index. > But I do see that ties within the logical order of the column values are > broken to agree with the physical order. That is wrong, right? Is there > any argument that this is desirable? Uh ... what do you propose doing instead? We'd have to do something with ties, and it's not so obvious this way is wrong. regards, tom lane
On Dec 3, 2017 15:31, "Tom Lane" wrote:
Jeff Janes writes:
> On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby
wrote:
>> It thinks there's somewhat-high correlation since it gets a list of x
>> and y values (integer positions by logical and physical sort order) and
>> 90% of the x list (logical value) are the same value ('t'), and the
>> CTIDs are in order on the new index, so 90% of the values are 100%
>> correlated.
> But there is no index involved (except in the case of the functional
> index). The correlation of table columns to physical order of the table
> doesn't depend on the existence of an index, or the physical order within
> an index.
> But I do see that ties within the logical order of the column values are
> broken to agree with the physical order. That is wrong, right? Is there
> any argument that this is desirable?
Uh ... what do you propose doing instead? We'd have to do something with
ties, and it's not so obvious this way is wrong.
Let them be tied. If there are 10 distinct values, number the values 0 to
9, and all rows of a given distinct values get the same number for the
logical order axis.
Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to
me. Although if we switched btree to store duplicate values with tid as a
tie breaker, then maybe it wouldn't be as obviously wrong.
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Dec 3, 2017 15:31, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >> Jeff Janes <jeff.janes@gmail.com> writes: >>> But I do see that ties within the logical order of the column values are >>> broken to agree with the physical order. That is wrong, right? Is there >>> any argument that this is desirable? >> Uh ... what do you propose doing instead? We'd have to do something with >> ties, and it's not so obvious this way is wrong. > Let them be tied. If there are 10 distinct values, number the values 0 to > 9, and all rows of a given distinct values get the same number for the > logical order axis. > Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to > me. Although if we switched btree to store duplicate values with tid as a > tie breaker, then maybe it wouldn't be as obviously wrong. I thought some more about this. What we really want the correlation stat to do is help us estimate how randomly an index-ordered scan will access the heap. If the values we've sampled are all unequal then there's no particular issue. However, if we have some group of equal values, we do not really know what order an indexscan will visit them in. The existing correlation calculation is making the *most optimistic possible* assumption, that such a group will be visited exactly in heap order --- and that assumption isn't too defensible. IIRC, a freshly built b-tree will behave that way, because the initial sort of a btree breaks ties using heap TIDs; but we don't maintain that property during later insertions. In any case, given that we do this calculation without regard to any specific index, we can't reasonably expect to model exactly what the index will do. It would be best to adopt some middle-of-the-road assumption about what the heap visitation order will be for a set of duplicate values: not exactly heap order, but I think we should not use a worst-case assumption either, since the btree may retain some amount of its initial ordering. BTW, I disagree that "correlation = zero" is the right answer for this particular example. If the btree is freshly built, then an index-order scan would visit all the heap pages in sequence to fetch "f" rows, and then visit them all in sequence again to fetch "t" rows, which is a whole lot better than the purely random access that zero correlation implies. So I think 0.8 or so is actually a perfectly reasonable answer when the index is fresh. The trouble is just that it'd behoove us to derate that answer somewhat for the probability that the index isn't fresh. My first thought for a concrete fix was to use the mean position of a group of duplicates for the purposes of the correlation calculation, but on reflection that's clearly wrong. We do have an idea, from the data we have, whether the duplicates are close together in the heap or spread all over. Using only mean position would fail to distinguish those cases, but really we'd better penalize the spread-all-over case. I'm not sure how to do that. Or we could leave this calculation alone and try to move towards keeping equal values in TID order in btrees. Not sure about the downsides of that, though. regards, tom lane
On 04/12/2017 00:11, Vitaliy Garnashevich wrote:
On 03/12/2017 03:27, Jeff Janes wrote:When I increase effective_cache_size to 1024MB, I start getting the plan with the slower index i2, too.Due to that, when I disable bitmapscans and seqscans, I start getting slow index scans on the wrong index, i2 rather than i1. I don't know why he doesn't see that in his example.
Bitmap Heap Scan on public.aaa (cost=12600.90..23688.70 rows=9488 width=5) (actual time=107.529..115.902 rows=9976 loops=1)
-> BitmapAnd (cost=12600.90..12600.90 rows=9488 width=0) (actual time=105.133..105.133 rows=0 loops=1)
-> Bitmap Index Scan on i1 (cost=0.00..1116.43 rows=96000 width=0) (actual time=16.313..16.313 rows=100508 loops=1)
-> Bitmap Index Scan on i2 (cost=0.00..11479.47 rows=988338 width=0) (actual time=77.950..77.950 rows=1000200 loops=1)
Index Scan using i2 on public.aaa (cost=0.44..48227.31 rows=9488 width=5) (actual time=0.020..285.695 rows=9976 loops=1)
Seq Scan on public.aaa (cost=0.00..169248.54 rows=9488 width=5) (actual time=0.024..966.469 rows=9976 loops=1)
This way the estimates and the actual time get more sense. But then there's the question - maybe it's i1 runs too fast, and is estimated incorrectly? Why that happens?
I've tried to create a better test case:
- Increase shared_buffers and effective_cache_size to fit whole database, including indexes.
- Use random(), to avoid correlation between the filtered values.
- Make both columns of integer type, to avoid special cases with boolean (like the one happened with CREATE STATISTICS).
- Flush OS disk cache and then try running the query several times, to get both cold-cache results and all-in-ram results.
- There are several tests, with different frequency of the selected values in the two columns: [1/10, 1/10], [1/50, 1/10], [1/100, 1/10].
- There is a split of cost by contribution of each of its components: seq_page_cost, random_page_cost, cpu_tuple_cost, cpu_index_tuple_cost, cpu_operator_cost. The EXPLAIN is run for each component, every time with only one of the components set to non-zero.
- The test was run on a Digitalocean VM: Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-101-generic x86_64), 2 GB RAM, 2 core CPU, SSD; PostgreSQL 9.5.10.
shared_buffers = 1024MB
effective_cache_size = 1024MB
work_mem = 100MB
create table aaa as select floor(random()*10)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id;
create table aaa as select floor(random()*50)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id;
create table aaa as select floor(random()*100)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id;
create index i1 on aaa (num);
create index i2 on aaa (flag);
set enable_bitmapscan = on; set enable_indexscan = off; set enable_seqscan = off;
set enable_bitmapscan = off; set enable_indexscan = on; set enable_seqscan = off;
set enable_bitmapscan = off; set enable_indexscan = off; set enable_seqscan = on;
set seq_page_cost = 1.0; set random_page_cost = 1.0; set cpu_tuple_cost = 0.01; set cpu_index_tuple_cost = 0.005; set cpu_operator_cost = 0.0025;
explain (analyze,verbose,costs,buffers) select * from aaa where num = 1 and flag = 1;
Bitmap Heap Scan on public.aaa (cost=20687.87..66456.91 rows=101403 width=8) (actual time=345.349..6061.834 rows=99600 loops=1) read=45091
Bitmap Heap Scan on public.aaa (cost=20687.87..66456.91 rows=101403 width=8) (actual time=593.915..991.403 rows=99600 loops=1) hit=45091
Bitmap Heap Scan on public.aaa (cost=20687.87..66456.91 rows=101403 width=8) (actual time=255.273..355.694 rows=99600 loops=1) hit=45091
Bitmap Heap Scan on public.aaa (cost=20687.87..66456.91 rows=101403 width=8) (actual time=284.768..385.505 rows=99600 loops=1) hit=45091
? + ? + 1014.03 + ? + 5595.52 = ?
Bitmap Heap Scan on public.aaa (cost=12081.43..28452.09 rows=19644 width=8) (actual time=238.566..3115.445 rows=20114 loops=1) read=19425
Bitmap Heap Scan on public.aaa (cost=12081.43..28452.09 rows=19644 width=8) (actual time=314.590..382.207 rows=20114 loops=1) hit=19425
Bitmap Heap Scan on public.aaa (cost=12081.43..28452.09 rows=19644 width=8) (actual time=265.899..311.064 rows=20114 loops=1) hit=19425
Bitmap Heap Scan on public.aaa (cost=12081.43..28452.09 rows=19644 width=8) (actual time=209.470..237.697 rows=20114 loops=1) hit=19425
9689.92 + ? + 196.44 + ? + ? = ?
Bitmap Heap Scan on public.aaa (cost=11273.15..20482.50 rows=10090 width=8) (actual time=284.381..2019.717 rows=10114 loops=1) read=12059
Bitmap Heap Scan on public.aaa (cost=11273.15..20482.50 rows=10090 width=8) (actual time=153.445..180.770 rows=10114 loops=1) hit=12059
Bitmap Heap Scan on public.aaa (cost=11273.15..20482.50 rows=10090 width=8) (actual time=146.275..159.446 rows=10114 loops=1) hit=12059
Bitmap Heap Scan on public.aaa (cost=11273.15..20482.50 rows=10090 width=8) (actual time=140.973..153.998 rows=10114 loops=1) hit=12059
4098.28 + ? + 100.90 + ? + ? = ?
Seq Scan on public.aaa (cost=0.00..194248.49 rows=101403 width=8) (actual time=0.126..2056.913 rows=99600 loops=1) read=44248
Seq Scan on public.aaa (cost=0.00..194248.49 rows=101403 width=8) (actual time=0.045..1595.377 rows=99600 loops=1) hit=32 read=44216
Seq Scan on public.aaa (cost=0.00..194248.49 rows=101403 width=8) (actual time=0.066..1392.700 rows=99600 loops=1) hit=64 read=44184
Seq Scan on public.aaa (cost=0.00..194248.49 rows=101403 width=8) (actual time=0.069..1378.574 rows=99600 loops=1) hit=96 read=44152
44248.00 + 0.00 + 100000.33 + 0.00 + 50000.17 = 194248.5
Seq Scan on public.aaa (cost=0.00..194247.77 rows=19644 width=8) (actual time=0.646..1801.794 rows=20114 loops=1) read=44248
Seq Scan on public.aaa (cost=0.00..194247.77 rows=19644 width=8) (actual time=0.385..1518.613 rows=20114 loops=1) hit=32 read=44216
Seq Scan on public.aaa (cost=0.00..194247.77 rows=19644 width=8) (actual time=0.346..1369.021 rows=20114 loops=1) hit=64 read=44184
Seq Scan on public.aaa (cost=0.00..194247.77 rows=19644 width=8) (actual time=0.597..1792.959 rows=20114 loops=1) hit=96 read=44152
44248.00 + 0.00 + 99999.85 + 0.00 + 49999.93 = 194247.78
Seq Scan on public.aaa (cost=0.00..194247.77 rows=10090 width=8) (actual time=0.700..2194.195 rows=10114 loops=1) read=44248
Seq Scan on public.aaa (cost=0.00..194247.77 rows=10090 width=8) (actual time=0.145..1401.274 rows=10114 loops=1) hit=32 read=44216
Seq Scan on public.aaa (cost=0.00..194247.77 rows=10090 width=8) (actual time=0.185..1602.002 rows=10114 loops=1) hit=64 read=44184
Seq Scan on public.aaa (cost=0.00..194247.77 rows=10090 width=8) (actual time=0.184..1353.162 rows=10114 loops=1) hit=96 read=44152
44248.00 + 0.00 + 99999.85 + 0.00 + 49999.93 = 194247.78
Index Scan using i1 on public.aaa (cost=0.43..67358.15 rows=101403 width=8) (actual time=0.400..1325.638 rows=99600 loops=1) read=46983
Index Scan using i1 on public.aaa (cost=0.43..67358.15 rows=101403 width=8) (actual time=0.020..479.713 rows=99600 loops=1) hit=46983
Index Scan using i1 on public.aaa (cost=0.43..67358.15 rows=101403 width=8) (actual time=0.024..642.947 rows=99600 loops=1) hit=46983
Index Scan using i1 on public.aaa (cost=0.43..67358.15 rows=101403 width=8) (actual time=0.038..756.045 rows=99600 loops=1) hit=46983
45.95 + 46638.42 + 10336.67 + 5168.34 + 5168.77 = 67358.15
Rows Removed by Filter: 900156
Index Scan using i1 on public.aaa (cost=0.43..48809.34 rows=19644 width=8) (actual time=0.600..1059.269 rows=20114 loops=1) read=44373
Index Scan using i1 on public.aaa (cost=0.43..48809.34 rows=19644 width=8) (actual time=0.071..208.940 rows=20114 loops=1) hit=44373
Index Scan using i1 on public.aaa (cost=0.43..48809.34 rows=19644 width=8) (actual time=0.044..124.437 rows=20114 loops=1) hit=44373
Index Scan using i1 on public.aaa (cost=0.43..48809.34 rows=19644 width=8) (actual time=0.044..127.814 rows=20114 loops=1) hit=44373
0.23 + 44788.68 + 2010.00 + 1005.00 + 1005.43 = 48809.34
Rows Removed by Filter: 179792
Index Scan using i1 on public.aaa (cost=0.43..46544.80 rows=10090 width=8) (actual time=0.647..1510.482 rows=10114 loops=1) read=39928
Index Scan using i1 on public.aaa (cost=0.43..46544.80 rows=10090 width=8) (actual time=0.035..141.847 rows=10114 loops=1) hit=39928
Index Scan using i1 on public.aaa (cost=0.43..46544.80 rows=10090 width=8) (actual time=0.032..86.716 rows=10114 loops=1) hit=39928
Index Scan using i1 on public.aaa (cost=0.43..46544.80 rows=10090 width=8) (actual time=0.032..79.492 rows=10114 loops=1) hit=39928
0.01 + 44524.36 + 1010.00 + 505.00 + 505.44 = 46544.81
Rows Removed by Filter: 89762
Index Scan using i2 on public.aaa (cost=0.43..39166.34 rows=101403 width=8) (actual time=1.337..1543.611 rows=99600 loops=1) read=46985
Index Scan using i2 on public.aaa (cost=0.43..39166.34 rows=101403 width=8) (actual time=0.027..410.623 rows=99600 loops=1) hit=46985
Index Scan using i2 on public.aaa (cost=0.43..39166.34 rows=101403 width=8) (actual time=0.025..377.529 rows=99600 loops=1) hit=46985
Index Scan using i2 on public.aaa (cost=0.43..39166.34 rows=101403 width=8) (actual time=0.025..377.554 rows=99600 loops=1) hit=46985
2979.08 + 16566.83 + 9810.00 + 4905.00 + 4905.44 = 39166.35
Rows Removed by Filter: 900835
Index Scan using i2 on public.aaa (cost=0.43..39247.83 rows=19644 width=8) (actual time=0.783..2236.765 rows=20114 loops=1) read=46985
Index Scan using i2 on public.aaa (cost=0.43..39247.83 rows=19644 width=8) (actual time=0.198..777.279 rows=20114 loops=1) hit=46985
Index Scan using i2 on public.aaa (cost=0.43..39247.83 rows=19644 width=8) (actual time=0.196..437.177 rows=20114 loops=1) hit=46985
Index Scan using i2 on public.aaa (cost=0.43..39247.83 rows=19644 width=8) (actual time=0.075..346.481 rows=20114 loops=1) hit=46985
2949.05 + 16751.68 + 9773.33 + 4886.66 + 4887.10 = 39247.82
Rows Removed by Filter: 980350
Index Scan using i2 on public.aaa (cost=0.43..40167.34 rows=10090 width=8) (actual time=0.619..1350.886 rows=10114 loops=1) read=46987
Index Scan using i2 on public.aaa (cost=0.43..40167.34 rows=10090 width=8) (actual time=0.069..448.975 rows=10114 loops=1) hit=46987
Index Scan using i2 on public.aaa (cost=0.43..40167.34 rows=10090 width=8) (actual time=0.048..383.066 rows=10114 loops=1) hit=46987
Index Scan using i2 on public.aaa (cost=0.43..40167.34 rows=10090 width=8) (actual time=0.050..387.874 rows=10114 loops=1) hit=46987
2974.39 + 17212.52 + 9990.00 + 4995.00 + 4995.44 = 40167.35
Rows Removed by Filter: 991389
What seems odd to me is that in different kinds of tests (with different frequency of column values):
i1 Rows Removed by Filter = 900156, 179792, 89762 (decreased a lot)
i1 buffers = 46983, 44373, 39928 (decreased, but not a lot)
i1 best case time = 756.045, 127.814, 79.492 (decreased a lot, as well as probably average case too)
i1 cost estimates = 67358.15, 48809.34, 46544.80 (did not decrease a lot)
i2 Rows Removed by Filter = 900835, 980350, 991389
i2 buffers = 46985, 46985, 46987
i2 best case time = 377.554, 346.481, 387.874
i2 cost estimates = 39166.34, 39247.83, 40167.34
It's odd that increase in actual execution time for "i1" was not reflected enough in cost estimates. The cost even didn't go below "i2" costs.
Details about the test are attached.
Regards,
Vitaliy
Attachment
On Tue, Dec 05, 2017 at 01:50:11PM -0500, Tom Lane wrote: > Jeff Janes <jeff.janes@gmail.com> writes: > > On Dec 3, 2017 15:31, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > >> Jeff Janes <jeff.janes@gmail.com> writes: > >>> But I do see that ties within the logical order of the column values are > >>> broken to agree with the physical order. That is wrong, right? Is there > >>> any argument that this is desirable? > > >> Uh ... what do you propose doing instead? We'd have to do something with > >> ties, and it's not so obvious this way is wrong. > > > Let them be tied. ... > I thought some more about this. What we really want the correlation stat > to do is help us estimate how randomly an index-ordered scan will access > the heap. If the values we've sampled are all unequal then there's no > particular issue. However, if we have some group of equal values, we > do not really know what order an indexscan will visit them in. The > existing correlation calculation is making the *most optimistic possible* > assumption, that such a group will be visited exactly in heap order --- > and that assumption isn't too defensible. I'm interested in discusstion regarding bitmap cost, since it would have helped our case discussed here ~18 months ago: https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@telsasoft.com ...but remember: in Vitaliy's case (as opposed to mine), the index scan is *faster* but being estimated at higher cost than bitmap (I have to keep reminding myself). So the rest of this discussion is about the overly-optimistic cost estimate of index scans, which moves in the opposite direction for this reported problem. For the test cases I looked at, index scans were used when RPC=1 and redundant conditions were avoided, so I'm not sure if there's any remaining issue (but I haven't looked at the latest cases Vitaliy sent). > In any case, given that we do this calculation without regard > to any specific index, One solution is to compute stats (at least correlation) for all indices, not just expr inds. I did that earlier this year while throwing around/out ideas. https://www.postgresql.org/message-id/20170707234119.GN17566%40telsasoft.com > We do have an idea, from the data we have, whether the duplicates are close > together in the heap or spread all over. I think you just mean pg_stats.correlation for all values, not just duplicates (with the understanding that duplicates might be a large fraction of the tuples, and high weight in correlation). Another issue I noted in an earlier thread is that as table sizes increase, the existing correlation computation approaches 1 for correlated insertions, (like "append-only" timestamps clustered around now()), due to ANALYZE sampling a fraction of the table, and thereby representing only large-scale correlation, and, to an increasing degree, failing to represent small-scale variations between adjacent index TIDs, which has real cost (and for which the mitigation by cache effects probably decreases WRT table size, too). I think any solution needs to handle this somehow. Generated data demonstrating this (I reused this query so it's more complicated than it needs to be): [pryzbyj@database ~]$ time for sz in 9999{,9{,9{,9{,9}}}} ; do psql postgres -tc "DROP TABLE IF EXISTS t; CREATE TABLE t(ifloat, j int); CREATE INDEX ON t(i);INSERT INTO t SELECT i/99999.0+pow(2,(-random())) FROM generate_series(1,$sz) i ORDERBY i; ANALYZE t; SELECT $sz, correlation, most_common_freqs[1] FROM pg_stats WHERE attname='i' AND tablename='t'"; done 9999 | 0.187146 | 99999 | 0.900629 | 999999 | 0.998772 | 9999999 | 0.999987 | Trying to keep it all in my own head: For sufficiently large number of pages, bitmap scan should be preferred to idx scan due to reduced random-page-cost outweighing its overhead in CPU cost. Probably by penalizing index scans, not discounting bitmap scans. Conceivably a correlation adjustment can be conditionalized or weighted based on index_pages_fetched() ... x = ln (x/999999); if (x>1) correlation/=x; I think one could look at the fraction of duplicated index keys expected to be returned: if we expect to return 1000 tuples, with 200 duplicates from MCV, cost_index would multiply correlation by (1 - 200/1000), meaning to use something closer to max_IO_cost rather than min_IO_cost. I imagine it'd be possible to limit to only those MCVs which pass quals - if none pass, then there may be few tuples returned, so apply no correction to (additionally) penalize index scan. In my tests, at one point I implemented idx_corr_fudge(), returning a value like "fragmentation" from pgstatindex (which I'm sure is where I got the phrase when reporting the problem). That only uses the leaf nodes' "next" pointer, and not the individual tuples, which probably works if there's a sufficiently number of repeated keys. I think that's all for now.. Justin
> What seems odd to me is that in different kinds of tests (with > different frequency of column values): > > i1 Rows Removed by Filter = 900156, 179792, 89762 (decreased a lot) > i1 buffers = 46983, 44373, 39928 (decreased, but not a lot) > i1 best case time = 756.045, 127.814, 79.492 (decreased a lot, as well > as probably average case too) > i1 cost estimates = 67358.15, 48809.34, 46544.80 (did not decrease a lot) > > i2 Rows Removed by Filter = 900835, 980350, 991389 > i2 buffers = 46985, 46985, 46987 > i2 best case time = 377.554, 346.481, 387.874 > i2 cost estimates = 39166.34, 39247.83, 40167.34 > > It's odd that increase in actual execution time for "i1" was not > reflected enough in cost estimates. The cost even didn't go below "i2" > costs. I've added some logging, in order to get the actual numbers which were used for estimation. --drop table if exists aaa; --create table aaa as select floor(random()*100)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id; --analyze aaa; --set enable_bitmapscan = off; set enable_indexscan = on; set enable_seqscan = off; --set seq_page_cost = 1.0; set random_page_cost = 1.0; set cpu_tuple_cost = 0.01; set cpu_index_tuple_cost = 0.005; set cpu_operator_cost = 0.0025; --create index i1 on aaa (num); --drop index if exists i2; --explain (analyze,verbose,costs,buffers) select * from aaa where num = 1 and flag = 1; Index Scan using i1 on public.aaa (cost=0.43..46697.59 rows=10641 width=8) (actual time=0.047..153.521 rows=9826 loops=1) Rows Removed by Filter: 89948 --drop index if exists i1; --create index i2 on aaa (flag); --explain (analyze,verbose,costs,buffers) select * from aaa where num = 1 and flag = 1; Index Scan using i2 on public.aaa (cost=0.43..39583.11 rows=10641 width=8) (actual time=0.098..351.454 rows=9826 loops=1) Rows Removed by Filter: 990249 LOG: cost_index: seq_page_cost=1.00, random_page_cost=1.00, cpu_tuple_cost=0.0100, cpu_index_tuple_cost=0.0050, cpu_operator_cost=0.0025, effective_cache_size=131072 indexStartupCost=0.43, indexTotalCost=1103.94, indexSelectivity=0.01076667, indexCorrelation=0.00208220 baserel->tuples=10000033.00, baserel->pages=44248.00, baserel->allvisfrac=0.00000000 tuples_fetched=107667.00, pages_fetched=477.00 max_IO_cost=44248.0000, min_IO_cost=477.0000, csquared=0.0000 qpqual_cost.startup=0.0000, qpqual_cost.per_tuple=0.0025, cpu_per_tuple=0.0125 spc_seq_page_cost=1.00, spc_random_page_cost=1.00 startup_cost=0.43, total_cost=46697.59 LOG: cost_index: seq_page_cost=1.00, random_page_cost=1.00, cpu_tuple_cost=0.0100, cpu_index_tuple_cost=0.0050, cpu_operator_cost=0.0025, effective_cache_size=131072 indexStartupCost=0.43, indexTotalCost=10123.93, indexSelectivity=0.09883333, indexCorrelation=0.82505685 baserel->tuples=10000000.00, baserel->pages=44248.00, baserel->allvisfrac=0.00000000 tuples_fetched=988333.00, pages_fetched=4374.00 max_IO_cost=44248.0000, min_IO_cost=4374.0000, csquared=0.6807 qpqual_cost.startup=0.0000, qpqual_cost.per_tuple=0.0025, cpu_per_tuple=0.0125 spc_seq_page_cost=1.00, spc_random_page_cost=1.00 startup_cost=0.43, total_cost=39583.11 Here is a break down of the total_cost into components, for i1 query and for i2 query (some rounding was removed from the formula for brevity): path->path.total_cost = (indexTotalCost + qpqual_cost.startup) + (max_IO_cost + csquared * (min_IO_cost - max_IO_cost)) + (cpu_tuple_cost + qpqual_cost.per_tuple) * (indexSelectivity * baserel->tuples); path->path.total_cost = 1103.94 + 0.0000 + // 1103.94 + 44248.0000 + 0.0000 * (477.0000 - 44248.0000) + // 44248.00 + (0.0100 + 0.0025) * (0.01076667 * 10000033.00) // 1345.84 = 46697.78; // = 46697.78; path->path.total_cost = (indexTotalCost + qpqual_cost.startup) + (max_IO_cost + csquared * (min_IO_cost - max_IO_cost)) + (cpu_tuple_cost + qpqual_cost.per_tuple) * (indexSelectivity * baserel->tuples); path->path.total_cost = 10123.93 + 0.0000 + // 10123.93 + 44248.0000 + 0.6807 * (4374.0000 - 44248.0000) + // 17105.77 + (0.0100 + 0.0025) * (0.09883333 * 10000000.00) // 12354.17 = 39583.86; // = 39583.86; PS. The code used for logging: /postgresql-9.3.1/src/backend/optimizer/path/costsize.c : cost_index() ereport(LOG, (errmsg("cost_index: \n" "seq_page_cost=%.2f, random_page_cost=%.2f, cpu_tuple_cost=%.4f, cpu_index_tuple_cost=%.4f, cpu_operator_cost=%.4f, effective_cache_size=%.0f\n" "indexStartupCost=%.2f, indexTotalCost=%.2f, indexSelectivity=%.8f, indexCorrelation=%.8f\n" "baserel->tuples=%.2f, baserel->pages=%.2f, baserel->allvisfrac=%.8f\n" "tuples_fetched=%.2f, pages_fetched=%.2f\n" "max_IO_cost=%.4f, min_IO_cost=%.4f, csquared=%.4f\n" "qpqual_cost.startup=%.4f, qpqual_cost.per_tuple=%.4f, cpu_per_tuple=%.4f\n" "spc_seq_page_cost=%.2f, spc_random_page_cost=%.2f\n" "startup_cost=%.2f, total_cost=%.2f\n", seq_page_cost, random_page_cost, cpu_tuple_cost, cpu_index_tuple_cost, cpu_operator_cost, (double)effective_cache_size, indexStartupCost, indexTotalCost, indexSelectivity, indexCorrelation, baserel->tuples, (double) baserel->pages, baserel->allvisfrac, tuples_fetched, pages_fetched, max_IO_cost, min_IO_cost, csquared, qpqual_cost.startup, qpqual_cost.per_tuple, cpu_per_tuple, spc_seq_page_cost, spc_random_page_cost, startup_cost, startup_cost + run_cost ))); Regards, Vitaliy
On Sun, Dec 3, 2017 at 1:15 PM, Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:
On 02/12/2017 23:17, Jeff Janes wrote:I can't reproduce the exact issue on the real database any more. The query started to use the slow bitmap scan recently, and had been doing so for some time lately, but now it's switched back to use the index scan. The table involved in the query gets modified a lot. It has hundreds of millions of rows. Lots of new rows are appended to it every day, the oldest rows are sometimes removed. The table is analyzed at least daily. It's possible that statistics was updated and that caused the query to run differently. But I still would like to understand why that issue happened, and how to properly fix it, in case the issue returns.Right, so there is a cpu costing problem (which could only be fixed by hacking postgresql and recompiling it), but it is much smaller of a problem than the IO cost not being accurate due to the high hit rate. Fixing the CPU costing problem is unlikely to make a difference to your real query. If you set the page costs to zero, what happens to your real query?
While your test case displays some cost estimation issues, there is really no reason to think that they are the same issues your real query shows. Particularly since you said the difference was a factor of 30 in the real case, rather than 3. Any chance you can show EXPLAIN ANALYZE output for the real query, but when it is acting up and when it is not? Something in the plans might stand out to us as the obvious problem. On the other hand, maybe nothing will stand out without having a replicable test case. The only way to know is to try.
In the real database, not everything is in memory. There are 200GB+ of RAM, but DB is 500GB+. The table involved in the query itself is 60GB+ of data and 100GB+ of indexes. I'm running the test case in a way where all reads are done from RAM, only to make it easier to reproduce and to avoid unrelated effects.But I doubt that the settings seq_page_cost = random_page_cost = 0.0 should actually be used.Why not? If your production server really has everything in memory during normal operation, that is the correct course of action. If you ever restart the server, then you could have some unpleasant time getting it back up to speed again, but pg_prewarm could help with that.
Is everything that the particular query in questions needs in memory, even if other queries need things from disk? Or does the problematic query also need things from disk? If the query does need to read things from disk, the bitmap actually should be faster. Which reinforces the idea that maybe the issue brought up by your test case is not the same as the issue brought up by your real case, even if they both point in the same direction.
As far as know, costs in Postgres were designed to be relative to seq_page_cost, which for that reason is usually defined as 1.0. Even if everything would be in RAM, accesses to the pages would still not have zero cost. Setting 0.0 just seems too extreme, as all other non-zero costs would become infinitely bigger.
When exploring things, 0.0 certain helps to simplify things. Yes, 0.05 or something similar might be better for a completely cached database. The problem is that it is very context dependent. Reading a page from shared_buffers when there is no contention from other processes for the same page is probably less than 0.01. If it is not in shared_buffers but is in effective_cache_size, it is probably a few multiples of 0.01. If there is contention either for that specific page, or for available buffers into which to read pages, then it could be substantially higher yet. Higher, none of those are things the planner is aware of.
Exactly. I'd like to understand why the worse plan is being chosen, and 1) if it's fixable by tuning costs, to figure out the right settings which could be used in production, 2) if there is a bug in Postgres optimizer, then to bring some attention to it, so that it's eventually fixed in one of future releases, 3) if Postgres is supposed to work this way, then at least I (and people who ever read this thread) would understand it better.If you really want to target the plan with the BitmapAnd, you should increase cpu_index_tuple_cost and/or cpu_operator_cost but not increase cpu_tuple_cost. That is because the unselective bitmap index scan does not incur any cpu_tuple_cost, but does incur index_tuple and operator costs. Unfortunately all other index scans in the system will also be skewed by such a change if you make the change system-wide.
I would argue that it is planner "bug", (quotes because it doesn't give wrong answers, just sub-optimal plans) but one that is very hard to pin down, and also depends on the hardware you are running on. Also, people have made some optimizations to the machinery behind the bitmap code recently, as well as the costing of the bitmap code, so if it is bug, the size of it is changing with the version you are using. If your aim is to improve the planner (rather than simply tuning the planner that currently exists) then you should probably 1) make your test case use random number generators, rather than modulus, to avoid cross-column correlation and other such issues, 2) run it against 11dev code, which is where improvements to PostgreSQL are targeted, rather than against production versions, and 3) post to pgsql-hackers, rather than performance.
Cheers,
Jeff
On Tue, Dec 5, 2017 at 10:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Dec 3, 2017 15:31, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> But I do see that ties within the logical order of the column values are
>>> broken to agree with the physical order. That is wrong, right? Is there
>>> any argument that this is desirable?
>> Uh ... what do you propose doing instead? We'd have to do something with
>> ties, and it's not so obvious this way is wrong.
> Let them be tied. If there are 10 distinct values, number the values 0 to
> 9, and all rows of a given distinct values get the same number for the
> logical order axis.
> Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to
> me. Although if we switched btree to store duplicate values with tid as a
> tie breaker, then maybe it wouldn't be as obviously wrong.
I thought some more about this. What we really want the correlation stat
to do is help us estimate how randomly an index-ordered scan will access
the heap.
The correlation is used in another place, estimating how much of the table we will visit in the first place. If the correlation is very high, then scanning 10% of the index leaf pages means we will visit 10% of the table. If the correlation is low, then we use Mackert and Lohman, and (in the case of visiting 10% of the index) predict we will visit most of the table. Assuming effective_cache_size is high, we will visit most of the table just once, but still in a random order, because subsequent visits for the same query will be found in the cache. Rather than visiting the various pages repeatedly and not finding them in cache each time.
In addition to estimating how much of the table we visit, we also estimate how "sequential like" those visits are. Which is the use that you describe. Ideally for that use case, we would know for each distinct value, how correlated the tids are with the leaf page ordering. If the index is freshly built, that is very high. We visit 1/10 of the index, which causes us to visit 100% of the table but in perfect order, plucking 1/10 of the tuples from each table page.
But visiting 100% of the table in physical order in order to pluck out 10% of the tuples from each page is quite different than visiting 10% of the table pages in physical order to pluck out 100% of the tuples from those pages and 0% from the pages not visited.
...
BTW, I disagree that "correlation = zero" is the right answer for this
particular example. If the btree is freshly built, then an index-order
scan would visit all the heap pages in sequence to fetch "f" rows, and
then visit them all in sequence again to fetch "t" rows, which is a whole
lot better than the purely random access that zero correlation implies.
So I think 0.8 or so is actually a perfectly reasonable answer when the
index is fresh. The trouble is just that it'd behoove us to derate that
answer somewhat for the probability that the index isn't fresh.
But, for the case of "how much of the table do we visit at all", correlation = zero is the right answer, even if it isn't the right answer for "how sequentially do we visit whatever we visit"
My first thought for a concrete fix was to use the mean position of
a group of duplicates for the purposes of the correlation calculation,
but on reflection that's clearly wrong. We do have an idea, from the
data we have, whether the duplicates are close together in the heap
or spread all over. Using only mean position would fail to distinguish
those cases, but really we'd better penalize the spread-all-over case.
I'm not sure how to do that.
Departing from correlations, we could also try to estimate "How many different table pages does each index leaf page reference". This could capture functional dependencies which are strong, but not in the form of linear correlations. (The current extended statistics only captures dependencies between user columns, not between one user column and one system column such as table slot)
For whatever its worth, here is my "let ties be ties" patch.
It breaks two regression tests due to plan changes, and both are cases where maybe the plan ought to change for the very reason being discussed. If I just put random gibberish into the correlation field, more regression tests fail, so I think my implementation is not too far broken.
The accumulations into corr_ysum and corr_y2sum could trivially be pushed down into the "if", and corr_xysum could as well with a little algebra. But that seems like premature optimization for a proof-of-concept patch.
Cheers,
Jeff
Attachment
On Tue, Dec 5, 2017 at 11:06 PM, Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:
This is very cool, thanks.
I've tried to create a better test case:
- Increase shared_buffers and effective_cache_size to fit whole database, including indexes.
- Use random(), to avoid correlation between the filtered values.
- Make both columns of integer type, to avoid special cases with boolean (like the one happened with CREATE STATISTICS).
- Flush OS disk cache and then try running the query several times, to get both cold-cache results and all-in-ram results.
- There are several tests, with different frequency of the selected values in the two columns: [1/10, 1/10], [1/50, 1/10], [1/100, 1/10].
- There is a split of cost by contribution of each of its components: seq_page_cost, random_page_cost, cpu_tuple_cost, cpu_index_tuple_cost, cpu_operator_cost. The EXPLAIN is run for each component, every time with only one of the components set to non-zero.
Where you have question marks, that means you could not force it into the plan you wanted with all-but-one settings being zero?
- The test was run on a Digitalocean VM: Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-101-generic x86_64), 2 GB RAM, 2 core CPU, SSD; PostgreSQL 9.5.10.
shared_buffers = 1024MB
effective_cache_size = 1024MB
I would set this even higher.
work_mem = 100MB
create table aaa as select floor(random()*10)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id;
create table aaa as select floor(random()*50)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id;
create table aaa as select floor(random()*100)::int num, (random()*10 < 1)::int flag from generate_series(1, 10000000) id;
create index i1 on aaa (num);
create index i2 on aaa (flag);
set enable_bitmapscan = on; set enable_indexscan = off; set enable_seqscan = off;
set enable_bitmapscan = off; set enable_indexscan = on; set enable_seqscan = off;
set enable_bitmapscan = off; set enable_indexscan = off; set enable_seqscan = on;
set seq_page_cost = 1.0; set random_page_cost = 1.0; set cpu_tuple_cost = 0.01; set cpu_index_tuple_cost = 0.005; set cpu_operator_cost = 0.0025;
explain (analyze,verbose,costs,buffers) select * from aaa where num = 1 and flag = 1;
One thing to try is to use explain (analyze, timing off), and then get the total execution time from the summary line at the end of the explain, rather than from "actual time" fields. Collecting the times of each individual step in the execution can impose a lot of overhead, and some plans have more if this artificial overhead than others. It might change the results, or it might not. I know that sorts and hash joins are very sensitive to this, but I don't know about bitmap scans.
....
What seems odd to me is that in different kinds of tests (with different frequency of column values):
i1 Rows Removed by Filter = 900156, 179792, 89762 (decreased a lot)
i1 buffers = 46983, 44373, 39928 (decreased, but not a lot)
To filter out 89762 tuples, you first have to look them up in the table, and since they are randomly scattered that means you hit nearly every page in the table at least once. In fact, I don't understand how the empirical number of buffers hits can be only 46983 in the first case, when it has to visit 1,000,000 rows (and reject 90% of them). I'm guessing that it is because your newly created index is sorted by ctid order within a given index value, and that the scan holds a pin on the table page between visits, and so doesn't count as a hit if it already holds the pin.
You could try to create an empty table, create the indexes, then populate the table with your random select query, to see if that changes the buffer hit count. (Note that this wouldn't change the cost estimates much even it does change the measured number of buffer hits, because of effective_cache_size. It knows you will be hitting ~47,000 pages ~25 times each, and so only charges you for the first time each one is hit.)
i1 best case time = 756.045, 127.814, 79.492 (decreased a lot, as well as probably average case too)
i1 cost estimates = 67358.15, 48809.34, 46544.80 (did not decrease a lot)
Right. Your best case times are when the data is completely in cache. But your cost estimates are dominated by *_page_cost, which are irrelevant when the data is entirely in cache. You are telling it to estimate the worse-case costs, and it is doing that pretty well (within this one plan).
i2 Rows Removed by Filter = 900835, 980350, 991389
i2 buffers = 46985, 46985, 46987
i2 best case time = 377.554, 346.481, 387.874
i2 cost estimates = 39166.34, 39247.83, 40167.34
It's odd that increase in actual execution time for "i1" was not reflected enough in cost estimates.
No, that's entirely expected given your settings. As long as you are charging disk-read costs for reading data from RAM, you will never get realistic cost estimates. Remember, you aren't trying to tune your production server here, you are trying to get a test case that can be dissected.
Perhaps you think that effective_cache_size is supposed to fix this for you. But it only accounts for blocks hit repeatedly within the same query. It has no idea that you are running a bunch of other queries on the same data back to back, and so it will likely find that data already in memory from one query to the next. That knowledge (currently) has to be baked into your *_page_cost setting. There is also no way to say that data for one table is more likely to be found in cache than data for another table is.
The cost even didn't go below "i2" costs.
That is partially because i2 benefits from a spuriously large correlation due to an artifact of how stats are computed. See another thread that has spun off of this one.
If you want to see the effect of this on your cost estimates, you can do something like:
update pg_statistic set stanumbers2='{0}' where starelid='aaa'::regclass and staattnum=2;
Cheers,
Jeff
On Wed, Dec 6, 2017 at 1:46 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
On Tue, Dec 05, 2017 at 01:50:11PM -0500, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > On Dec 3, 2017 15:31, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
> >> Jeff Janes <jeff.janes@gmail.com> writes:
> >>> But I do see that ties within the logical order of the column values are
> >>> broken to agree with the physical order. That is wrong, right? Is there
> >>> any argument that this is desirable?
>
> >> Uh ... what do you propose doing instead? We'd have to do something with
> >> ties, and it's not so obvious this way is wrong.
>
> > Let them be tied.
...
> I thought some more about this. What we really want the correlation stat
> to do is help us estimate how randomly an index-ordered scan will access
> the heap. If the values we've sampled are all unequal then there's no
> particular issue. However, if we have some group of equal values, we
> do not really know what order an indexscan will visit them in. The
> existing correlation calculation is making the *most optimistic possible*
> assumption, that such a group will be visited exactly in heap order ---
> and that assumption isn't too defensible.
I'm interested in discusstion regarding bitmap cost, since it would have helped
our case discussed here ~18 months ago:
https://www.postgresql.org/message-id/flat/ 20160524173914.GA11880% 40telsasoft.com# 20160524173914.GA11880@ telsasoft.com
...but remember: in Vitaliy's case (as opposed to mine), the index scan is
*faster* but being estimated at higher cost than bitmap (I have to keep
reminding myself). So the rest of this discussion is about the
overly-optimistic cost estimate of index scans, which moves in the opposite
direction for this reported problem. For the test cases I looked at, index
scans were used when RPC=1 and redundant conditions were avoided, so I'm not
sure if there's any remaining issue (but I haven't looked at the latest cases
Vitaliy sent).
> In any case, given that we do this calculation without regard
> to any specific index,
One solution is to compute stats (at least correlation) for all indices, not
just expr inds. I did that earlier this year while throwing around/out ideas.
https://www.postgresql.org/message-id/20170707234119. GN17566%40telsasoft.com
When is the correlation of a column which is not the leading column of a btree index or in a brin index ever used? If we did compute index-specific correlations, we could maybe just drop pure-column correlations.
> We do have an idea, from the data we have, whether the duplicates are close
> together in the heap or spread all over.
I think you just mean pg_stats.correlation for all values, not just duplicates
(with the understanding that duplicates might be a large fraction of the
tuples, and high weight in correlation).
Another issue I noted in an earlier thread is that as table sizes increase, the
existing correlation computation approaches 1 for correlated insertions, (like
"append-only" timestamps clustered around now()), due to ANALYZE sampling a
fraction of the table, and thereby representing only large-scale correlation,
That isn't due to sampling. That is due to the definition of linear correlation. Large scale is what it is about.
Generated data demonstrating this (I reused this query so it's more complicated
than it needs to be):
[pryzbyj@database ~]$ time for sz in 9999{,9{,9{,9{,9}}}} ; do psql postgres -tc "DROP TABLE IF EXISTS t; CREATE TABLE t(i float, j int); CREATE INDEX ON t(i);INSERT INTO t SELECT i/99999.0+pow(2,(-random())) FROM generate_series(1,$sz) i ORDER BY i; ANALYZE t; SELECT $sz, correlation, most_common_freqs[1] FROM pg_stats WHERE attname='i' AND tablename='t'"; done
9999 | 0.187146 |
99999 | 0.900629 |
999999 | 0.998772 |
9999999 | 0.999987 |
Because the amount of jitter introduced is constant WRT $sz, but the range of i/99999.0 increases with $sz, the correlation actually does increase; it is not a sampling effect.
Trying to keep it all in my own head: For sufficiently large number of pages,
bitmap scan should be preferred to idx scan due to reduced random-page-cost
outweighing its overhead in CPU cost.
But CPU cost is probably not why it is losing anyway.
Index scans get a double bonus from high correlation. It assumes that only a small fraction of the table will be visited. And then it assumes that the visits it does make will be largely sequential. I think that you are saying that for a large enough table, that last assumption is wrong, that the residual amount of non-correlation is enough to make the table reads more random than sequential. Maybe. Do you have a test case that demonstrates this? If so, how big do we need to go, and can you see the problem on SSD as well as HDD?
The thing is, the bitmap scan gets cheated out of one of these bonuses. It gets no credit for visiting only a small part of the table when the correlation is high, but does get credit for being mostly sequential in whatever visits it does make (even if the correlation is low, because of course the bitmap perfects the correlation).
I think it would be easier to give bitmap scans their due rather than try to knock down index scans. But of course a synthetic test case would go a long way to either one.
Probably by penalizing index scans, not
discounting bitmap scans. Conceivably a correlation adjustment can be
conditionalized or weighted based on index_pages_fetched() ...
x = ln (x/999999);
if (x>1) correlation/=x;
I think we should go with something with some statistical principle behind it if we want to do that. There is the notion of "restricted range" in correlations, that if there is a high correlation over the full range of data, but you zoom into only a small fraction of that range, the correlation you see over that restricted range will be much less than the full correlation is. I don't know that field well enough to give a formula off the top of my head, but I think it will be based on the fraction of the key space which is being scanned, and (1-RSQ), rather than an arbitrary number like 999999.
Cheers,
Jeff
On Tue, Dec 12, 2017 at 01:29:48AM -0800, Jeff Janes wrote: > On Wed, Dec 6, 2017 at 1:46 PM, Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Tue, Dec 05, 2017 at 01:50:11PM -0500, Tom Lane wrote: > > > In any case, given that we do this calculation without regard > > > to any specific index, > > > > One solution is to compute stats (at least correlation) for all indices, > > not > > just expr inds. I did that earlier this year while throwing around/out > > ideas. > > https://www.postgresql.org/message-id/20170707234119. > > GN17566%40telsasoft.com > > When is the correlation of a column which is not the leading column of a > btree index or in a brin index ever used? If we did compute index-specific > correlations, we could maybe just drop pure-column correlations. Yes I think so - correlation is collected for every column, but only used for indices. I also have a comment to myself in that patch to force attstattarget=0 for non-expr indices, to avoid keeping MCV/histogram which duplicates that of their column. > Trying to keep it all in my own head: For sufficiently large number of > > pages, > > bitmap scan should be preferred to idx scan due to reduced random-page-cost > > outweighing its overhead in CPU cost. > > > But CPU cost is probably not why it is losing anyway. > > Index scans get a double bonus from high correlation. It assumes that only > a small fraction of the table will be visited. And then it assumes that > the visits it does make will be largely sequential. I think that you are > saying that for a large enough table, that last assumption is wrong, that > the residual amount of non-correlation is enough to make the table reads > more random than sequential. Maybe. Do you have a test case that > demonstrates this? If so, how big do we need to go, and can you see the > problem on SSD as well as HDD? Right: The "residual"/fine-scale variations (those which are not adequately represented by correlation metric) are/may be non-sequential, so don't get good readahead. The original issue was with an 75GB table (an inheritence child) and an analytic query previously taking ~30min at that point taking 4-5 hours due to random seeks (from duplicate values in a timestamp column with 1second resolution). There would've been very little if any of the previous day's table cached: the child table being queried (by way of its parent) had size roughly same as the server's RAM, and would've been loaded over the course of the preceding 6-30hours, and not frequently accessed. It may be that there's a sharp change once cache no longer effectively mitigates the random heap reads. SSD: good question. Here's an rackspace VM with PG9.6.6, 2GB shared_buffers, 8GB RAM (~4GB of which is being used as OS page cache), and 32GB SSD (with random_page_cost=1). The server is in use by our application. I believe you could scale up the size of the table to see this behavior with any cache size. 0.0001 controls the "jitter", with smaller values being more jittery.. postgres=# CREATE TABLE t(i int,j int) TABLESPACE tmp; CREATE INDEX ON t(i); INSERT INTO t SELECT (0.0001*a+9*(random()-0.5))::intFROM generate_series(1,99999999) a; VACUUM ANALYZE t; public | t | table | pryzbyj | 3458 MB | relpages | 442478 For comparison purposes/baseline; here's a scan on an SEPARATE index freshly built AFTER insertions: postgres=# explain(analyze,buffers) SELECT COUNT(j) FROM t WHERE i BETWEEN 0 AND 4000; First invocation: #1 -> Index Scan using t_i_idx1 on t (cost=0.57..1413352.60 rows=39933001 width=4) (actual time=25.660..52575.127 rows=39996029loops=1) Buffers: shared hit=1578644 read=286489 written=1084 Subsequent invocations with (extra) effect from OS cache: #2 -> Index Scan using t_i_idx1 on t (cost=0.57..1413352.60 rows=39933001 width=4) (actual time=61.054..37646.556 rows=39996029loops=1) Buffers: shared hit=1578644 read=286489 written=2223 #3 -> Index Scan using t_i_idx1 on t (cost=0.57..1413352.60 rows=39933001 width=4) (actual time=9.344..31265.398 rows=39996029loops=1) Buffers: shared hit=1578644 read=286489 written=1192 Dropping that index, and scanning a different range on the non-fresh index: postgres=# explain(analyze,buffers) SELECT COUNT(j) FROM t WHERE i BETWEEN 4000 AND 8000; #1 -> Index Scan using t_i_idx on t (cost=0.57..1546440.47 rows=40298277 width=4) (actual time=95.815..139152.147 rows=40009853loops=1) Buffers: shared hit=1948069 read=316536 written=3411 Rerunning with cache effects: #2 -> Index Scan using t_i_idx on t (cost=0.57..1546440.47 rows=40298277 width=4) (actual time=203.590..87547.287 rows=40009853loops=1) Buffers: shared hit=1948069 read=316536 written=5712 #3 -> Index Scan using t_i_idx on t (cost=0.57..1546440.47 rows=40298277 width=4) (actual time=164.504..83768.890 rows=40009853loops=1) Buffers: shared hit=1948069 read=316536 written=1979 Compare to seq scan: -> Seq Scan on t (cost=0.00..1942478.00 rows=40298277 width=4) (actual time=1173.162..20980.069 rows=40009853 loops=1) Buffers: shared hit=47341 read=395137 Bitmap: -> Bitmap Heap Scan on t (cost=975197.91..2022150.06 rows=40298277 width=4) (actual time=24396.270..39304.813 rows=40009853loops=1) Buffers: shared read=316536 written=1431 The index scan reads 2.3e6 pages, compared to 4e5 pages (seq) and 3e5 pages (bitmap). And idx scans were 4-7x slower than seq scan. Was the index scan actually that badly affected by CPU cost of revisiting pages (rather than IO costs)? Or did the OS actually fail to cache the 3e5 pages "read"? That would be consistent with running almost 2x faster on the 3rd invocation. The "hits" are largely from pages being revisited and recently accessed. Are the misses (reads) mostly from pages being revisited after already falling out of cache ? Or mostly initial access ? Or ?? If the index scan is really paying an high IO cost for rereads and not primarily a CPU cost, this seems to be something like the "correlated index scan" variant of traditional failure to effectively cache doing seq scan on a sufficiently large table using a MRU buffer - the cache is ineffective for adequately mitigating IO costs when the (re)reads have sufficient "spread". postgres=# SELECT tablename, attname, correlation FROM pg_stats WHERE tablename='t'; tablename | t attname | i correlation | 1 I still want to say that's unreasonble due to (I think) high fraction of nonrandom reads and associated absense of readahead. > I think it would be easier to give bitmap scans their due rather than try > to knock down index scans. But of course a synthetic test case would go a > long way to either one. As Tom said: index scans involving repeated keys are assuming best-case sequential reads for a given computed correlation. I'd be happy if they were costed to avoid that assumption, and instead used some "middle of the road" interpretation (probably based on correlation and MCV fraction?), but, in the alternate, need to distinguish the cost_index cases, rather than adjusting bitmap. This is what led me to play around with stats computation for all indices. Justin
On Fri, Dec 15, 2017 at 02:54:06PM -0600, Justin Pryzby wrote: > SSD: good question. > > Here's an rackspace VM with PG9.6.6, 2GB shared_buffers, 8GB RAM (~4GB of which > is being used as OS page cache), and 32GB SSD (with random_page_cost=1). The > server is in use by our application. > > I believe you could scale up the size of the table to see this behavior with > any cache size. 0.0001 controls the "jitter", with smaller values being more > jittery.. > > postgres=# CREATE TABLE t(i int,j int) TABLESPACE tmp; CREATE INDEX ON t(i); INSERT INTO t SELECT (0.0001*a+9*(random()-0.5))::intFROM generate_series(1,99999999) a; VACUUM ANALYZE t; > public | t | table | pryzbyj | 3458 MB | > relpages | 442478 I realized I've made a mistake here; the table is on SSD but not its index... So all this cost is apparently coming from the index and not the heap. -> Bitmap Heap Scan on t (cost=855041.91..1901994.06 rows=40298277 width=4) (actual time=14202.624..27754.982 rows=40009853loops=1) -> Bitmap Index Scan on t_i_idx1 (cost=0.00..844967.34 rows=40298277 width=0) (actual time=14145.877..14145.877rows=40009853 loops=1) Let me get back to you about that. Justin
On Fri, Dec 15, 2017 at 02:54:06PM -0600, Justin Pryzby wrote: > SSD: good question. > > Here's an rackspace VM with PG9.6.6, 2GB shared_buffers, 8GB RAM (~4GB of which > is being used as OS page cache), and 32GB SSD (with random_page_cost=1). The > server is in use by our application. > > I believe you could scale up the size of the table to see this behavior with > any cache size. 0.0001 controls the "jitter", with smaller values being more > jittery.. On Sat, Dec 16, 2017 at 01:18:38PM -0600, Justin Pryzby wrote: > I realized I've made a mistake here; the table is on SSD but not its index... > So all this cost is apparently coming from the index and not the heap. > > -> Bitmap Heap Scan on t (cost=855041.91..1901994.06 rows=40298277 width=4) (actual time=14202.624..27754.982 rows=40009853loops=1) > -> Bitmap Index Scan on t_i_idx1 (cost=0.00..844967.34 rows=40298277 width=0) (actual time=14145.877..14145.877rows=40009853 loops=1) I'm rerunning with this: postgres=# CREATE TABLE t(i int,j int) TABLESPACE tmp; CREATE INDEX ON t(i) TABLESPACE tmp; INSERT INTO t SELECT (0.0001*a+9*(random()-0.5))::intFROM generate_series(1,99999999) a; VACUUM ANALYZE t; CREATE INDEX ON t(i) TABLESPACE tmp; That doesn't seem to invalidate my conclusions regarding the test data. The non-fresh index: #1 -> Index Scan using t_i_idx on t (cost=0.57..1103588.59 rows=39536704 width=4) (actual time=2.295..60094.704 rows=40009646loops=1) Rerun: #2 -> Index Scan using t_i_idx on t (cost=0.57..1103588.59 rows=39536704 width=4) (actual time=1.671..54209.037 rows=40009646loops=1) #3 -> Index Scan using t_i_idx on t (cost=0.57..1103588.59 rows=39536704 width=4) (actual time=1.743..46436.538 rows=40009646loops=1) Scan fresh index: -> Index Scan using t_i_idx1 on t (cost=0.57..1074105.46 rows=39536704 width=4) (actual time=1.715..16119.720 rows=40009646loops=1) bitmap scan on non-fresh idx: -> Bitmap Heap Scan on t (cost=543141.78..1578670.34 rows=39536704 width=4) (actual time=4397.767..9137.541 rows=40009646loops=1) Buffers: shared hit=91235 read=225314 -> Bitmap Index Scan on t_i_idx (cost=0.00..533257.61 rows=39536704 width=0) (actual time=4346.556..4346.556 rows=40009646loops=1) Buffers: shared read=139118 seq scan: -> Seq Scan on t (cost=0.00..1942478.00 rows=39536704 width=4) (actual time=6093.269..17880.164 rows=40009646 loops=1) I also tried an idx only scan (note COUNT i vs j / "eye" vs "jay"), which I think should be like an index scan without heap costs: postgres=# SET max_parallel_workers_per_gather=0;SET enable_bitmapscan=off;SET enable_indexscan=on; begin; DROP INDEX t_i_idx1;explain(analyze,buffers) SELECT COUNT(i) FROM t WHERE i BETWEEN 4000 AND 8000; rollback; -> Index Only Scan using t_i_idx on t (cost=0.57..928624.65 rows=39536704 width=4) (actual time=0.515..12646.676 rows=40009646loops=1) Buffers: shared hit=276 read=139118 However, in this test, random reads on the INDEX are still causing a large fraction of the query time. When cached by the OS, this is much faster. Compare: #1 -> Bitmap Heap Scan on t (cost=543141.78..1578670.34 rows=39536704 width=4) (actual time=25498.978..41418.870 rows=40009646loops=1) Buffers: shared read=316549 written=497 -> Bitmap Index Scan on t_i_idx (cost=0.00..533257.61 rows=39536704 width=0) (actual time=25435.865..25435.865rows=40009646 loops=1) Buffers: shared read=139118 written=2 #2 -> Bitmap Heap Scan on t (cost=543141.78..1578670.34 rows=39536704 width=4) (actual time=5863.003..17531.860 rows=40009646loops=1) Buffers: shared read=316549 written=31 -> Bitmap Index Scan on t_i_idx (cost=0.00..533257.61 rows=39536704 width=0) (actual time=5799.400..5799.400 rows=40009646loops=1) Buffers: shared read=139118 written=31 Note that for the test data, the index is a large fraction of the table data (since the only non-indexed column is nullfrac=1): public | t | table | pryzbyj | 3458 MB | public | t_i_idx | index | pryzbyj | t | 2725 MB | public | t_i_idx1 | index | pryzbyj | t | 2142 MB | (that could be 10% smaller with fillfactor=100) I think the test case are reasonably reproducing the original issue. Note that the 2nd invocation of the bitmap scan scanned the index in 5.8sec and the heap in 11sec, but the 2nd invocation of the index scan took 54sec, of which I gather ~6sec was from the index. So there's still 48sec spent accessing the heap randomly, rather than 11sec sequentially. I'm also playing with the tables which were the source of the original problem, for which index reads in bitmap scan do not appear to be a large fraction of the query time, probably because the index are 1-2% of the table size rather than 60-70%. I'll mail about that separately. Justin
Sorry for delay with response, I had to switch to other tasks and didn't have time to run proper tests and write some meaningful response. Recently, a similar issue happened with another our database, so I decided to write an update. Bitmap scan was preferred to index scan by the planner, but bitmap scan was running worse in practice. Here are the relevant pieces of a much bigger query plan: -> Bitmap Heap Scan on cmdb_program_daily_usage cmdb_program_daily_usage_6 (cost=6707.08..6879.35 rows=32 width=20) (actual time=39.994..40.019 rows=12 loops=336) Recheck Cond: ((used_from = cmdb_ci_computer_12.id) AND (usage_date >= '2018-02-02'::date) AND (usage_date <= '2018-02-12'::date)) Filter: (((NOT thin_client) OR (thin_client IS NULL)) AND (program_instance IS NOT NULL) AND (minutes_in_use > 0)) Rows Removed by Filter: 69 Heap Blocks: exact=2995 Buffers: shared hit=563448 -> BitmapAnd (cost=6707.08..6707.08 rows=154 width=0) (actual time=39.978..39.978 rows=0 loops=336) Buffers: shared hit=560453 -> Bitmap Index Scan on idx_fk_5317241949468942 (cost=0.00..133.87 rows=12641 width=0) (actual time=0.373..0.373 rows=4780 loops=336) Index Cond: (used_from = cmdb_ci_computer_12.id) Buffers: shared hit=5765 -> Bitmap Index Scan on idx_263911642415136 (cost=0.00..6572.94 rows=504668 width=0) (actual time=40.873..40.873 rows=540327 loops=324) Index Cond: ((usage_date >= '2018-02-02'::date) AND (usage_date <= '2018-02-12'::date)) Buffers: shared hit=554688 -> Index Scan using idx_fk_5317241949468942 on cmdb_program_daily_usage cmdb_program_daily_usage_6 (cost=0.56..24322.97 rows=35 width=20) (actual time=1.211..2.196 rows=14 loops=338) Index Cond: (used_from = cmdb_ci_computer_12.id) Filter: (((NOT thin_client) OR (thin_client IS NULL)) AND (program_instance IS NOT NULL) AND (minutes_in_use > 0) AND (usage_date >= '2018-02-02'::date) AND (usage_date <= '2018-02-12'::date)) Rows Removed by Filter: 4786 Buffers: shared hit=289812 The difference in run time does not look very huge, but when it's a part of a loop, that could mean difference between minutes and hours. After running some tests, here are the conclusions we've made: - When running with cold cache, and data is being read from disk, then the planner estimates look adequate. Bitmap scan has better costs, and indeed it performs better in that case. - When running with hot cache, and most of data is already in RAM, then index scan starts to outperform bitmap scan. Unfortunately the planner cannot account for the cache very well, and can't switch the plan. Because even if the planner would ever learn to account for the current content of shared buffers, it still can't know much about the content of filesystem cache. - Tests showed that the costs are dominated by random_page_cost, but there is still potential to change the total plan cost, if "cpu_*" costs would be less distant from "*_page_cost". - In our case the data is likely to be in cache, so we decided to change cost settings: seq_page_cost 1.0 -> 0.5; random_page_cost 1.1 -> 0.6 Regards, Vitaliy