Thread: Unnecessarily imprecise stats for very small tables leading to bad plans
Unnecessarily imprecise stats for very small tables leading to bad plans
From
Andreas Seltenreich
Date:
Hi, one of our customers suffers from excessively bad plans when joining a big table (100G) against a temp table with only a single digit number of rows. I found a way to reproduce it, and it seems to me like the optimizer cannot do better since the statistics says it can early-out a merge-join because the maximum value is missing from the stats. Below is the data, statistics an plan for the bad merge join as well as the expected nested loop. Intuitively, I'd say pg_statistic can represent such a small dataset perfectly. But instead, most_common_values only contains one of the two distinct values. This seems like a bug to me. I looked at the code for why this is the case, but couldn't find an answer right-away. Any input? The customer uses latest 9.6, the testcase was against 10.0. regards Andreas =# create table big(c int primary key); =# insert into big select generate_series(1,1000000); =# analyze big; =# create table small(c int); =# insert into small values (10), (10), (800000); =# analyze verbose small; =# explain analyze select * from big b join small s on (b.c=s.c); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Merge Join (cost=1.49..1.82 rows=3 width=8) (actual time=0.110..151.939 rows=3 loops=1) Merge Cond: (b.c = s.c) -> Index Only Scan using big_pkey on big b (cost=0.42..30408.42 rows=1000000 width=4) (actual time=0.067..122.192 rows=800001loops=1) Heap Fetches: 800001 -> Sort (cost=1.05..1.06 rows=3 width=4) (actual time=0.027..0.028 rows=3 loops=1) Sort Key: s.c Sort Method: quicksort Memory: 25kB -> Seq Scan on small s (cost=0.00..1.03 rows=3 width=4) (actual time=0.009..0.011 rows=3 loops=1) Planning time: 0.514 ms Execution time: 151.989 ms (10 rows) =# select * from pg_stats where tablename='small' \gx -[ RECORD 1 ]----------+----------- schemaname | public tablename | small attname | c inherited | f null_frac | 0 avg_width | 4 n_distinct | -0.666667 most_common_vals | {10} -- here, I expected to see 800000 as well most_common_freqs | {0.666667} histogram_bounds | ⊥ correlation | 1 most_common_elems | ⊥ most_common_elem_freqs | ⊥ elem_count_histogram | ⊥ =# insert into small values (800000); =# analyze verbose small; =# explain analyze select * from big b join small s on (b.c=s.c); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.42..34.81 rows=4 width=8) (actual time=0.037..0.068 rows=4 loops=1) -> Seq Scan on small s (cost=0.00..1.04 rows=4 width=4) (actual time=0.012..0.015 rows=4 loops=1) -> Index Only Scan using big_pkey on big b (cost=0.42..8.44 rows=1 width=4) (actual time=0.010..0.010 rows=1 loops=4) Index Cond: (c = s.c) Heap Fetches: 4 Planning time: 0.478 ms Execution time: 0.133 ms (7 rows) =# select * from pg_stats where tablename='small' \gx -[ RECORD 1 ]----------+------------ schemaname | public tablename | small attname | c inherited | f null_frac | 0 avg_width | 4 n_distinct | -0.5 most_common_vals | {10,800000} most_common_freqs | {0.5,0.5} histogram_bounds | ⊥ correlation | 1 most_common_elems | ⊥ most_common_elem_freqs | ⊥ elem_count_histogram | ⊥
Andreas Seltenreich <andreas.seltenreich@credativ.de> writes: > Intuitively, I'd say pg_statistic can represent such a small dataset > perfectly. But instead, most_common_values only contains one of the two > distinct values. This seems like a bug to me. I looked at the code for > why this is the case, but couldn't find an answer right-away. Any > input? ANALYZE has a hard-and-fast rule that a value must appear more than once in the sample in order to possibly be considered an MCV. Perhaps we could tinker with that if the sample is known to comprise the entire table, but it feels a bit ad-hoc. regards, tom lane