strange row count estimates with conditions on multiple column - Mailing list pgsql-general

From Tomas Vondra
Subject strange row count estimates with conditions on multiple column
Date
Msg-id 4CE32D36.7040508@fuzzy.cz
Whole thread Raw
Responses Re: strange row count estimates with conditions on multiple column  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Hi everyone,

I've just noticed a strange behaviour when estimating row counts (I'm
running 9.0.1). A small demonstration - let's create table with two
columns, and fill it with data so that the columns are not independent:

=====================================================================

-- table with two columns
create table test_table (col_a int, col_b int);

-- fill it with correlated data (0..99, 0..199)
insert into test_table select i%100, i%200 from generate_series(1,
1000000) s(i);

-- update statistics
analyze test_table;

=====================================================================

OK, not let's run a few simple queries, producing exactly the same
output but very different plans:

A) SELECT * FROM test_table WHERE col_a = 33 AND col_b = 33;

Seq Scan on test_table  (cost=0.00..19425.00 rows=47 width=8) (actual
time=0.025..216.273 rows=5000 loops=1)
   Filter: ((col_a = 33) AND (col_b = 33))
 Total runtime: 221.676 ms

B) SELECT * FROM test_table WHERE (col_a, col_b) = (33, 33);

plan is exactly the same as (A)

C) SELECT * FROM test_table WHERE (col_a BETWEEN 33 AND 33) AND (col_b
BETWEEN 33 AND 33);

 Seq Scan on test_table  (cost=0.00..24425.00 rows=1 width=8) (actual
time=0.025..282.725 rows=5000 loops=1)
   Filter: ((col_a >= 33) AND (col_a <= 33) AND (col_b >= 33) AND (col_b
<= 33))
 Total runtime: 288.127 ms

D) SELECT * FROM test_table WHERE (col_a, col_b) BETWEEN (33, 33) AND
(33, 33);

 Seq Scan on test_table  (cost=0.00..24425.00 rows=227232 width=8)
(actual time=0.022..238.958 rows=5000 loops=1)
   Filter: ((ROW(col_a, col_b) >= ROW(33, 33)) AND (ROW(col_a, col_b) <=
ROW(33, 33)))
 Total runtime: 244.353 ms

=====================================================================

So the estimated number of rows is this

A) 47
B) 47
C) 1
D) 227232

Results from (A) and (B) seem strange to me because AFAIK there are no
multi-column statistics available, and accoring to this thread

http://archives.postgresql.org/pgsql-hackers/2009-03/msg00052.php

the single-column estimates are not multiplied (which would be OK only
in case of statistically independent columns). Yet the estimates somehow
match the product:

1.000.000 * (1/200) * (1/100) = 1.000.000 / 20.000 = 50


I'm not quite sure why (C) has an estimate of 1. The col_a has only 100
distinct values, so it uses most_common_vals/most_common_freqs, and all
the values will be there (statistics target is 100) along with
frequencies. This gives about 1% selectivity.

Column col_b has 200 distinct values, uniformly distributed, so the
estimates are based on a histogram - there are 100 bins, the range fits
into a single bin, giving 1% selectivity.

But no matter what I do, I'm not sure how to combine these two estimates
into 0.0001% (1 row out of a million).

And I do have exactly the same problem with the estimate in (D). Where
the heck did 227232 come from?

regards
Tomas

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Survey on backing up unlogged tables: help us with PostgreSQL development!
Next
From: Andy Colson
Date:
Subject: Re: Survey on backing up unlogged tables: help us with PostgreSQL development!