High cost estimates when n_distinct is set - Mailing list pgsql-performance

From Niko Kiirala
Subject High cost estimates when n_distinct is set
Date
Msg-id D093912CABCEE44F92C35FE3142A759B777D18CD@NBL-EXCD1-02.nebula.local
Whole thread Raw
List pgsql-performance
Hello

I am working on a potentially large database table, let's call it "observation", that has a foreign key to table
"measurement".Each measurement is associated with either none or around five observations. In this kind of situation,
itis well known that the statistics on the foreign key column in observation table can get arbitrarily bad as the row
countincreases. Especially, the estimate of the number of distinct values in the foreign key column can be completely
off.

To combat this issue I have set n_distinct=-0.2 on the foreign key column. With this, the query planner gets good
estimatesof row counts, but it would appear that this setting does not affect the cost estimate of an index scan. With
this,I get odd cost estimates, as if fetching these approximately five rows using  index scan would take hundreds of
randomdisk reads. Due to this high cost estimate, when joining these two tables, PostgreSQL changes from using multiple
smallindex scans to scanning the whole table a lot earlier than would be beneficial. 

So, in more detail:

I am using PostgreSQL 9.2.1 on Windows 7 SP 1 installed with EnterpriseDB one-click installer and with default
settings.
SELECT version();
PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit

I have set up a simple testing database as follows (running these commands takes around an hour on my DB. It's slow,
sorry,but I need lots of rows to show the issue): 

CREATE TABLE observation
(
  id bigserial NOT NULL,
  measurement_id bigint NOT NULL,
  CONSTRAINT observation_pkey PRIMARY KEY (id)
);

CREATE INDEX observation_measurement_id_idx
  ON observation
  USING btree
  (measurement_id);

INSERT INTO observation
  SELECT x as id,
    CASE WHEN (x - 1) % 15 < 3 THEN ((x - 1) / 15) * 4
         WHEN (x - 1) % 15 < 8 THEN ((x - 1) / 15) * 4 + 1
         ELSE ((x - 1) / 15) * 4 + 2
    END AS measurement_id
  FROM (SELECT generate_series(1, 100000000) AS x) AS series;

ANALYZE observation;

Here the measurement_id stands for the foreign key to table "measurement". Actually having that table is not required
toshow the issue, though. Each number from range 1...26666666 (1e8 * 4 / 15) appears 0, 3, 5 or 7 times in
measurement_idcolumn. 

After this, the statistics on measurement_id column look something like this:
Distinct Values  1.23203e+006
Most Common Values {6895590, 8496970, 23294094, 75266, 128877, 150786, 175001, 192645, 216918, 262742, ...
Most Common Frequencies    {0.0001, 0.0001, 0.0001, 6.66667e-005, 6.66667e-005, 6.66667e-005, 6.66667e-005,
6.66667e-005,6.66667e-005, 6.66667e-005, ... 
Histogram Bounds {14, 208364, 511400, 840725, 1091642, 1392585, 1713998, 1945482, 2204897, 2476654, ...
Correlation 1

As there are 20e6 distinct values, the 1.2e6 estimate is already somewhat off and will keep on going worse if more rows
areadded. 

Let's have a look on the query plan of fetching all observations of a single measurement:
EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS on, TIMING on )
SELECT id, measurement_id
  FROM observation
  WHERE measurement_id = 200001;

Index Scan using observation_measurement_id_idx on public.observation  (cost=0.00..119.36 rows=82 width=16) (actual
time=0.060..0.062rows=5 loops=1) 
  Output: id, measurement_id
  Index Cond: (observation.measurement_id = 200001)
  Buffers: shared read=5
Total runtime: 0.081 ms

The row estimate is off by a factor of 10, so let's help the planner and tell how many rows it is likely to find:
ALTER TABLE observation
   ALTER COLUMN measurement_id
   SET (n_distinct=-0.2);
ANALYZE observation;

This doesn't change the row statistics noticeably, except for changing the number of distinct values.

EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS on, TIMING on )
SELECT id, measurement_id
  FROM observation
  WHERE measurement_id = 200001;

Index Scan using observation_measurement_id_idx on public.observation  (cost=0.00..118.01 rows=5 width=16) (actual
time=0.060..0.061rows=5 loops=1) 
  Output: id, measurement_id
  Index Cond: (observation.measurement_id = 200001)
  Buffers: shared read=5
Total runtime: 0.073 ms

So, the row count estimate is now good, but the cost estimate has not changed much. If I halve the random_page_cost
(from4 to 2), it nearly exactly halves the estimated cost, so it would appear this cost is almost completely from
randompage accesses, approximately 29 of them. In the original plan this made sense, as it expected to fetch 81 rows,
butfor five rows it would seem quite excessive.  

Enlarging the table from 100 million to 200 million rows almost doubles the estimated cost (from 118.01 to 227.69).
Sincethe estimated and actual count of returned rows stay the same and the query is using a B tree, I'd expect the cost
torise only slightly. 

I have a distinct feeling that this is either a bug in the cost estimator or there's a quite valid reason which I have
missed.Regardless, any insights you might have to this issue would be appreciated. 

--
Niko Kiirala



pgsql-performance by date:

Previous
From: houmanb
Date:
Subject: Re: SELECT AND AGG huge tables
Next
From: Tom Lane
Date:
Subject: Re: Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6