Thread: BUG #7619: Query cost estimate appears to not use n_distinct setting
BUG #7619: Query cost estimate appears to not use n_distinct setting
From
niko.kiirala@mapvision.fi
Date:
The following bug has been logged on the website: Bug reference: 7619 Logged by: Niko Kiirala Email address: niko.kiirala@mapvision.fi PostgreSQL version: 9.2.1 Operating system: Windows 7 SP 1 (64-bit) Description: = I am working on a potentially large database table, let's call it "observation", that has a foreign key to table "measurement". Each measurement is associated with either none or around five observations. In this kind of situation, it is well known that the statistics on the foreign key column in observation table can get arbitrarily bad as the row count increases. Especially, the estimate of the number of distinct values in the foreign key column can be completely off. To combat this issue I have set n_distinct=3D-0.2 on the foreign key column. With this, the query planner gets good estimates of row counts, but it would appear that this setting does not affect the cost estimate of an index scan. In a fairly small test case, without n_distinct setting the planner expects to find 82 rows, using around 29 random reads. This sounds reasonable, assuming the row estimate was correct. With n_distinct set, the planner correctly expects to find 5 rows but still use ~29 random reads. This strikes me as odd. Also, the estimated number of reads seems to grow linearily in the size of the table, even though constant result size and use of a B-tree would lead me to expect logarithmic cost increase at most. Due to this high cost estimate, when joining these two tables, PostgreSQL changes from using multiple small index scans to scanning the whole table a lot earlier than would be beneficial. A sub-second query suddenly becomes a query taking several minutes or hours. More details, including test database: I am using PostgreSQL 9.2.1 on Windows 7 SP 1 installed with EnterpriseDB one-click installer and with default settings. SELECT version(); PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit I have set up a simple testing database as follows (running these commands takes around an hour on my DB. It's slow, sorry, but I need lots of rows to show the issue): CREATE TABLE observation ( id bigserial NOT NULL, measurement_id bigint NOT NULL, CONSTRAINT observation_pkey PRIMARY KEY (id) ); CREATE INDEX observation_measurement_id_idx ON observation USING btree (measurement_id); INSERT INTO observation SELECT x as id, CASE WHEN (x - 1) % 15 < 3 THEN ((x - 1) / 15) * 4 WHEN (x - 1) % 15 < 8 THEN ((x - 1) / 15) * 4 + 1 ELSE ((x - 1) / 15) * 4 + 2 END AS measurement_id FROM (SELECT generate_series(1, 100000000) AS x) AS series; = ANALYZE observation; Here the measurement_id stands for the foreign key to table "measurement". Actually having that table is not required to show the issue, though. Each number from range 1...26666666 (1e8 * 4 / 15) appears 0, 3, 5 or 7 times in measurement_id column. After this, the statistics on measurement_id column look something like this: Distinct Values 1.23203e+006 Most Common Values {6895590, 8496970, 23294094, 75266, 128877, 150786, 175001, 192645, 216918, 262742, ... Most Common Frequencies {0.0001, 0.0001, 0.0001, 6.66667e-005, 6.66667e-005, 6.66667e-005, 6.66667e-005, 6.66667e-005, 6.66667e-005, 6.66667e-005, ... Histogram Bounds {14, 208364, 511400, 840725, 1091642, 1392585, 1713998, 1945482, 2204897, 2476654, ... Correlation 1 As there are 20e6 distinct values, the 1.2e6 estimate is already somewhat off and will keep on going worse if more rows are added. Let's have a look on the query plan of fetching all observations of a single measurement: EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS on, TIMING on ) SELECT id, measurement_id FROM observation WHERE measurement_id =3D 200001; Index Scan using observation_measurement_id_idx on public.observation = (cost=3D0.00..119.36 rows=3D82 width=3D16) (actual time=3D0.060..0.062 rows= =3D5 loops=3D1) Output: id, measurement_id Index Cond: (observation.measurement_id =3D 200001) Buffers: shared read=3D5 Total runtime: 0.081 ms The row estimate is off by a factor of 10, so let's help the planner and tell how many rows it is likely to find: ALTER TABLE observation ALTER COLUMN measurement_id SET (n_distinct=3D-0.2); ANALYZE observation; This doesn't change the row statistics noticeably, except for changing the number of distinct values. EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS on, TIMING on ) SELECT id, measurement_id FROM observation WHERE measurement_id =3D 200001; Index Scan using observation_measurement_id_idx on public.observation = (cost=3D0.00..118.01 rows=3D5 width=3D16) (actual time=3D0.060..0.061 rows= =3D5 loops=3D1) Output: id, measurement_id Index Cond: (observation.measurement_id =3D 200001) Buffers: shared read=3D5 Total runtime: 0.073 ms So, the row count estimate is now good, but the cost estimate has not changed much. If I halve the random_page_cost (from 4 to 2), it nearly exactly halves the estimated cost, so it would appear this cost is almost completely from random page accesses, approximately 29 of them. In the original plan this made sense, as it expected to fetch 81 rows, but for five rows it would seem quite excessive. = Enlarging the table from 100 million to 200 million rows almost doubles the estimated cost (from 118.01 to 227.69). Since the estimated and actual count of returned rows stay the same and the query is using a B tree, I'd expect the cost to rise only slightly.
On 10/23/2012 06:46 PM, niko.kiirala@mapvision.fi wrote: > The following bug has been logged on the website: > > Bug reference: 7619 > Logged by: Niko Kiirala > Email address: niko.kiirala@mapvision.fi > PostgreSQL version: 9.2.1 > Operating system: Windows 7 SP 1 (64-bit) > Description: > > I am working on a potentially large database table, let's call it > "observation", that has a foreign key to table "measurement". Each > measurement is associated with either none or around five observations. In > this kind of situation, it is well known that the statistics on the foreign > key column in observation table can get arbitrarily bad as the row count > increases. Especially, the estimate of the number of distinct values in the > foreign key column can be completely off. For anyone wondering why this feels familiar, the same message was posted to pgsql-performance earlier: http://postgresql.1045698.n5.nabble.com/High-cost-estimates-when-n-distinct-is-set-td5728596.html It's clear that there's a practical performance issue here, but less clear that it's a bug. Nonetheless, thanks for writing it up in so much detail and chasing it up further - though it'd be nice if you'd mentioned your earlier post. I'd love to help, but you've clearly already done a lot of work on this and I'm not sure I have anything useful to add. If you don't have any luck, consider asking one of the professional PostgreSQL consulting firms for their input. -- Craig Ringer
niko.kiirala@mapvision.fi writes: > I am working on a potentially large database table, let's call it > "observation", that has a foreign key to table "measurement". Each > measurement is associated with either none or around five observations. In > this kind of situation, it is well known that the statistics on the foreign > key column in observation table can get arbitrarily bad as the row count > increases. Especially, the estimate of the number of distinct values in the > foreign key column can be completely off. > To combat this issue I have set n_distinct=-0.2 on the foreign key column. > With this, the query planner gets good estimates of row counts, but it would > appear that this setting does not affect the cost estimate of an index > scan. [ traces through example with gdb ... ] The reason the cost estimates for these two cases are nearly the same is that in both cases, the planner predicts that only one index page and one heap page will be visited. There are about 360 tuples per page in your index, so whether 83 of them or 5 of them are to be visited, the index access cost is just about the same --- they'll most likely all be on the same index page, and the per-tuple CPU costs are negligible compared to the page access cost. The table is a little less dense, 185 tuples/page, but that's still a lot more than 83. It might seem like the planner should be estimating 83 vs 5 heap pages fetched, which would change the costs a lot, but it doesn't --- it thinks only one page is fetched in both cases. And it's right, plus or minus a page, because this artificial test case has loaded up the table with perfectly sequential measurement_ids. ANALYZE has recorded the index order correlation as exactly 1.0, so the planner predicts that all the rows with measurement_id = 200001 will be found on the same page, even when there are 83 of them. In short then, the cost estimate for this query really ought to be 2 * random_page_cost plus some not-very-large CPU-per-tuple charges, and so it's not surprising at all that the more accurate rowcount estimate isn't moving the result very much. So if the cost ought to be somewhere in the vicinity of 9 units, why is it actually coming out over 100? The difference is coming from this fudge-factor in genericcostestimate: /* * A difficulty with the leaf-pages-only cost approach is that for small * selectivities (eg, single index tuple fetched) all indexes will look * equally attractive because we will estimate exactly 1 leaf page to be * fetched. All else being equal, we should prefer physically smaller * indexes over larger ones. (An index might be smaller because it is * partial or because it contains fewer columns; presumably the other * columns in the larger index aren't useful to the query, or the larger * index would have better selectivity.) * * We can deal with this by adding a very small "fudge factor" that * depends on the index size. The fudge factor used here is one * spc_random_page_cost per 10000 index pages, which should be small * enough to not alter index-vs-seqscan decisions, but will prevent * indexes of different sizes from looking exactly equally attractive. */ *indexTotalCost += index->pages * spc_random_page_cost / 10000.0; This was intended to be a very small correction, but your index is so large that it's coming out as the dominant part of the cost. (Of course, this also explains your observation that the cost seems to be linear in the index size.) So we probably ought to rethink this calculation, but I'm not sure offhand what to do instead. It still seems like a good thing to penalize larger indexes compared to smaller ones even when we are estimating a single index page being touched. Maybe this should be a log rather than linear calculation, though. regards, tom lane