Thread: serious under-estimation of n_distinct for clustered distributions
I have encountered serious under-estimations of distinct values when values are not evenly distributed but clustered within a column. I think this problem might be relevant to many real-world use cases and I wonder if there is a good workaround or possibly a programmatic solution that could be implemented.
Thanks for your help!
Stefan
The Long Story:
When Postgres collects statistics, it estimates the number of distinct values for every column (see pg_stats.n_distinct). This is one important source for the planner to determine the selectivity and hence can have great influence on the resulting query plan.
My Problem:
When I collected statistics on some columns that have rather high selectivity but not anything like unique values, I consistently got n_distinct values that are far too low (easily by some orders of magnitude). Worse still, the estimates did not really improve until I analyzed the whole table.
I tested this for Postgres 9.1 and 9.2. An artificial test-case is described at the end of this mail.
Some Analysis of the Problem:
Unfortunately it is not trivial to estimate the total number of different values based on a sample. As far as I found out, Postgres uses an algorithm that is based on the number of values that are found only once in the sample used for ANALYZE. I found references to Good-Turing frequency estimation (http://encodestatistics.org/publications/statistics_and_postgres.pdf) and to a paper from Haas & Stokes, Computer Science, 1996 (http://researcher.ibm.com/researcher/files/us-phaas/jasa3rj.pdf). The latter source is from Josh Berkus in a 2005 discussion on the Postgres Performance List (see e.g. http://grokbase.com/t/postgresql/pgsql-performance/054kztf8pf/bad-n-distinct-estimation-hacks-suggested for a look on the whole discussion there). The formula given there for the total number of distinct values is:
n*d / (n - f1 + f1*n/N)
where f1 is the number of values that occurred only once in the sample. n is the number of rows sampled, d the number of distincts found and N the total number of rows in the table.
Now, the 2005 discussion goes into great detail on the advantages and disadvantages of this algorithm, particularly when using small sample sizes, and several alternatives are discussed. I do not know whether anything has been changed after that, but I know that the very distinct problem, which I will focus on here, still persists.
When the number of values that are found only once in the sample (f1) becomes zero, the whole term equals d, that is, n_distinct is estimated to be just the number of distincts found in the sample.
This is basically fine as it should only happen when the sample has really covered more or less all distinct values. However, we have a sampling problem here: for maximum efficiency Postgres samples not random rows but random pages. If the distribution of the values is not random but clustered (that is, the same values tend to be close together) we run into problems. The probability that any value from a clustered distribution is sampled only once, when any page covers multiple adjacent rows, is very low.
So, under these circumstances, the estimate for n_distinct will always be close to the number of distincts found in the sample. Even if every value would in fact only appear a few times in the table.
Relevance:
I think this is not just an unfortunate border case, but a very common situation. Imagine two tables that are filled continually over time where the second table references the first - some objects and multiple properties for each for example. Now the foreign key column of the second table will have many distinct values but a highly clustered distribution. It is probably not helpful, if the planner significantly underestimates the high selectivity of the foreign key column.
Workarounds:
There are workarounds: manually setting table column statistics or using an extremely high statistics target, so that the whole table gets analyzed. However, these workarounds do not seem elegant and may be impractical.
Questions:
A) Did I find the correct reason for my problem? Specifically, does Postgres really estimate n_distinct as described above?
B) Are there any elegant workarounds?
C) What could be a programmatic solution to this problem? I think, it might be possible to use the number of values that are found in only one page (vs. found only once at all) for f1. Or the number of distincts could be calculated using some completely different approach?
Test Case:
For an artificial test-case let's create a table and fill it with 10 million rows (appr. 1,300 MB required). There is an ID column featuring unique values and 4 groups of 3 columns each that have selectivities of:
- 5 (x_2000k = 2,000,000 distinct values)
- 25 (x_400k = 400,000 distinct values)
- 125 (x_80k = 80,000 distinct values).
The 4 groups of columns show different distributions:
- clustered and ordered (e.g. 1,1,1,2,2,2,3,3,3): clustered_ordered_x
- clustered but random values (e.g. 2,2,2,7,7,7,4,4,4): clustered_random_x
- uniform (e.g. 1,2,3,1,2,3,1,2,3): uniform_x
- random (e.g. well random, you know random_x
Here we go:
CREATE UNLOGGED TABLE test_1
(id BIGINT,
clustered_ordered_2000k BIGINT, clustered_ordered_400k BIGINT, clustered_ordered_80k BIGINT,
clustered_random_2000k BIGINT, clustered_random_400k BIGINT, clustered_random_80k BIGINT,
uniform_2000k BIGINT, uniform_400k BIGINT, uniform_80k BIGINT,
random_2000k BIGINT, random_400k BIGINT, random_80k BIGINT);
WITH q1 AS (SELECT generate_series(1,10000000) AS i, random() AS r),
q AS (SELECT q1.i, q1.r, trunc(sub_2000k.r * 10000000) AS r_2000k, trunc(sub_400k.r * 10000000) AS r_400k, trunc(sub_80k.r * 10000000) AS r_80k FROM q1
JOIN q1 AS sub_2000k ON sub_2000k.i - 1 = trunc((q1.i - 1) / 5)
JOIN q1 AS sub_400k ON sub_400k.i - 1 = trunc((q1.i - 1) / 25)
JOIN q1 AS sub_80k ON sub_80k.i - 1 = trunc((q1.i - 1) / 125)
ORDER BY q1.i)
INSERT INTO test_1
SELECT q.i,
trunc((q.i + 4) / 5), trunc((q.i + 24) / 25), trunc((q.i + 124) / 125),
q.r_2000k, q.r_400k, q.r_80k,
trunc(q.i % 2000000), trunc(q.i % 400000), trunc(q.i % 80000),
trunc(q.r * 2000000), trunc(q.r * 400000), trunc(q.r * 80000)
FROM q;
Now let's query the real numbers of distinct values:
SELECT colname, distincts FROM
(SELECT 'id' AS colname, COUNT(DISTINCT id) AS distincts FROM test_1 UNION
SELECT 'clustered_ordered_2000k' AS colname, COUNT(DISTINCT clustered_ordered_2000k) AS distincts FROM test_1 UNION
SELECT 'clustered_ordered_400k' AS colname, COUNT(DISTINCT clustered_ordered_400k) AS distincts FROM test_1 UNION
SELECT 'clustered_ordered_80k' AS colname, COUNT(DISTINCT clustered_ordered_80k) AS distincts FROM test_1 UNION
SELECT 'clustered_random_2000k' AS colname, COUNT(DISTINCT clustered_random_2000k) AS distincts FROM test_1 UNION
SELECT 'clustered_random_400k' AS colname, COUNT(DISTINCT clustered_random_400k) AS distincts FROM test_1 UNION
SELECT 'clustered_random_80k' AS colname, COUNT(DISTINCT clustered_random_80k) AS distincts FROM test_1 UNION
SELECT 'uniform_2000k' AS colname, COUNT(DISTINCT uniform_2000k) AS distincts FROM test_1 UNION
SELECT 'uniform_400k' AS colname, COUNT(DISTINCT uniform_400k) AS distincts FROM test_1 UNION
SELECT 'uniform_80k' AS colname, COUNT(DISTINCT uniform_80k) AS distincts FROM test_1 UNION
SELECT 'random_2000k' AS colname, COUNT(DISTINCT random_2000k) AS distincts FROM test_1 UNION
SELECT 'random_400k' AS colname, COUNT(DISTINCT random_400k) AS distincts FROM test_1 UNION
SELECT 'random_80k' AS colname, COUNT(DISTINCT random_80k) AS distincts FROM test_1) AS sub
ORDER BY colname;
colname | distincts
-------------------------+-----------
clustered_ordered_2000k | 2000000
clustered_ordered_400k | 400000
clustered_ordered_80k | 80000
clustered_random_2000k | 1811948
clustered_random_400k | 391881
clustered_random_80k | 79681
id | 10000000
random_2000k | 1986619
random_400k | 400000
random_80k | 80000
uniform_2000k | 2000000
uniform_400k | 400000
uniform_80k | 80000
-> So we got what we asked for.
As the row length of the table is not very large, we decrease the statistics target. Otherwise a quarter of the table will get sampled and the effect is less clear:
SET default_statistics_target = 10;
ANALYZE VERBOSE test_1;
SELECT attname, n_distinct, correlation FROM pg_stats WHERE tablename = 'test_1'
ORDER BY attname;
attname | n_distinct | correlation
-------------------------+------------+-------------
clustered_ordered_2000k | 51487 | 1
clustered_ordered_400k | 9991 | 1
clustered_ordered_80k | 3752 | 1
clustered_random_2000k | 51487 | 0.00938534
clustered_random_400k | 9991 | 0.00373309
clustered_random_80k | 3748 | -0.0461863
id | -1 | 1
random_2000k | -0.310305 | 0.00140735
random_400k | 289890 | 0.00140921
random_80k | 71763 | 0.00142101
uniform_2000k | -0.310305 | 0.209842
uniform_400k | -0.101016 | 0.0259991
uniform_80k | 74227 | 0.0154193
-> estimates for random and uniform distributions are really good. But for clustered distributions, estimates are off by a factor of 20 to 40.
And clean up
DROP TABLE test_1;
Thanks for your help!
Stefan
The Long Story:
When Postgres collects statistics, it estimates the number of distinct values for every column (see pg_stats.n_distinct). This is one important source for the planner to determine the selectivity and hence can have great influence on the resulting query plan.
My Problem:
When I collected statistics on some columns that have rather high selectivity but not anything like unique values, I consistently got n_distinct values that are far too low (easily by some orders of magnitude). Worse still, the estimates did not really improve until I analyzed the whole table.
I tested this for Postgres 9.1 and 9.2. An artificial test-case is described at the end of this mail.
Some Analysis of the Problem:
Unfortunately it is not trivial to estimate the total number of different values based on a sample. As far as I found out, Postgres uses an algorithm that is based on the number of values that are found only once in the sample used for ANALYZE. I found references to Good-Turing frequency estimation (http://encodestatistics.org/publications/statistics_and_postgres.pdf) and to a paper from Haas & Stokes, Computer Science, 1996 (http://researcher.ibm.com/researcher/files/us-phaas/jasa3rj.pdf). The latter source is from Josh Berkus in a 2005 discussion on the Postgres Performance List (see e.g. http://grokbase.com/t/postgresql/pgsql-performance/054kztf8pf/bad-n-distinct-estimation-hacks-suggested for a look on the whole discussion there). The formula given there for the total number of distinct values is:
n*d / (n - f1 + f1*n/N)
where f1 is the number of values that occurred only once in the sample. n is the number of rows sampled, d the number of distincts found and N the total number of rows in the table.
Now, the 2005 discussion goes into great detail on the advantages and disadvantages of this algorithm, particularly when using small sample sizes, and several alternatives are discussed. I do not know whether anything has been changed after that, but I know that the very distinct problem, which I will focus on here, still persists.
When the number of values that are found only once in the sample (f1) becomes zero, the whole term equals d, that is, n_distinct is estimated to be just the number of distincts found in the sample.
This is basically fine as it should only happen when the sample has really covered more or less all distinct values. However, we have a sampling problem here: for maximum efficiency Postgres samples not random rows but random pages. If the distribution of the values is not random but clustered (that is, the same values tend to be close together) we run into problems. The probability that any value from a clustered distribution is sampled only once, when any page covers multiple adjacent rows, is very low.
So, under these circumstances, the estimate for n_distinct will always be close to the number of distincts found in the sample. Even if every value would in fact only appear a few times in the table.
Relevance:
I think this is not just an unfortunate border case, but a very common situation. Imagine two tables that are filled continually over time where the second table references the first - some objects and multiple properties for each for example. Now the foreign key column of the second table will have many distinct values but a highly clustered distribution. It is probably not helpful, if the planner significantly underestimates the high selectivity of the foreign key column.
Workarounds:
There are workarounds: manually setting table column statistics or using an extremely high statistics target, so that the whole table gets analyzed. However, these workarounds do not seem elegant and may be impractical.
Questions:
A) Did I find the correct reason for my problem? Specifically, does Postgres really estimate n_distinct as described above?
B) Are there any elegant workarounds?
C) What could be a programmatic solution to this problem? I think, it might be possible to use the number of values that are found in only one page (vs. found only once at all) for f1. Or the number of distincts could be calculated using some completely different approach?
Test Case:
For an artificial test-case let's create a table and fill it with 10 million rows (appr. 1,300 MB required). There is an ID column featuring unique values and 4 groups of 3 columns each that have selectivities of:
- 5 (x_2000k = 2,000,000 distinct values)
- 25 (x_400k = 400,000 distinct values)
- 125 (x_80k = 80,000 distinct values).
The 4 groups of columns show different distributions:
- clustered and ordered (e.g. 1,1,1,2,2,2,3,3,3): clustered_ordered_x
- clustered but random values (e.g. 2,2,2,7,7,7,4,4,4): clustered_random_x
- uniform (e.g. 1,2,3,1,2,3,1,2,3): uniform_x
- random (e.g. well random, you know random_x
Here we go:
CREATE UNLOGGED TABLE test_1
(id BIGINT,
clustered_ordered_2000k BIGINT, clustered_ordered_400k BIGINT, clustered_ordered_80k BIGINT,
clustered_random_2000k BIGINT, clustered_random_400k BIGINT, clustered_random_80k BIGINT,
uniform_2000k BIGINT, uniform_400k BIGINT, uniform_80k BIGINT,
random_2000k BIGINT, random_400k BIGINT, random_80k BIGINT);
WITH q1 AS (SELECT generate_series(1,10000000) AS i, random() AS r),
q AS (SELECT q1.i, q1.r, trunc(sub_2000k.r * 10000000) AS r_2000k, trunc(sub_400k.r * 10000000) AS r_400k, trunc(sub_80k.r * 10000000) AS r_80k FROM q1
JOIN q1 AS sub_2000k ON sub_2000k.i - 1 = trunc((q1.i - 1) / 5)
JOIN q1 AS sub_400k ON sub_400k.i - 1 = trunc((q1.i - 1) / 25)
JOIN q1 AS sub_80k ON sub_80k.i - 1 = trunc((q1.i - 1) / 125)
ORDER BY q1.i)
INSERT INTO test_1
SELECT q.i,
trunc((q.i + 4) / 5), trunc((q.i + 24) / 25), trunc((q.i + 124) / 125),
q.r_2000k, q.r_400k, q.r_80k,
trunc(q.i % 2000000), trunc(q.i % 400000), trunc(q.i % 80000),
trunc(q.r * 2000000), trunc(q.r * 400000), trunc(q.r * 80000)
FROM q;
Now let's query the real numbers of distinct values:
SELECT colname, distincts FROM
(SELECT 'id' AS colname, COUNT(DISTINCT id) AS distincts FROM test_1 UNION
SELECT 'clustered_ordered_2000k' AS colname, COUNT(DISTINCT clustered_ordered_2000k) AS distincts FROM test_1 UNION
SELECT 'clustered_ordered_400k' AS colname, COUNT(DISTINCT clustered_ordered_400k) AS distincts FROM test_1 UNION
SELECT 'clustered_ordered_80k' AS colname, COUNT(DISTINCT clustered_ordered_80k) AS distincts FROM test_1 UNION
SELECT 'clustered_random_2000k' AS colname, COUNT(DISTINCT clustered_random_2000k) AS distincts FROM test_1 UNION
SELECT 'clustered_random_400k' AS colname, COUNT(DISTINCT clustered_random_400k) AS distincts FROM test_1 UNION
SELECT 'clustered_random_80k' AS colname, COUNT(DISTINCT clustered_random_80k) AS distincts FROM test_1 UNION
SELECT 'uniform_2000k' AS colname, COUNT(DISTINCT uniform_2000k) AS distincts FROM test_1 UNION
SELECT 'uniform_400k' AS colname, COUNT(DISTINCT uniform_400k) AS distincts FROM test_1 UNION
SELECT 'uniform_80k' AS colname, COUNT(DISTINCT uniform_80k) AS distincts FROM test_1 UNION
SELECT 'random_2000k' AS colname, COUNT(DISTINCT random_2000k) AS distincts FROM test_1 UNION
SELECT 'random_400k' AS colname, COUNT(DISTINCT random_400k) AS distincts FROM test_1 UNION
SELECT 'random_80k' AS colname, COUNT(DISTINCT random_80k) AS distincts FROM test_1) AS sub
ORDER BY colname;
colname | distincts
-------------------------+-----------
clustered_ordered_2000k | 2000000
clustered_ordered_400k | 400000
clustered_ordered_80k | 80000
clustered_random_2000k | 1811948
clustered_random_400k | 391881
clustered_random_80k | 79681
id | 10000000
random_2000k | 1986619
random_400k | 400000
random_80k | 80000
uniform_2000k | 2000000
uniform_400k | 400000
uniform_80k | 80000
-> So we got what we asked for.
As the row length of the table is not very large, we decrease the statistics target. Otherwise a quarter of the table will get sampled and the effect is less clear:
SET default_statistics_target = 10;
ANALYZE VERBOSE test_1;
SELECT attname, n_distinct, correlation FROM pg_stats WHERE tablename = 'test_1'
ORDER BY attname;
attname | n_distinct | correlation
-------------------------+------------+-------------
clustered_ordered_2000k | 51487 | 1
clustered_ordered_400k | 9991 | 1
clustered_ordered_80k | 3752 | 1
clustered_random_2000k | 51487 | 0.00938534
clustered_random_400k | 9991 | 0.00373309
clustered_random_80k | 3748 | -0.0461863
id | -1 | 1
random_2000k | -0.310305 | 0.00140735
random_400k | 289890 | 0.00140921
random_80k | 71763 | 0.00142101
uniform_2000k | -0.310305 | 0.209842
uniform_400k | -0.101016 | 0.0259991
uniform_80k | 74227 | 0.0154193
-> estimates for random and uniform distributions are really good. But for clustered distributions, estimates are off by a factor of 20 to 40.
And clean up
DROP TABLE test_1;
On 29 December 2012 20:57, Stefan Andreatta <s.andreatta@synedra.com> wrote: > Now, the 2005 discussion goes into great detail on the advantages and > disadvantages of this algorithm, particularly when using small sample sizes, > and several alternatives are discussed. I do not know whether anything has > been changed after that, but I know that the very distinct problem, which I > will focus on here, still persists. It's a really hard problem to solve satisfactorily. It's a problem that has been studied in much detail. Yes, the algorithm used is still the same. See the comments within src/backend/commands/analyze.c (IBM Research Report RJ 10025 is referenced there). The general advice here is: 1) Increase default_statistics_target for the column. 2) If that doesn't help, consider using the following DDL: alter table foo alter column bar set ( n_distinct = 5.0); -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 12/29/2012 10:57 PM, Peter Geoghegan wrote: > On 29 December 2012 20:57, Stefan Andreatta <s.andreatta@synedra.com> wrote: >> Now, the 2005 discussion goes into great detail on the advantages and >> disadvantages of this algorithm, particularly when using small sample sizes, >> and several alternatives are discussed. I do not know whether anything has >> been changed after that, but I know that the very distinct problem, which I >> will focus on here, still persists. > > It's a really hard problem to solve satisfactorily. It's a problem > that has been studied in much detail. Yes, the algorithm used is still > the same. See the comments within src/backend/commands/analyze.c (IBM > Research Report RJ 10025 is referenced there). Thanks a lot for this information! I looked through the code a bit. The Haas & Stokes Formula is fine. The problem really lies with the two phase random selection procedure: Starting from line 1039, there is a comment: * As of May 2004 we use a new two-stage method: Stage one selects up * to targrows random blocks (or all blocks, if there aren't so many). * Stage two scans these blocks and uses the Vitter algorithm to create * a random sample of targrows rows (or less, if there are less in the * sample of blocks). The two stages are executed simultaneously: each * block is processed as soon as stage one returns its number and while * the rows are read stage two controls which ones are to be inserted * into the sample. * * Although every row has an equal chance of ending up in the final * sample, this sampling method is not perfect: not every possible * sample has an equal chance of being selected. For large relations * the number of different blocks represented by the sample tends to be * too small. We can live with that for now. Improvements are welcome. Now the problem with clustered data is, that the probability of sampling a value twice is much higher when the same page is repeatedly sampled. As stage one takes a random sample of pages, and stage two samples rows from these pages, the probability of visiting the same page twice (or more often) is much higher than if random rows were selected from the whole table. Hence we get a lot more multiple values for clustered data and we end up with the severe under-estimation we can see in those cases. Probabilities do my brain in, as usual, but I tested the procedure for my test data with a simple python script. There is absolutely nothing wrong with the implementation. It seems to be a purely statistical problem. Not everything may be hopeless though ;-) The problem could theoretically be avoided if random rows were selected from the whole table. Again, that may not be feasible - the two phase approach was probably not implemented for nothing. Another possible solution would be to avoid much of the resampling (not all) in phase two. For that - in theory - every page visited would have to get a lower weight, so that revisiting this page is not any more likely as rows were selected from the whole column. That does not sound easy or elegant to implement. But perhaps there is some clever algorithm - unfortunately I do not know. > The general advice here is: > > 1) Increase default_statistics_target for the column. > > 2) If that doesn't help, consider using the following DDL: > > alter table foo alter column bar set ( n_distinct = 5.0); Yes, I will probably have to live with that for now - I will come back to these workarounds with one or two questions. Thanks again & Regards, Seefan
On Sat, Dec 29, 2012 at 5:57 PM, Stefan Andreatta <s.andreatta@synedra.com> wrote: > n*d / (n - f1 + f1*n/N) > > where f1 is the number of values that occurred only once in the sample. n is > the number of rows sampled, d the number of distincts found and N the total > number of rows in the table. > ... > > When the number of values that are found only once in the sample (f1) > becomes zero, the whole term equals d, that is, n_distinct is estimated to > be just the number of distincts found in the sample. I think the problem lies in the assumption that if there are no singly-sampled values, then the sample must have included all distinct values. This is clearly not true even on a fully random sample, it only means those sampled distinct values appear frequently enough to be excluded from the sample. In the clustered case, the error would be evident if the randomly-sampled pages were split into two samples, and considered separately. The distinct set in one would not match the distinct set in the other, and the intersection's size would say something about the real number of distinct values in the whole population.
On 12/29/2012 10:57 PM, Peter Geoghegan wrote: > On 29 December 2012 20:57, Stefan Andreatta <s.andreatta@synedra.com> wrote: ... > The general advice here is: > > 1) Increase default_statistics_target for the column. I tried that, but to get good estimates under these circumstances, I need to set the statistics_target so high that the whole table gets analyzed. As this problem matters most for all of our large tables, I would have to set default_statistics_target to something like 100000 - that's a bit scary for production systems with tables of appr. 100GB, I find. > 2) If that doesn't help, consider using the following DDL: > > alter table foo alter column bar set ( n_distinct = 5.0); > Yes, that's probably best - even if it means quite some maintenance work. I do it like that: ALTER TABLE test_1 ALTER COLUMN clustered_random_2000k SET (n_distinct = -0.05); btw: Postgres will never set relative n_distinct values for anything larger than -0.1. If I determine (or know) it to be a constant but lower fraction, could it be a problem to explicitly set this value to between -0.1 and 0? To activate that setting, however, an ANALYZE has to be run. That was not clear to me from the documentation: ANALYZE verbose test_1; To check column options and statistics values: SELECT pg_class.relname AS table_name, pg_attribute.attname AS column_name, pg_attribute.attoptions FROM pg_attribute JOIN pg_class ON pg_attribute.attrelid = pg_class.oid WHERE pg_attribute.attnum > 0 AND pg_class.relname = 'test_1' AND pg_attribute.attname = 'clustered_random_2000k'; SELECT tablename AS table_name, attname AS column_name, null_frac, avg_width, n_distinct, correlation FROM pg_stats WHERE tablename = 'test_1' and attname = 'clustered_random_2000k'; And finally, we can undo the whole thing, if necessary: ALTER TABLE test_1 ALTER COLUMN clustered_random_2000k RESET (n_distinct); ANALYZE VERBOSE test_1; Regards, Stefan
A status update on this problem: 1.) Workarounds (setting n_distinct manually) are tested and - as far as workarounds go - OK. 2.) Source of the problem and possible solution: The source of these troubles is the sampling method employed in src/backend/commands/analyze.c. Judging from Tom Lane's comment for the original implementation in 2004 this has never been thought to be perfect. Does anybody see a chance to improve that part? Should this discussion be taken elsewhere? Is there any input from my side that could help? btw: I do find this problem to be very frequent in our databases. And considering the commonplace conditions leading to it, I would expect many systems to be affected. But searching the forums and the web I hardly found any references to it - which amazes me to no end. Best Regards, Stefan On 12/30/2012 07:02 PM, Stefan Andreatta wrote: > On 12/29/2012 10:57 PM, Peter Geoghegan wrote: >> On 29 December 2012 20:57, Stefan Andreatta <s.andreatta@synedra.com> >> wrote: >>> Now, the 2005 discussion goes into great detail on the advantages and >>> disadvantages of this algorithm, particularly when using small >>> sample sizes, >>> and several alternatives are discussed. I do not know whether >>> anything has >>> been changed after that, but I know that the very distinct problem, >>> which I >>> will focus on here, still persists. >> >> It's a really hard problem to solve satisfactorily. It's a problem >> that has been studied in much detail. Yes, the algorithm used is still >> the same. See the comments within src/backend/commands/analyze.c (IBM >> Research Report RJ 10025 is referenced there). > > Thanks a lot for this information! I looked through the code a bit. > The Haas & Stokes Formula is fine. The problem really lies with the > two phase random selection procedure: > > Starting from line 1039, there is a comment: > * As of May 2004 we use a new two-stage method: Stage one selects up > * to targrows random blocks (or all blocks, if there aren't so many). > * Stage two scans these blocks and uses the Vitter algorithm to create > * a random sample of targrows rows (or less, if there are less in the > * sample of blocks). The two stages are executed simultaneously: each > * block is processed as soon as stage one returns its number and while > * the rows are read stage two controls which ones are to be inserted > * into the sample. > * > * Although every row has an equal chance of ending up in the final > * sample, this sampling method is not perfect: not every possible > * sample has an equal chance of being selected. For large relations > * the number of different blocks represented by the sample tends to be > * too small. We can live with that for now. Improvements are welcome. > > > Now the problem with clustered data is, that the probability of > sampling a value twice is much higher when the same page is repeatedly > sampled. As stage one takes a random sample of pages, and stage two > samples rows from these pages, the probability of visiting the same > page twice (or more often) is much higher than if random rows were > selected from the whole table. Hence we get a lot more multiple values > for clustered data and we end up with the severe under-estimation we > can see in those cases. > > Probabilities do my brain in, as usual, but I tested the procedure for > my test data with a simple python script. There is absolutely nothing > wrong with the implementation. It seems to be a purely statistical > problem. > > Not everything may be hopeless though ;-) The problem could > theoretically be avoided if random rows were selected from the whole > table. Again, that may not be feasible - the two phase approach was > probably not implemented for nothing. > > Another possible solution would be to avoid much of the resampling > (not all) in phase two. For that - in theory - every page visited > would have to get a lower weight, so that revisiting this page is not > any more likely as rows were selected from the whole column. That does > not sound easy or elegant to implement. But perhaps there is some > clever algorithm - unfortunately I do not know. > > >> The general advice here is: >> >> 1) Increase default_statistics_target for the column. >> >> 2) If that doesn't help, consider using the following DDL: >> >> alter table foo alter column bar set ( n_distinct = 5.0); > > Yes, I will probably have to live with that for now - I will come back > to these workarounds with one or two questions. > > Thanks again & Regards, > Seefan > >
On 14 January 2013 07:35, Stefan Andreatta <s.andreatta@synedra.com> wrote: > The source of these troubles is the sampling method employed in > src/backend/commands/analyze.c. Judging from Tom Lane's comment for the > original implementation in 2004 this has never been thought to be perfect. > Does anybody see a chance to improve that part? Should this discussion be > taken elsewhere? Is there any input from my side that could help? Numerous alternative algorithms exist, as this has been an area of great interest for researchers for some time. Some alternatives may even be objectively better than Haas & Stokes. A quick peruse through the archives shows that Simon Riggs once attempted to introduce an algorithm described in the paper "A Block Sampling Approach to Distinct Value Estimation": http://www.stat.washington.edu/research/reports/1999/tr355.pdf However, the word on the street is that it may be worth pursuing some of the ideas described by the literature in just the last few years. I've often thought that this would be an interesting problem to work on. I haven't had time to pursue it, though. You may wish to propose a patch on the pgsql-hackers mailing list. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services