Thread: Problem with n_distinct being consistently inaccurate.
Hi- I tried to post this to the performance list, but that list seems to have a problem at the moment. I think the question fits "admin" as well: I have a table- called "event" with a field event_date_time that is indexed. There are 1,700,000 rows in the table and 92,000 distinct values of event_date_time with anywhere from 1 to 400 rows sharing the same value of event_date_time. (I did a count grouped by event_date_time & scanned the results to get this info.) When I look at the pg_stats on this table after running analyze with the defaults, I always see 15,000 or lower in the n_distinct column for event_date_time. I re-ran analyze several times & then checked pg_stats to see if the numbers varied significantly. Since this is off by about a factor of 6, I think the planner is missing the chance to use this table as the "driver" in a complex query plan that I'm trying to optimize. So the question is- how can I get a better estimate of n_distinct from analyze? If I alter the stats target as high as it will go, I get closer, but it still shows the index to be about 1/2 as selective as it actually is: alpha=# alter table event alter column event_date_time set statistics 1000; ALTER TABLE alpha=# analyze event; ANALYZE alpha=# select n_distinct from pg_stats where tablename='event' and attname='event_date_time'; n_distinct ------------ 51741 (1 row) This number seems to be consistently around 51,000 if I re-run analyze a few times. I guess my question is two-part: (1)Is there any tweak to make this estimate work better? (2)Since I'm getting numbers that are consistent but way off, is there a bug here? (2-1/2) Or alternately, am I totally missing what n-distinct is supposed to contain ? Thanks! -Nick --------------------------------------------------------------------- Nick Fankhauser nickf@doxpop.com Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/
"Nick Fankhauser" <nickf@ontko.com> writes: > So the question is- how can I get a better estimate of n_distinct from > analyze? > If I alter the stats target as high as it will go, I get closer, but it > still shows the index to be about 1/2 as selective as it actually is: AFAIK, estimating number of distinct values from a small sample is inherently an ill-conditioned problem. You should probably be happy it can get within a factor of 2 ;-). You could try sticking the correct n_distinct into pg_statistic by hand just to see if it really does change the plan, but I'd like to think that getting within a factor of 2 is good enough. If it's not, then we probably ought to look for ways to avoid using number-of-distinct-values statistics altogether, because we'll seldom have a hard value for it. regards, tom lane
> AFAIK, estimating number of distinct values from a small sample is > inherently an ill-conditioned problem. If I had been getting estimates all over the map, I'd have been a bit more unconcerned, but what I'm seeing is a very consistent number that also increases and tends to be more consistent in proportion to the stats "target" number. This makes me think that there is more at work here than the inaccuracy likely to occur from small samples. It's as if the algorithm and sample size (even at default) are pretty reasonable for returning consistent results in this case, but a multiplier needs to be changed. For instance, with the various values for statistics, if I do an analyze on the table and then look at n_distinct 6 times, these are the results I get: (actual number is 92,000) set statistics = -1 (default): 13549 14268 14772 14518 13863 13526 mean = 14083 std dev = 518 or 3.7% of mean set statistics = 100 22457 22598 22566 22580 22767 22490 mean = 22576 std dev = 108 or .5% of mean set statistics = 500 39878 39984 40018 39977 39885 40070 mean = 39968 std dev = 75 or .2% of mean set statistics = 1000 51428 51503 51486 51658 51625 51589 mean = 51548 std dev = 74 or .1% of mean > You could try sticking the correct n_distinct into pg_statistic by hand > just to see if it really does change the plan OK... but I'm a bit confused on how to get to the right row in pg_statistic. when I do a \d on pg_stats, it appears that pg_statistic.starelid matches up with pg_class.oid, but apparently this is not the case. Is there a place I can look to find which keys correspond among the pg_catalog tables? > but I'd like to think > that getting within a factor of 2 is good enough. Probably so... but with the default stats, it is more like a factor of 6, which seems significant to me, and if my conjecture is correct, it might be an easy fix. (Easy for me to say, since I'm not a developer. <grin>) -Nick
"Nick Fankhauser" <nickf@ontko.com> writes: > when I do a \d on pg_stats, it appears that pg_statistic.starelid matches up > with pg_class.oid, but apparently this is not the case. It certainly should be the case. starelid matches to pg_class.oid and staattnum matches to pg_attribute.attnum. regards, tom lane
> It certainly should be the case. starelid matches to pg_class.oid and > staattnum matches to pg_attribute.attnum. My problem was that I was looking up "event_date_time" in pg_class.relname (and finding it), but the oid matched nothing. when I looked for 'event' in pg_class & 'event_date_time' in pg_attribute, everything became clear. So... I updated stadistinct to the actual value of 92,000 as you suggested and as you predicted, it did not change the plan a bit. I guess I'll look elsewhere for ways to optimize this query. Thanks! -Nick
"Nick Fankhauser" <nickf@ontko.com> writes: > So... I updated stadistinct to the actual value of 92,000 as you suggested > and as you predicted, it did not change the plan a bit. Just out of curiosity, what happens if you make it bigger than 92k? Does a value 10x or 100x reality change the plan? regards, tom lane
> Just out of curiosity, what happens if you make it bigger than 92k? > Does a value 10x or 100x reality change the plan? Neither one makes a change- perhaps something else is at work here- my understanding of the finer points of query plans is shaky- Here is the query and the plan I'm getting: alpha-# event.event_date_time AS start_time, alpha-# event.event_duration_minutes AS duration_minutes, alpha-# event.event_ical_status AS status, alpha-# event.event_location_name AS "location", alpha-# event.event_room, event.event_type_code AS category, alpha-# event.event_hearing_type_desc AS sub_category, alpha-# event.event_summary AS summary, alpha-# event.event_pk, alpha-# event.event_revision_number, alpha-# case_data.case_public_id, alpha-# case_data.case_title, alpha-# court_config.court_name AS court, alpha-# event.event_id, alpha-# NULL::"unknown" AS related_event_id, alpha-# case_data.case_id, alpha-# court_config.court_id, alpha-# actor_case_assignment.actor_id, alpha-# actor_identifier.actor_identifier_text AS actor_identifier, alpha-# actor_identifier.actor_identifier_type alpha-# FROM alpha-# actor_identifier, alpha-# actor_case_assignment, alpha-# case_data, event, alpha-# court_config alpha-# WHERE alpha-# ( alpha(# ( alpha(# ( alpha(# (actor_identifier.actor_id = actor_case_assignment.actor_id) alpha(# AND alpha(# (actor_case_assignment.case_id = case_data.case_id) alpha(# ) alpha(# AND alpha(# (case_data.case_id = event.case_id) alpha(# ) alpha(# AND alpha(# (case_data.court_id = court_config.court_id) alpha(# ) alpha-# and actor_identifier.actor_identifier_text='07314-20' alpha-# and actor_identifier.actor_identifier_type = 'AttorneyStateBarID' alpha-# and event_date_time >= '06/01/2003' alpha-# and event_date_time <= '06/30/2003'; Hash Join (cost=2702.10..2703.83 rows=1 width=510) Hash Cond: ("outer".court_id = "inner".court_id) -> Seq Scan on court_config (cost=0.00..1.48 rows=48 width=39) -> Hash (cost=2702.10..2702.10 rows=1 width=471) -> Nested Loop (cost=0.00..2702.10 rows=1 width=471) Join Filter: ("outer".case_id = "inner".case_id) -> Nested Loop (cost=0.00..2698.10 rows=1 width=397) -> Nested Loop (cost=0.00..2602.13 rows=13 width=90) -> Index Scan using actor_identifier_actor_text on actor_identifier (cost=0.00..6.63 rows=1 width=55) Index Cond: ((actor_identifier_text = '07314-20'::character varying) AND (actor_identifier_type = 'AttorneyStateBarID'::character varying)) -> Index Scan using actor_case_assignment_actor_id on actor_case_assignment (cost=0.00..2229.73 rows=1758 width=35) Index Cond: ("outer".actor_id = actor_case_assignment.actor_id) -> Index Scan using event_case_id on event (cost=0.00..7.46 rows=1 width=307) Index Cond: (event.case_id = "outer".case_id) Filter: ((event_date_time >= '2003-06-01 00:00:00-05'::timestamp with time zone) AND (event_date_time <= '2003-06-30 00:00:00-05'::timestamp with time zone)) -> Index Scan using case_data_case_id on case_data (cost=0.00..3.98 rows=1 width=74) Index Cond: (case_data.case_id = "outer".case_id) (17 rows) What I'm trying to avoid is the Filter on event_date_time. It seems to me that an index condition could be used to advantage here, and that if this table "drove" the rest of the plan, it should work nicely. What got me headed down this path is that if I make this particular table much smaller by eliminating all events in the past, the performance on the query becomes very good (Although Filter is still used). Then I looked at the event_date_time field & thought that it looked pretty selective. Am i coorect in thinking that Filter means that the index is not being used? My guess is that the problem is that although I know the values are evenly distributed over 10 years, the planner has no way of knowing that all of the events don't occur in the month I've specified.
"Nick Fankhauser" <nickf@ontko.com> writes: > Neither one makes a change- perhaps something else is at work here- my > understanding of the finer points of query plans is shaky- Here is the query > and the plan I'm getting: Well, one thing that's not a fine point is that if you are complaining about poor planning, it's important to offer EXPLAIN ANALYZE output not just EXPLAIN output ... the difference between the planner's ideas and reality is exactly the issue ... regards, tom lane
Tom- Thanks for helping me work through my question about the effect of n_distinct. Now that I understand how little this statistic affects the query plan, I'm going to spend some time trying to puzzle it out more on my own before coming back to the list for advice. Thanks also for pointing out that I should be using explain analyze instead of explain. I didn't realize the option existed until you pointed it out, and will use it henceforth. I apologize if the tone of my messages sounded like complaining. My interest is simply to find a solution to a problem I'm having in my production environment. Problems with Postgres are few, but when I run into something I don't understand, I value your willingness to help out and certainly don't want to imply that I'm complaining to the folks who have provided me with a free database and more importantly, free education and support. Thanks (I mean it!) -Nick --------------------------------------------------------------------- Nick Fankhauser nickf@doxpop.com Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/