Thread: Problem with n_distinct being consistently inaccurate.

Problem with n_distinct being consistently inaccurate.

From
"Nick Fankhauser"
Date:
Hi-

I tried to post this to the performance list, but that list seems to have a
problem at the moment. I think the question fits "admin" as well:

I have a table- called "event" with a field event_date_time that is indexed.
There are 1,700,000 rows in the table and 92,000 distinct values of
event_date_time with anywhere from 1 to 400 rows sharing the same value of
event_date_time. (I did a count grouped by event_date_time & scanned the
results to get this info.)

When I look at the pg_stats on this table after running analyze with the
defaults, I always see 15,000 or lower in the n_distinct column for
event_date_time. I re-ran analyze several times & then checked pg_stats to
see if the numbers varied significantly.

Since this is off by about a factor of 6, I think the planner is missing the
chance to use this table as the "driver" in a complex query plan that I'm
trying to optimize.

So the question is- how can I get a better estimate of n_distinct from
analyze?

If I alter the stats target as high as it will go, I get closer, but it
still shows the index to be about 1/2 as selective as it actually is:

alpha=# alter table event alter column event_date_time set statistics 1000;
ALTER TABLE
alpha=# analyze event;
ANALYZE
alpha=# select n_distinct from pg_stats where tablename='event' and
attname='event_date_time';
 n_distinct
------------
      51741
(1 row)

This number seems to be consistently around 51,000 if I re-run analyze a few
times.

I guess my question is two-part:

(1)Is there any tweak to make this estimate work better?

(2)Since I'm getting numbers that are consistent but way off, is there a bug
here?

(2-1/2) Or alternately, am I totally missing what n-distinct is supposed to
contain ?

Thanks!
   -Nick

---------------------------------------------------------------------
Nick Fankhauser

    nickf@doxpop.com  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/



Re: Problem with n_distinct being consistently inaccurate.

From
Tom Lane
Date:
"Nick Fankhauser" <nickf@ontko.com> writes:
> So the question is- how can I get a better estimate of n_distinct from
> analyze?
> If I alter the stats target as high as it will go, I get closer, but it
> still shows the index to be about 1/2 as selective as it actually is:

AFAIK, estimating number of distinct values from a small sample is
inherently an ill-conditioned problem.   You should probably be happy
it can get within a factor of 2 ;-).

You could try sticking the correct n_distinct into pg_statistic by hand
just to see if it really does change the plan, but I'd like to think
that getting within a factor of 2 is good enough.  If it's not, then we
probably ought to look for ways to avoid using number-of-distinct-values
statistics altogether, because we'll seldom have a hard value for it.

            regards, tom lane

Re: Problem with n_distinct being consistently inaccurate.

From
"Nick Fankhauser"
Date:
> AFAIK, estimating number of distinct values from a small sample is
> inherently an ill-conditioned problem.

If I had been getting estimates all over the map, I'd have been a bit more
unconcerned, but what I'm seeing is a very consistent number that also
increases and tends to be more consistent in proportion to the stats
"target" number. This makes me think that there is more at work here than
the inaccuracy likely to occur from small samples. It's as if the algorithm
and sample size (even at default) are pretty reasonable for returning
consistent results in this case, but a multiplier needs to be changed.

For instance, with the various values for statistics, if I do an analyze on
the table and then look at n_distinct 6 times, these are the results I get:

(actual number is 92,000)

set statistics = -1 (default):
13549
14268
14772
14518
13863
13526

mean = 14083
std dev = 518 or 3.7% of mean


set statistics = 100
22457
22598
22566
22580
22767
22490

mean = 22576
std dev = 108 or .5% of mean


set statistics = 500
39878
39984
40018
39977
39885
40070

mean = 39968
std dev = 75 or .2% of mean


set statistics = 1000
51428
51503
51486
51658
51625
51589

mean = 51548
std dev = 74 or .1% of mean


> You could try sticking the correct n_distinct into pg_statistic by hand
> just to see if it really does change the plan

OK... but I'm a bit confused on how to get to the right row in pg_statistic.

when I do a \d on pg_stats, it appears that pg_statistic.starelid matches up
with pg_class.oid, but apparently this is not the case. Is there a place I
can look to find which keys correspond among the pg_catalog tables?


> but I'd like to think
> that getting within a factor of 2 is good enough.

Probably so... but with the default stats, it is more like a factor of 6,
which seems significant to me, and if my conjecture is correct, it might be
an easy fix. (Easy for me to say, since I'm not a developer. <grin>)

-Nick




Re: Problem with n_distinct being consistently inaccurate.

From
Tom Lane
Date:
"Nick Fankhauser" <nickf@ontko.com> writes:
> when I do a \d on pg_stats, it appears that pg_statistic.starelid matches up
> with pg_class.oid, but apparently this is not the case.

It certainly should be the case.  starelid matches to pg_class.oid and
staattnum matches to pg_attribute.attnum.

            regards, tom lane

Re: Problem with n_distinct being consistently inaccurate.

From
"Nick Fankhauser"
Date:
> It certainly should be the case.  starelid matches to pg_class.oid and
> staattnum matches to pg_attribute.attnum.

My problem was that I was looking up "event_date_time" in pg_class.relname
(and finding it), but the oid matched nothing. when I looked for 'event' in
pg_class & 'event_date_time' in pg_attribute, everything became clear.

So... I updated stadistinct to the actual value of 92,000 as you suggested
and as you predicted, it did not change the plan a bit. I guess I'll look
elsewhere for ways to optimize this query.

Thanks!

-Nick



Re: Problem with n_distinct being consistently inaccurate.

From
Tom Lane
Date:
"Nick Fankhauser" <nickf@ontko.com> writes:
> So... I updated stadistinct to the actual value of 92,000 as you suggested
> and as you predicted, it did not change the plan a bit.

Just out of curiosity, what happens if you make it bigger than 92k?
Does a value 10x or 100x reality change the plan?

            regards, tom lane

Re: Problem with n_distinct being consistently inaccurate.

From
"Nick Fankhauser"
Date:
> Just out of curiosity, what happens if you make it bigger than 92k?
> Does a value 10x or 100x reality change the plan?

Neither one makes a change- perhaps something else is at work here- my
understanding of the finer points of query plans is shaky- Here is the query
and the plan I'm getting:

alpha-# event.event_date_time AS start_time,
alpha-# event.event_duration_minutes AS duration_minutes,
alpha-# event.event_ical_status AS status,
alpha-# event.event_location_name AS "location",
alpha-# event.event_room, event.event_type_code AS category,
alpha-# event.event_hearing_type_desc AS sub_category,
alpha-# event.event_summary AS summary,
alpha-# event.event_pk,
alpha-# event.event_revision_number,
alpha-# case_data.case_public_id,
alpha-# case_data.case_title,
alpha-# court_config.court_name AS court,
alpha-# event.event_id,
alpha-# NULL::"unknown" AS related_event_id,
alpha-# case_data.case_id,
alpha-# court_config.court_id,
alpha-# actor_case_assignment.actor_id,
alpha-# actor_identifier.actor_identifier_text AS actor_identifier,
alpha-# actor_identifier.actor_identifier_type
alpha-# FROM
alpha-# actor_identifier,
alpha-# actor_case_assignment,
alpha-# case_data, event,
alpha-# court_config
alpha-# WHERE
alpha-# (
alpha(#     (
alpha(#         (
alpha(#             (actor_identifier.actor_id =
actor_case_assignment.actor_id)
alpha(#             AND
alpha(#             (actor_case_assignment.case_id = case_data.case_id)
alpha(#         )
alpha(#         AND
alpha(#         (case_data.case_id = event.case_id)
alpha(#     )
alpha(#     AND
alpha(#     (case_data.court_id = court_config.court_id)
alpha(# )
alpha-# and actor_identifier.actor_identifier_text='07314-20'
alpha-# and actor_identifier.actor_identifier_type = 'AttorneyStateBarID'
alpha-# and event_date_time >= '06/01/2003'
alpha-# and event_date_time <= '06/30/2003';


 Hash Join  (cost=2702.10..2703.83 rows=1 width=510)
   Hash Cond: ("outer".court_id = "inner".court_id)
   ->  Seq Scan on court_config  (cost=0.00..1.48 rows=48 width=39)
   ->  Hash  (cost=2702.10..2702.10 rows=1 width=471)
         ->  Nested Loop  (cost=0.00..2702.10 rows=1 width=471)
               Join Filter: ("outer".case_id = "inner".case_id)
               ->  Nested Loop  (cost=0.00..2698.10 rows=1 width=397)
                     ->  Nested Loop  (cost=0.00..2602.13 rows=13 width=90)
                           ->  Index Scan using actor_identifier_actor_text
on actor_identifier  (cost=0.00..6.63 rows=1 width=55)
                                 Index Cond: ((actor_identifier_text =
'07314-20'::character varying) AND (actor_identifier_type =
'AttorneyStateBarID'::character varying))
                           ->  Index Scan using
actor_case_assignment_actor_id on actor_case_assignment  (cost=0.00..2229.73
rows=1758 width=35)
                                 Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
                     ->  Index Scan using event_case_id on event
(cost=0.00..7.46 rows=1 width=307)
                           Index Cond: (event.case_id = "outer".case_id)
                           Filter: ((event_date_time >= '2003-06-01
00:00:00-05'::timestamp with time zone) AND (event_date_time <= '2003-06-30
00:00:00-05'::timestamp with time zone))
               ->  Index Scan using case_data_case_id on case_data
(cost=0.00..3.98 rows=1 width=74)
                     Index Cond: (case_data.case_id = "outer".case_id)
(17 rows)

What I'm trying to avoid is the Filter on event_date_time. It seems to me
that an index condition could be used to advantage here, and that if this
table "drove" the rest of the plan, it should work nicely. What got me
headed down this path is that if I make this particular table much smaller
by eliminating all events in the past, the performance on the query becomes
very good (Although Filter is still used). Then I looked at the
event_date_time field & thought that it looked pretty selective. Am i
coorect in thinking that Filter means that the index is not being used?

My guess is that the problem is that although I know the values are evenly
distributed over 10 years, the planner has no way of knowing that all of the
events don't occur in the month I've specified.



Re: Problem with n_distinct being consistently inaccurate.

From
Tom Lane
Date:
"Nick Fankhauser" <nickf@ontko.com> writes:
> Neither one makes a change- perhaps something else is at work here- my
> understanding of the finer points of query plans is shaky- Here is the query
> and the plan I'm getting:

Well, one thing that's not a fine point is that if you are complaining
about poor planning, it's important to offer EXPLAIN ANALYZE output not
just EXPLAIN output ... the difference between the planner's ideas and
reality is exactly the issue ...

            regards, tom lane

Re: Problem with n_distinct being consistently inaccurate.

From
"Nick Fankhauser"
Date:
Tom-

Thanks for helping me work through my question about the effect of
n_distinct. Now that I understand how little this statistic affects the
query plan, I'm going to spend some time trying to puzzle it out more on my
own before coming back to the list for advice.

Thanks also for pointing out that I should be using explain analyze instead
of explain. I didn't realize the option existed until you pointed it out,
and will use it henceforth.

I apologize if the tone of my messages sounded like complaining. My interest
is simply to find a solution to a problem I'm having in my production
environment. Problems with Postgres are few, but when I run into something I
don't understand, I value your willingness to help out and certainly don't
want to imply that I'm complaining to the folks who have provided me with a
free database and more importantly, free education and support.

Thanks (I mean it!)

-Nick

---------------------------------------------------------------------
Nick Fankhauser

    nickf@doxpop.com  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/