Re: Problem with n_distinct being consistently inaccurate. - Mailing list pgsql-admin

From Nick Fankhauser
Subject Re: Problem with n_distinct being consistently inaccurate.
Date
Msg-id NEBBLAAHGLEEPCGOBHDGMEJCIJAA.nickf@ontko.com
Whole thread Raw
In response to Re: Problem with n_distinct being consistently inaccurate.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Problem with n_distinct being consistently inaccurate.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-admin
> Just out of curiosity, what happens if you make it bigger than 92k?
> Does a value 10x or 100x reality change the plan?

Neither one makes a change- perhaps something else is at work here- my
understanding of the finer points of query plans is shaky- Here is the query
and the plan I'm getting:

alpha-# event.event_date_time AS start_time,
alpha-# event.event_duration_minutes AS duration_minutes,
alpha-# event.event_ical_status AS status,
alpha-# event.event_location_name AS "location",
alpha-# event.event_room, event.event_type_code AS category,
alpha-# event.event_hearing_type_desc AS sub_category,
alpha-# event.event_summary AS summary,
alpha-# event.event_pk,
alpha-# event.event_revision_number,
alpha-# case_data.case_public_id,
alpha-# case_data.case_title,
alpha-# court_config.court_name AS court,
alpha-# event.event_id,
alpha-# NULL::"unknown" AS related_event_id,
alpha-# case_data.case_id,
alpha-# court_config.court_id,
alpha-# actor_case_assignment.actor_id,
alpha-# actor_identifier.actor_identifier_text AS actor_identifier,
alpha-# actor_identifier.actor_identifier_type
alpha-# FROM
alpha-# actor_identifier,
alpha-# actor_case_assignment,
alpha-# case_data, event,
alpha-# court_config
alpha-# WHERE
alpha-# (
alpha(#     (
alpha(#         (
alpha(#             (actor_identifier.actor_id =
actor_case_assignment.actor_id)
alpha(#             AND
alpha(#             (actor_case_assignment.case_id = case_data.case_id)
alpha(#         )
alpha(#         AND
alpha(#         (case_data.case_id = event.case_id)
alpha(#     )
alpha(#     AND
alpha(#     (case_data.court_id = court_config.court_id)
alpha(# )
alpha-# and actor_identifier.actor_identifier_text='07314-20'
alpha-# and actor_identifier.actor_identifier_type = 'AttorneyStateBarID'
alpha-# and event_date_time >= '06/01/2003'
alpha-# and event_date_time <= '06/30/2003';


 Hash Join  (cost=2702.10..2703.83 rows=1 width=510)
   Hash Cond: ("outer".court_id = "inner".court_id)
   ->  Seq Scan on court_config  (cost=0.00..1.48 rows=48 width=39)
   ->  Hash  (cost=2702.10..2702.10 rows=1 width=471)
         ->  Nested Loop  (cost=0.00..2702.10 rows=1 width=471)
               Join Filter: ("outer".case_id = "inner".case_id)
               ->  Nested Loop  (cost=0.00..2698.10 rows=1 width=397)
                     ->  Nested Loop  (cost=0.00..2602.13 rows=13 width=90)
                           ->  Index Scan using actor_identifier_actor_text
on actor_identifier  (cost=0.00..6.63 rows=1 width=55)
                                 Index Cond: ((actor_identifier_text =
'07314-20'::character varying) AND (actor_identifier_type =
'AttorneyStateBarID'::character varying))
                           ->  Index Scan using
actor_case_assignment_actor_id on actor_case_assignment  (cost=0.00..2229.73
rows=1758 width=35)
                                 Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
                     ->  Index Scan using event_case_id on event
(cost=0.00..7.46 rows=1 width=307)
                           Index Cond: (event.case_id = "outer".case_id)
                           Filter: ((event_date_time >= '2003-06-01
00:00:00-05'::timestamp with time zone) AND (event_date_time <= '2003-06-30
00:00:00-05'::timestamp with time zone))
               ->  Index Scan using case_data_case_id on case_data
(cost=0.00..3.98 rows=1 width=74)
                     Index Cond: (case_data.case_id = "outer".case_id)
(17 rows)

What I'm trying to avoid is the Filter on event_date_time. It seems to me
that an index condition could be used to advantage here, and that if this
table "drove" the rest of the plan, it should work nicely. What got me
headed down this path is that if I make this particular table much smaller
by eliminating all events in the past, the performance on the query becomes
very good (Although Filter is still used). Then I looked at the
event_date_time field & thought that it looked pretty selective. Am i
coorect in thinking that Filter means that the index is not being used?

My guess is that the problem is that although I know the values are evenly
distributed over 10 years, the planner has no way of knowing that all of the
events don't occur in the month I've specified.



pgsql-admin by date:

Previous
From: Tom Lane
Date:
Subject: Re: Problem with n_distinct being consistently inaccurate.
Next
From: "Robert D. Abernethy IV"
Date:
Subject: Java and SSL