Thread: Nested loop question

Nested loop question

From
"Nick Fankhauser - Doxpop"
Date:
Hi-

I'm trying to optimize a query that I *think* should run very fast.
Essentially, I'm joining two tables that have very selective indexes and
constraining the query on an indexed field. (There's a third small lookup
table in the mix, but it doesn't really affect the bottom line.)

actor is a table containing roughly 3 million rows with an index on
actor_full_name_uppercase and a unique index on actor_id.

actor_summary also contains roughly 3 million rows. Its PK is a unique
combined index on (actor_id, county_id, case_disp_global_code).

The vast majority of the rows in actor correspond to a single row in
actor_summary I'd estimate this at 95% or more. The remaining actors with
multiple records generally have two corresponding rows in actor summary.
Actor summary was created as a performance enhancer, where we can store some
pre-calculated values such as the number of court cases an actor is involved
in.

The constraint is applied first, with reasonable speed. In the example
below, it takes about 15 seconds to gather the matches in actor.

I'm unsure what is happening next. I notice that an index scan is occurring
on actor_summary_pk, with an "actual time" of 9.15, but then it looks like a
nested loop occurs at the next level to join these tables. Does this mean
that each probe of the actor_summary index will take 9.15 msec, but the
nested loop is going to do this once for each actor_id?

The nested loop appears to be where most of my time is going, so I'm
focusing on this area, but don't know if there is a better approach to this
join.

Is there a more efficient means than a nested loop to handle such a join?
Would a different method be chosen if there was exactly one row in
actor_summary for every row in actor?

-Nick

The query & explain analyze:


alpha=#
alpha=#
alpha=# explain analyze
alpha-#     select
alpha-#     min(actor.actor_id) as actor_id,
alpha-#     min(actor.actor_entity_type) as actor_entity_type,
alpha-#     min(actor.role_class_code) as role_class_code,
alpha-#     min(actor.actor_full_name) as actor_full_name,
alpha-#     min(actor.actor_person_date_of_birth) as
actor_person_date_of_birth,
alpha-#     min(actor.actor_entity_acronym) as actor_entity_acronym,
alpha-#     min(actor.actor_person_last_name) as actor_person_last_name,
alpha-#     min(actor.actor_person_first_name) as actor_person_first_name,
alpha-#     min(actor.actor_person_middle_name) as actor_person_middle_name,
alpha-#     min(actor.actor_person_name_suffix) as actor_person_name_suffix,
alpha-#     min(actor.actor_person_place_of_birth) as
actor_person_place_of_birth,
alpha-#     min(actor.actor_person_height) as actor_person_height,
alpha-#     min(actor.actor_person_height_unit) as actor_person_height_unit,
alpha-#     min(actor.actor_person_weight) as actor_person_weight,
alpha-#     min(actor.actor_person_weight_unit) as actor_person_weight_unit,
alpha-#     min(actor.actor_person_ethnicity) as actor_person_ethnicity,
alpha-#     min(actor.actor_person_citizenship_count) as
actor_person_citizenship_count,
alpha-#     min(actor.actor_person_hair_color) as actor_person_hair_color,
alpha-#     min(actor.actor_person_scars_marks_tatto) as
actor_person_scars_marks_tatto,
alpha-#     min(actor.actor_person_marital_status) as
actor_person_marital_status,
alpha-#     min(actor.actor_alias_for_actor_id) as actor_alias_for_actor_id,
alpha-#     min(to_char(data_source.source_last_update, 'MM/DD/YYYY HH12:MI
AM TZ')) as last_update,
alpha-#     min(actor_summary.single_case_public_id) as case_public_id,
alpha-#     min(actor_summary.single_case_id) as case_id,
alpha-#     sum(actor_summary.case_count)as case_count
alpha-#   from
alpha-#     actor,
alpha-#     actor_summary,
alpha-#     data_source
alpha-#   where
alpha-#     actor.actor_id = actor_summary.actor_id
alpha-#     and data_source.source_id = actor.source_id
alpha-#     and actor_full_name_uppercase like upper('sanders%')
alpha-#   group by
alpha-#     actor.actor_id
alpha-#   order by
alpha-#     min(actor.actor_full_name_uppercase),
alpha-#     case_count desc,
alpha-#     min(actor_summary.case_disp_global_code)
alpha-#   limit
alpha-#     1000
alpha-# ;



QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
-------------------------------
 Limit  (cost=2555.58..2555.59 rows=1 width=547) (actual
time=48841.76..48842.90 rows=1000 loops=1)
   ->  Sort  (cost=2555.58..2555.59 rows=1 width=547) (actual
time=48841.76..48842.18 rows=1001 loops=1)
         Sort Key: min((actor.actor_full_name_uppercase)::text),
sum(actor_summary.case_count),
min((actor_summary.case_disp_global_code)::text)
         ->  Aggregate  (cost=2555.50..2555.57 rows=1 width=547) (actual
time=48604.17..48755.28 rows=3590 loops=1)
               ->  Group  (cost=2555.50..2555.50 rows=1 width=547) (actual
time=48604.04..48647.91 rows=3594 loops=1)
                     ->  Sort  (cost=2555.50..2555.50 rows=1 width=547)
(actual time=48604.01..48605.70 rows=3594 loops=1)
                           Sort Key: actor.actor_id
                           ->  Nested Loop  (cost=1.14..2555.49 rows=1
width=547) (actual time=69.09..48585.83 rows=3594 loops=1)
                                 ->  Hash Join  (cost=1.14..900.39 rows=204
width=475) (actual time=46.92..15259.02 rows=3639 loops=1)
                                       Hash Cond: ("outer".source_id =
"inner".source_id)
                                       ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..895.04 rows=222 width=463)
(actual time=46.54..15220.77 rows=3639 loops=1)
                                             Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
                                             Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
                                       ->  Hash  (cost=1.11..1.11 rows=11
width=12) (actual time=0.05..0.05 rows=0 loops=1)
                                             ->  Seq Scan on data_source
(cost=0.00..1.11 rows=11 width=12) (actual time=0.02..0.04 rows=11 loops=1)
                                 ->  Index Scan using actor_summary_pk on
actor_summary  (cost=0.00..8.11 rows=1 width=72) (actual time=9.14..9.15
rows=1 loops=3639)
                                       Index Cond: ("outer".actor_id =
actor_summary.actor_id)
 Total runtime: 48851.85 msec
(18 rows)


---------------------------------------------------------------------
Nick Fankhauser

    nickf@doxpop.com  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/



Re: Nested loop question

From
Richard Huxton
Date:
On Tuesday 16 December 2003 17:06, Nick Fankhauser - Doxpop wrote:
> Hi-
>
> I'm trying to optimize a query that I *think* should run very fast.
> Essentially, I'm joining two tables that have very selective indexes and
> constraining the query on an indexed field. (There's a third small lookup
> table in the mix, but it doesn't really affect the bottom line.)

> I'm unsure what is happening next. I notice that an index scan is occurring
> on actor_summary_pk, with an "actual time" of 9.15, but then it looks like
> a nested loop occurs at the next level to join these tables. Does this mean
> that each probe of the actor_summary index will take 9.15 msec, but the
> nested loop is going to do this once for each actor_id?

That's right - you need to multiply the actual time by the number of loops. In
your case this would seem to be about 33 seconds.

>                                  ->  Index Scan using actor_summary_pk on
> actor_summary  (cost=0.00..8.11 rows=1 width=72) (actual time=9.14..9.15
> rows=1 loops=3639)
>                                        Index Cond: ("outer".actor_id =
> actor_summary.actor_id)

> The nested loop appears to be where most of my time is going, so I'm
> focusing on this area, but don't know if there is a better approach to this
> join.
>
> Is there a more efficient means than a nested loop to handle such a join?
> Would a different method be chosen if there was exactly one row in
> actor_summary for every row in actor?

Hmm - tricky to say in your case. PG has decided to filter on actor then look
up the corresponding values in actor_summary. Given that you have 3 million
rows in both tables that seems a reasonable approach. You could always try
forcing different plans by switching the various ENABLE_HASHJOIN etc options
(see the runtime configuration section of the manuals). I'm not sure that
will help you here though.

The fact that it's taking you 9ms to do each index lookup suggests to me that
it's going to disk each time. Does that sound plausible, or do you think you
have enough RAM to cache your large indexes?

--
  Richard Huxton
  Archonet Ltd

Re: Nested loop question

From
"Nick Fankhauser"
Date:
> The fact that it's taking you 9ms to do each index lookup
> suggests to me that
> it's going to disk each time. Does that sound plausible, or do
> you think you
> have enough RAM to cache your large indexes?

I'm sure we don't have enough RAM to cache all of our large indexes, so your
supposition makes sense. We have 1GB on this machine. In responding to the
performance problems we're having, one of the questions has been adding
memory vs crafting "helper" tables to speed things up. The issue is that
this database needs to be able to scale easily to about 10 times the size,
so although we could easily triple the memory at reasonable expense, we'd
eventually hit a wall.

Is there any solid method to insure that a particular index always resides
in memory? A hybrid approach that might scale reliably would be to bump up
our memory and then make sure key indexes are cached. however, I'm concerned
that if we didn't have a way to ensure that the indexes that we choose
remain cached, we would have very inconsistent responses.

Thanks for your ideas!

-Nick