Thread: Seeking help with a query that takes too long

Seeking help with a query that takes too long

From
"Nick Fankhauser"
Date:
[I originally posted this using the wrong E-Mail account, so a double
posting may occur if the first message gets released by the moderator later-
sorry!]

Hi-

I have a query that I'm trying to speed up. I haven't been able to come up
with any workable ideas for speeding it up, so I'm seeking some input from
the list.

I'm using version 7.3.2

I have three tables:

case_data (1,947,386 rows)
actor (3,385,669 rows)
actor_case_assignment (8,668,650 rows)

As the names imply, actor_case_assignment contains records that assign an
actor to a case. Actors such as attorneys or judges may have many cases,
while the average actor (we hope) only has one.

What I'm trying to do is link these tables to get back a single row per
actor that shows the actor's name, the number of cases that actor is
assigned to, and if they only have one case, I want the public_id for that
case. This means I have to do a group by to get the case count, but I'm then
forced to use an aggregate function like max on the other fields.

All of the fields ending in "_id" have unique indexes, and
actor_full_name_uppercase is indexed. An analyze is done every night & the
database is fairly stable in it's composition.

Here's the select:

  select
    actor.actor_id,
    max(actor.actor_full_name),
    max(case_data.case_public_id),
    max(case_data.case_id),
    count(case_data.case_id) as case_count
  from
    actor,
    actor_case_assignment,
    case_data
  where
    actor.actor_full_name_uppercase like upper('sanders%')
    and actor.actor_id = actor_case_assignment.actor_id
    and case_data.case_id = actor_case_assignment.case_id
  group by
    actor.actor_id
  order by
    max(actor.actor_full_name),
    case_count desc
  limit
    1000;


Here's the explain analyze:


QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--------------------------------------------------
 Limit  (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.61..120035.67 rows=1000 loops=1)
   ->  Sort  (cost=2214.71..2214.72 rows=1 width=115) (actual
time=120034.60..120034.98 rows=1001 loops=1)
         Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
         ->  Aggregate  (cost=2214.67..2214.70 rows=1 width=115) (actual
time=119962.80..120011.49 rows=3456 loops=1)
               ->  Group  (cost=2214.67..2214.68 rows=2 width=115) (actual
time=119962.76..119987.04 rows=5879 loops=1)
                     ->  Sort  (cost=2214.67..2214.68 rows=2 width=115)
(actual time=119962.74..119965.09 rows=5879 loops=1)
                           Sort Key: actor.actor_id
                           ->  Nested Loop  (cost=0.00..2214.66 rows=2
width=115) (actual time=59.05..119929.71 rows=5879 loops=1)
                                 ->  Nested Loop  (cost=0.00..2205.26 rows=3
width=76) (actual time=51.46..66089.04 rows=5882 loops=1)
                                       ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
(actual time=37.62..677.44 rows=3501 loops=1)
                                             Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
                                             Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
                                       ->  Index Scan using
actor_case_assignment_actor_id on actor_case_assignment  (cost=0.00..2165.93
rows=2666 width=34) (actual time=16.37..18.67 rows=2 loops=3501)
                                             Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
                                 ->  Index Scan using case_data_case_id on
case_data  (cost=0.00..3.66 rows=1 width=39) (actual time=9.14..9.15 rows=1
loops=5882)
                                       Index Cond: (case_data.case_id =
"outer".case_id)
 Total runtime: 120038.60 msec
(17 rows)


Any ideas?

Thanks!
     -Nick


---------------------------------------------------------------------
Nick Fankhauser

    nickf@doxpop.com  Phone 1.765.965.7363  Fax 1.765.962.9788
doxpop - Court records at your fingertips - http://www.doxpop.com/



Re: Seeking help with a query that takes too long

From
Manfred Koizar
Date:
On Wed, 12 Nov 2003 08:34:50 -0500, "Nick Fankhauser"
<nickf@ontko.com> wrote:
>                                       ->  Index Scan using
>actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
                                                      ^^^^^^
>(actual time=37.62..677.44 rows=3501 loops=1)
                            ^^^^^^^^^
>                                             Index Cond:
>((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
>(actor_full_name_uppercase < 'SANDERT'::character varying))
>                                             Filter:
>(actor_full_name_uppercase ~~ 'SANDERS%'::text)

Nick, can you find out why this row count estimation is so far off?

\x
SELECT * FROM pg_stats
 WHERE tablename='actor' AND attname='actor_full_name_uppercase';

BTW, there seem to be missing cases:
>  ->  Nested Loop  (cost=0.00..2214.66 rows=2 width=115)
>                   (actual time=59.05..119929.71 rows=5879 loops=1)
                                                       ^^^^
>        ->  Nested Loop  (cost=0.00..2205.26 rows=3 width=76)
>                         (actual time=51.46..66089.04 rows=5882 loops=1)
                                                            ^^^^

Servus
 Manfred

Re: Seeking help with a query that takes too long

From
"Nick Fankhauser"
Date:

>(actual time=37.62..677.44 rows=3501 loops=1)
                            ^^^^^^^^^

> Nick, can you find out why this row count estimation is so far off?

It's actually correct:

prod1=# select count(actor_id) from actor where actor_full_name_uppercase
like 'SANDERS%';
 count
-------
  3501
(1 row)

Of course, I merely chose "SANDERS" arbitrarily as a name that falls
somewhere near the middle of the frequency range for names. SMITH or JONES
would represent a worst-case, and something like KOIZAR would probably be
unique.


Here are the stats:

prod1=# SELECT * FROM pg_stats
prod1-# WHERE tablename='actor' AND attname='actor_full_name_uppercase';
-[ RECORD
1 ]-----+-------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
-------------------------------------------------------------
schemaname        | public
tablename         | actor
attname           | actor_full_name_uppercase
null_frac         | 0.000333333
avg_width         | 21
n_distinct        | 24215
most_common_vals  | {"STATE OF INDIANA","INDIANA DEPARTMENT OF
REVENUE","BARTH CONS SCHOOL CORP","HOWARD COUNTY CLERK","ADVANCED RECOVERY
SERVICES","STATE OF INDIANA-DEPT OF REVENUE","ALLIED COLLECTION SERVICE
INC","CREDIT BUREAU OF LAPORTE","MIDWEST COLLECTION SVC INC","NCO FINANCIAL
SYSTEMS INC"}
most_common_freqs |
{0.0153333,0.0143333,0.00433333,0.00433333,0.004,0.00366667,0.00333333,0.003
33333,0.00266667,0.00266667}
histogram_bounds  | {"(POE) ESTELLE, DENISE","BRIEN, LIISI","COTTRELL,
CAROL","FAMILY RENTALS","HAYNES, TAMIKA","KESSLER, VICTORIA","MEFFORD,
VERNON L","PHILLIPS, GERALD L","SHELTON, ANTOINETTE","TRICARICO, MELISSA
SUE","ZUEHLKE, THOMAS L"}
correlation       | -0.00147395


I think this means that the average is 357 per actor. As you can see, the
range of assignments varies from people with a single parking ticket to
"State of Indiana", which is party to many thousands of cases.


> BTW, there seem to be missing cases:
> >  ->  Nested Loop  (cost=0.00..2214.66 rows=2 width=115)
> >                   (actual time=59.05..119929.71 rows=5879 loops=1)
>                                                        ^^^^
> >        ->  Nested Loop  (cost=0.00..2205.26 rows=3 width=76)
> >                         (actual time=51.46..66089.04 rows=5882 loops=1)

This is expected- We actually aggregate data from many county court
databases, with varying levels of data "cleanliness".

Regards,
    -Nick




Re: Seeking help with a query that takes too long

From
"Nick Fankhauser"
Date:
> >actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
>                                                       ^^^^^^
> >(actual time=37.62..677.44 rows=3501 loops=1)
>                             ^^^^^^^^^
> Nick, can you find out why this row count estimation is so far off?
^^^^^^^^^

Oops- I read this backward- I see what you mean now. That's a good question.
I'm not sure what part of the stats this estimate might be pulled from. The
average is 357, but the most common frequency may be around 1.

-Nick



Re: Seeking help with a query that takes too long

From
Tom Lane
Date:
"Nick Fankhauser" <nickf@ontko.com> writes:
>> Nick, can you find out why this row count estimation is so far off?

> It's actually correct:

Sure, the 3501 was the "actual".  The estimate was 1 row, which was
pretty far off :-(

> Here are the stats:

It looks like you are running with the default statistics target (10).
Try boosting it to 100 or even more for this column (see ALTER TABLE
SET STATISTICS, then re-ANALYZE) and see if the estimate gets better.
I think the major problem is likely here:
> n_distinct        | 24215
which is no doubt much too small (do you have an idea of the number
of distinct actor_full_name_uppercase values?)

            regards, tom lane

Re: Seeking help with a query that takes too long

From
"Nick Fankhauser"
Date:
> It looks like you are running with the default statistics target (10).
> Try boosting it to 100 or even more for this column (see ALTER TABLE
> SET STATISTICS, then re-ANALYZE) and see if the estimate gets better.


Here are the results & a few more clues:

prod1=# alter table actor alter column actor_full_name_uppercase set
statistics 1000;
ALTER TABLE
prod1=# analyze actor;
ANALYZE
prod1=# select count(distinct actor_full_name_uppercase) from actor;
  count
---------
 1453371
(1 row)

prod1=# select count(actor_id) from actor;
  count
---------
 3386359
(1 row)

This indicates to me that 1 isn't too shabby as an estimate if the whole
name is specified, but I'm not sure how this gets altered in the case of a
"LIKE"


prod1=# \x
Expanded display is on.
prod1=# SELECT * FROM pg_stats
prod1-# WHERE tablename='actor' AND attname='actor_full_name_uppercase';

<Header boilerplate snipped out>

schemaname        | public
tablename         | actor
attname           | actor_full_name_uppercase
null_frac         | 0.000586667
avg_width         | 21
n_distinct        | -0.14701

<Long list of values and frequencies snipped out>

correlation       | -0.00211291


Question: What does it mean when n_distinct is negative?

New results of explain analyze:



QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--------------------------------------------------
 Limit  (cost=252683.61..252683.68 rows=28 width=116) (actual
time=169377.32..169378.39 rows=1000 loops=1)
   ->  Sort  (cost=252683.61..252683.68 rows=29 width=116) (actual
time=169377.31..169377.69 rows=1001 loops=1)
         Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
         ->  Aggregate  (cost=252678.57..252682.91 rows=29 width=116)
(actual time=169305.79..169354.50 rows=3456 loops=1)
               ->  Group  (cost=252678.57..252680.01 rows=289 width=116)
(actual time=169305.76..169330.00 rows=5879 loops=1)
                     ->  Sort  (cost=252678.57..252679.29 rows=289
width=116) (actual time=169305.75..169308.15 rows=5879 loops=1)
                           Sort Key: actor.actor_id
                           ->  Nested Loop  (cost=0.00..252666.74 rows=289
width=116) (actual time=89.27..169273.51 rows=5879 loops=1)
                                 ->  Nested Loop  (cost=0.00..251608.11
rows=289 width=77) (actual time=57.73..92753.49 rows=5882 loops=1)
                                       ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..456.88 rows=113 width=42)
(actual time=32.80..3197.28 rows=3501 loops=1)
                                             Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
                                             Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
                                       ->  Index Scan using
actor_case_assignment_actor_id on actor_case_assignment  (cost=0.00..2181.29
rows=2616 width=35) (actual time=22.26..25.57 rows=2 loops=3501)
                                             Index Cond: ("outer".actor_id =
actor_case_assignment.actor_id)
                                 ->  Index Scan using case_data_case_id on
case_data  (cost=0.00..3.65 rows=1 width=39) (actual time=13.00..13.00
rows=1 loops=5882)
                                       Index Cond: (case_data.case_id =
"outer".case_id)
 Total runtime: 169381.38 msec
(17 rows)



Re: Seeking help with a query that takes too long

From
Tom Lane
Date:
"Nick Fankhauser" <nickf@ontko.com> writes:
> This indicates to me that 1 isn't too shabby as an estimate if the whole
> name is specified, but I'm not sure how this gets altered in the case of a
> "LIKE"

For a pattern like "SANDERS%", the estimate is basically a range estimate
for this condition:

> ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
> (actor_full_name_uppercase < 'SANDERT'::character varying))

> n_distinct        | -0.14701

> Question: What does it mean when n_distinct is negative?

It means that the number of distinct values is estimated as a fraction
of the table size, rather than an absolute number.  In this case 14.7%
of the table size, which is a bit off compared to the correct value
of 43% (1453371/3386359), but at least it's of the right order of
magnitude now ...

>                                        ->  Index Scan using
> actor_full_name_uppercase on actor  (cost=0.00..456.88 rows=113 width=42)
> (actual time=32.80..3197.28 rows=3501 loops=1)

Hmm.  Better, but not enough better to force a different plan choice.

You might have to resort to brute force, like "set enable_nestloop=false".
Just out of curiosity, what do you get if you do that?

            regards, tom lane

Re: Seeking help with a query that takes too long

From
"Nick Fankhauser"
Date:
> You might have to resort to brute force, like "set enable_nestloop=false".
> Just out of curiosity, what do you get if you do that?

I get a different plan, but similar execution time:


 Limit  (cost=323437.13..323437.13 rows=1 width=115) (actual
time=170921.89..170922.95 rows=1000 loops=1)
   ->  Sort  (cost=323437.13..323437.13 rows=1 width=115) (actual
time=170921.89..170922.26 rows=1001 loops=1)
         Sort Key: max((actor.actor_full_name)::text),
count(case_data.case_id)
         ->  Aggregate  (cost=323437.08..323437.12 rows=1 width=115) (actual
time=170849.94..170898.06 rows=3457 loops=1)
               ->  Group  (cost=323437.08..323437.09 rows=3 width=115)
(actual time=170849.90..170873.60 rows=5880 loops=1)
                     ->  Sort  (cost=323437.08..323437.08 rows=3 width=115)
(actual time=170847.97..170850.21 rows=5880 loops=1)
                           Sort Key: actor.actor_id
                           ->  Hash Join  (cost=253333.29..323437.06 rows=3
width=115) (actual time=122873.80..170814.27 rows=5880 loops=1)
                                 Hash Cond: ("outer".case_id =
"inner".case_id)
                                 ->  Seq Scan on case_data
(cost=0.00..60368.16 rows=1947116 width=39) (actual time=12.95..43542.25
rows=1947377 loops=1)
                                 ->  Hash  (cost=253333.28..253333.28 rows=3
width=76) (actual time=122844.40..122844.40 rows=0 loops=1)
                                       ->  Hash Join  (cost=6.02..253333.28
rows=3 width=76) (actual time=24992.70..122810.32 rows=5883 loops=1)
                                             Hash Cond: ("outer".actor_id =
"inner".actor_id)
                                             ->  Seq Scan on
actor_case_assignment  (cost=0.00..209980.49 rows=8669349 width=34) (actual
time=9.13..85504.05 rows=8670467 loops=1)
                                             ->  Hash  (cost=6.01..6.01
rows=1 width=42) (actual time=24926.56..24926.56 rows=0 loops=1)
                                                   ->  Index Scan using
actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
(actual time=51.67..24900.53 rows=3502 loops=1)
                                                         Index Cond:
((actor_full_name_uppercase >= 'SANDERS'::character varying) AND
(actor_full_name_uppercase < 'SANDERT'::character varying))
                                                         Filter:
(actor_full_name_uppercase ~~ 'SANDERS%'::text)
 Total runtime: 170925.93 msec
(19 rows)


-Nick



Re: Seeking help with a query that takes too long

From
Manfred Koizar
Date:
On Wed, 12 Nov 2003 13:27:53 -0500, "Nick Fankhauser"
<nickf@ontko.com> wrote:
>
>> You might have to resort to brute force, like "set enable_nestloop=false".

>                                             ->  Seq Scan on
>actor_case_assignment  (cost=0.00..209980.49 rows=8669349 width=34) (actual
>time=9.13..85504.05 rows=8670467 loops=1)

Does actor_case_assignment contain more columns than just the two ids?
If yes, do these additional fields account for ca. 70 bytes per tuple?
If not, try
    VACUUM FULL ANALYSE actor_case_assignment;

>                                                   ->  Index Scan using
>actor_full_name_uppercase on actor  (cost=0.00..6.01 rows=1 width=42)
>(actual time=51.67..24900.53 rows=3502 loops=1)

This same index scan on actor has been much faster in your previous
postings (677ms, 3200ms), probably due to caching effects.  7ms per
tuple returned looks like a lot of disk seeks are involved.  Is
clustering actor on actor_full_name_uppercase an option or would this
slow down other queries?

Servus
 Manfred

Re: Seeking help with a query that takes too long

From
"Nick Fankhauser"
Date:
> Does actor_case_assignment contain more columns than just the two ids?
> If yes, do these additional fields account for ca. 70 bytes per tuple?
> If not, try
>     VACUUM FULL ANALYSE actor_case_assignment;

actor_case_assignment has its own primary key and a "role" field in addition
to the ids you've seen, so 70 bytes sounds reasonable. (The PK is to allow a
remote mirroring application to update these records- otherwise it would be
unnecessary.)



> 7ms per
> tuple returned looks like a lot of disk seeks are involved.  Is
> clustering actor on actor_full_name_uppercase an option or would this
> slow down other queries?

Good question... I've never used clustering in PostgreSQL before, so I'm
unsure. I presume this is like clustering in Oracle where the table is
ordered to match the index? If so, I think you may be onto something because
the only other field We regularly query on is the actor_id. Actor_id has a
unique index with no clustering currently, so I don't think I'd lose a thing
by clustering on actor_full_name_uppercase.

I'll give this a try & let you know how it changes.

BTW, you are correct that caching has a big affect on the actual time
figures in this case- I'm working on my development DB, so cahced info
doesn't get trampled as quickly by other users. Is there a way to flush out
the cache in a testing situation like this in order to start from a
consistent base?


Thanks!
    -Nick



Re: Seeking help with a query that takes too long

From
Manfred Koizar
Date:
On Fri, 14 Nov 2003 11:00:38 -0500, "Nick Fankhauser"
<nickf@ontko.com> wrote:
>Good question... I've never used clustering in PostgreSQL before, so I'm
>unsure. I presume this is like clustering in Oracle where the table is
>ordered to match the index?

Yes, something like that.  With the exception that Postgres looses the
clustered status, while you INSERT and UPDATE tuples.  So you have to
re-CLUSTER from time to time.  Look at pg_stats.correlation to see, if
its necessary.

> Is there a way to flush out
>the cache in a testing situation like this in order to start from a
>consistent base?

To flush Postgres shared buffers:
    SELECT count(*) FROM another_large_table;

To flush your database pages from the OS cache:
    tar cf /dev/null /some/large/directory

And run each of your tests at least twice to get a feeling how caching
affects your specific queries.

Servus
 Manfred