30ms to 30s execution time: query optimization problem - Mailing list pgsql-general

From Tomi N/A
Subject 30ms to 30s execution time: query optimization problem
Date
Msg-id d487eb8e0703141644s26388fbdl5f4d640b385f78e6@mail.gmail.com
Whole thread Raw
List pgsql-general
I have a problem with query optimization. Since it's a bit hard to
explain, I'll describe the database structure and list the queries
first. The database holds contacts information for a call center. The
general idea is to have operators call a random contact and record
the results. Rules for picking a contact are:

- contacts that have an open call shall not be called
- contacts that have been called and refused to participate in
current poll shall not be called again
- contacts that have been called and did not answer or line was busy
will be called again after 4 hours
- contact shall not be attempted more than max_call_times per poll

Contacts are joined in groups and groups are assigned to poll
samples. Each poll has several groups assigned, and each usergroup
can have different sample groups.

Tables sizes at this time are roughly:
contact - 1.5M
call - 0.5M
group - 10K
poll - 10
contact_group - 5M (each contact belongs to ~ 3 groups)

Tables last_call and num_calls have just the last call to a contact
per poll , and number of calls to a person per poll respectively.

All tables have foreign keys and appropriate indices.

SELECT INTO s_contact contact.id_contact
FROM contact
JOIN contact_group ON contact.id_contact = contact_group.id_contact
JOIN group_poll ON contact_group.id_group = group_poll.id_group
   AND group_poll.id_poll = $$s_poll$$
JOIN group ON group.id_group = group_poll.id_group
JOIN user_role ON group_poll.id_role = user_role.id_role
     AND user_role.id_user = $$s_user$$
LEFT OUTER JOIN last_call ON last_call.id_contact = contact.id_contact
LEFT OUTER JOIN num_calls ON num_calls.id_contact =
contact.id_contact AND num_calls.id_poll     = $$s_poll$$
WHERE -- temporal limitations
(
  (  group_poll.valid_from IS NULL )
  OR
  ( group_poll.valid_from <= NOW() )
)
AND
(
  (  group_poll.valid_to IS NULL )
  OR
  ( group_poll.valid_to >= NOW() )
)
AND
(
  ( group.valid_from IS NULL )
  OR
         ( group.valid_from <= NOW() )
)
AND
(
  ( group.valid_to IS NULL )
  OR
         ( group.valid_to >= NOW() )
)
AND
(
  ( contact.valid_from IS NULL )
  OR
         ( contact.valid_from <= NOW() )
)
AND
(
  ( contact.valid_to IS NULL )
  OR
         ( contact.valid_to >= NOW() )
)
AND
( -- no call
 last_call.id_call IS NULL
 OR
 ( -- no call in progress and not processed
    NOT(last_call.id_call_status = call_in_progress
        OR (last_call.sif_anketa = s_poll
        AND last_call.id_call_status
 IN ( call_completed, call_aborted, will_not_participate ))
)
    AND
last_call.end_time < NOW()-'240 minutes'::interval
  )
)
AND
(
  num_calls.num_calls IS NULL
  OR
  num_calls.num_calls < $$max_num_calls$$
)
AND -- randomize
  contact.random_number <= $$random_key$$
ORDER BY contact.random_number DESC LIMIT 1 FOR UPDATE OF contact;

The randomization is performed with a random number column in
contacts table. Variables with $$ are inputs to functions.

This query has an interesting property that it finishes with lightning
speed, seemingly if there is a lot of contacts that cannot be joined
with last call and num_calls. But it also goes to the other extreme
and take up to 30s to execute.



Limit  (cost=0.00..17.64 rows=1 width=18) (actual time=0.205..0.205
rows=1 loops=1)
 ->  Nested Loop  (cost=0.00..1615461.74 rows=91598 width=18) (actual
time=0.201..0.201 rows=1 loops=1)
       ->  Nested Loop  (cost=0.00..1536111.58 rows=42092 width=22)
(actual time=0.179..0.179 rows=1 loops=1)
             ->  Nested Loop Left Join  (cost=0.00..1523725.29
rows=42092 width=30) (actual time=0.159..0.159 rows=1 loops=1)
                   Filter: ((num_calls.num_calls IS NULL) OR
(num_calls.num_calls < 3))
                   ->  Nested Loop  (cost=0.00..1508293.43 rows=48997
width=30) (actual time=0.139..0.139 rows=1 loops=1)
                         ->  Nested Loop  (cost=0.00..1089736.15
rows=1430554 width=22) (actual time=0.120..0.120 rows=1 loops=1)
                               ->  Nested Loop Left Join
(cost=0.00..610920.91 rows=635050 width=18) (actual time=0.088..0.088
rows=1 loops=1)
                                     Filter: ((last_call.id_poziv IS
NULL) OR ((last_call.id_call_status <> 1) AND ((last_call.id_poll <>
1) OR (last_call.id_call_status <> ALL ('{3,8,9}'::integer[]))) AND
(last_call.end_time < (now() - '04:00:00'::interval))))
                                     ->  Index Scan Backward using
ix_contact_random_number on contact  (cost=0.00..191996.83
rows=1266570 width=18) (actual time=0.056..0.056 rows=1 loops=1)
                                           Index Cond: (random_number
<= 0.98::double precision)
                                           Filter: (((valid_from IS
NULL) OR (valid_from <= now())) AND ((valid_to IS NULL) OR (valid_to
>= now())))
                                     ->  Index Scan using
fki_last_call_2 on last_call  (cost=0.00..0.30 rows=1 width=24)
(actual time=0.023..0.023 rows=0 loops=1)
                                           Index Cond:
(last_call.id_contact = contact.id_contact)
                               ->  Index Scan using
fki_contact_group_2 on contact_group  (cost=0.00..0.73 rows=2 width=8)
(actual time=0.026..0.026 rows=1 loops=1)
                                     Index Cond: (contact.id_contact =
contact_group.id_contact)
                         ->  Index Scan using ix_group_poll_id_group
on group_poll  (cost=0.00..0.28 rows=1 width=8) (actual
time=0.015..0.015 rows=1 loops=1)
                               Index Cond: (contact_group.id_group =
group_poll.id_group)
                               Filter: ((id_poll = 1) AND ((valid_from
IS NULL) OR (valid_from <= now())) AND ((valid_to IS NULL) OR
(valid_to >= now())))
                   ->  Index Scan using fki_num_calls_1 on num_calls
(cost=0.00..0.30 rows=1 width=8) (actual time=0.016..0.016 rows=0
loops=1)
                         Index Cond: (num_calls.id_contact = contact.id_contact)
                         Filter: (id_poll = 1)
             ->  Index Scan using group_pkey on group
(cost=0.00..0.28 rows=1 width=4) (actual time=0.015..0.015 rows=1
loops=1)
                   Index Cond: (group.id_group = group_poll.id_group)
                   Filter: (((valid_from IS NULL) OR (valid_from <=
now())) AND ((valid_to IS NULL) OR (valid_to >= now())))
       ->  Index Scan using uq_user_role_id_user_id_role on user_role
(cost=0.00..1.87 rows=1 width=4) (actual time=0.018..0.018 rows=1
loops=1)
             Index Cond: ((user_role.id_user = 1) AND
(group_poll.id_role = user_role.id_role))
Total runtime: 0.972 ms

The slow plan (most of the contacts have already been attempted,
moving to 2nd and 3rd calls, group size 20K):

Limit  (cost=10.19..313.21 rows=1 width=18) (actual
time=166.575..166.576 rows=1 loops=1)
 ->  Nested Loop  (cost=10.19..1156643.83 rows=3817 width=18) (actual
time=166.571..166.571 rows=1 loops=1)
       ->  Nested Loop Left Join  (cost=10.19..1155520.62 rows=3817
width=26) (actual time=166.553..166.553 rows=1 loops=1)
             Filter: ((num_calls.num_calls IS NULL) OR
(num_calls.num_calls < 3))
             ->  Nested Loop  (cost=10.19..1154121.27 rows=4443
width=26) (actual time=24.318..166.259 rows=15 loops=1)
                   Join Filter: (contact_group.id_group = group_poll.id_group)
                   ->  Nested Loop  (cost=0.00..1089736.15
rows=1430554 width=22) (actual time=0.082..124.821 rows=8567 loops=1)
                         ->  Nested Loop Left Join
(cost=0.00..610920.91 rows=635050 width=18) (actual time=0.060..56.886
rows=3768 loops=1)
                               Filter: ((last_call.id_poziv IS NULL)
OR ((last_call.id_call_status <> 1) AND ((last_call.id_poll <> 1) OR
(last_call.id_call_status <> ALL ('{3,8,9}'::integer[]))) AND
(last_call.end_time < (now() - '04:00:00'::interval))))
                               ->  Index Scan Backward using
ix_contact_random_number on contact  (cost=0.00..191996.83
rows=1266570 width=18) (actual time=0.037..15.929 rows=4072 loops=1)
                                     Index Cond: (random_number <=
0.98::double precision)
                                     Filter: (((valid_from IS NULL) OR
(valid_from <= now())) AND ((valid_to IS NULL) OR (valid_to >=
now())))
                               ->  Index Scan using fki_last_call_2 on
last_call  (cost=0.00..0.30 rows=1 width=24) (actual time=0.007..0.007
rows=0 loops=4072)
                                     Index Cond: (last_call.id_contact
= contact.id_contact)
                         ->  Index Scan using fki_contact_group_2 on
contact_group  (cost=0.00..0.73 rows=2 width=8) (actual
time=0.010..0.014 rows=2 loops=3768)
                               Index Cond: (contact.id_contact =
contact_group.id_contact)
                   ->  Materialize  (cost=10.19..10.21 rows=2 width=4)
(actual time=0.000..0.002 rows=3 loops=8567)
                         ->  Hash Join  (cost=8.28..10.19 rows=2
width=4) (actual time=0.106..0.132 rows=3 loops=1)
                               Hash Cond: (group_poll.id_role =
user_role.id_role)
                               ->  Seq Scan on group_poll
(cost=0.00..1.81 rows=20 width=8) (actual time=0.026..0.071 rows=25
loops=1)
                                     Filter: ((id_poll = 1) AND
((valid_from IS NULL) OR (valid_from <= now())) AND ((valid_to IS
NULL) OR (valid_to >= now())))
                               ->  Hash  (cost=8.27..8.27 rows=1
width=4) (actual time=0.017..0.017 rows=1 loops=1)
                                     ->  Index Scan using
fki_user_role_1 on user_role  (cost=0.00..8.27 rows=1 width=4) (actual
time=0.011..0.013 rows=1 loops=1)
                                           Index Cond: (id_user = 265)
             ->  Index Scan using fki_num_calls_1 on num_calls
(cost=0.00..0.30 rows=1 width=8) (actual time=0.014..0.015 rows=1
loops=15)
                   Index Cond: (num_calls.id_contact = contact.id_contact)
                   Filter: (id_poll = 1)
       ->  Index Scan using group_pkey on group  (cost=0.00..0.28
rows=1 width=4) (actual time=0.012..0.012 rows=1 loops=1)
             Index Cond: (group.id_group = group_poll.id_group)
             Filter: (((valid_from IS NULL) OR (valid_from <= now()))
AND ((valid_to IS NULL) OR (valid_to >= now())))
Total runtime: 166.963 ms

This can also be slower, up to 10 seconds.. Query executes within a
function, which is synchronized from the application, so only one
such user has access to it.
So, the question is - what can cause such slowdown? Working set is
round 380M, shared_mem has been set to 400M, and work_mem to 256M,
when testing (single-user), there is no
swapping by postgres. When several users access the database (avg 30),
sometimes it slows down, but it's hard to tell why. Nothing in the
logs. Autovacuum has been enables,
statistics (row, block) also. Db is also slow in some string
operations (split_part() can sometimes slow down to 100 rows per
second).

Tried several changes to the query to make it correct in all
respects - cannot match the speed of this one in case there is lots
of free contacts. I thought of having
two queries and switch between them when the poll comes close to an
end, but in some cases there is no slowdown. Any pointers on what to
look for?

The machine is 2x Xeon 3GHz, 2G RAM, Win 2003, postgresql 8.2.3
(installer from postgresql.org). Running apps - just postgresql and
asp.net. Encoding - UTF8. Clients use Npgsql to connect
via LLBLGen O/R mapper.


To anyone who got this far, a sincere thanks.
Suggestions and comments are a bonus.
Cheers,
t.n.a.

pgsql-general by date:

Previous
From: Stephan Szabo
Date:
Subject: Re: [Bulk] Re: quoted identifier behaviour
Next
From: "Joshua D. Drake"
Date:
Subject: Re: Lifecycle of PostgreSQL releases