Thread: Problems with query in highly concurrent environment

Problems with query in highly concurrent environment

From
Aleksei Valikov
Date:
Hi.

We have a very strange problem and no idea how to solve it.

We have a PROGRESSION table that stores information about various
telephony events. Rows are never updated, only inserted, about 20-30
thousand rows a day.

Based on this table, we need to produce certain statistics, which are
created by a complex SQL query. This query runs quite fast on our test
systems, but is EXTREMELY slow on the production system. The slow down
factor is 100-500 times. A query that runs 2 seconds on the test system
takes sometimes up to 12 minutes in the production system.

Test systems run on the dumps of the production system, so the
difference between databases is not big. The only difference is that
production system is concurrently accessed by a number of users (and
PROGRESSION table gets plenty of inserts).

During the off-peak times (at night, for instance), statistics on the
production system are as fast as on the test systems (even faster
because of the server hardware).

Moreover, what I find very strange is that when I log in with psql (or
any other SQL client) onto the production database and execute the same
query, it runs fast - even in peak hours!

I would be extremely grateful for any hints or ideas on how could we
resolve this.

My personal theory is that concurrent reads/writes create some kind of
blocks (for instance, on indexes). Simply "heavy load" of the production
system would not explain the slowdown factor of 400. Under which
conditions could writes block reading query? What could we do to
diagnose those blocks?

I'm afraid I can't drop indexes on the production system.

Becoming a bit desperate...

Bye.
/lexi

ps.

The table description:

                                   Table "public.progression"
     Column     |          Type          |                      Modifiers
---------------+------------------------+------------------------------------------------------
  id            | bigint                 | not null default
nextval('progression_id_seq'::text)
  timestamp     | bigint                 | not null
  entry_type    | integer                | not null
  call_id       | bigint                 | not null
  hotline_id    | bigint                 | not null
  hotline_name  | character varying(64)  | not null
  team_id       | bigint                 | not null
  team_name     | character varying(64)  | not null
  expert_id     | bigint                 | not null
  expert_name   | character varying(64)  | not null
  expert_role   | bigint                 | not null
  remote_number | character varying(32)  |
  comment       | character varying(255) | not null default
''::character varying

Indexes:
     "progression_call_id_idx" btree (call_id)
     "progression_entry_type_timestamp_idx" btree (entry_type, "timestamp")
     "progression_expert_id" btree (expert_id)

select count(*) from progression returns 1870961

The statistics query (please be gentle):

SELECT CALL.HOTLINE_ID || '.' || CALL.PERIOD_NUMBER as {record.id},
CALL.PERIOD_NUMBER as {record.periodNumber}, (:startTimestamp +
CALL.PERIOD_NUMBER * :intervalSize) AS {record.timestamp},
CALL.HOTLINE_ID AS {record.hotlineId}, CALL.HOTLINE_NAME AS
{record.hotlineName}, count(*) AS {record.totalCallCount},
sum(CALL.IS_PROCESSED) as {record.processedCallCount},
sum(CALL.TOTAL_TIME) AS {record.totalTime}, sum(CALL.SPOKEN_TIME) as
{record.spokenTime}
     FROM
      (SELECT
         draft._CALL_ID as CALL_ID,
         draft._TIMESTAMP as TIMESTAMP,
         draft._HOTLINE_ID as HOTLINE_ID,
         draft._HOTLINE_NAME as HOTLINE_NAME,
         case when draft._CUSTOMER_CALL_GETS_CONNECTED > 0 then 1 else 0
end as IS_CONNECTED,
         case when draft._CUSTOMER_CALL_GETS_CONNECTED > 0 and
(draft._CUSTOMER_CALL_GETS_DISCONNECTED > 0 OR
draft._CUSTOMER_CALL_DROPPED > 0) then 1 else 0 end as IS_PROCESSED,
         case when draft._AGENT_CALL_GETS_CONNECTED > 0 and
draft._AGENT_CALL_GETS_DISCONNECTED > 0 THEN
draft._AGENT_CALL_GETS_DISCONNECTED - draft._AGENT_CALL_GETS_CONNECTED
              when draft._AGENT_CALL_GETS_CONNECTED > 0 and
draft._AGENT_CALL_DROPPED > 0 THEN draft._AGENT_CALL_DROPPED -
draft._AGENT_CALL_GETS_CONNECTED else 0 END as TOTAL_TIME,
         case when draft._AGENT_CALL_STARTS_TALKING > 0 and
draft._AGENT_CALL_GETS_DISCONNECTED > 0 THEN
draft._AGENT_CALL_GETS_DISCONNECTED - draft._AGENT_CALL_STARTS_TALKING
              when draft._AGENT_CALL_STARTS_TALKING > 0 and
draft._AGENT_CALL_DROPPED > 0 THEN draft._AGENT_CALL_DROPPED -
draft._AGENT_CALL_STARTS_TALKING else 0 END as SPOKEN_TIME,
         floor((draft._TIMESTAMP - :startTimestamp) / :intervalSize) as
PERIOD_NUMBER
       FROM
        (SELECT
          p.call_id
as _CALL_ID,
          sum(case when p.entry_type =  1 then p.timestamp else 0 end)
as _TIMESTAMP         ,
          p.hotline_id
as _HOTLINE_ID,
          p.hotline_name
as _HOTLINE_NAME,
          sum(case when p.entry_type =  2 then p.timestamp else 0 end)
as _CUSTOMER_CALL_GETS_CONNECTED         ,
          sum(case when p.entry_type =  3 then p.timestamp else 0 end)
as _CUSTOMER_CALL_GETS_DISCONNECTED      ,
          sum(case when p.entry_type =  4 then p.timestamp else 0 end)
as _CUSTOMER_CALL_DROPPED                ,
          sum(case when p.entry_type =  8 then p.timestamp else 0 end)
as _AGENT_CALL_GETS_CONNECTED            ,
          sum(case when p.entry_type =  9 then p.timestamp else 0 end)
as _AGENT_CALL_GETS_DISCONNECTED         ,
          sum(case when p.entry_type = 10 then p.timestamp else 0 end)
as _AGENT_CALL_DROPPED                   ,
          sum(case when p.entry_type = 16 then p.timestamp else 0 end)
as _AGENT_CALL_STARTS_TALKING
        FROM
          progression p
        WHERE
          p.hotline_id in (:hotlineIds) and
          p.entry_type in (1,2,3,4,8,9,10,16) and
          p.timestamp > :startTimestamp and
          p.timestamp < (:endTimestamp + :maximalDuration) and
          (p.entry_type != 1 or p.timestamp < :endTimestamp)
       GROUP BY p.call_id, p.hotline_id, p.hotline_name) as draft) AS CALL
     WHERE CALL.TIMESTAMP > 0
     GROUP BY CALL.HOTLINE_ID, CALL.HOTLINE_NAME, CALL.PERIOD_NUMBER

Query execution plan for this query (for the time period of January this
year, all hotlines, :maximalDuration = 36000000, :intervalSize =
3600000) is as follows:

HashAggregate
(cost=143395.98..159630.37 rows=209476 width=354)
  ->  Subquery Scan draft
  (cost=125590.52..139730.15 rows=209476 width=354)
   ->  HashAggregate
   (cost=125590.52..136588.01 rows=209476 width=55)
    Filter: (((min(CASE WHEN (entry_type = 1) THEN "timestamp" ELSE
0::bigint END) - 1104534000000::bigint) / 3600000) > 0)
    ->  Seq Scan on progression p
    (cost=0.00..119306.24 rows=209476 width=55)
     Filter: (((entry_type = 1) OR (entry_type = 2) OR (entry_type = 3)
OR (entry_type = 4) OR (entry_type = 8) OR (entry_type = 9) OR
(entry_type = 10) OR (entry_type = 16)) AND ("timestamp" >
1104534000000::bigint) AND ("timestamp" < 1107248400000::bigint) AND
((entry_type <> 1) OR ("timestamp" < 1107212400000::bigint)))



Re: Problems with query in highly concurrent environment

From
Jim Wilson
Date:
> From: Aleksei Valikov
>
<snip>
> Moreover, what I find very strange is that when I log in with psql (or
> any other SQL client) onto the production database and execute the same
> query, it runs fast - even in peak hours!
>
<snip>

That is strange.  Is this a networking issue?  Are your clients becoming memory
bound (e.g. not enough RAM or poor garbage collection/memory leakage on the
client side)?

Best regards,

Jim Wilson



Re: Problems with query in highly concurrent environment

From
Aleksei Valikov
Date:
Hi.

>>Moreover, what I find very strange is that when I log in with psql (or
>>any other SQL client) onto the production database and execute the same
>>query, it runs fast - even in peak hours!
>
> That is strange.  Is this a networking issue?

It is not. psql was executed on the same machine as the running
production system.

 > Are your clients becoming memory
> bound (e.g. not enough RAM or poor garbage collection/memory leakage on the
> client side)?

The slow down is not on the client side: databse logs show long
statement durations as well.

Quite strangely, but the problem disappeared. I have got now idea what
happened, but statistics work very fast now.

Few hypethesis that I have are:

1. Documentation says indexes may be potentially blocking. This could
have been the case: a long query uses an index and, as this index is
blocked by concurrent inserts (which are MANY), the query is blocked as
well. Queries also run in transactional environment (Spring
transactional proxies), so it might have been wrong transaction
isolation level or absence of the transaction or yet something else.
this could explain why the same query in psql ran fast. Yesterday I've
optimized the query a bit getting read of the join, so if new query does
not use indexes, it is not blocked and is fast.

2. Some kind of index malfunction - the index on call_id was created
yesterday manually on the prod. system. It might have not been "actual"
or for some reasons not used by the queries ran from the prod. system
(got now idea how could that happen). After the update yesterday late at
night, the prod. system software was restarted - this could have helped,
maybe...

3. Nightly ran analyze...

Still, these are only vague theories. I got no idea what actually
happened, but am glad that it works.

Bye.
/lexi