Re: Re: recently and selectively slow, but very simple, update query.... - Mailing list pgsql-performance

From Stelios Mavromichalis
Subject Re: Re: recently and selectively slow, but very simple, update query....
Date
Msg-id CAOfbx2Z8r-YM13-K8HZknVfGT9qHqNhgNOq2SCXZHWRV1tLx=g@mail.gmail.com
Whole thread Raw
In response to Re: Re: recently and selectively slow, but very simple, update query....  (Stelios Mavromichalis <mstelios@cytech.gr>)
List pgsql-performance
hello everyone,

i feel like i owe you all an apology.

it _was_ a locking problem after all.

i don't know why i was unlucky enough to never detect it for so long, no matter how hard i tried and how many times i did.

eventually, after many hours on the problem and hitting a locks dependency query (last one from <http://wiki.postgresql.org/wiki/Lock_Monitoring>) many times without apparent (seemed at the time) reason, it revealed the heart of my problem.

there was another query updating another column (opt_pushdlr_ltstamp) on that particular row (usid = 10808). however, due to bad written code (didn't addBatch and execute all at once instead updated instantly and delaying the commit while waiting for external _slow_ servers to reply) and external events that update didn't commit for long periods of time, forcing the original balance update to wait for it.

so, thanks to your advise (that a dump/restore wouldn't help me) i didn't bring the service down for a dump/restore cycle that wasn't required after all saving me service down time.

i can't thank you enough for leading me to solve this problem and helping me obtain deeper understanding of postgres.

wish you all have a great day.

my apology and best regards,

/mstelios


Stelios Mavromichalis
Cytech Ltd. - http://www.cytech.gr/
Science & Technology Park of Crete
fax: +30 2810 31 1045
tel.: +30 2810 31 4127
mob.: +30 697 7078013
skype: mstelios


On Tue, May 6, 2014 at 4:31 PM, Stelios Mavromichalis <mstelios@cytech.gr> wrote:
hello again,

some more data on the subject:

you are right, delete/re-insert didn't solve the problem (haven't yet tried dump/restore, i might tonight, when low traffic).

a short snip from logs were i log all queries that take longer then 1sec:

2014-05-06T15:57:46.303880+03:00 s10 postgres[46220]: [1891-1] prosms prosms LOG:  00000: duration: 1947.172 ms  execute <unnamed>: select easysms_jb_pay($1,
$2)
2014-05-06T15:57:46.304230+03:00 s10 postgres[46220]: [1891-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '2'
2014-05-06T15:57:46.304439+03:00 s10 postgres[46220]: [1891-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989
2014-05-06T15:57:56.199005+03:00 s10 postgres[58002]: [2570-1] prosms prosms LOG:  00000: duration: 6869.886 ms  execute <unnamed>: select easysms_jb_pay($1,$2)
2014-05-06T15:57:56.199349+03:00 s10 postgres[58002]: [2570-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '2'
2014-05-06T15:57:56.199567+03:00 s10 postgres[58002]: [2570-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989
2014-05-06T15:57:59.134982+03:00 s10 postgres[58002]: [2571-1] prosms prosms LOG:  00000: duration: 1797.747 ms  execute <unnamed>: select easysms_jb_pay($1,$2)
2014-05-06T15:57:59.135334+03:00 s10 postgres[58002]: [2571-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '2'
2014-05-06T15:57:59.135562+03:00 s10 postgres[58002]: [2571-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989
2014-05-06T15:58:07.149477+03:00 s10 postgres[46220]: [1892-1] prosms prosms LOG:  00000: duration: 3938.979 ms  execute <unnamed>: select easysms_jb_pay($1,$2)
2014-05-06T15:58:07.149830+03:00 s10 postgres[46220]: [1892-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '2'
2014-05-06T15:58:07.150067+03:00 s10 postgres[46220]: [1892-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989
2014-05-06T16:01:33.784422+03:00 s10 postgres[58002]: [2572-1] prosms prosms LOG:  00000: duration: 2921.212 ms  execute <unnamed>: select easysms_jb_pay($1,$2)
2014-05-06T16:01:33.784842+03:00 s10 postgres[58002]: [2572-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '4'
2014-05-06T16:01:33.785037+03:00 s10 postgres[58002]: [2572-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989


should you deem helpful yes i can enable full query logging and summon the listing for you.

a typical vmstat:
# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 363476 325852 294064 30520736    0    0    26   347    0    0  1  0 99  0  0
 1  0 363476 328400 294068 30520732    0    0     0    28 4930 8014  6  1 94  0  0
 2  0 363476 331756 294068 30520732    0    0     0  4384 4950 7980  6  1 93  0  0
 0  0 363476 334016 294068 30520756    0    0     0  4384 4961 7981  7  1 92  0  0
 0  0 363476 334700 294068 30520756    0    0     0  4424 4012 6467  4  1 95  0  0
 1  0 363476 330852 294068 30520788    0    0     0     0 2559 3861  5  1 95  0  0
 1  0 363476 331316 294072 30520788    0    0     0  4408 5013 8127  6  1 94  0  0
 1  0 363476 330788 294072 30520788    0    0     0  4384 5535 9055  6  1 93  0  0
 0  0 363476 331496 294072 30520804    0    0     0  4384 5031 8092  7  1 92  0  0
 2  0 363476 331268 294072 30520804    0    0     0  4428 5052 8246  6  1 94  0  0
 1  0 363476 330848 294080 30520812    0    0     0    32 4892 7950  5  1 94  0  0
 1  0 363476 330480 294080 30520812    0    0     0  4388 4935 8036  6  1 94  0  0
 2  0 363476 332616 294084 30521092    0    0     0  4408 5064 8194  6  1 93  0  0
 0  0 363476 333596 294084 30521008    0    0     0  4384 5205 8463  8  1 91  0  0
 1  0 363476 333324 294084 30521008    0    0     0    40 5014 8151  6  1 94  0  0
 0  0 363476 332740 294084 30521016    0    0     0  4384 5047 8163  6  1 93  0  0
 1  0 363476 336052 294084 30520888    0    0     0  4384 4849 7780  6  1 94  0  0
 1  0 363476 334732 294088 30520892    0    0     8  4400 5520 9012  6  1 93  0  0
 0  0 363476 334064 294096 30520884    0    0     0   220 3903 6193  6  1 94  0  0
 0  0 363476 333124 294096 30520916    0    0     0  2232 4088 6462  6  1 93  0  0

the process that is constantly writing the majority of data is:

"postgres: stats collector process"

and varies from 2.5mb/sec up to 5.7mb/sec

all the other postgres (and non-postgres) processes write very little data and rarely.

the checkpointer process is like 78.6kb/sec (a few seconds now as i write this email but no other is having a constant rate or I/O)

also, _while having the problem_ the results of the following queries are (taken from http://wiki.postgresql.org/wiki/Lock_Monitoring):

 SELECT relation::regclass, * FROM pg_locks WHERE NOT granted;
(0 rows)


SELECT bl.pid AS blocked_pid, a.usename AS blocked_user, ka.query AS blocking_statement, now() - ka.query_start AS blocking_duration, kl.pid AS blocking_pid, ka.usename AS blocking_user, a.query AS blocked_statement, now() - a.query_start AS blocked_duration FROM pg_catalog.pg_locks bl JOIN pg_catalog.pg_stat_activity a ON a.pid = bl.pid JOIN pg_catalog.pg_locks kl ON kl.transactionid = bl.transactionid AND kl.pid != bl.pid JOIN pg_catalog.pg_stat_activity ka ON ka.pid = kl.pid WHERE NOT bl.granted;
(0 rows)

SELECT    waiting.locktype           AS waiting_locktype,   waiting.relation::regclass AS waiting_table,   waiting_stm.query          AS waiting_query,   waiting.mode               AS waiting_mode,   waiting.pid                AS waiting_pid,   other.locktype             AS other_locktype,   other.relation::regclass   AS other_table,   other_stm.query            AS other_query,   other.mode                 AS other_mode,   other.pid                  AS other_pid,   other.granted              AS other_granted
FROM   pg_catalog.pg_locks AS waiting
JOIN   pg_catalog.pg_stat_activity AS waiting_stm   ON (       waiting_stm.pid = waiting.pid   )
JOIN   pg_catalog.pg_locks AS other   ON (       (           waiting."database" = other."database"       AND waiting.relation  = other.relation       )       OR waiting.transactionid = other.transactionid   )
JOIN   pg_catalog.pg_stat_activity AS other_stm   ON (       other_stm.pid = other.pid   )
WHERE   NOT waiting.granted
AND   waiting.pid <> other.pid;
(0 rows)



WITH RECURSIVE    c(requested, current) AS      ( VALUES        ('AccessShareLock'::text, 'AccessExclusiveLock'::text),        ('RowShareLock'::text, 'ExclusiveLock'::text),        ('RowShareLock'::text, 'AccessExclusiveLock'::text),        ('RowExclusiveLock'::text, 'ShareLock'::text),        ('RowExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('RowExclusiveLock'::text, 'ExclusiveLock'::text),        ('RowExclusiveLock'::text, 'AccessExclusiveLock'::text),        ('ShareUpdateExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),        ('ShareUpdateExclusiveLock'::text, 'ShareLock'::text),        ('ShareUpdateExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('ShareUpdateExclusiveLock'::text, 'ExclusiveLock'::text),        ('ShareUpdateExclusiveLock'::text, 'AccessExclusiveLock'::text),        ('ShareLock'::text, 'RowExclusiveLock'::text),        ('ShareLock'::text, 'ShareUpdateExclusiveLock'::text),        ('ShareLock'::text, 'ShareRowExclusiveLock'::text),        ('ShareLock'::text, 'ExclusiveLock'::text),        ('ShareLock'::text, 'AccessExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'RowExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'ShareLock'::text),        ('ShareRowExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'ExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'AccessExclusiveLock'::text),        ('ExclusiveLock'::text, 'RowShareLock'::text),        ('ExclusiveLock'::text, 'RowExclusiveLock'::text),        ('ExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),        ('ExclusiveLock'::text, 'ShareLock'::text),        ('ExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('ExclusiveLock'::text, 'ExclusiveLock'::text),        ('ExclusiveLock'::text, 'AccessExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'AccessShareLock'::text),        ('AccessExclusiveLock'::text, 'RowShareLock'::text),        ('AccessExclusiveLock'::text, 'RowExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'ShareLock'::text),        ('AccessExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'ExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'AccessExclusiveLock'::text)      ),    l AS      (        SELECT            (locktype,DATABASE,relation::regclass::text,page,tuple,virtualxid,transactionid,classid,objid,objsubid) AS target,            virtualtransaction,            pid,            mode,            granted          FROM pg_catalog.pg_locks      ),    t AS      (        SELECT            blocker.target  AS blocker_target,            blocker.pid     AS blocker_pid,            blocker.mode    AS blocker_mode,            blocked.target  AS target,            blocked.pid     AS pid,            blocked.mode    AS mode          FROM l blocker          JOIN l blocked            ON ( NOT blocked.granted             AND blocker.granted             AND blocked.pid != blocker.pid             AND blocked.target IS NOT DISTINCT FROM blocker.target)          JOIN c ON (c.requested = blocked.mode AND c.current = blocker.mode)      ),    r AS      (        SELECT            blocker_target,            blocker_pid,            blocker_mode,            '1'::int        AS depth,            target,            pid,            mode,            blocker_pid::text || ',' || pid::text AS seq          FROM t        UNION ALL        SELECT            blocker.blocker_target,            blocker.blocker_pid,            blocker.blocker_mode,            blocker.depth + 1,            blocked.target,            blocked.pid,            blocked.mode,            blocker.seq || ',' || blocked.pid::text          FROM r blocker          JOIN t blocked            ON (blocked.blocker_pid = blocker.pid)          WHERE blocker.depth < 1000      )
SELECT * FROM r ORDER BY seq;
(0 rows)


no finding here either :(

best,

/mstelios


Stelios Mavromichalis
Cytech Ltd. - http://www.cytech.gr/
Science & Technology Park of Crete
fax: +30 2810 31 1045
tel.: +30 2810 31 4127
mob.: +30 697 7078013
skype: mstelios


On Tue, May 6, 2014 at 2:54 AM, David G Johnston <david.g.johnston@gmail.com> wrote:
Stelios Mavromichalis wrote
> as a prior step to dump/restore i am thinking of deleting and re-inserting
> that particular row. that might share some light you think?

I still dislike the randomness of the unresponsiveness...

Every time you perform an update you "delete and insert" that row - that is
how an update works in MVCC - so doing so explicitly is unlikely to provide
any benefit.  Since that row is continually being inserted, and no other
rows are having this issue, I'm seriously doubting that a dump/restore is
going to have any effect either.  Note that the index scan took twice as
long in the bad case - but still reasonable and you didn't notice any
benefit from a REINDEX.  This is what I would expect.

The only other difference, if concurrency has been ruled out, is the 4 vs 18
buffers that had to be read.  I cannot imagine that, since all 22 were in
cache, that simply reading that much more data would account for the
difference (we're talking a 10,000-fold increase, not 2to4-fold).  The
reason for this particular difference, IIUC, is how may candidate tuples are
present whose visibility has to be accounted for (assuming 1 buffer per
tuple, you needed to scan 4 vs 18 for visibility in the two queries).

Is there any log file information you can share?  Especially if you can set
log_min_statement_duration (or whatever that GUC is named) so that whenever
one of these gets adversely delayed it appears in the log along with
whatever other system messages are being sent.  Checkpoints are a typical
culprit though that should be affecting a great deal more than what you
indicate you are seeing.

I'm pretty certain you are seeing this here largely because of the frequency
of activity on this particular user; not because the data itself is
corrupted.  It could be some kind of symptom of internal concurrency that
you just haven't observed yet but it could also be I/O or other system
contention that you also haven't properly instrumented.  Unfortunately that
is beyond my current help-providing skill-set.

A dump-restore likely would not make anything worse though I'd be surprised
if it were to improve matters.  It also doesn't seem like hardware - unless
the RAM is bad.  Software bugs are unlikely if this had been working well
before 5 days ago.   So, you need to observe the system during both periods
(good and bad) and observe something that is different - probably not within
PostgreSQL if indeed you've minimized concurrency.  And also see if you can
see if any other queries, executed during both these times, exhibit a
performance decrease.  Logging all statements would help matters greatly if
you can afford it in your production environment - it would make looking for
internal concurrency much easier.

David J.



--
View this message in context: http://postgresql.1045698.n5.nabble.com/Re-recently-and-selectively-slow-but-very-simple-update-query-tp5802553p5802579.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


pgsql-performance by date:

Previous
From: Dorian Hoxha
Date:
Subject: Re: Specifications for a new server
Next
From: Tomas Vondra
Date:
Subject: Re: Specifications for a new server