Thread: Re: recently and selectively slow, but very simple, update query....

Re: recently and selectively slow, but very simple, update query....

From
Stelios Mavromichalis
Date:
hello list,

since i got no reply i am afraid i'll go the dump/restore cycle path hopping this will solve my problem.

best regards,

/mstelios


Stelios Mavromichalis
Cytech Ltd. - http://www.cytech.gr/
Science & Technology Park of Crete
fax: +30 2810 31 1045
tel.: +30 2810 31 4127
mob.: +30 697 7078013
skype: mstelios


On Mon, May 5, 2014 at 5:11 PM, Stelios Mavromichalis <mstelios@cytech.gr> wrote:
hello,

after reading this guide: http://wiki.postgresql.org/wiki/SlowQueryQuestions

i decided to seek for your help.

my problem is that the same query/function some times run fast/normal (as expected) and, recently like 5 days now, some/most of the times, it run really slow to _very_ slow.

the query is in essence a very simple update on a balance of _a certain user_ (no other user has this issue). yes, this user has the most frequent updates to his balance.

i've tried restarting,manual vacuuming (with analyze full etc or not), reindex the database with no improvement. also it's not a hardware problem. diagnostics run fine and no kernel messages or anything weird/unexpected. the load of the machine is also low (like 0.2).

i would dump+restore cycle  the database without bothering you, hoping that that would solve my problem, but then i though i wouldn't learn anything out of it, nor you would have the chance to potentially trace a problem/bug thus help the community.

so, without further due:


Full Table and Index Schema

the function that has the problems(easysms_jb_pay(int,int) return int):
                                      Source code         
-----------------------------------------------------------
                                                          
     DECLARE                                              
       user_id ALIAS FOR $1;                              
       amount ALIAS FOR $2;                               
       myuser record;                                     
       mg record;                                         
       newbalance float;                                  
     BEGIN                                                
       SELECT INTO myuser es.login, es.balance as esbalance
   from                                                   
       easysms_users es                                   
   where                                                  
       es.usid = user_id;                                 
                                                          
       IF NOT FOUND THEN                                  
 RAISE EXCEPTION 'Cannot find user';                      
 return -2;                                               
       END IF;                                            
                                                          
       IF myuser.login = 'jbuser' THEN                    
   return -3;                                             
       END IF;                                            
                                                          
       IF myuser.esbalance < amount THEN                  
     return -1;                                           
       END IF;                                            
                                                          
       UPDATE easysms_users SET balance = balance - amount
       WHERE usid = user_id;                              
                                                          
       return 1;                                          
     END;                                                 


the related table:
                                            Table "public.easysms_users"
         Column         |            Type             |                          Modifiers
------------------------+-----------------------------+-------------------------------------------------------------
 login                  | character varying(20)       |
 passwd                 | character varying(32)       | not null
 mobile                 | character varying(16)       | not null
 name                   | character varying(20)       |
 lastname               | character varying(20)       |
 balance                | bigint                      | not null default 0
 email                  | character varying(40)       |
 status                 | character varying(1)        | default 'p'::character varying
 lang                   | character varying(2)        |
 trusted                | boolean                     | default false
 opt_originator         | character varying(16)       |
 opt_fullname           | character varying(50)       |
 opt_afm                | character varying(30)       |
 opt_invoice_details    | text                        |
 opt_postal_address     | text                        |
 opt_want_invoice       | smallint                    | default 0
 bulklimit              | integer                     | default 100
 lastlogin              | timestamp without time zone |
 daily_report           | boolean                     | default false
 pro                    | boolean                     | default true
 country_code           | integer                     |
 mobnumber              | character varying(10)       |
 cctld                  | character varying(2)        |
 mpid                   | integer                     |
 ifee                   | boolean                     |
 gsm_code               | character varying(8)        |
 account_reminder_email | boolean                     | default false
 usid                   | integer                     | default (-2)
 namedays               | boolean                     | default true
 opt_concat             | boolean                     | default false
 opt_smtype             | character(1)                | default 't'::bpchar
 opt_url                | text                        |
 opt_permit_concat      | boolean                     | default true
 opt_email              | boolean                     | default false
 suser                  | boolean                     | default false
 susid                  | integer                     |
 perm                   | character varying(20)       |
 opt_statsperiod        | character varying(3)        |
 opt_balance            | boolean                     |
 opt_lblimit            | integer                     |
 opt_override           | boolean                     | default false
 opt_letstamp           | timestamp with time zone    | default (now() - '1 day'::interval)
 opt_lbststamp          | timestamp with time zone    | default now()
 opt_pushdlr_enabled    | boolean                     | default false
 opt_pushdlr_ltstamp    | timestamp with time zone    | default now()
 opt_pushdlr_rperiod    | integer                     | default 300
 opt_pushdlr_dperiod    | integer                     | default 2
 opt_pushdlrs           | boolean                     | default false
 regdate                | timestamp with time zone    | not null default ('now'::text)::timestamp(6) with time zone
 opt_occupation         | character varying(50)       |
 opt_invoice_address    | text                        |
 opt_city               | character varying(50)       |
 opt_invoice_city       | character varying(50)       |
 opt_pcode              | character varying(30)       |
 opt_invoice_pcode      | character varying(30)       |
 opt_doy                | character varying(50)       |
 opt_phone              | character varying(50)       |
 opt_invoice_country    | character varying(50)       |
 opt_country            | character varying(50)       |
 billid                 | integer                     |
 opt_smpp_enabled       | boolean                     | default false
Indexes:
    "idx_easysms_users_usid" UNIQUE, btree (usid)
    "easysms_users_cctld_idx" btree (cctld)
    "easysms_users_email_idx" btree (email)
    "easysms_users_mobile_idx" btree (mobile)
    "easysms_users_mpid_idx" btree (mpid)
    "easysms_users_status_idx" btree (status)


Table Metadata
done not contain large objects
has a fair amount of nulls
does receive a large number of updates, no deletes
is not growing rapidly, but very slow
indexes you can see the schema
does not use triggers


History
what i've mentioned at the start of this email. i can't think of any event that could link to this behavior.


Hardware Components (Dedicated to dbs, also runs a low traffic mysql, runs open suse 12.3 x86-64bit)
Harddisk 2x 2000 GB SATA 3,5" 7.200 rpm (in raid 1)
RAM 32x Gigabyte RAM
RAID-Controller HP SmartArrayP410 (battery backed, write back is enabled)
Barebone Hewlett Packard DL320e G8
CPU Intel Xeon E3-1230v2


Maintenance Setup
autovacuuming on default settings. manual vacuum only on cases like this and not regularly. see db config


WAL Configuration
nothing special here, all run on same disk/part. see db config


GUC Settings
             name             |  current_setting  |        source
------------------------------+-------------------+----------------------
 application_name             | psql              | client
 checkpoint_completion_target | 0.9               | configuration file
 checkpoint_segments          | 64                | configuration file
 client_encoding              | UTF8              | client
 client_min_messages          | log               | configuration file
 DateStyle                    | ISO, DMY          | configuration file
 deadlock_timeout             | 10s               | configuration file
 debug_print_rewritten        | off               | configuration file
 default_statistics_target    | 100               | configuration file
 default_text_search_config   | pg_catalog.simple | configuration file
 effective_cache_size         | 8GB               | configuration file
 fsync                        | off               | configuration file
 lc_messages                  | el_GR.UTF-8       | configuration file
 lc_monetary                  | el_GR.UTF-8       | configuration file
 lc_numeric                   | el_GR.UTF-8       | configuration file
 lc_time                      | el_GR.UTF-8       | configuration file
 listen_addresses             | *                 | configuration file
 log_connections              | off               | configuration file
 log_destination              | syslog            | configuration file
 log_disconnections           | off               | configuration file
 log_error_verbosity          | verbose           | configuration file
 log_hostname                 | on                | configuration file
 log_line_prefix              | %d %u             | configuration file
 log_lock_waits               | on                | configuration file
 log_min_duration_statement   | 1s                | configuration file
 log_min_error_statement      | debug5            | configuration file
 log_min_messages             | info              | configuration file
 log_statement                | none              | configuration file
 logging_collector            | on                | configuration file
 maintenance_work_mem         | 704MB             | configuration file
 max_connections              | 400               | configuration file
 max_prepared_transactions    | 1000              | configuration file
 max_stack_depth              | 2MB               | environment variable
 random_page_cost             | 1.5               | configuration file
 shared_buffers               | 2816MB            | configuration file
 TimeZone                     | Europe/Athens     | configuration file
 wal_buffers                  | 16MB              | configuration file
 work_mem                     | 28MB              | configuration file
(38 rows)


Postgres version
# select version();
                                                             version
---------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.2.7 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388], 64-bit
(1 row)

normal speed query that really stacks: <http://explain.depesz.com/s/XeQm>

slow version of it: <http://explain.depesz.com/s/AjwK>

thank you so very much in advance for your time and efforts to help.

best regards,

/mstelios


Stelios Mavromichalis
Cytech Ltd. - http://www.cytech.gr/
Science & Technology Park of Crete
fax: +30 2810 31 1045
tel.: +30 2810 31 4127
mob.: +30 697 7078013
skype: mstelios

Re: recently and selectively slow, but very simple, update query....

From
David G Johnston
Date:
Stelios Mavromichalis wrote
>> the load of the machine is also low (like 0.2).

Which means little if the update is waiting for a lock to be released by one
other process; which is more likely the situation (or some other concurrency
contention) especially as you said that this particular user generates
significant transaction/query volume (implied by the fact the user has the
most balance updates).

During slow-update executions you want to look at:
pg_stat_activity
pg_locks

to see what other concurrent activity is taking place.

It is doubtful that dump/restore would have any effect given that the
symptoms are sporadic and we are only talking about a select statement that
returns a single row; and an update that does not hit any indexed column and
therefore benefits from "HOT" optimization.

HTH

David J.






--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Re-recently-and-selectively-slow-but-very-simple-update-query-tp5802553p5802555.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Re: recently and selectively slow, but very simple, update query....

From
Stelios Mavromichalis
Date:
hello David,

thank you! for the reply.

right, to this regard i did some research following the recommendation here <http://wiki.postgresql.org/wiki/Lock_Monitoring>

however, it showed that nothing was wrong. no deadlocks, no pending locks no nothing.

have to mention that this update is run in serial (no other thread/process is trying to update that balance -or any other-, only one. well not 100% true, except the topup mechanism that happens relatively rare).

also have to mention that his exact same mechanic is there some time now, life a few years, and it never had similar problem.

also the fact that if i use another usid, with great many updates, will roll normally leads me to think that it might be a corruption or something, thus the dump/restore hope :)

as a prior step to dump/restore i am thinking of deleting and re-inserting that particular row. that might share some light you think?

best regards,

/mstelios

Stelios Mavromichalis
Cytech Ltd. - http://www.cytech.gr/
Science & Technology Park of Crete
fax: +30 2810 31 1045
tel.: +30 2810 31 4127
mob.: +30 697 7078013
skype: mstelios


On Tue, May 6, 2014 at 12:51 AM, David G Johnston <david.g.johnston@gmail.com> wrote:
Stelios Mavromichalis wrote
>> the load of the machine is also low (like 0.2).

Which means little if the update is waiting for a lock to be released by one
other process; which is more likely the situation (or some other concurrency
contention) especially as you said that this particular user generates
significant transaction/query volume (implied by the fact the user has the
most balance updates).

During slow-update executions you want to look at:
pg_stat_activity
pg_locks

to see what other concurrent activity is taking place.

It is doubtful that dump/restore would have any effect given that the
symptoms are sporadic and we are only talking about a select statement that
returns a single row; and an update that does not hit any indexed column and
therefore benefits from "HOT" optimization.

HTH

David J.






--
View this message in context: http://postgresql.1045698.n5.nabble.com/Re-recently-and-selectively-slow-but-very-simple-update-query-tp5802553p5802555.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: recently and selectively slow, but very simple, update query....

From
David G Johnston
Date:
Stelios Mavromichalis wrote
> as a prior step to dump/restore i am thinking of deleting and re-inserting
> that particular row. that might share some light you think?

I still dislike the randomness of the unresponsiveness...

Every time you perform an update you "delete and insert" that row - that is
how an update works in MVCC - so doing so explicitly is unlikely to provide
any benefit.  Since that row is continually being inserted, and no other
rows are having this issue, I'm seriously doubting that a dump/restore is
going to have any effect either.  Note that the index scan took twice as
long in the bad case - but still reasonable and you didn't notice any
benefit from a REINDEX.  This is what I would expect.

The only other difference, if concurrency has been ruled out, is the 4 vs 18
buffers that had to be read.  I cannot imagine that, since all 22 were in
cache, that simply reading that much more data would account for the
difference (we're talking a 10,000-fold increase, not 2to4-fold).  The
reason for this particular difference, IIUC, is how may candidate tuples are
present whose visibility has to be accounted for (assuming 1 buffer per
tuple, you needed to scan 4 vs 18 for visibility in the two queries).

Is there any log file information you can share?  Especially if you can set
log_min_statement_duration (or whatever that GUC is named) so that whenever
one of these gets adversely delayed it appears in the log along with
whatever other system messages are being sent.  Checkpoints are a typical
culprit though that should be affecting a great deal more than what you
indicate you are seeing.

I'm pretty certain you are seeing this here largely because of the frequency
of activity on this particular user; not because the data itself is
corrupted.  It could be some kind of symptom of internal concurrency that
you just haven't observed yet but it could also be I/O or other system
contention that you also haven't properly instrumented.  Unfortunately that
is beyond my current help-providing skill-set.

A dump-restore likely would not make anything worse though I'd be surprised
if it were to improve matters.  It also doesn't seem like hardware - unless
the RAM is bad.  Software bugs are unlikely if this had been working well
before 5 days ago.   So, you need to observe the system during both periods
(good and bad) and observe something that is different - probably not within
PostgreSQL if indeed you've minimized concurrency.  And also see if you can
see if any other queries, executed during both these times, exhibit a
performance decrease.  Logging all statements would help matters greatly if
you can afford it in your production environment - it would make looking for
internal concurrency much easier.

David J.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Re-recently-and-selectively-slow-but-very-simple-update-query-tp5802553p5802579.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Re: recently and selectively slow, but very simple, update query....

From
Stelios Mavromichalis
Date:
hello again,

some more data on the subject:

you are right, delete/re-insert didn't solve the problem (haven't yet tried dump/restore, i might tonight, when low traffic).

a short snip from logs were i log all queries that take longer then 1sec:

2014-05-06T15:57:46.303880+03:00 s10 postgres[46220]: [1891-1] prosms prosms LOG:  00000: duration: 1947.172 ms  execute <unnamed>: select easysms_jb_pay($1,
$2)
2014-05-06T15:57:46.304230+03:00 s10 postgres[46220]: [1891-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '2'
2014-05-06T15:57:46.304439+03:00 s10 postgres[46220]: [1891-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989
2014-05-06T15:57:56.199005+03:00 s10 postgres[58002]: [2570-1] prosms prosms LOG:  00000: duration: 6869.886 ms  execute <unnamed>: select easysms_jb_pay($1,$2)
2014-05-06T15:57:56.199349+03:00 s10 postgres[58002]: [2570-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '2'
2014-05-06T15:57:56.199567+03:00 s10 postgres[58002]: [2570-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989
2014-05-06T15:57:59.134982+03:00 s10 postgres[58002]: [2571-1] prosms prosms LOG:  00000: duration: 1797.747 ms  execute <unnamed>: select easysms_jb_pay($1,$2)
2014-05-06T15:57:59.135334+03:00 s10 postgres[58002]: [2571-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '2'
2014-05-06T15:57:59.135562+03:00 s10 postgres[58002]: [2571-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989
2014-05-06T15:58:07.149477+03:00 s10 postgres[46220]: [1892-1] prosms prosms LOG:  00000: duration: 3938.979 ms  execute <unnamed>: select easysms_jb_pay($1,$2)
2014-05-06T15:58:07.149830+03:00 s10 postgres[46220]: [1892-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '2'
2014-05-06T15:58:07.150067+03:00 s10 postgres[46220]: [1892-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989
2014-05-06T16:01:33.784422+03:00 s10 postgres[58002]: [2572-1] prosms prosms LOG:  00000: duration: 2921.212 ms  execute <unnamed>: select easysms_jb_pay($1,$2)
2014-05-06T16:01:33.784842+03:00 s10 postgres[58002]: [2572-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '4'
2014-05-06T16:01:33.785037+03:00 s10 postgres[58002]: [2572-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989


should you deem helpful yes i can enable full query logging and summon the listing for you.

a typical vmstat:
# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 363476 325852 294064 30520736    0    0    26   347    0    0  1  0 99  0  0
 1  0 363476 328400 294068 30520732    0    0     0    28 4930 8014  6  1 94  0  0
 2  0 363476 331756 294068 30520732    0    0     0  4384 4950 7980  6  1 93  0  0
 0  0 363476 334016 294068 30520756    0    0     0  4384 4961 7981  7  1 92  0  0
 0  0 363476 334700 294068 30520756    0    0     0  4424 4012 6467  4  1 95  0  0
 1  0 363476 330852 294068 30520788    0    0     0     0 2559 3861  5  1 95  0  0
 1  0 363476 331316 294072 30520788    0    0     0  4408 5013 8127  6  1 94  0  0
 1  0 363476 330788 294072 30520788    0    0     0  4384 5535 9055  6  1 93  0  0
 0  0 363476 331496 294072 30520804    0    0     0  4384 5031 8092  7  1 92  0  0
 2  0 363476 331268 294072 30520804    0    0     0  4428 5052 8246  6  1 94  0  0
 1  0 363476 330848 294080 30520812    0    0     0    32 4892 7950  5  1 94  0  0
 1  0 363476 330480 294080 30520812    0    0     0  4388 4935 8036  6  1 94  0  0
 2  0 363476 332616 294084 30521092    0    0     0  4408 5064 8194  6  1 93  0  0
 0  0 363476 333596 294084 30521008    0    0     0  4384 5205 8463  8  1 91  0  0
 1  0 363476 333324 294084 30521008    0    0     0    40 5014 8151  6  1 94  0  0
 0  0 363476 332740 294084 30521016    0    0     0  4384 5047 8163  6  1 93  0  0
 1  0 363476 336052 294084 30520888    0    0     0  4384 4849 7780  6  1 94  0  0
 1  0 363476 334732 294088 30520892    0    0     8  4400 5520 9012  6  1 93  0  0
 0  0 363476 334064 294096 30520884    0    0     0   220 3903 6193  6  1 94  0  0
 0  0 363476 333124 294096 30520916    0    0     0  2232 4088 6462  6  1 93  0  0

the process that is constantly writing the majority of data is:

"postgres: stats collector process"

and varies from 2.5mb/sec up to 5.7mb/sec

all the other postgres (and non-postgres) processes write very little data and rarely.

the checkpointer process is like 78.6kb/sec (a few seconds now as i write this email but no other is having a constant rate or I/O)

also, _while having the problem_ the results of the following queries are (taken from http://wiki.postgresql.org/wiki/Lock_Monitoring):

 SELECT relation::regclass, * FROM pg_locks WHERE NOT granted;
(0 rows)


SELECT bl.pid AS blocked_pid, a.usename AS blocked_user, ka.query AS blocking_statement, now() - ka.query_start AS blocking_duration, kl.pid AS blocking_pid, ka.usename AS blocking_user, a.query AS blocked_statement, now() - a.query_start AS blocked_duration FROM pg_catalog.pg_locks bl JOIN pg_catalog.pg_stat_activity a ON a.pid = bl.pid JOIN pg_catalog.pg_locks kl ON kl.transactionid = bl.transactionid AND kl.pid != bl.pid JOIN pg_catalog.pg_stat_activity ka ON ka.pid = kl.pid WHERE NOT bl.granted;
(0 rows)

SELECT    waiting.locktype           AS waiting_locktype,   waiting.relation::regclass AS waiting_table,   waiting_stm.query          AS waiting_query,   waiting.mode               AS waiting_mode,   waiting.pid                AS waiting_pid,   other.locktype             AS other_locktype,   other.relation::regclass   AS other_table,   other_stm.query            AS other_query,   other.mode                 AS other_mode,   other.pid                  AS other_pid,   other.granted              AS other_granted
FROM   pg_catalog.pg_locks AS waiting
JOIN   pg_catalog.pg_stat_activity AS waiting_stm   ON (       waiting_stm.pid = waiting.pid   )
JOIN   pg_catalog.pg_locks AS other   ON (       (           waiting."database" = other."database"       AND waiting.relation  = other.relation       )       OR waiting.transactionid = other.transactionid   )
JOIN   pg_catalog.pg_stat_activity AS other_stm   ON (       other_stm.pid = other.pid   )
WHERE   NOT waiting.granted
AND   waiting.pid <> other.pid;
(0 rows)



WITH RECURSIVE    c(requested, current) AS      ( VALUES        ('AccessShareLock'::text, 'AccessExclusiveLock'::text),        ('RowShareLock'::text, 'ExclusiveLock'::text),        ('RowShareLock'::text, 'AccessExclusiveLock'::text),        ('RowExclusiveLock'::text, 'ShareLock'::text),        ('RowExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('RowExclusiveLock'::text, 'ExclusiveLock'::text),        ('RowExclusiveLock'::text, 'AccessExclusiveLock'::text),        ('ShareUpdateExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),        ('ShareUpdateExclusiveLock'::text, 'ShareLock'::text),        ('ShareUpdateExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('ShareUpdateExclusiveLock'::text, 'ExclusiveLock'::text),        ('ShareUpdateExclusiveLock'::text, 'AccessExclusiveLock'::text),        ('ShareLock'::text, 'RowExclusiveLock'::text),        ('ShareLock'::text, 'ShareUpdateExclusiveLock'::text),        ('ShareLock'::text, 'ShareRowExclusiveLock'::text),        ('ShareLock'::text, 'ExclusiveLock'::text),        ('ShareLock'::text, 'AccessExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'RowExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'ShareLock'::text),        ('ShareRowExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'ExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'AccessExclusiveLock'::text),        ('ExclusiveLock'::text, 'RowShareLock'::text),        ('ExclusiveLock'::text, 'RowExclusiveLock'::text),        ('ExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),        ('ExclusiveLock'::text, 'ShareLock'::text),        ('ExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('ExclusiveLock'::text, 'ExclusiveLock'::text),        ('ExclusiveLock'::text, 'AccessExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'AccessShareLock'::text),        ('AccessExclusiveLock'::text, 'RowShareLock'::text),        ('AccessExclusiveLock'::text, 'RowExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'ShareLock'::text),        ('AccessExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'ExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'AccessExclusiveLock'::text)      ),    l AS      (        SELECT            (locktype,DATABASE,relation::regclass::text,page,tuple,virtualxid,transactionid,classid,objid,objsubid) AS target,            virtualtransaction,            pid,            mode,            granted          FROM pg_catalog.pg_locks      ),    t AS      (        SELECT            blocker.target  AS blocker_target,            blocker.pid     AS blocker_pid,            blocker.mode    AS blocker_mode,            blocked.target  AS target,            blocked.pid     AS pid,            blocked.mode    AS mode          FROM l blocker          JOIN l blocked            ON ( NOT blocked.granted             AND blocker.granted             AND blocked.pid != blocker.pid             AND blocked.target IS NOT DISTINCT FROM blocker.target)          JOIN c ON (c.requested = blocked.mode AND c.current = blocker.mode)      ),    r AS      (        SELECT            blocker_target,            blocker_pid,            blocker_mode,            '1'::int        AS depth,            target,            pid,            mode,            blocker_pid::text || ',' || pid::text AS seq          FROM t        UNION ALL        SELECT            blocker.blocker_target,            blocker.blocker_pid,            blocker.blocker_mode,            blocker.depth + 1,            blocked.target,            blocked.pid,            blocked.mode,            blocker.seq || ',' || blocked.pid::text          FROM r blocker          JOIN t blocked            ON (blocked.blocker_pid = blocker.pid)          WHERE blocker.depth < 1000      )
SELECT * FROM r ORDER BY seq;
(0 rows)


no finding here either :(

best,

/mstelios


Stelios Mavromichalis
Cytech Ltd. - http://www.cytech.gr/
Science & Technology Park of Crete
fax: +30 2810 31 1045
tel.: +30 2810 31 4127
mob.: +30 697 7078013
skype: mstelios


On Tue, May 6, 2014 at 2:54 AM, David G Johnston <david.g.johnston@gmail.com> wrote:
Stelios Mavromichalis wrote
> as a prior step to dump/restore i am thinking of deleting and re-inserting
> that particular row. that might share some light you think?

I still dislike the randomness of the unresponsiveness...

Every time you perform an update you "delete and insert" that row - that is
how an update works in MVCC - so doing so explicitly is unlikely to provide
any benefit.  Since that row is continually being inserted, and no other
rows are having this issue, I'm seriously doubting that a dump/restore is
going to have any effect either.  Note that the index scan took twice as
long in the bad case - but still reasonable and you didn't notice any
benefit from a REINDEX.  This is what I would expect.

The only other difference, if concurrency has been ruled out, is the 4 vs 18
buffers that had to be read.  I cannot imagine that, since all 22 were in
cache, that simply reading that much more data would account for the
difference (we're talking a 10,000-fold increase, not 2to4-fold).  The
reason for this particular difference, IIUC, is how may candidate tuples are
present whose visibility has to be accounted for (assuming 1 buffer per
tuple, you needed to scan 4 vs 18 for visibility in the two queries).

Is there any log file information you can share?  Especially if you can set
log_min_statement_duration (or whatever that GUC is named) so that whenever
one of these gets adversely delayed it appears in the log along with
whatever other system messages are being sent.  Checkpoints are a typical
culprit though that should be affecting a great deal more than what you
indicate you are seeing.

I'm pretty certain you are seeing this here largely because of the frequency
of activity on this particular user; not because the data itself is
corrupted.  It could be some kind of symptom of internal concurrency that
you just haven't observed yet but it could also be I/O or other system
contention that you also haven't properly instrumented.  Unfortunately that
is beyond my current help-providing skill-set.

A dump-restore likely would not make anything worse though I'd be surprised
if it were to improve matters.  It also doesn't seem like hardware - unless
the RAM is bad.  Software bugs are unlikely if this had been working well
before 5 days ago.   So, you need to observe the system during both periods
(good and bad) and observe something that is different - probably not within
PostgreSQL if indeed you've minimized concurrency.  And also see if you can
see if any other queries, executed during both these times, exhibit a
performance decrease.  Logging all statements would help matters greatly if
you can afford it in your production environment - it would make looking for
internal concurrency much easier.

David J.



--
View this message in context: http://postgresql.1045698.n5.nabble.com/Re-recently-and-selectively-slow-but-very-simple-update-query-tp5802553p5802579.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Re: recently and selectively slow, but very simple, update query....

From
Stelios Mavromichalis
Date:
hello everyone,

i feel like i owe you all an apology.

it _was_ a locking problem after all.

i don't know why i was unlucky enough to never detect it for so long, no matter how hard i tried and how many times i did.

eventually, after many hours on the problem and hitting a locks dependency query (last one from <http://wiki.postgresql.org/wiki/Lock_Monitoring>) many times without apparent (seemed at the time) reason, it revealed the heart of my problem.

there was another query updating another column (opt_pushdlr_ltstamp) on that particular row (usid = 10808). however, due to bad written code (didn't addBatch and execute all at once instead updated instantly and delaying the commit while waiting for external _slow_ servers to reply) and external events that update didn't commit for long periods of time, forcing the original balance update to wait for it.

so, thanks to your advise (that a dump/restore wouldn't help me) i didn't bring the service down for a dump/restore cycle that wasn't required after all saving me service down time.

i can't thank you enough for leading me to solve this problem and helping me obtain deeper understanding of postgres.

wish you all have a great day.

my apology and best regards,

/mstelios


Stelios Mavromichalis
Cytech Ltd. - http://www.cytech.gr/
Science & Technology Park of Crete
fax: +30 2810 31 1045
tel.: +30 2810 31 4127
mob.: +30 697 7078013
skype: mstelios


On Tue, May 6, 2014 at 4:31 PM, Stelios Mavromichalis <mstelios@cytech.gr> wrote:
hello again,

some more data on the subject:

you are right, delete/re-insert didn't solve the problem (haven't yet tried dump/restore, i might tonight, when low traffic).

a short snip from logs were i log all queries that take longer then 1sec:

2014-05-06T15:57:46.303880+03:00 s10 postgres[46220]: [1891-1] prosms prosms LOG:  00000: duration: 1947.172 ms  execute <unnamed>: select easysms_jb_pay($1,
$2)
2014-05-06T15:57:46.304230+03:00 s10 postgres[46220]: [1891-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '2'
2014-05-06T15:57:46.304439+03:00 s10 postgres[46220]: [1891-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989
2014-05-06T15:57:56.199005+03:00 s10 postgres[58002]: [2570-1] prosms prosms LOG:  00000: duration: 6869.886 ms  execute <unnamed>: select easysms_jb_pay($1,$2)
2014-05-06T15:57:56.199349+03:00 s10 postgres[58002]: [2570-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '2'
2014-05-06T15:57:56.199567+03:00 s10 postgres[58002]: [2570-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989
2014-05-06T15:57:59.134982+03:00 s10 postgres[58002]: [2571-1] prosms prosms LOG:  00000: duration: 1797.747 ms  execute <unnamed>: select easysms_jb_pay($1,$2)
2014-05-06T15:57:59.135334+03:00 s10 postgres[58002]: [2571-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '2'
2014-05-06T15:57:59.135562+03:00 s10 postgres[58002]: [2571-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989
2014-05-06T15:58:07.149477+03:00 s10 postgres[46220]: [1892-1] prosms prosms LOG:  00000: duration: 3938.979 ms  execute <unnamed>: select easysms_jb_pay($1,$2)
2014-05-06T15:58:07.149830+03:00 s10 postgres[46220]: [1892-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '2'
2014-05-06T15:58:07.150067+03:00 s10 postgres[46220]: [1892-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989
2014-05-06T16:01:33.784422+03:00 s10 postgres[58002]: [2572-1] prosms prosms LOG:  00000: duration: 2921.212 ms  execute <unnamed>: select easysms_jb_pay($1,$2)
2014-05-06T16:01:33.784842+03:00 s10 postgres[58002]: [2572-2] prosms prosms DETAIL:  parameters: $1 = '10808', $2 = '4'
2014-05-06T16:01:33.785037+03:00 s10 postgres[58002]: [2572-3] prosms prosms LOCATION:  exec_execute_message, postgres.c:1989


should you deem helpful yes i can enable full query logging and summon the listing for you.

a typical vmstat:
# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 363476 325852 294064 30520736    0    0    26   347    0    0  1  0 99  0  0
 1  0 363476 328400 294068 30520732    0    0     0    28 4930 8014  6  1 94  0  0
 2  0 363476 331756 294068 30520732    0    0     0  4384 4950 7980  6  1 93  0  0
 0  0 363476 334016 294068 30520756    0    0     0  4384 4961 7981  7  1 92  0  0
 0  0 363476 334700 294068 30520756    0    0     0  4424 4012 6467  4  1 95  0  0
 1  0 363476 330852 294068 30520788    0    0     0     0 2559 3861  5  1 95  0  0
 1  0 363476 331316 294072 30520788    0    0     0  4408 5013 8127  6  1 94  0  0
 1  0 363476 330788 294072 30520788    0    0     0  4384 5535 9055  6  1 93  0  0
 0  0 363476 331496 294072 30520804    0    0     0  4384 5031 8092  7  1 92  0  0
 2  0 363476 331268 294072 30520804    0    0     0  4428 5052 8246  6  1 94  0  0
 1  0 363476 330848 294080 30520812    0    0     0    32 4892 7950  5  1 94  0  0
 1  0 363476 330480 294080 30520812    0    0     0  4388 4935 8036  6  1 94  0  0
 2  0 363476 332616 294084 30521092    0    0     0  4408 5064 8194  6  1 93  0  0
 0  0 363476 333596 294084 30521008    0    0     0  4384 5205 8463  8  1 91  0  0
 1  0 363476 333324 294084 30521008    0    0     0    40 5014 8151  6  1 94  0  0
 0  0 363476 332740 294084 30521016    0    0     0  4384 5047 8163  6  1 93  0  0
 1  0 363476 336052 294084 30520888    0    0     0  4384 4849 7780  6  1 94  0  0
 1  0 363476 334732 294088 30520892    0    0     8  4400 5520 9012  6  1 93  0  0
 0  0 363476 334064 294096 30520884    0    0     0   220 3903 6193  6  1 94  0  0
 0  0 363476 333124 294096 30520916    0    0     0  2232 4088 6462  6  1 93  0  0

the process that is constantly writing the majority of data is:

"postgres: stats collector process"

and varies from 2.5mb/sec up to 5.7mb/sec

all the other postgres (and non-postgres) processes write very little data and rarely.

the checkpointer process is like 78.6kb/sec (a few seconds now as i write this email but no other is having a constant rate or I/O)

also, _while having the problem_ the results of the following queries are (taken from http://wiki.postgresql.org/wiki/Lock_Monitoring):

 SELECT relation::regclass, * FROM pg_locks WHERE NOT granted;
(0 rows)


SELECT bl.pid AS blocked_pid, a.usename AS blocked_user, ka.query AS blocking_statement, now() - ka.query_start AS blocking_duration, kl.pid AS blocking_pid, ka.usename AS blocking_user, a.query AS blocked_statement, now() - a.query_start AS blocked_duration FROM pg_catalog.pg_locks bl JOIN pg_catalog.pg_stat_activity a ON a.pid = bl.pid JOIN pg_catalog.pg_locks kl ON kl.transactionid = bl.transactionid AND kl.pid != bl.pid JOIN pg_catalog.pg_stat_activity ka ON ka.pid = kl.pid WHERE NOT bl.granted;
(0 rows)

SELECT    waiting.locktype           AS waiting_locktype,   waiting.relation::regclass AS waiting_table,   waiting_stm.query          AS waiting_query,   waiting.mode               AS waiting_mode,   waiting.pid                AS waiting_pid,   other.locktype             AS other_locktype,   other.relation::regclass   AS other_table,   other_stm.query            AS other_query,   other.mode                 AS other_mode,   other.pid                  AS other_pid,   other.granted              AS other_granted
FROM   pg_catalog.pg_locks AS waiting
JOIN   pg_catalog.pg_stat_activity AS waiting_stm   ON (       waiting_stm.pid = waiting.pid   )
JOIN   pg_catalog.pg_locks AS other   ON (       (           waiting."database" = other."database"       AND waiting.relation  = other.relation       )       OR waiting.transactionid = other.transactionid   )
JOIN   pg_catalog.pg_stat_activity AS other_stm   ON (       other_stm.pid = other.pid   )
WHERE   NOT waiting.granted
AND   waiting.pid <> other.pid;
(0 rows)



WITH RECURSIVE    c(requested, current) AS      ( VALUES        ('AccessShareLock'::text, 'AccessExclusiveLock'::text),        ('RowShareLock'::text, 'ExclusiveLock'::text),        ('RowShareLock'::text, 'AccessExclusiveLock'::text),        ('RowExclusiveLock'::text, 'ShareLock'::text),        ('RowExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('RowExclusiveLock'::text, 'ExclusiveLock'::text),        ('RowExclusiveLock'::text, 'AccessExclusiveLock'::text),        ('ShareUpdateExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),        ('ShareUpdateExclusiveLock'::text, 'ShareLock'::text),        ('ShareUpdateExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('ShareUpdateExclusiveLock'::text, 'ExclusiveLock'::text),        ('ShareUpdateExclusiveLock'::text, 'AccessExclusiveLock'::text),        ('ShareLock'::text, 'RowExclusiveLock'::text),        ('ShareLock'::text, 'ShareUpdateExclusiveLock'::text),        ('ShareLock'::text, 'ShareRowExclusiveLock'::text),        ('ShareLock'::text, 'ExclusiveLock'::text),        ('ShareLock'::text, 'AccessExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'RowExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'ShareLock'::text),        ('ShareRowExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'ExclusiveLock'::text),        ('ShareRowExclusiveLock'::text, 'AccessExclusiveLock'::text),        ('ExclusiveLock'::text, 'RowShareLock'::text),        ('ExclusiveLock'::text, 'RowExclusiveLock'::text),        ('ExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),        ('ExclusiveLock'::text, 'ShareLock'::text),        ('ExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('ExclusiveLock'::text, 'ExclusiveLock'::text),        ('ExclusiveLock'::text, 'AccessExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'AccessShareLock'::text),        ('AccessExclusiveLock'::text, 'RowShareLock'::text),        ('AccessExclusiveLock'::text, 'RowExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'ShareUpdateExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'ShareLock'::text),        ('AccessExclusiveLock'::text, 'ShareRowExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'ExclusiveLock'::text),        ('AccessExclusiveLock'::text, 'AccessExclusiveLock'::text)      ),    l AS      (        SELECT            (locktype,DATABASE,relation::regclass::text,page,tuple,virtualxid,transactionid,classid,objid,objsubid) AS target,            virtualtransaction,            pid,            mode,            granted          FROM pg_catalog.pg_locks      ),    t AS      (        SELECT            blocker.target  AS blocker_target,            blocker.pid     AS blocker_pid,            blocker.mode    AS blocker_mode,            blocked.target  AS target,            blocked.pid     AS pid,            blocked.mode    AS mode          FROM l blocker          JOIN l blocked            ON ( NOT blocked.granted             AND blocker.granted             AND blocked.pid != blocker.pid             AND blocked.target IS NOT DISTINCT FROM blocker.target)          JOIN c ON (c.requested = blocked.mode AND c.current = blocker.mode)      ),    r AS      (        SELECT            blocker_target,            blocker_pid,            blocker_mode,            '1'::int        AS depth,            target,            pid,            mode,            blocker_pid::text || ',' || pid::text AS seq          FROM t        UNION ALL        SELECT            blocker.blocker_target,            blocker.blocker_pid,            blocker.blocker_mode,            blocker.depth + 1,            blocked.target,            blocked.pid,            blocked.mode,            blocker.seq || ',' || blocked.pid::text          FROM r blocker          JOIN t blocked            ON (blocked.blocker_pid = blocker.pid)          WHERE blocker.depth < 1000      )
SELECT * FROM r ORDER BY seq;
(0 rows)


no finding here either :(

best,

/mstelios


Stelios Mavromichalis
Cytech Ltd. - http://www.cytech.gr/
Science & Technology Park of Crete
fax: +30 2810 31 1045
tel.: +30 2810 31 4127
mob.: +30 697 7078013
skype: mstelios


On Tue, May 6, 2014 at 2:54 AM, David G Johnston <david.g.johnston@gmail.com> wrote:
Stelios Mavromichalis wrote
> as a prior step to dump/restore i am thinking of deleting and re-inserting
> that particular row. that might share some light you think?

I still dislike the randomness of the unresponsiveness...

Every time you perform an update you "delete and insert" that row - that is
how an update works in MVCC - so doing so explicitly is unlikely to provide
any benefit.  Since that row is continually being inserted, and no other
rows are having this issue, I'm seriously doubting that a dump/restore is
going to have any effect either.  Note that the index scan took twice as
long in the bad case - but still reasonable and you didn't notice any
benefit from a REINDEX.  This is what I would expect.

The only other difference, if concurrency has been ruled out, is the 4 vs 18
buffers that had to be read.  I cannot imagine that, since all 22 were in
cache, that simply reading that much more data would account for the
difference (we're talking a 10,000-fold increase, not 2to4-fold).  The
reason for this particular difference, IIUC, is how may candidate tuples are
present whose visibility has to be accounted for (assuming 1 buffer per
tuple, you needed to scan 4 vs 18 for visibility in the two queries).

Is there any log file information you can share?  Especially if you can set
log_min_statement_duration (or whatever that GUC is named) so that whenever
one of these gets adversely delayed it appears in the log along with
whatever other system messages are being sent.  Checkpoints are a typical
culprit though that should be affecting a great deal more than what you
indicate you are seeing.

I'm pretty certain you are seeing this here largely because of the frequency
of activity on this particular user; not because the data itself is
corrupted.  It could be some kind of symptom of internal concurrency that
you just haven't observed yet but it could also be I/O or other system
contention that you also haven't properly instrumented.  Unfortunately that
is beyond my current help-providing skill-set.

A dump-restore likely would not make anything worse though I'd be surprised
if it were to improve matters.  It also doesn't seem like hardware - unless
the RAM is bad.  Software bugs are unlikely if this had been working well
before 5 days ago.   So, you need to observe the system during both periods
(good and bad) and observe something that is different - probably not within
PostgreSQL if indeed you've minimized concurrency.  And also see if you can
see if any other queries, executed during both these times, exhibit a
performance decrease.  Logging all statements would help matters greatly if
you can afford it in your production environment - it would make looking for
internal concurrency much easier.

David J.



--
View this message in context: http://postgresql.1045698.n5.nabble.com/Re-recently-and-selectively-slow-but-very-simple-update-query-tp5802553p5802579.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance