Re: Odd blocking (or massively latent) issue - even with EXPLAIN - Mailing list pgsql-performance

From Martin French
Subject Re: Odd blocking (or massively latent) issue - even with EXPLAIN
Date
Msg-id OF7755750C.7073A920-ON80257A45.00249D8E-80257A45.00259C8D@LocalDomain
Whole thread Raw
In response to Re: Odd blocking (or massively latent) issue - even with EXPLAIN  (Jim Vanns <james.vanns@framestore.com>)
Responses Re: Odd blocking (or massively latent) issue - even with EXPLAIN
List pgsql-performance
<p><tt><font size="2">> > Hi<br />> > <br />> > > <br />> > > We're seeing SELECT
statementsand even EXPLAIN (no ANAYLZE) <br />> > > statements hang indefinitely until *something* (we don't
knowwhat)<br />> > > releases some kind of resource or no longer becomes a massive bottle<br />> > >
neck.These are the symptoms.<br />> > <br />> > Is this in pgAdmin? Or psql on the console?<br />> >
<br/>> psql<br />> <br />> > > However, the system seems healthy - no table ('heavyweight') locks<br
/>>> are<br />> > > held by any session (this happens with only a few connected<br />> >
sessions),<br/>> > > all indexes are used correctly, other transactions are writing data<br />> > (we<br
/>>> > generally only have a few sessions running at a time - perhaps 10)<br />> > etc.<br />> >
>etc. In fact, we are writing (or bgwriter is), 2-3 hundred MB/s<br />> > > sometimes.<br />> > <br
/>>> What is shown in "top" and "iostat" whilst the queries are running?<br />> <br />> Generally, lots of
CPUchurn (90-100%) and a fair bit of I/O wait.<br />> iostat reports massive reads (up to 300MB/s).<br
/></font></tt><br/><tt><font size="2">This looks like this is a pure IO issue. You mentioned that this was a software
RAIDsystem. I wonder if there's some complication there.</font></tt><br /><br /><tt><font size="2">Have you tried
settingthe disk queues to deadline?</font></tt><br /><br /><tt><font size="2">echo "deadline" >
/sys/block/{DEVICE-NAME}/queue/scheduler</font></tt><br/><br /><tt><font size="2">That might help. But to be honest, it
reallydoes sound disk/software raid related with the CPU and IO being so high.</font></tt><br /><br /><tt><font
size="2">Canyou attempt to replicate the problem on another system without software RAID?</font></tt><br /><br
/><tt><fontsize="2">Also, you might want to try a disk test on the machine, it's 24GB ram right?</font></tt><br /><br
/><tt><fontsize="2">so, try the following tests on the Postgres data disk (you'll obviously need lots of space for
this):</font></tt><br/><br /><br /><tt><font size="2">Write Test: </font></tt><br /><tt><font size="2"> time sh -c "dd
if=/dev/zeroof=bigfile bs=8k count=6000000 && sync"</font></tt><br /><br /><tt><font size="2">Read
Test:</font></tt><br/><tt><font size="2"> time dd if=bigfile of=/dev/null bs=8k</font></tt><br /><br /><tt><font
size="2">(Tests taken from Greg Smiths page: </font></tt><a
href="http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm"><fontcolor="#0000FF" face="serif"
size="3"><u>http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm</u></font></a><fontface="serif"
size="3"> </font><tt><fontsize="2">)</font></tt><br /><br /><tt><font size="2">> <br />> > > <br />>
>> We regularly run vacuum analyze at quiet periods - generally 1-2s<br />> > daily.<br />> <br />>
(thisis to answer to someone who didn't reply to the list)<br />> <br />> We run full scans using vacuumdb so
don'tjust rely on autovacuum. The<br />> small table is so small (<50 tuples) a sequence scan is always<br />>
performed.<br/>> <br />> > > These sessions (that only read data) that are blocked can block from<br />>
>> anything from between only 5 minutes to 10s of hours then<br />> > miraculously<br />> > >
completesuccessfully at once.<br />> > > <br />> > <br />> > Are any "blockers" shown in
pg_stat_activity?<br/>> <br />> None. Ever. Nothing in pg_locks either.<br />> <br />> > > <br />>
>> checkpoint_segments = 128<br />> > > maintenance_work_mem = 256MB<br />> > >
synchronous_commit= off<br />> > > random_page_cost = 3.0<br />> > > wal_buffers = 16MB<br />>
>> shared_buffers = 8192MB<br />> > > checkpoint_completion_target = 0.9<br />> > >
effective_cache_size= 18432MB<br />> > > work_mem = 32MB<br />> > > effective_io_concurrency = 12<br
/>>> > max_stack_depth = 8MB<br />> > > log_autovacuum_min_duration = 0<br />> > >
log_lock_waits= on<br />> > > autovacuum_vacuum_scale_factor = 0.1<br />> > > autovacuum_naptime =
8<br/>> > > autovacuum_max_workers = 4<br />> > <br />> > Memory looks reasonably configured to
me.effective_cache_size is only<br />> > an indication to the planner and is not actually allocated. <br />>
<br/>> I realise that.<br />> <br />> > Is anything being written to the logfiles?<br />> <br />>
Nothingobvious - and we log a fair amount. No tmp table creations,<br />> no locks held. <br />> <br />> To
addto this EXPLAIN reports it took only 0.23ms to run (for example)<br />> whereas the wall clock time is more like
20-30minutes (or up to n hours<br />> as I said where everything appears to click back into place at the same<br
/>>time).<br />> <br />> Thanks.<br />> <br /></font></tt><br /><tt><font size="2">Something else you might
wantto try is running with a default Postgresql.conf, if the query/explain then runs fine, then that would lead me to
believethat there is a configuration issue. Although I'm pretty convinced that it may be the disk set up.
</font></tt><br/><br /><tt><font size="2">Cheers<br /></font></tt><font
face="sans-serif">=============================================Romax Technology Limited Rutherford House Nottingham
Science& Technology Park Nottingham, NG7 2PZ England Telephone numbers: +44 (0)115 951 88 00 (main) For other
officelocations see: http://www.romaxtech.com/Contact ================================= =============== E-mail:
info@romaxtech.comWebsite: www.romaxtech.com ================================= ================ Confidentiality
StatementThis transmission is for the addressee only and contains information that is confidential and privileged.
Unlessyou are the named addressee, or authorised to receive it on behalf of the addressee you may not copy or use it,
ordisclose it to anyone else. If you have received this transmission in error please delete from your system and
contactthe sender. Thank you for your cooperation. =================================================</font> 

pgsql-performance by date:

Previous
From: Craig Ringer
Date:
Subject: Re: Odd blocking (or massively latent) issue - even with EXPLAIN
Next
From: "Riaan van den Dool"
Date:
Subject: Geoserver-PostGIS performance problems