Thread: locking issue on simple selects?

locking issue on simple selects?

From
Tobias Brox
Date:
We have a production database server ... it's quite busy but usually
working completely fine, simple queries taking a fraction of a
millisecond to run.

Recently we've frequently encountered issues where some simple selects
(meaning, selects doing an index lookup and fetching one row) have
become stuck for several minutes.  Apparently all requests on one
exact table gets stuck, all requests not related to said table are
going through without any problems.  According to the pg_stat_activity
view, all queries getting stuck was read-queries (selects), no updates
or anything like that (some of the transactions were doing updates
and/or inserts though).

The obvious thought seems to be that this is a locking issue ... but
it doesn't seem so.  For one thing, AFAIK locking shouldn't affect
selects, only updates?  I've also looked through tons of logs without
finding any obvious locking issues.  In one of the instances, I could
find that there were some open transactions doing updates on one row
in the table and then later becoming stuck (as all other transactions)
when doing a select on another row in the said table.

My second thought was that the database is on the edge of being
overloaded and that the memory situation is also just on the edge ...
important indexes that used to be in memory now has to be accessed
from the disk.  Still, it doesn't make sense, we're not seeing any
serious impact on the CPU iowait status, and it seems improbable that
it should take minutes to load an index?

There aren't any long-lasting transactions going on when the jam
occurs.  I haven't checked much up, usually the jam seems to resolve
itself pretty instantly, but I think that at some point it took half a
minute from the first query was finished until the pg_stat_activity
view got back to normal (meaning typically 0-5 simultaneously
processed queries).

FWIW, we're running pg 8.3.11, transaction isolation level
serializable.  The machine is quad-core hyperthreaded (16 CPUs visible
to the OS), a SAN is used for storage, and different RAIDs are used
for the root partition, pg data and pg wals.

Any ideas?  I'm aware that some configuration (i.e. checkpoint
interval etc) may cause significant delay on write-queries ... but
this is only read-queries.

Re: locking issue on simple selects?

From
Tom Lane
Date:
Tobias Brox <tobixen@gmail.com> writes:
> Recently we've frequently encountered issues where some simple selects
> (meaning, selects doing an index lookup and fetching one row) have
> become stuck for several minutes.  Apparently all requests on one
> exact table gets stuck, all requests not related to said table are
> going through without any problems.  According to the pg_stat_activity
> view, all queries getting stuck was read-queries (selects), no updates
> or anything like that (some of the transactions were doing updates
> and/or inserts though).

> The obvious thought seems to be that this is a locking issue ... but
> it doesn't seem so.  For one thing, AFAIK locking shouldn't affect
> selects, only updates?

An exclusive lock will block selects too.  Have you looked into pg_locks
for ungranted lock requests?

            regards, tom lane

Re: locking issue on simple selects?

From
Tobias Brox
Date:
On 15 September 2010 15:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> An exclusive lock will block selects too.  Have you looked into pg_locks
> for ungranted lock requests?

Well - I thought so, we have a logging script that logs the content of
the pg_locks table, it didn't log anything interesting but it may be a
problem with the script itself.  It does an inner join on
pg_locks.relation = pg_class.oid but when I check now this join seems
to remove most of the rows in the pg_locks table.  Does it make sense
at all to join pg_class with pg_locks?  I will ask the sysadm to
change to an outer join as for now.

Re: locking issue on simple selects?

From
Tom Lane
Date:
Tobias Brox <tobixen@gmail.com> writes:
> Well - I thought so, we have a logging script that logs the content of
> the pg_locks table, it didn't log anything interesting but it may be a
> problem with the script itself.  It does an inner join on
> pg_locks.relation = pg_class.oid but when I check now this join seems
> to remove most of the rows in the pg_locks table.  Does it make sense
> at all to join pg_class with pg_locks?  I will ask the sysadm to
> change to an outer join as for now.

There are lots of locks that are not tied to a specific table, so an
inner join is definitely bad.  You could use an outer join to annotate
rows that do correspond to table locks, though.

            regards, tom lane

Re: locking issue on simple selects?

From
Greg Smith
Date:
Tobias Brox wrote:
> I thought so, we have a logging script that logs the content of
> the pg_locks table, it didn't log anything interesting but it may be a
> problem with the script itself.  It does an inner join on
> pg_locks.relation = pg_class.oid but when I check now this join seems
> to remove most of the rows in the pg_locks table.

Some of the waits you might be running into will be things waiting on
another transaction holding a lock to finish, which are probably wiped
out by this approach.  There are some useful examples of lock views on
the wiki:

http://wiki.postgresql.org/wiki/Lock_Monitoring
http://wiki.postgresql.org/wiki/Lock_dependency_information
http://wiki.postgresql.org/wiki/Find_Locks

And the idea you have of coverting the pg_class one to an outer join
will help.

The other thing you should try is toggling on log_lock_waits and
possibly reducing deadlock_timeout.  This will put a lot of the
information you're trying to collect right into the logs.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


Re: locking issue on simple selects?

From
Brad Nicholson
Date:
  On 10-09-15 03:07 PM, Tobias Brox wrote:
> On 15 September 2010 15:39, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> An exclusive lock will block selects too.  Have you looked into pg_locks
>> for ungranted lock requests?
> Well - I thought so, we have a logging script that logs the content of
> the pg_locks table, it didn't log anything interesting but it may be a
> problem with the script itself.  It does an inner join on
> pg_locks.relation = pg_class.oid but when I check now this join seems
> to remove most of the rows in the pg_locks table.  Does it make sense
> at all to join pg_class with pg_locks?  I will ask the sysadm to
> change to an outer join as for now.
>

You can also enable log_lock_waits and the lock waits will appear in
your Postgres logs.

--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



Re: locking issue on simple selects?

From
Tobias Brox
Date:
On 15 September 2010 21:28, Greg Smith <greg@2ndquadrant.com> wrote:
> There are some useful examples of lock views on the wiki:
>
> http://wiki.postgresql.org/wiki/Lock_Monitoring
> http://wiki.postgresql.org/wiki/Lock_dependency_information
> http://wiki.postgresql.org/wiki/Find_Locks

Thanks.  I think those pages probably should be merged ... hmm ... if
I manage to solve my locking issues I should probably try and
contribute to the wiki.

Reading the wiki pages, for me it boils down to three things:

1) the current query we're logging seems good enough except that we
should do an outer join except for inner join towards pg_class, so
I've asked our sysadm to fix it.

2) the middle query on http://wiki.postgresql.org/wiki/Lock_Monitoring
seems very useful, and I've asked our sysadm to set up logging of this
one as well.

3) That log_lock_waits config option that you and Brad points to seems
very useful, so I've asked our sysadm to enable it.

I also discovered that there is an attribute pg_stat_activity.waiting
- I suppose it is 't' if a query is waiting for a lock?  It seems
quite useful ...

> reducing deadlock_timeout.

It's set to one second, and some of the jams we have been experiencing
has lasted for several minutes.  I also think it should say in the pg
log if there is a deadlock situation?  I grepped for "deadlock" in the
logs without finding anything.

Well, we'll improve the logging, and wait for the next "jam" to occur
... and I'll make sure to post an update if/when I figure out
something.

Re: locking issue on simple selects?

From
Greg Smith
Date:
Tobias Brox wrote:
I think those pages probably should be merged ... hmm ... if
I manage to solve my locking issues I should probably try and
contribute to the wiki. 

Certainly the 3rd one could be merged with one of the other two, and maybe all merged into one.  I haven't cleaned up that whole area in a whole, it's due for a round of it soon. You're welcome to take a shot at it.  We can always revert any mistakes made.

reducing deadlock_timeout.   
It's set to one second, and some of the jams we have been experiencing
has lasted for several minutes.  I also think it should say in the pg
log if there is a deadlock situation?  

deadlock_timeout is how long a process trying to acquire a lock waits before it a) looks for a deadlock, and b) prints a report that it's waiting into the logs when log_lock_waits is on.  So if you're looking for slow lock acquisition that's in the sub-second range, it can be useful to reduce regardless of whether deadlock ever happens.  That doesn't sound like your situation though.

-- 
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book

Re: locking issue on simple selects?

From
Tobias Brox
Date:
On 15 September 2010 12:05, Tobias Brox <tobixen@gmail.com> wrote:
> Recently we've frequently encountered issues where some simple selects
> (meaning, selects doing an index lookup and fetching one row) have
> become stuck for several minutes.  Apparently all requests on one
> exact table gets stuck, all requests not related to said table are
> going through without any problems.

Now I've set up all kind of logging regarding locks, so it seems like
we're having issues that aren't lock-related.  I just did a bit of
research into one situation today.

All while having this problem, there was one heavy query running in
parallell ... not sure if that's relevant.

Then comes one query that requires a seq scan on the problem table
(that won't happen again - I just added a new index).  Four seconds
later comes another query requiring a simple index lookup.  Still more
queries comes in, most of them simple index lookups, but on different
indexes.  After one minute there are 25 queries in the
pg_stat_activity view towards this table.  It's not a particularly
huge table.  Moments later all 25 queries have been executed.

Re: locking issue on simple selects?

From
"Kevin Grittner"
Date:
Tobias Brox <tobixen@gmail.com> wrote:

> All while having this problem, there was one heavy query running
> in parallell ... not sure if that's relevant.

Have you turned on checkpoint logging?  You might want to see if
these are happening at some particular point in the checkpoint
processing.  If so, look through the archives for posts from Greg
Smith on how to tune that -- he's worked out a nice methodology to
iteratively improve your configuration in this regard.

-Kevin

Re: locking issue on simple selects?

From
Tobias Brox
Date:
On 23 September 2010 22:55, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
> Have you turned on checkpoint logging?

Yes ... it seems so:

13:19:13.840 - LOG: checkpoint complete: wrote 3849 buffers (0.2%); 0
transaction log file(s) added, 0 removed, 5 recycled; write=269.551 s,
sync=0.103 s, total=269.953 s
13:19:13.841 - LOG:  checkpoint starting: xlog
13:19:33 - the seq scan query towards the affected table started
13:20:31.454 - one of the index lookup queries towards the affected
table was finished
13:20:43.176 - LOG:  checkpoint complete: wrote 108199 buffers (6.9%);
0 transaction log file(s) added, 0 removed, 16 recycled; write=11.521
s, sync=77.533 s, total=89.335 s

> You might want to see if
> these are happening at some particular point in the checkpoint
> processing.  If so, look through the archives for posts from Greg
> Smith on how to tune that -- he's worked out a nice methodology to
> iteratively improve your configuration in this regard.

Thank you, I will ... hmm ... I found this blog post:

http://blog.2ndquadrant.com/en/2010/01/measuring-postgresql-checkpoin.html

Of course I'm doing it my own way:

select *,now() as snapshot into tmp_pg_stat_bgwriter from pg_stat_bgwriter ;

create view tmp_delta_pg_stat_bgwriter as
 select a.checkpoints_timed-b.checkpoints_timed as
checkpoints_timed,a.checkpoints_req-b.checkpoints_req as
checkpoints_req,a.buffers_checkpoint-b.buffers_checkpoint as
buffers_checkpoint,a.buffers_clean-b.buffers_clean as
buffers_clean,a.maxwritten_clean-b.maxwritten_clean as
maxwritten_clean,a.buffers_backend-b.buffers_backend as
buffers_backend,a.buffers_alloc-b.buffers_alloc as buffers_alloc,
now()-b.snapshot as interval
 from pg_stat_bgwriter a ,
 (select * from tmp_pg_stat_bgwriter order by snapshot desc limit 1) as b;

Checkpoint timeout is set to 5 minutes.  Right now we're having
relatively low activity.  I'm not sure how to read the stats below,
but they look OK to me:

select * from tmp_delta_pg_stat_bgwriter ;
 checkpoints_timed | checkpoints_req | buffers_checkpoint |
buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc |
 interval

-------------------+-----------------+--------------------+---------------+------------------+-----------------+---------------+-----------------
                 3 |               0 |               8277 |
15 |                0 |             185 |         18691 |
00:12:02.988842
(1 row)

Re: locking issue on simple selects?

From
"Kevin Grittner"
Date:
Tobias Brox <tobixen@gmail.com> wrote:

> 13:19:13.840 - LOG: checkpoint complete

> 13:19:13.841 - LOG:  checkpoint starting

> 13:20:43.176 - LOG:  checkpoint complete

There wasn't a lot of time between the completion of one checkpoint
and the start of the next.  And the two checkpoints finished a
minute and a half apart.  Perhaps you need to boost your
checkpoint_segments setting?  What is it now?

-Kevin