Thread: backend suddenly becomes slow, then remains slow

backend suddenly becomes slow, then remains slow

From
Andrew Dunstan
Date:
One of my clients has an odd problem. Every so often a backend will
suddenly become very slow. The odd thing is that once this has happened
it remains slowed down, for all subsequent queries. Zone reclaim is off.
There is no IO or CPU spike, no checkpoint issues or stats timeouts, no
other symptom that we can see. The problem was a lot worse that it is
now, but two steps have alleviated it mostly, but not completely: much
less aggressive autovacuuming and reducing the maximum lifetime of
backends in the connection pooler to 30 minutes.

It's got us rather puzzled. Has anyone seen anything like this?

cheers

andrew


Re: backend suddenly becomes slow, then remains slow

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@pgexperts.com> writes:
> One of my clients has an odd problem. Every so often a backend will
> suddenly become very slow. The odd thing is that once this has happened
> it remains slowed down, for all subsequent queries. Zone reclaim is off.
> There is no IO or CPU spike, no checkpoint issues or stats timeouts, no
> other symptom that we can see. The problem was a lot worse that it is
> now, but two steps have alleviated it mostly, but not completely: much
> less aggressive autovacuuming and reducing the maximum lifetime of
> backends in the connection pooler to 30 minutes.

> It's got us rather puzzled. Has anyone seen anything like this?

Maybe the kernel is auto-nice'ing the process once it's accumulated X
amount of CPU time?

            regards, tom lane


Re: backend suddenly becomes slow, then remains slow

From
Andrew Dunstan
Date:
On 12/14/2012 02:56 PM, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@pgexperts.com> writes:
>> One of my clients has an odd problem. Every so often a backend will
>> suddenly become very slow. The odd thing is that once this has happened
>> it remains slowed down, for all subsequent queries. Zone reclaim is off.
>> There is no IO or CPU spike, no checkpoint issues or stats timeouts, no
>> other symptom that we can see. The problem was a lot worse that it is
>> now, but two steps have alleviated it mostly, but not completely: much
>> less aggressive autovacuuming and reducing the maximum lifetime of
>> backends in the connection pooler to 30 minutes.
>> It's got us rather puzzled. Has anyone seen anything like this?
> Maybe the kernel is auto-nice'ing the process once it's accumulated X
> amount of CPU time?
>
>


That was my initial thought, but the client said not. We'll check again.

cheers

andrew



backend suddenly becomes slow, then remains slow

From
Jeff Janes
Date:
On Fri, Dec 14, 2012 at 10:40 AM, Andrew Dunstan <andrew.dunstan@pgexperts.com> wrote:
> One of my clients has an odd problem. Every so often a backend will suddenly
> become very slow. The odd thing is that once this has happened it remains
> slowed down, for all subsequent queries. Zone reclaim is off. There is no IO
> or CPU spike, no checkpoint issues or stats timeouts, no other symptom that
> we can see.

By "no spike", do you mean that the system as a whole is not using an unusual amount of IO or CPU, or that this specific slow back-end is not using an unusual amount?

Could you strace is and see what it is doing?

> The problem was a lot worse that it is now, but two steps have
> alleviated it mostly, but not completely: much less aggressive autovacuuming
> and reducing the maximum lifetime of backends in the connection pooler to 30
> minutes.

Do you have a huge number of tables?  Maybe over the course of a long-lived connection, it touches enough tables to bloat the relcache / syscache.  I don't know how the autovac would be involved in that, though.


Cheers,

Jeff

Re: backend suddenly becomes slow, then remains slow

From
Andrew Dunstan
Date:
On 12/26/2012 11:03 PM, Jeff Janes wrote:
> On Fri, Dec 14, 2012 at 10:40 AM, Andrew Dunstan
> <andrew.dunstan@pgexperts.com> wrote:
> > One of my clients has an odd problem. Every so often a backend will
> suddenly
> > become very slow. The odd thing is that once this has happened it
> remains
> > slowed down, for all subsequent queries. Zone reclaim is off. There
> is no IO
> > or CPU spike, no checkpoint issues or stats timeouts, no other
> symptom that
> > we can see.
>
> By "no spike", do you mean that the system as a whole is not using an
> unusual amount of IO or CPU, or that this specific slow back-end is
> not using an unusual amount?


both, really.

>
> Could you strace is and see what it is doing?


Not very easily, because it's a pool connection and we've lowered the
pool session lifetime as part of the amelioration :-) So it's not
happening very much any more.

>
> > The problem was a lot worse that it is now, but two steps have
> > alleviated it mostly, but not completely: much less aggressive
> autovacuuming
> > and reducing the maximum lifetime of backends in the connection
> pooler to 30
> > minutes.
>
> Do you have a huge number of tables?  Maybe over the course of a
> long-lived connection, it touches enough tables to bloat the relcache
> / syscache.  I don't know how the autovac would be involved in that,
> though.
>
>

Yes, we do indeed have a huge number of tables. This seems a plausible
thesis.

cheers

andrew




Re: backend suddenly becomes slow, then remains slow

From
Jeff Janes
Date:
On Thursday, December 27, 2012, Andrew Dunstan wrote:
On 12/26/2012 11:03 PM, Jeff Janes wrote:

Do you have a huge number of tables?  Maybe over the course of a long-lived connection, it touches enough tables to bloat the relcache / syscache.  I don't know how the autovac would be involved in that, though.



Yes, we do indeed have a huge number of tables. This seems a plausible thesis.

All of the syscache things have compiled hard-coded numbers of buckets, at most 2048, and once those are exceeded the resulting collision resolution becomes essentially linear.  It is not hard to exceed 2048 tables by a substantial multiple, and even less hard to exceed 2048 columns (summed over all tables).

I don't know why syscache doesn't use dynahash; whether it is older than dynahash is and was never converted out of inertia, or if there are extra features that don't fit the dynahash API.  If the former, then converting them to use dynahash should give automatic resizing for free.  Maybe that conversion should be a To Do item?



Cheers,

Jeff