Re: How to analyze a slowdown in 9.3.5? - Mailing list pgsql-general

From Michael Nolan
Subject Re: How to analyze a slowdown in 9.3.5?
Date
Msg-id CAOzAqu+5vnBo3g1dNy3Vnix-R+uuVjXNV4SBOaKzFPvwfzEhwA@mail.gmail.com
Whole thread Raw
In response to Re: How to analyze a slowdown in 9.3.5?  (Andy Colson <andy@squeakycode.net>)
List pgsql-general
The function is a complicated plpgsql function that makes numerous database queries, all read-only. (Other parts of that program may make changes to the database.)

The first database shutdown and the shutdown/reboot later on were both 'clean' shutdowns, so there shouldn't have been any kind of transaction rollback.  

I has sar running on that server, if that provides any useful data.  Mostly I'm just trying to make up a list of what to look for and what to log in case it happens again.  (It runs again on Tuesday, and I already know it will be going back to review 2004 data so it'll be an even longer run than this week's was.)
--
Mike Nolan

On Sat, Jan 10, 2015 at 12:55 PM, Andy Colson <andy@squeakycode.net> wrote:
On 01/09/2015 07:52 PM, Tomas Vondra wrote:
On 9.1.2015 23:14, Michael Nolan wrote:
I'm running 9.3.5 on a virtual machine with 5 cores and 24 GB of
memory.  Disk is on a SAN.

I have a task that runs weekly that processes possibly as many as
120 months worth of data, one month at a time. Since moving to 9.3.5
(from 8.2!!) the average time for a month has been 3 minutes or less.

Congrats to migrating to a supported version!

Please, comparison of the configuration used on 8.2 and 9.3.5 would be
helpful (i.e. how you've updated the config on the new version?).

However, when this job ran this Tuesday, it ran fine for a number of
months, but then started slowing down dramatically, 300 minutes for
one month and then 167 minutes for the next. I stopped and restarted
postgresql, the next block also ran really slow (157 minutes.) I
then rebooted the server and the remaining blocks ran at the usual
fast speed again, so restarting postgresql didn't fix the problem
but rebooting the server did.

What amounts of data are we talking about? Gigabytes? Tens of gigabytes?


Looking at the logs, I see queries with a function call that would
normally take no more than 100-200 milliseconds, usually far less,
that were taking 100 seconds or longer. This function gets called
thousands of times for each month, so that appears to be one source
of the slowdown.

But why are the functions taking so much longer? Are they eating CPU,
I/O or are generally waiting for something (e.g. locks)?


I don't suspect a memory leak in the calling program (in php),
because since moving to this server in December this weekly task has
run several times over the same range of months, making pretty much
the same function calls each time. I also ran the entire range
several times during testing.

One change made to the server since the previous week's run was that
I moved up to the latest Centos kernel (Linux version
3.10.0-123.13.2.el7.x86_64).

And what was the previous kernel version?

However, if it worked fine after rebooting the server, it may not be a
kernel issue (unless it somehow depends on uptime). Is there something
in the /var/log/messages?


At first, I was thinking, lots of activity within one transaction was messing up the stats and the planner started getting it wrong.  But a reboot wouldn't fix that.  Would it?  What if the reboot rolled back the db, would that stats make sense again?

I have a process that makes a big temp table (with indexes).  After its built if I dont run a quick analyze on it the planner never uses the indexes right.

Another thing I can think of is never commiting.  If it started collecting lots and lots of row versions it could get slower and slower.  But, then, you'd see the same thing on 8.2, so, that's probably not it.  Do you have any "Idle in transaction" connections?

-Andy





--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

pgsql-general by date:

Previous
From: Andres Freund
Date:
Subject: Re: unexpected PQresultStatus: 8 with simple logical replication
Next
From: Michael Nolan
Date:
Subject: Re: How to analyze a slowdown in 9.3.5?