Re: 9.3 performance issues, lots of bind and parse log entries

From: Tomas Vondra
Subject: Re: 9.3 performance issues, lots of bind and parse log entries
Date: ,
(view: Whole thread, Raw)
In response to: Re: 9.3 performance issues, lots of bind and parse log entries  (Tory M Blue)
List: pgsql-performance

On 5.11.2014 20:16, Tory M Blue wrote:
> Thanks Thomas,
>     On 4.11.2014 21:07, Tory M Blue wrote:
>     > Well after fighting this all day and dealing with a really sluggish db
>     > where even my slon processes were taking several seconds, I reduced my
>     > shared_buffers back to 2GB from 10GB and my work_mem from 7.5GB to
>     2GB.
>     > i actually undid all my changes, including dropping my effective_cache
>     > back to 7GB  and restarted.
>     Have you been using the same parameter values on 9.2, or have you bumped
>     them up only on the new 9.3? I'm wondering whether 9.2 was performing
>     better with the values?
> Things seem to have been running better on 9.2 at this point I'm using
> the same config file from 9.2 and I'm still experiencing slowness under
> heavier write access. And my disk subsystem has not changed. Hardware
> has not changed, heck i'm even running the old version of slony (have
> not upgraded it yet).

So with shared_buffers=10GB and work_mem=7.5GB you saw significant
slowdown both for read and write queries, and after reverting to lower
values the read queries are OK but writes still take much longer?

> But since the upgrade to 9.3 even calls to my sl_log tables which
> are tiny can take:
> 2014-11-04 02:58:40 PST clsdb postgres 21642
> 2014-11-04 02:58:40.515 PSTLOG:  duration: 1627.019 ms  statement: fetch
> 500 from LOG;  (log had 145K items).
>     > I have 300 connections configured, we will use around 87 normally
>     > with some spikes, but I'm wondering if the 10GB shared memory caused
>     > me some grief, I don't believe it was the work_mem and don't believe
>     > it was the effective cache, but something caused my DB to run into
>     > issues with basic queries, same queries after restart are finishing
>     > in milliseconds instead of 2-3 seconds. No disk issues seen,.
>     I assume only some of the connections will be active (running queries)
>     at the same time. If you expect >> 32 active queries at the same time,
>     you're only increasing latency.
>     Based on your description I assume you're CPU bound (otherwise the
>     machine would not get "hotter", and planning is not about I/O).
>     I'm not sure if this is a production machine or how much you can
>     experiment with it, but it'd be helpful if you could provide some
>     profiling information
>        $ iostat -x -k 1
>        $ vmstat 1
>     and such data. A perf profile would be even better, but to get the most
>     useful info it may be necessary to recompile the postgres with debug
>     info and '-fno-omit-frame-pointer'. Then this should do the trick:
>       perf record -a -g (for a few seconds, then Ctrl-C)
>       perf report
>     or just "perf top" to see what functions are at the top.
> This is a production server, but it was not really CPU bound with 9.2
> so something is odd and I'm starting to stress, because it is a
> production environment :)

Yeah, I was talking about the 9.3 - that's clearly CPU bound.

> Connections correct, I have less than 20 or so active requests at a
> time, but i would say active queries are in the handful. I was was not
> seeing IO, but was seeing load increase as queries started taking
> longer, but nothing in iostat or vmstat/free showed any contention. Heck
> even Top while showed some cores as busy, nothing was sitting at over
> 60% utilized. And we are talking a load of 12-14 here on a 32 core
> system, when it's normally asleep!

Right. That's consistent with being CPU bound.

> This is my master slon insert server, so I can run commands, tweak
> configs but any type of rebuild or restart of postgres is a
> scheduled affair.

OK, understood. That however mostly rules out recompiling with debug
info and frame pointers, as that might make it significantly slower.
That's not something you'd like to do on production.

> These work loads that seem to be creating the issues run between
> midnight and now almost 6am, prior to 9.3 it was taking maybe 4
> hours, now it's taking 6. So tomorrow AM , I'll grab some stats when
> I see that it's struggling.
> But even now with almost no connections or really any major access
> i'm seeing the sl_log grab 500 rows take 1-3 seconds, which is just
> plain silly (but it's not a constant, so I may see 1 of these alerts
> every hour)

Is that plain "SELECT * FROM sl_log" or something more complex? When you
do explain analyze on the query, what you see?


pgsql-performance by date:

From: Guillaume Lelarge
Subject: Re: log_temp_files (integer), tuning work_mem
From: Artūras Lapinskas
Subject: Re: Index order ignored after `is null` in query