Re: auto_explain WAS: RFC: Timing Events - Mailing list pgsql-hackers

From Greg Smith
Subject Re: auto_explain WAS: RFC: Timing Events
Date
Msg-id 50AD607E.80806@2ndQuadrant.com
Whole thread Raw
In response to Re: auto_explain WAS: RFC: Timing Events  (Josh Berkus <josh@agliodbs.com>)
Responses Re: auto_explain WAS: RFC: Timing Events  (Gavin Flower <GavinFlower@archidevsys.co.nz>)
Re: auto_explain WAS: RFC: Timing Events  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On 11/8/12 2:16 PM, Josh Berkus wrote:

> Also, logging only the long-running queries is less useful than people
> on this list seem to think.  When I'm doing real performance analysis, I
> need to see *everything* which was run, not just the slow stuff.  Often
> the real problem is a query which used to take 1.1ms, now takes 1.8ms,
> and gets run 400 times/second. Looking just at the slow queries won't
> tell you that.

No argument here.  I've tried to be clear that some of these high-speed 
but lossy things I'm hacking on are not going to be useful for everyone.  A solution that found most of these 1.8ms
queries,but dropped some 
 
percentage under the highest peak load conditions, would still be very 
useful to me.

An anecdote on this topic seems relevant.  I have a troublesome 
production server that has moved log_min_duration_statement from 100ms 
to 500ms to 2000ms as the system grew.  Even the original setting wasn't 
short enough to catch everything we would like to watch *now*, but 
seeing sub-second data is a dream at this point.  The increases have 
been forced by logging contention becoming unmanagable when every client 
has to fight for the log to write to disk.  I can see the buggers stack 
up as waiting for locks if I try to log shorter statements, stalling 
enough that it drags the whole server down under peak load.

If I could just turn off logging just during those periods--basically, 
throwing them away only when some output rate throttled component hit 
its limit--I could still find them in the data collected the rest of the 
time.  There are some types of problems that also only occur under peak 
load that this idea would miss, but you'd still be likely to get *some* 
of them, statistically.

There's a really hard trade-off here:

-Sometimes you must save data on every query to capture fine details
-Making every query wait for disk I/O is impractical

The sort of ideas you threw out for making things like auto-explain 
logging per-session can work.  The main limitation there though is that 
it presumes you even know what area the problem is in the first place. 
I am not optimistic that covers a whole lot of ground either.

Upthread I talked about a background process that persists shared memory 
queues as a future consumer of timing events--one that might consume 
slow query data too.  That is effectively acting as the ideal component 
I described above, one that only loses things when it exceeds the 
system's write capacity for saving them.  I wouldn't want to try and 
retrofit the existing PostgreSQL logging facility for such a thing 
though.  Log parsing as the way to collect data is filled with headaches 
anyway.

I don't see any other good way to resolve this trade-off.  To help with 
the real-world situation you describe, ideally you dump all the query 
data somewhere, fast, and have the client move on.  You can't make 
queries wait for storage, something else (with a buffer!) needs to take 
over that job.

I can't find the URL right now, but at PG.EU someone was showing me a 
module that grabbed the new 9.2 logging hook and shipped the result to 
another server.  Clients burn a little CPU and network time and they 
move on, and magically disk I/O goes out of their concerns.  How much 
overhead persisting the data takes isn't the database server's job at 
all then.  That's the sort of flexibility people need to have with 
logging eventually.  Something so efficient that every client can afford 
to do it; it is capable of saving all events under ideal conditions; but 
under adverse ones, you have to keep going and accept the loss.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [PATCH 03/14] Add simple xlogdump tool
Next
From: Andres Freund
Date:
Subject: Re: logical changeset generation v3