Re: auto_explain WAS: RFC: Timing Events - Mailing list pgsql-hackers

From Gavin Flower
Subject Re: auto_explain WAS: RFC: Timing Events
Date
Msg-id 50AD64B6.4030607@archidevsys.co.nz
Whole thread Raw
In response to Re: auto_explain WAS: RFC: Timing Events  (Greg Smith <greg@2ndQuadrant.com>)
Responses Re: auto_explain WAS: RFC: Timing Events  (Jim Nasby <jim@nasby.net>)
List pgsql-hackers
On 22/11/12 12:15, Greg Smith wrote:
> On 11/8/12 2:16 PM, Josh Berkus wrote:
>
>> Also, logging only the long-running queries is less useful than people
>> on this list seem to think.  When I'm doing real performance analysis, I
>> need to see *everything* which was run, not just the slow stuff.  Often
>> the real problem is a query which used to take 1.1ms, now takes 1.8ms,
>> and gets run 400 times/second. Looking just at the slow queries won't
>> tell you that.
>
> No argument here.  I've tried to be clear that some of these 
> high-speed but lossy things I'm hacking on are not going to be useful 
> for everyone.  A solution that found most of these 1.8ms queries, but 
> dropped some percentage under the highest peak load conditions, would 
> still be very useful to me.
>
> An anecdote on this topic seems relevant.  I have a troublesome 
> production server that has moved log_min_duration_statement from 100ms 
> to 500ms to 2000ms as the system grew.  Even the original setting 
> wasn't short enough to catch everything we would like to watch *now*, 
> but seeing sub-second data is a dream at this point. The increases 
> have been forced by logging contention becoming unmanagable when every 
> client has to fight for the log to write to disk.  I can see the 
> buggers stack up as waiting for locks if I try to log shorter 
> statements, stalling enough that it drags the whole server down under 
> peak load.
>
> If I could just turn off logging just during those periods--basically, 
> throwing them away only when some output rate throttled component hit 
> its limit--I could still find them in the data collected the rest of 
> the time.  There are some types of problems that also only occur under 
> peak load that this idea would miss, but you'd still be likely to get 
> *some* of them, statistically.
>
> There's a really hard trade-off here:
>
> -Sometimes you must save data on every query to capture fine details
> -Making every query wait for disk I/O is impractical
>
> The sort of ideas you threw out for making things like auto-explain 
> logging per-session can work.  The main limitation there though is 
> that it presumes you even know what area the problem is in the first 
> place. I am not optimistic that covers a whole lot of ground either.
>
> Upthread I talked about a background process that persists shared 
> memory queues as a future consumer of timing events--one that might 
> consume slow query data too.  That is effectively acting as the ideal 
> component I described above, one that only loses things when it 
> exceeds the system's write capacity for saving them.  I wouldn't want 
> to try and retrofit the existing PostgreSQL logging facility for such 
> a thing though.  Log parsing as the way to collect data is filled with 
> headaches anyway.
>
> I don't see any other good way to resolve this trade-off.  To help 
> with the real-world situation you describe, ideally you dump all the 
> query data somewhere, fast, and have the client move on.  You can't 
> make queries wait for storage, something else (with a buffer!) needs 
> to take over that job.
>
> I can't find the URL right now, but at PG.EU someone was showing me a 
> module that grabbed the new 9.2 logging hook and shipped the result to 
> another server.  Clients burn a little CPU and network time and they 
> move on, and magically disk I/O goes out of their concerns.  How much 
> overhead persisting the data takes isn't the database server's job at 
> all then.  That's the sort of flexibility people need to have with 
> logging eventually.  Something so efficient that every client can 
> afford to do it; it is capable of saving all events under ideal 
> conditions; but under adverse ones, you have to keep going and accept 
> the loss.
>
Would it be useful to be able to specify a fixed sampling rate, say 1 in 
100.  That way you could get a well defined statistical sample, yet not 
cause excessive I/O?


Cheers,
Gavin



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: logical changeset generation v3
Next
From: Michael Paquier
Date:
Subject: Re: logical changeset generation v3