Thread: suggestion: log_statement = sample

suggestion: log_statement = sample

From
Janning Vygen
Date:
hi,

http://archives.postgresql.org/pgsql-general/2009-03/msg00581.php

This was my suggestion about introducing a statment to get a sample of SQL
statements. Nobody answered yet. Why not? i think my suggestion would help a
lot. Or was it kind of stupid?

kind regards
Janning


Re: suggestion: log_statement = sample

From
Bill Moran
Date:
In response to Janning Vygen <vygen@kicktipp.de>:

> hi,
>
> http://archives.postgresql.org/pgsql-general/2009-03/msg00581.php
>
> This was my suggestion about introducing a statment to get a sample of SQL
> statements. Nobody answered yet. Why not? i think my suggestion would help a
> lot. Or was it kind of stupid?

For my part, I don't think this would be useful.

Since most of your queries are run by software, you're going to see a
fairly predictable pattern to the queries, which means your sampling isn't
going to be anywhere near random, thus it will still be inaccurate and
incomplete.

In my experience, I've found that enabling full logging for a short time
(perhaps a few hours) gathers enough data to run through tools like
pgFouine and find problem areas.  Also, we have development servers that
run automated tests, and since it's not critical that they be performant,
we can run full query logging on them all the time.  Additionally, we make
sure our production systems have enough hardware behind them that we can
add additional tasks without it affecting production use.

All of these are (in my opinion) better approaches to the problem than
yet another arbitrary query filtering technique.  I mean, logging only
the most time-consuming queries is already arbitrary enough (as you
already stated).

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: suggestion: log_statement = sample

From
Janning Vygen
Date:
hi,

thanks for your comments on this.

On Thursday 16 July 2009 15:05:58 you wrote:
> In response to Janning Vygen <vygen@kicktipp.de>:
> > hi,
> >
> > http://archives.postgresql.org/pgsql-general/2009-03/msg00581.php
> >
> > This was my suggestion about introducing a statment to get a sample of
> > SQL statements. Nobody answered yet. Why not? i think my suggestion would
> > help a lot. Or was it kind of stupid?
>
> For my part, I don't think this would be useful.
>
> Since most of your queries are run by software, you're going to see a
> fairly predictable pattern to the queries, which means your sampling isn't
> going to be anywhere near random, thus it will still be inaccurate and
> incomplete.

I dont think so. In my use case i will get a good sampling of queries as I
could keep my log_sample running over long period of time. The sampling is in
any case much better than with log_minduration while logging all statement is
not acceptable in production.

> In my experience, I've found that enabling full logging for a short time
> (perhaps a few hours) gathers enough data to run through tools like
> pgFouine and find problem areas.

It is not possible for us. Logging millions of statements take too much time.

> Also, we have development servers that
> run automated tests, and since it's not critical that they be performant,
> we can run full query logging on them all the time.

But you dont run the real use cases with automated tests. There so many
factors involved in real time: caching, concurrency, data, peaktime,
deadlocks, doubleclicks, robots etc. that you just can't reproduce it on a
development system without lots of effort.

> Additionally, we make
> sure our production systems have enough hardware behind them that we can
> add additional tasks without it affecting production use.

that's nice, but not everybody can afford it. Of course i would love to log
every statement. But do you really log every statement in production? I guess
not.

> All of these are (in my opinion) better approaches to the problem than
> yet another arbitrary query filtering technique.  I mean, logging only
> the most time-consuming queries is already arbitrary enough (as you
> already stated).

With log_min duration i get only most time-consuming queries.
With log sample i can detect if there is a fast query which is called to
often. This is impossible today.

Again: for my use case it makes sense to have a log_sample feature.

kind regards
Janning


Re: suggestion: log_statement = sample

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


>> In my experience, I've found that enabling full logging for a short time
>> (perhaps a few hours) gathers enough data to run through tools like
>> pgFouine and find problem areas.

> It is not possible for us. Logging millions of statements take too much time.

Too much time to do what? Where is the bottleneck?

> But you dont run the real use cases with automated tests. There so many
> factors involved in real time: caching, concurrency, data, peaktime,
> deadlocks, doubleclicks, robots etc. that you just can't reproduce it on a
> development system without lots of effort.

Agreed.

>> Additionally, we make
>> sure our production systems have enough hardware behind them that we can
>> add additional tasks without it affecting production use.

> that's nice, but not everybody can afford it. Of course i would love to log
> every statement. But do you really log every statement in production? I guess
> not.

We have clients that do, and we always recommend it for all of our clients. It
can get very expensive (hint: ship them off with syslog over an internal
network to a dedicated box), but being able to see exactly what happened
in your database when something goes wrong after the fact is invaluable.
Plus we can then run tools like pgsi (http://bucardo.org/pgsi/) to get
nice statistics and find weak spots in the application code.

> Again: for my use case it makes sense to have a log_sample feature.

Perhaps, but I don't think you've quite overcome the 'log everything'
counter-argument. I don't think log_sample is necessarily a bad idea, but I
doubt it is going to happen soon, and certainly not in time to put
on your production system in the next year or so. Perhaps you can attack
this from another angle, such as using some sort of filter or script
to grab every X lines from the logs and discard the rest while in
full logging mode. That would have the advantage of being fully
backwards-compatible and usable today.


- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation
PGP Key: 0x14964AC8 200907201256
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkpkoh4ACgkQvJuQZxSWSsi5swCfSgxE/3Vs+kCmfqSERL6u84XJ
nNQAoO5oOs/Nwuhe27FQ+THZEUVcdULO
=7JPQ
-----END PGP SIGNATURE-----



Re: suggestion: log_statement = sample

From
Bill Moran
Date:
In response to "Greg Sabino Mullane" <greg@turnstep.com>:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
>
>
> >> In my experience, I've found that enabling full logging for a short time
> >> (perhaps a few hours) gathers enough data to run through tools like
> >> pgFouine and find problem areas.
>
> > It is not possible for us. Logging millions of statements take too much time.

This is a ridiculous statement.  In actual practice, full query logging
is 1/50 the amount of disk I/O as the actual database activity.  If your
systems are so stressed that they can't handle another 2% increase, then
you've got bigger problems lurking.

Have you benchmarked the load it creates under your workload?

> > But you dont run the real use cases with automated tests. There so many
> > factors involved in real time: caching, concurrency, data, peaktime,
> > deadlocks, doubleclicks, robots etc. that you just can't reproduce it on a
> > development system without lots of effort.

Logging a sample query every arbitrary number of queries isn't a real use
case either, thus your counter argument makes no sense.

I suggested the test system because it's a good compromise.

> >> Additionally, we make
> >> sure our production systems have enough hardware behind them that we can
> >> add additional tasks without it affecting production use.
>
> > that's nice, but not everybody can afford it. Of course i would love to log
> > every statement. But do you really log every statement in production? I guess
> > not.

Try reading my responses instead of guessing.  We run full query logging on
occasion to get a sample of real usage on our production systems.  As a
result, we know that our servers can handle the load.  We're working on
the infrastructure to manage the amount of data so we can do it all the
time (we don't currently have enough disk space).

Overall, it seems like you've decided that you want this feature and nothing
else will do.  If that's the case, then just go ahead and write it.

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: suggestion: log_statement = sample

From
"Joshua D. Drake"
Date:
On Mon, 2009-07-20 at 13:24 -0400, Bill Moran wrote:
> In response to "Greg Sabino Mullane" <greg@turnstep.com>:
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: RIPEMD160
> >
> >
> > >> In my experience, I've found that enabling full logging for a short time
> > >> (perhaps a few hours) gathers enough data to run through tools like
> > >> pgFouine and find problem areas.
> >
> > > It is not possible for us. Logging millions of statements take too much time.
>
> This is a ridiculous statement.

No it isn't.

>   In actual practice, full query logging
> is 1/50 the amount of disk I/O as the actual database activity.  If your
> systems are so stressed that they can't handle another 2% increase, then
> you've got bigger problems lurking.

It depends on the system. I have seen even big systems take a huge hit
by full logging due to transactional velocity.

Joshua D. Drake

--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


Re: suggestion: log_statement = sample

From
Steve Atkins
Date:
On Jul 20, 2009, at 10:24 AM, Bill Moran wrote:

> In response to "Greg Sabino Mullane" <greg@turnstep.com>:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: RIPEMD160
>>
>>
>>>> In my experience, I've found that enabling full logging for a
>>>> short time
>>>> (perhaps a few hours) gathers enough data to run through tools like
>>>> pgFouine and find problem areas.
>>
>>> It is not possible for us. Logging millions of statements take too
>>> much time.
>
> This is a ridiculous statement.

No, it isn't.

>  In actual practice, full query logging
> is 1/50 the amount of disk I/O as the actual database activity.  If
> your
> systems are so stressed that they can't handle another 2% increase,
> then
> you've got bigger problems lurking.

That's really not true. Ever, probably, but certainly in my situation.

A lot of my inserts are large text fields (10k - 1M). Many of those
are never
read again.

If I want to log all statements or slow statements then a lot of
what's logged is those large inserts. There's currently no way
to log the statement without logging all the data inserted, so
the log traffic produced by each insert is huge.

Logging via syslog on Solaris I've had reports of that slowing
the machine down to unusability. (I'm fairly sure I know why,
and I suspect you can guess too, but this is on customer machines).
The hit due to logging can be huge, even on fairly overpowered
systems.

There are a lot of limitations in the current logging system when
it comes to capturing data for performance analysis, and that's
one of them. There's certainly significant room for improvement
in the logging system - some of that can be added externally,
but some of it really needs to be done within the core code.

Cheers,
   Steve


Re: suggestion: log_statement = sample

From
Bill Moran
Date:
In response to "Joshua D. Drake" <jd@commandprompt.com>:

> On Mon, 2009-07-20 at 13:24 -0400, Bill Moran wrote:
> > In response to "Greg Sabino Mullane" <greg@turnstep.com>:
> > >
> > > -----BEGIN PGP SIGNED MESSAGE-----
> > > Hash: RIPEMD160
> > >
> > >
> > > >> In my experience, I've found that enabling full logging for a short time
> > > >> (perhaps a few hours) gathers enough data to run through tools like
> > > >> pgFouine and find problem areas.
> > >
> > > > It is not possible for us. Logging millions of statements take too much time.
> >
> > This is a ridiculous statement.
>
> No it isn't.
>
> >   In actual practice, full query logging
> > is 1/50 the amount of disk I/O as the actual database activity.  If your
> > systems are so stressed that they can't handle another 2% increase, then
> > you've got bigger problems lurking.
>
> It depends on the system. I have seen even big systems take a huge hit
> by full logging due to transactional velocity.

Perhaps I'm just in a foul mood today, but I feel like people are picking
my emails apart to make me the bad guy.  Note that you trimmed out a key
part of what I said here:

"Have you benchmarked the load it creates under your workload?"

Perhaps I should have explicitly said, "There are likely some cases where
this statement isn't true, so you should benchmark your specific load
case, but you don't mention that you've done that."

Or, perhaps I should just abstain from posting to mailing lists when I'm
in a foul mood ...

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: suggestion: log_statement = sample

From
"Joshua D. Drake"
Date:
On Mon, 2009-07-20 at 13:46 -0400, Bill Moran wrote:
> In response to "Joshua D. Drake" <jd@commandprompt.com>:

> > It depends on the system. I have seen even big systems take a huge hit
> > by full logging due to transactional velocity.
>
> Perhaps I'm just in a foul mood today, but I feel like people are picking

Possibly. :)

> my emails apart to make me the bad guy.  Note that you trimmed out a key
> part of what I said here:

Right. I didn't have issue with the rest of your email. Its just the
idea what the person said was ridiculous.

>
> "Have you benchmarked the load it creates under your workload?"

Which is certainly a valid question.

>
> Perhaps I should have explicitly said, "There are likely some cases where
> this statement isn't true, so you should benchmark your specific load
> case, but you don't mention that you've done that."
>
> Or, perhaps I should just abstain from posting to mailing lists when I'm
> in a foul mood ...

It has certainly gotten me in trouble more than once. There is a reason
they say, "Don't go JD".

Joshua D. Drake




>
--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


Re: suggestion: log_statement = sample

From
Janning Vygen
Date:
On Monday 20 July 2009 18:58:21 Greg Sabino Mullane wrote:
> Perhaps, but I don't think you've quite overcome the 'log everything'
> counter-argument.
#
Not everybody can afford a system with lots of raid arrays or dedicated logging
boxes. Many people log to the same disk. I do it in some projects.

So log_sample gives you quite as much log analysing as full logging but
without the I/O overhead.

Maybe most of the postgresql guys have lots of hardware with RAID 10 SCSI
Disks but reality sometimes differ. we run most of our boxes on IDE RAID 1
without additional disks for logging. In this scenario you just can't log
everything. log_min_duration gives you only a view to long running queries but
that is just on half of the whole picture.

kind regards
Janning


Re: suggestion: log_statement = sample

From
Janning Vygen
Date:
On Monday 20 July 2009 19:24:13 Bill Moran wrote:
> > > It is not possible for us. Logging millions of statements take too much
> > > time.
>
> This is a ridiculous statement.  In actual practice, full query logging
> is 1/50 the amount of disk I/O as the actual database activity.  If your
> systems are so stressed that they can't handle another 2% increase, then
> you've got bigger problems lurking.
>
> Have you benchmarked the load it creates under your workload?

Yes, it takes up to 15% of our workload in an average use case. But we have
peak times where we can not afford 15% lost for logging!

And if we log on the same disk, it slows down reading and writing. So it does
not really matter how much work load it creates. If it slows down my response,
i dont want to log everything. (I know that logging onto the same disk is
bad!)

> Overall, it seems like you've decided that you want this feature and
> nothing else will do.  If that's the case, then just go ahead and write it.

It was really just a suggestion. I know that i can handle my problems in a
different way like doing full logging from time to time, better automatic
tests, more hardware and so on. I just thought, it couldnt be so hard to log
every nth statement instead of all. And i think that there are lot of
usescases out there for a directive like this. I really just wanted to help to
make postgresql better. Even if it would be implemented soon, it wouldn't help
me. As i need to analyze my load now!

kind regards
Janning


Re: suggestion: log_statement = sample

From
Tom Lane
Date:
Janning Vygen <vygen@kicktipp.de> writes:
> On Monday 20 July 2009 19:24:13 Bill Moran wrote:
>> Have you benchmarked the load it creates under your workload?

> Yes, it takes up to 15% of our workload in an average use case. But we have
> peak times where we can not afford 15% lost for logging!

Well, you could turn it off during the peak times.  Or buy better
hardware --- if you have less than 15% headroom then it's past time
to be doing that anyway.

You really haven't given any convincing case for this behavior...

            regards, tom lane

Re: suggestion: log_statement = sample

From
Janning Vygen
Date:
On Tuesday 21 July 2009 15:49:36 Tom Lane wrote:
> Janning Vygen <vygen@kicktipp.de> writes:
> > On Monday 20 July 2009 19:24:13 Bill Moran wrote:
> >> Have you benchmarked the load it creates under your workload?
> >
> > Yes, it takes up to 15% of our workload in an average use case. But we
> > have peak times where we can not afford 15% lost for logging!
>
> Well, you could turn it off during the peak times.

It affords a server restart which is not a good idea. And i will never get any
real time analyze of what happens at peak time.

> Or buy better
> hardware --- if you have less than 15% headroom then it's past time
> to be doing that anyway.

yes, you are right, but software solutions are so much nicer...

> You really haven't given any convincing case for this behavior...

that's fine more me. I still don't understand it why it was not convincing. I
guess you guys run your DB with lots of hardware power, RAIDs and NAS and so
on. But there are many people running it on small boxes with one IDE hard
disk. Switching full logging on with such hardware is like shutting it down as
IO is blocked! I still think it has reasonable uses cases without lots of
effort (just a guess, i havn't looked at the source code yet).

But, hey: I just wanted to help improving postgresql. If you think this is not
an improvment, its fine for me. Thanks to all for discussing it.

kind regards
Janning


Re: suggestion: log_statement = sample

From
Tom Lane
Date:
Janning Vygen <vygen@kicktipp.de> writes:
> On Tuesday 21 July 2009 15:49:36 Tom Lane wrote:
>> Well, you could turn it off during the peak times.

> It affords a server restart which is not a good idea.

Changing logging options does not require a server restart.

            regards, tom lane

Re: suggestion: log_statement = sample

From
Merlin Moncure
Date:
On Mon, Jul 20, 2009 at 1:28 PM, Joshua D. Drake<jd@commandprompt.com> wrote:
> On Mon, 2009-07-20 at 13:24 -0400, Bill Moran wrote:
>>   In actual practice, full query logging
>> is 1/50 the amount of disk I/O as the actual database activity.  If your
>> systems are so stressed that they can't handle another 2% increase, then
>> you've got bigger problems lurking.
>
> It depends on the system. I have seen even big systems take a huge hit
> by full logging due to transactional velocity.
>
> Joshua D. Drake

I've seen this too.  On high transaction load systems, log_statement
has big overhead.  I haven't tested this lately though, and I suspect
this information is version dependent (in the 'old' days, it used to
be a really big deal).

merlin

Re: suggestion: log_statement = sample

From
Janning Vygen
Date:
On Tuesday 21 July 2009 18:09:57 you wrote:
> Janning Vygen <vygen@kicktipp.de> writes:
> > On Tuesday 21 July 2009 15:49:36 Tom Lane wrote:
> >> Well, you could turn it off during the peak times.
> >
> > It affords a server restart which is not a good idea.
>
> Changing logging options does not require a server restart.

great. I didn't know that. thank you!

Janning