Thread: Patch: add timing of buffer I/O requests

Patch: add timing of buffer I/O requests

From

Ants Aasma

Date:

27 November 2011, 20:40:12

Hi,

I know I'm late for the ongoing commitfest, but I thought I might as well
be early for the next one.

Attached is a patch against master that implements timing of shared buffer
fills and per relation stats collection of said timings. Buffer flushes are
timed as well but aren't exposed per table because of difficulty of
correctly attributing them.

Some notes on the implementation:
* The timing is done in bufmgr.c. Under high load some CPU contention will
  get attributed to IO because the process doing the I/O won't get a time-
  slice immediately.
* I decided to also account waiting for others doing the I/O as I/O waits.
  They aren't double counted for per relation stats though.
* I added a GUC track_iotiming defaulting to off because timing isn't cheap
  on all platforms.
* I used instr_time to keep the counts to be consistent with function
  timings, but maybe both should convert to plain uint64 counts to make the
  arithmetic code cleaner.
* Timings are exposed via EXPLAIN (BUFFERS), pg_stat_statements and
  pg_statio_* views.
* I noticed there aren't any pg_statio_xact_* views. I don't have any need
  for them myself, but thought I'd mention the inconsistency.
* The feature is really useful for me with auto_explain. Even better with
  Peter's pg_stat_statements query cleaning applied.

I did some testing on an older AMD Athlon X2 BE-2350 and an Intel i5 M 540
to see the overhead. The AMD CPU doesn't have the necessary features for
fast user mode timing and has an overhead of about 900ns per gettimeofday
call. The Intel processor has an overhead of 22ns per call.

I tried a read only pgbench with scalefactor 50 and shared_buffers=32MB to
induce a lot of IO traffic that hits the OS cache. Seems like it should be
the worst case for this patch.

On the AMD I saw about 3% performance drop with timing enabled. On the
Intel machine I couldn't measure any statistically significant change. The
median was actually higher with timing enabled, but stddevs were large
enough to hide a couple of percent of performance loss. This needs some
further testing.

Preliminary results for the Intel machine with stddev (10 5min runs):
-c |          master |      io-stats
 4 | 16521.53 ±4.49% | +1.16% ±3.21%
16 | 13923.49 ±5.98% | +0.56% ±7.11%

This is my first patch, so I hope I haven't missed anything too trivial.

--
Ants Aasma
ants.aasma@eesti.ee

Attachment

io-stats.v1.patch

Re: Patch: add timing of buffer I/O requests

From

Greg Smith

Date:

28 November 2011, 03:54:56

On 11/27/2011 04:39 PM, Ants Aasma wrote:
> On the AMD I saw about 3% performance drop with timing enabled. On the
> Intel machine I couldn't measure any statistically significant change.

Oh no, it's party pooper time again.  Sorry I have to be the one to do 
it this round.  The real problem with this whole area is that we know 
there are systems floating around where the amount of time taken to grab 
timestamps like this is just terrible.  I've been annoyed enough by that 
problem to spend some time digging into why that is--seems to be a bunch 
of trivia around the multiple ways to collect time info on x86 
systems--and after this CommitFest is over I was already hoping to dig 
through my notes and start quantifying that more.  So you can't really 
prove the overhead of this approach is acceptable just by showing two 
examples; we need to find one of the really terrible clocks and test 
there to get a real feel for the worst-case.

I recall a patch similar to this one was submitted by Greg Stark some 
time ago.  It used the info for different reasons--to try and figure out 
whether reads were cached or not--but I believe it withered rather than 
being implemented mainly because it ran into the same fundamental 
roadblocks here.  My memory could be wrong here, there were also 
concerns about what the data would be used for.

I've been thinking about a few ways to try and cope with this whole 
class of timing problem:

-Document the underlying problem and known workarounds, provide a way to 
test how bad the overhead is, and just throw our hands up and say 
"sorry, you just can't instrument like this" if someone has a slow system.

-Have one of the PostgreSQL background processes keep track of a time 
estimate on its own, only periodically pausing to sync against the real 
time.  Then most calls to gettimeofday() can use that value instead.  I 
was thinking of that idea for slightly longer running things though; I 
doubt that can be made accurate enough to test instrument buffer

And while I hate to kick off massive bike-shedding in your direction, 
I'm also afraid this area--collecting stats about how long individual 
operations take--will need a much wider ranging approach than just 
looking at the buffer cache ones.  If you step back and ask "what do 
people expect here?", there's a pretty large number who really want 
something like Oracle's v$session_wait  and v$system_event interface for 
finding the underlying source of slow things.  There's enough demand for 
that that EnterpriseDB has even done some work in this area too; what 
I've been told about it suggests the code isn't a great fit for 
contribution to community PostgreSQL though.  Like I said, this area is 
really messy and hard to get right.

Something more ambitious like the v$ stuff would also take care of what 
you're doing here; I'm not sure that what you've done helps built it 
though.  Please don't take that personally.  Part of one of my own 
instrumentation patches recently was rejected out of hand for the same 
reason, just not being general enough.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: Patch: add timing of buffer I/O requests

From

"Tomas Vondra"

Date:

28 November 2011, 05:24:33

On 28 Listopad 2011, 8:54, Greg Smith wrote:
> -Have one of the PostgreSQL background processes keep track of a time
> estimate on its own, only periodically pausing to sync against the real
> time.  Then most calls to gettimeofday() can use that value instead.  I
> was thinking of that idea for slightly longer running things though; I
> doubt that can be made accurate enough to test instrument buffer

What about random sampling, i.e. "measure just 5% of the events" or
something like that? Sure, it's not exact but it significantly reduces the
overhead. And it might be a config parameter, so the user might decide how
precise results are needed, and even consider how fast the clocks are.

> Something more ambitious like the v$ stuff would also take care of what
> you're doing here; I'm not sure that what you've done helps built it
> though.  Please don't take that personally.  Part of one of my own
> instrumentation patches recently was rejected out of hand for the same
> reason, just not being general enough.

Yes, that'd be significant improvement. The wait-event stuff is very
useful and changes the tuning significantly.

Tomas

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

28 November 2011, 09:51:34

On Mon, Nov 28, 2011 at 2:54 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> The real problem with this whole area is that we know there are
> systems floating around where the amount of time taken to grab timestamps
> like this is just terrible.

Assuming the feature is off by default (and I can't imagine we'd
consider anything else), I don't see why that should be cause for
concern.  If the instrumentation creates too much system load, then
don't use it: simple as that.  A more interesting question is "how
much load does this feature create even when it's turned off?".

The other big problem for a patch of this sort is that it would bloat
the stats file.  I think we really need to come up with a more
scalable alternative to the current system, but I haven't looked the
current system in enough detail to have a clear feeling about what
such an alternative would look like.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Greg Stark

Date:

28 November 2011, 10:41:00

<p><br /> On Nov 28, 2011 8:55 AM, "Greg Smith" <<a href="mailto:greg@2ndquadrant.com">greg@2ndquadrant.com</a>>
wrote:<br/> ><br /> > On 11/27/2011 04:39 PM, Ants Aasma wrote:<br /> >><br /> >> On the AMD I saw
about3% performance drop with timing enabled. On the<br /> >> Intel machine I couldn't measure any statistically
significantchange.<br /> ><br /> ><br /> > Oh no, it's party pooper time again.  Sorry I have to be the one to
doit this round.  The real problem with this whole area is that we know there are systems floating around where the
amountof time taken to grab timestamps like this is just terrible. <p>I believe on most systems on modern linux kernels
gettimeofdayan its ilk will be a vsyscall and nearly as fast as a regular function call.<br /><p>><br /> > I
recalla patch similar to this one was submitted by Greg Stark some time ago.  It used the info for different
reasons--totry and figure out whether reads were cached or not--but I believe it withered rather than being implemented
mainlybecause it ran into the same fundamental roadblocks here.  My memory could be wrong here, there were also
concernsabout what the data would be used for.<p>I speculated about doing that but never did. I had an experimental
patchusing mincore to do what you describe but it wasn't intended for production code I think. The only real patch was
touse getrusage which I still intend to commit but it doesn't tell you the time spent in I/O -- though it does tell you
thesys time which should be similar.<br />

Re: Patch: add timing of buffer I/O requests

From

"Tomas Vondra"

Date:

28 November 2011, 11:30:08

On 28 Listopad 2011, 15:40, Greg Stark wrote:
> On Nov 28, 2011 8:55 AM, "Greg Smith" <greg@2ndquadrant.com> wrote:
>>
>> On 11/27/2011 04:39 PM, Ants Aasma wrote:
>>>
>>> On the AMD I saw about 3% performance drop with timing enabled. On the
>>> Intel machine I couldn't measure any statistically significant change.
>>
>>
>> Oh no, it's party pooper time again.  Sorry I have to be the one to do
>> it
> this round.  The real problem with this whole area is that we know there
> are systems floating around where the amount of time taken to grab
> timestamps like this is just terrible.
>
> I believe on most systems on modern linux kernels gettimeofday an its ilk
> will be a vsyscall and nearly as fast as a regular function call.

AFAIK a vsyscall should be faster than a regular syscall. It does not need
to switch to kernel space at all, it "just" reads the data from a shared
page. The problem is that this is Linux-specific - for example FreeBSD
does not have vsyscall at all (it's actually one of the Linux-isms
mentioned here: http://wiki.freebsd.org/AvoidingLinuxisms).

There's also something called VDSO, that (among other things) uses
vsyscall if availabe, or the best implementation available. So there are
platforms that do not provide vsyscall, and in that case it'd be just as
slow as a regular syscall :(

I wouldn't expect a patch that works fine on Linux but not on other
platforms to be accepted, unless there's a compile-time configure switch
(--with-timings) that'd allow to disable that.

Another option would be to reimplement the vsyscall, even on platforms
that don't provide it. The principle is actually quite simple - allocate a
shared memory, store there a current time and update it whenever a clock
interrupt happens. This is basically what Greg suggested in one of the
previous posts, where "regularly" means "on every interrupt". Greg was
worried about the precision, but this should be just fine I guess. It's
the precision you get on Linux, anyway ...

>> I recall a patch similar to this one was submitted by Greg Stark some
> time ago.  It used the info for different reasons--to try and figure out
> whether reads were cached or not--but I believe it withered rather than
> being implemented mainly because it ran into the same fundamental
> roadblocks here.  My memory could be wrong here, there were also concerns
> about what the data would be used for.

The difficulty when distinguishing whether the reads were cached or not is
the price we pay for using filesystem cache instead of managing our own.
Not sure if this can be solved just by measuring the latency - with
spinners it's quite easy, the differences are rather huge (and it's not
difficult to derive that even from pgbench log). But with SSDs, multiple
tablespaces on different storage, etc. it gets much harder.

Tomas

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

28 November 2011, 11:55:23

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Nov 28, 2011 at 2:54 AM, Greg Smith <greg@2ndquadrant.com> wrote:
>> The real problem with this whole area is that we know there are
>> systems floating around where the amount of time taken to grab timestamps
>> like this is just terrible.

> Assuming the feature is off by default (and I can't imagine we'd
> consider anything else), I don't see why that should be cause for
> concern.  If the instrumentation creates too much system load, then
> don't use it: simple as that.  A more interesting question is "how
> much load does this feature create even when it's turned off?".

Right.  I see that the code already has a switch to skip the
gettimeofday calls, so the objection is only problematic if the added
overhead is significant even with the switch off.  I would worry mainly
about the added time/space to deal with the extra stats counters.

> The other big problem for a patch of this sort is that it would bloat
> the stats file.

Yes.  Which begs the question of why we need to measure this per-table.
I would think per-tablespace would be sufficient.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Martijn van Oosterhout

Date:

28 November 2011, 14:11:05

On Sun, Nov 27, 2011 at 11:54:38PM -0800, Greg Smith wrote:
> On 11/27/2011 04:39 PM, Ants Aasma wrote:
> >On the AMD I saw about 3% performance drop with timing enabled. On the
> >Intel machine I couldn't measure any statistically significant change.
>
> Oh no, it's party pooper time again.  Sorry I have to be the one to
> do it this round.  The real problem with this whole area is that we
> know there are systems floating around where the amount of time
> taken to grab timestamps like this is just terrible.  I've been
> annoyed enough by that problem to spend some time digging into why
> that is--seems to be a bunch of trivia around the multiple ways to
> collect time info on x86 systems--and after this CommitFest is over

Something good to know: in Linux the file
/sys/devices/system/clocksource/clocksource0/current_clocksource
lists the current clock source, and
/sys/devices/system/clocksource/clocksource0/available_clocksource
lists the available clock sources. With cat you can switch them. That
way you may be able to quantify the effects on a single machine.

Learned the hard way while tracking clock-skew on a multicore system.
The hpet may not be the fastest (that would be the cpu timer), but it's
the fastest (IME) that gives guarenteed monotonic time.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: Patch: add timing of buffer I/O requests

From

Dimitri Fontaine

Date:

28 November 2011, 17:33:00

"Tomas Vondra" <tv@fuzzy.cz> writes:
> Another option would be to reimplement the vsyscall, even on platforms
> that don't provide it. The principle is actually quite simple - allocate a
> shared memory, store there a current time and update it whenever a clock
> interrupt happens. This is basically what Greg suggested in one of the
> previous posts, where "regularly" means "on every interrupt". Greg was
> worried about the precision, but this should be just fine I guess. It's
> the precision you get on Linux, anyway ...

That sounds good for other interesting things, which entails being able
to have timing information attached to the XID sequence.  If we go this
way, how far are we from having a ticker in PostgreSQL?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Patch: add timing of buffer I/O requests

From

Jim Nasby

Date:

28 November 2011, 21:15:04

On Nov 28, 2011, at 9:29 AM, Tomas Vondra wrote:
>>> I recall a patch similar to this one was submitted by Greg Stark some
>> time ago.  It used the info for different reasons--to try and figure out
>> whether reads were cached or not--but I believe it withered rather than
>> being implemented mainly because it ran into the same fundamental
>> roadblocks here.  My memory could be wrong here, there were also concerns
>> about what the data would be used for.
>
> The difficulty when distinguishing whether the reads were cached or not is
> the price we pay for using filesystem cache instead of managing our own.
> Not sure if this can be solved just by measuring the latency - with
> spinners it's quite easy, the differences are rather huge (and it's not
> difficult to derive that even from pgbench log). But with SSDs, multiple
> tablespaces on different storage, etc. it gets much harder.

True, but every use case for this information I can think of ultimately only cares about how long it took to perform
somekind of IO; it doesn't *really* care about whether it was cached. So in that context, we don't really care if SSDs
arefast enough that they look like cache, because that means they're performing (essentially) the same as cache. 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: Patch: add timing of buffer I/O requests

From

Tomas Vondra

Date:

28 November 2011, 21:18:26

On 28.11.2011 22:32, Dimitri Fontaine wrote:
> "Tomas Vondra" <tv@fuzzy.cz> writes:
>> Another option would be to reimplement the vsyscall, even on platforms
>> that don't provide it. The principle is actually quite simple - allocate a
>> shared memory, store there a current time and update it whenever a clock
>> interrupt happens. This is basically what Greg suggested in one of the
>> previous posts, where "regularly" means "on every interrupt". Greg was
>> worried about the precision, but this should be just fine I guess. It's
>> the precision you get on Linux, anyway ...
> 
> That sounds good for other interesting things, which entails being able
> to have timing information attached to the XID sequence.  If we go this
> way, how far are we from having a ticker in PostgreSQL?

I'm not sure. On Linux/x86 this is already done, but my knowledge of
kernel development is rather limited, especially when it comes to other
OSes and platforms. E.g. I'm not sure why it's not available in FreeBSD
on x86, I guess it's rather "we don't want it" than "it's not possible."


In Linux sources, the most interesting pieces are probably these:

1) arch/x86/include/asm/vgtod.h - that's the shared memory structure

2) arch/x86/kernel/vsyscall_64.c - this is how the memory is read                                  (do_vgettimeofday)

3) arch/x86/kernel/vsyscall_64.c - this is how the memory is updated
(update_vsyscall)

4) kernel/time/timekeeping.c - do_settimeofday (calls update_vsyscall)

5) drivers/rtc/class.c (and other) RTC drivers call do_settimeofday


Tomas

Re: Patch: add timing of buffer I/O requests

From

Tomas Vondra

Date:

28 November 2011, 21:21:53

On 29.11.2011 02:14, Jim Nasby wrote:
> On Nov 28, 2011, at 9:29 AM, Tomas Vondra wrote:
>>>> I recall a patch similar to this one was submitted by Greg
>>>> Stark some
>>> time ago.  It used the info for different reasons--to try and
>>> figure out whether reads were cached or not--but I believe it
>>> withered rather than being implemented mainly because it ran into
>>> the same fundamental roadblocks here.  My memory could be wrong
>>> here, there were also concerns about what the data would be used
>>> for.
>> 
>> The difficulty when distinguishing whether the reads were cached or
>> not is the price we pay for using filesystem cache instead of
>> managing our own. Not sure if this can be solved just by measuring
>> the latency - with spinners it's quite easy, the differences are
>> rather huge (and it's not difficult to derive that even from
>> pgbench log). But with SSDs, multiple tablespaces on different
>> storage, etc. it gets much harder.
> 
> True, but every use case for this information I can think of
> ultimately only cares about how long it took to perform some kind of
> IO; it doesn't *really* care about whether it was cached. So in that
> context, we don't really care if SSDs are fast enough that they look
> like cache, because that means they're performing (essentially) the
> same as cache.

Yup, that's right. The wait times are generally much more interesting
than the cached/not cached ratio.

Tomas

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

28 November 2011, 21:24:47

Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:
> That sounds good for other interesting things, which entails being able
> to have timing information attached to the XID sequence.  If we go this
> way, how far are we from having a ticker in PostgreSQL?

Those of us who are trying to get rid of idle-state process wakeups will
protest any such thing with vigor.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Ants Aasma

Date:

28 November 2011, 21:36:58

Sorry for taking so long to respong, had a pretty busy day at work. Anyway..

On Mon, Nov 28, 2011 at 9:54 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> Oh no, it's party pooper time again.  Sorry I have to be the one to do it
> this round.  The real problem with this whole area is that we know there are
> systems floating around where the amount of time taken to grab timestamps
> like this is just terrible.  I've been annoyed enough by that problem to
> spend some time digging into why that is--seems to be a bunch of trivia
> around the multiple ways to collect time info on x86 systems--and after this
> CommitFest is over I was already hoping to dig through my notes and start
> quantifying that more.  So you can't really prove the overhead of this
> approach is acceptable just by showing two examples; we need to find one of
> the really terrible clocks and test there to get a real feel for the
> worst-case.

Sure, I know that the timing calls might be awfully slow. That's why I turned
it off by default. I saw that track_functions was already using this, so I
figured it was ok to have it potentially run very slowly.

> -Document the underlying problem and known workarounds, provide a way to
> test how bad the overhead is, and just throw our hands up and say "sorry,
> you just can't instrument like this" if someone has a slow system.

Some documentation about potential problems would definitely be good.
Same goes for a test tool. ISTM that fast accurate timing is just not
possible on all supported platforms. That doesn't seem like a good enough
justification to refuse implementing something useful for the majority that
do as long as it doesn't cause regressions for those that don't or
significant code complexity.

> -Have one of the PostgreSQL background processes keep track of a time
> estimate on its own, only periodically pausing to sync against the real
> time.  Then most calls to gettimeofday() can use that value instead.  I was
> thinking of that idea for slightly longer running things though; I doubt
> that can be made accurate enough to test instrument buffer

This would limit it to those cases where hundreds of milliseconds of jitter
or more don't bother all that much.

> And while I hate to kick off massive bike-shedding in your direction, I'm
> also afraid this area--collecting stats about how long individual operations
> take--will need a much wider ranging approach than just looking at the
> buffer cache ones.  If you step back and ask "what do people expect here?",
> there's a pretty large number who really want something like Oracle's
> v$session_wait  and v$system_event interface for finding the underlying
> source of slow things.  There's enough demand for that that EnterpriseDB has
> even done some work in this area too; what I've been told about it suggests
> the code isn't a great fit for contribution to community PostgreSQL though.
>  Like I said, this area is really messy and hard to get right.

Yeah, something like that should probably be something to strive for. I'll
ponder a bit more  about resource and latency tracking a general. Maybe the
question here should be about the cost/benefit ratio of having some utility
now vs maintaining/deprecating the user visible interface when a more
general framework will turn up.

> Something more ambitious like the v$ stuff would also take care of what
> you're doing here; I'm not sure that what you've done helps built it though.
>  Please don't take that personally.  Part of one of my own instrumentation
> patches recently was rejected out of hand for the same reason, just not
> being general enough.

No problem, I understand that half-way solutions can be more trouble than
they're worth. I actually built this to help with performance testing an
application and thought it would be an interesting experience to try to
give the community back something.

On Mon, Nov 28, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote:
> I believe on most systems on modern linux kernels gettimeofday an its ilk
> will be a vsyscall and nearly as fast as a regular function call.

clock_gettime() is implemented as a vDSO since 2.6.23. gettimeofday() has
been user context callable since before git shows any history (2.6.12).

On Mon, Nov 28, 2011 at 5:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The other big problem for a patch of this sort is that it would bloat
>> the stats file.
>
> Yes.  Which begs the question of why we need to measure this per-table.
> I would think per-tablespace would be sufficient.

Yeah, I figured that this is something that should be discussed. I
implemented per-table collection because I thought it might be useful for
tools to pick up and show a quick overview on which tables are causing the
most IO overhead for queries.

On Mon, Nov 28, 2011 at 8:10 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
> Something good to know: in Linux the file
> /sys/devices/system/clocksource/clocksource0/current_clocksource
> lists the current clock source, and
> /sys/devices/system/clocksource/clocksource0/available_clocksource
> lists the available clock sources. With cat you can switch them. That
> way you may be able to quantify the effects on a single machine.
>
> Learned the hard way while tracking clock-skew on a multicore system.
> The hpet may not be the fastest (that would be the cpu timer), but it's
> the fastest (IME) that gives guarenteed monotonic time.

The Linux kernel seems to go pretty far out of its way to ensure that TSC
(CPU timestamp counter) based clocksource returns monotonic values,
including actually testing if it does. [1] If the hardware doesn't support
stable and consistent tsc values, tsc isn't used as a clock source.

Of course trying to keep it monotonic doesn't mean succeeding. I thought
about inserting a sanity check. But as the current instrumentation doesn't
use one and it would catch errors only in one direction, biasing the long
term average, I decided against it.

Because this is non-essential instrumentation, I don't see an issue with
it returning bogus information when the system clock is broken. Atleast it
seems that no one has complained about the same issue in track_functions.
The only complaint I found is that it's off by default.

On Mon, Nov 28, 2011 at 5:29 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Another option would be to reimplement the vsyscall, even on platforms
> that don't provide it. The principle is actually quite simple - allocate a
> shared memory, store there a current time and update it whenever a clock
> interrupt happens. This is basically what Greg suggested in one of the
> previous posts, where "regularly" means "on every interrupt". Greg was
> worried about the precision, but this should be just fine I guess. It's
> the precision you get on Linux, anyway ...

On modern platforms you actually really do get the microsecond precision.
Even more, if you use clock_gettime(CLOCK_MONOTONIC), you get nanosecond
precision and avoid issues with someone changing the system time while
you're timing. This precision does require OS and hardware cooperation,
because of CPU offsets, TSC's changing frequencies, stopping, etc.

--
Ants Aasma

[1] https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc_sync.c#L143

Re: Patch: add timing of buffer I/O requests

From

Greg Smith

Date:

28 November 2011, 21:59:33

On 11/28/2011 05:51 AM, Robert Haas wrote:
> On Mon, Nov 28, 2011 at 2:54 AM, Greg Smith<greg@2ndquadrant.com>  wrote:
>> The real problem with this whole area is that we know there are
>> systems floating around where the amount of time taken to grab timestamps
>> like this is just terrible.
> Assuming the feature is off by default (and I can't imagine we'd
> consider anything else), I don't see why that should be cause for
> concern.  If the instrumentation creates too much system load, then
> don't use it: simple as that.

It's not quite that simple though.  Releasing a performance measurement 
feature that itself can perform terribly under undocumented conditions 
has a wider downside than that.

Consider that people aren't going to turn it on until they are already 
overloaded.  If that has the potential to completely tank performance, 
we better make sure that area is at least explored usefully first; the 
minimum diligence should be to document that fact and make suggestions 
for avoiding or testing it.

Instrumentation that can itself become a performance problem is an 
advocacy problem waiting to happen.  As I write this I'm picturing such 
an encounter resulting in an angry blog post, about how this proves 
PostgreSQL isn't usable for serious systems because someone sees massive 
overhead turning this on.  Right now the primary exposure to this class 
of issue is EXPLAIN ANALYZE.  When I was working on my book, I went out 
of my way to find a worst case for that[1], and that turned out to be a 
query that went from 7.994ms to 69.837ms when instrumented.  I've been 
meaning to investigate what was up there since finding that one.  The 
fact that we already have one such problem bit exposed already worries 
me; I'd really prefer not to have two.

[1] (Dell Store 2 schema, query was "SELECT count(*) FROM customers;")

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

28 November 2011, 22:12:26

Greg Smith <greg@2ndQuadrant.com> writes:
> On 11/28/2011 05:51 AM, Robert Haas wrote:
>> Assuming the feature is off by default (and I can't imagine we'd
>> consider anything else), I don't see why that should be cause for
>> concern.  If the instrumentation creates too much system load, then
>> don't use it: simple as that.

> It's not quite that simple though.  Releasing a performance measurement 
> feature that itself can perform terribly under undocumented conditions 
> has a wider downside than that.

Yeah, that's a good point, and the machines on which this would suck
are exactly the ones where EXPLAIN ANALYZE creates very large overhead.
We don't seem to see a lot of complaints about that anymore, but we do
still see some ... and yes, it's documented that EXPLAIN ANALYZE can add
significant overhead, but that doesn't stop the questions.

> Instrumentation that can itself become a performance problem is an 
> advocacy problem waiting to happen.  As I write this I'm picturing such 
> an encounter resulting in an angry blog post, about how this proves 
> PostgreSQL isn't usable for serious systems because someone sees massive 
> overhead turning this on.

Of course, the rejoinder could be that if you see that, you're not
testing on serious hardware.  But still, I take your point.

> Right now the primary exposure to this class 
> of issue is EXPLAIN ANALYZE.  When I was working on my book, I went out 
> of my way to find a worst case for that[1],
> [1] (Dell Store 2 schema, query was "SELECT count(*) FROM customers;")

That's pretty meaningless without saying what sort of clock hardware
was on the machine...
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Ants Aasma

Date:

15 January 2012, 18:15:08

Here's the second version of the I/O timings patch. Changes from the
previous version:

* Rebased against master.
* Added the missing pg_stat_statements upgrade script that I
accidentally left out from the previous version.
* Added a tool to test timing overhead under contrib/pg_test_timing

I hope that having a tool to measure the overhead and check the sanity
of clock sources is enough to answer the worries about the potential
performance hit. We could also check that the clock source is fast
enough on start-up/when the guc is changed, but that seems a bit too
much and leaves open the question about what is fast enough.

About issues with stats file bloat - if it really is a blocker, I can
easily rip out the per-table or even per-database stats fields. The
patch is plenty useful without them. It seemed like a useful tool for
overworked DBAs with limited amount of SSD space available to easily
figure out which tables and indexes would benefit most from fast
storage.

--
Ants Aasma

Attachment

io-stats.v2.patch

Re: Patch: add timing of buffer I/O requests

From

Greg Smith

Date:

15 January 2012, 22:24:13

On 01/15/2012 05:14 PM, Ants Aasma wrote:
> I hope that having a tool to measure the overhead and check the sanity
> of clock sources is enough to answer the worries about the potential
> performance hit. We could also check that the clock source is fast
> enough on start-up/when the guc is changed, but that seems a bit too
> much and leaves open the question about what is fast enough.

I've been thinking along those same lines--check at startup, provide 
some guidance on the general range of what's considered fast vs. slow in 
both the code and documentation.  What I'm hoping to do here is split 
your patch in half and work on the pg_test_timing contrib utility 
first.  That part answers some overdue questions about when EXPLAIN 
ANALYZE can be expected to add a lot of overhead, which means it's 
useful all on its own.  I'd like to see that utility go into 9.2, along 
with a new documentation section covering that topic.  I'll write the 
new documentation bit.

As far as the buffer timing goes, there is a lot of low-level timing 
information I'd like to see the database collect.  To pick a second 
example with very similar mechanics, I'd like to know which queries 
spend a lot of their time waiting on locks.  The subset of time a 
statement spends waiting just for commit related things is a third.  The 
industry standard term for these is wait events, as seen in Oracle, 
MySQL, MS SQL Server. etc.  That's so standard I don't see an 
intellectual property issue with PostgreSQL using the same term.  Talk 
with a random person who is converting from Oracle to PostgreSQL, ask 
them about their performance concerns.  At least 3/4 of those 
conversations I have mention being nervous about not having wait event data.

Right now, I feel the biggest hurdle to performance tuning PostgreSQL is 
not having good enough built-in query log analysis tools.  If the 
pg_stat_statements normalization upgrade in the CF queue is commited, 
that's enough to make me bump that to "solved well enough".  After 
clearing that hurdle, figuring out how to log, analyze, and manage 
storage of wait events is the next biggest missing piece.  One of my top 
goals for 9.3 was to make sure that happened.

I don't think the long-term answer for how to manage wait event data is 
to collect it as part of pg_stat_statements though.  But I don't have a 
good alternate proposal, while you've submitted a patch that actually 
does something useful right now.  I'm going to think some more about how 
to reconcile all that.  There is an intermediate point to consider as 
well, which is just committing something that adjusts the core code to 
make the buffer wait event data available.  pg_stat_statements is easy 
enough to continue work on outside of core.  I could see a path where 
that happens in parallel with adding a better core wait event 
infrastructure, just to get the initial buffer wait info into people's 
hands earlier.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Re: Patch: add timing of buffer I/O requests

From

Greg Smith

Date:

17 January 2012, 14:09:10

Attached is the pg_test_timing utility portion of this submission,
broken out into its own patch.  It's a contrib module modeled on
pg_test_fsync.

The documentation is still a bit rough, I'm not done with that yet.  I
have included an example of good timing results, switching to a bad
clock source, and the resulting bad results.  Code review found some
formatting things to nitpick I've already fixed:  non-standard brace
locations and not including enough spaces in expressions were the main two.

This is now referenced by the existing cryptic documentation comment
around EXPLAIN ANALYZE, which says that overhead can be high because
gettimeofday is slow on some systems.  Since this utility measures that
directly, I think it's a clear win to include it just for that purpose.
The fact that there are more places coming where timing overhead matters
is also true.  But this existing one is already bad enough to justify
shipping something to help measure/manage it in my mind.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachment

pg_test_timing-v2.patch

Re: Patch: add timing of buffer I/O requests

From

Greg Smith

Date:

22 February 2012, 10:44:05

Attached are updated versions of this feature without the pg_test_timing
tool part, since I broke that out into another discussion thread.  I've
split the part that updates pg_stat_statistics out from the main feature
too, separate patch attached to here (but I'm not reviewing that yet).
Lots of bitrot since this was submitted, and yes I noticed that I've
almost recreated earlier versions of this patch--by splitting off the
parts that were developed later.

Earlier discussion of this got side tracked on a few things, partly my
fault. It's worth taking a look at what this provides before judging it
too much.  It can demo well.

The stated purpose is helping figure out what relations are gobbling up
the most access time, presumably to optimize them and/or the storage
they are on.  "What do I put onto SSD" is surely a popular request
nowadays.  To check suitability for that, I decided to run the standard
pgbench test and see what it would show lots of time being consumed by.
  Any answer other than "pgbench_accounts and to a lesser extent its
index" is a failing grade.  I started with a clean database and OS cache
so I'd get real read timings:

$ psql -d pgbench -x -c "select
relname,heap_blks_read,heap_blks_hit,heap_blks_time,
idx_blks_read ,idx_blks_hit,idx_blks_time
from pg_statio_user_tables where idx_blks_read > 0
order by heap_blks_time desc"

relname        | pgbench_accounts
heap_blks_read | 7879
heap_blks_hit  | 43837
heap_blks_time | 151770
idx_blks_read  | 7503
idx_blks_hit   | 60484
idx_blks_time  | 70968

relname        | pgbench_tellers
heap_blks_read | 19
heap_blks_hit  | 15856
heap_blks_time | 105
idx_blks_read  | 11
idx_blks_hit   | 15745
idx_blks_time  | 62

relname        | pgbench_branches
heap_blks_read | 11
heap_blks_hit  | 32333
heap_blks_time | 77
idx_blks_read  | 2
idx_blks_hit   | 0
idx_blks_time  | 9

Now, the first critical question to ask is "what additional information
is this providing above the existing counters?"  After all, it's
possible to tell pgbench_accounts is the hotspot just from comparing
heap_blks_read, right?  To really be useful, this would need to make it
obvious that reads from pgbench_accounts are slower than the other two,
because it's bigger and requires more seeking around to populate.  That
should show up if we compute time per read numbers:

$ psql -d pgbench -x -c "select relname,
1.0 * heap_blks_time / heap_blks_read as time_per_read,
1.0 * idx_blks_time / idx_blks_read as time_per_idx_read
from pg_statio_user_tables where idx_blks_read > 0
order by heap_blks_time desc"

relname           | pgbench_accounts
time_per_read     | 19.2625967762406397
time_per_idx_read | 9.4586165533786485

relname           | pgbench_tellers
time_per_read     | 5.5263157894736842
time_per_idx_read | 5.6363636363636364

relname           | pgbench_branches
time_per_read     | 7.0000000000000000
time_per_idx_read | 4.5000000000000000

This run looks useful at providing the data wished for--that read times
are slower per capita from the accounts table.  The first time I tried
this I got a bizarre high number for pgbench_branches.heap_blks_time ;
I'm not sure how reliable this is yet.  One problem that might be easy
to fix is that the write timing info doesn't show in any of these system
views, only in EXPLAIN and statement level ones.

I still think a full wait timing interface is the right long-term
direction here.  It's hard to reject this idea when it seems to be
working right now though, while more comprehensive wait storage is still
at least a release off.   Opinions welcome, I'm still juggling this
around now that I have it working again.

Some implementation notes.  This currently fails regression test
create_function_3, haven't looked into why yet.  I've confirmed that on
a system where timing is cheap, this is too.  On something touching real
data, not just a synthetic thing moving memory around, Aants couldn't
measure it on a server similar to mine; I can't either.  Yes, this is
going to gobble up more room for statistics.

The track_iotiming GUC seems to work as expected.  Off by default, can
turn it on in a session and see that session's work get timed, and it
toggles on a config reload.  Everything needed to only turn it on
selectively; main penalty you pay all the time is the stats bloat.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachment

Re: Patch: add timing of buffer I/O requests

From

Ants Aasma

Date:

22 February 2012, 12:36:11

On Wed, Feb 22, 2012 at 4:43 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Attached are updated versions of this feature without the pg_test_timing
> tool part, since I broke that out into another discussion thread.  I've
> split the part that updates pg_stat_statistics out from the main feature
> too, separate patch attached to here (but I'm not reviewing that yet). Lots
> of bitrot since this was submitted, and yes I noticed that I've almost
> recreated earlier versions of this patch--by splitting off the parts that
> were developed later.

Thanks for the review and splitting. Sorry I didn't fix up the bit rot myself.

> Earlier discussion of this got side tracked on a few things, partly my
> fault. It's worth taking a look at what this provides before judging it too
> much.  It can demo well.
>
> The stated purpose is helping figure out what relations are gobbling up the
> most access time, presumably to optimize them and/or the storage they are
> on.  "What do I put onto SSD" is surely a popular request nowadays.

I should have stated the purpose more clearly. The original reason for
developing this patch was to figure out "what queries are taking the
most time and why", specifically in the case where OS memory is a lot
larger than shared_buffers. Basically the following query to get a
quick overview where the bottlenecks are:
SELECT query, total_time, (time_read+time_write)/total_time AS
io_fraction FROM pg_stat_statements ORDER BY total_time DESC LIMIT 20;

This of course hugely benefits from Peter's pg_stat_statements
normalization patch.

Tracking timings per relation was actually an afterthought.

> Now, the first critical question to ask is "what additional information is
> this providing above the existing counters?"  After all, it's possible to
> tell pgbench_accounts is the hotspot just from comparing heap_blks_read,
> right?

Like I said above, I find it mostly useful to see what is missing the
OS cache. With memory being as cheap as it is, a reasonably priced
server can have 128G of memory, while max recommended value for
shared_buffers is 8GB. It's quite likely to have tables that fit into
OS cache but not into shared_buffers, but it's not trivial to figure
out which those are.

> This run looks useful at providing the data wished for--that read times are
> slower per capita from the accounts table.  The first time I tried this I
> got a bizarre high number for pgbench_branches.heap_blks_time ; I'm not sure
> how reliable this is yet.  One problem that might be easy to fix is that the
> write timing info doesn't show in any of these system views, only in EXPLAIN
> and statement level ones.

I'm not sure about the source of the huge number, might instability in
the clock source. Have you tried running the monotonicity check for a
longer period while the system is under load? Another issue with the
current timing code is that gettimeofday isn't guaranteed to be
monotonic anyway, things like NTP adjustments can make time go
backwards. clock_gettime with CLOCK_MONOTONIC_RAW would be better, but
that's linux specific :(

The reason why I didn't add write timings to relation stats is that I
couldn't figure out what the semantics should be. It could be either
"time spent waiting for this relations blocks to be written out" or
"time spent waiting for some other relations blocks to be written out
to free space for this relations block" or maybe distribute the cost,
background writes could be included or excluded. Writes usually return
quickly, unless lots of possibly unrelated writes have dirtied enough
of OS cache, etc. I figured that what ever choices I made, they
wouldn't really help anyone diagnose anything. Having global write
timings in pg_stat_bgwriter might be useful, but I feel that is
something for another patch.

> I still think a full wait timing interface is the right long-term direction
> here.  It's hard to reject this idea when it seems to be working right now
> though, while more comprehensive wait storage is still at least a release
> off.   Opinions welcome, I'm still juggling this around now that I have it
> working again.

I agree that wait timing interface is the right direction. I have
thought a bit about it and could share some ideas - maybe I should
create a wiki page where the general design could be hashed out?

Anyway, the user visible information from this patch should be trivial
to extract from a general wait timing framework. Pushing my own agenda
a bit - having this patch in the current release would help to get
some field experience on any issues surrounding timing :)

> Some implementation notes.  This currently fails regression test
> create_function_3, haven't looked into why yet.

I'll take a look at it.

Thanks again.

--
Ants Aasma

Re: Patch: add timing of buffer I/O requests

From

Ants Aasma

Date:

24 February 2012, 15:23:25

On Wed, Feb 22, 2012 at 6:35 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:
>> Some implementation notes.  This currently fails regression test
>> create_function_3, haven't looked into why yet.
>
> I'll take a look at it.

The failure was due to leakproof changes to pgproc. Attached patches
are adjusted accordingly and rebased over Robert's blocks dirtied
patch.

Cheers,
Ants Aasma

Attachment

Re: Patch: add timing of buffer I/O requests

From

Jim Nasby

Date:

24 February 2012, 18:08:24

On 2/22/12 10:35 AM, Ants Aasma wrote:
> The reason why I didn't add write timings to relation stats is that I
> couldn't figure out what the semantics should be. It could be either
> "time spent waiting for this relations blocks to be written out" or
> "time spent waiting for some other relations blocks to be written out
> to free space for this relations block" or maybe distribute the cost,
> background writes could be included or excluded. Writes usually return
> quickly, unless lots of possibly unrelated writes have dirtied enough
> of OS cache, etc. I figured that what ever choices I made, they
> wouldn't really help anyone diagnose anything. Having global write
> timings in pg_stat_bgwriter might be useful, but I feel that is
> something for another patch.

I know it's not perfect, but I would argue that what users care about most of the time is time taken up in actual
backends.So I wouldn't worry about bgwriter. I also wouldn't worry about time spent waiting to find a buffer at this
point(see below).
 

>> >  I still think a full wait timing interface is the right long-term direction
>> >  here.  It's hard to reject this idea when it seems to be working right now
>> >  though, while more comprehensive wait storage is still at least a release
>> >  off.   Opinions welcome, I'm still juggling this around now that I have it
>> >  working again.
> I agree that wait timing interface is the right direction. I have
> thought a bit about it and could share some ideas - maybe I should
> create a wiki page where the general design could be hashed out?

Yes, I think a wiki would be a good place to start. As you showed in your previous question about writes there's a
*lot*of places where timing info would be useful to us.
 
-- 
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

21 March 2012, 12:01:21

On Fri, Feb 24, 2012 at 2:23 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:
> On Wed, Feb 22, 2012 at 6:35 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:
>>> Some implementation notes.  This currently fails regression test
>>> create_function_3, haven't looked into why yet.
>>
>> I'll take a look at it.
>
> The failure was due to leakproof changes to pgproc. Attached patches
> are adjusted accordingly and rebased over Robert's blocks dirtied
> patch.

This seems to have bitrotted again.  :-(

Can you please rebase again?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Ants Aasma

Date:

22 March 2012, 20:40:17

On Wed, Mar 21, 2012 at 5:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> This seems to have bitrotted again.  :-(
>
> Can you please rebase again?

Rebased.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Attachment

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

27 March 2012, 15:58:55

On Thu, Mar 22, 2012 at 7:38 PM, Ants Aasma <ants@cybertec.at> wrote:
> On Wed, Mar 21, 2012 at 5:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> This seems to have bitrotted again.  :-(
>>
>> Can you please rebase again?
>
> Rebased.

I've committed the core of this.  I left out the stats collector
stuff, because it's still per-table and I think perhaps we should back
off to just per-database.  I changed it so that it does not conflate
wait time with I/O time.  Maybe there should be a separate method of
measuring wait time, but I don't think it's a good idea for the
per-backend stats to measure a different thing than what gets reported
up to the stats collector - we should have ONE definition of each
counter.  I also tweaked the EXPLAIN output format a bit, and the
docs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

27 March 2012, 16:23:18

On Tue, Mar 27, 2012 at 2:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 22, 2012 at 7:38 PM, Ants Aasma <ants@cybertec.at> wrote:
>> On Wed, Mar 21, 2012 at 5:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> This seems to have bitrotted again.  :-(
>>>
>>> Can you please rebase again?
>>
>> Rebased.
>
> I've committed the core of this.  I left out the stats collector
> stuff, because it's still per-table and I think perhaps we should back
> off to just per-database.  I changed it so that it does not conflate
> wait time with I/O time.  Maybe there should be a separate method of
> measuring wait time, but I don't think it's a good idea for the
> per-backend stats to measure a different thing than what gets reported
> up to the stats collector - we should have ONE definition of each
> counter.  I also tweaked the EXPLAIN output format a bit, and the
> docs.

And I've now committed the pg_stat_statements code as well, hopefully
not stomping too badly one what Tom's apparently in the midst of doing
with Peter's pg_stat_statements patch.  I committed this almost
exactly as submitted; just a minor code style correction and a few
documentation enhancements.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Ants Aasma

Date:

27 March 2012, 16:23:47

On Tue, Mar 27, 2012 at 9:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I've committed the core of this.  I left out the stats collector
> stuff, because it's still per-table and I think perhaps we should back
> off to just per-database.  I changed it so that it does not conflate
> wait time with I/O time.  Maybe there should be a separate method of
> measuring wait time, but I don't think it's a good idea for the
> per-backend stats to measure a different thing than what gets reported
> up to the stats collector - we should have ONE definition of each
> counter.  I also tweaked the EXPLAIN output format a bit, and the
> docs.

Thank you for working on this.

Taking a fresh look at it, I agree with you that conflating waiting
for backend local IO, and IO performed by some other backend might
paint us into a corner. For most workloads the wait for IO performed
by others should be the minority anyway.

I won't really miss the per table stats. But if the pg_stat_statements
normalisation patch gets commited, it would be really neat to also
have IO waits there. It would be much easier to quickly diagnose "what
is causing all this IO" issues.

Thanks again,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

27 March 2012, 21:30:29

On Tue, Mar 27, 2012 at 3:22 PM, Ants Aasma <ants@cybertec.at> wrote:
> On Tue, Mar 27, 2012 at 9:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I've committed the core of this.  I left out the stats collector
>> stuff, because it's still per-table and I think perhaps we should back
>> off to just per-database.  I changed it so that it does not conflate
>> wait time with I/O time.  Maybe there should be a separate method of
>> measuring wait time, but I don't think it's a good idea for the
>> per-backend stats to measure a different thing than what gets reported
>> up to the stats collector - we should have ONE definition of each
>> counter.  I also tweaked the EXPLAIN output format a bit, and the
>> docs.
>
> Thank you for working on this.
>
> Taking a fresh look at it, I agree with you that conflating waiting
> for backend local IO, and IO performed by some other backend might
> paint us into a corner. For most workloads the wait for IO performed
> by others should be the minority anyway.
>
> I won't really miss the per table stats. But if the pg_stat_statements
> normalisation patch gets commited, it would be really neat to also
> have IO waits there. It would be much easier to quickly diagnose "what
> is causing all this IO" issues.

So, the pg_stat_statements part of this is committed now.  Do you want
to prepare a revised patch to add per-database counters to the
statistics collector?  I think that might be a good idea...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Greg Stark

Date:

27 March 2012, 21:42:14

On Tue, Mar 27, 2012 at 7:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I've committed the core of this.  I left out the stats collector
> stuff, because it's still per-table and I think perhaps we should back
> off to just per-database.  I changed it so that it does not conflate
> wait time with I/O time.  Maybe there should be a separate method of
> measuring wait time, but I don't think it's a good idea for the
> per-backend stats to measure a different thing than what gets reported
> up to the stats collector - we should have ONE definition of each
> counter.  I also tweaked the EXPLAIN output format a bit, and the
> docs.

Maybe I missed some earlier discussoin -- I've been having trouble
keeping up with the lists.

But was there discussion of why this is a GUC? Why not just another
parameter to EXPLAIN like the others?
i.e. EXPLAIN (ANALYZE, BUFFERS, IOTIMING)

--
greg

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

27 March 2012, 21:51:57

On Tue, Mar 27, 2012 at 8:41 PM, Greg Stark <stark@mit.edu> wrote:
> On Tue, Mar 27, 2012 at 7:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I've committed the core of this.  I left out the stats collector
>> stuff, because it's still per-table and I think perhaps we should back
>> off to just per-database.  I changed it so that it does not conflate
>> wait time with I/O time.  Maybe there should be a separate method of
>> measuring wait time, but I don't think it's a good idea for the
>> per-backend stats to measure a different thing than what gets reported
>> up to the stats collector - we should have ONE definition of each
>> counter.  I also tweaked the EXPLAIN output format a bit, and the
>> docs.
>
> Maybe I missed some earlier discussoin -- I've been having trouble
> keeping up with the lists.
>
> But was there discussion of why this is a GUC? Why not just another
> parameter to EXPLAIN like the others?
> i.e. EXPLAIN (ANALYZE, BUFFERS, IOTIMING)

Because you want to be able to expose the data even for queries that
aren't explained.  Right now, you can do that with pg_stat_statements;
and the original patch also had per-table counters, but I didn't
commit that part due to some concerns about stats bloat.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

05 April 2012, 12:46:13

On Tue, Mar 27, 2012 at 8:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 27, 2012 at 3:22 PM, Ants Aasma <ants@cybertec.at> wrote:
>> On Tue, Mar 27, 2012 at 9:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I've committed the core of this.  I left out the stats collector
>>> stuff, because it's still per-table and I think perhaps we should back
>>> off to just per-database.  I changed it so that it does not conflate
>>> wait time with I/O time.  Maybe there should be a separate method of
>>> measuring wait time, but I don't think it's a good idea for the
>>> per-backend stats to measure a different thing than what gets reported
>>> up to the stats collector - we should have ONE definition of each
>>> counter.  I also tweaked the EXPLAIN output format a bit, and the
>>> docs.
>>
>> Thank you for working on this.
>>
>> Taking a fresh look at it, I agree with you that conflating waiting
>> for backend local IO, and IO performed by some other backend might
>> paint us into a corner. For most workloads the wait for IO performed
>> by others should be the minority anyway.
>>
>> I won't really miss the per table stats. But if the pg_stat_statements
>> normalisation patch gets commited, it would be really neat to also
>> have IO waits there. It would be much easier to quickly diagnose "what
>> is causing all this IO" issues.
>
> So, the pg_stat_statements part of this is committed now.  Do you want
> to prepare a revised patch to add per-database counters to the
> statistics collector?  I think that might be a good idea...

Hearing nothing further on this point, I went and did it myself.

In the process I noticed a couple of things that I think we need to fix.

Currently, the statistics views that include timing information are
displayed in milliseconds (see pg_stat_user_functions), while the
underlying SQL-callable functions return the data in microseconds.
Whether or not this was a good design decision, we're stuck with it
now; the documentation implies that the views and functions use the
same units.  I'll go fix that next.

Meanwhile, pg_stat_statements converts the same data to seconds but
makes it a double rather than a bigint.  I think that was a bad idea
and we should make it consistent use a bigint and milliseconds while
we still can.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

10 April 2012, 10:33:22

On Thu, Apr 5, 2012 at 11:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Meanwhile, pg_stat_statements converts the same data to seconds but
> makes it a double rather than a bigint.  I think that was a bad idea
> and we should make it consistent use a bigint and milliseconds while
> we still can.

Hmm.  So, on further review, this is not as simple as it seems.  I'd
like some input from other people on what we should do here.

pg_stat_statements has long exposed a column called "total_time" as a
float8.  It now exposes columns "time_read" and "time_write" which are
actually measuring the time spent reading and writing data blocks, and
those are also exposed as a float8; all these count seconds.

Meanwhile, all times exposed by the stats collector (including the new
and analagous pg_stat_database.block_read_time and
pg_stat_database.block_write_time columns) are exposed as int8; these
count milliseconds.

So, should we make the new columns exposed by pg_stat_statements use
milliseconds, so that the block read/write timings are everywhere in
milliseconds, or should we keep them as a float8, so that all the
times exposed by pg_stat_statements use float8?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

10 April 2012, 11:06:50

Robert Haas <robertmhaas@gmail.com> writes:
> Hmm.  So, on further review, this is not as simple as it seems.  I'd
> like some input from other people on what we should do here.

> pg_stat_statements has long exposed a column called "total_time" as a
> float8.  It now exposes columns "time_read" and "time_write" which are
> actually measuring the time spent reading and writing data blocks, and
> those are also exposed as a float8; all these count seconds.

> Meanwhile, all times exposed by the stats collector (including the new
> and analagous pg_stat_database.block_read_time and
> pg_stat_database.block_write_time columns) are exposed as int8; these
> count milliseconds.

> So, should we make the new columns exposed by pg_stat_statements use
> milliseconds, so that the block read/write timings are everywhere in
> milliseconds, or should we keep them as a float8, so that all the
> times exposed by pg_stat_statements use float8?

Given that we've whacked pg_stat_statements' behavior around rather
thoroughly in this release, maybe we could get away with redefining
total_time as being measured in msec rather than sec, thereby aligning
units as msec across the board.  It's arguably a smaller deal than the
way we've redefined what the query column contains...

float8 vs int8 is a distinct issue, and probably one that is not as
much of an impact on clients if we change it.  It is not hard to predict
that somebody will eventually want sub-msec resolution on these things,
which would suggest that float8 would be the better idea.  But perhaps
we could leave that change for a future release.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Peter Geoghegan

Date:

10 April 2012, 11:12:10

On 10 April 2012 14:33, Robert Haas <robertmhaas@gmail.com> wrote:
> So, should we make the new columns exposed by pg_stat_statements use
> milliseconds, so that the block read/write timings are everywhere in
> milliseconds, or should we keep them as a float8, so that all the
> times exposed by pg_stat_statements use float8?

I believe that we should keep them as float8, on the basis that a user
is more likely to generalise from total_time's format (when writing a
script to query the view of whatever) than from that of
pg_stat_database.

A part of me would like to change the view definitions so that all the
columns are strongly typed (i.e. all these values would be intervals).
I realise that that isn't practical though.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

10 April 2012, 12:58:57

On Tue, Apr 10, 2012 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Hmm.  So, on further review, this is not as simple as it seems.  I'd
>> like some input from other people on what we should do here.
>
>> pg_stat_statements has long exposed a column called "total_time" as a
>> float8.  It now exposes columns "time_read" and "time_write" which are
>> actually measuring the time spent reading and writing data blocks, and
>> those are also exposed as a float8; all these count seconds.
>
>> Meanwhile, all times exposed by the stats collector (including the new
>> and analagous pg_stat_database.block_read_time and
>> pg_stat_database.block_write_time columns) are exposed as int8; these
>> count milliseconds.
>
>> So, should we make the new columns exposed by pg_stat_statements use
>> milliseconds, so that the block read/write timings are everywhere in
>> milliseconds, or should we keep them as a float8, so that all the
>> times exposed by pg_stat_statements use float8?
>
> Given that we've whacked pg_stat_statements' behavior around rather
> thoroughly in this release, maybe we could get away with redefining
> total_time as being measured in msec rather than sec, thereby aligning
> units as msec across the board.  It's arguably a smaller deal than the
> way we've redefined what the query column contains...

Retyping columns is an awfully good way to produce grumpy users.  Then
again, if we're going to do it, it would certainly be better to do it
now rather than later, because right now I'm guessing
pg_stat_statements is a lot less heavily used than it will be after
9.2 hits shelves.

> float8 vs int8 is a distinct issue, and probably one that is not as
> much of an impact on clients if we change it.  It is not hard to predict
> that somebody will eventually want sub-msec resolution on these things,
> which would suggest that float8 would be the better idea.  But perhaps
> we could leave that change for a future release.

Well, internally, the I/O timing stuff as well as the function timing
stuff use microseconds, and the SQL functions expose it as
microseconds, but then the view divides by 1000 to convert to
milliseconds.  I made the I/O timing stuff do it that way because
that's how the function timing stuff does it, but it does seem a
little random.  One thing in its favor is that it provides a way for
users to get this if they want it, without screwing readability for
the vast majority who don't care.

On the flip side, the new checkpoint timing stuff is in milliseconds
all the way through, though it seems vanishingly unlikely that anyone
needs more resolution in that case.  We have lots of other things in
milliseconds, too.

No matter what we end up doing here it will be consistent with
something; I am reminded of the phrase "the good thing about standards
is that there are so many to choose from...".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

10 April 2012, 13:11:22

Robert Haas <robertmhaas@gmail.com> writes:
> No matter what we end up doing here it will be consistent with
> something; I am reminded of the phrase "the good thing about standards
> is that there are so many to choose from...".

Well, FWIW I vote for making the new columns be float8 msec.  If you
don't want to change total_time to match, I guess there's no law that
says it *has* to be consistent ...
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Magnus Hagander

Date:

10 April 2012, 13:14:01

On Tue, Apr 10, 2012 at 17:58, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 10, 2012 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Hmm.  So, on further review, this is not as simple as it seems.  I'd
>>> like some input from other people on what we should do here.
>>
>>> pg_stat_statements has long exposed a column called "total_time" as a
>>> float8.  It now exposes columns "time_read" and "time_write" which are
>>> actually measuring the time spent reading and writing data blocks, and
>>> those are also exposed as a float8; all these count seconds.
>>
>>> Meanwhile, all times exposed by the stats collector (including the new
>>> and analagous pg_stat_database.block_read_time and
>>> pg_stat_database.block_write_time columns) are exposed as int8; these
>>> count milliseconds.
>>
>>> So, should we make the new columns exposed by pg_stat_statements use
>>> milliseconds, so that the block read/write timings are everywhere in
>>> milliseconds, or should we keep them as a float8, so that all the
>>> times exposed by pg_stat_statements use float8?
>>
>> Given that we've whacked pg_stat_statements' behavior around rather
>> thoroughly in this release, maybe we could get away with redefining
>> total_time as being measured in msec rather than sec, thereby aligning
>> units as msec across the board.  It's arguably a smaller deal than the
>> way we've redefined what the query column contains...
>
> Retyping columns is an awfully good way to produce grumpy users.  Then
> again, if we're going to do it, it would certainly be better to do it
> now rather than later, because right now I'm guessing
> pg_stat_statements is a lot less heavily used than it will be after
> 9.2 hits shelves.

Agreed. It's better if we can also change the name of it - provided we
can come up with a reasonable new name. Then peoples applications will
break *visibly*, which is a lot  better than breaking invisibly. (This
is the main reason why we renamed current_query in pg_stat_activity..)

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

10 April 2012, 13:27:29

Magnus Hagander <magnus@hagander.net> writes:
> On Tue, Apr 10, 2012 at 17:58, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Apr 10, 2012 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Given that we've whacked pg_stat_statements' behavior around rather
>>> thoroughly in this release, maybe we could get away with redefining
>>> total_time as being measured in msec rather than sec, thereby aligning
>>> units as msec across the board.  It's arguably a smaller deal than the
>>> way we've redefined what the query column contains...
>> 
>> Retyping columns is an awfully good way to produce grumpy users.  Then
>> again, if we're going to do it, it would certainly be better to do it
>> now rather than later, because right now I'm guessing
>> pg_stat_statements is a lot less heavily used than it will be after
>> 9.2 hits shelves.

> Agreed. It's better if we can also change the name of it - provided we
> can come up with a reasonable new name. Then peoples applications will
> break *visibly*, which is a lot  better than breaking invisibly. (This
> is the main reason why we renamed current_query in pg_stat_activity..)

That might be overkill.  Changing the column name will definitely break
anything more specific than "select * from pg_stat_statements".
However, it's less clear that changing the units in which the column is
expressed will break things.  It seems likely to me that nobody out
there is doing anything much more sophisticated than sorting by the
column, and that's still going to work the same.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Magnus Hagander

Date:

10 April 2012, 13:30:14

On Tue, Apr 10, 2012 at 18:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> On Tue, Apr 10, 2012 at 17:58, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Tue, Apr 10, 2012 at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Given that we've whacked pg_stat_statements' behavior around rather
>>>> thoroughly in this release, maybe we could get away with redefining
>>>> total_time as being measured in msec rather than sec, thereby aligning
>>>> units as msec across the board.  It's arguably a smaller deal than the
>>>> way we've redefined what the query column contains...
>>>
>>> Retyping columns is an awfully good way to produce grumpy users.  Then
>>> again, if we're going to do it, it would certainly be better to do it
>>> now rather than later, because right now I'm guessing
>>> pg_stat_statements is a lot less heavily used than it will be after
>>> 9.2 hits shelves.
>
>> Agreed. It's better if we can also change the name of it - provided we
>> can come up with a reasonable new name. Then peoples applications will
>> break *visibly*, which is a lot  better than breaking invisibly. (This
>> is the main reason why we renamed current_query in pg_stat_activity..)
>
> That might be overkill.  Changing the column name will definitely break
> anything more specific than "select * from pg_stat_statements".
> However, it's less clear that changing the units in which the column is
> expressed will break things.  It seems likely to me that nobody out
> there is doing anything much more sophisticated than sorting by the
> column, and that's still going to work the same.

I've seen cases where the timing is correlated with external timings,
e.g. from the application. Have I seen it a lot? No - but then I
haven't seen a big usage of pg_stat_statements either, which might be
the better argument for allowing a change of unit but not name.


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

10 April 2012, 14:35:16

On Tue, Apr 10, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> No matter what we end up doing here it will be consistent with
>> something; I am reminded of the phrase "the good thing about standards
>> is that there are so many to choose from...".
>
> Well, FWIW I vote for making the new columns be float8 msec.  If you
> don't want to change total_time to match, I guess there's no law that
> says it *has* to be consistent ...

Ugh.  So the three ways of doing timing that we have already aren't
enough, and we need a fourth one?  Ack!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

10 April 2012, 14:45:14

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Apr 10, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, FWIW I vote for making the new columns be float8 msec.

> Ugh.  So the three ways of doing timing that we have already aren't
> enough, and we need a fourth one?  Ack!

Huh?  I understood what you said upthread to be that we have two ways
in existing releases (anything unreleased has zero standing in this
discussion): float8 sec in pg_stat_statements.total_time, and
int8 msec everywhere else.  Did I miss something?
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

10 April 2012, 14:51:30

On Tue, Apr 10, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Apr 10, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Well, FWIW I vote for making the new columns be float8 msec.
>
>> Ugh.  So the three ways of doing timing that we have already aren't
>> enough, and we need a fourth one?  Ack!
>
> Huh?  I understood what you said upthread to be that we have two ways
> in existing releases (anything unreleased has zero standing in this
> discussion): float8 sec in pg_stat_statements.total_time, and
> int8 msec everywhere else.  Did I miss something?

We also have int8 usec floating around.  But even if we didn't, float8
msec would be a new one, regardless of whether it would be third or
fourth...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

10 April 2012, 14:59:05

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Apr 10, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Huh?  I understood what you said upthread to be that we have two ways
>> in existing releases (anything unreleased has zero standing in this
>> discussion): float8 sec in pg_stat_statements.total_time, and
>> int8 msec everywhere else.  Did I miss something?

> We also have int8 usec floating around.  But even if we didn't, float8
> msec would be a new one, regardless of whether it would be third or
> fourth...

It would still be the second one, because it would replace the only use
of float8 sec, no?  And more to the point, it converges us on msec being
the only exposed unit.

The business about underlying microseconds is maybe not so good, but
I don't think we want to touch that right now.  In the long run
I think it would make sense to converge on float8 msec as being the
standard for exposed timing values, because that is readily adaptable to
the underlying data having nsec or even better precision.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

10 April 2012, 15:01:17

On Tue, Apr 10, 2012 at 1:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Apr 10, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Huh?  I understood what you said upthread to be that we have two ways
>>> in existing releases (anything unreleased has zero standing in this
>>> discussion): float8 sec in pg_stat_statements.total_time, and
>>> int8 msec everywhere else.  Did I miss something?
>
>> We also have int8 usec floating around.  But even if we didn't, float8
>> msec would be a new one, regardless of whether it would be third or
>> fourth...
>
> It would still be the second one, because it would replace the only use
> of float8 sec, no?  And more to the point, it converges us on msec being
> the only exposed unit.
>
> The business about underlying microseconds is maybe not so good, but
> I don't think we want to touch that right now.  In the long run
> I think it would make sense to converge on float8 msec as being the
> standard for exposed timing values, because that is readily adaptable to
> the underlying data having nsec or even better precision.

Hmm.  Maybe we should think about numeric ms, which would have all the
same advantages but without the round-off error.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

"ktm@rice.edu"

Date:

10 April 2012, 15:07:53

On Tue, Apr 10, 2012 at 02:01:02PM -0400, Robert Haas wrote:
> On Tue, Apr 10, 2012 at 1:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Tue, Apr 10, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> Huh?  I understood what you said upthread to be that we have two ways
> >>> in existing releases (anything unreleased has zero standing in this
> >>> discussion): float8 sec in pg_stat_statements.total_time, and
> >>> int8 msec everywhere else.  Did I miss something?
> >
> >> We also have int8 usec floating around.  But even if we didn't, float8
> >> msec would be a new one, regardless of whether it would be third or
> >> fourth...
> >
> > It would still be the second one, because it would replace the only use
> > of float8 sec, no?  And more to the point, it converges us on msec being
> > the only exposed unit.
> >
> > The business about underlying microseconds is maybe not so good, but
> > I don't think we want to touch that right now.  In the long run
> > I think it would make sense to converge on float8 msec as being the
> > standard for exposed timing values, because that is readily adaptable to
> > the underlying data having nsec or even better precision.
>
> Hmm.  Maybe we should think about numeric ms, which would have all the
> same advantages but without the round-off error.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

They are also a lot bigger with tons of added overhead. :)

Regards,
Ken

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

10 April 2012, 15:10:23

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Apr 10, 2012 at 1:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The business about underlying microseconds is maybe not so good, but
>> I don't think we want to touch that right now. �In the long run
>> I think it would make sense to converge on float8 msec as being the
>> standard for exposed timing values, because that is readily adaptable to
>> the underlying data having nsec or even better precision.

> Hmm.  Maybe we should think about numeric ms, which would have all the
> same advantages but without the round-off error.

Color me unimpressed ... numeric calculations are vastly more expensive
than float, and where are you going to get timing data that has more
than sixteen decimal digits of accuracy?  IME we're lucky to get three
repeatable digits in any timing measurement.  The point of using a
non-integer type here is not so much precision as dynamic range:
sometimes you might be measuring queries that run for hours, and other
times ones that run for microseconds.  In the latter case it's important
to be able to represent nanoseconds, but not so much in the former.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Greg Smith

Date:

10 April 2012, 19:07:46

On 04/10/2012 12:27 PM, Tom Lane wrote:
>Changing the column name will definitely break
> anything more specific than "select * from pg_stat_statements".
> However, it's less clear that changing the units in which the column is
> expressed will break things.  It seems likely to me that nobody out
> there is doing anything much more sophisticated than sorting by the
> column, and that's still going to work the same.

I am doing more sophisticated things with it, so I'll celebrate this as 
my opportunity to say I did something you didn't see coming for 2012.

All the sites involved will happily shred those scripts and rewrite for 
either normalized queries *or* better I/O timing info though, so net 
positive for 9.2 changes even if this part breaks on them.  I think this 
is one of those rare opportunities where there's enough positive 
goodwill from changes to ask "what's the best way to handle this 
long-term?" and get away with whatever change that requires, too.  I'm 
really not liking the look of this wart now that Robert has pointed it out.

I'd prefer to see at least usec resolution and 8 bytes of "dynamic 
range" for query related statistics.  Any of these would be fine from a 
UI perspective to me:

-float8 seconds
-float8 msec
-float8 usec
-int8 usec

I don't think int8 msec will be enough resolution to time queries for 
very long, if it's not already obsolete.  The committed example for 
pg_test_timing on good hardware already clocks trivial events at a 
single usec.  Even I/O is getting there.  I've measured my Fusion-io 
loaner card peaking at 8GB/s, which works out to 1 usec per 8K page. 
None of that is even price no object hardware today; it's the stuff 
sitting in my office.

If anything, I'd expect more timing code in the database that only has 
ms resolution right now will start looking fat in a year or two, and 
more things might need to be shifted to usec instead.  Checkpoint timing 
can survive having less resolution because its primary drumbeat is very 
unlikely to drop below the minutes range.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Re: Patch: add timing of buffer I/O requests

From

Peter Geoghegan

Date:

10 April 2012, 19:33:11

On 10 April 2012 23:07, Greg Smith <greg@2ndquadrant.com> wrote:
> On 04/10/2012 12:27 PM, Tom Lane wrote:
> I am doing more sophisticated things with it, so I'll celebrate this as my
> opportunity to say I did something you didn't see coming for 2012.

This is why I requested that we expose the query_id hash value - I
believe that it will be generally useful in clustering situations. It
would be nice to have a persistent identifier. While we're discussing
revising pg_stat_statement's interface, are you still opposed to
exposing that value, Tom?

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

10 April 2012, 20:35:40

On Tue, Apr 10, 2012 at 6:32 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 10 April 2012 23:07, Greg Smith <greg@2ndquadrant.com> wrote:
>> On 04/10/2012 12:27 PM, Tom Lane wrote:
>> I am doing more sophisticated things with it, so I'll celebrate this as my
>> opportunity to say I did something you didn't see coming for 2012.
>
> This is why I requested that we expose the query_id hash value - I
> believe that it will be generally useful in clustering situations. It
> would be nice to have a persistent identifier. While we're discussing
> revising pg_stat_statement's interface, are you still opposed to
> exposing that value, Tom?

If people need something like that, couldn't they create it by hashing
the normalized query text with an arbitrary algorithm?

The only obvious advantage of exposing the value used internally is
that it might be helpful in terms of understanding the collision
behavior.  But hopefully collisions are pretty rare anyway, so...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

10 April 2012, 20:48:31

Peter Geoghegan <peter@2ndquadrant.com> writes:
> On 10 April 2012 23:07, Greg Smith <greg@2ndquadrant.com> wrote:
>> On 04/10/2012 12:27 PM, Tom Lane wrote:
>> I am doing more sophisticated things with it, so I'll celebrate this as my
>> opportunity to say I did something you didn't see coming for 2012.

> This is why I requested that we expose the query_id hash value - I
> believe that it will be generally useful in clustering situations. It
> would be nice to have a persistent identifier. While we're discussing
> revising pg_stat_statement's interface, are you still opposed to
> exposing that value, Tom?

I still am.  I'm unconvinced by references to "clustering situations",
because as constructed the hash is extremely database-specific.
It will vary depending on OID assignments, not to mention platform
characteristics such as word width and endianness.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Peter Geoghegan

Date:

10 April 2012, 21:04:05

On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote:
> If people need something like that, couldn't they create it by hashing
> the normalized query text with an arbitrary algorithm?

That supposes that the normalised query text is perfectly stable. It
may well not be, particularly for things like ad-hoc queries or
queries generated by ORMs, across database clusters and over long
periods of time - you're basically throwing the benefit of all of that
intelligent normalisation out of the window, because it's pretty close
to free to expose it. What if a developer tweaks an alias in the
application for clarity? Also, as you point out, it has additional
utility in advertising when a collision has happened, and setting the
user's expectations appropriately. I assume that collisions are very
rare, but when they do happen, this gives you a fighting chance of
noticing them.

As Tom points out, the query hash will vary according to platform
specific characteristics, including endianness, and will require OIDs
are the same on every node. However, it is still going to be useful in
clusters that use streaming replication, though not a third party
trigger based replication system for example, because streaming
replication does of course require that those factors (and rather a
lot more) will be identical across the cluster anyway. Realistically,
I'd expect a large majority of people interested in this feature to
only want to use it with streaming replication anyway.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

10 April 2012, 21:17:10

Peter Geoghegan <peter@2ndquadrant.com> writes:
> On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote:
>> If people need something like that, couldn't they create it by hashing
>> the normalized query text with an arbitrary algorithm?

> That supposes that the normalised query text is perfectly stable. It
> may well not be, particularly for things like ad-hoc queries or
> queries generated by ORMs, across database clusters and over long
> periods of time -

Indeed, but the hash value isn't stable either given those sorts of
assumptions, so I'm not convinced that there's any advantage there.

What I think people would actually like to know, if they're in a
situation where distinct query texts are getting hashed to the same
thing, is *which* different texts got hashed to the same thing.
But there's no good way to expose that given the pg_stat_statements
infrastructure, and exposing the hash value doesn't help.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Peter Geoghegan

Date:

10 April 2012, 21:53:25

On 11 April 2012 01:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <peter@2ndquadrant.com> writes:
>> On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote:
>>> If people need something like that, couldn't they create it by hashing
>>> the normalized query text with an arbitrary algorithm?
>
>> That supposes that the normalised query text is perfectly stable. It
>> may well not be, particularly for things like ad-hoc queries or
>> queries generated by ORMs, across database clusters and over long
>> periods of time -
>
> Indeed, but the hash value isn't stable either given those sorts of
> assumptions, so I'm not convinced that there's any advantage there.

Isn't it? The hash captures the true meaning of the query, while
having the database server's platform as a usually irrelevant
artefact. Another thing that I forgot to mention is client encoding -
it may well be fairly inconvenient to have to use the same algorithm
to hash the query string across applications. You also have to hash
the query string yourself again and again, which is expensive to do
from Python or something, and is often inconvenient - differences
beyond track_activity_query_size bytes (default:1024) are not
recognised. Using an SQL code beautifier where a single byte varies
now breaks everything, which developers don't expect at all (we've
trained them not to), so in many ways you're back to the same
limitations as classic pg_stat_statements if you attempt to aggregate
queries over time and across machines, which is a very real use case.

It's probably pretty annoying to have to get your Python app to use
the same hash function as your Java app or whatever I, unless you want
to use something heavyweight like a cryptographic hash function. By
doing it within Postgres, you avoid those headaches.

I'm not asking you to very loudly proclaim that it should be used like
this - just expose it, accurately document it, and I'm quite confident
that it will be widely used and relied upon by those that are
reasonably well informed, and understand its limitations, which are
really quite straightforward.

> What I think people would actually like to know, if they're in a
> situation where distinct query texts are getting hashed to the same
> thing, is *which* different texts got hashed to the same thing.
> But there's no good way to expose that given the pg_stat_statements
> infrastructure, and exposing the hash value doesn't help.

Apart from detecting the case where we get a straightforward
collision, I don't expect that that would be useful. The whole point
is that the user doesn't care about the difference, and I think we've
specified a practical, widely useful standard for when queries should
be considered equivalent.
--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Patch: add timing of buffer I/O requests

From

"ktm@rice.edu"

Date:

11 April 2012, 10:03:11

On Wed, Apr 11, 2012 at 01:53:06AM +0100, Peter Geoghegan wrote:
> On 11 April 2012 01:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Peter Geoghegan <peter@2ndquadrant.com> writes:
> >> On 11 April 2012 00:35, Robert Haas <robertmhaas@gmail.com> wrote:
> >>> If people need something like that, couldn't they create it by hashing
> >>> the normalized query text with an arbitrary algorithm?
> >
> >> That supposes that the normalised query text is perfectly stable. It
> >> may well not be, particularly for things like ad-hoc queries or
> >> queries generated by ORMs, across database clusters and over long
> >> periods of time -
> >
> > Indeed, but the hash value isn't stable either given those sorts of
> > assumptions, so I'm not convinced that there's any advantage there.
>
> Isn't it? The hash captures the true meaning of the query, while
> having the database server's platform as a usually irrelevant
> artefact. Another thing that I forgot to mention is client encoding -
> it may well be fairly inconvenient to have to use the same algorithm
> to hash the query string across applications. You also have to hash
> the query string yourself again and again, which is expensive to do
> from Python or something, and is often inconvenient - differences
> beyond track_activity_query_size bytes (default:1024) are not
> recognised. Using an SQL code beautifier where a single byte varies
> now breaks everything, which developers don't expect at all (we've
> trained them not to), so in many ways you're back to the same
> limitations as classic pg_stat_statements if you attempt to aggregate
> queries over time and across machines, which is a very real use case.
>
> It's probably pretty annoying to have to get your Python app to use
> the same hash function as your Java app or whatever I, unless you want
> to use something heavyweight like a cryptographic hash function. By
> doing it within Postgres, you avoid those headaches.
>
> I'm not asking you to very loudly proclaim that it should be used like
> this - just expose it, accurately document it, and I'm quite confident
> that it will be widely used and relied upon by those that are
> reasonably well informed, and understand its limitations, which are
> really quite straightforward.
>
> > What I think people would actually like to know, if they're in a
> > situation where distinct query texts are getting hashed to the same
> > thing, is *which* different texts got hashed to the same thing.
> > But there's no good way to expose that given the pg_stat_statements
> > infrastructure, and exposing the hash value doesn't help.
>
> Apart from detecting the case where we get a straightforward
> collision, I don't expect that that would be useful. The whole point
> is that the user doesn't care about the difference, and I think we've
> specified a practical, widely useful standard for when queries should
> be considered equivalent.
> --
> Peter Geoghegan       http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training and Services
>

By using all 64-bits of the hash that we currently calculate, instead
of the current use of 32-bits only, the collision probabilities are
very low.

Regards,
Ken

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

13 April 2012, 16:15:48

On Wed, Apr 11, 2012 at 9:02 AM, ktm@rice.edu <ktm@rice.edu> wrote:
> By using all 64-bits of the hash that we currently calculate, instead
> of the current use of 32-bits only, the collision probabilities are
> very low.

That's probably true, but I'm not sure it's worth worrying about -
one-in-four-billion is a pretty small probability.

On the broader issue, Peter's argument here seems to boil down to
"there is probably a reason why this is useful" and Tom's argument
seems to boil down to "i don't want to expose it without knowing what
that reason is". Peter may well be right, but that doesn't make Tom
wrong.  If we are going to expose this, we ought to be able to
document why we have it, and right now we can't, because we don't
*really* know what it's good for, other than raising awareness of the
theoretical possibility of collisions, which doesn't seem like quite
enough.

On another note, I think this is a sufficiently major change that we
ought to add Peter's name to the "Author" section of the
pg_stat_statements documentation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

13 April 2012, 17:01:28

Robert Haas <robertmhaas@gmail.com> writes:
> On another note, I think this is a sufficiently major change that we
> ought to add Peter's name to the "Author" section of the
> pg_stat_statements documentation.

+1 ... as long as we have those at all, they probably ought to credit
anybody who did substantial work on the module.

I think that eventually we'll have to give them up, just the way that
we don't credit anybody in particular as author of the core code; but
for now this is a good change.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

13 April 2012, 17:24:53

On Fri, Apr 13, 2012 at 4:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On another note, I think this is a sufficiently major change that we
>> ought to add Peter's name to the "Author" section of the
>> pg_stat_statements documentation.
>
> +1 ... as long as we have those at all, they probably ought to credit
> anybody who did substantial work on the module.
>
> I think that eventually we'll have to give them up, just the way that
> we don't credit anybody in particular as author of the core code; but
> for now this is a good change.

Yeah.  I'd actually be in favor of removing them, and similar
references in the comments, because they do lead and have led to
disputes about who deserves to be mentioned.  However, a change of
this magnitude certainly deserves mention; it's not really the same
module any more.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Peter Geoghegan

Date:

13 April 2012, 17:34:05

On 13 April 2012 20:15, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Apr 11, 2012 at 9:02 AM, ktm@rice.edu <ktm@rice.edu> wrote:
>> By using all 64-bits of the hash that we currently calculate, instead
>> of the current use of 32-bits only, the collision probabilities are
>> very low.
>
> That's probably true, but I'm not sure it's worth worrying about -
> one-in-four-billion is a pretty small probability.

That assumes that our hashing of query trees will exhibit uniform
distribution. While it's easy enough to prove that hash_any does that,
it would seem to me that that does not imply that we will exhibit
perfectly uniform distribution within pg_stat_statements. The reason
that I invented the jumble nomenclature to distinguish it from a
straight serialisation is that, apart from the fact that many fields
are simply ignored, we still couldn't deserialize what we do store
from the jumble, because the correct way to serialise a tree entails
serialising NULL pointers too - two non-equivalent jumbles could
actually be bitwise identical. In practice, I think that adding
NodeTags ought to be sufficient here, but I wouldn't like to bet my
life on it. Actually, that is a point that perhaps should have been
touched on in the comments at the top of the file.

You may wish to take a look at my original analysis in the
pg_stat_statements normalisation thread, which attempts to quantify
the odds by drawing upon the mathematics of the birthday problem.

> On the broader issue, Peter's argument here seems to boil down to
> "there is probably a reason why this is useful" and Tom's argument
> seems to boil down to "i don't want to expose it without knowing what
> that reason is". Peter may well be right, but that doesn't make Tom
> wrong.  If we are going to expose this, we ought to be able to
> document why we have it, and right now we can't, because we don't
> *really* know what it's good for, other than raising awareness of the
> theoretical possibility of collisions, which doesn't seem like quite
> enough.

Well, it's important to realise that the query string isn't really
stable, to the extent that it could differ as the query is evicted and
re-enters pg_stat_statements over time. The hash value is necessarily
a stable identifier, as it is the very identifier used by
pg_stat_statements. People are going to want to aggregate this
information over long periods and across database clusters, and I
would really like to facilitate that.

To be honest, with the plans that we have for replication, with the
addition of things like cascading replication, I think it will
increasingly be the case that people prefer built-in replication. The
fact that the actual hash value is affected by factors like the
endianness of the architecture in question seems mostly irrelevant.

If we were to expose this, I'd suggest that the documentation in the
table describing each column say of the value:

query_hash | stable identifier used by pg_stat_statements for the query

There'd then be additional clarification after the existing reference
to the hash value, which gave the required caveats about the hash
value being subject to various implementation artefacts that
effectively only make the values persistently indentify queries
referencing particular objects in the same database (that is, the
object OID values are used), or across databases that are binary
clones, such as between a streaming replica master and its standby.
The OID restriction alone effectively shadows all others, so there's
no need to mention endianness or anything like that.

Anyone who jumped to the conclusion that their aggregation tool would
work fine with Slony or something based on the query_hash description
alone would probably have bigger problems.

> On another note, I think this is a sufficiently major change that we
> ought to add Peter's name to the "Author" section of the
> pg_stat_statements documentation.

Thanks. I actually thought this myself, but didn't want to mention it
because I didn't think that it was up to me to decide, or to attempt
to influence that decision.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Patch: add timing of buffer I/O requests

From

Jim Nasby

Date:

13 April 2012, 17:38:35

On 4/10/12 5:07 PM, Greg Smith wrote:
> I'd prefer to see at least usec resolution and 8 bytes of "dynamic range" for query related statistics.  Any of these
wouldbe fine from a UI perspective to me:
 
>
> -float8 seconds
> -float8 msec
> -float8 usec
> -int8 usec
>
> I don't think int8 msec will be enough resolution to time queries for very long, if it's not already obsolete.  The
committedexample for pg_test_timing on good hardware already clocks trivial events at a single usec.  Even I/O is
gettingthere.  I've measured my Fusion-io loaner card peaking at 8GB/s, which works out to 1 usec per 8K page. None of
thatis even price no object hardware today; it's the stuff sitting in my office.
 
>
> If anything, I'd expect more timing code in the database that only has ms resolution right now will start looking fat
ina year or two, and more things might need to be shifted to usec instead.  Checkpoint timing can survive having less
resolutionbecause its primary drumbeat is very unlikely to drop below the minutes range.
 

I agree that ms is on it's way out... and frankly it wouldn't surprise me if at some point we actually had need of ns
resolution.

Given that, I don't think ms or us definitions make sense... float8 seconds seams much more logical to me.

Though, if we're going to end up seriously breaking things anyway, perhaps it would make sense to switch everything
overto interval... I realize that there's more overhead there, but I don't think selecting from the stats views is
exactlyperformance critical.
 
-- 
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: Patch: add timing of buffer I/O requests

From

Peter Eisentraut

Date:

13 April 2012, 18:46:32

On tis, 2012-04-10 at 09:33 -0400, Robert Haas wrote:
> So, should we make the new columns exposed by pg_stat_statements use
> milliseconds, so that the block read/write timings are everywhere in
> milliseconds, or should we keep them as a float8, so that all the
> times exposed by pg_stat_statements use float8?

Wouldn't interval be the proper type to represent elapsed time?

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

13 April 2012, 19:22:57

Jim Nasby <jim@nasby.net> writes:
> I agree that ms is on it's way out... and frankly it wouldn't surprise me if at some point we actually had need of ns
resolution.

> Given that, I don't think ms or us definitions make sense... float8 seconds seams much more logical to me.

Well, the important point is that it be float8, so that fractions of
whatever units you choose can be represented.  I do not feel a strong
need to change the units in all the existing pg_stat_ columns from msec
to sec; that strikes me as just breaking things to little gain.  If the
majority of them were in sec then I'd want to converge on that, but
since the majority are in msec it seems like the path of least breakage
is to converge on that.

> Though, if we're going to end up seriously breaking things anyway,
> perhaps it would make sense to switch everything over to interval... I
> realize that there's more overhead there, but I don't think selecting
> from the stats views is exactly performance critical.

But (a) I *don't* want to seriously break things, and don't see a need
to; (b) interval is expensive and has got its own problems, notably an
internal limitation to usec resolution that we would not be able to get
rid of easily.  It's not even apparent to me that interval is
semantically The Right Thing for values that represent an accumulation
of a lot of different measurements.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Greg Stark

Date:

13 April 2012, 22:43:30

On Fri, Apr 13, 2012 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> That's probably true, but I'm not sure it's worth worrying about -
> one-in-four-billion is a pretty small probability.

Is this not subject to the birthday paradox? If you have a given hash
you're worried about a collision with then you have a
one-in-four-billion chance. But if you have a collection of hashes and
you're worried about any collisions then it only takes about 64k
before there's likely a collision.

-- 
greg

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

13 April 2012, 23:01:57

Greg Stark <stark@mit.edu> writes:
> On Fri, Apr 13, 2012 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> That's probably true, but I'm not sure it's worth worrying about -
>> one-in-four-billion is a pretty small probability.

> Is this not subject to the birthday paradox? If you have a given hash
> you're worried about a collision with then you have a
> one-in-four-billion chance. But if you have a collection of hashes and
> you're worried about any collisions then it only takes about 64k
> before there's likely a collision.

... so, if pg_stat_statements.max were set as high as 64k, you would
need to worry.

Realistically, I'm more worried about collisions due to inadequacies in
the jumble calculation logic (Peter already pointed out some risk
factors in that regard).
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Peter Geoghegan

Date:

14 April 2012, 00:22:06

On 14 April 2012 03:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Realistically, I'm more worried about collisions due to inadequacies in
> the jumble calculation logic (Peter already pointed out some risk
> factors in that regard).

It's important to have a sense of proportion about the problem. The
worst thing that a collision can do is lead the DBA on a bit of a wild
goose chase. Importantly, collisions across databases and users are
impossible. I've always taken the view that aggregating query
statistics is a lossy process, and these kinds of problems seem like a
more than acceptable price to pay for low-overhead dynamic query
statistics .

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

14 April 2012, 01:20:44

On Fri, Apr 13, 2012 at 10:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> On Fri, Apr 13, 2012 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> That's probably true, but I'm not sure it's worth worrying about -
>>> one-in-four-billion is a pretty small probability.
>
>> Is this not subject to the birthday paradox? If you have a given hash
>> you're worried about a collision with then you have a
>> one-in-four-billion chance. But if you have a collection of hashes and
>> you're worried about any collisions then it only takes about 64k
>> before there's likely a collision.
>
> ... so, if pg_stat_statements.max were set as high as 64k, you would
> need to worry.

Well... at 64k, you'd be very likely to have a collision.  But the
whole birthday paradox thing means that there's a non-trivial
collision probability even at lower numbers of entries.  Seems like
maybe we ought to be using 64 bits here...

> Realistically, I'm more worried about collisions due to inadequacies in
> the jumble calculation logic (Peter already pointed out some risk
> factors in that regard).

...especially if collisions are even more frequent than random chance
would suggest.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Greg Smith

Date:

14 April 2012, 04:28:16

On 04/13/2012 06:22 PM, Tom Lane wrote:
> But (a) I *don't* want to seriously break things, and don't see a need
> to; (b) interval is expensive and has got its own problems, notably an
> internal limitation to usec resolution that we would not be able to get
> rid of easily.

A straight float seems pretty future proof compared to a usec resolution 
interval.  Jim was commenting in the same direction I already did, that 
ns resolution is not impossible to see coming.

I also expect to compute plenty of derived statistics from these 
numbers.  Interval math is good enough that I'm sure such things could 
be done, but it seems odd to start with those units.  I appreciate that 
the interval type has a nice purist feel to it.  My pragmatic side says 
we're going to pay overhead to create in that type, only to find people 
end up converting it right back to other types for easier math tricks.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

14 April 2012, 09:30:11

On Sat, Apr 14, 2012 at 3:27 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 04/13/2012 06:22 PM, Tom Lane wrote:
>>
>> But (a) I *don't* want to seriously break things, and don't see a need
>> to; (b) interval is expensive and has got its own problems, notably an
>> internal limitation to usec resolution that we would not be able to get
>> rid of easily.
>
> A straight float seems pretty future proof compared to a usec resolution
> interval.  Jim was commenting in the same direction I already did, that ns
> resolution is not impossible to see coming.
>
> I also expect to compute plenty of derived statistics from these numbers.
>  Interval math is good enough that I'm sure such things could be done, but
> it seems odd to start with those units.  I appreciate that the interval type
> has a nice purist feel to it.  My pragmatic side says we're going to pay
> overhead to create in that type, only to find people end up converting it
> right back to other types for easier math tricks.

I'm still rooting for numeric.  As somebody said upthread, performance
ain't critical here; and that lets us whack around the internal
representation however we like without worrying about it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Peter Geoghegan

Date:

14 April 2012, 10:54:59

On 10 April 2012 19:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm.  Maybe we should think about numeric ms, which would have all the
>> same advantages but without the round-off error.
>
> Color me unimpressed ... numeric calculations are vastly more expensive
> than float, and where are you going to get timing data that has more
> than sixteen decimal digits of accuracy?

+1

Besides, how do you propose to solve the problem of storing numerics
in a fixed allocation of shared memory? The only comparable thing in
pg_stat_statements is the query string, which is capped at
track_activity_query_size bytes for this very reason.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Patch: add timing of buffer I/O requests

From

Peter Geoghegan

Date:

14 April 2012, 15:34:58

On 14 April 2012 02:42, Greg Stark <stark@mit.edu> wrote:
> Is this not subject to the birthday paradox? If you have a given hash
> you're worried about a collision with then you have a
> one-in-four-billion chance. But if you have a collection of hashes and
> you're worried about any collisions then it only takes about 64k
> before there's likely a collision.

Just for the sake of the archives, assuming that there is a uniform
distribution of values, by my calculations that puts the probability
of collision at:

pg_stat_statements.max of 1,000 =   0.00011562303995116263

and perhaps more representatively, if we follow the example in the docs:

pg_stat_statements.max of 10,000 = 0.011496378237656368

It's enough of a problem that I'd expect to hear one or two complaints
about it in the next few years, maybe. This is the probability of
there being a collision, not the probability of someone caring about
it.

You probably wouldn't want to push your luck too far:

pg_stat_statements.max of 100,000 = 0.6853509059051395

Even if you did, that would only mean that there was usually one, but
perhaps two or three values that were collisions, out of an entire
100,000. To labour the point, you'd have to have a lot of bad luck for
those to be the values that a human actually ended up caring about.

Jim Nasby said upthread that selecting from the stats view isn't
performance critical, and he's right. However, maintaining the stats
themselves certainly is, since each and every query will have to
update them, adding latency. pg_stat_statements is far from being a
tool of minority interest, particularly now. People are going to want
to add additional bells and whistles to it, which is fine by me, but
I'm very much opposed to making everyone pay for additional features
that imply performance overhead for all queries, particularly if the
feature is of minority interest.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

14 April 2012, 21:12:17

On Sat, Apr 14, 2012 at 9:54 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 10 April 2012 19:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Hmm.  Maybe we should think about numeric ms, which would have all the
>>> same advantages but without the round-off error.
>>
>> Color me unimpressed ... numeric calculations are vastly more expensive
>> than float, and where are you going to get timing data that has more
>> than sixteen decimal digits of accuracy?
>
> +1
>
> Besides, how do you propose to solve the problem of storing numerics
> in a fixed allocation of shared memory? The only comparable thing in
> pg_stat_statements is the query string, which is capped at
> track_activity_query_size bytes for this very reason.

The internal representation doesn't have to be (and certainly
shouldn't be) numeric.  But if you translate to numeric before
returning the data to the user, then you have the freedom, in the
future, to whack around the internal representation however you like,
without breaking backward compatibility.  Choosing float8 for the
external representation is fine as long as we're sure we're not ever
going to want more than 16 significant digits, but I see no particular
value in baking in that assumption.  But perhaps, as the saying goes,
16 digits ought to be enough for anyone.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

14 April 2012, 23:33:43

Robert Haas <robertmhaas@gmail.com> writes:
> The internal representation doesn't have to be (and certainly
> shouldn't be) numeric.  But if you translate to numeric before
> returning the data to the user, then you have the freedom, in the
> future, to whack around the internal representation however you like,
> without breaking backward compatibility.  Choosing float8 for the
> external representation is fine as long as we're sure we're not ever
> going to want more than 16 significant digits, but I see no particular
> value in baking in that assumption.  But perhaps, as the saying goes,
> 16 digits ought to be enough for anyone.

There's no particular reason to think that Moore's Law is going to
result in an increase in the fractional precision of timing data.
It hasn't done so in the past, for sure.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

23 April 2012, 23:03:28

On Sat, Apr 14, 2012 at 10:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> The internal representation doesn't have to be (and certainly
>> shouldn't be) numeric.  But if you translate to numeric before
>> returning the data to the user, then you have the freedom, in the
>> future, to whack around the internal representation however you like,
>> without breaking backward compatibility.  Choosing float8 for the
>> external representation is fine as long as we're sure we're not ever
>> going to want more than 16 significant digits, but I see no particular
>> value in baking in that assumption.  But perhaps, as the saying goes,
>> 16 digits ought to be enough for anyone.
>
> There's no particular reason to think that Moore's Law is going to
> result in an increase in the fractional precision of timing data.
> It hasn't done so in the past, for sure.

Perhaps, but nobody's explained what we gain out of NOT using numeric."It's slow" doesn't impress me; selecting from a
systemview doesn't 
need to be lightning-fast.

However, the main thing here is that we need to do *something* here...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

25 April 2012, 13:47:48

Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Apr 14, 2012 at 10:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> There's no particular reason to think that Moore's Law is going to
>> result in an increase in the fractional precision of timing data.
>> It hasn't done so in the past, for sure.

> Perhaps, but nobody's explained what we gain out of NOT using numeric.
>  "It's slow" doesn't impress me; selecting from a system view doesn't
> need to be lightning-fast.

Well, how about "the code is going to be quite a lot less readable"?
C can manipulate floats natively, but not numerics.

Also, as was pointed out upthread, the underlying data in shared memory
is almost certainly never going to be infinite-precision; so using
numeric in the API seems to me to be more likely to convey a false
impression of exactness than to do anything useful.

> However, the main thing here is that we need to do *something* here...

Agreed, this has got to be pushed forward.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

25 April 2012, 13:56:37

On Wed, Apr 25, 2012 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sat, Apr 14, 2012 at 10:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> There's no particular reason to think that Moore's Law is going to
>>> result in an increase in the fractional precision of timing data.
>>> It hasn't done so in the past, for sure.
>
>> Perhaps, but nobody's explained what we gain out of NOT using numeric.
>>  "It's slow" doesn't impress me; selecting from a system view doesn't
>> need to be lightning-fast.
>
> Well, how about "the code is going to be quite a lot less readable"?
> C can manipulate floats natively, but not numerics.
>
> Also, as was pointed out upthread, the underlying data in shared memory
> is almost certainly never going to be infinite-precision; so using
> numeric in the API seems to me to be more likely to convey a false
> impression of exactness than to do anything useful.
>
>> However, the main thing here is that we need to do *something* here...
>
> Agreed, this has got to be pushed forward.

In the interest of furthering that goal, I propose that whoever is
willing to take the time to clean this up gets to decide what to
standardize on, and I'm happy to give you first crack at that if you
have the cycles.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

25 April 2012, 14:12:24

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Apr 25, 2012 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> However, the main thing here is that we need to do *something* here...

>> Agreed, this has got to be pushed forward.

> In the interest of furthering that goal, I propose that whoever is
> willing to take the time to clean this up gets to decide what to
> standardize on, and I'm happy to give you first crack at that if you
> have the cycles.

OK.  I have just returned from some emergency family business, and have
got assorted catching-up to do, but I will try to get to that later
this week.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

25 April 2012, 14:19:31

On Wed, Apr 25, 2012 at 1:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Apr 25, 2012 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> However, the main thing here is that we need to do *something* here...
>
>>> Agreed, this has got to be pushed forward.
>
>> In the interest of furthering that goal, I propose that whoever is
>> willing to take the time to clean this up gets to decide what to
>> standardize on, and I'm happy to give you first crack at that if you
>> have the cycles.
>
> OK.  I have just returned from some emergency family business, and have
> got assorted catching-up to do, but I will try to get to that later
> this week.

Sounds good to me.  You might want to revisit the issue of how the new
columns in pg_stat_statements are named, as well.  I am not sure I'm
happy with that, but neither am I sure that I know what I'd like
better.  It's not too clear that the timing is specifically for data
block reads and writes, for example.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Greg Stark

Date:

25 April 2012, 14:59:36

On Wed, Apr 25, 2012 at 5:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Also, as was pointed out upthread, the underlying data in shared memory
> is almost certainly never going to be infinite-precision; so using
> numeric in the API seems to me to be more likely to convey a false
> impression of exactness than to do anything useful.

I don't think that follows. The underlyng data will be measured in
some metric unit of time like microsecond or nanosecond or something
like that. So a base-10 representation will show exactly the precision
that the underlying data has. On the other hand a floating point
number will show a base-2 approximation that may in fact display with
more digits than the underlying data representation has.

-- 
greg

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

25 April 2012, 15:09:18

Greg Stark <stark@mit.edu> writes:
> On Wed, Apr 25, 2012 at 5:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also, as was pointed out upthread, the underlying data in shared memory
>> is almost certainly never going to be infinite-precision; so using
>> numeric in the API seems to me to be more likely to convey a false
>> impression of exactness than to do anything useful.

> I don't think that follows. The underlyng data will be measured in
> some metric unit of time like microsecond or nanosecond or something
> like that. So a base-10 representation will show exactly the precision
> that the underlying data has. On the other hand a floating point
> number will show a base-2 approximation that may in fact display with
> more digits than the underlying data representation has.

My point is that the underlying data is going to be stored in a
fixed-width representation, and therefore it will have accuracy and/or
range limitations that are considerably more severe than use of
"numeric" for output might suggest to the user.  In the current
pg_stat_statements code, timings are in fact accumulated in float8,
and emitting them as something other than float8 is just plain
misleading IMHO.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

25 April 2012, 15:35:38

On Wed, Apr 25, 2012 at 1:58 PM, Greg Stark <stark@mit.edu> wrote:
> On Wed, Apr 25, 2012 at 5:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also, as was pointed out upthread, the underlying data in shared memory
>> is almost certainly never going to be infinite-precision; so using
>> numeric in the API seems to me to be more likely to convey a false
>> impression of exactness than to do anything useful.
>
> I don't think that follows. The underlyng data will be measured in
> some metric unit of time like microsecond or nanosecond or something
> like that. So a base-10 representation will show exactly the precision
> that the underlying data has. On the other hand a floating point
> number will show a base-2 approximation that may in fact display with
> more digits than the underlying data representation has.

I wholeheartedly agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Peter Eisentraut

Date:

25 April 2012, 18:00:30

On mån, 2012-04-23 at 22:03 -0400, Robert Haas wrote:
> Perhaps, but nobody's explained what we gain out of NOT using numeric.

So if you want to have possibly different internal and external
representations, why not use interval for the external one?

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

25 April 2012, 18:33:28

Peter Eisentraut <peter_e@gmx.net> writes:
> On mån, 2012-04-23 at 22:03 -0400, Robert Haas wrote:
>> Perhaps, but nobody's explained what we gain out of NOT using numeric.

> So if you want to have possibly different internal and external
> representations, why not use interval for the external one?

That doesn't add any usefulness, only extra complication for clients
that want to do more arithmetic with the values.  Also, as was pointed
out earlier, we have a hard-coded restriction to microsecond precision
with the default implementation of interval; and it's not hard to
foresee the day when that won't do.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

25 April 2012, 18:38:01

On Wed, Apr 25, 2012 at 5:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> On mån, 2012-04-23 at 22:03 -0400, Robert Haas wrote:
>>> Perhaps, but nobody's explained what we gain out of NOT using numeric.
>
>> So if you want to have possibly different internal and external
>> representations, why not use interval for the external one?
>
> That doesn't add any usefulness, only extra complication for clients
> that want to do more arithmetic with the values.  Also, as was pointed
> out earlier, we have a hard-coded restriction to microsecond precision
> with the default implementation of interval; and it's not hard to
> foresee the day when that won't do.

Agreed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

28 April 2012, 13:50:06

Robert Haas <robertmhaas@gmail.com> writes:
> ... You might want to revisit the issue of how the new
> columns in pg_stat_statements are named, as well.  I am not sure I'm
> happy with that, but neither am I sure that I know what I'd like
> better.  It's not too clear that the timing is specifically for data
> block reads and writes, for example.

Well, the names "time_read" and "time_write" are certainly out of step
with every other stats view in the system; everyplace else, such columns
are named "something_time" (and even in this view itself the other
timing column is "total_time", not "time_total").  So that's got to
change.  We could just reverse the word order to "read_time" and
"write_time", or we could do something like "buf_read_time" or
"data_read_time".  IIUC block_read_time/block_write_time in the
pg_stat_database view are database-wide totals for the same numbers, so
perhaps the pg_stat_statements column names should be consistent with
those.  I am kinda wondering though why those columns spell out "block"
where every single other column name in the stats views uses the
abbreviation "blk".
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

28 April 2012, 13:56:22

... btw, while I'm criticizing names, how about changing
"track_iotiming" to "track_io_timing"?  The former seems inelegant and
unreadable.
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Robert Haas

Date:

28 April 2012, 20:26:17

On Sat, Apr 28, 2012 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> ... You might want to revisit the issue of how the new
>> columns in pg_stat_statements are named, as well.  I am not sure I'm
>> happy with that, but neither am I sure that I know what I'd like
>> better.  It's not too clear that the timing is specifically for data
>> block reads and writes, for example.
>
> Well, the names "time_read" and "time_write" are certainly out of step
> with every other stats view in the system; everyplace else, such columns
> are named "something_time" (and even in this view itself the other
> timing column is "total_time", not "time_total").  So that's got to
> change.  We could just reverse the word order to "read_time" and
> "write_time", or we could do something like "buf_read_time" or
> "data_read_time".  IIUC block_read_time/block_write_time in the
> pg_stat_database view are database-wide totals for the same numbers, so
> perhaps the pg_stat_statements column names should be consistent with
> those.  I am kinda wondering though why those columns spell out "block"
> where every single other column name in the stats views uses the
> abbreviation "blk".

I like the idea of including the word block in there.  I don't think
it was probably a terribly good idea to abbreviate that to blk
everywhere, but at this point it's probably better to be consistent,
sigh.

As for track_iotiming -> track_io_timing, I'm fine with that as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

29 April 2012, 19:32:02

Robert Haas <robertmhaas@gmail.com> writes:
> I like the idea of including the word block in there.  I don't think
> it was probably a terribly good idea to abbreviate that to blk
> everywhere, but at this point it's probably better to be consistent,
> sigh.

> As for track_iotiming -> track_io_timing, I'm fine with that as well.

I made these changes, so I think we are done with the naming issues.
However, I'd still like to propose that we think about adjusting the
timing column datatypes, ie uniformly use float8 msec for values
representing elapsed times.  By my count there are six columns that
would be affected:

pg_stat_bgwriter.checkpoint_write_timepg_stat_bgwriter.checkpoint_sync_timepg_stat_database.blk_read_timepg_stat_database.blk_write_timepg_stat_user_functions.total_timepg_stat_user_functions.self_time

The first four of these are new in 9.2, meaning that we would only be
creating a compatibility issue for the last two.  If we wait to do this
in the future, we will have a significantly bigger problem than if we
do it today.  Advantages of the change are:

* Better precision exposed to the user (pg_stat_user_functions has
historically provided only millisecond precision).

* Removal of arbitrary limit of microsecond precision.  Of course,
the underlying data is still no better than microsecond, but if we
ever are able to migrate to OS APIs that return better-than-microsecond
data, we won't have to adjust the stats view APIs to expose that data.

* A chance to make the functions underlying these stats view columns
agree with the exposed column definitions.

Any objections out there?
        regards, tom lane

Re: Patch: add timing of buffer I/O requests

From

Greg Stark

Date:

30 April 2012, 12:05:07

On Sun, Apr 29, 2012 at 12:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> As for track_iotiming -> track_io_timing, I'm fine with that as well.

I'm still grumpy about the idea of a GUC changing the explain analyze
output. How would people feel about adding an explain option that
explicitly requests io timing for this explain analyze and then having
the io timing be enabled if either it's requested by explain analyze
or if it's set on globally? That would make it more consistent with
the other explain analyze options?

I realize I don't get to be grumpy without actually contributing
anything, but I'm happy to write up the patch if people agree with the
change.


-- 
greg

Re: Patch: add timing of buffer I/O requests

From

Tom Lane

Date:

30 April 2012, 12:13:18

Greg Stark <stark@mit.edu> writes:
> On Sun, Apr 29, 2012 at 12:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> As for track_iotiming -> track_io_timing, I'm fine with that as well.

> I'm still grumpy about the idea of a GUC changing the explain analyze
> output. How would people feel about adding an explain option that
> explicitly requests io timing for this explain analyze and then having
> the io timing be enabled if either it's requested by explain analyze
> or if it's set on globally? That would make it more consistent with
> the other explain analyze options?

I think it's going to be hard to decouple that altogether.  For
instance, if track_io_timing were not on but you did EXPLAIN (TIMING),
you'd end up with timing info getting sent to the stats collector for
just that one statement.  That seems a bit weird too.

I see where you're coming from but I don't think it's a good idea to
add an EXPLAIN option unless you can make the two behaviors (EXPLAIN
reporting and stats collection) truly independent.
        regards, tom lane