Thread: pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}

Hi,

I found that pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}. For example, when you run (track_io_timing should be on):

CREATE EXTENSION pg_stat_statements;
CREATE TEMP TABLE example_table (id serial PRIMARY KEY, data text);
INSERT INTO example_table (data) SELECT 'Some data'
    FROM generate_series(1, 100000);
UPDATE example_table SET data = 'Updated data';
SELECT query, local_blks_read, local_blks_written,
    blk_read_time, blk_write_time FROM pg_stat_statements
    WHERE query like '%UPDATE%';


on master:

query              | UPDATE example_table SET data = $1
local_blks_read    | 467
local_blks_written | 2087
blk_read_time      | 0
blk_write_time     | 0


There are two reasons of that:

1- When local_blks_{read|written} are incremented, pgstat_count_io_op_time() is called with IOOBJECT_TEMP_RELATION. But in pgstat_count_io_op_time(), pgBufferUsage.blk_{read|write}_time increments are called only when io_object is IOOBJECT_RELATION. The first patch attached fixes that.

2- in ExtendBufferedRelLocal() and in ExtendBufferedRelShared(), extend calls increment local_blks_written and shared_blks_written respectively. But these extends are not counted while calculating the blk_write_time. If there is no specific reason to not do that, I think these extends needs to be counted in blk_write_time. The second patch attached does that.

Results after applying first patch:

query              | UPDATE example_table SET data = $1
local_blks_read    | 467
local_blks_written | 2087
blk_read_time      | 0.30085
blk_write_time     | 1.475123

Results after applying both patches:

query              | UPDATE example_table SET data = $1
local_blks_read    | 467
local_blks_written | 2087
blk_read_time      | 0.329597
blk_write_time     | 4.050305


Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft
Attachment
On Fri, Sep 15, 2023 at 9:24 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> I found that pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}

Yes, good catch. This is a bug. I will note that at least in 15 and
likely before, pgBufferUsage.local_blks_written is incremented for
local buffers but pgBufferUsage.blk_write_time is only added to for
shared buffers (in FlushBuffer()). I think it makes sense to propose a
bug fix to stable branches counting blk_write_time for local buffers
as well.

- Melanie



Hi,

On Fri, 15 Sept 2023 at 16:30, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Yes, good catch. This is a bug. I will note that at least in 15 and
> likely before, pgBufferUsage.local_blks_written is incremented for
> local buffers but pgBufferUsage.blk_write_time is only added to for
> shared buffers (in FlushBuffer()). I think it makes sense to propose a
> bug fix to stable branches counting blk_write_time for local buffers
> as well.

I attached the PG16+ (after pg_stat_io) and PG15- (before pg_stat_io)
versions of the same patch.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachment
On Fri, Sep 15, 2023 at 12:34 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Fri, Sep 15, 2023 at 9:24 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > I found that pgBufferUsage.blk_{read|write}_time are zero although there are
pgBufferUsage.local_blks_{read|written}
>
> Yes, good catch. This is a bug. I will note that at least in 15 and
> likely before, pgBufferUsage.local_blks_written is incremented for
> local buffers but pgBufferUsage.blk_write_time is only added to for
> shared buffers (in FlushBuffer()). I think it makes sense to propose a
> bug fix to stable branches counting blk_write_time for local buffers
> as well.

My first thought was to wonder whether this was even a bug. I
remembered that EXPLAIN treats shared, local, and temp buffers as
three separate categories of things. But it seems that someone decided
to conflate two of them for I/O timing purposes:

                        if (has_timing)
                        {
                                appendStringInfoString(es->str, "
shared/local");

^^^^ Notice this bit in particular.

                                if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
                                        appendStringInfo(es->str, " read=%0.3f",

  INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
                                if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
                                        appendStringInfo(es->str, "
write=%0.3f",

  INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
                                if (has_temp_timing)
                                        appendStringInfoChar(es->str, ',');
                        }
                        if (has_temp_timing)
                        {
                                appendStringInfoString(es->str, " temp");
                                if
(!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time))
                                        appendStringInfo(es->str, " read=%0.3f",

  INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
                                if
(!INSTR_TIME_IS_ZERO(usage->temp_blk_write_time))
                                        appendStringInfo(es->str, "
write=%0.3f",

  INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
                        }

Given that, I'm inclined to agree that this is a bug. But we might
need to go through and make sure all of the code that deals with these
counters is on the same page about what the values represent. Maybe
there is code lurking somewhere that thinks these counters are only
supposed to include "shared" rather than, as the fragment above
suggests, "shared/local".

--
Robert Haas
EDB: http://www.enterprisedb.com



Hi,

On Tue, 3 Oct 2023 at 19:44, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Given that, I'm inclined to agree that this is a bug. But we might
> need to go through and make sure all of the code that deals with these
> counters is on the same page about what the values represent. Maybe
> there is code lurking somewhere that thinks these counters are only
> supposed to include "shared" rather than, as the fragment above
> suggests, "shared/local".

Thank you for the guidance.

What do you think about the second patch, counting extend calls'
timings in blk_write_time? In my opinion if something increments
{shared|local}_blks_written, then it needs to be counted in
blk_write_time too. I am not sure why it is decided like that.

Regards,
Nazir Bilal Yavuz
Microsoft



On Thu, Oct 5, 2023 at 6:25 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> What do you think about the second patch, counting extend calls'
> timings in blk_write_time? In my opinion if something increments
> {shared|local}_blks_written, then it needs to be counted in
> blk_write_time too. I am not sure why it is decided like that.

I agree that an extend should be counted the same way as a write. But
I'm suspicious that here too we have confusion about whether
blk_write_time is supposed to be covering shared buffers and local
buffers or just shared buffers.

--
Robert Haas
EDB: http://www.enterprisedb.com



On Thu, Oct 05, 2023 at 08:51:40AM -0400, Robert Haas wrote:
> On Thu, Oct 5, 2023 at 6:25 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>> What do you think about the second patch, counting extend calls'
>> timings in blk_write_time? In my opinion if something increments
>> {shared|local}_blks_written, then it needs to be counted in
>> blk_write_time too. I am not sure why it is decided like that.
>
> I agree that an extend should be counted the same way as a write. But
> I'm suspicious that here too we have confusion about whether
> blk_write_time is supposed to be covering shared buffers and local
> buffers or just shared buffers.

Agreed.

In ~14, as far as I can see blk_write_time is only incremented for
shared buffers.  FWIW, I agree that we should improve these stats for
local buffers but I am not on board with a solution where we'd use the
same counter for local and shared buffers while we've historically
only counted the former, because that could confuse existing
monitoring queries.  It seems to me that the right solution is to do
the same separation as temp blocks with two separate counters, without
a backpatch.  I'd like to go as far as renaming blk_read_time and
blk_write_time to respectively shared_blk_read_time and
shared_blk_write_time to know exactly what the type of block dealt
with is when querying this data, particularly for pg_stat_statements's
sake.
--
Michael

Attachment
Hi,

On Tue, 10 Oct 2023 at 03:54, Michael Paquier <michael@paquier.xyz> wrote:
>
> In ~14, as far as I can see blk_write_time is only incremented for
> shared buffers.  FWIW, I agree that we should improve these stats for
> local buffers but I am not on board with a solution where we'd use the
> same counter for local and shared buffers while we've historically
> only counted the former, because that could confuse existing
> monitoring queries.  It seems to me that the right solution is to do
> the same separation as temp blocks with two separate counters, without
> a backpatch.  I'd like to go as far as renaming blk_read_time and
> blk_write_time to respectively shared_blk_read_time and
> shared_blk_write_time to know exactly what the type of block dealt
> with is when querying this data, particularly for pg_stat_statements's
> sake.

Yes, that could be a better solution. Also, having more detailed stats
for shared and local buffers is helpful. I updated patches in line
with that:

0001: Counts extends same way as a write.
0002: Rename blk_{read|write}_time as shared_blk_{read|write}_time.
0003: Add new local_blk_{read|write}_time variables.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachment
On Mon, Oct 16, 2023 at 01:07:07PM +0300, Nazir Bilal Yavuz wrote:
> Yes, that could be a better solution. Also, having more detailed stats
> for shared and local buffers is helpful. I updated patches in line
> with that:
>
> 0001: Counts extends same way as a write.

It can change existing query results on an already-released branch,
but we already count the number of blocks when doing a relation
extension, so counting the write time is something I'd rather fix in
v16.  If you have any objections, let me know.

> 0002: Rename blk_{read|write}_time as shared_blk_{read|write}_time.

Note that `git diff --check` complains here.

--- a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
@@ -30,8 +30,8 @@ CREATE FUNCTION pg_stat_statements(IN showtext boolean,
     OUT local_blks_written int8,
     OUT temp_blks_read int8,
     OUT temp_blks_written int8,
-    OUT blk_read_time float8,
-    OUT blk_write_time float8
+    OUT shared_blk_read_time float8,
+    OUT shared_blk_write_time float8

Doing that in an extension upgrade script is incorrect.  These should
not be touched.

-     Total time the statement spent reading data file blocks, in milliseconds
+     Total time the statement spent reading shared data file blocks, in milliseconds

Or just shared blocks?  That's what we use elsewhere for
pg_stat_statements.  "shared data file blocks" sounds a bit confusing
for relation file blocks read/written from/to shared buffers.

> 0003: Add new local_blk_{read|write}_time variables.

 DATA = pg_stat_statements--1.4.sql \
+       pg_stat_statements--1.11--1.12.sql \
        pg_stat_statements--1.10--1.11.sql \

There is no need to bump again pg_stat_statements, as it has already
been bumped to 1.11 on HEAD per the recent commit 5a3423ad8ee1 from
Daniel.  So the new changes can just be added to 1.11.
--
Michael

Attachment
Hi,

Thanks for the review!

On Tue, 17 Oct 2023 at 11:40, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Oct 16, 2023 at 01:07:07PM +0300, Nazir Bilal Yavuz wrote:
> > Yes, that could be a better solution. Also, having more detailed stats
> > for shared and local buffers is helpful. I updated patches in line
> > with that:
> >
> > 0001: Counts extends same way as a write.
>
> It can change existing query results on an already-released branch,
> but we already count the number of blocks when doing a relation
> extension, so counting the write time is something I'd rather fix in
> v16.  If you have any objections, let me know.

I agree.

>
> > 0002: Rename blk_{read|write}_time as shared_blk_{read|write}_time.
>
> Note that `git diff --check` complains here.
>
> --- a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
> +++ b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
> @@ -30,8 +30,8 @@ CREATE FUNCTION pg_stat_statements(IN showtext boolean,
>      OUT local_blks_written int8,
>      OUT temp_blks_read int8,
>      OUT temp_blks_written int8,
> -    OUT blk_read_time float8,
> -    OUT blk_write_time float8
> +    OUT shared_blk_read_time float8,
> +    OUT shared_blk_write_time float8
>
> Doing that in an extension upgrade script is incorrect.  These should
> not be touched.
>
> -     Total time the statement spent reading data file blocks, in milliseconds
> +     Total time the statement spent reading shared data file blocks, in milliseconds
>
> Or just shared blocks?  That's what we use elsewhere for
> pg_stat_statements.  "shared data file blocks" sounds a bit confusing
> for relation file blocks read/written from/to shared buffers.
>
> > 0003: Add new local_blk_{read|write}_time variables.
>
>  DATA = pg_stat_statements--1.4.sql \
> +       pg_stat_statements--1.11--1.12.sql \
>         pg_stat_statements--1.10--1.11.sql \
>
> There is no need to bump again pg_stat_statements, as it has already
> been bumped to 1.11 on HEAD per the recent commit 5a3423ad8ee1 from
> Daniel.  So the new changes can just be added to 1.11.

I updated patches based on your comments. v4 is attached.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachment
On Tue, Oct 17, 2023 at 04:44:25PM +0300, Nazir Bilal Yavuz wrote:
> I updated patches based on your comments. v4 is attached.

Thanks for the new versions.  I have applied 0001 and backpatched it
for now.  0002 and 0003 look in much cleaner shape than previously.
--
Michael

Attachment
On Tue, Oct 03, 2023 at 12:44:36PM -0400, Robert Haas wrote:
> My first thought was to wonder whether this was even a bug. I
> remembered that EXPLAIN treats shared, local, and temp buffers as
> three separate categories of things. But it seems that someone decided
> to conflate two of them for I/O timing purposes:
>
>             if (has_timing)
>             {
>                 appendStringInfoString(es->str, "
> shared/local");
>
> ^^^^ Notice this bit in particular.

I was reviewing the whole, and this is an oversight specific to
efb0ef909f60, because we've never incremented the write/read counters
for local buffers, even with this commit applied, for both the EXPLAIN
reports and anything stored in pg_stat_statement.  It seems to me that
the origin of the confusion comes down to pg_stat_database where
blk_{read|write}_time increments on both local and shared blocks, but
on EXPLAIN this stuff only reflects data about shared buffers.  So the
"shared" part of the string is right, but the "local" part is not in
v15 and v16.
--
Michael

Attachment
On Wed, Oct 18, 2023 at 02:56:42PM +0900, Michael Paquier wrote:
> Thanks for the new versions.  I have applied 0001 and backpatched it
> for now.  0002 and 0003 look in much cleaner shape than previously.

0002 and 0003 have now been applied.  I have split 0003 into two parts
at the end, mainly on clarity grounds: one for the counters with
EXPLAIN and a second for pg_stat_statements.

There were a few things in the patch set.  Per my notes:
- Some incorrect indentation.
- The additions of show_buffer_usage() did not handle correctly the
addition of a comma before/after the local timing block.  The code
area for has_local_timing needs to check for has_temp_timing, while
the area of has_shared_timing needs to check for (has_local_timing ||
has_temp_timing).
- explain.sgml was missing an update for the information related to
the read/write timings of the local blocks.

Remains what we should do about the "shared/local" string in
show_buffer_usage() for v16 and v15, as "local" is unrelated to that.
Perhaps we should just switch to "shared" in this case or just remove
the string entirely?  Still that implies changing the output of
EXPLAIN on a stable branch in this case, so there could be an argument
for leaving this stuff alone.
--
Michael

Attachment
Hi,

On Thu, 19 Oct 2023 at 08:26, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Oct 18, 2023 at 02:56:42PM +0900, Michael Paquier wrote:
> > Thanks for the new versions.  I have applied 0001 and backpatched it
> > for now.  0002 and 0003 look in much cleaner shape than previously.
>
> 0002 and 0003 have now been applied.  I have split 0003 into two parts
> at the end, mainly on clarity grounds: one for the counters with
> EXPLAIN and a second for pg_stat_statements.
>
> There were a few things in the patch set.  Per my notes:
> - Some incorrect indentation.
> - The additions of show_buffer_usage() did not handle correctly the
> addition of a comma before/after the local timing block.  The code
> area for has_local_timing needs to check for has_temp_timing, while
> the area of has_shared_timing needs to check for (has_local_timing ||
> has_temp_timing).
> - explain.sgml was missing an update for the information related to
> the read/write timings of the local blocks.

Thanks for the changes, push and feedback!

>
> Remains what we should do about the "shared/local" string in
> show_buffer_usage() for v16 and v15, as "local" is unrelated to that.
> Perhaps we should just switch to "shared" in this case or just remove
> the string entirely?  Still that implies changing the output of
> EXPLAIN on a stable branch in this case, so there could be an argument
> for leaving this stuff alone.

I think switching it to 'shared' makes sense. That shouldn't confuse
existing monitoring queries much as the numbers won't change, right?
Also, if we keep 'shared/local' there could be similar complaints to
this thread in the future; so, at least adding comments can be
helpful.

Regards,
Nazir Bilal Yavuz
Microsoft



On Fri, Oct 27, 2023 at 04:58:20PM +0300, Nazir Bilal Yavuz wrote:
> I think switching it to 'shared' makes sense. That shouldn't confuse
> existing monitoring queries much as the numbers won't change, right?
> Also, if we keep 'shared/local' there could be similar complaints to
> this thread in the future; so, at least adding comments can be
> helpful.

The problem is that it may impact existing tools that do explain
output deparsing.  One of them is https://explain.depesz.com/ that
Hubert Depesz Lubaczewski has implemented, and it would be sad to
break anything related to it.

I am adding Hubert in CC for comments about changing this
"shared/local" to "shared" on a released branch.  Knowing that
"shared" and "local" will need to be handled as separate terms in 17~
anyway, perhaps that's not a big deal, but let's be sure.
--
Michael

Attachment
On Mon, Oct 30, 2023 at 10:45:05AM +0900, Michael Paquier wrote:
> On Fri, Oct 27, 2023 at 04:58:20PM +0300, Nazir Bilal Yavuz wrote:
> > I think switching it to 'shared' makes sense. That shouldn't confuse
> > existing monitoring queries much as the numbers won't change, right?
> > Also, if we keep 'shared/local' there could be similar complaints to
> > this thread in the future; so, at least adding comments can be
> > helpful.
>
> The problem is that it may impact existing tools that do explain
> output deparsing.  One of them is https://explain.depesz.com/ that
> Hubert Depesz Lubaczewski has implemented, and it would be sad to
> break anything related to it.
>
> I am adding Hubert in CC for comments about changing this
> "shared/local" to "shared" on a released branch.  Knowing that
> "shared" and "local" will need to be handled as separate terms in 17~
> anyway, perhaps that's not a big deal, but let's be sure.

Hi,
some things will definitely break, but that's 100% OK. The change seems
needed, and I can update my parser to deal with it :)

Best regards,

depesz




On Mon, Oct 30, 2023 at 03:14:16PM +0100, hubert depesz lubaczewski wrote:
> some things will definitely break, but that's 100% OK. The change seems
> needed, and I can update my parser to deal with it :)

Thanks for the input.  I was looking yesterday if this code was
available somewhere, but couldn't find it..  Until this morning:
https://gitlab.com/depesz/explain.depesz.com.git

And..  It looks like things would become better if we change
"shared/local" to "shared", because the parsing code seems to have an
issue once you add a '/'.  All the fields in I/O Timings are
considered as part of a Node, and they're just included in the output.
Now, pasting a plan that includes "shared/local" drops entirely the
string from the output result, so some information is lost.  In short,
imagine that we have the following string in a node:
I/O Timings: shared/local write=23.77

This would show up like that, meaning that the context where the
write/read timings happened is lost:
I/O Timings: write=23.77

If we switch back to "shared", the context would be kept around.  Of
course, this does not count for all the parsers that may be out
there, but at least that's something.
--
Michael

Attachment
On Tue, Oct 31, 2023 at 08:17:52AM +0900, Michael Paquier wrote:
> Thanks for the input.  I was looking yesterday if this code was
> available somewhere, but couldn't find it..  Until this morning:
> https://gitlab.com/depesz/explain.depesz.com.git

Well, the parser itself is https://gitlab.com/depesz/Pg--Explain/ :)

> And..  It looks like things would become better if we change
> "shared/local" to "shared", because the parsing code seems to have an
> issue once you add a '/'.  All the fields in I/O Timings are
> considered as part of a Node, and they're just included in the output.
> Now, pasting a plan that includes "shared/local" drops entirely the
> string from the output result, so some information is lost.  In short,
> imagine that we have the following string in a node:
> I/O Timings: shared/local write=23.77
>
> This would show up like that, meaning that the context where the
> write/read timings happened is lost:
> I/O Timings: write=23.77
>
> If we switch back to "shared", the context would be kept around.  Of
> course, this does not count for all the parsers that may be out
> there, but at least that's something.

Well, if it's possible to deduce what is the meaning in given line,
I can add the logic to do the deduction to parser.

Also, I want to say that I appreciate being looped in the discussion.

Best regards,

depesz




On Tue, Oct 31, 2023 at 03:11:03PM +0100, hubert depesz lubaczewski wrote:
> On Tue, Oct 31, 2023 at 08:17:52AM +0900, Michael Paquier wrote:
>> Thanks for the input.  I was looking yesterday if this code was
>> available somewhere, but couldn't find it..  Until this morning:
>> https://gitlab.com/depesz/explain.depesz.com.git
>
> Well, the parser itself is https://gitlab.com/depesz/Pg--Explain/ :)

That was close enough ;)

> Well, if it's possible to deduce what is the meaning in given line,
> I can add the logic to do the deduction to parser.
>
> Also, I want to say that I appreciate being looped in the discussion.

I lost sight of this thread, so my apologies for the delay.  The patch
to fix the description of the EXPLAIN field has now been applied to
v15 and v16.
--
Michael

Attachment
> > Well, if it's possible to deduce what is the meaning in given line,
> > I can add the logic to do the deduction to parser.
> > Also, I want to say that I appreciate being looped in the discussion.
> I lost sight of this thread, so my apologies for the delay.  The patch
> to fix the description of the EXPLAIN field has now been applied to
> v15 and v16.

Thanks. Will do my best to update the parser soon.

Best regards,

depesz