Re: Track IO times in pg_stat_io - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Track IO times in pg_stat_io
Date
Msg-id CAAKRu_Z1x5FX6NA2Qa2S4w_=uzWTP-QcyomTOVUkgBYOVkxduw@mail.gmail.com
Whole thread Raw
In response to Re: Track IO times in pg_stat_io  (Andres Freund <andres@anarazel.de>)
Responses Re: Track IO times in pg_stat_io
List pgsql-hackers
v5 attached mostly addresses instr_time persistence issues.

On Tue, Mar 14, 2023 at 6:56 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-03-09 11:50:38 -0500, Melanie Plageman wrote:
> > On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <andres@anarazel.de> wrote:
> > > On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > > > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
> > > > > to also check that if counts are not Zero then times are not Zero.
> > > >
> > > > Yes, I think adding some validation around the relationship between
> > > > counts and timing should help prevent developers from forgetting to call
> > > > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> > > >
> > > > However, I think that we cannot check that if IO counts are non-zero
> > > > that IO times are non-zero, because the user may not have
> > > > track_io_timing enabled. We can check that if IO times are not zero, IO
> > > > counts are not zero. I've done this in the attached v3.
> > >
> > > And even if track_io_timing is enabled, the timer granularity might be so low
> > > that we *still* get zeroes.
> > >
> > > I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,
> >
> > And then have pg_stat_reset_shared('io') reset pg_stat_database IO
> > stats?
>
> Yes.

I think this makes sense but I am hesitant to do it in this patchset,
because it feels a bit hidden...maybe?

But, if you feel strongly, I will make the change.

> > > >  typedef struct PgStat_BktypeIO
> > > >  {
> > > > -     PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > > +     PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > > +     instr_time      times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > >  } PgStat_BktypeIO;
> > >
> > > Ah, you're going to hate me. We can't store instr_time on disk. There's
> > > another patch that gets substantial peformance gains by varying the frequency
> > > at which instr_time keeps track of time based on the CPU frequency...
> >
> > What does that have to do with what we can store on disk?
>
> The frequency can change.

Ah, I see.

> > If so, would it not be enough to do this when reading/writing the stats
> > file?
>
> Theoretically yes. But to me it seems cleaner to do it when flushing to shared
> stats. See also the overflow issue below.
>
> > > It also just doesn't have enough range to keep track of system wide
> > > time on a larger system. A single backend won't run for 293 years, but
> > > with a few thousand backends that's a whole different story.
> > >
> > > I think we need to accumulate in instr_time, but convert to floating point
> > > when flushing stats.
> >
> > Hmmm. So, are you saying that we need to read from disk when we query
> > the view and add that to what is in shared memory? That we only store
> > the delta since the last restart in the instr_time array?
>
> No, I don't think I am suggesting that. What I am trying to suggest is that
> PendingIOStats should contain instr_time, but that PgStat_IO should contain
> PgStat_Counter as microseconds, as before.

So, I've modified the code to make a union of instr_time and
PgStat_Counter in PgStat_BktypeIO. I am not quite sure if this is okay.
I store in microsec and then in pg_stat_io, I multiply to get
milliseconds for display.

I considered refactoring pgstat_io_end() to use INSTR_TIME_ACCUM_DIFF()
like [1], but, in the end I actually think I would end up with more
operations because of the various different counters needing to be
updated. As it is now, I do a single subtract and a few adds (one for
each of the different statistics objects tracking IO times
(pgBufferUsage, pgStatBlockWrite/ReadTime). Whereas, I would need to do
an accum diff for every one of those.

- Melanie

[1] https://www.postgresql.org/message-id/1feedb83-7aa9-cb4b-5086-598349d3f555%40gmail.com

Attachment

pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: doc: mentioned CREATE+ATTACH PARTITION as an alternative to CREATE TABLE..PARTITION OF
Next
From: Tomas Vondra
Date:
Subject: Re: Add LZ4 compression in pg_dump