Re: Track IO times in pg_stat_io - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Track IO times in pg_stat_io
Date
Msg-id 20230314225615.4twkbpcl6mtini4a@awork3.anarazel.de
Whole thread Raw
In response to Re: Track IO times in pg_stat_io  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Track IO times in pg_stat_io
List pgsql-hackers
Hi,

On 2023-03-09 11:50:38 -0500, Melanie Plageman wrote:
> On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea
> > > > to also check that if counts are not Zero then times are not Zero.
> > >
> > > Yes, I think adding some validation around the relationship between
> > > counts and timing should help prevent developers from forgetting to call
> > > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> > >
> > > However, I think that we cannot check that if IO counts are non-zero
> > > that IO times are non-zero, because the user may not have
> > > track_io_timing enabled. We can check that if IO times are not zero, IO
> > > counts are not zero. I've done this in the attached v3.
> >
> > And even if track_io_timing is enabled, the timer granularity might be so low
> > that we *still* get zeroes.
> >
> > I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,
>
> And then have pg_stat_reset_shared('io') reset pg_stat_database IO
> stats?

Yes.


> > > @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> > >
> > >       if (isExtend)
> > >       {
> > > +             instr_time      io_start,
> > > +                                     io_time;
> > > +
> > >               /* new buffers are zero-filled */
> > >               MemSet((char *) bufBlock, 0, BLCKSZ);
> > > +
> > > +             if (track_io_timing)
> > > +                     INSTR_TIME_SET_CURRENT(io_start);
> > > +             else
> > > +                     INSTR_TIME_SET_ZERO(io_start);
> > > +
> >
> > I wonder if there's an argument for tracking this in the existing IO stats as
> > well. But I guess we've lived with this for a long time...
>
> Not sure I want to include that in this patchset.

No, probably not.


> > >  typedef struct PgStat_BktypeIO
> > >  {
> > > -     PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > +     PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > +     instr_time      times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > >  } PgStat_BktypeIO;
> >
> > Ah, you're going to hate me. We can't store instr_time on disk. There's
> > another patch that gets substantial peformance gains by varying the frequency
> > at which instr_time keeps track of time based on the CPU frequency...
>
> What does that have to do with what we can store on disk?

The frequency can change.


> If so, would it not be enough to do this when reading/writing the stats
> file?

Theoretically yes. But to me it seems cleaner to do it when flushing to shared
stats. See also the overflow issue below.



> void
> instr_time_deserialize(instr_time *dest, int64 *src, int length)
> {
>     for (size_t i = 0; i < length; i++)
>     {
>         INSTR_TIME_SET_ZERO(dest[i]);
>         dest[i].ticks = src[i];
>     }
> }

That wouldn't be correct, because what ticks means will at some point change
between postgres stopping and starting.



> > It also just doesn't have enough range to keep track of system wide
> > time on a larger system. A single backend won't run for 293 years, but
> > with a few thousand backends that's a whole different story.
> >
> > I think we need to accumulate in instr_time, but convert to floating point
> > when flushing stats.
>
> Hmmm. So, are you saying that we need to read from disk when we query
> the view and add that to what is in shared memory? That we only store
> the delta since the last restart in the instr_time array?

No, I don't think I am suggesting that. What I am trying to suggest is that
PendingIOStats should contain instr_time, but that PgStat_IO should contain
PgStat_Counter as microseconds, as before.


> But, I don't see how that avoids the problem of backend total runtime
> being 293 years. We would have to reset and write out the delta whenever
> we thought the times would overflow.

The overflow risk is due to storing nanoseconds (which is what instr_time
stores internally on linux now) - which we don't need to do once
accumulatated. Right now we store them as microseconds.

nanosecond range:
((2**63) - 1)/(10**9*60*60*24*365) -> 292 years
microsecond range:
((2**63) - 1)/(10**6*60*60*24*365) -> 292471 years

If you assume 5k connections continually doing IO, a range of 292 years would
last 21 days at nanosecond resolution. At microsecond resolution it's 58
years.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: Add pg_walinspect function with block info columns
Next
From: Peter Geoghegan
Date:
Subject: Re: Add pg_walinspect function with block info columns