Re: Track IO times in pg_stat_io - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Track IO times in pg_stat_io |
Date | |
Msg-id | 20230314225615.4twkbpcl6mtini4a@awork3.anarazel.de Whole thread Raw |
In response to | Re: Track IO times in pg_stat_io (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: Track IO times in pg_stat_io
|
List | pgsql-hackers |
Hi, On 2023-03-09 11:50:38 -0500, Melanie Plageman wrote: > On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <andres@anarazel.de> wrote: > > On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote: > > > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea > > > > to also check that if counts are not Zero then times are not Zero. > > > > > > Yes, I think adding some validation around the relationship between > > > counts and timing should help prevent developers from forgetting to call > > > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant). > > > > > > However, I think that we cannot check that if IO counts are non-zero > > > that IO times are non-zero, because the user may not have > > > track_io_timing enabled. We can check that if IO times are not zero, IO > > > counts are not zero. I've done this in the attached v3. > > > > And even if track_io_timing is enabled, the timer granularity might be so low > > that we *still* get zeroes. > > > > I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime, > > And then have pg_stat_reset_shared('io') reset pg_stat_database IO > stats? Yes. > > > @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > > > > > > if (isExtend) > > > { > > > + instr_time io_start, > > > + io_time; > > > + > > > /* new buffers are zero-filled */ > > > MemSet((char *) bufBlock, 0, BLCKSZ); > > > + > > > + if (track_io_timing) > > > + INSTR_TIME_SET_CURRENT(io_start); > > > + else > > > + INSTR_TIME_SET_ZERO(io_start); > > > + > > > > I wonder if there's an argument for tracking this in the existing IO stats as > > well. But I guess we've lived with this for a long time... > > Not sure I want to include that in this patchset. No, probably not. > > > typedef struct PgStat_BktypeIO > > > { > > > - PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; > > > + PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; > > > + instr_time times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; > > > } PgStat_BktypeIO; > > > > Ah, you're going to hate me. We can't store instr_time on disk. There's > > another patch that gets substantial peformance gains by varying the frequency > > at which instr_time keeps track of time based on the CPU frequency... > > What does that have to do with what we can store on disk? The frequency can change. > If so, would it not be enough to do this when reading/writing the stats > file? Theoretically yes. But to me it seems cleaner to do it when flushing to shared stats. See also the overflow issue below. > void > instr_time_deserialize(instr_time *dest, int64 *src, int length) > { > for (size_t i = 0; i < length; i++) > { > INSTR_TIME_SET_ZERO(dest[i]); > dest[i].ticks = src[i]; > } > } That wouldn't be correct, because what ticks means will at some point change between postgres stopping and starting. > > It also just doesn't have enough range to keep track of system wide > > time on a larger system. A single backend won't run for 293 years, but > > with a few thousand backends that's a whole different story. > > > > I think we need to accumulate in instr_time, but convert to floating point > > when flushing stats. > > Hmmm. So, are you saying that we need to read from disk when we query > the view and add that to what is in shared memory? That we only store > the delta since the last restart in the instr_time array? No, I don't think I am suggesting that. What I am trying to suggest is that PendingIOStats should contain instr_time, but that PgStat_IO should contain PgStat_Counter as microseconds, as before. > But, I don't see how that avoids the problem of backend total runtime > being 293 years. We would have to reset and write out the delta whenever > we thought the times would overflow. The overflow risk is due to storing nanoseconds (which is what instr_time stores internally on linux now) - which we don't need to do once accumulatated. Right now we store them as microseconds. nanosecond range: ((2**63) - 1)/(10**9*60*60*24*365) -> 292 years microsecond range: ((2**63) - 1)/(10**6*60*60*24*365) -> 292471 years If you assume 5k connections continually doing IO, a range of 292 years would last 21 days at nanosecond resolution. At microsecond resolution it's 58 years. Greetings, Andres Freund
pgsql-hackers by date: