Thread: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Hi, Currently pg_stat_bgwriter.buffers_backend is pretty useless to gauge whether backends are doing writes they shouldn't do. That's because it counts things that are either unavoidably or unlikely doable by other parts of the system (checkpointer, bgwriter). In particular extending the file can not currently be done by any another type of process, yet is counted. When using a buffer access strategy it is also very likely that writes have to be done by the 'dirtying' backend itself, as the buffer will be reused soon after (when not previously in s_b that is). Additionally pg_stat_bgwriter.buffers_backend also counts writes done by autovacuum et al. I think it'd make sense to at least split buffers_backend into buffers_backend_extend, buffers_backend_write, buffers_backend_write_strat but it could also be worthwhile to expand it into buffers_backend_extend, buffers_{backend,checkpoint,bgwriter,autovacuum}_write buffers_{backend,autovacuum}_write_stat Possibly by internally, in contrast to SQL level, having just counter arrays indexed by backend types. It's also noteworthy that buffers_backend is accounted in an absurd manner. One might think that writes are accounted from backend -> shared memory or such. But instead it works like this: 1) backend flushes buffer in bufmgr.c, accounts for backend *write time* 2) mdwrite writes and registers a sync request, which forwards the sync request to checkpointer 3) ForwardSyncRequest(), when not called by bgwriter, increments CheckpointerShmem->num_backend_writes 4) checkpointer, whenever doing AbsorbSyncRequests(), moves CheckpointerShmem->num_backend_writes to BgWriterStats.m_buf_written_backend (local memory!) 5) Occasionally it calls pgstat_send_bgwriter(), which sends the data to pgstat (which bgwriter also does) 6) Which then updates the shared memory used by the display functions Worthwhile to note that backend buffer read/write *time* is accounted differently. That's done via pgstat_send_tabstat(). I think there's very little excuse for the indirection via checkpointer, besides architectually being weird, it actually requires that we continue to wake up checkpointer over and over instead of optimizing how and when we submit fsync requests. As far as I can tell we're also simply not accounting at all for writes done outside of shared buffers. All writes done directly through smgrwrite()/extend() aren't accounted anywhere as far as I can tell. I think we also count things as writes that aren't writes: mdtruncate() is AFAICT counted as one backend write for each segment. Which seems weird to me. Lastly, I don't understand what the point of sending fixed size stats, like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While I don't like it's architecture, we obviously need something like pgstat to handle variable amounts of stats (database, table level etc stats). But that doesn't at all apply to these types of global stats. Greetings, Andres Freund
On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > Currently pg_stat_bgwriter.buffers_backend is pretty useless to gauge > whether backends are doing writes they shouldn't do. That's because it > counts things that are either unavoidably or unlikely doable by other > parts of the system (checkpointer, bgwriter). > In particular extending the file can not currently be done by any > another type of process, yet is counted. When using a buffer access > strategy it is also very likely that writes have to be done by the > 'dirtying' backend itself, as the buffer will be reused soon after (when > not previously in s_b that is). Yeah. That's quite annoying. > Additionally pg_stat_bgwriter.buffers_backend also counts writes done by > autovacuum et al. > > > I think it'd make sense to at least split buffers_backend into > buffers_backend_extend, > buffers_backend_write, > buffers_backend_write_strat > > but it could also be worthwhile to expand it into > buffers_backend_extend, > buffers_{backend,checkpoint,bgwriter,autovacuum}_write > buffers_{backend,autovacuum}_write_stat Given that these are individual global counters, I don't really see any reason not to expand it to the bigger set of counters. It's easy enough to add them up together later if needed. > Possibly by internally, in contrast to SQL level, having just counter > arrays indexed by backend types. > > > It's also noteworthy that buffers_backend is accounted in an absurd > manner. One might think that writes are accounted from backend -> shared > memory or such. But instead it works like this: > > 1) backend flushes buffer in bufmgr.c, accounts for backend *write time* > 2) mdwrite writes and registers a sync request, which forwards the sync request to checkpointer > 3) ForwardSyncRequest(), when not called by bgwriter, increments CheckpointerShmem->num_backend_writes > 4) checkpointer, whenever doing AbsorbSyncRequests(), moves > CheckpointerShmem->num_backend_writes to > BgWriterStats.m_buf_written_backend (local memory!) > 5) Occasionally it calls pgstat_send_bgwriter(), which sends the data to > pgstat (which bgwriter also does) > 6) Which then updates the shared memory used by the display functions > > Worthwhile to note that backend buffer read/write *time* is accounted > differently. That's done via pgstat_send_tabstat(). > > > I think there's very little excuse for the indirection via checkpointer, > besides architectually being weird, it actually requires that we > continue to wake up checkpointer over and over instead of optimizing how > and when we submit fsync requests. > > As far as I can tell we're also simply not accounting at all for writes > done outside of shared buffers. All writes done directly through > smgrwrite()/extend() aren't accounted anywhere as far as I can tell. > > > I think we also count things as writes that aren't writes: mdtruncate() > is AFAICT counted as one backend write for each segment. Which seems > weird to me. It's at least slightly weird :) Might it be worth counting truncate events separately? > Lastly, I don't understand what the point of sending fixed size stats, > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While > I don't like it's architecture, we obviously need something like pgstat > to handle variable amounts of stats (database, table level etc > stats). But that doesn't at all apply to these types of global stats. That part has annoyed me as well a few times. +1 for just moving that into a global shared memory. Given that we don't really care about things being in sync between those different counters *or* if we loose a bit of data (which the stats collector is designed to do), we could even do that without a lock? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Hi, On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote: > On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote: > > Additionally pg_stat_bgwriter.buffers_backend also counts writes done by > > autovacuum et al. > > I think it'd make sense to at least split buffers_backend into > > buffers_backend_extend, > > buffers_backend_write, > > buffers_backend_write_strat > > > > but it could also be worthwhile to expand it into > > buffers_backend_extend, > > buffers_{backend,checkpoint,bgwriter,autovacuum}_write > > buffers_{backend,autovacuum}_write_stat > > Given that these are individual global counters, I don't really see > any reason not to expand it to the bigger set of counters. It's easy > enough to add them up together later if needed. Are you agreeing to buffers_{backend,checkpoint,bgwriter,autovacuum}_write or are you suggesting further ones? > > I think we also count things as writes that aren't writes: mdtruncate() > > is AFAICT counted as one backend write for each segment. Which seems > > weird to me. > > It's at least slightly weird :) Might it be worth counting truncate > events separately? Is that really something interesting? Feels like it'd have to be done at a higher level to be useful. E.g. the truncate done by TRUNCATE (when in same xact as creation) and VACUUM are quite different. I think it'd be better to just not include it. > > Lastly, I don't understand what the point of sending fixed size stats, > > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While > > I don't like it's architecture, we obviously need something like pgstat > > to handle variable amounts of stats (database, table level etc > > stats). But that doesn't at all apply to these types of global stats. > > That part has annoyed me as well a few times. +1 for just moving that > into a global shared memory. Given that we don't really care about > things being in sync between those different counters *or* if we loose > a bit of data (which the stats collector is designed to do), we could > even do that without a lock? I don't think we'd quite want to do it without any (single counter) synchronization - high concurrency setups would be pretty likely to loose values that way. I suspect the best would be to have a struct in shared memory that contains the potential counters for each potential process. And then sum them up when actually wanting the concrete value. That way we avoid unnecessary contention, in contrast to having a single shared memory value for each(which would just pingpong between different sockets and store buffers). There's a few details like how exactly to implement resetting the counters, but ... Thanks, Andres Freund
On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote: > > On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote: > > > Additionally pg_stat_bgwriter.buffers_backend also counts writes done by > > > autovacuum et al. > > > > I think it'd make sense to at least split buffers_backend into > > > buffers_backend_extend, > > > buffers_backend_write, > > > buffers_backend_write_strat > > > > > > but it could also be worthwhile to expand it into > > > buffers_backend_extend, > > > buffers_{backend,checkpoint,bgwriter,autovacuum}_write > > > buffers_{backend,autovacuum}_write_stat > > > > Given that these are individual global counters, I don't really see > > any reason not to expand it to the bigger set of counters. It's easy > > enough to add them up together later if needed. > > Are you agreeing to > buffers_{backend,checkpoint,bgwriter,autovacuum}_write > or are you suggesting further ones? The former. > > > I think we also count things as writes that aren't writes: mdtruncate() > > > is AFAICT counted as one backend write for each segment. Which seems > > > weird to me. > > > > It's at least slightly weird :) Might it be worth counting truncate > > events separately? > > Is that really something interesting? Feels like it'd have to be done at > a higher level to be useful. E.g. the truncate done by TRUNCATE (when in > same xact as creation) and VACUUM are quite different. I think it'd be > better to just not include it. Yeah, you're probably right. it certainly makes very little sense where it is now. > > > Lastly, I don't understand what the point of sending fixed size stats, > > > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While > > > I don't like it's architecture, we obviously need something like pgstat > > > to handle variable amounts of stats (database, table level etc > > > stats). But that doesn't at all apply to these types of global stats. > > > > That part has annoyed me as well a few times. +1 for just moving that > > into a global shared memory. Given that we don't really care about > > things being in sync between those different counters *or* if we loose > > a bit of data (which the stats collector is designed to do), we could > > even do that without a lock? > > I don't think we'd quite want to do it without any (single counter) > synchronization - high concurrency setups would be pretty likely to > loose values that way. I suspect the best would be to have a struct in > shared memory that contains the potential counters for each potential > process. And then sum them up when actually wanting the concrete > value. That way we avoid unnecessary contention, in contrast to having a > single shared memory value for each(which would just pingpong between > different sockets and store buffers). There's a few details like how > exactly to implement resetting the counters, but ... Right. Each process gets to do their own write, but still in shared memory. But do you need to lock them when reading them (for the summary)? That's the part where I figured you could just read and summarize them, and accept the possible loss. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Hi, On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote: > On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote: > > On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote: > > > On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote: > > > > Lastly, I don't understand what the point of sending fixed size stats, > > > > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While > > > > I don't like it's architecture, we obviously need something like pgstat > > > > to handle variable amounts of stats (database, table level etc > > > > stats). But that doesn't at all apply to these types of global stats. > > > > > > That part has annoyed me as well a few times. +1 for just moving that > > > into a global shared memory. Given that we don't really care about > > > things being in sync between those different counters *or* if we loose > > > a bit of data (which the stats collector is designed to do), we could > > > even do that without a lock? > > > > I don't think we'd quite want to do it without any (single counter) > > synchronization - high concurrency setups would be pretty likely to > > loose values that way. I suspect the best would be to have a struct in > > shared memory that contains the potential counters for each potential > > process. And then sum them up when actually wanting the concrete > > value. That way we avoid unnecessary contention, in contrast to having a > > single shared memory value for each(which would just pingpong between > > different sockets and store buffers). There's a few details like how > > exactly to implement resetting the counters, but ... > > Right. Each process gets to do their own write, but still in shared > memory. But do you need to lock them when reading them (for the > summary)? That's the part where I figured you could just read and > summarize them, and accept the possible loss. Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit integers can be read / written without a danger of torn values, and I don't think we need perfect cross counter accuracy. To deal with the few platforms without 64bit "single copy atomicity", we can just use pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically fall back to using locked operations for those platforms. So I don't think there's actually a danger of loss. Obviously we could also use atomic ops to increment the value, but I'd rather not add all those atomic operations, even if it's on uncontended cachelines. It'd allow us to reset the backend values more easily by just swapping in a 0, which we can't do if the backend increments non-atomically. But I think we could instead just have one global "bias" value to implement resets (by subtracting that from the summarized value, and storing the current sum when resetting). Or use the new global barrier to trigger a reset. Or something similar. Greetings, Andres Freund
Hello. At Sun, 26 Jan 2020 12:22:03 -0800, Andres Freund <andres@anarazel.de> wrote in > Hi, I feel the same on the specific issues brought in upthread. > On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote: > > On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote: > > > On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote: > > > > On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote: > > > > > Lastly, I don't understand what the point of sending fixed size stats, > > > > > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While > > > > > I don't like it's architecture, we obviously need something like pgstat > > > > > to handle variable amounts of stats (database, table level etc > > > > > stats). But that doesn't at all apply to these types of global stats. > > > > > > > > That part has annoyed me as well a few times. +1 for just moving that > > > > into a global shared memory. Given that we don't really care about > > > > things being in sync between those different counters *or* if we loose > > > > a bit of data (which the stats collector is designed to do), we could > > > > even do that without a lock? > > > > > > I don't think we'd quite want to do it without any (single counter) > > > synchronization - high concurrency setups would be pretty likely to > > > loose values that way. I suspect the best would be to have a struct in > > > shared memory that contains the potential counters for each potential > > > process. And then sum them up when actually wanting the concrete > > > value. That way we avoid unnecessary contention, in contrast to having a > > > single shared memory value for each(which would just pingpong between > > > different sockets and store buffers). There's a few details like how > > > exactly to implement resetting the counters, but ... > > > > Right. Each process gets to do their own write, but still in shared > > memory. But do you need to lock them when reading them (for the > > summary)? That's the part where I figured you could just read and > > summarize them, and accept the possible loss. > > Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit > integers can be read / written without a danger of torn values, and I > don't think we need perfect cross counter accuracy. To deal with the few > platforms without 64bit "single copy atomicity", we can just use > pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically > fall back to using locked operations for those platforms. So I don't > think there's actually a danger of loss. > > Obviously we could also use atomic ops to increment the value, but I'd > rather not add all those atomic operations, even if it's on uncontended > cachelines. It'd allow us to reset the backend values more easily by > just swapping in a 0, which we can't do if the backend increments > non-atomically. But I think we could instead just have one global "bias" > value to implement resets (by subtracting that from the summarized > value, and storing the current sum when resetting). Or use the new > global barrier to trigger a reset. Or something similar. Fixed or global stats are suitable for the startar of shared-memory stats collector. In the case of buffers_*_write, the global stats entry for each process needs just 8 bytes plus matbe extra 8 bytes for the bias value. I'm not sure how many counters like this there are, but is such size of footprint acceptatble? (Each backend already uses the same amount of local memory for pgstat use, though.) Anyway I will do something like that as a trial, maybe by adding a member in PgBackendStatus and one global-shared for the bial value. int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM]; + PgBackendStatsCounters counters; } PgBackendStatus; regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Sun, Jan 26, 2020 at 11:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Sun, 26 Jan 2020 12:22:03 -0800, Andres Freund <andres@anarazel.de> wrote in > > On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote: > > > On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote: > > > > On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote: > > > > > On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > Lastly, I don't understand what the point of sending fixed size stats, > > > > > > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While > > > > > > I don't like it's architecture, we obviously need something like pgstat > > > > > > to handle variable amounts of stats (database, table level etc > > > > > > stats). But that doesn't at all apply to these types of global stats. > > > > > > > > > > That part has annoyed me as well a few times. +1 for just moving that > > > > > into a global shared memory. Given that we don't really care about > > > > > things being in sync between those different counters *or* if we loose > > > > > a bit of data (which the stats collector is designed to do), we could > > > > > even do that without a lock? > > > > > > > > I don't think we'd quite want to do it without any (single counter) > > > > synchronization - high concurrency setups would be pretty likely to > > > > loose values that way. I suspect the best would be to have a struct in > > > > shared memory that contains the potential counters for each potential > > > > process. And then sum them up when actually wanting the concrete > > > > value. That way we avoid unnecessary contention, in contrast to having a > > > > single shared memory value for each(which would just pingpong between > > > > different sockets and store buffers). There's a few details like how > > > > exactly to implement resetting the counters, but ... > > > > > > Right. Each process gets to do their own write, but still in shared > > > memory. But do you need to lock them when reading them (for the > > > summary)? That's the part where I figured you could just read and > > > summarize them, and accept the possible loss. > > > > Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit > > integers can be read / written without a danger of torn values, and I > > don't think we need perfect cross counter accuracy. To deal with the few > > platforms without 64bit "single copy atomicity", we can just use > > pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically > > fall back to using locked operations for those platforms. So I don't > > think there's actually a danger of loss. > > > > Obviously we could also use atomic ops to increment the value, but I'd > > rather not add all those atomic operations, even if it's on uncontended > > cachelines. It'd allow us to reset the backend values more easily by > > just swapping in a 0, which we can't do if the backend increments > > non-atomically. But I think we could instead just have one global "bias" > > value to implement resets (by subtracting that from the summarized > > value, and storing the current sum when resetting). Or use the new > > global barrier to trigger a reset. Or something similar. > > Fixed or global stats are suitable for the startar of shared-memory > stats collector. In the case of buffers_*_write, the global stats > entry for each process needs just 8 bytes plus matbe extra 8 bytes for > the bias value. I'm not sure how many counters like this there are, > but is such size of footprint acceptatble? (Each backend already uses > the same amount of local memory for pgstat use, though.) > > Anyway I will do something like that as a trial, maybe by adding a > member in PgBackendStatus and one global-shared for the bial value. > > int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM]; > + PgBackendStatsCounters counters; > } PgBackendStatus; > So, I took a stab at implementing this in PgBackendStatus. The attached patch is not quite on top of current master, so, alas, don't try and apply it. I went to rebase today and realized I needed to make some changes in light of e1025044cd4, however, I wanted to share this WIP so that I could pose a few questions that I imagine will still be relevant after I rewrite the patch. I removed buffers_backend and buffers_backend_fsync from pg_stat_bgwriter and have created a new view which tracks - number of shared buffers the checkpointer and bgwriter write out - number of shared buffers a regular backend is forced to flush - number of extends done by a regular backend through shared buffers - number of buffers flushed by a backend or autovacuum using a BufferAccessStrategy which, were they not to use this strategy, could perhaps have been avoided if a clean shared buffer was available - number of fsyncs done by a backend which could have been done by checkpointer if sync queue had not been full This view currently does only track writes and extends that go through shared buffers and fsyncs of shared buffers (which, AFAIK are the only things fsync'd though the SyncRequest machinery currently). BufferAlloc() and SyncOneBuffer() are the main points at which the tracking is done. I can definitely expand this, but, I want to make sure that we are tracking the right kind of information. num_backend_writes and num_backend_fsync were intended (though they were not accurate) to count buffers that backends had to end up writing themselves and fsyncs that backends had to end up doing themselves which could have been avoided with a different configuration (or, I suppose, a different workload/different data, etc). That is, they were meant to tell you if checkpointer and bgwriter were keeping up and/or if the size of shared buffers was adequate. In implementing this counting per backend, it is easy for all types of backends to keep track of the number of writes, extends, fsyncs, and strategy writes they are doing. So, as recommended upthread, I have added columns in the view for the number of writes for checkpointer and bgwriter and others. Thus, this view becomes more than just stats on "avoidable I/O done by backends". So, my question is, does it makes sense to track all extends -- those to extend the fsm and visimap and when making a new relation or index? Is that information useful? If so, is it different than the extends done through shared buffers? Should it be tracked separately? Also, if we care about all of the extends, then it seems a bit annoying to pepper the counting all over the place when it really just needs to be done when smgrextend() — even though maybe a stats function doesn't belong in that API. Another question I have is, should the number of extends be for every single block extended or should we try to track the initiation of a set of extends (all of those added in RelationAddExtraBlocks(), in this case)? When it comes to fsync counting, I only count the fsyncs counted by the previous code — that is fsyncs done by backends themselves when the checkpointer sync request queue was full. I did the counting in the same place in checkpointer code -- in ForwardSyncRequest() -- partially because there did not seem to be another good place to do it since register_dirty_segment() returns void (thought about having it return a bool to indicate if it fsync'd it or if it registered the fsync because that seemed alright, but mdextend(), mdwrite() etc, also return NULL) so there is no way to propagate the information back up to the bufmgr that the process had to do its own fsync, so, that means that I would have to muck with the md.c API. and, since the checkpointer is the one processing these sync requests anyway, it actually seems okay to do it in the checkpointer code. I'm not counting fsyncs that are "unavoidable" in the sense that they couldn't be avoided by changing settings/workload etc -- like those done when building an index, creating a table/rewriting a table/copying a table -- is it useful to count these? It seems like it makes the number of "avoidable fsyncs by backends" less useful if we count the others. Also, should we count how many fsyncs checkpointer has done (have to check if there is already a stat for that)? Is that useful in this context? Of course, this view, when grown, will begin to overlap with pg_statio, which is another consideration. What is its identity? I would find "avoidable I/O" either avoidable entirely or avoidable for that particular type of process, to be useful. Or maybe, it should have a more expansive mandate. Maybe it would be useful to aggregate some of the info from pg_stat_statements at a higher level -- like maybe shared_blks_read counted across many statements for a period of time/context in which we expected the relation in shared buffers becomes potentially interesting. As for the way I have recorded strategy writes -- it is quite inelegant, but, I wanted to make sure that I only counted a strategy write as one in which the backend wrote out the dirty buffer from its strategy ring but did not check if there was any clean buffer in shared buffers more generally (so, it is *potentially* an avoidable write). I'm not sure if this distinction is useful to anyone. I haven't done enough with BufferAccessStrategies to know what I'd want to know about them when developing or using Postgres. However, if I don't need to be so careful, it will make the code much simpler (though, I'm sure I can improve the code regardless). As for the implementation of the counters themselves, I appreciate that it isn't very nice to have a bunch of random members in PgBackendStatus to count all of these write, extends, fsyncs. I considered if I could add params that were used for all command types to st_progress_param but I haven't looked into it yet. Alternatively, I could create an array just for these kind of stats in PgBackendStatus. Though, I imagine that I should take a look at the changes that have been made recently to this area and at the shared memory stats patch. Oh, also, there should be a way to reset the stats, especially if we add more extends and fsyncs that happen at the time of relation/index creation. I, at least, would find it useful to see these numbers once the database is at some kind of steady state. Oh and src/test/regress/sql/stats.sql will fail and, of course, I don't intend to add that SELECT from the view to regress, it was just for testing purposes to make sure the view was working. -- Melanie
Attachment
Hi, On 2021-04-12 19:49:36 -0700, Melanie Plageman wrote: > So, I took a stab at implementing this in PgBackendStatus. Cool! > The attached patch is not quite on top of current master, so, alas, > don't try and apply it. I went to rebase today and realized I needed > to make some changes in light of e1025044cd4, however, I wanted to > share this WIP so that I could pose a few questions that I imagine > will still be relevant after I rewrite the patch. > > I removed buffers_backend and buffers_backend_fsync from > pg_stat_bgwriter and have created a new view which tracks > - number of shared buffers the checkpointer and bgwriter write out > - number of shared buffers a regular backend is forced to flush > - number of extends done by a regular backend through shared buffers > - number of buffers flushed by a backend or autovacuum using a > BufferAccessStrategy which, were they not to use this strategy, > could perhaps have been avoided if a clean shared buffer was > available > - number of fsyncs done by a backend which could have been done by > checkpointer if sync queue had not been full I wonder if leaving buffers_alloc in pg_stat_bgwriter makes sense after this? I'm tempted to move that to pg_stat_buffers or such... I'm not quite convinced by having separate columns for checkpointer, bgwriter, etc. That doesn't seem to scale all that well. What if we instead made it a view that has one row for each BackendType? > In implementing this counting per backend, it is easy for all types of > backends to keep track of the number of writes, extends, fsyncs, and > strategy writes they are doing. So, as recommended upthread, I have > added columns in the view for the number of writes for checkpointer and > bgwriter and others. Thus, this view becomes more than just stats on > "avoidable I/O done by backends". > > So, my question is, does it makes sense to track all extends -- those to > extend the fsm and visimap and when making a new relation or index? Is > that information useful? If so, is it different than the extends done > through shared buffers? Should it be tracked separately? I don't fully understand what you mean with "extends done through shared buffers"? > Another question I have is, should the number of extends be for every > single block extended or should we try to track the initiation of a set > of extends (all of those added in RelationAddExtraBlocks(), in this > case)? I think it should be 8k blocks, i.e. RelationAddExtraBlocks() should be tracked as many individual extends. It's implemented that way, but more importantly, it should be in BLCKSZ units. If we later add some actually batched operations, we can have separate stats for that. > Of course, this view, when grown, will begin to overlap with pg_statio, > which is another consideration. What is its identity? I would find > "avoidable I/O" either avoidable entirely or avoidable for that > particular type of process, to be useful. I think it's fine to overlap with pg_statio_* - those are for individual objects, so it seems to be expected to overlap with coarser stats. > Or maybe, it should have a more expansive mandate. Maybe it would be > useful to aggregate some of the info from pg_stat_statements at a higher > level -- like maybe shared_blks_read counted across many statements for > a period of time/context in which we expected the relation in shared > buffers becomes potentially interesting. Let's do something more basic first... Greetings, Andres Freund
On Thu, Apr 15, 2021 at 7:59 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-04-12 19:49:36 -0700, Melanie Plageman wrote:
> So, I took a stab at implementing this in PgBackendStatus.
Cool!
Just a note on v2 of the patch -- the diff for the changes I made to
pgstatfuncs.c is pretty atrocious and hard to read. I tried using a
different diff algorithm, to no avail.
pgstatfuncs.c is pretty atrocious and hard to read. I tried using a
different diff algorithm, to no avail.
> The attached patch is not quite on top of current master, so, alas,
> don't try and apply it. I went to rebase today and realized I needed
> to make some changes in light of e1025044cd4, however, I wanted to
> share this WIP so that I could pose a few questions that I imagine
> will still be relevant after I rewrite the patch.
Regarding the refactor done in e1025044cd4:
Most of the functions I've added access variables in PgBackendStatus, so
I put most of them in backend_status.h/c. However, technically, these
are stats which are aggregated over time, which e1025044cd4 says should
go in pgstat.c/h. I could move some of it, but I hadn't tried to do so,
as it made a few things inconvenient, and, I wasn't sure if it was the
right thing to do anyway.
Most of the functions I've added access variables in PgBackendStatus, so
I put most of them in backend_status.h/c. However, technically, these
are stats which are aggregated over time, which e1025044cd4 says should
go in pgstat.c/h. I could move some of it, but I hadn't tried to do so,
as it made a few things inconvenient, and, I wasn't sure if it was the
right thing to do anyway.
>
> I removed buffers_backend and buffers_backend_fsync from
> pg_stat_bgwriter and have created a new view which tracks
> - number of shared buffers the checkpointer and bgwriter write out
> - number of shared buffers a regular backend is forced to flush
> - number of extends done by a regular backend through shared buffers
> - number of buffers flushed by a backend or autovacuum using a
> BufferAccessStrategy which, were they not to use this strategy,
> could perhaps have been avoided if a clean shared buffer was
> available
> - number of fsyncs done by a backend which could have been done by
> checkpointer if sync queue had not been full
I wonder if leaving buffers_alloc in pg_stat_bgwriter makes sense after
this? I'm tempted to move that to pg_stat_buffers or such...
I've gone ahead and moved buffers_alloc out of pg_stat_bgwriter and into
pg_stat_buffer_actions (I've renamed it from pg_stat_buffers_written).
pg_stat_buffer_actions (I've renamed it from pg_stat_buffers_written).
I'm not quite convinced by having separate columns for checkpointer,
bgwriter, etc. That doesn't seem to scale all that well. What if we
instead made it a view that has one row for each BackendType?
I've changed the view to have one row for each backend type for which we
would like to report stats and one column for each buffer action type.
To make the code easier to write, I record buffer actions for all
backend types -- even if we don't have any buffer actions we care about
for that backend type. I thought it was okay because when I actually
aggregate the counters across backends, I only do so for the backend
types we care about -- thus there shouldn't be much accessing of shared
memory by multiple different processes.
Also, I copy-pasted most of the code in pg_stat_get_buffer_actions() to
set up the result tuplestore from pg_stat_get_activity() without totally
understanding all the parts of it, so I'm not sure if all of it is
required here.
would like to report stats and one column for each buffer action type.
To make the code easier to write, I record buffer actions for all
backend types -- even if we don't have any buffer actions we care about
for that backend type. I thought it was okay because when I actually
aggregate the counters across backends, I only do so for the backend
types we care about -- thus there shouldn't be much accessing of shared
memory by multiple different processes.
Also, I copy-pasted most of the code in pg_stat_get_buffer_actions() to
set up the result tuplestore from pg_stat_get_activity() without totally
understanding all the parts of it, so I'm not sure if all of it is
required here.
> In implementing this counting per backend, it is easy for all types of
> backends to keep track of the number of writes, extends, fsyncs, and
> strategy writes they are doing. So, as recommended upthread, I have
> added columns in the view for the number of writes for checkpointer and
> bgwriter and others. Thus, this view becomes more than just stats on
> "avoidable I/O done by backends".
>
> So, my question is, does it makes sense to track all extends -- those to
> extend the fsm and visimap and when making a new relation or index? Is
> that information useful? If so, is it different than the extends done
> through shared buffers? Should it be tracked separately?
I don't fully understand what you mean with "extends done through shared
buffers"?
By "extends done through shared buffers", I just mean when an extend of
a relation is done and the data that will be written to the new block is
written into a shared buffer (as opposed to a local one or local memory
or a strategy buffer).
a relation is done and the data that will be written to the new block is
written into a shared buffer (as opposed to a local one or local memory
or a strategy buffer).
Random note:
I added a length member to the BackendType enum (BACKEND_NUM_TYPES),
which led to this compiler warning:
miscinit.c: In function ‘GetBackendTypeDesc’:
miscinit.c:236:2: warning: enumeration value ‘BACKEND_NUM_TYPES’ not handled in switch [-Wswitch]
236 | switch (backendType)
| ^~~~~~
I tried using pg_attribute_unused() for BACKEND_NUM_TYPES, but, it
didn't seem to have the desired effect. As such, I just threw a case
into GetBackendTypeDesc() which does nothing (as opposed to erroring
out), since the backendDesc already is initialized to "unknown process
type", erroring out doesn't seem to be expected.
which led to this compiler warning:
miscinit.c: In function ‘GetBackendTypeDesc’:
miscinit.c:236:2: warning: enumeration value ‘BACKEND_NUM_TYPES’ not handled in switch [-Wswitch]
236 | switch (backendType)
| ^~~~~~
I tried using pg_attribute_unused() for BACKEND_NUM_TYPES, but, it
didn't seem to have the desired effect. As such, I just threw a case
into GetBackendTypeDesc() which does nothing (as opposed to erroring
out), since the backendDesc already is initialized to "unknown process
type", erroring out doesn't seem to be expected.
- Melanie
Attachment
On 2021-Apr-12, Melanie Plageman wrote: > As for the way I have recorded strategy writes -- it is quite inelegant, > but, I wanted to make sure that I only counted a strategy write as one > in which the backend wrote out the dirty buffer from its strategy ring > but did not check if there was any clean buffer in shared buffers more > generally (so, it is *potentially* an avoidable write). I'm not sure if > this distinction is useful to anyone. I haven't done enough with > BufferAccessStrategies to know what I'd want to know about them when > developing or using Postgres. However, if I don't need to be so careful, > it will make the code much simpler (though, I'm sure I can improve the > code regardless). I was bitten last year by REFRESH MATERIALIZED VIEW counting its writes via buffers_backend, and I was very surprised/confused about it. So it seems definitely worthwhile to count writes via strategy separately. For a DBA tuning the server configuration it is very useful. The main thing is to *not* let these writes end up regular buffers_backend (or whatever you call these now). I didn't read your patch, but the way you have described it seems okay to me. -- Álvaro Herrera 39°49'30"S 73°17'W
On Fri, Jun 4, 2021 at 5:52 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Apr-12, Melanie Plageman wrote: > > > As for the way I have recorded strategy writes -- it is quite inelegant, > > but, I wanted to make sure that I only counted a strategy write as one > > in which the backend wrote out the dirty buffer from its strategy ring > > but did not check if there was any clean buffer in shared buffers more > > generally (so, it is *potentially* an avoidable write). I'm not sure if > > this distinction is useful to anyone. I haven't done enough with > > BufferAccessStrategies to know what I'd want to know about them when > > developing or using Postgres. However, if I don't need to be so careful, > > it will make the code much simpler (though, I'm sure I can improve the > > code regardless). > > I was bitten last year by REFRESH MATERIALIZED VIEW counting its writes > via buffers_backend, and I was very surprised/confused about it. So it > seems definitely worthwhile to count writes via strategy separately. > For a DBA tuning the server configuration it is very useful. > > The main thing is to *not* let these writes end up regular > buffers_backend (or whatever you call these now). I didn't read your > patch, but the way you have described it seems okay to me. > Thanks for the feedback! I agree it makes sense to count strategy writes separately. I thought about this some more, and I don't know if it makes sense to only count "avoidable" strategy writes. This would mean that a backend writing out a buffer from the strategy ring when no clean shared buffers (as well as no clean strategy buffers) are available would not count that write as a strategy write (even though it is writing out a buffer from its strategy ring). But, it obviously doesn't make sense to count it as a regular buffer being written out. So, I plan to change this code. On another note, I've updated the patch with more correct concurrency control control mechanisms (had some data races and other problems before). Now, I am using atomics for the buffer action counters, though the code includes several #TODO questions around the correctness of what I have now too. I also wrapped the buffer action types in a struct to make them easier to work with. The most substantial missing piece of the patch right now is persisting the data across reboots. The two places in the code I can see to persist the buffer action stats data are: 1) using the stats collector code (like in pgstat_read/write_statsfiles() 2) using a before_shmem_exit() hook which writes the data structure to a file and then read from it when making the shared memory array initially It feels a bit weird to me to wedge the buffer actions stats into the stats collector code--since the stats collector isn't receiving and aggregating the buffer action stats. Also, I'm unsure how writing the buffer action stats out in pgstat_write_statsfiles() will work, since I think that backends can update their buffer action stats after we would have already persisted the data from the BufferActionStatsArray -- causing us to lose those updates. And, I don't think I can use pgstat_read_statsfiles() since the BufferActionStatsArray should have the data from the file as soon as the view containing the buffer action stats can be queried. Thus, it seems like I would need to read the file while initializing the array in CreateBufferActionStatsCounters(). I am registering the patch for September commitfest but plan to update the stats persistence before then (and docs, etc). -- Melanie
Attachment
Hi, On 2021-08-02 18:25:56 -0400, Melanie Plageman wrote: > Thanks for the feedback! > > I agree it makes sense to count strategy writes separately. > > I thought about this some more, and I don't know if it makes sense to > only count "avoidable" strategy writes. > > This would mean that a backend writing out a buffer from the strategy > ring when no clean shared buffers (as well as no clean strategy buffers) > are available would not count that write as a strategy write (even > though it is writing out a buffer from its strategy ring). But, it > obviously doesn't make sense to count it as a regular buffer being > written out. So, I plan to change this code. What do you mean with "no clean shared buffers ... are available"? > The most substantial missing piece of the patch right now is persisting > the data across reboots. > > The two places in the code I can see to persist the buffer action stats > data are: > 1) using the stats collector code (like in > pgstat_read/write_statsfiles() > 2) using a before_shmem_exit() hook which writes the data structure to a > file and then read from it when making the shared memory array initially I think it's pretty clear that we should go for 1. Having two mechanisms for persisting stats data is a bad idea. > Also, I'm unsure how writing the buffer action stats out in > pgstat_write_statsfiles() will work, since I think that backends can > update their buffer action stats after we would have already persisted > the data from the BufferActionStatsArray -- causing us to lose those > updates. I was thinking it'd work differently. Whenever a connection ends, it reports its data up to pgstats.c (otherwise we'd loose those stats). By the time shutdown happens, they all need to have already have reported their stats - so we don't need to do anything to get the data to pgstats.c during shutdown time. > And, I don't think I can use pgstat_read_statsfiles() since the > BufferActionStatsArray should have the data from the file as soon as the > view containing the buffer action stats can be queried. Thus, it seems > like I would need to read the file while initializing the array in > CreateBufferActionStatsCounters(). Why would backends need to read that data back? > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql > index 55f6e3711d..96cac0a74e 100644 > --- a/src/backend/catalog/system_views.sql > +++ b/src/backend/catalog/system_views.sql > @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS > pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, > pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, > pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, > - pg_stat_get_buf_written_backend() AS buffers_backend, > - pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, > - pg_stat_get_buf_alloc() AS buffers_alloc, > pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; Material for a separate patch, not this. But if we're going to break monitoring queries anyway, I think we should consider also renaming maxwritten_clean (and perhaps a few others), because nobody understands what that is supposed to mean. > @@ -1089,10 +1077,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type) > > LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE); > > - /* Count all backend writes regardless of if they fit in the queue */ > - if (!AmBackgroundWriterProcess()) > - CheckpointerShmem->num_backend_writes++; > - > /* > * If the checkpointer isn't running or the request queue is full, the > * backend will have to perform its own fsync request. But before forcing > @@ -1106,8 +1090,10 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type) > * Count the subset of writes where backends have to do their own > * fsync > */ > + /* TODO: should we count fsyncs for all types of procs? */ > if (!AmBackgroundWriterProcess()) > - CheckpointerShmem->num_backend_fsync++; > + pgstat_increment_buffer_action(BA_Fsync); > + Yes, I think that'd make sense. Now that we can disambiguate the different types of syncs between procs, I don't see a point of having a process-type filter here. We just loose data... > /* don't set checksum for all-zero page */ > @@ -1229,11 +1234,60 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > if (XLogNeedsFlush(lsn) && > StrategyRejectBuffer(strategy, buf)) > { > + /* > + * Unset the strat write flag, as we will not be writing > + * this particular buffer from our ring out and may end > + * up having to find a buffer from main shared buffers, > + * which, if it is dirty, we may have to write out, which > + * could have been prevented by checkpointing and background > + * writing > + */ > + StrategyUnChooseBufferFromRing(strategy); > + > /* Drop lock/pin and loop around for another buffer */ > LWLockRelease(BufferDescriptorGetContentLock(buf)); > UnpinBuffer(buf, true); > continue; > } Could we combine this with StrategyRejectBuffer()? It seems a bit wasteful to have two function calls into freelist.c when the second happens exactly when the first returns true? > + > + /* > + * TODO: there is certainly a better way to write this > + * logic > + */ > + > + /* > + * The dirty buffer that will be written out was selected > + * from the ring and we did not bother checking the > + * freelist or doing a clock sweep to look for a clean > + * buffer to use, thus, this write will be counted as a > + * strategy write -- one that may be unnecessary without a > + * strategy > + */ > + if (StrategyIsBufferFromRing(strategy)) > + { > + pgstat_increment_buffer_action(BA_Write_Strat); > + } > + > + /* > + * If the dirty buffer was one we grabbed from the > + * freelist or through a clock sweep, it could have been > + * written out by bgwriter or checkpointer, thus, we will > + * count it as a regular write > + */ > + else > + pgstat_increment_buffer_action(BA_Write); It seems this would be better solved by having an "bool *from_ring" or GetBufferSource* parameter to StrategyGetBuffer(). > @@ -2895,6 +2948,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) > /* > * bufToWrite is either the shared buffer or a copy, as appropriate. > */ > + > + /* > + * TODO: consider that if we did not need to distinguish between a buffer > + * flushed that was grabbed from the ring buffer and written out as part > + * of a strategy which was not from main Shared Buffers (and thus > + * preventable by bgwriter or checkpointer), then we could move all calls > + * to pgstat_increment_buffer_action() here except for the one for > + * extends, which would remain in ReadBuffer_common() before smgrextend() > + * (unless we decide to start counting other extends). That includes the > + * call to count buffers written by bgwriter and checkpointer which go > + * through FlushBuffer() but not BufferAlloc(). That would make it > + * simpler. Perhaps instead we can find somewhere else to indicate that > + * the buffer is from the ring of buffers to reuse. > + */ > smgrwrite(reln, > buf->tag.forkNum, > buf->tag.blockNum, Can we just add a parameter to FlushBuffer indicating what the source of the write is? > @@ -247,7 +257,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state) > * the rate of buffer consumption. Note that buffers recycled by a > * strategy object are intentionally not counted here. > */ > - pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1); > + pgstat_increment_buffer_action(BA_Alloc); > > /* > * First check, without acquiring the lock, whether there's buffers in the > @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) > */ > *complete_passes += nextVictimBuffer / NBuffers; > } > - > - if (num_buf_alloc) > - { > - *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0); > - } > SpinLockRelease(&StrategyControl->buffer_strategy_lock); > return result; > } Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I suspect this patch shouldn't get rid of numBufferAllocs at the same time as overhauling the stats stuff. Perhaps we don't need both - but it's not obvious that that's the case / how we can make that work. > +void > +pgstat_increment_buffer_action(BufferActionType ba_type) > +{ > + volatile PgBackendStatus *beentry = MyBEEntry; > + > + if (!beentry || !pgstat_track_activities) > + return; > + > + if (ba_type == BA_Alloc) > + pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.allocs, 1); > + else if (ba_type == BA_Extend) > + pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.extends, 1); > + else if (ba_type == BA_Fsync) > + pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.fsyncs, 1); > + else if (ba_type == BA_Write) > + pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes, 1); > + else if (ba_type == BA_Write_Strat) > + pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes_strat, 1); > +} I don't think we want to use atomic increments here - they're *slow*. And there only ever can be a single writer to a backend's stats. So just doing something like pg_atomic_write_u64(&var, pg_atomic_read_u64(&var) + 1) should do the trick. > +/* > + * Called for a single backend at the time of death to persist its I/O stats > + */ > +void > +pgstat_record_dead_backend_buffer_actions(void) > +{ > + volatile PgBackendBufferActionStats *ba_stats; > + volatile PgBackendStatus *beentry = MyBEEntry; > + > + if (beentry->st_procpid != 0) > + return; > + > + // TODO: is this correct? could there be a data race? do I need a lock? > + ba_stats = &BufferActionStatsArray[beentry->st_backendType]; > + pg_atomic_add_fetch_u64(&ba_stats->allocs, pg_atomic_read_u64(&beentry->buffer_action_stats.allocs)); > + pg_atomic_add_fetch_u64(&ba_stats->extends, pg_atomic_read_u64(&beentry->buffer_action_stats.extends)); > + pg_atomic_add_fetch_u64(&ba_stats->fsyncs, pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs)); > + pg_atomic_add_fetch_u64(&ba_stats->writes, pg_atomic_read_u64(&beentry->buffer_action_stats.writes)); > + pg_atomic_add_fetch_u64(&ba_stats->writes_strat, pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat)); > +} I don't see a race, FWIW. This is where I propose that we instead report the values up to the stats collector, instead of having a separate array that we need to persist > +/* > + * Fill the provided values array with the accumulated counts of buffer actions > + * taken by all backends of type backend_type (input parameter), both alive and > + * dead. This is currently only used by pg_stat_get_buffer_actions() to create > + * the rows in the pg_stat_buffer_actions system view. > + */ > +void > +pgstat_recount_all_buffer_actions(BackendType backend_type, Datum *values) > +{ > + int i; > + volatile PgBackendStatus *beentry; > + > + /* > + * Add stats from all exited backends > + */ > + values[BA_Alloc] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].allocs); > + values[BA_Extend] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].extends); > + values[BA_Fsync] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].fsyncs); > + values[BA_Write] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes); > + values[BA_Write_Strat] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes_strat); > + > + /* > + * Loop through all live backends and count their buffer actions > + */ > + // TODO: see note in pg_stat_get_buffer_actions() about inefficiency of this method > + > + beentry = BackendStatusArray; > + for (i = 1; i <= MaxBackends; i++) > + { > + /* Don't count dead backends. They should already be counted */ > + if (beentry->st_procpid == 0) > + continue; > + if (beentry->st_backendType != backend_type) > + continue; > + > + values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs); > + values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends); > + values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs); > + values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes); > + values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat); > + > + beentry++; > + } > +} It seems to make a bit more sense to have this sum up the stats for all backend types at once. > + /* > + * Currently, the only supported backend types for stats are the following. > + * If this were to change, pg_proc.dat would need to be changed as well > + * to reflect the new expected number of rows. > + */ > + Datum values[BUFFER_ACTION_NUM_TYPES]; > + bool nulls[BUFFER_ACTION_NUM_TYPES]; Ah ;) Greetings, Andres Freund
On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-08-02 18:25:56 -0400, Melanie Plageman wrote: > > Thanks for the feedback! > > > > I agree it makes sense to count strategy writes separately. > > > > I thought about this some more, and I don't know if it makes sense to > > only count "avoidable" strategy writes. > > > > This would mean that a backend writing out a buffer from the strategy > > ring when no clean shared buffers (as well as no clean strategy buffers) > > are available would not count that write as a strategy write (even > > though it is writing out a buffer from its strategy ring). But, it > > obviously doesn't make sense to count it as a regular buffer being > > written out. So, I plan to change this code. > > What do you mean with "no clean shared buffers ... are available"? > I think I was talking about the scenario in which a backend using a strategy does not find a clean buffer in the strategy ring and goes to look in the freelist for a clean shared buffer and doesn't find one. I was probably talking in circles up there. I think the current patch counts the right writes in the right way, though. > > > > The most substantial missing piece of the patch right now is persisting > > the data across reboots. > > > > The two places in the code I can see to persist the buffer action stats > > data are: > > 1) using the stats collector code (like in > > pgstat_read/write_statsfiles() > > 2) using a before_shmem_exit() hook which writes the data structure to a > > file and then read from it when making the shared memory array initially > > I think it's pretty clear that we should go for 1. Having two mechanisms for > persisting stats data is a bad idea. New version uses the stats collector. > > > > Also, I'm unsure how writing the buffer action stats out in > > pgstat_write_statsfiles() will work, since I think that backends can > > update their buffer action stats after we would have already persisted > > the data from the BufferActionStatsArray -- causing us to lose those > > updates. > > I was thinking it'd work differently. Whenever a connection ends, it reports > its data up to pgstats.c (otherwise we'd loose those stats). By the time > shutdown happens, they all need to have already have reported their stats - so > we don't need to do anything to get the data to pgstats.c during shutdown > time. > When you say "whenever a connection ends", what part of the code are you referring to specifically? Also, when you say "shutdown", do you mean a backend shutting down or all backends shutting down (including postmaster) -- like pg_ctl stop? > > > And, I don't think I can use pgstat_read_statsfiles() since the > > BufferActionStatsArray should have the data from the file as soon as the > > view containing the buffer action stats can be queried. Thus, it seems > > like I would need to read the file while initializing the array in > > CreateBufferActionStatsCounters(). > > Why would backends need to read that data back? > To get totals across restarts, but, doesn't matter now that I am using stats collector. > > > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql > > index 55f6e3711d..96cac0a74e 100644 > > --- a/src/backend/catalog/system_views.sql > > +++ b/src/backend/catalog/system_views.sql > > @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS > > pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, > > pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, > > pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, > > - pg_stat_get_buf_written_backend() AS buffers_backend, > > - pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, > > - pg_stat_get_buf_alloc() AS buffers_alloc, > > pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; > > Material for a separate patch, not this. But if we're going to break > monitoring queries anyway, I think we should consider also renaming > maxwritten_clean (and perhaps a few others), because nobody understands what > that is supposed to mean. > > Do you mean I shouldn't remove anything from the pg_stat_bgwriter view? > > > @@ -1089,10 +1077,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type) > > > > LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE); > > > > - /* Count all backend writes regardless of if they fit in the queue */ > > - if (!AmBackgroundWriterProcess()) > > - CheckpointerShmem->num_backend_writes++; > > - > > /* > > * If the checkpointer isn't running or the request queue is full, the > > * backend will have to perform its own fsync request. But before forcing > > @@ -1106,8 +1090,10 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type) > > * Count the subset of writes where backends have to do their own > > * fsync > > */ > > + /* TODO: should we count fsyncs for all types of procs? */ > > if (!AmBackgroundWriterProcess()) > > - CheckpointerShmem->num_backend_fsync++; > > + pgstat_increment_buffer_action(BA_Fsync); > > + > > Yes, I think that'd make sense. Now that we can disambiguate the different > types of syncs between procs, I don't see a point of having a process-type > filter here. We just loose data... > > Done > > > /* don't set checksum for all-zero page */ > > @@ -1229,11 +1234,60 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > > if (XLogNeedsFlush(lsn) && > > StrategyRejectBuffer(strategy, buf)) > > { > > + /* > > + * Unset the strat write flag, as we will not be writing > > + * this particular buffer from our ring out and may end > > + * up having to find a buffer from main shared buffers, > > + * which, if it is dirty, we may have to write out, which > > + * could have been prevented by checkpointing and background > > + * writing > > + */ > > + StrategyUnChooseBufferFromRing(strategy); > > + > > /* Drop lock/pin and loop around for another buffer */ > > LWLockRelease(BufferDescriptorGetContentLock(buf)); > > UnpinBuffer(buf, true); > > continue; > > } > > Could we combine this with StrategyRejectBuffer()? It seems a bit wasteful to > have two function calls into freelist.c when the second happens exactly when > the first returns true? > > > > + > > + /* > > + * TODO: there is certainly a better way to write this > > + * logic > > + */ > > + > > + /* > > + * The dirty buffer that will be written out was selected > > + * from the ring and we did not bother checking the > > + * freelist or doing a clock sweep to look for a clean > > + * buffer to use, thus, this write will be counted as a > > + * strategy write -- one that may be unnecessary without a > > + * strategy > > + */ > > + if (StrategyIsBufferFromRing(strategy)) > > + { > > + pgstat_increment_buffer_action(BA_Write_Strat); > > + } > > + > > + /* > > + * If the dirty buffer was one we grabbed from the > > + * freelist or through a clock sweep, it could have been > > + * written out by bgwriter or checkpointer, thus, we will > > + * count it as a regular write > > + */ > > + else > > + pgstat_increment_buffer_action(BA_Write); > > It seems this would be better solved by having an "bool *from_ring" or > GetBufferSource* parameter to StrategyGetBuffer(). > I've addressed both of these in the new version. > > > @@ -2895,6 +2948,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) > > /* > > * bufToWrite is either the shared buffer or a copy, as appropriate. > > */ > > + > > + /* > > + * TODO: consider that if we did not need to distinguish between a buffer > > + * flushed that was grabbed from the ring buffer and written out as part > > + * of a strategy which was not from main Shared Buffers (and thus > > + * preventable by bgwriter or checkpointer), then we could move all calls > > + * to pgstat_increment_buffer_action() here except for the one for > > + * extends, which would remain in ReadBuffer_common() before smgrextend() > > + * (unless we decide to start counting other extends). That includes the > > + * call to count buffers written by bgwriter and checkpointer which go > > + * through FlushBuffer() but not BufferAlloc(). That would make it > > + * simpler. Perhaps instead we can find somewhere else to indicate that > > + * the buffer is from the ring of buffers to reuse. > > + */ > > smgrwrite(reln, > > buf->tag.forkNum, > > buf->tag.blockNum, > > Can we just add a parameter to FlushBuffer indicating what the source of the > write is? > I just noticed this comment now, so I'll address that in the next version. I rebased today and noticed merge conflicts, so, it looks like v5 will be on its way soon anyway. > > > @@ -247,7 +257,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state) > > * the rate of buffer consumption. Note that buffers recycled by a > > * strategy object are intentionally not counted here. > > */ > > - pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1); > > + pgstat_increment_buffer_action(BA_Alloc); > > > > /* > > * First check, without acquiring the lock, whether there's buffers in the > > > @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) > > */ > > *complete_passes += nextVictimBuffer / NBuffers; > > } > > - > > - if (num_buf_alloc) > > - { > > - *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0); > > - } > > SpinLockRelease(&StrategyControl->buffer_strategy_lock); > > return result; > > } > > Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I > suspect this patch shouldn't get rid of numBufferAllocs at the same time as > overhauling the stats stuff. Perhaps we don't need both - but it's not obvious > that that's the case / how we can make that work. > > I initially meant to add a function to the patch like pg_stat_get_buffer_actions() but which took a BufferActionType and BackendType as parameters and returned a single value which is the number of buffer action types of that type for that type of backend. let's say I defined it like this: uint64 pg_stat_get_backend_buffer_actions_stats(BackendType backend_type, BufferActionType ba_type) Then, I intended to use that in StrategySyncStart() to set num_buf_alloc by subtracting the value of StrategyControl->numBufferAllocs from the value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER, BA_Alloc), val, then adding that value, val, to StrategyControl->numBufferAllocs. I think that would have the same behavior as current, though I'm not sure if the performance would end up being better or worse. It wouldn't be atomically incrementing StrategyControl->numBufferAllocs, but it would do a few additional atomic operations in StrategySyncStart() than before. Also, we would do all the work done by pg_stat_get_buffer_actions() in StrategySyncStart(). But that is called comparatively infrequently, right? > > > > +void > > +pgstat_increment_buffer_action(BufferActionType ba_type) > > +{ > > + volatile PgBackendStatus *beentry = MyBEEntry; > > + > > + if (!beentry || !pgstat_track_activities) > > + return; > > + > > + if (ba_type == BA_Alloc) > > + pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.allocs, 1); > > + else if (ba_type == BA_Extend) > > + pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.extends, 1); > > + else if (ba_type == BA_Fsync) > > + pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.fsyncs, 1); > > + else if (ba_type == BA_Write) > > + pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes, 1); > > + else if (ba_type == BA_Write_Strat) > > + pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes_strat, 1); > > +} > > I don't think we want to use atomic increments here - they're *slow*. And > there only ever can be a single writer to a backend's stats. So just doing > something like > pg_atomic_write_u64(&var, pg_atomic_read_u64(&var) + 1) > should do the trick. > Done > > > +/* > > + * Called for a single backend at the time of death to persist its I/O stats > > + */ > > +void > > +pgstat_record_dead_backend_buffer_actions(void) > > +{ > > + volatile PgBackendBufferActionStats *ba_stats; > > + volatile PgBackendStatus *beentry = MyBEEntry; > > + > > + if (beentry->st_procpid != 0) > > + return; > > + > > + // TODO: is this correct? could there be a data race? do I need a lock? > > + ba_stats = &BufferActionStatsArray[beentry->st_backendType]; > > + pg_atomic_add_fetch_u64(&ba_stats->allocs, pg_atomic_read_u64(&beentry->buffer_action_stats.allocs)); > > + pg_atomic_add_fetch_u64(&ba_stats->extends, pg_atomic_read_u64(&beentry->buffer_action_stats.extends)); > > + pg_atomic_add_fetch_u64(&ba_stats->fsyncs, pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs)); > > + pg_atomic_add_fetch_u64(&ba_stats->writes, pg_atomic_read_u64(&beentry->buffer_action_stats.writes)); > > + pg_atomic_add_fetch_u64(&ba_stats->writes_strat, pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat)); > > +} > > I don't see a race, FWIW. > > This is where I propose that we instead report the values up to the stats > collector, instead of having a separate array that we need to persist > Changed > > > +/* > > + * Fill the provided values array with the accumulated counts of buffer actions > > + * taken by all backends of type backend_type (input parameter), both alive and > > + * dead. This is currently only used by pg_stat_get_buffer_actions() to create > > + * the rows in the pg_stat_buffer_actions system view. > > + */ > > +void > > +pgstat_recount_all_buffer_actions(BackendType backend_type, Datum *values) > > +{ > > + int i; > > + volatile PgBackendStatus *beentry; > > + > > + /* > > + * Add stats from all exited backends > > + */ > > + values[BA_Alloc] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].allocs); > > + values[BA_Extend] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].extends); > > + values[BA_Fsync] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].fsyncs); > > + values[BA_Write] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes); > > + values[BA_Write_Strat] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes_strat); > > + > > + /* > > + * Loop through all live backends and count their buffer actions > > + */ > > + // TODO: see note in pg_stat_get_buffer_actions() about inefficiency of this method > > + > > + beentry = BackendStatusArray; > > + for (i = 1; i <= MaxBackends; i++) > > + { > > + /* Don't count dead backends. They should already be counted */ > > + if (beentry->st_procpid == 0) > > + continue; > > + if (beentry->st_backendType != backend_type) > > + continue; > > + > > + values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs); > > + values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends); > > + values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs); > > + values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes); > > + values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat); > > + > > + beentry++; > > + } > > +} > > It seems to make a bit more sense to have this sum up the stats for all > backend types at once. Changed. > > > + /* > > + * Currently, the only supported backend types for stats are the following. > > + * If this were to change, pg_proc.dat would need to be changed as well > > + * to reflect the new expected number of rows. > > + */ > > + Datum values[BUFFER_ACTION_NUM_TYPES]; > > + bool nulls[BUFFER_ACTION_NUM_TYPES]; > > Ah ;) > I just went ahead and made a row for each backend type. - Melanie
Attachment
On Wed, Aug 11, 2021 at 4:11 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote: > > > > > @@ -2895,6 +2948,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) > > > /* > > > * bufToWrite is either the shared buffer or a copy, as appropriate. > > > */ > > > + > > > + /* > > > + * TODO: consider that if we did not need to distinguish between a buffer > > > + * flushed that was grabbed from the ring buffer and written out as part > > > + * of a strategy which was not from main Shared Buffers (and thus > > > + * preventable by bgwriter or checkpointer), then we could move all calls > > > + * to pgstat_increment_buffer_action() here except for the one for > > > + * extends, which would remain in ReadBuffer_common() before smgrextend() > > > + * (unless we decide to start counting other extends). That includes the > > > + * call to count buffers written by bgwriter and checkpointer which go > > > + * through FlushBuffer() but not BufferAlloc(). That would make it > > > + * simpler. Perhaps instead we can find somewhere else to indicate that > > > + * the buffer is from the ring of buffers to reuse. > > > + */ > > > smgrwrite(reln, > > > buf->tag.forkNum, > > > buf->tag.blockNum, > > > > Can we just add a parameter to FlushBuffer indicating what the source of the > > write is? > > > > I just noticed this comment now, so I'll address that in the next > version. I rebased today and noticed merge conflicts, so, it looks like > v5 will be on its way soon anyway. > Actually, after moving the code around like you suggested, calling pgstat_increment_buffer_action() before smgrwrite() in FlushBuffer() and using a parameter to indicate if it is a strategy write or not would only save us one other call to pgstat_increment_buffer_action() -- the one in SyncOneBuffer(). We would end up moving the one in BufferAlloc() to FlushBuffer() and removing the one in SyncOneBuffer(). Do you think it is still worth it? Rebased v5 attached.
Attachment
Hi, On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote: > On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote: > > > Also, I'm unsure how writing the buffer action stats out in > > > pgstat_write_statsfiles() will work, since I think that backends can > > > update their buffer action stats after we would have already persisted > > > the data from the BufferActionStatsArray -- causing us to lose those > > > updates. > > > > I was thinking it'd work differently. Whenever a connection ends, it reports > > its data up to pgstats.c (otherwise we'd loose those stats). By the time > > shutdown happens, they all need to have already have reported their stats - so > > we don't need to do anything to get the data to pgstats.c during shutdown > > time. > > > > When you say "whenever a connection ends", what part of the code are you > referring to specifically? pgstat_beshutdown_hook() > Also, when you say "shutdown", do you mean a backend shutting down or > all backends shutting down (including postmaster) -- like pg_ctl stop? Admittedly our language is very imprecise around this :(. What I meant is that backends would report their own stats up to the stats collector when the connection ends (in pgstat_beshutdown_hook()). That means that when the whole server (pgstat and then postmaster, potentially via pg_ctl stop) shuts down, all the per-connection stats have already been reported up to pgstat. > > > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql > > > index 55f6e3711d..96cac0a74e 100644 > > > --- a/src/backend/catalog/system_views.sql > > > +++ b/src/backend/catalog/system_views.sql > > > @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS > > > pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, > > > pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, > > > pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, > > > - pg_stat_get_buf_written_backend() AS buffers_backend, > > > - pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, > > > - pg_stat_get_buf_alloc() AS buffers_alloc, > > > pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; > > > > Material for a separate patch, not this. But if we're going to break > > monitoring queries anyway, I think we should consider also renaming > > maxwritten_clean (and perhaps a few others), because nobody understands what > > that is supposed to mean. > Do you mean I shouldn't remove anything from the pg_stat_bgwriter view? No - I just meant that now that we're breaking pg_stat_bgwriter queries, we should also rename the columns to be easier to understand. But that it should be a separate patch / commit... > > > @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) > > > */ > > > *complete_passes += nextVictimBuffer / NBuffers; > > > } > > > - > > > - if (num_buf_alloc) > > > - { > > > - *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0); > > > - } > > > SpinLockRelease(&StrategyControl->buffer_strategy_lock); > > > return result; > > > } > > > > Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I > > suspect this patch shouldn't get rid of numBufferAllocs at the same time as > > overhauling the stats stuff. Perhaps we don't need both - but it's not obvious > > that that's the case / how we can make that work. > > > > > > I initially meant to add a function to the patch like > pg_stat_get_buffer_actions() but which took a BufferActionType and > BackendType as parameters and returned a single value which is the > number of buffer action types of that type for that type of backend. > > let's say I defined it like this: > uint64 > pg_stat_get_backend_buffer_actions_stats(BackendType backend_type, > BufferActionType ba_type) > > Then, I intended to use that in StrategySyncStart() to set num_buf_alloc > by subtracting the value of StrategyControl->numBufferAllocs from the > value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER, > BA_Alloc), val, then adding that value, val, to > StrategyControl->numBufferAllocs. I don't think you could restrict this to B_BG_WRITER? The whole point of this logic is that bgwriter uses the stats for *all* backends to get the "usage rate" for buffers, which it then uses to control how many buffers to clean. > I think that would have the same behavior as current, though I'm not > sure if the performance would end up being better or worse. It wouldn't > be atomically incrementing StrategyControl->numBufferAllocs, but it > would do a few additional atomic operations in StrategySyncStart() than > before. Also, we would do all the work done by > pg_stat_get_buffer_actions() in StrategySyncStart(). I think it'd be better to separate changing the bgwriter pacing logic (and thus numBufferAllocs) from changing the stats reporting. > But that is called comparatively infrequently, right? Depending on the workload not that rarely. I'm afraid this might be a bit too expensive. It's possible we can work around that however. Greetings, Andres Freund
On Fri, Aug 13, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote: > > On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote: > > > > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql > > > > index 55f6e3711d..96cac0a74e 100644 > > > > --- a/src/backend/catalog/system_views.sql > > > > +++ b/src/backend/catalog/system_views.sql > > > > @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS > > > > pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, > > > > pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, > > > > pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, > > > > - pg_stat_get_buf_written_backend() AS buffers_backend, > > > > - pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, > > > > - pg_stat_get_buf_alloc() AS buffers_alloc, > > > > pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; > > > > > > Material for a separate patch, not this. But if we're going to break > > > monitoring queries anyway, I think we should consider also renaming > > > maxwritten_clean (and perhaps a few others), because nobody understands what > > > that is supposed to mean. > > > Do you mean I shouldn't remove anything from the pg_stat_bgwriter view? > > No - I just meant that now that we're breaking pg_stat_bgwriter queries, > we should also rename the columns to be easier to understand. But that > it should be a separate patch / commit... > I separated the removal of some redundant stats from pg_stat_bgwriter into a different commit but haven't removed or clarified any additional columns in pg_stat_bgwriter. > > > > > > @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) > > > > */ > > > > *complete_passes += nextVictimBuffer / NBuffers; > > > > } > > > > - > > > > - if (num_buf_alloc) > > > > - { > > > > - *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0); > > > > - } > > > > SpinLockRelease(&StrategyControl->buffer_strategy_lock); > > > > return result; > > > > } > > > > > > Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I > > > suspect this patch shouldn't get rid of numBufferAllocs at the same time as > > > overhauling the stats stuff. Perhaps we don't need both - but it's not obvious > > > that that's the case / how we can make that work. > > > > > > > > > > I initially meant to add a function to the patch like > > pg_stat_get_buffer_actions() but which took a BufferActionType and > > BackendType as parameters and returned a single value which is the > > number of buffer action types of that type for that type of backend. > > > > let's say I defined it like this: > > uint64 > > pg_stat_get_backend_buffer_actions_stats(BackendType backend_type, > > BufferActionType ba_type) > > > > Then, I intended to use that in StrategySyncStart() to set num_buf_alloc > > by subtracting the value of StrategyControl->numBufferAllocs from the > > value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER, > > BA_Alloc), val, then adding that value, val, to > > StrategyControl->numBufferAllocs. > > I don't think you could restrict this to B_BG_WRITER? The whole point of > this logic is that bgwriter uses the stats for *all* backends to get the > "usage rate" for buffers, which it then uses to control how many buffers > to clean. > > > > I think that would have the same behavior as current, though I'm not > > sure if the performance would end up being better or worse. It wouldn't > > be atomically incrementing StrategyControl->numBufferAllocs, but it > > would do a few additional atomic operations in StrategySyncStart() than > > before. Also, we would do all the work done by > > pg_stat_get_buffer_actions() in StrategySyncStart(). > > I think it'd be better to separate changing the bgwriter pacing logic > (and thus numBufferAllocs) from changing the stats reporting. > > > > But that is called comparatively infrequently, right? > > Depending on the workload not that rarely. I'm afraid this might be a > bit too expensive. It's possible we can work around that however. > I've restored StrategyControl->numBuffersAlloc. Attached is v6 of the patchset. I have made several small updates to the patch, including user docs updates, comment clarifications, various changes related to how structures are initialized, code simplications, small details like alphabetizing of #includes, etc. Below are details on the remaining TODOs and open questions for this patch and why I haven't done them yet: 1) performance testing (initial tests done, but need to do some further investigation before sharing) 2) stats_reset Because pg_stat_buffer_actions fields were added to the globalStats structure, they get reset when the target RESET_BGWRITER is reset. Depending on whether or not these commits remove columns from the pg_stat_bgwriter view, I would approach adding stats_reset to pg_stat_buffer_actions differently. If removing all of pg_stat_bgwriter, I would just rename the target to apply to pg_stat_buffer_actions. If not removing all of pg_stat_bgwriter, I would add a new target for pg_stat_buffer_actions to reset those stats and then either remove them from globalStats or MemSet() only the relevant parts of the struct in pgstat_recv_resetsharedcounter(). I haven't done this yet because I want to get input on what should happen to pg_stat_bgwriter first (all of it goes, all of it stays, some goes, etc). 3) what to count Currently, the patch counts allocs, extends, fsyncs and writes of shared buffers and writes done when using a buffer access strategy. So, it is a mix of mostly shared buffers and a few non-shared buffers. I am wondering if it makes sense to also count extends with smgrextend() other than those using shared buffers--for example when building an index or when extending the free space map or visibility map. For fsyncs, the patch does not count checkpointer fsyncs or fsyncs done from XLogWrite(). On a related note, depending on what the view counts, the name buffer_actions may or may not be too general. I also feel like the BackendType B_BACKEND is a bit confusing when we are tracking buffer actions for different backend types -- this name makes it seem like other types of backends are not backends. I'm not sure what the view should track and can see arguments for excluding certain extends or separating them into another stat. I haven't made the changes because I am looking for other peoples' opinions. 4) Adding some sort of protection against regressions when code is added that adds additional buffer actions but doesn't count them -- more likely if we are counting all users of smgrextend() but not doing the counter incrementing there. I'm not sure how I would even do this, so, that's why I haven't done it. 5) It seems like the code to create a tuplestore used by various stats functions like pg_stat_get_progress_info(), pg_stat_get_activity, and pg_stat_get_slru could be refactored into a helper function since it is quite redundant (maybe returning a ReturnSetInfo). I haven't done this because I wasn't sure if it was a good idea, and, if it is, if I should do it in a separate commit. 6) Cleaning up of commit message, running pgindent, and, eventually, catalog bump (waiting until the patch is done to do this). 7) Additional testing to ensure all codepaths added are hit (one-off testing, not added to regression test suite). I am waiting to do this until all of the types of buffer actions that will be done are finalized. - Melanie
Attachment
On Fri, Aug 13, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote: > > On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote: > > > > Also, I'm unsure how writing the buffer action stats out in > > > > pgstat_write_statsfiles() will work, since I think that backends can > > > > update their buffer action stats after we would have already persisted > > > > the data from the BufferActionStatsArray -- causing us to lose those > > > > updates. > > > > > > I was thinking it'd work differently. Whenever a connection ends, it reports > > > its data up to pgstats.c (otherwise we'd loose those stats). By the time > > > shutdown happens, they all need to have already have reported their stats - so > > > we don't need to do anything to get the data to pgstats.c during shutdown > > > time. > > > > > > > When you say "whenever a connection ends", what part of the code are you > > referring to specifically? > > pgstat_beshutdown_hook() > > > > Also, when you say "shutdown", do you mean a backend shutting down or > > all backends shutting down (including postmaster) -- like pg_ctl stop? > > Admittedly our language is very imprecise around this :(. What I meant > is that backends would report their own stats up to the stats collector > when the connection ends (in pgstat_beshutdown_hook()). That means that > when the whole server (pgstat and then postmaster, potentially via > pg_ctl stop) shuts down, all the per-connection stats have already been > reported up to pgstat. > So, I realized that the patch has a problem. I added the code to send buffer actions stats to the stats collector (pgstat_send_buffer_actions()) to pgstat_report_stat() and this isn't getting called when all types of backends exit. I originally thought to add pgstat_send_buffer_actions() to pgstat_beshutdown_hook() (as suggested), but, this is called after pgstat_shutdown_hook(), so, we aren't able to send stats to the stats collector at that time. (pgstat_shutdown_hook() sets pgstat_is_shutdown to true and then in pgstat_beshutdown_hook() (called after), if we call pgstat_send_buffer_actions(), it calls pgstat_send() which calls pgstat_assert_is_up() which trips when pgstat_is_shutdown is true.) After calling pgstat_send_buffer_actions() from pgstat_report_stat(), it seems to miss checkpointer stats entirely. I did find that if I sprinkled pgstat_send_buffer_actions() around in the various places that pgstat_send_checkpointer() is called, I could get checkpointer stats (see attached patch, capture_checkpointer_buffer_actions.patch), but, that seems a little bit haphazard since pgstat_send_buffer_actions() is supposed to capture stats for all backend types. Is there somewhere else I can call it that is exercised by all backend types before pgstat_shutdown_hook() is called but after they would have finished any relevant buffer actions? - Melanie
Attachment
On Wed, Sep 8, 2021 at 9:28 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Fri, Aug 13, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote: > > > On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote: > > > > > Also, I'm unsure how writing the buffer action stats out in > > > > > pgstat_write_statsfiles() will work, since I think that backends can > > > > > update their buffer action stats after we would have already persisted > > > > > the data from the BufferActionStatsArray -- causing us to lose those > > > > > updates. > > > > > > > > I was thinking it'd work differently. Whenever a connection ends, it reports > > > > its data up to pgstats.c (otherwise we'd loose those stats). By the time > > > > shutdown happens, they all need to have already have reported their stats - so > > > > we don't need to do anything to get the data to pgstats.c during shutdown > > > > time. > > > > > > > > > > When you say "whenever a connection ends", what part of the code are you > > > referring to specifically? > > > > pgstat_beshutdown_hook() > > > > > > > Also, when you say "shutdown", do you mean a backend shutting down or > > > all backends shutting down (including postmaster) -- like pg_ctl stop? > > > > Admittedly our language is very imprecise around this :(. What I meant > > is that backends would report their own stats up to the stats collector > > when the connection ends (in pgstat_beshutdown_hook()). That means that > > when the whole server (pgstat and then postmaster, potentially via > > pg_ctl stop) shuts down, all the per-connection stats have already been > > reported up to pgstat. > > > > So, I realized that the patch has a problem. I added the code to send > buffer actions stats to the stats collector > (pgstat_send_buffer_actions()) to pgstat_report_stat() and this isn't > getting called when all types of backends exit. > > I originally thought to add pgstat_send_buffer_actions() to > pgstat_beshutdown_hook() (as suggested), but, this is called after > pgstat_shutdown_hook(), so, we aren't able to send stats to the stats > collector at that time. (pgstat_shutdown_hook() sets pgstat_is_shutdown > to true and then in pgstat_beshutdown_hook() (called after), if we call > pgstat_send_buffer_actions(), it calls pgstat_send() which calls > pgstat_assert_is_up() which trips when pgstat_is_shutdown is true.) > > After calling pgstat_send_buffer_actions() from pgstat_report_stat(), it > seems to miss checkpointer stats entirely. I did find that if I > sprinkled pgstat_send_buffer_actions() around in the various places that > pgstat_send_checkpointer() is called, I could get checkpointer stats > (see attached patch, capture_checkpointer_buffer_actions.patch), but, > that seems a little bit haphazard since pgstat_send_buffer_actions() is > supposed to capture stats for all backend types. Is there somewhere else > I can call it that is exercised by all backend types before > pgstat_shutdown_hook() is called but after they would have finished any > relevant buffer actions? > I realized that putting these additional calls in checkpointer code and not clearing out PgBackendStatus counters for buffer actions results in a lot of duplicate stats. I was wondering if pgstat_send_buffer_actions() is needed, however, in HandleCheckpointerInterrupts() before the proc_exit(). It does seem like additional calls to pgstat_send_buffer_actions() shouldn't be needed since most processes register pgstat_shutdown_hook(). However, since MyDatabaseId isn't valid for the auxiliary processes, even though the pgstat_shutdown_hook() is registered from BaseInit(), pgstat_report_stat() never gets called for them, so their stats aren't persisted using the current method. It seems like the best solution to persisting all processes' stats would be to have all processes register pgstat_shutdown_hook() and to still call pgstat_report_stat() even if MyDatabaseId is not valid if the process is not a regular backend (I assume that it is only a problem that MyDatabaseId is InvalidOid for backends that have had it set to a valid oid at some point). For the stats that rely on database OID, perhaps those can be reported based on whether or not MyDatabaseId is valid from within pgstat_report_stat(). I also realized that I am not collecting stats from live auxiliary processes in pg_stat_get_buffer_actions(). I need to change the loop to for (i = 0; i <= MaxBackends + NUM_AUXPROCTYPES; i++) to actually get stats from live auxiliary processes when querying the view. On an unrelated note, I am planning to remove buffers_clean and buffers_checkpoint from the pg_stat_bgwriter view since those are also redundant. When I was removing them, I noticed that buffers_checkpoint and buffers_clean count buffers as having been written even when FlushBuffer() "does nothing" because someone else wrote out the dirty buffer before the bgwriter or checkpointer had a chance to do it. This seems like it would result in an incorrect count. Am I missing something? - Melanie
Hi,
I've attached the v7 patch set.
Changes from v6:
- removed unnecessary global variable BufferActionsStats
- fixed the loop condition in pg_stat_get_buffer_actions()
- updated some comments
- removed buffers_checkpoint and buffers_clean from pg_stat_bgwriter
view (now pg_stat_bgwriter view is mainly checkpointer statistics,
which isn't great)
- instead of calling pgstat_send_buffer_actions() in
pgstat_report_stat(), I renamed pgstat_send_buffer_actions() to
pgstat_report_buffers() and call it directly from
pgstat_shutdown_hook() for all types of processes (including processes
with invalid MyDatabaseId [like auxiliary processes])
I began changing the code to add the stats reset timestamp to the
pg_stat_buffer_actions view, but, I realized that it will be kind of
distracting to have every row for every backend type have a stats reset
timestamp (since it will be the same timestamp over and over). If,
however, you could reset buffer stats for each backend type
individually, then, I could see having it. Otherwise, we could add a
function like pg_stat_get_stats_reset_time(viewname) where viewname
would be pg_stat_buffer_actions in our case. Though, maybe that is
annoying and not very usable--I'm not sure.
I also think it makes sense to rename the pg_stat_buffer_actions view to
pg_stat_buffers and to name the columns using both the buffer action
type and buffer type -- e.g. shared, strategy, local. This leaves open
the possibility of counting buffer actions done on other non-shared
buffers -- like those done while building indexes or those using local
buffers. The third patch in the set does this (I wanted to see if it
made sense before fixing it up into the first patch in the set).
This naming convention (BufferType_BufferActionType) made me think that
it might make sense to have two enumerations: one being the current
BufferActionType (which could also be called BufferAccessType though
that might get confusing with BufferAccessStrategyType and buffer access
strategies in general) and the other being BufferType (which would be
one of shared, local, index, etc).
I attached a patch with the outline of this idea
(buffer_type_enum_addition.patch). It doesn't work because
pg_stat_get_buffer_actions() uses the BufferActionType as an index into
the values array returned. If I wanted to use a combination of the two
enums as an indexing mechanism (BufferActionType and BufferType), we
would end up with a tuple having every combination of the two
enums--some of which aren't valid. It might not make sense to implement
this. I do think it is useful to think of these stats as a combination
I've attached the v7 patch set.
Changes from v6:
- removed unnecessary global variable BufferActionsStats
- fixed the loop condition in pg_stat_get_buffer_actions()
- updated some comments
- removed buffers_checkpoint and buffers_clean from pg_stat_bgwriter
view (now pg_stat_bgwriter view is mainly checkpointer statistics,
which isn't great)
- instead of calling pgstat_send_buffer_actions() in
pgstat_report_stat(), I renamed pgstat_send_buffer_actions() to
pgstat_report_buffers() and call it directly from
pgstat_shutdown_hook() for all types of processes (including processes
with invalid MyDatabaseId [like auxiliary processes])
I began changing the code to add the stats reset timestamp to the
pg_stat_buffer_actions view, but, I realized that it will be kind of
distracting to have every row for every backend type have a stats reset
timestamp (since it will be the same timestamp over and over). If,
however, you could reset buffer stats for each backend type
individually, then, I could see having it. Otherwise, we could add a
function like pg_stat_get_stats_reset_time(viewname) where viewname
would be pg_stat_buffer_actions in our case. Though, maybe that is
annoying and not very usable--I'm not sure.
I also think it makes sense to rename the pg_stat_buffer_actions view to
pg_stat_buffers and to name the columns using both the buffer action
type and buffer type -- e.g. shared, strategy, local. This leaves open
the possibility of counting buffer actions done on other non-shared
buffers -- like those done while building indexes or those using local
buffers. The third patch in the set does this (I wanted to see if it
made sense before fixing it up into the first patch in the set).
This naming convention (BufferType_BufferActionType) made me think that
it might make sense to have two enumerations: one being the current
BufferActionType (which could also be called BufferAccessType though
that might get confusing with BufferAccessStrategyType and buffer access
strategies in general) and the other being BufferType (which would be
one of shared, local, index, etc).
I attached a patch with the outline of this idea
(buffer_type_enum_addition.patch). It doesn't work because
pg_stat_get_buffer_actions() uses the BufferActionType as an index into
the values array returned. If I wanted to use a combination of the two
enums as an indexing mechanism (BufferActionType and BufferType), we
would end up with a tuple having every combination of the two
enums--some of which aren't valid. It might not make sense to implement
this. I do think it is useful to think of these stats as a combination
of a buffer action and a type of buffer.
- Melanie
Attachment
Hello Melanie On 2021-Sep-13, Melanie Plageman wrote: > I also think it makes sense to rename the pg_stat_buffer_actions view to > pg_stat_buffers and to name the columns using both the buffer action > type and buffer type -- e.g. shared, strategy, local. This leaves open > the possibility of counting buffer actions done on other non-shared > buffers -- like those done while building indexes or those using local > buffers. The third patch in the set does this (I wanted to see if it > made sense before fixing it up into the first patch in the set). What do you think of the idea of having the "shared/strategy/local" attribute be a column? So you'd have up to three rows per buffer action type. Users wishing to see an aggregate can just aggregate them, just like they'd do with pg_buffercache. I think that leads to an easy decision with regards to this point: > I attached a patch with the outline of this idea > (buffer_type_enum_addition.patch). It doesn't work because > pg_stat_get_buffer_actions() uses the BufferActionType as an index into > the values array returned. If I wanted to use a combination of the two > enums as an indexing mechanism (BufferActionType and BufferType), we > would end up with a tuple having every combination of the two > enums--some of which aren't valid. It might not make sense to implement > this. I do think it is useful to think of these stats as a combination > of a buffer action and a type of buffer. Does that seem sensible? (It's weird to have enum values that are there just to indicate what's the maximum value. I think that sort of thing is better done by having a "#define LAST_THING" that takes the last valid value from the enum. That would free you from having to handle the last value in switch blocks, for example. LAST_OCLASS in dependency.h is a precedent on this.) -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "That sort of implies that there are Emacs keystrokes which aren't obscure. I've been using it daily for 2 years now and have yet to discover any key sequence which makes any sense." (Paul Thomas)
On Tue, Sep 14, 2021 at 9:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Sep-13, Melanie Plageman wrote: > > > I also think it makes sense to rename the pg_stat_buffer_actions view to > > pg_stat_buffers and to name the columns using both the buffer action > > type and buffer type -- e.g. shared, strategy, local. This leaves open > > the possibility of counting buffer actions done on other non-shared > > buffers -- like those done while building indexes or those using local > > buffers. The third patch in the set does this (I wanted to see if it > > made sense before fixing it up into the first patch in the set). > > What do you think of the idea of having the "shared/strategy/local" > attribute be a column? So you'd have up to three rows per buffer action > type. Users wishing to see an aggregate can just aggregate them, just > like they'd do with pg_buffercache. I think that leads to an easy > decision with regards to this point: I have rewritten the code to implement this. > > > (It's weird to have enum values that are there just to indicate what's > the maximum value. I think that sort of thing is better done by having > a "#define LAST_THING" that takes the last valid value from the enum. > That would free you from having to handle the last value in switch > blocks, for example. LAST_OCLASS in dependency.h is a precedent on this.) > I have made this change. The attached v8 patchset is rewritten to add in an additional dimension -- buffer type. Now, a backend keeps track of how many buffers of a particular type (e.g. shared, local) it has accessed in a particular way (e.g. alloc, write). It also changes the naming of various structures and the view members. Previously, stats reset did not work since it did not consider live backends' counters. Now, the reset message includes the current live backends' counters to be tracked by the stats collector and used when the view is queried. The reset message is one of the areas in which I still need to do some work -- I shoved the array of PgBufferAccesses into the existing reset message used for checkpointer, bgwriter, etc. Before making a new type of message, I would like feedback from a reviewer about the approach. There are various TODOs in the code which are actually questions for the reviewer. Once I have some feedback, it will be easier to address these items. There a few other items which may be material for other commits that I would also like to do: 1) write wrapper functions for smgr* functions which count buffer accesses of the appropriate type. I wasn't sure if these should literally just take all the parameters that the smgr* functions take + buffer type. Once these exist, there will be less possibility for regressions in which new code is added using smgr* functions without counting this buffer activity. Once I add these, I was going to go through and replace existing calls to smgr* functions and thereby start counting currently uncounted buffer type accesses (direct, local, etc). 2) Separate checkpointer and bgwriter into two views and add additional stats to the bgwriter view. 3) Consider adding a helper function to pgstatfuncs.c to help create the tuplestore. These functions all have quite a few lines which are exactly the same, and I thought it might be nice to do something about that: pg_stat_get_progress_info(PG_FUNCTION_ARGS) pg_stat_get_activity(PG_FUNCTION_ARGS) pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS) pg_stat_get_slru(PG_FUNCTION_ARGS) pg_stat_get_progress_info(PG_FUNCTION_ARGS) I can imagine a function that takes a Datums array, a nulls array, and a ResultSetInfo and then makes the tuplestore -- though I think that will use more memory. Perhaps we could make a macro which does the initial error checking (checking if caller supports returning a tuplestore)? I'm not sure if there is something meaningful here, but I thought I would ask. Finally, I haven't removed the test in pg_stats and haven't done a final pass for comment clarity, alphabetization, etc on this version. - Melanie
Attachment
On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > The attached v8 patchset is rewritten to add in an additional dimension > -- buffer type. Now, a backend keeps track of how many buffers of a > particular type (e.g. shared, local) it has accessed in a particular way > (e.g. alloc, write). It also changes the naming of various structures > and the view members. > > Previously, stats reset did not work since it did not consider live > backends' counters. Now, the reset message includes the current live > backends' counters to be tracked by the stats collector and used when > the view is queried. > > The reset message is one of the areas in which I still need to do some > work -- I shoved the array of PgBufferAccesses into the existing reset > message used for checkpointer, bgwriter, etc. Before making a new type > of message, I would like feedback from a reviewer about the approach. > > There are various TODOs in the code which are actually questions for the > reviewer. Once I have some feedback, it will be easier to address these > items. > > There a few other items which may be material for other commits that > I would also like to do: > 1) write wrapper functions for smgr* functions which count buffer > accesses of the appropriate type. I wasn't sure if these should > literally just take all the parameters that the smgr* functions take + > buffer type. Once these exist, there will be less possibility for > regressions in which new code is added using smgr* functions without > counting this buffer activity. Once I add these, I was going to go > through and replace existing calls to smgr* functions and thereby start > counting currently uncounted buffer type accesses (direct, local, etc). > > 2) Separate checkpointer and bgwriter into two views and add additional > stats to the bgwriter view. > > 3) Consider adding a helper function to pgstatfuncs.c to help create the > tuplestore. These functions all have quite a few lines which are exactly > the same, and I thought it might be nice to do something about that: > pg_stat_get_progress_info(PG_FUNCTION_ARGS) > pg_stat_get_activity(PG_FUNCTION_ARGS) > pg_stat_get_buffers_accesses(PG_FUNCTION_ARGS) > pg_stat_get_slru(PG_FUNCTION_ARGS) > pg_stat_get_progress_info(PG_FUNCTION_ARGS) > I can imagine a function that takes a Datums array, a nulls array, and a > ResultSetInfo and then makes the tuplestore -- though I think that will > use more memory. Perhaps we could make a macro which does the initial > error checking (checking if caller supports returning a tuplestore)? I'm > not sure if there is something meaningful here, but I thought I would > ask. > > Finally, I haven't removed the test in pg_stats and haven't done a final > pass for comment clarity, alphabetization, etc on this version. > I have addressed almost all of the issues mentioned above in v9. The only remaining TODOs are described in the commit message. most critical one is that the reset message doesn't work.
Attachment
On Fri, Sep 24, 2021 at 5:58 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > The only remaining TODOs are described in the commit message. > most critical one is that the reset message doesn't work. v10 is attached with updated comments and some limited code refactoring
Attachment
On Mon, Sep 27, 2021 at 2:58 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Fri, Sep 24, 2021 at 5:58 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > The only remaining TODOs are described in the commit message. > > most critical one is that the reset message doesn't work. > > v10 is attached with updated comments and some limited code refactoring v11 has fixed the oversize message issue by sending a reset message for each backend type. Now, we will call GetCurrentTimestamp BACKEND_NUM_TYPES times, so maybe I should add some kind of flag to the reset message that indicates the first message so that all the "do once" things can be done at that point. I've also fixed a few style/cosmetic issues and updated the commit message with a link to the thread [1] where I proposed smgrwrite() and smgrextend() wrappers (which is where I propose to call pgstat_incremement_buffer_access_type() for unbuffered writes and extends). - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com
Attachment
On Wed, Sep 29, 2021 at 4:46 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Sep 27, 2021 at 2:58 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Fri, Sep 24, 2021 at 5:58 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman > > > <melanieplageman@gmail.com> wrote: > > > The only remaining TODOs are described in the commit message. > > > most critical one is that the reset message doesn't work. > > > > v10 is attached with updated comments and some limited code refactoring > > v11 has fixed the oversize message issue by sending a reset message for > each backend type. Now, we will call GetCurrentTimestamp > BACKEND_NUM_TYPES times, so maybe I should add some kind of flag to the > reset message that indicates the first message so that all the "do once" > things can be done at that point. > > I've also fixed a few style/cosmetic issues and updated the commit > message with a link to the thread [1] where I proposed smgrwrite() and > smgrextend() wrappers (which is where I propose to call > pgstat_incremement_buffer_access_type() for unbuffered writes and > extends). > > - Melanie > > [1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com v12 (attached) has various style and code clarity updates (it is pgindented as well). I also added a new commit which creates a utility function to make a tuplestore for views that need one in pgstatfuncs.c. Having received some offlist feedback about the names BufferAccessType and BufferType being confusing, I am planning to rename these variables and all of the associated functions. I agree that BufferType and BufferAccessType are confusing for the following reasons: - They sound similar. - They aren't very precise. - One of the types of buffers is not using a Postgres buffer. So far, the proposed alternative is IO_Op or IOOp for BufferAccessType and IOPath for BufferType. - Melanie
Attachment
Can you say more about 0001? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Use it up, wear it out, make it do, or do without"
v13 (attached) contains several cosmetic updates and the full rename (comments included) of BufferAccessType and BufferType. On Thu, Sep 30, 2021 at 7:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Can you say more about 0001? > The rationale for this patch was that it doesn't save much to avoid initializing backend activity state in the bootstrap process and by doing so, I don't have to do the check if (beentry) in pgstat_inc_ioop() --which happens on most buffer accesses.
Attachment
Hi, On 2021-10-01 16:05:31 -0400, Melanie Plageman wrote: > From 40c809ad1127322f3462e85be080c10534485f0d Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Fri, 24 Sep 2021 17:39:12 -0400 > Subject: [PATCH v13 1/4] Allow bootstrap process to beinit > > --- > src/backend/utils/init/postinit.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c > index 78bc64671e..fba5864172 100644 > --- a/src/backend/utils/init/postinit.c > +++ b/src/backend/utils/init/postinit.c > @@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, > EnablePortalManager(); > > /* Initialize status reporting */ > - if (!bootstrap) > - pgstat_beinit(); > + pgstat_beinit(); > > /* > * Load relcache entries for the shared system catalogs. This must create > -- > 2.27.0 > I think it's good to remove more and more of these !bootstrap cases - they really make it harder to understand the state of the system at various points. Optimizing for the rarely executed bootstrap mode at the cost of checks in very common codepaths... > From a709ddb30b2b747beb214f0b13cd1e1816094e6b Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Thu, 30 Sep 2021 16:16:22 -0400 > Subject: [PATCH v13 2/4] Add utility to make tuplestores for pg stat views > > Most of the steps to make a tuplestore for those pg_stat views requiring > one are the same. Consolidate them into a single helper function for > clarity and to avoid bugs. > --- > src/backend/utils/adt/pgstatfuncs.c | 129 ++++++++++------------------ > 1 file changed, 44 insertions(+), 85 deletions(-) > > diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c > index ff5aedc99c..513f5aecf6 100644 > --- a/src/backend/utils/adt/pgstatfuncs.c > +++ b/src/backend/utils/adt/pgstatfuncs.c > @@ -36,6 +36,42 @@ > > #define HAS_PGSTAT_PERMISSIONS(role) (is_member_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(),role)) > > +/* > + * Helper function for views with multiple rows constructed from a tuplestore > + */ > +static Tuplestorestate * > +pg_stat_make_tuplestore(FunctionCallInfo fcinfo, TupleDesc *tupdesc) > +{ > + Tuplestorestate *tupstore; > + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; > + MemoryContext per_query_ctx; > + MemoryContext oldcontext; > + > + /* check to see if caller supports us returning a tuplestore */ > + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("set-valued function called in context that cannot accept a set"))); > + if (!(rsinfo->allowedModes & SFRM_Materialize)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("materialize mode required, but it is not allowed in this context"))); > + > + /* Build a tuple descriptor for our result type */ > + if (get_call_result_type(fcinfo, NULL, tupdesc) != TYPEFUNC_COMPOSITE) > + elog(ERROR, "return type must be a row type"); > + > + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; > + oldcontext = MemoryContextSwitchTo(per_query_ctx); > + > + tupstore = tuplestore_begin_heap(true, false, work_mem); > + rsinfo->returnMode = SFRM_Materialize; > + rsinfo->setResult = tupstore; > + rsinfo->setDesc = *tupdesc; > + MemoryContextSwitchTo(oldcontext); > + return tupstore; > +} Is pgstattuple the best place for this helper? It's not really pgstatfuncs specific... It also looks vaguely familiar - I wonder if we have a helper roughly like this somewhere else already... > From e9a5d2a021d429fdbb2daa58ab9d75a069f334d4 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Wed, 29 Sep 2021 15:39:45 -0400 > Subject: [PATCH v13 3/4] Add system view tracking IO ops per backend type > > diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c > index be7366379d..0d18e7f71a 100644 > --- a/src/backend/postmaster/checkpointer.c > +++ b/src/backend/postmaster/checkpointer.c > @@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type) > */ > if (!AmBackgroundWriterProcess()) > CheckpointerShmem->num_backend_fsync++; > + pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED); > LWLockRelease(CheckpointerCommLock); > return false; > } ISTM this doens't need to happen while holding CheckpointerCommLock? > @@ -1461,7 +1467,25 @@ pgstat_reset_shared_counters(const char *target) > errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\"."))); > > pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER); > - pgstat_send(&msg, sizeof(msg)); > + > + if (msg.m_resettarget == RESET_BUFFERS) > + { > + int backend_type; > + PgStatIOPathOps ops[BACKEND_NUM_TYPES]; > + > + memset(ops, 0, sizeof(ops)); > + pgstat_report_live_backend_io_path_ops(ops); > + > + for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++) > + { > + msg.m_backend_resets.backend_type = backend_type; > + memcpy(&msg.m_backend_resets.iop, &ops[backend_type], sizeof(msg.m_backend_resets.iop)); > + pgstat_send(&msg, sizeof(msg)); > + } > + } > + else > + pgstat_send(&msg, sizeof(msg)); > + > } I'd perhaps put this in a small helper function. > /* ---------- > * pgstat_fetch_stat_dbentry() - > @@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg) > { > Assert(!pgstat_is_shutdown); > > + /* > + * Only need to send stats on IO Ops for IO Paths when a process exits, as > + * pg_stat_get_buffers() will read from live backends' PgBackendStatus and > + * then sum this with totals from exited backends persisted by the stats > + * collector. > + */ > + pgstat_send_buffers(); > + > /* > * If we got as far as discovering our own database ID, we can report what > * we did to the collector. Otherwise, we'd be sending an invalid > @@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len) > #endif > } I think it might be nicer to move pgstat_beshutdown_hook() to be a before_shmem_exit(), and do this in there. > +/* > + * Add live IO Op stats for all IO Paths (e.g. shared, local) to those in the > + * equivalent stats structure for exited backends. Note that this adds and > + * doesn't set, so the destination stats structure should be zeroed out by the > + * caller initially. This would commonly be used to transfer all IO Op stats > + * for all IO Paths for a particular backend type to the pgstats structure. > + */ This seems a bit odd. Why not zero it in here? Perhaps it also should be called something like _sum_ instead of _add_? > +void > +pgstat_add_io_path_ops(PgStatIOOps *dest, IOOps *src, int io_path_num_types) > +{ Why is io_path_num_types a parameter? > +static void > +pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len) > +{ > + int io_path; > + PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops; > + PgStatIOOps *dest_io_path_ops = > + globalStats.buffers.ops[msg->backend_type].io_path_ops; > + > + for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++) > + { > + PgStatIOOps *src = &src_io_path_ops[io_path]; > + PgStatIOOps *dest = &dest_io_path_ops[io_path]; > + > + dest->allocs += src->allocs; > + dest->extends += src->extends; > + dest->fsyncs += src->fsyncs; > + dest->writes += src->writes; > + } > +} Could this, with a bit of finessing, use pgstat_add_io_path_ops()? > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c What about writes originating in like FlushRelationBuffers()? > bool > -StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf) > +StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring) > { > + /* > + * If we decide to use the dirty buffer selected by StrategyGetBuffer(), > + * then ensure that we count it as such in pg_stat_buffers view. > + */ > + *from_ring = true; > + Absolutely minor nitpick: Somehow it feelsoff to talk about the view here. > +PgBackendStatus * > +pgstat_fetch_backend_statuses(void) > +{ > + return BackendStatusArray; > +} Hm, not sure this adds much? > + /* > + * Subtract 1 from backend_type to avoid having rows for B_INVALID > + * BackendType > + */ > + int rownum = (beentry->st_backendType - 1) * IOPATH_NUM_TYPES + io_path; Perhaps worth wrapping this in a macro or inline function? It's repeated and nontrivial. > + /* Add stats from all exited backends */ > + backend_io_path_ops = pgstat_fetch_exited_backend_buffers(); It's probably *not* worth it, but I do wonder if we should do the addition on the SQL level, and actually have two functions, one returning data for exited backends, and one for currently connected ones. > +static inline void > +pgstat_inc_ioop(IOOp io_op, IOPath io_path) > +{ > + IOOps *io_ops; > + PgBackendStatus *beentry = MyBEEntry; > + > + Assert(beentry); > + > + io_ops = &beentry->io_path_stats[io_path]; > + switch (io_op) > + { > + case IOOP_ALLOC: > + pg_atomic_write_u64(&io_ops->allocs, > + pg_atomic_read_u64(&io_ops->allocs) + 1); > + break; > + case IOOP_EXTEND: > + pg_atomic_write_u64(&io_ops->extends, > + pg_atomic_read_u64(&io_ops->extends) + 1); > + break; > + case IOOP_FSYNC: > + pg_atomic_write_u64(&io_ops->fsyncs, > + pg_atomic_read_u64(&io_ops->fsyncs) + 1); > + break; > + case IOOP_WRITE: > + pg_atomic_write_u64(&io_ops->writes, > + pg_atomic_read_u64(&io_ops->writes) + 1); > + break; > + } > +} IIRC Thomas Munro had a patch adding a nonatomic_add or such somewhere. Perhaps in the recovery readahead thread? Might be worth using instead? Greetings, Andres Freund
On Fri, Oct 8, 2021 at 1:56 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-10-01 16:05:31 -0400, Melanie Plageman wrote: > > From 40c809ad1127322f3462e85be080c10534485f0d Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman <melanieplageman@gmail.com> > > Date: Fri, 24 Sep 2021 17:39:12 -0400 > > Subject: [PATCH v13 1/4] Allow bootstrap process to beinit > > > > --- > > src/backend/utils/init/postinit.c | 3 +-- > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c > > index 78bc64671e..fba5864172 100644 > > --- a/src/backend/utils/init/postinit.c > > +++ b/src/backend/utils/init/postinit.c > > @@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, > > EnablePortalManager(); > > > > /* Initialize status reporting */ > > - if (!bootstrap) > > - pgstat_beinit(); > > + pgstat_beinit(); > > > > /* > > * Load relcache entries for the shared system catalogs. This must create > > -- > > 2.27.0 > > > > I think it's good to remove more and more of these !bootstrap cases - they > really make it harder to understand the state of the system at various > points. Optimizing for the rarely executed bootstrap mode at the cost of > checks in very common codepaths... What scope do you suggest for this patch set? A single patch which does this in more locations (remove !bootstrap) or should I remove this patch from the patchset? > > > > > From a709ddb30b2b747beb214f0b13cd1e1816094e6b Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman <melanieplageman@gmail.com> > > Date: Thu, 30 Sep 2021 16:16:22 -0400 > > Subject: [PATCH v13 2/4] Add utility to make tuplestores for pg stat views > > > > Most of the steps to make a tuplestore for those pg_stat views requiring > > one are the same. Consolidate them into a single helper function for > > clarity and to avoid bugs. > > --- > > src/backend/utils/adt/pgstatfuncs.c | 129 ++++++++++------------------ > > 1 file changed, 44 insertions(+), 85 deletions(-) > > > > diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c > > index ff5aedc99c..513f5aecf6 100644 > > --- a/src/backend/utils/adt/pgstatfuncs.c > > +++ b/src/backend/utils/adt/pgstatfuncs.c > > @@ -36,6 +36,42 @@ > > > > #define HAS_PGSTAT_PERMISSIONS(role) (is_member_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(),role)) > > > > +/* > > + * Helper function for views with multiple rows constructed from a tuplestore > > + */ > > +static Tuplestorestate * > > +pg_stat_make_tuplestore(FunctionCallInfo fcinfo, TupleDesc *tupdesc) > > +{ > > + Tuplestorestate *tupstore; > > + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; > > + MemoryContext per_query_ctx; > > + MemoryContext oldcontext; > > + > > + /* check to see if caller supports us returning a tuplestore */ > > + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) > > + ereport(ERROR, > > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > + errmsg("set-valued function called in context that cannot accept a set"))); > > + if (!(rsinfo->allowedModes & SFRM_Materialize)) > > + ereport(ERROR, > > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > + errmsg("materialize mode required, but it is not allowed in this context"))); > > + > > + /* Build a tuple descriptor for our result type */ > > + if (get_call_result_type(fcinfo, NULL, tupdesc) != TYPEFUNC_COMPOSITE) > > + elog(ERROR, "return type must be a row type"); > > + > > + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; > > + oldcontext = MemoryContextSwitchTo(per_query_ctx); > > + > > + tupstore = tuplestore_begin_heap(true, false, work_mem); > > + rsinfo->returnMode = SFRM_Materialize; > > + rsinfo->setResult = tupstore; > > + rsinfo->setDesc = *tupdesc; > > + MemoryContextSwitchTo(oldcontext); > > + return tupstore; > > +} > > Is pgstattuple the best place for this helper? It's not really pgstatfuncs > specific... > > It also looks vaguely familiar - I wonder if we have a helper roughly like > this somewhere else already... > I don't see a function which is specifically a utility to make a tuplestore. Looking at the callers of tuplestore_begin_heap(), I notice very similar code to the function I added in pg_tablespace_databases() in utils/adt/misc.c, pg_stop_backup_v2() in xlogfuncs.c, pg_event_trigger_dropped_objects() and pg_event_trigger_ddl_commands in event_tigger.c, pg_available_extensions in extension.c, etc. Do you think it makes sense to refactor this code out of all of these places? If so, where would such a utility function belong? > > > > From e9a5d2a021d429fdbb2daa58ab9d75a069f334d4 Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman <melanieplageman@gmail.com> > > Date: Wed, 29 Sep 2021 15:39:45 -0400 > > Subject: [PATCH v13 3/4] Add system view tracking IO ops per backend type > > > > > diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c > > index be7366379d..0d18e7f71a 100644 > > --- a/src/backend/postmaster/checkpointer.c > > +++ b/src/backend/postmaster/checkpointer.c > > @@ -1104,6 +1104,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type) > > */ > > if (!AmBackgroundWriterProcess()) > > CheckpointerShmem->num_backend_fsync++; > > + pgstat_inc_ioop(IOOP_FSYNC, IOPATH_SHARED); > > LWLockRelease(CheckpointerCommLock); > > return false; > > } > > ISTM this doens't need to happen while holding CheckpointerCommLock? > Fixed in attached updates. I only attached the diff from my previous version. > > > > @@ -1461,7 +1467,25 @@ pgstat_reset_shared_counters(const char *target) > > errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\"."))); > > > > pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER); > > - pgstat_send(&msg, sizeof(msg)); > > + > > + if (msg.m_resettarget == RESET_BUFFERS) > > + { > > + int backend_type; > > + PgStatIOPathOps ops[BACKEND_NUM_TYPES]; > > + > > + memset(ops, 0, sizeof(ops)); > > + pgstat_report_live_backend_io_path_ops(ops); > > + > > + for (backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++) > > + { > > + msg.m_backend_resets.backend_type = backend_type; > > + memcpy(&msg.m_backend_resets.iop, &ops[backend_type], sizeof(msg.m_backend_resets.iop)); > > + pgstat_send(&msg, sizeof(msg)); > > + } > > + } > > + else > > + pgstat_send(&msg, sizeof(msg)); > > + > > } > > I'd perhaps put this in a small helper function. > Done. > > > /* ---------- > > * pgstat_fetch_stat_dbentry() - > > @@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg) > > { > > Assert(!pgstat_is_shutdown); > > > > + /* > > + * Only need to send stats on IO Ops for IO Paths when a process exits, as > > + * pg_stat_get_buffers() will read from live backends' PgBackendStatus and > > + * then sum this with totals from exited backends persisted by the stats > > + * collector. > > + */ > > + pgstat_send_buffers(); > > + > > /* > > * If we got as far as discovering our own database ID, we can report what > > * we did to the collector. Otherwise, we'd be sending an invalid > > @@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len) > > #endif > > } > > I think it might be nicer to move pgstat_beshutdown_hook() to be a > before_shmem_exit(), and do this in there. > Not really sure the correct way to do this. A cursory attempt to do so failed because ShutdownXLOG() is also registered as a before_shmem_exit() and ends up being called after pgstat_beshutdown_hook(). pgstat_beshutdown_hook() zeroes out PgBackendStatus, ShutdownXLOG() initiates a checkpoint, and during a checkpoint, the checkpointer increments IO op counter for writes in its PgBackendStatus. > > > +/* > > + * Add live IO Op stats for all IO Paths (e.g. shared, local) to those in the > > + * equivalent stats structure for exited backends. Note that this adds and > > + * doesn't set, so the destination stats structure should be zeroed out by the > > + * caller initially. This would commonly be used to transfer all IO Op stats > > + * for all IO Paths for a particular backend type to the pgstats structure. > > + */ > > This seems a bit odd. Why not zero it in here? Perhaps it also should be > called something like _sum_ instead of _add_? > I wanted to be able to use the function both when it was setting the values and when it needed to add to the values (which are the two current callers). I have changed the name from add -> sum. > > > +void > > +pgstat_add_io_path_ops(PgStatIOOps *dest, IOOps *src, int io_path_num_types) > > +{ > > Why is io_path_num_types a parameter? > I imagined that maybe another caller would want to only add some IO path types and still use the function, but I think it is more confusing than anything else so I've changed it. > > > +static void > > +pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len) > > +{ > > + int io_path; > > + PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops; > > + PgStatIOOps *dest_io_path_ops = > > + globalStats.buffers.ops[msg->backend_type].io_path_ops; > > + > > + for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++) > > + { > > + PgStatIOOps *src = &src_io_path_ops[io_path]; > > + PgStatIOOps *dest = &dest_io_path_ops[io_path]; > > + > > + dest->allocs += src->allocs; > > + dest->extends += src->extends; > > + dest->fsyncs += src->fsyncs; > > + dest->writes += src->writes; > > + } > > +} > > Could this, with a bit of finessing, use pgstat_add_io_path_ops()? > I didn't really see a good way to do this -- given that pgstat_add_io_path_ops() adds IOOps members to PgStatIOOps members -- which requires a pg_atomic_read_u64() and pgstat_recv_io_path_ops adds PgStatIOOps to PgStatIOOps which doesn't require pg_atomic_read_u64(). Maybe I could pass a flag which, based on the type, either does or doesn't use pg_atomic_read_u64 to access the value? But that seems worse to me. > > > --- a/src/backend/storage/buffer/bufmgr.c > > +++ b/src/backend/storage/buffer/bufmgr.c > > What about writes originating in like FlushRelationBuffers()? > Yes, I have made IOPath a parameter to FlushBuffer() so that it can distinguish between strategy buffer writes and shared buffer writes and then pushed pgstat_inc_ioop() into FlushBuffer(). > > > bool > > -StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf) > > +StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *from_ring) > > { > > + /* > > + * If we decide to use the dirty buffer selected by StrategyGetBuffer(), > > + * then ensure that we count it as such in pg_stat_buffers view. > > + */ > > + *from_ring = true; > > + > > Absolutely minor nitpick: Somehow it feelsoff to talk about the view here. Fixed. > > > > +PgBackendStatus * > > +pgstat_fetch_backend_statuses(void) > > +{ > > + return BackendStatusArray; > > +} > > Hm, not sure this adds much? Is there a better way to access the whole BackendStatusArray from within pgstatfuncs.c? > > > > + /* > > + * Subtract 1 from backend_type to avoid having rows for B_INVALID > > + * BackendType > > + */ > > + int rownum = (beentry->st_backendType - 1) * IOPATH_NUM_TYPES + io_path; > > > Perhaps worth wrapping this in a macro or inline function? It's repeated and nontrivial. > Done. > > > + /* Add stats from all exited backends */ > > + backend_io_path_ops = pgstat_fetch_exited_backend_buffers(); > > It's probably *not* worth it, but I do wonder if we should do the addition on the SQL > level, and actually have two functions, one returning data for exited > backends, and one for currently connected ones. > It would be easy enough to implement. I would defer to others on whether or not this would be useful. My use case for pg_stat_buffers() is to see what backends' IO during a benchmark or test workload. For that, I reset the stats before and then query pg_stat_buffers after running the benchmark. I don't know if I would use exited and live stats individually. In a real workload, I could see using pg_stat_buffers live and exited to see if the workload causing lots of backends to do their own writes is ongoing. Though a given workload may be composed of lots of different queries, with backends exiting throughout. > > > +static inline void > > +pgstat_inc_ioop(IOOp io_op, IOPath io_path) > > +{ > > + IOOps *io_ops; > > + PgBackendStatus *beentry = MyBEEntry; > > + > > + Assert(beentry); > > + > > + io_ops = &beentry->io_path_stats[io_path]; > > + switch (io_op) > > + { > > + case IOOP_ALLOC: > > + pg_atomic_write_u64(&io_ops->allocs, > > + pg_atomic_read_u64(&io_ops->allocs) + 1); > > + break; > > + case IOOP_EXTEND: > > + pg_atomic_write_u64(&io_ops->extends, > > + pg_atomic_read_u64(&io_ops->extends) + 1); > > + break; > > + case IOOP_FSYNC: > > + pg_atomic_write_u64(&io_ops->fsyncs, > > + pg_atomic_read_u64(&io_ops->fsyncs) + 1); > > + break; > > + case IOOP_WRITE: > > + pg_atomic_write_u64(&io_ops->writes, > > + pg_atomic_read_u64(&io_ops->writes) + 1); > > + break; > > + } > > +} > > IIRC Thomas Munro had a patch adding a nonatomic_add or such > somewhere. Perhaps in the recovery readahead thread? Might be worth using > instead? > I've added Thomas' function in a separate commit. I looked for a better place to add it (I was thinking somewhere in src/backend/utils/misc) but couldn't find anywhere that made sense. I also added a call to pgstat_inc_ioop() in ProcessSyncRequests() to capture when the checkpointer does fsyncs. I also added pgstat_inc_ioop() calls to callers of smgrwrite() flushing local buffers. I don't know if that is desirable or not in this patch. They could be removed if wrappers for smgrwrite() go in and pgstat_inc_ioop() can be called from within those wrappers. - Melanie
Attachment
Hi, On 2021-10-11 16:48:01 -0400, Melanie Plageman wrote: > On Fri, Oct 8, 2021 at 1:56 PM Andres Freund <andres@anarazel.de> wrote: > > On 2021-10-01 16:05:31 -0400, Melanie Plageman wrote: > > > diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c > > > index 78bc64671e..fba5864172 100644 > > > --- a/src/backend/utils/init/postinit.c > > > +++ b/src/backend/utils/init/postinit.c > > > @@ -670,8 +670,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, > > > EnablePortalManager(); > > > > > > /* Initialize status reporting */ > > > - if (!bootstrap) > > > - pgstat_beinit(); > > > + pgstat_beinit(); > > > > > > /* > > > * Load relcache entries for the shared system catalogs. This must create > > > -- > > > 2.27.0 > > > > > > > I think it's good to remove more and more of these !bootstrap cases - they > > really make it harder to understand the state of the system at various > > points. Optimizing for the rarely executed bootstrap mode at the cost of > > checks in very common codepaths... > > What scope do you suggest for this patch set? A single patch which does > this in more locations (remove !bootstrap) or should I remove this patch > from the patchset? I think the scope is fine as-is. > > Is pgstattuple the best place for this helper? It's not really pgstatfuncs > > specific... > > > > It also looks vaguely familiar - I wonder if we have a helper roughly like > > this somewhere else already... > > > > I don't see a function which is specifically a utility to make a > tuplestore. Looking at the callers of tuplestore_begin_heap(), I notice > very similar code to the function I added in pg_tablespace_databases() > in utils/adt/misc.c, pg_stop_backup_v2() in xlogfuncs.c, > pg_event_trigger_dropped_objects() and pg_event_trigger_ddl_commands in > event_tigger.c, pg_available_extensions in extension.c, etc. > > Do you think it makes sense to refactor this code out of all of these > places? Yes, I think it'd make sense. We have about 40 copies of this stuff, which is fairly ridiculous. > If so, where would such a utility function belong? Not quite sure. src/backend/utils/fmgr/funcapi.c maybe? I suggest starting a separate thread about that... > > > @@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg) > > > { > > > Assert(!pgstat_is_shutdown); > > > > > > + /* > > > + * Only need to send stats on IO Ops for IO Paths when a process exits, as > > > + * pg_stat_get_buffers() will read from live backends' PgBackendStatus and > > > + * then sum this with totals from exited backends persisted by the stats > > > + * collector. > > > + */ > > > + pgstat_send_buffers(); > > > + > > > /* > > > * If we got as far as discovering our own database ID, we can report what > > > * we did to the collector. Otherwise, we'd be sending an invalid > > > @@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len) > > > #endif > > > } > > > > I think it might be nicer to move pgstat_beshutdown_hook() to be a > > before_shmem_exit(), and do this in there. > > > > Not really sure the correct way to do this. A cursory attempt to do so > failed because ShutdownXLOG() is also registered as a > before_shmem_exit() and ends up being called after > pgstat_beshutdown_hook(). pgstat_beshutdown_hook() zeroes out > PgBackendStatus, ShutdownXLOG() initiates a checkpoint, and during a > checkpoint, the checkpointer increments IO op counter for writes in its > PgBackendStatus. I think we'll really need to do a proper redesign of the shutdown callback mechanism :(. > > > +static void > > > +pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len) > > > +{ > > > + int io_path; > > > + PgStatIOOps *src_io_path_ops = msg->iop.io_path_ops; > > > + PgStatIOOps *dest_io_path_ops = > > > + globalStats.buffers.ops[msg->backend_type].io_path_ops; > > > + > > > + for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++) > > > + { > > > + PgStatIOOps *src = &src_io_path_ops[io_path]; > > > + PgStatIOOps *dest = &dest_io_path_ops[io_path]; > > > + > > > + dest->allocs += src->allocs; > > > + dest->extends += src->extends; > > > + dest->fsyncs += src->fsyncs; > > > + dest->writes += src->writes; > > > + } > > > +} > > > > Could this, with a bit of finessing, use pgstat_add_io_path_ops()? > > > > I didn't really see a good way to do this -- given that > pgstat_add_io_path_ops() adds IOOps members to PgStatIOOps members -- > which requires a pg_atomic_read_u64() and pgstat_recv_io_path_ops adds > PgStatIOOps to PgStatIOOps which doesn't require pg_atomic_read_u64(). > Maybe I could pass a flag which, based on the type, either does or > doesn't use pg_atomic_read_u64 to access the value? But that seems worse > to me. Yea, you're probably right, that's worse. > > > +PgBackendStatus * > > > +pgstat_fetch_backend_statuses(void) > > > +{ > > > + return BackendStatusArray; > > > +} > > > > Hm, not sure this adds much? > > Is there a better way to access the whole BackendStatusArray from within > pgstatfuncs.c? Export the variable itself? > > IIRC Thomas Munro had a patch adding a nonatomic_add or such > > somewhere. Perhaps in the recovery readahead thread? Might be worth using > > instead? > > > > I've added Thomas' function in a separate commit. I looked for a better > place to add it (I was thinking somewhere in src/backend/utils/misc) but > couldn't find anywhere that made sense. I think it should just live in atomics.h? > I also added pgstat_inc_ioop() calls to callers of smgrwrite() flushing local > buffers. I don't know if that is desirable or not in this patch. They could be > removed if wrappers for smgrwrite() go in and pgstat_inc_ioop() can be called > from within those wrappers. Makes sense to me to to have it here. Greetings, Andres Freund
v14 attached. On Tue, Oct 19, 2021 at 3:29 PM Andres Freund <andres@anarazel.de> wrote: > > > > > Is pgstattuple the best place for this helper? It's not really pgstatfuncs > > > specific... > > > > > > It also looks vaguely familiar - I wonder if we have a helper roughly like > > > this somewhere else already... > > > > > > > I don't see a function which is specifically a utility to make a > > tuplestore. Looking at the callers of tuplestore_begin_heap(), I notice > > very similar code to the function I added in pg_tablespace_databases() > > in utils/adt/misc.c, pg_stop_backup_v2() in xlogfuncs.c, > > pg_event_trigger_dropped_objects() and pg_event_trigger_ddl_commands in > > event_tigger.c, pg_available_extensions in extension.c, etc. > > > > Do you think it makes sense to refactor this code out of all of these > > places? > > Yes, I think it'd make sense. We have about 40 copies of this stuff, which is > fairly ridiculous. > > > > If so, where would such a utility function belong? > > Not quite sure. src/backend/utils/fmgr/funcapi.c maybe? I suggest starting a > separate thread about that... > done [1]. also, I dropped that commit from this patchset. > > > > > @@ -2999,6 +3036,14 @@ pgstat_shutdown_hook(int code, Datum arg) > > > > { > > > > Assert(!pgstat_is_shutdown); > > > > > > > > + /* > > > > + * Only need to send stats on IO Ops for IO Paths when a process exits, as > > > > + * pg_stat_get_buffers() will read from live backends' PgBackendStatus and > > > > + * then sum this with totals from exited backends persisted by the stats > > > > + * collector. > > > > + */ > > > > + pgstat_send_buffers(); > > > > + > > > > /* > > > > * If we got as far as discovering our own database ID, we can report what > > > > * we did to the collector. Otherwise, we'd be sending an invalid > > > > @@ -3092,6 +3137,30 @@ pgstat_send(void *msg, int len) > > > > #endif > > > > } > > > > > > I think it might be nicer to move pgstat_beshutdown_hook() to be a > > > before_shmem_exit(), and do this in there. > > > > > > > Not really sure the correct way to do this. A cursory attempt to do so > > failed because ShutdownXLOG() is also registered as a > > before_shmem_exit() and ends up being called after > > pgstat_beshutdown_hook(). pgstat_beshutdown_hook() zeroes out > > PgBackendStatus, ShutdownXLOG() initiates a checkpoint, and during a > > checkpoint, the checkpointer increments IO op counter for writes in its > > PgBackendStatus. > > I think we'll really need to do a proper redesign of the shutdown callback > mechanism :(. > I've left what I originally had, then. > > > > > > +PgBackendStatus * > > > > +pgstat_fetch_backend_statuses(void) > > > > +{ > > > > + return BackendStatusArray; > > > > +} > > > > > > Hm, not sure this adds much? > > > > Is there a better way to access the whole BackendStatusArray from within > > pgstatfuncs.c? > > Export the variable itself? > done but wasn't sure about PGDLLIMPORT > > > > IIRC Thomas Munro had a patch adding a nonatomic_add or such > > > somewhere. Perhaps in the recovery readahead thread? Might be worth using > > > instead? > > > > > > > I've added Thomas' function in a separate commit. I looked for a better > > place to add it (I was thinking somewhere in src/backend/utils/misc) but > > couldn't find anywhere that made sense. > > I think it should just live in atomics.h? > done -- melanie [1] https://www.postgresql.org/message-id/flat/CAAKRu_azyd1Z3W_r7Ou4sorTjRCs%2BPxeHw1CWJeXKofkE6TuZg%40mail.gmail.com
Attachment
Hi, On 2021-11-02 15:26:52 -0400, Melanie Plageman wrote: > Subject: [PATCH v14 1/4] Allow bootstrap process to beinit Pushed. > +/* > + * On modern systems this is really just *counter++. On some older systems > + * there might be more to it, due to inability to read and write 64 bit values > + * atomically. > + */ > +static inline void inc_counter(pg_atomic_uint64 *counter) > +{ > + pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1); > +} > + > #undef INSIDE_ATOMICS_H Why is this using a completely different naming scheme from the rest of the file? > doc/src/sgml/monitoring.sgml | 116 +++++++++++++- > src/backend/catalog/system_views.sql | 11 ++ > src/backend/postmaster/checkpointer.c | 3 +- > src/backend/postmaster/pgstat.c | 161 +++++++++++++++++++- > src/backend/storage/buffer/bufmgr.c | 46 ++++-- > src/backend/storage/buffer/freelist.c | 23 ++- > src/backend/storage/buffer/localbuf.c | 3 + > src/backend/storage/sync/sync.c | 1 + > src/backend/utils/activity/backend_status.c | 60 +++++++- > src/backend/utils/adt/pgstatfuncs.c | 152 ++++++++++++++++++ > src/include/catalog/pg_proc.dat | 9 ++ > src/include/miscadmin.h | 2 + > src/include/pgstat.h | 53 +++++++ > src/include/storage/buf_internals.h | 4 +- > src/include/utils/backend_status.h | 80 ++++++++++ > src/test/regress/expected/rules.out | 8 + > 16 files changed, 701 insertions(+), 31 deletions(-) This is a pretty large change, I wonder if there's a way to make it a bit more granular. Greetings, Andres Freund
On Fri, Nov 19, 2021 at 11:49 AM Andres Freund <andres@anarazel.de> wrote: > > +/* > > + * On modern systems this is really just *counter++. On some older systems > > + * there might be more to it, due to inability to read and write 64 bit values > > + * atomically. > > + */ > > +static inline void inc_counter(pg_atomic_uint64 *counter) > > +{ > > + pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1); > > +} > > + > > #undef INSIDE_ATOMICS_H > > Why is this using a completely different naming scheme from the rest of the > file? It was what Thomas originally named it. Also, I noticed all the other pg_atomic* in this file were wrappers around the same impl function, so I thought maybe naming it this way would be confusing. I renamed it to pg_atomic_inc_counter(), though maybe pg_atomic_readonly_write() would be better? > > > doc/src/sgml/monitoring.sgml | 116 +++++++++++++- > > src/backend/catalog/system_views.sql | 11 ++ > > src/backend/postmaster/checkpointer.c | 3 +- > > src/backend/postmaster/pgstat.c | 161 +++++++++++++++++++- > > src/backend/storage/buffer/bufmgr.c | 46 ++++-- > > src/backend/storage/buffer/freelist.c | 23 ++- > > src/backend/storage/buffer/localbuf.c | 3 + > > src/backend/storage/sync/sync.c | 1 + > > src/backend/utils/activity/backend_status.c | 60 +++++++- > > src/backend/utils/adt/pgstatfuncs.c | 152 ++++++++++++++++++ > > src/include/catalog/pg_proc.dat | 9 ++ > > src/include/miscadmin.h | 2 + > > src/include/pgstat.h | 53 +++++++ > > src/include/storage/buf_internals.h | 4 +- > > src/include/utils/backend_status.h | 80 ++++++++++ > > src/test/regress/expected/rules.out | 8 + > > 16 files changed, 701 insertions(+), 31 deletions(-) > > This is a pretty large change, I wonder if there's a way to make it a bit more > granular. > I have done this. See v15 patch set attached. - Melanie
Attachment
- v15-0007-small-comment-correction.patch
- v15-0004-Add-buffers-to-pgstat_reset_shared_counters.patch
- v15-0006-Remove-superfluous-bgwriter-stats.patch
- v15-0003-Send-IO-operations-to-stats-collector.patch
- v15-0005-Add-system-view-tracking-IO-ops-per-backend-type.patch
- v15-0002-Add-IO-operation-counters-to-PgBackendStatus.patch
- v15-0001-Read-only-atomic-backend-write-function.patch
Thanks for working on this. I was just trying to find something like "pg_stat_checkpointer". You wrote beentry++ at the start of two loops, but I think that's wrong; it should be at the end, as in the rest of the file (or as a loop increment). BackendStatusArray[0] is actually used (even though its backend has backendId==1, not 0). "MyBEEntry = &BackendStatusArray[MyBackendId - 1];" You could put *_NUM_TYPES as the last value in these enums, like NUM_AUXPROCTYPES, NUM_PMSIGNALS, and NUM_PROCSIGNALS: +#define IOOP_NUM_TYPES (IOOP_WRITE + 1) +#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1) +#define BACKEND_NUM_TYPES (B_LOGGER + 1) There's extraneous blank lines in these functions: +pgstat_sum_io_path_ops +pgstat_report_live_backend_io_path_ops +pgstat_recv_resetsharedcounter +GetIOPathDesc +StrategyRejectBuffer This function is doubly-indented: +pgstat_send_buffers_reset As support for C99 is now required by postgres, variables can be declared as part of various loops. + int io_path; + for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++) Rather than memset(), you could initialize msg like this. PgStat_MsgIOPathOps msg = {0}; +pgstat_send_buffers(void) +{ + PgStat_MsgIOPathOps msg; + + PgBackendStatus *beentry = MyBEEntry; + + if (!beentry) + return; + + memset(&msg, 0, sizeof(msg)); -- Justin
On Wed, Nov 24, 2021 at 07:15:59PM -0600, Justin Pryzby wrote: > There's extraneous blank lines in these functions: > > +pgstat_sum_io_path_ops > +pgstat_report_live_backend_io_path_ops > +pgstat_recv_resetsharedcounter > +GetIOPathDesc > +StrategyRejectBuffer + an extra blank line pgstat_reset_shared_counters. In 0005: monitoring.sgml says that the columns in pg_stat_buffers are integers, but they're actually bigint. + tupstore = tuplestore_begin_heap(true, false, work_mem); You're passing a constant randomAccess=true to tuplestore_begin_heap ;) +Datum all_values[NROWS][COLUMN_LENGTH]; If you were to allocate this as an array, I think it could actually be 3-D: Datum all_values[BACKEND_NUM_TYPES-1][IOPATH_NUM_TYPES][COLUMN_LENGTH]; But I don't know if this is portable across postgres' supported platforms; I haven't seen any place which allocates a multidimensional array on the stack, nor passes one to a function: +static inline Datum * +get_pg_stat_buffers_row(Datum all_values[NROWS][COLUMN_LENGTH], BackendType backend_type, IOPath io_path) Maybe the allocation half is okay (I think it's ~3kB), but it seems easier to palloc the required amount than to research compiler behavior. That function is only used as a one-line helper, and doesn't use multidimensional array access anyway: + return all_values[(backend_type - 1) * IOPATH_NUM_TYPES + io_path]; I think it'd be better as a macro, like (I think) #define ROW(backend_type, io_path) all_values[NROWS*(backend_type-1)+io_path] Maybe it should take the column type as a 3 arg. The enum with COLUMN_LENGTH should be named. Or maybe it should be removed, and the enum names moved to comments, like: + /* backend_type */ + values[val++] = backend_type_desc; + /* io_path */ + values[val++] = CStringGetTextDatum(GetIOPathDesc(io_path)); + /* allocs */ + values[val++] += io_ops->allocs - resets->allocs; ... *Note the use of += and not =. Also: src/include/miscadmin.h:#define BACKEND_NUM_TYPES (B_LOGGER + 1) I think it's wrong to say NUM_TYPES = B_LOGGER + 1 (which would suggest using lessthan-or-equal instead of lessthan as you are). Since the valid backend types start at 1 , the "count" of backend types is currently B_LOGGER (13) - not 14. I think you should remove the "+1" here. Then NROWS (if it continued to exist at all) wouldn't need to subtract one. -- Justin
Thanks for the review! On Wed, Nov 24, 2021 at 8:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > You wrote beentry++ at the start of two loops, but I think that's wrong; it > should be at the end, as in the rest of the file (or as a loop increment). > BackendStatusArray[0] is actually used (even though its backend has > backendId==1, not 0). "MyBEEntry = &BackendStatusArray[MyBackendId - 1];" I've fixed this in v16 which I will attach to the next email in the thread. > You could put *_NUM_TYPES as the last value in these enums, like > NUM_AUXPROCTYPES, NUM_PMSIGNALS, and NUM_PROCSIGNALS: > > +#define IOOP_NUM_TYPES (IOOP_WRITE + 1) > +#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1) > +#define BACKEND_NUM_TYPES (B_LOGGER + 1) I originally had it as you describe, but based on this feedback upthread from Álvaro Herrera: > (It's weird to have enum values that are there just to indicate what's > the maximum value. I think that sort of thing is better done by having > a "#define LAST_THING" that takes the last valid value from the enum. > That would free you from having to handle the last value in switch > blocks, for example. LAST_OCLASS in dependency.h is a precedent on this.) So, I changed it to use macros. > There's extraneous blank lines in these functions: > > +pgstat_sum_io_path_ops Fixed > +pgstat_report_live_backend_io_path_ops I didn't see one here > +pgstat_recv_resetsharedcounter I didn't see one here > +GetIOPathDesc Fixed > +StrategyRejectBuffer Fixed > This function is doubly-indented: > > +pgstat_send_buffers_reset Fixed. Thanks for catching this. I also ran pgindent and manually picked a few of the formatting fixes that were relevant to code I added. > > As support for C99 is now required by postgres, variables can be declared as > part of various loops. > > + int io_path; > + for (io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++) Fixed this and all other occurrences in my code. > Rather than memset(), you could initialize msg like this. > PgStat_MsgIOPathOps msg = {0}; > > +pgstat_send_buffers(void) > +{ > + PgStat_MsgIOPathOps msg; > + > + PgBackendStatus *beentry = MyBEEntry; > + > + if (!beentry) > + return; > + > + memset(&msg, 0, sizeof(msg)); > though changing the initialization to universal zero initialization seems to be the correct way, I do get this compiler warning when I make the change pgstat.c:3212:29: warning: suggest braces around initialization of subobject [-Wmissing-braces] PgStat_MsgIOPathOps msg = {0}; ^ {} I have seem some comments online that say that this is a spurious warning present with some versions of both gcc and clang when using -Wmissing-braces to compile code with universal zero initialization, but I'm not sure what I should do. v16 attached in next message - Melanie
v16 (also rebased) attached On Fri, Nov 26, 2021 at 4:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Wed, Nov 24, 2021 at 07:15:59PM -0600, Justin Pryzby wrote: > > There's extraneous blank lines in these functions: > > > > +pgstat_sum_io_path_ops > > +pgstat_report_live_backend_io_path_ops > > +pgstat_recv_resetsharedcounter > > +GetIOPathDesc > > +StrategyRejectBuffer > > + an extra blank line pgstat_reset_shared_counters. Fixed > > In 0005: > > monitoring.sgml says that the columns in pg_stat_buffers are integers, but > they're actually bigint. Fixed > > + tupstore = tuplestore_begin_heap(true, false, work_mem); > > You're passing a constant randomAccess=true to tuplestore_begin_heap ;) Fixed > > +Datum all_values[NROWS][COLUMN_LENGTH]; > > If you were to allocate this as an array, I think it could actually be 3-D: > Datum all_values[BACKEND_NUM_TYPES-1][IOPATH_NUM_TYPES][COLUMN_LENGTH]; I've changed this to a 3D array as you suggested and removed the NROWS macro. > But I don't know if this is portable across postgres' supported platforms; I > haven't seen any place which allocates a multidimensional array on the stack, > nor passes one to a function: > > +static inline Datum * > +get_pg_stat_buffers_row(Datum all_values[NROWS][COLUMN_LENGTH], BackendType backend_type, IOPath io_path) > > Maybe the allocation half is okay (I think it's ~3kB), but it seems easier to > palloc the required amount than to research compiler behavior. I think passing it to the function is okay. The parameter type would be adjusted from an array to a pointer. I am not sure if the allocation on the stack in the body of pg_stat_get_buffers is too large. (left as is for now) > That function is only used as a one-line helper, and doesn't use > multidimensional array access anyway: > > + return all_values[(backend_type - 1) * IOPATH_NUM_TYPES + io_path]; with your suggested changes to a 3D array, it now does use multidimensional array access > I think it'd be better as a macro, like (I think) > #define ROW(backend_type, io_path) all_values[NROWS*(backend_type-1)+io_path] If I am understanding the idea of the macro, it would change the call site from this: +Datum *values = get_pg_stat_buffers_row(all_values, beentry->st_backendType, io_path); +values[COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs); +values[COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs); to this: +Datum *row = ROW(beentry->st_backendType, io_path); +row[COLUMN_ALLOCS] += pg_atomic_read_u64(&io_ops->allocs); +row[COLUMN_FSYNCS] += pg_atomic_read_u64(&io_ops->fsyncs); I usually prefer functions to macros, but I am fine with changing it. (I did not change it in this version) I have changed all the local variables from "values" to "row" which I think is a bit clearer. > Maybe it should take the column type as a 3 arg. If I am understanding this idea, the call site would look like this now: +CELL(beentry->st_backendType, io_path, COLUMN_FSYNCS) += pg_atomic_read_u64(&io_ops->fsyncs); +CELL(beentry->st_backendType, io_path, COLUMN_ALLOCS) += pg_atomic_read_u64(&io_ops->allocs); I don't like this as much. Since this code is inside of a loop, it kind of makes sense to me that you get a row at the top of the loop and then fill in all the cells in the row using that "row" variable. > The enum with COLUMN_LENGTH should be named. I only use the values in it, so it didn't need a name. > Or maybe it should be removed, and the enum names moved to comments, like: > > + /* backend_type */ > + values[val++] = backend_type_desc; > > + /* io_path */ > + values[val++] = CStringGetTextDatum(GetIOPathDesc(io_path)); > > + /* allocs */ > + values[val++] += io_ops->allocs - resets->allocs; > ... I find it easier to understand with it in code instead of as a comment. > *Note the use of += and not =. Thanks for seeing this. I have changed this (to use +=). > Also: > src/include/miscadmin.h:#define BACKEND_NUM_TYPES (B_LOGGER + 1) > > I think it's wrong to say NUM_TYPES = B_LOGGER + 1 (which would suggest using > lessthan-or-equal instead of lessthan as you are). > > Since the valid backend types start at 1 , the "count" of backend types is > currently B_LOGGER (13) - not 14. I think you should remove the "+1" here. > Then NROWS (if it continued to exist at all) wouldn't need to subtract one. I think what I currently have is technically correct because I start at 1 when I am using it as a loop condition. I do waste a spot in the arrays I allocate with BACKEND_NUM_TYPES size. I was hesitant to make the value of BACKEND_NUM_TYPES == B_LOGGER because it seems kind of weird to have it have the same value as the B_LOGGER enum. I am open to changing it. (I didn't change it in this v16). - Melanie
Attachment
- v16-0006-Remove-superfluous-bgwriter-stats.patch
- v16-0004-Add-buffers-to-pgstat_reset_shared_counters.patch
- v16-0007-small-comment-correction.patch
- v16-0005-Add-system-view-tracking-IO-ops-per-backend-type.patch
- v16-0003-Send-IO-operations-to-stats-collector.patch
- v16-0002-Add-IO-operation-counters-to-PgBackendStatus.patch
- v16-0001-Read-only-atomic-backend-write-function.patch
On Wed, Dec 01, 2021 at 05:00:14PM -0500, Melanie Plageman wrote: > > Also: > > src/include/miscadmin.h:#define BACKEND_NUM_TYPES (B_LOGGER + 1) > > > > I think it's wrong to say NUM_TYPES = B_LOGGER + 1 (which would suggest using > > lessthan-or-equal instead of lessthan as you are). > > > > Since the valid backend types start at 1 , the "count" of backend types is > > currently B_LOGGER (13) - not 14. I think you should remove the "+1" here. > > Then NROWS (if it continued to exist at all) wouldn't need to subtract one. > > I think what I currently have is technically correct because I start at > 1 when I am using it as a loop condition. I do waste a spot in the > arrays I allocate with BACKEND_NUM_TYPES size. > > I was hesitant to make the value of BACKEND_NUM_TYPES == B_LOGGER > because it seems kind of weird to have it have the same value as the > B_LOGGER enum. I don't mean to say that the code is misbehaving - I mean "num_x" means "the number of x's" - how many there are. Since the first, valid backend type is 1, and they're numbered consecutively and without duplicates, then "the number of backend types" is the same as the value of the last one (B_LOGGER). It's confusing if there's a macro called BACKEND_NUM_TYPES which is greater than the number of backend types. Most loops say for (int i=0; i<NUM; ++i) If it's 1-based, they say for (int i=1; i<=NUM; ++i) You have two different loops like: + for (int i = 0; i < BACKEND_NUM_TYPES - 1 ; i++) + for (int backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++) Both of these iterate over the correct number of backend types, but they both *look* wrong, which isn't desirable. -- Justin
On Wed, Dec 01, 2021 at 04:59:44PM -0500, Melanie Plageman wrote: > Thanks for the review! > > On Wed, Nov 24, 2021 at 8:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > You wrote beentry++ at the start of two loops, but I think that's wrong; it > > should be at the end, as in the rest of the file (or as a loop increment). > > BackendStatusArray[0] is actually used (even though its backend has > > backendId==1, not 0). "MyBEEntry = &BackendStatusArray[MyBackendId - 1];" > > I've fixed this in v16 which I will attach to the next email in the thread. > > > You could put *_NUM_TYPES as the last value in these enums, like > > NUM_AUXPROCTYPES, NUM_PMSIGNALS, and NUM_PROCSIGNALS: > > > > +#define IOOP_NUM_TYPES (IOOP_WRITE + 1) > > +#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1) > > +#define BACKEND_NUM_TYPES (B_LOGGER + 1) > > I originally had it as you describe, but based on this feedback upthread > from Álvaro Herrera: I saw that after I made my suggestion. Sorry for the noise. Both ways already exist in postgres and seem to be acceptable. > > There's extraneous blank lines in these functions: > > +pgstat_recv_resetsharedcounter > I didn't see one here => The extra blank line is after the RESET_BUFFERS memset. > + * Reset the global, bgwriter and checkpointer statistics for the > + * cluster. The first comma in this comment was introduced in 1bc8e7b09, and seems to be extraneous, since bgwriter and checkpointer are both global. With the comma, it looks like it should be memsetting 3 things. > + /* Don't count dead backends. They should already be counted */ Maybe this comment should say ".. they'll be added below" > + row[COLUMN_BACKEND_TYPE] = backend_type_desc; > + row[COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(io_path)); > + row[COLUMN_ALLOCS] += io_ops->allocs - resets->allocs; > + row[COLUMN_EXTENDS] += io_ops->extends - resets->extends; > + row[COLUMN_FSYNCS] += io_ops->fsyncs - resets->fsyncs; > + row[COLUMN_WRITES] += io_ops->writes - resets->writes; > + row[COLUMN_RESET_TIME] = reset_time; It'd be clearer if RESET_TIME were set adjacent to BACKEND_TYPE and IO_PATH. > > Rather than memset(), you could initialize msg like this. > > PgStat_MsgIOPathOps msg = {0}; > > though changing the initialization to universal zero initialization > seems to be the correct way, I do get this compiler warning when I make > the change > > pgstat.c:3212:29: warning: suggest braces around initialization of subobject [-Wmissing-braces] > > I have seem some comments online that say that this is a spurious > warning present with some versions of both gcc and clang when using > -Wmissing-braces to compile code with universal zero initialization, but > I'm not sure what I should do. I think gcc is suggesting to write something like {{0}}, and I think the online commentary you found is saying that the warning is a false positive. So I think you should ignore my suggestion - it's not worth the bother. This message needs to be updated: errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\"."))) When I query the view, I see reset times as: 1999-12-31 18:00:00-06. I guess it should be initialized like this one: globalStats.bgwriter.stat_reset_timestamp = ts The cfbot shows failures now (I thought it was passing with the previous patch, but I suppose I'm wrong.) It looks like running recovery during single user mode hits this assertion. TRAP: FailedAssertion("beentry", File: "../../../../src/include/utils/backend_status.h", Line: 359, PID: 3499) -- Justin
On Wed, Dec 01, 2021 at 04:59:44PM -0500, Melanie Plageman wrote: > Thanks for the review! > > On Wed, Nov 24, 2021 at 8:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > You wrote beentry++ at the start of two loops, but I think that's wrong; it > > should be at the end, as in the rest of the file (or as a loop increment). > > BackendStatusArray[0] is actually used (even though its backend has > > backendId==1, not 0). "MyBEEntry = &BackendStatusArray[MyBackendId - 1];" > > I've fixed this in v16 which I will attach to the next email in the thread. I just noticed that since beentry++ is now at the end of the loop, it's being missed when you "continue": + if (beentry->st_procpid == 0) + continue; Also, I saw that pgindent messed up and added spaces after pointers in function declarations, due to new typedefs not in typedefs.list: -pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg) +pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter * msg) -static inline void pg_atomic_inc_counter(pg_atomic_uint64 *counter) +static inline void +pg_atomic_inc_counter(pg_atomic_uint64 * counter) -- Justin
Thanks again! I really appreciate the thorough review. I have combined responses to all three of your emails below. Let me know if it is more confusing to do it this way. On Wed, Dec 1, 2021 at 6:59 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Wed, Dec 01, 2021 at 05:00:14PM -0500, Melanie Plageman wrote: > > > Also: > > > src/include/miscadmin.h:#define BACKEND_NUM_TYPES (B_LOGGER + 1) > > > > > > I think it's wrong to say NUM_TYPES = B_LOGGER + 1 (which would suggest using > > > lessthan-or-equal instead of lessthan as you are). > > > > > > Since the valid backend types start at 1 , the "count" of backend types is > > > currently B_LOGGER (13) - not 14. I think you should remove the "+1" here. > > > Then NROWS (if it continued to exist at all) wouldn't need to subtract one. > > > > I think what I currently have is technically correct because I start at > > 1 when I am using it as a loop condition. I do waste a spot in the > > arrays I allocate with BACKEND_NUM_TYPES size. > > > > I was hesitant to make the value of BACKEND_NUM_TYPES == B_LOGGER > > because it seems kind of weird to have it have the same value as the > > B_LOGGER enum. > > I don't mean to say that the code is misbehaving - I mean "num_x" means "the > number of x's" - how many there are. Since the first, valid backend type is 1, > and they're numbered consecutively and without duplicates, then "the number of > backend types" is the same as the value of the last one (B_LOGGER). It's > confusing if there's a macro called BACKEND_NUM_TYPES which is greater than the > number of backend types. > > Most loops say for (int i=0; i<NUM; ++i) > If it's 1-based, they say for (int i=1; i<=NUM; ++i) > You have two different loops like: > > + for (int i = 0; i < BACKEND_NUM_TYPES - 1 ; i++) > + for (int backend_type = 1; backend_type < BACKEND_NUM_TYPES; backend_type++) > > Both of these iterate over the correct number of backend types, but they both > *look* wrong, which isn't desirable. I've changed this and added comments wherever I could to make it clear. Whenever the parameter was of type BackendType, I tried to use the correct (not adjusted by subtracting 1) number and wherever the type was int and being used as an index into the array, I used the adjusted value and added the idx suffix to make it clear that the number does not reflect the actual BackendType: On Wed, Dec 1, 2021 at 10:31 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Wed, Dec 01, 2021 at 04:59:44PM -0500, Melanie Plageman wrote: > > Thanks for the review! > > > > On Wed, Nov 24, 2021 at 8:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > There's extraneous blank lines in these functions: > > > +pgstat_recv_resetsharedcounter > > I didn't see one here > > => The extra blank line is after the RESET_BUFFERS memset. Fixed. > > + * Reset the global, bgwriter and checkpointer statistics for the > > + * cluster. > > The first comma in this comment was introduced in 1bc8e7b09, and seems to be > extraneous, since bgwriter and checkpointer are both global. With the comma, > it looks like it should be memsetting 3 things. Fixed. > > + /* Don't count dead backends. They should already be counted */ > > Maybe this comment should say ".. they'll be added below" Fixed. > > + row[COLUMN_BACKEND_TYPE] = backend_type_desc; > > + row[COLUMN_IO_PATH] = CStringGetTextDatum(GetIOPathDesc(io_path)); > > + row[COLUMN_ALLOCS] += io_ops->allocs - resets->allocs; > > + row[COLUMN_EXTENDS] += io_ops->extends - resets->extends; > > + row[COLUMN_FSYNCS] += io_ops->fsyncs - resets->fsyncs; > > + row[COLUMN_WRITES] += io_ops->writes - resets->writes; > > + row[COLUMN_RESET_TIME] = reset_time; > > It'd be clearer if RESET_TIME were set adjacent to BACKEND_TYPE and IO_PATH. If you mean just in the order here (not in the column order in the view), then I have changed it as you recommended. > This message needs to be updated: > errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\"."))) Done. > When I query the view, I see reset times as: 1999-12-31 18:00:00-06. > I guess it should be initialized like this one: > globalStats.bgwriter.stat_reset_timestamp = ts Done. > The cfbot shows failures now (I thought it was passing with the previous patch, > but I suppose I'm wrong.) > > It looks like running recovery during single user mode hits this assertion. > TRAP: FailedAssertion("beentry", File: "../../../../src/include/utils/backend_status.h", Line: 359, PID: 3499) > Yes, thank you for catching this. I have moved up pgstat_beinit and pgstat_bestart so that single user mode process will also have PgBackendStatus. I also have to guard against sending these stats to the collector since there is no room for B_INVALID backendtype in the array of IO Op values. With this change `make check-world` passes on my machine. On Wed, Dec 1, 2021 at 11:06 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Wed, Dec 01, 2021 at 04:59:44PM -0500, Melanie Plageman wrote: > > Thanks for the review! > > > > On Wed, Nov 24, 2021 at 8:16 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > You wrote beentry++ at the start of two loops, but I think that's wrong; it > > > should be at the end, as in the rest of the file (or as a loop increment). > > > BackendStatusArray[0] is actually used (even though its backend has > > > backendId==1, not 0). "MyBEEntry = &BackendStatusArray[MyBackendId - 1];" > > > > I've fixed this in v16 which I will attach to the next email in the thread. > > I just noticed that since beentry++ is now at the end of the loop, it's being > missed when you "continue": > > + if (beentry->st_procpid == 0) > + continue; Fixed. > Also, I saw that pgindent messed up and added spaces after pointers in function > declarations, due to new typedefs not in typedefs.list: > > -pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg) > +pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter * msg) > > -static inline void pg_atomic_inc_counter(pg_atomic_uint64 *counter) > +static inline void > +pg_atomic_inc_counter(pg_atomic_uint64 * counter) Fixed. -- Melanie
Attachment
- v17-0007-small-comment-correction.patch
- v17-0006-Remove-superfluous-bgwriter-stats.patch
- v17-0005-Add-system-view-tracking-IO-ops-per-backend-type.patch
- v17-0003-Send-IO-operations-to-stats-collector.patch
- v17-0004-Add-buffers-to-pgstat_reset_shared_counters.patch
- v17-0002-Add-IO-operation-counters-to-PgBackendStatus.patch
- v17-0001-Read-only-atomic-backend-write-function.patch
Hi, On 2021-12-03 15:02:24 -0500, Melanie Plageman wrote: > From e0f7f3dfd60a68fa01f3c023bcdb69305ade3738 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Mon, 11 Oct 2021 16:15:06 -0400 > Subject: [PATCH v17 1/7] Read-only atomic backend write function > > For counters in shared memory which can be read by any backend but only > written to by one backend, an atomic is still needed to protect against > torn values, however, pg_atomic_fetch_add_u64() is overkill for > incrementing the counter. pg_atomic_inc_counter() is a helper function > which can be used to increment these values safely but without > unnecessary overhead. > > Author: Thomas Munro > --- > src/include/port/atomics.h | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h > index 856338f161..39ffff24dd 100644 > --- a/src/include/port/atomics.h > +++ b/src/include/port/atomics.h > @@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_) > return pg_atomic_sub_fetch_u64_impl(ptr, sub_); > } > > +/* > + * On modern systems this is really just *counter++. On some older systems > + * there might be more to it, due to inability to read and write 64 bit values > + * atomically. > + */ > +static inline void > +pg_atomic_inc_counter(pg_atomic_uint64 *counter) > +{ > + pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1); > +} I wonder if it's worth putting something in the name indicating that this is not actual atomic RMW operation. Perhaps adding _unlocked? > From b0e193cfa08f0b8cf1be929f26fe38f06a39aeae Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Wed, 24 Nov 2021 10:32:56 -0500 > Subject: [PATCH v17 2/7] Add IO operation counters to PgBackendStatus > > Add an array of counters in PgBackendStatus which count the buffers > allocated, extended, fsynced, and written by a given backend. Each "IO > Op" (alloc, fsync, extend, write) is counted per "IO Path" (direct, > local, shared, or strategy). "local" and "shared" IO Path counters count > operations on local and shared buffers. The "strategy" IO Path counts > buffers alloc'd/written/read/fsync'd as part of a BufferAccessStrategy. > The "direct" IO Path counts blocks of IO which are read, written, or > fsync'd using smgrwrite/extend/immedsync directly (as opposed to through > [Local]BufferAlloc()). > > With this commit, all backends increment a counter in their > PgBackendStatus when performing an IO operation. This is in preparation > for future commits which will persist these stats upon backend exit and > use the counters to provide observability of database IO operations. > > Note that this commit does not add code to increment the "direct" path. > A separate proposed patch [1] which would add wrappers for smgrwrite(), > smgrextend(), and smgrimmedsync() would provide a good location to call > pgstat_inc_ioop() for unbuffered IO and avoid regressions for future > users of these functions. > > [1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com On longer thread it's nice for committers to already have Reviewed-By: in the commit message. > diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c > index 7229598822..413cc605f8 100644 > --- a/src/backend/utils/activity/backend_status.c > +++ b/src/backend/utils/activity/backend_status.c > @@ -399,6 +399,15 @@ pgstat_bestart(void) > lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID; > lbeentry.st_progress_command_target = InvalidOid; > lbeentry.st_query_id = UINT64CONST(0); > + for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++) > + { > + IOOps *io_ops = &lbeentry.io_path_stats[io_path]; > + > + pg_atomic_init_u64(&io_ops->allocs, 0); > + pg_atomic_init_u64(&io_ops->extends, 0); > + pg_atomic_init_u64(&io_ops->fsyncs, 0); > + pg_atomic_init_u64(&io_ops->writes, 0); > + } > > /* > * we don't zero st_progress_param here to save cycles; nobody should nit: I think we nearly always have a blank line before loops > diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c > index 646126edee..93f1b4bcfc 100644 > --- a/src/backend/utils/init/postinit.c > +++ b/src/backend/utils/init/postinit.c > @@ -623,6 +623,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, > RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler); > } > > + pgstat_beinit(); > /* > * Initialize local process's access to XLOG. > */ nit: same with multi-line comments. > @@ -649,6 +650,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, > */ > CreateAuxProcessResourceOwner(); > > + pgstat_bestart(); > StartupXLOG(); > /* Release (and warn about) any buffer pins leaked in StartupXLOG */ > ReleaseAuxProcessResources(true); > @@ -676,7 +678,6 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, > EnablePortalManager(); > > /* Initialize status reporting */ > - pgstat_beinit(); I'd like to see changes like moving this kind of thing around broken around and committed separately. It's much easier to pinpoint breakage if the CF breaks after moving just pgstat_beinit() around, rather than when committing this considerably larger patch. And reordering subsystem initialization has the habit of causing problems... > +/* ---------- > + * IO Stats reporting utility types > + * ---------- > + */ > + > +typedef enum IOOp > +{ > + IOOP_ALLOC, > + IOOP_EXTEND, > + IOOP_FSYNC, > + IOOP_WRITE, > +} IOOp; > [...] > +/* > + * Structure for counting all types of IOOps for a live backend. > + */ > +typedef struct IOOps > +{ > + pg_atomic_uint64 allocs; > + pg_atomic_uint64 extends; > + pg_atomic_uint64 fsyncs; > + pg_atomic_uint64 writes; > +} IOOps; To me IOop and IOOps sound to much alike - even though they're really kind of separate things. s/IOOps/IOOpCounters/ maybe? > @@ -3152,6 +3156,14 @@ pgstat_shutdown_hook(int code, Datum arg) > { > Assert(!pgstat_is_shutdown); > > + /* > + * Only need to send stats on IO Ops for IO Paths when a process exits. > + * Users requiring IO Ops for both live and exited backends can read from > + * live backends' PgBackendStatus and sum this with totals from exited > + * backends persisted by the stats collector. > + */ > + pgstat_send_buffers(); Perhaps something like this comment belongs somewhere at the top of the file, or in the header, or ...? It's a fairly central design piece, and it's not obvious one would need to look in the shutdown hook for it? > +/* > + * Before exiting, a backend sends its IO op statistics to the collector so > + * that they may be persisted. > + */ > +void > +pgstat_send_buffers(void) > +{ > + PgStat_MsgIOPathOps msg; > + > + PgBackendStatus *beentry = MyBEEntry; > + > + /* > + * Though some backends with type B_INVALID (such as the single-user mode > + * process) do initialize and increment IO operations stats, there is no > + * spot in the array of IO operations for backends of type B_INVALID. As > + * such, do not send these to the stats collector. > + */ > + if (!beentry || beentry->st_backendType == B_INVALID) > + return; Why does single user mode use B_INVALID? That doesn't seem quite right. > + memset(&msg, 0, sizeof(msg)); > + msg.backend_type = beentry->st_backendType; > + > + pgstat_sum_io_path_ops(msg.iop.io_path_ops, > + (IOOps *) &beentry->io_path_stats); > + > + pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS); > + pgstat_send(&msg, sizeof(msg)); > +} It seems worth having a path skipping sending the message if there was no IO? > +/* > + * Helper function to sum all live IO Op stats for all IO Paths (e.g. shared, > + * local) to those in the equivalent stats structure for exited backends. Note > + * that this adds and doesn't set, so the destination stats structure should be > + * zeroed out by the caller initially. This would commonly be used to transfer > + * all IO Op stats for all IO Paths for a particular backend type to the > + * pgstats structure. > + */ > +void > +pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src) > +{ > + for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++) > + { Sacriligeous, but I find io_path a harder to understand variable name for the counter than i (or io_path_off or ...) ;) > +static void > +pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len) > +{ > + PgStatIOOps *src_io_path_ops; > + PgStatIOOps *dest_io_path_ops; > + > + /* > + * Subtract 1 from message's BackendType to get a valid index into the > + * array of IO Ops which does not include an entry for B_INVALID > + * BackendType. > + */ > + Assert(msg->backend_type > B_INVALID); Probably worth also asserting the upper boundary? > From f972ea87270feaed464a74fb6541ac04b4fc7d98 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Wed, 24 Nov 2021 11:39:48 -0500 > Subject: [PATCH v17 4/7] Add "buffers" to pgstat_reset_shared_counters > > Backends count IO operations for various IO paths in their PgBackendStatus. > Upon exit, they send these counts to the stats collector. Prior to this commit, > these IO Ops stats would have been reset when the target was "bgwriter". > > With this commit, target "bgwriter" no longer will cause the IO operations > stats to be reset, and the IO operations stats can be reset with new target, > "buffers". > --- > doc/src/sgml/monitoring.sgml | 2 +- > src/backend/postmaster/pgstat.c | 83 +++++++++++++++++++-- > src/backend/utils/activity/backend_status.c | 29 +++++++ > src/include/pgstat.h | 8 +- > src/include/utils/backend_status.h | 2 + > 5 files changed, 117 insertions(+), 7 deletions(-) > > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml > index 62f2a3332b..bda3eef309 100644 > --- a/doc/src/sgml/monitoring.sgml > +++ b/doc/src/sgml/monitoring.sgml > @@ -3604,7 +3604,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i > <structfield>stats_reset</structfield> <type>timestamp with time zone</type> > </para> > <para> > - Time at which these statistics were last reset > + Time at which these statistics were last reset. > </para></entry> > </row> > </tbody> Hm? Shouldn't this new reset target be documented? > +/* > + * Helper function to collect and send live backends' current IO operations > + * stats counters when a stats reset is initiated so that they may be deducted > + * from future totals. > + */ > +static void > +pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg) > +{ > + PgStatIOPathOps ops[BACKEND_NUM_TYPES]; > + > + memset(ops, 0, sizeof(ops)); > + pgstat_report_live_backend_io_path_ops(ops); > + > + /* > + * Iterate through the array of IO Ops for all IO Paths for each > + * BackendType. Because the array does not include a spot for BackendType > + * B_INVALID, add 1 to the index when setting backend_type so that there is > + * no confusion as to the BackendType with which this reset message > + * corresponds. > + */ > + for (int backend_type_idx = 0; backend_type_idx < BACKEND_NUM_TYPES; backend_type_idx++) > + { > + msg->m_backend_resets.backend_type = backend_type_idx + 1; > + memcpy(&msg->m_backend_resets.iop, &ops[backend_type_idx], > + sizeof(msg->m_backend_resets.iop)); > + pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter)); > + } > +} Probably worth explaining why multiple messages are sent? > @@ -5583,10 +5621,45 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len) > { > if (msg->m_resettarget == RESET_BGWRITER) > { > - /* Reset the global, bgwriter and checkpointer statistics for the cluster. */ > - memset(&globalStats, 0, sizeof(globalStats)); > + /* > + * Reset the global bgwriter and checkpointer statistics for the > + * cluster. > + */ > + memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer)); > + memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter)); > globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp(); > } Oh, is this a live bug? > + /* > + * Subtract 1 from the BackendType to arrive at a valid index in the > + * array, as it does not contain a spot for B_INVALID BackendType. > + */ Instead of repeating a comment about +- 1 in a bunch of places, would it look better to have two helper inline functions for this purpose? > +/* > +* When adding a new column to the pg_stat_buffers view, add a new enum > +* value here above COLUMN_LENGTH. > +*/ > +enum > +{ > + COLUMN_BACKEND_TYPE, > + COLUMN_IO_PATH, > + COLUMN_ALLOCS, > + COLUMN_EXTENDS, > + COLUMN_FSYNCS, > + COLUMN_WRITES, > + COLUMN_RESET_TIME, > + COLUMN_LENGTH, > +}; COLUMN_LENGTH seems like a fairly generic name... > From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Wed, 24 Nov 2021 12:20:10 -0500 > Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats > > Remove stats from pg_stat_bgwriter which are now more clearly expressed > in pg_stat_buffers. > > TODO: > - make pg_stat_checkpointer view and move relevant stats into it > - add additional stats to pg_stat_bgwriter When do you think it makes sense to tackle these wrt committing some of the patches? > diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c > index 6926fc5742..67447f997a 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -2164,7 +2164,6 @@ BufferSync(int flags) > if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN) > { > TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id); > - PendingCheckpointerStats.m_buf_written_checkpoints++; > num_written++; > } > } > @@ -2273,9 +2272,6 @@ BgBufferSync(WritebackContext *wb_context) > */ > strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc); > > - /* Report buffer alloc counts to pgstat */ > - PendingBgWriterStats.m_buf_alloc += recent_alloc; > - > /* > * If we're not running the LRU scan, just stop after doing the stats > * stuff. We mark the saved state invalid so that we can recover sanely > @@ -2472,8 +2468,6 @@ BgBufferSync(WritebackContext *wb_context) > reusable_buffers++; > } > > - PendingBgWriterStats.m_buf_written_clean += num_written; > - Isn't num_written unused now, unless tracepoints are enabled? I'd expect some compilers to warn... Perhaps we should just remove information from the tracepoint? Greetings, Andres Freund
v18 attached. On Thu, Dec 9, 2021 at 2:17 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-12-03 15:02:24 -0500, Melanie Plageman wrote: > > From e0f7f3dfd60a68fa01f3c023bcdb69305ade3738 Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman <melanieplageman@gmail.com> > > Date: Mon, 11 Oct 2021 16:15:06 -0400 > > Subject: [PATCH v17 1/7] Read-only atomic backend write function > > > > For counters in shared memory which can be read by any backend but only > > written to by one backend, an atomic is still needed to protect against > > torn values, however, pg_atomic_fetch_add_u64() is overkill for > > incrementing the counter. pg_atomic_inc_counter() is a helper function > > which can be used to increment these values safely but without > > unnecessary overhead. > > > > Author: Thomas Munro > > --- > > src/include/port/atomics.h | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h > > index 856338f161..39ffff24dd 100644 > > --- a/src/include/port/atomics.h > > +++ b/src/include/port/atomics.h > > @@ -519,6 +519,17 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_) > > return pg_atomic_sub_fetch_u64_impl(ptr, sub_); > > } > > > > +/* > > + * On modern systems this is really just *counter++. On some older systems > > + * there might be more to it, due to inability to read and write 64 bit values > > + * atomically. > > + */ > > +static inline void > > +pg_atomic_inc_counter(pg_atomic_uint64 *counter) > > +{ > > + pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1); > > +} > > I wonder if it's worth putting something in the name indicating that this is > not actual atomic RMW operation. Perhaps adding _unlocked? > Done. > > > From b0e193cfa08f0b8cf1be929f26fe38f06a39aeae Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman <melanieplageman@gmail.com> > > Date: Wed, 24 Nov 2021 10:32:56 -0500 > > Subject: [PATCH v17 2/7] Add IO operation counters to PgBackendStatus > > > > Add an array of counters in PgBackendStatus which count the buffers > > allocated, extended, fsynced, and written by a given backend. Each "IO > > Op" (alloc, fsync, extend, write) is counted per "IO Path" (direct, > > local, shared, or strategy). "local" and "shared" IO Path counters count > > operations on local and shared buffers. The "strategy" IO Path counts > > buffers alloc'd/written/read/fsync'd as part of a BufferAccessStrategy. > > The "direct" IO Path counts blocks of IO which are read, written, or > > fsync'd using smgrwrite/extend/immedsync directly (as opposed to through > > [Local]BufferAlloc()). > > > > With this commit, all backends increment a counter in their > > PgBackendStatus when performing an IO operation. This is in preparation > > for future commits which will persist these stats upon backend exit and > > use the counters to provide observability of database IO operations. > > > > Note that this commit does not add code to increment the "direct" path. > > A separate proposed patch [1] which would add wrappers for smgrwrite(), > > smgrextend(), and smgrimmedsync() would provide a good location to call > > pgstat_inc_ioop() for unbuffered IO and avoid regressions for future > > users of these functions. > > > > [1] https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com > > On longer thread it's nice for committers to already have Reviewed-By: in the > commit message. Done. > > diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c > > index 7229598822..413cc605f8 100644 > > --- a/src/backend/utils/activity/backend_status.c > > +++ b/src/backend/utils/activity/backend_status.c > > @@ -399,6 +399,15 @@ pgstat_bestart(void) > > lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID; > > lbeentry.st_progress_command_target = InvalidOid; > > lbeentry.st_query_id = UINT64CONST(0); > > + for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++) > > + { > > + IOOps *io_ops = &lbeentry.io_path_stats[io_path]; > > + > > + pg_atomic_init_u64(&io_ops->allocs, 0); > > + pg_atomic_init_u64(&io_ops->extends, 0); > > + pg_atomic_init_u64(&io_ops->fsyncs, 0); > > + pg_atomic_init_u64(&io_ops->writes, 0); > > + } > > > > /* > > * we don't zero st_progress_param here to save cycles; nobody should > > nit: I think we nearly always have a blank line before loops Done. > > diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c > > index 646126edee..93f1b4bcfc 100644 > > --- a/src/backend/utils/init/postinit.c > > +++ b/src/backend/utils/init/postinit.c > > @@ -623,6 +623,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, > > RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler); > > } > > > > + pgstat_beinit(); > > /* > > * Initialize local process's access to XLOG. > > */ > > nit: same with multi-line comments. Done. > > @@ -649,6 +650,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, > > */ > > CreateAuxProcessResourceOwner(); > > > > + pgstat_bestart(); > > StartupXLOG(); > > /* Release (and warn about) any buffer pins leaked in StartupXLOG */ > > ReleaseAuxProcessResources(true); > > @@ -676,7 +678,6 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, > > EnablePortalManager(); > > > > /* Initialize status reporting */ > > - pgstat_beinit(); > > I'd like to see changes like moving this kind of thing around broken around > and committed separately. It's much easier to pinpoint breakage if the CF > breaks after moving just pgstat_beinit() around, rather than when committing > this considerably larger patch. And reordering subsystem initialization has > the habit of causing problems... Done > > +/* ---------- > > + * IO Stats reporting utility types > > + * ---------- > > + */ > > + > > +typedef enum IOOp > > +{ > > + IOOP_ALLOC, > > + IOOP_EXTEND, > > + IOOP_FSYNC, > > + IOOP_WRITE, > > +} IOOp; > > [...] > > +/* > > + * Structure for counting all types of IOOps for a live backend. > > + */ > > +typedef struct IOOps > > +{ > > + pg_atomic_uint64 allocs; > > + pg_atomic_uint64 extends; > > + pg_atomic_uint64 fsyncs; > > + pg_atomic_uint64 writes; > > +} IOOps; > > To me IOop and IOOps sound to much alike - even though they're really kind of > separate things. s/IOOps/IOOpCounters/ maybe? Done. > > @@ -3152,6 +3156,14 @@ pgstat_shutdown_hook(int code, Datum arg) > > { > > Assert(!pgstat_is_shutdown); > > > > + /* > > + * Only need to send stats on IO Ops for IO Paths when a process exits. > > + * Users requiring IO Ops for both live and exited backends can read from > > + * live backends' PgBackendStatus and sum this with totals from exited > > + * backends persisted by the stats collector. > > + */ > > + pgstat_send_buffers(); > > Perhaps something like this comment belongs somewhere at the top of the file, > or in the header, or ...? It's a fairly central design piece, and it's not > obvious one would need to look in the shutdown hook for it? > now in pgstat.h above the declaration of pgstat_send_buffers() > > +/* > > + * Before exiting, a backend sends its IO op statistics to the collector so > > + * that they may be persisted. > > + */ > > +void > > +pgstat_send_buffers(void) > > +{ > > + PgStat_MsgIOPathOps msg; > > + > > + PgBackendStatus *beentry = MyBEEntry; > > + > > + /* > > + * Though some backends with type B_INVALID (such as the single-user mode > > + * process) do initialize and increment IO operations stats, there is no > > + * spot in the array of IO operations for backends of type B_INVALID. As > > + * such, do not send these to the stats collector. > > + */ > > + if (!beentry || beentry->st_backendType == B_INVALID) > > + return; > > Why does single user mode use B_INVALID? That doesn't seem quite right. I think PgBackendStatus->st_backendType is set from MyBackendType which isn't set for the single user mode process. What BackendType would you expect to see? > > + memset(&msg, 0, sizeof(msg)); > > + msg.backend_type = beentry->st_backendType; > > + > > + pgstat_sum_io_path_ops(msg.iop.io_path_ops, > > + (IOOps *) &beentry->io_path_stats); > > + > > + pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_IO_PATH_OPS); > > + pgstat_send(&msg, sizeof(msg)); > > +} > > It seems worth having a path skipping sending the message if there was no IO? Makes sense. I've updated pgstat_send_buffers() to do a loop after calling pgstat_sum_io_path_ops() and check if it should skip sending. I also thought about having pgstat_sum_io_path_ops() return a value to indicate if everything was 0 -- which could be useful to future callers potentially? I didn't do this because I am not sure what the return value would be. It could be a bool and be true if any IO was done and false if none was done -- but that doesn't really make sense given the function's name it would be called like if (!pgstat_sum_io_path_ops()) return which I'm not sure is very clear > > +/* > > + * Helper function to sum all live IO Op stats for all IO Paths (e.g. shared, > > + * local) to those in the equivalent stats structure for exited backends. Note > > + * that this adds and doesn't set, so the destination stats structure should be > > + * zeroed out by the caller initially. This would commonly be used to transfer > > + * all IO Op stats for all IO Paths for a particular backend type to the > > + * pgstats structure. > > + */ > > +void > > +pgstat_sum_io_path_ops(PgStatIOOps *dest, IOOps *src) > > +{ > > + for (int io_path = 0; io_path < IOPATH_NUM_TYPES; io_path++) > > + { > > Sacriligeous, but I find io_path a harder to understand variable name for the > counter than i (or io_path_off or ...) ;) I've updated almost all my non-standard loop index variable names. > > +static void > > +pgstat_recv_io_path_ops(PgStat_MsgIOPathOps *msg, int len) > > +{ > > + PgStatIOOps *src_io_path_ops; > > + PgStatIOOps *dest_io_path_ops; > > + > > + /* > > + * Subtract 1 from message's BackendType to get a valid index into the > > + * array of IO Ops which does not include an entry for B_INVALID > > + * BackendType. > > + */ > > + Assert(msg->backend_type > B_INVALID); > > Probably worth also asserting the upper boundary? Done. > > From f972ea87270feaed464a74fb6541ac04b4fc7d98 Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman <melanieplageman@gmail.com> > > Date: Wed, 24 Nov 2021 11:39:48 -0500 > > Subject: [PATCH v17 4/7] Add "buffers" to pgstat_reset_shared_counters > > > > Backends count IO operations for various IO paths in their PgBackendStatus. > > Upon exit, they send these counts to the stats collector. Prior to this commit, > > these IO Ops stats would have been reset when the target was "bgwriter". > > > > With this commit, target "bgwriter" no longer will cause the IO operations > > stats to be reset, and the IO operations stats can be reset with new target, > > "buffers". > > --- > > doc/src/sgml/monitoring.sgml | 2 +- > > src/backend/postmaster/pgstat.c | 83 +++++++++++++++++++-- > > src/backend/utils/activity/backend_status.c | 29 +++++++ > > src/include/pgstat.h | 8 +- > > src/include/utils/backend_status.h | 2 + > > 5 files changed, 117 insertions(+), 7 deletions(-) > > > > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml > > index 62f2a3332b..bda3eef309 100644 > > --- a/doc/src/sgml/monitoring.sgml > > +++ b/doc/src/sgml/monitoring.sgml > > @@ -3604,7 +3604,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i > > <structfield>stats_reset</structfield> <type>timestamp with time zone</type> > > </para> > > <para> > > - Time at which these statistics were last reset > > + Time at which these statistics were last reset. > > </para></entry> > > </row> > > </tbody> > > Hm? > > Shouldn't this new reset target be documented? It is in the commit adding the view. I didn't include it in this commit because the pg_stat_buffers view doesn't exist yet, as of this commit, and I thought it would be odd to mention it in the docs (in this commit). As an aside, I shouldn't have left this correction in this commit. I moved it now to the other one. > > +/* > > + * Helper function to collect and send live backends' current IO operations > > + * stats counters when a stats reset is initiated so that they may be deducted > > + * from future totals. > > + */ > > +static void > > +pgstat_send_buffers_reset(PgStat_MsgResetsharedcounter *msg) > > +{ > > + PgStatIOPathOps ops[BACKEND_NUM_TYPES]; > > + > > + memset(ops, 0, sizeof(ops)); > > + pgstat_report_live_backend_io_path_ops(ops); > > + > > + /* > > + * Iterate through the array of IO Ops for all IO Paths for each > > + * BackendType. Because the array does not include a spot for BackendType > > + * B_INVALID, add 1 to the index when setting backend_type so that there is > > + * no confusion as to the BackendType with which this reset message > > + * corresponds. > > + */ > > + for (int backend_type_idx = 0; backend_type_idx < BACKEND_NUM_TYPES; backend_type_idx++) > > + { > > + msg->m_backend_resets.backend_type = backend_type_idx + 1; > > + memcpy(&msg->m_backend_resets.iop, &ops[backend_type_idx], > > + sizeof(msg->m_backend_resets.iop)); > > + pgstat_send(msg, sizeof(PgStat_MsgResetsharedcounter)); > > + } > > +} > > Probably worth explaining why multiple messages are sent? Done. > > @@ -5583,10 +5621,45 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len) > > { > > if (msg->m_resettarget == RESET_BGWRITER) > > { > > - /* Reset the global, bgwriter and checkpointer statistics for the cluster. */ > > - memset(&globalStats, 0, sizeof(globalStats)); > > + /* > > + * Reset the global bgwriter and checkpointer statistics for the > > + * cluster. > > + */ > > + memset(&globalStats.checkpointer, 0, sizeof(globalStats.checkpointer)); > > + memset(&globalStats.bgwriter, 0, sizeof(globalStats.bgwriter)); > > globalStats.bgwriter.stat_reset_timestamp = GetCurrentTimestamp(); > > } > > Oh, is this a live bug? I don't think it is a bug. globalStats only contained bgwriter and checkpointer stats and those were all only displayed in pg_stat_bgwriter(), so memsetting the whole thing seems fine. > > + /* > > + * Subtract 1 from the BackendType to arrive at a valid index in the > > + * array, as it does not contain a spot for B_INVALID BackendType. > > + */ > > Instead of repeating a comment about +- 1 in a bunch of places, would it look > better to have two helper inline functions for this purpose? Done. > > +/* > > +* When adding a new column to the pg_stat_buffers view, add a new enum > > +* value here above COLUMN_LENGTH. > > +*/ > > +enum > > +{ > > + COLUMN_BACKEND_TYPE, > > + COLUMN_IO_PATH, > > + COLUMN_ALLOCS, > > + COLUMN_EXTENDS, > > + COLUMN_FSYNCS, > > + COLUMN_WRITES, > > + COLUMN_RESET_TIME, > > + COLUMN_LENGTH, > > +}; > > COLUMN_LENGTH seems like a fairly generic name... Changed. > > From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman <melanieplageman@gmail.com> > > Date: Wed, 24 Nov 2021 12:20:10 -0500 > > Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats > > > > Remove stats from pg_stat_bgwriter which are now more clearly expressed > > in pg_stat_buffers. > > > > TODO: > > - make pg_stat_checkpointer view and move relevant stats into it > > - add additional stats to pg_stat_bgwriter > > When do you think it makes sense to tackle these wrt committing some of the > patches? Well, the new stats are a superset of the old stats (no stats have been removed that are not represented in the new or old views). So, I don't see that as a blocker for committing these patches. Since it is weird that pg_stat_bgwriter had mostly checkpointer stats, I've edited this commit to rename that view to pg_stat_checkpointer. I have not made a separate view just for maxwritten_clean (presumably called pg_stat_bgwriter), but I would not be opposed to doing this if you thought having a view with a single column isn't a problem (in the event that we don't get around to adding more bgwriter stats right away). I noticed after changing the docs on the "bgwriter" target for pg_stat_reset_shared to say "checkpointer", that it still said "bgwriter" in src/backend/po/ko.po src/backend/po/it.po ... I presume these are automatically updated with some incantation, but I wasn't sure what it was nor could I find documentation on this. > > diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c > > index 6926fc5742..67447f997a 100644 > > --- a/src/backend/storage/buffer/bufmgr.c > > +++ b/src/backend/storage/buffer/bufmgr.c > > @@ -2164,7 +2164,6 @@ BufferSync(int flags) > > if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN) > > { > > TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id); > > - PendingCheckpointerStats.m_buf_written_checkpoints++; > > num_written++; > > } > > } > > @@ -2273,9 +2272,6 @@ BgBufferSync(WritebackContext *wb_context) > > */ > > strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc); > > > > - /* Report buffer alloc counts to pgstat */ > > - PendingBgWriterStats.m_buf_alloc += recent_alloc; > > - > > /* > > * If we're not running the LRU scan, just stop after doing the stats > > * stuff. We mark the saved state invalid so that we can recover sanely > > @@ -2472,8 +2468,6 @@ BgBufferSync(WritebackContext *wb_context) > > reusable_buffers++; > > } > > > > - PendingBgWriterStats.m_buf_written_clean += num_written; > > - > > Isn't num_written unused now, unless tracepoints are enabled? I'd expect some > compilers to warn... Perhaps we should just remove information from the > tracepoint? The local variable num_written is used in BgBufferSync() to determine whether or not to increment maxwritten_clean which is still represented in the view pg_stat_checkpointer (formerly pg_stat_bgwriter). A local variable num_written is used in BufferSync() to increment CheckpointStats.ckpt_bufs_written which is logged in LogCheckpointEnd(), so I'm not sure that can be removed. - Melanie
Attachment
- v18-0008-small-comment-correction.patch
- v18-0005-Add-buffers-to-pgstat_reset_shared_counters.patch
- v18-0007-Remove-superfluous-bgwriter-stats.patch
- v18-0004-Send-IO-operations-to-stats-collector.patch
- v18-0006-Add-system-view-tracking-IO-ops-per-backend-type.patch
- v18-0003-Add-IO-operation-counters-to-PgBackendStatus.patch
- v18-0001-Read-only-atomic-backend-write-function.patch
- v18-0002-Move-backend-pgstat-initialization-earlier.patch
On Fri, Dec 03, 2021 at 03:02:24PM -0500, Melanie Plageman wrote: > Thanks again! I really appreciate the thorough review. > > I have combined responses to all three of your emails below. > Let me know if it is more confusing to do it this way. One email is better than three - I'm just not a model citizen ;) Thanks for updating the patch. I checked that all my previous review comments were addressed (except for the part about passing the 3D array to a function - I know that technically the pointer is being passed). +int backend_type_get_idx(BackendType backend_type) +BackendType idx_get_backend_type(int idx) => I think it'd be desirable for these to be either static functions (which won't work for your needs) or macros, or inline functions in the header. - if (strcmp(target, "archiver") == 0) + pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER); + if (strcmp(target, "buffers") == 0) => This should be added in alphabetical order. Which is unimportant, but it will also makes the patch 2 lines shorter. The doc patch should also be in order. + * Don't count dead backends. They will be added below There are no => Missing a period. -- Justin
Hi, On 2021-12-15 16:40:27 -0500, Melanie Plageman wrote: > > > +/* > > > + * Before exiting, a backend sends its IO op statistics to the collector so > > > + * that they may be persisted. > > > + */ > > > +void > > > +pgstat_send_buffers(void) > > > +{ > > > + PgStat_MsgIOPathOps msg; > > > + > > > + PgBackendStatus *beentry = MyBEEntry; > > > + > > > + /* > > > + * Though some backends with type B_INVALID (such as the single-user mode > > > + * process) do initialize and increment IO operations stats, there is no > > > + * spot in the array of IO operations for backends of type B_INVALID. As > > > + * such, do not send these to the stats collector. > > > + */ > > > + if (!beentry || beentry->st_backendType == B_INVALID) > > > + return; > > > > Why does single user mode use B_INVALID? That doesn't seem quite right. > > I think PgBackendStatus->st_backendType is set from MyBackendType which > isn't set for the single user mode process. What BackendType would you > expect to see? Either B_BACKEND or something new like B_SINGLE_USER_BACKEND? > I also thought about having pgstat_sum_io_path_ops() return a value to > indicate if everything was 0 -- which could be useful to future callers > potentially? > > I didn't do this because I am not sure what the return value would be. > It could be a bool and be true if any IO was done and false if none was > done -- but that doesn't really make sense given the function's name it > would be called like > if (!pgstat_sum_io_path_ops()) > return > which I'm not sure is very clear Yea, I think it's ok to not do something fancier here for nwo. > > > From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001 > > > From: Melanie Plageman <melanieplageman@gmail.com> > > > Date: Wed, 24 Nov 2021 12:20:10 -0500 > > > Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats > > > > > > Remove stats from pg_stat_bgwriter which are now more clearly expressed > > > in pg_stat_buffers. > > > > > > TODO: > > > - make pg_stat_checkpointer view and move relevant stats into it > > > - add additional stats to pg_stat_bgwriter > > > > When do you think it makes sense to tackle these wrt committing some of the > > patches? > > Well, the new stats are a superset of the old stats (no stats have been > removed that are not represented in the new or old views). So, I don't > see that as a blocker for committing these patches. > Since it is weird that pg_stat_bgwriter had mostly checkpointer stats, > I've edited this commit to rename that view to pg_stat_checkpointer. > I have not made a separate view just for maxwritten_clean (presumably > called pg_stat_bgwriter), but I would not be opposed to doing this if > you thought having a view with a single column isn't a problem (in the > event that we don't get around to adding more bgwriter stats right > away). How about keeping old bgwriter values in place in the view , but generated from the new stats stuff? > I noticed after changing the docs on the "bgwriter" target for > pg_stat_reset_shared to say "checkpointer", that it still said "bgwriter" in > src/backend/po/ko.po > src/backend/po/it.po > ... > I presume these are automatically updated with some incantation, but I wasn't > sure what it was nor could I find documentation on this. Yes, they are - and often some languages lag updating things. There's a bit of docs at https://www.postgresql.org/docs/devel/nls.html Greetings, Andres Freund
On 2021-Dec-15, Melanie Plageman wrote: > I noticed after changing the docs on the "bgwriter" target for > pg_stat_reset_shared to say "checkpointer", that it still said "bgwriter" in > src/backend/po/ko.po > src/backend/po/it.po > ... > I presume these are automatically updated with some incantation, but I wasn't > sure what it was nor could I find documentation on this. Yes, feel free to ignore those files completely. They are updated using an external workflow that you don't need to concern yourself with. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "World domination is proceeding according to plan" (Andrew Morton)
On Tue, Dec 21, 2021 at 8:32 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Thu, Dec 16, 2021 at 3:18 PM Andres Freund <andres@anarazel.de> wrote: > > > > > From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001 > > > > > From: Melanie Plageman <melanieplageman@gmail.com> > > > > > Date: Wed, 24 Nov 2021 12:20:10 -0500 > > > > > Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats > > > > > > > > > > Remove stats from pg_stat_bgwriter which are now more clearly expressed > > > > > in pg_stat_buffers. > > > > > > > > > > TODO: > > > > > - make pg_stat_checkpointer view and move relevant stats into it > > > > > - add additional stats to pg_stat_bgwriter > > > > > > > > When do you think it makes sense to tackle these wrt committing some of the > > > > patches? > > > > > > Well, the new stats are a superset of the old stats (no stats have been > > > removed that are not represented in the new or old views). So, I don't > > > see that as a blocker for committing these patches. > > > > > Since it is weird that pg_stat_bgwriter had mostly checkpointer stats, > > > I've edited this commit to rename that view to pg_stat_checkpointer. > > > > > I have not made a separate view just for maxwritten_clean (presumably > > > called pg_stat_bgwriter), but I would not be opposed to doing this if > > > you thought having a view with a single column isn't a problem (in the > > > event that we don't get around to adding more bgwriter stats right > > > away). > > > > How about keeping old bgwriter values in place in the view , but generated > > from the new stats stuff? > > I tried this, but I actually don't think it is the right way to go. In > order to maintain the old view with the new source code, I had to add > new code to maintain a separate resets array just for the bgwriter view. > It adds some fiddly code that will be annoying to maintain (the reset > logic is confusing enough as is). > And, besides the implementation complexity, if a user resets > pg_stat_bgwriter and not pg_stat_buffers (or vice versa), they will > see totally different numbers for "buffers_backend" in pg_stat_bgwriter > than shared buffers written by B_BACKEND in pg_stat_buffers. I would > find that confusing. In a quick chat off-list, Andres suggested it might be okay to have a single reset target for both the pg_stat_buffers view and legacy pg_stat_bgwriter view. So, I am planning to share a new patchset which has only the new "buffers" target which will also reset the legacy pg_stat_bgwriter view. I'll also remove the bgwriter stats I proposed and the pg_stat_checkpointer view to keep things simple for now. - Melanie
On Thu, Dec 30, 2021 at 3:30 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Tue, Dec 21, 2021 at 8:32 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > On Thu, Dec 16, 2021 at 3:18 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > From 9f22da9041e1e1fbc0ef003f5f78f4e72274d438 Mon Sep 17 00:00:00 2001 > > > > > > From: Melanie Plageman <melanieplageman@gmail.com> > > > > > > Date: Wed, 24 Nov 2021 12:20:10 -0500 > > > > > > Subject: [PATCH v17 6/7] Remove superfluous bgwriter stats > > > > > > > > > > > > Remove stats from pg_stat_bgwriter which are now more clearly expressed > > > > > > in pg_stat_buffers. > > > > > > > > > > > > TODO: > > > > > > - make pg_stat_checkpointer view and move relevant stats into it > > > > > > - add additional stats to pg_stat_bgwriter > > > > > > > > > > When do you think it makes sense to tackle these wrt committing some of the > > > > > patches? > > > > > > > > Well, the new stats are a superset of the old stats (no stats have been > > > > removed that are not represented in the new or old views). So, I don't > > > > see that as a blocker for committing these patches. > > > > > > > Since it is weird that pg_stat_bgwriter had mostly checkpointer stats, > > > > I've edited this commit to rename that view to pg_stat_checkpointer. > > > > > > > I have not made a separate view just for maxwritten_clean (presumably > > > > called pg_stat_bgwriter), but I would not be opposed to doing this if > > > > you thought having a view with a single column isn't a problem (in the > > > > event that we don't get around to adding more bgwriter stats right > > > > away). > > > > > > How about keeping old bgwriter values in place in the view , but generated > > > from the new stats stuff? > > > > I tried this, but I actually don't think it is the right way to go. In > > order to maintain the old view with the new source code, I had to add > > new code to maintain a separate resets array just for the bgwriter view. > > It adds some fiddly code that will be annoying to maintain (the reset > > logic is confusing enough as is). > > And, besides the implementation complexity, if a user resets > > pg_stat_bgwriter and not pg_stat_buffers (or vice versa), they will > > see totally different numbers for "buffers_backend" in pg_stat_bgwriter > > than shared buffers written by B_BACKEND in pg_stat_buffers. I would > > find that confusing. > > In a quick chat off-list, Andres suggested it might be okay to have a > single reset target for both the pg_stat_buffers view and legacy > pg_stat_bgwriter view. So, I am planning to share a new patchset which > has only the new "buffers" target which will also reset the legacy > pg_stat_bgwriter view. > > I'll also remove the bgwriter stats I proposed and the > pg_stat_checkpointer view to keep things simple for now. > I've done the above in v20, attached. - Melanie
Attachment
- v20-0006-Add-system-view-tracking-IO-ops-per-backend-type.patch
- v20-0005-Add-buffers-to-pgstat_reset_shared_counters.patch
- v20-0008-small-comment-correction.patch
- v20-0004-Send-IO-operations-to-stats-collector.patch
- v20-0007-Remove-superfluous-bgwriter-stats-code.patch
- v20-0003-Add-IO-operation-counters-to-PgBackendStatus.patch
- v20-0001-Read-only-atomic-backend-write-function.patch
- v20-0002-Move-backend-pgstat-initialization-earlier.patch
v21 rebased with compile errors fixed is attached.
Attachment
- v21-0008-small-comment-correction.patch
- v21-0007-Remove-superfluous-bgwriter-stats-code.patch
- v21-0004-Send-IO-operations-to-stats-collector.patch
- v21-0005-Add-buffers-to-pgstat_reset_shared_counters.patch
- v21-0006-Add-system-view-tracking-IO-ops-per-backend-type.patch
- v21-0002-Move-backend-pgstat-initialization-earlier.patch
- v21-0003-Add-IO-operation-counters-to-PgBackendStatus.patch
- v21-0001-Read-only-atomic-backend-write-function.patch
Hi, On 2022-02-19 11:06:18 -0500, Melanie Plageman wrote: > v21 rebased with compile errors fixed is attached. This currently doesn't apply (mea culpa likely): http://cfbot.cputube.org/patch_37_3272.log Could you rebase? Marked as waiting-on-author for now. - Andres
I already rebased this in a local branch, so here it's. I don't expect it to survive the day. This should be updated to use the tuplestore helper.
Attachment
- 0001-Read-only-atomic-backend-write-function.patch
- 0002-Move-backend-pgstat-initialization-earlier.patch
- 0003-Add-IO-operation-counters-to-PgBackendStatus.patch
- 0004-Send-IO-operations-to-stats-collector.patch
- 0005-Add-buffers-to-pgstat_reset_shared_counters.patch
- 0006-Add-system-view-tracking-IO-ops-per-backend-type.patch
- 0007-Remove-superfluous-bgwriter-stats-code.patch
- 0008-small-comment-correction.patch
On Mon, Mar 21, 2022 at 8:15 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-02-19 11:06:18 -0500, Melanie Plageman wrote:
> v21 rebased with compile errors fixed is attached.
This currently doesn't apply (mea culpa likely): http://cfbot.cputube.org/patch_37_3272.log
Could you rebase? Marked as waiting-on-author for now.
Attached is the rebased/rewritten version of the pg_stat_buffers patch
which uses the cumulative stats system instead of stats collector.
I've moved to the model of backend-local pending stats which get
accumulated into shared memory by pgstat_report_stat().
It is worth noting that, with this method, other backends will no longer
have access to each other's individual IO operation statistics. An
argument could be made to keep the statistics in each backend in
PgBackendStatus before accumulating them to the cumulative stats system
so that they can be accessed at the per-backend level of detail.
There are two TODOs related to when pgstat_report_io_ops() should be
called. pgstat_report_io_ops() is meant for backends that will not
commonly call pgstat_report_stat(). I was unsure if it made sense for
BootstrapModeMain() to explicitly call pgstat_report_io_ops() and if
auto vacuum worker should call it explicitly and, if so, if it was the
right location to call it after do_autovacuum().
Archiver and syslogger do not increment or report IO operations.
I did not change pg_stat_bgwriter fields to derive from the IO
operations statistics structures since the reset targets differ.
Also, I added one test, but I'm not sure if it will be flakey. It tests
that the "writes" for checkpointer are tracked when data is inserted
into a table and then CHECKPOINT is explicitly invoked directly after. I
don't know if this will have a problem if the checkpointer is busy and
somehow the backend which dirtied the buffer is forced to write out its
own buffer, causing the test to potentially fail (even if the
checkpointer is doing other writes [causing it to be busy], it may not
do them in between the INSERT and the SELECT from pg_stat_buffers).
I am wondering how to add a non-flakey test. For regular backends, I
couldn't think of a way to suspend checkpointer to make them do their
own writes and fsyncs in the context of a regression or isolation test.
In fact for many of the dirty buffers it seems like it will be difficult
to keep bgwriter, checkpointer, and regular backends from competing and
sometimes causing test failures.
which uses the cumulative stats system instead of stats collector.
I've moved to the model of backend-local pending stats which get
accumulated into shared memory by pgstat_report_stat().
It is worth noting that, with this method, other backends will no longer
have access to each other's individual IO operation statistics. An
argument could be made to keep the statistics in each backend in
PgBackendStatus before accumulating them to the cumulative stats system
so that they can be accessed at the per-backend level of detail.
There are two TODOs related to when pgstat_report_io_ops() should be
called. pgstat_report_io_ops() is meant for backends that will not
commonly call pgstat_report_stat(). I was unsure if it made sense for
BootstrapModeMain() to explicitly call pgstat_report_io_ops() and if
auto vacuum worker should call it explicitly and, if so, if it was the
right location to call it after do_autovacuum().
Archiver and syslogger do not increment or report IO operations.
I did not change pg_stat_bgwriter fields to derive from the IO
operations statistics structures since the reset targets differ.
Also, I added one test, but I'm not sure if it will be flakey. It tests
that the "writes" for checkpointer are tracked when data is inserted
into a table and then CHECKPOINT is explicitly invoked directly after. I
don't know if this will have a problem if the checkpointer is busy and
somehow the backend which dirtied the buffer is forced to write out its
own buffer, causing the test to potentially fail (even if the
checkpointer is doing other writes [causing it to be busy], it may not
do them in between the INSERT and the SELECT from pg_stat_buffers).
I am wondering how to add a non-flakey test. For regular backends, I
couldn't think of a way to suspend checkpointer to make them do their
own writes and fsyncs in the context of a regression or isolation test.
In fact for many of the dirty buffers it seems like it will be difficult
to keep bgwriter, checkpointer, and regular backends from competing and
sometimes causing test failures.
- Melanie
Attachment
Hi, On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote: > From 2d089e26236c55d1be5b93833baa0cf7667ba38d Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Tue, 28 Jun 2022 11:33:04 -0400 > Subject: [PATCH v22 1/3] Add BackendType for standalone backends > > All backends should have a BackendType to enable statistics reporting > per BackendType. > > Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and > alphabetize the BackendTypes). Both the bootstrap backend and single > user mode backends will have BackendType B_STANDALONE_BACKEND. > > Author: Melanie Plageman <melanieplageman@gmail.com> > Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com > --- > src/backend/utils/init/miscinit.c | 17 +++++++++++------ > src/include/miscadmin.h | 5 +++-- > 2 files changed, 14 insertions(+), 8 deletions(-) > > diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c > index eb43b2c5e5..07e6db1a1c 100644 > --- a/src/backend/utils/init/miscinit.c > +++ b/src/backend/utils/init/miscinit.c > @@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0) > { > Assert(!IsPostmasterEnvironment); > > + MyBackendType = B_STANDALONE_BACKEND; Hm. This is used for singleuser mode as well as bootstrap. Should we split those? It's not like bootstrap mode really matters for stats, so I'm inclined not to. > @@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool check_only) > * out the initial relation mapping files. > */ > RelationMapFinishBootstrap(); > + // TODO: should this be done for bootstrap? > + pgstat_report_io_ops(); Hm. Not particularly useful, but also not harmful. But we don't need an explicit call, because it'll be done at process exit too. At least I think, it could be that it's different for bootstrap. > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c > index 2e146aac93..e6dbb1c4bb 100644 > --- a/src/backend/postmaster/autovacuum.c > +++ b/src/backend/postmaster/autovacuum.c > @@ -1712,6 +1712,9 @@ AutoVacWorkerMain(int argc, char *argv[]) > recentXid = ReadNextTransactionId(); > recentMulti = ReadNextMultiXactId(); > do_autovacuum(); > + > + // TODO: should this be done more often somewhere in do_autovacuum()? > + pgstat_report_io_ops(); > } Don't think you need all these calls before process exit - it'll happen via pgstat_shutdown_hook(). IMO it'd be a good idea to add pgstat_report_io_ops() to pgstat_report_vacuum()/analyze(), so that the stats for a longrunning autovac worker get updated more regularly. > diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c > index 91e6f6ea18..87e4b9e9bd 100644 > --- a/src/backend/postmaster/bgwriter.c > +++ b/src/backend/postmaster/bgwriter.c > @@ -242,6 +242,7 @@ BackgroundWriterMain(void) > > /* Report pending statistics to the cumulative stats system */ > pgstat_report_bgwriter(); > + pgstat_report_io_ops(); > > if (FirstCallSinceLastCheckpoint()) > { How about moving the pgstat_report_io_ops() into pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems unnecessary to have multiple pgstat_* calls in these places. > +/* > + * Flush out locally pending IO Operation statistics entries > + * > + * If nowait is true, this function returns false on lock failure. Otherwise > + * this function always returns true. Writer processes are mutually excluded > + * using LWLock, but readers are expected to use change-count protocol to avoid > + * interference with writers. > + * > + * If nowait is true, this function returns true if the lock could not be > + * acquired. Otherwise return false. > + * > + */ > +bool > +pgstat_flush_io_ops(bool nowait) > +{ > + PgStat_IOPathOps *dest_io_path_ops; > + PgStatShared_BackendIOPathOps *stats_shmem; > + > + PgBackendStatus *beentry = MyBEEntry; > + > + if (!have_ioopstats) > + return false; > + > + if (!beentry || beentry->st_backendType == B_INVALID) > + return false; > + > + stats_shmem = &pgStatLocal.shmem->io_ops; > + > + if (!nowait) > + LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE); > + else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE)) > + return true; Wonder if it's worth making the lock specific to the backend type? > + dest_io_path_ops = > + &stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)]; > + This could be done before acquiring the lock, right? > +void > +pgstat_io_ops_snapshot_cb(void) > +{ > + PgStatShared_BackendIOPathOps *stats_shmem = &pgStatLocal.shmem->io_ops; > + PgStat_IOPathOps *snapshot_ops = pgStatLocal.snapshot.io_path_ops; > + PgStat_IOPathOps *reset_ops; > + > + PgStat_IOPathOps *reset_offset = stats_shmem->reset_offset; > + PgStat_IOPathOps reset[BACKEND_NUM_TYPES]; > + > + pgstat_copy_changecounted_stats(snapshot_ops, > + &stats_shmem->stats, sizeof(stats_shmem->stats), > + &stats_shmem->changecount); This doesn't make sense - with multiple writers you can't use the changecount approach (and you don't in the flush part above). > + LWLockAcquire(&stats_shmem->lock, LW_SHARED); > + memcpy(&reset, reset_offset, sizeof(stats_shmem->stats)); > + LWLockRelease(&stats_shmem->lock); Which then also means that you don't need the reset offset stuff. It's only there because with the changecount approach we can't take a lock to reset the stats (since there is no lock). With a lock you can just reset the shared state. > +void > +pgstat_count_io_op(IOOp io_op, IOPath io_path) > +{ > + PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_path]; > + PgStat_IOOpCounters *cumulative_counters = > + &cumulative_IOOpStats.data[io_path]; the pending_/cumultive_ prefix before an uppercase-first camelcase name seems ugly... > + switch (io_op) > + { > + case IOOP_ALLOC: > + pending_counters->allocs++; > + cumulative_counters->allocs++; > + break; > + case IOOP_EXTEND: > + pending_counters->extends++; > + cumulative_counters->extends++; > + break; > + case IOOP_FSYNC: > + pending_counters->fsyncs++; > + cumulative_counters->fsyncs++; > + break; > + case IOOP_WRITE: > + pending_counters->writes++; > + cumulative_counters->writes++; > + break; > + } > + > + have_ioopstats = true; > +} Doing two math ops / memory accesses every time seems off. Seems better to maintain cumultive_counters whenever reporting stats, just before zeroing pending_counters? > +/* > + * Report IO operation statistics > + * > + * This works in much the same way as pgstat_flush_io_ops() but is meant for > + * BackendTypes like bgwriter for whom pgstat_report_stat() will not be called > + * frequently enough to keep shared memory stats fresh. > + * Backends not typically calling pgstat_report_stat() can invoke > + * pgstat_report_io_ops() explicitly. > + */ > +void > +pgstat_report_io_ops(void) > +{ This shouldn't be needed - the flush function above can be used. > + PgStat_IOPathOps *dest_io_path_ops; > + PgStatShared_BackendIOPathOps *stats_shmem; > + > + PgBackendStatus *beentry = MyBEEntry; > + > + Assert(!pgStatLocal.shmem->is_shutdown); > + pgstat_assert_is_up(); > + > + if (!have_ioopstats) > + return; > + > + if (!beentry || beentry->st_backendType == B_INVALID) > + return; Is there a case where this may be called where we have no beentry? Why not just use MyBackendType? > + stats_shmem = &pgStatLocal.shmem->io_ops; > + > + dest_io_path_ops = > + &stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)]; > + > + pgstat_begin_changecount_write(&stats_shmem->changecount); A mentioned before, the changecount stuff doesn't apply here. You need a lock. > +PgStat_IOPathOps * > +pgstat_fetch_backend_io_path_ops(void) > +{ > + pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS); > + return pgStatLocal.snapshot.io_path_ops; > +} > + > +PgStat_Counter > +pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op) > +{ > + PgStat_IOOpCounters *counters = &cumulative_IOOpStats.data[io_path]; > + > + switch (io_op) > + { > + case IOOP_ALLOC: > + return counters->allocs; > + case IOOP_EXTEND: > + return counters->extends; > + case IOOP_FSYNC: > + return counters->fsyncs; > + case IOOP_WRITE: > + return counters->writes; > + default: > + elog(ERROR, "IO Operation %s for IO Path %s is undefined.", > + pgstat_io_op_desc(io_op), pgstat_io_path_desc(io_path)); > + } > +} There's currently no user for this, right? Maybe let's just defer the cumulative stuff until we need it? > +const char * > +pgstat_io_path_desc(IOPath io_path) > +{ > + const char *io_path_desc = "Unknown IO Path"; > + This should be unreachable, right? > From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Mon, 4 Jul 2022 15:44:17 -0400 > Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type > Add pg_stat_buffers, a system view which tracks the number of IO > operations (allocs, writes, fsyncs, and extends) done through each IO > path (e.g. shared buffers, local buffers, unbuffered IO) by each type of > backend. I think I like pg_stat_io a bit better? Nearly everything in here seems to fit better in that. I guess we could split out buffers allocated, but that's actually interesting in the context of the kind of IO too. > <row> > <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry> > <entry>One row only, showing statistics about WAL activity. See > @@ -3595,7 +3604,102 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i > <structfield>stats_reset</structfield> <type>timestamp with time zone</type> > </para> > <para> > - Time at which these statistics were last reset > + Time at which these statistics were last reset. > + </para></entry> Grammar critique time :) > +CREATE VIEW pg_stat_buffers AS > +SELECT > + b.backend_type, > + b.io_path, > + b.alloc, > + b.extend, > + b.fsync, > + b.write, > + b.stats_reset > +FROM pg_stat_get_buffers() b; Do we want to expose all data to all users? I guess pg_stat_bgwriter does? But this does split things out a lot more... > + for (int i = 0; i < BACKEND_NUM_TYPES; i++) > + { > + PgStat_IOOpCounters *counters = io_path_ops->data; > + Datum backend_type_desc = > + CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i))); > + /* const char *log_name = GetBackendTypeDesc(idx_get_backend_type(i)); */ > + > + for (int j = 0; j < IOPATH_NUM_TYPES; j++) > + { > + Datum values[BUFFERS_NUM_COLUMNS]; > + bool nulls[BUFFERS_NUM_COLUMNS]; > + memset(values, 0, sizeof(values)); > + memset(nulls, 0, sizeof(nulls)); > + > + values[BUFFERS_COLUMN_BACKEND_TYPE] = backend_type_desc; > + values[BUFFERS_COLUMN_IO_PATH] = CStringGetTextDatum(pgstat_io_path_desc(j)); Random musing: I wonder if we should start to use SQL level enums for this kind of thing. > DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4; > DROP TABLE prevstats; > +SELECT pg_stat_reset_shared('buffers'); > + pg_stat_reset_shared > +---------------------- > + > +(1 row) > + > +SELECT pg_stat_force_next_flush(); > + pg_stat_force_next_flush > +-------------------------- > + > +(1 row) > + > +SELECT write = 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and backend_type = 'checkpointer'; > + ?column? > +---------- > + t > +(1 row) Don't think you can rely on that. The lookup of the view, functions might have needed to load catalog data, which might have needed to evict buffers. I think you can do something more reliable by checking that there's more written buffers after a checkpoint than before, or such. Would be nice to have something testing that the ringbuffer stats stuff does something sensible - that feels not entirely trivial. Greetings, Andres Freund
Hi,
In the attached patch set, I've added in missing IO operations for
certain IO Paths as well as enumerating in the commit message which IO
Paths and IO Operations are not currently counted and or not possible.
There is a TODO in HandleWalWriterInterrupts() about removing
pgstat_report_wal() since it is immediately before a proc_exit()
I was wondering if LocalBufferAlloc() should increment the counter or if
I should wait until GetLocalBufferStorage() to increment the counter.
I also realized that I am not differentiating between IOPATH_SHARED and
IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
of buffer we are fsync'ing by the time we call register_dirty_segment(),
I'm not sure how we would fix this.
certain IO Paths as well as enumerating in the commit message which IO
Paths and IO Operations are not currently counted and or not possible.
There is a TODO in HandleWalWriterInterrupts() about removing
pgstat_report_wal() since it is immediately before a proc_exit()
I was wondering if LocalBufferAlloc() should increment the counter or if
I should wait until GetLocalBufferStorage() to increment the counter.
I also realized that I am not differentiating between IOPATH_SHARED and
IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
of buffer we are fsync'ing by the time we call register_dirty_segment(),
I'm not sure how we would fix this.
On Wed, Jul 6, 2022 at 3:20 PM Andres Freund <andres@anarazel.de> wrote:
On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote:
> From 2d089e26236c55d1be5b93833baa0cf7667ba38d Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 28 Jun 2022 11:33:04 -0400
> Subject: [PATCH v22 1/3] Add BackendType for standalone backends
>
> All backends should have a BackendType to enable statistics reporting
> per BackendType.
>
> Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
> alphabetize the BackendTypes). Both the bootstrap backend and single
> user mode backends will have BackendType B_STANDALONE_BACKEND.
>
> Author: Melanie Plageman <melanieplageman@gmail.com>
> Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
> ---
> src/backend/utils/init/miscinit.c | 17 +++++++++++------
> src/include/miscadmin.h | 5 +++--
> 2 files changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
> index eb43b2c5e5..07e6db1a1c 100644
> --- a/src/backend/utils/init/miscinit.c
> +++ b/src/backend/utils/init/miscinit.c
> @@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
> {
> Assert(!IsPostmasterEnvironment);
>
> + MyBackendType = B_STANDALONE_BACKEND;
Hm. This is used for singleuser mode as well as bootstrap. Should we
split those? It's not like bootstrap mode really matters for stats, so
I'm inclined not to.
I have no opinion currently.
It depends on how commonly you think developers might want separate
bootstrap and single user mode IO stats.
It depends on how commonly you think developers might want separate
bootstrap and single user mode IO stats.
> @@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
> * out the initial relation mapping files.
> */
> RelationMapFinishBootstrap();
> + // TODO: should this be done for bootstrap?
> + pgstat_report_io_ops();
Hm. Not particularly useful, but also not harmful. But we don't need an
explicit call, because it'll be done at process exit too. At least I
think, it could be that it's different for bootstrap.
I've removed this and other occurrences which were before proc_exit()
(and thus redundant). (Though I did not explicitly check if it was
different for bootstrap.)
(and thus redundant). (Though I did not explicitly check if it was
different for bootstrap.)
> diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> index 2e146aac93..e6dbb1c4bb 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -1712,6 +1712,9 @@ AutoVacWorkerMain(int argc, char *argv[])
> recentXid = ReadNextTransactionId();
> recentMulti = ReadNextMultiXactId();
> do_autovacuum();
> +
> + // TODO: should this be done more often somewhere in do_autovacuum()?
> + pgstat_report_io_ops();
> }
Don't think you need all these calls before process exit - it'll happen
via pgstat_shutdown_hook().
IMO it'd be a good idea to add pgstat_report_io_ops() to
pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
autovac worker get updated more regularly.
noted and fixed.
> diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
> index 91e6f6ea18..87e4b9e9bd 100644
> --- a/src/backend/postmaster/bgwriter.c
> +++ b/src/backend/postmaster/bgwriter.c
> @@ -242,6 +242,7 @@ BackgroundWriterMain(void)
>
> /* Report pending statistics to the cumulative stats system */
> pgstat_report_bgwriter();
> + pgstat_report_io_ops();
>
> if (FirstCallSinceLastCheckpoint())
> {
How about moving the pgstat_report_io_ops() into
pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
unnecessary to have multiple pgstat_* calls in these places.
noted and fixed.
> +/*
> + * Flush out locally pending IO Operation statistics entries
> + *
> + * If nowait is true, this function returns false on lock failure. Otherwise
> + * this function always returns true. Writer processes are mutually excluded
> + * using LWLock, but readers are expected to use change-count protocol to avoid
> + * interference with writers.
> + *
> + * If nowait is true, this function returns true if the lock could not be
> + * acquired. Otherwise return false.
> + *
> + */
> +bool
> +pgstat_flush_io_ops(bool nowait)
> +{
> + PgStat_IOPathOps *dest_io_path_ops;
> + PgStatShared_BackendIOPathOps *stats_shmem;
> +
> + PgBackendStatus *beentry = MyBEEntry;
> +
> + if (!have_ioopstats)
> + return false;
> +
> + if (!beentry || beentry->st_backendType == B_INVALID)
> + return false;
> +
> + stats_shmem = &pgStatLocal.shmem->io_ops;
> +
> + if (!nowait)
> + LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
> + else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
> + return true;
Wonder if it's worth making the lock specific to the backend type?
I've added another Lock into PgStat_IOPathOps so that each BackendType
can be locked separately. But, I've also kept the lock in
PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
done easily.
can be locked separately. But, I've also kept the lock in
PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
done easily.
> + dest_io_path_ops =
> + &stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
> +
This could be done before acquiring the lock, right?
> +void
> +pgstat_io_ops_snapshot_cb(void)
> +{
> + PgStatShared_BackendIOPathOps *stats_shmem = &pgStatLocal.shmem->io_ops;
> + PgStat_IOPathOps *snapshot_ops = pgStatLocal.snapshot.io_path_ops;
> + PgStat_IOPathOps *reset_ops;
> +
> + PgStat_IOPathOps *reset_offset = stats_shmem->reset_offset;
> + PgStat_IOPathOps reset[BACKEND_NUM_TYPES];
> +
> + pgstat_copy_changecounted_stats(snapshot_ops,
> + &stats_shmem->stats, sizeof(stats_shmem->stats),
> + &stats_shmem->changecount);
This doesn't make sense - with multiple writers you can't use the
changecount approach (and you don't in the flush part above).
> + LWLockAcquire(&stats_shmem->lock, LW_SHARED);
> + memcpy(&reset, reset_offset, sizeof(stats_shmem->stats));
> + LWLockRelease(&stats_shmem->lock);
Which then also means that you don't need the reset offset stuff. It's
only there because with the changecount approach we can't take a lock to
reset the stats (since there is no lock). With a lock you can just reset
the shared state.
Yes, I believe I have cleaned up all of this embarrassing mess. I use the
lock in PgStatShared_BackendIOPathOps for reset all and snapshot and the
locks in PgStat_IOPathOps for flush.
lock in PgStatShared_BackendIOPathOps for reset all and snapshot and the
locks in PgStat_IOPathOps for flush.
> +void
> +pgstat_count_io_op(IOOp io_op, IOPath io_path)
> +{
> + PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_path];
> + PgStat_IOOpCounters *cumulative_counters =
> + &cumulative_IOOpStats.data[io_path];
the pending_/cumultive_ prefix before an uppercase-first camelcase name
seems ugly...
> + switch (io_op)
> + {
> + case IOOP_ALLOC:
> + pending_counters->allocs++;
> + cumulative_counters->allocs++;
> + break;
> + case IOOP_EXTEND:
> + pending_counters->extends++;
> + cumulative_counters->extends++;
> + break;
> + case IOOP_FSYNC:
> + pending_counters->fsyncs++;
> + cumulative_counters->fsyncs++;
> + break;
> + case IOOP_WRITE:
> + pending_counters->writes++;
> + cumulative_counters->writes++;
> + break;
> + }
> +
> + have_ioopstats = true;
> +}
Doing two math ops / memory accesses every time seems off. Seems better
to maintain cumultive_counters whenever reporting stats, just before
zeroing pending_counters?
I've gone ahead and cut the cumulative counters concept.
> +/*
> + * Report IO operation statistics
> + *
> + * This works in much the same way as pgstat_flush_io_ops() but is meant for
> + * BackendTypes like bgwriter for whom pgstat_report_stat() will not be called
> + * frequently enough to keep shared memory stats fresh.
> + * Backends not typically calling pgstat_report_stat() can invoke
> + * pgstat_report_io_ops() explicitly.
> + */
> +void
> +pgstat_report_io_ops(void)
> +{
This shouldn't be needed - the flush function above can be used.
Fixed.
> + PgStat_IOPathOps *dest_io_path_ops;
> + PgStatShared_BackendIOPathOps *stats_shmem;
> +
> + PgBackendStatus *beentry = MyBEEntry;
> +
> + Assert(!pgStatLocal.shmem->is_shutdown);
> + pgstat_assert_is_up();
> +
> + if (!have_ioopstats)
> + return;
> +
> + if (!beentry || beentry->st_backendType == B_INVALID)
> + return;
Is there a case where this may be called where we have no beentry?
Why not just use MyBackendType?
Fixed.
> + stats_shmem = &pgStatLocal.shmem->io_ops;
> +
> + dest_io_path_ops =
> + &stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
> +
> + pgstat_begin_changecount_write(&stats_shmem->changecount);
A mentioned before, the changecount stuff doesn't apply here. You need a
lock.
Fixed.
> +PgStat_IOPathOps *
> +pgstat_fetch_backend_io_path_ops(void)
> +{
> + pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
> + return pgStatLocal.snapshot.io_path_ops;
> +}
> +
> +PgStat_Counter
> +pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op)
> +{
> + PgStat_IOOpCounters *counters = &cumulative_IOOpStats.data[io_path];
> +
> + switch (io_op)
> + {
> + case IOOP_ALLOC:
> + return counters->allocs;
> + case IOOP_EXTEND:
> + return counters->extends;
> + case IOOP_FSYNC:
> + return counters->fsyncs;
> + case IOOP_WRITE:
> + return counters->writes;
> + default:
> + elog(ERROR, "IO Operation %s for IO Path %s is undefined.",
> + pgstat_io_op_desc(io_op), pgstat_io_path_desc(io_path));
> + }
> +}
There's currently no user for this, right? Maybe let's just defer the
cumulative stuff until we need it?
Removed.
> +const char *
> +pgstat_io_path_desc(IOPath io_path)
> +{
> + const char *io_path_desc = "Unknown IO Path";
> +
This should be unreachable, right?
Changed it to an error.
> From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 4 Jul 2022 15:44:17 -0400
> Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type
> Add pg_stat_buffers, a system view which tracks the number of IO
> operations (allocs, writes, fsyncs, and extends) done through each IO
> path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
> backend.
I think I like pg_stat_io a bit better? Nearly everything in here seems
to fit better in that.
I guess we could split out buffers allocated, but that's actually
interesting in the context of the kind of IO too.
changed it to pg_stat_io
> +CREATE VIEW pg_stat_buffers AS
> +SELECT
> + b.backend_type,
> + b.io_path,
> + b.alloc,
> + b.extend,
> + b.fsync,
> + b.write,
> + b.stats_reset
> +FROM pg_stat_get_buffers() b;
Do we want to expose all data to all users? I guess pg_stat_bgwriter
does? But this does split things out a lot more...
I didn't see another similar example limiting access.
> DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
> DROP TABLE prevstats;
> +SELECT pg_stat_reset_shared('buffers');
> + pg_stat_reset_shared
> +----------------------
> +
> +(1 row)
> +
> +SELECT pg_stat_force_next_flush();
> + pg_stat_force_next_flush
> +--------------------------
> +
> +(1 row)
> +
> +SELECT write = 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and backend_type = 'checkpointer';
> + ?column?
> +----------
> + t
> +(1 row)
Don't think you can rely on that. The lookup of the view, functions
might have needed to load catalog data, which might have needed to evict
buffers. I think you can do something more reliable by checking that
there's more written buffers after a checkpoint than before, or such.
Yes, per an off list suggestion by you, I have changed the tests to use a
sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of
the missing calls to count IO Operations for IOPATH_LOCAL and
IOPATH_STRATEGY.
sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of
the missing calls to count IO Operations for IOPATH_LOCAL and
IOPATH_STRATEGY.
I struggled to come up with a way to test writes for a particular
type of backend are counted correctly since a dirty buffer could be
written out by another type of backend before the target BackendType has
a chance to write it out.
I also struggled to come up with a way to test IO operations for
background workers. I'm not sure of a way to deterministically have a
background worker do a particular kind of IO in a test scenario.
I'm not sure how to cause a strategy "extend" for testing.
type of backend are counted correctly since a dirty buffer could be
written out by another type of backend before the target BackendType has
a chance to write it out.
I also struggled to come up with a way to test IO operations for
background workers. I'm not sure of a way to deterministically have a
background worker do a particular kind of IO in a test scenario.
I'm not sure how to cause a strategy "extend" for testing.
Would be nice to have something testing that the ringbuffer stats stuff
does something sensible - that feels not entirely trivial.
I've added a test to test that reused strategy buffers are counted as
allocs. I would like to add a test which checks that if a buffer in the
ring is pinned and thus not reused, that it is not counted as a strategy
alloc, but I found it challenging without a way to pause vacuuming, pin
a buffer, then resume vacuuming.
allocs. I would like to add a test which checks that if a buffer in the
ring is pinned and thus not reused, that it is not counted as a strategy
alloc, but I found it challenging without a way to pause vacuuming, pin
a buffer, then resume vacuuming.
Thanks,
Melanie
Attachment
At Mon, 11 Jul 2022 22:22:28 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in > Hi, > > In the attached patch set, I've added in missing IO operations for > certain IO Paths as well as enumerating in the commit message which IO > Paths and IO Operations are not currently counted and or not possible. > > There is a TODO in HandleWalWriterInterrupts() about removing > pgstat_report_wal() since it is immediately before a proc_exit() Right. walwriter does that without needing the explicit call. > I was wondering if LocalBufferAlloc() should increment the counter or if > I should wait until GetLocalBufferStorage() to increment the counter. Depends on what "allocate" means. Different from shared buffers, local buffers are taken from OS then allocated to page. OS-allcoated pages are restricted by num_temp_buffers so I think what we're interested in is the count incremented by LocalBuferAlloc(). (And it is the parallel of alloc for shared-buffers) > I also realized that I am not differentiating between IOPATH_SHARED and > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type > of buffer we are fsync'ing by the time we call register_dirty_segment(), > I'm not sure how we would fix this. I think there scarcely happens flush for strategy-loaded buffers. If that is sensible, IOOP_FSYNC would not make much sense for IOPATH_STRATEGY. > On Wed, Jul 6, 2022 at 3:20 PM Andres Freund <andres@anarazel.de> wrote: > > > On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote: > > > @@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0) > > > { > > > Assert(!IsPostmasterEnvironment); > > > > > > + MyBackendType = B_STANDALONE_BACKEND; > > > > Hm. This is used for singleuser mode as well as bootstrap. Should we > > split those? It's not like bootstrap mode really matters for stats, so > > I'm inclined not to. > > > > > I have no opinion currently. > It depends on how commonly you think developers might want separate > bootstrap and single user mode IO stats. Regarding to stats, I don't think separating them makes much sense. > > > @@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool > > check_only) > > > * out the initial relation mapping files. > > > */ > > > RelationMapFinishBootstrap(); > > > + // TODO: should this be done for bootstrap? > > > + pgstat_report_io_ops(); > > > > Hm. Not particularly useful, but also not harmful. But we don't need an > > explicit call, because it'll be done at process exit too. At least I > > think, it could be that it's different for bootstrap. > > I've removed this and other occurrences which were before proc_exit() > (and thus redundant). (Though I did not explicitly check if it was > different for bootstrap.) pgstat_report_stat(true) is supposed to be called as needed via before_shmem_hook so I think that's the right thing. > > IMO it'd be a good idea to add pgstat_report_io_ops() to > > pgstat_report_vacuum()/analyze(), so that the stats for a longrunning > > autovac worker get updated more regularly. > > > > noted and fixed. > > How about moving the pgstat_report_io_ops() into > > pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems > > unnecessary to have multiple pgstat_* calls in these places. > > > > > > > noted and fixed. + * Also report IO Operations statistics I think that the function comment also should mention this. > > Wonder if it's worth making the lock specific to the backend type? > > > > I've added another Lock into PgStat_IOPathOps so that each BackendType > can be locked separately. But, I've also kept the lock in > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be > done easily. Looks fine about the lock separation. By the way, in the following line: + &pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)]; backend_type_get_idx(x) is actually (x - 1) plus assertion on the value range. And the only use-case is here. There's an reverse function and also used only at one place. + Datum backend_type_desc = + CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i))); In this usage GetBackendTypeDesc() gracefully treats out-of-domain values but idx_get_backend_type keenly kills the process for the same. This is inconsistent. My humbel opinion on this is we don't define the two functions and replace the calls to them with (x +/- 1). Addition to that, I think we should not abort() by invalid backend types. In that sense, I wonder if we could use B_INVALIDth element for this purpose. > > > + LWLockAcquire(&stats_shmem->lock, LW_SHARED); > > > + memcpy(&reset, reset_offset, sizeof(stats_shmem->stats)); > > > + LWLockRelease(&stats_shmem->lock); > > > > Which then also means that you don't need the reset offset stuff. It's > > only there because with the changecount approach we can't take a lock to > > reset the stats (since there is no lock). With a lock you can just reset > > the shared state. > > > > Yes, I believe I have cleaned up all of this embarrassing mess. I use the > lock in PgStatShared_BackendIOPathOps for reset all and snapshot and the > locks in PgStat_IOPathOps for flush. Looks fine, but I think pgstat_flush_io_ops() need more comments like other pgstat_flush_* functions. + for (int i = 0; i < BACKEND_NUM_TYPES; i++) + stats_shmem->stats[i].stat_reset_timestamp = ts; I'm not sure we need a separate reset timestamp for each backend type but SLRU counter does the same thing.. > > > +pgstat_report_io_ops(void) > > > +{ > > > > This shouldn't be needed - the flush function above can be used. > > > > Fixed. The commit message of 0002 contains that name:p > > > +const char * > > > +pgstat_io_path_desc(IOPath io_path) > > > +{ > > > + const char *io_path_desc = "Unknown IO Path"; > > > + > > > > This should be unreachable, right? > > > > Changed it to an error. + elog(ERROR, "Attempt to describe an unknown IOPath"); I think we usually spell it as ("unrecognized IOPath value: %d", io_path). > > > From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001 > > > From: Melanie Plageman <melanieplageman@gmail.com> > > > Date: Mon, 4 Jul 2022 15:44:17 -0400 > > > Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type > > > > > Add pg_stat_buffers, a system view which tracks the number of IO > > > operations (allocs, writes, fsyncs, and extends) done through each IO > > > path (e.g. shared buffers, local buffers, unbuffered IO) by each type of > > > backend. > > > > I think I like pg_stat_io a bit better? Nearly everything in here seems > > to fit better in that. > > > > I guess we could split out buffers allocated, but that's actually > > interesting in the context of the kind of IO too. > > > > changed it to pg_stat_io A bit different thing, but I felt a little uneasy about some uses of "pgstat_io_ops". IOOp looks like a neighbouring word of IOPath. On the other hand, actually iopath is used as an attribute of io_ops in many places. Couldn't we be more consistent about the relationship between the names? IOOp -> PgStat_IOOpType IOPath -> PgStat_IOPath PgStat_IOOpCOonters -> PgStat_IOCounters PgStat_IOPathOps -> PgStat_IO pgstat_count_io_op -> pgstat_count_io ... (Better wordings are welcome.) > > > +CREATE VIEW pg_stat_buffers AS > > > +SELECT > > > + b.backend_type, > > > + b.io_path, > > > + b.alloc, > > > + b.extend, > > > + b.fsync, > > > + b.write, > > > + b.stats_reset > > > +FROM pg_stat_get_buffers() b; > > > > Do we want to expose all data to all users? I guess pg_stat_bgwriter > > does? But this does split things out a lot more... > > > > > I didn't see another similar example limiting access. (The doc told me that) pg_buffercache view is restricted to pg_monitor. But other activity-stats(aka stats collector:)-related pg_stat_* views are not restricted to pg_monitor. doc> pg_monitor Read/execute various monitoring views and functions. Hmm.... > > Don't think you can rely on that. The lookup of the view, functions > > might have needed to load catalog data, which might have needed to evict > > buffers. I think you can do something more reliable by checking that > > there's more written buffers after a checkpoint than before, or such. > > > > > Yes, per an off list suggestion by you, I have changed the tests to use a > sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of > the missing calls to count IO Operations for IOPATH_LOCAL and > IOPATH_STRATEGY. > > I struggled to come up with a way to test writes for a particular > type of backend are counted correctly since a dirty buffer could be > written out by another type of backend before the target BackendType has > a chance to write it out. > > I also struggled to come up with a way to test IO operations for > background workers. I'm not sure of a way to deterministically have a > background worker do a particular kind of IO in a test scenario. > > I'm not sure how to cause a strategy "extend" for testing. I'm not sure what you are expecting, but for example, "create table t as select generate_series(0, 99999)" increments Strategy-extend by about 400. (I'm surprised that autovac worker-shared-extend has non-zero number) > > Would be nice to have something testing that the ringbuffer stats stuff > > does something sensible - that feels not entirely trivial. > > > > > I've added a test to test that reused strategy buffers are counted as > allocs. I would like to add a test which checks that if a buffer in the > ring is pinned and thus not reused, that it is not counted as a strategy > alloc, but I found it challenging without a way to pause vacuuming, pin > a buffer, then resume vacuuming. === If I'm not missing something, in BufferAlloc, when strategy is not used and the victim is dirty, iopath is determined based on the uninitialized from_ring. It seems to me from_ring is equivalent to strategy_current_was_in_ring. And if StrategyGetBuffer has set from_ring to false, StratetgyRejectBuffer may set it to true, which is is wrong. The logic around there seems to need a rethink. What can we read from the values separated to Shared and Strategy? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Thanks for the review!
On Tue, Jul 12, 2022 at 4:06 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Mon, 11 Jul 2022 22:22:28 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in
> Hi,
>
> In the attached patch set, I've added in missing IO operations for
> certain IO Paths as well as enumerating in the commit message which IO
> Paths and IO Operations are not currently counted and or not possible.
>
> There is a TODO in HandleWalWriterInterrupts() about removing
> pgstat_report_wal() since it is immediately before a proc_exit()
Right. walwriter does that without needing the explicit call.
I have deleted it.
> I was wondering if LocalBufferAlloc() should increment the counter or if
> I should wait until GetLocalBufferStorage() to increment the counter.
Depends on what "allocate" means. Different from shared buffers, local
buffers are taken from OS then allocated to page. OS-allcoated pages
are restricted by num_temp_buffers so I think what we're interested in
is the count incremented by LocalBuferAlloc(). (And it is the parallel
of alloc for shared-buffers)
I've left it in LocalBufferAlloc().
> I also realized that I am not differentiating between IOPATH_SHARED and
> IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
> of buffer we are fsync'ing by the time we call register_dirty_segment(),
> I'm not sure how we would fix this.
I think there scarcely happens flush for strategy-loaded buffers. If
that is sensible, IOOP_FSYNC would not make much sense for
IOPATH_STRATEGY.
Why would it be less likely for a backend to do its own fsync when
flushing a dirty strategy buffer than a regular dirty shared buffer?
flushing a dirty strategy buffer than a regular dirty shared buffer?
> > IMO it'd be a good idea to add pgstat_report_io_ops() to
> > pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
> > autovac worker get updated more regularly.
> >
>
> noted and fixed.
> > How about moving the pgstat_report_io_ops() into
> > pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
> > unnecessary to have multiple pgstat_* calls in these places.
> >
> >
> >
> noted and fixed.
+ * Also report IO Operations statistics
I think that the function comment also should mention this.
I've added comments at the top of all these functions.
> > Wonder if it's worth making the lock specific to the backend type?
> >
>
> I've added another Lock into PgStat_IOPathOps so that each BackendType
> can be locked separately. But, I've also kept the lock in
> PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
> done easily.
Looks fine about the lock separation.
Actually, I think it is not safe to use both of these locks. So for
picking one method, it is probably better to go with the locks in
PgStat_IOPathOps, it will be more efficient for flush (and not for
fetching and resetting), so that is probably the way to go, right?
picking one method, it is probably better to go with the locks in
PgStat_IOPathOps, it will be more efficient for flush (and not for
fetching and resetting), so that is probably the way to go, right?
By the way, in the following line:
+ &pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)];
backend_type_get_idx(x) is actually (x - 1) plus assertion on the
value range. And the only use-case is here. There's an reverse
function and also used only at one place.
+ Datum backend_type_desc =
+ CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));
In this usage GetBackendTypeDesc() gracefully treats out-of-domain
values but idx_get_backend_type keenly kills the process for the
same. This is inconsistent.
My humbel opinion on this is we don't define the two functions and
replace the calls to them with (x +/- 1). Addition to that, I think
we should not abort() by invalid backend types. In that sense, I
wonder if we could use B_INVALIDth element for this purpose.
I think that GetBackendTypeDesc() should probably also error out for an
unknown value.
unknown value.
I would be open to not using the helper functions. I thought it would be
less error-prone, but since it is limited to the code in
pgstat_io_ops.c, it is probably okay. Let me think a bit more.
Could you explain more about what you mean about using B_INVALID
BackendType?
less error-prone, but since it is limited to the code in
pgstat_io_ops.c, it is probably okay. Let me think a bit more.
Could you explain more about what you mean about using B_INVALID
BackendType?
> > > + LWLockAcquire(&stats_shmem->lock, LW_SHARED);
> > > + memcpy(&reset, reset_offset, sizeof(stats_shmem->stats));
> > > + LWLockRelease(&stats_shmem->lock);
> >
> > Which then also means that you don't need the reset offset stuff. It's
> > only there because with the changecount approach we can't take a lock to
> > reset the stats (since there is no lock). With a lock you can just reset
> > the shared state.
> >
>
> Yes, I believe I have cleaned up all of this embarrassing mess. I use the
> lock in PgStatShared_BackendIOPathOps for reset all and snapshot and the
> locks in PgStat_IOPathOps for flush.
Looks fine, but I think pgstat_flush_io_ops() need more comments like
other pgstat_flush_* functions.
+ for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+ stats_shmem->stats[i].stat_reset_timestamp = ts;
I'm not sure we need a separate reset timestamp for each backend type
but SLRU counter does the same thing..
Yes, I think for SLRU stats it is because you can reset individual SLRU
stats. Also there is no wrapper data structure to put it in. I could
keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
operation stats at once, but I am thinking of getting rid of
PgStatShared_BackendIOPathOps since it is not needed if I only keep the
locks in PgStat_IOPathOps and make the global shared value an array of
PgStat_IOPathOps.
stats. Also there is no wrapper data structure to put it in. I could
keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
operation stats at once, but I am thinking of getting rid of
PgStatShared_BackendIOPathOps since it is not needed if I only keep the
locks in PgStat_IOPathOps and make the global shared value an array of
PgStat_IOPathOps.
> > > +pgstat_report_io_ops(void)
> > > +{
> >
> > This shouldn't be needed - the flush function above can be used.
> >
>
> Fixed.
The commit message of 0002 contains that name:p
Thanks! Fixed.
> > > +const char *
> > > +pgstat_io_path_desc(IOPath io_path)
> > > +{
> > > + const char *io_path_desc = "Unknown IO Path";
> > > +
> >
> > This should be unreachable, right?
> >
>
> Changed it to an error.
+ elog(ERROR, "Attempt to describe an unknown IOPath");
I think we usually spell it as ("unrecognized IOPath value: %d", io_path).
I have changed to this.
> > > From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman <melanieplageman@gmail.com>
> > > Date: Mon, 4 Jul 2022 15:44:17 -0400
> > > Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type
> >
> > > Add pg_stat_buffers, a system view which tracks the number of IO
> > > operations (allocs, writes, fsyncs, and extends) done through each IO
> > > path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
> > > backend.
> >
> > I think I like pg_stat_io a bit better? Nearly everything in here seems
> > to fit better in that.
> >
> > I guess we could split out buffers allocated, but that's actually
> > interesting in the context of the kind of IO too.
> >
>
> changed it to pg_stat_io
A bit different thing, but I felt a little uneasy about some uses of
"pgstat_io_ops". IOOp looks like a neighbouring word of IOPath. On the
other hand, actually iopath is used as an attribute of io_ops in many
places. Couldn't we be more consistent about the relationship between
the names?
IOOp -> PgStat_IOOpType
IOPath -> PgStat_IOPath
PgStat_IOOpCOonters -> PgStat_IOCounters
PgStat_IOPathOps -> PgStat_IO
pgstat_count_io_op -> pgstat_count_io
...
(Better wordings are welcome.)
Let me think about naming and make changes in the next version.
> > Would be nice to have something testing that the ringbuffer stats stuff
> > does something sensible - that feels not entirely trivial.
> >
> >
> I've added a test to test that reused strategy buffers are counted as
> allocs. I would like to add a test which checks that if a buffer in the
> ring is pinned and thus not reused, that it is not counted as a strategy
> alloc, but I found it challenging without a way to pause vacuuming, pin
> a buffer, then resume vacuuming.
===
If I'm not missing something, in BufferAlloc, when strategy is not
used and the victim is dirty, iopath is determined based on the
uninitialized from_ring. It seems to me from_ring is equivalent to
strategy_current_was_in_ring. And if StrategyGetBuffer has set
from_ring to false, StratetgyRejectBuffer may set it to true, which is
is wrong. The logic around there seems to need a rethink.
What can we read from the values separated to Shared and Strategy?
I have changed this local variable to only be used for communicating if
the buffer which was not rejected by StrategyRejectBuffer() was from the
ring or not for the purposes of counting strategy writes. I could add an
accessor for this member (strategy->current_was_in_ring) if that makes
more sense? For strategy allocs, I just use
strategy->current_was_in_ring inside of StrategyGetBuffer() since this
has access to that member of the struct.
Currently, strategy allocs count only reuses of a strategy buffer (not
initial shared buffers which are added to the ring).
strategy writes count only the writing out of dirty buffers which are
already in the ring and are being reused.
Alternatively, we could also count as strategy allocs all those buffers
which are added to the ring and count as strategy writes all those
shared buffers which are dirty when initially added to the ring.
the buffer which was not rejected by StrategyRejectBuffer() was from the
ring or not for the purposes of counting strategy writes. I could add an
accessor for this member (strategy->current_was_in_ring) if that makes
more sense? For strategy allocs, I just use
strategy->current_was_in_ring inside of StrategyGetBuffer() since this
has access to that member of the struct.
Currently, strategy allocs count only reuses of a strategy buffer (not
initial shared buffers which are added to the ring).
strategy writes count only the writing out of dirty buffers which are
already in the ring and are being reused.
Alternatively, we could also count as strategy allocs all those buffers
which are added to the ring and count as strategy writes all those
shared buffers which are dirty when initially added to the ring.
- Melanie
Attachment
Hi, On 2022-07-12 12:19:06 -0400, Melanie Plageman wrote: > > > I also realized that I am not differentiating between IOPATH_SHARED and > > > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type > > > of buffer we are fsync'ing by the time we call register_dirty_segment(), > > > I'm not sure how we would fix this. > > > > I think there scarcely happens flush for strategy-loaded buffers. If > > that is sensible, IOOP_FSYNC would not make much sense for > > IOPATH_STRATEGY. > > > > Why would it be less likely for a backend to do its own fsync when > flushing a dirty strategy buffer than a regular dirty shared buffer? We really just don't expect a backend to do many segment fsyncs at all. Otherwise there's something wrong with the forwarding mechanism. It'd be different if we tracked WAL fsyncs more granularly - which would be quite interesting - but that's something for another day^Wpatch. > > > > Wonder if it's worth making the lock specific to the backend type? > > > > > > > > > > I've added another Lock into PgStat_IOPathOps so that each BackendType > > > can be locked separately. But, I've also kept the lock in > > > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be > > > done easily. > > > > Looks fine about the lock separation. > > > > Actually, I think it is not safe to use both of these locks. So for > picking one method, it is probably better to go with the locks in > PgStat_IOPathOps, it will be more efficient for flush (and not for > fetching and resetting), so that is probably the way to go, right? I think it's good to just use one kind of lock, and efficiency of snapshotting / resetting is nearly irrelevant. But I don't see why it's not safe to use both kinds of locks? > > Looks fine, but I think pgstat_flush_io_ops() need more comments like > > other pgstat_flush_* functions. > > > > + for (int i = 0; i < BACKEND_NUM_TYPES; i++) > > + stats_shmem->stats[i].stat_reset_timestamp = ts; > > > > I'm not sure we need a separate reset timestamp for each backend type > > but SLRU counter does the same thing.. > > > > Yes, I think for SLRU stats it is because you can reset individual SLRU > stats. Also there is no wrapper data structure to put it in. I could > keep it in PgStatShared_BackendIOPathOps since you have to reset all IO > operation stats at once, but I am thinking of getting rid of > PgStatShared_BackendIOPathOps since it is not needed if I only keep the > locks in PgStat_IOPathOps and make the global shared value an array of > PgStat_IOPathOps. I'm strongly against introducing super granular reset timestamps. I think that was a mistake for SLRU stats, but we can't fix that as easily. > Currently, strategy allocs count only reuses of a strategy buffer (not > initial shared buffers which are added to the ring). > strategy writes count only the writing out of dirty buffers which are > already in the ring and are being reused. That seems right to me. > Alternatively, we could also count as strategy allocs all those buffers > which are added to the ring and count as strategy writes all those > shared buffers which are dirty when initially added to the ring. I don't think that'd provide valuable information. The whole reason that strategy writes are interesting is that they can lead to writing out data a lot sooner than they would be written out without a strategy being used. > Subject: [PATCH v24 2/3] Track IO operation statistics > > Introduce "IOOp", an IO operation done by a backend, and "IOPath", the > location or type of IO done by a backend. For example, the checkpointer > may write a shared buffer out. This would be counted as an IOOp write on > an IOPath IOPATH_SHARED by BackendType "checkpointer". I'm still not 100% happy with IOPath - seems a bit too easy to confuse with the file path. What about 'origin'? > Each IOOp (alloc, fsync, extend, write) is counted per IOPath > (direct, local, shared, or strategy) through a call to > pgstat_count_io_op(). It seems we should track reads too - it's quite interesting to know whether reads happened because of a strategy, for example. You do reference reads in a later part of the commit message even :) > The primary concern of these statistics is IO operations on data blocks > during the course of normal database operations. IO done by, for > example, the archiver or syslogger is not counted in these statistics. We could extend this at a later stage, if we really want to. But I'm not sure it's interesting or fully possible. E.g. the archiver's write are largely not done by the archiver itself, but by a command (or module these days) it shells out to. > Note that this commit does not add code to increment IOPATH_DIRECT. A > future patch adding wrappers for smgrwrite(), smgrextend(), and > smgrimmedsync() would provide a good location to call > pgstat_count_io_op() for unbuffered IO and avoid regressions for future > users of these functions. Hm. Perhaps we should defer introducing IOPATH_DIRECT for now then? > Stats on IOOps for all IOPaths for a backend are initially accumulated > locally. > > Later they are flushed to shared memory and accumulated with those from > all other backends, exited and live. Perhaps mention here that this later could be extended to make per-connection stats visible? > Some BackendTypes will not execute pgstat_report_stat() and thus must > explicitly call pgstat_flush_io_ops() in order to flush their backend > local IO operation statistics to shared memory. Maybe add "flush ... during ongoing operation" or such? Because they'd all flush at commit, IIRC. > diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c > index 088556ab54..963b05321e 100644 > --- a/src/backend/bootstrap/bootstrap.c > +++ b/src/backend/bootstrap/bootstrap.c > @@ -33,6 +33,7 @@ > #include "miscadmin.h" > #include "nodes/makefuncs.h" > #include "pg_getopt.h" > +#include "pgstat.h" > #include "storage/bufmgr.h" > #include "storage/bufpage.h" > #include "storage/condition_variable.h" Hm? > diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c > index e926f8c27c..beb46dcb55 100644 > --- a/src/backend/postmaster/walwriter.c > +++ b/src/backend/postmaster/walwriter.c > @@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void) > } > > if (ShutdownRequestPending) > - { > - /* > - * Force reporting remaining WAL statistics at process exit. > - * > - * Since pgstat_report_wal is invoked with 'force' is false in main > - * loop to avoid overloading the cumulative stats system, there may > - * exist unreported stats counters for the WAL writer. > - */ > - pgstat_report_wal(true); > - > proc_exit(0); > - } > > /* Perform logging of memory contexts of this process */ > if (LogMemoryContextPending) Let's do this in a separate commit and get it out of the way... > @@ -682,16 +694,37 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf) > * if this buffer should be written and re-used. > */ > bool > -StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf) > +StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *write_from_ring) > { > - /* We only do this in bulkread mode */ > + > + /* > + * We only reject reusing and writing out the strategy buffer this in > + * bulkread mode. > + */ > if (strategy->btype != BAS_BULKREAD) > + { > + /* > + * If the buffer was from the ring and we are not rejecting it, consider it > + * a write of a strategy buffer. > + */ > + if (strategy->current_was_in_ring) > + *write_from_ring = true; Hm. This is set even if the buffer wasn't dirty? I guess we don't expect StrategyRejectBuffer() to be called for clean buffers... > /* > diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c > index d9275611f0..d3963f59d0 100644 > --- a/src/backend/utils/activity/pgstat_database.c > +++ b/src/backend/utils/activity/pgstat_database.c > @@ -47,7 +47,8 @@ pgstat_drop_database(Oid databaseid) > } > > /* > - * Called from autovacuum.c to report startup of an autovacuum process. > + * Called from autovacuum.c to report startup of an autovacuum process and > + * flush IO Operation statistics. > * We are called before InitPostgres is done, so can't rely on MyDatabaseId; > * the db OID must be passed in, instead. > */ > @@ -72,6 +73,11 @@ pgstat_report_autovac(Oid dboid) > dbentry->stats.last_autovac_time = GetCurrentTimestamp(); > > pgstat_unlock_entry(entry_ref); > + > + /* > + * Report IO Operation statistics > + */ > + pgstat_flush_io_ops(false); > } Hm. I suspect this will always be zero - at this point we haven't connected to a database, so there really can't have been much, if any, IO. I think I suggested doing something here, but on a second look it really doesn't make much sense. Note that that's different from doing something in pgstat_report_(vacuum|analyze) - clearly we've done something at that point. > /* > - * Report that the table was just vacuumed. > + * Report that the table was just vacuumed and flush IO Operation statistics. > */ > void > pgstat_report_vacuum(Oid tableoid, bool shared, > @@ -257,10 +257,15 @@ pgstat_report_vacuum(Oid tableoid, bool shared, > } > > pgstat_unlock_entry(entry_ref); > + > + /* > + * Report IO Operations statistics > + */ > + pgstat_flush_io_ops(false); > } > > /* > - * Report that the table was just analyzed. > + * Report that the table was just analyzed and flush IO Operation statistics. > * > * Caller must provide new live- and dead-tuples estimates, as well as a > * flag indicating whether to reset the changes_since_analyze counter. > @@ -340,6 +345,11 @@ pgstat_report_analyze(Relation rel, > } > > pgstat_unlock_entry(entry_ref); > + > + /* > + * Report IO Operations statistics > + */ > + pgstat_flush_io_ops(false); > } Think it'd be good to amend these comments to say that otherwise stats would only get flushed after a multi-relatio autovacuum cycle is done / a VACUUM/ANALYZE command processed all tables. Perhaps add the comment to one of the two functions, and just reference it in the other place? > --- a/src/include/utils/backend_status.h > +++ b/src/include/utils/backend_status.h > @@ -306,6 +306,40 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer, > int buflen); > extern uint64 pgstat_get_my_query_id(void); > > +/* Utility functions */ > + > +/* > + * When maintaining an array of information about all valid BackendTypes, in > + * order to avoid wasting the 0th spot, use this helper to convert a valid > + * BackendType to a valid location in the array (given that no spot is > + * maintained for B_INVALID BackendType). > + */ > +static inline int backend_type_get_idx(BackendType backend_type) > +{ > + /* > + * backend_type must be one of the valid backend types. If caller is > + * maintaining backend information in an array that includes B_INVALID, > + * this function is unnecessary. > + */ > + Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES); > + return backend_type - 1; > +} In function definitions (vs declarations) we put the 'static inline int' in a separate line from the rest of the function signature. > +/* > + * When using a value from an array of information about all valid > + * BackendTypes, add 1 to the index before using it as a BackendType to adjust > + * for not maintaining a spot for B_INVALID BackendType. > + */ > +static inline BackendType idx_get_backend_type(int idx) > +{ > + int backend_type = idx + 1; > + /* > + * If the array includes a spot for B_INVALID BackendType this function is > + * not required. The comments around this seem a bit over the top, but I also don't mind them much. > Add pg_stat_io, a system view which tracks the number of IOOp (allocs, > writes, fsyncs, and extends) done through each IOPath (e.g. shared > buffers, local buffers, unbuffered IO) by each type of backend. Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting the latter, except that we already have a bunch of views with that prefix. > Some of these should always be zero. For example, checkpointer does not > use a BufferAccessStrategy (currently), so the "strategy" IOPath for > checkpointer will be 0 for all IOOps. What do you think about returning NULL for the values that we except to never be non-zero? Perhaps with an assert against non-zero values? Seems like it might be helpful for understanding the view. > +/* > +* When adding a new column to the pg_stat_io view, add a new enum > +* value here above IO_NUM_COLUMNS. > +*/ > +enum > +{ > + IO_COLUMN_BACKEND_TYPE, > + IO_COLUMN_IO_PATH, > + IO_COLUMN_ALLOCS, > + IO_COLUMN_EXTENDS, > + IO_COLUMN_FSYNCS, > + IO_COLUMN_WRITES, > + IO_COLUMN_RESET_TIME, > + IO_NUM_COLUMNS, > +}; We typedef pretty much every enum so the enum can be referenced without the 'enum' prefix. I'd do that here, even if we don't need it. Greetings, Andres Freund
Hi, On 2022-07-11 22:22:28 -0400, Melanie Plageman wrote: > Yes, per an off list suggestion by you, I have changed the tests to use a > sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of > the missing calls to count IO Operations for IOPATH_LOCAL and > IOPATH_STRATEGY. > > I struggled to come up with a way to test writes for a particular > type of backend are counted correctly since a dirty buffer could be > written out by another type of backend before the target BackendType has > a chance to write it out. I guess temp file writes would be reliably done by one backend... Don't have a good idea otherwise. > I also struggled to come up with a way to test IO operations for > background workers. I'm not sure of a way to deterministically have a > background worker do a particular kind of IO in a test scenario. I think it's perfectly fine to not test that - for it to be broken we'd have to somehow screw up setting the backend type. Everything else is the same as other types of backends anyway. If you *do* want to test it, you probably could use SET parallel_leader_participation = false; SET force_parallel_mode = 'regress'; SELECT something_triggering_io; > I'm not sure how to cause a strategy "extend" for testing. COPY into a table should work. But might be unattractive due to the size of of the COPY ringbuffer. > > Would be nice to have something testing that the ringbuffer stats stuff > > does something sensible - that feels not entirely trivial. > > > > > I've added a test to test that reused strategy buffers are counted as > allocs. I would like to add a test which checks that if a buffer in the > ring is pinned and thus not reused, that it is not counted as a strategy > alloc, but I found it challenging without a way to pause vacuuming, pin > a buffer, then resume vacuuming. Yea, that's probably too hard to make reliable to be worth it. Greetings, Andres Freund
At Tue, 12 Jul 2022 12:19:06 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in > > + > > &pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)]; > > > > backend_type_get_idx(x) is actually (x - 1) plus assertion on the > > value range. And the only use-case is here. There's an reverse > > function and also used only at one place. > > > > + Datum backend_type_desc = > > + > > CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i))); > > > > In this usage GetBackendTypeDesc() gracefully treats out-of-domain > > values but idx_get_backend_type keenly kills the process for the > > same. This is inconsistent. > > > > My humbel opinion on this is we don't define the two functions and > > replace the calls to them with (x +/- 1). Addition to that, I think > > we should not abort() by invalid backend types. In that sense, I > > wonder if we could use B_INVALIDth element for this purpose. > > > > I think that GetBackendTypeDesc() should probably also error out for an > unknown value. > > I would be open to not using the helper functions. I thought it would be > less error-prone, but since it is limited to the code in > pgstat_io_ops.c, it is probably okay. Let me think a bit more. > > Could you explain more about what you mean about using B_INVALID > BackendType? I imagined to use B_INVALID as a kind of "default" partition, which accepts all unknown backend types. We can just ignore that values but then we lose the clue for malfunction of stats machinery. I thought that that backend-type as the sentinel for malfunctions. Thus we can emit logs instead. I feel that the stats machinery shouldn't stop the server as possible, or I think it is overreaction to abort for invalid values that can be easily coped with. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2022-07-13 11:00:07 +0900, Kyotaro Horiguchi wrote: > I imagined to use B_INVALID as a kind of "default" partition, which > accepts all unknown backend types. There shouldn't be any unknown backend types. Something has gone wrong if we get far without a backend type set. > We can just ignore that values but then we lose the clue for malfunction of > stats machinery. I thought that that backend-type as the sentinel for > malfunctions. Thus we can emit logs instead. > > I feel that the stats machinery shouldn't stop the server as possible, > or I think it is overreaction to abort for invalid values that can be > easily coped with. I strongly disagree. That just ends up with hard to find bugs. Greetings, Andres Freund
At Tue, 12 Jul 2022 19:18:22 -0700, Andres Freund <andres@anarazel.de> wrote in > Hi, > > On 2022-07-13 11:00:07 +0900, Kyotaro Horiguchi wrote: > > I imagined to use B_INVALID as a kind of "default" partition, which > > accepts all unknown backend types. > > There shouldn't be any unknown backend types. Something has gone wrong if we > get far without a backend type set. > > > > We can just ignore that values but then we lose the clue for malfunction of > > stats machinery. I thought that that backend-type as the sentinel for > > malfunctions. Thus we can emit logs instead. > > > > I feel that the stats machinery shouldn't stop the server as possible, > > or I think it is overreaction to abort for invalid values that can be > > easily coped with. > > I strongly disagree. That just ends up with hard to find bugs. I was not sure about the policy on that since, as Melanie (and I) mentioned, GetBackendTypeDesc() is gracefully treating invalid values. Since both of you are agreeing on this point, I'm fine with Assert()ing assuming that GetBackendTypeDesc() (or other places backend-type is handled) is modified to behave the same way. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attached patch set is substantially different enough from previous
versions that I kept it as a new patch set.
Note that local buffer allocations are now correctly tracked.
versions that I kept it as a new patch set.
Note that local buffer allocations are now correctly tracked.
On Tue, Jul 12, 2022 at 1:01 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-07-12 12:19:06 -0400, Melanie Plageman wrote:
> > > I also realized that I am not differentiating between IOPATH_SHARED and
> > > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
> > > of buffer we are fsync'ing by the time we call register_dirty_segment(),
> > > I'm not sure how we would fix this.
> >
> > I think there scarcely happens flush for strategy-loaded buffers. If
> > that is sensible, IOOP_FSYNC would not make much sense for
> > IOPATH_STRATEGY.
> >
>
> Why would it be less likely for a backend to do its own fsync when
> flushing a dirty strategy buffer than a regular dirty shared buffer?
We really just don't expect a backend to do many segment fsyncs at
all. Otherwise there's something wrong with the forwarding mechanism.
When a dirty strategy buffer is written out, if pendingOps sync queue is
full and the backend has to fsync the segment itself instead of relying
on the checkpointer, this will show in the statistics as an IOOP_FSYNC
for IOPATH_SHARED not IOPATH_STRATEGY.
full and the backend has to fsync the segment itself instead of relying
on the checkpointer, this will show in the statistics as an IOOP_FSYNC
for IOPATH_SHARED not IOPATH_STRATEGY.
IOPATH_STRATEGY + IOOP_FSYNC will always be 0 for all BackendTypes.
Does this seem right?
It'd be different if we tracked WAL fsyncs more granularly - which would be
quite interesting - but that's something for another day^Wpatch.
I do have a question about this.
So, if we were to start tracking WAL IO would it fit within this
paradigm to have a new IOPATH_WAL for WAL or would it add a separate
dimension?
I was thinking that we might want to consider calling this view
pg_stat_io_data because we might want to have a separate view,
pg_stat_io_wal and then, maybe eventually, convert pg_stat_slru to
pg_stat_io_slru (or a subset of what is in pg_stat_slru).
And maybe then later add pg_stat_io_[archiver/other]
Is pg_stat_io_data a good name that gives us flexibility to
introduce views which expose per-backend IO operation stats (maybe that
goes in pg_stat_activity, though [or maybe not because it wouldn't
include exited backends?]) and per query IO operation stats?
I would like to add roughly the same additional columns to all of
these during AIO development (basically the columns from iostat):
- average block size (will usually be 8kB for pg_stat_io_data but won't
necessarily for the others)
- IOPS/BW
- avg read/write wait time
- demand rate/completion rate
- merges
- maybe queue depth
And I would like to be able to see all of these per query, per backend,
per relation, per BackendType, per IOPath, per SLRU type, etc.
Basically, what I'm asking is
1) what can we name the view to enable these future stats to exist with
the least confusing/wordy view names?
2) will the current view layout and column titles work with minimal
changes for future stats extensions like what I mention above?
So, if we were to start tracking WAL IO would it fit within this
paradigm to have a new IOPATH_WAL for WAL or would it add a separate
dimension?
I was thinking that we might want to consider calling this view
pg_stat_io_data because we might want to have a separate view,
pg_stat_io_wal and then, maybe eventually, convert pg_stat_slru to
pg_stat_io_slru (or a subset of what is in pg_stat_slru).
And maybe then later add pg_stat_io_[archiver/other]
Is pg_stat_io_data a good name that gives us flexibility to
introduce views which expose per-backend IO operation stats (maybe that
goes in pg_stat_activity, though [or maybe not because it wouldn't
include exited backends?]) and per query IO operation stats?
I would like to add roughly the same additional columns to all of
these during AIO development (basically the columns from iostat):
- average block size (will usually be 8kB for pg_stat_io_data but won't
necessarily for the others)
- IOPS/BW
- avg read/write wait time
- demand rate/completion rate
- merges
- maybe queue depth
And I would like to be able to see all of these per query, per backend,
per relation, per BackendType, per IOPath, per SLRU type, etc.
Basically, what I'm asking is
1) what can we name the view to enable these future stats to exist with
the least confusing/wordy view names?
2) will the current view layout and column titles work with minimal
changes for future stats extensions like what I mention above?
> > > > Wonder if it's worth making the lock specific to the backend type?
> > > >
> > >
> > > I've added another Lock into PgStat_IOPathOps so that each BackendType
> > > can be locked separately. But, I've also kept the lock in
> > > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
> > > done easily.
> >
> > Looks fine about the lock separation.
> >
>
> Actually, I think it is not safe to use both of these locks. So for
> picking one method, it is probably better to go with the locks in
> PgStat_IOPathOps, it will be more efficient for flush (and not for
> fetching and resetting), so that is probably the way to go, right?
I think it's good to just use one kind of lock, and efficiency of snapshotting
/ resetting is nearly irrelevant. But I don't see why it's not safe to use
both kinds of locks?
The way I implemented it was not safe because I didn't use both locks
when resetting the stats.
In this new version of the patch, I've done the following: In shared
memory I've put the lock in PgStatShared_IOPathOps -- the data structure
which contains an array of PgStat_IOOpCounters for all IOOp types for
all IOPaths. Thus, different BackendType + IOPath combinations can be
updated concurrently without contending for the same lock.
To make this work, I made two versions of the PgStat_IOPathOps -- one
that has the lock, PgStatShared_IOPathOps, and one without,
PgStat_IOPathOps, so that I can persist it to the stats file without
writing and reading the LWLock and can have a local and snapshot version
of the data structure without the lock.
This also necessitated two versions of the data structure wrapping
PgStat_IOPathOps, PgStat_BackendIOPathOps, which contains an array with
a PgStat_IOPathOps for each BackendType, and
PgStatShared_BackendIOPathOps, containing an array of
PgStatShared_IOPathOps.
memory I've put the lock in PgStatShared_IOPathOps -- the data structure
which contains an array of PgStat_IOOpCounters for all IOOp types for
all IOPaths. Thus, different BackendType + IOPath combinations can be
updated concurrently without contending for the same lock.
To make this work, I made two versions of the PgStat_IOPathOps -- one
that has the lock, PgStatShared_IOPathOps, and one without,
PgStat_IOPathOps, so that I can persist it to the stats file without
writing and reading the LWLock and can have a local and snapshot version
of the data structure without the lock.
This also necessitated two versions of the data structure wrapping
PgStat_IOPathOps, PgStat_BackendIOPathOps, which contains an array with
a PgStat_IOPathOps for each BackendType, and
PgStatShared_BackendIOPathOps, containing an array of
PgStatShared_IOPathOps.
> > Looks fine, but I think pgstat_flush_io_ops() need more comments like
> > other pgstat_flush_* functions.
> >
> > + for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> > + stats_shmem->stats[i].stat_reset_timestamp = ts;
> >
> > I'm not sure we need a separate reset timestamp for each backend type
> > but SLRU counter does the same thing..
> >
>
> Yes, I think for SLRU stats it is because you can reset individual SLRU
> stats. Also there is no wrapper data structure to put it in. I could
> keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
> operation stats at once, but I am thinking of getting rid of
> PgStatShared_BackendIOPathOps since it is not needed if I only keep the
> locks in PgStat_IOPathOps and make the global shared value an array of
> PgStat_IOPathOps.
I'm strongly against introducing super granular reset timestamps. I think that
was a mistake for SLRU stats, but we can't fix that as easily.
Since all stats in pg_stat_io must be reset at the same time, I've put
the reset timestamp can in the PgStat[Shared]_BackendIOPathOps and
removed it from each PgStat[Shared]_IOPathOps.
the reset timestamp can in the PgStat[Shared]_BackendIOPathOps and
removed it from each PgStat[Shared]_IOPathOps.
> Currently, strategy allocs count only reuses of a strategy buffer (not
> initial shared buffers which are added to the ring).
> strategy writes count only the writing out of dirty buffers which are
> already in the ring and are being reused.
That seems right to me.
> Alternatively, we could also count as strategy allocs all those buffers
> which are added to the ring and count as strategy writes all those
> shared buffers which are dirty when initially added to the ring.
I don't think that'd provide valuable information. The whole reason that
strategy writes are interesting is that they can lead to writing out data a
lot sooner than they would be written out without a strategy being used.
Then I agree that strategy writes should only count strategy buffers
that are written out in order to reuse the buffer (which is in lieu of
getting a new, potentially clean, shared buffer). This patch implements
that behavior.
However, for strategy allocs, it seems like we would want to count all
demand for buffers as part of a BufferAccessStrategy. So, that would
include allocating buffers to initially fill the ring, allocations of
new shared buffers after the ring was already full that are added to the
ring because all existing buffers in the ring are pinned, and buffers
already in the ring which are being reused.
This version of the patch only counts the third scenario as a strategy
allocation, but I think it would make more sense to count all three as
strategy allocs.
The downside of this behavior is that strategy allocs count different
scenarios than strategy writes, reads, and extends. But, I think that
this is okay.
I'll clarify it in the docs once there is a decision.
Also, note that, as stated above, there will never be any strategy
fsyncs (that is, IOPATH_STRATEGY + IOOP_FSYNC will always be 0) because
the code path starting with register_dirty_segment() which ends with a
regular backend doing its own fsync when pendingOps is full does not
know what the current IOPATH is and checkpointer does not use a
BufferAccessStrategy.
fsyncs (that is, IOPATH_STRATEGY + IOOP_FSYNC will always be 0) because
the code path starting with register_dirty_segment() which ends with a
regular backend doing its own fsync when pendingOps is full does not
know what the current IOPATH is and checkpointer does not use a
BufferAccessStrategy.
> Subject: [PATCH v24 2/3] Track IO operation statistics
>
> Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
> location or type of IO done by a backend. For example, the checkpointer
> may write a shared buffer out. This would be counted as an IOOp write on
> an IOPath IOPATH_SHARED by BackendType "checkpointer".
I'm still not 100% happy with IOPath - seems a bit too easy to confuse with
the file path. What about 'origin'?
Enough has changed in this version of the patch that I decided to defer
renaming until some of the other issues are resolved.
renaming until some of the other issues are resolved.
> Each IOOp (alloc, fsync, extend, write) is counted per IOPath
> (direct, local, shared, or strategy) through a call to
> pgstat_count_io_op().
It seems we should track reads too - it's quite interesting to know whether
reads happened because of a strategy, for example. You do reference reads in a
later part of the commit message even :)
I've added reads to what is counted.
> The primary concern of these statistics is IO operations on data blocks
> during the course of normal database operations. IO done by, for
> example, the archiver or syslogger is not counted in these statistics.
We could extend this at a later stage, if we really want to. But I'm not sure
it's interesting or fully possible. E.g. the archiver's write are largely not
done by the archiver itself, but by a command (or module these days) it shells
out to.
I've added note of this to some of the comments and the commit message.
I also omit rows for these BackendTypes from the view. See my later
comment in this email for more detail on that.
I also omit rows for these BackendTypes from the view. See my later
comment in this email for more detail on that.
> Note that this commit does not add code to increment IOPATH_DIRECT. A
> future patch adding wrappers for smgrwrite(), smgrextend(), and
> smgrimmedsync() would provide a good location to call
> pgstat_count_io_op() for unbuffered IO and avoid regressions for future
> users of these functions.
Hm. Perhaps we should defer introducing IOPATH_DIRECT for now then?
It's gone.
> Stats on IOOps for all IOPaths for a backend are initially accumulated
> locally.
>
> Later they are flushed to shared memory and accumulated with those from
> all other backends, exited and live.
Perhaps mention here that this later could be extended to make per-connection
stats visible?
Mentioned.
> Some BackendTypes will not execute pgstat_report_stat() and thus must
> explicitly call pgstat_flush_io_ops() in order to flush their backend
> local IO operation statistics to shared memory.
Maybe add "flush ... during ongoing operation" or such? Because they'd all
flush at commit, IIRC.
Added.
> diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
> index 088556ab54..963b05321e 100644
> --- a/src/backend/bootstrap/bootstrap.c
> +++ b/src/backend/bootstrap/bootstrap.c
> @@ -33,6 +33,7 @@
> #include "miscadmin.h"
> #include "nodes/makefuncs.h"
> #include "pg_getopt.h"
> +#include "pgstat.h"
> #include "storage/bufmgr.h"
> #include "storage/bufpage.h"
> #include "storage/condition_variable.h"
Hm?
Removed
> diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
> index e926f8c27c..beb46dcb55 100644
> --- a/src/backend/postmaster/walwriter.c
> +++ b/src/backend/postmaster/walwriter.c
> @@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
> }
>
> if (ShutdownRequestPending)
> - {
> - /*
> - * Force reporting remaining WAL statistics at process exit.
> - *
> - * Since pgstat_report_wal is invoked with 'force' is false in main
> - * loop to avoid overloading the cumulative stats system, there may
> - * exist unreported stats counters for the WAL writer.
> - */
> - pgstat_report_wal(true);
> -
> proc_exit(0);
> - }
>
> /* Perform logging of memory contexts of this process */
> if (LogMemoryContextPending)
Let's do this in a separate commit and get it out of the way...
I've put it in a separate commit.
> @@ -682,16 +694,37 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
> * if this buffer should be written and re-used.
> */
> bool
> -StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
> +StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *write_from_ring)
> {
> - /* We only do this in bulkread mode */
> +
> + /*
> + * We only reject reusing and writing out the strategy buffer this in
> + * bulkread mode.
> + */
> if (strategy->btype != BAS_BULKREAD)
> + {
> + /*
> + * If the buffer was from the ring and we are not rejecting it, consider it
> + * a write of a strategy buffer.
> + */
> + if (strategy->current_was_in_ring)
> + *write_from_ring = true;
Hm. This is set even if the buffer wasn't dirty? I guess we don't expect
StrategyRejectBuffer() to be called for clean buffers...
Yes, we do not expect it to be called for clean buffers.
I've added a comment about this assumption.
> /*
> diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
> index d9275611f0..d3963f59d0 100644
> --- a/src/backend/utils/activity/pgstat_database.c
> +++ b/src/backend/utils/activity/pgstat_database.c
> @@ -47,7 +47,8 @@ pgstat_drop_database(Oid databaseid)
> }
>
> /*
> - * Called from autovacuum.c to report startup of an autovacuum process.
> + * Called from autovacuum.c to report startup of an autovacuum process and
> + * flush IO Operation statistics.
> * We are called before InitPostgres is done, so can't rely on MyDatabaseId;
> * the db OID must be passed in, instead.
> */
> @@ -72,6 +73,11 @@ pgstat_report_autovac(Oid dboid)
> dbentry->stats.last_autovac_time = GetCurrentTimestamp();
>
> pgstat_unlock_entry(entry_ref);
> +
> + /*
> + * Report IO Operation statistics
> + */
> + pgstat_flush_io_ops(false);
> }
Hm. I suspect this will always be zero - at this point we haven't connected to
a database, so there really can't have been much, if any, IO. I think I
suggested doing something here, but on a second look it really doesn't make
much sense.
Note that that's different from doing something in
pgstat_report_(vacuum|analyze) - clearly we've done something at that point.
I've removed this.
> /*
> - * Report that the table was just vacuumed.
> + * Report that the table was just vacuumed and flush IO Operation statistics.
> */
> void
> pgstat_report_vacuum(Oid tableoid, bool shared,
> @@ -257,10 +257,15 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
> }
>
> pgstat_unlock_entry(entry_ref);
> +
> + /*
> + * Report IO Operations statistics
> + */
> + pgstat_flush_io_ops(false);
> }
>
> /*
> - * Report that the table was just analyzed.
> + * Report that the table was just analyzed and flush IO Operation statistics.
> *
> * Caller must provide new live- and dead-tuples estimates, as well as a
> * flag indicating whether to reset the changes_since_analyze counter.
> @@ -340,6 +345,11 @@ pgstat_report_analyze(Relation rel,
> }
>
> pgstat_unlock_entry(entry_ref);
> +
> + /*
> + * Report IO Operations statistics
> + */
> + pgstat_flush_io_ops(false);
> }
Think it'd be good to amend these comments to say that otherwise stats would
only get flushed after a multi-relatio autovacuum cycle is done / a
VACUUM/ANALYZE command processed all tables. Perhaps add the comment to one
of the two functions, and just reference it in the other place?
Done
> --- a/src/include/utils/backend_status.h
> +++ b/src/include/utils/backend_status.h
> @@ -306,6 +306,40 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
> int buflen);
> extern uint64 pgstat_get_my_query_id(void);
>
> +/* Utility functions */
> +
> +/*
> + * When maintaining an array of information about all valid BackendTypes, in
> + * order to avoid wasting the 0th spot, use this helper to convert a valid
> + * BackendType to a valid location in the array (given that no spot is
> + * maintained for B_INVALID BackendType).
> + */
> +static inline int backend_type_get_idx(BackendType backend_type)
> +{
> + /*
> + * backend_type must be one of the valid backend types. If caller is
> + * maintaining backend information in an array that includes B_INVALID,
> + * this function is unnecessary.
> + */
> + Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
> + return backend_type - 1;
> +}
In function definitions (vs declarations) we put the 'static inline int' in a
separate line from the rest of the function signature.
Fixed.
> +/*
> + * When using a value from an array of information about all valid
> + * BackendTypes, add 1 to the index before using it as a BackendType to adjust
> + * for not maintaining a spot for B_INVALID BackendType.
> + */
> +static inline BackendType idx_get_backend_type(int idx)
> +{
> + int backend_type = idx + 1;
> + /*
> + * If the array includes a spot for B_INVALID BackendType this function is
> + * not required.
The comments around this seem a bit over the top, but I also don't mind them
much.
Feel free to change them to something shorter. I couldn't think of something I liked.
> Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
> writes, fsyncs, and extends) done through each IOPath (e.g. shared
> buffers, local buffers, unbuffered IO) by each type of backend.
Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting the
latter, except that we already have a bunch of views with that prefix.
I have thoughts on this but thought it best deferred until after the _data decision.
> Some of these should always be zero. For example, checkpointer does not
> use a BufferAccessStrategy (currently), so the "strategy" IOPath for
> checkpointer will be 0 for all IOOps.
What do you think about returning NULL for the values that we except to never
be non-zero? Perhaps with an assert against non-zero values? Seems like it
might be helpful for understanding the view.
Yes, I like this idea.
Beyond just setting individual cells to NULL, if an entire row would be
NULL, I have now dropped it from the view.
So far, I have omitted from the view all rows for BackendTypes
B_ARCHIVER, B_LOGGER, and B_STARTUP.
Should I also omit rows for B_WAL_RECEIVER and B_WAL_WRITER for now?
I have also omitted rows for IOPATH_STRATEGY for all BackendTypes
*except* B_AUTOVAC_WORKER, B_BACKEND, B_STANDALONE_BACKEND, and
B_BG_WORKER.
Do these seem correct?
I think there are some BackendTypes which will never do IO Operations on
IOPATH_LOCAL but I am not sure which. Do you know which?
As for individual cells which should be NULL, so far what I have is:
- IOPATH_LOCAL + IOOP_FSYNC
I am sure there are others as well. Can you think of any?
Beyond just setting individual cells to NULL, if an entire row would be
NULL, I have now dropped it from the view.
So far, I have omitted from the view all rows for BackendTypes
B_ARCHIVER, B_LOGGER, and B_STARTUP.
Should I also omit rows for B_WAL_RECEIVER and B_WAL_WRITER for now?
I have also omitted rows for IOPATH_STRATEGY for all BackendTypes
*except* B_AUTOVAC_WORKER, B_BACKEND, B_STANDALONE_BACKEND, and
B_BG_WORKER.
Do these seem correct?
I think there are some BackendTypes which will never do IO Operations on
IOPATH_LOCAL but I am not sure which. Do you know which?
As for individual cells which should be NULL, so far what I have is:
- IOPATH_LOCAL + IOOP_FSYNC
I am sure there are others as well. Can you think of any?
> +/*
> +* When adding a new column to the pg_stat_io view, add a new enum
> +* value here above IO_NUM_COLUMNS.
> +*/
> +enum
> +{
> + IO_COLUMN_BACKEND_TYPE,
> + IO_COLUMN_IO_PATH,
> + IO_COLUMN_ALLOCS,
> + IO_COLUMN_EXTENDS,
> + IO_COLUMN_FSYNCS,
> + IO_COLUMN_WRITES,
> + IO_COLUMN_RESET_TIME,
> + IO_NUM_COLUMNS,
> +};
We typedef pretty much every enum so the enum can be referenced without the
'enum' prefix. I'd do that here, even if we don't need it.
So, I left it anonymous because I didn't want it being used as a type
or referenced anywhere else.
I am interested to hear more about your SQL enums idea from upthread.
or referenced anywhere else.
I am interested to hear more about your SQL enums idea from upthread.
- Melanie
Attachment
In addition to adding several new tests, the attached version 26 fixes a
major bug in constructing the view.
The only valid combination of IOPATH/IOOP that is not tested now is
IOPATH_STRATEGY + IOOP_WRITE. In most cases when I ran this in regress,
the checkpointer wrote out the dirty strategy buffer before VACUUM got
around to reusing and writing it out in my tests.
I've also changed the BACKEND_NUM_TYPES definition. Now arrays will have
that dead spot for B_INVALID, but I feel like it is much easier to
understand without trying to skip that spot and use those special helper
functions.
I also started skipping adding rows to the view for WAL_RECEIVER and
WAL_WRITER and for BackendTypes except B_BACKEND and WAL_SENDER for
IOPATH_LOCAL.
major bug in constructing the view.
The only valid combination of IOPATH/IOOP that is not tested now is
IOPATH_STRATEGY + IOOP_WRITE. In most cases when I ran this in regress,
the checkpointer wrote out the dirty strategy buffer before VACUUM got
around to reusing and writing it out in my tests.
I've also changed the BACKEND_NUM_TYPES definition. Now arrays will have
that dead spot for B_INVALID, but I feel like it is much easier to
understand without trying to skip that spot and use those special helper
functions.
I also started skipping adding rows to the view for WAL_RECEIVER and
WAL_WRITER and for BackendTypes except B_BACKEND and WAL_SENDER for
IOPATH_LOCAL.
On Tue, Jul 12, 2022 at 1:18 PM Andres Freund <andres@anarazel.de> wrote:
On 2022-07-11 22:22:28 -0400, Melanie Plageman wrote:
> Yes, per an off list suggestion by you, I have changed the tests to use a
> sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of
> the missing calls to count IO Operations for IOPATH_LOCAL and
> IOPATH_STRATEGY.
>
> I struggled to come up with a way to test writes for a particular
> type of backend are counted correctly since a dirty buffer could be
> written out by another type of backend before the target BackendType has
> a chance to write it out.
I guess temp file writes would be reliably done by one backend... Don't have a
good idea otherwise.
This was mainly an issue for IOPATH_STRATEGY writes as I mentioned. I
still have not solved this.
still have not solved this.
> I'm not sure how to cause a strategy "extend" for testing.
COPY into a table should work. But might be unattractive due to the size of of
the COPY ringbuffer.
Did it with a CTAS as Horiguchi-san suggested.
> > Would be nice to have something testing that the ringbuffer stats stuff
> > does something sensible - that feels not entirely trivial.
> >
> >
> I've added a test to test that reused strategy buffers are counted as
> allocs. I would like to add a test which checks that if a buffer in the
> ring is pinned and thus not reused, that it is not counted as a strategy
> alloc, but I found it challenging without a way to pause vacuuming, pin
> a buffer, then resume vacuuming.
Yea, that's probably too hard to make reliable to be worth it.
Yes, I have skipped this.
- Melanie
Attachment
I am consolidating the various naming points from this thread into one
email:
From Horiguchi-san:
> A bit different thing, but I felt a little uneasy about some uses of
> "pgstat_io_ops". IOOp looks like a neighbouring word of IOPath. On the
> other hand, actually iopath is used as an attribute of io_ops in many
> places. Couldn't we be more consistent about the relationship between
> the names?
>
> IOOp -> PgStat_IOOpType
> IOPath -> PgStat_IOPath
> PgStat_IOOpCOonters -> PgStat_IOCounters
> PgStat_IOPathOps -> PgStat_IO
> pgstat_count_io_op -> pgstat_count_io
So, because of the way the data structures contain arrays of each other
the naming was meant to specify all the information contained in the
data structure:
PgStat_IOOpCounters are all IOOp (I could see removing the word
"counters" from the name for more consistency)
PgStat_IOPathOps are all IOOp for all IOPath
PgStat_BackendIOPathOps are all IOOp for all IOPath for all BackendType
The downside of this naming is that, when choosing a local variable name
for all of the IOOp for all IOPath for a single BackendType,
"backend_io_path_ops" seems accurate but is actually confusing if the
type name for all IOOp for all IOPath for all BackendType is
PgStat_BackendIOPathOps.
I would be open to changing PgStat_BackendIOPathOps to PgStat_IO, but I
don't see how I could omit Path or Op from PgStat_IOPathOps without
making its meaning unclear.
I'm not sure about the idea of prefixing the IOOp and IOPath enums with
Pg_Stat. I could see them being used outside of statistics (though they
are defined in pgstat.h) and could see myself using them in, for
example, calculations for the prefetcher.
email:
From Horiguchi-san:
> A bit different thing, but I felt a little uneasy about some uses of
> "pgstat_io_ops". IOOp looks like a neighbouring word of IOPath. On the
> other hand, actually iopath is used as an attribute of io_ops in many
> places. Couldn't we be more consistent about the relationship between
> the names?
>
> IOOp -> PgStat_IOOpType
> IOPath -> PgStat_IOPath
> PgStat_IOOpCOonters -> PgStat_IOCounters
> PgStat_IOPathOps -> PgStat_IO
> pgstat_count_io_op -> pgstat_count_io
So, because of the way the data structures contain arrays of each other
the naming was meant to specify all the information contained in the
data structure:
PgStat_IOOpCounters are all IOOp (I could see removing the word
"counters" from the name for more consistency)
PgStat_IOPathOps are all IOOp for all IOPath
PgStat_BackendIOPathOps are all IOOp for all IOPath for all BackendType
The downside of this naming is that, when choosing a local variable name
for all of the IOOp for all IOPath for a single BackendType,
"backend_io_path_ops" seems accurate but is actually confusing if the
type name for all IOOp for all IOPath for all BackendType is
PgStat_BackendIOPathOps.
I would be open to changing PgStat_BackendIOPathOps to PgStat_IO, but I
don't see how I could omit Path or Op from PgStat_IOPathOps without
making its meaning unclear.
I'm not sure about the idea of prefixing the IOOp and IOPath enums with
Pg_Stat. I could see them being used outside of statistics (though they
are defined in pgstat.h) and could see myself using them in, for
example, calculations for the prefetcher.
From Andres:
Quoting me (Melanie):
> > Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
> > location or type of IO done by a backend. For example, the checkpointer
> > may write a shared buffer out. This would be counted as an IOOp write on
> > an IOPath IOPATH_SHARED by BackendType "checkpointer".
> I'm still not 100% happy with IOPath - seems a bit too easy to confuse with
> the file path. What about 'origin'?
I can see the point about IOPATH.
I'm not wild about origin mostly because of the number of O's given that
IO Operation already has two O's. It gets kind of hard to read when
using Pascal Case: IOOrigin and IOOp.
Also, it doesn't totally make sense for alloc. I could be convinced,
though.
IOSOURCE doesn't have the O problem but does still not make sense for
alloc. I also thought of IOSITE and IOVENUE.
> > location or type of IO done by a backend. For example, the checkpointer
> > may write a shared buffer out. This would be counted as an IOOp write on
> > an IOPath IOPATH_SHARED by BackendType "checkpointer".
> I'm still not 100% happy with IOPath - seems a bit too easy to confuse with
> the file path. What about 'origin'?
I can see the point about IOPATH.
I'm not wild about origin mostly because of the number of O's given that
IO Operation already has two O's. It gets kind of hard to read when
using Pascal Case: IOOrigin and IOOp.
Also, it doesn't totally make sense for alloc. I could be convinced,
though.
IOSOURCE doesn't have the O problem but does still not make sense for
alloc. I also thought of IOSITE and IOVENUE.
> Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting the
> latter, except that we already have a bunch of views with that prefix.
As far as pg_stat_io vs pg_statio, they are the only stats views which
don't have an underscore between stat and the rest of the view name, so
perhaps we should move away from statio to stat_io going forward anyway.
I am imagining adding to them with other iostat type metrics once direct
IO is introduced, so they may well be changing soon anyway.
> latter, except that we already have a bunch of views with that prefix.
As far as pg_stat_io vs pg_statio, they are the only stats views which
don't have an underscore between stat and the rest of the view name, so
perhaps we should move away from statio to stat_io going forward anyway.
I am imagining adding to them with other iostat type metrics once direct
IO is introduced, so they may well be changing soon anyway.
- Melanie
Hi, On 2022-07-15 11:59:41 -0400, Melanie Plageman wrote: > I'm not sure about the idea of prefixing the IOOp and IOPath enums with > Pg_Stat. I could see them being used outside of statistics (though they > are defined in pgstat.h) +1 > From Andres: > > Quoting me (Melanie): > > > Introduce "IOOp", an IO operation done by a backend, and "IOPath", the > > > location or type of IO done by a backend. For example, the checkpointer > > > may write a shared buffer out. This would be counted as an IOOp write on > > > an IOPath IOPATH_SHARED by BackendType "checkpointer". > > > I'm still not 100% happy with IOPath - seems a bit too easy to confuse > with > > the file path. What about 'origin'? > > I can see the point about IOPATH. > I'm not wild about origin mostly because of the number of O's given that > IO Operation already has two O's. It gets kind of hard to read when > using Pascal Case: IOOrigin and IOOp. > Also, it doesn't totally make sense for alloc. I could be convinced, > though. > > IOSOURCE doesn't have the O problem but does still not make sense for > alloc. I also thought of IOSITE and IOVENUE. I like "source" - not too bothered by the alloc aspect. I can also see "context" working. > > Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting > the > > latter, except that we already have a bunch of views with that prefix. > > As far as pg_stat_io vs pg_statio, they are the only stats views which > don't have an underscore between stat and the rest of the view name, so > perhaps we should move away from statio to stat_io going forward anyway. > I am imagining adding to them with other iostat type metrics once direct > IO is introduced, so they may well be changing soon anyway. I don't think I have strong opinions on this one. I can see arguments for either naming. Greetings, Andres Freund
Hi, On 2022-07-14 18:44:48 -0400, Melanie Plageman wrote: > Subject: [PATCH v26 1/4] Add BackendType for standalone backends > Subject: [PATCH v26 2/4] Remove unneeded call to pgstat_report_wal() LGTM. > Subject: [PATCH v26 3/4] Track IO operation statistics > @@ -978,8 +979,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > > bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr); > > + if (isLocalBuf) > + io_path = IOPATH_LOCAL; > + else if (strategy != NULL) > + io_path = IOPATH_STRATEGY; > + else > + io_path = IOPATH_SHARED; Seems a bit ugly to have an if (isLocalBuf) just after an isLocalBuf ?. > + /* > + * When a strategy is in use, reused buffers from the strategy ring will > + * be counted as allocations for the purposes of IO Operation statistics > + * tracking. > + * > + * However, even when a strategy is in use, if a new buffer must be > + * allocated from shared buffers and added to the ring, this is counted > + * as a IOPATH_SHARED allocation. > + */ There's a bit too much duplication between the paragraphs... > @@ -628,6 +637,9 @@ pgstat_report_stat(bool force) > /* flush database / relation / function / ... stats */ > partial_flush |= pgstat_flush_pending_entries(nowait); > > + /* flush IO Operations stats */ > + partial_flush |= pgstat_flush_io_ops(nowait); Could you either add a note to the commit message that the stats file version needs to be increased, or just iclude that in the patch. > @@ -1427,8 +1445,10 @@ pgstat_read_statsfile(void) > FILE *fpin; > int32 format_id; > bool found; > + PgStat_BackendIOPathOps io_stats; > const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME; > PgStat_ShmemControl *shmem = pgStatLocal.shmem; > + PgStatShared_BackendIOPathOps *io_stats_shmem = &shmem->io_ops; > > /* shouldn't be called from postmaster */ > Assert(IsUnderPostmaster || !IsPostmasterEnvironment); > @@ -1486,6 +1506,22 @@ pgstat_read_statsfile(void) > if (!read_chunk_s(fpin, &shmem->checkpointer.stats)) > goto error; > > + /* > + * Read IO Operations stats struct > + */ > + if (!read_chunk_s(fpin, &io_stats)) > + goto error; > + > + io_stats_shmem->stat_reset_timestamp = io_stats.stat_reset_timestamp; > + > + for (int i = 0; i < BACKEND_NUM_TYPES; i++) > + { > + PgStat_IOPathOps *stats = &io_stats.stats[i]; > + PgStatShared_IOPathOps *stats_shmem = &io_stats_shmem->stats[i]; > + > + memcpy(stats_shmem->data, stats->data, sizeof(stats->data)); > + } Why can't the data be read directly into shared memory? > /* > +void > +pgstat_io_ops_snapshot_cb(void) > +{ > + PgStatShared_BackendIOPathOps *all_backend_stats_shmem = &pgStatLocal.shmem->io_ops; > + PgStat_BackendIOPathOps *all_backend_stats_snap = &pgStatLocal.snapshot.io_ops; > + > + for (int i = 0; i < BACKEND_NUM_TYPES; i++) > + { > + PgStatShared_IOPathOps *stats_shmem = &all_backend_stats_shmem->stats[i]; > + PgStat_IOPathOps *stats_snap = &all_backend_stats_snap->stats[i]; > + > + LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE); Why acquire the same lock repeatedly for each type, rather than once for the whole? > + /* > + * Use the lock in the first BackendType's PgStat_IOPathOps to protect the > + * reset timestamp as well. > + */ > + if (i == 0) > + all_backend_stats_snap->stat_reset_timestamp = all_backend_stats_shmem->stat_reset_timestamp; Which also would make this look a bit less awkward. Starting to look pretty good... - Andres
On Wed, Jul 20, 2022 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:
On 2022-07-14 18:44:48 -0400, Melanie Plageman wrote:
> @@ -1427,8 +1445,10 @@ pgstat_read_statsfile(void)
> FILE *fpin;
> int32 format_id;
> bool found;
> + PgStat_BackendIOPathOps io_stats;
> const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
> PgStat_ShmemControl *shmem = pgStatLocal.shmem;
> + PgStatShared_BackendIOPathOps *io_stats_shmem = &shmem->io_ops;
>
> /* shouldn't be called from postmaster */
> Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
> @@ -1486,6 +1506,22 @@ pgstat_read_statsfile(void)
> if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
> goto error;
>
> + /*
> + * Read IO Operations stats struct
> + */
> + if (!read_chunk_s(fpin, &io_stats))
> + goto error;
> +
> + io_stats_shmem->stat_reset_timestamp = io_stats.stat_reset_timestamp;
> +
> + for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> + {
> + PgStat_IOPathOps *stats = &io_stats.stats[i];
> + PgStatShared_IOPathOps *stats_shmem = &io_stats_shmem->stats[i];
> +
> + memcpy(stats_shmem->data, stats->data, sizeof(stats->data));
> + }
Why can't the data be read directly into shared memory?
It is not the same lock. Each PgStatShared_IOPathOps has a lock so that
they can be accessed individually (per BackendType in
PgStatShared_BackendIOPathOps). It is optimized for the more common
operation of flushing at the expense of the snapshot operation (which
should be less common) and reset operation.
they can be accessed individually (per BackendType in
PgStatShared_BackendIOPathOps). It is optimized for the more common
operation of flushing at the expense of the snapshot operation (which
should be less common) and reset operation.
> +void
> +pgstat_io_ops_snapshot_cb(void)
> +{
> + PgStatShared_BackendIOPathOps *all_backend_stats_shmem = &pgStatLocal.shmem->io_ops;
> + PgStat_BackendIOPathOps *all_backend_stats_snap = &pgStatLocal.snapshot.io_ops;
> +
> + for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> + {
> + PgStatShared_IOPathOps *stats_shmem = &all_backend_stats_shmem->stats[i];
> + PgStat_IOPathOps *stats_snap = &all_backend_stats_snap->stats[i];
> +
> + LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
Why acquire the same lock repeatedly for each type, rather than once for
the whole?
This is also because of having a LWLock in each PgStatShared_IOPathOps.
Because I don't want a lock in the backend local stats, I have two data
structures PgStatShared_IOPathOps and PgStat_IOPathOps. I thought it was
odd to write out the lock to the file, so when persisting the stats, I
write out the relevant data only and when reading it back in to shared
memory, I read in the data member of PgStatShared_IOPathOps.
Because I don't want a lock in the backend local stats, I have two data
structures PgStatShared_IOPathOps and PgStat_IOPathOps. I thought it was
odd to write out the lock to the file, so when persisting the stats, I
write out the relevant data only and when reading it back in to shared
memory, I read in the data member of PgStatShared_IOPathOps.
I've attached v27 of the patch.
I've renamed IOPATH to IOCONTEXT. I also have added assertions to
confirm that unexpected statistics are not being accumulated.
I've renamed IOPATH to IOCONTEXT. I also have added assertions to
confirm that unexpected statistics are not being accumulated.
There are also assorted other cleanups and changes.
It would be good to confirm that the rows being skipped and cells that
are NULL in the view are the correct ones.
The startup process will never use a BufferAccessStrategy, right?
are NULL in the view are the correct ones.
The startup process will never use a BufferAccessStrategy, right?
On Wed, Jul 20, 2022 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:
> Subject: [PATCH v26 3/4] Track IO operation statistics
> @@ -978,8 +979,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>
> bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
>
> + if (isLocalBuf)
> + io_path = IOPATH_LOCAL;
> + else if (strategy != NULL)
> + io_path = IOPATH_STRATEGY;
> + else
> + io_path = IOPATH_SHARED;
Seems a bit ugly to have an if (isLocalBuf) just after an isLocalBuf ?.
Changed this.
> + /*
> + * When a strategy is in use, reused buffers from the strategy ring will
> + * be counted as allocations for the purposes of IO Operation statistics
> + * tracking.
> + *
> + * However, even when a strategy is in use, if a new buffer must be
> + * allocated from shared buffers and added to the ring, this is counted
> + * as a IOPATH_SHARED allocation.
> + */
There's a bit too much duplication between the paragraphs...
I actually think the two paragraphs are making separate points. I've
edited this, so see if you like it better now.
edited this, so see if you like it better now.
> @@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
> /* flush database / relation / function / ... stats */
> partial_flush |= pgstat_flush_pending_entries(nowait);
>
> + /* flush IO Operations stats */
> + partial_flush |= pgstat_flush_io_ops(nowait);
Could you either add a note to the commit message that the stats file
version needs to be increased, or just iclude that in the patch.
Bumped the stats file version in attached patchset.
- Melanie
Attachment
v28 attached.
I've added the new structs I added to typedefs.list.
I've split the commit which adds all of the logic to track
IO operation statistics into two commits -- one which includes all of
the code to count IOOps for IOContexts locally in a backend and a second
which includes all of the code to accumulate and manage these with the
cumulative stats system.
A few notes about the commit which adds local IO Operation stats:
- There is a comment above pgstat_io_op_stats_collected() which mentions
the cumulative stats system even though this commit doesn't engage the
cumulative stats system. I wasn't sure if it was more or less
confusing to have two different versions of this comment.
- should pgstat_count_io_op() take BackendType as a parameter instead of
using MyBackendType internally?
- pgstat_count_io_op() Assert()s that the passed-in IOOp and IOContext
are valid for this BackendType, but it doesn't check that all of the
pending stats which should be zero are zero. I thought this was okay
because if I did add that zero-check, it would be added to
pgstat_count_ioop() as well, and we already Assert() there that we can
count the op. Thus, it doesn't seem like checking that the stats are
zero would add any additional regression protection.
- I've kept pgstat_io_context_desc() and pgstat_io_op_desc() in the
commit which adds those types (the local stats commit), however they
are not used in that commit. I wasn't sure if I should keep them in
that commit or move them to the first commit using them (the commit
adding the new view).
Notes on the commit which accumulates IO Operation stats in shared
memory:
- I've extended the usage of the Assert()s that IO Operation stats that
should be zero are. Previously we only checked the stats validity when
querying the view. Now we check it when flushing pending stats and
when reading the stats file into shared memory.
Note that the three locations with these validity checks (when
flushing pending stats, when reading stats file into shared memory,
and when querying the view) have similar looking code to loop through
and validate the stats. However, the actual action they perform if the
stats are valid is different for each site (adding counters together,
doing a read, setting nulls in a tuple column to true). Also, some of
these instances have other code interspersed in the loops which would
require additional looping if separated from this logic. So it was
difficult to see a way of combining these into a single helper
function.
- I've left pgstat_fetch_backend_io_context_ops() in the shared stats
commit, however it is not used until the commit which adds the view in
I've added the new structs I added to typedefs.list.
I've split the commit which adds all of the logic to track
IO operation statistics into two commits -- one which includes all of
the code to count IOOps for IOContexts locally in a backend and a second
which includes all of the code to accumulate and manage these with the
cumulative stats system.
A few notes about the commit which adds local IO Operation stats:
- There is a comment above pgstat_io_op_stats_collected() which mentions
the cumulative stats system even though this commit doesn't engage the
cumulative stats system. I wasn't sure if it was more or less
confusing to have two different versions of this comment.
- should pgstat_count_io_op() take BackendType as a parameter instead of
using MyBackendType internally?
- pgstat_count_io_op() Assert()s that the passed-in IOOp and IOContext
are valid for this BackendType, but it doesn't check that all of the
pending stats which should be zero are zero. I thought this was okay
because if I did add that zero-check, it would be added to
pgstat_count_ioop() as well, and we already Assert() there that we can
count the op. Thus, it doesn't seem like checking that the stats are
zero would add any additional regression protection.
- I've kept pgstat_io_context_desc() and pgstat_io_op_desc() in the
commit which adds those types (the local stats commit), however they
are not used in that commit. I wasn't sure if I should keep them in
that commit or move them to the first commit using them (the commit
adding the new view).
Notes on the commit which accumulates IO Operation stats in shared
memory:
- I've extended the usage of the Assert()s that IO Operation stats that
should be zero are. Previously we only checked the stats validity when
querying the view. Now we check it when flushing pending stats and
when reading the stats file into shared memory.
Note that the three locations with these validity checks (when
flushing pending stats, when reading stats file into shared memory,
and when querying the view) have similar looking code to loop through
and validate the stats. However, the actual action they perform if the
stats are valid is different for each site (adding counters together,
doing a read, setting nulls in a tuple column to true). Also, some of
these instances have other code interspersed in the loops which would
require additional looping if separated from this logic. So it was
difficult to see a way of combining these into a single helper
function.
- I've left pgstat_fetch_backend_io_context_ops() in the shared stats
commit, however it is not used until the commit which adds the view in
pg_stat_get_io(). I wasn't sure which way seemed better.
- Melanie
Attachment
Hi, On 2022-08-22 13:15:18 -0400, Melanie Plageman wrote: > v28 attached. Pushed 0001, 0002. Thanks! - Andres
Hi, On 2022-08-22 13:15:18 -0400, Melanie Plageman wrote: > v28 attached. > > I've added the new structs I added to typedefs.list. > > I've split the commit which adds all of the logic to track > IO operation statistics into two commits -- one which includes all of > the code to count IOOps for IOContexts locally in a backend and a second > which includes all of the code to accumulate and manage these with the > cumulative stats system. Thanks! > A few notes about the commit which adds local IO Operation stats: > > - There is a comment above pgstat_io_op_stats_collected() which mentions > the cumulative stats system even though this commit doesn't engage the > cumulative stats system. I wasn't sure if it was more or less > confusing to have two different versions of this comment. Not worth being worried about... > - should pgstat_count_io_op() take BackendType as a parameter instead of > using MyBackendType internally? I don't forsee a case where a different value would be passed in. > - pgstat_count_io_op() Assert()s that the passed-in IOOp and IOContext > are valid for this BackendType, but it doesn't check that all of the > pending stats which should be zero are zero. I thought this was okay > because if I did add that zero-check, it would be added to > pgstat_count_ioop() as well, and we already Assert() there that we can > count the op. Thus, it doesn't seem like checking that the stats are > zero would add any additional regression protection. It's probably ok. > - I've kept pgstat_io_context_desc() and pgstat_io_op_desc() in the > commit which adds those types (the local stats commit), however they > are not used in that commit. I wasn't sure if I should keep them in > that commit or move them to the first commit using them (the commit > adding the new view). > - I've left pgstat_fetch_backend_io_context_ops() in the shared stats > commit, however it is not used until the commit which adds the view in > pg_stat_get_io(). I wasn't sure which way seemed better. Think that's fine. > Notes on the commit which accumulates IO Operation stats in shared > memory: > > - I've extended the usage of the Assert()s that IO Operation stats that > should be zero are. Previously we only checked the stats validity when > querying the view. Now we check it when flushing pending stats and > when reading the stats file into shared memory. > Note that the three locations with these validity checks (when > flushing pending stats, when reading stats file into shared memory, > and when querying the view) have similar looking code to loop through > and validate the stats. However, the actual action they perform if the > stats are valid is different for each site (adding counters together, > doing a read, setting nulls in a tuple column to true). Also, some of > these instances have other code interspersed in the loops which would > require additional looping if separated from this logic. So it was > difficult to see a way of combining these into a single helper > function. All of them seem to repeat something like > + if (!pgstat_bktype_io_op_valid(bktype, io_op) || > + !pgstat_io_context_io_op_valid(io_context, io_op)) perhaps those could be combined? Afaics nothing uses pgstat_bktype_io_op_valid separately. > Subject: [PATCH v28 3/5] Track IO operation statistics locally > > Introduce "IOOp", an IO operation done by a backend, and "IOContext", > the IO location source or target or IO type done by a backend. For > example, the checkpointer may write a shared buffer out. This would be > counted as an IOOp "write" on an IOContext IOCONTEXT_SHARED by > BackendType "checkpointer". > > Each IOOp (alloc, extend, fsync, read, write) is counted per IOContext > (local, shared, or strategy) through a call to pgstat_count_io_op(). > > The primary concern of these statistics is IO operations on data blocks > during the course of normal database operations. IO done by, for > example, the archiver or syslogger is not counted in these statistics. s/is/are/? > Stats on IOOps for all IOContexts for a backend are counted in a > backend's local memory. This commit does not expose any functions for > aggregating or viewing these stats. s/This commit does not/A subsequent commit will expose/... > @@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > BufferDesc *bufHdr; > Block bufBlock; > bool found; > + IOContext io_context; > bool isExtend; > bool isLocalBuf = SmgrIsTemp(smgr); > > @@ -986,10 +987,25 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > */ > Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID)); /* spinlock not needed */ > > - bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr); > + if (isLocalBuf) > + { > + bufBlock = LocalBufHdrGetBlock(bufHdr); > + io_context = IOCONTEXT_LOCAL; > + } > + else > + { > + bufBlock = BufHdrGetBlock(bufHdr); > + > + if (strategy != NULL) > + io_context = IOCONTEXT_STRATEGY; > + else > + io_context = IOCONTEXT_SHARED; > + } There's a isLocalBuf block earlier on, couldn't we just determine the context there? I guess there's a branch here already, so it's probably fine as is. > if (isExtend) > { > + > + pgstat_count_io_op(IOOP_EXTEND, io_context); Spurious newline. > @@ -2820,9 +2857,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum, > * > * If the caller has an smgr reference for the buffer's relation, pass it > * as the second parameter. If not, pass NULL. > + * > + * IOContext will always be IOCONTEXT_SHARED except when a buffer access strategy is > + * used and the buffer being flushed is a buffer from the strategy ring. > */ > static void > -FlushBuffer(BufferDesc *buf, SMgrRelation reln) > +FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context) Too long line? But also, why document the possible values here? Seems likely to get out of date at some point, and it doesn't seem important to know? > @@ -3549,6 +3591,8 @@ FlushRelationBuffers(Relation rel) > localpage, > false); > > + pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL); > + > buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED); > pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state); > Probably not worth doing, but these made me wonder whether there should be a function for counting N operations at once. > @@ -212,8 +215,23 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state) > if (strategy != NULL) > { > buf = GetBufferFromRing(strategy, buf_state); > - if (buf != NULL) > + *from_ring = buf != NULL; > + if (*from_ring) > + { Don't really like the if (*from_ring) - why not keep it as buf != NULL? Seems a bit confusing this way, making it less obvious what's being changed. > diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c > index 014f644bf9..a3d76599bf 100644 > --- a/src/backend/storage/buffer/localbuf.c > +++ b/src/backend/storage/buffer/localbuf.c > @@ -15,6 +15,7 @@ > */ > #include "postgres.h" > > +#include "pgstat.h" > #include "access/parallel.h" > #include "catalog/catalog.h" > #include "executor/instrument.h" Do most other places not put pgstat.h in the alphabetical order of headers? > @@ -432,6 +432,15 @@ ProcessSyncRequests(void) > total_elapsed += elapsed; > processed++; > > + /* > + * Note that if a backend using a BufferAccessStrategy is > + * forced to do its own fsync (as opposed to the > + * checkpointer doing it), it will not be counted as an > + * IOCONTEXT_STRATEGY IOOP_FSYNC and instead will be > + * counted as an IOCONTEXT_SHARED IOOP_FSYNC. > + */ > + pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED); Why is this noted here? Perhaps just point to the place where that happens instead? I think it's also documented in ForwardSyncRequest()? Or just only mention it there... > @@ -0,0 +1,191 @@ > +/* ------------------------------------------------------------------------- > + * > + * pgstat_io_ops.c > + * Implementation of IO operation statistics. > + * > + * This file contains the implementation of IO operation statistics. It is kept > + * separate from pgstat.c to enforce the line between the statistics access / > + * storage implementation and the details about individual types of > + * statistics. > + * > + * Copyright (c) 2001-2022, PostgreSQL Global Development Group Arguably this would just be 2021-2022 > +void > +pgstat_count_io_op(IOOp io_op, IOContext io_context) > +{ > + PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_context]; > + > + Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op)); > + > + switch (io_op) > + { > + case IOOP_ALLOC: > + pending_counters->allocs++; > + break; > + case IOOP_EXTEND: > + pending_counters->extends++; > + break; > + case IOOP_FSYNC: > + pending_counters->fsyncs++; > + break; > + case IOOP_READ: > + pending_counters->reads++; > + break; > + case IOOP_WRITE: > + pending_counters->writes++; > + break; > + } > + > +} How about replacing the breaks with a return and then erroring out if we reach the end of the function? You did that below, and I think it makes sense. > +bool > +pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context) > +{ Maybe add a tiny comment about what 'valid' means here? Something like 'return whether the backend type counts io in io_context'. > + /* > + * Only regular backends and WAL Sender processes executing queries should > + * use local buffers. > + */ > + no_local = bktype == B_AUTOVAC_LAUNCHER || bktype == > + B_BG_WRITER || bktype == B_CHECKPOINTER || bktype == > + B_AUTOVAC_WORKER || bktype == B_BG_WORKER || bktype == > + B_STANDALONE_BACKEND || bktype == B_STARTUP; I think BG_WORKERS could end up using local buffers, extensions can do just about everything in them. > +bool > +pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op) > +{ > + if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) && io_op == > + IOOP_READ) > + return false; Perhaps we should add an assertion about the backend type making sense here? I.e. that it's not archiver, walwriter etc? > +bool > +pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op) > +{ > + /* > + * Temporary tables using local buffers are not logged and thus do not > + * require fsync'ing. Set this cell to NULL to differentiate between an > + * invalid combination and 0 observed IO Operations. This comment feels a bit out of place? > +bool > +pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op) > +{ > + if (!pgstat_io_op_stats_collected(bktype)) > + return false; > + > + if (!pgstat_bktype_io_context_valid(bktype, io_context)) > + return false; > + > + if (!pgstat_bktype_io_op_valid(bktype, io_op)) > + return false; > + > + if (!pgstat_io_context_io_op_valid(io_context, io_op)) > + return false; > + > + /* > + * There are currently no cases of a BackendType, IOContext, IOOp > + * combination that are specifically invalid. > + */ "specifically"? > From 0f141fa7f97a57b8628b1b6fd6029bd3782f16a1 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Mon, 22 Aug 2022 11:35:20 -0400 > Subject: [PATCH v28 4/5] Aggregate IO operation stats per BackendType > > Stats on IOOps for all IOContexts for a backend are tracked locally. Add > functionality for backends to flush these stats to shared memory and > accumulate them with those from all other backends, exited and live. > Also add reset and snapshot functions used by cumulative stats system > for management of these statistics. > > The aggregated stats in shared memory could be extended in the future > with per-backend stats -- useful for per connection IO statistics and > monitoring. > > Some BackendTypes will not flush their pending statistics at regular > intervals and explicitly call pgstat_flush_io_ops() during the course of > normal operations to flush their backend-local IO Operation statistics > to shared memory in a timely manner. > Because not all BackendType, IOOp, IOContext combinations are valid, the > validity of the stats are checked before flushing pending stats and > before reading in the existing stats file to shared memory. s/are checked/is checked/? > @@ -1486,6 +1507,42 @@ pgstat_read_statsfile(void) > if (!read_chunk_s(fpin, &shmem->checkpointer.stats)) > goto error; > > + /* > + * Read IO Operations stats struct > + */ > + if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp)) > + goto error; > + > + for (int backend_type = 0; backend_type < BACKEND_NUM_TYPES; backend_type++) > + { > + PgStatShared_IOContextOps *backend_io_context_ops = &shmem->io_ops.stats[backend_type]; > + bool expect_backend_stats = true; > + > + if (!pgstat_io_op_stats_collected(backend_type)) > + expect_backend_stats = false; > + > + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) > + { > + if (!expect_backend_stats || > + !pgstat_bktype_io_context_valid(backend_type, io_context)) > + { > + pgstat_io_context_ops_assert_zero(&backend_io_context_ops->data[io_context]); > + continue; > + } > + > + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) > + { > + if (!pgstat_bktype_io_op_valid(backend_type, io_op) || > + !pgstat_io_context_io_op_valid(io_context, io_op)) > + pgstat_io_op_assert_zero(&backend_io_context_ops->data[io_context], > + io_op); > + } > + } > + > + if (!read_chunk_s(fpin, &backend_io_context_ops->data)) > + goto error; > + } Could we put the validation out of line? That's a lot of io stats specific code to be in pgstat_read_statsfile(). > +/* > + * Helper function to accumulate PgStat_IOOpCounters. If either of the > + * passed-in PgStat_IOOpCounters are members of PgStatShared_IOContextOps, the > + * caller is responsible for ensuring that the appropriate lock is held. This > + * is not asserted because this function could plausibly be used to accumulate > + * two local/pending PgStat_IOOpCounters. What's "this" here? > + */ > +static void > +pgstat_accum_io_op(PgStat_IOOpCounters *shared, PgStat_IOOpCounters *local, IOOp io_op) Given that the comment above says both of them may be local, it's a bit odd to call it 'shared' here... > +PgStat_BackendIOContextOps * > +pgstat_fetch_backend_io_context_ops(void) > +{ > + pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS); > + > + return &pgStatLocal.snapshot.io_ops; > +} Not for this patch series, but we really should replace this set of functions with storing the relevant offset in the kind_info. > @@ -496,6 +503,8 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void); > */ > > extern void pgstat_count_io_op(IOOp io_op, IOContext io_context); > +extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void); > +extern bool pgstat_flush_io_ops(bool nowait); > extern const char *pgstat_io_context_desc(IOContext io_context); > extern const char *pgstat_io_op_desc(IOOp io_op); > Is there any call to pgstat_flush_io_ops() from outside pgstat*.c? So possibly it could be in pgstat_internal.h? Not that it's particularly important... > @@ -506,6 +515,43 @@ extern bool pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op); > extern bool pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op); > extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op); > > +/* > + * Functions to assert that invalid IO Operation counters are zero. Used with > + * the validation functions in pgstat_io_ops.c > + */ > +static inline void > +pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters) > +{ > + Assert(counters->allocs == 0 && counters->extends == 0 && > + counters->fsyncs == 0 && counters->reads == 0 && > + counters->writes == 0); > +} > + > +static inline void > +pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op) > +{ > + switch (io_op) > + { > + case IOOP_ALLOC: > + Assert(counters->allocs == 0); > + return; > + case IOOP_EXTEND: > + Assert(counters->extends == 0); > + return; > + case IOOP_FSYNC: > + Assert(counters->fsyncs == 0); > + return; > + case IOOP_READ: > + Assert(counters->reads == 0); > + return; > + case IOOP_WRITE: > + Assert(counters->writes == 0); > + return; > + } > + > + elog(ERROR, "unrecognized IOOp value: %d", io_op); Hm. This means it'll emit code even in non-assertion builds - this should probably just be an Assert(false) or pg_unreachable(). > Subject: [PATCH v28 5/5] Add system view tracking IO ops per backend type > View stats are fetched from statistics incremented when a backend > performs an IO Operation and maintained by the cumulative statistics > subsystem. "fetched from statistics incremented"? > Each row of the view is stats for a particular BackendType for a > particular IOContext (e.g. shared buffer accesses by checkpointer) and > each column in the view is the total number of IO Operations done (e.g. > writes). s/is/shows/? s/for a particular BackendType for a particular IOContext/for a particularl BackendType and IOContext/? Somehow the repetition is weird. > Note that some of the cells in the view are redundant with fields in > pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in > pg_stat_bgwriter for backwards compatibility. Deriving the redundant > pg_stat_bgwriter stats from the IO operations stats structures was also > problematic due to the separate reset targets for 'bgwriter' and > 'io'. I suspect we should still consider doing that in the future, perhaps by documenting that the relevant fields in pg_stat_bgwriter aren't reset by the 'bgwriter' target anymore? And noting that reliance on those fields is "deprecated" and that pg_stat_io should be used instead? > Suggested by Andres Freund > > Author: Melanie Plageman <melanieplageman@gmail.com> > Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>, Kyotaro Horiguchi <horikyota.ntt@gmail.com> > Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de > --- > doc/src/sgml/monitoring.sgml | 115 ++++++++++++++- > src/backend/catalog/system_views.sql | 12 ++ > src/backend/utils/adt/pgstatfuncs.c | 100 +++++++++++++ > src/include/catalog/pg_proc.dat | 9 ++ > src/test/regress/expected/rules.out | 9 ++ > src/test/regress/expected/stats.out | 201 +++++++++++++++++++++++++++ > src/test/regress/sql/stats.sql | 103 ++++++++++++++ > 7 files changed, 548 insertions(+), 1 deletion(-) > > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml > index 9440b41770..9949011ba3 100644 > --- a/doc/src/sgml/monitoring.sgml > +++ b/doc/src/sgml/monitoring.sgml > @@ -448,6 +448,15 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser > </entry> > </row> > > + <row> > + <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry> > + <entry>A row for each IO Context for each backend type showing > + statistics about backend IO operations. See > + <link linkend="monitoring-pg-stat-io-view"> > + <structname>pg_stat_io</structname></link> for details. > + </entry> > + </row> The "for each for each" thing again :) > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>io_context</structfield> <type>text</type> > + </para> > + <para> > + IO Context used (e.g. shared buffers, direct). > + </para></entry> > + </row> Wrong list of contexts. > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>alloc</structfield> <type>bigint</type> > + </para> > + <para> > + Number of buffers allocated. > + </para></entry> > + </row> > + > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>extend</structfield> <type>bigint</type> > + </para> > + <para> > + Number of blocks extended. > + </para></entry> > + </row> > + > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>fsync</structfield> <type>bigint</type> > + </para> > + <para> > + Number of blocks fsynced. > + </para></entry> > + </row> > + > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>read</structfield> <type>bigint</type> > + </para> > + <para> > + Number of blocks read. > + </para></entry> > + </row> > + > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>write</structfield> <type>bigint</type> > + </para> > + <para> > + Number of blocks written. > + </para></entry> > + </row> > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>stats_reset</structfield> <type>timestamp with time zone</type> > + </para> > + <para> > + Time at which these statistics were last reset. > </para></entry> > </row> > </tbody> Part of me thinks it'd be nicer if it were "allocated, read, written, extended, fsynced, stats_reset", instead of alphabetical order. The order already isn't alphabetical. > + /* > + * When adding a new column to the pg_stat_io view, add a new enum value > + * here above IO_NUM_COLUMNS. > + */ > + enum > + { > + IO_COLUMN_BACKEND_TYPE, > + IO_COLUMN_IO_CONTEXT, > + IO_COLUMN_ALLOCS, > + IO_COLUMN_EXTENDS, > + IO_COLUMN_FSYNCS, > + IO_COLUMN_READS, > + IO_COLUMN_WRITES, > + IO_COLUMN_RESET_TIME, > + IO_NUM_COLUMNS, > + }; Given it's local and some of the lines are long, maybe just use COL? > +#define IO_COLUMN_IOOP_OFFSET (IO_COLUMN_IO_CONTEXT + 1) Undef'ing it probably worth doing. > + SetSingleFuncCall(fcinfo, 0); > + rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; > + > + backends_io_stats = pgstat_fetch_backend_io_context_ops(); > + > + reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp); > + > + for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++) > + { > + Datum bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype)); > + bool expect_backend_stats = true; > + PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype]; > + > + /* > + * For those BackendTypes without IO Operation stats, skip > + * representing them in the view altogether. > + */ > + if (!pgstat_io_op_stats_collected(bktype)) > + expect_backend_stats = false; Why not just expect_backend_stats = pgstat_io_op_stats_collected()? > + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) > + { > + PgStat_IOOpCounters *counters = &io_context_ops->data[io_context]; > + Datum values[IO_NUM_COLUMNS]; > + bool nulls[IO_NUM_COLUMNS]; > + > + /* > + * Some combinations of IOCONTEXT and BackendType are not valid > + * for any type of IO Operation. In such cases, omit the entire > + * row from the view. > + */ > + if (!expect_backend_stats || > + !pgstat_bktype_io_context_valid(bktype, io_context)) > + { > + pgstat_io_context_ops_assert_zero(counters); > + continue; > + } > + > + memset(values, 0, sizeof(values)); > + memset(nulls, 0, sizeof(nulls)); I'd replace the memset with values[...] = {0} etc. > + values[IO_COLUMN_BACKEND_TYPE] = bktype_desc; > + values[IO_COLUMN_IO_CONTEXT] = CStringGetTextDatum( > + pgstat_io_context_desc(io_context)); Pgindent, I hate you. Perhaps put it the context desc in a local var, so it doesn't look quite this ugly? > + values[IO_COLUMN_ALLOCS] = Int64GetDatum(counters->allocs); > + values[IO_COLUMN_EXTENDS] = Int64GetDatum(counters->extends); > + values[IO_COLUMN_FSYNCS] = Int64GetDatum(counters->fsyncs); > + values[IO_COLUMN_READS] = Int64GetDatum(counters->reads); > + values[IO_COLUMN_WRITES] = Int64GetDatum(counters->writes); > + values[IO_COLUMN_RESET_TIME] = TimestampTzGetDatum(reset_time); > + > + > + /* > + * Some combinations of BackendType and IOOp and of IOContext and > + * IOOp are not valid. Set these cells in the view NULL and assert > + * that these stats are zero as expected. > + */ > + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) > + { > + if (!pgstat_bktype_io_op_valid(bktype, io_op) || > + !pgstat_io_context_io_op_valid(io_context, io_op)) > + { > + pgstat_io_op_assert_zero(counters, io_op); > + nulls[io_op + IO_COLUMN_IOOP_OFFSET] = true; > + } > + } A bit weird that we first assign a value and then set nulls separately. But it's not obvious how to make it look nice otherwise. > +-- Test that allocs, extends, reads, and writes to Shared Buffers and fsyncs > +-- done to ensure durability of Shared Buffers are tracked in pg_stat_io. > +SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset > +SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'Shared' \gset > +SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset > +SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'Shared' \gset > +SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'Shared' \gset > +-- Create a regular table and insert some data to generate IOCONTEXT_SHARED allocs and extends. > +CREATE TABLE test_io_shared(a int); > +INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i; > +SELECT pg_stat_force_next_flush(); > + pg_stat_force_next_flush > +-------------------------- > + > +(1 row) > + > +-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes and fsyncs. > +CHECKPOINT; Does that work reliably? A checkpoint could have started just before the CREATE TABLE, I think? Then it'd not have flushed those writes yet. I think doing two checkpoints would protect against that. > +DROP TABLE test_io_shared; > +DROP TABLESPACE test_io_shared_stats_tblspc; Tablespace creation is somewhat expensive, do we really need that? There should be one set up in setup.sql or such. > +-- Test that allocs, extends, reads, and writes of temporary tables are tracked > +-- in pg_stat_io. > +CREATE TEMPORARY TABLE test_io_local(a int, b TEXT); > +SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'Local' \gset > +SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'Local' \gset > +SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'Local' \gset > +SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'Local' \gset > +-- Insert enough values that we need to reuse and write out dirty local > +-- buffers. > +INSERT INTO test_io_local SELECT generate_series(1, 80000) as id, > +'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'; Could be abbreviated with repeat('a', some-number) :P Can the table be smaller than this? That might show up on a slow machine. > +SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'Local' \gset > +SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'Local' \gset > +SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'Local' \gset > +SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'Local' \gset > +SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before; Random q: Why are we uppercasing the first letter of the context? > +CREATE TABLE test_io_strategy(a INT, b INT); > +ALTER TABLE test_io_strategy SET (autovacuum_enabled = 'false'); I think you can specify that as part of the CREATE TABLE. Not sure if otherwise there's not a race where autovac coul start before you do the ALTER. > +INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i; > +-- Ensure that the next VACUUM will need to perform IO by rewriting the table > +-- first with VACUUM (FULL). ... because VACUUM FULL currently doesn't set all-visible etc on the pages, which the subsequent vacuum will then do. > +-- Hope that the previous value of wal_skip_threshold was the default. We > +-- can't use BEGIN...SET LOCAL since VACUUM can't be run inside a transaction > +-- block. > +RESET wal_skip_threshold; Nothing in this file set it before, so that's a pretty sure-to-be-fulfilled hope. > +-- Test that, when using a Strategy, if creating a relation, Strategy extends s/if/when/? Looks good! Greetings, Andres Freund
v29 attached
On Thu, Aug 25, 2022 at 3:15 PM Andres Freund <andres@anarazel.de> wrote:
On 2022-08-22 13:15:18 -0400, Melanie Plageman wrote:
> Notes on the commit which accumulates IO Operation stats in shared
> memory:
>
> - I've extended the usage of the Assert()s that IO Operation stats that
> should be zero are. Previously we only checked the stats validity when
> querying the view. Now we check it when flushing pending stats and
> when reading the stats file into shared memory.
> Note that the three locations with these validity checks (when
> flushing pending stats, when reading stats file into shared memory,
> and when querying the view) have similar looking code to loop through
> and validate the stats. However, the actual action they perform if the
> stats are valid is different for each site (adding counters together,
> doing a read, setting nulls in a tuple column to true). Also, some of
> these instances have other code interspersed in the loops which would
> require additional looping if separated from this logic. So it was
> difficult to see a way of combining these into a single helper
> function.
All of them seem to repeat something like
> + if (!pgstat_bktype_io_op_valid(bktype, io_op) ||
> + !pgstat_io_context_io_op_valid(io_context, io_op))
perhaps those could be combined? Afaics nothing uses pgstat_bktype_io_op_valid
separately.
I've combined these into pgstat_io_op_valid().
> Subject: [PATCH v28 3/5] Track IO operation statistics locally
>
> Introduce "IOOp", an IO operation done by a backend, and "IOContext",
> the IO location source or target or IO type done by a backend. For
> example, the checkpointer may write a shared buffer out. This would be
> counted as an IOOp "write" on an IOContext IOCONTEXT_SHARED by
> BackendType "checkpointer".
>
> Each IOOp (alloc, extend, fsync, read, write) is counted per IOContext
> (local, shared, or strategy) through a call to pgstat_count_io_op().
>
> The primary concern of these statistics is IO operations on data blocks
> during the course of normal database operations. IO done by, for
> example, the archiver or syslogger is not counted in these statistics.
s/is/are/?
changed
> Stats on IOOps for all IOContexts for a backend are counted in a
> backend's local memory. This commit does not expose any functions for
> aggregating or viewing these stats.
s/This commit does not/A subsequent commit will expose/...
changed
> @@ -823,6 +823,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> BufferDesc *bufHdr;
> Block bufBlock;
> bool found;
> + IOContext io_context;
> bool isExtend;
> bool isLocalBuf = SmgrIsTemp(smgr);
>
> @@ -986,10 +987,25 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> */
> Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID)); /* spinlock not needed */
>
> - bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
> + if (isLocalBuf)
> + {
> + bufBlock = LocalBufHdrGetBlock(bufHdr);
> + io_context = IOCONTEXT_LOCAL;
> + }
> + else
> + {
> + bufBlock = BufHdrGetBlock(bufHdr);
> +
> + if (strategy != NULL)
> + io_context = IOCONTEXT_STRATEGY;
> + else
> + io_context = IOCONTEXT_SHARED;
> + }
There's a isLocalBuf block earlier on, couldn't we just determine the context
there? I guess there's a branch here already, so it's probably fine as is.
I've added this as close as possible to the code where we use the
io_context. If I were to move it, it would make sense to move it all the
way to the top of ReadBuffer_common() where we first define isLocalBuf.
I've left it as is.
> if (isExtend)
> {
> +
> + pgstat_count_io_op(IOOP_EXTEND, io_context);
Spurious newline.
fixed
> @@ -2820,9 +2857,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
> *
> * If the caller has an smgr reference for the buffer's relation, pass it
> * as the second parameter. If not, pass NULL.
> + *
> + * IOContext will always be IOCONTEXT_SHARED except when a buffer access strategy is
> + * used and the buffer being flushed is a buffer from the strategy ring.
> */
> static void
> -FlushBuffer(BufferDesc *buf, SMgrRelation reln)
> +FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOContext io_context)
Too long line?
But also, why document the possible values here? Seems likely to get out of
date at some point, and it doesn't seem important to know?
Deleted.
> @@ -3549,6 +3591,8 @@ FlushRelationBuffers(Relation rel)
> localpage,
> false);
>
> + pgstat_count_io_op(IOOP_WRITE, IOCONTEXT_LOCAL);
> +
> buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
> pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
>
Probably not worth doing, but these made me wonder whether there should be a
function for counting N operations at once.
Would it be worth it here? We would need a local variable to track how
many local buffers we end up writing. Do you think that
pgstat_count_io_op() will not be inlined and thus we will end up with
lots of extra function calls if we do a pgstat_count_io_op() on every
iteration? And that it will matter in FlushRelationBuffers()?
The other times that pgstat_count_io_op() is used in a loop, it is
part of the branch that will exit the loop and only be called once-ish.
Or are you thinking that just generally it might be nice to have?
many local buffers we end up writing. Do you think that
pgstat_count_io_op() will not be inlined and thus we will end up with
lots of extra function calls if we do a pgstat_count_io_op() on every
iteration? And that it will matter in FlushRelationBuffers()?
The other times that pgstat_count_io_op() is used in a loop, it is
part of the branch that will exit the loop and only be called once-ish.
Or are you thinking that just generally it might be nice to have?
> @@ -212,8 +215,23 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
> if (strategy != NULL)
> {
> buf = GetBufferFromRing(strategy, buf_state);
> - if (buf != NULL)
> + *from_ring = buf != NULL;
> + if (*from_ring)
> + {
Don't really like the if (*from_ring) - why not keep it as buf != NULL? Seems
a bit confusing this way, making it less obvious what's being changed.
Changed
> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
> index 014f644bf9..a3d76599bf 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> @@ -15,6 +15,7 @@
> */
> #include "postgres.h"
>
> +#include "pgstat.h"
> #include "access/parallel.h"
> #include "catalog/catalog.h"
> #include "executor/instrument.h"
Do most other places not put pgstat.h in the alphabetical order of headers?
Fixed
> @@ -432,6 +432,15 @@ ProcessSyncRequests(void)
> total_elapsed += elapsed;
> processed++;
>
> + /*
> + * Note that if a backend using a BufferAccessStrategy is
> + * forced to do its own fsync (as opposed to the
> + * checkpointer doing it), it will not be counted as an
> + * IOCONTEXT_STRATEGY IOOP_FSYNC and instead will be
> + * counted as an IOCONTEXT_SHARED IOOP_FSYNC.
> + */
> + pgstat_count_io_op(IOOP_FSYNC, IOCONTEXT_SHARED);
Why is this noted here? Perhaps just point to the place where that happens
instead? I think it's also documented in ForwardSyncRequest()? Or just only
mention it there...
Removed
> @@ -0,0 +1,191 @@
> +/* -------------------------------------------------------------------------
> + *
> + * pgstat_io_ops.c
> + * Implementation of IO operation statistics.
> + *
> + * This file contains the implementation of IO operation statistics. It is kept
> + * separate from pgstat.c to enforce the line between the statistics access /
> + * storage implementation and the details about individual types of
> + * statistics.
> + *
> + * Copyright (c) 2001-2022, PostgreSQL Global Development Group
Arguably this would just be 2021-2022
Changed
> +void
> +pgstat_count_io_op(IOOp io_op, IOContext io_context)
> +{
> + PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_context];
> +
> + Assert(pgstat_expect_io_op(MyBackendType, io_context, io_op));
> +
> + switch (io_op)
> + {
> + case IOOP_ALLOC:
> + pending_counters->allocs++;
> + break;
> + case IOOP_EXTEND:
> + pending_counters->extends++;
> + break;
> + case IOOP_FSYNC:
> + pending_counters->fsyncs++;
> + break;
> + case IOOP_READ:
> + pending_counters->reads++;
> + break;
> + case IOOP_WRITE:
> + pending_counters->writes++;
> + break;
> + }
> +
> +}
How about replacing the breaks with a return and then erroring out if we reach
the end of the function? You did that below, and I think it makes sense.
I used breaks because in the subsequent commit I introduce the variable
"have_ioopstats", and I set have_ioopstats to false in
pgstat_count_io_op() after counting.
It is probably safe to set have_ioopstats to true before incrementing it
since this backend is the only one that can see have_ioopstats and it
shouldn't fail while incrementing the counter but it seems less clear
than doing it after.
Instead of erroring out for an unknown IOOp, I decided to add Asserts
about the IOContext and IOOp being valid and that the combination of
MyBackendType, IOContext, and IOOp are valid. I think it will be good to
assert that the IOContext is valid before using it as an array index for
lookup in pending stats.
"have_ioopstats", and I set have_ioopstats to false in
pgstat_count_io_op() after counting.
It is probably safe to set have_ioopstats to true before incrementing it
since this backend is the only one that can see have_ioopstats and it
shouldn't fail while incrementing the counter but it seems less clear
than doing it after.
Instead of erroring out for an unknown IOOp, I decided to add Asserts
about the IOContext and IOOp being valid and that the combination of
MyBackendType, IOContext, and IOOp are valid. I think it will be good to
assert that the IOContext is valid before using it as an array index for
lookup in pending stats.
> +bool
> +pgstat_bktype_io_context_valid(BackendType bktype, IOContext io_context)
> +{
Maybe add a tiny comment about what 'valid' means here? Something like
'return whether the backend type counts io in io_context'.
Changed
> + /*
> + * Only regular backends and WAL Sender processes executing queries should
> + * use local buffers.
> + */
> + no_local = bktype == B_AUTOVAC_LAUNCHER || bktype ==
> + B_BG_WRITER || bktype == B_CHECKPOINTER || bktype ==
> + B_AUTOVAC_WORKER || bktype == B_BG_WORKER || bktype ==
> + B_STANDALONE_BACKEND || bktype == B_STARTUP;
I think BG_WORKERS could end up using local buffers, extensions can do just
about everything in them.
Fixed and added comment.
> +bool
> +pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op)
> +{
> + if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) && io_op ==
> + IOOP_READ)
> + return false;
Perhaps we should add an assertion about the backend type making sense here?
I.e. that it's not archiver, walwriter etc?
Done
> +bool
> +pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op)
> +{
> + /*
> + * Temporary tables using local buffers are not logged and thus do not
> + * require fsync'ing. Set this cell to NULL to differentiate between an
> + * invalid combination and 0 observed IO Operations.
This comment feels a bit out of place?
Deleted
> +bool
> +pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op)
> +{
> + if (!pgstat_io_op_stats_collected(bktype))
> + return false;
> +
> + if (!pgstat_bktype_io_context_valid(bktype, io_context))
> + return false;
> +
> + if (!pgstat_bktype_io_op_valid(bktype, io_op))
> + return false;
> +
> + if (!pgstat_io_context_io_op_valid(io_context, io_op))
> + return false;
> +
> + /*
> + * There are currently no cases of a BackendType, IOContext, IOOp
> + * combination that are specifically invalid.
> + */
"specifically"?
I removed this and mentioned it (rephrased) above pgstat_io_op_valid()
> From 0f141fa7f97a57b8628b1b6fd6029bd3782f16a1 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 22 Aug 2022 11:35:20 -0400
> Subject: [PATCH v28 4/5] Aggregate IO operation stats per BackendType
>
> Stats on IOOps for all IOContexts for a backend are tracked locally. Add
> functionality for backends to flush these stats to shared memory and
> accumulate them with those from all other backends, exited and live.
> Also add reset and snapshot functions used by cumulative stats system
> for management of these statistics.
>
> The aggregated stats in shared memory could be extended in the future
> with per-backend stats -- useful for per connection IO statistics and
> monitoring.
>
> Some BackendTypes will not flush their pending statistics at regular
> intervals and explicitly call pgstat_flush_io_ops() during the course of
> normal operations to flush their backend-local IO Operation statistics
> to shared memory in a timely manner.
> Because not all BackendType, IOOp, IOContext combinations are valid, the
> validity of the stats are checked before flushing pending stats and
> before reading in the existing stats file to shared memory.
s/are checked/is checked/?
Fixed
> @@ -1486,6 +1507,42 @@ pgstat_read_statsfile(void)
> if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
> goto error;
>
> + /*
> + * Read IO Operations stats struct
> + */
> + if (!read_chunk_s(fpin, &shmem->io_ops.stat_reset_timestamp))
> + goto error;
> +
> + for (int backend_type = 0; backend_type < BACKEND_NUM_TYPES; backend_type++)
> + {
> + PgStatShared_IOContextOps *backend_io_context_ops = &shmem->io_ops.stats[backend_type];
> + bool expect_backend_stats = true;
> +
> + if (!pgstat_io_op_stats_collected(backend_type))
> + expect_backend_stats = false;
> +
> + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> + {
> + if (!expect_backend_stats ||
> + !pgstat_bktype_io_context_valid(backend_type, io_context))
> + {
> + pgstat_io_context_ops_assert_zero(&backend_io_context_ops->data[io_context]);
> + continue;
> + }
> +
> + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
> + {
> + if (!pgstat_bktype_io_op_valid(backend_type, io_op) ||
> + !pgstat_io_context_io_op_valid(io_context, io_op))
> + pgstat_io_op_assert_zero(&backend_io_context_ops->data[io_context],
> + io_op);
> + }
> + }
> +
> + if (!read_chunk_s(fpin, &backend_io_context_ops->data))
> + goto error;
> + }
Could we put the validation out of line? That's a lot of io stats specific
code to be in pgstat_read_statsfile().
Done.
> +/*
> + * Helper function to accumulate PgStat_IOOpCounters. If either of the
> + * passed-in PgStat_IOOpCounters are members of PgStatShared_IOContextOps, the
> + * caller is responsible for ensuring that the appropriate lock is held. This
> + * is not asserted because this function could plausibly be used to accumulate
> + * two local/pending PgStat_IOOpCounters.
What's "this" here?
I rephrased it.
> @@ -496,6 +503,8 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
> */
>
> extern void pgstat_count_io_op(IOOp io_op, IOContext io_context);
> +extern PgStat_BackendIOContextOps *pgstat_fetch_backend_io_context_ops(void);
> +extern bool pgstat_flush_io_ops(bool nowait);
> extern const char *pgstat_io_context_desc(IOContext io_context);
> extern const char *pgstat_io_op_desc(IOOp io_op);
>
Is there any call to pgstat_flush_io_ops() from outside pgstat*.c? So possibly
it could be in pgstat_internal.h? Not that it's particularly important...
Moved it.
> @@ -506,6 +515,43 @@ extern bool pgstat_bktype_io_op_valid(BackendType bktype, IOOp io_op);
> extern bool pgstat_io_context_io_op_valid(IOContext io_context, IOOp io_op);
> extern bool pgstat_expect_io_op(BackendType bktype, IOContext io_context, IOOp io_op);
>
> +/*
> + * Functions to assert that invalid IO Operation counters are zero. Used with
> + * the validation functions in pgstat_io_ops.c
> + */
> +static inline void
> +pgstat_io_context_ops_assert_zero(PgStat_IOOpCounters *counters)
> +{
> + Assert(counters->allocs == 0 && counters->extends == 0 &&
> + counters->fsyncs == 0 && counters->reads == 0 &&
> + counters->writes == 0);
> +}
> +
> +static inline void
> +pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op)
> +{
> + switch (io_op)
> + {
> + case IOOP_ALLOC:
> + Assert(counters->allocs == 0);
> + return;
> + case IOOP_EXTEND:
> + Assert(counters->extends == 0);
> + return;
> + case IOOP_FSYNC:
> + Assert(counters->fsyncs == 0);
> + return;
> + case IOOP_READ:
> + Assert(counters->reads == 0);
> + return;
> + case IOOP_WRITE:
> + Assert(counters->writes == 0);
> + return;
> + }
> +
> + elog(ERROR, "unrecognized IOOp value: %d", io_op);
Hm. This means it'll emit code even in non-assertion builds - this should
probably just be an Assert(false) or pg_unreachable().
Fixed.
> Subject: [PATCH v28 5/5] Add system view tracking IO ops per backend type
> View stats are fetched from statistics incremented when a backend
> performs an IO Operation and maintained by the cumulative statistics
> subsystem.
"fetched from statistics incremented"?
Rephrased it.
> Each row of the view is stats for a particular BackendType for a
> particular IOContext (e.g. shared buffer accesses by checkpointer) and
> each column in the view is the total number of IO Operations done (e.g.
> writes).
s/is/shows/?
s/for a particular BackendType for a particular IOContext/for a particularl
BackendType and IOContext/? Somehow the repetition is weird.
Both of the above wordings are now changed.
> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 9440b41770..9949011ba3 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -448,6 +448,15 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
> </entry>
> </row>
>
> + <row>
> + <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
> + <entry>A row for each IO Context for each backend type showing
> + statistics about backend IO operations. See
> + <link linkend="monitoring-pg-stat-io-view">
> + <structname>pg_stat_io</structname></link> for details.
> + </entry>
> + </row>
The "for each for each" thing again :)
Changed it.
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>io_context</structfield> <type>text</type>
> + </para>
> + <para>
> + IO Context used (e.g. shared buffers, direct).
> + </para></entry>
> + </row>
Wrong list of contexts.
Fixed it.
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>alloc</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Number of buffers allocated.
> + </para></entry>
> + </row>
> +
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>extend</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Number of blocks extended.
> + </para></entry>
> + </row>
> +
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>fsync</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Number of blocks fsynced.
> + </para></entry>
> + </row>
> +
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>read</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Number of blocks read.
> + </para></entry>
> + </row>
> +
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>write</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Number of blocks written.
> + </para></entry>
> + </row>
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
> + </para>
> + <para>
> + Time at which these statistics were last reset.
> </para></entry>
> </row>
> </tbody>
Part of me thinks it'd be nicer if it were "allocated, read, written, extended,
fsynced, stats_reset", instead of alphabetical order. The order already isn't
alphabetical.
I've updated the order in the view and docs.
> + /*
> + * When adding a new column to the pg_stat_io view, add a new enum value
> + * here above IO_NUM_COLUMNS.
> + */
> + enum
> + {
> + IO_COLUMN_BACKEND_TYPE,
> + IO_COLUMN_IO_CONTEXT,
> + IO_COLUMN_ALLOCS,
> + IO_COLUMN_EXTENDS,
> + IO_COLUMN_FSYNCS,
> + IO_COLUMN_READS,
> + IO_COLUMN_WRITES,
> + IO_COLUMN_RESET_TIME,
> + IO_NUM_COLUMNS,
> + };
Given it's local and some of the lines are long, maybe just use COL?
I've shortened COLUMN to COL. However, I've also moved this enum outside
of the function and typedef'd it. I did this because, upon changing the
order of the columns in the view, I could no longer use
IO_COLUMN_IOOP_OFFSET and the IOOp value in the loop at the bottom of
pg_sta_get_io() to set the correct column to NULL. So, I created a
helper function which translates IOOp to io_stat_col.
of the function and typedef'd it. I did this because, upon changing the
order of the columns in the view, I could no longer use
IO_COLUMN_IOOP_OFFSET and the IOOp value in the loop at the bottom of
pg_sta_get_io() to set the correct column to NULL. So, I created a
helper function which translates IOOp to io_stat_col.
> +#define IO_COLUMN_IOOP_OFFSET (IO_COLUMN_IO_CONTEXT + 1)
Undef'ing it probably worth doing.
It's gone now anyway.
> + SetSingleFuncCall(fcinfo, 0);
> + rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> +
> + backends_io_stats = pgstat_fetch_backend_io_context_ops();
> +
> + reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp);
> +
> + for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
> + {
> + Datum bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
> + bool expect_backend_stats = true;
> + PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype];
> +
> + /*
> + * For those BackendTypes without IO Operation stats, skip
> + * representing them in the view altogether.
> + */
> + if (!pgstat_io_op_stats_collected(bktype))
> + expect_backend_stats = false;
Why not just expect_backend_stats = pgstat_io_op_stats_collected()?
Updated this everywhere it occurred.
> + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
> + {
> + PgStat_IOOpCounters *counters = &io_context_ops->data[io_context];
> + Datum values[IO_NUM_COLUMNS];
> + bool nulls[IO_NUM_COLUMNS];
> +
> + /*
> + * Some combinations of IOCONTEXT and BackendType are not valid
> + * for any type of IO Operation. In such cases, omit the entire
> + * row from the view.
> + */
> + if (!expect_backend_stats ||
> + !pgstat_bktype_io_context_valid(bktype, io_context))
> + {
> + pgstat_io_context_ops_assert_zero(counters);
> + continue;
> + }
> +
> + memset(values, 0, sizeof(values));
> + memset(nulls, 0, sizeof(nulls));
I'd replace the memset with values[...] = {0} etc.
Done.
> + values[IO_COLUMN_BACKEND_TYPE] = bktype_desc;
> + values[IO_COLUMN_IO_CONTEXT] = CStringGetTextDatum(
> + pgstat_io_context_desc(io_context));
Pgindent, I hate you.
Perhaps put it the context desc in a local var, so it doesn't look quite this
ugly?
Did this.
> +-- Test that allocs, extends, reads, and writes to Shared Buffers and fsyncs
> +-- done to ensure durability of Shared Buffers are tracked in pg_stat_io.
> +SELECT sum(alloc) AS io_sum_shared_allocs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +SELECT sum(extend) AS io_sum_shared_extends_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +SELECT sum(fsync) AS io_sum_shared_fsyncs_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +SELECT sum(read) AS io_sum_shared_reads_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_context = 'Shared' \gset
> +-- Create a regular table and insert some data to generate IOCONTEXT_SHARED allocs and extends.
> +CREATE TABLE test_io_shared(a int);
> +INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
> +SELECT pg_stat_force_next_flush();
> + pg_stat_force_next_flush
> +--------------------------
> +
> +(1 row)
> +
> +-- After a checkpoint, there should be some additional IOCONTEXT_SHARED writes and fsyncs.
> +CHECKPOINT;
Does that work reliably? A checkpoint could have started just before the
CREATE TABLE, I think? Then it'd not have flushed those writes yet. I think
doing two checkpoints would protect against that.
If the first checkpoint starts just before creating the table and those
buffers are dirtied during that checkpoint and thus not written out by
checkpointer during that checkpoint, then the test's (single) explicit
checkpoint would end up picking up those dirty buffers and writing them
out, right?
buffers are dirtied during that checkpoint and thus not written out by
checkpointer during that checkpoint, then the test's (single) explicit
checkpoint would end up picking up those dirty buffers and writing them
out, right?
> +DROP TABLE test_io_shared;
> +DROP TABLESPACE test_io_shared_stats_tblspc;
Tablespace creation is somewhat expensive, do we really need that? There
should be one set up in setup.sql or such.
The only ones I see in regress are for tablespace.sql which drops them
in the same test and is testing dropping tablespaces.
> +-- Test that allocs, extends, reads, and writes of temporary tables are tracked
> +-- in pg_stat_io.
> +CREATE TEMPORARY TABLE test_io_local(a int, b TEXT);
> +SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(write) AS io_sum_local_writes_before FROM pg_stat_io WHERE io_context = 'Local' \gset
> +-- Insert enough values that we need to reuse and write out dirty local
> +-- buffers.
> +INSERT INTO test_io_local SELECT generate_series(1, 80000) as id,
> +'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
Could be abbreviated with repeat('a', some-number) :P
Done.
Can the table be smaller than this? That might show up on a slow machine.
Setting temp_buffers to 1MB, 7500 tuples of this width seem like enough.
I inserted 8000 to be safe -- seems like an order of magnitude less
should be good.
> +SELECT sum(alloc) AS io_sum_local_allocs_after FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(read) AS io_sum_local_reads_after FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT sum(write) AS io_sum_local_writes_after FROM pg_stat_io WHERE io_context = 'Local' \gset
> +SELECT :io_sum_local_allocs_after > :io_sum_local_allocs_before;
Random q: Why are we uppercasing the first letter of the context?
hmm. dunno. I changed it to be lowercase now.
> +CREATE TABLE test_io_strategy(a INT, b INT);
> +ALTER TABLE test_io_strategy SET (autovacuum_enabled = 'false');
I think you can specify that as part of the CREATE TABLE. Not sure if
otherwise there's not a race where autovac coul start before you do the ALTER.
Done.
> +INSERT INTO test_io_strategy SELECT i, i from generate_series(1, 8000)i;
> +-- Ensure that the next VACUUM will need to perform IO by rewriting the table
> +-- first with VACUUM (FULL).
... because VACUUM FULL currently doesn't set all-visible etc on the pages,
which the subsequent vacuum will then do.
It is true that the second VACUUM will set all-visible while VACUUM FULL
will not. However, I didn't think that that writing was what allowed us
to test strategy reads and allocs. It would theoretically allow us to
test strategy writes, however, in practice, checkpointer or background
writer often wrote out these dirty pages with all-visible set before
this backend had a chance to reuse them and write them out itself.
Unless you are saying that the subsequent VACUUM would be a no-op were
VACUUM FULL to set all-visible on the rewritten pages?
> +-- Hope that the previous value of wal_skip_threshold was the default. We
> +-- can't use BEGIN...SET LOCAL since VACUUM can't be run inside a transaction
> +-- block.
> +RESET wal_skip_threshold;
Nothing in this file set it before, so that's a pretty sure-to-be-fulfilled
hope.
I've removed the comment.
> +-- Test that, when using a Strategy, if creating a relation, Strategy extends
s/if/when/?
Changed this.
Thanks for the detailed review!
- Melanie
Attachment
v30 attached rebased and pgstat_io_ops.c builds with meson now also, I tested with pgstat_report_stat() only flushing when forced and tests still pass
Attachment
On Tue, Sep 27, 2022 at 11:20 AM Melanie Plageman <melanieplageman@gmail.com> wrote:
v30 attached
rebased and pgstat_io_ops.c builds with meson now
also, I tested with pgstat_report_stat() only flushing when forced and
tests still pass
First of all, I'm excited about this patch, and I think it will be a big help to understand better which part of Postgres is producing I/O (and why).
I've paired up with Maciek (CCed) on a review of this patch and had a few comments, focused on the user experience:
The term "strategy" as an "io_context" is hard to understand, as its not a concept an end-user / DBA would be familiar with. Since this comes from BufferAccessStrategyType (i.e. anything not NULL/BAS_NORMAL is treated as "strategy"), maybe we could instead split this out into the individual strategy types? i.e. making "strategy" three different I/O contexts instead: "shared_bulkread", "shared_bulkwrite" and "shared_vacuum", retaining "shared" to mean NULL / BAS_NORMAL.
Separately, could we also track buffer hits without incurring extra overhead? (not just allocs and reads) -- Whilst we already have shared read and hit counters in a few other places, this would help make the common "What's my cache hit ratio" question more accurate to answer in the presence of different shared buffer access strategies. Tracking hits could also help for local buffers (e.g. to tune temp_buffers based on seeing a low cache hit ratio).
Additionally, some minor notes:
- Since the stats are counting blocks, it would make sense to prefix the view columns with "blks_", and word them in the past tense (to match current style), i.e. "blks_written", "blks_read", "blks_extended", "blks_fsynced" (realistically one would combine this new view with other data e.g. from pg_stat_database or pg_stat_statements, which all use the "blks_" prefix, and stop using pg_stat_bgwriter for this which does not use such a prefix)
- "alloc" as a name doesn't seem intuitive (and it may be confused with memory allocations) - whilst this is already named this way in pg_stat_bgwriter, it feels like this is an opportunity to eventually deprecate the column there and make this easier to understand - specifically, maybe we can clarify that this means buffer *acquisitions*? (either by renaming the field to "blks_acquired", or clarifying in the documentation)
- Assuming we think this view could realistically cover all I/O produced by Postgres in the future (thus warranting the name "pg_stat_io"), it may be best to have an explicit list of things that are not currently tracked in the documentation, to reduce user confusion (i.e. WAL writes are not tracked, temporary files are not tracked, and some forms of direct writes are not tracked, e.g. when a table moves to a different tablespace)
- In the view documentation, it would be good to explain the different values for "io_strategy" (and what they mean)
- Overall it would be helpful if we had a dedicated documentation page on I/O statistics that's linked from the pg_stat_io view description, and explains how the I/O statistics tie into the various concepts of shared buffers / buffer access strategies / etc (and what is not tracked today)
I've paired up with Maciek (CCed) on a review of this patch and had a few comments, focused on the user experience:
The term "strategy" as an "io_context" is hard to understand, as its not a concept an end-user / DBA would be familiar with. Since this comes from BufferAccessStrategyType (i.e. anything not NULL/BAS_NORMAL is treated as "strategy"), maybe we could instead split this out into the individual strategy types? i.e. making "strategy" three different I/O contexts instead: "shared_bulkread", "shared_bulkwrite" and "shared_vacuum", retaining "shared" to mean NULL / BAS_NORMAL.
Separately, could we also track buffer hits without incurring extra overhead? (not just allocs and reads) -- Whilst we already have shared read and hit counters in a few other places, this would help make the common "What's my cache hit ratio" question more accurate to answer in the presence of different shared buffer access strategies. Tracking hits could also help for local buffers (e.g. to tune temp_buffers based on seeing a low cache hit ratio).
Additionally, some minor notes:
- Since the stats are counting blocks, it would make sense to prefix the view columns with "blks_", and word them in the past tense (to match current style), i.e. "blks_written", "blks_read", "blks_extended", "blks_fsynced" (realistically one would combine this new view with other data e.g. from pg_stat_database or pg_stat_statements, which all use the "blks_" prefix, and stop using pg_stat_bgwriter for this which does not use such a prefix)
- "alloc" as a name doesn't seem intuitive (and it may be confused with memory allocations) - whilst this is already named this way in pg_stat_bgwriter, it feels like this is an opportunity to eventually deprecate the column there and make this easier to understand - specifically, maybe we can clarify that this means buffer *acquisitions*? (either by renaming the field to "blks_acquired", or clarifying in the documentation)
- Assuming we think this view could realistically cover all I/O produced by Postgres in the future (thus warranting the name "pg_stat_io"), it may be best to have an explicit list of things that are not currently tracked in the documentation, to reduce user confusion (i.e. WAL writes are not tracked, temporary files are not tracked, and some forms of direct writes are not tracked, e.g. when a table moves to a different tablespace)
- In the view documentation, it would be good to explain the different values for "io_strategy" (and what they mean)
- Overall it would be helpful if we had a dedicated documentation page on I/O statistics that's linked from the pg_stat_io view description, and explains how the I/O statistics tie into the various concepts of shared buffers / buffer access strategies / etc (and what is not tracked today)
Thanks,
Lukas
Lukas Fittl
Hi, On 2022-09-27 14:20:44 -0400, Melanie Plageman wrote: > v30 attached > rebased and pgstat_io_ops.c builds with meson now > also, I tested with pgstat_report_stat() only flushing when forced and > tests still pass Unfortunately tests fail in CI / cfbot. E.g., https://cirrus-ci.com/task/5816109319323648 https://api.cirrus-ci.com/v1/artifact/task/5816109319323648/testrun/build/testrun/main/regress/regression.diffs diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/stats.out /tmp/cirrus-ci-build/build/testrun/main/regress/results/stats.out --- /tmp/cirrus-ci-build/src/test/regress/expected/stats.out 2022-10-01 12:07:47.779183501 +0000 +++ /tmp/cirrus-ci-build/build/testrun/main/regress/results/stats.out 2022-10-01 12:11:38.686433303 +0000 @@ -997,6 +997,8 @@ -- Set temp_buffers to a low value so that we can trigger writes with fewer -- inserted tuples. SET temp_buffers TO '1MB'; +ERROR: invalid value for parameter "temp_buffers": 128 +DETAIL: "temp_buffers" cannot be changed after any temporary tables have been accessed in the session. CREATE TEMPORARY TABLE test_io_local(a int, b TEXT); SELECT sum(alloc) AS io_sum_local_allocs_before FROM pg_stat_io WHERE io_context = 'local' \gset SELECT sum(read) AS io_sum_local_reads_before FROM pg_stat_io WHERE io_context = 'local' \gset @@ -1037,7 +1039,7 @@ SELECT :io_sum_local_writes_after > :io_sum_local_writes_before; ?column? ---------- - t + f (1 row) SELECT :io_sum_local_extends_after > :io_sum_local_extends_before; So the problem is just that something else accesses temp buffers earlier in the same test. That's likely because since you sent your email commit d7e39d72ca1c6f188b400d7d58813ff5b5b79064 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: 2022-09-29 12:14:39 -0400 Use actual backend IDs in pg_stat_get_backend_idset() and friends. was applied, which adds a temp table earlier in the same session. I think the easiest way to make this robust would be to just add a reconnect before the place you need to set temp_buffers, that way additional temp tables won't cause a problem. Setting the patch to waiting-for-author for now. Greetings, Andres Freund
v31 attached I've also addressed failing test mentioned by Andres in [1] On Fri, Sep 30, 2022 at 7:18 PM Lukas Fittl <lukas@fittl.com> wrote: > > On Tue, Sep 27, 2022 at 11:20 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > First of all, I'm excited about this patch, and I think it will be a big help to understand better which part of Postgresis producing I/O (and why). > Thanks! I'm happy to hear that. > I've paired up with Maciek (CCed) on a review of this patch and had a few comments, focused on the user experience: > Thanks for taking the time to review! > The term "strategy" as an "io_context" is hard to understand, as its not a concept an end-user / DBA would be familiarwith. Since this comes from BufferAccessStrategyType (i.e. anything not NULL/BAS_NORMAL is treated as "strategy"),maybe we could instead split this out into the individual strategy types? i.e. making "strategy" three differentI/O contexts instead: "shared_bulkread", "shared_bulkwrite" and "shared_vacuum", retaining "shared" to mean NULL/ BAS_NORMAL. I have split strategy out into "vacuum", "bulkread", and "bulkwrite". I thought it was less clear with shared as a prefix. If we were to have BufferAccessStrategies in the future which acquire local buffers (for example), we could start prefixing the columns to differentiate. This opened up some new questions about which BufferAccessStrategies will be employed by which BackendTypes and which IOOps will be valid in a given BufferAccessStrategy. I've excluded IOCONTEXT_BULKREAD and IOCONTEXT_BULKWRITE for autovacuum worker -- though those may not be inherently invalid, they seem not to be done now and added extra rows to the view. I've also disallowed IOOP_EXTEND for IOCONTEXT_BULKREAD. > Separately, could we also track buffer hits without incurring extra overhead? (not just allocs and reads) -- Whilst wealready have shared read and hit counters in a few other places, this would help make the common "What's my cache hit ratio"question more accurate to answer in the presence of different shared buffer access strategies. Tracking hits couldalso help for local buffers (e.g. to tune temp_buffers based on seeing a low cache hit ratio). I've started tracking hits and added "hit" to the view. I added IOOP_HIT and IOOP_ACQUIRE to those IOOps disallowed for checkpointer and bgwriter. I have added tests for hit, but I'm not sure I can keep them. It seems like they might fail if the blocks are evicted between the first and second time I try to read them. > Additionally, some minor notes: > > - Since the stats are counting blocks, it would make sense to prefix the view columns with "blks_", and word them in thepast tense (to match current style), i.e. "blks_written", "blks_read", "blks_extended", "blks_fsynced" (realisticallyone would combine this new view with other data e.g. from pg_stat_database or pg_stat_statements, which alluse the "blks_" prefix, and stop using pg_stat_bgwriter for this which does not use such a prefix) I have changed the column names to be in the past tense. There are no columns equivalent to "dirty" or "misses" from the other views containing information on buffer hits/block reads/writes/etc. I'm not sure whether or not those make sense in this context. Because we want to add non-block-oriented IO in the future (like temporary file IO) to this view and want to use the same "read", "written", "extended" columns, I would prefer not to prefix the columns with "blks_". I have added a column "unit" which would contain the unit in which read, written, and extended are in. Unfortunately, fsyncs are not per block, so "unit" doesn't really work for this. I documented this. The most correct thing to do to accommodate block-oriented and non-block-oriented IO would be to specify all the values in bytes. However, I would like this view to be usable visually (as opposed to just in scripts and by tools). The only current value of unit is "block_size" which could potentially be combined with the value of the GUC to get bytes. I've hard-coded the string "block_size" into the view generation function pg_stat_get_io(), so, if this idea makes sense, perhaps I should do something better there. > - "alloc" as a name doesn't seem intuitive (and it may be confused with memory allocations) - whilst this is already namedthis way in pg_stat_bgwriter, it feels like this is an opportunity to eventually deprecate the column there and makethis easier to understand - specifically, maybe we can clarify that this means buffer *acquisitions*? (either by renamingthe field to "blks_acquired", or clarifying in the documentation) I have renamed it to acquired. It doesn't overlap completely with buffers_alloc in pg_stat_bgwriter, so I didn't mention that in docs. > - Assuming we think this view could realistically cover all I/O produced by Postgres in the future (thus warranting thename "pg_stat_io"), it may be best to have an explicit list of things that are not currently tracked in the documentation,to reduce user confusion (i.e. WAL writes are not tracked, temporary files are not tracked, and some formsof direct writes are not tracked, e.g. when a table moves to a different tablespace) I have added this to the docs. The list is not exhaustive, so I would love to get feedback on if there are other specific examples of IO which is using smgr* directly that users will wonder about and I should call out. > - In the view documentation, it would be good to explain the different values for "io_strategy" (and what they mean) I have added this and would love feedback on my docs additions. > - Overall it would be helpful if we had a dedicated documentation page on I/O statistics that's linked from the pg_stat_ioview description, and explains how the I/O statistics tie into the various concepts of shared buffers / bufferaccess strategies / etc (and what is not tracked today) I haven't done this yet. How specific were you thinking -- like interpretations of all the combinations and what to do with what you see? Like you should run pg_prewarm if you see X? Specific checkpointer or bgwriter GUCs to change? Or just links to other docs pages on recommended tunings? Were you imagining the other IO statistics views (like pg_statio_all_tables and pg_stat_database) also being included in this page? Like would it be a comprehensive guide to IO statistics and what their significance/purposes are? - Melanie [1] https://www.postgresql.org/message-id/20221002172404.xyzhftbedh4zpio2%40awork3.anarazel.de
Attachment
v31 failed in CI, so I've attached v32 which has a few issues fixed: - addressed some compiler warnings I hadn't noticed locally - autovac launcher and worker do indeed use bulkread strategy if they end up starting before critical indexes have loaded and end up doing a sequential scan of some catalog tables, so I have changed the restrictions on BackendTypes allowed to track IO Operations in IOCONTEXT_BULKREAD - changed the name of the column "fsynced" to "files_synced" to make it more clear what unit it is in (and that the unit differs from that of the "unit" column) In an off-list discussion with Andres, he mentioned that he thought buffers reused by a BufferAccessStrategy should be split from buffers "acquired" and that "acquired" should be renamed "clocksweeps". I have started doing this, but for BufferAccessStrategy IO there are a few choices about how we want to count the clocksweeps: Currently the following situations are counted under the following IOContexts and IOOps: IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_ACQUIRE - reuse a buffer from the ring IOCONTEXT_SHARED, IOOP_ACQUIRE - add a buffer to the strategy ring initially - add a new shared buffer to the ring when all the existing buffers in the ring are pinned And in the new paradigm, I think these are two good options: 1) IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP - add a buffer to the strategy ring initially - add a new shared buffer to the ring when all the existing buffers in the ring are pinned IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE - reuse a buffer from the ring 2) IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP - add a buffer to the strategy ring initially IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE - reuse a buffer from the ring IOCONTEXT SHARED, IOOP_CLOCKSWEEP - add a new shared buffer to the ring when all the existing buffers in the ring are pinned However, if we want to differentiate between buffers initially added to the ring and buffers taken from shared buffers and added to the ring because all strategy ring buffers are pinned or have a usage count above one, then we would need to either do so inside of GetBufferFromRing() or propagate this distinction out somehow (easy enough if we care to do it). There are other combinations that I could come up with a justification for as well, but I wanted to know what other people thought made sense (and would make sense to users). - Melanie
Attachment
I've gone ahead and implemented option 1 (commented below). On Thu, Oct 6, 2022 at 6:23 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > v31 failed in CI, so > I've attached v32 which has a few issues fixed: > - addressed some compiler warnings I hadn't noticed locally > - autovac launcher and worker do indeed use bulkread strategy if they > end up starting before critical indexes have loaded and end up doing a > sequential scan of some catalog tables, so I have changed the > restrictions on BackendTypes allowed to track IO Operations in > IOCONTEXT_BULKREAD > - changed the name of the column "fsynced" to "files_synced" to make it > more clear what unit it is in (and that the unit differs from that of > the "unit" column) > > In an off-list discussion with Andres, he mentioned that he thought > buffers reused by a BufferAccessStrategy should be split from buffers > "acquired" and that "acquired" should be renamed "clocksweeps". > > I have started doing this, but for BufferAccessStrategy IO there are a > few choices about how we want to count the clocksweeps: > > Currently the following situations are counted under the following > IOContexts and IOOps: > > IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_ACQUIRE > - reuse a buffer from the ring > > IOCONTEXT_SHARED, IOOP_ACQUIRE > - add a buffer to the strategy ring initially > - add a new shared buffer to the ring when all the existing buffers in > the ring are pinned > > And in the new paradigm, I think these are two good options: > > 1) > IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP > - add a buffer to the strategy ring initially > - add a new shared buffer to the ring when all the existing buffers in > the ring are pinned > > IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE > - reuse a buffer from the ring > I've implemented this option in attached v33. > 2) > IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_CLOCKSWEEP > - add a buffer to the strategy ring initially > > IOCONTEXT_[VACUUM,BULKREAD,BULKWRITE], IOOP_REUSE > - reuse a buffer from the ring > > IOCONTEXT SHARED, IOOP_CLOCKSWEEP > - add a new shared buffer to the ring when all the existing buffers in > the ring are pinned - Melanie
Attachment
Thanks for working on this! Like Lukas, I'm excited to see more visibility into important parts of the system like this. On Mon, Oct 10, 2022 at 11:49 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > I've gone ahead and implemented option 1 (commented below). No strong opinion on 1 versus 2, but I guess at least partly because I don't understand the implications (I do understand the difference, just not when it might be important in terms of stats). Can we think of a situation where combining stats about initial additions with pinned additions hides some behavior that might be good to understand and hard to pinpoint otherwise? I took a look at the latest docs (as someone mostly familiar with internals at only a pretty high level, so probably somewhat close to the target audience) and have some feedback. + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>backend_type</structfield> <type>text</type> + </para> + <para> + Type of backend (e.g. background worker, autovacuum worker). + </para></entry> + </row> Not critical, but is there a list of backend types we could cross-reference elsewhere in the docs? From the io_context column description: + The autovacuum daemon, explicit <command>VACUUM</command>, explicit + <command>ANALYZE</command>, many bulk reads, and many bulk writes use a + fixed amount of memory, acquiring the equivalent number of shared + buffers and reusing them circularly to avoid occupying an undue portion + of the main shared buffer pool. + </para></entry> I don't understand how this is relevant to the io_context column. Could you expand on that, or am I just missing something obvious? + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>extended</structfield> <type>bigint</type> + </para> + <para> + Extends of relations done by this <varname>backend_type</varname> in + order to write data in this <varname>io_context</varname>. + </para></entry> + </row> I understand what this is, but not why this is something I might want to know about. And from your earlier e-mail: On Thu, Oct 6, 2022 at 10:42 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Because we want to add non-block-oriented IO in the future (like > temporary file IO) to this view and want to use the same "read", > "written", "extended" columns, I would prefer not to prefix the columns > with "blks_". I have added a column "unit" which would contain the unit > in which read, written, and extended are in. Unfortunately, fsyncs are > not per block, so "unit" doesn't really work for this. I documented > this. > > The most correct thing to do to accommodate block-oriented and > non-block-oriented IO would be to specify all the values in bytes. > However, I would like this view to be usable visually (as opposed to > just in scripts and by tools). The only current value of unit is > "block_size" which could potentially be combined with the value of the > GUC to get bytes. > > I've hard-coded the string "block_size" into the view generation > function pg_stat_get_io(), so, if this idea makes sense, perhaps I > should do something better there. That seems broadly reasonable, but pg_settings also has a 'unit' field, and in that view, unit is '8kB' on my system--i.e., it (presumably) reflects the block size. Is that something we should try to be consistent with (not sure if that's a good idea, but thought it was worth asking)? > On Fri, Sep 30, 2022 at 7:18 PM Lukas Fittl <lukas@fittl.com> wrote: > > - Overall it would be helpful if we had a dedicated documentation page on I/O statistics that's linked from the pg_stat_ioview description, and explains how the I/O statistics tie into the various concepts of shared buffers / bufferaccess strategies / etc (and what is not tracked today) > > I haven't done this yet. How specific were you thinking -- like > interpretations of all the combinations and what to do with what you > see? Like you should run pg_prewarm if you see X? Specific checkpointer > or bgwriter GUCs to change? Or just links to other docs pages on > recommended tunings? > > Were you imagining the other IO statistics views (like > pg_statio_all_tables and pg_stat_database) also being included in this > page? Like would it be a comprehensive guide to IO statistics and what > their significance/purposes are? I can't speak for Lukas here, but I encouraged him to suggest more thorough documentation in general, so I can speak to my concerns: in general, these stats should be usable for someone who does not know much about Postgres internals. It's pretty low-level information, sure, so I think you need some understanding of how the system broadly works to make sense of it. But ideally you should be able to find what you need to understand the concepts involved within the docs. I think your updated docs are much clearer (with the caveats of my specific comments above). It would still probably be helpful to have a dedicated page on I/O stats (and yeah, something with a broad scope, along the lines of a comprehensive guide), but I think that can wait until a future patch. Thanks, Maciek
On Mon, Oct 10, 2022 at 7:43 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > Thanks for working on this! Like Lukas, I'm excited to see more > visibility into important parts of the system like this. Thanks for taking another look! > > On Mon, Oct 10, 2022 at 11:49 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > I've gone ahead and implemented option 1 (commented below). > > No strong opinion on 1 versus 2, but I guess at least partly because I > don't understand the implications (I do understand the difference, > just not when it might be important in terms of stats). Can we think > of a situation where combining stats about initial additions with > pinned additions hides some behavior that might be good to understand > and hard to pinpoint otherwise? I think that it makes sense to count both the initial buffers added to the ring and subsequent shared buffers added to the ring (either when the current strategy buffer is pinned or in use or when a bulkread rejects dirty strategy buffers in favor of new shared buffers) as strategy clocksweeps because of how the statistic would be used. Clocksweeps give you an idea of how much of your working set is cached (setting aside initially reading data into shared buffers when you are warming up the db). You may use clocksweeps to determine if you need to make shared buffers larger. Distinguishing strategy buffer clocksweeps from shared buffer clocksweeps allows us to avoid enlarging shared buffers if most of the clocksweeps are to bring in blocks for the strategy operation. However, I could see an argument that discounting strategy clocksweeps done because the current strategy buffer is pinned makes the number of shared buffer clocksweeps artificially low since those other queries using the buffer would have suffered a cache miss were it not for the strategy. And, in this case, you would take strategy clocksweeps together with shared clocksweeps to make your decision. And if we include buffers initially added to the strategy ring in the strategy clocksweep statistic, this number may be off because those blocks may not be needed in the main shared working set. But you won't know that until you try to reuse the buffer and it is pinned. So, I think we don't have a better option than counting initial buffers added to the ring as strategy clocksweeps (as opposed to as reuses). So, in answer to your question, no, I cannot think of a scenario like that. Sitting down and thinking about that for a long time did, however, help me realize that some of my code comments were misleading (and some incorrect). I will update these in the next version once we agree on updated docs. It also made me remember that I am incorrectly counting rejected buffers as reused. I'm not sure if it is a good idea to subtract from reuses when a buffer is rejected. Waiting until after it is rejected to count the reuse will take some other code changes. Perhaps we could also count rejections in the stats? > > I took a look at the latest docs (as someone mostly familiar with > internals at only a pretty high level, so probably somewhat close to > the target audience) and have some feedback. > > + <row> > + <entry role="catalog_table_entry"><para > role="column_definition"> > + <structfield>backend_type</structfield> <type>text</type> > + </para> > + <para> > + Type of backend (e.g. background worker, autovacuum worker). > + </para></entry> > + </row> > > Not critical, but is there a list of backend types we could > cross-reference elsewhere in the docs? The most I could find was this longer explanation (with exhaustive list of types) in pg_stat_activity docs [1]. I could duplicate what it says or I could link to the view and say "see pg_stat_activity" for a description of backend_type" or something like that (to keep them from getting out of sync as new backend_types are added. I suppose I could also add docs on backend_types, but I'm not sure where something like that would go. > > From the io_context column description: > > + The autovacuum daemon, explicit <command>VACUUM</command>, > explicit > + <command>ANALYZE</command>, many bulk reads, and many bulk > writes use a > + fixed amount of memory, acquiring the equivalent number of > shared > + buffers and reusing them circularly to avoid occupying an > undue portion > + of the main shared buffer pool. > + </para></entry> > > I don't understand how this is relevant to the io_context column. > Could you expand on that, or am I just missing something obvious? > I'm trying to explain why those other IO Contexts exist (bulkread, bulkwrite, vacuum) and why they are separate from shared buffers. Should I cut it altogether or preface it with something like: these are counted separate from shared buffers because...? > + <row> > + <entry role="catalog_table_entry"><para > role="column_definition"> > + <structfield>extended</structfield> <type>bigint</type> > + </para> > + <para> > + Extends of relations done by this > <varname>backend_type</varname> in > + order to write data in this <varname>io_context</varname>. > + </para></entry> > + </row> > > I understand what this is, but not why this is something I might want > to know about. Unlike writes, backends largely have to do their own extends, so separating this from writes lets us determine whether or not we need to change checkpointer/bgwriter to be more aggressive using the writes without the distraction of the extends. Should I mention this in the docs? The other stats views don't seems to editorialize at all, and I wasn't sure if this was an objective enough point to include in docs. > > And from your earlier e-mail: > > On Thu, Oct 6, 2022 at 10:42 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > Because we want to add non-block-oriented IO in the future (like > > temporary file IO) to this view and want to use the same "read", > > "written", "extended" columns, I would prefer not to prefix the columns > > with "blks_". I have added a column "unit" which would contain the unit > > in which read, written, and extended are in. Unfortunately, fsyncs are > > not per block, so "unit" doesn't really work for this. I documented > > this. > > > > The most correct thing to do to accommodate block-oriented and > > non-block-oriented IO would be to specify all the values in bytes. > > However, I would like this view to be usable visually (as opposed to > > just in scripts and by tools). The only current value of unit is > > "block_size" which could potentially be combined with the value of the > > GUC to get bytes. > > > > I've hard-coded the string "block_size" into the view generation > > function pg_stat_get_io(), so, if this idea makes sense, perhaps I > > should do something better there. > > That seems broadly reasonable, but pg_settings also has a 'unit' > field, and in that view, unit is '8kB' on my system--i.e., it > (presumably) reflects the block size. Is that something we should try > to be consistent with (not sure if that's a good idea, but thought it > was worth asking)? > I think this idea is a good option. I am wondering if it would be clear when mixed with non-block-oriented IO. Block-oriented IO would say 8kB (or whatever the build-time value of a block was) and non-block-oriented IO would say B or kB. The math would work out. Looking at pg_settings now though, I am confused about how the units for wal_buffers is 8kB but then the value of wal_buffers when I show it in psql is "16MB"... Though the units for the pg_stat_io view for block-oriented IO would be the build-time values for block size, so it wouldn't line up exactly with pg_settings. However, I do like the idea of having a unit column that reflects the value and not the name of the GUC/setting which determined the unit. I can update this in the next version. - Melanie [1] https://www.postgresql.org/docs/15/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW
On Thu, Oct 13, 2022 at 10:29 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > I think that it makes sense to count both the initial buffers added to > the ring and subsequent shared buffers added to the ring (either when > the current strategy buffer is pinned or in use or when a bulkread > rejects dirty strategy buffers in favor of new shared buffers) as > strategy clocksweeps because of how the statistic would be used. > > Clocksweeps give you an idea of how much of your working set is cached > (setting aside initially reading data into shared buffers when you are > warming up the db). You may use clocksweeps to determine if you need to > make shared buffers larger. > > Distinguishing strategy buffer clocksweeps from shared buffer > clocksweeps allows us to avoid enlarging shared buffers if most of the > clocksweeps are to bring in blocks for the strategy operation. > > However, I could see an argument that discounting strategy clocksweeps > done because the current strategy buffer is pinned makes the number of > shared buffer clocksweeps artificially low since those other queries > using the buffer would have suffered a cache miss were it not for the > strategy. And, in this case, you would take strategy clocksweeps > together with shared clocksweeps to make your decision. And if we > include buffers initially added to the strategy ring in the strategy > clocksweep statistic, this number may be off because those blocks may > not be needed in the main shared working set. But you won't know that > until you try to reuse the buffer and it is pinned. So, I think we don't > have a better option than counting initial buffers added to the ring as > strategy clocksweeps (as opposed to as reuses). > > So, in answer to your question, no, I cannot think of a scenario like > that. That analysis makes sense to me; thanks. > It also made me remember that I am incorrectly counting rejected buffers > as reused. I'm not sure if it is a good idea to subtract from reuses > when a buffer is rejected. Waiting until after it is rejected to count > the reuse will take some other code changes. Perhaps we could also count > rejections in the stats? I'm not sure what makes sense here. > > Not critical, but is there a list of backend types we could > > cross-reference elsewhere in the docs? > > The most I could find was this longer explanation (with exhaustive list > of types) in pg_stat_activity docs [1]. I could duplicate what it says > or I could link to the view and say "see pg_stat_activity" for a > description of backend_type" or something like that (to keep them from > getting out of sync as new backend_types are added. I suppose I could > also add docs on backend_types, but I'm not sure where something like > that would go. I think linking pg_stat_activity is reasonable for now. A separate section for this might be nice at some point, but that seems out of scope. > > From the io_context column description: > > > > + The autovacuum daemon, explicit <command>VACUUM</command>, > > explicit > > + <command>ANALYZE</command>, many bulk reads, and many bulk > > writes use a > > + fixed amount of memory, acquiring the equivalent number of > > shared > > + buffers and reusing them circularly to avoid occupying an > > undue portion > > + of the main shared buffer pool. > > + </para></entry> > > > > I don't understand how this is relevant to the io_context column. > > Could you expand on that, or am I just missing something obvious? > > > > I'm trying to explain why those other IO Contexts exist (bulkread, > bulkwrite, vacuum) and why they are separate from shared buffers. > Should I cut it altogether or preface it with something like: these are > counted separate from shared buffers because...? Oh I see. That makes sense; it just wasn't obvious to me this was talking about the last three values of io_context. I think a brief preface like that would be helpful (maybe explicitly with "these last three values", and I think "counted separately"). > > + <row> > > + <entry role="catalog_table_entry"><para > > role="column_definition"> > > + <structfield>extended</structfield> <type>bigint</type> > > + </para> > > + <para> > > + Extends of relations done by this > > <varname>backend_type</varname> in > > + order to write data in this <varname>io_context</varname>. > > + </para></entry> > > + </row> > > > > I understand what this is, but not why this is something I might want > > to know about. > > Unlike writes, backends largely have to do their own extends, so > separating this from writes lets us determine whether or not we need to > change checkpointer/bgwriter to be more aggressive using the writes > without the distraction of the extends. Should I mention this in the > docs? The other stats views don't seems to editorialize at all, and I > wasn't sure if this was an objective enough point to include in docs. Thanks for the clarification. Just to make sure I understand, you mean that if I see a high extended count, that may be interesting in terms of write activity, but I can't fix that by tuning--it's just the nature of my workload? I think you're right that this is not objective enough. It's unfortunate that there's not a good place in the docs for info like that, since stats like this are hard to interpret without that context, but I admit that it's not really this patch's job to solve that larger issue. > > That seems broadly reasonable, but pg_settings also has a 'unit' > > field, and in that view, unit is '8kB' on my system--i.e., it > > (presumably) reflects the block size. Is that something we should try > > to be consistent with (not sure if that's a good idea, but thought it > > was worth asking)? > > > > I think this idea is a good option. I am wondering if it would be clear > when mixed with non-block-oriented IO. Block-oriented IO would say 8kB > (or whatever the build-time value of a block was) and non-block-oriented > IO would say B or kB. The math would work out. Right, yeah. Although maybe that's a little confusing? When you originally added "unit", you had said: >The most correct thing to do to accommodate block-oriented and >non-block-oriented IO would be to specify all the values in bytes. >However, I would like this view to be usable visually (as opposed to >just in scripts and by tools). The only current value of unit is >"block_size" which could potentially be combined with the value of the >GUC to get bytes. Is this still usable visually if you have to compare values across units? I don't really have any great ideas here (and maybe this is still the best option), just pointing it out. > Looking at pg_settings now though, I am confused about > how the units for wal_buffers is 8kB but then the value of wal_buffers > when I show it in psql is "16MB"... You mean the difference between maciek=# select setting, unit from pg_settings where name = 'wal_buffers'; setting | unit ---------+------ 512 | 8kB (1 row) and maciek=# show wal_buffers; wal_buffers ------------- 4MB (1 row) ? Poking around, I think it looks like that's due to convert_int_from_base_unit (indirectly called from SHOW / current_setting): /* * Convert an integer value in some base unit to a human-friendly unit. * * The output unit is chosen so that it's the greatest unit that can represent * the value without loss. For example, if the base unit is GUC_UNIT_KB, 1024 * is converted to 1 MB, but 1025 is represented as 1025 kB. */ > Though the units for the pg_stat_io view for block-oriented IO would be > the build-time values for block size, so it wouldn't line up exactly > with pg_settings. I don't follow--what would be the discrepancy?
v34 is attached. I think the column names need discussion. Also, the docs need more work (I added a lot of new content there). I could use feedback on the column names and definitions and review/rephrasing ideas for the docs additions. On Mon, Oct 17, 2022 at 1:28 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > On Thu, Oct 13, 2022 at 10:29 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I think that it makes sense to count both the initial buffers added to > > the ring and subsequent shared buffers added to the ring (either when > > the current strategy buffer is pinned or in use or when a bulkread > > rejects dirty strategy buffers in favor of new shared buffers) as > > strategy clocksweeps because of how the statistic would be used. > > > > Clocksweeps give you an idea of how much of your working set is cached > > (setting aside initially reading data into shared buffers when you are > > warming up the db). You may use clocksweeps to determine if you need to > > make shared buffers larger. > > > > Distinguishing strategy buffer clocksweeps from shared buffer > > clocksweeps allows us to avoid enlarging shared buffers if most of the > > clocksweeps are to bring in blocks for the strategy operation. > > > > However, I could see an argument that discounting strategy clocksweeps > > done because the current strategy buffer is pinned makes the number of > > shared buffer clocksweeps artificially low since those other queries > > using the buffer would have suffered a cache miss were it not for the > > strategy. And, in this case, you would take strategy clocksweeps > > together with shared clocksweeps to make your decision. And if we > > include buffers initially added to the strategy ring in the strategy > > clocksweep statistic, this number may be off because those blocks may > > not be needed in the main shared working set. But you won't know that > > until you try to reuse the buffer and it is pinned. So, I think we don't > > have a better option than counting initial buffers added to the ring as > > strategy clocksweeps (as opposed to as reuses). > > > > So, in answer to your question, no, I cannot think of a scenario like > > that. > > That analysis makes sense to me; thanks. I have made some major changes in this area to make the columns more useful. I have renamed and split "clocksweeps". It is now "evicted" and "freelist acquired". This makes it clear when a block must be evicted from a shared buffer must be and may help to identify misconfiguration of shared buffers. There is some nuance here that I tried to make clear in the docs. "freelist acquired" in a shared context is straightforward. "freelist acquired" in a strategy context is counted when a shared buffer is added to the strategy ring (not when it is reused). "freelist acquired" in the local buffer context is actually the initial allocation of a local buffer (in contrast with reuse). "evicted" in the shared IOContext is a block being evicted from a shared buffer in order to reuse that buffer when not using a strategy. "evicted" in a strategy IOContext is a block being evicted from a shared buffer in order to add that shared buffer to the strategy ring. This is in contrast with "reused" in a strategy IOContext which is when an existing buffer in the strategy ring has a block evicted in order to reuse that buffer in a strategy context. "evicted" in a local IOContext is when an existing local buffer has a block evicted in order to reuse that local buffer. "freelist_acquired" is confusing for local buffers but I wanted to distinguish between reuse/eviction of local buffers and initial allocation. "freelist_acquired" seemed more fitting because there is a clocksweep to find a local buffer and if it hasn't been allocated yet it is allocated in a place similar to where shared buffers acquire a buffer from the freelist. If I didn't count it here, I would need to make a new column only for local buffers called "allocated" or something like that. I chose not to call "evicted" "sb_evicted" because then we would need a separate "local_evicted". I could instead make "local_evicted", "sb_evicted", and rename "reused" to "strat_evicted". If I did that we would end up with separate columns for every IO Context describing behavior when a buffer is initially acquired vs when it is reused. It would look something like this: shared buffers: initial: freelist_acquired reused: sb_evicted local buffers: initial: allocated reused: local_evicted strategy buffers: initial: sb_evicted | freelist_acquired reused: strat_evicted replaced: sb_evicted | freelist_acquired This seems not too bad at first, but if you consider that later we will add other kinds of IO -- eg WAL IO or temporary file IO, we won't be able to use these existing columns and will need to add even more columns describing the exact behavior in those cases. I wanted to devise a paradigm which allowed for reuse of columns across IOContexts even if with slightly different meanings. I have also added the columns "repossessed" and "rejected". "rejected" is when a bulkread rejects a strategy buffer because it is dirty and requires flush. Seeing a lot of rejections could indicate you need to vacuum. "repossessed" is the number of times a strategy buffer was pinned or in use by another backend and had to be removed from the strategy ring and replaced with a new shared buffer. This gives you some indication that there is contention on blocks recently used by a strategy. I've also added some descriptions to the docs of how these columns might be used or what a large value in one of them may mean. I haven't added tests for repossessed or rejected yet. I can add tests for repossessed if we decide to keep it. Rejected is hard to write a test for because we can't guarantee checkpointer won't clean up the buffer before we can reject it > > > It also made me remember that I am incorrectly counting rejected buffers > > as reused. I'm not sure if it is a good idea to subtract from reuses > > when a buffer is rejected. Waiting until after it is rejected to count > > the reuse will take some other code changes. Perhaps we could also count > > rejections in the stats? > > I'm not sure what makes sense here. I have fixed the counting of rejected and have made a new column dedicated to rejected. > > > > From the io_context column description: > > > > > > + The autovacuum daemon, explicit <command>VACUUM</command>, > > > explicit > > > + <command>ANALYZE</command>, many bulk reads, and many bulk > > > writes use a > > > + fixed amount of memory, acquiring the equivalent number of > > > shared > > > + buffers and reusing them circularly to avoid occupying an > > > undue portion > > > + of the main shared buffer pool. > > > + </para></entry> > > > > > > I don't understand how this is relevant to the io_context column. > > > Could you expand on that, or am I just missing something obvious? > > > > > > > I'm trying to explain why those other IO Contexts exist (bulkread, > > bulkwrite, vacuum) and why they are separate from shared buffers. > > Should I cut it altogether or preface it with something like: these are > > counted separate from shared buffers because...? > > Oh I see. That makes sense; it just wasn't obvious to me this was > talking about the last three values of io_context. I think a brief > preface like that would be helpful (maybe explicitly with "these last > three values", and I think "counted separately"). I've done this. Thanks for the suggested wording. > > > > + <row> > > > + <entry role="catalog_table_entry"><para > > > role="column_definition"> > > > + <structfield>extended</structfield> <type>bigint</type> > > > + </para> > > > + <para> > > > + Extends of relations done by this > > > <varname>backend_type</varname> in > > > + order to write data in this <varname>io_context</varname>. > > > + </para></entry> > > > + </row> > > > > > > I understand what this is, but not why this is something I might want > > > to know about. > > > > Unlike writes, backends largely have to do their own extends, so > > separating this from writes lets us determine whether or not we need to > > change checkpointer/bgwriter to be more aggressive using the writes > > without the distraction of the extends. Should I mention this in the > > docs? The other stats views don't seems to editorialize at all, and I > > wasn't sure if this was an objective enough point to include in docs. > > Thanks for the clarification. Just to make sure I understand, you mean > that if I see a high extended count, that may be interesting in terms > of write activity, but I can't fix that by tuning--it's just the > nature of my workload? That is correct. > > > > That seems broadly reasonable, but pg_settings also has a 'unit' > > > field, and in that view, unit is '8kB' on my system--i.e., it > > > (presumably) reflects the block size. Is that something we should try > > > to be consistent with (not sure if that's a good idea, but thought it > > > was worth asking)? > > > > > > > I think this idea is a good option. I am wondering if it would be clear > > when mixed with non-block-oriented IO. Block-oriented IO would say 8kB > > (or whatever the build-time value of a block was) and non-block-oriented > > IO would say B or kB. The math would work out. > > Right, yeah. Although maybe that's a little confusing? When you > originally added "unit", you had said: > > >The most correct thing to do to accommodate block-oriented and > >non-block-oriented IO would be to specify all the values in bytes. > >However, I would like this view to be usable visually (as opposed to > >just in scripts and by tools). The only current value of unit is > >"block_size" which could potentially be combined with the value of the > >GUC to get bytes. > > Is this still usable visually if you have to compare values across > units? I don't really have any great ideas here (and maybe this is > still the best option), just pointing it out. > > > Looking at pg_settings now though, I am confused about > > how the units for wal_buffers is 8kB but then the value of wal_buffers > > when I show it in psql is "16MB"... > > You mean the difference between > > maciek=# select setting, unit from pg_settings where name = 'wal_buffers'; > setting | unit > ---------+------ > 512 | 8kB > (1 row) > > and > > maciek=# show wal_buffers; > wal_buffers > ------------- > 4MB > (1 row) > > ? > > Poking around, I think it looks like that's due to > convert_int_from_base_unit (indirectly called from SHOW / > current_setting): > > /* > * Convert an integer value in some base unit to a human-friendly > unit. > * > * The output unit is chosen so that it's the greatest unit that can > represent > * the value without loss. For example, if the base unit is > GUC_UNIT_KB, 1024 > * is converted to 1 MB, but 1025 is represented as 1025 kB. > */ I've implemented a change using the same function pg_settings uses to turn the build-time parameter BLCKSZ into 8kB (get_config_unit_name()) using the flag GUC_UNIT_BLOCKS. I am unsure if this is better or worse than "block_size". I am feeling very conflicted about this column. > > > Though the units for the pg_stat_io view for block-oriented IO would be > > the build-time values for block size, so it wouldn't line up exactly > > with pg_settings. > > I don't follow--what would be the discrepancy? I got confused. You are right -- pg_settings does seem to use the build-time value of BLCKSZ to derive this. I was confused because the description of pg_settings says: "The view pg_settings provides access to run-time parameters of the server." - Melanie
Attachment
Hi, - we shouldn't do pgstat_count_io_op() while the buffer header lock is held, if possible. I wonder if we should add a "source" output argument to StrategyGetBuffer(). Then nearly all the counting can happen in BufferAlloc(). - "repossession" is a very unintuitive name for me. If we want something like it, can't we just name it reuse_failed or such? - Wonder if the column names should be reads, writes, extends, etc instead of the current naming pattern - Is it actually correct to count evictions in StrategyGetBuffer()? What if we then decide to not use that buffer in BufferAlloc()? Yes, that'll be counted via rejected, but that still leaves the eviction count to be "misleading"? On 2022-10-19 15:26:51 -0400, Melanie Plageman wrote: > I have made some major changes in this area to make the columns more > useful. I have renamed and split "clocksweeps". It is now "evicted" and > "freelist acquired". This makes it clear when a block must be evicted > from a shared buffer must be and may help to identify misconfiguration > of shared buffers. I'm not sure freelist acquired is really that useful? If we don't add it, we should however definitely not count buffers from the freelist as evictions. > There is some nuance here that I tried to make clear in the docs. > "freelist acquired" in a shared context is straightforward. > "freelist acquired" in a strategy context is counted when a shared > buffer is added to the strategy ring (not when it is reused). Not sure what the second half here means - why would a buffer that's not from the freelist ever be counted as being from the freelist? > "freelist_acquired" is confusing for local buffers but I wanted to > distinguish between reuse/eviction of local buffers and initial > allocation. "freelist_acquired" seemed more fitting because there is a > clocksweep to find a local buffer and if it hasn't been allocated yet it > is allocated in a place similar to where shared buffers acquire a buffer > from the freelist. If I didn't count it here, I would need to make a new > column only for local buffers called "allocated" or something like that. I think you're making this too granular. We need to have more detail than today. But we don't necessarily need to catch every nuance. > I chose not to call "evicted" "sb_evicted" > because then we would need a separate "local_evicted". I could instead > make "local_evicted", "sb_evicted", and rename "reused" to > "strat_evicted". If I did that we would end up with separate columns for > every IO Context describing behavior when a buffer is initially acquired > vs when it is reused. > > It would look something like this: > > shared buffers: > initial: freelist_acquired > reused: sb_evicted > > local buffers: > initial: allocated > reused: local_evicted > > strategy buffers: > initial: sb_evicted | freelist_acquired > reused: strat_evicted > replaced: sb_evicted | freelist_acquired > > This seems not too bad at first, but if you consider that later we will > add other kinds of IO -- eg WAL IO or temporary file IO, we won't be > able to use these existing columns and will need to add even more > columns describing the exact behavior in those cases. I think it's clearly not the right direction. > I have also added the columns "repossessed" and "rejected". "rejected" > is when a bulkread rejects a strategy buffer because it is dirty and > requires flush. Seeing a lot of rejections could indicate you need to > vacuum. "repossessed" is the number of times a strategy buffer was > pinned or in use by another backend and had to be removed from the > strategy ring and replaced with a new shared buffer. This gives you some > indication that there is contention on blocks recently used by a > strategy. I don't immediately see a real use case for repossessed. Why isn't it sufficient to count it as part of rejected? Greetings, Andres Freund
On Wed, Oct 19, 2022 at 12:27 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > v34 is attached. > I think the column names need discussion. Also, the docs need more work > (I added a lot of new content there). I could use feedback on the column > names and definitions and review/rephrasing ideas for the docs > additions. Nice! I think the expanded docs are great, and make this information much easier to interpret. >+ <varname>io_context</varname> <literal>bulkread</literal>, existing >+ dirty buffers in the ring requirng flush are "requiring" >+ shared buffers were acquired from the freelist and added to the >+ fixed-size strategy ring buffer. Shared buffers are added to the >+ strategy ring lazily. If the current buffer in the ring is pinned or in This is the first mention of the term "strategy" in these docs. It's not totally opaque, since there's some context, but maybe we should either try to avoid that term or define it more explicitly? >+ <varname>io_context</varname>s. This is equivalent to >+ <varname>evicted</varname> for shared buffers in >+ <varname>io_context</varname> <literal>shared</literal>, as the contents >+ of the buffer are <quote>evicted</quote> but refers to the case when the I don't quite follow this: does this mean that I should expect 'reused' and 'evicted' to be equal in the 'shared' context, because they represent the same thing? Or will 'reused' just be null because it's not distinct from 'evicted'? It looks like it's null right now, but I find the wording here confusing. >+ future with a new shared buffer. A high number of >+ <literal>bulkread</literal> rejections can indicate a need for more >+ frequent vacuuming or more aggressive autovacuum settings, as buffers are >+ dirtied during a bulkread operation when updating the hint bit or when >+ performing on-access pruning. This is great. Just wanted to re-iterate that notes like this are really helpful to understanding this view. > I've implemented a change using the same function pg_settings uses to > turn the build-time parameter BLCKSZ into 8kB (get_config_unit_name()) > using the flag GUC_UNIT_BLOCKS. I am unsure if this is better or worse > than "block_size". I am feeling very conflicted about this column. Yeah, I guess it feels less natural here than in pg_settings, but it still kind of feels like one way of doing this is better than two...
On Thu, Oct 20, 2022 at 10:31 AM Andres Freund <andres@anarazel.de> wrote: > - "repossession" is a very unintuitive name for me. If we want something like > it, can't we just name it reuse_failed or such? +1, I think "repossessed" is awkward. I think "reuse_failed" works, but no strong opinions on an alternate name. > - Wonder if the column names should be reads, writes, extends, etc instead of > the current naming pattern Why? Lukas suggested alignment with existing views like pg_stat_database and pg_stat_statements. It doesn't make sense to use the blks_ prefix since it's not all blocks, but otherwise it seems like we should be consistent, no? > > "freelist_acquired" is confusing for local buffers but I wanted to > > distinguish between reuse/eviction of local buffers and initial > > allocation. "freelist_acquired" seemed more fitting because there is a > > clocksweep to find a local buffer and if it hasn't been allocated yet it > > is allocated in a place similar to where shared buffers acquire a buffer > > from the freelist. If I didn't count it here, I would need to make a new > > column only for local buffers called "allocated" or something like that. > > I think you're making this too granular. We need to have more detail than > today. But we don't necessarily need to catch every nuance. In general I agree that coarser granularity here may be easier to use. I do think the current docs explain what's going on pretty well, though, and I worry if merging too many concepts will make that harder to follow. But if a less detailed breakdown still communicates potential problems, +1. > > This seems not too bad at first, but if you consider that later we will > > add other kinds of IO -- eg WAL IO or temporary file IO, we won't be > > able to use these existing columns and will need to add even more > > columns describing the exact behavior in those cases. > > I think it's clearly not the right direction. +1, I think the existing approach makes more sense.
On Thu, Oct 20, 2022 at 1:31 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > - we shouldn't do pgstat_count_io_op() while the buffer header lock is held, > if possible. I've changed this locally. It will be fixed in the next version I share. > > I wonder if we should add a "source" output argument to > StrategyGetBuffer(). Then nearly all the counting can happen in > BufferAlloc(). I think we can just check for BM_VALID being set before invalidating it in order to claim the buffer at the end of BufferAlloc(). Then we can count it as an eviction or reuse. > > - "repossession" is a very unintuitive name for me. If we want something like > it, can't we just name it reuse_failed or such? Repossession could be called eviction_failed or reuse_failed. Do we think we will ever want to use it to count buffers we released in other IOContexts (thus making the name eviction_failed better than reuse_failed)? > - Is it actually correct to count evictions in StrategyGetBuffer()? What if we > then decide to not use that buffer in BufferAlloc()? Yes, that'll be counted > via rejected, but that still leaves the eviction count to be "misleading"? I agree that counting evictions in StrategyGetBuffer() is incorrect. Checking BM_VALID at bottom of BufferAlloc() should be better. > On 2022-10-19 15:26:51 -0400, Melanie Plageman wrote: > > I have made some major changes in this area to make the columns more > > useful. I have renamed and split "clocksweeps". It is now "evicted" and > > "freelist acquired". This makes it clear when a block must be evicted > > from a shared buffer must be and may help to identify misconfiguration > > of shared buffers. > > I'm not sure freelist acquired is really that useful? If we don't add it, we > should however definitely not count buffers from the freelist as evictions. > > > > There is some nuance here that I tried to make clear in the docs. > > "freelist acquired" in a shared context is straightforward. > > "freelist acquired" in a strategy context is counted when a shared > > buffer is added to the strategy ring (not when it is reused). > > Not sure what the second half here means - why would a buffer that's not from > the freelist ever be counted as being from the freelist? > > > > "freelist_acquired" is confusing for local buffers but I wanted to > > distinguish between reuse/eviction of local buffers and initial > > allocation. "freelist_acquired" seemed more fitting because there is a > > clocksweep to find a local buffer and if it hasn't been allocated yet it > > is allocated in a place similar to where shared buffers acquire a buffer > > from the freelist. If I didn't count it here, I would need to make a new > > column only for local buffers called "allocated" or something like that. > > I think you're making this too granular. We need to have more detail than > today. But we don't necessarily need to catch every nuance. > I am fine with cutting freelist_acquired. The same actionable information that it could provide could be provided by "read", right? Also, removing it means I can remove the complicated explanation of how freelist_acquired should be interpreted in IOCONTEXT_LOCAL. Speaking of IOCONTEXT_LOCAL, I was wondering if it is confusing to call it IOCONTEXT_LOCAL since it refers to IO done for temporary tables. What if, in the future, we want to track other IO done using data in local memory? Also, what if we want to track other IO done using data from shared memory that is not in shared buffers? Would IOCONTEXT_SB and IOCONTEXT_TEMP be better? Should IOContext literally describe the context of the IO being done and there be a separate column which indicates the source of the data for the IO? Like wal_buffer, local_buffer, shared_buffer? Then if it is not block-oriented, it could be shared_mem, local_mem, or bypass? If we had another dimension to the matrix "data_src" which, with block-oriented IO is equivalent to "buffer type", this could help with some of the clarity problems. We could remove the "reused" column and that becomes: IOCONTEXT | DATA_SRC | IOOP ---------------------------------------- strategy | strategy_buffer | EVICT Having data_src and iocontext simplifies the meaning of all io operations involving a strategy. Some operations are done on shared buffers and some on existing strategy buffers and this would be more clear without the addition of special columns for strategies. > > I have also added the columns "repossessed" and "rejected". "rejected" > > is when a bulkread rejects a strategy buffer because it is dirty and > > requires flush. Seeing a lot of rejections could indicate you need to > > vacuum. "repossessed" is the number of times a strategy buffer was > > pinned or in use by another backend and had to be removed from the > > strategy ring and replaced with a new shared buffer. This gives you some > > indication that there is contention on blocks recently used by a > > strategy. > > I don't immediately see a real use case for repossessed. Why isn't it > sufficient to count it as part of rejected? I'm still on the fence about combining rejection and reuse_failed. A buffer rejected by a bulkread for being dirty may indicate the need to vacuum but doesn't say anything about contention. Whereas, failed reuses indicate contention for the blocks operated on by the strategy. You would react to them differently. And you could have a bulkread racking up both failed reuses and rejections. If this seems like an unlikely or niche case, I would be okay with combining rejections with reuse_failed. But it would be nice if we could help with interpreting the column. I wonder if there is a rule of thumb for determining which scenario you have. For example, how likely is it that if you see a high number of reuse_rejected in a bulkread IOContext that you would see any reused if the rejections are due to the bulkread dirtying its own buffers? I suppose it would depend on your workload and how random your updates/deletes were? If there is some way to use reuse_rejected in combination with another column to determine the cause of the rejections, it would be easier to combine them. - Melanie
On Sun, Oct 23, 2022 at 6:35 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > On Wed, Oct 19, 2022 at 12:27 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > v34 is attached. > > I think the column names need discussion. Also, the docs need more work > > (I added a lot of new content there). I could use feedback on the column > > names and definitions and review/rephrasing ideas for the docs > > additions. > > Nice! I think the expanded docs are great, and make this information > much easier to interpret. > > >+ <varname>io_context</varname> <literal>bulkread</literal>, existing > >+ dirty buffers in the ring requirng flush are > > "requiring" Thanks! > > >+ shared buffers were acquired from the freelist and added to the > >+ fixed-size strategy ring buffer. Shared buffers are added to the > >+ strategy ring lazily. If the current buffer in the ring is pinned or in > > This is the first mention of the term "strategy" in these docs. It's > not totally opaque, since there's some context, but maybe we should > either try to avoid that term or define it more explicitly? > I am thinking it might be good to define the term strategy for use in this view documentation. In the IOContext column documentation, I've added this ... avoid occupying an undue portion of the main shared buffer pool. This pattern is called a Buffer Access Strategy and the fixed-size ring buffer can be referred to as a <quote>strategy ring buffer</quote>. </para></entry> I was thinking this would allow me to refer to the strategy ring buffer more easily. I fear simply referring to "the" ring buffer throughout this view documentation will be confusing. > >+ <varname>io_context</varname>s. This is equivalent to > >+ <varname>evicted</varname> for shared buffers in > >+ <varname>io_context</varname> <literal>shared</literal>, as the contents > >+ of the buffer are <quote>evicted</quote> but refers to the case when the > > I don't quite follow this: does this mean that I should expect > 'reused' and 'evicted' to be equal in the 'shared' context, because > they represent the same thing? Or will 'reused' just be null because > it's not distinct from 'evicted'? It looks like it's null right now, > but I find the wording here confusing. You should only see evictions when the strategy evicts shared buffers and reuses when the strategy evicts existing strategy buffers. How about this instead in this docs? the number of times an existing buffer in the strategy ring was reused as part of an operation in the <literal>bulkread</literal>, <literal>bulkwrite</literal>, or <literal>vacuum</literal> <varname>io_context</varname>s. when a buffer access strategy <quote>reuses</quote> a buffer in the strategy ring, it must evict its contents, incrementing <varname>reused</varname>. when a buffer access strategy adds a new shared buffer to the strategy ring and this shared buffer is occupied, the buffer access strategy must evict the contents of the shared buffer, incrementing <varname>evicted</varname>. > > I've implemented a change using the same function pg_settings uses to > > turn the build-time parameter BLCKSZ into 8kB (get_config_unit_name()) > > using the flag GUC_UNIT_BLOCKS. I am unsure if this is better or worse > > than "block_size". I am feeling very conflicted about this column. > > Yeah, I guess it feels less natural here than in pg_settings, but it > still kind of feels like one way of doing this is better than two... So, Andres pointed out that it would be nice to be able to multiply the unit column by the operation column (e.g. select unit * reused from pg_stat_io...) and get a number of bytes. Then you can use pg_size_pretty to convert it to something more human readable. It probably shouldn't be called unit, then, since that would be the same name as pg_settings but a different meaning. I thought of "bytes_conversion". Then, non-block-oriented IO also wouldn't have to be in bytes. They could put 1000 or 10000 for bytes_conversion. What do you think? - Melanie
v35 is attached On Mon, Oct 24, 2022 at 2:38 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Thu, Oct 20, 2022 at 1:31 PM Andres Freund <andres@anarazel.de> wrote: > > I wonder if we should add a "source" output argument to > > StrategyGetBuffer(). Then nearly all the counting can happen in > > BufferAlloc(). > > I think we can just check for BM_VALID being set before invalidating it > in order to claim the buffer at the end of BufferAlloc(). Then we can > count it as an eviction or reuse. Done this in attached version > > > On 2022-10-19 15:26:51 -0400, Melanie Plageman wrote: > > > I have made some major changes in this area to make the columns more > > > useful. I have renamed and split "clocksweeps". It is now "evicted" and > > > "freelist acquired". This makes it clear when a block must be evicted > > > from a shared buffer must be and may help to identify misconfiguration > > > of shared buffers. > > > > I'm not sure freelist acquired is really that useful? If we don't add it, we > > should however definitely not count buffers from the freelist as evictions. > > > > > > > There is some nuance here that I tried to make clear in the docs. > > > "freelist acquired" in a shared context is straightforward. > > > "freelist acquired" in a strategy context is counted when a shared > > > buffer is added to the strategy ring (not when it is reused). > > > > Not sure what the second half here means - why would a buffer that's not from > > the freelist ever be counted as being from the freelist? > > > > > > > "freelist_acquired" is confusing for local buffers but I wanted to > > > distinguish between reuse/eviction of local buffers and initial > > > allocation. "freelist_acquired" seemed more fitting because there is a > > > clocksweep to find a local buffer and if it hasn't been allocated yet it > > > is allocated in a place similar to where shared buffers acquire a buffer > > > from the freelist. If I didn't count it here, I would need to make a new > > > column only for local buffers called "allocated" or something like that. > > > > I think you're making this too granular. We need to have more detail than > > today. But we don't necessarily need to catch every nuance. I cut freelist_acquired in attached version. > I am fine with cutting freelist_acquired. The same actionable > information that it could provide could be provided by "read", right? > Also, removing it means I can remove the complicated explanation of how > freelist_acquired should be interpreted in IOCONTEXT_LOCAL. > > Speaking of IOCONTEXT_LOCAL, I was wondering if it is confusing to call > it IOCONTEXT_LOCAL since it refers to IO done for temporary tables. What > if, in the future, we want to track other IO done using data in local > memory? Also, what if we want to track other IO done using data from > shared memory that is not in shared buffers? Would IOCONTEXT_SB and > IOCONTEXT_TEMP be better? Should IOContext literally describe the > context of the IO being done and there be a separate column which > indicates the source of the data for the IO? > Like wal_buffer, local_buffer, shared_buffer? Then if it is not > block-oriented, it could be shared_mem, local_mem, or bypass? pg_stat_statements uses local_blks_read and temp_blks_read for local buffers for temp tables and temp file IO respectively -- so perhaps we should stick to that Other updates in this version: I've also updated the unit column to bytes_conversion. I've made quite a few updates to the docs including more information on overlaps between pg_stat_database, pg_statio_*, and pg_stat_statements. Let me know if there are other configuration tip resources from the existing docs that I could link in the column "files_synced". I still need to look at the docs with fresh eyes and do another round of cleanup (probably). - Melanie
Attachment
okay, so I realized v35 had an issue where I wasn't counting strategy evictions correctly. fixed in attached v36. This made me wonder if there is actually a way to add a test for evictions (in strategy and shared contexts) that is not flakey. On Sun, Oct 23, 2022 at 6:48 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > On Thu, Oct 20, 2022 at 10:31 AM Andres Freund <andres@anarazel.de> wrote: > > - "repossession" is a very unintuitive name for me. If we want something like > > it, can't we just name it reuse_failed or such? > > +1, I think "repossessed" is awkward. I think "reuse_failed" works, > but no strong opinions on an alternate name. Also, re: repossessed, I can change it to reuse_failed but I do think it is important to give users a way to distinguish between bulkread rejections of dirty buffers and strategies failing to reuse buffers due to concurrent pinning (since the reaction to these two scenarios would likely be different). If we added another column called something like "claim_failed" which counts buffers which we failed to reuse because of concurrent pinning or usage, we could recommend use of this column together with "reuse_failed" to determine the cause of the failed reuses for a bulkread. We could also use "claim_failed" in IOContext shared to provide information on shared buffer contention. - Melanie
Attachment
Hi, On 2022-10-24 14:38:52 -0400, Melanie Plageman wrote: > > - "repossession" is a very unintuitive name for me. If we want something like > > it, can't we just name it reuse_failed or such? > > Repossession could be called eviction_failed or reuse_failed. > Do we think we will ever want to use it to count buffers we released > in other IOContexts (thus making the name eviction_failed better than > reuse_failed)? I've a somewhat radical proposal: Let's just not count any of this in the initial version. I think we want something, but clearly it's one of the harder aspects of this patch. Let's get the rest in, and then work on this is in isolation. > Speaking of IOCONTEXT_LOCAL, I was wondering if it is confusing to call > it IOCONTEXT_LOCAL since it refers to IO done for temporary tables. What > if, in the future, we want to track other IO done using data in local > memory? Fair point. However, I think 'tmp' or 'temp' would be worse, because there's other sources of temporary files that would be worth counting, consider e.g. tuplestore temporary files. 'temptable' isn't good because it's not just tables. 'temprel'? On balance I think local is better, but not sure. > Also, what if we want to track other IO done using data from shared memory > that is not in shared buffers? Would IOCONTEXT_SB and IOCONTEXT_TEMP be > better? Should IOContext literally describe the context of the IO being done > and there be a separate column which indicates the source of the data for > the IO? Like wal_buffer, local_buffer, shared_buffer? Then if it is not > block-oriented, it could be shared_mem, local_mem, or bypass? Hm. I don't think we'd need _buffer for WAL or such, because there's nothing else. > If we had another dimension to the matrix "data_src" which, with > block-oriented IO is equivalent to "buffer type", this could help with > some of the clarity problems. > > We could remove the "reused" column and that becomes: > > IOCONTEXT | DATA_SRC | IOOP > ---------------------------------------- > strategy | strategy_buffer | EVICT > Having data_src and iocontext simplifies the meaning of all io > operations involving a strategy. Some operations are done on shared > buffers and some on existing strategy buffers and this would be more > clear without the addition of special columns for strategies. -1, I think this just blows up the complexity further, without providing much benefit. But: Perhaps a somewhat similar idea could be used to address the concerns in the preceding paragraphs. How about the following set of columns: backend_type: object: relation, temp_relation[, WAL, tempfiles, ...] iocontext: buffer_pool, bulkread, bulkwrite, vacuum[, bypass] read: written: extended: bytes_conversion: evicted: reused: files_synced: stats_reset: Greetings, Andres Freund
On Wed, Oct 26, 2022 at 10:55 AM Melanie Plageman <melanieplageman@gmail.com> wrote: + The <structname>pg_statio_</structname> and + <structname>pg_stat_io</structname> views are primarily useful to determine + the effectiveness of the buffer cache. When the number of actual disk reads Totally nitpicking, but this reads a little funny to me. Previously the trailing underscore suggested this is a group, and now with pg_stat_io itself added (stupid question: should this be "pg_statio"?), it sounds like we're talking about two views: pg_stat_io and "pg_statio_". Maybe something like "The pg_stat_io view and the pg_statio_ set of views are primarily..."? + by that backend type in that IO context. Currently only a subset of IO + operations are tracked here. WAL IO, IO on temporary files, and some forms + of IO outside of shared buffers (such as when building indexes or moving a + table from one tablespace to another) could be added in the future. Again nitpicking, but should this be "may be added"? I think "could" suggests the possibility of implementation, whereas "may" feels more like a hint as to how the feature could evolve. + portion of the main shared buffer pool. This pattern is called a + <quote>Buffer Access Strategy</quote> in the + <productname>PostgreSQL</productname> source code and the fixed-size + ring buffer is referred to as a <quote>strategy ring buffer</quote> for + the purposes of this view's documentation. + </para></entry> Nice, I think this explanation is very helpful. You also use the term "strategy context" and "strategy operation" below. I think it's fairly obvious what those mean, but pointing it out in case we want to note that here, too. + <varname>read</varname> and <varname>extended</varname> for Maybe "plus" instead of "and" here for clarity (I'm assuming that's what the "and" means)? + <varname>backend_type</varname>s <literal>autovacuum launcher</literal>, + <literal>autovacuum worker</literal>, <literal>client backend</literal>, + <literal>standalone backend</literal>, <literal>background + worker</literal>, and <literal>walsender</literal> for all + <varname>io_context</varname>s is similar to the sum of I'm reviewing the rendered docs now, and I noticed sentences like this are a bit hard to scan: they force the reader to parse a big list of backend types before even getting to the meat of what this is talking about. Should we maybe reword this so that the backend list comes at the end of the sentence? Or maybe even use a list (e.g., like in the "state" column description in pg_stat_activity)? + <varname>heap_blks_read</varname>, <varname>idx_blks_read</varname>, + <varname>tidx_blks_read</varname>, and + <varname>toast_blks_read</varname> in <link + linkend="monitoring-pg-statio-all-tables-view"> + <structname>pg_statio_all_tables</structname></link>. and + <varname>blks_read</varname> from <link I think that's a stray period before the "and." + <para>If using the <productname>PostgreSQL</productname> extension, + <xref linkend="pgstatstatements"/>, + <varname>read</varname> for + <varname>backend_type</varname>s <literal>autovacuum launcher</literal>, + <literal>autovacuum worker</literal>, <literal>client backend</literal>, + <literal>standalone backend</literal>, <literal>background + worker</literal>, and <literal>walsender</literal> for all + <varname>io_context</varname>s is equivalent to Same comment as above re: the lengthy list. + Normal client backends should be able to rely on maintenance processes + like the checkpointer and background writer to write out dirty data as Nice--it's great to see this mentioned. But I think these are generally referred to as "auxiliary" not "maintenance" processes, no? + <para>If using the <productname>PostgreSQL</productname> extension, + <xref linkend="pgstatstatements"/>, <varname>written</varname> and + <varname>extended</varname> for <varname>backend_type</varname>s Again, should this be "plus" instead of "and"? + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>bytes_conversion</structfield> <type>bigint</type> + </para> I think this general approach works (instead of unit). I'm not wild about the name, but I don't really have a better suggestion. Maybe "op_bytes" (since each cell is counting the number of I/O operations)? But I think bytes_conversion is okay. Also, is this (in the middle of the table) the right place for this column? I would have expected to see it before or after all the actual I/O op cells. + <varname>io_context</varname>s. When a <quote>Buffer Access + Strategy</quote> reuses a buffer in the strategy ring, it must evict its + contents, incrementing <varname>reused</varname>. When a <quote>Buffer + Access Strategy</quote> adds a new shared buffer to the strategy ring + and this shared buffer is occupied, the <quote>Buffer Access + Strategy</quote> must evict the contents of the shared buffer, + incrementing <varname>evicted</varname>. I think the parallel phrasing here makes this a little hard to follow. Specifically, I think "must evict its contents" for the strategy case sounds like a bad thing, but in fact this is a totally normal thing that happens as part of strategy access, no? The idea is you probably won't need that buffer again, so it's fine to evict it. I'm not sure how to reword, but I think the current phrasing is misleading. + The number of times a <literal>bulkread</literal> found the current + buffer in the fixed-size strategy ring dirty and requiring flush. Maybe "...found ... to be dirty..."? + frequent vacuuming or more aggressive autovacuum settings, as buffers are + dirtied during a bulkread operation when updating the hint bit or when + performing on-access pruning. Are there docs to cross-reference here, especially for pruning? I couldn't find much except a few un-explained mentions in the page layout docs [2], and most of the search results refer to partition pruning. Searching for hint bits at least gives some info in blog posts and the wiki. + again. A high number of repossessions is a sign of contention for the + blocks operated on by the strategy operation. This (and in general the repossession description) makes sense, but I'm not sure what to do with the information. Maybe Andres is right that we could skip this in the first version? On Mon, Oct 24, 2022 at 12:39 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > I don't quite follow this: does this mean that I should expect > > 'reused' and 'evicted' to be equal in the 'shared' context, because > > they represent the same thing? Or will 'reused' just be null because > > it's not distinct from 'evicted'? It looks like it's null right now, > > but I find the wording here confusing. > > You should only see evictions when the strategy evicts shared buffers > and reuses when the strategy evicts existing strategy buffers. > > How about this instead in this docs? > > the number of times an existing buffer in the strategy ring was reused > as part of an operation in the <literal>bulkread</literal>, > <literal>bulkwrite</literal>, or <literal>vacuum</literal> > <varname>io_context</varname>s. when a buffer access strategy > <quote>reuses</quote> a buffer in the strategy ring, it must evict its > contents, incrementing <varname>reused</varname>. when a buffer access > strategy adds a new shared buffer to the strategy ring and this shared > buffer is occupied, the buffer access strategy must evict the contents > of the shared buffer, incrementing <varname>evicted</varname>. It looks like you ended up with different wording in the patch, but both this explanation and what's in the patch now make sense to me. Thanks for clarifying. Also, I noticed that the commit message explains missing rows for some backend_type / io_context combinations and NULL (versus 0) in some cells, but the docs don't really talk about that. Do you think that should be in there as well? Thanks, Maciek [1]: https://www.postgresql.org/docs/15/glossary.html#GLOSSARY-AUXILIARY-PROC [2]: https://www.postgresql.org/docs/15/storage-page-layout.html
v37 attached On Sun, Oct 30, 2022 at 9:09 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > On Wed, Oct 26, 2022 at 10:55 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > + The <structname>pg_statio_</structname> and > + <structname>pg_stat_io</structname> views are primarily useful to determine > + the effectiveness of the buffer cache. When the number of actual disk reads > > Totally nitpicking, but this reads a little funny to me. Previously > the trailing underscore suggested this is a group, and now with > pg_stat_io itself added (stupid question: should this be > "pg_statio"?), it sounds like we're talking about two views: > pg_stat_io and "pg_statio_". Maybe something like "The pg_stat_io view > and the pg_statio_ set of views are primarily..."? I decided not to call it pg_statio because all of the other stats views have an underscore after stat and I thought it was an opportunity to be consistent with them. > + by that backend type in that IO context. Currently only a subset of IO > + operations are tracked here. WAL IO, IO on temporary files, and some forms > + of IO outside of shared buffers (such as when building indexes or moving a > + table from one tablespace to another) could be added in the future. > > Again nitpicking, but should this be "may be added"? I think "could" > suggests the possibility of implementation, whereas "may" feels more > like a hint as to how the feature could evolve. I've adopted the wording you suggested. > + portion of the main shared buffer pool. This pattern is called a > + <quote>Buffer Access Strategy</quote> in the > + <productname>PostgreSQL</productname> source code and the fixed-size > + ring buffer is referred to as a <quote>strategy ring buffer</quote> for > + the purposes of this view's documentation. > + </para></entry> > > Nice, I think this explanation is very helpful. You also use the term > "strategy context" and "strategy operation" below. I think it's fairly > obvious what those mean, but pointing it out in case we want to note > that here, too. Thanks! I've added definitions of those as well. > + <varname>read</varname> and <varname>extended</varname> for > > Maybe "plus" instead of "and" here for clarity (I'm assuming that's > what the "and" means)? Modified this -- in some cases by adding the lists mentioned below > + <varname>backend_type</varname>s <literal>autovacuum launcher</literal>, > + <literal>autovacuum worker</literal>, <literal>client backend</literal>, > + <literal>standalone backend</literal>, <literal>background > + worker</literal>, and <literal>walsender</literal> for all > + <varname>io_context</varname>s is similar to the sum of > > I'm reviewing the rendered docs now, and I noticed sentences like this > are a bit hard to scan: they force the reader to parse a big list of > backend types before even getting to the meat of what this is talking > about. Should we maybe reword this so that the backend list comes at > the end of the sentence? Or maybe even use a list (e.g., like in the > "state" column description in pg_stat_activity)? Good idea with the bullet points. For the lengthy lists, I've added bullet point lists to the docs for several of the columns. It is quite long now but, hopefully, clearer? Let me know if you think it improves the readability. > + <varname>heap_blks_read</varname>, <varname>idx_blks_read</varname>, > + <varname>tidx_blks_read</varname>, and > + <varname>toast_blks_read</varname> in <link > + linkend="monitoring-pg-statio-all-tables-view"> > + <structname>pg_statio_all_tables</structname></link>. and > + <varname>blks_read</varname> from <link > > I think that's a stray period before the "and." Fixed! > + Normal client backends should be able to rely on maintenance processes > + like the checkpointer and background writer to write out dirty data as > > Nice--it's great to see this mentioned. But I think these are > generally referred to as "auxiliary" not "maintenance" processes, no? Thanks! Fixed. > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>bytes_conversion</structfield> <type>bigint</type> > + </para> > > I think this general approach works (instead of unit). I'm not wild > about the name, but I don't really have a better suggestion. Maybe > "op_bytes" (since each cell is counting the number of I/O operations)? > But I think bytes_conversion is okay. I really like op_bytes and have changed it to this. Thanks for the suggestion! > Also, is this (in the middle of the table) the right place for this > column? I would have expected to see it before or after all the actual > I/O op cells. I put it after read, write, and extend columns because it applies to them. It doesn't apply to files_synced. For reused and evicted, I didn't think bytes reused and evicted made sense. Also, when we add non-block oriented IO, reused and evicted won't be used but op_bytes will be. So I thought it made more sense to place it after the operations it applies to. > + <varname>io_context</varname>s. When a <quote>Buffer Access > + Strategy</quote> reuses a buffer in the strategy ring, it must evict its > + contents, incrementing <varname>reused</varname>. When a <quote>Buffer > + Access Strategy</quote> adds a new shared buffer to the strategy ring > + and this shared buffer is occupied, the <quote>Buffer Access > + Strategy</quote> must evict the contents of the shared buffer, > + incrementing <varname>evicted</varname>. > > I think the parallel phrasing here makes this a little hard to follow. > Specifically, I think "must evict its contents" for the strategy case > sounds like a bad thing, but in fact this is a totally normal thing > that happens as part of strategy access, no? The idea is you probably > won't need that buffer again, so it's fine to evict it. I'm not sure > how to reword, but I think the current phrasing is misleading. I had trouble rephrasing this. I changed a few words. I see what you mean. It is worth noting that reusing strategy buffers when there are buffers on the freelist may not be the best behavior, so I wouldn't necessarily consider "reused" a good thing. However, I'm not sure how much the user could really do about this. I would at least like this phrasing to be clear (evicted is for shared buffers, reused is for strategy buffers), so, perhaps this section requires more work. > + The number of times a <literal>bulkread</literal> found the current > + buffer in the fixed-size strategy ring dirty and requiring flush. > > Maybe "...found ... to be dirty..."? Changed to this wording. > + frequent vacuuming or more aggressive autovacuum settings, as buffers are > + dirtied during a bulkread operation when updating the hint bit or when > + performing on-access pruning. > > Are there docs to cross-reference here, especially for pruning? I > couldn't find much except a few un-explained mentions in the page > layout docs [2], and most of the search results refer to partition > pruning. Searching for hint bits at least gives some info in blog > posts and the wiki. yes, I don't see anything explaining this either -- below the page layout it discusses tuple layout but that doesn't mention hint bits. > + again. A high number of repossessions is a sign of contention for the > + blocks operated on by the strategy operation. > > This (and in general the repossession description) makes sense, but > I'm not sure what to do with the information. Maybe Andres is right > that we could skip this in the first version? I've removed repossessed and rejected in attached v37. I am a bit sad about this because I don't see a good way forward and I think those could be useful for users. I have added the new column Andres recommended in [1] ("io_object") to clarify temp and local buffers and pave the way for bypass IO (IO not done through a buffer pool), which can be done on temp or permanent files for temp or permanent relations, and spill file IO which is done on temporary files but isn't related to temporary tables. IOObject has increased the memory footprint and complexity of the code around tracking and accumulating the statistics, though it has not increased the number of rows in the view. One question I still have about this additional dimension is how much enumeration we need of the various combinations of IO operations, IO objects, IO ops, and backend types which are allowed and not allowed. Currently because it is only valid to operate on both IOOBJECT_RELATION and IOOBJECT_TEMP_RELATION in IOCONTEXT_BUFFER_POOL, the changes to the various functions asserting and validating what is "allowed" in terms of combinations of ops, objects, contexts, and backend types aren't much different than they were without IO Object. However, once we begin adding other objects and contexts, we will need to make this logic more comprehensive. I'm not sure whether or not I should do that preemptively. > On Mon, Oct 24, 2022 at 12:39 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > I don't quite follow this: does this mean that I should expect > > > 'reused' and 'evicted' to be equal in the 'shared' context, because > > > they represent the same thing? Or will 'reused' just be null because > > > it's not distinct from 'evicted'? It looks like it's null right now, > > > but I find the wording here confusing. > > > > You should only see evictions when the strategy evicts shared buffers > > and reuses when the strategy evicts existing strategy buffers. > > > > How about this instead in this docs? > > > > the number of times an existing buffer in the strategy ring was reused > > as part of an operation in the <literal>bulkread</literal>, > > <literal>bulkwrite</literal>, or <literal>vacuum</literal> > > <varname>io_context</varname>s. when a buffer access strategy > > <quote>reuses</quote> a buffer in the strategy ring, it must evict its > > contents, incrementing <varname>reused</varname>. when a buffer access > > strategy adds a new shared buffer to the strategy ring and this shared > > buffer is occupied, the buffer access strategy must evict the contents > > of the shared buffer, incrementing <varname>evicted</varname>. > > It looks like you ended up with different wording in the patch, but > both this explanation and what's in the patch now make sense to me. > Thanks for clarifying. Yes, I tried to rework it and your suggestion and feedback was very helpful. > Also, I noticed that the commit message explains missing rows for some > backend_type / io_context combinations and NULL (versus 0) in some > cells, but the docs don't really talk about that. Do you think that > should be in there as well? Thanks for pointing this out. I have added notes about this to the relevant columns in the docs. - Melanie [1] https://www.postgresql.org/message-id/20221026185808.4qnxowtn35x43u7u%40awork3.anarazel.de
Attachment
On Thu, Nov 3, 2022 at 10:00 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > I decided not to call it pg_statio because all of the other stats views > have an underscore after stat and I thought it was an opportunity to be > consistent with them. Oh, got it. Makes sense. > > I'm reviewing the rendered docs now, and I noticed sentences like this > > are a bit hard to scan: they force the reader to parse a big list of > > backend types before even getting to the meat of what this is talking > > about. Should we maybe reword this so that the backend list comes at > > the end of the sentence? Or maybe even use a list (e.g., like in the > > "state" column description in pg_stat_activity)? > > Good idea with the bullet points. > For the lengthy lists, I've added bullet point lists to the docs for > several of the columns. It is quite long now but, hopefully, clearer? > Let me know if you think it improves the readability. Hmm, I should have tried this before suggesting it. I think the lists break up the flow of the column description too much. What do you think about the attached (on top of your patches--attaching it as a .diff to hopefully not confuse cfbot)? I kept the lists for backend types but inlined the others as a middle ground. I also added a few omitted periods and reworded "read plus extended" to avoid starting the sentence with a (lowercase) varname (I think in general it's fine to do that, but the more complicated sentence structure here makes it easier to follow if the sentence starts with a capital). Alternately, what do you think about pulling equivalencies to existing views out of the main column descriptions, and adding them after the main table as a sort of footnote? Most view docs don't have anything like that, but pg_stat_replication does and it might be a good pattern to follow. Thoughts? > > Also, is this (in the middle of the table) the right place for this > > column? I would have expected to see it before or after all the actual > > I/O op cells. > > I put it after read, write, and extend columns because it applies to > them. It doesn't apply to files_synced. For reused and evicted, I didn't > think bytes reused and evicted made sense. Also, when we add non-block > oriented IO, reused and evicted won't be used but op_bytes will be. So I > thought it made more sense to place it after the operations it applies > to. Got it, makes sense. > > + <varname>io_context</varname>s. When a <quote>Buffer Access > > + Strategy</quote> reuses a buffer in the strategy ring, it must evict its > > + contents, incrementing <varname>reused</varname>. When a <quote>Buffer > > + Access Strategy</quote> adds a new shared buffer to the strategy ring > > + and this shared buffer is occupied, the <quote>Buffer Access > > + Strategy</quote> must evict the contents of the shared buffer, > > + incrementing <varname>evicted</varname>. > > > > I think the parallel phrasing here makes this a little hard to follow. > > Specifically, I think "must evict its contents" for the strategy case > > sounds like a bad thing, but in fact this is a totally normal thing > > that happens as part of strategy access, no? The idea is you probably > > won't need that buffer again, so it's fine to evict it. I'm not sure > > how to reword, but I think the current phrasing is misleading. > > I had trouble rephrasing this. I changed a few words. I see what you > mean. It is worth noting that reusing strategy buffers when there are > buffers on the freelist may not be the best behavior, so I wouldn't > necessarily consider "reused" a good thing. However, I'm not sure how > much the user could really do about this. I would at least like this > phrasing to be clear (evicted is for shared buffers, reused is for > strategy buffers), so, perhaps this section requires more work. Oh, I see. I think the updated wording works better. Although I think we can drop the quotes around "Buffer Access Strategy" here. They're useful when defining the term originally, but after that I think it's clearer to use the term unquoted. Just to understand this better myself, though: can you clarify when "reused" is not a normal, expected part of the strategy execution? I was under the impression that a ring buffer is used because each page is needed only "once" (i.e., for one set of operations) for the command using the strategy ring buffer. Naively, in that situation, it seems better to reuse a no-longer-needed buffer than to claim another buffer from the freelist (where other commands may eventually make better use of it). > > + again. A high number of repossessions is a sign of contention for the > > + blocks operated on by the strategy operation. > > > > This (and in general the repossession description) makes sense, but > > I'm not sure what to do with the information. Maybe Andres is right > > that we could skip this in the first version? > > I've removed repossessed and rejected in attached v37. I am a bit sad > about this because I don't see a good way forward and I think those > could be useful for users. I can see that, but I think as long as we're not doing anything to preclude adding this in the future, it's better to get something out there and expand it later. For what it's worth, I don't feel it needs to be excluded, just that it's not worth getting hung up on. > I have added the new column Andres recommended in [1] ("io_object") to > clarify temp and local buffers and pave the way for bypass IO (IO not > done through a buffer pool), which can be done on temp or permanent > files for temp or permanent relations, and spill file IO which is done > on temporary files but isn't related to temporary tables. > > IOObject has increased the memory footprint and complexity of the code > around tracking and accumulating the statistics, though it has not > increased the number of rows in the view. > > One question I still have about this additional dimension is how much > enumeration we need of the various combinations of IO operations, IO > objects, IO ops, and backend types which are allowed and not allowed. > Currently because it is only valid to operate on both IOOBJECT_RELATION > and IOOBJECT_TEMP_RELATION in IOCONTEXT_BUFFER_POOL, the changes to the > various functions asserting and validating what is "allowed" in terms of > combinations of ops, objects, contexts, and backend types aren't much > different than they were without IO Object. However, once we begin > adding other objects and contexts, we will need to make this logic more > comprehensive. I'm not sure whether or not I should do that > preemptively. It's definitely something to consider, but I have no useful input here. Some more notes on the docs patch: + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>io_context</structfield> <type>text</type> + </para> + <para> + The context or location of an IO operation. + </para> + <itemizedlist> + <listitem> + <para> + <varname>io_context</varname> <literal>buffer pool</literal> refers to + IO operations on data in both the shared buffer pool and process-local + buffer pools used for temporary relation data. + </para> + <para> + Operations on temporary relations are tracked in + <varname>io_context</varname> <literal>buffer pool</literal> and + <varname>io_object</varname> <literal>temp relation</literal>. + </para> + <para> + Operations on permanent relations are tracked in + <varname>io_context</varname> <literal>buffer pool</literal> and + <varname>io_object</varname> <literal>relation</literal>. + </para> + </listitem> For this column, you repeat "io_context" in the list describing the possible values of the column. Enum-style columns in other tables don't do that (e.g., the pg_stat_activty "state" column). I think it might read better to omit "io_context" from the list. + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>io_object</structfield> <type>text</type> + </para> + <para> + Object operated on in a given <varname>io_context</varname> by a given + <varname>backend_type</varname>. + </para> Is this a fixed set of objects we should list, like for io_context? Thanks, Maciek
Attachment
Hi, One good follow up patch will be to rip out the accounting for pg_stat_bgwriter's buffers_backend, buffers_backend_fsync and perhaps buffers_alloc and replace it with a subselect getting the equivalent data from pg_stat_io. It might not be quite worth doing for buffers_alloc because of the way that's tied into bgwriter pacing. On 2022-11-03 13:00:24 -0400, Melanie Plageman wrote: > > + again. A high number of repossessions is a sign of contention for the + > > blocks operated on by the strategy operation. > > > > This (and in general the repossession description) makes sense, but > > I'm not sure what to do with the information. Maybe Andres is right > > that we could skip this in the first version? > > I've removed repossessed and rejected in attached v37. I am a bit sad > about this because I don't see a good way forward and I think those > could be useful for users. Let's get the basic patch in and then check whether we can find a way to have something providing at least some more information like repossessed and rejected. I think it'll be easier to analyze in isolation. > I have added the new column Andres recommended in [1] ("io_object") to > clarify temp and local buffers and pave the way for bypass IO (IO not > done through a buffer pool), which can be done on temp or permanent > files for temp or permanent relations, and spill file IO which is done > on temporary files but isn't related to temporary tables. > IOObject has increased the memory footprint and complexity of the code > around tracking and accumulating the statistics, though it has not > increased the number of rows in the view. It doesn't look too bad from here. Is there a specific portion of the code where it concerns you the most? > One question I still have about this additional dimension is how much > enumeration we need of the various combinations of IO operations, IO > objects, IO ops, and backend types which are allowed and not allowed. > > Currently because it is only valid to operate on both IOOBJECT_RELATION > and IOOBJECT_TEMP_RELATION in IOCONTEXT_BUFFER_POOL, the changes to the > various functions asserting and validating what is "allowed" in terms of > combinations of ops, objects, contexts, and backend types aren't much > different than they were without IO Object. However, once we begin > adding other objects and contexts, we will need to make this logic more > comprehensive. I'm not sure whether or not I should do that > preemptively. I'd not do it preemptively. > @@ -833,6 +836,22 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > > isExtend = (blockNum == P_NEW); > > + if (isLocalBuf) > + { > + /* > + * Though a strategy object may be passed in, no strategy is employed > + * when using local buffers. This could happen when doing, for example, > + * CREATE TEMPORRARY TABLE AS ... > + */ > + io_context = IOCONTEXT_BUFFER_POOL; > + io_object = IOOBJECT_TEMP_RELATION; > + } > + else > + { > + io_context = IOContextForStrategy(strategy); > + io_object = IOOBJECT_RELATION; > + } I think given how frequently ReadBuffer_common() is called in some workloads, it'd be good to make IOContextForStrategy inlinable. But I guess that's not easily doable, because struct BufferAccessStrategyData is only defined in freelist.c. Could we defer this until later, given that we don't currently need this in case of buffer hits afaict? > @@ -1121,6 +1144,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > BufferAccessStrategy strategy, > bool *foundPtr) > { > + bool from_ring; > + IOContext io_context; > BufferTag newTag; /* identity of requested block */ > uint32 newHash; /* hash value for newTag */ > LWLock *newPartitionLock; /* buffer partition lock for it */ > @@ -1187,9 +1212,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > */ > LWLockRelease(newPartitionLock); > > + io_context = IOContextForStrategy(strategy); Hm - doesn't this mean we do IOContextForStrategy() twice? Once in ReadBuffer_common() and then again here? > /* Loop here in case we have to try another victim buffer */ > for (;;) > { > + > /* > * Ensure, while the spinlock's not yet held, that there's a free > * refcount entry. > @@ -1200,7 +1228,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > * Select a victim buffer. The buffer is returned with its header > * spinlock still held! > */ > - buf = StrategyGetBuffer(strategy, &buf_state); > + buf = StrategyGetBuffer(strategy, &buf_state, &from_ring); > > Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0); > I think patch 0001 relies on this change already having been made, If I am not misunderstanding? > @@ -1263,13 +1291,34 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > } > } > > + /* > + * When a strategy is in use, only flushes of dirty buffers > + * already in the strategy ring are counted as strategy writes > + * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the > + * purpose of IO operation statistics tracking. > + * > + * If a shared buffer initially added to the ring must be > + * flushed before being used, this is counted as an > + * IOCONTEXT_BUFFER_POOL IOOP_WRITE. > + * > + * If a shared buffer added to the ring later because the Missing word? > + * current strategy buffer is pinned or in use or because all > + * strategy buffers were dirty and rejected (for BAS_BULKREAD > + * operations only) requires flushing, this is counted as an > + * IOCONTEXT_BUFFER_POOL IOOP_WRITE (from_ring will be false). I think this makes sense for now, but it'd be good if somebody else could chime in on this... > + * > + * When a strategy is not in use, the write can only be a > + * "regular" write of a dirty shared buffer (IOCONTEXT_BUFFER_POOL > + * IOOP_WRITE). > + */ > + > /* OK, do the I/O */ > TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum, > smgr->smgr_rlocator.locator.spcOid, > smgr->smgr_rlocator.locator.dbOid, > smgr->smgr_rlocator.locator.relNumber); > > - FlushBuffer(buf, NULL); > + FlushBuffer(buf, NULL, io_context, IOOBJECT_RELATION); > LWLockRelease(BufferDescriptorGetContentLock(buf)); > ScheduleBufferTagForWriteback(&BackendWritebackContext, > + if (oldFlags & BM_VALID) > + { > + /* > + * When a BufferAccessStrategy is in use, evictions adding a > + * shared buffer to the strategy ring are counted in the > + * corresponding strategy's context. Perhaps "adding a shared buffer to the ring are counted in the corresponding context"? "strategy's context" sounds off to me. > This includes the evictions > + * done to add buffers to the ring initially as well as those > + * done to add a new shared buffer to the ring when current > + * buffer is pinned or otherwise in use. I think this sentence could use a few commas, but not sure. s/current/the current/? > + * We wait until this point to count reuses and evictions in order to > + * avoid incorrectly counting a buffer as reused or evicted when it was > + * released because it was concurrently pinned or in use or counting it > + * as reused when it was rejected or when we errored out. > + */ I can't quite parse this sentence. > + IOOp io_op = from_ring ? IOOP_REUSE : IOOP_EVICT; > + > + pgstat_count_io_op(io_op, IOOBJECT_RELATION, io_context); > + } I'd just inline the variable, but ... > @@ -196,6 +197,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, > LocalRefCount[b]++; > ResourceOwnerRememberBuffer(CurrentResourceOwner, > BufferDescriptorGetBuffer(bufHdr)); > + > break; > } > } Spurious change. > pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state); > > *foundPtr = false; > + > return bufHdr; > } Dito. > +/* > +* IO Operation statistics are not collected for all BackendTypes. > +* > +* The following BackendTypes do not participate in the cumulative stats > +* subsystem or do not do IO operations worth reporting statistics on: s/worth reporting/we currently report/? > + /* > + * In core Postgres, only regular backends and WAL Sender processes > + * executing queries will use local buffers and operate on temporary > + * relations. Parallel workers will not use local buffers (see > + * InitLocalBuffers()); however, extensions leveraging background workers > + * have no such limitation, so track IO Operations on > + * IOOBJECT_TEMP_RELATION for BackendType B_BG_WORKER. > + */ > + no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype > + == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype == > + B_STANDALONE_BACKEND || bktype == B_STARTUP; > + > + if (no_temp_rel && io_context == IOCONTEXT_BUFFER_POOL && io_object == > + IOOBJECT_TEMP_RELATION) > + return false; Personally I don't like line breaks on the == and would rather break earlier on the && or ||. > + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) > + { > + PgStatShared_IOObjectOps *shared_objs = &type_shstats->data[io_context]; > + PgStat_IOObjectOps *pending_objs = &pending_IOOpStats.data[io_context]; > + > + for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) > + { Is there any compiler that'd complain if you used IOContext/IOObject/IOOp as the type in the for loop? I don't think so? Then you'd not need the casts in other places, which I think would make the code easier to read. > + PgStat_IOOpCounters *sharedent = &shared_objs->data[io_object]; > + PgStat_IOOpCounters *pendingent = &pending_objs->data[io_object]; > + > + if (!expect_backend_stats || > + !pgstat_bktype_io_context_io_object_valid(MyBackendType, > + (IOContext) io_context, (IOObject) io_object)) > + { > + pgstat_io_context_ops_assert_zero(sharedent); > + pgstat_io_context_ops_assert_zero(pendingent); > + continue; > + } > + > + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) > + { > + if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context, > + (IOObject) io_object, (IOOp) io_op))) Superfluous parens after the !, I think? > void > pgstat_report_vacuum(Oid tableoid, bool shared, > @@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared, > } > > pgstat_unlock_entry(entry_ref); > + > + /* > + * Flush IO Operations statistics now. pgstat_report_stat() will flush IO > + * Operation stats, however this will not be called after an entire Missing "until"? > +static inline void > +pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op) > +{ Does this need to be in pgstat.h? Perhaps pgstat_internal.h would suffice, afaict it's not used outside of pgstat code? > + > +/* > + * Assert that stats have not been counted for any combination of IOContext, > + * IOObject, and IOOp which is not valid for the passed-in BackendType. The > + * passed-in array of PgStat_IOOpCounters must contain stats from the > + * BackendType specified by the second parameter. Caller is responsible for > + * locking of the passed-in PgStatShared_IOContextOps, if needed. > + */ > +static inline void > +pgstat_backend_io_stats_assert_well_formed(PgStatShared_IOContextOps *backend_io_context_ops, > + BackendType bktype) > +{ This doesn't look like it should be an inline function - it's quite long. I think it's also too complicated for the compiler to optimize out if assertions are disabled. So you'd need to handle this with an explicit #ifdef USE_ASSERT_CHECKING. > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>io_context</structfield> <type>text</type> > + </para> > + <para> > + The context or location of an IO operation. > + </para> > + <itemizedlist> > + <listitem> > + <para> > + <varname>io_context</varname> <literal>buffer pool</literal> refers to > + IO operations on data in both the shared buffer pool and process-local > + buffer pools used for temporary relation data. > + </para> > + <para> The indentation in the sgml part of the patch seems to be a bit wonky. > + <para> > + These last three <varname>io_context</varname>s are counted separately > + because the autovacuum daemon, explicit <command>VACUUM</command>, > + explicit <command>ANALYZE</command>, many bulk reads, and many bulk > + writes use a fixed amount of memory, acquiring the equivalent number of s/memory/buffers/? The amount of memory isn't really fixed. > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>read</structfield> <type>bigint</type> > + </para> > + <para> > + Reads by this <varname>backend_type</varname> into buffers in this > + <varname>io_context</varname>. > + <varname>read</varname> plus <varname>extended</varname> for > + <varname>backend_type</varname>s > + > + <itemizedlist> > + > + <listitem> > + <para> > + <literal>autovacuum launcher</literal> > + </para> > + </listitem> Hm. ISTM that we should not document the set of valid backend types as part of this view. Couldn't we share it with pg_stat_activity.backend_type? > + The difference is that reads done as part of <command>CREATE > + DATABASE</command> are not counted in > + <structname>pg_statio_all_tables</structname> and > + <structname>pg_stat_database</structname> > + </para> Hm, this seems a bit far into the weeds? > +Datum > +pg_stat_get_io(PG_FUNCTION_ARGS) > +{ > + PgStat_BackendIOContextOps *backends_io_stats; > + ReturnSetInfo *rsinfo; > + Datum reset_time; > + > + InitMaterializedSRF(fcinfo, 0); > + rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; > + > + backends_io_stats = pgstat_fetch_backend_io_context_ops(); > + > + reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp); > + > + for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++) > + { > + Datum bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype)); > + bool expect_backend_stats = true; > + PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype]; > + > + /* > + * For those BackendTypes without IO Operation stats, skip > + * representing them in the view altogether. > + */ > + expect_backend_stats = pgstat_io_op_stats_collected((BackendType) > + bktype); > + > + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) > + { > + const char *io_context_str = pgstat_io_context_desc(io_context); > + PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context]; > + > + for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) > + { > + PgStat_IOOpCounters *counters = &io_objs->data[io_object]; > + const char *io_obj_str = pgstat_io_object_desc(io_object); > + > + Datum values[IO_NUM_COLUMNS] = {0}; > + bool nulls[IO_NUM_COLUMNS] = {0}; > + > + /* > + * Some combinations of IOContext, IOObject, and BackendType are > + * not valid for any type of IOOp. In such cases, omit the > + * entire row from the view. > + */ > + if (!expect_backend_stats || > + !pgstat_bktype_io_context_io_object_valid((BackendType) bktype, > + (IOContext) io_context, (IOObject) io_object)) > + { > + pgstat_io_context_ops_assert_zero(counters); > + continue; > + } Perhaps mention in a comment two loops up that we don't skip the nested loops despite !expect_backend_stats because we want to assert here? Greetings, Andres Freund
Note that 001 fails to compile without 002: ../src/backend/storage/buffer/bufmgr.c:1257:43: error: ‘from_ring’ undeclared (first use in this function) 1257 | StrategyRejectBuffer(strategy, buf, from_ring)) My "warnings" script informed me about these gripes from MSVC: [03:42:30.607] c:\cirrus>call sh -c 'if grep ": warning " build.txt; then exit 1; fi; exit 0' [03:42:30.749] c:\cirrus\src\backend\storage\buffer\freelist.c(699) : warning C4715: 'IOContextForStrategy': not all controlpaths return a value [03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(190) : warning C4715: 'pgstat_io_context_desc': not allcontrol paths return a value [03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(204) : warning C4715: 'pgstat_io_object_desc': not allcontrol paths return a value [03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(226) : warning C4715: 'pgstat_io_op_desc': not all controlpaths return a value [03:42:30.749] c:\cirrus\src\backend\utils\adt\pgstatfuncs.c(1816) : warning C4715: 'pgstat_io_op_get_index': not all controlpaths return a value In the docs table, you say things like: | io_context vacuum refers to the IO operations incurred while vacuuming and analyzing. ..but it's a bit unclear (maybe due to the way the docs are rendered). I think it may be more clear to say "when <io_context> is <vacuum>, ..." | acquiring the equivalent number of shared buffers I don't think "equivelent" fits here, since it's actually acquiring a different number of buffers. There's a missing period before " The difference is" The sentence beginning "read plus extended for backend_types" is difficult to parse due to having a bulleted list in its middle. There aren't many references to "IOOps", which is good, because I started to read it as "I oops". + * Flush IO Operations statistics now. pgstat_report_stat() will flush IO + * Operation stats, however this will not be called after an entire => I think that's intended to say *until* after ? + * Functions to assert that invalid IO Operation counters are zero. => There's a missing newline above this comment. + Assert(counters->evictions == 0 && counters->extends == 0 && + counters->fsyncs == 0 && counters->reads == 0 && counters->reuses + == 0 && counters->writes == 0); => It'd be more readable and also maybe help debugging if these were separate assertions. I wondered in the past if that should be a general policy for all assertions. +pgstat_io_op_stats_collected(BackendType bktype) +{ + return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER && + bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER; Similar: I'd prefer to see this as 5 "ifs" or a "switch" to return false, else return true. But YMMV. + * CREATE TEMPORRARY TABLE AS ... => typo: temporary + if (strategy_io_context && io_op == IOOP_FSYNC) => Extra space. pgstat_count_io_op() has a superflous newline before "}". I think there may be a problem/deficiency with hint bits: |postgres=# DROP TABLE u2; CREATE TABLE u2 AS SELECT generate_series(1,999999)a; SELECT pg_stat_reset_shared('io'); explain(analyze,buffers) SELECT * FROM u2; |... | Seq Scan on u2 (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.111..458.239 rows=999999 loops=1) | Buffers: shared hit=2048 read=2377 dirtied=2377 written=2345 |postgres=# SELECT COUNT(1), relname, COUNT(1) FILTER(WHERE isdirty) FROM pg_buffercache b LEFT JOIN pg_class c ON pg_relation_filenode(c.oid)=b.relfilenodeGROUP BY 2 ORDER BY 1 DESC LIMIT 11; | count | relname | count |-------+---------------------------------+------- | 13619 | | 0 | 2080 | u2 | 2080 | 104 | pg_attribute | 4 | 71 | pg_statistic | 1 | 51 | pg_class | 1 It says that SELECT caused 2377 buffers to be dirtied, of which 2080 are associated with the new table in pg_buffercache. |postgres=# SELECT * FROM pg_stat_io WHERE backend_type!~'autovac|archiver|logger|standalone|startup|^wal|background worker'or true ORDER BY 2; | backend_type | io_context | io_object | read | written | extended | op_bytes | evicted | reused | files_synced| stats_reset |... | client backend | bulkread | relation | 2377 | 2345 | | 8192 | 0 | 2345 | | 2022-11-22 22:32:33.044552-06 I think it's a known behavior that hint bits do not use the strategy ring buffer. For BAS_BULKREAD, ring_size = 256kB (32, 8kB pages), but there's 2080 dirty pages in the buffercache (~16MB). But the IO view says that 2345 of the pages were "reused", which seems misleading to me. Maybe that just follows from the behavior and the view is fine. If the view is fine, maybe this case should still be specifically mentioned in the docs. -- Justin
Hi, On 2022-11-22 23:43:29 -0600, Justin Pryzby wrote: > I think there may be a problem/deficiency with hint bits: > > |postgres=# DROP TABLE u2; CREATE TABLE u2 AS SELECT generate_series(1,999999)a; SELECT pg_stat_reset_shared('io'); explain(analyze,buffers) SELECT * FROM u2; > |... > | Seq Scan on u2 (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.111..458.239 rows=999999 loops=1) > | Buffers: shared hit=2048 read=2377 dirtied=2377 written=2345 > > |postgres=# SELECT COUNT(1), relname, COUNT(1) FILTER(WHERE isdirty) FROM pg_buffercache b LEFT JOIN pg_class c ON pg_relation_filenode(c.oid)=b.relfilenodeGROUP BY 2 ORDER BY 1 DESC LIMIT 11; > | count | relname | count > |-------+---------------------------------+------- > | 13619 | | 0 > | 2080 | u2 | 2080 > | 104 | pg_attribute | 4 > | 71 | pg_statistic | 1 > | 51 | pg_class | 1 > > It says that SELECT caused 2377 buffers to be dirtied, of which 2080 are > associated with the new table in pg_buffercache. Note that there's 2048 dirty buffers for u2 in shared_buffers before the SELECT, despite the relation being 4425 blocks long, due to the CTAS using BAS_BULKWRITE. > |postgres=# SELECT * FROM pg_stat_io WHERE backend_type!~'autovac|archiver|logger|standalone|startup|^wal|background worker'or true ORDER BY 2; > | backend_type | io_context | io_object | read | written | extended | op_bytes | evicted | reused | files_synced| stats_reset > |... > | client backend | bulkread | relation | 2377 | 2345 | | 8192 | 0 | 2345 | | 2022-11-22 22:32:33.044552-06 > > I think it's a known behavior that hint bits do not use the strategy > ring buffer. For BAS_BULKREAD, ring_size = 256kB (32, 8kB pages), but > there's 2080 dirty pages in the buffercache (~16MB). I don't think there's any "circumvention" of the ringbuffer here. There's 2048 buffers for u2 in s_b before, all dirty, there's 2080 after, also all dirty. So the ringbuffer restricted the increase in shared buffers used for u2 to 2080-2048=32 additional buffers. The reason hint bits don't prevent pages from being written out here is that a BAS_BULKREAD strategy doesn't cause all buffer writes to be rejected, it just causes buffer writes to be rejected when the page LSN would require a WAL flush. And that's not typically the case when you just set a hint bit, unless you use wal_log_hint_bits = true. If I turn on wal_log_hints=true and add a CHECKPOINT after the CTAS I see 0 reuses (and 4425 dirty buffers), which is what I'd expect. > But the IO view says that 2345 of the pages were "reused", which seems > misleading to me. Maybe that just follows from the behavior and the view is > fine. If the view is fine, maybe this case should still be specifically > mentioned in the docs. I think that's just confusing due to the reset. 2048 + 2345 = 4393, but we only have 2080 buffers for u2 in s_b. Greetings, Andres Freund
v38 attached. On Sun, Nov 20, 2022 at 7:38 PM Andres Freund <andres@anarazel.de> wrote: > One good follow up patch will be to rip out the accounting for > pg_stat_bgwriter's buffers_backend, buffers_backend_fsync and perhaps > buffers_alloc and replace it with a subselect getting the equivalent data from > pg_stat_io. It might not be quite worth doing for buffers_alloc because of > the way that's tied into bgwriter pacing. I don't see how it will make sense to have buffers_backend and buffers_backend_fsync respond to a different reset target than the rest of the fields in pg_stat_bgwriter. > On 2022-11-03 13:00:24 -0400, Melanie Plageman wrote: > > @@ -833,6 +836,22 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > > > > isExtend = (blockNum == P_NEW); > > > > + if (isLocalBuf) > > + { > > + /* > > + * Though a strategy object may be passed in, no strategy is employed > > + * when using local buffers. This could happen when doing, for example, > > + * CREATE TEMPORRARY TABLE AS ... > > + */ > > + io_context = IOCONTEXT_BUFFER_POOL; > > + io_object = IOOBJECT_TEMP_RELATION; > > + } > > + else > > + { > > + io_context = IOContextForStrategy(strategy); > > + io_object = IOOBJECT_RELATION; > > + } > > I think given how frequently ReadBuffer_common() is called in some workloads, > it'd be good to make IOContextForStrategy inlinable. But I guess that's not > easily doable, because struct BufferAccessStrategyData is only defined in > freelist.c. Correct > Could we defer this until later, given that we don't currently need this in > case of buffer hits afaict? Yes, you are right. In ReadBuffer_common(), we can easily move the IOContextForStrategy() call to directly before using io_context. I've done that in the attached version. > > @@ -1121,6 +1144,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > > BufferAccessStrategy strategy, > > bool *foundPtr) > > { > > + bool from_ring; > > + IOContext io_context; > > BufferTag newTag; /* identity of requested block */ > > uint32 newHash; /* hash value for newTag */ > > LWLock *newPartitionLock; /* buffer partition lock for it */ > > @@ -1187,9 +1212,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > > */ > > LWLockRelease(newPartitionLock); > > > > + io_context = IOContextForStrategy(strategy); > > Hm - doesn't this mean we do IOContextForStrategy() twice? Once in > ReadBuffer_common() and then again here? Yes. So, there are a few options for addressing this. - if the goal is to call IOStrategyForContext() exactly once in a given codepath, BufferAlloc() can set IOContext (passed by reference as an output parameter). I don't like this much because it doesn't make sense to me that BufferAlloc() would set the "io_context" parameter -- especially given that strategy is already passed as a parameter and is obviously available to the caller. I also don't see a good way of waiting until BufferAlloc() returns to count the IO operations counted in FlushBuffer() and BufferAlloc() itself. - if the goal is to avoid calling IOStrategyForContext() in more common codepaths or to call it as close to its use as possible, then we can push down its call in BufferAlloc() to the two locations where it is used -- when a dirty buffer must be flushed and when a block was evicted or reused. This will avoid calling it when we are not evicting a block from a valid buffer. However, if we do that, I don't know how to avoid calling it twice in that codepath. Even though we can assume io_context was set in the first location by the time we get to the second location, we would need to initialize the variable with something if we only plan to set it in some branches and there is no "invalid" or "default" value of the IOContext enum. Given the above, I've left the call in BufferAlloc() as is in the attached version. > > > > /* Loop here in case we have to try another victim buffer */ > > for (;;) > > { > > + > > /* > > * Ensure, while the spinlock's not yet held, that there's a free > > * refcount entry. > > @@ -1200,7 +1228,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > > * Select a victim buffer. The buffer is returned with its header > > * spinlock still held! > > */ > > - buf = StrategyGetBuffer(strategy, &buf_state); > > + buf = StrategyGetBuffer(strategy, &buf_state, &from_ring); > > > > Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0); > > > > I think patch 0001 relies on this change already having been made, If I am not misunderstanding? Fixed. > > > > @@ -1263,13 +1291,34 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > > } > > } > > > > + /* > > + * When a strategy is in use, only flushes of dirty buffers > > + * already in the strategy ring are counted as strategy writes > > + * (IOCONTEXT [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the > > + * purpose of IO operation statistics tracking. > > + * > > + * If a shared buffer initially added to the ring must be > > + * flushed before being used, this is counted as an > > + * IOCONTEXT_BUFFER_POOL IOOP_WRITE. > > + * > > + * If a shared buffer added to the ring later because the > > Missing word? Fixed. > > > > + * current strategy buffer is pinned or in use or because all > > + * strategy buffers were dirty and rejected (for BAS_BULKREAD > > + * operations only) requires flushing, this is counted as an > > + * IOCONTEXT_BUFFER_POOL IOOP_WRITE (from_ring will be false). > > I think this makes sense for now, but it'd be good if somebody else could > chime in on this... > > > + * > > + * When a strategy is not in use, the write can only be a > > + * "regular" write of a dirty shared buffer (IOCONTEXT_BUFFER_POOL > > + * IOOP_WRITE). > > + */ > > + > > /* OK, do the I/O */ > > TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum, > > smgr->smgr_rlocator.locator.spcOid, > > smgr->smgr_rlocator.locator.dbOid, > > smgr->smgr_rlocator.locator.relNumber); > > > > - FlushBuffer(buf, NULL); > > + FlushBuffer(buf, NULL, io_context, IOOBJECT_RELATION); > > LWLockRelease(BufferDescriptorGetContentLock(buf)); > > ScheduleBufferTagForWriteback(&BackendWritebackContext, > > > > > + if (oldFlags & BM_VALID) > > + { > > + /* > > + * When a BufferAccessStrategy is in use, evictions adding a > > + * shared buffer to the strategy ring are counted in the > > + * corresponding strategy's context. > > Perhaps "adding a shared buffer to the ring are counted in the corresponding > context"? "strategy's context" sounds off to me. Fixed. > > This includes the evictions > > + * done to add buffers to the ring initially as well as those > > + * done to add a new shared buffer to the ring when current > > + * buffer is pinned or otherwise in use. > > I think this sentence could use a few commas, but not sure. > > s/current/the current/? Reworded. > > > + * We wait until this point to count reuses and evictions in order to > > + * avoid incorrectly counting a buffer as reused or evicted when it was > > + * released because it was concurrently pinned or in use or counting it > > + * as reused when it was rejected or when we errored out. > > + */ > > I can't quite parse this sentence. I've reworded the whole comment. I think it is clearer now. > > > + IOOp io_op = from_ring ? IOOP_REUSE : IOOP_EVICT; > > + > > + pgstat_count_io_op(io_op, IOOBJECT_RELATION, io_context); > > + } > > I'd just inline the variable, but ... Done. > > @@ -196,6 +197,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, > > LocalRefCount[b]++; > > ResourceOwnerRememberBuffer(CurrentResourceOwner, > > BufferDescriptorGetBuffer(bufHdr)); > > + > > break; > > } > > } > > Spurious change. Removed. > > pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state); > > > > *foundPtr = false; > > + > > return bufHdr; > > } > > Dito. Removed. > > +/* > > +* IO Operation statistics are not collected for all BackendTypes. > > +* > > +* The following BackendTypes do not participate in the cumulative stats > > +* subsystem or do not do IO operations worth reporting statistics on: > > s/worth reporting/we currently report/? Updated > > + /* > > + * In core Postgres, only regular backends and WAL Sender processes > > + * executing queries will use local buffers and operate on temporary > > + * relations. Parallel workers will not use local buffers (see > > + * InitLocalBuffers()); however, extensions leveraging background workers > > + * have no such limitation, so track IO Operations on > > + * IOOBJECT_TEMP_RELATION for BackendType B_BG_WORKER. > > + */ > > + no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype > > + == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype == > > + B_STANDALONE_BACKEND || bktype == B_STARTUP; > > + > > + if (no_temp_rel && io_context == IOCONTEXT_BUFFER_POOL && io_object == > > + IOOBJECT_TEMP_RELATION) > > + return false; > > Personally I don't like line breaks on the == and would rather break earlier > on the && or ||. I've gone through and fixed all of these that I could find. > > + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) > > + { > > + PgStatShared_IOObjectOps *shared_objs = &type_shstats->data[io_context]; > > + PgStat_IOObjectOps *pending_objs = &pending_IOOpStats.data[io_context]; > > + > > + for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) > > + { > > Is there any compiler that'd complain if you used IOContext/IOObject/IOOp as the > type in the for loop? I don't think so? Then you'd not need the casts in other > places, which I think would make the code easier to read. I changed the type and currently get no compiler warnings, however, on a previous CI run, with the type changed to an enum I got the following warning: /tmp/cirrus-ci-build/src/include/utils/pgstat_internal.h:605:48: error: no ‘operator++(int)’ declared for postfix ‘++’ [-fpermissive] 605 | io_context < IOCONTEXT_NUM_TYPES; io_context++) I'm not sure why I am no longer getting it. > > + PgStat_IOOpCounters *sharedent = &shared_objs->data[io_object]; > > + PgStat_IOOpCounters *pendingent = &pending_objs->data[io_object]; > > + > > + if (!expect_backend_stats || > > + !pgstat_bktype_io_context_io_object_valid(MyBackendType, > > + (IOContext) io_context, (IOObject) io_object)) > > + { > > + pgstat_io_context_ops_assert_zero(sharedent); > > + pgstat_io_context_ops_assert_zero(pendingent); > > + continue; > > + } > > + > > + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) > > + { > > + if (!(pgstat_io_op_valid(MyBackendType, (IOContext) io_context, > > + (IOObject) io_object, (IOOp) io_op))) > > Superfluous parens after the !, I think? Thanks! I've looked for other occurrences as well and fixed them. > > void > > pgstat_report_vacuum(Oid tableoid, bool shared, > > @@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared, > > } > > > > pgstat_unlock_entry(entry_ref); > > + > > + /* > > + * Flush IO Operations statistics now. pgstat_report_stat() will flush IO > > + * Operation stats, however this will not be called after an entire > > Missing "until"? Fixed. > > +static inline void > > +pgstat_io_op_assert_zero(PgStat_IOOpCounters *counters, IOOp io_op) > > +{ > > Does this need to be in pgstat.h? Perhaps pgstat_internal.h would suffice, > afaict it's not used outside of pgstat code? It is used in pgstatfuncs.c during the view creation. > > + > > +/* > > + * Assert that stats have not been counted for any combination of IOContext, > > + * IOObject, and IOOp which is not valid for the passed-in BackendType. The > > + * passed-in array of PgStat_IOOpCounters must contain stats from the > > + * BackendType specified by the second parameter. Caller is responsible for > > + * locking of the passed-in PgStatShared_IOContextOps, if needed. > > + */ > > +static inline void > > +pgstat_backend_io_stats_assert_well_formed(PgStatShared_IOContextOps *backend_io_context_ops, > > + BackendType bktype) > > +{ > > This doesn't look like it should be an inline function - it's quite long. > > I think it's also too complicated for the compiler to optimize out if > assertions are disabled. So you'd need to handle this with an explicit #ifdef > USE_ASSERT_CHECKING. I've made it a static helper function in pgstat.c. > > > + <row> > > + <entry role="catalog_table_entry"><para role="column_definition"> > > + <structfield>io_context</structfield> <type>text</type> > > + </para> > > + <para> > > + The context or location of an IO operation. > > + </para> > > + <itemizedlist> > > + <listitem> > > + <para> > > + <varname>io_context</varname> <literal>buffer pool</literal> refers to > > + IO operations on data in both the shared buffer pool and process-local > > + buffer pools used for temporary relation data. > > + </para> > > + <para> > > The indentation in the sgml part of the patch seems to be a bit wonky. I'll address this and the other docs feedback in a separate patchset and email. > > +Datum > > +pg_stat_get_io(PG_FUNCTION_ARGS) > > +{ > > + PgStat_BackendIOContextOps *backends_io_stats; > > + ReturnSetInfo *rsinfo; > > + Datum reset_time; > > + > > + InitMaterializedSRF(fcinfo, 0); > > + rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; > > + > > + backends_io_stats = pgstat_fetch_backend_io_context_ops(); > > + > > + reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp); > > + > > + for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++) > > + { > > + Datum bktype_desc = CStringGetTextDatum(GetBackendTypeDesc((BackendType) bktype)); > > + bool expect_backend_stats = true; > > + PgStat_IOContextOps *io_context_ops = &backends_io_stats->stats[bktype]; > > + > > + /* > > + * For those BackendTypes without IO Operation stats, skip > > + * representing them in the view altogether. > > + */ > > + expect_backend_stats = pgstat_io_op_stats_collected((BackendType) > > + bktype); > > + > > + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) > > + { > > + const char *io_context_str = pgstat_io_context_desc(io_context); > > + PgStat_IOObjectOps *io_objs = &io_context_ops->data[io_context]; > > + > > + for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) > > + { > > + PgStat_IOOpCounters *counters = &io_objs->data[io_object]; > > + const char *io_obj_str = pgstat_io_object_desc(io_object); > > + > > + Datum values[IO_NUM_COLUMNS] = {0}; > > + bool nulls[IO_NUM_COLUMNS] = {0}; > > + > > + /* > > + * Some combinations of IOContext, IOObject, and BackendType are > > + * not valid for any type of IOOp. In such cases, omit the > > + * entire row from the view. > > + */ > > + if (!expect_backend_stats || > > + !pgstat_bktype_io_context_io_object_valid((BackendType) bktype, > > + (IOContext) io_context, (IOObject) io_object)) > > + { > > + pgstat_io_context_ops_assert_zero(counters); > > + continue; > > + } > > Perhaps mention in a comment two loops up that we don't skip the nested loops > despite !expect_backend_stats because we want to assert here? Done. I've also removed the test for bulkread reads from regress because CREATE DATABASE is expensive and added it to the verify_heapam test since it is one of the only users of a BULKREAD strategy which unconditionally uses a BULKREAD strategy. Thanks, Melanie
Attachment
On Wed, Nov 23, 2022 at 12:43 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > Note that 001 fails to compile without 002: > > ../src/backend/storage/buffer/bufmgr.c:1257:43: error: ‘from_ring’ undeclared (first use in this function) > 1257 | StrategyRejectBuffer(strategy, buf, from_ring)) Thanks! I fixed this in version 38 attached in response to Andres upthread [1]. > My "warnings" script informed me about these gripes from MSVC: > > [03:42:30.607] c:\cirrus>call sh -c 'if grep ": warning " build.txt; then exit 1; fi; exit 0' > [03:42:30.749] c:\cirrus\src\backend\storage\buffer\freelist.c(699) : warning C4715: 'IOContextForStrategy': not all controlpaths return a value > [03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(190) : warning C4715: 'pgstat_io_context_desc': notall control paths return a value > [03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(204) : warning C4715: 'pgstat_io_object_desc': notall control paths return a value > [03:42:30.749] c:\cirrus\src\backend\utils\activity\pgstat_io_ops.c(226) : warning C4715: 'pgstat_io_op_desc': not allcontrol paths return a value > [03:42:30.749] c:\cirrus\src\backend\utils\adt\pgstatfuncs.c(1816) : warning C4715: 'pgstat_io_op_get_index': not all controlpaths return a value Thanks, I forgot to look at those warnings in CI. I added pg_unreachable() and think it silenced the warnings. > In the docs table, you say things like: > | io_context vacuum refers to the IO operations incurred while vacuuming and analyzing. > > ..but it's a bit unclear (maybe due to the way the docs are rendered). > I think it may be more clear to say "when <io_context> is > <vacuum>, ..." So, because I use this language [column name] [column value] so often in the docs, I would prefer a pattern that is as concise as possible. I agree it may be hard to see due to the rendering. Currently, I am using <varname> tags for the column name and <literal> tags for the column value. Is there another tag type I could use to perhaps make this more clear without adding additional words? This is what the code looks like for the above docs text: <varname>io_context</varname> <literal>vacuum</literal> refers to the IO > | acquiring the equivalent number of shared buffers > > I don't think "equivelent" fits here, since it's actually acquiring a > different number of buffers. I'm planning to do docs changes in a separate patchset after addressing code feedback. I plan to change "equivalent" to "corresponding" here. > There's a missing period before " The difference is" > > The sentence beginning "read plus extended for backend_types" is difficult to > parse due to having a bulleted list in its middle. Will address in future version. > There aren't many references to "IOOps", which is good, because I > started to read it as "I oops". Grep'ing for this in the code, I only use the word IOOp(s) in the code when I very clearly want to use the type name -- and never in the docs. But, yes, it does look like "I oops" :) > > + * Flush IO Operations statistics now. pgstat_report_stat() will flush IO > + * Operation stats, however this will not be called after an entire > > => I think that's intended to say *until* after ? Fixed in v38. > + * Functions to assert that invalid IO Operation counters are zero. > > => There's a missing newline above this comment. Fixed in v38. > + Assert(counters->evictions == 0 && counters->extends == 0 && > + counters->fsyncs == 0 && counters->reads == 0 && counters->reuses > + == 0 && counters->writes == 0); > > => It'd be more readable and also maybe help debugging if these were separate > assertions. I have made this change. > +pgstat_io_op_stats_collected(BackendType bktype) > +{ > + return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER && > + bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER; > > Similar: I'd prefer to see this as 5 "ifs" or a "switch" to return > false, else return true. But YMMV. I don't know that separating it into multiple if statements or a switch would make it more clear to me or help me with debugging here. Separately, since this is used in non-assert builds, I would like to ensure it is efficient. Do you know if a switch or if statements will be compiled to the exact same thing as this at useful optimization levels? > > + * CREATE TEMPORRARY TABLE AS ... > > => typo: temporary Fixed in v38. > > + if (strategy_io_context && io_op == IOOP_FSYNC) > > => Extra space. Fixed. > > pgstat_count_io_op() has a superflous newline before "}". I couldn't find the one you are referencing. Do you mind pasting in the code? Thanks, Melanie [1] https://www.postgresql.org/message-id/CAAKRu_Zvaj_yFA_eiSRrLZsjhT0J8cJ044QhZfKuXq6WN5bu5g%40mail.gmail.com
Thanks for the review, Maciek! I've attached a new version 39 of the patch which addresses your docs feedback from this email as well as docs feedback from Andres in [1] and Justin in [2]. I've made some additional code changes addressing a few of their other points as well, and I've moved the verify_heapam test to a plain sql test in contrib/amcheck instead of putting it in the perl test. This patchset also includes various cleanup, pgindenting, and addressing the sgml indentation issue brought up in the thread. On Mon, Nov 7, 2022 at 1:26 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > On Thu, Nov 3, 2022 at 10:00 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > I'm reviewing the rendered docs now, and I noticed sentences like this > > > are a bit hard to scan: they force the reader to parse a big list of > > > backend types before even getting to the meat of what this is talking > > > about. Should we maybe reword this so that the backend list comes at > > > the end of the sentence? Or maybe even use a list (e.g., like in the > > > "state" column description in pg_stat_activity)? > > > > Good idea with the bullet points. > > For the lengthy lists, I've added bullet point lists to the docs for > > several of the columns. It is quite long now but, hopefully, clearer? > > Let me know if you think it improves the readability. > > Hmm, I should have tried this before suggesting it. I think the lists > break up the flow of the column description too much. What do you > think about the attached (on top of your patches--attaching it as a > .diff to hopefully not confuse cfbot)? I kept the lists for backend > types but inlined the others as a middle ground. I also added a few > omitted periods and reworded "read plus extended" to avoid starting > the sentence with a (lowercase) varname (I think in general it's fine > to do that, but the more complicated sentence structure here makes it > easier to follow if the sentence starts with a capital). > > Alternately, what do you think about pulling equivalencies to existing > views out of the main column descriptions, and adding them after the > main table as a sort of footnote? Most view docs don't have anything > like that, but pg_stat_replication does and it might be a good pattern > to follow. > > Thoughts? Thanks for including a patch! In the attached v39, I've taken your suggestion of flattening some of the lists and done some rewording as well. I have also moved the note about equivalence with pg_stat_statements columns to the pg_stat_statements documentation. The result is quite a bit different than what I had before, so I would be interested to hear your thoughts. My concern with the blue "note" section like you mentioned is that it would be harder to read the lists of backend types than it was in the tabular format. > > > + <varname>io_context</varname>s. When a <quote>Buffer Access > > > + Strategy</quote> reuses a buffer in the strategy ring, it must evict its > > > + contents, incrementing <varname>reused</varname>. When a <quote>Buffer > > > + Access Strategy</quote> adds a new shared buffer to the strategy ring > > > + and this shared buffer is occupied, the <quote>Buffer Access > > > + Strategy</quote> must evict the contents of the shared buffer, > > > + incrementing <varname>evicted</varname>. > > > > > > I think the parallel phrasing here makes this a little hard to follow. > > > Specifically, I think "must evict its contents" for the strategy case > > > sounds like a bad thing, but in fact this is a totally normal thing > > > that happens as part of strategy access, no? The idea is you probably > > > won't need that buffer again, so it's fine to evict it. I'm not sure > > > how to reword, but I think the current phrasing is misleading. > > > > I had trouble rephrasing this. I changed a few words. I see what you > > mean. It is worth noting that reusing strategy buffers when there are > > buffers on the freelist may not be the best behavior, so I wouldn't > > necessarily consider "reused" a good thing. However, I'm not sure how > > much the user could really do about this. I would at least like this > > phrasing to be clear (evicted is for shared buffers, reused is for > > strategy buffers), so, perhaps this section requires more work. > > Oh, I see. I think the updated wording works better. Although I think > we can drop the quotes around "Buffer Access Strategy" here. They're > useful when defining the term originally, but after that I think it's > clearer to use the term unquoted. Thanks! I've fixed this. > Just to understand this better myself, though: can you clarify when > "reused" is not a normal, expected part of the strategy execution? I > was under the impression that a ring buffer is used because each page > is needed only "once" (i.e., for one set of operations) for the > command using the strategy ring buffer. Naively, in that situation, it > seems better to reuse a no-longer-needed buffer than to claim another > buffer from the freelist (where other commands may eventually make > better use of it). You are right: reused is a normal, expected part of strategy execution. And you are correct: the idea behind reusing existing strategy buffers instead of taking buffers off the freelist is to leave those buffers for blocks that we might expect to be accessed more than once. In practice, however, if you happen to not be using many shared buffers, and then do a large COPY, for example, you will end up doing a bunch of writes (in order to reuse the strategy buffers) that you perhaps didn't need to do at that time had you leveraged the freelist. I think the decision about which tradeoff to make is quite contentious, though. > Some more notes on the docs patch: > > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>io_context</structfield> <type>text</type> > + </para> > + <para> > + The context or location of an IO operation. > + </para> > + <itemizedlist> > + <listitem> > + <para> > + <varname>io_context</varname> <literal>buffer pool</literal> refers to > + IO operations on data in both the shared buffer pool and process-local > + buffer pools used for temporary relation data. > + </para> > + <para> > + Operations on temporary relations are tracked in > + <varname>io_context</varname> <literal>buffer pool</literal> and > + <varname>io_object</varname> <literal>temp relation</literal>. > + </para> > + <para> > + Operations on permanent relations are tracked in > + <varname>io_context</varname> <literal>buffer pool</literal> and > + <varname>io_object</varname> <literal>relation</literal>. > + </para> > + </listitem> > > For this column, you repeat "io_context" in the list describing the > possible values of the column. Enum-style columns in other tables > don't do that (e.g., the pg_stat_activty "state" column). I think it > might read better to omit "io_context" from the list. I changed this. > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>io_object</structfield> <type>text</type> > + </para> > + <para> > + Object operated on in a given <varname>io_context</varname> by a given > + <varname>backend_type</varname>. > + </para> > > Is this a fixed set of objects we should list, like for io_context? I've added this. - Melanie [1] https://www.postgresql.org/message-id/20221121003815.qnwlnz2lhkow2e5w%40awork3.anarazel.de [2] https://www.postgresql.org/message-id/20221123054329.GG11463%40telsasoft.com
Attachment
On Mon, Nov 28, 2022 at 09:08:36PM -0500, Melanie Plageman wrote: > > +pgstat_io_op_stats_collected(BackendType bktype) > > +{ > > + return bktype != B_INVALID && bktype != B_ARCHIVER && bktype != B_LOGGER && > > + bktype != B_WAL_RECEIVER && bktype != B_WAL_WRITER; > > > > Similar: I'd prefer to see this as 5 "ifs" or a "switch" to return > > false, else return true. But YMMV. > > I don't know that separating it into multiple if statements or a switch > would make it more clear to me or help me with debugging here. > > Separately, since this is used in non-assert builds, I would like to > ensure it is efficient. Do you know if a switch or if statements will > be compiled to the exact same thing as this at useful optimization > levels? This doesn't seem like a detail worth much bother, but I did a test. With -O2 (but not -O1 nor -Og) the assembly (gcc 9.4) is the same when written like: + if (bktype == B_INVALID) + return false; + if (bktype == B_ARCHIVER) + return false; + if (bktype == B_LOGGER) + return false; + if (bktype == B_WAL_RECEIVER) + return false; + if (bktype == B_WAL_WRITER) + return false; + + return true; objdump --disassemble=pgstat_io_op_stats_collected src/backend/postgres_lib.a.p/utils_activity_pgstat_io_ops.c.o 0000000000000110 <pgstat_io_op_stats_collected>: 110: f3 0f 1e fa endbr64 114: b8 01 00 00 00 mov $0x1,%eax 119: 83 ff 0d cmp $0xd,%edi 11c: 77 10 ja 12e <pgstat_io_op_stats_collected+0x1e> 11e: b8 03 29 00 00 mov $0x2903,%eax 123: 89 f9 mov %edi,%ecx 125: 48 d3 e8 shr %cl,%rax 128: 48 f7 d0 not %rax 12b: 83 e0 01 and $0x1,%eax 12e: c3 retq I was surprised, but the assembly is *not* the same when I used a switch{}. I think it's fine to write however you want. > > pgstat_count_io_op() has a superflous newline before "}". > > I couldn't find the one you are referencing. > Do you mind pasting in the code? + case IOOP_WRITE: + pending_counters->writes++; + break; + } + --> here <-- +} -- Justin
On Tue, Nov 29, 2022 at 5:13 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Thanks for the review, Maciek! > > I've attached a new version 39 of the patch which addresses your docs > feedback from this email as well as docs feedback from Andres in [1] and > Justin in [2]. This looks great! Just a couple of minor comments. > You are right: reused is a normal, expected part of strategy > execution. And you are correct: the idea behind reusing existing > strategy buffers instead of taking buffers off the freelist is to leave > those buffers for blocks that we might expect to be accessed more than > once. > > In practice, however, if you happen to not be using many shared buffers, > and then do a large COPY, for example, you will end up doing a bunch of > writes (in order to reuse the strategy buffers) that you perhaps didn't > need to do at that time had you leveraged the freelist. I think the > decision about which tradeoff to make is quite contentious, though. Thanks for the explanation--that makes sense. > On Mon, Nov 7, 2022 at 1:26 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > Alternately, what do you think about pulling equivalencies to existing > > views out of the main column descriptions, and adding them after the > > main table as a sort of footnote? Most view docs don't have anything > > like that, but pg_stat_replication does and it might be a good pattern > > to follow. > > > > Thoughts? > > Thanks for including a patch! > In the attached v39, I've taken your suggestion of flattening some of > the lists and done some rewording as well. I have also moved the note > about equivalence with pg_stat_statements columns to the > pg_stat_statements documentation. The result is quite a bit different > than what I had before, so I would be interested to hear your thoughts. > > My concern with the blue "note" section like you mentioned is that it > would be harder to read the lists of backend types than it was in the > tabular format. Oh, I wasn't thinking of doing a separate "note": just additional paragraphs of text after the table (like what pg_stat_replication has before its "note", or the brief comment after the pg_stat_archiver table). But I think the updated docs work also. + <para> + The context or location of an IO operation. + </para> maybe "...of an IO operation:" (colon) instead? + default. Future values could include those derived from + <symbol>XLOG_BLCKSZ</symbol>, once WAL IO is tracked in this view, and + constant multipliers once non-block-oriented IO (e.g. temporary file IO) + is tracked here. I know Lukas had commented that we should communicate that the goal is to eventually provide relatively comprehensive I/O stats in this view (you do that in the view description and I think that works), and this is sort of along those lines, but I think speculative documentation like this is not all that helpful. I'd drop this last sentence. Just my two cents. + <para> + <varname>evicted</varname> in <varname>io_context</varname> + <literal>buffer pool</literal> and <varname>io_object</varname> + <literal>temp relation</literal> counts the number of times a block of + data from an existing local buffer was evicted in order to replace it + with another block, also in local buffers. + </para> Doesn't this follow from the first sentence of the column description? I think we could drop this, no? Otherwise, the docs look good to me. Thanks, Maciek
Hi, - I think it might be worth to rename IOCONTEXT_BUFFER_POOL to IOCONTEXT_{NORMAL, PLAIN, DEFAULT}. I'd like at some point to track WAL IO , temporary file IO etc, and it doesn't seem useful to define a version of BUFFER_POOL for each of them. And it'd make it less confusing, because all the other existing contexts are also in the buffer pool (for now, can't wait for "bypass" or whatever to be tracked as well). - given that IOContextForStrategy() is defined in freelist.c, and that declaring it in pgstat.h requires including buf.h, I think it's probably better to move IOContextForStrategy()'s declaration to freelist.h (doesn't exist, but whatever the right one is) - pgstat_backend_io_stats_assert_well_formed() doesn't seem to belong in pgstat.c. Why not pgstat_io_ops.c? - Do pgstat_io_context_ops_assert_zero(), pgstat_io_op_assert_zero() have to be in pgstat.h? I think the only non-trival thing is the first point, the rest is stuff than I also evolve during commit. Greetings, Andres Freund
Attached is v40. I have addressed the feedback from Justin [1] and Maciek [2] as well. I took all of the suggestions regarding the docs that Maciek made, including the following: > + default. Future values could include those derived from > + <symbol>XLOG_BLCKSZ</symbol>, once WAL IO is tracked in this view, and > + constant multipliers once non-block-oriented IO (e.g. temporary file IO) > + is tracked here. > > > I know Lukas had commented that we should communicate that the goal is > to eventually provide relatively comprehensive I/O stats in this view > (you do that in the view description and I think that works), and this > is sort of along those lines, but I think speculative documentation > like this is not all that helpful. I'd drop this last sentence. Just > my two cents. I have removed this and added the relevant part of this as a comment to the view generating function pg_stat_get_io(). On Mon, Dec 5, 2022 at 2:32 PM Andres Freund <andres@anarazel.de> wrote: > - I think it might be worth to rename IOCONTEXT_BUFFER_POOL to > IOCONTEXT_{NORMAL, PLAIN, DEFAULT}. I'd like at some point to track WAL IO , > temporary file IO etc, and it doesn't seem useful to define a version of > BUFFER_POOL for each of them. And it'd make it less confusing, because all > the other existing contexts are also in the buffer pool (for now, can't wait > for "bypass" or whatever to be tracked as well). In attached v40, I've renamed IOCONTEXT_BUFFER_POOL to IOCONTEXT_NORMAL. > - given that IOContextForStrategy() is defined in freelist.c, and that > declaring it in pgstat.h requires including buf.h, I think it's probably > better to move IOContextForStrategy()'s declaration to freelist.h (doesn't > exist, but whatever the right one is) I have moved it to buf_internals.h. > - pgstat_backend_io_stats_assert_well_formed() doesn't seem to belong in > pgstat.c. Why not pgstat_io_ops.c? I put it in pgstat.c because it is only used there -- so I made it static. I've moved it to pg_stat_io_ops.c and declared it in pgstat_internal.h > - Do pgstat_io_context_ops_assert_zero(), pgstat_io_op_assert_zero() have to > be in pgstat.h? They are used in pgstatfuncs.c, which I presume should not include pgstat_internal.h. Or did you mean that I should not put them in a header file at all? - Melanie [1] https://www.postgresql.org/message-id/20221130025113.GD24131%40telsasoft.com [2] https://www.postgresql.org/message-id/CAOtHd0BfFdMqO7-zDOk%3DiJTatzSDgVcgYcaR1_wk0GS4NN%2BRUQ%40mail.gmail.com
Attachment
In the pg_stat_statements docs, there are several column descriptions like Total number of ... by the statement You added an additional sentence to some describing the equivalent pg_stat_io values, but you only added a period to the previous sentence for shared_blks_read (for other columns, the additional description just follows directly). These should be consistent. Otherwise, the docs look good to me.
Hi, On 2022-10-06 13:42:09 -0400, Melanie Plageman wrote: > > Additionally, some minor notes: > > > > - Since the stats are counting blocks, it would make sense to prefix the view columns with "blks_", and word them inthe past tense (to match current style), i.e. "blks_written", "blks_read", "blks_extended", "blks_fsynced" (realisticallyone would combine this new view with other data e.g. from pg_stat_database or pg_stat_statements, which alluse the "blks_" prefix, and stop using pg_stat_bgwriter for this which does not use such a prefix) > > I have changed the column names to be in the past tense. For a while I was convinced by the consistency argument (after Melanie pointing it out to me). But the more I look, the less convinced I am. The existing IO related stats in pg_stat_database, pg_stat_bgwriter aren't past tense, just the ones in pg_stat_statements. pg_stat_database uses past tense for tup_*, but not xact_*, deadlocks, checksum_failures etc. And even pg_stat_statements isn't consistent about it - otherwise it'd be 'planned' instead of 'plans', 'called' instead of 'calls' etc. I started to look at the naming "tense" issue again, after I got "confused" about "extended", because that somehow makes me think about more detailed stats or such, rather than files getting extended. ISTM that 'evictions', 'extends', 'fsyncs', 'reads', 'reuses', 'writes' are clearer than the past tense versions, and about as consistent with existing columns. FWIW, I've been hacking on this code a bunch, mostly around renaming things and changing the 'stacking' of the patches. My current state is at https://github.com/anarazel/postgres/tree/pg_stat_io A bit more to do before posting the edited version... Greetings, Andres Freund
On Wed, Dec 28, 2022 at 6:56 PM Andres Freund <andres@anarazel.de> wrote: > > FWIW, I've been hacking on this code a bunch, mostly around renaming things > and changing the 'stacking' of the patches. My current state is at > https://github.com/anarazel/postgres/tree/pg_stat_io > A bit more to do before posting the edited version... Here is the bit more done. I've attached a new version 42 which incorporates all of Andres' changes on his branch (which I am considering version 41). I have fixed various issues with counting fsyncs and added more comments and done cosmetic cleanup. The docs have substantial changes but still require more work: - The comparisons between columns in pg_stat_io and pg_stat_statements have been removed, since the granularity and lifetime are so different, comparing them isn't quite correct. - The lists of backend types still take up a lot of visual space in the definitions, which doesn't look great. I'm not sure what to do about that. - Andres has pointed out that it is difficult to read the definitions of the columns because of the added clutter of the interpretations and the comparisons to other stats views. I'm not sure if I should cut these. He and I tried adding that information as a note and in other various table types, however none of the alternatives were an improvement. Besides docs, there is one large change to the code which I am currently working on, which is to change PgStat_IOOpCounters into an array of PgStatCounters instead of having individual members for each IOOp type. I hadn't done this previously because the additional level of nesting seemed confusing. However, it seems it would simplify the code quite a bit and is probably worth doing. - Melanie
Attachment
On Mon, Jan 2, 2023 at 5:46 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Besides docs, there is one large change to the code which I am currently > working on, which is to change PgStat_IOOpCounters into an array of > PgStatCounters instead of having individual members for each IOOp type. > I hadn't done this previously because the additional level of nesting > seemed confusing. However, it seems it would simplify the code quite a > bit and is probably worth doing. As described above, attached v43 uses an array for the PgStatCounters of IOOps instead of struct members.
Attachment
On Mon, Jan 2, 2023 at 8:15 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Jan 2, 2023 at 5:46 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > Besides docs, there is one large change to the code which I am currently > > working on, which is to change PgStat_IOOpCounters into an array of > > PgStatCounters instead of having individual members for each IOOp type. > > I hadn't done this previously because the additional level of nesting > > seemed confusing. However, it seems it would simplify the code quite a > > bit and is probably worth doing. > > As described above, attached v43 uses an array for the PgStatCounters of > IOOps instead of struct members. This wasn't quite a multi-dimensional array. Attached is v44, in which I have removed all of the granular struct types -- PgStat_IOOps, PgStat_IOContext, and PgStat_IOObject by collapsing them into a single array of PgStat_Counters in a new struct PgStat_BackendIO. I needed to keep this in addition to PgStat_IO to have a data type for backends to track their stats in locally. I've also done another round of cleanup. - Melanie
Attachment
Attached is v45 of the patchset. I've done some additional code cleanup and changes. The most significant change, however, is the docs. I've separated the docs into its own patch for ease of review. The docs patch here was edited and co-authored by Samay Sharma. I'm not sure if the order of pg_stat_io in the docs is correct. The significant changes are removal of all "correspondence" or "equivalence"-related sections (those explaining how other IO stats were the same or different from pg_stat_io columns). I've tried to remove references to "strategies" and "Buffer Access Strategy" as much as possible. I've moved the advice and interpretation section to the bottom -- outside of the table of definitions. Since this page is primarily a reference page, I agree with Samay that incorporating interpretation into the column definitions adds clutter and confusion. I think the best course would be to have an "Interpreting Statistics" section. I suggest a structure like the following for this section: - Statistics Collection Configuration - Viewing Statistics - Statistics Views Reference - Statistics Functions Reference - Interpreting Statistics As an aside, this section of the docs has some other structural issues as well. For example, I'm not sure it makes sense to have the dynamic statistics views as sub-sections under 28.2, which is titled "The Cumulative Statistics System." In fact the docs say this under Section 28.2 https://www.postgresql.org/docs/current/monitoring-stats.html "PostgreSQL also supports reporting dynamic information about exactly what is going on in the system right now, such as the exact command currently being executed by other server processes, and which other connections exist in the system. This facility is independent of the cumulative statistics system." So, it is a bit weird that they are defined under the section titled "The Cumulative Statistics System". In this version of the patchset, I have not attempted a new structure but instead moved the advice/interpretation for pg_stat_io to below the table containing the column definitions. - Melanie
Attachment
On Tue, 10 Jan 2023 at 02:41, Melanie Plageman <melanieplageman@gmail.com> wrote: > > Attached is v45 of the patchset. I've done some additional code cleanup > and changes. The most significant change, however, is the docs. I've > separated the docs into its own patch for ease of review. > > The docs patch here was edited and co-authored by Samay Sharma. > I'm not sure if the order of pg_stat_io in the docs is correct. > > The significant changes are removal of all "correspondence" or > "equivalence"-related sections (those explaining how other IO stats were > the same or different from pg_stat_io columns). > > I've tried to remove references to "strategies" and "Buffer Access > Strategy" as much as possible. > > I've moved the advice and interpretation section to the bottom -- > outside of the table of definitions. Since this page is primarily a > reference page, I agree with Samay that incorporating interpretation > into the column definitions adds clutter and confusion. > > I think the best course would be to have an "Interpreting Statistics" > section. > > I suggest a structure like the following for this section: > - Statistics Collection Configuration > - Viewing Statistics > - Statistics Views Reference > - Statistics Functions Reference > - Interpreting Statistics > > As an aside, this section of the docs has some other structural issues > as well. > > For example, I'm not sure it makes sense to have the dynamic statistics > views as sub-sections under 28.2, which is titled "The Cumulative > Statistics System." > > In fact the docs say this under Section 28.2 > https://www.postgresql.org/docs/current/monitoring-stats.html > > "PostgreSQL also supports reporting dynamic information about exactly > what is going on in the system right now, such as the exact command > currently being executed by other server processes, and which other > connections exist in the system. This facility is independent of the > cumulative statistics system." > > So, it is a bit weird that they are defined under the section titled > "The Cumulative Statistics System". > > In this version of the patchset, I have not attempted a new structure > but instead moved the advice/interpretation for pg_stat_io to below the > table containing the column definitions. For some reason cfbot is not able to apply this patch as in [1], please have a look and post an updated patch if required: === Applying patches on top of PostgreSQL commit ID 3c6fc58209f24b959ee18f5d19ef96403d08f15c === === applying patch ./v45-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patch patching file src/backend/storage/buffer/bufmgr.c patching file src/backend/storage/buffer/localbuf.c patching file src/backend/utils/activity/pgstat.c patching file src/backend/utils/activity/pgstat_relation.c patching file src/backend/utils/adt/pgstatfuncs.c patching file src/include/pgstat.h patching file src/include/utils/pgstat_internal.h === applying patch ./v45-0002-pgstat-Infrastructure-to-track-IO-operations.patch gpatch: **** Only garbage was found in the patch input. [1] - http://cfbot.cputube.org/patch_41_3272.log Regards, Vignesh
> Subject: [PATCH v45 4/5] Add system view tracking IO ops per backend type The patch can/will fail with: CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION ''; +WARNING: tablespaces created by regression test cases should have names starting with "regress_" CREATE TABLESPACE test_stats LOCATION ''; +WARNING: tablespaces created by regression test cases should have names starting with "regress_" (I already sent patches to address the omission in cirrus.yml) 1760 : errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or\"wal\"."))); => Do you want to put these in order? pgstat_get_io_op_name() isn't currently being hit by tests; actually, it's completely unused. FlushRelationBuffers() isn't being hit for local buffers. > + <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry> > + <entry> > + One row per backend type, context, target object combination showing > + cluster-wide I/O statistics. I suggest: "One row for each combination of of .." > + The <structname>pg_stat_io</structname> and > + <structname>pg_statio_</structname> set of views are especially useful for > + determining the effectiveness of the buffer cache. When the number of actual > + disk reads is much smaller than the number of buffer hits, then the cache is > + satisfying most read requests without invoking a kernel call. I would change this say "Postgres' own buffer cache is satisfying ..." > However, these > + statistics do not give the entire story: due to the way in which > + <productname>PostgreSQL</productname> handles disk I/O, data that is not in > + the <productname>PostgreSQL</productname> buffer cache might still reside in > + the kernel's I/O cache, and might therefore still be fetched without I suggest to refer to "the kernel's page cache" > + The <structname>pg_stat_io</structname> view will contain one row for each > + backend type, I/O context, and target I/O object combination showing > + cluster-wide I/O statistics. Combinations which do not make sense are > + omitted. "..for each combination of .." > + <varname>io_context</varname> for a type of I/O operation. For "for I/O operations" > + <literal>vacuum</literal>: I/O operations done outside of shared > + buffers incurred while vacuuming and analyzing permanent relations. s/incurred/performed/ > + <literal>bulkread</literal>: Qualifying large read I/O operations > + done outside of shared buffers, for example, a sequential scan of a > + large table. I don't think it's correct to say that it's "outside of" shared-buffers. s/Qualifying/Certain/ > + <literal>bulkwrite</literal>: Qualifying large write I/O operations > + done outside of shared buffers, such as <command>COPY</command>. Same > + Target object of an I/O operation. Possible values are: > + <itemizedlist> > + <listitem> > + <para> > + <literal>relation</literal>: This includes permanent relations. It says "includes permanent" but what seems to mean is that it "exclusive of temporary relations". > + <row> > + <entry role="catalog_table_entry"> > + <para role="column_definition"> > + <structfield>read</structfield> <type>bigint</type> > + </para> > + <para> > + Number of read operations in units of <varname>op_bytes</varname>. This looks too much like it means "bytes". Should say: "in number of blocks of size >op_bytes<" But wait - is it the number of read operations "in units of op_bytes" (which would means this already multiplied by op_bytes, and is in units of bytes). Or the "number of read operations" *of* op_bytes chunks ? Which would mean this is a "pure" number, and could be multipled by op_bytes to obtain a size in bytes. > + Number of write operations in units of <varname>op_bytes</varname>. > + Number of relation extend operations in units of > + <varname>op_bytes</varname>. same > + In <varname>io_context</varname> <literal>normal</literal>, this counts > + the number of times a block was evicted from a buffer and replaced with > + another block. In <varname>io_context</varname>s > + <literal>bulkwrite</literal>, <literal>bulkread</literal>, and > + <literal>vacuum</literal>, this counts the number of times a block was > + evicted from shared buffers in order to add the shared buffer to a > + separate size-limited ring buffer. This never defines what "evicted" means. Does it mea that a dirty buffer was written out ? > + The number of times an existing buffer in a size-limited ring buffer > + outside of shared buffers was reused as part of an I/O operation in the > + <literal>bulkread</literal>, <literal>bulkwrite</literal>, or > + <literal>vacuum</literal> <varname>io_context</varname>s. Maybe say "as part of a bulk I/O operation (bulkread, bulkwrite, or vacuum)." > + <para> > + <structname>pg_stat_io</structname> can be used to inform database tuning. > + For example: > + <itemizedlist> > + <listitem> > + <para> > + A high <varname>evicted</varname> count can indicate that shared buffers > + should be increased. > + </para> > + </listitem> > + <listitem> > + <para> > + Client backends rely on the checkpointer to ensure data is persisted to > + permanent storage. Large numbers of <varname>files_synced</varname> by > + <literal>client backend</literal>s could indicate a misconfiguration of > + shared buffers or of checkpointer. More information on checkpointer of *the* checkpointer > + Normally, client backends should be able to rely on auxiliary processes > + like the checkpointer and background writer to write out dirty data as *the* bg writer > + much as possible. Large numbers of writes by client backends could > + indicate a misconfiguration of shared buffers or of checkpointer. More *the* ckpointer Should this link to various docs for checkpointer/bgwriter ? Maybe the docs for ALTER/COPY/VACUUM/CREATE/etc should be updated to refer to some central description of ring buffers. Maybe something should be included to the appendix. -- Justin
Attached is v46. On Wed, Dec 28, 2022 at 6:56 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-10-06 13:42:09 -0400, Melanie Plageman wrote: > > > Additionally, some minor notes: > > > > > > - Since the stats are counting blocks, it would make sense to prefix the view columns with "blks_", and word them inthe past tense (to match current style), i.e. "blks_written", "blks_read", "blks_extended", "blks_fsynced" (realisticallyone would combine this new view with other data e.g. from pg_stat_database or pg_stat_statements, which alluse the "blks_" prefix, and stop using pg_stat_bgwriter for this which does not use such a prefix) > > > > I have changed the column names to be in the past tense. > > For a while I was convinced by the consistency argument (after Melanie > pointing it out to me). But the more I look, the less convinced I am. The > existing IO related stats in pg_stat_database, pg_stat_bgwriter aren't past > tense, just the ones in pg_stat_statements. pg_stat_database uses past tense > for tup_*, but not xact_*, deadlocks, checksum_failures etc. > > And even pg_stat_statements isn't consistent about it - otherwise it'd be > 'planned' instead of 'plans', 'called' instead of 'calls' etc. > > I started to look at the naming "tense" issue again, after I got "confused" > about "extended", because that somehow makes me think about more detailed > stats or such, rather than files getting extended. > > ISTM that 'evictions', 'extends', 'fsyncs', 'reads', 'reuses', 'writes' are > clearer than the past tense versions, and about as consistent with existing > columns. I have updated the column names to the above recommendation. On Wed, Jan 11, 2023 at 11:32 AM vignesh C <vignesh21@gmail.com> wrote: > > For some reason cfbot is not able to apply this patch as in [1], > please have a look and post an updated patch if required: > === Applying patches on top of PostgreSQL commit ID > 3c6fc58209f24b959ee18f5d19ef96403d08f15c === > === applying patch > ./v45-0001-pgindent-and-some-manual-cleanup-in-pgstat-relat.patch > patching file src/backend/storage/buffer/bufmgr.c > patching file src/backend/storage/buffer/localbuf.c > patching file src/backend/utils/activity/pgstat.c > patching file src/backend/utils/activity/pgstat_relation.c > patching file src/backend/utils/adt/pgstatfuncs.c > patching file src/include/pgstat.h > patching file src/include/utils/pgstat_internal.h > === applying patch ./v45-0002-pgstat-Infrastructure-to-track-IO-operations.patch > gpatch: **** Only garbage was found in the patch input. > > [1] - http://cfbot.cputube.org/patch_41_3272.log > This was an issue with cfbot that Thomas has now fixed as he describes in [1]. On Wed, Jan 11, 2023 at 4:58 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > Subject: [PATCH v45 4/5] Add system view tracking IO ops per backend type > > The patch can/will fail with: > > CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION ''; > +WARNING: tablespaces created by regression test cases should have names starting with "regress_" > > CREATE TABLESPACE test_stats LOCATION ''; > +WARNING: tablespaces created by regression test cases should have names starting with "regress_" > > (I already sent patches to address the omission in cirrus.yml) Thanks. I've fixed this I make a tablespace in amcheck -- are there recommendations for naming tablespaces in contrib also? > > 1760 : errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\",or \"wal\"."))); > => Do you want to put these in order? Thanks. Fixed. > pgstat_get_io_op_name() isn't currently being hit by tests; actually, > it's completely unused. Deleted it. > FlushRelationBuffers() isn't being hit for local buffers. I added a test. > > + <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry> > > + <entry> > > + One row per backend type, context, target object combination showing > > + cluster-wide I/O statistics. > > I suggest: "One row for each combination of of .." I have made this change. > > + The <structname>pg_stat_io</structname> and > > + <structname>pg_statio_</structname> set of views are especially useful for > > + determining the effectiveness of the buffer cache. When the number of actual > > + disk reads is much smaller than the number of buffer hits, then the cache is > > + satisfying most read requests without invoking a kernel call. > > I would change this say "Postgres' own buffer cache is satisfying ..." So, this is existing copy to which I added the pg_stat_io view name and re-flowed the indentation. However, I think your suggestions are a good idea, so I've taken them and just rewritten this paragraph altogether. > > > However, these > > + statistics do not give the entire story: due to the way in which > > + <productname>PostgreSQL</productname> handles disk I/O, data that is not in > > + the <productname>PostgreSQL</productname> buffer cache might still reside in > > + the kernel's I/O cache, and might therefore still be fetched without > > I suggest to refer to "the kernel's page cache" same applies here. > > > + The <structname>pg_stat_io</structname> view will contain one row for each > > + backend type, I/O context, and target I/O object combination showing > > + cluster-wide I/O statistics. Combinations which do not make sense are > > + omitted. > > "..for each combination of .." I have changed this. > > > + <varname>io_context</varname> for a type of I/O operation. For > > "for I/O operations" So I actually mean for a type of I/O operation -- that is, relation data is normally written to a shared buffer but sometimes we bypass shared buffers and just call write and sometimes we use a buffer access strategy and write it to a special ring buffer (made up of buffers stolen from shared buffers, but still). So I don't want to say "for I/O operations" because I think that would imply that writes of relation data will always be in the same IO Context. > > > + <literal>vacuum</literal>: I/O operations done outside of shared > > + buffers incurred while vacuuming and analyzing permanent relations. > > s/incurred/performed/ I changed this. > > > + <literal>bulkread</literal>: Qualifying large read I/O operations > > + done outside of shared buffers, for example, a sequential scan of a > > + large table. > > I don't think it's correct to say that it's "outside of" shared-buffers. I suppose "outside of" gives the wrong idea. But I need to make clear that this I/O is to and from buffers which are not a part of shared buffers right now -- they may still be accessible from the same data structures which access shared buffers but they are currently being used in a different way. > s/Qualifying/Certain/ I feel like qualifying is more specific than certain, but I would be open to changing it if there was a specific reason you don't like it. > > > + <literal>bulkwrite</literal>: Qualifying large write I/O operations > > + done outside of shared buffers, such as <command>COPY</command>. > > Same > > > + Target object of an I/O operation. Possible values are: > > + <itemizedlist> > > + <listitem> > > + <para> > > + <literal>relation</literal>: This includes permanent relations. > > It says "includes permanent" but what seems to mean is that it > "exclusive of temporary relations". I've changed this. > > > + <row> > > + <entry role="catalog_table_entry"> > > + <para role="column_definition"> > > + <structfield>read</structfield> <type>bigint</type> > > + </para> > > + <para> > > + Number of read operations in units of <varname>op_bytes</varname>. > > This looks too much like it means "bytes". > Should say: "in number of blocks of size >op_bytes<" > > But wait - is it the number of read operations "in units of op_bytes" > (which would means this already multiplied by op_bytes, and is in units > of bytes). > > Or the "number of read operations" *of* op_bytes chunks ? Which would > mean this is a "pure" number, and could be multipled by op_bytes to > obtain a size in bytes. It is the number of read operations of op_bytes size -- thanks so much for pointing this out. The wording was really unclear. The idea is that you can do something like: SELECT pg_size_pretty(reads * op_bytes) FROM pg_stat_io; and get it in bytes. The view will contain other types of IO that are not in BLCKSZ chunks, which is where this column will be handy. > > > + Number of write operations in units of <varname>op_bytes</varname>. > > > + Number of relation extend operations in units of > > + <varname>op_bytes</varname>. > > same > > > + In <varname>io_context</varname> <literal>normal</literal>, this counts > > + the number of times a block was evicted from a buffer and replaced with > > + another block. In <varname>io_context</varname>s > > + <literal>bulkwrite</literal>, <literal>bulkread</literal>, and > > + <literal>vacuum</literal>, this counts the number of times a block was > > + evicted from shared buffers in order to add the shared buffer to a > > + separate size-limited ring buffer. > > This never defines what "evicted" means. Does it mea that a dirty > buffer was written out ? Thanks. I've updated this. > > > + The number of times an existing buffer in a size-limited ring buffer > > + outside of shared buffers was reused as part of an I/O operation in the > > + <literal>bulkread</literal>, <literal>bulkwrite</literal>, or > > + <literal>vacuum</literal> <varname>io_context</varname>s. > > Maybe say "as part of a bulk I/O operation (bulkread, bulkwrite, or > vacuum)." I've changed this. > > > + <para> > > + <structname>pg_stat_io</structname> can be used to inform database tuning. > > > + For example: > > + <itemizedlist> > > + <listitem> > > + <para> > > + A high <varname>evicted</varname> count can indicate that shared buffers > > + should be increased. > > + </para> > > + </listitem> > > + <listitem> > > + <para> > > + Client backends rely on the checkpointer to ensure data is persisted to > > + permanent storage. Large numbers of <varname>files_synced</varname> by > > + <literal>client backend</literal>s could indicate a misconfiguration of > > + shared buffers or of checkpointer. More information on checkpointer > > of *the* checkpointer > > > + Normally, client backends should be able to rely on auxiliary processes > > + like the checkpointer and background writer to write out dirty data as > > *the* bg writer > > > + much as possible. Large numbers of writes by client backends could > > + indicate a misconfiguration of shared buffers or of checkpointer. More > > *the* ckpointer I've made most of these changes. > Should this link to various docs for checkpointer/bgwriter ? I couldn't find docs related to tuning checkpointer outside of the WAL configuration docs. There is the docs page for the CHECKPOINT command -- but I don't think that is very relevant here. > Maybe the docs for ALTER/COPY/VACUUM/CREATE/etc should be updated to > refer to some central description of ring buffers. Maybe something > should be included to the appendix. I agree it would be nice to explain Buffer Access Strategies in the docs. - Melanie [1] https://www.postgresql.org/message-id/CA%2BhUKGLiY1e%2B1%3DpB7hXJOyGj1dJOfgde%2BHmiSnv3gDKayUFJMA%40mail.gmail.com
Attachment
On Thu, Jan 12, 2023 at 09:19:36PM -0500, Melanie Plageman wrote: > On Wed, Jan 11, 2023 at 4:58 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > > Subject: [PATCH v45 4/5] Add system view tracking IO ops per backend type > > > > The patch can/will fail with: > > > > CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION ''; > > +WARNING: tablespaces created by regression test cases should have names starting with "regress_" > > > > CREATE TABLESPACE test_stats LOCATION ''; > > +WARNING: tablespaces created by regression test cases should have names starting with "regress_" > > > > (I already sent patches to address the omission in cirrus.yml) > > Thanks. I've fixed this > I make a tablespace in amcheck -- are there recommendations for naming > tablespaces in contrib also? That's the test_stats one I mentioned. Check with -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS > > > + <literal>bulkread</literal>: Qualifying large read I/O operations > > > + done outside of shared buffers, for example, a sequential scan of a > > > + large table. > > > > I don't think it's correct to say that it's "outside of" shared-buffers. > > I suppose "outside of" gives the wrong idea. But I need to make clear > that this I/O is to and from buffers which are not a part of shared > buffers right now -- they may still be accessible from the same data > structures which access shared buffers but they are currently being used > in a different way. This would be a good place to link to a description of the ringbuffer, if we had one. > > s/Qualifying/Certain/ > > I feel like qualifying is more specific than certain, but I would be open > to changing it if there was a specific reason you don't like it. I suggested to change it because at first I started to interpret it as "The act of qualifying large I/O ops .." rather than "Large I/O ops that qualify..". + Number of read operations of <varname>op_bytes</varname> size. This is still a bit too easy to misinterpret as being in units of bytes. I suggest: Number of read operations (which are each of the size specified in >op_bytes<). + in order to add the shared buffer to a separate size-limited ring buffer separate comma + More information on configuring checkpointer can be found in Section 30.5. *the* checkpointer (as in the following paragraph) + <varname>backend_type</varname> <literal>checkpointer</literal> and + <varname>io_object</varname> <literal>temp relation</literal>. + </para> I still think it's a bit hard to understand the <varname>s adjacent to <literal>s. + Some backend_types + in some io_contexts + on some io_objects + in certain io_contexts + on certain io_objects Maybe these should not use underscores: Some backend types never perform I/O operations in some I/O contexts and/or on some i/o objects. + for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++) + for (IOContext io_context = IOCONTEXT_BULKREAD; io_context < IOCONTEXT_NUM_TYPES; io_context++) + for (IOObject io_obj = IOOBJECT_RELATION; io_obj < IOOBJECT_NUM_TYPES; io_obj++) + for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++) These look a bit fragile due to starting at some hardcoded "first" value. In other places you use symbols "FIRST" symbols: + for (IOContext io_context = IOCONTEXT_FIRST; io_context < IOCONTEXT_NUM_TYPES; io_context++) + for (IOObject io_object = IOOBJECT_FIRST; io_object < IOOBJECT_NUM_TYPES; io_object++) + for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++) I think that's marginally better, but I think having to define both FIRST and NUM is excessive and doesn't make it less fragile. Not sure what anyone else will say, but I'd prefer if it started at "0". Thanks for working on this - I'm looking forward to updating my rrdtool script for this soon. It'll be nice to finally distinguish huge number of "backend ringbuffer writes during ALTER" from other backend writes. Currently, that makes it look like something is terribly wrong. -- Justin
Attached is v47. On Fri, Jan 13, 2023 at 12:23 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Thu, Jan 12, 2023 at 09:19:36PM -0500, Melanie Plageman wrote: > > On Wed, Jan 11, 2023 at 4:58 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > > > > Subject: [PATCH v45 4/5] Add system view tracking IO ops per backend type > > > > > > The patch can/will fail with: > > > > > > CREATE TABLESPACE test_io_shared_stats_tblspc LOCATION ''; > > > +WARNING: tablespaces created by regression test cases should have names starting with "regress_" > > > > > > CREATE TABLESPACE test_stats LOCATION ''; > > > +WARNING: tablespaces created by regression test cases should have names starting with "regress_" > > > > > > (I already sent patches to address the omission in cirrus.yml) > > > > Thanks. I've fixed this > > I make a tablespace in amcheck -- are there recommendations for naming > > tablespaces in contrib also? > > That's the test_stats one I mentioned. > > Check with -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS Thanks. I have now changed both tablespace names and checked using that macro. > > > > + <literal>bulkread</literal>: Qualifying large read I/O operations > > > > + done outside of shared buffers, for example, a sequential scan of a > > > > + large table. > > > > > > I don't think it's correct to say that it's "outside of" shared-buffers. > > > > I suppose "outside of" gives the wrong idea. But I need to make clear > > that this I/O is to and from buffers which are not a part of shared > > buffers right now -- they may still be accessible from the same data > > structures which access shared buffers but they are currently being used > > in a different way. > > This would be a good place to link to a description of the ringbuffer, > if we had one. Indeed. > > > s/Qualifying/Certain/ > > > > I feel like qualifying is more specific than certain, but I would be open > > to changing it if there was a specific reason you don't like it. > > I suggested to change it because at first I started to interpret it as > "The act of qualifying large I/O ops .." rather than "Large I/O ops that > qualify..". I have changed it to "certain". > + Number of read operations of <varname>op_bytes</varname> size. > > This is still a bit too easy to misinterpret as being in units of bytes. > I suggest: Number of read operations (which are each of the size > specified in >op_bytes<). I have changed this. > + in order to add the shared buffer to a separate size-limited ring buffer > > separate comma > > + More information on configuring checkpointer can be found in Section 30.5. > > *the* checkpointer (as in the following paragraph) above items changed. > + <varname>backend_type</varname> <literal>checkpointer</literal> and > + <varname>io_object</varname> <literal>temp relation</literal>. > + </para> > > I still think it's a bit hard to understand the <varname>s adjacent to > <literal>s. I agree it isn't great -- is there a different XML tag you suggest instead of literal? > + Some backend_types > + in some io_contexts > + on some io_objects > + in certain io_contexts > + on certain io_objects > > Maybe these should not use underscores: Some backend types never > perform I/O operations in some I/O contexts and/or on some i/o objects. I've changed this. Also, taking another look, I forgot to update the docs' column name tenses in the last version. That is now done. > + for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++) > + for (IOContext io_context = IOCONTEXT_BULKREAD; io_context < IOCONTEXT_NUM_TYPES; io_context++) > + for (IOObject io_obj = IOOBJECT_RELATION; io_obj < IOOBJECT_NUM_TYPES; io_obj++) > + for (IOOp io_op = IOOP_EVICT; io_op < IOOP_NUM_TYPES; io_op++) > > These look a bit fragile due to starting at some hardcoded "first" > value. In other places you use symbols "FIRST" symbols: > > + for (IOContext io_context = IOCONTEXT_FIRST; io_context < IOCONTEXT_NUM_TYPES; io_context++) > + for (IOObject io_object = IOOBJECT_FIRST; io_object < IOOBJECT_NUM_TYPES; io_object++) > + for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++) > > I think that's marginally better, but I think having to define both > FIRST and NUM is excessive and doesn't make it less fragile. Not sure > what anyone else will say, but I'd prefer if it started at "0". Thanks for catching the discrepancy in pg_stat_get_io(). I have changed those instances to use _FIRST. I think that having the loop start from the first enum value (except when that value is something special like _INVALID like with BackendType) is confusing. I agree that having multiple macros to allow iteration through all enum values introduces some fragility. I'm not sure about using the number 0 with the enum as the loop variable data type. Is that a common pattern? In this version, I have updated the loops in pg_stat_get_io() to use _FIRST. > Thanks for working on this - I'm looking forward to updating my rrdtool > script for this soon. It'll be nice to finally distinguish huge number > of "backend ringbuffer writes during ALTER" from other backend writes. > Currently, that makes it look like something is terribly wrong. Cool! I'm glad to know you will use it. - Melanie
Attachment
Hi, On 2023-01-13 13:38:15 -0500, Melanie Plageman wrote: > > I think that's marginally better, but I think having to define both > > FIRST and NUM is excessive and doesn't make it less fragile. Not sure > > what anyone else will say, but I'd prefer if it started at "0". The reason for using FIRST is to be able to define the loop variable as the enum type, without assigning numeric values to an enum var. I prefer it slightly. > From f8c9077631169a778c893fd16b7a973ad5725f2a Mon Sep 17 00:00:00 2001 > From: Andres Freund <andres@anarazel.de> > Date: Fri, 9 Dec 2022 18:23:19 -0800 > Subject: [PATCH v47 1/5] pgindent and some manual cleanup in pgstat related Applied. > Subject: [PATCH v47 2/5] pgstat: Infrastructure to track IO operations > diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c > index 0fa5370bcd..608c3b59da 100644 > --- a/src/backend/utils/activity/pgstat.c > +++ b/src/backend/utils/activity/pgstat.c Reminder to self: Need to bump PGSTAT_FILE_FORMAT_ID before commit. Perhaps you could add a note about that to the commit message? > @@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = { > .snapshot_cb = pgstat_checkpointer_snapshot_cb, > }, > > + [PGSTAT_KIND_IO] = { > + .name = "io_ops", That should be "io" now I think? > +/* > + * Check that stats have not been counted for any combination of IOContext, > + * IOObject, and IOOp which are not tracked for the passed-in BackendType. The > + * passed-in PgStat_BackendIO must contain stats from the BackendType specified > + * by the second parameter. Caller is responsible for locking the passed-in > + * PgStat_BackendIO, if needed. > + */ Other PgStat_Backend* structs are just for pending data. Perhaps we could rename it slightly to make that clearer? PgStat_BktypeIO? PgStat_IOForBackendType? or a similar variation? > +bool > +pgstat_bktype_io_stats_valid(PgStat_BackendIO *backend_io, > + BackendType bktype) > +{ > + bool bktype_tracked = pgstat_tracks_io_bktype(bktype); > + > + for (IOContext io_context = IOCONTEXT_FIRST; > + io_context < IOCONTEXT_NUM_TYPES; io_context++) > + { > + for (IOObject io_object = IOOBJECT_FIRST; > + io_object < IOOBJECT_NUM_TYPES; io_object++) > + { > + /* > + * Don't bother trying to skip to the next loop iteration if > + * pgstat_tracks_io_object() would return false here. We still > + * need to validate that each counter is zero anyway. > + */ > + for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++) > + { > + if ((!bktype_tracked || !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) && > + backend_io->data[io_context][io_object][io_op] != 0) > + return false; Hm, perhaps this could be broken up into multiple lines? Something like /* no stats, so nothing to validate */ if (backend_io->data[io_context][io_object][io_op] == 0) continue; /* something went wrong if have stats for something not tracked */ if (!bktype_tracked || !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) return false; > +typedef struct PgStat_BackendIO > +{ > + PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES]; > +} PgStat_BackendIO; Would it bother you if we swapped the order of iocontext and iobject here and related places? It makes more sense to me semantically, and should now be pretty easy, code wise. > +/* shared version of PgStat_IO */ > +typedef struct PgStatShared_IO > +{ Maybe /* PgStat_IO in shared memory */? > Subject: [PATCH v47 3/5] pgstat: Count IO for relations Nearly happy with this now. See one minor nit below. I don't love the counting in register_dirty_segment() and mdsyncfiletag(), but I don't have a better idea, and it doesn't seem too horrible. > @@ -1441,6 +1474,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > > UnlockBufHdr(buf, buf_state); > > + if (oldFlags & BM_VALID) > + { > + /* > + * When a BufferAccessStrategy is in use, blocks evicted from shared > + * buffers are counted as IOOP_EVICT in the corresponding context > + * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a > + * strategy in two cases: 1) while initially claiming buffers for the > + * strategy ring 2) to replace an existing strategy ring buffer > + * because it is pinned or in use and cannot be reused. > + * > + * Blocks evicted from buffers already in the strategy ring are > + * counted as IOOP_REUSE in the corresponding strategy context. > + * > + * At this point, we can accurately count evictions and reuses, > + * because we have successfully claimed the valid buffer. Previously, > + * we may have been forced to release the buffer due to concurrent > + * pinners or erroring out. > + */ > + pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT, > + IOOBJECT_RELATION, *io_context); > + } > + > if (oldPartitionLock != NULL) > { > BufTableDelete(&oldTag, oldHash); There's no reason to do this while we still hold the buffer partition lock, right? That's a highly contended lock, and we can just move the counting a few lines down. > @@ -1410,6 +1432,9 @@ mdsyncfiletag(const FileTag *ftag, char *path) > if (need_to_close) > FileClose(file); > > + if (result >= 0) > + pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL); > + I'd lean towards doing this unconditionally, it's still an fsync if it failed... Not that it matters. > Subject: [PATCH v47 4/5] Add system view tracking IO ops per backend type Note to self + commit message: Remember the need to do a catversion bump. > +-- pg_stat_io test: > +-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. Maybe add that "whereas a sequential scan does not, see ..."? > This allows > +-- us to reliably test that pg_stat_io BULKREAD reads are being captured > +-- without relying on the size of shared buffers or on an expensive operation > +-- like CREATE DATABASE. CREATE / DROP TABLESPACE is also pretty expensive, but I don't have a better idea. > +-- Create an alternative tablespace and move the heaptest table to it, causing > +-- it to be rewritten. IIRC the point of that is that it reliably evicts all the buffers from s_b, correct? If so, mention that? > +Datum > +pg_stat_get_io(PG_FUNCTION_ARGS) > +{ > + ReturnSetInfo *rsinfo; > + PgStat_IO *backends_io_stats; > + Datum reset_time; > + > + InitMaterializedSRF(fcinfo, 0); > + rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; > + > + backends_io_stats = pgstat_fetch_stat_io(); > + > + reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp); > + > + for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++) > + { > + bool bktype_tracked; > + Datum bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype)); > + PgStat_BackendIO *bktype_stats = &backends_io_stats->stats[bktype]; > + > + /* > + * For those BackendTypes without IO Operation stats, skip > + * representing them in the view altogether. We still loop through > + * their counters so that we can assert that all values are zero. > + */ > + bktype_tracked = pgstat_tracks_io_bktype(bktype); How about instead just doing Assert(pgstat_bktype_io_stats_valid(...))? That deduplicates the logic for the asserts, and avoids doing the full loop when assertions aren't enabled anyway? Otherwise, see also the suggestion aout formatting the assertions as I suggested for 0002. > +-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes > +-- and fsyncs. > +-- The second checkpoint ensures that stats from the first checkpoint have been > +-- reported and protects against any potential races amongst the table > +-- creation, a possible timing-triggered checkpoint, and the explicit > +-- checkpoint in the test. There's a comment about the subsequent checkpoints earlier in the file, and I think the comment is slightly more precise. Mybe just reference the earlier comment? > +-- Change the tablespace so that the table is rewritten directly, then SELECT > +-- from it to cause it to be read back into shared buffers. > +SET allow_in_place_tablespaces = true; > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION ''; Perhaps worth doing this in tablespace.sql, to avoid the additional checkpoints done as part of CREATE/DROP TABLESPACE? Or, at least combine this with the CHECKPOINTs above? > +-- Drop the table so we can drop the tablespace later. > +DROP TABLE test_io_shared; > +-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io: > +-- - eviction of local buffers in order to reuse them > +-- - reads of temporary table blocks into local buffers > +-- - writes of local buffers to permanent storage > +-- - extends of temporary tables > +-- Set temp_buffers to a low value so that we can trigger writes with fewer > +-- inserted tuples. Do so in a new session in case temporary tables have been > +-- accessed by previous tests in this session. > +\c > +SET temp_buffers TO '1MB'; I'd set it to the actual minimum '100' (in pages). Perhaps that'd allow to make test_io_local a bit smaller? > +CREATE TEMPORARY TABLE test_io_local(a int, b TEXT); > +SELECT sum(extends) AS io_sum_local_extends_before > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > +SELECT sum(evictions) AS io_sum_local_evictions_before > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > +SELECT sum(writes) AS io_sum_local_writes_before > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > +-- Insert tuples into the temporary table, generating extends in the stats. > +-- Insert enough values that we need to reuse and write out dirty local > +-- buffers, generating evictions and writes. > +INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100); > +SELECT sum(reads) AS io_sum_local_reads_before > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset Maybe add something like SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100; Better toast compression or such could easily make test_io_local smaller than it's today. Seeing that it's too small would make it easier to understand the failure. > +SELECT sum(evictions) AS io_sum_local_evictions_after > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > +SELECT sum(reads) AS io_sum_local_reads_after > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > +SELECT sum(writes) AS io_sum_local_writes_after > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > +SELECT sum(extends) AS io_sum_local_extends_after > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset This could just be one select with multiple columns? I think if you use something like \gset io_sum_local_after_ you can also avoid the need to repeat "io_sum_local_" so many times. > +SELECT :io_sum_local_evictions_after > :io_sum_local_evictions_before; > + ?column? > +---------- > + t > +(1 row) > + > +SELECT :io_sum_local_reads_after > :io_sum_local_reads_before; > + ?column? > +---------- > + t > +(1 row) > + > +SELECT :io_sum_local_writes_after > :io_sum_local_writes_before; > + ?column? > +---------- > + t > +(1 row) > + > +SELECT :io_sum_local_extends_after > :io_sum_local_extends_before; > + ?column? > +---------- > + t > +(1 row) Similar. > +SELECT sum(reuses) AS io_sum_vac_strategy_reuses_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset > +SELECT sum(reads) AS io_sum_vac_strategy_reads_before FROM pg_stat_io WHERE io_context = 'vacuum' \gset There's quite a few more instances of this, so I'll now omit further mentions. Greetings, Andres Freund
On Fri, Jan 13, 2023 at 10:38 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Attached is v47. I missed a couple of versions, but I think the docs are clearer now. I'm torn on losing some of the detail, but overall I do think it's a good trade-off. Moving some details out to after the table does keep the bulk of the view documentation more readable, and the "inform database tuning" part is great. I really like the idea of a separate Interpreting Statistics section, but for now this works. >+ <literal>vacuum</literal>: I/O operations performed outside of shared >+ buffers while vacuuming and analyzing permanent relations. Why only permanent relations? Are temporary relations treated differently? I imagine if someone has a temp-table-heavy workload that requires regularly vacuuming and analyzing those relations, this point may be confusing without some additional explanation. Other than that, this looks great. Thanks, Maciek
v48 attached. On Fri, Jan 13, 2023 at 6:36 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-01-13 13:38:15 -0500, Melanie Plageman wrote: > > From f8c9077631169a778c893fd16b7a973ad5725f2a Mon Sep 17 00:00:00 2001 > > From: Andres Freund <andres@anarazel.de> > > Date: Fri, 9 Dec 2022 18:23:19 -0800 > > Subject: [PATCH v47 2/5] pgstat: Infrastructure to track IO operations > > diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c > > index 0fa5370bcd..608c3b59da 100644 > > --- a/src/backend/utils/activity/pgstat.c > > +++ b/src/backend/utils/activity/pgstat.c > > Reminder to self: Need to bump PGSTAT_FILE_FORMAT_ID before commit. > > Perhaps you could add a note about that to the commit message? > done > > > > @@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = { > > .snapshot_cb = pgstat_checkpointer_snapshot_cb, > > }, > > > > + [PGSTAT_KIND_IO] = { > > + .name = "io_ops", > > That should be "io" now I think? > Oh no! I didn't notice this was broken. I've added pg_stat_have_stats() to the IO stats tests now. It would be nice if pgstat_get_kind_from_str() could be used in pg_stat_reset_shared() to avoid having to remember to change both. It doesn't really work because we want to be able to throw the error message in pg_stat_reset_shared() when the user input is wrong -- not the one in pgstat_get_kind_from_str(). Also: - Since recovery_prefetch doesn't have a statistic kind, it doesn't fit well into this paradigm - Only a subset of the statistics kinds are reset through this function - bgwriter and checkpointer share a reset target I added a comment -- perhaps that's all I can do? On a separate note, should we be setting have_[io/slru/etc]stats to false in the reset all functions? > > > +/* > > + * Check that stats have not been counted for any combination of IOContext, > > + * IOObject, and IOOp which are not tracked for the passed-in BackendType. The > > + * passed-in PgStat_BackendIO must contain stats from the BackendType specified > > + * by the second parameter. Caller is responsible for locking the passed-in > > + * PgStat_BackendIO, if needed. > > + */ > > Other PgStat_Backend* structs are just for pending data. Perhaps we could > rename it slightly to make that clearer? PgStat_BktypeIO? > PgStat_IOForBackendType? or a similar variation? I've done this. > > > +bool > > +pgstat_bktype_io_stats_valid(PgStat_BackendIO *backend_io, > > + BackendType bktype) > > +{ > > + bool bktype_tracked = pgstat_tracks_io_bktype(bktype); > > + > > + for (IOContext io_context = IOCONTEXT_FIRST; > > + io_context < IOCONTEXT_NUM_TYPES; io_context++) > > + { > > + for (IOObject io_object = IOOBJECT_FIRST; > > + io_object < IOOBJECT_NUM_TYPES; io_object++) > > + { > > + /* > > + * Don't bother trying to skip to the next loop iteration if > > + * pgstat_tracks_io_object() would return false here. We still > > + * need to validate that each counter is zero anyway. > > + */ > > + for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; io_op++) > > + { > > + if ((!bktype_tracked || !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) && > > + backend_io->data[io_context][io_object][io_op] != 0) > > + return false; > > Hm, perhaps this could be broken up into multiple lines? Something like > > /* no stats, so nothing to validate */ > if (backend_io->data[io_context][io_object][io_op] == 0) > continue; > > /* something went wrong if have stats for something not tracked */ > if (!bktype_tracked || > !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) > return false; I've done this. > > +typedef struct PgStat_BackendIO > > +{ > > + PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES]; > > +} PgStat_BackendIO; > > Would it bother you if we swapped the order of iocontext and iobject here and > related places? It makes more sense to me semantically, and should now be > pretty easy, code wise. So, thinking about this I started noticing inconsistencies in other areas around this order: For example: ordering of objects mentioned in commit messages and comments, ordering of parameters (like in pgstat_count_io_op() [currently in reverse order]). I think we should make a final decision about this ordering and then make everywhere consistent (including ordering in the view). Currently the order is: BackendType IOContext IOObject IOOp You are suggesting this order: BackendType IOObject IOContext IOOp Could you explain what you find more natural about this ordering (as I find the other more natural)? This is one possible natural sentence with these objects: During COPY, a client backend may read in data from a permanent relation. This order is: IOContext BackendType IOOp IOObject I think English sentences are often structured subject, verb, object -- but in our case, we have an extra thing that doesn't fit neatly (IOContext). Also, IOOp in a sentence would be in the middle (as the verb). I made it last because a) it feels like the smallest unit b) it would make the code a lot more annoying if it wasn't last. WRT IOObject and IOContext, is there a future case for which having IOObject first will be better or lead to fewer mistakes? I actually see loads of places where this needs to be made consistent. > > > +/* shared version of PgStat_IO */ > > +typedef struct PgStatShared_IO > > +{ > > Maybe /* PgStat_IO in shared memory */? > updated. > > > Subject: [PATCH v47 3/5] pgstat: Count IO for relations > > Nearly happy with this now. See one minor nit below. > > I don't love the counting in register_dirty_segment() and mdsyncfiletag(), but > I don't have a better idea, and it doesn't seem too horrible. You don't like it because such things shouldn't be in md.c -- since we went to the trouble of having function pointers and making it general? > > > @@ -1441,6 +1474,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > > > > UnlockBufHdr(buf, buf_state); > > > > + if (oldFlags & BM_VALID) > > + { > > + /* > > + * When a BufferAccessStrategy is in use, blocks evicted from shared > > + * buffers are counted as IOOP_EVICT in the corresponding context > > + * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a > > + * strategy in two cases: 1) while initially claiming buffers for the > > + * strategy ring 2) to replace an existing strategy ring buffer > > + * because it is pinned or in use and cannot be reused. > > + * > > + * Blocks evicted from buffers already in the strategy ring are > > + * counted as IOOP_REUSE in the corresponding strategy context. > > + * > > + * At this point, we can accurately count evictions and reuses, > > + * because we have successfully claimed the valid buffer. Previously, > > + * we may have been forced to release the buffer due to concurrent > > + * pinners or erroring out. > > + */ > > + pgstat_count_io_op(from_ring ? IOOP_REUSE : IOOP_EVICT, > > + IOOBJECT_RELATION, *io_context); > > + } > > + > > if (oldPartitionLock != NULL) > > { > > BufTableDelete(&oldTag, oldHash); > > There's no reason to do this while we still hold the buffer partition lock, > right? That's a highly contended lock, and we can just move the counting a few > lines down. Thanks, I've done this. > > > @@ -1410,6 +1432,9 @@ mdsyncfiletag(const FileTag *ftag, char *path) > > if (need_to_close) > > FileClose(file); > > > > + if (result >= 0) > > + pgstat_count_io_op(IOOP_FSYNC, IOOBJECT_RELATION, IOCONTEXT_NORMAL); > > + > > I'd lean towards doing this unconditionally, it's still an fsync if it > failed... Not that it matters. Good point. We still incurred the costs if not benefited from the effects. I've updated this. > > > Subject: [PATCH v47 4/5] Add system view tracking IO ops per backend type > > Note to self + commit message: Remember the need to do a catversion bump. Noted. > > > +-- pg_stat_io test: > > +-- verify_heapam always uses a BAS_BULKREAD BufferAccessStrategy. > > Maybe add that "whereas a sequential scan does not, see ..."? Updated. > > > This allows > > +-- us to reliably test that pg_stat_io BULKREAD reads are being captured > > +-- without relying on the size of shared buffers or on an expensive operation > > +-- like CREATE DATABASE. > > CREATE / DROP TABLESPACE is also pretty expensive, but I don't have a better > idea. I've added a comment. > > > +-- Create an alternative tablespace and move the heaptest table to it, causing > > +-- it to be rewritten. > > IIRC the point of that is that it reliably evicts all the buffers from s_b, > correct? If so, mention that? Done. > > > +Datum > > +pg_stat_get_io(PG_FUNCTION_ARGS) > > +{ > > + ReturnSetInfo *rsinfo; > > + PgStat_IO *backends_io_stats; > > + Datum reset_time; > > + > > + InitMaterializedSRF(fcinfo, 0); > > + rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; > > + > > + backends_io_stats = pgstat_fetch_stat_io(); > > + > > + reset_time = TimestampTzGetDatum(backends_io_stats->stat_reset_timestamp); > > + > > + for (BackendType bktype = B_INVALID; bktype < BACKEND_NUM_TYPES; bktype++) > > + { > > + bool bktype_tracked; > > + Datum bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype)); > > + PgStat_BackendIO *bktype_stats = &backends_io_stats->stats[bktype]; > > + > > + /* > > + * For those BackendTypes without IO Operation stats, skip > > + * representing them in the view altogether. We still loop through > > + * their counters so that we can assert that all values are zero. > > + */ > > + bktype_tracked = pgstat_tracks_io_bktype(bktype); > > How about instead just doing Assert(pgstat_bktype_io_stats_valid(...))? That > deduplicates the logic for the asserts, and avoids doing the full loop when > assertions aren't enabled anyway? > I've done this and added a comment. > > > > +-- After a checkpoint, there should be some additional IOCONTEXT_NORMAL writes > > +-- and fsyncs. > > +-- The second checkpoint ensures that stats from the first checkpoint have been > > +-- reported and protects against any potential races amongst the table > > +-- creation, a possible timing-triggered checkpoint, and the explicit > > +-- checkpoint in the test. > > There's a comment about the subsequent checkpoints earlier in the file, and I > think the comment is slightly more precise. Mybe just reference the earlier comment? > > > > +-- Change the tablespace so that the table is rewritten directly, then SELECT > > +-- from it to cause it to be read back into shared buffers. > > +SET allow_in_place_tablespaces = true; > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION ''; > > Perhaps worth doing this in tablespace.sql, to avoid the additional > checkpoints done as part of CREATE/DROP TABLESPACE? > > Or, at least combine this with the CHECKPOINTs above? I see a checkpoint is requested when dropping the tablespace if not all the files in it are deleted. It seems like if the DROP TABLE for the permanent table is before the explicit checkpoints in the test, then the DROP TABLESPACE will not cause an additional checkpoint. Is this what you are suggesting? Dropping the temporary table should not have an effect on this. > > > +-- Drop the table so we can drop the tablespace later. > > +DROP TABLE test_io_shared; > > +-- Test that the follow IOCONTEXT_LOCAL IOOps are tracked in pg_stat_io: > > +-- - eviction of local buffers in order to reuse them > > +-- - reads of temporary table blocks into local buffers > > +-- - writes of local buffers to permanent storage > > +-- - extends of temporary tables > > +-- Set temp_buffers to a low value so that we can trigger writes with fewer > > +-- inserted tuples. Do so in a new session in case temporary tables have been > > +-- accessed by previous tests in this session. > > +\c > > +SET temp_buffers TO '1MB'; > > I'd set it to the actual minimum '100' (in pages). Perhaps that'd allow to > make test_io_local a bit smaller? I've done this. > > > +CREATE TEMPORARY TABLE test_io_local(a int, b TEXT); > > +SELECT sum(extends) AS io_sum_local_extends_before > > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > > +SELECT sum(evictions) AS io_sum_local_evictions_before > > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > > +SELECT sum(writes) AS io_sum_local_writes_before > > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > > +-- Insert tuples into the temporary table, generating extends in the stats. > > +-- Insert enough values that we need to reuse and write out dirty local > > +-- buffers, generating evictions and writes. > > +INSERT INTO test_io_local SELECT generate_series(1, 8000) as id, repeat('a', 100); > > +SELECT sum(reads) AS io_sum_local_reads_before > > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > > Maybe add something like > > SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100; > > Better toast compression or such could easily make test_io_local smaller than > it's today. Seeing that it's too small would make it easier to understand the > failure. Good idea. So, I used pg_table_size() because it seems like pg_relation_size() does not include the toast relations. However, I'm not sure this is a good idea, because pg_table_size() includes FSM and visibility map. Should I write a query to get the toast relation name and add pg_relation_size() of that relation and the main relation? > > > +SELECT sum(evictions) AS io_sum_local_evictions_after > > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > > +SELECT sum(reads) AS io_sum_local_reads_after > > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > > +SELECT sum(writes) AS io_sum_local_writes_after > > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > > +SELECT sum(extends) AS io_sum_local_extends_after > > + FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'temp relation' \gset > > This could just be one select with multiple columns? > > I think if you use something like \gset io_sum_local_after_ you can also avoid > the need to repeat "io_sum_local_" so many times. Thanks. I didn't realize. I've fixed this throughout the test file. On Mon, Jan 16, 2023 at 4:42 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > I missed a couple of versions, but I think the docs are clearer now. > I'm torn on losing some of the detail, but overall I do think it's a > good trade-off. Moving some details out to after the table does keep > the bulk of the view documentation more readable, and the "inform > database tuning" part is great. I really like the idea of a separate > Interpreting Statistics section, but for now this works. > > >+ <literal>vacuum</literal>: I/O operations performed outside of shared > >+ buffers while vacuuming and analyzing permanent relations. > > Why only permanent relations? Are temporary relations treated > differently? I imagine if someone has a temp-table-heavy workload that > requires regularly vacuuming and analyzing those relations, this point > may be confusing without some additional explanation. Ah, yes. This is a bit confusing. We don't use buffer access strategies when operating on temp relations, so vacuuming them is counted in IO Context normal. I've added this information to the docs but now that definition is a bit long. Perhaps it should be a note? That seems like it would draw too much attention to this detail, though... - Melanie
Attachment
Hi, On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote: > > > @@ -359,6 +360,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = { > > > .snapshot_cb = pgstat_checkpointer_snapshot_cb, > > > }, > > > > > > + [PGSTAT_KIND_IO] = { > > > + .name = "io_ops", > > > > That should be "io" now I think? > > > > Oh no! I didn't notice this was broken. I've added pg_stat_have_stats() > to the IO stats tests now. > > It would be nice if pgstat_get_kind_from_str() could be used in > pg_stat_reset_shared() to avoid having to remember to change both. It's hard to make that work, because of the historical behaviour of that function :( > Also: > - Since recovery_prefetch doesn't have a statistic kind, it doesn't fit > well into this paradigm I think that needs a rework anyway - it went in at about the same time as the shared mem stats patch, so it doesn't quite cohere. > On a separate note, should we be setting have_[io/slru/etc]stats to > false in the reset all functions? That'd not work reliably, because other backends won't do the same. I don't see a benefit in doing it differently in the local connection than the other connections. > > > +typedef struct PgStat_BackendIO > > > +{ > > > + PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES]; > > > +} PgStat_BackendIO; > > > > Would it bother you if we swapped the order of iocontext and iobject here and > > related places? It makes more sense to me semantically, and should now be > > pretty easy, code wise. > > So, thinking about this I started noticing inconsistencies in other > areas around this order: > For example: ordering of objects mentioned in commit messages and comments, > ordering of parameters (like in pgstat_count_io_op() [currently in > reverse order]). > > I think we should make a final decision about this ordering and then > make everywhere consistent (including ordering in the view). > > Currently the order is: > BackendType > IOContext > IOObject > IOOp > > You are suggesting this order: > BackendType > IOObject > IOContext > IOOp > > Could you explain what you find more natural about this ordering (as I > find the other more natural)? The object we're performing IO on determines more things than the context. So it just seems like the natural hierarchical fit. The context is a sub-category of the object. Consider how it'll look like if we also have objects for 'wal', 'temp files'. It'll make sense to group by just the object, but it won't make sense to group by just the context. If it were trivial to do I'd use a different IOContext for each IOObject. But it'd make it much harder. So there'll just be a bunch of values of IOContext that'll only be used for one or a subset of the IOObjects. The reason to put BackendType at the top is pragmatic - one backend is of a single type, but can do IO for all kinds of objects/contexts. So any other hierarchy would make the locking etc much harder. > This is one possible natural sentence with these objects: > > During COPY, a client backend may read in data from a permanent > relation. > This order is: > IOContext > BackendType > IOOp > IOObject > > I think English sentences are often structured subject, verb, object -- > but in our case, we have an extra thing that doesn't fit neatly > (IOContext). "..., to avoid polluting the buffer cache it uses the bulk (read|write) strategy". > Also, IOOp in a sentence would be in the middle (as the > verb). I made it last because a) it feels like the smallest unit b) it > would make the code a lot more annoying if it wasn't last. Yea, I think pragmatically that is the right choice. > > > Subject: [PATCH v47 3/5] pgstat: Count IO for relations > > > > Nearly happy with this now. See one minor nit below. > > > > I don't love the counting in register_dirty_segment() and mdsyncfiletag(), but > > I don't have a better idea, and it doesn't seem too horrible. > > You don't like it because such things shouldn't be in md.c -- since we > went to the trouble of having function pointers and making it general? It's more of a gut feeling than well reasoned ;) > > > +-- Change the tablespace so that the table is rewritten directly, then SELECT > > > +-- from it to cause it to be read back into shared buffers. > > > +SET allow_in_place_tablespaces = true; > > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION ''; > > > > Perhaps worth doing this in tablespace.sql, to avoid the additional > > checkpoints done as part of CREATE/DROP TABLESPACE? > > > > Or, at least combine this with the CHECKPOINTs above? > > I see a checkpoint is requested when dropping the tablespace if not all > the files in it are deleted. It seems like if the DROP TABLE for the > permanent table is before the explicit checkpoints in the test, then the > DROP TABLESPACE will not cause an additional checkpoint. Unfortunately, that's not how it works :(. See the comment above mdunlink(): > * For regular relations, we don't unlink the first segment file of the rel, > * but just truncate it to zero length, and record a request to unlink it after > * the next checkpoint. Additional segments can be unlinked immediately, > * however. Leaving the empty file in place prevents that relfilenumber > * from being reused. The scenario this protects us from is: > ... > Is this what you are suggesting? Dropping the temporary table should not > have an effect on this. I was wondering about simply moving that portion of the test to tablespace.sql, where we already created a tablespace. An alternative would be to propose splitting tablespace.sql into one portion running at the start of parallel_schedule, and one at the end. Historically, we needed tablespace.sql to be optional due to causing problems when replicating to another instance on the same machine, but now we have allow_in_place_tablespaces. > > SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100; > > > > Better toast compression or such could easily make test_io_local smaller than > > it's today. Seeing that it's too small would make it easier to understand the > > failure. > > Good idea. So, I used pg_table_size() because it seems like > pg_relation_size() does not include the toast relations. However, I'm > not sure this is a good idea, because pg_table_size() includes FSM and > visibility map. Should I write a query to get the toast relation name > and add pg_relation_size() of that relation and the main relation? I think it's the right thing to just include the relation size. Your queries IIRC won't use the toast table or other forks. So I'd leave it at just pg_relation_size(). Greetings, Andres Freund
v49 attached On Tue, Jan 17, 2023 at 2:12 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote: > > > > > +typedef struct PgStat_BackendIO > > > > +{ > > > > + PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES]; > > > > +} PgStat_BackendIO; > > > > > > Would it bother you if we swapped the order of iocontext and iobject here and > > > related places? It makes more sense to me semantically, and should now be > > > pretty easy, code wise. > > > > So, thinking about this I started noticing inconsistencies in other > > areas around this order: > > For example: ordering of objects mentioned in commit messages and comments, > > ordering of parameters (like in pgstat_count_io_op() [currently in > > reverse order]). > > > > I think we should make a final decision about this ordering and then > > make everywhere consistent (including ordering in the view). > > > > Currently the order is: > > BackendType > > IOContext > > IOObject > > IOOp > > > > You are suggesting this order: > > BackendType > > IOObject > > IOContext > > IOOp > > > > Could you explain what you find more natural about this ordering (as I > > find the other more natural)? > > The object we're performing IO on determines more things than the context. So > it just seems like the natural hierarchical fit. The context is a sub-category > of the object. Consider how it'll look like if we also have objects for 'wal', > 'temp files'. It'll make sense to group by just the object, but it won't make > sense to group by just the context. > > If it were trivial to do I'd use a different IOContext for each IOObject. But > it'd make it much harder. So there'll just be a bunch of values of IOContext > that'll only be used for one or a subset of the IOObjects. > > > The reason to put BackendType at the top is pragmatic - one backend is of a > single type, but can do IO for all kinds of objects/contexts. So any other > hierarchy would make the locking etc much harder. > > > > This is one possible natural sentence with these objects: > > > > During COPY, a client backend may read in data from a permanent > > relation. > > This order is: > > IOContext > > BackendType > > IOOp > > IOObject > > > > I think English sentences are often structured subject, verb, object -- > > but in our case, we have an extra thing that doesn't fit neatly > > (IOContext). > > "..., to avoid polluting the buffer cache it uses the bulk (read|write) > strategy". > > > > Also, IOOp in a sentence would be in the middle (as the > > verb). I made it last because a) it feels like the smallest unit b) it > > would make the code a lot more annoying if it wasn't last. > > Yea, I think pragmatically that is the right choice. I have changed the order and updated all the places using PgStat_BktypeIO as well as in all locations in which it should be ordered for consistency (that I could find in the pass I did) -- e.g. the view definition, function signatures, comments, commit messages, etc. > > > > +-- Change the tablespace so that the table is rewritten directly, then SELECT > > > > +-- from it to cause it to be read back into shared buffers. > > > > +SET allow_in_place_tablespaces = true; > > > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION ''; > > > > > > Perhaps worth doing this in tablespace.sql, to avoid the additional > > > checkpoints done as part of CREATE/DROP TABLESPACE? > > > > > > Or, at least combine this with the CHECKPOINTs above? > > > > I see a checkpoint is requested when dropping the tablespace if not all > > the files in it are deleted. It seems like if the DROP TABLE for the > > permanent table is before the explicit checkpoints in the test, then the > > DROP TABLESPACE will not cause an additional checkpoint. > > Unfortunately, that's not how it works :(. See the comment above mdunlink(): > > > * For regular relations, we don't unlink the first segment file of the rel, > > * but just truncate it to zero length, and record a request to unlink it after > > * the next checkpoint. Additional segments can be unlinked immediately, > > * however. Leaving the empty file in place prevents that relfilenumber > > * from being reused. The scenario this protects us from is: > > ... > > > > Is this what you are suggesting? Dropping the temporary table should not > > have an effect on this. > > I was wondering about simply moving that portion of the test to > tablespace.sql, where we already created a tablespace. > > > An alternative would be to propose splitting tablespace.sql into one portion > running at the start of parallel_schedule, and one at the end. Historically, > we needed tablespace.sql to be optional due to causing problems when > replicating to another instance on the same machine, but now we have > allow_in_place_tablespaces. It seems like the best way would be to split up the tablespace test file as you suggested and drop the tablespace at the end of the regression test suite. There could be other tests that could use a tablespace. Though what I wrote is kind of tablespace test coverage, if this rewriting behavior no longer happened when doing alter table set tablespace, we would want to come up with a new test which exercised that code to count those IO stats, not simply delete it from the tablespace tests. > > > SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100; > > > > > > Better toast compression or such could easily make test_io_local smaller than > > > it's today. Seeing that it's too small would make it easier to understand the > > > failure. > > > > Good idea. So, I used pg_table_size() because it seems like > > pg_relation_size() does not include the toast relations. However, I'm > > not sure this is a good idea, because pg_table_size() includes FSM and > > visibility map. Should I write a query to get the toast relation name > > and add pg_relation_size() of that relation and the main relation? > > I think it's the right thing to just include the relation size. Your queries > IIRC won't use the toast table or other forks. So I'd leave it at just > pg_relation_size(). I did notice that this test wasn't using the toast table for the toastable column -- but you mentioned better toast compression affecting the future test stability, so I'm confused. - Melanie
Attachment
On Tue, Jan 17, 2023 at 9:22 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Mon, Jan 16, 2023 at 4:42 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > I missed a couple of versions, but I think the docs are clearer now. > > I'm torn on losing some of the detail, but overall I do think it's a > > good trade-off. Moving some details out to after the table does keep > > the bulk of the view documentation more readable, and the "inform > > database tuning" part is great. I really like the idea of a separate > > Interpreting Statistics section, but for now this works. > > > > >+ <literal>vacuum</literal>: I/O operations performed outside of shared > > >+ buffers while vacuuming and analyzing permanent relations. > > > > Why only permanent relations? Are temporary relations treated > > differently? I imagine if someone has a temp-table-heavy workload that > > requires regularly vacuuming and analyzing those relations, this point > > may be confusing without some additional explanation. > > Ah, yes. This is a bit confusing. We don't use buffer access strategies > when operating on temp relations, so vacuuming them is counted in IO > Context normal. I've added this information to the docs but now that > definition is a bit long. Perhaps it should be a note? That seems like > it would draw too much attention to this detail, though... Thanks for clarifying. I think the updated definition still works: it's still shorter than the `normal` context definition.
On Wed, 18 Jan 2023 at 03:30, Melanie Plageman <melanieplageman@gmail.com> wrote: > > v49 attached > > On Tue, Jan 17, 2023 at 2:12 PM Andres Freund <andres@anarazel.de> wrote: > > On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote: > > > > > > > +typedef struct PgStat_BackendIO > > > > > +{ > > > > > + PgStat_Counter data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES]; > > > > > +} PgStat_BackendIO; > > > > > > > > Would it bother you if we swapped the order of iocontext and iobject here and > > > > related places? It makes more sense to me semantically, and should now be > > > > pretty easy, code wise. > > > > > > So, thinking about this I started noticing inconsistencies in other > > > areas around this order: > > > For example: ordering of objects mentioned in commit messages and comments, > > > ordering of parameters (like in pgstat_count_io_op() [currently in > > > reverse order]). > > > > > > I think we should make a final decision about this ordering and then > > > make everywhere consistent (including ordering in the view). > > > > > > Currently the order is: > > > BackendType > > > IOContext > > > IOObject > > > IOOp > > > > > > You are suggesting this order: > > > BackendType > > > IOObject > > > IOContext > > > IOOp > > > > > > Could you explain what you find more natural about this ordering (as I > > > find the other more natural)? > > > > The object we're performing IO on determines more things than the context. So > > it just seems like the natural hierarchical fit. The context is a sub-category > > of the object. Consider how it'll look like if we also have objects for 'wal', > > 'temp files'. It'll make sense to group by just the object, but it won't make > > sense to group by just the context. > > > > If it were trivial to do I'd use a different IOContext for each IOObject. But > > it'd make it much harder. So there'll just be a bunch of values of IOContext > > that'll only be used for one or a subset of the IOObjects. > > > > > > The reason to put BackendType at the top is pragmatic - one backend is of a > > single type, but can do IO for all kinds of objects/contexts. So any other > > hierarchy would make the locking etc much harder. > > > > > > > This is one possible natural sentence with these objects: > > > > > > During COPY, a client backend may read in data from a permanent > > > relation. > > > This order is: > > > IOContext > > > BackendType > > > IOOp > > > IOObject > > > > > > I think English sentences are often structured subject, verb, object -- > > > but in our case, we have an extra thing that doesn't fit neatly > > > (IOContext). > > > > "..., to avoid polluting the buffer cache it uses the bulk (read|write) > > strategy". > > > > > > > Also, IOOp in a sentence would be in the middle (as the > > > verb). I made it last because a) it feels like the smallest unit b) it > > > would make the code a lot more annoying if it wasn't last. > > > > Yea, I think pragmatically that is the right choice. > > I have changed the order and updated all the places using > PgStat_BktypeIO as well as in all locations in which it should be > ordered for consistency (that I could find in the pass I did) -- e.g. > the view definition, function signatures, comments, commit messages, > etc. > > > > > > +-- Change the tablespace so that the table is rewritten directly, then SELECT > > > > > +-- from it to cause it to be read back into shared buffers. > > > > > +SET allow_in_place_tablespaces = true; > > > > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION ''; > > > > > > > > Perhaps worth doing this in tablespace.sql, to avoid the additional > > > > checkpoints done as part of CREATE/DROP TABLESPACE? > > > > > > > > Or, at least combine this with the CHECKPOINTs above? > > > > > > I see a checkpoint is requested when dropping the tablespace if not all > > > the files in it are deleted. It seems like if the DROP TABLE for the > > > permanent table is before the explicit checkpoints in the test, then the > > > DROP TABLESPACE will not cause an additional checkpoint. > > > > Unfortunately, that's not how it works :(. See the comment above mdunlink(): > > > > > * For regular relations, we don't unlink the first segment file of the rel, > > > * but just truncate it to zero length, and record a request to unlink it after > > > * the next checkpoint. Additional segments can be unlinked immediately, > > > * however. Leaving the empty file in place prevents that relfilenumber > > > * from being reused. The scenario this protects us from is: > > > ... > > > > > > > Is this what you are suggesting? Dropping the temporary table should not > > > have an effect on this. > > > > I was wondering about simply moving that portion of the test to > > tablespace.sql, where we already created a tablespace. > > > > > > An alternative would be to propose splitting tablespace.sql into one portion > > running at the start of parallel_schedule, and one at the end. Historically, > > we needed tablespace.sql to be optional due to causing problems when > > replicating to another instance on the same machine, but now we have > > allow_in_place_tablespaces. > > It seems like the best way would be to split up the tablespace test file > as you suggested and drop the tablespace at the end of the regression > test suite. There could be other tests that could use a tablespace. > Though what I wrote is kind of tablespace test coverage, if this > rewriting behavior no longer happened when doing alter table set > tablespace, we would want to come up with a new test which exercised > that code to count those IO stats, not simply delete it from the > tablespace tests. > > > > > SELECT pg_relation_size('test_io_local') / current_setting('block_size')::int8 > 100; > > > > > > > > Better toast compression or such could easily make test_io_local smaller than > > > > it's today. Seeing that it's too small would make it easier to understand the > > > > failure. > > > > > > Good idea. So, I used pg_table_size() because it seems like > > > pg_relation_size() does not include the toast relations. However, I'm > > > not sure this is a good idea, because pg_table_size() includes FSM and > > > visibility map. Should I write a query to get the toast relation name > > > and add pg_relation_size() of that relation and the main relation? > > > > I think it's the right thing to just include the relation size. Your queries > > IIRC won't use the toast table or other forks. So I'd leave it at just > > pg_relation_size(). > > I did notice that this test wasn't using the toast table for the > toastable column -- but you mentioned better toast compression affecting > the future test stability, so I'm confused. The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID 4f74f5641d53559ec44e74d5bf552e167fdd5d20 === === applying patch ./v49-0003-Add-system-view-tracking-IO-ops-per-backend-type.patch .... patching file src/test/regress/expected/rules.out Hunk #1 FAILED at 1876. 1 out of 1 hunk FAILED -- saving rejects to file src/test/regress/expected/rules.out.rej [1] - http://cfbot.cputube.org/patch_41_3272.log Regards, Vignesh
On Thu, Jan 19, 2023 at 6:18 AM vignesh C <vignesh21@gmail.com> wrote: > The patch does not apply on top of HEAD as in [1], please post a rebased patch: > === Applying patches on top of PostgreSQL commit ID > 4f74f5641d53559ec44e74d5bf552e167fdd5d20 === > === applying patch > ./v49-0003-Add-system-view-tracking-IO-ops-per-backend-type.patch > .... > patching file src/test/regress/expected/rules.out > Hunk #1 FAILED at 1876. > 1 out of 1 hunk FAILED -- saving rejects to file > src/test/regress/expected/rules.out.rej > > [1] - http://cfbot.cputube.org/patch_41_3272.log Yes, it conflicted with 47bb9db75996232. rebased v50 is attached. On Tue, Jan 17, 2023 at 5:00 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > > > > +-- Change the tablespace so that the table is rewritten directly, then SELECT > > > > > +-- from it to cause it to be read back into shared buffers. > > > > > +SET allow_in_place_tablespaces = true; > > > > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION ''; > > > > > > > > Perhaps worth doing this in tablespace.sql, to avoid the additional > > > > checkpoints done as part of CREATE/DROP TABLESPACE? > > > > > > > > Or, at least combine this with the CHECKPOINTs above? > > > > > > I see a checkpoint is requested when dropping the tablespace if not all > > > the files in it are deleted. It seems like if the DROP TABLE for the > > > permanent table is before the explicit checkpoints in the test, then the > > > DROP TABLESPACE will not cause an additional checkpoint. > > > > Unfortunately, that's not how it works :(. See the comment above mdunlink(): > > > > > * For regular relations, we don't unlink the first segment file of the rel, > > > * but just truncate it to zero length, and record a request to unlink it after > > > * the next checkpoint. Additional segments can be unlinked immediately, > > > * however. Leaving the empty file in place prevents that relfilenumber > > > * from being reused. The scenario this protects us from is: > > > ... > > > > > > > Is this what you are suggesting? Dropping the temporary table should not > > > have an effect on this. > > > > I was wondering about simply moving that portion of the test to > > tablespace.sql, where we already created a tablespace. > > > > > > An alternative would be to propose splitting tablespace.sql into one portion > > running at the start of parallel_schedule, and one at the end. Historically, > > we needed tablespace.sql to be optional due to causing problems when > > replicating to another instance on the same machine, but now we have > > allow_in_place_tablespaces. > > It seems like the best way would be to split up the tablespace test file > as you suggested and drop the tablespace at the end of the regression > test suite. There could be other tests that could use a tablespace. > Though what I wrote is kind of tablespace test coverage, if this > rewriting behavior no longer happened when doing alter table set > tablespace, we would want to come up with a new test which exercised > that code to count those IO stats, not simply delete it from the > tablespace tests. I have added a patch to the set which creates the regress_tblspace (formerly created in tablespace.sq1) in test_setup.sql. I then moved the tablespace test to the end of the parallel schedule so that my test (and others) could use the regress_tblspace. I modified some of the tablespace.sql tests to be more specific in terms of the objects they are looking for so that tests using the tablespace are not forced to drop all of the objects they make in the tablespace. Note that I did not proactively change all tests in tablespace.sql that may fail in this way -- only those that failed because of the tables I created (and did not drop) from regress_tblspace. - Melanie
Attachment
On Thu, Jan 19, 2023 at 4:28 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Thu, Jan 19, 2023 at 6:18 AM vignesh C <vignesh21@gmail.com> wrote: > > The patch does not apply on top of HEAD as in [1], please post a rebased patch: > > === Applying patches on top of PostgreSQL commit ID > > 4f74f5641d53559ec44e74d5bf552e167fdd5d20 === > > === applying patch > > ./v49-0003-Add-system-view-tracking-IO-ops-per-backend-type.patch > > .... > > patching file src/test/regress/expected/rules.out > > Hunk #1 FAILED at 1876. > > 1 out of 1 hunk FAILED -- saving rejects to file > > src/test/regress/expected/rules.out.rej > > > > [1] - http://cfbot.cputube.org/patch_41_3272.log > > Yes, it conflicted with 47bb9db75996232. rebased v50 is attached. Oh dear-- an extra FlushBuffer() snuck in there somehow. Removed it in attached v51. Also, I fixed an issue in my tablespace.sql updates - Melanie
Attachment
Hello. At Thu, 19 Jan 2023 21:15:34 -0500, Melanie Plageman <melanieplageman@gmail.com> wrote in > Oh dear-- an extra FlushBuffer() snuck in there somehow. > Removed it in attached v51. > Also, I fixed an issue in my tablespace.sql updates I only looked 0002 and 0004. (Sorry for the random order of the comment..) 0002: + Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType)); This is relatively complex checking. We already asserts-out increments of invalid counters. Thus this is checking if some unrelated codes clobbered them, which we do only when consistency is critical. Is there any needs to do that here? I saw another occurance of the same assertion. -/* Reset some shared cluster-wide counters */ +/* + * Reset some shared cluster-wide counters + * + * When adding a new reset target, ideally the name should match that in + * pgstat_kind_infos, if relevant. + */ I'm not sure the addition is useful.. +pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op) +{ + Assert(io_object < IOOBJECT_NUM_TYPES); + Assert(io_context < IOCONTEXT_NUM_TYPES); + Assert(io_op < IOOP_NUM_TYPES); + Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op)); Is there any reason for not checking the value ranges at the bottom-most functions? They can lead to out-of-bounds access so I don't think we need to continue execution for such invalid values. + no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || + bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || + bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP; I'm not sure I like to omit parentheses for such a long Boolean expression on the right side. + write_chunk_s(fpout, &pgStatLocal.snapshot.io); + if (!read_chunk_s(fpin, &shmem->io.stats)) The names of the functions hardly make sense alone to me. How about write_struct()/read_struct()? (I personally prefer to use write_chunk() directly..) + PgStat_BktypeIO This patch abbreviates "backend" as "bk" but "be" is used in many places. I think that naming should follow the predecessors. 0004: system_views.sql: +FROM pg_stat_get_io() b; What does the "b" stand for? (Backend? then "s" or "i" seems straight-forward.) + nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op); + + if (nulls[col_idx]) + continue; + + values[col_idx] = + Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]); This is a bit hard to read since it requires to follow the condition flow. The following is simpler and I thhink close to our standard. if (pgstat_tacks_io_op()) values[col_idx] = Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]); else nulls[col_idx] = true; > + Number of read operations in units of <varname>op_bytes</varname>. I may be the only one who see the name as umbiguous between "total number of handled bytes" and "bytes hadled at an operation". Can't it be op_blocksize or just block_size? + b.io_object, + b.io_context, It's uncertain to me why only the two columns are prefixed by "io". Don't "object_type" and just "context" work instead? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 24 Jan 2023 17:22:03 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > +pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op) > +{ > + Assert(io_object < IOOBJECT_NUM_TYPES); > + Assert(io_context < IOCONTEXT_NUM_TYPES); > + Assert(io_op < IOOP_NUM_TYPES); > + Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op)); > > Is there any reason for not checking the value ranges at the > bottom-most functions? They can lead to out-of-bounds access so I To make sure, the "They" means "out-of-range io_object/context/op values".. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2023-01-24 17:22:03 +0900, Kyotaro Horiguchi wrote: > Hello. > > At Thu, 19 Jan 2023 21:15:34 -0500, Melanie Plageman <melanieplageman@gmail.com> wrote in > > Oh dear-- an extra FlushBuffer() snuck in there somehow. > > Removed it in attached v51. > > Also, I fixed an issue in my tablespace.sql updates > > I only looked 0002 and 0004. > (Sorry for the random order of the comment..) > > 0002: > > + Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType)); > > This is relatively complex checking. We already asserts-out increments > of invalid counters. Thus this is checking if some unrelated codes > clobbered them, which we do only when consistency is critical. Is > there any needs to do that here? I saw another occurance of the same > assertion. I found it useful to find problems. > + no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || > + bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || > + bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP; > > I'm not sure I like to omit parentheses for such a long Boolean > expression on the right side. What parens would help? > + write_chunk_s(fpout, &pgStatLocal.snapshot.io); > + if (!read_chunk_s(fpin, &shmem->io.stats)) > > The names of the functions hardly make sense alone to me. How about > write_struct()/read_struct()? (I personally prefer to use > write_chunk() directly..) That's not related to this patch - there's several existing callers for it. And write_struct wouldn't be better imo, because it's not just for structs. > + PgStat_BktypeIO > > This patch abbreviates "backend" as "bk" but "be" is used in many > places. I think that naming should follow the predecessors. The precedence aren't consistent unfortunately :) > > + Number of read operations in units of <varname>op_bytes</varname>. > > I may be the only one who see the name as umbiguous between "total > number of handled bytes" and "bytes hadled at an operation". Can't it > be op_blocksize or just block_size? > > + b.io_object, > + b.io_context, No, block wouldn't be helpful - we'd like to use this for something that isn't uniform blocks. Greetings, Andres Freund
At Tue, 24 Jan 2023 14:35:12 -0800, Andres Freund <andres@anarazel.de> wrote in > > 0002: > > > > + Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType)); > > > > This is relatively complex checking. We already asserts-out increments > > of invalid counters. Thus this is checking if some unrelated codes > > clobbered them, which we do only when consistency is critical. Is > > there any needs to do that here? I saw another occurance of the same > > assertion. > > I found it useful to find problems. Okay. > > + no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || > > + bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || > > + bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP; > > > > I'm not sure I like to omit parentheses for such a long Boolean > > expression on the right side. > > What parens would help? I thought about the following. no_temp_rel = (bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP); > > + write_chunk_s(fpout, &pgStatLocal.snapshot.io); > > + if (!read_chunk_s(fpin, &shmem->io.stats)) > > > > The names of the functions hardly make sense alone to me. How about > > write_struct()/read_struct()? (I personally prefer to use > > write_chunk() directly..) > > That's not related to this patch - there's several existing callers for > it. And write_struct wouldn't be better imo, because it's not just for > structs. Hmm. Then what the "_s" stands for? > > + PgStat_BktypeIO > > > > This patch abbreviates "backend" as "bk" but "be" is used in many > > places. I think that naming should follow the predecessors. > > The precedence aren't consistent unfortunately :) Uuuummmmm. Okay, just I like "be" there! Anyway, I don't strongly push that. > > > + Number of read operations in units of <varname>op_bytes</varname>. > > > > I may be the only one who see the name as umbiguous between "total > > number of handled bytes" and "bytes hadled at an operation". Can't it > > be op_blocksize or just block_size? > > > > + b.io_object, > > + b.io_context, > > No, block wouldn't be helpful - we'd like to use this for something that isn't > uniform blocks. What does the field show in that case? The mean of operation size? Or one row per opration size? If the former, the name looks somewhat wrong. If the latter, block_size seems making sense. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, I did another read through the series. I do have some minor changes, but they're minor. I think this is ready for commit. I plan to start pushing tomorrow. The changes I made are: - the tablespace test changes didn't quite work in isolation / needed a bit of polishing - moved the tablespace changes to later in the series - split the tests out of the commit adding the view into its own commit - minor code formatting things (e.g. didn't like nested for()s without {}) On 2023-01-25 16:56:17 +0900, Kyotaro Horiguchi wrote: > At Tue, 24 Jan 2023 14:35:12 -0800, Andres Freund <andres@anarazel.de> wrote in > > > + write_chunk_s(fpout, &pgStatLocal.snapshot.io); > > > + if (!read_chunk_s(fpin, &shmem->io.stats)) > > > > > > The names of the functions hardly make sense alone to me. How about > > > write_struct()/read_struct()? (I personally prefer to use > > > write_chunk() directly..) > > > > That's not related to this patch - there's several existing callers for > > it. And write_struct wouldn't be better imo, because it's not just for > > structs. > > Hmm. Then what the "_s" stands for? Size. It's a macro that just forwards to read_chunk()/write_chunk(). > > > > + Number of read operations in units of <varname>op_bytes</varname>. > > > > > > I may be the only one who see the name as umbiguous between "total > > > number of handled bytes" and "bytes hadled at an operation". Can't it > > > be op_blocksize or just block_size? > > > > > > + b.io_object, > > > + b.io_context, > > > > No, block wouldn't be helpful - we'd like to use this for something that isn't > > uniform blocks. > > What does the field show in that case? The mean of operation size? Or > one row per opration size? If the former, the name looks somewhat > wrong. If the latter, block_size seems making sense. 1, so that it's clear that the rest are in bytes. Greetings, Andres Freund
At Tue, 7 Feb 2023 22:38:14 -0800, Andres Freund <andres@anarazel.de> wrote in > Hi, > > I did another read through the series. I do have some minor changes, but > they're minor. I think this is ready for commit. I plan to start pushing > tomorrow. > > The changes I made are: > - the tablespace test changes didn't quite work in isolation / needed a bit of > polishing > - moved the tablespace changes to later in the series > - split the tests out of the commit adding the view into its own commit > - minor code formatting things (e.g. didn't like nested for()s without {}) > On 2023-01-25 16:56:17 +0900, Kyotaro Horiguchi wrote: > > At Tue, 24 Jan 2023 14:35:12 -0800, Andres Freund <andres@anarazel.de> wrote in > > > > + write_chunk_s(fpout, &pgStatLocal.snapshot.io); > > > > + if (!read_chunk_s(fpin, &shmem->io.stats)) > > > > > > > > The names of the functions hardly make sense alone to me. How about > > > > write_struct()/read_struct()? (I personally prefer to use > > > > write_chunk() directly..) > > > > > > That's not related to this patch - there's several existing callers for > > > it. And write_struct wouldn't be better imo, because it's not just for > > > structs. > > > > Hmm. Then what the "_s" stands for? > > Size. It's a macro that just forwards to read_chunk()/write_chunk(). I know what the macros do. But, I'm fine with the names as they are there since before this patch. Sorry for the noise. > > > > > + Number of read operations in units of <varname>op_bytes</varname>. > > > > > > > > I may be the only one who see the name as umbiguous between "total > > > > number of handled bytes" and "bytes hadled at an operation". Can't it > > > > be op_blocksize or just block_size? > > > > > > > > + b.io_object, > > > > + b.io_context, > > > > > > No, block wouldn't be helpful - we'd like to use this for something that isn't > > > uniform blocks. > > > > What does the field show in that case? The mean of operation size? Or > > one row per opration size? If the former, the name looks somewhat > > wrong. If the latter, block_size seems making sense. > > 1, so that it's clear that the rest are in bytes. Thanks. Okay, I guess the documentation will be changed as necessary. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2023-02-07 22:38:14 -0800, Andres Freund wrote: > I did another read through the series. I do have some minor changes, but > they're minor. I think this is ready for commit. I plan to start pushing > tomorrow. Pushed the first (and biggest) commit. More tomorrow. Already can't wait to see incremental improvements of this version of pg_stat_io ;). Tracking buffer hits. Tracking Wal IO. Tracking relation IO bypassing shared buffers. Per connection IO statistics. Tracking IO time. Greetings, Andres Freund
Hi, On 2023-02-08 21:03:19 -0800, Andres Freund wrote: > Pushed the first (and biggest) commit. More tomorrow. Just pushed the actual pg_stat_io view, the splitting of the tablespace test, and the pg_stat_io tests. Yay! Thanks all for patch and review! > Already can't wait to see incremental improvements of this version of > pg_stat_io ;). Tracking buffer hits. Tracking Wal IO. Tracking relation IO > bypassing shared buffers. Per connection IO statistics. Tracking IO time. That's still the case. Greetings, Andres Freund
Hi, On 2023-02-11 10:24:37 -0800, Andres Freund wrote: > Just pushed the actual pg_stat_io view, the splitting of the tablespace test, > and the pg_stat_io tests. One thing I started to wonder about since is whether we should remove the io_ prefix from io_object, io_context. The prefixes make sense on the C level, but it's not clear to me that that's also the case on the table level. Greetings, Andres Freund
On Tue, Feb 14, 2023 at 11:08 AM Andres Freund <andres@anarazel.de> wrote: > One thing I started to wonder about since is whether we should remove the io_ > prefix from io_object, io_context. The prefixes make sense on the C level, but > it's not clear to me that that's also the case on the table level. Yeah, +1. It's hard to argue that there would be any confusion, considering `io_` is in the name of the view. (Unless, I suppose, some other, non-I/O, "some_object" or "some_context" column were to be introduced to this view in the future. But that doesn't seem likely?)
At Tue, 14 Feb 2023 22:35:01 -0800, Maciek Sakrejda <m.sakrejda@gmail.com> wrote in > On Tue, Feb 14, 2023 at 11:08 AM Andres Freund <andres@anarazel.de> wrote: > > One thing I started to wonder about since is whether we should remove the io_ > > prefix from io_object, io_context. The prefixes make sense on the C level, but > > it's not clear to me that that's also the case on the table level. > > Yeah, +1. It's hard to argue that there would be any confusion, > considering `io_` is in the name of the view. We usually add such prefixes to the columns of system views and catalogs, but it seems that's not the case for the stats views. Thus +1 from me, too. > (Unless, I suppose, some other, non-I/O, "some_object" or > "some_context" column were to be introduced to this view in the > future. But that doesn't seem likely?) I don't think that can happen. As for corss-views ambiguity, that is already present. Many columns in stats views share the same names with some other views. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Sat, Feb 11, 2023 at 10:24:37AM -0800, Andres Freund wrote: > On 2023-02-08 21:03:19 -0800, Andres Freund wrote: > > Pushed the first (and biggest) commit. More tomorrow. > > Just pushed the actual pg_stat_io view, the splitting of the tablespace test, > and the pg_stat_io tests. pg_stat_io says: * Some BackendTypes do not currently perform any IO in certain * IOContexts, and, while it may not be inherently incorrect for them to * do so, excluding those rows from the view makes the view easier to use. if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM) return false; if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) && io_context == IOCONTEXT_BULKWRITE) return false; What about these combinations? Aren't these also "can't happen" ? relation | bulkread | autovacuum worker relation | bulkread | autovacuum launcher relation | vacuum | startup -- Justin
On Tue, Feb 21, 2023 at 07:50:35PM -0600, Justin Pryzby wrote: > On Sat, Feb 11, 2023 at 10:24:37AM -0800, Andres Freund wrote: > > On 2023-02-08 21:03:19 -0800, Andres Freund wrote: > > > Pushed the first (and biggest) commit. More tomorrow. > > > > Just pushed the actual pg_stat_io view, the splitting of the tablespace test, > > and the pg_stat_io tests. > > pg_stat_io says: > > * Some BackendTypes do not currently perform any IO in certain > * IOContexts, and, while it may not be inherently incorrect for them to > * do so, excluding those rows from the view makes the view easier to use. > > if (bktype == B_AUTOVAC_LAUNCHER && io_context == IOCONTEXT_VACUUM) > return false; > > if ((bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_LAUNCHER) && > io_context == IOCONTEXT_BULKWRITE) > return false; > > What about these combinations? Aren't these also "can't happen" ? > > relation | bulkread | autovacuum worker > relation | bulkread | autovacuum launcher > relation | vacuum | startup Nevermind - at least these are possible. (gdb) p MyBackendType $1 = B_AUTOVAC_WORKER (gdb) p io_object $2 = IOOBJECT_RELATION (gdb) p io_context $3 = IOCONTEXT_BULKREAD (gdb) p io_op $4 = IOOP_EVICT (gdb) bt ... #9 0x0000557b2f6097a3 in ReadBufferExtended (reln=0x7ff5ccee36b8, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=16,mode=mode@entry=RBM_NORMAL, strategy=0x557b305fb568) at ../src/include/utils/rel.h:573 #10 0x0000557b2f3057c0 in heapgetpage (sscan=sscan@entry=0x557b305fb158, block=block@entry=16) at ../src/backend/access/heap/heapam.c:405 #11 0x0000557b2f305d6c in heapgettup_pagemode (scan=scan@entry=0x557b305fb158, dir=dir@entry=ForwardScanDirection, nkeys=0,key=0x0) at ../src/backend/access/heap/heapam.c:885 #12 0x0000557b2f306956 in heap_getnext (sscan=sscan@entry=0x557b305fb158, direction=direction@entry=ForwardScanDirection)at ../src/backend/access/heap/heapam.c:1122 #13 0x0000557b2f59be0c in do_autovacuum () at ../src/backend/postmaster/autovacuum.c:2061 #14 0x0000557b2f59ccf7 in AutoVacWorkerMain (argc=argc@entry=0, argv=argv@entry=0x0) at ../src/backend/postmaster/autovacuum.c:1716 #15 0x0000557b2f59cdd8 in StartAutoVacWorker () at ../src/backend/postmaster/autovacuum.c:1494 #16 0x0000557b2f5a561a in StartAutovacuumWorker () at ../src/backend/postmaster/postmaster.c:5481 #17 0x0000557b2f5a5a39 in process_pm_pmsignal () at ../src/backend/postmaster/postmaster.c:5192 #18 0x0000557b2f5a5d7e in ServerLoop () at ../src/backend/postmaster/postmaster.c:1770 #19 0x0000557b2f5a73da in PostmasterMain (argc=9, argv=<optimized out>) at ../src/backend/postmaster/postmaster.c:1463 #20 0x0000557b2f4dfc39 in main (argc=9, argv=0x557b30568f50) at ../src/backend/main/main.c:200 -- Justin
Andres Freund <andres@anarazel.de> writes: > Pushed the first (and biggest) commit. More tomorrow. I hadn't run my buildfarm-compile-warning scraper for a little while, but I just did, and I find that this commit is causing warnings on no fewer than 14 buildfarm animals. They all look like ayu | 2023-02-25 23:02:08 | pgstat_io.c:40:14: warning: comparison of constant 2 with expression of type 'IOObject'(aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare] ayu | 2023-02-25 23:02:08 | pgstat_io.c:43:16: warning: comparison of constant 4 with expression of type 'IOContext'(aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare] ayu | 2023-02-25 23:02:08 | pgstat_io.c:70:19: warning: comparison of constant 2 with expression of type 'IOObject'(aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare] ayu | 2023-02-25 23:02:08 | pgstat_io.c:71:20: warning: comparison of constant 4 with expression of type 'IOContext'(aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare] ayu | 2023-02-25 23:02:08 | pgstat_io.c:115:14: warning: comparison of constant 2 with expression of type 'IOObject'(aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare] ayu | 2023-02-25 23:02:08 | pgstat_io.c:118:16: warning: comparison of constant 4 with expression of type 'IOContext'(aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare] ayu | 2023-02-25 23:02:08 | pgstatfuncs.c:1329:12: warning: comparison of constant 2 with expression of type 'IOObject'(aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare] ayu | 2023-02-25 23:02:08 | pgstatfuncs.c:1334:17: warning: comparison of constant 4 with expression of type 'IOContext'(aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare] That is, these compilers think that comparisons like io_object < IOOBJECT_NUM_TYPES io_context < IOCONTEXT_NUM_TYPES are constant-true. This seems not good; if they were to actually act on this observation, by removing those loop-ending tests, we'd have a problem. The issue seems to be that code like this: typedef enum IOContext { IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, IOCONTEXT_NORMAL, IOCONTEXT_VACUUM, } IOContext; #define IOCONTEXT_FIRST IOCONTEXT_BULKREAD #define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1) is far too cute for its own good. I'm not sure about how to fix it either. I thought of defining #define IOCONTEXT_LAST IOCONTEXT_VACUUM and make the loop conditions like "io_context <= IOCONTEXT_LAST", but that doesn't actually fix the problem. (Even aside from that, I do not find this coding even a little bit mistake-proof: you still have to remember to update the #define when adding another enum value.) We have similar code involving enum ForkNumber but it looks to me like the loop variables are always declared as plain "int". That might be the path of least resistance here. regards, tom lane
I wrote: > The issue seems to be that code like this: > ... > is far too cute for its own good. Oh, there's another thing here that qualifies as too-cute: loops like for (IOObject io_object = IOOBJECT_FIRST; io_object < IOOBJECT_NUM_TYPES; io_object++) make it look like we could define these enums as 1-based rather than 0-based, but if we did this code would fail, because it's confusing "the number of values" with "1 more than the last value". Again, we could fix that with tests like "io_context <= IOCONTEXT_LAST", but I don't see the point of adding more macros rather than removing some. We do need IOOBJECT_NUM_TYPES to declare array sizes with, so I think we should nuke the "xxx_FIRST" macros as being not worth the electrons they're written on, and write these loops like for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) which is not actually adding any assumptions that you don't already make by using io_object as a C array subscript. regards, tom lane
Hi, On 2023-02-26 13:20:00 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Pushed the first (and biggest) commit. More tomorrow. > > I hadn't run my buildfarm-compile-warning scraper for a little while, > but I just did, and I find that this commit is causing warnings on > no fewer than 14 buildfarm animals. They all look like > > ayu | 2023-02-25 23:02:08 | pgstat_io.c:40:14: warning: comparison of constant 2 with expression of type 'IOObject'(aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare] > ayu | 2023-02-25 23:02:08 | pgstat_io.c:43:16: warning: comparison of constant 4 with expression of type 'IOContext'(aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare] > ayu | 2023-02-25 23:02:08 | pgstat_io.c:70:19: warning: comparison of constant 2 with expression of type 'IOObject'(aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare] > ayu | 2023-02-25 23:02:08 | pgstat_io.c:71:20: warning: comparison of constant 4 with expression of type 'IOContext'(aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare] > ayu | 2023-02-25 23:02:08 | pgstat_io.c:115:14: warning: comparison of constant 2 with expression of type 'IOObject'(aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare] > ayu | 2023-02-25 23:02:08 | pgstat_io.c:118:16: warning: comparison of constant 4 with expression of type 'IOContext'(aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare] > ayu | 2023-02-25 23:02:08 | pgstatfuncs.c:1329:12: warning: comparison of constant 2 with expression of type'IOObject' (aka 'enum IOObject') is always true [-Wtautological-constant-out-of-range-compare] > ayu | 2023-02-25 23:02:08 | pgstatfuncs.c:1334:17: warning: comparison of constant 4 with expression of type'IOContext' (aka 'enum IOContext') is always true [-Wtautological-constant-out-of-range-compare] What other animals? If it had been just ayu / clang 4, I'd not be sure it's worth doing much here. > That is, these compilers think that comparisons like > > io_object < IOOBJECT_NUM_TYPES > io_context < IOCONTEXT_NUM_TYPES > > are constant-true. This seems not good; if they were to actually > act on this observation, by removing those loop-ending tests, > we'd have a problem. It'd at least be obvious breakage :/ > The issue seems to be that code like this: > > typedef enum IOContext > { > IOCONTEXT_BULKREAD, > IOCONTEXT_BULKWRITE, > IOCONTEXT_NORMAL, > IOCONTEXT_VACUUM, > } IOContext; > > #define IOCONTEXT_FIRST IOCONTEXT_BULKREAD > #define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1) > > is far too cute for its own good. I'm not sure about how to fix it > either. I thought of defining > > #define IOCONTEXT_LAST IOCONTEXT_VACUUM > > and make the loop conditions like "io_context <= IOCONTEXT_LAST", > but that doesn't actually fix the problem. > > (Even aside from that, I do not find this coding even a little bit > mistake-proof: you still have to remember to update the #define > when adding another enum value.) But the alternative is going around and updating N places, or having a LAST member in the enum, which then precludes means either adding pointless case statements or adding default: cases, which prevents the compiler from warning when a new case is added. I haven't dug up an old enough compiler yet, what happens if IOCONTEXT_NUM_TYPES is redefined to ((int)IOOBJECT_TEMP_RELATION + 1)? > We have similar code involving enum ForkNumber but it looks to me > like the loop variables are always declared as plain "int". That > might be the path of least resistance here. IIRC that caused some even longer lines due to casting the integer to the enum in some other lines. Perhaps we should just case for the < comparison? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-02-26 13:20:00 -0500, Tom Lane wrote: >> I hadn't run my buildfarm-compile-warning scraper for a little while, >> but I just did, and I find that this commit is causing warnings on >> no fewer than 14 buildfarm animals. They all look like > What other animals? If it had been just ayu / clang 4, I'd not be sure it's > worth doing much here. ayu batfish demoiselle desmoxytes dragonet idiacanthus mantid petalura phycodurus pogona wobbegong Some of those are yours ;-) Actually there are only 11, because I miscounted before, but there are new compilers in that group not only old ones. desmoxytes is gcc 10, for instance. regards, tom lane
Hi, On 2023-02-26 14:40:00 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-02-26 13:20:00 -0500, Tom Lane wrote: > >> I hadn't run my buildfarm-compile-warning scraper for a little while, > >> but I just did, and I find that this commit is causing warnings on > >> no fewer than 14 buildfarm animals. They all look like > > > What other animals? If it had been just ayu / clang 4, I'd not be sure it's > > worth doing much here. > > ayu > batfish > demoiselle > desmoxytes > dragonet > idiacanthus > mantid > petalura > phycodurus > pogona > wobbegong > > Some of those are yours ;-) > > Actually there are only 11, because I miscounted before, but > there are new compilers in that group not only old ones. > desmoxytes is gcc 10, for instance. I think on mine the warnings come from the clang to generate bitcode, rather than gcc. The parallel make output makes that a bit hard to see though, as commands and warnings are interspersed. They're all animals for testing older LLVM versions. They're using pretty old clang versions. phycodurus and dragonet are clang 3.9, petalura and desmoxytes is clang 4, idiacanthus and pogona are clang 5. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > They're all animals for testing older LLVM versions. They're using > pretty old clang versions. phycodurus and dragonet are clang 3.9, petalura and > desmoxytes is clang 4, idiacanthus and pogona are clang 5. [ shrug ... ] If I thought this was actually good code, I might agree with ignoring these warnings; but I think what it mostly is is misleading overcomplication. regards, tom lane
On 2023-02-26 15:08:33 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > They're all animals for testing older LLVM versions. They're using > > pretty old clang versions. phycodurus and dragonet are clang 3.9, petalura and > > desmoxytes is clang 4, idiacanthus and pogona are clang 5. > > [ shrug ... ] If I thought this was actually good code, I might > agree with ignoring these warnings; but I think what it mostly is > is misleading overcomplication. I don't mind removing *_FIRST et al by using 0. None of the proposals for getting rid of *_NUM_* seemed a cure actually better than the disease. Adding a cast to int of the loop iteration variable seems to work and only noticeably, not untollerably, ugly. One thing that's odd is that the warnings don't appear reliably. The "io_op < IOOP_NUM_TYPES" comparison in pgstatfuncs.c doesn't trigger any with clang-4. Greetings, Andres Freund
On Sun, Feb 26, 2023 at 12:33:03PM -0800, Andres Freund wrote: > On 2023-02-26 15:08:33 -0500, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > They're all animals for testing older LLVM versions. They're using > > > pretty old clang versions. phycodurus and dragonet are clang 3.9, petalura and > > > desmoxytes is clang 4, idiacanthus and pogona are clang 5. > > > > [ shrug ... ] If I thought this was actually good code, I might > > agree with ignoring these warnings; but I think what it mostly is > > is misleading overcomplication. > > I don't mind removing *_FIRST et al by using 0. None of the proposals for > getting rid of *_NUM_* seemed a cure actually better than the disease. I am also fine with removing *_FIRST and allowing those electrons to move on to bigger and better things :) > > Adding a cast to int of the loop iteration variable seems to work and only > noticeably, not untollerably, ugly. > > One thing that's odd is that the warnings don't appear reliably. The > "io_op < IOOP_NUM_TYPES" comparison in pgstatfuncs.c doesn't trigger any > with clang-4. Using an int and casting all over the place certainly doesn't make the code more attractive, but I am fine with this if it seems like the least bad solution. I didn't want to write a patch with this (ints instead of enums as loop control variable) without being able to reproduce the warnings myself and confirm the patch silences them. However, I wasn't able to reproduce the warnings myself. I tried to do so with a minimal repro on godbolt, and even with -Wtautological-constant-out-of-range-compare -Wall -Wextra -Weverything -Werror I couldn't get clang 4 or 5 (or a number of other compilers I randomly picked from the dropdown) to produce the warnings. - Melanie
On Sun, Feb 26, 2023 at 04:11:45PM -0500, Melanie Plageman wrote: > On Sun, Feb 26, 2023 at 12:33:03PM -0800, Andres Freund wrote: > > On 2023-02-26 15:08:33 -0500, Tom Lane wrote: > > > Andres Freund <andres@anarazel.de> writes: > > > > They're all animals for testing older LLVM versions. They're using > > > > pretty old clang versions. phycodurus and dragonet are clang 3.9, petalura and > > > > desmoxytes is clang 4, idiacanthus and pogona are clang 5. > > > > > > [ shrug ... ] If I thought this was actually good code, I might > > > agree with ignoring these warnings; but I think what it mostly is > > > is misleading overcomplication. > > > > I don't mind removing *_FIRST et al by using 0. None of the proposals for > > getting rid of *_NUM_* seemed a cure actually better than the disease. > > I am also fine with removing *_FIRST and allowing those electrons to > move on to bigger and better things :) > > > > > Adding a cast to int of the loop iteration variable seems to work and only > > noticeably, not untollerably, ugly. > > > > One thing that's odd is that the warnings don't appear reliably. The > > "io_op < IOOP_NUM_TYPES" comparison in pgstatfuncs.c doesn't trigger any > > with clang-4. > > Using an int and casting all over the place certainly doesn't make the > code more attractive, but I am fine with this if it seems like the least > bad solution. > > I didn't want to write a patch with this (ints instead of enums as loop > control variable) without being able to reproduce the warnings myself > and confirm the patch silences them. However, I wasn't able to reproduce > the warnings myself. I tried to do so with a minimal repro on godbolt, > and even with > -Wtautological-constant-out-of-range-compare -Wall -Wextra -Weverything -Werror > I couldn't get clang 4 or 5 (or a number of other compilers I randomly > picked from the dropdown) to produce the warnings. Just kidding: it reproduces if the defined enum has two or less values. Interesting... After discovering this, tried out various solutions including one Andres suggested: for (IOOp io_op = 0; (int) io_op < IOOP_NUM_TYPES; io_op++) and it does silence the warning. What do you think? - Melanie
On Sun, Feb 26, 2023 at 1:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > > The issue seems to be that code like this: > > ... > > is far too cute for its own good. > > Oh, there's another thing here that qualifies as too-cute: loops like > > for (IOObject io_object = IOOBJECT_FIRST; > io_object < IOOBJECT_NUM_TYPES; io_object++) > > make it look like we could define these enums as 1-based rather > than 0-based, but if we did this code would fail, because it's > confusing "the number of values" with "1 more than the last value". > > Again, we could fix that with tests like "io_context <= IOCONTEXT_LAST", > but I don't see the point of adding more macros rather than removing > some. We do need IOOBJECT_NUM_TYPES to declare array sizes with, > so I think we should nuke the "xxx_FIRST" macros as being not worth > the electrons they're written on, and write these loops like > > for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) > > which is not actually adding any assumptions that you don't already > make by using io_object as a C array subscript. Attached is a patch to remove the *_FIRST macros. I was going to add in code to change for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) to for (IOObject io_object = 0; (int) io_object < IOOBJECT_NUM_TYPES; io_object++) but then I couldn't remember why we didn't just do for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) I recall that when passing that loop variable into a function I was getting a compiler warning that required me to cast the value back to an enum to silence it: pgstat_tracks_io_op(bktype, (IOObject) io_object, io_context, io_op)) However, I am now unable to reproduce that warning. Moreover, I see in cases like table_block_relation_size() with ForkNumber, the variable i is passed with no cast to smgrnblocks(). - Melanie
Attachment
Melanie Plageman <melanieplageman@gmail.com> writes: > Attached is a patch to remove the *_FIRST macros. > I was going to add in code to change > for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) > to > for (IOObject io_object = 0; (int) io_object < IOOBJECT_NUM_TYPES; io_object++) I don't really like that proposal. ISTM it's just silencing the messenger rather than addressing the underlying problem, namely that there's no guarantee that an IOObject variable can hold the value IOOBJECT_NUM_TYPES, which it had better do if you want the loop to terminate. Admittedly it's quite unlikely that these three enums would grow to the point that that becomes an actual hazard for them --- but IMO it's still bad practice and a bad precedent for future code. > but then I couldn't remember why we didn't just do > for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) > I recall that when passing that loop variable into a function I was > getting a compiler warning that required me to cast the value back to an > enum to silence it: > pgstat_tracks_io_op(bktype, (IOObject) io_object, > io_context, io_op)) > However, I am now unable to reproduce that warning. > Moreover, I see in cases like table_block_relation_size() with > ForkNumber, the variable i is passed with no cast to smgrnblocks(). Yeah, my druthers would be to just do it the way we do comparable things with ForkNumber. I don't feel like we need to invent a better way here. The risk of needing to cast when using the "int" loop variable as an enum is obviously the downside of that approach, but we have not seen any indication that any compilers actually do warn. It's interesting that you did see such a warning ... I wonder which compiler you were using at the time? regards, tom lane
On Mon, Feb 27, 2023 at 10:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Melanie Plageman <melanieplageman@gmail.com> writes: > > Attached is a patch to remove the *_FIRST macros. > > I was going to add in code to change > > > for (IOObject io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) > > to > > for (IOObject io_object = 0; (int) io_object < IOOBJECT_NUM_TYPES; io_object++) > > I don't really like that proposal. ISTM it's just silencing the > messenger rather than addressing the underlying problem, namely that > there's no guarantee that an IOObject variable can hold the value > IOOBJECT_NUM_TYPES, which it had better do if you want the loop to > terminate. Admittedly it's quite unlikely that these three enums would > grow to the point that that becomes an actual hazard for them --- but > IMO it's still bad practice and a bad precedent for future code. That's fair. Patch attached. > > but then I couldn't remember why we didn't just do > > > for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) > > > I recall that when passing that loop variable into a function I was > > getting a compiler warning that required me to cast the value back to an > > enum to silence it: > > > pgstat_tracks_io_op(bktype, (IOObject) io_object, > > io_context, io_op)) > > > However, I am now unable to reproduce that warning. > > Moreover, I see in cases like table_block_relation_size() with > > ForkNumber, the variable i is passed with no cast to smgrnblocks(). > > Yeah, my druthers would be to just do it the way we do comparable > things with ForkNumber. I don't feel like we need to invent a > better way here. > > The risk of needing to cast when using the "int" loop variable > as an enum is obviously the downside of that approach, but we have > not seen any indication that any compilers actually do warn. > It's interesting that you did see such a warning ... I wonder which > compiler you were using at the time? so, pretty much any version of clang I tried with -Wsign-conversion produces a warning. <source>:35:32: warning: implicit conversion changes signedness: 'int' to 'IOOp' (aka 'enum IOOp') [-Wsign-conversion] I didn't do the casts in the attached patch since they aren't done elsewhere. - Melanie
Attachment
Melanie Plageman <melanieplageman@gmail.com> writes: > On Mon, Feb 27, 2023 at 10:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The risk of needing to cast when using the "int" loop variable >> as an enum is obviously the downside of that approach, but we have >> not seen any indication that any compilers actually do warn. >> It's interesting that you did see such a warning ... I wonder which >> compiler you were using at the time? > so, pretty much any version of clang I tried with > -Wsign-conversion produces a warning. > <source>:35:32: warning: implicit conversion changes signedness: 'int' > to 'IOOp' (aka 'enum IOOp') [-Wsign-conversion] Oh, interesting --- so it's not about the implicit conversion to enum but just about signedness. I bet we could silence that by making the loop variables be "unsigned int". I doubt it's worth any extra keystrokes though, because we are not at all clean about sign-conversion warnings. I tried enabling -Wsign-conversion on Apple's clang 14.0.0 just now, and counted 13462 such warnings just in the core build :-(. I don't foresee anybody trying to clean that up. > I didn't do the casts in the attached patch since they aren't done elsewhere. Agreed. I'll push this along with the earlier patch if there are not objections. regards, tom lane
On 2023-02-27 14:58:30 -0500, Tom Lane wrote: > Agreed. I'll push this along with the earlier patch if there are > not objections. None here.
Andres Freund <andres@anarazel.de> writes: > Just pushed the actual pg_stat_io view, the splitting of the tablespace test, > and the pg_stat_io tests. One of the test cases is flapping a bit: diff -U3 /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out --- /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out 2023-03-04 21:30:05.891579466+0100 +++ /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out 2023-03-04 21:34:26.745552661+0100 @@ -1201,7 +1201,7 @@ SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads; ?column? ---------- - t + f (1 row) DROP TABLE test_io_shared; There are two instances of this today [1][2], and I've seen it before but failed to note down where. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2023-03-04%2021%3A19%3A39 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2023-03-04%2020%3A30%3A05
At Sat, 04 Mar 2023 18:21:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Andres Freund <andres@anarazel.de> writes: > > Just pushed the actual pg_stat_io view, the splitting of the tablespace test, > > and the pg_stat_io tests. > > One of the test cases is flapping a bit: > > diff -U3 /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out > --- /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out 2023-03-04 21:30:05.891579466+0100 > +++ /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out 2023-03-04 21:34:26.745552661+0100 > @@ -1201,7 +1201,7 @@ > SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads; > ?column? > ---------- > - t > + f > (1 row) > > DROP TABLE test_io_shared; > > There are two instances of this today [1][2], and I've seen it before > but failed to note down where. The concurrent autoanalyze below is logged as performing at least one page read from the table. It is unclear, however, how that analyze operation resulted in 19 hits and 2 reads on the (I think) single-page relation. In any case, I think we need to avoid such concurrent autovacuum/analyze. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2023-03-04%2021%3A19%3A39 2023-03-04 22:36:27.781 CET [4073:106] pg_regress/stats LOG: statement: ALTER TABLE test_io_shared SET TABLESPACE regress_tblspace; 2023-03-04 22:36:27.838 CET [4073:107] pg_regress/stats LOG: statement: SELECT COUNT(*) FROM test_io_shared; 2023-03-04 22:36:27.864 CET [4255:5] LOG: automatic analyze of table "regression.public.test_io_shared" avg read rate: 5.208 MB/s, avg write rate: 5.208 MB/s buffer usage: 17 hits, 2 misses, 2 dirtied 2023-03-04 22:36:28.024 CET [4073:108] pg_regress/stats LOG: statement: SELECT pg_stat_force_next_flush(); 2023-03-04 22:36:28.024 CET [4073:108] pg_regress/stats LOG: statement: SELECT pg_stat_force_next_flush(); 2023-03-04 22:36:28.027 CET [4073:109] pg_regress/stats LOG: statement: SELECT sum(reads) AS io_sum_shared_after_reads FROM pg_stat_io WHERE io_context = 'normal' AND io_object = 'relation' > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2023-03-04%2021%3A19%3A39 > [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2023-03-04%2020%3A30%3A05 regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > In any case, I think we need to avoid such concurrent autovacuum/analyze. If it is correct, I believe the attached fix works. regads. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > In any case, I think we need to avoid such concurrent autovacuum/analyze. > > If it is correct, I believe the attached fix works. Thanks for investigating this! Yes, this fix looks correct and makes sense to me. On Mon, Mar 6, 2023 at 1:24 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Sat, 04 Mar 2023 18:21:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in > > Andres Freund <andres@anarazel.de> writes: > > > Just pushed the actual pg_stat_io view, the splitting of the tablespace test, > > > and the pg_stat_io tests. > > > > One of the test cases is flapping a bit: > > > > diff -U3 /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out > > --- /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/expected/stats.out 2023-03-04 21:30:05.891579466+0100 > > +++ /home/pg/build-farm-15/buildroot/HEAD/pgsql.build/src/test/regress/results/stats.out 2023-03-04 21:34:26.745552661+0100 > > @@ -1201,7 +1201,7 @@ > > SELECT :io_sum_shared_after_reads > :io_sum_shared_before_reads; > > ?column? > > ---------- > > - t > > + f > > (1 row) > > > > DROP TABLE test_io_shared; > > > > There are two instances of this today [1][2], and I've seen it before > > but failed to note down where. > > The concurrent autoanalyze below is logged as performing at least one > page read from the table. It is unclear, however, how that analyze > operation resulted in 19 hits and 2 reads on the (I think) single-page > relation. Yes, it is a single page. I think there could be a few different reasons by it is 2 misses/2 dirtied, but the one that seems most likely is that I/O of other relations done during this autovac/analyze of this relation is counted in the same global variables (like catalog tables). - Melanie
Hi, On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote: > On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > > In any case, I think we need to avoid such concurrent autovacuum/analyze. > > > > If it is correct, I believe the attached fix works. > > Thanks for investigating this! > > Yes, this fix looks correct and makes sense to me. Wouldn't it be better to just perform the section from the ALTER TABLE till the DROP TABLE in a transaction? Then there couldn't be any other accesses in just that section. I'm not convinced it's good to disallow all concurrent activity in other parts of the test. Greetings, Andres Freund
On Mon, Mar 06, 2023 at 11:09:19AM -0800, Andres Freund wrote: > Hi, > > On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote: > > On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > > > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > > > In any case, I think we need to avoid such concurrent autovacuum/analyze. > > > > > > If it is correct, I believe the attached fix works. > > > > Thanks for investigating this! > > > > Yes, this fix looks correct and makes sense to me. > > Wouldn't it be better to just perform the section from the ALTER TABLE till > the DROP TABLE in a transaction? Then there couldn't be any other accesses in > just that section. I'm not convinced it's good to disallow all concurrent > activity in other parts of the test. You mean for test coverage reasons? Because the table in question only exists for a few operations in this test file. - Melanie
Hi, On 2023-03-06 14:24:09 -0500, Melanie Plageman wrote: > On Mon, Mar 06, 2023 at 11:09:19AM -0800, Andres Freund wrote: > > On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote: > > > On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > > > > > > > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > > > > In any case, I think we need to avoid such concurrent autovacuum/analyze. > > > > > > > > If it is correct, I believe the attached fix works. > > > > > > Thanks for investigating this! > > > > > > Yes, this fix looks correct and makes sense to me. > > > > Wouldn't it be better to just perform the section from the ALTER TABLE till > > the DROP TABLE in a transaction? Then there couldn't be any other accesses in > > just that section. I'm not convinced it's good to disallow all concurrent > > activity in other parts of the test. > > You mean for test coverage reasons? Because the table in question only > exists for a few operations in this test file. That, but also because it's simply more reliable. autovacuum=off doesn't protect against a anti-wraparound vacuum or such. Or a concurrent test somehow triggering a read. Or ... Greetings, Andres Freund
On Mon, Mar 6, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-03-06 14:24:09 -0500, Melanie Plageman wrote: > > On Mon, Mar 06, 2023 at 11:09:19AM -0800, Andres Freund wrote: > > > On 2023-03-06 10:09:24 -0500, Melanie Plageman wrote: > > > > On Mon, Mar 6, 2023 at 1:48 AM Kyotaro Horiguchi > > > > <horikyota.ntt@gmail.com> wrote: > > > > > > > > > > At Mon, 06 Mar 2023 15:24:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > > > > > In any case, I think we need to avoid such concurrent autovacuum/analyze. > > > > > > > > > > If it is correct, I believe the attached fix works. > > > > > > > > Thanks for investigating this! > > > > > > > > Yes, this fix looks correct and makes sense to me. > > > > > > Wouldn't it be better to just perform the section from the ALTER TABLE till > > > the DROP TABLE in a transaction? Then there couldn't be any other accesses in > > > just that section. I'm not convinced it's good to disallow all concurrent > > > activity in other parts of the test. > > > > You mean for test coverage reasons? Because the table in question only > > exists for a few operations in this test file. > > That, but also because it's simply more reliable. autovacuum=off doesn't > protect against a anti-wraparound vacuum or such. Or a concurrent test somehow > triggering a read. Or ... Good point. Attached is what you suggested. I committed the transaction before the drop table so that the statistics would be visible when we queried pg_stat_io. - Melanie
Attachment
At Mon, 6 Mar 2023 15:21:14 -0500, Melanie Plageman <melanieplageman@gmail.com> wrote in > On Mon, Mar 6, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > > That, but also because it's simply more reliable. autovacuum=off doesn't > > protect against a anti-wraparound vacuum or such. Or a concurrent test somehow > > triggering a read. Or ... > > Good point. Attached is what you suggested. I committed the transaction > before the drop table so that the statistics would be visible when we > queried pg_stat_io. While I don't believe anti-wraparound vacuum can occur during testing, Melanie's solution (moving the commit by a few lines) seems working (by a manual testing). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote: > Good point. Attached is what you suggested. I committed the transaction > before the drop table so that the statistics would be visible when we > queried pg_stat_io. Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie. Greetings, Andres Freund
On Tue, Mar 07, 2023 at 10:18:44AM -0800, Andres Freund wrote: > Hi, > > On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote: > > Good point. Attached is what you suggested. I committed the transaction > > before the drop table so that the statistics would be visible when we > > queried pg_stat_io. > > Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie. There's a 2nd portion of the test that's still flapping, at least on cirrusci. The issue that Tom mentioned is at: SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes; But what I've seen on cirrusci is at: SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes; https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs https://api.cirrus-ci.com/v1/artifact/task/5355168397524992/log/src/test/recovery/tmp_check/regression.diffs https://api.cirrus-ci.com/v1/artifact/task/6142435751886848/testrun/build/testrun/recovery/027_stream_regress/log/regress_log_027_stream_regress It'd be neat if cfbot could show a histogram of test failures, although I'm not entirely sure what granularity would be most useful: the test that failed (027_regress) or the way it failed (:after_write > :before_writes). Maybe it's enough to show the test, with links to its recent failures. -- Justin
Hi, On 2023-03-09 06:51:31 -0600, Justin Pryzby wrote: > On Tue, Mar 07, 2023 at 10:18:44AM -0800, Andres Freund wrote: > > Hi, > > > > On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote: > > > Good point. Attached is what you suggested. I committed the transaction > > > before the drop table so that the statistics would be visible when we > > > queried pg_stat_io. > > > > Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie. > > There's a 2nd portion of the test that's still flapping, at least on > cirrusci. > > The issue that Tom mentioned is at: > SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes; > > But what I've seen on cirrusci is at: > SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes; Seems you meant to copy a different line for Tom's (s/writes/redas/)? > https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs Hm. I guess the explanation here is that the buffers were already all written out by another backend. Which is made more likely by your patch. I found a few more occurances and chatted with Melanie. Melanie will come up with a fix I think. Greetings, Andres Freund
On Thu, Mar 9, 2023 at 2:43 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-03-09 06:51:31 -0600, Justin Pryzby wrote: > > On Tue, Mar 07, 2023 at 10:18:44AM -0800, Andres Freund wrote: > > There's a 2nd portion of the test that's still flapping, at least on > > cirrusci. > > > > The issue that Tom mentioned is at: > > SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes; > > > > But what I've seen on cirrusci is at: > > SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes; > > Seems you meant to copy a different line for Tom's (s/writes/redas/)? > > > > https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs > > Hm. I guess the explanation here is that the buffers were already all written > out by another backend. Which is made more likely by your patch. > > > I found a few more occurances and chatted with Melanie. Melanie will come up > with a fix I think. So, what this test is relying on is that either the checkpointer or another backend will flush the pages of test_io_shared which we dirtied above in the test. The test specifically checks for IOCONTEXT_NORMAL writes. It could fail if some other backend is doing a bulkread or bulkwrite and flushes these buffers first in a strategy context. This will happen more often when shared buffers is small. I tried to come up with a reliable test which was limited to IOCONTEXT_NORMAL. I thought if we could guarantee a dirty buffer would be pinned using a cursor, that we could then issue a checkpoint and guarantee a flush that way. However, I don't see a way to guarantee that no one flushes the buffer between dirtying it and pinning it with the cursor. So, I think our best bet is to just change the test to pass if there are any writes in any contexts. By moving the sum(writes) before the INSERT and keeping the checkpoint, we can guarantee that someway or another, some buffers will be flushed. This essentially covers the same code anyway. Patch attached. - Melanie
Attachment
On Thu, Mar 09, 2023 at 11:43:01AM -0800, Andres Freund wrote: > On 2023-03-09 06:51:31 -0600, Justin Pryzby wrote: > > On Tue, Mar 07, 2023 at 10:18:44AM -0800, Andres Freund wrote: > > > Hi, > > > > > > On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote: > > > > Good point. Attached is what you suggested. I committed the transaction > > > > before the drop table so that the statistics would be visible when we > > > > queried pg_stat_io. > > > > > > Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie. > > > > There's a 2nd portion of the test that's still flapping, at least on > > cirrusci. > > > > The issue that Tom mentioned is at: > > SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes; > > > > But what I've seen on cirrusci is at: > > SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes; > > Seems you meant to copy a different line for Tom's (s/writes/redas/)? Seems so > > https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs > > Hm. I guess the explanation here is that the buffers were already all written > out by another backend. Which is made more likely by your patch. FYI: that patch would've made it more likely for each backend to write out its *own* dirty pages of TOAST ... but the two other failures that I mentioned were for patches which wouldn't have affected this at all. -- Justin
On Fri, Mar 10, 2023 at 3:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Thu, Mar 09, 2023 at 11:43:01AM -0800, Andres Freund wrote: > > > https://api.cirrus-ci.com/v1/artifact/task/6701069548388352/log/src/test/recovery/tmp_check/regression.diffs > > > > Hm. I guess the explanation here is that the buffers were already all written > > out by another backend. Which is made more likely by your patch. > > FYI: that patch would've made it more likely for each backend to write > out its *own* dirty pages of TOAST ... but the two other failures that I > mentioned were for patches which wouldn't have affected this at all. I think your patch made it more likely that a backend needing to flush a buffer in order to fit its own data would be doing so in a buffer access strategy IO context. Your patch makes it so those toast table writes are using a BAS_BULKWRITE (see GetBulkInsertState()) and when they are looking for buffers to put their data in, they have to evict other data (theirs and others) but all of this is tracked in io_context = 'bulkwrite' -- and the test only counted writes done in io_context 'normal'. But it is good that your patch did that! It helped us to see that this test is not reliable. The other times this test failed in cfbot were for a patch that had many failures and might have something wrong with its code, IIRC. Thanks again for the report! - Melanie
Hello, I found that the 'standalone backend' backend type is not documented right now. Adding something like (from commit message) would be helpful: Both the bootstrap backend and single user mode backends will have backend_type STANDALONE_BACKEND. -- Pavel Luzanov Postgres Professional: https://postgrespro.com
On Mon, Apr 3, 2023 at 12:13 AM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote: > > Hello, > > I found that the 'standalone backend' backend type is not documented > right now. > Adding something like (from commit message) would be helpful: > > Both the bootstrap backend and single user mode backends will have > backend_type STANDALONE_BACKEND. Thanks for the report. Attached is a tiny patch to add standalone backend type to pg_stat_activity documentation (referenced by pg_stat_io). I mentioned both the bootstrap process and single user mode process in the docs, though I can't imagine that the bootstrap process is relevant for pg_stat_activity. I also noticed that the pg_stat_activity docs call background workers "parallel workers" (though it also mentions that extensions could have other background workers registered), but this seems a bit weird because pg_stat_activity uses GetBackendTypeDesc() and this prints "background worker" for type B_BG_WORKER. Background workers doing parallelism tasks is what users will most often see in pg_stat_activity, but I feel like it is confusing to have it documented as something different than what would appear in the view. Unless I am misunderstanding something... - Melanie
Attachment
On 03.04.2023 23:50, Melanie Plageman wrote: > Attached is a tiny patch to add standalone backend type to > pg_stat_activity documentation (referenced by pg_stat_io). > > I mentioned both the bootstrap process and single user mode process in > the docs, though I can't imagine that the bootstrap process is relevant > for pg_stat_activity. After a little thought... I'm not sure about the term 'bootstrap process'. I can't find this term in the documentation. Do I understand correctly that this is a postmaster? If so, then the postmaster process is not shown in pg_stat_activity. Perhaps it may be worth adding a description of the standalone backend to pg_stat_io, not to pg_stat_activity. Something like: backend_type is all types from pg_stat_activity plus 'standalone backend', which is used for the postmaster process and in a single user mode. > I also noticed that the pg_stat_activity docs call background workers > "parallel workers" (though it also mentions that extensions could have > other background workers registered), but this seems a bit weird because > pg_stat_activity uses GetBackendTypeDesc() and this prints "background > worker" for type B_BG_WORKER. Background workers doing parallelism tasks > is what users will most often see in pg_stat_activity, but I feel like > it is confusing to have it documented as something different than what > would appear in the view. Unless I am misunderstanding something... 'parallel worker' appears in the pg_stat_activity for parallel queries. I think it's right here. -- Pavel Luzanov Postgres Professional: https://postgrespro.com
On Tue, Apr 4, 2023 at 4:35 PM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote: > > On 03.04.2023 23:50, Melanie Plageman wrote: > > Attached is a tiny patch to add standalone backend type to > > pg_stat_activity documentation (referenced by pg_stat_io). > > > > I mentioned both the bootstrap process and single user mode process in > > the docs, though I can't imagine that the bootstrap process is relevant > > for pg_stat_activity. > > After a little thought... I'm not sure about the term 'bootstrap > process'. I can't find this term in the documentation. There are various mentions of "bootstrap" peppered throughout the docs but no concise summary of what it is. For example, initdb docs mention the "bootstrap backend" [1]. Interestingly, 910cab820d0 added "Bootstrap superuser" in November. This doesn't really cover what bootstrapping is itself, but I wonder if that is useful? If so, you could propose a glossary entry for it? (preferably in a new thread) > Do I understand correctly that this is a postmaster? If so, then the > postmaster process is not shown in pg_stat_activity. No, bootstrap process is for initializing the template database. You will not be able to see pg_stat_activity when it is running. > Perhaps it may be worth adding a description of the standalone backend > to pg_stat_io, not to pg_stat_activity. > Something like: backend_type is all types from pg_stat_activity plus > 'standalone backend', > which is used for the postmaster process and in a single user mode. You can query pg_stat_activity from single user mode, so it is relevant to pg_stat_activity also. I take your point that bootstrap mode isn't relevant for pg_stat_activity, but I am hesitant to add that distinction to the pg_stat_io docs since the reason you won't see it in pg_stat_activity is because it is ephemeral and before a user can access the database and not because stats are not tracked for it. Can you think of a way to convey this? > > I also noticed that the pg_stat_activity docs call background workers > > "parallel workers" (though it also mentions that extensions could have > > other background workers registered), but this seems a bit weird because > > pg_stat_activity uses GetBackendTypeDesc() and this prints "background > > worker" for type B_BG_WORKER. Background workers doing parallelism tasks > > is what users will most often see in pg_stat_activity, but I feel like > > it is confusing to have it documented as something different than what > > would appear in the view. Unless I am misunderstanding something... > > 'parallel worker' appears in the pg_stat_activity for parallel queries. > I think it's right here. Ah, I didn't read the code closely enough in pg_stat_get_activity() Even though there is no BackendType which GetBackendTypeDesc() returns called "parallel worker", we to out of our way to be specific using GetBackgroundWorkerTypeByPid() /* Add backend type */ if (beentry->st_backendType == B_BG_WORKER) { const char *bgw_type; bgw_type = GetBackgroundWorkerTypeByPid(beentry->st_procpid); if (bgw_type) values[17] = CStringGetTextDatum(bgw_type); else nulls[17] = true; } else values[17] = CStringGetTextDatum(GetBackendTypeDesc(beentry->st_backendType)); - Melanie [1] https://www.postgresql.org/docs/current/app-initdb.html
On 05.04.2023 03:41, Melanie Plageman wrote: > On Tue, Apr 4, 2023 at 4:35 PM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote: > >> After a little thought... I'm not sure about the term 'bootstrap >> process'. I can't find this term in the documentation. > There are various mentions of "bootstrap" peppered throughout the docs > but no concise summary of what it is. For example, initdb docs mention > the "bootstrap backend" [1]. > > Interestingly, 910cab820d0 added "Bootstrap superuser" in November. This > doesn't really cover what bootstrapping is itself, but I wonder if that > is useful? If so, you could propose a glossary entry for it? > (preferably in a new thread) I'm not sure if this is the reason for adding a new entry in the glossary. >> Do I understand correctly that this is a postmaster? If so, then the >> postmaster process is not shown in pg_stat_activity. > No, bootstrap process is for initializing the template database. You > will not be able to see pg_stat_activity when it is running. Oh, it's clear to me now. Thank you for the explanation. > You can query pg_stat_activity from single user mode, so it is relevant > to pg_stat_activity also. I take your point that bootstrap mode isn't > relevant for pg_stat_activity, but I am hesitant to add that distinction > to the pg_stat_io docs since the reason you won't see it in > pg_stat_activity is because it is ephemeral and before a user can access > the database and not because stats are not tracked for it. > > Can you think of a way to convey this? See my attempt attached. I'm not sure about the wording. But I think we can avoid the term 'bootstrap process' by replacing it with "database cluster initialization", which should be clear to everyone. -- Pavel Luzanov Postgres Professional: https://postgrespro.com
Attachment
On Mon, Apr 10, 2023 at 3:41 AM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote: > > On 05.04.2023 03:41, Melanie Plageman wrote: > > On Tue, Apr 4, 2023 at 4:35 PM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote: > > > >> After a little thought... I'm not sure about the term 'bootstrap > >> process'. I can't find this term in the documentation. > > There are various mentions of "bootstrap" peppered throughout the docs > > but no concise summary of what it is. For example, initdb docs mention > > the "bootstrap backend" [1]. > > > > Interestingly, 910cab820d0 added "Bootstrap superuser" in November. This > > doesn't really cover what bootstrapping is itself, but I wonder if that > > is useful? If so, you could propose a glossary entry for it? > > (preferably in a new thread) > > I'm not sure if this is the reason for adding a new entry in the glossary. > > >> Do I understand correctly that this is a postmaster? If so, then the > >> postmaster process is not shown in pg_stat_activity. > > No, bootstrap process is for initializing the template database. You > > will not be able to see pg_stat_activity when it is running. > > Oh, it's clear to me now. Thank you for the explanation. > > > You can query pg_stat_activity from single user mode, so it is relevant > > to pg_stat_activity also. I take your point that bootstrap mode isn't > > relevant for pg_stat_activity, but I am hesitant to add that distinction > > to the pg_stat_io docs since the reason you won't see it in > > pg_stat_activity is because it is ephemeral and before a user can access > > the database and not because stats are not tracked for it. > > > > Can you think of a way to convey this? > > See my attempt attached. > I'm not sure about the wording. But I think we can avoid the term > 'bootstrap process' > by replacing it with "database cluster initialization", which should be > clear to everyone. I like that idea. diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 3f33a1c56c..45e20efbfb 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -991,6 +991,9 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser <literal>archiver</literal>, <literal>startup</literal>, <literal>walreceiver</literal>, <literal>walsender</literal> and <literal>walwriter</literal>. + The special type <literal>standalone backend</literal> is used I think referring to it as a "special type" is a bit confusing. I think you can just start the sentence with "standalone backend". You could even include it in the main list of backend_types since it is possible to see it in pg_stat_activity when in single user mode. + when initializing a database cluster by <xref linkend="app-initdb"/> + and when running in the <xref linkend="app-postgres-single-user"/>. In addition, background workers registered by extensions may have additional types. </para></entry> I like the rest of this. I copied the committer who most recently touched pg_stat_io (Michael Paquier) to see if we could get someone interested in committing this docs update. - Melanie
On 24.04.2023 23:53, Melanie Plageman wrote: > I copied the committer who most recently touched pg_stat_io (Michael > Paquier) to see if we could get someone interested in committing this > docs update. I can explain my motivation by suggesting this update. pg_stat_io is a very impressive feature. So I decided to try it. I see 4 rows for some 'standalone backend' out of 30 total rows of the view. The attempt to find description of 'standalone backend' in the docs did not result in anything. pg_stat_io page references pg_stat_activity for backend types. But pg_stat_activity page doesn't say anything about 'standalone backend'. I think this question will be popular without clarifying in docs. -- Pavel Luzanov Postgres Professional: https://postgrespro.com