Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) - Mailing list pgsql-hackers
From | Melanie Plageman |
---|---|
Subject | Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) |
Date | |
Msg-id | CAAKRu_atgf8pnWw4So5Sbi-YhD+Vff-Ac10VEjdvE3By-nM3uA@mail.gmail.com Whole thread Raw |
In response to | Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (andmore?) (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) |
List | pgsql-hackers |
On Sun, Jan 26, 2020 at 11:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Sun, 26 Jan 2020 12:22:03 -0800, Andres Freund <andres@anarazel.de> wrote in > > On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote: > > > On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres@anarazel.de> wrote: > > > > On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote: > > > > > On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > Lastly, I don't understand what the point of sending fixed size stats, > > > > > > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While > > > > > > I don't like it's architecture, we obviously need something like pgstat > > > > > > to handle variable amounts of stats (database, table level etc > > > > > > stats). But that doesn't at all apply to these types of global stats. > > > > > > > > > > That part has annoyed me as well a few times. +1 for just moving that > > > > > into a global shared memory. Given that we don't really care about > > > > > things being in sync between those different counters *or* if we loose > > > > > a bit of data (which the stats collector is designed to do), we could > > > > > even do that without a lock? > > > > > > > > I don't think we'd quite want to do it without any (single counter) > > > > synchronization - high concurrency setups would be pretty likely to > > > > loose values that way. I suspect the best would be to have a struct in > > > > shared memory that contains the potential counters for each potential > > > > process. And then sum them up when actually wanting the concrete > > > > value. That way we avoid unnecessary contention, in contrast to having a > > > > single shared memory value for each(which would just pingpong between > > > > different sockets and store buffers). There's a few details like how > > > > exactly to implement resetting the counters, but ... > > > > > > Right. Each process gets to do their own write, but still in shared > > > memory. But do you need to lock them when reading them (for the > > > summary)? That's the part where I figured you could just read and > > > summarize them, and accept the possible loss. > > > > Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit > > integers can be read / written without a danger of torn values, and I > > don't think we need perfect cross counter accuracy. To deal with the few > > platforms without 64bit "single copy atomicity", we can just use > > pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically > > fall back to using locked operations for those platforms. So I don't > > think there's actually a danger of loss. > > > > Obviously we could also use atomic ops to increment the value, but I'd > > rather not add all those atomic operations, even if it's on uncontended > > cachelines. It'd allow us to reset the backend values more easily by > > just swapping in a 0, which we can't do if the backend increments > > non-atomically. But I think we could instead just have one global "bias" > > value to implement resets (by subtracting that from the summarized > > value, and storing the current sum when resetting). Or use the new > > global barrier to trigger a reset. Or something similar. > > Fixed or global stats are suitable for the startar of shared-memory > stats collector. In the case of buffers_*_write, the global stats > entry for each process needs just 8 bytes plus matbe extra 8 bytes for > the bias value. I'm not sure how many counters like this there are, > but is such size of footprint acceptatble? (Each backend already uses > the same amount of local memory for pgstat use, though.) > > Anyway I will do something like that as a trial, maybe by adding a > member in PgBackendStatus and one global-shared for the bial value. > > int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM]; > + PgBackendStatsCounters counters; > } PgBackendStatus; > So, I took a stab at implementing this in PgBackendStatus. The attached patch is not quite on top of current master, so, alas, don't try and apply it. I went to rebase today and realized I needed to make some changes in light of e1025044cd4, however, I wanted to share this WIP so that I could pose a few questions that I imagine will still be relevant after I rewrite the patch. I removed buffers_backend and buffers_backend_fsync from pg_stat_bgwriter and have created a new view which tracks - number of shared buffers the checkpointer and bgwriter write out - number of shared buffers a regular backend is forced to flush - number of extends done by a regular backend through shared buffers - number of buffers flushed by a backend or autovacuum using a BufferAccessStrategy which, were they not to use this strategy, could perhaps have been avoided if a clean shared buffer was available - number of fsyncs done by a backend which could have been done by checkpointer if sync queue had not been full This view currently does only track writes and extends that go through shared buffers and fsyncs of shared buffers (which, AFAIK are the only things fsync'd though the SyncRequest machinery currently). BufferAlloc() and SyncOneBuffer() are the main points at which the tracking is done. I can definitely expand this, but, I want to make sure that we are tracking the right kind of information. num_backend_writes and num_backend_fsync were intended (though they were not accurate) to count buffers that backends had to end up writing themselves and fsyncs that backends had to end up doing themselves which could have been avoided with a different configuration (or, I suppose, a different workload/different data, etc). That is, they were meant to tell you if checkpointer and bgwriter were keeping up and/or if the size of shared buffers was adequate. In implementing this counting per backend, it is easy for all types of backends to keep track of the number of writes, extends, fsyncs, and strategy writes they are doing. So, as recommended upthread, I have added columns in the view for the number of writes for checkpointer and bgwriter and others. Thus, this view becomes more than just stats on "avoidable I/O done by backends". So, my question is, does it makes sense to track all extends -- those to extend the fsm and visimap and when making a new relation or index? Is that information useful? If so, is it different than the extends done through shared buffers? Should it be tracked separately? Also, if we care about all of the extends, then it seems a bit annoying to pepper the counting all over the place when it really just needs to be done when smgrextend() — even though maybe a stats function doesn't belong in that API. Another question I have is, should the number of extends be for every single block extended or should we try to track the initiation of a set of extends (all of those added in RelationAddExtraBlocks(), in this case)? When it comes to fsync counting, I only count the fsyncs counted by the previous code — that is fsyncs done by backends themselves when the checkpointer sync request queue was full. I did the counting in the same place in checkpointer code -- in ForwardSyncRequest() -- partially because there did not seem to be another good place to do it since register_dirty_segment() returns void (thought about having it return a bool to indicate if it fsync'd it or if it registered the fsync because that seemed alright, but mdextend(), mdwrite() etc, also return NULL) so there is no way to propagate the information back up to the bufmgr that the process had to do its own fsync, so, that means that I would have to muck with the md.c API. and, since the checkpointer is the one processing these sync requests anyway, it actually seems okay to do it in the checkpointer code. I'm not counting fsyncs that are "unavoidable" in the sense that they couldn't be avoided by changing settings/workload etc -- like those done when building an index, creating a table/rewriting a table/copying a table -- is it useful to count these? It seems like it makes the number of "avoidable fsyncs by backends" less useful if we count the others. Also, should we count how many fsyncs checkpointer has done (have to check if there is already a stat for that)? Is that useful in this context? Of course, this view, when grown, will begin to overlap with pg_statio, which is another consideration. What is its identity? I would find "avoidable I/O" either avoidable entirely or avoidable for that particular type of process, to be useful. Or maybe, it should have a more expansive mandate. Maybe it would be useful to aggregate some of the info from pg_stat_statements at a higher level -- like maybe shared_blks_read counted across many statements for a period of time/context in which we expected the relation in shared buffers becomes potentially interesting. As for the way I have recorded strategy writes -- it is quite inelegant, but, I wanted to make sure that I only counted a strategy write as one in which the backend wrote out the dirty buffer from its strategy ring but did not check if there was any clean buffer in shared buffers more generally (so, it is *potentially* an avoidable write). I'm not sure if this distinction is useful to anyone. I haven't done enough with BufferAccessStrategies to know what I'd want to know about them when developing or using Postgres. However, if I don't need to be so careful, it will make the code much simpler (though, I'm sure I can improve the code regardless). As for the implementation of the counters themselves, I appreciate that it isn't very nice to have a bunch of random members in PgBackendStatus to count all of these write, extends, fsyncs. I considered if I could add params that were used for all command types to st_progress_param but I haven't looked into it yet. Alternatively, I could create an array just for these kind of stats in PgBackendStatus. Though, I imagine that I should take a look at the changes that have been made recently to this area and at the shared memory stats patch. Oh, also, there should be a way to reset the stats, especially if we add more extends and fsyncs that happen at the time of relation/index creation. I, at least, would find it useful to see these numbers once the database is at some kind of steady state. Oh and src/test/regress/sql/stats.sql will fail and, of course, I don't intend to add that SELECT from the view to regress, it was just for testing purposes to make sure the view was working. -- Melanie
Attachment
pgsql-hackers by date: