Thread: shared memory stats ideas
Hi, shortly after shared memory stats went in I had a conversation with Lukas about what it'd enable us going forward. I also chatted with Peter about autovacuum related stats. I started to write an email, but then somehow lost the draft and couldn't bring myself to start from scratch. Here's a largely unordered list of ideas. I'm not planning to work on them myself, but thought it'd nevertheless be useful to have them memorialized somewhere. 1) Track some statistics based on relfilenodes rather than oids We currently track IO related statistics as part of the normal relation stats. The problem is that that prevents us from collecting stats whenever we operate on a relfilenode, rather than a Relation. We e.g. currently can't track the number of blocks written out in a relation, because we don't have a Relation at that point. Nor can't we really get hold of one, as the writeback can happen in a different database without access to pg_class. Which is also the reason why the per-relation IO stats aren't populated by the startup process, even though it'd obviously sometimes be helpful to know where the most IO time is spent on a standby. There's also quite a bit of contortions of the bufmgr interface related to this. I think the solution to this is actually fairly simple: We split the IO related statistics out from the relation statistics, and track them on a relfilenode basis instead. That'd allow us to track all the IO stats from all the places, rather than the partial job we do right now. 2) Split index and table statistics into different types of stats We track both types of statistics in the same format and rename column in views etc to make them somewhat sensible. A number of the "columns" in index stats are currently unused. If we split the stats for indexes and relations we can have reasonable names for the fields, shrink the current memory usage by halfing the set of fields we keep for indexes, and extend the stats in a more targeted fashion. This e.g. would allow us keep track of the number of index entries killed via the killtuples mechanism, which in turn would allow us to more intelligently decide whether we should vacuum indexes (often the most expensive part of vacuum). In a lot of workload killtuples takes care of most of the cleanup, but in others it doesn't do much. 3) Maintain more historical statistics about vacuuming We currently track the last time a table was vacuumed, the number of times it was vacuumed and a bunch of counters for the number of modified tuples since the last vacuum. However, none of that allows the user to identify which relations are causing autovacuum to not keep up. Even just keeping track of the the total time autovacuum has spent on certain relations would be a significant improvement, with more easily imaginable (total IO [time], autovacuum delay time, xid age). 4) Make the stats mechanism extensible Most of the work towards this has already been done, but a bit more work is necessary. The hardest likely is how to identify stats belonging to an extension across restarts. There's a bunch of extensions with their own stats mechanisms, but it's hard to get this stuff right from the outside. 5) Use extensible shared memory stats to store pg_stat_statements data pg_stat_statements current mechanism has a few issues. The top ones I know of are: - Contention on individual stats entries when the same queryid is executed concurrently. pgstats deals with this by allowing stats to be collected in backend local memory and to be flushed into shared stats at a lower frequency. - The querytext file can get huge (I've seen > 100GB) and cause massive slowdowns. It's better than the old fixed-length, fixed-shared-memory mechansism, don't get me wrong. But we can do better by storing the data in dynamic shared memory and then also support trimming based on the total size. There were some other things, but I can't remember them right now. Greetings, Andres Freund
On Wed, Oct 19, 2022 at 11:19 AM Andres Freund <andres@anarazel.de> wrote: > We e.g. currently can't track the number of blocks written out in a relation, > because we don't have a Relation at that point. Nor can't we really get hold > of one, as the writeback can happen in a different database without access to > pg_class. Which is also the reason why the per-relation IO stats aren't > populated by the startup process, even though it'd obviously sometimes be > helpful to know where the most IO time is spent on a standby. > > There's also quite a bit of contortions of the bufmgr interface related to > this. This seems related to the difficulty with distinguishing between internal pages and leaf pages (or some generalized AM-agnostic definition) in views like pg_statio_*_indexes. Differentiating between leaf pages and internal pages would definitely be a big improvement, but it's kind of an awkward thing to implement [1] because you have to somehow invent the general concept of multiple distinct kinds of buffers/pages within a relation. A lot of code would need to be taught about that. This work would be more likely to actually happen if it was tied to some bigger project that promised other benefits. > 2) Split index and table statistics into different types of stats > This e.g. would allow us keep track of the number of index entries killed via > the killtuples mechanism, which in turn would allow us to more intelligently > decide whether we should vacuum indexes (often the most expensive part of > vacuum). In a lot of workload killtuples takes care of most of the cleanup, > but in others it doesn't do much. While I do agree that it would be nice to record information about the number of deletion operations per index, that information will still be tricky to interpret and act upon relative to other kinds of information. As a general rule, we should prefer to focus on signals that show things really aren't going well in some specific and unambiguous way. Signals about things that are going well seem harder to work with -- they don't generalize well. What I really mean here is this: I think that page split stuff is going to be much more interesting than index deletion stuff. Index deletion exists to prevent page splits. So it's natural to ask questions about where that seems like it ought to have happened, but didn't actually happen. This likely requires bucketing page splits into different categories (since most individual page splits aren't like that at all). Then it becomes much easier to (say) compare indexes on the same table -- the user can follow a procedure that is likely to generalize well to many different kinds of situations. It's not completely clear how the bucketization would work. We ought to remember how many page splits were caused by INSERT statements rather than non-HOT UPDATEs, though -- that much seems likely to be very useful and actionable. The DBA can probably consume this information in a low context way by looking at the proportions of one kind of split to the other at the level of each index. One type of split is mostly just a "cost of doing business" for B-Tree indexing. The other type really isn't. > 3) Maintain more historical statistics about vacuuming > However, none of that allows the user to identify which relations are causing > autovacuum to not keep up. Even just keeping track of the the total time > autovacuum has spent on certain relations would be a significant improvement, > with more easily imaginable (total IO [time], autovacuum delay time, xid age). With VACUUM in particular the picture over time can be far easier to work with than any given snapshot, from any single VACUUM operation. Focusing on how things seem to be changing can make it a lot easier to spot concerning trends, especially if you're a non-expert. I would also expect a similar focus on the picture over time to be useful with the indexing stuff, for roughly the same underlying reasons. [1] https://postgr.es/m/CAA8Fd-pB=mr42YQuoaLPO_o2=XO9YJnjQ23CYJDFwC8SXGM8zg@mail.gmail.com -- Peter Geoghegan
Hi, On 10/19/22 8:19 PM, Andres Freund wrote: > > Hi, > > > Here's a largely unordered list of ideas. I'm not planning to work on them > myself, but thought it'd nevertheless be useful to have them memorialized > somewhere. > Thanks for sharing this list of ideas! > > > 2) Split index and table statistics into different types of stats > > We track both types of statistics in the same format and rename column in > views etc to make them somewhat sensible. A number of the "columns" in index > stats are currently unused. > > If we split the stats for indexes and relations we can have reasonable names > for the fields, shrink the current memory usage by halfing the set of fields > we keep for indexes, and extend the stats in a more targeted fashion. I started to work on this. I should be able to provide a patch attempt in the next couple of weeks. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Thanks for the nice list. At Wed, 19 Oct 2022 12:37:30 -0700, Peter Geoghegan <pg@bowt.ie> wrote in > On Wed, Oct 19, 2022 at 11:19 AM Andres Freund <andres@anarazel.de> wrote: > > We e.g. currently can't track the number of blocks written out in a relation, > > because we don't have a Relation at that point. Nor can't we really get hold > > of one, as the writeback can happen in a different database without access to > > pg_class. Which is also the reason why the per-relation IO stats aren't > > populated by the startup process, even though it'd obviously sometimes be > > helpful to know where the most IO time is spent on a standby. > > > > There's also quite a bit of contortions of the bufmgr interface related to > > this. > > This seems related to the difficulty with distinguishing between > internal pages and leaf pages (or some generalized AM-agnostic > definition) in views like pg_statio_*_indexes. > > Differentiating between leaf pages and internal pages would definitely > be a big improvement, but it's kind of an awkward thing to implement > [1] because you have to somehow invent the general concept of multiple > distinct kinds of buffers/pages within a relation. A lot of code would > need to be taught about that. > > This work would be more likely to actually happen if it was tied to > some bigger project that promised other benefits. Stickier buffers for index pages seems to be related. I haven't see it even get started, though. But this might be able be an additional reason for starting it. > > 2) Split index and table statistics into different types of stats > > > This e.g. would allow us keep track of the number of index entries killed via > > the killtuples mechanism, which in turn would allow us to more intelligently > > decide whether we should vacuum indexes (often the most expensive part of > > vacuum). In a lot of workload killtuples takes care of most of the cleanup, > > but in others it doesn't do much. > > While I do agree that it would be nice to record information about the > number of deletion operations per index, that information will still > be tricky to interpret and act upon relative to other kinds of > information. As a general rule, we should prefer to focus on signals > that show things really aren't going well in some specific and > unambiguous way. Signals about things that are going well seem harder > to work with -- they don't generalize well. I think some statistics can be pure-internal purpose. We can maintain some statistics hidden from users, if we want. (However, I think people will request for the numbers to be revealed, finally..) > What I really mean here is this: I think that page split stuff is > going to be much more interesting than index deletion stuff. Index > deletion exists to prevent page splits. So it's natural to ask > questions about where that seems like it ought to have happened, but > didn't actually happen. This likely requires bucketing page splits > into different categories (since most individual page splits aren't > like that at all). Then it becomes much easier to (say) compare > indexes on the same table -- the user can follow a procedure that is > likely to generalize well to many different kinds of situations. > > It's not completely clear how the bucketization would work. We ought > to remember how many page splits were caused by INSERT statements > rather than non-HOT UPDATEs, though -- that much seems likely to be > very useful and actionable. The DBA can probably consume this > information in a low context way by looking at the proportions of one > kind of split to the other at the level of each index. > > One type of split is mostly just a "cost of doing business" for B-Tree > indexing. The other type really isn't. > > > 3) Maintain more historical statistics about vacuuming > > > However, none of that allows the user to identify which relations are causing > > autovacuum to not keep up. Even just keeping track of the the total time > > autovacuum has spent on certain relations would be a significant improvement, > > with more easily imaginable (total IO [time], autovacuum delay time, xid age). > > With VACUUM in particular the picture over time can be far easier to > work with than any given snapshot, from any single VACUUM operation. > Focusing on how things seem to be changing can make it a lot easier to > spot concerning trends, especially if you're a non-expert. Agreed. It seem like a kind of easy (low-hanging) one. I'll give it a try. There should be some other numbers that timeseries stats are useful. > I would also expect a similar focus on the picture over time to be > useful with the indexing stuff, for roughly the same underlying > reasons. > > [1] https://postgr.es/m/CAA8Fd-pB=mr42YQuoaLPO_o2=XO9YJnjQ23CYJDFwC8SXGM8zg@mail.gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 10/20/22 9:17 AM, Drouvot, Bertrand wrote: > On 10/19/22 8:19 PM, Andres Freund wrote: >> >> 2) Split index and table statistics into different types of stats >> >> We track both types of statistics in the same format and rename column in >> views etc to make them somewhat sensible. A number of the "columns" in >> index >> stats are currently unused. >> >> If we split the stats for indexes and relations we can have reasonable >> names >> for the fields, shrink the current memory usage by halfing the set of >> fields >> we keep for indexes, and extend the stats in a more targeted fashion. > > I started to work on this. > I should be able to provide a patch attempt in the next couple of weeks. Patch submitted and CF entry created: https://commitfest.postgresql.org/40/3984/ Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Oct 21, 2022 at 2:26 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Stickier buffers for index pages seems to be related. I haven't see it > even get started, though. But this might be able be an additional > reason for starting it. Maybe, but FWIW I think that that will mostly just need to distinguish leaf pages from heap pages (and mostly ignore internal pages). Within each index, internal pages are typically no more than a fraction of 1% of all pages. There are already so few internal pages that it seems very likely that they're practically guaranteed to be cached already. There is a huge asymmetry in how pages are naturally accessed, which justifies treating them as fundamentally different things. Separating leaf pages from internal pages for instrumentation purposes is valuable because it allows the DBA to completely *ignore* internal pages. Internal pages are accessed far far more frequently than leaf pages. In effect, internal pages add "noise" to the instrumentation, obscuring the useful "signal" that the DBA should focus on (by considering leaf level hits and misses in isolation). So the value is from "removing noise", not from "adding signal". You only need about 1% of the memory required to cache a big index to get a "hit rate" of 75% (assuming you don't have a workload that's very scan heavy, which would be unusual). Obviously the standard naive definition of "index hit rate" isn't particularly useful. > > While I do agree that it would be nice to record information about the > > number of deletion operations per index, that information will still > > be tricky to interpret and act upon relative to other kinds of > > information. As a general rule, we should prefer to focus on signals > > that show things really aren't going well in some specific and > > unambiguous way. Signals about things that are going well seem harder > > to work with -- they don't generalize well. > > I think some statistics can be pure-internal purpose. We can maintain > some statistics hidden from users, if we want. (However, I think > people will request for the numbers to be revealed, finally..) It will probably be easy to add information about index tuple deletions, without almost no downside, so of course we should do it. My point was just that it's probably not the single most informative thing that could be instrumented to help users to understand index bloat. It's just much easier to understand what's not working than what is going well. It's a stronger and more informative signal. > > With VACUUM in particular the picture over time can be far easier to > > work with than any given snapshot, from any single VACUUM operation. > > Focusing on how things seem to be changing can make it a lot easier to > > spot concerning trends, especially if you're a non-expert. > > Agreed. It seem like a kind of easy (low-hanging) one. I'll give it a > try. There should be some other numbers that timeseries stats are > useful. Great! There probably is some way that VACUUM itself will ultimately use this information to decide what to do. For example, if we go too long without doing any index vacuuming, we might want to do it despite the fact that there are relatively few LP_DEAD items in heap pages. I don't think that we need to worry too much about how VACUUM itself might apply the same information for now, but it's something that you might want to consider. -- Peter Geoghegan