Thread: relfilenode statistics
Hi hackers, Please find attached a POC patch to implement $SUBJECT. Adding relfilenode statistics has been proposed in [1]. The idea is to allow tracking dirtied blocks, written blocks,... on a per relation basis. The attached patch is not in a fully "polished" state yet: there is more places we should add relfilenode counters, create more APIS to retrieve the relfilenode stats.... But I think that it is in a state that can be used to discuss the approach it is implementing (so that we can agree or not on it) before moving forward. The approach that is implemented in this patch is the following: - A new PGSTAT_KIND_RELFILENODE is added - A new attribute (aka relfile) has been added to PgStat_HashKey so that we can record (dboid, spcOid and relfile) to identify a relfilenode entry - pgstat_create_transactional() is used in RelationCreateStorage() - pgstat_drop_transactional() is used in RelationDropStorage() - RelationPreserveStorage() will remove the entry from the list of dropped stats The current approach to deal with table rewrite is to: - copy the relfilenode stats in table_relation_set_new_filelocator() from the relfilenode stats entry to the shared table stats entry - in the pg_statio_all_tables view: add the table stats entry (that contains "previous" relfilenode stats (due to the above) that were linked to this relation ) to the current relfilenode stats linked to the relation An example is done in the attached patch for the new heap_blks_written field in pg_statio_all_tables. Outcome is: " postgres=# create table bdt (a int); CREATE TABLE postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt'; heap_blks_written ------------------- 0 (1 row) postgres=# insert into bdt select generate_series(1,10000); INSERT 0 10000 postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt'; heap_blks_written ------------------- 0 (1 row) postgres=# checkpoint; CHECKPOINT postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt'; heap_blks_written ------------------- 45 (1 row) postgres=# truncate table bdt; TRUNCATE TABLE postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt'; heap_blks_written ------------------- 45 (1 row) postgres=# insert into bdt select generate_series(1,10000); INSERT 0 10000 postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt'; heap_blks_written ------------------- 45 (1 row) postgres=# checkpoint; CHECKPOINT postgres=# select heap_blks_written from pg_statio_all_tables where relname = 'bdt'; heap_blks_written ------------------- 90 (1 row) " Some remarks: - My first attempt has been to call the pgstat_create_transactional() and pgstat_drop_transactional() at the same places it is done for the relations but that did not work well (mainly due to corner cases in case of rewrite). - Please don't take care of the pgstat_count_buffer_read() and pgstat_count_buffer_hit() calls in pgstat_report_relfilenode_buffer_read() and pgstat_report_relfilenode_buffer_hit(). Those stats will follow the same flow as the one done and explained above for the new heap_blks_written one ( should we agree on it). Looking forward to your comments, feedback. Regards, [1]: https://www.postgresql.org/message-id/20231113204439.r4lmys73tessqmak%40awork3.anarazel.de -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
Hi Bertrand, It would be helpful to me if the reasons why we're splitting out relfilenodestats could be more clearly spelled out. I see Andres's comment in the thread to which you linked, but it's pretty vague about why we should do this ("it's not nice") and whether we should do this ("I wonder if this is an argument for") and maybe that's all fine if Andres is going to be the one to review and commit this, but even if then it would be nice if the rest of us could follow along from home, and right now I can't. The commit message is often a good place to spell this kind of thing out, because then it's included with every version of the patch you post, and may be of some use to the eventual committer in writing their commit message. The body of the email where you post the patch set can be fine, too. ...Robert
Hi Robert, On Mon, May 27, 2024 at 09:10:13AM -0400, Robert Haas wrote: > Hi Bertrand, > > It would be helpful to me if the reasons why we're splitting out > relfilenodestats could be more clearly spelled out. I see Andres's > comment in the thread to which you linked, but it's pretty vague about > why we should do this ("it's not nice") and whether we should do this > ("I wonder if this is an argument for") and maybe that's all fine if > Andres is going to be the one to review and commit this, but even if > then it would be nice if the rest of us could follow along from home, > and right now I can't. Thanks for the feedback! You’re completely right, my previous message is missing clear explanation as to why I think that relfilenode stats could be useful. Let me try to fix this. The main argument is that we currently don’t have writes counters for relations. The reason is that we don’t have the relation OID when writing buffers out. Tracking writes per relfilenode would allow us to track/consolidate writes per relation (example in the v1 patch and in the message up-thread). I think that adding instrumentation in this area (writes counters) could be beneficial (like it is for the ones we currently have for reads). Second argument is that this is also beneficial for the "Split index and table statistics into different types of stats" thread (mentioned in the previous message). It would allow us to avoid additional branches in some situations (like the one mentioned by Andres in the link I provided up-thread). If we agree that the main argument makes sense to think about having relfilenode stats then I think using them as proposed in the second argument makes sense too: We’d move the current buffer read and buffer hit counters from the relation stats to the relfilenode stats (while still being able to retrieve them from the pg_statio_all_tables/indexes views: see the example for the new heap_blks_written stat added in the patch). Generally speaking, I think that tracking counters at a common level (i.e relfilenode level instead of table or index level) is beneficial (avoid storing/allocating space for the same counters in multiple structs) and sounds more intuitive to me. Also I think this is open door for new ideas: for example, with relfilenode statistics in place, we could probably also start thinking about tracking checksum errors per relfllenode. > The commit message is often a good place to spell this kind of thing > out, because then it's included with every version of the patch you > post, and may be of some use to the eventual committer in writing > their commit message. The body of the email where you post the patch > set can be fine, too. > Yeah, I’ll update the commit message in V2 with better explanations once I get feedback on V1 (should we decide to move on with the relfilenode stats idea). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Jun 3, 2024 at 7:11 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > The main argument is that we currently don’t have writes counters for relations. > The reason is that we don’t have the relation OID when writing buffers out. OK. > Second argument is that this is also beneficial for the "Split index and > table statistics into different types of stats" thread (mentioned in the previous > message). It would allow us to avoid additional branches in some situations (like > the one mentioned by Andres in the link I provided up-thread). OK. > We’d move the current buffer read and buffer hit counters from the relation stats > to the relfilenode stats (while still being able to retrieve them from the > pg_statio_all_tables/indexes views: see the example for the new heap_blks_written > stat added in the patch). Generally speaking, I think that tracking counters at > a common level (i.e relfilenode level instead of table or index level) is > beneficial (avoid storing/allocating space for the same counters in multiple > structs) and sounds more intuitive to me. Hmm. So if I CLUSTER or VACUUM FULL the relation, the relfilenode changes. Does that mean I lose all of those stats? Is that a problem? Or is it good? Or what? I also thought about the other direction. Suppose I drop the a relation and create a new one that gets a different relation OID but the same relfilenode. But I don't think that's a problem: dropping the relation should forcibly remove the old stats, so there won't be any conflict in this case. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 04, 2024 at 09:26:27AM -0400, Robert Haas wrote: > On Mon, Jun 3, 2024 at 7:11 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > We’d move the current buffer read and buffer hit counters from the relation stats > > to the relfilenode stats (while still being able to retrieve them from the > > pg_statio_all_tables/indexes views: see the example for the new heap_blks_written > > stat added in the patch). Generally speaking, I think that tracking counters at > > a common level (i.e relfilenode level instead of table or index level) is > > beneficial (avoid storing/allocating space for the same counters in multiple > > structs) and sounds more intuitive to me. > > Hmm. So if I CLUSTER or VACUUM FULL the relation, the relfilenode > changes. Does that mean I lose all of those stats? Is that a problem? > Or is it good? Or what? I think we should keep the stats in the relation during relfilenode changes. As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can see in the example provided up-thread the new heap_blks_written statistic has been preserved during the TRUNCATE. Please note that the v1 POC only takes care of the new heap_blks_written stat and that the logic used in table_relation_set_new_filelocator() would probably need to be applied in rebuild_relation() or such to deal with CLUSTER or VACUUM FULL. For the relation, the new counter "blocks_written" has been added to the PgStat_StatTabEntry struct (it's not needed in the PgStat_TableCounts one as the relfilenode stat takes care of it). It's added in PgStat_StatTabEntry only to copy/preserve the relfilenode stats during rewrite operations and to retrieve the stats in pg_statio_all_tables. Then, if later we split the relation stats to index/table stats, we'd have blocks_written defined in less structs (as compare to doing the split without relfilenode stat in place). As mentioned up-thread, the new logic has been implemented in v1 only for the new blocks_written stat (we'd need to do the same for the existing buffer read / buffer hit if we agree on the approach implemented in v1). > I also thought about the other direction. Suppose I drop the a > relation and create a new one that gets a different relation OID but > the same relfilenode. But I don't think that's a problem: dropping the > relation should forcibly remove the old stats, so there won't be any > conflict in this case. Yeah. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Mon, Jun 03, 2024 at 11:11:46AM +0000, Bertrand Drouvot wrote: > Yeah, I’ll update the commit message in V2 with better explanations once I get > feedback on V1 (should we decide to move on with the relfilenode stats idea). > Please find attached v2, mandatory rebase due to cd312adc56. In passing it provides a more detailed commit message (also making clear that the goal of this patch is to start the discussion and agree on the design before moving forward.) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > I think we should keep the stats in the relation during relfilenode changes. > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can > see in the example provided up-thread the new heap_blks_written statistic has > been preserved during the TRUNCATE. Yeah, I think there's something weird about this design. Somehow we're ending up with both per-relation and per-relfilenode counters: + pg_stat_get_blocks_written(C.oid) + pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END, C.relfilenode) AS heap_blks_written, I'll defer to Andres if he thinks that's awesome, but to me it does not seem right to track some blocks written in a per-relation counter and others in a per-relfilenode counter. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2024-06-06 12:27:49 -0400, Robert Haas wrote: > On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > I think we should keep the stats in the relation during relfilenode changes. > > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in > > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can > > see in the example provided up-thread the new heap_blks_written statistic has > > been preserved during the TRUNCATE. > > Yeah, I think there's something weird about this design. Somehow we're > ending up with both per-relation and per-relfilenode counters: > > + pg_stat_get_blocks_written(C.oid) + > pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN > C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END, > C.relfilenode) AS heap_blks_written, > > I'll defer to Andres if he thinks that's awesome, but to me it does > not seem right to track some blocks written in a per-relation counter > and others in a per-relfilenode counter. It doesn't immediately sound awesome. Nor really necessary? If we just want to keep prior stats upon arelation rewrite, we can just copy the stats from the old relfilenode. Or we can decide that those stats don't really make sense anymore, and start from scratch. I *guess* I could see an occasional benefit in having both counter for "prior relfilenodes" and "current relfilenode" - except that stats get reset manually and upon crash anyway, making this less useful than if it were really "lifetime" stats. Greetings, Andres Freund
Hi, On 2024-06-03 11:11:46 +0000, Bertrand Drouvot wrote: > The main argument is that we currently don’t have writes counters for relations. > The reason is that we don’t have the relation OID when writing buffers out. > Tracking writes per relfilenode would allow us to track/consolidate writes per > relation (example in the v1 patch and in the message up-thread). > > I think that adding instrumentation in this area (writes counters) could be > beneficial (like it is for the ones we currently have for reads). > > Second argument is that this is also beneficial for the "Split index and > table statistics into different types of stats" thread (mentioned in the previous > message). It would allow us to avoid additional branches in some situations (like > the one mentioned by Andres in the link I provided up-thread). I think there's another *very* significant benefit: Right now physical replication doesn't populate statistics fields like n_dead_tup, which can be a huge issue after failovers, because there's little information about what autovacuum needs to do. Auto-analyze *partially* can fix it at times, if it's lucky enough to see enough dead tuples - but that's not a given and even if it works, is often wildly inaccurate. Once we put things like n_dead_tup into per-relfilenode stats, we can populate them during WAL replay. Thus after a promotion autovacuum has much better data. This also is important when we crash: We've been talking about storing a snapshot of the stats alongside each REDO pointer. Combined with updating stats during crash recovery, we'll have accurate dead-tuple stats once recovey has finished. Greetings, Andres Freund
Hi, On Thu, Jun 06, 2024 at 08:38:06PM -0700, Andres Freund wrote: > Hi, > > On 2024-06-03 11:11:46 +0000, Bertrand Drouvot wrote: > > The main argument is that we currently don’t have writes counters for relations. > > The reason is that we don’t have the relation OID when writing buffers out. > > Tracking writes per relfilenode would allow us to track/consolidate writes per > > relation (example in the v1 patch and in the message up-thread). > > > > I think that adding instrumentation in this area (writes counters) could be > > beneficial (like it is for the ones we currently have for reads). > > > > Second argument is that this is also beneficial for the "Split index and > > table statistics into different types of stats" thread (mentioned in the previous > > message). It would allow us to avoid additional branches in some situations (like > > the one mentioned by Andres in the link I provided up-thread). > > I think there's another *very* significant benefit: > > Right now physical replication doesn't populate statistics fields like > n_dead_tup, which can be a huge issue after failovers, because there's little > information about what autovacuum needs to do. > > Auto-analyze *partially* can fix it at times, if it's lucky enough to see > enough dead tuples - but that's not a given and even if it works, is often > wildly inaccurate. > > > Once we put things like n_dead_tup into per-relfilenode stats, Hm - I had in mind to populate relfilenode stats only with stats that are somehow related to I/O activities. Which ones do you have in mind to put in relfilenode stats? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Thu, Jun 06, 2024 at 08:17:36PM -0700, Andres Freund wrote: > Hi, > > On 2024-06-06 12:27:49 -0400, Robert Haas wrote: > > On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > I think we should keep the stats in the relation during relfilenode changes. > > > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in > > > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can > > > see in the example provided up-thread the new heap_blks_written statistic has > > > been preserved during the TRUNCATE. > > > > Yeah, I think there's something weird about this design. Somehow we're > > ending up with both per-relation and per-relfilenode counters: > > > > + pg_stat_get_blocks_written(C.oid) + > > pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN > > C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END, > > C.relfilenode) AS heap_blks_written, > > > > I'll defer to Andres if he thinks that's awesome, but to me it does > > not seem right to track some blocks written in a per-relation counter > > and others in a per-relfilenode counter. > > It doesn't immediately sound awesome. Nor really necessary? > > If we just want to keep prior stats upon arelation rewrite, we can just copy > the stats from the old relfilenode. Agree, that's another option. But I think that would be in another field like "cumulative_XXX" to ensure one could still retrieve stats that are "dedicated" to this particular "new" relfilenode. Thoughts? > Or we can decide that those stats don't > really make sense anymore, and start from scratch. > > > I *guess* I could see an occasional benefit in having both counter for "prior > relfilenodes" and "current relfilenode" - except that stats get reset manually > and upon crash anyway, making this less useful than if it were really > "lifetime" stats. Right but currently they are not lost during a relation rewrite. If we decide to not keep the relfilenode stats during a rewrite then things like heap_blks_read would stop surviving a rewrite (if we move it to relfilenode stats) while it currently does. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jun 6, 2024 at 11:17 PM Andres Freund <andres@anarazel.de> wrote: > If we just want to keep prior stats upon arelation rewrite, we can just copy > the stats from the old relfilenode. Or we can decide that those stats don't > really make sense anymore, and start from scratch. I think we need to think carefully about what we want the user experience to be here. "Per-relfilenode stats" could mean "sometimes I don't know the relation OID so I want to use the relfilenumber instead, without changing the user experience" or it could mean "some of these stats actually properly pertain to the relfilenode rather than the relation so I want to associate them with the right object and that will affect how the user sees things." We need to decide which it is. If it's the former, then we need to examine whether the goal of hiding the distinction between relfilenode stats and relation stats from the user is in fact feasible. If it's the latter, then we need to make sure the whole patch reflects that design, which would include e.g. NOT copying stats from the old to the new relfilenode, and which would also include documenting the behavior in a way that will be understandable to users. In my experience, the worst thing you can do in cases like this is be somewhere in the middle. Then you tend to end up with stuff like: the difference isn't supposed to be something that the user knows or cares about, except that they do have to know and care because you haven't thoroughly covered up the deception, and often they have to reverse engineer the behavior because you didn't document what was really happening because you imagined that they wouldn't notice. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On Fri, Jun 07, 2024 at 09:24:41AM -0400, Robert Haas wrote: > On Thu, Jun 6, 2024 at 11:17 PM Andres Freund <andres@anarazel.de> wrote: > > If we just want to keep prior stats upon arelation rewrite, we can just copy > > the stats from the old relfilenode. Or we can decide that those stats don't > > really make sense anymore, and start from scratch. > > I think we need to think carefully about what we want the user > experience to be here. "Per-relfilenode stats" could mean "sometimes I > don't know the relation OID so I want to use the relfilenumber > instead, without changing the user experience" or it could mean "some > of these stats actually properly pertain to the relfilenode rather > than the relation so I want to associate them with the right object > and that will affect how the user sees things." We need to decide > which it is. If it's the former, then we need to examine whether the > goal of hiding the distinction between relfilenode stats and relation > stats from the user is in fact feasible. If it's the latter, then we > need to make sure the whole patch reflects that design, which would > include e.g. NOT copying stats from the old to the new relfilenode, > and which would also include documenting the behavior in a way that > will be understandable to users. Thanks for sharing your thoughts! Let's take the current heap_blks_read as an example: it currently survives a relation rewrite and I guess we don't want to change the existing user experience for it. Now say we want to add "heap_blks_written" (like in this POC patch) then I think that it makes sense for the user to 1) query this new stat from the same place as the existing heap_blks_read: from pg_statio_all_tables and 2) to have the same experience as far the relation rewrite is concerned (keep the previous stats). To achieve the rewrite behavior we could: 1) copy the stats from the OLD relfilenode to the relation (like in the POC patch) 2) copy the stats from the OLD relfilenode to the NEW one (could be in a dedicated field) The PROS of 1) is that the behavior is consistent with the current heap_blks_read and that the user could still see the current relfilenode stats (through a new API) if he wants to. > In my experience, the worst thing you can do in cases like this is be > somewhere in the middle. Then you tend to end up with stuff like: the > difference isn't supposed to be something that the user knows or cares > about, except that they do have to know and care because you haven't > thoroughly covered up the deception, and often they have to reverse > engineer the behavior because you didn't document what was really > happening because you imagined that they wouldn't notice. My idea was to move all that is in pg_statio_all_tables to relfilenode stats and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) ensure the user can still retrieve the stats from pg_statio_all_tables in such a way that it survives a rewrite, 3) provide dedicated APIs to retrieve relfilenode stats but only for the current relfilenode, 4) document this behavior. This is what the POC patch is doing for heap_blks_written (would need to do the same for heap_blks_read and friends) except for the documentation part. What do you think? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
At Mon, 10 Jun 2024 08:09:56 +0000, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote in > Hi, > > On Fri, Jun 07, 2024 at 09:24:41AM -0400, Robert Haas wrote: > > On Thu, Jun 6, 2024 at 11:17 PM Andres Freund <andres@anarazel.de> wrote: > > > If we just want to keep prior stats upon arelation rewrite, we can just copy > > > the stats from the old relfilenode. Or we can decide that those stats don't > > > really make sense anymore, and start from scratch. > > > > I think we need to think carefully about what we want the user > > experience to be here. "Per-relfilenode stats" could mean "sometimes I > > don't know the relation OID so I want to use the relfilenumber > > instead, without changing the user experience" or it could mean "some > > of these stats actually properly pertain to the relfilenode rather > > than the relation so I want to associate them with the right object > > and that will affect how the user sees things." We need to decide > > which it is. If it's the former, then we need to examine whether the > > goal of hiding the distinction between relfilenode stats and relation > > stats from the user is in fact feasible. If it's the latter, then we > > need to make sure the whole patch reflects that design, which would > > include e.g. NOT copying stats from the old to the new relfilenode, > > and which would also include documenting the behavior in a way that > > will be understandable to users. > > Thanks for sharing your thoughts! > > Let's take the current heap_blks_read as an example: it currently survives > a relation rewrite and I guess we don't want to change the existing user > experience for it. > > Now say we want to add "heap_blks_written" (like in this POC patch) then I think > that it makes sense for the user to 1) query this new stat from the same place > as the existing heap_blks_read: from pg_statio_all_tables and 2) to have the same > experience as far the relation rewrite is concerned (keep the previous stats). > > To achieve the rewrite behavior we could: > > 1) copy the stats from the OLD relfilenode to the relation (like in the POC patch) > 2) copy the stats from the OLD relfilenode to the NEW one (could be in a dedicated > field) > > The PROS of 1) is that the behavior is consistent with the current heap_blks_read > and that the user could still see the current relfilenode stats (through a new API) > if he wants to. > > > In my experience, the worst thing you can do in cases like this is be > > somewhere in the middle. Then you tend to end up with stuff like: the > > difference isn't supposed to be something that the user knows or cares > > about, except that they do have to know and care because you haven't > > thoroughly covered up the deception, and often they have to reverse > > engineer the behavior because you didn't document what was really > > happening because you imagined that they wouldn't notice. > > My idea was to move all that is in pg_statio_all_tables to relfilenode stats > and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) ensure > the user can still retrieve the stats from pg_statio_all_tables in such a way > that it survives a rewrite, 3) provide dedicated APIs to retrieve > relfilenode stats but only for the current relfilenode, 4) document this > behavior. This is what the POC patch is doing for heap_blks_written (would > need to do the same for heap_blks_read and friends) except for the documentation > part. What do you think? In my opinion, it is certainly strange that bufmgr is aware of relation kinds, but introducing relfilenode stats to avoid this skew doesn't seem to be the best way, as it invites inconclusive arguments like the one raised above. The fact that we transfer counters from old relfilenodes to new ones indicates that we are not really interested in counts by relfilenode. If that's the case, wouldn't it be simpler to call pgstat_count_relation_buffer_read() from bufmgr.c and then branch according to relkind within that function? If you're concerned about the additional branch, some ingenuity may be needed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On Tue, Jun 11, 2024 at 03:35:23PM +0900, Kyotaro Horiguchi wrote: > At Mon, 10 Jun 2024 08:09:56 +0000, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote in > > > > My idea was to move all that is in pg_statio_all_tables to relfilenode stats > > and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) ensure > > the user can still retrieve the stats from pg_statio_all_tables in such a way > > that it survives a rewrite, 3) provide dedicated APIs to retrieve > > relfilenode stats but only for the current relfilenode, 4) document this > > behavior. This is what the POC patch is doing for heap_blks_written (would > > need to do the same for heap_blks_read and friends) except for the documentation > > part. What do you think? > > In my opinion, Thanks for looking at it! > it is certainly strange that bufmgr is aware of > relation kinds, but introducing relfilenode stats to avoid this skew > doesn't seem to be the best way, as it invites inconclusive arguments > like the one raised above. The fact that we transfer counters from old > relfilenodes to new ones indicates that we are not really interested > in counts by relfilenode. If that's the case, wouldn't it be simpler > to call pgstat_count_relation_buffer_read() from bufmgr.c and then > branch according to relkind within that function? If you're concerned > about the additional branch, some ingenuity may be needed. That may be doable for "read" activities but what about write activities? Do you mean not relying on relfilenode stats for reads but relying on relfilenode stats for writes? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Sat, May 25, 2024 at 07:52:02AM +0000, Bertrand Drouvot wrote: > But I think that it is in a state that can be used to discuss the approach it > is implementing (so that we can agree or not on it) before moving > forward. I have read through the patch to get an idea of how things are done, and I am troubled by the approach taken (mentioned down by you), but that's invasive compared to how pgstats wants to be transparent with its stats kinds. + Oid objoid; /* object ID, either table or function or tablespace. */ + RelFileNumber relfile; /* relfilenumber for RelFileLocator. */ } PgStat_HashKey; This adds a relfilenode component to the central hash key used for the dshash of pgstats, which is something most stats types don't care about. That looks like the incorrect thing to do to me, particularly seeing a couple of lines down that a stats kind is assigned so the HashKey uniqueness is ensured by the KindInfo: + [PGSTAT_KIND_RELFILENODE] = { + .name = "relfilenode", FWIW, I have on my stack of patches something to switch the objoid to 8 bytes, actually, which is something that would be required for pg_stat_statements as query IDs are wider than that and affect all databases, FWIW. Relfilenodes are 4 bytes, okay still Robert has proposed a couple of years ago a patch set to bump that to 56 bits, change reverted in a448e49bcbe4. The objoid is also not something specific to OIDs, see replication slots with their idx for example. What you would be looking instead is to use the relfilenode as an objoid and keep track of the OID of the original relation in each PgStat_StatRelFileNodeEntry so as it is possible to know where a past relfilenode was used? That makes looking back at the past relation's elfilenodes stats more complicated as it would be necessary to keep a list of the past relfilenodes for a relation, as well. Perhaps with some kind of cache that maintains a mapping between the relation and its relfilenode history? -- Michael
Attachment
Hi, On Wed, Jul 10, 2024 at 03:02:34PM +0900, Michael Paquier wrote: > On Sat, May 25, 2024 at 07:52:02AM +0000, Bertrand Drouvot wrote: > > But I think that it is in a state that can be used to discuss the approach it > > is implementing (so that we can agree or not on it) before moving > > forward. > > I have read through the patch to get an idea of how things are done, Thanks! > and I am troubled by the approach taken (mentioned down by you), but > that's invasive compared to how pgstats wants to be transparent with > its stats kinds. > > + Oid objoid; /* object ID, either table or function > or tablespace. */ > + RelFileNumber relfile; /* relfilenumber for RelFileLocator. */ > } PgStat_HashKey; > > This adds a relfilenode component to the central hash key used for the > dshash of pgstats, which is something most stats types don't care > about. That's right but that's an existing behavior without the patch as: PGSTAT_KIND_DATABASE does not care care about the objoid PGSTAT_KIND_REPLSLOT does not care care about the dboid PGSTAT_KIND_SUBSCRIPTION does not care care about the dboid That's 3 kinds out of the 5 non fixed stats kind. Not saying it's good, just saying that's an existing behavior. > That looks like the incorrect thing to do to me, particularly > seeing a couple of lines down that a stats kind is assigned so the > HashKey uniqueness is ensured by the KindInfo: > + [PGSTAT_KIND_RELFILENODE] = { > + .name = "relfilenode", You mean, just rely on kind, dboid and relfile to ensure uniqueness? I'm not sure that would work as there is this comment in relfilelocator.h: " * Notice that relNumber is only unique within a database in a particular * tablespace. " So, I think it makes sense to link the hashkey to all the RelFileLocator fields, means: dboid (linked to RelFileLocator's dbOid) objoid (linked to RelFileLocator's spcOid) relfile (linked to RelFileLocator's relNumber) > FWIW, I have on my stack of patches something to switch the objoid to > 8 bytes, actually, which is something that would be required for > pg_stat_statements as query IDs are wider than that and affect all > databases, FWIW. Relfilenodes are 4 bytes, okay still Robert has > proposed a couple of years ago a patch set to bump that to 56 bits, > change reverted in a448e49bcbe4. Right, but it really looks like this extra field is needed to ensure uniqueness (see above). > What you would be looking instead is to use the relfilenode as an > objoid Not sure that works, as it looks like uniqueness won't be ensured (see above). > and keep track of the OID of the original relation in each > PgStat_StatRelFileNodeEntry so as it is possible to know where a past > relfilenode was used? That makes looking back at the past relation's > elfilenodes stats more complicated as it would be necessary to keep a > list of the past relfilenodes for a relation, as well. Perhaps with > some kind of cache that maintains a mapping between the relation and > its relfilenode history? Yeah, I also thought about keeping a list of "previous" relfilenodes stats for a relation but that would lead to: 1. Keep previous relfilnode stats 2. A more complicated way to look at relation stats (as you said) 3. Extra memory usage I think the only reason "previous" relfilenode stats are needed is to provide accurate stats for the relation. Outside of this need, I don't think we would want to retrieve "individual" previous relfilenode stats in the past. That's why the POC patch "simply" copies the stats to the relation during a rewrite (before getting rid of the "previous" relfilenode stats). What do you think? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Jul 10, 2024 at 01:38:06PM +0000, Bertrand Drouvot wrote: > On Wed, Jul 10, 2024 at 03:02:34PM +0900, Michael Paquier wrote: >> and I am troubled by the approach taken (mentioned down by you), but >> that's invasive compared to how pgstats wants to be transparent with >> its stats kinds. >> >> + Oid objoid; /* object ID, either table or function >> or tablespace. */ >> + RelFileNumber relfile; /* relfilenumber for RelFileLocator. */ >> } PgStat_HashKey; >> >> This adds a relfilenode component to the central hash key used for the >> dshash of pgstats, which is something most stats types don't care >> about. > > That's right but that's an existing behavior without the patch as: > > PGSTAT_KIND_DATABASE does not care care about the objoid > PGSTAT_KIND_REPLSLOT does not care care about the dboid > PGSTAT_KIND_SUBSCRIPTION does not care care about the dboid > > That's 3 kinds out of the 5 non fixed stats kind. I'd like to think that this is just going to increase across time. >> That looks like the incorrect thing to do to me, particularly >> seeing a couple of lines down that a stats kind is assigned so the >> HashKey uniqueness is ensured by the KindInfo: >> + [PGSTAT_KIND_RELFILENODE] = { >> + .name = "relfilenode", > > You mean, just rely on kind, dboid and relfile to ensure uniqueness? Or table OID for the objid, with a hardcoded number of past relfilenodes stats stored, to limit bloating the dshash with too much past stats. See below. > So, I think it makes sense to link the hashkey to all the RelFileLocator > fields, means: > > dboid (linked to RelFileLocator's dbOid) > objoid (linked to RelFileLocator's spcOid) > relfile (linked to RelFileLocator's relNumber) Hmm. How about using the table OID as objoid, but store in the stats of the new KindInfo an array of entries with the relfilenodes (current and past, perhaps with more data than the relfilenode to ensure the uniqueness tracking) and each of its stats? The number of past relfilenodes would be fixed, meaning that there would be a strict control with the retention of the past stats. When a table is dropped, removing its relfilenode stats would be as cheap as when its PGSTAT_KIND_RELATION is dropped. > Yeah, I also thought about keeping a list of "previous" relfilenodes stats for a > relation but that would lead to: > > 1. Keep previous relfilnode stats > 2. A more complicated way to look at relation stats (as you said) > 3. Extra memory usage > > I think the only reason "previous" relfilenode stats are needed is to provide > accurate stats for the relation. Outside of this need, I don't think we would > want to retrieve "individual" previous relfilenode stats in the past. > > That's why the POC patch "simply" copies the stats to the relation during a > rewrite (before getting rid of the "previous" relfilenode stats). Hmm. Okay. -- Michael
Attachment
Hi, On Thu, Jul 11, 2024 at 01:58:19PM +0900, Michael Paquier wrote: > On Wed, Jul 10, 2024 at 01:38:06PM +0000, Bertrand Drouvot wrote: > > So, I think it makes sense to link the hashkey to all the RelFileLocator > > fields, means: > > > > dboid (linked to RelFileLocator's dbOid) > > objoid (linked to RelFileLocator's spcOid) > > relfile (linked to RelFileLocator's relNumber) > > Hmm. How about using the table OID as objoid, The issue is that we don't have the relation OID when writing buffers out (that's one of the reason explained in [1]). [1]: https://www.postgresql.org/message-id/Zl2k8u4HDTUW6QlC%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Thu, Jul 11, 2024 at 06:10:23AM +0000, Bertrand Drouvot wrote: > Hi, > > On Thu, Jul 11, 2024 at 01:58:19PM +0900, Michael Paquier wrote: > > On Wed, Jul 10, 2024 at 01:38:06PM +0000, Bertrand Drouvot wrote: > > > So, I think it makes sense to link the hashkey to all the RelFileLocator > > > fields, means: > > > > > > dboid (linked to RelFileLocator's dbOid) > > > objoid (linked to RelFileLocator's spcOid) > > > relfile (linked to RelFileLocator's relNumber) > > > > Hmm. How about using the table OID as objoid, > > The issue is that we don't have the relation OID when writing buffers out (that's > one of the reason explained in [1]). > > [1]: https://www.postgresql.org/message-id/Zl2k8u4HDTUW6QlC%40ip-10-97-1-34.eu-west-3.compute.internal > > Regards, > Please find attached a mandatory rebase due to the recent changes around statistics. As mentioned up-thread: The attached patch is not in a fully "polished" state yet: there is more places we should add relfilenode counters, create more APIS to retrieve the relfilenode stats.... It is in a state that can be used to discuss the approach it is implementing (as we have done so far) before moving forward. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On Tue, Sep 10, 2024 at 05:30:32AM +0000, Bertrand Drouvot wrote: > Hi, > > On Thu, Sep 05, 2024 at 04:48:36AM +0000, Bertrand Drouvot wrote: > > Please find attached a mandatory rebase. > > > > In passing, checking if based on the previous discussion (and given that we > > don't have the relation OID when writing buffers out) you see another approach > > that the one this patch is implementing? > > Attached v5, mandatory rebase due to recent changes in the stats area. Attached v6, mandatory rebase due to b14e9ce7d5. Note that 0001 is the same as the one proposed in [0] and needs to be applied here to make the stats machinery working as expected with the relfile added in the stats hash key (though it deserves its own dedicated thread as explained in [0]). Don't look at 0001 and 0002 as I think we need more design discussion. === Sum up the feedback received up-thread I re-read this thread and it appears that there is 3 main remarks: R1: Andres did propose to add stuff like "n_dead_tup" (see [1]), to provide even more benefits. R2: Robert mentioned ([2]) that we need to decide between "sometimes I don't know the relation OID so I want to use the relfilenumber instead, without changing the user experience" and "some of these stats actually properly pertain to the relfilenode rather than the relation so I want to associate them with the right object and that will affect how the user sees things". R3: Michael had concerns about adding a new field (the relfile) in the hash key, see [3]. === My thoughts: While my initial idea was that the relfilenode stats would deal only with I/O activities it also looks like that it would be benficial to add sutff like "n_dead_tup". Then I think we should go with the "sometimes I don't know the relation OID so I want to use the relfilenumber instead, without changing the user experience" way. Regarding the concern about adding a new field in the hash key, I think we can't avoid that as we don't have the relation OID when writing buffers out. === Moving forward I would go for trying to store everything that is "relation" related into the relfilenode stats (that will then include n_dead_tup among other things) and try to hide the distinction between relfilenode stats and relation stats from the user. Thoughts of moving forward that way? [0]: https://www.postgresql.org/message-id/Zyb7RW1y9dVfO0UH%40ip-10-97-1-34.eu-west-3.compute.internal [1]: https://www.postgresql.org/message-id/20240607033806.6gwgolihss72cj6r%40awork3.anarazel.de [2]: https://www.postgresql.org/message-id/CA%2BTgmoZtwT6h%3DnyuQ1J9GNSrRyhf0fv7Ai6FzO%3DbH0C9Bf6tew%40mail.gmail.com [3]: https://www.postgresql.org/message-id/Zo9j69GhexDpeV4k%40paquier.xyz Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Nov 4, 2024 at 4:27 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Then I think we should go with the "sometimes I don't know the relation OID > so I want to use the relfilenumber instead, without changing the user experience" > way. So does the latest version of the patch implement that principal uniformly throughout? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On Mon, Nov 04, 2024 at 02:51:10PM -0500, Robert Haas wrote: > On Mon, Nov 4, 2024 at 4:27 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Then I think we should go with the "sometimes I don't know the relation OID > > so I want to use the relfilenumber instead, without changing the user experience" > > way. > > So does the latest version of the patch implement that principal > uniformly throughout? No, please don't look at v6-0001 and 0002 (as mentioned up-thread). The purpose here is mainly to get an agreement on the design before moving forward. Does it sound ok to you to move with the above principal? (I'm +1 on it). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Nov 5, 2024 at 1:06 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Does it sound ok to you to move with the above principal? (I'm +1 on it). Yes, provided we can get a clean implementation of it. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 5 Nov 2024 at 11:06, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > Does it sound ok to you to move with the above principal? (I'm +1 on it). > Hi! I looked through this thread. Looks like we are still awaiting a patch which stores more counters (n_dead_tup, ... etc) into relfilenode stats. So, I assume this should be moved to the next CF. I also have a very stupid question: If we don’t have the relation OID when writing buffers out, can we just store oid to buffertag mapping somewhere and use it? I suspect that this is a horrible idea, but what's the exact reason? Is it that we will break too many abstraction layers for such a minor matter? -- Best regards, Kirill Reshke
Hi, On Fri, Nov 29, 2024 at 11:23:12AM +0500, Kirill Reshke wrote: > On Tue, 5 Nov 2024 at 11:06, Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Does it sound ok to you to move with the above principal? (I'm +1 on it). > > > > Hi! I looked through this thread. Thanks for looking at it! > Looks like we are still awaiting a patch which stores more counters > (n_dead_tup, ... etc) into relfilenode stats. Yes. > If we don’t have the relation OID when writing buffers out, can we > just store oid to buffertag mapping somewhere and use it? Do you mean add the relation OID into the BufferTag? While that could probably be done from a technical point of view (with probably non negligible amount of refactoring), I can see those cons: 1. We'd increase the BufferDesc size and approaching the 64 bytes limit (cache line size) that we don't want to exceed (see comment above BufferDesc definition) 2. Probably lot of refactoring 3. This new member would be there "only" for stats and reporting purpose as it is not needed at all for buffer related operations 4. 3. seems to indicate that's not the right place Then I think 1. and 2. are not worth it given 3. and 4. There is probably other cons too though. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, 29 Nov 2024 at 20:20, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > On Fri, Nov 29, 2024 at 11:23:12AM +0500, Kirill Reshke wrote: > > If we don’t have the relation OID when writing buffers out, can we > > just store oid to buffertag mapping somewhere and use it? > > Do you mean add the relation OID into the BufferTag? While that could probably > be done from a technical point of view (with probably non negligible amount > of refactoring), I can see those cons: Not exactly, what i had in mind was a separate hashmap into shared memory, mapping buffertag<>oid. > 2. Probably lot of refactoring > 3. This new member would be there "only" for stats and reporting purpose as > it is not needed at all for buffer related operations To this design, your points 2&3 apply. -- Best regards, Kirill Reshke
Hi, On Fri, Nov 29, 2024 at 08:52:13PM +0500, Kirill Reshke wrote: > On Fri, 29 Nov 2024 at 20:20, Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > On Fri, Nov 29, 2024 at 11:23:12AM +0500, Kirill Reshke wrote: > > > If we don’t have the relation OID when writing buffers out, can we > > > just store oid to buffertag mapping somewhere and use it? > > > > Do you mean add the relation OID into the BufferTag? While that could probably > > be done from a technical point of view (with probably non negligible amount > > of refactoring), I can see those cons: > > Not exactly, what i had in mind was a separate hashmap into shared > memory, mapping buffertag<>oid. I see. > > 2. Probably lot of refactoring > > 3. This new member would be there "only" for stats and reporting purpose as > > it is not needed at all for buffer related operations > > To this design, your points 2&3 apply. That said, it might also help for DropRelationBuffers() where we need to scan the entire buffer pool (there is an optimization in place though). We could imagine buffertag as key and the value could be the relation OID and each entry would have next/prev pointers linking to other BufferTags with same OID. That's probably much more refactoring (and more invasive) that the initial idea in this thread but could lead to multiple pros though. I'm not very familar with the "buffer" area of the code and would also need to study the performance impact to maintain this new hash map. Do you and/or others have any thoughts/ideas about it? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com