Thread: Temporary tables prevent autovacuum, leading to XID wraparound
Hello, I've found a problem that an orphaned temporary table could cause XID wraparound. Our customer encountered this problemwith PG 9.5.2, but I think this will happen with the latest PG. I'm willing to fix this, but I'd like to ask you what approach we should take. PROBLEM ==================================== The customer has a database for application data, which I call it user_db here. They don't store application data in postgresdatabase. No tables in user_db was autovacuumed for more than a month, leading to user tables bloating. Those tables are eligiblefor autovacuum according to pg_stat_all_tables and autovacuum settings. age(datfrozenxid) of user_db and postgres databases are greater than autovacuum_max_freeze_age, so they are eligible forautovacuuming for XID wraparound. There are user tables in user_db whose age(relfrozenxid) is greater than autovacuum_freeze_max_age, so those tables shouldget autovacuum treatment. CAUSE ==================================== postgres database has a table named pg_temp_3.fetchchunks, whose age(relfrozenxid) is greater than autovacuum_freeze_max_age. This temporary table is the culprit. pg_temp_3.fetchchunks is created by pg_rewind. The customersays they ran pg_rewind. autovacuum launcher always choose postgres, because do_start_worker() scans pg_database and finds that postgres databaseneeds vacuuming for XID wraparound. user_db is never chosen for vacuuming, although it also needs vacuuming forXID wraparound. autovacuum worker doesn't delete pg_temp3.fetchchunks, because the backendid 3 is used by some application so autovacuumworker thinks that the backend is active and the temporary table cannot be dropped. I don't know why pg_temp3.fetchchunks still exists. Maybe the user ran pg_ctl stop -mi while pg_rewind was running. FIX ==================================== I have the following questions. Along which line should I proceed to fix the problem? * Why does autovacuum launcher always choose only one database when that database need vacuuming for XID wraparound? Shouldn'tit also choose other databases? * I think temporary tables should not require vacuuming for XID wraparound. Furtherover, should updates/deletes to temporarytables be in-place instead of creating garbage, so that any form of vacuum is unnecessary? Other sessions do notneed to read temporary tables. * In this incident, autovacuum worker misjudged that pg_temp_3.fetchchunks can't be deleted, although the creator (pg_rewind)is no longer active. How can we delete orphan temporary tables safely? Regards Takayuki Tsunakawa
On Thu, Jan 25, 2018 at 06:14:41AM +0000, Tsunakawa, Takayuki wrote: > I don't know why pg_temp3.fetchchunks still exists. Maybe the user > ran pg_ctl stop -mi while pg_rewind was running. Likely that was the case :( As a superuser, DROP TABLE should work on the temporary schema of another session. Have you tried that to solve the situation? > * I think temporary tables should not require vacuuming for XID > wraparound. Furtherover, should updates/deletes to temporary tables > be in-place instead of creating garbage, so that any form of vacuum is > unnecessary? Other sessions do not need to read temporary tables. Yeah, there are many areas of improvements in this area. Temp tables also generate WAL.. > * In this incident, autovacuum worker misjudged that > pg_temp_3.fetchchunks can't be deleted, although the creator > (pg_rewind) is no longer active. How can we delete orphan temporary > tables safely? As long as Postgres sees that its temporary schema is in use, it would think that the table is not orphaned. Another thing possible would be to have the session now holding this schema space to reuse fetchchunks so as things are reset. -- Michael
Attachment
> From: Michael Paquier [mailto:michael.paquier@gmail.com] > As a superuser, DROP TABLE should work on the temporary schema of another > session. Have you tried that to solve the situation? Yes, we asked the customer to do that today. I think the customer will do in the near future. > > * In this incident, autovacuum worker misjudged that > > pg_temp_3.fetchchunks can't be deleted, although the creator > > (pg_rewind) is no longer active. How can we delete orphan temporary > > tables safely? > > As long as Postgres sees that its temporary schema is in use, it would think > that the table is not orphaned. Another thing possible would be to have > the session now holding this schema space to reuse fetchchunks so as things > are reset. I understood you suggested a new session which recycle the temp schema should erase the zombie metadata of old temp tablesor recreate the temp schema. That sounds easy. Regards Takayuki Tsunakawa
On Thu, Jan 25, 2018 at 08:10:00AM +0000, Tsunakawa, Takayuki wrote: >>> * In this incident, autovacuum worker misjudged that >>> pg_temp_3.fetchchunks can't be deleted, although the creator >>> (pg_rewind) is no longer active. How can we delete orphan temporary >>> tables safely? >> >> As long as Postgres sees that its temporary schema is in use, it would think >> that the table is not orphaned. Another thing possible would be to have >> the session now holding this schema space to reuse fetchchunks so as things >> are reset. > > I understood you suggested a new session which recycle the temp schema > should erase the zombie metadata of old temp tables or recreate the > temp schema. That sounds easy. If the new session makes use of the same temporary schema where the orphan table is, cleanup is possible. Now you have a problem if this is not available as this depends on the backend ID uniquely assigned. It would be better to just drop the table manually at the end. -- Michael
Attachment
From: Michael Paquier [mailto:michael.paquier@gmail.com] > On Thu, Jan 25, 2018 at 08:10:00AM +0000, Tsunakawa, Takayuki wrote: > > I understood you suggested a new session which recycle the temp schema > > should erase the zombie metadata of old temp tables or recreate the > > temp schema. That sounds easy. > > If the new session makes use of the same temporary schema where the orphan > table is, cleanup is possible. Now you have a problem if this is not available > as this depends on the backend ID uniquely assigned. Ouch, you're right. If the new session "uses the temp schema," it has a chance to clean the old temp table metadata. However,it won't help if the session doesn't try to use the temp schema by creating a temp table... > It would be better > to just drop the table manually at the end. Just to solve this very incident, it's so. But this is a bug, so we need to fix it somehow. Regards Takayuki Tsunakawa
On Thu, Jan 25, 2018 at 3:14 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > > FIX > ==================================== > > I have the following questions. Along which line should I proceed to fix the problem? > > * Why does autovacuum launcher always choose only one database when that database need vacuuming for XID wraparound? Shouldn'tit also choose other databases? Yeah, I'd also like to fix this issue. This can be problem even in other case; there are two databases that require anti-wraparound vacuum, and one of them has a very large table that could take a long time to vacuum. In this case, if autovacuum chooses the database having big table first, another database would not be selected until an autovacuum worker completes vacuum on the large table. To deal with it, I think we can make autovacuum workers tell that it is no longer necessary to launch a new autovacuum worker on the database to autovacuum launcher before exit. For example, autovacuum worker reports both the total number of relations and the number of relations that require an anti-wraparound vacuum to the stats collector. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Jan 25, 2018 at 1:14 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > * I think temporary tables should not require vacuuming for XID wraparound. Furtherover, should updates/deletes to temporarytables be in-place instead of creating garbage, so that any form of vacuum is unnecessary? Other sessions do notneed to read temporary tables. Temporary tables contain XIDs, so they need to be vacuumed for XID wraparound. Otherwise, queries against those tables by the session that created them could yield wrong answers. However, autovacuum can't perform that vacuuming; it would have to be done by the session. I think we should consider having backends try to remove their temporary schema on startup; then, if a temp table in a backend is old enough that it's due for vacuum for wraparound, have autovacuum kill the connection. The former is necessary to prevent sessions from being killed on account of temp tables they "inherited" from a backend that didn't exit cleanly. The in-place update idea won't work for a couple of reasons. First, a command shouldn't see the results of modifications made earlier in the same command. Second, using cursors, it's possible to have more than one distinct snapshot open against a temporary table at the same time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Masahiko Sawada [mailto:sawada.mshk@gmail.com] > On Thu, Jan 25, 2018 at 3:14 PM, Tsunakawa, Takayuki > <tsunakawa.takay@jp.fujitsu.com> wrote: > > * Why does autovacuum launcher always choose only one database when that > database need vacuuming for XID wraparound? Shouldn't it also choose other > databases? > > Yeah, I'd also like to fix this issue. This can be problem even in other > case; there are two databases that require anti-wraparound vacuum, and one > of them has a very large table that could take a long time to vacuum. In > this case, if autovacuum chooses the database having big table first, > another database would not be selected until an autovacuum worker completes > vacuum on the large table. To deal with it, I think we can make autovacuum > workers tell that it is no longer necessary to launch a new autovacuum worker > on the database to autovacuum launcher before exit. For example, autovacuum > worker reports both the total number of relations and the number of relations > that require an anti-wraparound vacuum to the stats collector. Thanks for commenting. I believe you have deep knowledge and experience with vacuum because you did a great work for freezemap in 9.6, so I appreciate your help! How would you use those two counts? How about just modifying do_start_worker(), so that the launcher chooses a database in the following order? 1. wraparound-risky database not being vacuumed by any worker 2. non-wraparound-risky database not being vacuumed by any worker 3. wraparound-risky database being vacuumed by any worker 4. non-wraparound-risky database being vacuumed by any worker Regards Takayuki Tsunakawa
From: Robert Haas [mailto:robertmhaas@gmail.com] > I think we should consider having backends try to remove their temporary > schema on startup; then, if a temp table in a backend is old enough that > it's due for vacuum for wraparound, have autovacuum kill the connection. > The former is necessary to prevent sessions from being killed on account > of temp tables they "inherited" from a backend that didn't exit cleanly. That seems to be the only reasonable solution. One might feel it annoying to emit WAL during connection establishment todelete the temp schema, but even now the client authentication can emit WAL for hot pruning while scanning pg_database,pg_authid, etc. Thanks. > The in-place update idea won't work for a couple of reasons. First, a > command shouldn't see the results of modifications made earlier in the same > command. Second, using cursors, it's possible to have more than one > distinct snapshot open against a temporary table at the same time. You're right. And if the transaction rolls back, it needs to see the old tuples, which requires undo log. Regards Takayuki Tsunakawa
On Fri, Jan 26, 2018 at 2:22 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > From: Masahiko Sawada [mailto:sawada.mshk@gmail.com] >> On Thu, Jan 25, 2018 at 3:14 PM, Tsunakawa, Takayuki >> <tsunakawa.takay@jp.fujitsu.com> wrote: >> > * Why does autovacuum launcher always choose only one database when that >> database need vacuuming for XID wraparound? Shouldn't it also choose other >> databases? >> >> Yeah, I'd also like to fix this issue. This can be problem even in other >> case; there are two databases that require anti-wraparound vacuum, and one >> of them has a very large table that could take a long time to vacuum. In >> this case, if autovacuum chooses the database having big table first, >> another database would not be selected until an autovacuum worker completes >> vacuum on the large table. To deal with it, I think we can make autovacuum >> workers tell that it is no longer necessary to launch a new autovacuum worker >> on the database to autovacuum launcher before exit. For example, autovacuum >> worker reports both the total number of relations and the number of relations >> that require an anti-wraparound vacuum to the stats collector. > > Thanks for commenting. I believe you have deep knowledge and experience with vacuum because you did a great work for freezemap in 9.6, so I appreciate your help! Thanks. The most of hackers on the community have deep knowledge, so I'd like to hear other opinion. > How would you use those two counts? What I thought is that a worker reports these two values after scanned pg_class and after freezed a table. The launcher decides to launch a new worker if the number of tables requiring anti-wraparound vacuum is greater than the number of workers running on the database. Similarly, the autovacuum launcher doesn't launch a new worker if two values are equal, which means all tables requiring an anti-wraparound vacuum is being vacuumed. There are chances that new relation is added during a worker is running on the last one table that requires anti-wraparound vacuum and launcher launches a new worker on the database. I think it's no problem because the new worker would update that two values and exits soon. > > How about just modifying do_start_worker(), so that the launcher chooses a database in the following order? > > 1. wraparound-risky database not being vacuumed by any worker > 2. non-wraparound-risky database not being vacuumed by any worker > 3. wraparound-risky database being vacuumed by any worker > 4. non-wraparound-risky database being vacuumed by any worker > IMO the limiting the number of worker on a database to 1 seems risky. If a database has many tables that require an anti-wraparound vacuum, it takes a long time to freeze the all of these tables. In current implementation, as I mentioned as above, launcher can launch multiple worker on the one database. Since the above idea would be complex a bit, as an alternated idea it might be better to specify the number of worker to launch per database by a new GUC parameter or something. If the number of worker running on the database exceeds that limit, the launcher doesn't select the database even if the database is about to wraparound. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
From: Masahiko Sawada [mailto:sawada.mshk@gmail.com] > What I thought is that a worker reports these two values after scanned > pg_class and after freezed a table. The launcher decides to launch a new > worker if the number of tables requiring anti-wraparound vacuum is greater > than the number of workers running on the database. Similarly, the > autovacuum launcher doesn't launch a new worker if two values are equal, > which means all tables requiring an anti-wraparound vacuum is being vacuumed. > There are chances that new relation is added during a worker is running > on the last one table that requires anti-wraparound vacuum and launcher > launches a new worker on the database. I think it's no problem because the > new worker would update that two values and exits soon. I got it. Currently, the launcher assigns all workers to one database even if that database has only one table in dangerof wraparound. With your suggestion, the launcher assigns as many workers as the tables to be frozen, and use remainingworkers for the other databases. > > How about just modifying do_start_worker(), so that the launcher chooses > a database in the following order? > > > > 1. wraparound-risky database not being vacuumed by any worker 2. > > non-wraparound-risky database not being vacuumed by any worker 3. > > wraparound-risky database being vacuumed by any worker 4. > > non-wraparound-risky database being vacuumed by any worker > > > > IMO the limiting the number of worker on a database to 1 seems risky. > If a database has many tables that require an anti-wraparound vacuum, it > takes a long time to freeze the all of these tables. In current > implementation, as I mentioned as above, launcher can launch multiple worker > on the one database. I can understand your concern. On the other hand, it's unfair that one database could monopolize all workers, because otherdatabases might also be facing wraparound risk. > Since the above idea would be complex a bit, as an > alternated idea it might be better to specify the number of worker to launch > per database by a new GUC parameter or something. If the number of worker > running on the database exceeds that limit, the launcher doesn't select > the database even if the database is about to wraparound. I'm afraid the GUC would be difficult for the user to understand and tune. I want to submit the patch for handling the garbage temporary table metadata as Robert suggested in the next CF. That shouldbe enough to prevent this customer's problem. I would appreciate if anyone could address the other improvement thatSawada-san proposed. Regards Takayuki Tsunakawa
On Mon, Jan 29, 2018 at 3:33 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > From: Masahiko Sawada [mailto:sawada.mshk@gmail.com] >> What I thought is that a worker reports these two values after scanned >> pg_class and after freezed a table. The launcher decides to launch a new >> worker if the number of tables requiring anti-wraparound vacuum is greater >> than the number of workers running on the database. Similarly, the >> autovacuum launcher doesn't launch a new worker if two values are equal, >> which means all tables requiring an anti-wraparound vacuum is being vacuumed. >> There are chances that new relation is added during a worker is running >> on the last one table that requires anti-wraparound vacuum and launcher >> launches a new worker on the database. I think it's no problem because the >> new worker would update that two values and exits soon. > > I got it. Currently, the launcher assigns all workers to one database even if that database has only one table in dangerof wraparound. With your suggestion, the launcher assigns as many workers as the tables to be frozen, and use remainingworkers for the other databases. > > >> > How about just modifying do_start_worker(), so that the launcher chooses >> a database in the following order? >> > >> > 1. wraparound-risky database not being vacuumed by any worker 2. >> > non-wraparound-risky database not being vacuumed by any worker 3. >> > wraparound-risky database being vacuumed by any worker 4. >> > non-wraparound-risky database being vacuumed by any worker >> > >> >> IMO the limiting the number of worker on a database to 1 seems risky. >> If a database has many tables that require an anti-wraparound vacuum, it >> takes a long time to freeze the all of these tables. In current >> implementation, as I mentioned as above, launcher can launch multiple worker >> on the one database. > > I can understand your concern. On the other hand, it's unfair that one database could monopolize all workers, becauseother databases might also be facing wraparound risk. On third thought, we can change the policy of launching workers so that the launcher dispatches workers evenly to wraparound-risky databases instead of choosing only one most wraparound-risky database. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
From: Masahiko Sawada [mailto:sawada.mshk@gmail.com] > On Mon, Jan 29, 2018 at 3:33 PM, Tsunakawa, Takayuki > <tsunakawa.takay@jp.fujitsu.com> wrote: > > I can understand your concern. On the other hand, it's unfair that one > database could monopolize all workers, because other databases might also > be facing wraparound risk. > > On third thought, we can change the policy of launching workers so that > the launcher dispatches workers evenly to wraparound-risky databases > instead of choosing only one most wraparound-risky database. +1 Regards Takayuki Tsunakawa
On Mon, Jan 29, 2018 at 1:33 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: >> Since the above idea would be complex a bit, as an >> alternated idea it might be better to specify the number of worker to launch >> per database by a new GUC parameter or something. If the number of worker >> running on the database exceeds that limit, the launcher doesn't select >> the database even if the database is about to wraparound. > > I'm afraid the GUC would be difficult for the user to understand and tune. I agree. It's autovacuum's job to do the correct thing. If it's not, then we need figure out how to make it do the right thing. Adding a GUC seems like saying we don't know what the right thing to do is but we hope the user does know. That's not a good answer. Unfortunately, I think a full solution to the problem of allocating AV workers to avoid wraparound is quite complex. Suppose that database A will force a shutdown due to impending wraparound in 4 hours and database B will force a shutdown in 12 hours. On first blush, it seems that we should favor adding workers to A. But it might be that database A needs only 2 hours of vacuuming to avoid a shutdown whereas B needs 12 hours. In that case, it seems that we ought to instead favor adding workers to B. However, it might be the case that A has more table coming do for wraparound 6 hours from now, and we need another 15 hours of vacuuming to avoid that shutdown. That would favor adding workers to A. Then again, it might be that A and B already both have workers, and that adding yet another worker to A won't speed anything up (because only large tables remain to be processed and each has a worker already), whereas adding a worker to B would speed things up (because it still has a big table that we could immediately start to vacuum for wraparound). In that case, perhaps we ought to instead add a worker to B. But, thinking some more, it seems like that should cause autovacuum_vacuum_cost_limit to be reallocated among the workers, making the existing vacuum processes take longer, which might actually make a bad situation worse. It seems possible that the right answer could be to start no new autovacuum worker at all. Given all of the foregoing this seems like a very hard problem. I can't even articulate a clear set of rules for what our priorities should be, and it seems that such rules would depend on the rate at which we're consuming XIDs, how close we are in each database to a wraparound shutdown, what tables exist in each database, how big the not-all-frozen part of each one is, how big their indexes are, how much they're holding back relfrozenxid, and which ones already have workers, among other things. I think it's quite possible that we can come up with something that's better than what we have now without embarking on a huge project, but it's not going to be anywhere near perfect because this is really complicated, and there's a real risk that we'll just making some cases better and others worse rather than actually coming out ahead overall. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Robert Haas [mailto:robertmhaas@gmail.com] > Unfortunately, I think a full solution to the problem of allocating AV > workers to avoid wraparound is quite complex. Yes, that easily puts my small brain into an infinite loop... > Given all of the foregoing this seems like a very hard problem. I can't > even articulate a clear set of rules for what our priorities should be, > and it seems that such rules would depend on the rate at which we're consuming > XIDs, how close we are in each database to a wraparound shutdown, what tables > exist in each database, how big the not-all-frozen part of each one is, > how big their indexes are, how much they're holding back relfrozenxid, and > which ones already have workers, among other things. I think it's quite > possible that we can come up with something that's better than what we have > now without embarking on a huge project, but it's not going to be anywhere > near perfect because this is really complicated, and there's a real risk > that we'll just making some cases better and others worse rather than > actually coming out ahead overall. So a simple improvement would be to assign workers fairly to databases facing a wraparound risk, as Sawada-san suggested. One ultimate solution should be the undo-based MVCC that makes vacuuming unnecessary, which you proposed about a year ago... Regards Takayuki Tsunakawa
On Tue, Jan 30, 2018 at 7:04 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > So a simple improvement would be to assign workers fairly to databases facing a wraparound risk, as Sawada-san suggested. Is that always an improvement, or does it make some cases better and others worse? > One ultimate solution should be the undo-based MVCC that makes vacuuming unnecessary, which you proposed about a year ago... And blogged about yesterday. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 1, 2018 at 2:01 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 30, 2018 at 7:04 PM, Tsunakawa, Takayuki > <tsunakawa.takay@jp.fujitsu.com> wrote: >> So a simple improvement would be to assign workers fairly to databases facing a wraparound risk, as Sawada-san suggested. > > Is that always an improvement, or does it make some cases better and > others worse? I think the idea would not be an improvement, but just change the policy. The current launcher's policy is "let's launch a new worker as much as possible on the database that is at risk of wraparound most". The idea I suggested makes the cases mentioned on this thread better while perhaps making other cases worse. To improve while keeping the current policy, we might want to use the first idea I proposed. That is, we don't launch a new worker on a database impending wraparound if the last table of the database is being vacuumed. But it needs to share new information such as what tables exist in each database and which tables already have worker. It might be overkill in order to deal with only such a corner case though. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jan 31, 2018 at 7:37 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I think the idea would not be an improvement, but just change the > policy. The current launcher's policy is "let's launch a new worker as > much as possible on the database that is at risk of wraparound most". > The idea I suggested makes the cases mentioned on this thread better > while perhaps making other cases worse. > > To improve while keeping the current policy, we might want to use the > first idea I proposed. That is, we don't launch a new worker on a > database impending wraparound if the last table of the database is > being vacuumed. But it needs to share new information such as what > tables exist in each database and which tables already have worker. It > might be overkill in order to deal with only such a corner case > though. Something like that might work, but I think it needs more thought. Maybe, for each database currently being processed by at least 1 worker, advertise in shared memory the oldest XID that isn't already being vacuumed by some AV worker; when considering which database is at greatest risk of wraparound, if that value is available, use it instead of the database's datfrozenxid. Then when all tables that make that database riskier than some other database already have workers, some other database can get a chance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 2, 2018 at 12:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 31, 2018 at 7:37 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I think the idea would not be an improvement, but just change the >> policy. The current launcher's policy is "let's launch a new worker as >> much as possible on the database that is at risk of wraparound most". >> The idea I suggested makes the cases mentioned on this thread better >> while perhaps making other cases worse. >> >> To improve while keeping the current policy, we might want to use the >> first idea I proposed. That is, we don't launch a new worker on a >> database impending wraparound if the last table of the database is >> being vacuumed. But it needs to share new information such as what >> tables exist in each database and which tables already have worker. It >> might be overkill in order to deal with only such a corner case >> though. > > Something like that might work, but I think it needs more thought. > Maybe, for each database currently being processed by at least 1 > worker, advertise in shared memory the oldest XID that isn't already > being vacuumed by some AV worker; when considering which database is > at greatest risk of wraparound, if that value is available, use it > instead of the database's datfrozenxid. Then when all tables that > make that database riskier than some other database already have > workers, some other database can get a chance. > Thank you for suggestion. It sounds more smarter. So it would be more better if we vacuums database for anti-wraparound in ascending order of relfrozenxid? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
From: Masahiko Sawada [mailto:sawada.mshk@gmail.com] > Thank you for suggestion. It sounds more smarter. So it would be more better > if we vacuums database for anti-wraparound in ascending order of > relfrozenxid? I thought so, too. The current behavior is inconsistent: the launcher tries to assign all workers to one database with thebiggest wraparound risk in order to eliminate the risk as fast as possible, but the workers don't get hurry to reducethe risk. Regards Takayuki Tsunakawa
On Fri, Feb 2, 2018 at 1:27 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for suggestion. It sounds more smarter. So it would be more > better if we vacuums database for anti-wraparound in ascending order > of relfrozenxid? Currently, we're doing it based on datfrozenxid. I was looking for a small, relatively safe change to improve things, which led to my proposal: continue using datfrozenxid when the database isn't being vacuumed, but when it is, substitute the datfrozenxid that we expect the database *will have* after the tables currently being vacuumed are finished for the actual datfrozenxid. On further reflection, though, I see problems. Suppose db1 has age(datfrozenxid) = 600m and two tables with age(relfrozenxid) of 600m and 400m. db2 has age(datfrozenxid) = 500m. Then suppose #1 starts vacuuming the table with age 600m. The AV launcher sees that, with that table out of the way, we'll be able to bring datfrozenxid forward to 400m and so sends worker #2 to db2. But actually, when the worker in db1 finishes vacuuming the table with age 600m, it's going to have to vacuum the table with age 400m before performing any relfrozenxid update. So actually, the table with age 400m is holding the relfrozenxid at 600m just as much as if it too had a relfrozenxid of 600m. In fact, any tables in the same database that need routine vacuuming are a problem, too: we're going to vacuum all of those before updating relfrozenxid, too. Maybe we should start by making the scheduling algorithm used by the individual workers smarter: 1. Let's teach do_autovacuum() that, when there are any tables needing vacuum for wraparound, either for relfrozenxid or relminmxid, it should vacuum only those and forget about the rest. This seems like an obvious optimization to prevent us from delaying datfrozenxid/datminmxid updates for the sake of vacuuming tables that are "merely" bloated. 2. Let's have do_autovacuum() sort the tables to be vacuumed in ascending order by relfrozenxid, so older tables are vacuumed first. A zero-order idea is to put tables needing relfrozenxid vacuuming before tables needing relminmxid vacuuming, and sort the latter by ascending relminmxid, but maybe there's something smarter possible there. The idea here is to vacuum tables in order of priority rather than in whatever order they happen to be physically mentioned in pg_class. 3. Let's have do_autovacuum() make an extra call to vac_update_datfrozenxid() whenever the next table to be vacuumed is at least 10 million XIDs (or MXIDs) newer than the first one it vacuumed either since the last call to vac_update_datfrozenxid() or, if none, since it started. That way, we don't have to wait for every single table to get vacuumed before we can consider advancing relfrozenxid/relminmxid. 4. When it's performing vacuuming for wraparound, let's have AV workers advertise in shared memory the oldest relfrozenxid and relminmxid that it might exist in the database. Given #1 and #2, this is pretty easy, since we start by moving through tables in increasing relfrozenxid order and then shift to moving through them in increasing relminmxid order. When we're working through the relfrozenxid tables, the oldest relminmxid doesn't move, and the oldest relfrozenxid is that of the next table in the list. When we're working through the relminmxid tables, it's the reverse. We need a little cleverness to figure out what value to advertise when we're on the last table in each list -- it should be the next-higher value, even though that will be above the relevant threshold, not a sentinel value. 5. With those steps in place, I think we can now adopt my previous idea to have the AV launcher use any advertised relfrozenxid (and, as I now realize, relminmxid) instead of the ones found in pg_database, because now we know that individual workers are definitely focused on getting relfrozenxid (and/or relminmxid) as soon as possible, and vacuuming unrelated tables won't help them do it any faster. This gets us fairly close to vacuuming tables in decreasing order of wraparound danger across the entire cluster. It's not perfect. It prefers to keep vacuuming tables in the same database rather than having a worker exit and maybe launching a new one in a different database -- but the alternative is not very appealing. If we didn't do it that way, and if we had a million tables with XIDs that were closely spaced spread across different databases, we'd have to terminate and relaunching workers at a very high rate to get everything sorted out, which would be inefficient and annoying to program. Also, it keeps the existing hard prioritization of relfrozenxid over relminmxid, which could theoretically be wrong for some installation. But I think that might not be a big problem in practice, and it seems like that could be separately improved at another time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Robert Haas [mailto:robertmhaas@gmail.com] > Temporary tables contain XIDs, so they need to be vacuumed for XID > wraparound. Otherwise, queries against those tables by the session > that created them could yield wrong answers. However, autovacuum > can't perform that vacuuming; it would have to be done by the session. > I think we should consider having backends try to remove their > temporary schema on startup; then, if a temp table in a backend is old > enough that it's due for vacuum for wraparound, have autovacuum kill > the connection. The former is necessary to prevent sessions from > being killed on account of temp tables they "inherited" from a backend > that didn't exit cleanly. The attached patch does the former. The small change in autovacuum.c is mainly for autovac launcher and background workerswhich don't connect to a database. I'll add this to the next CF. I'd like this to be back-patched. I didn't do the latter, because killing the connection seemed a bit overkill. If we're going to do it, then we should alsokill the connection which is preventing vacuum regardless of whether it has temporary tables in its session. Regards Takayuki Tsunakawa
Attachment
On Sun, Feb 4, 2018 at 10:10 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > The attached patch does the former. The small change in autovacuum.c is mainly for autovac launcher and background workerswhich don't connect to a database. I'll add this to the next CF. I'd like this to be back-patched. This is really two separate changes: 1. Teach backends to remove any leftover pg_temp_%d schema on startup. 2. Teach autovacuum to remove any leftover tables in a pg_temp_%d schema if the backend is active but in some other database (rather than only when the backend is not active at all). Both of those changes seem broadly reasonable to me, but what do other people think? #1 is bound to have some small impact on backend startup time, but it's probably negligible (one extra syscache lookup, in the common case). I guess one problem is that if the cleanout of the old pg_temp_%d schema should fail, then the backend will probably end up exiting. So it would be really bad if you had catalog corruption preventing the removal of pg_temp_2, because then every time you connect, it will try to remove that schema, fail, and disconnect you. On the other hand, there's plenty of kinds of catalog corruption that can prevent a database connection already. It would be nice if we could think of a way to skip this cleanup if the previous backend failed to complete it, but I don't see a way to do that off-hand. I think we should at least skip it if we're in single-user mode, though. I don't see any problem with #2, although that part of the patch could use a little cosmetic work. I would not be inclined to back-patch these changes, or at least not farther than v10. Before v10, we only tried to drop temporary tables if they were old enough to threaten wraparound, so we've already made this much more aggressive than it used to be. I think making it more aggressive in the ways mentioned above is a good idea, but I'm not sure that the failure to do so constitutes a bug. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 06, 2018 at 11:33:59AM -0500, Robert Haas wrote: > This is really two separate changes: > > 1. Teach backends to remove any leftover pg_temp_%d schema on > startup. I am not sure that we would like to give up that easily the property that we have now to clean up past temporary files only at postmaster startup and only when not in recovery. If you implement that, there is a risk that the backend you are starting is eating the connection slot and by consequence its temporary schema and its set of temporary tables on which one may want to look into after a crash. > 2. Teach autovacuum to remove any leftover tables in a pg_temp_%d > schema if the backend is active but in some other database (rather > than only when the backend is not active at all). Yeah. Here we can do something. This does not sound much difficult to me. -- Michael
Attachment
From: Michael Paquier [mailto:michael.paquier@gmail.com] > I am not sure that we would like to give up that easily the property that > we have now to clean up past temporary files only at postmaster startup > and only when not in recovery. If you implement that, there is a risk that > the backend you are starting is eating the connection slot and by consequence > its temporary schema and its set of temporary tables on which one may want > to look into after a crash. postmaster deletes temporary relation files at startup by calling RemovePgTempFiles() regardless of whether it's in recovery. It doesn't call that function during auto restart after a crash when restart_after_crash is on. > > 2. Teach autovacuum to remove any leftover tables in a pg_temp_%d > > schema if the backend is active but in some other database (rather > > than only when the backend is not active at all). > > Yeah. Here we can do something. This does not sound much difficult to > me. I did that in my patch. Regards Takayuki Tsunakawa
On Wed, Feb 07, 2018 at 12:37:50AM +0000, Tsunakawa, Takayuki wrote: > From: Michael Paquier [mailto:michael.paquier@gmail.com] >> I am not sure that we would like to give up that easily the property that >> we have now to clean up past temporary files only at postmaster startup >> and only when not in recovery. If you implement that, there is a risk that >> the backend you are starting is eating the connection slot and by consequence >> its temporary schema and its set of temporary tables on which one may want >> to look into after a crash. > > postmaster deletes temporary relation files at startup by calling > RemovePgTempFiles() regardless of whether it's in recovery. It > doesn't call that function during auto restart after a crash when > restart_after_crash is on. The comment on top of RemovePgTempFiles() states the following: * NOTE: we could, but don't, call this during a post-backend-crash restart * cycle. The argument for not doing it is that someone might want to examine * the temp files for debugging purposes. This does however mean that * OpenTemporaryFile had better allow for collision with an existing temp * file name. >> > 2. Teach autovacuum to remove any leftover tables in a pg_temp_%d >> > schema if the backend is active but in some other database (rather >> > than only when the backend is not active at all). >> >> Yeah. Here we can do something. This does not sound much difficult to >> me. > > I did that in my patch. Nice to hear that. Please note that I did not check your patch, so I cannot conclude on its correctness in details. -- Michael
Attachment
From: Michael Paquier [mailto:michael.paquier@gmail.com] > > postmaster deletes temporary relation files at startup by calling > > RemovePgTempFiles() regardless of whether it's in recovery. It > > doesn't call that function during auto restart after a crash when > > restart_after_crash is on. > > The comment on top of RemovePgTempFiles() states the following: > * NOTE: we could, but don't, call this during a post-backend-crash restart > * cycle. The argument for not doing it is that someone might want to > examine > * the temp files for debugging purposes. This does however mean that > * OpenTemporaryFile had better allow for collision with an existing temp > * file name. Yes, I saw that comment. postmaster keeps orphaned temp relation files only after a crash when restart_after_crash is on. > Nice to hear that. Please note that I did not check your patch, so I cannot > conclude on its correctness in details. I thought so. Don't mind. Regards Takayuki Tsunakawa
On Tue, Feb 6, 2018 at 6:03 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > I am not sure that we would like to give up that easily the property > that we have now to clean up past temporary files only at postmaster > startup and only when not in recovery. If you implement that, there is > a risk that the backend you are starting is eating the connection slot > and by consequence its temporary schema and its set of temporary tables > on which one may want to look into after a crash. You can't actually do that from SQL, because as soon as you try to access your pg_temp schema, it'll be cleared out. All this changes does is move that forward from time-of-first-access to session start. That's a significant change that needs discussion, but I don't think it has the effect you are supposing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Feb 3, 2018 at 1:48 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Feb 2, 2018 at 1:27 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Thank you for suggestion. It sounds more smarter. So it would be more >> better if we vacuums database for anti-wraparound in ascending order >> of relfrozenxid? > > Currently, we're doing it based on datfrozenxid. I was looking for a > small, relatively safe change to improve things, which led to my > proposal: continue using datfrozenxid when the database isn't being > vacuumed, but when it is, substitute the datfrozenxid that we expect > the database *will have* after the tables currently being vacuumed are > finished for the actual datfrozenxid. > > On further reflection, though, I see problems. Suppose db1 has > age(datfrozenxid) = 600m and two tables with age(relfrozenxid) of 600m > and 400m. db2 has age(datfrozenxid) = 500m. Then suppose #1 starts > vacuuming the table with age 600m. The AV launcher sees that, with > that table out of the way, we'll be able to bring datfrozenxid forward > to 400m and so sends worker #2 to db2. But actually, when the worker > in db1 finishes vacuuming the table with age 600m, it's going to have > to vacuum the table with age 400m before performing any relfrozenxid > update. So actually, the table with age 400m is holding the > relfrozenxid at 600m just as much as if it too had a relfrozenxid of > 600m. In fact, any tables in the same database that need routine > vacuuming are a problem, too: we're going to vacuum all of those > before updating relfrozenxid, too. > > Maybe we should start by making the scheduling algorithm used by the > individual workers smarter: > > 1. Let's teach do_autovacuum() that, when there are any tables needing > vacuum for wraparound, either for relfrozenxid or relminmxid, it > should vacuum only those and forget about the rest. This seems like > an obvious optimization to prevent us from delaying > datfrozenxid/datminmxid updates for the sake of vacuuming tables that > are "merely" bloated. > > 2. Let's have do_autovacuum() sort the tables to be vacuumed in > ascending order by relfrozenxid, so older tables are vacuumed first. > A zero-order idea is to put tables needing relfrozenxid vacuuming > before tables needing relminmxid vacuuming, and sort the latter by > ascending relminmxid, but maybe there's something smarter possible > there. The idea here is to vacuum tables in order of priority rather > than in whatever order they happen to be physically mentioned in > pg_class. > > 3. Let's have do_autovacuum() make an extra call to > vac_update_datfrozenxid() whenever the next table to be vacuumed is at > least 10 million XIDs (or MXIDs) newer than the first one it vacuumed > either since the last call to vac_update_datfrozenxid() or, if none, > since it started. That way, we don't have to wait for every single > table to get vacuumed before we can consider advancing > relfrozenxid/relminmxid. > > 4. When it's performing vacuuming for wraparound, let's have AV > workers advertise in shared memory the oldest relfrozenxid and > relminmxid that it might exist in the database. Given #1 and #2, this > is pretty easy, since we start by moving through tables in increasing > relfrozenxid order and then shift to moving through them in increasing > relminmxid order. When we're working through the relfrozenxid tables, > the oldest relminmxid doesn't move, and the oldest relfrozenxid is > that of the next table in the list. When we're working through the > relminmxid tables, it's the reverse. We need a little cleverness to > figure out what value to advertise when we're on the last table in > each list -- it should be the next-higher value, even though that will > be above the relevant threshold, not a sentinel value. So I think we should includes tables as well that are not at risk of wraparound in order to get the next-higher value (that is, the oldest table of the non-risked tables) instead of forgetting them. And then we skip vacuuming them. > > 5. With those steps in place, I think we can now adopt my previous > idea to have the AV launcher use any advertised relfrozenxid (and, as > I now realize, relminmxid) instead of the ones found in pg_database, > because now we know that individual workers are definitely focused on > getting relfrozenxid (and/or relminmxid) as soon as possible, and > vacuuming unrelated tables won't help them do it any faster. > > This gets us fairly close to vacuuming tables in decreasing order of > wraparound danger across the entire cluster. It's not perfect. It > prefers to keep vacuuming tables in the same database rather than > having a worker exit and maybe launching a new one in a different > database -- but the alternative is not very appealing. If we didn't > do it that way, and if we had a million tables with XIDs that were > closely spaced spread across different databases, we'd have to > terminate and relaunching workers at a very high rate to get > everything sorted out, which would be inefficient and annoying to > program. Also, it keeps the existing hard prioritization of > relfrozenxid over relminmxid, which could theoretically be wrong for > some installation. But I think that might not be a big problem in > practice, and it seems like that could be separately improved at > another time. > I think this algorithm works fine and improves the current behavior. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Robert Haas <robertmhaas@gmail.com> writes: > This is really two separate changes: > 1. Teach backends to remove any leftover pg_temp_%d schema on startup. > 2. Teach autovacuum to remove any leftover tables in a pg_temp_%d > schema if the backend is active but in some other database (rather > than only when the backend is not active at all). > Both of those changes seem broadly reasonable to me, but what do other > people think? I just read through this thread for the first time; sorry for not paying attention sooner. I'm uncomfortable with all the discussion of changing the autovacuum launcher's algorithm; all of that seems complicated and unsure of making anyone's life better. It definitely does nothing for the problem originally stated, that autovacuum is skipping an orphaned temp table altogether. That's not relevant to the actually proposed patch; I'm just saying that I'm not sure anything useful would come of that. Now as for the problem originally stated, step 1 alone doesn't fix it, and there's reason not to like that change much. Forcing backends to clear their temp schemas immediately on connection will slow down connection times, and for applications that never make any temp tables, that's just a dead loss (though admittedly it might not be too expensive in that case). But the problem here is that it doesn't fix the issue. The reason that a temp table might stay orphaned for a long time, if we're speaking of an app that does use temp tables, is that it's in a very high-numbered temp schema and only once in a blue moon does the database have enough connections for that temp schema to be used at all. Forcing on-connection cleaning doesn't fix that. So the correct fix is to improve autovacuum's check to discover whether an old temp table is orphaned, so that it isn't fooled by putative owner processes that are connected to some other DB. Step 2 of the proposed patch tries to do that, but it's only half right: we need a change near line 2264 not only line 2090. (Evidently, we need either a comment that the logic is repeated, or else refactor so that there's only one copy.) Now, you can argue that autovacuum's check can be fooled by an "owner" backend that is connected to the current DB but hasn't actually taken possession of its assigned temp schema (and hence the table in question really is orphaned after all). This edge case could be taken care of by having backends clear their temp schema immediately, as in step 1 of the patch. But I still think that that is an expensive way to catch what would be a really infrequent case. Perhaps a better idea is to have backends advertise in PGPROC whether they have taken possession of their assigned temp schema, and then autovacuum could ignore putative owners that aren't showing that flag. Or we could just do nothing about that, on the grounds that nobody has shown the case to be worth worrying about. Temp schemas that are sufficiently low-numbered to be likely to have an apparent owner are also likely to get cleaned out by actual temp table creation reasonably often, so I'm not at all convinced that we need a mechanism that covers this case. We do need something for the cross-DB case though, because even a very low-numbered temp schema can be problematic if it's in a seldom-used DB, which seems to be the case that was complained of originally. On the whole, my vote is to fix and apply step 2, and leave it at that. regards, tom lane
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > I just read through this thread for the first time; sorry for not paying > attention sooner. Don't mind, please. It's very happy that you gave attention now. > I'm uncomfortable with all the discussion of changing the autovacuum > launcher's algorithm; all of that seems complicated and unsure of making > anyone's life better. It definitely does nothing for the problem > originally stated, that autovacuum is skipping an orphaned temp table > altogether. That's not relevant to the actually proposed patch; I'm just > saying that I'm not sure anything useful would come of that. Yes. Sawada-san is addressing the autovacuum launcher algorithm in another thread. > The reason that a temp table might stay orphaned for a long time, if we're > speaking of an app that does use temp tables, is that it's in a very > high-numbered temp schema and only once in a blue moon does the database > have enough connections for that temp schema to be used at all. Forcing > on-connection cleaning doesn't fix that. Uh, you're right. > So the correct fix is to improve autovacuum's check to discover whether > an old temp table is orphaned, so that it isn't fooled by putative owner > processes that are connected to some other DB. Step 2 of the proposed patch > tries to do that, but it's only half right: we need a change near line 2264 > not only line 2090. (Evidently, we need either a comment that the logic > is repeated, or else refactor so that there's only one copy.) I see... > Now, you can argue that autovacuum's check can be fooled by an "owner" > backend that is connected to the current DB but hasn't actually taken > possession of its assigned temp schema (and hence the table in question > really is orphaned after all). This edge case could be taken care of by > having backends clear their temp schema immediately, as in step 1 of the > patch. But I still think that that is an expensive way to catch what would > be a really infrequent case. Perhaps a better idea is to have backends > advertise in PGPROC whether they have taken possession of their assigned > temp schema, and then autovacuum could ignore putative owners that aren't > showing that flag. That autovacuum-driven approach sounds interesting. But it seems to require some new locking to prevent a race conditionbetween the autovacuum launcher and the backend: autovacuum launcher thinks that a temp schema is not owned byanyone, then a backend beings to use that temp schema, and autovacuum launcher deletes temp tables in that schema. > Or we could just do nothing about that, on the grounds > that nobody has shown the case to be worth worrying about. > Temp schemas that are sufficiently low-numbered to be likely to have an > apparent owner are also likely to get cleaned out by actual temp table > creation reasonably often, so I'm not at all convinced that we need a > mechanism that covers this case. We do need something for the cross-DB > case though, because even a very low-numbered temp schema can be problematic > if it's in a seldom-used DB, which seems to be the case that was complained > of originally. Yes, the original problem was the low-numbered temp schema (pg_temp_3) which pg_rewind had used. So we want to rescue thiscase. > On the whole, my vote is to fix and apply step 2, and leave it at that. Regards Takayuki Tsunakawa
On Tue, Mar 6, 2018 at 6:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Now as for the problem originally stated, step 1 alone doesn't fix it, > and there's reason not to like that change much. Forcing backends to > clear their temp schemas immediately on connection will slow down > connection times, and for applications that never make any temp tables, > that's just a dead loss (though admittedly it might not be too expensive > in that case). I think that's a little short-sighted. I think we really want temp tables of no-longer-running backends to go away as soon as possible; that should be viewed as a gain in and of itself. One, it saves disk space. Two, it prevents them from causing wraparound problems. I believe we've had people complain about both, but definitely the latter. > Now, you can argue that autovacuum's check can be fooled by an "owner" > backend that is connected to the current DB but hasn't actually taken > possession of its assigned temp schema (and hence the table in question > really is orphaned after all). This edge case could be taken care of by > having backends clear their temp schema immediately, as in step 1 of the > patch. But I still think that that is an expensive way to catch what > would be a really infrequent case. I think we should try to do something about this case -- if not now, then later. I agree that it would be better if we could get autovacuum to do it instead of doing it in the foreground. I don't really share your concern about performance; one extra syscache lookup at backend startup isn't going to break the bank. Rather, I'm concerned about reliability. As I said upthread: "So it would be really bad if you had catalog corruption preventing the removal of pg_temp_2, because then every time you connect, it will try to remove that schema, fail, and disconnect you." Now granted your database *shouldn't* have catalog corruption, but a lot of things that shouldn't happen sometimes do, and it's better when those problem don't cause cascading failures. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Mar 6, 2018 at 6:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Now as for the problem originally stated, step 1 alone doesn't fix it, >> and there's reason not to like that change much. Forcing backends to >> clear their temp schemas immediately on connection will slow down >> connection times, and for applications that never make any temp tables, >> that's just a dead loss (though admittedly it might not be too expensive >> in that case). > I think that's a little short-sighted. I think we really want temp > tables of no-longer-running backends to go away as soon as possible; > that should be viewed as a gain in and of itself. We already have autovacuum taking care of that, and as I stated, asking backends to do it is provably insufficient. The right path is to make autovacuum cover the cases it's missing today. > I don't > really share your concern about performance; one extra syscache lookup > at backend startup isn't going to break the bank. If it were just that, I wouldn't be worried either. But it's not. Once the pg_namespace row exists, which it will in an installation that's been in use for for any length of time, you're going to have to actively search for entries in that schema. I'm not sure how expensive a performDeletion() scan that finds nothing to do really is, but for sure it's more than the syscache lookup you expended to find the pg_namespace row. It's perhaps also worth reminding people that this whole discussion pertains only to crash recovery; if the previous owner of the temp schema had exited cleanly, it would've cleaned things up. I'm unexcited about adding overhead for crash recovery to the normal connection code path when it's not necessary, and even more so when it's neither necessary nor sufficient. regards, tom lane
On Thu, Mar 8, 2018 at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > We already have autovacuum taking care of that, and as I stated, asking > backends to do it is provably insufficient. The right path is to make > autovacuum cover the cases it's missing today. Well, your counter-proposal of teaching autovacuum to recognize the case where the backend and the orphaned schema are in different databases is also provably insufficient. We have to make more than one change here to really fix this, and the two changes that Tsunakawa-san proposed, taken together, are sufficient; either one by itself is not. Doing only #2 out of his proposals, as you proposed, is better than doing nothing, but it's not a complete fix. >> I don't >> really share your concern about performance; one extra syscache lookup >> at backend startup isn't going to break the bank. > > If it were just that, I wouldn't be worried either. But it's not. > Once the pg_namespace row exists, which it will in an installation > that's been in use for for any length of time, you're going to have > to actively search for entries in that schema. I'm not sure how > expensive a performDeletion() scan that finds nothing to do really > is, but for sure it's more than the syscache lookup you expended to > find the pg_namespace row. True, but that's a rare scenario. Still, I'm not sure we're too far apart on the underlying issue here; we both would prefer, for one reason or another, if autovacuum could just take care of this. But I think that Tsunakawa-san is correct to guess that your proposal might have some race conditions that need to be solved somehow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > So the correct fix is to improve autovacuum's check to discover whether > an old temp table is orphaned, so that it isn't fooled by putative owner > processes that are connected to some other DB. Step 2 of the proposed patch > tries to do that, but it's only half right: we need a change near line 2264 > not only line 2090. (Evidently, we need either a comment that the logic > is repeated, or else refactor so that there's only one copy.) > > Now, you can argue that autovacuum's check can be fooled by an "owner" > backend that is connected to the current DB but hasn't actually taken > possession of its assigned temp schema (and hence the table in question > really is orphaned after all). This edge case could be taken care of by > having backends clear their temp schema immediately, as in step 1 of the > patch. But I still think that that is an expensive way to catch what would > be a really infrequent case. Perhaps a better idea is to have backends > advertise in PGPROC whether they have taken possession of their assigned > temp schema, and then autovacuum could ignore putative owners that aren't > showing that flag. Or we could just do nothing about that, on the grounds > that nobody has shown the case to be worth worrying about. > Temp schemas that are sufficiently low-numbered to be likely to have an > apparent owner are also likely to get cleaned out by actual temp table > creation reasonably often, so I'm not at all convinced that we need a > mechanism that covers this case. We do need something for the cross-DB > case though, because even a very low-numbered temp schema can be problematic > if it's in a seldom-used DB, which seems to be the case that was complained > of originally. > > On the whole, my vote is to fix and apply step 2, and leave it at that. Done. It seems to work well. Regards Takayuki Tsunakawa
Attachment
On Tue, Mar 13, 2018 at 08:08:48AM +0000, Tsunakawa, Takayuki wrote: > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] >> On the whole, my vote is to fix and apply step 2, and leave it at that. Yeah, I have been thinking about the idea 1 mentioned above, or in short clean up the temporary namespace at connection start instead of first-use of it, and while that would make the cleanup more aggressive, it could be possible as well that only having autovacuum do the work could be enough, so I am +-0 on this idea. > Done. It seems to work well. I have looked at the patch proposed. + /* Does the backend own the temp schema? */ + if (proc->tempNamespaceId != namespaceID) + return false; I have a very hard time believing that this is safe lock-less, and a spin lock would be enough it seems. + /* Is the backend connected to this database? */ + if (proc->databaseId != MyDatabaseId) + return false; Wouldn't it be more interesting to do this cleanup as well if the backend is *not* connected to the database autovacuum'ed? This would make the cleanup more aggresive, which is better. -- Michael
Attachment
From: Michael Paquier [mailto:michael@paquier.xyz] > + /* Does the backend own the temp schema? */ > + if (proc->tempNamespaceId != namespaceID) > + return false; > I have a very hard time believing that this is safe lock-less, and a spin > lock would be enough it seems. The lwlock in BackendIdGetProc() flushes the CPU cache so that proc->tempNamespaceId reflects the latest value. Or, do youmean another spinlock elsewhere? > + /* Is the backend connected to this database? */ > + if (proc->databaseId != MyDatabaseId) > + return false; > Wouldn't it be more interesting to do this cleanup as well if the backend > is *not* connected to the database autovacuum'ed? This would make the > cleanup more aggresive, which is better. IIUC, the above code does what you suggest. proc->databaseId is the database the client is connected to, and MyDatabaseIdis the database being autovacuumed (by this autovacuum worker.) Regards Takayuki Tsunakawa
Hello. At Wed, 18 Jul 2018 07:34:10 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FA538FD@G01JPEXMBYT05> > From: Michael Paquier [mailto:michael@paquier.xyz] > > + /* Does the backend own the temp schema? */ > > + if (proc->tempNamespaceId != namespaceID) > > + return false; > > I have a very hard time believing that this is safe lock-less, and a spin > > lock would be enough it seems. > > The lwlock in BackendIdGetProc() flushes the CPU cache so that proc->tempNamespaceId reflects the latest value. Or, doyou mean another spinlock elsewhere? It seems to be allowed that the series of checks on *proc results in false-positive, which is the safer side for the usage, even it is not atomically updated. Actually ->databaseId is written without taking a lock. > > + /* Is the backend connected to this database? */ > > + if (proc->databaseId != MyDatabaseId) > > + return false; > > Wouldn't it be more interesting to do this cleanup as well if the backend > > is *not* connected to the database autovacuum'ed? This would make the > > cleanup more aggresive, which is better. > > IIUC, the above code does what you suggest. proc->databaseId is the database the client is connected to, and MyDatabaseIdis the database being autovacuumed (by this autovacuum worker.) It seems that you're right. But maybe the comments in backned_uses_temp_namespace should be written invsered. And the comment on the last condition needs to be more detailed. For example: | /* Not used if the backend is not on connection */ | if (proc == NULL) | .. | /* Not used if the backend is on another database */ | if (proc->databaseId != MyDatabaseID) | .. | /* | * Not used if this backend has no temp namespace or one with | * different OID. However we could reuse the namespce OID, | * tables in the old teblespace have been cleaned up at the | * creation time. See InitTempTableNamespace. | */ | if (proc->tempNamespaceId != namespaceId) ...(is it correct?) backend_uses_temp_namespace is taking two parameters but the first one is always derived from the second. backendID doesn't seem to be needed outside so just one parameter namespaceID is needed. There may be a case where classForm->relnamespace is not found in catalog or nspname is corrupt, but the namespace is identified as inactive and orphaned tables are cleaned up in the case so we don't need to inform that to users.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Jul 26, 2018 at 07:05:11PM +0900, Kyotaro HORIGUCHI wrote: > At Wed, 18 Jul 2018 07:34:10 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FA538FD@G01JPEXMBYT05> >> From: Michael Paquier [mailto:michael@paquier.xyz] >>> + /* Does the backend own the temp schema? */ >>> + if (proc->tempNamespaceId != namespaceID) >>> + return false; >>> I have a very hard time believing that this is safe lock-less, and a spin >>> lock would be enough it seems. >> >> The lwlock in BackendIdGetProc() flushes the CPU cache so that >> proc->tempNamespaceId reflects the latest value. Or, do you mean >> another spinlock elsewhere? > > It seems to be allowed that the series of checks on *proc results > in false-positive, which is the safer side for the usage, even it > is not atomically updated. Actually ->databaseId is written > without taking a lock. Well, from postinit.c there are a couple of assumptions in this case, so it is neither white nor black: /* * Now we can mark our PGPROC entry with the database ID. * * We assume this is an atomic store so no lock is needed; though actually * things would work fine even if it weren't atomic. Anyone searching the * ProcArray for this database's ID should hold the database lock, so they * would not be executing concurrently with this store. A process looking * for another database's ID could in theory see a chance match if it read * a partially-updated databaseId value; but as long as all such searches * wait and retry, as in CountOtherDBBackends(), they will certainly see * the correct value on their next try. */ MyProc->databaseId = MyDatabaseId; Anyway, I have spent some time on this patch, and the thing is doing a rather bad job about why it is fine to assume that the update can happen lock-less, and the central part of it seems to be that autovacuum cannot see the temporary schema created, and any objects on it when it scans pg_class until it gets committed, and the tracking field is updated in PGPROC before the commit. > backend_uses_temp_namespace is taking two parameters but the > first one is always derived from the second. backendID doesn't > seem to be needed outside so just one parameter namespaceID is > needed. Yes, that reduces the number of calls to GetTempNamespaceBackendId. autovacuum.c is a pretty bad place for stuff as namespace.c holds all the logic related to temporary tablespaces, so I renamed the routine to isTempNamespaceInUse and moved it there. The patch I have now is attached. I have not been able to test it much, particularly with orphaned temp tables and autovacuum, which is something I'll try to look at this week end or perhaps at the beginning of next week, heading toward a commit if I am fine enough with it. Please feel free to look at it for the time being. Thanks, -- Michael
Attachment
Thank you, Michael and Horiguchi-san, From: Michael Paquier [mailto:michael@paquier.xyz] > autovacuum.c is a pretty bad place for stuff as namespace.c holds all the > logic related to temporary tablespaces, so I renamed the routine to > isTempNamespaceInUse and moved it there. I don't have a strong opinion, but I wonder which of namespace.c or autovacuum.c is suitable, because isTempNamespaceInUseis specific to autovacuum. Regards Takayuki Tsunakawa
On Fri, Jul 27, 2018 at 08:27:26AM +0000, Tsunakawa, Takayuki wrote: > I don't have a strong opinion, but I wonder which of namespace.c or > autovacuum.c is suitable, because isTempNamespaceInUse is specific to > autovacuum. I think that there is also a point in allowing other backends to use it as well, so I left it in namespace.c. I have been doing more testing with this patch today. In order to catch code paths where this is triggered I have for example added an infinite loop in do_autovacuum when finding a temp table which exits once a given on-disk file is found. This lets plenty of time to attach autovacuum to a debugger, and play with other sessions in parallel, so I have checked the transactional assumption this patch relied on, and tested down to v10 as that's where removal of orphaned temp relations has been made more aggressive. This patch can also be applied cleanly there. The check on MyDatabaseId does not actually matter much as the same temporary namespace OID would get reused only after an OID wraparound with a backend using a different database but I left it anyway as that's more consistent to me to check for database and namespace, and added a sanity check to make sure that the routine gets called only by a process connected to a database. I am going to be out of town for a couple of days, and the next minor release planned is close by, so I am not pushing that today, but I'd like to do so after the next minor release if there are no objections. Attached is a patch with a proposal of commit message. -- Michael
Attachment
At Mon, 30 Jul 2018 16:59:16 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180730075916.GB2878@paquier.xyz> > On Fri, Jul 27, 2018 at 08:27:26AM +0000, Tsunakawa, Takayuki wrote: > > I don't have a strong opinion, but I wonder which of namespace.c or > > autovacuum.c is suitable, because isTempNamespaceInUse is specific to > > autovacuum. > > I think that there is also a point in allowing other backends to use it > as well, so I left it in namespace.c. I have been doing more testing > with this patch today. In order to catch code paths where this is +1 for namespace.c from me. > triggered I have for example added an infinite loop in do_autovacuum > when finding a temp table which exits once a given on-disk file is > found. This lets plenty of time to attach autovacuum to a debugger, and > play with other sessions in parallel, so I have checked the > transactional assumption this patch relied on, and tested down to v10 as > that's where removal of orphaned temp relations has been made more > aggressive. This patch can also be applied cleanly there. > > The check on MyDatabaseId does not actually matter much as the same > temporary namespace OID would get reused only after an OID wraparound > with a backend using a different database but I left it anyway as that's > more consistent to me to check for database and namespace, and added a > sanity check to make sure that the routine gets called only by a process > connected to a database. > > I am going to be out of town for a couple of days, and the next minor > release planned is close by, so I am not pushing that today, but I'd > like to do so after the next minor release if there are no objections. > Attached is a patch with a proposal of commit message. "int backendID" is left in autovacuum.c as an unused variable. + * decide if a temporary file entry can be marked as orphaned or not. Even + * if this is an atomic operation, this can be safely done lock-less as no "Even if this is *not* an atomic operation" ? + * Mark the proc entry as owning this namespace which autovacuum uses to + * decide if a temporary file entry can be marked as orphaned or not. Even + * if this is an atomic operation, this can be safely done lock-less as no + * temporary relations associated to it can be seen yet by autovacuum when + * scanning pg_class. + */ + MyProc->tempNamespaceId = namespaceId; The comment looks wrong. After a crash having temp tables, pg_class has orphaned temp relations and the namespace can be of reconnected backends especially for low-numbered backends. So autovacuum may look the shared variable while the reconnected backend is writing it. The only harmful misbehavior is false "unused" judgement. This happens only when reusing the same namespace ID but the temp namespace is already cleaned up by InitTempTableNamespace in the case. In the case of false "used" judgement, the later autovacuum rounds will do the clean up correctly. In short, I think it is safely done lock-less but the reason is not no concurrent reads but transient reads do not harm (currently..). | * Mark the proc entry as owning this namespace which autovacuum uses to | * decide if a temporary file entry can be marked as orphaned or not. Even | * if this is not an atomic operation, this can be safely done lock-less as | * a transient read doesn't let autovacuum go wrong. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jul 31, 2018 at 01:39:07PM +0900, Kyotaro HORIGUCHI wrote: > "int backendID" is left in autovacuum.c as an unused variable. Fixed. > "Even if this is *not* an atomic operation" ? Yeah, I am reworking this comment a bit more to map with the other PGPROC fields. > + * Mark the proc entry as owning this namespace which autovacuum uses to > + * decide if a temporary file entry can be marked as orphaned or not. Even > + * if this is an atomic operation, this can be safely done lock-less as no > + * temporary relations associated to it can be seen yet by autovacuum when > + * scanning pg_class. > + */ > + MyProc->tempNamespaceId = namespaceId; > > The comment looks wrong. After a crash having temp tables, > pg_class has orphaned temp relations and the namespace can be of > reconnected backends especially for low-numbered backends I don't quite understand your comment. InitProcess() initializes the field, so if a backend reconnects and uses the same proc slot as a backend which had a temporary namespace, then you would discard orphaned tables. Anyway, I have reworked it as such: + /* + * Mark MyProc as owning this namespace which other processes can use to + * decide if a temporary namespace is in use or not. We assume that + * assignment of namespaceId is an atomic operation. Even if it is not, + * there is as no temporary relations associated to it and the temporary + * namespace creation is not committed yet, so none of its contents can + * be seen yet if scanning pg_class or pg_namespace. + */ As new minor versions have been tagged, I am planning to commit this patch soon with some tweaks for the comments. As this introduces a new field to PGPROC, so back-patching the thing as-is would cause an ABI breakage. Are folks here fine with the new field added to the bottom of the structure for the backpatched versions, including v11? I have found about commit 13752743 which has also added a new field called isBackgroundWorker in the middle of PGPROC in a released branch, which looks to me like an ABI breakage... -- Michael
Attachment
Hi, On 2018-07-30 16:59:16 +0900, Michael Paquier wrote: > + /* > + * Mark MyProc as owning this namespace which other processes can use to > + * decide if a temporary namespace is in use or not. Even if this is an > + * atomic operation, this can be safely done lock-less as no temporary > + * relations associated to it can be seen yet if scanning pg_class. > + */ > + MyProc->tempNamespaceId = namespaceId; I can't parse this. "Even if this is an atomic operation, this can be safely done lock-less" - that seems like a contradictory sentence. Is there a "not" missing? Also, this seems like insufficient reasoning. What guarantees the visibility of the flag? You're going to have to talk about externally implied memory ordering here. Or add explicit barriers - the latter is probably preferrable. Greetings, Andres Freund
On 2018-Aug-08, Michael Paquier wrote: > As this introduces a new > field to PGPROC, so back-patching the thing as-is would cause an ABI > breakage. Are folks here fine with the new field added to the bottom of > the structure for the backpatched versions, including v11? I have found > about commit 13752743 which has also added a new field called > isBackgroundWorker in the middle of PGPROC in a released branch, which > looks to me like an ABI breakage... Unnoticed ABI breaks make my hair stand on end. I suppose if we didn't know about 13752743 earlier, then not much outside code relies on PGPROC, or at least its members after isBackgroundWorker. I wouldn't move it now (I suppose anyone who cared has already adjusted for it), but please do put your new member at the end in backbranches. I'm unsure about pg11 -- is it a backbranch already or not? Since we've released beta3 already, ISTM we should consider it so. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Unnoticed ABI breaks make my hair stand on end. Yeah. > I'm unsure about pg11 -- is it a backbranch already or not? Since we've > released beta3 already, ISTM we should consider it so. No. See https://www.postgresql.org/message-id/12725.1533737052%40sss.pgh.pa.us At this point I'd judge it's still more valuable to minimize differences between v11 and v12 than to avoid ABI breakages in v11. It's unlikely this would be the last ABI break before 11.0 anyway. regards, tom lane
On Wed, Aug 08, 2018 at 10:50:34AM -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> I'm unsure about pg11 -- is it a backbranch already or not? Since we've >> released beta3 already, ISTM we should consider it so. > > No. See > https://www.postgresql.org/message-id/12725.1533737052%40sss.pgh.pa.us > > At this point I'd judge it's still more valuable to minimize differences > between v11 and v12 than to avoid ABI breakages in v11. It's unlikely > this would be the last ABI break before 11.0 anyway. Thanks! Based on that I will get this patch into HEAD and 11 pretty soon. v10 is definitely out of scope, even if it would have been interesting to make autovacuum more aggressive there as well, as that's where the last improvements have been done. -- Michael
Attachment
Hi Andres, (Not my intention to miss your message, I have just noticed it.) On Wed, Aug 08, 2018 at 01:41:27AM -0700, Andres Freund wrote: > I can't parse this. "Even if this is an atomic operation, this can be > safely done lock-less" - that seems like a contradictory sentence. Is > there a "not" missing? Yes, a "not" has gone missing here. I reworked the comment block as mentioned upthread. > Also, this seems like insufficient reasoning. What guarantees the > visibility of the flag? You're going to have to talk about externally > implied memory ordering here. Or add explicit barriers - the latter is > probably preferrable. Well, we use BackendIdGetProc() in this case, where we could finish with information out-of-date pretty quickly, and there is no special reasoning for backendId and databaseId for autovacuum but... Perhaps you could explain more what you have in mind? And it is not like this relies on the number of elements in PGPROC. -- Michael
Attachment
On August 9, 2018 12:41:17 AM GMT+05:30, Michael Paquier <michael@paquier.xyz> wrote: >Hi Andres, > >(Not my intention to miss your message, I have just noticed it.) > >On Wed, Aug 08, 2018 at 01:41:27AM -0700, Andres Freund wrote: >> I can't parse this. "Even if this is an atomic operation, this can be >> safely done lock-less" - that seems like a contradictory sentence. Is >> there a "not" missing? > >Yes, a "not" has gone missing here. I reworked the comment block as >mentioned upthread. > >> Also, this seems like insufficient reasoning. What guarantees the >> visibility of the flag? You're going to have to talk about externally >> implied memory ordering here. Or add explicit barriers - the latter >is >> probably preferrable. > >Well, we use BackendIdGetProc() in this case, where we could finish >with >information out-of-date pretty quickly, and there is no special >reasoning for backendId and databaseId for autovacuum but... Perhaps >you could explain more what you have in mind? And it is not like this >relies on the number of elements in PGPROC. My point is that the documentation isn't sufficient. Not that there's an active problem. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, Aug 09, 2018 at 08:32:32AM +0530, Andres Freund wrote: > My point is that the documentation isn't sufficient. Not that there's an active problem. OK. Attached is the latest version if the patch I have that I was preparing for commit. On top of isTempNamespaceInUse I have this note: + * Note: this can be used while scanning relations in pg_class to detect + * orphaned temporary tables or namespaces with a backend connected to a + * given database. Would you be fine if I add an extra note like what's in BackendIdGetProc? Say "The result may be out of date quickly, so the caller must be careful how to handle this information." -- Michael
Attachment
On 2018-08-09 09:00:29 +0200, Michael Paquier wrote: > On Thu, Aug 09, 2018 at 08:32:32AM +0530, Andres Freund wrote: > > My point is that the documentation isn't sufficient. Not that there's an active problem. > > OK. Attached is the latest version if the patch I have that I was > preparing for commit. > > On top of isTempNamespaceInUse I have this note: > + * Note: this can be used while scanning relations in pg_class to detect > + * orphaned temporary tables or namespaces with a backend connected to a > + * given database. > > Would you be fine if I add an extra note like what's in > BackendIdGetProc? Say "The result may be out of date quickly, so the > caller must be careful how to handle this information." That's a caveat, not a documentation of the memory ordering / concurrency. You somewhere need a comment to the effect of "We are guaranteed to see a recent enough value of ->tempNamespace because anybody creating a temporary table acquires a lock - serving as a memory barrier - during the creation of said table, after assigning the tempNamespace variable. At the same time, before considering dropping a temporary table as orphaned, we acquire a lock and recheck tempNamespace afterwards". Note that I'm not explaining the concrete model you have here, but the way you could explain a theoretical one. - Andres
On Thu, Aug 09, 2018 at 02:29:54AM -0700, Andres Freund wrote: > On 2018-08-09 09:00:29 +0200, Michael Paquier wrote: >> Would you be fine if I add an extra note like what's in >> BackendIdGetProc? Say "The result may be out of date quickly, so the >> caller must be careful how to handle this information." > > That's a caveat, not a documentation of the memory ordering / > concurrency. I think that I am going to add this note anyway in the new routine. > You somewhere need a comment to the effect of "We are > guaranteed to see a recent enough value of ->tempNamespace because > anybody creating a temporary table acquires a lock - serving as a memory > barrier - during the creation of said table, after assigning the > tempNamespace variable. At the same time, before considering dropping a > temporary table as orphaned, we acquire a lock and recheck tempNamespace > afterwards". Note that I'm not explaining the concrete model you have > here, but the way you could explain a theoretical one. Okay, here is an idea... When assigning the value I have now that: + /* + * Mark MyProc as owning this namespace which other processes can use to + * decide if a temporary namespace is in use or not. We assume that + * assignment of namespaceId is an atomic operation. Even if it is not, + * there is no visible temporary relations associated to it and the + * temporary namespace creation is not committed yet, so none of its + * contents should be seen yet if scanning pg_class or pg_namespace. + */ I actually have tried to mention what you are willing to see in the comments with the last sentence. So that is awkward :) I would propose to reword the last sentence of the patch as follows then: "Even if it is not atomic, the temporary relation which resulted in the creation of this temporary namespace is still locked until the current transaction commits, so it would not be accessible yet." When resetting the value on abort I have that: + /* + * Reset the temporary namespace flag in MyProc. The creation of + * the temporary namespace has failed for some reason and should + * not be seen by other processes as it has not been committed + * yet, hence this would be fine even if not atomic, still we + * assume that it is an atomic assignment. + */ Hence I would propose the following wording for this part: "Reset the temporary namespace flag in MyProc. We assume that this operation is atomic, however it would be fine even if not atomic as the temporary table which created this namespace is still locked until this transaction aborts so it would not be visible yet." Better ideas are of course welcome. -- Michael
Attachment
On Thu, Aug 09, 2018 at 06:50:47PM +0200, Michael Paquier wrote: > Better ideas are of course welcome. I have gone back-and-forth on this patch for the last couple of days, reworded the comment blocks to outline the point Andres has been making, and I have finally been able to push it and back-patched it down to v11 but not further down as we don't want an ABI breakage in already released branches. -- Michael
Attachment
On 2018-08-09 18:50:47 +0200, Michael Paquier wrote: > On Thu, Aug 09, 2018 at 02:29:54AM -0700, Andres Freund wrote: > + /* > + * Mark MyProc as owning this namespace which other processes can use to > + * decide if a temporary namespace is in use or not. We assume that > + * assignment of namespaceId is an atomic operation. Even if it is not, > + * there is no visible temporary relations associated to it and the > + * temporary namespace creation is not committed yet, so none of its > + * contents should be seen yet if scanning pg_class or pg_namespace. > + */ > I actually have tried to mention what you are willing to see in the > comments with the last sentence. So that is awkward :) I don't know what you're trying to say with this. > I would propose to reword the last sentence of the patch as follows > then: > "Even if it is not atomic, the temporary relation which resulted in the > creation of this temporary namespace is still locked until the current > transaction commits, so it would not be accessible yet." > > When resetting the value on abort I have that: > + /* > + * Reset the temporary namespace flag in MyProc. The creation of > + * the temporary namespace has failed for some reason and should > + * not be seen by other processes as it has not been committed > + * yet, hence this would be fine even if not atomic, still we > + * assume that it is an atomic assignment. > + */ > > Hence I would propose the following wording for this part: > "Reset the temporary namespace flag in MyProc. We assume that this > operation is atomic, however it would be fine even if not atomic as the > temporary table which created this namespace is still locked until this > transaction aborts so it would not be visible yet." I don't think that comment, nor the comment that you ended up committing: + + /* + * Reset the temporary namespace flag in MyProc. We assume that + * this operation is atomic. Even if it is not, the temporary + * table which created this namespace is still locked until this + * transaction aborts so it would not be visible yet, acting as a + * barrier. + */ is actually correct. *Holding* a lock isn't a memory barrier. Acquring or releasing one is. Greetings, Andres Freund
From 1e9ba9fa9b172bda1ea54b1f3be1b930973ff45f Mon Sep 17 00:00:00 2001 From: Michael Paquier <michael@paquier.xyz> Date: Wed, 8 Aug 2018 19:45:31 +0200 Subject: [PATCH] Make autovacuum more aggressive to remove orphaned temp tables I was here to complain about this piece: > @@ -3975,6 +4033,15 @@ AtEOSubXact_Namespace(bool isCommit, SubTransactionId mySubid, > myTempNamespace = InvalidOid; > myTempToastNamespace = InvalidOid; > baseSearchPathValid = false; /* need to rebuild list */ > + > + /* > + * Reset the temporary namespace flag in MyProc. The creation of > + * the temporary namespace has failed for some reason and should > + * not be seen by other processes as it has not been committed > + * yet, hence this would be fine even if not atomic, still we > + * assume that it is an atomic assignment. > + */ > + MyProc->tempNamespaceId = InvalidOid; > } > } But after looking carefully, I realized that this only runs if the subtransaction being aborted is the same that set up the temp namespace (or one of its children) in the first place; therefore it's correct to tear it down unconditionally. I was afraid of this clobbering a temp namespace that had been set up by some other branch of the transaction tree. However, that's not a concern because we compare the subTransactionId (and update the value when propagating to parent subxact.) I do share Andres' concerns on the wording the comment. I would say something like /* * Reset the temporary namespace flag in MyProc. We assume this to be * an atomic assignment. * * Because this subtransaction is rolling back, the pg_namespace * row is not visible to anyone else anyway, but that doesn't matter: * it's not a problem if objects contained in this namespace are removed * concurrently. */ The fact of assignment being atomic and the fact of the pg_namespace row being visible are separately important. You care about it being atomic because it means you must not have someone read "16" (0x10) when you were partway removing the value "65552" (0x10010), thus causing that someone removing namespace 16. And you care about the visibility of the pg_namespace row because of whether you're worried about a third party removing the tables from that namespace or not: since the subxact is aborting, you are not. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Aug 13, 2018 at 02:56:16AM -0700, Andres Freund wrote: > On 2018-08-09 18:50:47 +0200, Michael Paquier wrote: > I don't think that comment, nor the comment that you ended up > committing: > + > + /* > + * Reset the temporary namespace flag in MyProc. We assume that > + * this operation is atomic. Even if it is not, the temporary > + * table which created this namespace is still locked until this > + * transaction aborts so it would not be visible yet, acting as a > + * barrier. > + */ > > is actually correct. *Holding* a lock isn't a memory barrier. Acquring > or releasing one is. I cannot guess what you think, but would something like the attached be more adapted? Both things look rather similar to me now, likely for you it does not. -- Michael
Attachment
If writing an OID were not atomic, the assignment would be really dangerous. I don't think your proposed update is good. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Aug 13, 2018 at 02:15:07PM -0300, Alvaro Herrera wrote: > If writing an OID were not atomic, the assignment would be really > dangerous. I don't think your proposed update is good. OK, I am withdrawing this one then. -- Michael
Attachment
On Mon, Aug 13, 2018 at 01:53:18PM -0300, Alvaro Herrera wrote: > I do share Andres' concerns on the wording the comment. I would say > something like > > /* > * Reset the temporary namespace flag in MyProc. We assume this to be > * an atomic assignment. > * > * Because this subtransaction is rolling back, the pg_namespace > * row is not visible to anyone else anyway, but that doesn't matter: > * it's not a problem if objects contained in this namespace are removed > * concurrently. > */ > The fact of assignment being atomic and the fact of the pg_namespace row > being visible are separately important. You care about it being atomic > because it means you must not have someone read "16" (0x10) when you > were partway removing the value "65552" (0x10010), thus causing that > someone removing namespace 16. And you care about the visibility of the > pg_namespace row because of whether you're worried about a third party > removing the tables from that namespace or not: since the subxact is > aborting, you are not. I was thinking about adding "Even if it is not atomic" or such at the beginning of the paragraph, but at the end your phrasing sounds better to me. So I have hacked up the attached, which also reworks the comment in InitTempTableNamespace in the same spirit. Thoughts? -- Michael
Attachment
On Tue, Aug 14, 2018 at 06:53:32PM +0200, Michael Paquier wrote: > I was thinking about adding "Even if it is not atomic" or such at the > beginning of the paragraph, but at the end your phrasing sounds better > to me. So I have hacked up the attached, which also reworks the comment > in InitTempTableNamespace in the same spirit. Thoughts? It has been close to one week since I sent this patch. Anything? I don't like to keep folks unhappy about anything I committed. -- Michael
Attachment
On 2018-Aug-20, Michael Paquier wrote: > On Tue, Aug 14, 2018 at 06:53:32PM +0200, Michael Paquier wrote: > > I was thinking about adding "Even if it is not atomic" or such at the > > beginning of the paragraph, but at the end your phrasing sounds better > > to me. So I have hacked up the attached, which also reworks the comment > > in InitTempTableNamespace in the same spirit. Thoughts? > > It has been close to one week since I sent this patch. Anything? I > don't like to keep folks unhappy about anything I committed. Seems reasonable. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Aug 20, 2018 at 10:54:28AM -0300, Alvaro Herrera wrote: > Seems reasonable. Thanks Álvaro, pushed. -- Michael