Thread: recovering from "found xmin ... from before relfrozenxid ..."
Hi, A number of EDB customers have had this error crop on their tables for reasons that we have usually not been able to determine. In many cases, it's probably down to things like running buggy old releases for a long time before upgrading, or bad backup and recovery procedures. It's more than possible that there are still-unfixed server bugs, but I do not have any compelling evidence of such bugs at this time. Unfortunately, once you're in this situation, it's kind of hard to find your way out of it. There are a few problems: 1. There's nothing to identify the tuple that has the problem, and no way to know how many more of them there might be. Back-patching b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first part of this. 2. In some other, similar situations, e.g. where the tuple data is garbled, it's often possible to get out from under the problem by deleting the tuple at issue. But I think that doesn't necessarily fix anything in this case. 3. We've had some success with using a PL/plgsql loop with an EXCEPTION block to extract all the accessible tuples from the table. Then you can truncate the original table and reinsert the data. But this is slow, so it stinks if the table is big, and it's not a viable approach if the table in question is a system catalog table -- at least if it's not if it's something critical like pg_class. I realize somebody's probably going to say "well, you shouldn't try to repair a database that's in this state, you shouldn't let it happen in the first place, and if it does happen, you should track the root cause to the ends of the earth." But I think that's a completely impractical approach. I at least have no idea how I'm supposed to figure out when and how a bad relfrozenxid ended up in the table, and by the time the problem is discovered after an upgrade the problem that caused it may be quite old. Moreover, not everyone is as interested in an extended debugging exercise as they are in getting the system working again, and VACUUM failing repeatedly is a pretty serious problem. Therefore, one of my colleagues has - at my request - created a couple of functions called heap_force_kill() and heap_force_freeze() which take an array of TIDs. The former truncates them all to dead line pointers. The latter resets the infomask and xmin to make the xmin frozen. (It should probably handle the xmax too; not sure that the current version does that, but it's easily fixed if not.) The intention is that you can use these to get either get rid of, or get access to, tuples whose visibility information is corrupted for whatever reason. These are pretty sharp tools; you could corrupt a perfectly-good table by incautious use of them, or destroy a large amount of data. You could, for example, force-freeze a tuple created by a transaction which added a column, inserted data, and rolled back; that would likely be disastrous. However, in the cases that I'm thinking about, disaster has already struck, and something that you can use to get things back to a saner state is better than just leaving the table perpetually broken. Without something like this, the backup plan is probably to shut down the server and try to edit the pages using a perl script or something, but that seems clearly worse. So I have these questions: - Do people think it would me smart/good/useful to include something like this in PostgreSQL? - If so, how? I would propose a new contrib module that we back-patch all the way, because the VACUUM errors were back-patched all the way, and there seems to be no advantage in making people wait 5 years for a new version that has some kind of tooling in this area. - Any ideas for additional things we should include, or improvements on the sketch above? Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > - Do people think it would me smart/good/useful to include something > like this in PostgreSQL? Absolutely, yes. > - If so, how? I would propose a new contrib module that we back-patch > all the way, because the VACUUM errors were back-patched all the way, > and there seems to be no advantage in making people wait 5 years for a > new version that has some kind of tooling in this area. While I agree that this would be a good and useful new contrib module to have, I don't think it would be appropriate to back-patch it into PG formally. Unfortunately, that gets into the discussion that's cropped up on a few other threads of late- that we don't have a good place to put extensions which are well maintained/recommended by core PG hackers, and which are able to work with lots of different versions of PG, and are versioned and released independently of PG (and, ideally, built for all the versions of PG that we distribute through our packages). Given the lack of such a place today, I'd at least suggest starting with proposing it as a new contrib module for v14. > - Any ideas for additional things we should include, or improvements > on the sketch above? Not right off-hand, but will think about it, there could certainly be a lot of very interesting tools in such a toolbox. Thanks! Stephen
Attachment
Stephen Frost <sfrost@snowman.net> writes: > * Robert Haas (robertmhaas@gmail.com) wrote: >> - If so, how? I would propose a new contrib module that we back-patch >> all the way, because the VACUUM errors were back-patched all the way, >> and there seems to be no advantage in making people wait 5 years for a >> new version that has some kind of tooling in this area. > While I agree that this would be a good and useful new contrib module to > have, I don't think it would be appropriate to back-patch it into PG > formally. Yeah, I don't care for that either. That's a pretty huge violation of our normal back-patching rules, and I'm not convinced that it's justified. No objection to adding it as a new contrib module. regards, tom lane
On Mon, Jul 13, 2020 at 2:12 PM Robert Haas <robertmhaas@gmail.com> wrote: > 1. There's nothing to identify the tuple that has the problem, and no > way to know how many more of them there might be. Back-patching > b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first > part of this. I am in favor of backpatching such changes in cases where senior community members feel that it could help with hypothetical undiscovered data corruption issues -- if they're willing to take responsibility for the change. It certainly wouldn't be the first time. A "defense in depth" mindset seems like the right one when it comes to data corruption bugs. Early detection is really important. > Moreover, not everyone is as > interested in an extended debugging exercise as they are in getting > the system working again, and VACUUM failing repeatedly is a pretty > serious problem. That's absolutely consistent with my experience. Most users want to get back to business as usual now, while letting somebody else do the hard work of debugging. > Therefore, one of my colleagues has - at my request - created a couple > of functions called heap_force_kill() and heap_force_freeze() which > take an array of TIDs. > So I have these questions: > > - Do people think it would me smart/good/useful to include something > like this in PostgreSQL? I'm in favor of it. > - If so, how? I would propose a new contrib module that we back-patch > all the way, because the VACUUM errors were back-patched all the way, > and there seems to be no advantage in making people wait 5 years for a > new version that has some kind of tooling in this area. I'm in favor of it being *possible* to backpatch tooling that is clearly related to correctness in a fundamental way. Obviously this would mean that we'd be revising our general position on backpatching to allow some limited exceptions around corruption. I'm not sure that this meets that standard, though. It's hardly something that we can expect all that many users to be able to use effectively. I may be biased, but I'd be inclined to permit it in the case of something like amcheck, or pg_visibility, on the grounds that they're more or less the same as the new VACUUM errcontext instrumentation you mentioned. The same cannot be said of something like this new heap_force_kill() stuff. > - Any ideas for additional things we should include, or improvements > on the sketch above? Clearly you should work out a way of making it very hard to accidentally (mis)use. For example, maybe you make the functions check for the presence of a sentinel file in the data directory. -- Peter Geoghegan
Hi, On 2020-07-13 17:12:18 -0400, Robert Haas wrote: > 1. There's nothing to identify the tuple that has the problem, and no > way to know how many more of them there might be. Back-patching > b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first > part of this. Not fully, I'm afraid. Afaict it doesn't currently tell you the item pointer offset, just the block numer, right? We probably should extend it to also include the offset... > 2. In some other, similar situations, e.g. where the tuple data is > garbled, it's often possible to get out from under the problem by > deleting the tuple at issue. But I think that doesn't necessarily fix > anything in this case. Huh, why not? That worked in the cases I saw. > Therefore, one of my colleagues has - at my request - created a couple > of functions called heap_force_kill() and heap_force_freeze() which > take an array of TIDs. The former truncates them all to dead line > pointers. The latter resets the infomask and xmin to make the xmin > frozen. (It should probably handle the xmax too; not sure that the > current version does that, but it's easily fixed if not.) xmax is among the problematic cases IIRC, so yes, it'd be good to fix that. Greetings, Andres Freund
On Mon, Jul 13, 2020 at 6:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, I don't care for that either. That's a pretty huge violation of our > normal back-patching rules, and I'm not convinced that it's justified. I think that our normal back-patching rules are based primarily on the risk of breaking things, and a new contrib module carries a pretty negligible risk of breaking anything that works today. I wouldn't propose to back-patch something on those grounds just as a way of delivering a new feature more quickly, but that's not the intention here. At least in my experience, un-VACUUM-able tables have gotten several orders of magnitude more common since Andres put those changes in. As far as I can recall, EDB has not had this many instances of different customers reporting the same problem since the 9.3-era multixact issues. So far, this does not rise to that level, but it is by no means a negligible issue, either. I believe it deserves to be taken quite seriously, especially because the existing options for helping customers with this kind of problem are so limited. Now, if this goes into v14, we can certainly stick it up on github, or put it out there in some other way for users to download, self-compile, and install, but that seems noticeably less convenient for people who need it, and I'm not clear what the benefit to the project is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 13, 2020 at 6:38 PM Andres Freund <andres@anarazel.de> wrote: > Not fully, I'm afraid. Afaict it doesn't currently tell you the item > pointer offset, just the block numer, right? We probably should extend > it to also include the offset... Oh, I hadn't realized that limitation. That would be good to fix. It would be even better, I think, if we could have VACUUM proceed with the rest of vacuuming the table, emitting warnings about each instance, instead of blowing up when it hits the first bad tuple, but I think you may have told me sometime that doing so would be, uh, less than straightforward. We probably should refuse to update relfrozenxid/relminmxid when this is happening, but I *think* it would be better to still proceed with dead tuple cleanup as far as we can, or at least have an option to enable that behavior. I'm not positive about that, but not being able to complete VACUUM at all is a FAR more urgent problem than not being able to freeze, even though in the long run the latter is more severe. > > 2. In some other, similar situations, e.g. where the tuple data is > > garbled, it's often possible to get out from under the problem by > > deleting the tuple at issue. But I think that doesn't necessarily fix > > anything in this case. > > Huh, why not? That worked in the cases I saw. I'm not sure I've seen a case where that didn't work, but I don't see a reason why it couldn't happen. Do you think the code is structured in such a way that a deleted tuple is guaranteed to be pruned even if the XID is old? What if clog has been truncated so that the xmin can't be looked up? > xmax is among the problematic cases IIRC, so yes, it'd be good to fix > that. Thanks for the input. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Oh, I hadn't realized that limitation. That would be good to fix. It > would be even better, I think, if we could have VACUUM proceed with > the rest of vacuuming the table, emitting warnings about each > instance, instead of blowing up when it hits the first bad tuple, but > I think you may have told me sometime that doing so would be, uh, less > than straightforward. We probably should refuse to update > relfrozenxid/relminmxid when this is happening, but I *think* it would > be better to still proceed with dead tuple cleanup as far as we can, > or at least have an option to enable that behavior. I'm not positive > about that, but not being able to complete VACUUM at all is a FAR more > urgent problem than not being able to freeze, even though in the long > run the latter is more severe. +1 for proceeding in this direction, rather than handing users tools that they *will* hurt themselves with. The more that I think about it, the more I think that the proposed functions are tools for wizards only, and so I'm getting hesitant about having them in contrib at all. We lack a better place to put them, but that doesn't mean they should be there. regards, tom lane
On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Oh, I hadn't realized that limitation. That would be good to fix. It > > would be even better, I think, if we could have VACUUM proceed with > > the rest of vacuuming the table, emitting warnings about each > > instance, instead of blowing up when it hits the first bad tuple, but > > I think you may have told me sometime that doing so would be, uh, less > > than straightforward. We probably should refuse to update > > relfrozenxid/relminmxid when this is happening, but I *think* it would > > be better to still proceed with dead tuple cleanup as far as we can, > > or at least have an option to enable that behavior. I'm not positive > > about that, but not being able to complete VACUUM at all is a FAR more > > urgent problem than not being able to freeze, even though in the long > > run the latter is more severe. > > +1 for proceeding in this direction, rather than handing users tools > that they *will* hurt themselves with. > > The more that I think about it, the more I think that the proposed > functions are tools for wizards only, and so I'm getting hesitant > about having them in contrib at all. We lack a better place to > put them, but that doesn't mean they should be there. It's not an either/or; it's a both/and. To recover from this problem, you need to: 1. Be able to tell which tuples are affected. 2. Do something about it. I think there are a number of strategies that we could pursue around either of those things, and there are better and worse ways of accomplishing them, but having one without the other isn't too great. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-07-13 20:47:10 -0400, Robert Haas wrote: > On Mon, Jul 13, 2020 at 6:38 PM Andres Freund <andres@anarazel.de> wrote: > > Not fully, I'm afraid. Afaict it doesn't currently tell you the item > > pointer offset, just the block numer, right? We probably should extend > > it to also include the offset... > > Oh, I hadn't realized that limitation. That would be good to fix. Yea. And it'd even be good if we were to to end up implementing your suggestion below about continuing vacuuming other tuples. > It would be even better, I think, if we could have VACUUM proceed with > the rest of vacuuming the table, emitting warnings about each > instance, instead of blowing up when it hits the first bad tuple, but > I think you may have told me sometime that doing so would be, uh, less > than straightforward. Yea, it's not that simple to implement. Not impossible either. > We probably should refuse to update relfrozenxid/relminmxid when this > is happening, but I *think* it would be better to still proceed with > dead tuple cleanup as far as we can, or at least have an option to > enable that behavior. I'm not positive about that, but not being able > to complete VACUUM at all is a FAR more urgent problem than not being > able to freeze, even though in the long run the latter is more severe. I'm hesitant to default to removing tuples once we've figured out that something is seriously wrong. Could easy enough make us plow ahead and delete valuable data on other tuples, even if we'd already detected there's a problem. But I also see the problem you raise. That's not academic, a number of multixact corruption issues the checks detected IIRC weren't guaranteed to be caught. > > > 2. In some other, similar situations, e.g. where the tuple data is > > > garbled, it's often possible to get out from under the problem by > > > deleting the tuple at issue. But I think that doesn't necessarily fix > > > anything in this case. > > > > Huh, why not? That worked in the cases I saw. > > I'm not sure I've seen a case where that didn't work, but I don't see > a reason why it couldn't happen. Do you think the code is structured > in such a way that a deleted tuple is guaranteed to be pruned even if > the XID is old? I think so, leaving aside some temporary situations perhaps. > What if clog has been truncated so that the xmin can't be looked up? That's possible, but probably only in cases where xmin actually committed. Greetings, Andres Freund
On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The more that I think about it, the more I think that the proposed > functions are tools for wizards only, and so I'm getting hesitant > about having them in contrib at all. We lack a better place to > put them, but that doesn't mean they should be there. Also, I want to clarify that in a typical situation in which a customer is facing this problem, I don't have any access to their system. I basically never touch customer systems directly. Typically, the customer sends us log files and a description of the problem and their goals, and we send them back advice or instructions. So it's impractical to imagine that this can be something where you have to know the secret magic wizard password to get access to it. We'd just have to give the customers who need to use this tool said password, and then the jig is up - they can redistribute that password to all the non-wizards on the Internet, if they so choose. I understand that it's not too great when we give people access to sharp tools and they hurt themselves with said tools. But this is open source. That's how it goes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 13, 2020 at 9:10 PM Andres Freund <andres@anarazel.de> wrote: > > What if clog has been truncated so that the xmin can't be looked up? > > That's possible, but probably only in cases where xmin actually > committed. Isn't that the normal case? I'm imagining something like: - Tuple gets inserted. Transaction commits. - VACUUM processes table. - Mischievous fairies mark page all-visible in the visibility map. - VACUUM runs lots more times, relfrozenxid advances, but without ever looking at the page in question, because it's all-visible. - clog is truncated, rendering xmin no longer accessible. - User runs VACUUM disabling page skipping, gets ERROR. - User deletes offending tuple. - At this point, I think the tuple is both invisible and unprunable? - Fairies happy, user sad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The more that I think about it, the more I think that the proposed >> functions are tools for wizards only, and so I'm getting hesitant >> about having them in contrib at all. We lack a better place to >> put them, but that doesn't mean they should be there. > I understand that it's not too great when we give people access to > sharp tools and they hurt themselves with said tools. But this is open > source. That's how it goes. I think you're attacking a straw man. I'm well aware of how open source works, thanks. What I'm saying is that contrib is mostly seen to be reasonably harmless stuff. Sure, you can overwrite data you didn't want to with adminpack's pg_file_write. But that's the price of having such a capability at all, and in general it's not hard for users to understand both the uses and risks of that function. That statement does not apply to the functions being proposed here. It doesn't seem like they could possibly be safe to use without very specific expert advice --- and even then, we're talking rather small values of "safe". So I wish we had some other way to distribute them than via contrib. regards, tom lane
On 2020/07/14 9:41, Robert Haas wrote: > On Mon, Jul 13, 2020 at 6:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah, I don't care for that either. That's a pretty huge violation of our >> normal back-patching rules, and I'm not convinced that it's justified. > > I think that our normal back-patching rules are based primarily on the > risk of breaking things, and a new contrib module carries a pretty > negligible risk of breaking anything that works today. I wouldn't > propose to back-patch something on those grounds just as a way of > delivering a new feature more quickly, but that's not the intention > here. At least in my experience, un-VACUUM-able tables have gotten > several orders of magnitude more common since Andres put those changes > in. As far as I can recall, EDB has not had this many instances of > different customers reporting the same problem since the 9.3-era > multixact issues. So far, this does not rise to that level, but it is > by no means a negligible issue, either. I believe it deserves to be > taken quite seriously, especially because the existing options for > helping customers with this kind of problem are so limited. > > Now, if this goes into v14, we can certainly stick it up on github, or > put it out there in some other way for users to download, > self-compile, and install, but that seems noticeably less convenient > for people who need it, and I'm not clear what the benefit to the > project is. But updating this tool can fit to the release schedule and policy of PostgreSQL? While investigating the problem by using this tool, we may want to add new feature into the tool because it's necessary for the investigation. But users would need to wait for next minor version release, to use this new feature. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Hi! > 14 июля 2020 г., в 02:12, Robert Haas <robertmhaas@gmail.com> написал(а): > > So I have these questions: > > - Do people think it would me smart/good/useful to include something > like this in PostgreSQL? > > - If so, how? I would propose a new contrib module that we back-patch > all the way My 0.05₽. At Yandex we used to fix similar corruption things with our pg_dirty_hands extension [0]. But then we developed our internal pg_heapcheck module (unfortunately we did not publish it) and incorporated aggressiverecovery into heapcheck. Now when community has official heapcheck I think it worth to keep detection and fixing tools together. Best regards, Andrey Borodin. [0] https://github.com/dsarafan/pg_dirty_hands/blob/master/src/pg_dirty_hands.c
On 2020-07-14 02:41, Robert Haas wrote: > I think that our normal back-patching rules are based primarily on the > risk of breaking things, and a new contrib module carries a pretty > negligible risk of breaking anything that works today. I think that all feature code ought to go through a beta cycle. So if this code makes it to 14.0 or 14.1, then I'd consider backpatching it. > Now, if this goes into v14, we can certainly stick it up on github, or > put it out there in some other way for users to download, > self-compile, and install, but that seems noticeably less convenient > for people who need it, and I'm not clear what the benefit to the > project is. In the meantime, if you're wizard enough to deal with this kind of thing, you could also clone the module from the PG14 tree and build it against older versions manually. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The more that I think about it, the more I think that the proposed
>> functions are tools for wizards only, and so I'm getting hesitant
>> about having them in contrib at all. We lack a better place to
>> put them, but that doesn't mean they should be there.
> I understand that it's not too great when we give people access to
> sharp tools and they hurt themselves with said tools. But this is open
> source. That's how it goes.
I think you're attacking a straw man. I'm well aware of how open source
works, thanks. What I'm saying is that contrib is mostly seen to be
reasonably harmless stuff. Sure, you can overwrite data you didn't want
to with adminpack's pg_file_write. But that's the price of having such a
capability at all, and in general it's not hard for users to understand
both the uses and risks of that function. That statement does not apply
to the functions being proposed here. It doesn't seem like they could
possibly be safe to use without very specific expert advice --- and even
then, we're talking rather small values of "safe". So I wish we had some
other way to distribute them than via contrib.
On Tue, Jul 14, 2020 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think you're attacking a straw man. I'm well aware of how open source > works, thanks. What I'm saying is that contrib is mostly seen to be > reasonably harmless stuff. Sure, you can overwrite data you didn't want > to with adminpack's pg_file_write. But that's the price of having such a > capability at all, and in general it's not hard for users to understand > both the uses and risks of that function. That statement does not apply > to the functions being proposed here. It doesn't seem like they could > possibly be safe to use without very specific expert advice --- and even > then, we're talking rather small values of "safe". Would it be possible to make them safe(r)? For instance, truncate only, don't freeze; only tuples whose visibility information is corrupted; and only in non-catalog tables. What exactly is the risk in that case? Foreign keys might not be satisfied, which might make it impossible to restore a dump, but is that worse than what a DBA can do anyway? I would think that it is not and would leave the database in a state DBAs are much better equipped to deal with. Or would it be possible to create a table like the original table (minus any constraints) and copy all tuples with corrupted visibility there before truncating to a dead line pointer? Jochem
On Tue, Jul 14, 2020 at 3:08 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > In the meantime, if you're wizard enough to deal with this kind of > thing, you could also clone the module from the PG14 tree and build it > against older versions manually. But what if you are NOT a wizard, and a wizard is giving you directions? Then having to build from source is a real pain. And that's normally the situation I'm in when a customer has this issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 14, 2020 at 4:59 AM Magnus Hagander <magnus@hagander.net> wrote: > The countersable of this is pg_resetwal. The number of people who have broken their database with that tool is not small. Very true. > That said, we could have a separate "class" of extensions/tools in the distribution, and encourage packagers to pack themup as separate packages for example. Technically they don't have to be in the same source repository at all of course,but I have a feeling some of them might be a lot easier to maintain if they are. And then the user would just haveto install something like "postgresql-14-wizardtools". They'd still be available to everybody, of course, but at leastthe knives would be in a closed drawer until intentionally picked up. I don't think that does much to help with the immediate problem here, because people are being bitten by this problem *now* and a packaging change like this will take a long time to happen and become standard out there, but I think it's a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 14, 2020 at 4:59 AM Magnus Hagander <magnus@hagander.net> wrote:
> The countersable of this is pg_resetwal. The number of people who have broken their database with that tool is not small.
Very true.
> That said, we could have a separate "class" of extensions/tools in the distribution, and encourage packagers to pack them up as separate packages for example. Technically they don't have to be in the same source repository at all of course, but I have a feeling some of them might be a lot easier to maintain if they are. And then the user would just have to install something like "postgresql-14-wizardtools". They'd still be available to everybody, of course, but at least the knives would be in a closed drawer until intentionally picked up.
I don't think that does much to help with the immediate problem here,
because people are being bitten by this problem *now* and a packaging
change like this will take a long time to happen and become standard
out there, but I think it's a good idea.
On Mon, Jul 13, 2020 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > But updating this tool can fit to the release schedule and > policy of PostgreSQL? > > While investigating the problem by using this tool, we may want to > add new feature into the tool because it's necessary for the investigation. > But users would need to wait for next minor version release, to use this > new feature. Yeah, that's a point that needs careful thought. I don't think it means that we shouldn't have something in core; after all, this is a problem that is created in part by the way that PostgreSQL itself works, and I think it would be quite unfriendly if we refused to do anything about that in the core distribution. On the other hand, it might be a good reason not to back-patch, which is something most people don't seem enthusiastic about anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 14, 2020 at 8:25 AM Magnus Hagander <magnus@hagander.net> wrote: > I don't think that it necessarily has to be. As long as we're talking about adding something and not actually changingtheir existing packages, getting this into both yum and apt shouldn't be *that* hard, if it's coordinated well withChristoph and Devrim (obviously that's based on my experience and they will have to give a more complete answer themselves).It would be a lot more complicated if it involved changing an existing package. I mean, you presumably could not move pg_resetwal to this new package in existing branches, right? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 14, 2020 at 8:25 AM Magnus Hagander <magnus@hagander.net> wrote:
> I don't think that it necessarily has to be. As long as we're talking about adding something and not actually changing their existing packages, getting this into both yum and apt shouldn't be *that* hard, if it's coordinated well with Christoph and Devrim (obviously that's based on my experience and they will have to give a more complete answer themselves). It would be a lot more complicated if it involved changing an existing package.
I mean, you presumably could not move pg_resetwal to this new package
in existing branches, right?
Greetings, * Magnus Hagander (magnus@hagander.net) wrote: > On Tue, Jul 14, 2020 at 4:09 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jul 14, 2020 at 8:25 AM Magnus Hagander <magnus@hagander.net> > > wrote: > > > I don't think that it necessarily has to be. As long as we're talking > > about adding something and not actually changing their existing packages, > > getting this into both yum and apt shouldn't be *that* hard, if it's > > coordinated well with Christoph and Devrim (obviously that's based on my > > experience and they will have to give a more complete answer themselves). > > It would be a lot more complicated if it involved changing an existing > > package. > > > > I mean, you presumably could not move pg_resetwal to this new package > > in existing branches, right? > > Probably and eventually. But that can be done for 14+ (or 13+ depending on > how "done" the packaging is there -- we should just make sure that hits the > biggest platform in the same release). Considering we just got rid of the -contrib independent package on at least Debian-based systems, it doesn't really seem likely that the packagers are going to be anxious to create a new one- they are not without costs. Also, in such dire straits as this thread is contemplating, I would think we'd *really* like to have access to these tools with as small an amount of change as absolutely possible to the system: what if pg_extension itself got munged and we aren't able to install this new contrib module, for example? I would suggest that, instead, we make this part of core, but have it be in a relatively clearly marked special schema that isn't part of search_path by default- eg: pg_hacks, or pg_dirty_hands (I kinda like the latter though it seems a bit unprofessional for us). I'd also certainly be in support of having a contrib module with the same functions that's independent from core and available and able to be installed on pre-v14 systems. I'd further support having another repo that's "core maintained" or however we want to phrase it which includes this proposed module (and possibly all of contrib) and which has a different release cadence and requirements for what gets into it, has its own packages, etc. Thanks, Stephen
Attachment
On 2020-Jul-13, Andres Freund wrote: > Hi, > > On 2020-07-13 17:12:18 -0400, Robert Haas wrote: > > 1. There's nothing to identify the tuple that has the problem, and no > > way to know how many more of them there might be. Back-patching > > b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first > > part of this. > > Not fully, I'm afraid. Afaict it doesn't currently tell you the item > pointer offset, just the block numer, right? We probably should extend > it to also include the offset... Just having the block number is already a tremendous step forward; with that you can ask the customer to set a pageinspect dump of tuple headers, and then the problem is obvious. Now if you want to add block number to that, by all means do so. FWIW I do support the idea of backpatching the vacuum errcontext commit. One useful thing to do is to mark a tuple frozen unconditionally if it's marked hinted XMIN_COMMITTED; no need to consult pg_clog in that case. The attached (for 9.6) does that; IIRC it would have helped in a couple of cases. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi, On 2020-07-13 21:18:10 -0400, Robert Haas wrote: > On Mon, Jul 13, 2020 at 9:10 PM Andres Freund <andres@anarazel.de> wrote: > > > What if clog has been truncated so that the xmin can't be looked up? > > > > That's possible, but probably only in cases where xmin actually > > committed. > > Isn't that the normal case? I'm imagining something like: > > - Tuple gets inserted. Transaction commits. > - VACUUM processes table. > - Mischievous fairies mark page all-visible in the visibility map. > - VACUUM runs lots more times, relfrozenxid advances, but without ever > looking at the page in question, because it's all-visible. > - clog is truncated, rendering xmin no longer accessible. > - User runs VACUUM disabling page skipping, gets ERROR. > - User deletes offending tuple. > - At this point, I think the tuple is both invisible and unprunable? > - Fairies happy, user sad. I'm not saying it's impossible that that happens, but the cases I did investigate didn't look like this. If something just roguely wrote to the VM I'd expect a lot more "is not marked all-visible but visibility map bit is set in relation" type WARNINGs, and I've not seen much of those (they're WARNINGs though, so maybe we wouldn't). Presumably this wouldn't always just happen with tuples that'd trigger an error first during hot pruning. I've definitely seen indications of both datfrozenxid and relfrozenxid getting corrupted (in particular vac_update_datfrozenxid being racy as hell), xid wraparound, indications of multixact problems (although it's possible we've now fixed those) and some signs of corrupted relcache entries for shared relations leading to vacuums being skipped. Greetings, Andres Freund
Hi, On 2020-07-14 13:20:25 -0400, Alvaro Herrera wrote: > On 2020-Jul-13, Andres Freund wrote: > > > Hi, > > > > On 2020-07-13 17:12:18 -0400, Robert Haas wrote: > > > 1. There's nothing to identify the tuple that has the problem, and no > > > way to know how many more of them there might be. Back-patching > > > b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first > > > part of this. > > > > Not fully, I'm afraid. Afaict it doesn't currently tell you the item > > pointer offset, just the block numer, right? We probably should extend > > it to also include the offset... > > Just having the block number is already a tremendous step forward; with > that you can ask the customer to set a pageinspect dump of tuple > headers, and then the problem is obvious. Now if you want to add block > number to that, by all means do so. offset number I assume? > One useful thing to do is to mark a tuple frozen unconditionally if it's > marked hinted XMIN_COMMITTED; no need to consult pg_clog in that case. > The attached (for 9.6) does that; IIRC it would have helped in a couple > of cases. I think it might also have hidden corruption in at least one case where we subsequently fixed a bug (and helped detect at least one unfixed bug). That should only be possible if either required clog has been removed, or if relfrozenxid/datfrozenxid are corrupt, right? Greetings, Andres Freund
Hi, On 2020-07-14 07:51:27 -0400, Robert Haas wrote: > On Tue, Jul 14, 2020 at 3:08 AM Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: > > In the meantime, if you're wizard enough to deal with this kind of > > thing, you could also clone the module from the PG14 tree and build it > > against older versions manually. > > But what if you are NOT a wizard, and a wizard is giving you > directions? Then having to build from source is a real pain. And > that's normally the situation I'm in when a customer has this issue. The "found xmin ... from before relfrozenxid ..." cases should all be fixable without needing such a function, and without it making fixing them significantly easier, no? As far as I understand your suggested solution, you need the tid(s) of these tuples, right? If you have those, I don't think it's meaningfully harder to INSERT ... DELETE WHERE ctid = .... or something like that. ISTM that the hard part is finding all problematic tuples in an efficient manner (i.e. that doesn't require one manual VACUUM for each individual block + parsing VACUUMs error message), not "fixing" those tuples. Greetings, Andres Freund
On 2020-Jul-14, Andres Freund wrote: > Hi, > > On 2020-07-14 13:20:25 -0400, Alvaro Herrera wrote: > > Just having the block number is already a tremendous step forward; with > > that you can ask the customer to set a pageinspect dump of tuple > > headers, and then the problem is obvious. Now if you want to add block > > number to that, by all means do so. > > offset number I assume? Eh, yeah, that. > > One useful thing to do is to mark a tuple frozen unconditionally if it's > > marked hinted XMIN_COMMITTED; no need to consult pg_clog in that case. > > The attached (for 9.6) does that; IIRC it would have helped in a couple > > of cases. > > I think it might also have hidden corruption in at least one case where > we subsequently fixed a bug (and helped detect at least one unfixed > bug). That should only be possible if either required clog has been > removed, or if relfrozenxid/datfrozenxid are corrupt, right? Yes, that's precisely the reason I never submitted it :-) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jul 14, 2020 at 3:42 PM Andres Freund <andres@anarazel.de> wrote: > The "found xmin ... from before relfrozenxid ..." cases should all be > fixable without needing such a function, and without it making fixing > them significantly easier, no? As far as I understand your suggested > solution, you need the tid(s) of these tuples, right? If you have those, > I don't think it's meaningfully harder to INSERT ... DELETE WHERE ctid = > .... or something like that. > > ISTM that the hard part is finding all problematic tuples in an > efficient manner (i.e. that doesn't require one manual VACUUM for each > individual block + parsing VACUUMs error message), not "fixing" those > tuples. I haven't tried the INSERT ... DELETE approach, but I've definitely seen a case where a straight UPDATE did not fix the problem; VACUUM continued failing afterwards. In that case, it was a system catalog that was affected, and not one where TRUNCATE + re-INSERT was remotely practical. The only solution I could come up with was to drop the database and recreate it. Fortunately in that case the affected database didn't seem to have any actual data in it, but if it had been a 1TB database I think we would have been in really bad trouble. Do you have a reason for believing that INSERT ... DELETE is going to be better than UPDATE? It seems to me that either way you can end up with a deleted and thus invisible tuple that you still can't get rid of. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-07-14 15:59:21 -0400, Robert Haas wrote: > On Tue, Jul 14, 2020 at 3:42 PM Andres Freund <andres@anarazel.de> wrote: > > The "found xmin ... from before relfrozenxid ..." cases should all be > > fixable without needing such a function, and without it making fixing > > them significantly easier, no? As far as I understand your suggested > > solution, you need the tid(s) of these tuples, right? If you have those, > > I don't think it's meaningfully harder to INSERT ... DELETE WHERE ctid = > > .... or something like that. > > > > ISTM that the hard part is finding all problematic tuples in an > > efficient manner (i.e. that doesn't require one manual VACUUM for each > > individual block + parsing VACUUMs error message), not "fixing" those > > tuples. > > I haven't tried the INSERT ... DELETE approach, but I've definitely > seen a case where a straight UPDATE did not fix the problem; VACUUM > continued failing afterwards. The only way I can see that to happen is for the old tuple's multixact being copied forward. That'd not happen with INSERT ... DELETE. > In that case, it was a system catalog > that was affected, and not one where TRUNCATE + re-INSERT was remotely > practical. FWIW, an rewriting ALTER TABLE would likely also fix it. But obviously that'd require allow_system_table_mods... > Do you have a reason for believing that INSERT ... DELETE is going to > be better than UPDATE? It seems to me that either way you can end up > with a deleted and thus invisible tuple that you still can't get rid > of. None of the "new" checks around freezing would apply to deleted tuples. So we shouldn't fail with an error like $subject. Greetings, Andres Freund
On Wed, Jul 15, 2020 at 11:41 AM Andres Freund <andres@anarazel.de> wrote: > > Do you have a reason for believing that INSERT ... DELETE is going to > > be better than UPDATE? It seems to me that either way you can end up > > with a deleted and thus invisible tuple that you still can't get rid > > of. > > None of the "new" checks around freezing would apply to deleted > tuples. So we shouldn't fail with an error like $subject. It can definitely happen at least transiently: S1: rhaas=# create table wubble (a int, b text); CREATE TABLE rhaas=# insert into wubble values (1, 'glumpf'); INSERT 0 1 S2: rhaas=# begin transaction isolation level repeatable read; BEGIN rhaas=*# select * from wubble; a | b ---+-------- 1 | glumpf (1 row) S1: rhaas=# delete from wubble; DELETE 1 rhaas=# update pg_class set relfrozenxid = (relfrozenxid::text::integer + 1000000)::text::xid where relname = 'wubble'; UPDATE 1 rhaas=# vacuum verbose wubble; INFO: vacuuming "public.wubble" ERROR: found xmin 528 from before relfrozenxid 1000527 CONTEXT: while scanning block 0 of relation "public.wubble" S2: rhaas=*# commit; COMMIT S1: rhaas=# vacuum verbose wubble; INFO: vacuuming "public.wubble" INFO: "wubble": removed 1 row versions in 1 pages INFO: "wubble": found 1 removable, 0 nonremovable row versions in 1 out of 1 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 531 There were 0 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: "wubble": truncated 1 to 0 pages DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: vacuuming "pg_toast.pg_toast_16415" INFO: index "pg_toast_16415_index" now contains 0 row versions in 1 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: "pg_toast_16415": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 532 There were 0 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM I see your point, though: the tuple has to be able to survive HOT-pruning in order to cause a problem when we check whether it needs freezing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 16, 2020 at 10:00 AM Robert Haas <robertmhaas@gmail.com> wrote: > I see your point, though: the tuple has to be able to survive > HOT-pruning in order to cause a problem when we check whether it needs > freezing. Here's an example where the new sanity checks fail on an invisible tuple without any concurrent transactions: $ initdb $ pg_ctl start -l ~/logfile $ createdb $ psql create table simpsons (a int, b text); vacuum freeze; $ cat > txid.sql select txid_current(); $ pgbench -t 131072 -c 8 -j 8 -n -f txid.sql $ psql insert into simpsons values (1, 'homer'); $ pg_ctl stop $ pg_resetwal -x 1000 $PGDATA $ pg_ctl start -l ~/logfile $ psql update pg_class set relfrozenxid = (relfrozenxid::text::integer + 2000000)::text::xid where relname = 'simpsons'; rhaas=# select * from simpsons; a | b ---+--- (0 rows) rhaas=# vacuum simpsons; ERROR: found xmin 1049082 from before relfrozenxid 2000506 CONTEXT: while scanning block 0 of relation "public.simpsons" This is a fairly insane situation, because we should have relfrozenxid < tuple xid < xid counter, but instead we have xid counter < tuple xid < relfrozenxid, but it demonstrates that it's possible to have a database which is sufficiently corrupt that you can't escape from the new sanity checks using only INSERT, UPDATE, and DELETE. Now, an even easier way to create a table with a tuple that prevents vacuuming and also can't just be deleted is to simply remove a required pg_clog file (and maybe restart the server to clear out any cached data in the SLRUs). What we typically do with customers who need to recover from that situation today is give them a script to fabricate a bogus CLOG file that shows all transactions as committed (or, perhaps, aborted). But I think that the tools proposed on this thread might be a better approach in certain cases. If the problem is that a pg_clog file vanished, then recreating it with whatever content you think is closest to what was probably there before is likely the best you can do. But if you've got some individual tuples with crazy xmin values, you don't really want to drop matching files in pg_clog; it's better to fix the tuples. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 16, 2020 at 10:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I see your point, though: the tuple has to be able to survive
> HOT-pruning in order to cause a problem when we check whether it needs
> freezing.
Here's an example where the new sanity checks fail on an invisible
tuple without any concurrent transactions:
$ initdb
$ pg_ctl start -l ~/logfile
$ createdb
$ psql
create table simpsons (a int, b text);
vacuum freeze;
$ cat > txid.sql
select txid_current();
$ pgbench -t 131072 -c 8 -j 8 -n -f txid.sql
$ psql
insert into simpsons values (1, 'homer');
$ pg_ctl stop
$ pg_resetwal -x 1000 $PGDATA
$ pg_ctl start -l ~/logfile
$ psql
update pg_class set relfrozenxid = (relfrozenxid::text::integer +
2000000)::text::xid where relname = 'simpsons';
rhaas=# select * from simpsons;
a | b
---+---
(0 rows)
rhaas=# vacuum simpsons;
ERROR: found xmin 1049082 from before relfrozenxid 2000506
CONTEXT: while scanning block 0 of relation "public.simpsons"
This is a fairly insane situation, because we should have relfrozenxid
< tuple xid < xid counter, but instead we have xid counter < tuple xid
< relfrozenxid, but it demonstrates that it's possible to have a
database which is sufficiently corrupt that you can't escape from the
new sanity checks using only INSERT, UPDATE, and DELETE.
Now, an even easier way to create a table with a tuple that prevents
vacuuming and also can't just be deleted is to simply remove a
required pg_clog file (and maybe restart the server to clear out any
cached data in the SLRUs). What we typically do with customers who
need to recover from that situation today is give them a script to
fabricate a bogus CLOG file that shows all transactions as committed
(or, perhaps, aborted). But I think that the tools proposed on this
thread might be a better approach in certain cases. If the problem is
that a pg_clog file vanished, then recreating it with whatever content
you think is closest to what was probably there before is likely the
best you can do. But if you've got some individual tuples with crazy
xmin values, you don't really want to drop matching files in pg_clog;
it's better to fix the tuples.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment
> 24 июля 2020 г., в 14:05, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а): > > Attached is the patch that adds heap_force_kill(regclass, tid[]) and heap_force_freeze(regclass, tid[]) functions whichRobert mentioned in the first email in this thread. The patch basically adds an extension named pg_surgery that containsthese functions. Please have a look and let me know your feedback. Thank you. Thanks for the patch! I have just few random thoughts. I think here we should report that we haven't done what was asked. + /* Nothing to do if the itemid is unused or already dead. */ + if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) + continue; Also, should we try to fix VM along the way? Are there any caveats with concurrent VACUUM? (I do not see any, just asking) It would be good to have some checks for interrupts in safe places. I think we should not trust user entierly here. I'd prefer validation and graceful exit, not a core dump. + Assert(noffs <= PageGetMaxOffsetNumber(page)); For some reason we had unlogged versions of these functions. But I do not recall exact rationale.. Also, I'd be happy if we had something like "Restore this tuple iff this does not break unique constraint". To do so we needto sort tids by xmin\xmax, to revive most recent data first. Thanks! Best regards, Andrey Borodin.
I think here we should report that we haven't done what was asked.
+ /* Nothing to do if the itemid is unused or already dead. */
+ if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+ continue;
Also, should we try to fix VM along the way?
Are there any caveats with concurrent VACUUM? (I do not see any, just asking)
It would be good to have some checks for interrupts in safe places.
I think we should not trust user entierly here. I'd prefer validation and graceful exit, not a core dump.
+ Assert(noffs <= PageGetMaxOffsetNumber(page));
For some reason we had unlogged versions of these functions. But I do not recall exact rationale..
Also, I'd be happy if we had something like "Restore this tuple iff this does not break unique constraint". To do so we need to sort tids by xmin\xmax, to revive most recent data first.
> 27 июля 2020 г., в 09:36, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а): > > > Also, should we try to fix VM along the way? > > I think we should let VACUUM do that. Sometimes VACUUM will not get to these pages, because they are marked All Frozen. An possibly some tuples will get stale on this page again > > Are there any caveats with concurrent VACUUM? (I do not see any, just asking) > > As of now, I don't see any. VACUUM has collection of dead item pointers. We will not resurrect any of them, right? > > It would be good to have some checks for interrupts in safe places. > > I think I have already added those wherever I felt it was required. If you feel it's missing somewhere, it could be goodif you could point it out. Sorry, I just overlooked that call, everything is fine here. > > Also, I'd be happy if we had something like "Restore this tuple iff this does not break unique constraint". To do sowe need to sort tids by xmin\xmax, to revive most recent data first. > > I didn't get this point. Could you please elaborate. You may have 10 corrupted tuples for the same record, that was updated 9 times. And if you have unique constraint on thetable you may want to have only latest version of the row. So you want to kill 9 tuples and freeze 1. Thanks! Best regards, Andrey Borodin.
Hello Ashutosh, On Fri, 24 Jul 2020 at 14:35, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hi All, > > Attached is the patch that adds heap_force_kill(regclass, tid[]) and heap_force_freeze(regclass, tid[]) functions whichRobert mentioned in the first email in this thread. The patch basically adds an extension named pg_surgery that containsthese functions. Please have a look and let me know your feedback. Thank you. > Thanks for the patch. 1. We would be marking buffer dirty and writing wal even if we have not done any changes( ex if we pass invalid/dead tids). Maybe we can handle this better? cosmetic changes 1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and also the variable names could use surgery instead. 2. extension comment pg_surgery.control "extension to perform surgery the damaged heap table" -> "extension to perform surgery on the damaged heap table" > On Thu, Jul 16, 2020 at 9:44 PM Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Thu, Jul 16, 2020 at 10:00 AM Robert Haas <robertmhaas@gmail.com> wrote: >> > I see your point, though: the tuple has to be able to survive >> > HOT-pruning in order to cause a problem when we check whether it needs >> > freezing. >> >> Here's an example where the new sanity checks fail on an invisible >> tuple without any concurrent transactions: >> >> $ initdb >> $ pg_ctl start -l ~/logfile >> $ createdb >> $ psql >> >> create table simpsons (a int, b text); >> vacuum freeze; >> >> $ cat > txid.sql >> select txid_current(); >> $ pgbench -t 131072 -c 8 -j 8 -n -f txid.sql >> $ psql >> >> insert into simpsons values (1, 'homer'); >> >> $ pg_ctl stop >> $ pg_resetwal -x 1000 $PGDATA >> $ pg_ctl start -l ~/logfile >> $ psql >> >> update pg_class set relfrozenxid = (relfrozenxid::text::integer + >> 2000000)::text::xid where relname = 'simpsons'; >> >> rhaas=# select * from simpsons; >> a | b >> ---+--- >> (0 rows) >> >> rhaas=# vacuum simpsons; >> ERROR: found xmin 1049082 from before relfrozenxid 2000506 >> CONTEXT: while scanning block 0 of relation "public.simpsons" >> >> This is a fairly insane situation, because we should have relfrozenxid >> < tuple xid < xid counter, but instead we have xid counter < tuple xid >> < relfrozenxid, but it demonstrates that it's possible to have a >> database which is sufficiently corrupt that you can't escape from the >> new sanity checks using only INSERT, UPDATE, and DELETE. >> >> Now, an even easier way to create a table with a tuple that prevents >> vacuuming and also can't just be deleted is to simply remove a >> required pg_clog file (and maybe restart the server to clear out any >> cached data in the SLRUs). What we typically do with customers who >> need to recover from that situation today is give them a script to >> fabricate a bogus CLOG file that shows all transactions as committed >> (or, perhaps, aborted). But I think that the tools proposed on this >> thread might be a better approach in certain cases. If the problem is >> that a pg_clog file vanished, then recreating it with whatever content >> you think is closest to what was probably there before is likely the >> best you can do. But if you've got some individual tuples with crazy >> xmin values, you don't really want to drop matching files in pg_clog; >> it's better to fix the tuples. >> >> -- >> Robert Haas >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> >> -- M Beena Emerson Sr. Software Engineer edbpostgres.com
> I think we should let VACUUM do that.
Sometimes VACUUM will not get to these pages, because they are marked All Frozen.
An possibly some tuples will get stale on this page again
> > Are there any caveats with concurrent VACUUM? (I do not see any, just asking)
>
> As of now, I don't see any.
VACUUM has collection of dead item pointers. We will not resurrect any of them, right?
> > It would be good to have some checks for interrupts in safe places.
>
> I think I have already added those wherever I felt it was required. If you feel it's missing somewhere, it could be good if you could point it out.
Sorry, I just overlooked that call, everything is fine here.
> > Also, I'd be happy if we had something like "Restore this tuple iff this does not break unique constraint". To do so we need to sort tids by xmin\xmax, to revive most recent data first.
>
> I didn't get this point. Could you please elaborate.
You may have 10 corrupted tuples for the same record, that was updated 9 times. And if you have unique constraint on the table you may want to have only latest version of the row. So you want to kill 9 tuples and freeze 1.
1. We would be marking buffer dirty and writing wal even if we have
not done any changes( ex if we pass invalid/dead tids). Maybe we can
handle this better?
cosmetic changes
1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and
also the variable names could use surgery instead.
2. extension comment pg_surgery.control "extension to perform surgery
the damaged heap table" -> "extension to perform surgery on the
damaged heap table"
On Tue, Jul 14, 2020 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote: > A number of EDB customers have had this error crop on their tables for > reasons that we have usually not been able to determine. In many <long-shot>Do you happen to know if they ever used the snapshot-too-old feature?</long-shot>
On Wed, Jul 29, 2020 at 3:23 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Jul 14, 2020 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote: > > A number of EDB customers have had this error crop on their tables for > > reasons that we have usually not been able to determine. In many > > <long-shot>Do you happen to know if they ever used the > snapshot-too-old feature?</long-shot> I don't have any reason to believe that they did. Why? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 30, 2020 at 1:36 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jul 29, 2020 at 3:23 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Tue, Jul 14, 2020 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > A number of EDB customers have had this error crop on their tables for > > > reasons that we have usually not been able to determine. In many > > > > <long-shot>Do you happen to know if they ever used the > > snapshot-too-old feature?</long-shot> > > I don't have any reason to believe that they did. Why? Nothing specific, I was just contemplating the problems with that feature and the patches[1] proposed so far to fix some of them, and what types of corruption might be possible due to that stuff, and it occurred to me to ask if you'd thought about that in connection to these reports. [1] https://www.postgresql.org/message-id/flat/CA%2BTgmoY%3Daqf0zjTD%2B3dUWYkgMiNDegDLFjo%2B6ze%3DWtpik%2B3XqA%40mail.gmail.com
> I think we should let VACUUM do that.
Sometimes VACUUM will not get to these pages, because they are marked All Frozen.
An possibly some tuples will get stale on this page againHmm, okay, will have a look into this. Thanks.
Hi Beena,Thanks for the review.1. We would be marking buffer dirty and writing wal even if we have
not done any changes( ex if we pass invalid/dead tids). Maybe we can
handle this better?Yeah, we can skip this when nothing has changed. Will take care of it in the next version of patch.cosmetic changes
1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and
also the variable names could use surgery instead.I think that looks fine. I would rather prefer the word "Force" just because all the enum options contain the word "Force" in it.2. extension comment pg_surgery.control "extension to perform surgery
the damaged heap table" -> "extension to perform surgery on the
damaged heap table"Okay, will fix that typo.
Attachment
> 31 июля 2020 г., в 17:32, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а): > > > On Wed, Jul 29, 2020 at 9:58 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > I think we should let VACUUM do that. > Sometimes VACUUM will not get to these pages, because they are marked All Frozen. > An possibly some tuples will get stale on this page again > > Hmm, okay, will have a look into this. Thanks. > > I had a look over this and found that one can use the DISABLE_PAGE_SKIPPING option with VACUUM to disable all its page-skippingbehavior. Oh, wow, I didn't know that. Thanks! This actually will do the trick. I'll try to review your patch again next week. Thanks! Best regards, Andrey Borodin.
On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Attached is the new version of patch that addresses the comments from Andrey and Beena. +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table" the -> a I also suggest: heap table -> relation, because we might want to extend this to other cases later. +-- toast table. +begin; +create table ttab(a text); +insert into ttab select string_agg(chr(floor(random() * 26)::int + 65), '') from generate_series(1,10000); +select * from ttab where xmin = 2; + a +--- +(0 rows) + +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]); + heap_force_freeze +------------------- + +(1 row) + I don't understand the point of this. You're not testing the function on the TOAST table; you're testing it on the main table when there happens to be a TOAST table that is probably getting used for something. But that's not really relevant to what is being tested here, so as written this seems redundant with the previous cases. +-- test pg_surgery functions with the unsupported relations. Should fail. Please name the specific functions being tested here in case we add more in the future that are tested separately. +++ b/contrib/pg_surgery/heap_surgery_funcs.c I think we could drop _funcs from the file name. +#ifdef PG_MODULE_MAGIC +PG_MODULE_MAGIC; +#endif The #ifdef here is not required, and if you look at other contrib modules you'll see that they don't have it. I don't like all the macros at the top of the file much. CHECKARRVALID is only used in one place, so it seems to me that you might as well just inline it and lose the macro. Likewise for SORT and ARRISEMPTY. Once you do that, heap_force_common() can just compute the number of array elements once, instead of doing it once inside ARRISEMPTY, then again inside SORT, and then a third time to initialize ntids. You can just have a local variable in that function that is set once and then used as needed. Then you are only doing ARRNELEMS once, so you can get rid of that macro too. The same technique can be used to get rid of ARRPTR. So then all the macros go away without introducing any code duplication. +/* Options to forcefully change the state of a heap tuple. */ +typedef enum HTupleForceOption +{ + FORCE_KILL, + FORCE_FREEZE +} HTupleForceOption; I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE. Also, how about option -> operation? + return heap_force_common(fcinfo, FORCE_KILL); I think it might be more idiomatic to use PG_RETURN_DATUM here. I know it's the same thing, though, and perhaps I'm even wrong about the prevailing style. + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE); I think this is unnecessary. It's an enum with 2 values. + if (ARRISEMPTY(ta)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("empty tid array"))); I don't see why this should be an error. Why can't we just continue normally and if it does nothing, it does nothing? You'd need to change the do..while loop to a while loop so that the end condition is tested at the top, but that seems fine. + rel = relation_open(relid, AccessShareLock); Maybe we should take RowExclusiveLock, since we are going to modify rows. Not sure how much it matters, though. + if (!superuser() && GetUserId() != rel->rd_rel->relowner) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("must be superuser or object owner to use %s.", + force_opt == FORCE_KILL ? "heap_force_kill()" : + "heap_force_freeze()"))); This is the wrong way to do a permissions check, and it's also the wrong way to write an error message about having failed one. To see the correct method, grep for pg_class_aclcheck. However, I think that we shouldn't in general trust the object owner to do this, unless the super-user gave permission. This is a data-corrupting operation, and only the boss is allowed to authorize it. So I think you should also add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then have this check as a backup. Then, the superuser is always allowed, and if they choose to GRANT EXECUTE on this function to some users, those users can do it for their own relations, but not others. + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("only heap AM is supported"))); + + check_relation_relkind(rel); Seems like these checks are in the wrong order. Also, maybe you could call the function something like check_relation_ok() and put the permissions test, the relkind test, and the relam test all inside of it, just to tighten up the code in this main function a bit. + if (noffs > maxoffset) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("number of offsets specified for block %u exceeds the max offset number %u", + blkno, maxoffset))); Hmm, this doesn't seem quite right. The actual problem is if an individual item pointer's offset number is greater than maxoffset, which can be true even if the total number of offsets is less than maxoffset. So I think you need to remove this check and add a check inside the loop which follows that offnos[i] is in range. The way you've structured that loop is actually problematic -- I don't think we want to be calling elog() or ereport() inside a critical section. You could fix the case that checks for an invalid force_opt by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op == HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The NOTICE case you have here is a bigger problem. You really cannot modify the buffer like this and then decide, oops, never mind, I think I won't mark it dirty or write WAL for the changes. If you do that, the buffer is still in memory, but it's now been modified. A subsequent operation that modifies it will start with the altered state you created here, quite possibly leading to WAL that cannot be correctly replayed on the standby. In other words, you've got to decide for certain whether you want to proceed with the operation *before* you enter the critical section. You also need to emit any messages before or after the critical section. So you could: 1. If you encounter a TID that's unused or dead, skip it silently. -or- 2. Loop over offsets twice. The first time, ERROR if you find any one that is unused or dead. Then start a critical section. Loop again and do the real work. -or- 3. Like #2, but emit a NOTICE about a unused or dead item rather than an ERROR, and skip the critical section and the second loop if you did that >0 times. -or- 4. Like #3, but don't skip anything just because you emitted a NOTICE about the page. #3 is closest to the behavior you have now, but I'm not sure what else it has going for it. It doesn't seem like particularly intuitive behavior that finding a dead or unused TID should cause other item TIDs on the same page not to get processed while still permitting TIDs on other pages to get processed. I don't think that's the behavior users will be expecting. I think my vote is for #4, which will emit a NOTICE about any TID that is dead or unused -- and I guess also about any TID whose offset number is out of range -- but won't actually skip any operations that can be performed. But there are decent arguments for #1 or #2 too. + (errmsg("skipping tid (%u, %u) because it is already marked %s", + blkno, offnos[i], + ItemIdIsDead(itemid) ? "dead" : "unused"))); I believe this violates our guidelines on message construction. Have two completely separate messages -- and maybe lose the word "already": "skipping tid (%u, %u) because it is dead" "skipping tid (%u, %u) because it is unused" The point of this is that it makes it easier for translators. I see very little point in what verify_tid() is doing. Before using each block number, we should check that it's less than or equal to a cached value of RelationGetNumberOfBlocks(rel). That's necessary in any case to avoid funny errors; and then the check here against specifically InvalidBlockNumber is redundant. For the offset number, same thing: we need to check each offset against the page's PageGetMaxOffsetNumber(page); and if we do that then we don't need these checks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I've gone through all your review comments and understood all of them except this one:
You really cannot
modify the buffer like this and then decide, oops, never mind, I think
I won't mark it dirty or write WAL for the changes. If you do that,
the buffer is still in memory, but it's now been modified. A
subsequent operation that modifies it will start with the altered
state you created here, quite possibly leading to WAL that cannot be
correctly replayed on the standby. In other words, you've got to
decide for certain whether you want to proceed with the operation
*before* you enter the critical section.
On Mon, Aug 3, 2020 at 5:05 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Could you please explain this point once more in detail? I am not quite able to understand under what circumstances a bufferwould be modified, but won't be marked as dirty or a WAL won't be written for it. Whenever this branch is taken: + if (nskippedItems == noffs) + goto skip_wal; At this point you have already modified the page, using ItemIdSetDead, HeapTupleHeaderSet*, and/or directly adjusting htup->infomask. If this branch is taken, then MarkBufferDirty() and log_newpage_buffer() are skipped. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 3, 2020 at 7:06 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Aug 3, 2020 at 5:05 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Could you please explain this point once more in detail? I am not quite able to understand under what circumstances abuffer would be modified, but won't be marked as dirty or a WAL won't be written for it. > > Whenever this branch is taken: > > + if (nskippedItems == noffs) > + goto skip_wal; > If the above path is taken that means none of the items in the page got changed. As per the following if-check whenever an item in the offnos[] array is found dead or unused, it is skipped (due to continue statement) which means the item is neither marked dead nor it is marked frozen. Now, if this happens for all the items in a page, then the above condition (nskippedItems == noffs) would be true and hence the buffer would remain unchanged, so, we don't mark such a buffer as dirty and neither do any WAL logging for it. This is my understanding, please let me know if I am missing something here. Thank you. if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) { nskippedItems++; ereport(NOTICE, (errmsg("skipping tid (%u, %u) because it is already marked %s", blkno, offnos[i], ItemIdIsDead(itemid) ? "dead" : "unused"))); continue; } > At this point you have already modified the page, using ItemIdSetDead, > HeapTupleHeaderSet*, and/or directly adjusting htup->infomask. If this > branch is taken, then MarkBufferDirty() and log_newpage_buffer() are > skipped. > -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Hi Robert, Thanks for the review. Please find my comments inline: On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Attached is the new version of patch that addresses the comments from Andrey and Beena. > > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table" > > the -> a > > I also suggest: heap table -> relation, because we might want to > extend this to other cases later. > Corrected. > +-- toast table. > +begin; > +create table ttab(a text); > +insert into ttab select string_agg(chr(floor(random() * 26)::int + > 65), '') from generate_series(1,10000); > +select * from ttab where xmin = 2; > + a > +--- > +(0 rows) > + > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]); > + heap_force_freeze > +------------------- > + > +(1 row) > + > > I don't understand the point of this. You're not testing the function > on the TOAST table; you're testing it on the main table when there > happens to be a TOAST table that is probably getting used for > something. But that's not really relevant to what is being tested > here, so as written this seems redundant with the previous cases. > Yeah, it's being tested on the main table, not on a toast table. I've removed this test-case and also restricted direct access to the toast table using heap_force_kill/freeze functions. I think we shouldn't be using these functions to do any changes in the toast table. We will only use these functions with the main table and let VACUUM remove the corresponding data chunks (pointed by the tuple that got removed from the main table). Another option would be to identify all the data chunks corresponding to the tuple (ctid) being killed from the main table and remove them one by one. We will only do this if the tuple from the main table that has been marked as killed has an external storage. We will have to add a bunch of code for this otherwise we can let VACUUM do this for us. Let me know your thoughts on this. > +-- test pg_surgery functions with the unsupported relations. Should fail. > > Please name the specific functions being tested here in case we add > more in the future that are tested separately. > Done. > +++ b/contrib/pg_surgery/heap_surgery_funcs.c > > I think we could drop _funcs from the file name. > Done. > +#ifdef PG_MODULE_MAGIC > +PG_MODULE_MAGIC; > +#endif > > The #ifdef here is not required, and if you look at other contrib > modules you'll see that they don't have it. > Okay, done. > I don't like all the macros at the top of the file much. CHECKARRVALID > is only used in one place, so it seems to me that you might as well > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY. > Done. > Once you do that, heap_force_common() can just compute the number of > array elements once, instead of doing it once inside ARRISEMPTY, then > again inside SORT, and then a third time to initialize ntids. You can > just have a local variable in that function that is set once and then > used as needed. Then you are only doing ARRNELEMS once, so you can get > rid of that macro too. The same technique can be used to get rid of > ARRPTR. So then all the macros go away without introducing any code > duplication. > Done. > +/* Options to forcefully change the state of a heap tuple. */ > +typedef enum HTupleForceOption > +{ > + FORCE_KILL, > + FORCE_FREEZE > +} HTupleForceOption; > > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE. Done. Also, how about > option -> operation? > I think both look okay to me. > + return heap_force_common(fcinfo, FORCE_KILL); > > I think it might be more idiomatic to use PG_RETURN_DATUM here. I > know it's the same thing, though, and perhaps I'm even wrong about the > prevailing style. > Done. > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE); > > I think this is unnecessary. It's an enum with 2 values. > Removed. > + if (ARRISEMPTY(ta)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("empty tid array"))); > > I don't see why this should be an error. Why can't we just continue > normally and if it does nothing, it does nothing? You'd need to change > the do..while loop to a while loop so that the end condition is tested > at the top, but that seems fine. > I think it's okay to have this check. So, just left it as-is. We do have such checks in other contrib modules as well wherever the array is being passed as an input to the function. > + rel = relation_open(relid, AccessShareLock); > > Maybe we should take RowExclusiveLock, since we are going to modify > rows. Not sure how much it matters, though. > I don't know how it would make a difference, but still as you said replaced AccessShare with RowExclusive. > + if (!superuser() && GetUserId() != rel->rd_rel->relowner) > + ereport(ERROR, > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > + errmsg("must be superuser or object owner to use %s.", > + force_opt == FORCE_KILL ? "heap_force_kill()" : > + "heap_force_freeze()"))); > > This is the wrong way to do a permissions check, and it's also the > wrong way to write an error message about having failed one. To see > the correct method, grep for pg_class_aclcheck. However, I think that > we shouldn't in general trust the object owner to do this, unless the > super-user gave permission. This is a data-corrupting operation, and > only the boss is allowed to authorize it. So I think you should also > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then > have this check as a backup. Then, the superuser is always allowed, > and if they choose to GRANT EXECUTE on this function to some users, > those users can do it for their own relations, but not others. > Done. > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("only heap AM is supported"))); > + > + check_relation_relkind(rel); > > Seems like these checks are in the wrong order. I don't think there is anything wrong with the order. I can see the same order in other contrib modules as well. Also, maybe you could > call the function something like check_relation_ok() and put the > permissions test, the relkind test, and the relam test all inside of > it, just to tighten up the code in this main function a bit. > Yeah, I've added a couple of functions named sanity_check_relation and sanity_check_tid_array and shifted all the sanity checks there. > + if (noffs > maxoffset) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("number of offsets specified for block %u exceeds the max > offset number %u", > + blkno, maxoffset))); > > Hmm, this doesn't seem quite right. The actual problem is if an > individual item pointer's offset number is greater than maxoffset, > which can be true even if the total number of offsets is less than > maxoffset. So I think you need to remove this check and add a check > inside the loop which follows that offnos[i] is in range. > Agreed and done. > The way you've structured that loop is actually problematic -- I don't > think we want to be calling elog() or ereport() inside a critical > section. You could fix the case that checks for an invalid force_opt > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op == > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The > NOTICE case you have here is a bigger problem. Done. You really cannot > modify the buffer like this and then decide, oops, never mind, I think > I won't mark it dirty or write WAL for the changes. If you do that, > the buffer is still in memory, but it's now been modified. A > subsequent operation that modifies it will start with the altered > state you created here, quite possibly leading to WAL that cannot be > correctly replayed on the standby. In other words, you've got to > decide for certain whether you want to proceed with the operation > *before* you enter the critical section. You also need to emit any > messages before or after the critical section. So you could: > This is still not clear. I think Robert needs to respond to my earlier comment. > I believe this violates our guidelines on message construction. Have > two completely separate messages -- and maybe lose the word "already": > > "skipping tid (%u, %u) because it is dead" > "skipping tid (%u, %u) because it is unused" > > The point of this is that it makes it easier for translators. > Done. > I see very little point in what verify_tid() is doing. Before using > each block number, we should check that it's less than or equal to a > cached value of RelationGetNumberOfBlocks(rel). That's necessary in > any case to avoid funny errors; and then the check here against > specifically InvalidBlockNumber is redundant. For the offset number, > same thing: we need to check each offset against the page's > PageGetMaxOffsetNumber(page); and if we do that then we don't need > these checks. > Done. Please check the attached patch for the changes. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Attachment
On Mon, Aug 3, 2020 at 12:13 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > If the above path is taken that means none of the items in the page > got changed. Oops. I didn't realize that, sorry. Maybe it would be a little more clear if instead of "int nSkippedItems" you had "bool did_modify_page"? Then you could initialize it to false and set it to true just before doing the page modifications. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 5, 2020 at 9:42 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Yeah, it's being tested on the main table, not on a toast table. I've > removed this test-case and also restricted direct access to the toast > table using heap_force_kill/freeze functions. I think we shouldn't be > using these functions to do any changes in the toast table. We will > only use these functions with the main table and let VACUUM remove the > corresponding data chunks (pointed by the tuple that got removed from > the main table). I agree with removing the test case, but I disagree with restricting this from being used on the TOAST table. These are tools for experts, who may use them as they see fit. It's unlikely that it would be useful to use this on a TOAST table, I think, but not impossible. > Another option would be to identify all the data chunks corresponding > to the tuple (ctid) being killed from the main table and remove them > one by one. We will only do this if the tuple from the main table that > has been marked as killed has an external storage. We will have to add > a bunch of code for this otherwise we can let VACUUM do this for us. > Let me know your thoughts on this. I don't think VACUUM will do anything for us automatically -- it isn't going to know that we force-killed the tuple in the main table. Normally, a tuple delete would have to set xmax on the TOAST tuples and then VACUUM would do its thing, but in this case that won't happen. So if you force-kill a tuple in the main table you would end up with a space leak in the TOAST table. The problem here is that one reason you might force-killing a tuple in the main table is because it's full of garbage. If so, trying to decode the tuple so that you can find the TOAST pointers might crash or error out, or maybe that part will work but then you'll error out trying to look up the corresponding TOAST tuples, either because the values are not valid or because the TOAST table itself is generally hosed in some way. So I think it is probably best if we keep this tool as simple as possible, with as few dependencies as we can, and document the possible negative outcomes of using it. It's not impossible to recover from a space-leak like this; you can always move the data into a new table with CTAS and then drop the old one. Not sure whether CLUSTER or VACUUM FULL would also be sufficient. Separately, we might want to add a TOAST-checker to amcheck, or enhance the heap-checker Mark is working on, and one of the things it could do is check for TOAST entries to which nothing points. Then if you force-kill tuples in the main table you could also use that tool to look for things in the TOAST table that ought to also be force-killed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 6, 2020 at 1:04 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Aug 3, 2020 at 12:13 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > If the above path is taken that means none of the items in the page > > got changed. > > Oops. I didn't realize that, sorry. Maybe it would be a little more > clear if instead of "int nSkippedItems" you had "bool > did_modify_page"? Then you could initialize it to false and set it to > true just before doing the page modifications. > Okay, np, in that case, as you suggested, I will replace "int nSkippedItems" with "did_modify_page" to increase the clarity. I will do this change in the next version of patch. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Thu, Aug 6, 2020 at 1:29 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Aug 5, 2020 at 9:42 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Yeah, it's being tested on the main table, not on a toast table. I've > > removed this test-case and also restricted direct access to the toast > > table using heap_force_kill/freeze functions. I think we shouldn't be > > using these functions to do any changes in the toast table. We will > > only use these functions with the main table and let VACUUM remove the > > corresponding data chunks (pointed by the tuple that got removed from > > the main table). > > I agree with removing the test case, but I disagree with restricting > this from being used on the TOAST table. These are tools for experts, > who may use them as they see fit. It's unlikely that it would be > useful to use this on a TOAST table, I think, but not impossible. > Okay, If you want I can remove the restriction on a toast table, but, then that means a user can directly remove the data chunks from the toast table without changing anything in the main table. This means we won't be able to query the main table as it will fail with an error like "ERROR: unexpected chunk number ...". So, we will have to find some way to identify the pointer to the chunks that got deleted from the toast table and remove that pointer from the main table. We also need to make sure that before we remove a tuple (pointer) from the main table, we identify all the remaining data chunks pointed by this tuple and remove them completely only then that table would be considered to be in a good state. Now, I am not sure if we can currently do all these things. > > Another option would be to identify all the data chunks corresponding > > to the tuple (ctid) being killed from the main table and remove them > > one by one. We will only do this if the tuple from the main table that > > has been marked as killed has an external storage. We will have to add > > a bunch of code for this otherwise we can let VACUUM do this for us. > > Let me know your thoughts on this. > > I don't think VACUUM will do anything for us automatically -- it isn't > going to know that we force-killed the tuple in the main table. > Normally, a tuple delete would have to set xmax on the TOAST tuples > and then VACUUM would do its thing, but in this case that won't > happen. So if you force-kill a tuple in the main table you would end > up with a space leak in the TOAST table. > > The problem here is that one reason you might force-killing a tuple in > the main table is because it's full of garbage. If so, trying to > decode the tuple so that you can find the TOAST pointers might crash > or error out, or maybe that part will work but then you'll error out > trying to look up the corresponding TOAST tuples, either because the > values are not valid or because the TOAST table itself is generally > hosed in some way. So I think it is probably best if we keep this tool > as simple as possible, with as few dependencies as we can, and > document the possible negative outcomes of using it. I completely agree with you. It's not > impossible to recover from a space-leak like this; you can always move > the data into a new table with CTAS and then drop the old one. Not > sure whether CLUSTER or VACUUM FULL would also be sufficient. > Yeah, I think, we can either use CTAS or VACUUM FULL, both look fine. > Separately, we might want to add a TOAST-checker to amcheck, or > enhance the heap-checker Mark is working on, and one of the things it > could do is check for TOAST entries to which nothing points. Then if > you force-kill tuples in the main table you could also use that tool > to look for things in the TOAST table that ought to also be > force-killed. > Okay, good to know that. Thanks for sharing this info. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hi Robert, > > Thanks for the review. Please find my comments inline: > > On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > Attached is the new version of patch that addresses the comments from Andrey and Beena. > > > > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table" > > > > the -> a > > > > I also suggest: heap table -> relation, because we might want to > > extend this to other cases later. > > > > Corrected. > > > +-- toast table. > > +begin; > > +create table ttab(a text); > > +insert into ttab select string_agg(chr(floor(random() * 26)::int + > > 65), '') from generate_series(1,10000); > > +select * from ttab where xmin = 2; > > + a > > +--- > > +(0 rows) > > + > > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]); > > + heap_force_freeze > > +------------------- > > + > > +(1 row) > > + > > > > I don't understand the point of this. You're not testing the function > > on the TOAST table; you're testing it on the main table when there > > happens to be a TOAST table that is probably getting used for > > something. But that's not really relevant to what is being tested > > here, so as written this seems redundant with the previous cases. > > > > Yeah, it's being tested on the main table, not on a toast table. I've > removed this test-case and also restricted direct access to the toast > table using heap_force_kill/freeze functions. I think we shouldn't be > using these functions to do any changes in the toast table. We will > only use these functions with the main table and let VACUUM remove the > corresponding data chunks (pointed by the tuple that got removed from > the main table). > > Another option would be to identify all the data chunks corresponding > to the tuple (ctid) being killed from the main table and remove them > one by one. We will only do this if the tuple from the main table that > has been marked as killed has an external storage. We will have to add > a bunch of code for this otherwise we can let VACUUM do this for us. > Let me know your thoughts on this. > > > +-- test pg_surgery functions with the unsupported relations. Should fail. > > > > Please name the specific functions being tested here in case we add > > more in the future that are tested separately. > > > > Done. > > > +++ b/contrib/pg_surgery/heap_surgery_funcs.c > > > > I think we could drop _funcs from the file name. > > > > Done. > > > +#ifdef PG_MODULE_MAGIC > > +PG_MODULE_MAGIC; > > +#endif > > > > The #ifdef here is not required, and if you look at other contrib > > modules you'll see that they don't have it. > > > > Okay, done. > > > I don't like all the macros at the top of the file much. CHECKARRVALID > > is only used in one place, so it seems to me that you might as well > > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY. > > > > Done. > > > Once you do that, heap_force_common() can just compute the number of > > array elements once, instead of doing it once inside ARRISEMPTY, then > > again inside SORT, and then a third time to initialize ntids. You can > > just have a local variable in that function that is set once and then > > used as needed. Then you are only doing ARRNELEMS once, so you can get > > rid of that macro too. The same technique can be used to get rid of > > ARRPTR. So then all the macros go away without introducing any code > > duplication. > > > > Done. > > > +/* Options to forcefully change the state of a heap tuple. */ > > +typedef enum HTupleForceOption > > +{ > > + FORCE_KILL, > > + FORCE_FREEZE > > +} HTupleForceOption; > > > > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the > > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE. > > Done. > > Also, how about > > option -> operation? > > > > I think both look okay to me. > > > + return heap_force_common(fcinfo, FORCE_KILL); > > > > I think it might be more idiomatic to use PG_RETURN_DATUM here. I > > know it's the same thing, though, and perhaps I'm even wrong about the > > prevailing style. > > > > Done. > > > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE); > > > > I think this is unnecessary. It's an enum with 2 values. > > > > Removed. > > > + if (ARRISEMPTY(ta)) > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("empty tid array"))); > > > > I don't see why this should be an error. Why can't we just continue > > normally and if it does nothing, it does nothing? You'd need to change > > the do..while loop to a while loop so that the end condition is tested > > at the top, but that seems fine. > > > > I think it's okay to have this check. So, just left it as-is. We do > have such checks in other contrib modules as well wherever the array > is being passed as an input to the function. > > > + rel = relation_open(relid, AccessShareLock); > > > > Maybe we should take RowExclusiveLock, since we are going to modify > > rows. Not sure how much it matters, though. > > > > I don't know how it would make a difference, but still as you said > replaced AccessShare with RowExclusive. > > > + if (!superuser() && GetUserId() != rel->rd_rel->relowner) > > + ereport(ERROR, > > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > > + errmsg("must be superuser or object owner to use %s.", > > + force_opt == FORCE_KILL ? "heap_force_kill()" : > > + "heap_force_freeze()"))); > > > > This is the wrong way to do a permissions check, and it's also the > > wrong way to write an error message about having failed one. To see > > the correct method, grep for pg_class_aclcheck. However, I think that > > we shouldn't in general trust the object owner to do this, unless the > > super-user gave permission. This is a data-corrupting operation, and > > only the boss is allowed to authorize it. So I think you should also > > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then > > have this check as a backup. Then, the superuser is always allowed, > > and if they choose to GRANT EXECUTE on this function to some users, > > those users can do it for their own relations, but not others. > > > > Done. > > > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID) > > + ereport(ERROR, > > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > + errmsg("only heap AM is supported"))); > > + > > + check_relation_relkind(rel); > > > > Seems like these checks are in the wrong order. > > I don't think there is anything wrong with the order. I can see the > same order in other contrib modules as well. > > Also, maybe you could > > call the function something like check_relation_ok() and put the > > permissions test, the relkind test, and the relam test all inside of > > it, just to tighten up the code in this main function a bit. > > > > Yeah, I've added a couple of functions named sanity_check_relation and > sanity_check_tid_array and shifted all the sanity checks there. > > > + if (noffs > maxoffset) > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("number of offsets specified for block %u exceeds the max > > offset number %u", > > + blkno, maxoffset))); > > > > Hmm, this doesn't seem quite right. The actual problem is if an > > individual item pointer's offset number is greater than maxoffset, > > which can be true even if the total number of offsets is less than > > maxoffset. So I think you need to remove this check and add a check > > inside the loop which follows that offnos[i] is in range. > > > > Agreed and done. > > > The way you've structured that loop is actually problematic -- I don't > > think we want to be calling elog() or ereport() inside a critical > > section. You could fix the case that checks for an invalid force_opt > > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op == > > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The > > NOTICE case you have here is a bigger problem. > > Done. > > You really cannot > > modify the buffer like this and then decide, oops, never mind, I think > > I won't mark it dirty or write WAL for the changes. If you do that, > > the buffer is still in memory, but it's now been modified. A > > subsequent operation that modifies it will start with the altered > > state you created here, quite possibly leading to WAL that cannot be > > correctly replayed on the standby. In other words, you've got to > > decide for certain whether you want to proceed with the operation > > *before* you enter the critical section. You also need to emit any > > messages before or after the critical section. So you could: > > > > This is still not clear. I think Robert needs to respond to my earlier comment. > > > I believe this violates our guidelines on message construction. Have > > two completely separate messages -- and maybe lose the word "already": > > > > "skipping tid (%u, %u) because it is dead" > > "skipping tid (%u, %u) because it is unused" > > > > The point of this is that it makes it easier for translators. > > > > Done. > > > I see very little point in what verify_tid() is doing. Before using > > each block number, we should check that it's less than or equal to a > > cached value of RelationGetNumberOfBlocks(rel). That's necessary in > > any case to avoid funny errors; and then the check here against > > specifically InvalidBlockNumber is redundant. For the offset number, > > same thing: we need to check each offset against the page's > > PageGetMaxOffsetNumber(page); and if we do that then we don't need > > these checks. > > > > Done. > > Please check the attached patch for the changes. I also looked at this version patch and have some small comments: + Oid relid = PG_GETARG_OID(0); + ArrayType *ta = PG_GETARG_ARRAYTYPE_P_COPY(1); + ItemPointer tids; + int ntids; + Relation rel; + Buffer buf; + Page page; + ItemId itemid; + BlockNumber blkno; + OffsetNumber *offnos; + OffsetNumber offno, + noffs, + curr_start_ptr, + next_start_ptr, + maxoffset; + int i, + nskippedItems, + nblocks; You declare all variables at the top of heap_force_common() function but I think we can declare some variables such as buf, page inside of the do loop. --- + if (offnos[i] > maxoffset) + { + ereport(NOTICE, + errmsg("skipping tid (%u, %u) because it contains an invalid offset", + blkno, offnos[i])); + continue; + } If all tids on a page take the above path, we will end up logging FPI in spite of modifying nothing on the page. --- + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + log_newpage_buffer(buf, true); I think we need to set the returned LSN by log_newpage_buffer() to the page lsn. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I have been doing some user-level testing of this feature, apart from sanity test for extension and it's functions
I have tried to corrupt tuples and then able to fix it using heap_force_freeze/kill functions.
--corrupt relfrozenxid, cause vacuum failed.
update pg_class set relfrozenxid = (relfrozenxid::text::integer + 10)::text::xid where relname = 'test_tbl';
UPDATE 1
insert into test_tbl values (2, 'BBB');
postgres=# vacuum test_tbl;
ERROR: found xmin 507 from before relfrozenxid 516
CONTEXT: while scanning block 0 of relation "public.test_tbl"
postgres=# select *, ctid, xmin, xmax from test_tbl;
a | b | ctid | xmin | xmax
---+-----+-------+------+------
1 | AAA | (0,1) | 505 | 0
2 | BBB | (0,2) | 507 | 0
(2 rows)
--fixed using heap_force_freeze
postgres=# select heap_force_freeze('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);
heap_force_freeze
-------------------
postgres=# vacuum test_tbl;
VACUUM
postgres=# select *, ctid, xmin, xmax from test_tbl;
a | b | ctid | xmin | xmax
---+-----+-------+------+------
1 | AAA | (0,1) | 505 | 0
2 | BBB | (0,2) | 2 | 0
(2 rows)
--corrupt table headers in base/oid. file, cause table access failed.
postgres=# select ctid, * from test_tbl;
ERROR: could not access status of transaction 4294967295
DETAIL: Could not open file "pg_xact/0FFF": No such file or directory.
--removed corrupted tuple using heap_force_kill
postgres=# select heap_force_kill('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);
heap_force_kill
-----------------
(1 row)
postgres=# select ctid, * from test_tbl;
ctid | a | b
-------+---+-----
(0,1) | 1 | AAA
(1 row)
I will be continuing with my testing with the latest patch and update here if found anything.
On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi Robert,
>
> Thanks for the review. Please find my comments inline:
>
> On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > Attached is the new version of patch that addresses the comments from Andrey and Beena.
> >
> > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
> >
> > the -> a
> >
> > I also suggest: heap table -> relation, because we might want to
> > extend this to other cases later.
> >
>
> Corrected.
>
> > +-- toast table.
> > +begin;
> > +create table ttab(a text);
> > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
> > 65), '') from generate_series(1,10000);
> > +select * from ttab where xmin = 2;
> > + a
> > +---
> > +(0 rows)
> > +
> > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
> > + heap_force_freeze
> > +-------------------
> > +
> > +(1 row)
> > +
> >
> > I don't understand the point of this. You're not testing the function
> > on the TOAST table; you're testing it on the main table when there
> > happens to be a TOAST table that is probably getting used for
> > something. But that's not really relevant to what is being tested
> > here, so as written this seems redundant with the previous cases.
> >
>
> Yeah, it's being tested on the main table, not on a toast table. I've
> removed this test-case and also restricted direct access to the toast
> table using heap_force_kill/freeze functions. I think we shouldn't be
> using these functions to do any changes in the toast table. We will
> only use these functions with the main table and let VACUUM remove the
> corresponding data chunks (pointed by the tuple that got removed from
> the main table).
>
> Another option would be to identify all the data chunks corresponding
> to the tuple (ctid) being killed from the main table and remove them
> one by one. We will only do this if the tuple from the main table that
> has been marked as killed has an external storage. We will have to add
> a bunch of code for this otherwise we can let VACUUM do this for us.
> Let me know your thoughts on this.
>
> > +-- test pg_surgery functions with the unsupported relations. Should fail.
> >
> > Please name the specific functions being tested here in case we add
> > more in the future that are tested separately.
> >
>
> Done.
>
> > +++ b/contrib/pg_surgery/heap_surgery_funcs.c
> >
> > I think we could drop _funcs from the file name.
> >
>
> Done.
>
> > +#ifdef PG_MODULE_MAGIC
> > +PG_MODULE_MAGIC;
> > +#endif
> >
> > The #ifdef here is not required, and if you look at other contrib
> > modules you'll see that they don't have it.
> >
>
> Okay, done.
>
> > I don't like all the macros at the top of the file much. CHECKARRVALID
> > is only used in one place, so it seems to me that you might as well
> > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
> >
>
> Done.
>
> > Once you do that, heap_force_common() can just compute the number of
> > array elements once, instead of doing it once inside ARRISEMPTY, then
> > again inside SORT, and then a third time to initialize ntids. You can
> > just have a local variable in that function that is set once and then
> > used as needed. Then you are only doing ARRNELEMS once, so you can get
> > rid of that macro too. The same technique can be used to get rid of
> > ARRPTR. So then all the macros go away without introducing any code
> > duplication.
> >
>
> Done.
>
> > +/* Options to forcefully change the state of a heap tuple. */
> > +typedef enum HTupleForceOption
> > +{
> > + FORCE_KILL,
> > + FORCE_FREEZE
> > +} HTupleForceOption;
> >
> > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
> > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
>
> Done.
>
> Also, how about
> > option -> operation?
> >
>
> I think both look okay to me.
>
> > + return heap_force_common(fcinfo, FORCE_KILL);
> >
> > I think it might be more idiomatic to use PG_RETURN_DATUM here. I
> > know it's the same thing, though, and perhaps I'm even wrong about the
> > prevailing style.
> >
>
> Done.
>
> > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
> >
> > I think this is unnecessary. It's an enum with 2 values.
> >
>
> Removed.
>
> > + if (ARRISEMPTY(ta))
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("empty tid array")));
> >
> > I don't see why this should be an error. Why can't we just continue
> > normally and if it does nothing, it does nothing? You'd need to change
> > the do..while loop to a while loop so that the end condition is tested
> > at the top, but that seems fine.
> >
>
> I think it's okay to have this check. So, just left it as-is. We do
> have such checks in other contrib modules as well wherever the array
> is being passed as an input to the function.
>
> > + rel = relation_open(relid, AccessShareLock);
> >
> > Maybe we should take RowExclusiveLock, since we are going to modify
> > rows. Not sure how much it matters, though.
> >
>
> I don't know how it would make a difference, but still as you said
> replaced AccessShare with RowExclusive.
>
> > + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> > + errmsg("must be superuser or object owner to use %s.",
> > + force_opt == FORCE_KILL ? "heap_force_kill()" :
> > + "heap_force_freeze()")));
> >
> > This is the wrong way to do a permissions check, and it's also the
> > wrong way to write an error message about having failed one. To see
> > the correct method, grep for pg_class_aclcheck. However, I think that
> > we shouldn't in general trust the object owner to do this, unless the
> > super-user gave permission. This is a data-corrupting operation, and
> > only the boss is allowed to authorize it. So I think you should also
> > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
> > have this check as a backup. Then, the superuser is always allowed,
> > and if they choose to GRANT EXECUTE on this function to some users,
> > those users can do it for their own relations, but not others.
> >
>
> Done.
>
> > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > + errmsg("only heap AM is supported")));
> > +
> > + check_relation_relkind(rel);
> >
> > Seems like these checks are in the wrong order.
>
> I don't think there is anything wrong with the order. I can see the
> same order in other contrib modules as well.
>
> Also, maybe you could
> > call the function something like check_relation_ok() and put the
> > permissions test, the relkind test, and the relam test all inside of
> > it, just to tighten up the code in this main function a bit.
> >
>
> Yeah, I've added a couple of functions named sanity_check_relation and
> sanity_check_tid_array and shifted all the sanity checks there.
>
> > + if (noffs > maxoffset)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("number of offsets specified for block %u exceeds the max
> > offset number %u",
> > + blkno, maxoffset)));
> >
> > Hmm, this doesn't seem quite right. The actual problem is if an
> > individual item pointer's offset number is greater than maxoffset,
> > which can be true even if the total number of offsets is less than
> > maxoffset. So I think you need to remove this check and add a check
> > inside the loop which follows that offnos[i] is in range.
> >
>
> Agreed and done.
>
> > The way you've structured that loop is actually problematic -- I don't
> > think we want to be calling elog() or ereport() inside a critical
> > section. You could fix the case that checks for an invalid force_opt
> > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
> > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
> > NOTICE case you have here is a bigger problem.
>
> Done.
>
> You really cannot
> > modify the buffer like this and then decide, oops, never mind, I think
> > I won't mark it dirty or write WAL for the changes. If you do that,
> > the buffer is still in memory, but it's now been modified. A
> > subsequent operation that modifies it will start with the altered
> > state you created here, quite possibly leading to WAL that cannot be
> > correctly replayed on the standby. In other words, you've got to
> > decide for certain whether you want to proceed with the operation
> > *before* you enter the critical section. You also need to emit any
> > messages before or after the critical section. So you could:
> >
>
> This is still not clear. I think Robert needs to respond to my earlier comment.
>
> > I believe this violates our guidelines on message construction. Have
> > two completely separate messages -- and maybe lose the word "already":
> >
> > "skipping tid (%u, %u) because it is dead"
> > "skipping tid (%u, %u) because it is unused"
> >
> > The point of this is that it makes it easier for translators.
> >
>
> Done.
>
> > I see very little point in what verify_tid() is doing. Before using
> > each block number, we should check that it's less than or equal to a
> > cached value of RelationGetNumberOfBlocks(rel). That's necessary in
> > any case to avoid funny errors; and then the check here against
> > specifically InvalidBlockNumber is redundant. For the offset number,
> > same thing: we need to check each offset against the page's
> > PageGetMaxOffsetNumber(page); and if we do that then we don't need
> > these checks.
> >
>
> Done.
>
> Please check the attached patch for the changes.
I also looked at this version patch and have some small comments:
+ Oid relid = PG_GETARG_OID(0);
+ ArrayType *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+ ItemPointer tids;
+ int ntids;
+ Relation rel;
+ Buffer buf;
+ Page page;
+ ItemId itemid;
+ BlockNumber blkno;
+ OffsetNumber *offnos;
+ OffsetNumber offno,
+ noffs,
+ curr_start_ptr,
+ next_start_ptr,
+ maxoffset;
+ int i,
+ nskippedItems,
+ nblocks;
You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.
---
+ if (offnos[i] > maxoffset)
+ {
+ ereport(NOTICE,
+ errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+ blkno, offnos[i]));
+ continue;
+ }
If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.
---
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ log_newpage_buffer(buf, true);
I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello Masahiko-san, Thanks for looking into the patch. Please find my comments inline below: On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Please check the attached patch for the changes. > > I also looked at this version patch and have some small comments: > > + Oid relid = PG_GETARG_OID(0); > + ArrayType *ta = PG_GETARG_ARRAYTYPE_P_COPY(1); > + ItemPointer tids; > + int ntids; > + Relation rel; > + Buffer buf; > + Page page; > + ItemId itemid; > + BlockNumber blkno; > + OffsetNumber *offnos; > + OffsetNumber offno, > + noffs, > + curr_start_ptr, > + next_start_ptr, > + maxoffset; > + int i, > + nskippedItems, > + nblocks; > > You declare all variables at the top of heap_force_common() function > but I think we can declare some variables such as buf, page inside of > the do loop. > Sure, I will do this in the next version of patch. > --- > + if (offnos[i] > maxoffset) > + { > + ereport(NOTICE, > + errmsg("skipping tid (%u, %u) because it > contains an invalid offset", > + blkno, offnos[i])); > + continue; > + } > > If all tids on a page take the above path, we will end up logging FPI > in spite of modifying nothing on the page. > Yeah, that's right. I've taken care of this in the new version of patch which I am yet to share. > --- > + /* XLOG stuff */ > + if (RelationNeedsWAL(rel)) > + log_newpage_buffer(buf, true); > > I think we need to set the returned LSN by log_newpage_buffer() to the page lsn. > I think we are already setting the page lsn in the log_newpage() which is being called from log_newpage_buffer(). So, AFAIU, no change would be required here. Please let me know if I am missing something. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Thu, 6 Aug 2020 at 18:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hello Masahiko-san, > > Thanks for looking into the patch. Please find my comments inline below: > > On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > > > On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > Please check the attached patch for the changes. > > > > I also looked at this version patch and have some small comments: > > > > + Oid relid = PG_GETARG_OID(0); > > + ArrayType *ta = PG_GETARG_ARRAYTYPE_P_COPY(1); > > + ItemPointer tids; > > + int ntids; > > + Relation rel; > > + Buffer buf; > > + Page page; > > + ItemId itemid; > > + BlockNumber blkno; > > + OffsetNumber *offnos; > > + OffsetNumber offno, > > + noffs, > > + curr_start_ptr, > > + next_start_ptr, > > + maxoffset; > > + int i, > > + nskippedItems, > > + nblocks; > > > > You declare all variables at the top of heap_force_common() function > > but I think we can declare some variables such as buf, page inside of > > the do loop. > > > > Sure, I will do this in the next version of patch. > > > --- > > + if (offnos[i] > maxoffset) > > + { > > + ereport(NOTICE, > > + errmsg("skipping tid (%u, %u) because it > > contains an invalid offset", > > + blkno, offnos[i])); > > + continue; > > + } > > > > If all tids on a page take the above path, we will end up logging FPI > > in spite of modifying nothing on the page. > > > > Yeah, that's right. I've taken care of this in the new version of > patch which I am yet to share. > > > --- > > + /* XLOG stuff */ > > + if (RelationNeedsWAL(rel)) > > + log_newpage_buffer(buf, true); > > > > I think we need to set the returned LSN by log_newpage_buffer() to the page lsn. > > > > I think we are already setting the page lsn in the log_newpage() which > is being called from log_newpage_buffer(). So, AFAIU, no change would > be required here. Please let me know if I am missing something. You're right. I'd missed the comment of log_newpage_buffer(). Thanks! Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attached v4 patch fixes the latest comments from Robert and Masahiko-san. Changes: 1) Let heap_force_kill and freeze functions to be used with toast tables. 2) Replace "int nskippedItems" with "bool did_modify_page" flag to know if any modification happened in the page or not. 3) Declare some of the variables such as buf, page inside of the do loop instead of declaring them at the top of heap_force_common function. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Aug 6, 2020 at 2:49 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Thu, 6 Aug 2020 at 18:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > Hello Masahiko-san, > > > > Thanks for looking into the patch. Please find my comments inline below: > > > > On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada > > <masahiko.sawada@2ndquadrant.com> wrote: > > > > > > On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > Please check the attached patch for the changes. > > > > > > I also looked at this version patch and have some small comments: > > > > > > + Oid relid = PG_GETARG_OID(0); > > > + ArrayType *ta = PG_GETARG_ARRAYTYPE_P_COPY(1); > > > + ItemPointer tids; > > > + int ntids; > > > + Relation rel; > > > + Buffer buf; > > > + Page page; > > > + ItemId itemid; > > > + BlockNumber blkno; > > > + OffsetNumber *offnos; > > > + OffsetNumber offno, > > > + noffs, > > > + curr_start_ptr, > > > + next_start_ptr, > > > + maxoffset; > > > + int i, > > > + nskippedItems, > > > + nblocks; > > > > > > You declare all variables at the top of heap_force_common() function > > > but I think we can declare some variables such as buf, page inside of > > > the do loop. > > > > > > > Sure, I will do this in the next version of patch. > > > > > --- > > > + if (offnos[i] > maxoffset) > > > + { > > > + ereport(NOTICE, > > > + errmsg("skipping tid (%u, %u) because it > > > contains an invalid offset", > > > + blkno, offnos[i])); > > > + continue; > > > + } > > > > > > If all tids on a page take the above path, we will end up logging FPI > > > in spite of modifying nothing on the page. > > > > > > > Yeah, that's right. I've taken care of this in the new version of > > patch which I am yet to share. > > > > > --- > > > + /* XLOG stuff */ > > > + if (RelationNeedsWAL(rel)) > > > + log_newpage_buffer(buf, true); > > > > > > I think we need to set the returned LSN by log_newpage_buffer() to the page lsn. > > > > > > > I think we are already setting the page lsn in the log_newpage() which > > is being called from log_newpage_buffer(). So, AFAIU, no change would > > be required here. Please let me know if I am missing something. > > You're right. I'd missed the comment of log_newpage_buffer(). Thanks! > > Regards, > > -- > Masahiko Sawada http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Aug 6, 2020 at 2:11 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Okay, If you want I can remove the restriction on a toast table, but, > then that means a user can directly remove the data chunks from the > toast table without changing anything in the main table. This means we > won't be able to query the main table as it will fail with an error > like "ERROR: unexpected chunk number ...". So, we will have to find > some way to identify the pointer to the chunks that got deleted from > the toast table and remove that pointer from the main table. We also > need to make sure that before we remove a tuple (pointer) from the > main table, we identify all the remaining data chunks pointed by this > tuple and remove them completely only then that table would be > considered to be in a good state. Now, I am not sure if we can > currently do all these things. That's the user's problem. If they don't have a plan for that, they shouldn't use this tool. I don't think we can or should try to handle it in this code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 6, 2020 at 9:19 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Aug 6, 2020 at 2:11 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Okay, If you want I can remove the restriction on a toast table, but, > > then that means a user can directly remove the data chunks from the > > toast table without changing anything in the main table. This means we > > won't be able to query the main table as it will fail with an error > > like "ERROR: unexpected chunk number ...". So, we will have to find > > some way to identify the pointer to the chunks that got deleted from > > the toast table and remove that pointer from the main table. We also > > need to make sure that before we remove a tuple (pointer) from the > > main table, we identify all the remaining data chunks pointed by this > > tuple and remove them completely only then that table would be > > considered to be in a good state. Now, I am not sure if we can > > currently do all these things. > > That's the user's problem. If they don't have a plan for that, they > shouldn't use this tool. I don't think we can or should try to handle > it in this code. > Okay, thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Thanks Rajkumar for testing the patch. Here are some of the additional test-cases that I would suggest you to execute, if possible: 1) You may try running the test-cases that you have executed so far with SR setup and see if the changes are getting reflected on the standby. 2) You may also try running some concurrent test-cases for e.g. try running these functions with VACUUM or some other sql commands (preferable DML commands) in parallel. 3) See what happens when you pass some invalid tids (containing invalid block or offset number) to these functions. You may also try running these functions on the same tuple repeatedly and see the behaviour. ... -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Aug 6, 2020 at 2:25 PM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > > I have been doing some user-level testing of this feature, apart from sanity test for extension and it's functions > > I have tried to corrupt tuples and then able to fix it using heap_force_freeze/kill functions. > > > --corrupt relfrozenxid, cause vacuum failed. > > update pg_class set relfrozenxid = (relfrozenxid::text::integer + 10)::text::xid where relname = 'test_tbl'; > > UPDATE 1 > > insert into test_tbl values (2, 'BBB'); > > > postgres=# vacuum test_tbl; > > ERROR: found xmin 507 from before relfrozenxid 516 > > CONTEXT: while scanning block 0 of relation "public.test_tbl" > > > postgres=# select *, ctid, xmin, xmax from test_tbl; > > a | b | ctid | xmin | xmax > > ---+-----+-------+------+------ > > 1 | AAA | (0,1) | 505 | 0 > > 2 | BBB | (0,2) | 507 | 0 > > (2 rows) > > > --fixed using heap_force_freeze > > postgres=# select heap_force_freeze('test_tbl'::regclass, ARRAY['(0,2)']::tid[]); > > heap_force_freeze > > ------------------- > > > postgres=# vacuum test_tbl; > > VACUUM > > postgres=# select *, ctid, xmin, xmax from test_tbl; > > a | b | ctid | xmin | xmax > > ---+-----+-------+------+------ > > 1 | AAA | (0,1) | 505 | 0 > > 2 | BBB | (0,2) | 2 | 0 > > (2 rows) > > > --corrupt table headers in base/oid. file, cause table access failed. > > postgres=# select ctid, * from test_tbl; > > ERROR: could not access status of transaction 4294967295 > > DETAIL: Could not open file "pg_xact/0FFF": No such file or directory. > > > --removed corrupted tuple using heap_force_kill > > postgres=# select heap_force_kill('test_tbl'::regclass, ARRAY['(0,2)']::tid[]); > > heap_force_kill > > ----------------- > > > > (1 row) > > > postgres=# select ctid, * from test_tbl; > > ctid | a | b > > -------+---+----- > > (0,1) | 1 | AAA > > (1 row) > > > I will be continuing with my testing with the latest patch and update here if found anything. > > > Thanks & Regards, > Rajkumar Raghuwanshi > > > On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: >> >> On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> > >> > Hi Robert, >> > >> > Thanks for the review. Please find my comments inline: >> > >> > On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote: >> > > >> > > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> > > > Attached is the new version of patch that addresses the comments from Andrey and Beena. >> > > >> > > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table" >> > > >> > > the -> a >> > > >> > > I also suggest: heap table -> relation, because we might want to >> > > extend this to other cases later. >> > > >> > >> > Corrected. >> > >> > > +-- toast table. >> > > +begin; >> > > +create table ttab(a text); >> > > +insert into ttab select string_agg(chr(floor(random() * 26)::int + >> > > 65), '') from generate_series(1,10000); >> > > +select * from ttab where xmin = 2; >> > > + a >> > > +--- >> > > +(0 rows) >> > > + >> > > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]); >> > > + heap_force_freeze >> > > +------------------- >> > > + >> > > +(1 row) >> > > + >> > > >> > > I don't understand the point of this. You're not testing the function >> > > on the TOAST table; you're testing it on the main table when there >> > > happens to be a TOAST table that is probably getting used for >> > > something. But that's not really relevant to what is being tested >> > > here, so as written this seems redundant with the previous cases. >> > > >> > >> > Yeah, it's being tested on the main table, not on a toast table. I've >> > removed this test-case and also restricted direct access to the toast >> > table using heap_force_kill/freeze functions. I think we shouldn't be >> > using these functions to do any changes in the toast table. We will >> > only use these functions with the main table and let VACUUM remove the >> > corresponding data chunks (pointed by the tuple that got removed from >> > the main table). >> > >> > Another option would be to identify all the data chunks corresponding >> > to the tuple (ctid) being killed from the main table and remove them >> > one by one. We will only do this if the tuple from the main table that >> > has been marked as killed has an external storage. We will have to add >> > a bunch of code for this otherwise we can let VACUUM do this for us. >> > Let me know your thoughts on this. >> > >> > > +-- test pg_surgery functions with the unsupported relations. Should fail. >> > > >> > > Please name the specific functions being tested here in case we add >> > > more in the future that are tested separately. >> > > >> > >> > Done. >> > >> > > +++ b/contrib/pg_surgery/heap_surgery_funcs.c >> > > >> > > I think we could drop _funcs from the file name. >> > > >> > >> > Done. >> > >> > > +#ifdef PG_MODULE_MAGIC >> > > +PG_MODULE_MAGIC; >> > > +#endif >> > > >> > > The #ifdef here is not required, and if you look at other contrib >> > > modules you'll see that they don't have it. >> > > >> > >> > Okay, done. >> > >> > > I don't like all the macros at the top of the file much. CHECKARRVALID >> > > is only used in one place, so it seems to me that you might as well >> > > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY. >> > > >> > >> > Done. >> > >> > > Once you do that, heap_force_common() can just compute the number of >> > > array elements once, instead of doing it once inside ARRISEMPTY, then >> > > again inside SORT, and then a third time to initialize ntids. You can >> > > just have a local variable in that function that is set once and then >> > > used as needed. Then you are only doing ARRNELEMS once, so you can get >> > > rid of that macro too. The same technique can be used to get rid of >> > > ARRPTR. So then all the macros go away without introducing any code >> > > duplication. >> > > >> > >> > Done. >> > >> > > +/* Options to forcefully change the state of a heap tuple. */ >> > > +typedef enum HTupleForceOption >> > > +{ >> > > + FORCE_KILL, >> > > + FORCE_FREEZE >> > > +} HTupleForceOption; >> > > >> > > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the >> > > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE. >> > >> > Done. >> > >> > Also, how about >> > > option -> operation? >> > > >> > >> > I think both look okay to me. >> > >> > > + return heap_force_common(fcinfo, FORCE_KILL); >> > > >> > > I think it might be more idiomatic to use PG_RETURN_DATUM here. I >> > > know it's the same thing, though, and perhaps I'm even wrong about the >> > > prevailing style. >> > > >> > >> > Done. >> > >> > > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE); >> > > >> > > I think this is unnecessary. It's an enum with 2 values. >> > > >> > >> > Removed. >> > >> > > + if (ARRISEMPTY(ta)) >> > > + ereport(ERROR, >> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >> > > + errmsg("empty tid array"))); >> > > >> > > I don't see why this should be an error. Why can't we just continue >> > > normally and if it does nothing, it does nothing? You'd need to change >> > > the do..while loop to a while loop so that the end condition is tested >> > > at the top, but that seems fine. >> > > >> > >> > I think it's okay to have this check. So, just left it as-is. We do >> > have such checks in other contrib modules as well wherever the array >> > is being passed as an input to the function. >> > >> > > + rel = relation_open(relid, AccessShareLock); >> > > >> > > Maybe we should take RowExclusiveLock, since we are going to modify >> > > rows. Not sure how much it matters, though. >> > > >> > >> > I don't know how it would make a difference, but still as you said >> > replaced AccessShare with RowExclusive. >> > >> > > + if (!superuser() && GetUserId() != rel->rd_rel->relowner) >> > > + ereport(ERROR, >> > > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), >> > > + errmsg("must be superuser or object owner to use %s.", >> > > + force_opt == FORCE_KILL ? "heap_force_kill()" : >> > > + "heap_force_freeze()"))); >> > > >> > > This is the wrong way to do a permissions check, and it's also the >> > > wrong way to write an error message about having failed one. To see >> > > the correct method, grep for pg_class_aclcheck. However, I think that >> > > we shouldn't in general trust the object owner to do this, unless the >> > > super-user gave permission. This is a data-corrupting operation, and >> > > only the boss is allowed to authorize it. So I think you should also >> > > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then >> > > have this check as a backup. Then, the superuser is always allowed, >> > > and if they choose to GRANT EXECUTE on this function to some users, >> > > those users can do it for their own relations, but not others. >> > > >> > >> > Done. >> > >> > > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID) >> > > + ereport(ERROR, >> > > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), >> > > + errmsg("only heap AM is supported"))); >> > > + >> > > + check_relation_relkind(rel); >> > > >> > > Seems like these checks are in the wrong order. >> > >> > I don't think there is anything wrong with the order. I can see the >> > same order in other contrib modules as well. >> > >> > Also, maybe you could >> > > call the function something like check_relation_ok() and put the >> > > permissions test, the relkind test, and the relam test all inside of >> > > it, just to tighten up the code in this main function a bit. >> > > >> > >> > Yeah, I've added a couple of functions named sanity_check_relation and >> > sanity_check_tid_array and shifted all the sanity checks there. >> > >> > > + if (noffs > maxoffset) >> > > + ereport(ERROR, >> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >> > > + errmsg("number of offsets specified for block %u exceeds the max >> > > offset number %u", >> > > + blkno, maxoffset))); >> > > >> > > Hmm, this doesn't seem quite right. The actual problem is if an >> > > individual item pointer's offset number is greater than maxoffset, >> > > which can be true even if the total number of offsets is less than >> > > maxoffset. So I think you need to remove this check and add a check >> > > inside the loop which follows that offnos[i] is in range. >> > > >> > >> > Agreed and done. >> > >> > > The way you've structured that loop is actually problematic -- I don't >> > > think we want to be calling elog() or ereport() inside a critical >> > > section. You could fix the case that checks for an invalid force_opt >> > > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op == >> > > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The >> > > NOTICE case you have here is a bigger problem. >> > >> > Done. >> > >> > You really cannot >> > > modify the buffer like this and then decide, oops, never mind, I think >> > > I won't mark it dirty or write WAL for the changes. If you do that, >> > > the buffer is still in memory, but it's now been modified. A >> > > subsequent operation that modifies it will start with the altered >> > > state you created here, quite possibly leading to WAL that cannot be >> > > correctly replayed on the standby. In other words, you've got to >> > > decide for certain whether you want to proceed with the operation >> > > *before* you enter the critical section. You also need to emit any >> > > messages before or after the critical section. So you could: >> > > >> > >> > This is still not clear. I think Robert needs to respond to my earlier comment. >> > >> > > I believe this violates our guidelines on message construction. Have >> > > two completely separate messages -- and maybe lose the word "already": >> > > >> > > "skipping tid (%u, %u) because it is dead" >> > > "skipping tid (%u, %u) because it is unused" >> > > >> > > The point of this is that it makes it easier for translators. >> > > >> > >> > Done. >> > >> > > I see very little point in what verify_tid() is doing. Before using >> > > each block number, we should check that it's less than or equal to a >> > > cached value of RelationGetNumberOfBlocks(rel). That's necessary in >> > > any case to avoid funny errors; and then the check here against >> > > specifically InvalidBlockNumber is redundant. For the offset number, >> > > same thing: we need to check each offset against the page's >> > > PageGetMaxOffsetNumber(page); and if we do that then we don't need >> > > these checks. >> > > >> > >> > Done. >> > >> > Please check the attached patch for the changes. >> >> I also looked at this version patch and have some small comments: >> >> + Oid relid = PG_GETARG_OID(0); >> + ArrayType *ta = PG_GETARG_ARRAYTYPE_P_COPY(1); >> + ItemPointer tids; >> + int ntids; >> + Relation rel; >> + Buffer buf; >> + Page page; >> + ItemId itemid; >> + BlockNumber blkno; >> + OffsetNumber *offnos; >> + OffsetNumber offno, >> + noffs, >> + curr_start_ptr, >> + next_start_ptr, >> + maxoffset; >> + int i, >> + nskippedItems, >> + nblocks; >> >> You declare all variables at the top of heap_force_common() function >> but I think we can declare some variables such as buf, page inside of >> the do loop. >> >> --- >> + if (offnos[i] > maxoffset) >> + { >> + ereport(NOTICE, >> + errmsg("skipping tid (%u, %u) because it >> contains an invalid offset", >> + blkno, offnos[i])); >> + continue; >> + } >> >> If all tids on a page take the above path, we will end up logging FPI >> in spite of modifying nothing on the page. >> >> --- >> + /* XLOG stuff */ >> + if (RelationNeedsWAL(rel)) >> + log_newpage_buffer(buf, true); >> >> I think we need to set the returned LSN by log_newpage_buffer() to the page lsn. >> >> Regards, >> >> -- >> Masahiko Sawada http://www.2ndQuadrant.com/ >> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >> >>
On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Attached v4 patch fixes the latest comments from Robert and Masahiko-san. Compiler warning: heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is always false [-Werror,-Wtautological-compare] if (blkno < 0 || blkno >= nblocks) ~~~~~ ^ ~ There's a certain inconsistency to these messages: rhaas=# create table foo (a int); CREATE TABLE rhaas=# insert into foo values (1); INSERT 0 1 rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]); NOTICE: skipping tid (0, 2) because it contains an invalid offset heap_force_kill ----------------- (1 row) rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]); ERROR: invalid item pointer LOCATION: tids_same_page_fetch_offnums, heap_surgery.c:347 rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]); ERROR: block number 1 is out of range for relation "foo" From a user perspective it seems like I've made three very similar mistakes: in the first case, the offset is too high, in the second case it's too low, and in the third case the block number is out of range. But in one case I get a NOTICE and in the other two cases I get an ERROR. In one case I get the relation name and in the other two cases I don't. The two complaints about an invalid offset are phrased completely differently from each other. For example, suppose you do this: ERROR: tid (%u, %u) is invalid for relation "%s" because the block number is out of range (%u..%u) ERROR: tid (%u, %u) is invalid for relation "%s" because the item number is out of range for this block (%u..%u) ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead I think I misled you when I said to use pg_class_aclcheck. I think it should actually be pg_class_ownercheck. I think the relkind sanity check should permit RELKIND_MATVIEW also. It's unclear to me why the freeze logic here shouldn't do this part what heap_prepare_freeze_tuple() does when freezing xmax: frz->t_infomask2 &= ~HEAP_HOT_UPDATED; frz->t_infomask2 &= ~HEAP_KEYS_UPDATED; Likewise, why should we not freeze or invalidate xvac in the case where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple() would do? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > There's a certain inconsistency to these messages: > > rhaas=# create table foo (a int); > CREATE TABLE > rhaas=# insert into foo values (1); > INSERT 0 1 > rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]); > NOTICE: skipping tid (0, 2) because it contains an invalid offset > heap_force_kill > ----------------- > > (1 row) > > rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]); > ERROR: invalid item pointer > LOCATION: tids_same_page_fetch_offnums, heap_surgery.c:347 > rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]); > ERROR: block number 1 is out of range for relation "foo" > > From a user perspective it seems like I've made three very similar > mistakes: in the first case, the offset is too high, in the second > case it's too low, and in the third case the block number is out of > range. But in one case I get a NOTICE and in the other two cases I get > an ERROR. In one case I get the relation name and in the other two > cases I don't. The two complaints about an invalid offset are phrased > completely differently from each other. For example, suppose you do > this: > > ERROR: tid (%u, %u) is invalid for relation "%s" because the block > number is out of range (%u..%u) > ERROR: tid (%u, %u) is invalid for relation "%s" because the item > number is out of range for this block (%u..%u) > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead > Thank you for your suggestions. To make this consistent, I am planning to do the following changes: Remove the error message to report "invalid item pointer" from tids_same_page_fetch_offnums() and expand the if-check to detect any invalid offset number in the CRITICAL section such that it not just checks if the offset number is > maxoffset, but also checks if the offset number is equal to 0. That way it would also do the job that "if (!ItemPointerIsValid)" was doing for us. Further, if any invalid block number is detected, then I am planning to skip all the tids associated with this block and move to the next block. Hence, instead of reporting the error I would report the NOTICE message to the user. The other two messages for reporting unused items and dead items remain the same. Hence, with above change, we would be reporting the following 4 messages: NOTICE: skipping all the tids in block %u for relation "%s" because the block number is out of range NOTICE: skipping tid (%u, %u) for relation "%s" because the item number is out of range for this block NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked dead NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked unused Please let me know if you are okay with the above changes or not? -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > The other two messages for reporting unused items and dead items > remain the same. Hence, with above change, we would be reporting the > following 4 messages: > > NOTICE: skipping all the tids in block %u for relation "%s" because > the block number is out of range > > NOTICE: skipping tid (%u, %u) for relation "%s" because the item > number is out of range for this block > > NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked dead > > NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked unused > > Please let me know if you are okay with the above changes or not? That seems broadly reasonable, but I would suggest phrasing the first message like this: skipping block %u for relation "%s" because the block number is out of range -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 11, 2020 at 7:33 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > The other two messages for reporting unused items and dead items > > remain the same. Hence, with above change, we would be reporting the > > following 4 messages: > > > > NOTICE: skipping all the tids in block %u for relation "%s" because > > the block number is out of range > > > > NOTICE: skipping tid (%u, %u) for relation "%s" because the item > > number is out of range for this block > > > > NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked dead > > > > NOTICE: skipping tid (%u, %u) for relation "%s" because it is marked unused > > > > Please let me know if you are okay with the above changes or not? > > That seems broadly reasonable, but I would suggest phrasing the first > message like this: > > skipping block %u for relation "%s" because the block number is out of range > Okay, thanks for the confirmation. I'll do that. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Thanks Robert for the review. Please find my comments inline below: On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Attached v4 patch fixes the latest comments from Robert and Masahiko-san. > > Compiler warning: > > heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is > always false [-Werror,-Wtautological-compare] > if (blkno < 0 || blkno >= nblocks) > ~~~~~ ^ ~ > Fixed. > There's a certain inconsistency to these messages: > > rhaas=# create table foo (a int); > CREATE TABLE > rhaas=# insert into foo values (1); > INSERT 0 1 > rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]); > NOTICE: skipping tid (0, 2) because it contains an invalid offset > heap_force_kill > ----------------- > > (1 row) > > rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]); > ERROR: invalid item pointer > LOCATION: tids_same_page_fetch_offnums, heap_surgery.c:347 > rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]); > ERROR: block number 1 is out of range for relation "foo" > > From a user perspective it seems like I've made three very similar > mistakes: in the first case, the offset is too high, in the second > case it's too low, and in the third case the block number is out of > range. But in one case I get a NOTICE and in the other two cases I get > an ERROR. In one case I get the relation name and in the other two > cases I don't. The two complaints about an invalid offset are phrased > completely differently from each other. For example, suppose you do > this: > > ERROR: tid (%u, %u) is invalid for relation "%s" because the block > number is out of range (%u..%u) > ERROR: tid (%u, %u) is invalid for relation "%s" because the item > number is out of range for this block (%u..%u) > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead > Corrected. > I think I misled you when I said to use pg_class_aclcheck. I think it > should actually be pg_class_ownercheck. > okay, I've changed it to pg_class_ownercheck. > I think the relkind sanity check should permit RELKIND_MATVIEW also. > Yeah, actually we should allow MATVIEW, don't know why I thought of blocking it earlier. > It's unclear to me why the freeze logic here shouldn't do this part > what heap_prepare_freeze_tuple() does when freezing xmax: > > frz->t_infomask2 &= ~HEAP_HOT_UPDATED; > frz->t_infomask2 &= ~HEAP_KEYS_UPDATED; > Yeah, we should have these changes when freezing the xmax. > Likewise, why should we not freeze or invalidate xvac in the case > where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple() > would do? > Again, we should have this as well. Apart from above, this time I've also added the documentation on pg_surgery module and added a few more test-cases. Attached patch with above changes. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Attachment
Hi Ashutosh I stumbled upon this thread today, went through your patch and it looks good. A minor suggestion in sanity_check_relation(): if (rel->rd_rel->relam != HEAP_TABLE_AM_OID) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("only heap AM is supported"))); Instead of checking the access method OID, it seems better to check the handler OID like so: if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID) The reason is current version of sanity_check_relation() would emit error for the following case even when the table structureis actually heap. create access method myam type table handler heap_tableam_handler; create table mytable (…) using myam; Asim
Hi Asim, Thanks for having a look into the patch and for sharing your feedback. Please find my comments inline below: On Thu, Aug 13, 2020 at 12:36 PM Asim Praveen <pasim@vmware.com> wrote: > > Hi Ashutosh > > I stumbled upon this thread today, went through your patch and it looks good. A minor suggestion in sanity_check_relation(): > > if (rel->rd_rel->relam != HEAP_TABLE_AM_OID) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("only heap AM is supported"))); > > Instead of checking the access method OID, it seems better to check the handler OID like so: > > if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID) > > The reason is current version of sanity_check_relation() would emit error for the following case even when the table structureis actually heap. > > create access method myam type table handler heap_tableam_handler; > create table mytable (…) using myam; > This looks like a very good suggestion to me. I will do this change in the next version. Just wondering if we should be doing similar changes in other contrib modules (like pgrowlocks, pageinspect and pgstattuple) as well? -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Thu, Aug 13, 2020 at 3:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > This looks like a very good suggestion to me. I will do this change in > the next version. Just wondering if we should be doing similar changes > in other contrib modules (like pgrowlocks, pageinspect and > pgstattuple) as well? It seems like it should be consistent, but I'm not sure the proposed change is really an improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 12 Aug 2020 at 22:27, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Thanks Robert for the review. Please find my comments inline below: > > On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > Attached v4 patch fixes the latest comments from Robert and Masahiko-san. > > > > Compiler warning: > > > > heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is > > always false [-Werror,-Wtautological-compare] > > if (blkno < 0 || blkno >= nblocks) > > ~~~~~ ^ ~ > > > > Fixed. > > > There's a certain inconsistency to these messages: > > > > rhaas=# create table foo (a int); > > CREATE TABLE > > rhaas=# insert into foo values (1); > > INSERT 0 1 > > rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]); > > NOTICE: skipping tid (0, 2) because it contains an invalid offset > > heap_force_kill > > ----------------- > > > > (1 row) > > > > rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]); > > ERROR: invalid item pointer > > LOCATION: tids_same_page_fetch_offnums, heap_surgery.c:347 > > rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]); > > ERROR: block number 1 is out of range for relation "foo" > > > > From a user perspective it seems like I've made three very similar > > mistakes: in the first case, the offset is too high, in the second > > case it's too low, and in the third case the block number is out of > > range. But in one case I get a NOTICE and in the other two cases I get > > an ERROR. In one case I get the relation name and in the other two > > cases I don't. The two complaints about an invalid offset are phrased > > completely differently from each other. For example, suppose you do > > this: > > > > ERROR: tid (%u, %u) is invalid for relation "%s" because the block > > number is out of range (%u..%u) > > ERROR: tid (%u, %u) is invalid for relation "%s" because the item > > number is out of range for this block (%u..%u) > > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused > > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead > > > > Corrected. > > > I think I misled you when I said to use pg_class_aclcheck. I think it > > should actually be pg_class_ownercheck. > > > > okay, I've changed it to pg_class_ownercheck. > > > I think the relkind sanity check should permit RELKIND_MATVIEW also. > > > > Yeah, actually we should allow MATVIEW, don't know why I thought of > blocking it earlier. > > > It's unclear to me why the freeze logic here shouldn't do this part > > what heap_prepare_freeze_tuple() does when freezing xmax: > > > > frz->t_infomask2 &= ~HEAP_HOT_UPDATED; > > frz->t_infomask2 &= ~HEAP_KEYS_UPDATED; > > > > Yeah, we should have these changes when freezing the xmax. > > > Likewise, why should we not freeze or invalidate xvac in the case > > where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple() > > would do? > > > > Again, we should have this as well. > > Apart from above, this time I've also added the documentation on > pg_surgery module and added a few more test-cases. > > Attached patch with above changes. > Thank you for updating the patch! Here are my comments on v5 patch: --- a/contrib/Makefile +++ b/contrib/Makefile @@ -35,6 +35,7 @@ SUBDIRS = \ pg_standby \ pg_stat_statements \ pg_trgm \ + pg_surgery \ pgcrypto \ I guess we use alphabetical order here. So pg_surgery should be placed before pg_trgm. --- + if (heap_force_opt == HEAP_FORCE_KILL) + ItemIdSetDead(itemid); I think that if the page is an all-visible page, we should clear an all-visible bit on the visibility map corresponding to the page and PD_ALL_VISIBLE on the page header. Otherwise, index only scan would return the wrong results. --- + /* + * We do not mark the buffer dirty or do WAL logging for unmodifed + * pages. + */ + if (!did_modify_page) + goto skip_wal; + + /* Mark buffer dirty before we write WAL. */ + MarkBufferDirty(buf); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + log_newpage_buffer(buf, true); + +skip_wal: + END_CRIT_SECTION(); + s/unmodifed/unmodified/ Do we really need to use goto? I think we can modify it like follows: if (did_modity_page) { /* Mark buffer dirty before we write WAL. */ MarkBufferDirty(buf); /* XLOG stuff */ if (RelationNeedsWAL(rel)) log_newpage_buffer(buf, true); } END_CRIT_SECTION(); --- pg_force_freeze() can revival a tuple that is already deleted but not vacuumed yet. Therefore, the user might need to reindex indexes after using that function. For instance, with the following script, the last two queries: index scan and seq scan, will return different results. set enable_seqscan to off; set enable_bitmapscan to off; set enable_indexonlyscan to off; create table tbl (a int primary key); insert into tbl values (1); update tbl set a = a + 100 where a = 1; explain analyze select * from tbl where a < 200; -- revive deleted tuple on heap select heap_force_freeze('tbl', array['(0,1)'::tid]); -- index scan returns 2 tuples explain analyze select * from tbl where a < 200; -- seq scan returns 1 tuple set enable_seqscan to on; explain analyze select * from tbl; Also, if a tuple updated and moved to another partition is revived by heap_force_freeze(), its ctid still has special values: MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't see a problem yet caused by a visible tuple having the special ctid value, but it might be worth considering either to reset ctid value as well or to not freezing already-deleted tuple. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello Masahiko-san, Thanks for the review. Please check the comments inline below: On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > Thank you for updating the patch! Here are my comments on v5 patch: > > --- a/contrib/Makefile > +++ b/contrib/Makefile > @@ -35,6 +35,7 @@ SUBDIRS = \ > pg_standby \ > pg_stat_statements \ > pg_trgm \ > + pg_surgery \ > pgcrypto \ > > I guess we use alphabetical order here. So pg_surgery should be placed > before pg_trgm. > Okay, will take care of this in the next version of patch. > --- > + if (heap_force_opt == HEAP_FORCE_KILL) > + ItemIdSetDead(itemid); > > I think that if the page is an all-visible page, we should clear an > all-visible bit on the visibility map corresponding to the page and > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would > return the wrong results. > I think we should let VACUUM do that. Please note that this module is intended to be used only on a damaged relation and should only be operated on damaged tuples of such relations. And the execution of any of the functions provided by this module on a damaged relation must be followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation. This is necessary to bring back a damaged relation to the sane state once a surgery is performed on it. I will try to add this note in the documentation for this module. > --- > + /* > + * We do not mark the buffer dirty or do WAL logging for unmodifed > + * pages. > + */ > + if (!did_modify_page) > + goto skip_wal; > + > + /* Mark buffer dirty before we write WAL. */ > + MarkBufferDirty(buf); > + > + /* XLOG stuff */ > + if (RelationNeedsWAL(rel)) > + log_newpage_buffer(buf, true); > + > +skip_wal: > + END_CRIT_SECTION(); > + > > s/unmodifed/unmodified/ > okay, will fix this typo. > Do we really need to use goto? I think we can modify it like follows: > > if (did_modity_page) > { > /* Mark buffer dirty before we write WAL. */ > MarkBufferDirty(buf); > > /* XLOG stuff */ > if (RelationNeedsWAL(rel)) > log_newpage_buffer(buf, true); > } > > END_CRIT_SECTION(); > No, we don't need it. We can achieve the same by checking the status of did_modify_page flag as you suggested. I will do this change in the next version. > --- > pg_force_freeze() can revival a tuple that is already deleted but not > vacuumed yet. Therefore, the user might need to reindex indexes after > using that function. For instance, with the following script, the last > two queries: index scan and seq scan, will return different results. > > set enable_seqscan to off; > set enable_bitmapscan to off; > set enable_indexonlyscan to off; > create table tbl (a int primary key); > insert into tbl values (1); > > update tbl set a = a + 100 where a = 1; > > explain analyze select * from tbl where a < 200; > > -- revive deleted tuple on heap > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > -- index scan returns 2 tuples > explain analyze select * from tbl where a < 200; > > -- seq scan returns 1 tuple > set enable_seqscan to on; > explain analyze select * from tbl; > I am not sure if this is the right use-case of pg_force_freeze function. I think we should only be running pg_force_freeze function on a tuple for which VACUUM reports "found xmin ABC from before relfrozenxid PQR" sort of error otherwise it might worsen the things instead of making it better. Now, the question is - can VACUUM report this type of error for a deleted tuple or it would only report it for a live tuple? AFAIU this won't be reported for the deleted tuples because VACUUM wouldn't consider freezing a tuple that has been deleted. > Also, if a tuple updated and moved to another partition is revived by > heap_force_freeze(), its ctid still has special values: > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > see a problem yet caused by a visible tuple having the special ctid > value, but it might be worth considering either to reset ctid value as > well or to not freezing already-deleted tuple. > For this as well, the answer remains the same as above. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Hello Masahiko-san, I've spent some more time trying to understand the code in lazy_scan_heap function to know under what all circumstances a VACUUM can fail with "found xmin ... before relfrozenxid ..." error for a tuple whose xmin is behind relfrozenxid. Here are my observations: 1) It can fail with this error for a live tuple OR, 2) It can also fail with this error if a tuple (that went through update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE. OR, 3) If there are any concurrent transactions, then the tuple might be marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with this error. Now, AFAIU, as we will be dealing with a damaged table, the chances of point #3 being the cause of this error looks impossible in our case because I don't think we will be doing anything in parallel when performing surgery on a damaged table, in fact we shouldn't be doing any such things. However, it is quite possible that reason #2 could cause VACUUM to fail with this sort of error, but, as we are already skipping redirected item pointers in heap_force_common(), I think, we would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we may not need to handle point #2 as well. Further, I also don't see VACUUM reporting this error for a tuple that has been moved from one partition to another. So, I think we might not need to do any special handling for a tuple that got updated and its new version was moved to another partition. If you feel I am missing something here, please correct me. Thank you. Moreover, while I was exploring on above, I noticed that in lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check for a redirected item pointers and if any redirected item pointer is detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked with HEAP_HOT_UPDATED. I am referring to the following code in lazy_scan_heap(). for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum)) { ItemId itemid; itemid = PageGetItemId(page, offnum); ............. ............. /* Redirect items mustn't be touched */ <-- this check would bypass the redirected item pointers from being checked for HeapTupleSatisfiesVacuum. if (ItemIdIsRedirected(itemid)) { hastup = true; /* this page won't be truncatable */ continue; } .............. .............. switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf)) { case HEAPTUPLE_DEAD: if (HeapTupleIsHotUpdated(&tuple) || HeapTupleIsHeapOnly(&tuple) || params->index_cleanup == VACOPT_TERNARY_DISABLED) nkeep += 1; else tupgone = true; /* we can delete the tuple */ .............. .............. } So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true? -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hello Masahiko-san, > > Thanks for the review. Please check the comments inline below: > > On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > > Thank you for updating the patch! Here are my comments on v5 patch: > > > > --- a/contrib/Makefile > > +++ b/contrib/Makefile > > @@ -35,6 +35,7 @@ SUBDIRS = \ > > pg_standby \ > > pg_stat_statements \ > > pg_trgm \ > > + pg_surgery \ > > pgcrypto \ > > > > I guess we use alphabetical order here. So pg_surgery should be placed > > before pg_trgm. > > > > Okay, will take care of this in the next version of patch. > > > --- > > + if (heap_force_opt == HEAP_FORCE_KILL) > > + ItemIdSetDead(itemid); > > > > I think that if the page is an all-visible page, we should clear an > > all-visible bit on the visibility map corresponding to the page and > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would > > return the wrong results. > > > > I think we should let VACUUM do that. Please note that this module is > intended to be used only on a damaged relation and should only be > operated on damaged tuples of such relations. And the execution of any > of the functions provided by this module on a damaged relation must be > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation. > This is necessary to bring back a damaged relation to the sane state > once a surgery is performed on it. I will try to add this note in the > documentation for this module. > > > --- > > + /* > > + * We do not mark the buffer dirty or do WAL logging for unmodifed > > + * pages. > > + */ > > + if (!did_modify_page) > > + goto skip_wal; > > + > > + /* Mark buffer dirty before we write WAL. */ > > + MarkBufferDirty(buf); > > + > > + /* XLOG stuff */ > > + if (RelationNeedsWAL(rel)) > > + log_newpage_buffer(buf, true); > > + > > +skip_wal: > > + END_CRIT_SECTION(); > > + > > > > s/unmodifed/unmodified/ > > > > okay, will fix this typo. > > > Do we really need to use goto? I think we can modify it like follows: > > > > if (did_modity_page) > > { > > /* Mark buffer dirty before we write WAL. */ > > MarkBufferDirty(buf); > > > > /* XLOG stuff */ > > if (RelationNeedsWAL(rel)) > > log_newpage_buffer(buf, true); > > } > > > > END_CRIT_SECTION(); > > > > No, we don't need it. We can achieve the same by checking the status > of did_modify_page flag as you suggested. I will do this change in the > next version. > > > --- > > pg_force_freeze() can revival a tuple that is already deleted but not > > vacuumed yet. Therefore, the user might need to reindex indexes after > > using that function. For instance, with the following script, the last > > two queries: index scan and seq scan, will return different results. > > > > set enable_seqscan to off; > > set enable_bitmapscan to off; > > set enable_indexonlyscan to off; > > create table tbl (a int primary key); > > insert into tbl values (1); > > > > update tbl set a = a + 100 where a = 1; > > > > explain analyze select * from tbl where a < 200; > > > > -- revive deleted tuple on heap > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > -- index scan returns 2 tuples > > explain analyze select * from tbl where a < 200; > > > > -- seq scan returns 1 tuple > > set enable_seqscan to on; > > explain analyze select * from tbl; > > > > I am not sure if this is the right use-case of pg_force_freeze > function. I think we should only be running pg_force_freeze function > on a tuple for which VACUUM reports "found xmin ABC from before > relfrozenxid PQR" sort of error otherwise it might worsen the things > instead of making it better. Now, the question is - can VACUUM report > this type of error for a deleted tuple or it would only report it for > a live tuple? AFAIU this won't be reported for the deleted tuples > because VACUUM wouldn't consider freezing a tuple that has been > deleted. > > > Also, if a tuple updated and moved to another partition is revived by > > heap_force_freeze(), its ctid still has special values: > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > > see a problem yet caused by a visible tuple having the special ctid > > value, but it might be worth considering either to reset ctid value as > > well or to not freezing already-deleted tuple. > > > > For this as well, the answer remains the same as above. > > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com
Attached is the new version of patch that addresses the comments from Asim Praveen and Masahiko-san. It also improves the documentation to some extent. On Tue, Aug 18, 2020 at 1:46 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hello Masahiko-san, > > I've spent some more time trying to understand the code in > lazy_scan_heap function to know under what all circumstances a VACUUM > can fail with "found xmin ... before relfrozenxid ..." error for a > tuple whose xmin is behind relfrozenxid. Here are my observations: > > 1) It can fail with this error for a live tuple > > OR, > > 2) It can also fail with this error if a tuple (that went through > update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE. > > OR, > > 3) If there are any concurrent transactions, then the tuple might be > marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS > or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with > this error. > > Now, AFAIU, as we will be dealing with a damaged table, the chances of > point #3 being the cause of this error looks impossible in our case > because I don't think we will be doing anything in parallel when > performing surgery on a damaged table, in fact we shouldn't be doing > any such things. However, it is quite possible that reason #2 could > cause VACUUM to fail with this sort of error, but, as we are already > skipping redirected item pointers in heap_force_common(), I think, we > would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't > see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we > may not need to handle point #2 as well. > > Further, I also don't see VACUUM reporting this error for a tuple that > has been moved from one partition to another. So, I think we might not > need to do any special handling for a tuple that got updated and its > new version was moved to another partition. > > If you feel I am missing something here, please correct me. Thank you. > > Moreover, while I was exploring on above, I noticed that in > lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check > for a redirected item pointers and if any redirected item pointer is > detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how > HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked > with HEAP_HOT_UPDATED. I am referring to the following code in > lazy_scan_heap(). > > for (offnum = FirstOffsetNumber; > offnum <= maxoff; > offnum = OffsetNumberNext(offnum)) > { > ItemId itemid; > > itemid = PageGetItemId(page, offnum); > > ............. > ............. > > > /* Redirect items mustn't be touched */ <-- this check > would bypass the redirected item pointers from being checked for > HeapTupleSatisfiesVacuum. > if (ItemIdIsRedirected(itemid)) > { > hastup = true; /* this page won't be truncatable */ > continue; > } > > .............. > .............. > > switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf)) > { > case HEAPTUPLE_DEAD: > > if (HeapTupleIsHotUpdated(&tuple) || > HeapTupleIsHeapOnly(&tuple) || > params->index_cleanup == VACOPT_TERNARY_DISABLED) > nkeep += 1; > else > tupgone = true; /* we can delete the tuple */ > .............. > .............. > } > > > So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true? > > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com > > On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > Hello Masahiko-san, > > > > Thanks for the review. Please check the comments inline below: > > > > On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada > > <masahiko.sawada@2ndquadrant.com> wrote: > > > > > Thank you for updating the patch! Here are my comments on v5 patch: > > > > > > --- a/contrib/Makefile > > > +++ b/contrib/Makefile > > > @@ -35,6 +35,7 @@ SUBDIRS = \ > > > pg_standby \ > > > pg_stat_statements \ > > > pg_trgm \ > > > + pg_surgery \ > > > pgcrypto \ > > > > > > I guess we use alphabetical order here. So pg_surgery should be placed > > > before pg_trgm. > > > > > > > Okay, will take care of this in the next version of patch. > > > > > --- > > > + if (heap_force_opt == HEAP_FORCE_KILL) > > > + ItemIdSetDead(itemid); > > > > > > I think that if the page is an all-visible page, we should clear an > > > all-visible bit on the visibility map corresponding to the page and > > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would > > > return the wrong results. > > > > > > > I think we should let VACUUM do that. Please note that this module is > > intended to be used only on a damaged relation and should only be > > operated on damaged tuples of such relations. And the execution of any > > of the functions provided by this module on a damaged relation must be > > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation. > > This is necessary to bring back a damaged relation to the sane state > > once a surgery is performed on it. I will try to add this note in the > > documentation for this module. > > > > > --- > > > + /* > > > + * We do not mark the buffer dirty or do WAL logging for unmodifed > > > + * pages. > > > + */ > > > + if (!did_modify_page) > > > + goto skip_wal; > > > + > > > + /* Mark buffer dirty before we write WAL. */ > > > + MarkBufferDirty(buf); > > > + > > > + /* XLOG stuff */ > > > + if (RelationNeedsWAL(rel)) > > > + log_newpage_buffer(buf, true); > > > + > > > +skip_wal: > > > + END_CRIT_SECTION(); > > > + > > > > > > s/unmodifed/unmodified/ > > > > > > > okay, will fix this typo. > > > > > Do we really need to use goto? I think we can modify it like follows: > > > > > > if (did_modity_page) > > > { > > > /* Mark buffer dirty before we write WAL. */ > > > MarkBufferDirty(buf); > > > > > > /* XLOG stuff */ > > > if (RelationNeedsWAL(rel)) > > > log_newpage_buffer(buf, true); > > > } > > > > > > END_CRIT_SECTION(); > > > > > > > No, we don't need it. We can achieve the same by checking the status > > of did_modify_page flag as you suggested. I will do this change in the > > next version. > > > > > --- > > > pg_force_freeze() can revival a tuple that is already deleted but not > > > vacuumed yet. Therefore, the user might need to reindex indexes after > > > using that function. For instance, with the following script, the last > > > two queries: index scan and seq scan, will return different results. > > > > > > set enable_seqscan to off; > > > set enable_bitmapscan to off; > > > set enable_indexonlyscan to off; > > > create table tbl (a int primary key); > > > insert into tbl values (1); > > > > > > update tbl set a = a + 100 where a = 1; > > > > > > explain analyze select * from tbl where a < 200; > > > > > > -- revive deleted tuple on heap > > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > > > -- index scan returns 2 tuples > > > explain analyze select * from tbl where a < 200; > > > > > > -- seq scan returns 1 tuple > > > set enable_seqscan to on; > > > explain analyze select * from tbl; > > > > > > > I am not sure if this is the right use-case of pg_force_freeze > > function. I think we should only be running pg_force_freeze function > > on a tuple for which VACUUM reports "found xmin ABC from before > > relfrozenxid PQR" sort of error otherwise it might worsen the things > > instead of making it better. Now, the question is - can VACUUM report > > this type of error for a deleted tuple or it would only report it for > > a live tuple? AFAIU this won't be reported for the deleted tuples > > because VACUUM wouldn't consider freezing a tuple that has been > > deleted. > > > > > Also, if a tuple updated and moved to another partition is revived by > > > heap_force_freeze(), its ctid still has special values: > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > > > see a problem yet caused by a visible tuple having the special ctid > > > value, but it might be worth considering either to reset ctid value as > > > well or to not freezing already-deleted tuple. > > > > > > > For this as well, the answer remains the same as above. > > > > -- > > With Regards, > > Ashutosh Sharma > > EnterpriseDB:http://www.enterprisedb.com
Attachment
On Mon, Jul 13, 2020 at 2:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
> 1. There's nothing to identify the tuple that has the problem, and no
> way to know how many more of them there might be. Back-patching
> b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> part of this.
I am in favor of backpatching such changes in cases where senior
community members feel that it could help with hypothetical
undiscovered data corruption issues -- if they're willing to take
responsibility for the change. It certainly wouldn't be the first
time. A "defense in depth" mindset seems like the right one when it
comes to data corruption bugs. Early detection is really important.
> Moreover, not everyone is as
> interested in an extended debugging exercise as they are in getting
> the system working again, and VACUUM failing repeatedly is a pretty
> serious problem.
That's absolutely consistent with my experience. Most users want to
get back to business as usual now, while letting somebody else do the
hard work of debugging.
> Therefore, one of my colleagues has - at my request - created a couple
> of functions called heap_force_kill() and heap_force_freeze() which
> take an array of TIDs.
> So I have these questions:
>
> - Do people think it would me smart/good/useful to include something
> like this in PostgreSQL?
I'm in favor of it.
> - If so, how? I would propose a new contrib module that we back-patch
> all the way, because the VACUUM errors were back-patched all the way,
> and there seems to be no advantage in making people wait 5 years for a
> new version that has some kind of tooling in this area.
I'm in favor of it being *possible* to backpatch tooling that is
clearly related to correctness in a fundamental way. Obviously this
would mean that we'd be revising our general position on backpatching
to allow some limited exceptions around corruption. I'm not sure that
this meets that standard, though. It's hardly something that we can
expect all that many users to be able to use effectively.
I may be biased, but I'd be inclined to permit it in the case of
something like amcheck, or pg_visibility, on the grounds that they're
more or less the same as the new VACUUM errcontext instrumentation you
mentioned. The same cannot be said of something like this new
heap_force_kill() stuff.
> - Any ideas for additional things we should include, or improvements
> on the sketch above?
Clearly you should work out a way of making it very hard to
accidentally (mis)use. For example, maybe you make the functions check
for the presence of a sentinel file in the data directory.
--
Peter Geoghegan
--
and found no issues. I will continue testing same with updated patch posted
on this thread.
Thanks Rajkumar for testing the patch.
Here are some of the additional test-cases that I would suggest you to
execute, if possible:
1) You may try running the test-cases that you have executed so far
with SR setup and see if the changes are getting reflected on the
standby.
2) You may also try running some concurrent test-cases for e.g. try
running these functions with VACUUM or some other sql commands
(preferable DML commands) in parallel.
3) See what happens when you pass some invalid tids (containing
invalid block or offset number) to these functions. You may also try
running these functions on the same tuple repeatedly and see the
behaviour.
...
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
On 2020-Aug-17, Ashutosh Sharma wrote: > > + if (heap_force_opt == HEAP_FORCE_KILL) > > + ItemIdSetDead(itemid); > > > > I think that if the page is an all-visible page, we should clear an > > all-visible bit on the visibility map corresponding to the page and > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would > > return the wrong results. > > I think we should let VACUUM do that. Please note that this module is > intended to be used only on a damaged relation and should only be > operated on damaged tuples of such relations. And the execution of any > of the functions provided by this module on a damaged relation must be > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation. > This is necessary to bring back a damaged relation to the sane state > once a surgery is performed on it. I will try to add this note in the > documentation for this module. It makes sense to recommend VACUUM after fixing the page, but I agree with Sawada-san that it would be sensible to reset the VM bit while doing surgery, since that's the state that the page would be in. We should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING, but if users fail to do so, then leaving the VM bit set just means that we know *for certain* that there will be further corruption as soon as the XID counter advances sufficiently. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hello Masahiko-san, > > Thanks for the review. Please check the comments inline below: > > On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > > Thank you for updating the patch! Here are my comments on v5 patch: > > > > --- a/contrib/Makefile > > +++ b/contrib/Makefile > > @@ -35,6 +35,7 @@ SUBDIRS = \ > > pg_standby \ > > pg_stat_statements \ > > pg_trgm \ > > + pg_surgery \ > > pgcrypto \ > > > > I guess we use alphabetical order here. So pg_surgery should be placed > > before pg_trgm. > > > > Okay, will take care of this in the next version of patch. > > > --- > > + if (heap_force_opt == HEAP_FORCE_KILL) > > + ItemIdSetDead(itemid); > > > > I think that if the page is an all-visible page, we should clear an > > all-visible bit on the visibility map corresponding to the page and > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would > > return the wrong results. > > > > I think we should let VACUUM do that. Please note that this module is > intended to be used only on a damaged relation and should only be > operated on damaged tuples of such relations. And the execution of any > of the functions provided by this module on a damaged relation must be > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation. > This is necessary to bring back a damaged relation to the sane state > once a surgery is performed on it. I will try to add this note in the > documentation for this module. > > > --- > > + /* > > + * We do not mark the buffer dirty or do WAL logging for unmodifed > > + * pages. > > + */ > > + if (!did_modify_page) > > + goto skip_wal; > > + > > + /* Mark buffer dirty before we write WAL. */ > > + MarkBufferDirty(buf); > > + > > + /* XLOG stuff */ > > + if (RelationNeedsWAL(rel)) > > + log_newpage_buffer(buf, true); > > + > > +skip_wal: > > + END_CRIT_SECTION(); > > + > > > > s/unmodifed/unmodified/ > > > > okay, will fix this typo. > > > Do we really need to use goto? I think we can modify it like follows: > > > > if (did_modity_page) > > { > > /* Mark buffer dirty before we write WAL. */ > > MarkBufferDirty(buf); > > > > /* XLOG stuff */ > > if (RelationNeedsWAL(rel)) > > log_newpage_buffer(buf, true); > > } > > > > END_CRIT_SECTION(); > > > > No, we don't need it. We can achieve the same by checking the status > of did_modify_page flag as you suggested. I will do this change in the > next version. > > > --- > > pg_force_freeze() can revival a tuple that is already deleted but not > > vacuumed yet. Therefore, the user might need to reindex indexes after > > using that function. For instance, with the following script, the last > > two queries: index scan and seq scan, will return different results. > > > > set enable_seqscan to off; > > set enable_bitmapscan to off; > > set enable_indexonlyscan to off; > > create table tbl (a int primary key); > > insert into tbl values (1); > > > > update tbl set a = a + 100 where a = 1; > > > > explain analyze select * from tbl where a < 200; > > > > -- revive deleted tuple on heap > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > -- index scan returns 2 tuples > > explain analyze select * from tbl where a < 200; > > > > -- seq scan returns 1 tuple > > set enable_seqscan to on; > > explain analyze select * from tbl; > > > > I am not sure if this is the right use-case of pg_force_freeze > function. I think we should only be running pg_force_freeze function > on a tuple for which VACUUM reports "found xmin ABC from before > relfrozenxid PQR" sort of error otherwise it might worsen the things > instead of making it better. Should this also be documented? I think that it's hard to force the user to always use this module in the right situation but we need to show at least when to use. > > Also, if a tuple updated and moved to another partition is revived by > > heap_force_freeze(), its ctid still has special values: > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > > see a problem yet caused by a visible tuple having the special ctid > > value, but it might be worth considering either to reset ctid value as > > well or to not freezing already-deleted tuple. > > > > For this as well, the answer remains the same as above. Perhaps the same is true when a tuple header is corrupted including xmin and ctid for some reason and the user wants to fix it? I'm concerned that a live tuple having the wrong ctid will cause SEGV or PANIC error in the future. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Aug 18, 2020 at 9:44 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2020-Aug-17, Ashutosh Sharma wrote: > > > > + if (heap_force_opt == HEAP_FORCE_KILL) > > > + ItemIdSetDead(itemid); > > > > > > I think that if the page is an all-visible page, we should clear an > > > all-visible bit on the visibility map corresponding to the page and > > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would > > > return the wrong results. > > > > I think we should let VACUUM do that. Please note that this module is > > intended to be used only on a damaged relation and should only be > > operated on damaged tuples of such relations. And the execution of any > > of the functions provided by this module on a damaged relation must be > > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation. > > This is necessary to bring back a damaged relation to the sane state > > once a surgery is performed on it. I will try to add this note in the > > documentation for this module. > > It makes sense to recommend VACUUM after fixing the page, but I agree > with Sawada-san that it would be sensible to reset the VM bit while > doing surgery, since that's the state that the page would be in. Sure, I will try to do that change but I would still recommend to always run VACUUM with DISABLE_PAGE_SKIPPING option on the relation that underwent surgery. We > should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING, > but if users fail to do so, then leaving the VM bit set just means that > we know *for certain* that there will be further corruption as soon as > the XID counter advances sufficiently. > Yeah, I've already added a note for this in the documentation: Note: "After a surgery is performed on a damaged relation using this module, we must run VACUUM with DISABLE_PAGE_SKIPPING option on that relation to bring it back into a sane state." -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > pg_force_freeze() can revival a tuple that is already deleted but not > > > vacuumed yet. Therefore, the user might need to reindex indexes after > > > using that function. For instance, with the following script, the last > > > two queries: index scan and seq scan, will return different results. > > > > > > set enable_seqscan to off; > > > set enable_bitmapscan to off; > > > set enable_indexonlyscan to off; > > > create table tbl (a int primary key); > > > insert into tbl values (1); > > > > > > update tbl set a = a + 100 where a = 1; > > > > > > explain analyze select * from tbl where a < 200; > > > > > > -- revive deleted tuple on heap > > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > > > -- index scan returns 2 tuples > > > explain analyze select * from tbl where a < 200; > > > > > > -- seq scan returns 1 tuple > > > set enable_seqscan to on; > > > explain analyze select * from tbl; > > > > > > > I am not sure if this is the right use-case of pg_force_freeze > > function. I think we should only be running pg_force_freeze function > > on a tuple for which VACUUM reports "found xmin ABC from before > > relfrozenxid PQR" sort of error otherwise it might worsen the things > > instead of making it better. > > Should this also be documented? I think that it's hard to force the > user to always use this module in the right situation but we need to > show at least when to use. > I've already added some examples in the documentation explaining the use-case of force_freeze function. If required, I will also add a note about it. > > > Also, if a tuple updated and moved to another partition is revived by > > > heap_force_freeze(), its ctid still has special values: > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > > > see a problem yet caused by a visible tuple having the special ctid > > > value, but it might be worth considering either to reset ctid value as > > > well or to not freezing already-deleted tuple. > > > > > > > For this as well, the answer remains the same as above. > > Perhaps the same is true when a tuple header is corrupted including > xmin and ctid for some reason and the user wants to fix it? I'm > concerned that a live tuple having the wrong ctid will cause SEGV or > PANIC error in the future. > If a tuple header itself is corrupted, then I think we must kill that tuple. If only xmin and t_ctid fields are corrupted, then probably we can think of resetting the ctid value of that tuple. However, it won't be always possible to detect the corrupted ctid value. It's quite possible that the corrupted ctid field has valid values for block number and offset number in it, but it's actually corrupted and it would be difficult to consider such ctid as corrupted. Hence, we can't do anything about such types of corruption. Probably in such cases we need to run VACUUM FULL on such tables so that new ctid gets generated for each tuple in the table. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > > pg_force_freeze() can revival a tuple that is already deleted but not > > > > vacuumed yet. Therefore, the user might need to reindex indexes after > > > > using that function. For instance, with the following script, the last > > > > two queries: index scan and seq scan, will return different results. > > > > > > > > set enable_seqscan to off; > > > > set enable_bitmapscan to off; > > > > set enable_indexonlyscan to off; > > > > create table tbl (a int primary key); > > > > insert into tbl values (1); > > > > > > > > update tbl set a = a + 100 where a = 1; > > > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > -- revive deleted tuple on heap > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > > > > > -- index scan returns 2 tuples > > > > explain analyze select * from tbl where a < 200; > > > > > > > > -- seq scan returns 1 tuple > > > > set enable_seqscan to on; > > > > explain analyze select * from tbl; > > > > > > > > > > I am not sure if this is the right use-case of pg_force_freeze > > > function. I think we should only be running pg_force_freeze function > > > on a tuple for which VACUUM reports "found xmin ABC from before > > > relfrozenxid PQR" sort of error otherwise it might worsen the things > > > instead of making it better. > > > > Should this also be documented? I think that it's hard to force the > > user to always use this module in the right situation but we need to > > show at least when to use. > > > > I've already added some examples in the documentation explaining the > use-case of force_freeze function. If required, I will also add a note > about it. > > > > > Also, if a tuple updated and moved to another partition is revived by > > > > heap_force_freeze(), its ctid still has special values: > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > > > > see a problem yet caused by a visible tuple having the special ctid > > > > value, but it might be worth considering either to reset ctid value as > > > > well or to not freezing already-deleted tuple. > > > > > > > > > > For this as well, the answer remains the same as above. > > > > Perhaps the same is true when a tuple header is corrupted including > > xmin and ctid for some reason and the user wants to fix it? I'm > > concerned that a live tuple having the wrong ctid will cause SEGV or > > PANIC error in the future. > > > > If a tuple header itself is corrupted, then I think we must kill that > tuple. If only xmin and t_ctid fields are corrupted, then probably we > can think of resetting the ctid value of that tuple. However, it won't > be always possible to detect the corrupted ctid value. It's quite > possible that the corrupted ctid field has valid values for block > number and offset number in it, but it's actually corrupted and it > would be difficult to consider such ctid as corrupted. Hence, we can't > do anything about such types of corruption. Probably in such cases we > need to run VACUUM FULL on such tables so that new ctid gets generated > for each tuple in the table. Understood. Perhaps such corruption will be able to be detected by new heapam check functions discussed on another thread. My point was that it might be better to attempt making the tuple header sane state as much as possible when fixing a live tuple in order to prevent further problems such as databases crash by malicious attackers. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada > > <masahiko.sawada@2ndquadrant.com> wrote: > > > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > > > > pg_force_freeze() can revival a tuple that is already deleted but not > > > > > vacuumed yet. Therefore, the user might need to reindex indexes after > > > > > using that function. For instance, with the following script, the last > > > > > two queries: index scan and seq scan, will return different results. > > > > > > > > > > set enable_seqscan to off; > > > > > set enable_bitmapscan to off; > > > > > set enable_indexonlyscan to off; > > > > > create table tbl (a int primary key); > > > > > insert into tbl values (1); > > > > > > > > > > update tbl set a = a + 100 where a = 1; > > > > > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > -- revive deleted tuple on heap > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > > > > > > > -- index scan returns 2 tuples > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > -- seq scan returns 1 tuple > > > > > set enable_seqscan to on; > > > > > explain analyze select * from tbl; > > > > > > > > > > > > > I am not sure if this is the right use-case of pg_force_freeze > > > > function. I think we should only be running pg_force_freeze function > > > > on a tuple for which VACUUM reports "found xmin ABC from before > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things > > > > instead of making it better. > > > > > > Should this also be documented? I think that it's hard to force the > > > user to always use this module in the right situation but we need to > > > show at least when to use. > > > > > > > I've already added some examples in the documentation explaining the > > use-case of force_freeze function. If required, I will also add a note > > about it. > > > > > > > Also, if a tuple updated and moved to another partition is revived by > > > > > heap_force_freeze(), its ctid still has special values: > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > > > > > see a problem yet caused by a visible tuple having the special ctid > > > > > value, but it might be worth considering either to reset ctid value as > > > > > well or to not freezing already-deleted tuple. > > > > > > > > > > > > > For this as well, the answer remains the same as above. > > > > > > Perhaps the same is true when a tuple header is corrupted including > > > xmin and ctid for some reason and the user wants to fix it? I'm > > > concerned that a live tuple having the wrong ctid will cause SEGV or > > > PANIC error in the future. > > > > > > > If a tuple header itself is corrupted, then I think we must kill that > > tuple. If only xmin and t_ctid fields are corrupted, then probably we > > can think of resetting the ctid value of that tuple. However, it won't > > be always possible to detect the corrupted ctid value. It's quite > > possible that the corrupted ctid field has valid values for block > > number and offset number in it, but it's actually corrupted and it > > would be difficult to consider such ctid as corrupted. Hence, we can't > > do anything about such types of corruption. Probably in such cases we > > need to run VACUUM FULL on such tables so that new ctid gets generated > > for each tuple in the table. > > Understood. > > Perhaps such corruption will be able to be detected by new heapam > check functions discussed on another thread. My point was that it > might be better to attempt making the tuple header sane state as much > as possible when fixing a live tuple in order to prevent further > problems such as databases crash by malicious attackers. > Agreed. So, to handle the ctid related concern that you raised, I'm planning to do the following changes to ensure that the tuple being marked as frozen contains the correct item pointer value. Please let me know if you are okay with these changes. HeapTupleHeader htup; + ItemPointerData ctid; Assert(heap_force_opt == HEAP_FORCE_FREEZE); + ItemPointerSet(&ctid, blkno, offnos[i]); + htup = (HeapTupleHeader) PageGetItem(page, itemid); + /* + * Make sure that this tuple holds the correct item pointer + * value. + */ + if (!HeapTupleHeaderIndicatesMovedPartitions(htup) && + !ItemPointerEquals(&ctid, &htup->t_ctid)) + ItemPointerSet(&htup->t_ctid, blkno, offnos[i]); + HeapTupleHeaderSetXmin(htup, FrozenTransactionId); HeapTupleHeaderSetXmax(htup, InvalidTransactionId); -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada > > > <masahiko.sawada@2ndquadrant.com> wrote: > > > > > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > > > > > > pg_force_freeze() can revival a tuple that is already deleted but not > > > > > > vacuumed yet. Therefore, the user might need to reindex indexes after > > > > > > using that function. For instance, with the following script, the last > > > > > > two queries: index scan and seq scan, will return different results. > > > > > > > > > > > > set enable_seqscan to off; > > > > > > set enable_bitmapscan to off; > > > > > > set enable_indexonlyscan to off; > > > > > > create table tbl (a int primary key); > > > > > > insert into tbl values (1); > > > > > > > > > > > > update tbl set a = a + 100 where a = 1; > > > > > > > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > > > -- revive deleted tuple on heap > > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > > > > > > > > > -- index scan returns 2 tuples > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > > > -- seq scan returns 1 tuple > > > > > > set enable_seqscan to on; > > > > > > explain analyze select * from tbl; > > > > > > > > > > > > > > > > I am not sure if this is the right use-case of pg_force_freeze > > > > > function. I think we should only be running pg_force_freeze function > > > > > on a tuple for which VACUUM reports "found xmin ABC from before > > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things > > > > > instead of making it better. > > > > > > > > Should this also be documented? I think that it's hard to force the > > > > user to always use this module in the right situation but we need to > > > > show at least when to use. > > > > > > > > > > I've already added some examples in the documentation explaining the > > > use-case of force_freeze function. If required, I will also add a note > > > about it. > > > > > > > > > Also, if a tuple updated and moved to another partition is revived by > > > > > > heap_force_freeze(), its ctid still has special values: > > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > > > > > > see a problem yet caused by a visible tuple having the special ctid > > > > > > value, but it might be worth considering either to reset ctid value as > > > > > > well or to not freezing already-deleted tuple. > > > > > > > > > > > > > > > > For this as well, the answer remains the same as above. > > > > > > > > Perhaps the same is true when a tuple header is corrupted including > > > > xmin and ctid for some reason and the user wants to fix it? I'm > > > > concerned that a live tuple having the wrong ctid will cause SEGV or > > > > PANIC error in the future. > > > > > > > > > > If a tuple header itself is corrupted, then I think we must kill that > > > tuple. If only xmin and t_ctid fields are corrupted, then probably we > > > can think of resetting the ctid value of that tuple. However, it won't > > > be always possible to detect the corrupted ctid value. It's quite > > > possible that the corrupted ctid field has valid values for block > > > number and offset number in it, but it's actually corrupted and it > > > would be difficult to consider such ctid as corrupted. Hence, we can't > > > do anything about such types of corruption. Probably in such cases we > > > need to run VACUUM FULL on such tables so that new ctid gets generated > > > for each tuple in the table. > > > > Understood. > > > > Perhaps such corruption will be able to be detected by new heapam > > check functions discussed on another thread. My point was that it > > might be better to attempt making the tuple header sane state as much > > as possible when fixing a live tuple in order to prevent further > > problems such as databases crash by malicious attackers. > > > > Agreed. So, to handle the ctid related concern that you raised, I'm > planning to do the following changes to ensure that the tuple being > marked as frozen contains the correct item pointer value. Please let > me know if you are okay with these changes. Given that a live tuple never indicates to ve moved partitions, I guess the first condition in the if statement is not necessary. The rest looks good to me, although other hackers might think differently. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 20, 2020 at 11:04 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada > > <masahiko.sawada@2ndquadrant.com> wrote: > > > > > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada > > > > <masahiko.sawada@2ndquadrant.com> wrote: > > > > > > > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > > > > > > > > pg_force_freeze() can revival a tuple that is already deleted but not > > > > > > > vacuumed yet. Therefore, the user might need to reindex indexes after > > > > > > > using that function. For instance, with the following script, the last > > > > > > > two queries: index scan and seq scan, will return different results. > > > > > > > > > > > > > > set enable_seqscan to off; > > > > > > > set enable_bitmapscan to off; > > > > > > > set enable_indexonlyscan to off; > > > > > > > create table tbl (a int primary key); > > > > > > > insert into tbl values (1); > > > > > > > > > > > > > > update tbl set a = a + 100 where a = 1; > > > > > > > > > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > > > > > -- revive deleted tuple on heap > > > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > > > > > > > > > > > -- index scan returns 2 tuples > > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > > > > > -- seq scan returns 1 tuple > > > > > > > set enable_seqscan to on; > > > > > > > explain analyze select * from tbl; > > > > > > > > > > > > > > > > > > > I am not sure if this is the right use-case of pg_force_freeze > > > > > > function. I think we should only be running pg_force_freeze function > > > > > > on a tuple for which VACUUM reports "found xmin ABC from before > > > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things > > > > > > instead of making it better. > > > > > > > > > > Should this also be documented? I think that it's hard to force the > > > > > user to always use this module in the right situation but we need to > > > > > show at least when to use. > > > > > > > > > > > > > I've already added some examples in the documentation explaining the > > > > use-case of force_freeze function. If required, I will also add a note > > > > about it. > > > > > > > > > > > Also, if a tuple updated and moved to another partition is revived by > > > > > > > heap_force_freeze(), its ctid still has special values: > > > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > > > > > > > see a problem yet caused by a visible tuple having the special ctid > > > > > > > value, but it might be worth considering either to reset ctid value as > > > > > > > well or to not freezing already-deleted tuple. > > > > > > > > > > > > > > > > > > > For this as well, the answer remains the same as above. > > > > > > > > > > Perhaps the same is true when a tuple header is corrupted including > > > > > xmin and ctid for some reason and the user wants to fix it? I'm > > > > > concerned that a live tuple having the wrong ctid will cause SEGV or > > > > > PANIC error in the future. > > > > > > > > > > > > > If a tuple header itself is corrupted, then I think we must kill that > > > > tuple. If only xmin and t_ctid fields are corrupted, then probably we > > > > can think of resetting the ctid value of that tuple. However, it won't > > > > be always possible to detect the corrupted ctid value. It's quite > > > > possible that the corrupted ctid field has valid values for block > > > > number and offset number in it, but it's actually corrupted and it > > > > would be difficult to consider such ctid as corrupted. Hence, we can't > > > > do anything about such types of corruption. Probably in such cases we > > > > need to run VACUUM FULL on such tables so that new ctid gets generated > > > > for each tuple in the table. > > > > > > Understood. > > > > > > Perhaps such corruption will be able to be detected by new heapam > > > check functions discussed on another thread. My point was that it > > > might be better to attempt making the tuple header sane state as much > > > as possible when fixing a live tuple in order to prevent further > > > problems such as databases crash by malicious attackers. > > > > > > > Agreed. So, to handle the ctid related concern that you raised, I'm > > planning to do the following changes to ensure that the tuple being > > marked as frozen contains the correct item pointer value. Please let > > me know if you are okay with these changes. > > Given that a live tuple never indicates to ve moved partitions, I > guess the first condition in the if statement is not necessary. The > rest looks good to me, although other hackers might think differently. > Okay, thanks for confirming. I am planning to go ahead with this approach. Will later see what others have to say about it. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Hi Masahiko-san, Please find the updated patch with the following new changes: 1) It adds the code changes in heap_force_kill function to clear an all-visible bit on the visibility map corresponding to the page that is marked all-visible. Along the way it also clears PD_ALL_VISIBLE flag on the page header. 2) It adds the code changes in heap_force_freeze function to reset the ctid value in a tuple header if it is corrupted. 3) It adds several notes and examples in the documentation stating when and how we need to use the functions provided by this module. Please have a look and let me know for any other concern. Thanks, -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Aug 20, 2020 at 11:43 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > On Thu, Aug 20, 2020 at 11:04 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > > > On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada > > > <masahiko.sawada@2ndquadrant.com> wrote: > > > > > > > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > > > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada > > > > > <masahiko.sawada@2ndquadrant.com> wrote: > > > > > > > > > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > > > > > > > > > > pg_force_freeze() can revival a tuple that is already deleted but not > > > > > > > > vacuumed yet. Therefore, the user might need to reindex indexes after > > > > > > > > using that function. For instance, with the following script, the last > > > > > > > > two queries: index scan and seq scan, will return different results. > > > > > > > > > > > > > > > > set enable_seqscan to off; > > > > > > > > set enable_bitmapscan to off; > > > > > > > > set enable_indexonlyscan to off; > > > > > > > > create table tbl (a int primary key); > > > > > > > > insert into tbl values (1); > > > > > > > > > > > > > > > > update tbl set a = a + 100 where a = 1; > > > > > > > > > > > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > > > > > > > -- revive deleted tuple on heap > > > > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]); > > > > > > > > > > > > > > > > -- index scan returns 2 tuples > > > > > > > > explain analyze select * from tbl where a < 200; > > > > > > > > > > > > > > > > -- seq scan returns 1 tuple > > > > > > > > set enable_seqscan to on; > > > > > > > > explain analyze select * from tbl; > > > > > > > > > > > > > > > > > > > > > > I am not sure if this is the right use-case of pg_force_freeze > > > > > > > function. I think we should only be running pg_force_freeze function > > > > > > > on a tuple for which VACUUM reports "found xmin ABC from before > > > > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things > > > > > > > instead of making it better. > > > > > > > > > > > > Should this also be documented? I think that it's hard to force the > > > > > > user to always use this module in the right situation but we need to > > > > > > show at least when to use. > > > > > > > > > > > > > > > > I've already added some examples in the documentation explaining the > > > > > use-case of force_freeze function. If required, I will also add a note > > > > > about it. > > > > > > > > > > > > > Also, if a tuple updated and moved to another partition is revived by > > > > > > > > heap_force_freeze(), its ctid still has special values: > > > > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't > > > > > > > > see a problem yet caused by a visible tuple having the special ctid > > > > > > > > value, but it might be worth considering either to reset ctid value as > > > > > > > > well or to not freezing already-deleted tuple. > > > > > > > > > > > > > > > > > > > > > > For this as well, the answer remains the same as above. > > > > > > > > > > > > Perhaps the same is true when a tuple header is corrupted including > > > > > > xmin and ctid for some reason and the user wants to fix it? I'm > > > > > > concerned that a live tuple having the wrong ctid will cause SEGV or > > > > > > PANIC error in the future. > > > > > > > > > > > > > > > > If a tuple header itself is corrupted, then I think we must kill that > > > > > tuple. If only xmin and t_ctid fields are corrupted, then probably we > > > > > can think of resetting the ctid value of that tuple. However, it won't > > > > > be always possible to detect the corrupted ctid value. It's quite > > > > > possible that the corrupted ctid field has valid values for block > > > > > number and offset number in it, but it's actually corrupted and it > > > > > would be difficult to consider such ctid as corrupted. Hence, we can't > > > > > do anything about such types of corruption. Probably in such cases we > > > > > need to run VACUUM FULL on such tables so that new ctid gets generated > > > > > for each tuple in the table. > > > > > > > > Understood. > > > > > > > > Perhaps such corruption will be able to be detected by new heapam > > > > check functions discussed on another thread. My point was that it > > > > might be better to attempt making the tuple header sane state as much > > > > as possible when fixing a live tuple in order to prevent further > > > > problems such as databases crash by malicious attackers. > > > > > > > > > > Agreed. So, to handle the ctid related concern that you raised, I'm > > > planning to do the following changes to ensure that the tuple being > > > marked as frozen contains the correct item pointer value. Please let > > > me know if you are okay with these changes. > > > > Given that a live tuple never indicates to ve moved partitions, I > > guess the first condition in the if statement is not necessary. The > > rest looks good to me, although other hackers might think differently. > > > > Okay, thanks for confirming. I am planning to go ahead with this > approach. Will later see what others have to say about it. > > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com
Attachment
On Tue, Aug 18, 2020 at 12:14 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > It makes sense to recommend VACUUM after fixing the page, but I agree > with Sawada-san that it would be sensible to reset the VM bit while > doing surgery, since that's the state that the page would be in. We > should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING, > but if users fail to do so, then leaving the VM bit set just means that > we know *for certain* that there will be further corruption as soon as > the XID counter advances sufficiently. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi! > 21 авг. 2020 г., в 18:24, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а): > > Please find the updated patch with the following new changes: Do you have plans to support pg_surgery as external extension? For example for earlier versions of Postgres and for new features,like amcheck_next is maintained. ISTM that I'll have to use something like that tomorrow and I'm in doubt - should I resurrect our pg_dirty_hands or try yournew pg_surgey... Thanks! Best regards, Andrey Borodin.
On Mon, Aug 24, 2020 at 7:15 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Aug 18, 2020 at 12:14 PM Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > It makes sense to recommend VACUUM after fixing the page, but I agree > > with Sawada-san that it would be sensible to reset the VM bit while > > doing surgery, since that's the state that the page would be in. We > > should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING, > > but if users fail to do so, then leaving the VM bit set just means that > > we know *for certain* that there will be further corruption as soon as > > the XID counter advances sufficiently. > > +1. > This has been taken care of in the latest v7 patch. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Mon, Aug 24, 2020 at 11:00 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > Hi! > > > 21 авг. 2020 г., в 18:24, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а): > > > > Please find the updated patch with the following new changes: > > Do you have plans to support pg_surgery as external extension? For example for earlier versions of Postgres and for newfeatures, like amcheck_next is maintained. > ISTM that I'll have to use something like that tomorrow and I'm in doubt - should I resurrect our pg_dirty_hands or tryyour new pg_surgey... > AFAICS, we don't have any plans to support pg_surgery as an external extension as of now. Based on the discussion that has happened earlier in this thread, I think we might also back-patch this contrib module. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hi Masahiko-san, > > Please find the updated patch with the following new changes: > Thank you for updating the patch! > 1) It adds the code changes in heap_force_kill function to clear an > all-visible bit on the visibility map corresponding to the page that > is marked all-visible. Along the way it also clears PD_ALL_VISIBLE > flag on the page header. I think we need to clear all visibility map bits by using VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but not all-visible bit, which is not a valid state. > > 2) It adds the code changes in heap_force_freeze function to reset the > ctid value in a tuple header if it is corrupted. > > 3) It adds several notes and examples in the documentation stating > when and how we need to use the functions provided by this module. > > Please have a look and let me know for any other concern. > And here are small comments on the heap_surgery.c: + /* + * Get the offset numbers from the tids belonging to one particular page + * and process them one by one. + */ + blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr, + offnos); + + /* Calculate the number of offsets stored in offnos array. */ + noffs = next_start_ptr - curr_start_ptr; + + /* + * Update the current start pointer so that next time when + * tids_same_page_fetch_offnums() is called, we can calculate the number + * of offsets present in the offnos array. + */ + curr_start_ptr = next_start_ptr; + + /* Check whether the block number is valid. */ + if (blkno >= nblocks) + { + ereport(NOTICE, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("skipping block %u for relation \"%s\" because the block number is out of range", + blkno, RelationGetRelationName(rel)))); + continue; + } + + CHECK_FOR_INTERRUPTS(); I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top of the do loop for safety. I think it's unlikely to happen but the user might mistakenly specify a lot of wrong block numbers. ---- + offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber)); + noffs = curr_start_ptr = next_start_ptr = 0; + nblocks = RelationGetNumberOfBlocks(rel); + + do + { (snip) + + /* + * Get the offset numbers from the tids belonging to one particular page + * and process them one by one. + */ + blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr, + offnos); + + /* Calculate the number of offsets stored in offnos array. */ + noffs = next_start_ptr - curr_start_ptr; + (snip) + /* No ereport(ERROR) from here until all the changes are logged. */ + START_CRIT_SECTION(); + + for (i = 0; i < noffs; i++) You copy all offset numbers belonging to the same page to palloc'd array, offnos, and iterate it while processing the tuples. I might be missing something but I think we can do that without allocating the space for offset numbers. Is there any reason for this? I guess we can do that by just iterating the sorted tids array. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, 25 Aug 2020 at 17:08, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > Hi Masahiko-san, > > > > Please find the updated patch with the following new changes: > > > > Thank you for updating the patch! > > > 1) It adds the code changes in heap_force_kill function to clear an > > all-visible bit on the visibility map corresponding to the page that > > is marked all-visible. Along the way it also clears PD_ALL_VISIBLE > > flag on the page header. > > I think we need to clear all visibility map bits by using > VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but > not all-visible bit, which is not a valid state. > > > > > 2) It adds the code changes in heap_force_freeze function to reset the > > ctid value in a tuple header if it is corrupted. > > > > 3) It adds several notes and examples in the documentation stating > > when and how we need to use the functions provided by this module. > > > > Please have a look and let me know for any other concern. > > > > And here are small comments on the heap_surgery.c: > > + /* > + * Get the offset numbers from the tids belonging to one particular page > + * and process them one by one. > + */ > + blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr, > + offnos); > + > + /* Calculate the number of offsets stored in offnos array. */ > + noffs = next_start_ptr - curr_start_ptr; > + > + /* > + * Update the current start pointer so that next time when > + * tids_same_page_fetch_offnums() is called, we can calculate the number > + * of offsets present in the offnos array. > + */ > + curr_start_ptr = next_start_ptr; > + > + /* Check whether the block number is valid. */ > + if (blkno >= nblocks) > + { > + ereport(NOTICE, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("skipping block %u for relation \"%s\" > because the block number is out of range", > + blkno, RelationGetRelationName(rel)))); > + continue; > + } > + > + CHECK_FOR_INTERRUPTS(); > > I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top > of the do loop for safety. I think it's unlikely to happen but the > user might mistakenly specify a lot of wrong block numbers. > > ---- > + offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber)); > + noffs = curr_start_ptr = next_start_ptr = 0; > + nblocks = RelationGetNumberOfBlocks(rel); > + > + do > + { > > (snip) > > + > + /* > + * Get the offset numbers from the tids belonging to one particular page > + * and process them one by one. > + */ > + blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr, > + offnos); > + > + /* Calculate the number of offsets stored in offnos array. */ > + noffs = next_start_ptr - curr_start_ptr; > + > > (snip) > > + /* No ereport(ERROR) from here until all the changes are logged. */ > + START_CRIT_SECTION(); > + > + for (i = 0; i < noffs; i++) > > You copy all offset numbers belonging to the same page to palloc'd > array, offnos, and iterate it while processing the tuples. I might be > missing something but I think we can do that without allocating the > space for offset numbers. Is there any reason for this? I guess we can > do that by just iterating the sorted tids array. > Let me share other comments on the latest version patch: Some words need to be tagged. For instance, I found the following words: VACUUM DISABLE_PAGE_SKIPPING HEAP_XMIN_FROZEN HEAP_XMAX_INVALID --- +test=# select ctid from t1 where xmin = 507; + ctid +------- + (0,3) +(1 row) + +test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]); +-[ RECORD 1 ]-----+- +heap_force_freeze | I think it's better to use a consistent output format. The former uses the normal format whereas the latter uses the expanded format. --- + <note> + <para> + While performing surgery on a damaged relation, we must not be doing anything + else on that relation in parallel. This is to ensure that when we are + operating on a damaged tuple there is no other transaction trying to modify + that tuple. + </para> + </note> If we prefer to avoid concurrent operations on the target relation why don't we use AccessExclusiveLock? --- +CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[]) +RETURNS VOID +AS 'MODULE_PATHNAME', 'heap_force_kill' +LANGUAGE C STRICT; +CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[]) +RETURNS VOID +AS 'MODULE_PATHNAME', 'heap_force_freeze' +LANGUAGE C STRICT; I think these functions should be PARALLEL UNSAFE. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Masahiko-san, Thank you for the review. Please check my comments inline below: On Tue, Aug 25, 2020 at 1:39 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > Hi Masahiko-san, > > > > Please find the updated patch with the following new changes: > > > > Thank you for updating the patch! > > > 1) It adds the code changes in heap_force_kill function to clear an > > all-visible bit on the visibility map corresponding to the page that > > is marked all-visible. Along the way it also clears PD_ALL_VISIBLE > > flag on the page header. > > I think we need to clear all visibility map bits by using > VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but > not all-visible bit, which is not a valid state. > Yeah, makes sense, I will do that change in the next version of patch. > > > > 2) It adds the code changes in heap_force_freeze function to reset the > > ctid value in a tuple header if it is corrupted. > > > > 3) It adds several notes and examples in the documentation stating > > when and how we need to use the functions provided by this module. > > > > Please have a look and let me know for any other concern. > > > > And here are small comments on the heap_surgery.c: > > + /* > + * Get the offset numbers from the tids belonging to one particular page > + * and process them one by one. > + */ > + blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr, > + offnos); > + > + /* Calculate the number of offsets stored in offnos array. */ > + noffs = next_start_ptr - curr_start_ptr; > + > + /* > + * Update the current start pointer so that next time when > + * tids_same_page_fetch_offnums() is called, we can calculate the number > + * of offsets present in the offnos array. > + */ > + curr_start_ptr = next_start_ptr; > + > + /* Check whether the block number is valid. */ > + if (blkno >= nblocks) > + { > + ereport(NOTICE, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("skipping block %u for relation \"%s\" > because the block number is out of range", > + blkno, RelationGetRelationName(rel)))); > + continue; > + } > + > + CHECK_FOR_INTERRUPTS(); > > I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top > of the do loop for safety. I think it's unlikely to happen but the > user might mistakenly specify a lot of wrong block numbers. > Okay, np, will shift it to top of the do loop. > ---- > + offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber)); > + noffs = curr_start_ptr = next_start_ptr = 0; > + nblocks = RelationGetNumberOfBlocks(rel); > + > + do > + { > > (snip) > > + > + /* > + * Get the offset numbers from the tids belonging to one particular page > + * and process them one by one. > + */ > + blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr, > + offnos); > + > + /* Calculate the number of offsets stored in offnos array. */ > + noffs = next_start_ptr - curr_start_ptr; > + > > (snip) > > + /* No ereport(ERROR) from here until all the changes are logged. */ > + START_CRIT_SECTION(); > + > + for (i = 0; i < noffs; i++) > > You copy all offset numbers belonging to the same page to palloc'd > array, offnos, and iterate it while processing the tuples. I might be > missing something but I think we can do that without allocating the > space for offset numbers. Is there any reason for this? I guess we can > do that by just iterating the sorted tids array. > Hmmm.. okay, I see your point. I think probably what you are trying to suggest here is to make use of the current and next start pointers to get the tids belonging to the same page and process them one by one instead of fetching the offset numbers of all tids belonging to one page into the offnos array and then iterate through the offnos array. I think that is probably possible and I will try to do that in the next version of patch. If there is something else that you have in your mind, please let me know. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Tue, Aug 25, 2020 at 8:17 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > + <note> > + <para> > + While performing surgery on a damaged relation, we must not be doing anything > + else on that relation in parallel. This is to ensure that when we are > + operating on a damaged tuple there is no other transaction trying to modify > + that tuple. > + </para> > + </note> > > If we prefer to avoid concurrent operations on the target relation why > don't we use AccessExclusiveLock? I disagree with the content of the note. It's up to the user whether to perform any concurrent operations on the target relation, and in many cases it would be fine to do so. Users who can afford to take the table off-line to repair the problem don't really need this tool in the first place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 25, 2020 at 11:51 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Aug 25, 2020 at 8:17 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > + <note> > > + <para> > > + While performing surgery on a damaged relation, we must not be doing anything > > + else on that relation in parallel. This is to ensure that when we are > > + operating on a damaged tuple there is no other transaction trying to modify > > + that tuple. > > + </para> > > + </note> > > > > If we prefer to avoid concurrent operations on the target relation why > > don't we use AccessExclusiveLock? > > I disagree with the content of the note. It's up to the user whether > to perform any concurrent operations on the target relation, and in > many cases it would be fine to do so. Users who can afford to take the > table off-line to repair the problem don't really need this tool in > the first place. > The only reason I added this note was to ensure that we do not revive the tuple that is deleted but not yet vacuumed. There is one corner-case scenario as reported by you in - [1] where you have explained a scenario under which vacuum can report "found xmin ... from before relfrozenxid ..." sort of error for the deleted tuples. And as per the explanation provided there, it can happen when there are multiple transactions operating on the same tuple. However, I think we can take care of this scenario by doing some code changes in heap_force_freeze to identify the deleted tuples and maybe skip such tuples. So, yeah, I will do the code changes for handling this and remove the note added in the documentation. Thank you Robert and Masahiko-san for pointing this out. [1] - https://www.postgresql.org/message-id/CA%2BTgmobfJ8CkabKJZ-1FGfvbSz%2Bb8bBX807Y6hHEtVfzVe%2Bg6A%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Thanks for the review. Please find my comments inline below: On Tue, Aug 25, 2020 at 5:47 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > Let me share other comments on the latest version patch: > > Some words need to be tagged. For instance, I found the following words: > > VACUUM > DISABLE_PAGE_SKIPPING > HEAP_XMIN_FROZEN > HEAP_XMAX_INVALID > Okay, done. > --- > +test=# select ctid from t1 where xmin = 507; > + ctid > +------- > + (0,3) > +(1 row) > + > +test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]); > +-[ RECORD 1 ]-----+- > +heap_force_freeze | > > I think it's better to use a consistent output format. The former uses > the normal format whereas the latter uses the expanded format. > Yep, makes sense, done. > --- > + <note> > + <para> > + While performing surgery on a damaged relation, we must not be doing anything > + else on that relation in parallel. This is to ensure that when we are > + operating on a damaged tuple there is no other transaction trying to modify > + that tuple. > + </para> > + </note> > > If we prefer to avoid concurrent operations on the target relation why > don't we use AccessExclusiveLock? > Removed this note from the documentation and added a note saying: "The user needs to ensure that they do not operate pg_force_freeze function on a deleted tuple because it may revive the deleted tuple." > --- > +CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[]) > +RETURNS VOID > +AS 'MODULE_PATHNAME', 'heap_force_kill' > +LANGUAGE C STRICT; > > +CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[]) > +RETURNS VOID > +AS 'MODULE_PATHNAME', 'heap_force_freeze' > +LANGUAGE C STRICT; > > I think these functions should be PARALLEL UNSAFE. > By default the functions are marked PARALLEL UNSAFE, so I think there is nothing to do here. Attached patch with above changes. This patch also takes care of all the other review comments from - [1]. [1] - https://www.postgresql.org/message-id/CA%2Bfd4k6%2BJWq2MfQt5b7fSJ2wMvCes9TRfbDhVO_fQP9B8JJRAA%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Attachment
On Wed, Aug 26, 2020 at 7:36 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Removed this note from the documentation and added a note saying: "The > user needs to ensure that they do not operate pg_force_freeze function > on a deleted tuple because it may revive the deleted tuple." I do not agree with that note, either. I believe that trying to tell people what things specifically they should do or avoid doing with the tool is the wrong approach. Instead, the thrust of the message should be to tell people that if you use this, it may corrupt your database, and that's your problem. The difficulty with telling people what specifically they ought to avoid doing is that experts will be annoyed to be told that something is not safe when they know that it is fine, and non-experts will think that some uses are safer than they really are. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 26, 2020 at 9:19 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Aug 26, 2020 at 7:36 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Removed this note from the documentation and added a note saying: "The > > user needs to ensure that they do not operate pg_force_freeze function > > on a deleted tuple because it may revive the deleted tuple." > > I do not agree with that note, either. I believe that trying to tell > people what things specifically they should do or avoid doing with the > tool is the wrong approach. Instead, the thrust of the message should > be to tell people that if you use this, it may corrupt your database, > and that's your problem. The difficulty with telling people what > specifically they ought to avoid doing is that experts will be annoyed > to be told that something is not safe when they know that it is fine, > and non-experts will think that some uses are safer than they really > are. > Okay, point noted. Thanks, -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Okay, point noted. I spent some time today working on this patch. I'm fairly happy with it now and intend to commit it if nobody sees a big problem with that. Per discussion, I do not intend to back-patch at this time. The two most significant changes I made to your version are: 1. I changed things around to avoid using any form of ereport() in a critical section. I'm not actually sure whether it is project policy to avoid ereport(NOTICE, ...) or similar in a critical section, but it seems prudent, because if anything fails in a critical section, we will PANIC, so doing fewer things there seems prudent. 2. I changed the code so that it does not try to follow redirected line pointers; instead, it skips them with an appropriate message, as we were already doing for dead and unused line pointers. I think the way you had it coded might've been my suggestion originally, but the more I looked into it the less I liked it. One problem is that it didn't match the docs. A second is that following a corrupted line pointer might index off the end of the line pointer array, and while that probably shouldn't happen, we are talking about corruption recovery here. Then I realized that, as you coded it, if the line pointer was redirected to a line pointer that is in turn dead (or unused, if there's corruption) the user would get a NOTICE complaining about a TID they hadn't specified, which seems like it would be very confusing. I thought about trying to fix all that stuff, but it just didn't seem worth it, because I can't think of a good reason to pass this function the TID of a redirected line pointer in the first place. If you're doing surgery, you should probably specify the exact thing upon which you want to operate, not some other thing that points to it. Here is a list of other changes I made: * Added a .gitignore file. * Renamed the regression test file from pg_surgery to heap_surgery to match the name of the single C source file we currently have. * Capitalized TID in a few places. * Ran pgindent. * Adjusted various comments. * Removed the check for an empty TID array. I don't see any reason why this should be an error case and I don't see much precedent for having such a check. * Fixed the code to work properly with that change and added a test case. * Added a check that the array is not multi-dimensional. * Put the AM type check before the relkind check, following existing precedent. * Adjusted the check to use the AM OID rather than the handler OID, following existing precedent. Fixed the message wording accordingly. * Changed the documentation wording to say less about specific recovery procedures and focus more on the general idea that this is dangerous. * Removed all but one of the test cases that checked what happens if you use this on a non-heap; three tests for basically the same thing seemed excessive. * Added some additional tests to improve code coverage. There are now only a handful of lines not covered. * Reorganized the test cases somewhat. New patch attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
> On Aug 26, 2020, at 4:36 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > This patch also takes care of all the other review comments from - [1]. > > [1] - https://www.postgresql.org/message-id/CA%2Bfd4k6%2BJWq2MfQt5b7fSJ2wMvCes9TRfbDhVO_fQP9B8JJRAA%40mail.gmail.com > > > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com > <v8-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patch> Hi Ashutosh, I took a look at the v8 patch, created a commitfest entry [1] because I did not find one already existent, and have the followingreview comments: HeapTupleForceOption should be added to src/tools/pgindent/typedefs.list. The tidcmp function can be removed, and ItemPointerCompare used directly by qsort as: - qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp); + qsort((void*) tids, ntids, sizeof(ItemPointerData), + (int (*) (const void *, const void *)) ItemPointerCompare); sanity_check_tid_array() has two error messages: "array must not contain nulls" "empty tid array" I would change the first to say "tid array must not contain nulls", as "tid" is the name of the parameter being checked. It is also more consistent with the second error message, but that doesn't matter to me so much, as I'd argue forremoving the second check. I don't see why an empty array should draw an error. It seems more reasonable to just returnearly since there is no work to do. Consider if somebody uses a function that returns the tids for all corrupt tuplesin a table, aggregates that into an array, and hands that to this function. It doesn't seem like an error for thataggregated array to have zero elements in it. I suppose you could emit a NOTICE in this case? Upthread: > On Aug 13, 2020, at 12:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> This looks like a very good suggestion to me. I will do this change in >> the next version. Just wondering if we should be doing similar changes >> in other contrib modules (like pgrowlocks, pageinspect and >> pgstattuple) as well? > > It seems like it should be consistent, but I'm not sure the proposed > change is really an improvement. You have used Asim's proposed check: if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("only the relation using heap_tableam_handler is supported"))); which Robert seems unenthusiastic about, but if you are going that direction, I think at least the language of the errormessage should be changed. I recommend something like: if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("only the relation using heap_tableam_handler is supported"))); + errmsg("\"%s\" does not use a heap access method", + RelationGetRelationName(rel)))); where "a heap access method" could also be written as "a heap table type access method", "a heap table compatible accessmethod", and so forth. There doesn't seem to be enough precedent to dictate exactly how to phrase this, or perhapsI'm just not looking in the right place. The header comment for function find_tids_one_page should state the requirement that the tids array must be sorted. The heap_force_common function contains multiple ereport(NOTICE,...) within a critical section. I don't think that is normalpractice. Can those reports be buffered until after the critical section is exited? You also have a CHECK_FOR_INTERRUPTSwithin the critical section. [1] https://commitfest.postgresql.org/29/2700/ — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 28, 2020 at 1:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Okay, point noted. > > I spent some time today working on this patch. I'm fairly happy with > it now and intend to commit it if nobody sees a big problem with that. > Per discussion, I do not intend to back-patch at this time. The two > most significant changes I made to your version are: > > 1. I changed things around to avoid using any form of ereport() in a > critical section. I'm not actually sure whether it is project policy > to avoid ereport(NOTICE, ...) or similar in a critical section, but it > seems prudent, because if anything fails in a critical section, we > will PANIC, so doing fewer things there seems prudent. > > 2. I changed the code so that it does not try to follow redirected > line pointers; instead, it skips them with an appropriate message, as > we were already doing for dead and unused line pointers. I think the > way you had it coded might've been my suggestion originally, but the > more I looked into it the less I liked it. One problem is that it > didn't match the docs. A second is that following a corrupted line > pointer might index off the end of the line pointer array, and while > that probably shouldn't happen, we are talking about corruption > recovery here. Then I realized that, as you coded it, if the line > pointer was redirected to a line pointer that is in turn dead (or > unused, if there's corruption) the user would get a NOTICE complaining > about a TID they hadn't specified, which seems like it would be very > confusing. I thought about trying to fix all that stuff, but it just > didn't seem worth it, because I can't think of a good reason to pass > this function the TID of a redirected line pointer in the first place. > If you're doing surgery, you should probably specify the exact thing > upon which you want to operate, not some other thing that points to > it. > > Here is a list of other changes I made: > > * Added a .gitignore file. > * Renamed the regression test file from pg_surgery to heap_surgery to > match the name of the single C source file we currently have. > * Capitalized TID in a few places. > * Ran pgindent. > * Adjusted various comments. > * Removed the check for an empty TID array. I don't see any reason why > this should be an error case and I don't see much precedent for having > such a check. > * Fixed the code to work properly with that change and added a test case. > * Added a check that the array is not multi-dimensional. > * Put the AM type check before the relkind check, following existing precedent. > * Adjusted the check to use the AM OID rather than the handler OID, > following existing precedent. Fixed the message wording accordingly. > * Changed the documentation wording to say less about specific > recovery procedures and focus more on the general idea that this is > dangerous. > * Removed all but one of the test cases that checked what happens if > you use this on a non-heap; three tests for basically the same thing > seemed excessive. > * Added some additional tests to improve code coverage. There are now > only a handful of lines not covered. > * Reorganized the test cases somewhat. > > New patch attached. > Thank you Robert for the patch. I've looked into the changes you've made to the v8 patch and they all look good to me. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Hi Mark, Thanks for the review. Please find my comments inline below: > HeapTupleForceOption should be added to src/tools/pgindent/typedefs.list. > This has been fixed in the v9 patch. > > The tidcmp function can be removed, and ItemPointerCompare used directly by qsort as: > > - qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp); > + qsort((void*) tids, ntids, sizeof(ItemPointerData), > + (int (*) (const void *, const void *)) ItemPointerCompare); > Will have a look into this. > > sanity_check_tid_array() has two error messages: > > "array must not contain nulls" > "empty tid array" > > I would change the first to say "tid array must not contain nulls", as "tid" is the name of the parameter being checked. It is also more consistent with the second error message, but that doesn't matter to me so much, as I'd argue forremoving the second check. I don't see why an empty array should draw an error. It seems more reasonable to just returnearly since there is no work to do. Consider if somebody uses a function that returns the tids for all corrupt tuplesin a table, aggregates that into an array, and hands that to this function. It doesn't seem like an error for thataggregated array to have zero elements in it. I suppose you could emit a NOTICE in this case? > This comment is no more valid as per the changes done in the v9 patch. > > Upthread: > > On Aug 13, 2020, at 12:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > >> This looks like a very good suggestion to me. I will do this change in > >> the next version. Just wondering if we should be doing similar changes > >> in other contrib modules (like pgrowlocks, pageinspect and > >> pgstattuple) as well? > > > > It seems like it should be consistent, but I'm not sure the proposed > > change is really an improvement. > > You have used Asim's proposed check: > > if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("only the relation using heap_tableam_handler is supported"))); > > which Robert seems unenthusiastic about, but if you are going that direction, I think at least the language of the errormessage should be changed. I recommend something like: > > if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > - errmsg("only the relation using heap_tableam_handler is supported"))); > + errmsg("\"%s\" does not use a heap access method", > + RelationGetRelationName(rel)))); > > where "a heap access method" could also be written as "a heap table type access method", "a heap table compatible accessmethod", and so forth. There doesn't seem to be enough precedent to dictate exactly how to phrase this, or perhapsI'm just not looking in the right place. > Same here. This also looks invalid as per the changes done in v9 patch. > > The header comment for function find_tids_one_page should state the requirement that the tids array must be sorted. > Okay, will add a comment for this. > > The heap_force_common function contains multiple ereport(NOTICE,...) within a critical section. I don't think that isnormal practice. Can those reports be buffered until after the critical section is exited? You also have a CHECK_FOR_INTERRUPTSwithin the critical section. > This has been fixed in the v9 patch. > [1] https://commitfest.postgresql.org/29/2700/ > — Thanks for adding a commitfest entry for this. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Fri, 28 Aug 2020 at 05:14, Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Okay, point noted. > > I spent some time today working on this patch. I'm fairly happy with > it now and intend to commit it if nobody sees a big problem with that. > Per discussion, I do not intend to back-patch at this time. The two > most significant changes I made to your version are: Thank you for updating the patch. > > 1. I changed things around to avoid using any form of ereport() in a > critical section. I'm not actually sure whether it is project policy > to avoid ereport(NOTICE, ...) or similar in a critical section, but it > seems prudent, because if anything fails in a critical section, we > will PANIC, so doing fewer things there seems prudent. > > 2. I changed the code so that it does not try to follow redirected > line pointers; instead, it skips them with an appropriate message, as > we were already doing for dead and unused line pointers. I think the > way you had it coded might've been my suggestion originally, but the > more I looked into it the less I liked it. One problem is that it > didn't match the docs. A second is that following a corrupted line > pointer might index off the end of the line pointer array, and while > that probably shouldn't happen, we are talking about corruption > recovery here. Then I realized that, as you coded it, if the line > pointer was redirected to a line pointer that is in turn dead (or > unused, if there's corruption) the user would get a NOTICE complaining > about a TID they hadn't specified, which seems like it would be very > confusing. I thought about trying to fix all that stuff, but it just > didn't seem worth it, because I can't think of a good reason to pass > this function the TID of a redirected line pointer in the first place. > If you're doing surgery, you should probably specify the exact thing > upon which you want to operate, not some other thing that points to > it. > > Here is a list of other changes I made: > > * Added a .gitignore file. > * Renamed the regression test file from pg_surgery to heap_surgery to > match the name of the single C source file we currently have. > * Capitalized TID in a few places. > * Ran pgindent. > * Adjusted various comments. > * Removed the check for an empty TID array. I don't see any reason why > this should be an error case and I don't see much precedent for having > such a check. > * Fixed the code to work properly with that change and added a test case. > * Added a check that the array is not multi-dimensional. > * Put the AM type check before the relkind check, following existing precedent. > * Adjusted the check to use the AM OID rather than the handler OID, > following existing precedent. Fixed the message wording accordingly. > * Changed the documentation wording to say less about specific > recovery procedures and focus more on the general idea that this is > dangerous. You've removed the description about executing VACUUM with DISABLE_PAGE_SKIPPING option on the target relation after using pg_surgery functions from the doc but I guess it’s better to recommend that in the doc for safety. Could you please tell me the reason for removing that? Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> >
> > - qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
> > + qsort((void*) tids, ntids, sizeof(ItemPointerData),
> > + (int (*) (const void *, const void *)) ItemPointerCompare);
> >
>
> Will have a look into this.
>
We can certainly do this way, but I would still prefer having a comparator function (tidcmp) here for the reasons that it makes the code look a bit cleaner, it also makes us more consistent with the way the comparator function argument is being passed to qsort at several other places in postgres which kinda of increases the code readability and simplicity. For e.g. there is a comparator function for gin that does the same thing as tidcmp is doing here. See below:
static int
qsortCompareItemPointers(const void *a, const void *b)
{
int res = ginCompareItemPointers((ItemPointer) a, (ItemPointer) b);
/* Assert that there are no equal item pointers being sorted */
Assert(res != 0);
return res;
}
In this case as well, it could have been done the way you are suggesting, but it seems like writing a small comparator function with the prototype that qsort accepts looked like a better option. Considering this, I am just leaving this as-it-is. Please let me know if you feel the other way round.
> > The header comment for function find_tids_one_page should state the requirement that the tids array must be sorted.
> >
>
> Okay, will add a comment for this.
Attachment
On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > We can certainly do this way, but I would still prefer having a comparator function (tidcmp) here for the reasons thatit makes the code look a bit cleaner, it also makes us more consistent with the way the comparator function argumentis being passed to qsort at several other places in postgres which kinda of increases the code readability and simplicity.For e.g. there is a comparator function for gin that does the same thing as tidcmp is doing here. Me too. Casting one kind of function pointer to another kind of function pointer assumes that the compiler is using the same argument-passing conventions in both cases, which seems slightly risky. It also means that if the signature for the function were to diverge further from the signature that we need in the future, the compiler might not warn us about it. Perhaps there is some case where the performance gains would be sufficiently to justify those risks, but this is certainly not that case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 28, 2020 at 4:07 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > You've removed the description about executing VACUUM with > DISABLE_PAGE_SKIPPING option on the target relation after using > pg_surgery functions from the doc but I guess it’s better to recommend > that in the doc for safety. Could you please tell me the reason for > removing that? Well, I think that was added because there wasn't code to clear the visibility map bits, either page-level in the map, but we added code for that, so now I don't really see why it's necessary or even desirable. Here are a few example scenarios: 1. My table is not corrupt. For no particular reason, I force-freeze or force-kill a tuple which is neither dead nor all-visible. Concurrent queries might return wrong answers, but the table is not corrupt. It does not require VACUUM and would not benefit from it. Actually, it doesn't need anything at all. 2. My table is not corrupt. For no particular reason, I force-freeze a tuple which is dead. I believe it's possible that the index entries for that tuple might be gone already, but VACUUM won't fix that. REINDEX or a table rewrite would, though. It's also possible, if the dead tuple was added by an aborted transaction which added columns to the table, that the tuple might have been created using a tuple descriptor that differs from the table's current tuple descriptor. If so, I think scanning the table could produce a crash. VACUUM won't fix this, either. I would need to delete or force-kill the offending tuple. 3. I have one or more tuples in my table that are intact except that they have garbage values for xmin, resulting in VACUUM failure or possibly even SELECT failure if the CLOG entries are also missing. I force-kill or force-freeze them. If by chance the affected tuples were also omitted from one or more indexes, a REINDEX or table rewrite is needed to fix them, but a VACUUM will not help. On the other hand, if those tuples are present in the indexes, there's no remaining problem and VACUUM is not needed for the purpose of restoring the integrity of the table. If the problem has been ongoing for a while, VACUUM might be needed to advance relfrozenxid, but that doesn't require DISABLE_PAGE_SKIPPING. 4. I have some pages in my table that have incorrect visibility map bits. In this case, I need VACUUM (DISABLE_PAGE_SKIPPING). However, I don't need the functions we're talking about here at all unless I also have tuples with corrupted visibility information. If I do happen to have both tuples with corrupted visibility information and also pages with incorrect visibility map bits, then I suppose I need both these tools and also VACUUM (DISABLE_PAGE_SKIPPING). Probably, I'll want to do the VACUUM second. But, if I happened to do the VACUUM first and then use these functions afterward, the worst thing that could happen is that I might end up with a some dead tuples that could've gotten removed faster if I'd switched the order. And that's not a disaster. Basically, I can see no real reason to recommend VACUUM (DISABLE_PAGE_SKIPPING) here. There are problems that can be fixed with that command, and there are problems that can be fixed by this method, but they are mostly independent of each other. We should not recommend that people run VACUUM "just in case." That kind of fuzzy thinking seems relatively prevalent already, and it leads to people spending a lot of time running slow maintenance commands that do nothing to help them, and which occasionally make things worse. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 28 Aug 2020 at 23:39, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Aug 28, 2020 at 4:07 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > You've removed the description about executing VACUUM with > > DISABLE_PAGE_SKIPPING option on the target relation after using > > pg_surgery functions from the doc but I guess it’s better to recommend > > that in the doc for safety. Could you please tell me the reason for > > removing that? > > Well, I think that was added because there wasn't code to clear the > visibility map bits, either page-level in the map, but we added code > for that, so now I don't really see why it's necessary or even > desirable. > > Here are a few example scenarios: > > 1. My table is not corrupt. For no particular reason, I force-freeze > or force-kill a tuple which is neither dead nor all-visible. > Concurrent queries might return wrong answers, but the table is not > corrupt. It does not require VACUUM and would not benefit from it. > Actually, it doesn't need anything at all. > > 2. My table is not corrupt. For no particular reason, I force-freeze a > tuple which is dead. I believe it's possible that the index entries > for that tuple might be gone already, but VACUUM won't fix that. > REINDEX or a table rewrite would, though. It's also possible, if the > dead tuple was added by an aborted transaction which added columns to > the table, that the tuple might have been created using a tuple > descriptor that differs from the table's current tuple descriptor. If > so, I think scanning the table could produce a crash. VACUUM won't fix > this, either. I would need to delete or force-kill the offending > tuple. > > 3. I have one or more tuples in my table that are intact except that > they have garbage values for xmin, resulting in VACUUM failure or > possibly even SELECT failure if the CLOG entries are also missing. I > force-kill or force-freeze them. If by chance the affected tuples were > also omitted from one or more indexes, a REINDEX or table rewrite is > needed to fix them, but a VACUUM will not help. On the other hand, if > those tuples are present in the indexes, there's no remaining problem > and VACUUM is not needed for the purpose of restoring the integrity of > the table. If the problem has been ongoing for a while, VACUUM might > be needed to advance relfrozenxid, but that doesn't require > DISABLE_PAGE_SKIPPING. > > 4. I have some pages in my table that have incorrect visibility map > bits. In this case, I need VACUUM (DISABLE_PAGE_SKIPPING). However, I > don't need the functions we're talking about here at all unless I also > have tuples with corrupted visibility information. If I do happen to > have both tuples with corrupted visibility information and also pages > with incorrect visibility map bits, then I suppose I need both these > tools and also VACUUM (DISABLE_PAGE_SKIPPING). Probably, I'll want to > do the VACUUM second. But, if I happened to do the VACUUM first and > then use these functions afterward, the worst thing that could happen > is that I might end up with a some dead tuples that could've gotten > removed faster if I'd switched the order. And that's not a disaster. > > Basically, I can see no real reason to recommend VACUUM > (DISABLE_PAGE_SKIPPING) here. There are problems that can be fixed > with that command, and there are problems that can be fixed by this > method, but they are mostly independent of each other. We should not > recommend that people run VACUUM "just in case." That kind of fuzzy > thinking seems relatively prevalent already, and it leads to people > spending a lot of time running slow maintenance commands that do > nothing to help them, and which occasionally make things worse. > Thank you for your explanation. That very makes sense to me. If vacuum could fix the particular kind of problem by using together with pg_surgery we could recommend using vacuum. But I agree that the corruption of heap table is not the case. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Please have a look into the attached patch for the changes and let me know for any other concerns. Thank you. I have committed this version. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 10, 2020 at 11:21:02AM -0400, Robert Haas wrote: > On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Please have a look into the attached patch for the changes and let me know for any other concerns. Thank you. > > I have committed this version. Thanks ; I marked it as such in CF app. https://commitfest.postgresql.org/29/2700/ -- Justin
Robert Haas <robertmhaas@gmail.com> writes: > I have committed this version. This failure says that the test case is not entirely stable: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-12%2005%3A13%3A12 diff -U3 /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out --- /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out 2020-09-11 06:31:36.000000000 +0000 +++ /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out 2020-09-12 11:40:26.000000000 +0000 @@ -116,7 +116,6 @@ vacuum freeze htab2; -- unused TIDs should be skipped select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]); - NOTICE: skipping tid (0, 2) for relation "htab2" because it is marked unused heap_force_kill ----------------- sungazer's first run after pg_surgery went in was successful, so it's not a hard failure. I'm guessing that it's timing dependent. The most obvious theory for the cause is that what VACUUM does with a tuple depends on whether the tuple's xmin is below global xmin, and a concurrent autovacuum could very easily be holding back global xmin. While I can't easily get autovac to run at just the right time, I did verify that a concurrent regular session holding back global xmin produces the symptom seen above. (To replicate, insert "select pg_sleep(...)" in heap_surgery.sql before "-- now create an unused line pointer"; run make installcheck; and use the delay to connect to the database manually, start a serializable transaction, and do any query to acquire a snapshot.) I suggest that the easiest way to make this test reliable is to make the test tables be temp tables (which allows dropping the autovacuum_enabled = off property, too). In the wake of commit a7212be8b, that should guarantee that vacuum has stable tuple-level behavior regardless of what is happening concurrently. regards, tom lane
On Sun, Sep 13, 2020 at 3:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > I have committed this version. > > This failure says that the test case is not entirely stable: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-12%2005%3A13%3A12 > > diff -U3 /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out > --- /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out 2020-09-11 06:31:36.000000000 +0000 > +++ /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out 2020-09-12 11:40:26.000000000 +0000 > @@ -116,7 +116,6 @@ > vacuum freeze htab2; > -- unused TIDs should be skipped > select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]); > - NOTICE: skipping tid (0, 2) for relation "htab2" because it is marked unused > heap_force_kill > ----------------- > > > sungazer's first run after pg_surgery went in was successful, so it's > not a hard failure. I'm guessing that it's timing dependent. > > The most obvious theory for the cause is that what VACUUM does with > a tuple depends on whether the tuple's xmin is below global xmin, > and a concurrent autovacuum could very easily be holding back global > xmin. While I can't easily get autovac to run at just the right > time, I did verify that a concurrent regular session holding back > global xmin produces the symptom seen above. (To replicate, insert > "select pg_sleep(...)" in heap_surgery.sql before "-- now create an unused > line pointer"; run make installcheck; and use the delay to connect > to the database manually, start a serializable transaction, and do > any query to acquire a snapshot.) > Thanks for reporting. I'm able to reproduce the issue by creating some delay just before "-- now create an unused line pointer" and use the delay to start a new session either with repeatable read or serializable transaction isolation level and run some query on the test table. To fix this, as you suggested I've converted the test table to the temp table. Attached is the patch with the changes. Please have a look and let me know about any concerns. Thanks, -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Attachment
On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Thanks for reporting. I'm able to reproduce the issue by creating some > delay just before "-- now create an unused line pointer" and use the > delay to start a new session either with repeatable read or > serializable transaction isolation level and run some query on the > test table. To fix this, as you suggested I've converted the test > table to the temp table. Attached is the patch with the changes. > Please have a look and let me know about any concerns. Tom, do you have any concerns about this fix? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> Thanks for reporting. I'm able to reproduce the issue by creating some >> delay just before "-- now create an unused line pointer" and use the >> delay to start a new session either with repeatable read or >> serializable transaction isolation level and run some query on the >> test table. To fix this, as you suggested I've converted the test >> table to the temp table. Attached is the patch with the changes. >> Please have a look and let me know about any concerns. > Tom, do you have any concerns about this fix? It seems OK as far as it goes. Two thoughts: * Do we need a comment in the test pointing out that the table must be temp to ensure that we get stable vacuum results? Or will the commit log message be enough documentation? * Should any of the other tables in the test be converted to temp? I see that the other test cases are kluging around related issues by dint of never committing their tables at all. It's not clear to me how badly those test cases have been distorted by that, or whether it means they're testing less-than-typical situations. Anyway, if you're satisfied with leaving the other cases as-is, I have no objections. regards, tom lane
On Wed, Sep 16, 2020 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > >> Thanks for reporting. I'm able to reproduce the issue by creating some > >> delay just before "-- now create an unused line pointer" and use the > >> delay to start a new session either with repeatable read or > >> serializable transaction isolation level and run some query on the > >> test table. To fix this, as you suggested I've converted the test > >> table to the temp table. Attached is the patch with the changes. > >> Please have a look and let me know about any concerns. > > > Tom, do you have any concerns about this fix? > > It seems OK as far as it goes. Two thoughts: > > * Do we need a comment in the test pointing out that the table must be > temp to ensure that we get stable vacuum results? Or will the commit > log message be enough documentation? > I'll add a note for this. > * Should any of the other tables in the test be converted to temp? > I see that the other test cases are kluging around related issues > by dint of never committing their tables at all. It's not clear > to me how badly those test cases have been distorted by that, or > whether it means they're testing less-than-typical situations. > Are you trying to say that we can achieve the things being done in test-case 1 and 2 by having a single temp table and we should aim for it because it will make the test-case more efficient and easy to maintain? If so, I will try to do the necessary changes and submit a new patch for it. please confirm. Thanks, -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Ashutosh Sharma <ashu.coek88@gmail.com> writes: > On Wed, Sep 16, 2020 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> * Should any of the other tables in the test be converted to temp? > Are you trying to say that we can achieve the things being done in > test-case 1 and 2 by having a single temp table and we should aim for > it because it will make the test-case more efficient and easy to > maintain? Well, I'm just looking at the comment that says the reason for the begin/rollback structure is to keep autovacuum's hands off the table. In most if not all of the other places where we need that, the preferred method is to make the table temp or mark it with (autovacuum = off). While this way isn't wrong exactly, nor inefficient, it does seem a little restrictive. For instance, you can't easily test cases that involve intentional errors. Another point is that we have a few optimizations that apply to tables created in the current transaction. I'm not sure whether any of them would fire in this test case, but if they do (now or in the future) that might mean you aren't testing the usual scenario for use of pg_surgery, which is surely not going to be a new-in-transaction table. (That might be an argument for preferring autovacuum = off over a temp table, too.) Like I said, I don't have a big problem with leaving the rest of the test as-is. It just seems to be doing things in an unusual way for no very good reason. regards, tom lane
On Wed, Sep 16, 2020 at 9:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Ashutosh Sharma <ashu.coek88@gmail.com> writes: > > On Wed, Sep 16, 2020 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> * Should any of the other tables in the test be converted to temp? > > > Are you trying to say that we can achieve the things being done in > > test-case 1 and 2 by having a single temp table and we should aim for > > it because it will make the test-case more efficient and easy to > > maintain? > > Well, I'm just looking at the comment that says the reason for the > begin/rollback structure is to keep autovacuum's hands off the table. > In most if not all of the other places where we need that, the preferred > method is to make the table temp or mark it with (autovacuum = off). > While this way isn't wrong exactly, nor inefficient, it does seem > a little restrictive. For instance, you can't easily test cases that > involve intentional errors. > > Another point is that we have a few optimizations that apply to tables > created in the current transaction. I'm not sure whether any of them > would fire in this test case, but if they do (now or in the future) > that might mean you aren't testing the usual scenario for use of > pg_surgery, which is surely not going to be a new-in-transaction > table. (That might be an argument for preferring autovacuum = off > over a temp table, too.) > I agree with you on both the above points. I'll try to make the necessary changes to address all your comments. Also, I'd prefer creating a normal heap table with autovacuum = off over the temp table for the reasons you mentioned in the second point. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Wed, Sep 16, 2020 at 10:40 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > On Wed, Sep 16, 2020 at 9:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Ashutosh Sharma <ashu.coek88@gmail.com> writes: > > > On Wed, Sep 16, 2020 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> * Should any of the other tables in the test be converted to temp? > > > > > Are you trying to say that we can achieve the things being done in > > > test-case 1 and 2 by having a single temp table and we should aim for > > > it because it will make the test-case more efficient and easy to > > > maintain? > > > > Well, I'm just looking at the comment that says the reason for the > > begin/rollback structure is to keep autovacuum's hands off the table. > > In most if not all of the other places where we need that, the preferred > > method is to make the table temp or mark it with (autovacuum = off). > > While this way isn't wrong exactly, nor inefficient, it does seem > > a little restrictive. For instance, you can't easily test cases that > > involve intentional errors. > > > > Another point is that we have a few optimizations that apply to tables > > created in the current transaction. I'm not sure whether any of them > > would fire in this test case, but if they do (now or in the future) > > that might mean you aren't testing the usual scenario for use of > > pg_surgery, which is surely not going to be a new-in-transaction > > table. (That might be an argument for preferring autovacuum = off > > over a temp table, too.) > > > > I agree with you on both the above points. I'll try to make the > necessary changes to address all your comments. Also, I'd prefer > creating a normal heap table with autovacuum = off over the temp table > for the reasons you mentioned in the second point. > Attached is the patch with the changes suggested here. I've tried to use a normal heap table with (autovacuum = off) wherever possible. Please have a look and let me know for any other issues. Thanks, -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Attachment
On Wed, Sep 16, 2020 at 1:48 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Attached is the patch with the changes suggested here. I've tried to > use a normal heap table with (autovacuum = off) wherever possible. > Please have a look and let me know for any other issues. I think the comment needs some wordsmithing -- "unlike other cases" is not that informative, and "we get a stable vacuum results" isn't either very clear or all that grammatical. If we're going to add a comment add here, why not just "use a temp table, so autovacuum can't interfere"? Tom, I know that you often have strong feelings about the exact wording and details of this kind of stuff, so if you feel moved to commit something that is fine with me. If not, I will take my best shot at it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Tom, I know that you often have strong feelings about the exact > wording and details of this kind of stuff, so if you feel moved to > commit something that is fine with me. If not, I will take my best > shot at it. I'm not feeling terribly picky about it --- so it's all yours. regards, tom lane
On Wed, Sep 16, 2020 at 11:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Tom, I know that you often have strong feelings about the exact > > wording and details of this kind of stuff, so if you feel moved to > > commit something that is fine with me. If not, I will take my best > > shot at it. > > I'm not feeling terribly picky about it --- so it's all yours. OK. After some more study of the thread and some more experimentation, I came up with the attached. I'll go ahead and commit this if nobody objects. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > OK. After some more study of the thread and some more experimentation, > I came up with the attached. I'll go ahead and commit this if nobody > objects. This is OK by me. regards, tom lane
On Thu, Sep 17, 2020 at 12:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > OK. After some more study of the thread and some more experimentation, > > I came up with the attached. I'll go ahead and commit this if nobody > > objects. > > This is OK by me. > Looks good to me too. Thanks, -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Wed, Sep 16, 2020 at 10:27 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > This is OK by me. > > Looks good to me too. Cool, thanks to both of you for looking. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Cool, thanks to both of you for looking. Committed. Hmph, according to whelk this is worse not better: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2020-09-18%2017%3A42%3A11 I'm at a loss to understand what's going on there. regards, tom lane
On Sat, Sep 19, 2020 at 4:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > Cool, thanks to both of you for looking. Committed. > > Hmph, according to whelk this is worse not better: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2020-09-18%2017%3A42%3A11 > > I'm at a loss to understand what's going on there. > I think our assumption that changing the tests to have temp tables will make them safe w.r.t concurrent activity doesn't seem to be correct. We do set OldestXmin for temp tables aggressive enough that it allows us to remove all dead tuples but the test case behavior lies on whether we are able to prune the chain. AFAICS, we are using different cut-offs in heap_page_prune when it is called via lazy_scan_heap. So that seems to be causing both the failures. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes: > I think our assumption that changing the tests to have temp tables > will make them safe w.r.t concurrent activity doesn't seem to be > correct. We do set OldestXmin for temp tables aggressive enough that > it allows us to remove all dead tuples but the test case behavior lies > on whether we are able to prune the chain. AFAICS, we are using > different cut-offs in heap_page_prune when it is called via > lazy_scan_heap. So that seems to be causing both the failures. Hm, reasonable theory. I was able to partially reproduce whelk's failure here. I got a couple of cases of "cannot freeze committed xmax", which then leads to the second NOTICE diff; but I couldn't reproduce the first NOTICE diff. That was out of about a thousand tries :-( so it's not looking like a promising thing to reproduce without modifying the test. I wonder whether "cannot freeze committed xmax" doesn't represent an actual bug, ie is a7212be8b setting the cutoff *too* aggressively? But if so, why's it so hard to reproduce? regards, tom lane
I wrote: > I was able to partially reproduce whelk's failure here. I got a > couple of cases of "cannot freeze committed xmax", which then leads > to the second NOTICE diff; but I couldn't reproduce the first > NOTICE diff. That was out of about a thousand tries :-( so it's not > looking like a promising thing to reproduce without modifying the test. ... however, it's trivial to reproduce via manual interference, using the same strategy discussed recently for another case: add a pg_sleep at the start of the heap_surgery.sql script, run "make installcheck", and while that's running start another session in which you begin a serializable transaction, execute any old SELECT, and wait. AFAICT this reproduces all of whelk's symptoms with 100% reliability. With a little more effort, this could be automated by putting some long-running transaction (likely, it needn't be any more complicated than "select pg_sleep(10)") in a second test script launched in parallel with heap_surgery.sql. So this confirms the suspicion that the cause of the buildfarm failures is a concurrently-open transaction, presumably from autovacuum. I don't have time to poke further right now. regards, tom lane
I wrote: > So this confirms the suspicion that the cause of the buildfarm > failures is a concurrently-open transaction, presumably from > autovacuum. I don't have time to poke further right now. I spent some more time analyzing this, and there seem to be two distinct issues: 1. My patch a7212be8b does indeed have a problem. It will allow vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp table if freeze_min_age is zero (ie VACUUM FREEZE). If there's any concurrent transactions, this falls foul of heap_prepare_freeze_tuple's expectation that * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any * XID older than it could neither be running nor seen as running by any * open transaction. This ensures that the replacement will not change * anyone's idea of the tuple state. The "cannot freeze committed xmax" error message appears to be banking on the assumption that we'd not reach heap_prepare_freeze_tuple for any committed-dead tuple unless its xmax is past the specified cutoff_xid. 2. As Amit suspected, there's an inconsistency between pruneheap.c's rules for which tuples are removable and vacuum.c's rules for that. This seems like a massive bug in its own right: what's the point of pruneheap.c going to huge effort to decide whether it should keep a tuple if vacuum will then kill it anyway? I do not understand why whoever put in the GlobalVisState stuff only applied it in pruneheap.c and not VACUUM proper. These two points interact, in that we don't get to the "cannot freeze" failure except when pruneheap has decided not to remove something that is removable according to VACUUM's rules. (VACUUM doesn't actually remove it, because lazy_scan_heap won't try to remove HeapOnly tuples even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze the tuple, and heap_prepare_freeze_tuple spits up.) However, if I revert a7212be8b then the pg_surgery test still fails in the presence of a concurrent transaction (both of the expected "skipping TID" notices disappear). So reverting that patch wouldn't get us out of trouble. I think to move forward, we need to figure out what the freezing behavior ought to be for temp tables. We could make it the same as it was before a7212be8b, which'd just require some more complexity in vacuum_set_xid_limits. However, that negates the idea that we'd like VACUUM's behavior on a temp table to be fully independent of whether concurrent transactions exist. I'd prefer to allow a7212be8b's behavior to stand, but then it seems we need to lobotomize the error check in heap_prepare_freeze_tuple to some extent. Independently of that, it seems like we need to fix things so that when pruneheap.c is called by vacuum, it makes EXACTLY the same dead-or-not-dead decisions that the main vacuum code makes. This business with applying some GlobalVisState rule or other instead seems just as unsafe as can be. AFAICS, there is no chance of the existing pg_surgery regression test being fully stable if we don't fix both things. BTW, attached is a quick-hack patch to allow automated testing of this scenario, along the lines I sketched yesterday. This test passes if you run the two scripts serially, but not when you run them in parallel. I'm not proposing this for commit; it's a hack and its timing behavior is probably not stable enough for the buildfarm. But it's pretty useful for poking at these issues. regards, tom lane diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile index a66776c4c4..a7bcfca0fb 100644 --- a/contrib/pg_surgery/Makefile +++ b/contrib/pg_surgery/Makefile @@ -9,7 +9,7 @@ EXTENSION = pg_surgery DATA = pg_surgery--1.0.sql PGFILEDESC = "pg_surgery - perform surgery on a damaged relation" -REGRESS = heap_surgery +REGRESS = --schedule=surgery_schedule ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out index d4a757ffa0..6ef5597e7c 100644 --- a/contrib/pg_surgery/expected/heap_surgery.out +++ b/contrib/pg_surgery/expected/heap_surgery.out @@ -1,4 +1,10 @@ create extension pg_surgery; +select pg_sleep(0.1); -- ensure concurrent transaction is ready + pg_sleep +---------- + +(1 row) + -- create a normal heap table and insert some rows. -- use a temp table so that vacuum behavior doesn't depend on global xmin create temp table htab (a int); diff --git a/contrib/pg_surgery/expected/sleep.out b/contrib/pg_surgery/expected/sleep.out new file mode 100644 index 0000000000..208e31163d --- /dev/null +++ b/contrib/pg_surgery/expected/sleep.out @@ -0,0 +1,6 @@ +select pg_sleep(1.0); -- long enough for whole heap_surgery test + pg_sleep +---------- + +(1 row) + diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql index 6526b27535..212c657f3c 100644 --- a/contrib/pg_surgery/sql/heap_surgery.sql +++ b/contrib/pg_surgery/sql/heap_surgery.sql @@ -1,5 +1,7 @@ create extension pg_surgery; +select pg_sleep(0.1); -- ensure concurrent transaction is ready + -- create a normal heap table and insert some rows. -- use a temp table so that vacuum behavior doesn't depend on global xmin create temp table htab (a int); diff --git a/contrib/pg_surgery/sql/sleep.sql b/contrib/pg_surgery/sql/sleep.sql new file mode 100644 index 0000000000..a85c25bca2 --- /dev/null +++ b/contrib/pg_surgery/sql/sleep.sql @@ -0,0 +1 @@ +select pg_sleep(1.0); -- long enough for whole heap_surgery test diff --git a/contrib/pg_surgery/surgery_schedule b/contrib/pg_surgery/surgery_schedule new file mode 100644 index 0000000000..1bb3aa5e53 --- /dev/null +++ b/contrib/pg_surgery/surgery_schedule @@ -0,0 +1 @@ +test: heap_surgery sleep
Hi Tom, On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > > So this confirms the suspicion that the cause of the buildfarm > > failures is a concurrently-open transaction, presumably from > > autovacuum. I don't have time to poke further right now. > > I spent some more time analyzing this, and there seem to be two distinct > issues: > > 1. My patch a7212be8b does indeed have a problem. It will allow > vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp > table if freeze_min_age is zero (ie VACUUM FREEZE). If there's > any concurrent transactions, this falls foul of > heap_prepare_freeze_tuple's expectation that > > * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any > * XID older than it could neither be running nor seen as running by any > * open transaction. This ensures that the replacement will not change > * anyone's idea of the tuple state. > > The "cannot freeze committed xmax" error message appears to be banking on > the assumption that we'd not reach heap_prepare_freeze_tuple for any > committed-dead tuple unless its xmax is past the specified cutoff_xid. > > 2. As Amit suspected, there's an inconsistency between pruneheap.c's > rules for which tuples are removable and vacuum.c's rules for that. > This seems like a massive bug in its own right: what's the point of > pruneheap.c going to huge effort to decide whether it should keep a > tuple if vacuum will then kill it anyway? I do not understand why > whoever put in the GlobalVisState stuff only applied it in pruneheap.c > and not VACUUM proper. > > These two points interact, in that we don't get to the "cannot freeze" > failure except when pruneheap has decided not to remove something that > is removable according to VACUUM's rules. (VACUUM doesn't actually > remove it, because lazy_scan_heap won't try to remove HeapOnly tuples > even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze > the tuple, and heap_prepare_freeze_tuple spits up.) However, if I revert > a7212be8b then the pg_surgery test still fails in the presence of a > concurrent transaction (both of the expected "skipping TID" notices > disappear). So reverting that patch wouldn't get us out of trouble. > > I think to move forward, we need to figure out what the freezing > behavior ought to be for temp tables. We could make it the same > as it was before a7212be8b, which'd just require some more complexity > in vacuum_set_xid_limits. However, that negates the idea that we'd > like VACUUM's behavior on a temp table to be fully independent of > whether concurrent transactions exist. I'd prefer to allow a7212be8b's > behavior to stand, but then it seems we need to lobotomize the error > check in heap_prepare_freeze_tuple to some extent. > > Independently of that, it seems like we need to fix things so that > when pruneheap.c is called by vacuum, it makes EXACTLY the same > dead-or-not-dead decisions that the main vacuum code makes. This > business with applying some GlobalVisState rule or other instead > seems just as unsafe as can be. > > AFAICS, there is no chance of the existing pg_surgery regression test > being fully stable if we don't fix both things. > From the explanation you provided above it seems like the test-case for pg_surgery module is failing because there is some issue with the changes done in a7212be8b commit (shown below). In other words, I believe that the test-case for pg_surgery has actually detected an issue in this commit. commit a7212be8b9e0885ee769e8c55f99ef742cda487b Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Tue Sep 1 18:37:12 2020 -0400 Set cutoff xmin more aggressively when vacuuming a temporary table. .... So, do you mean to say that if the issues related to temp tables induced by the above commit is fixed, it will make the regression test for pg_surgery stable? Please let me know if I am missing something here. Thank you. > BTW, attached is a quick-hack patch to allow automated testing > of this scenario, along the lines I sketched yesterday. This > test passes if you run the two scripts serially, but not when > you run them in parallel. I'm not proposing this for commit; > it's a hack and its timing behavior is probably not stable enough > for the buildfarm. But it's pretty useful for poking at these > issues. > Yeah, understood, thanks for sharing this. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > > So this confirms the suspicion that the cause of the buildfarm > > failures is a concurrently-open transaction, presumably from > > autovacuum. I don't have time to poke further right now. > .. > 2. As Amit suspected, there's an inconsistency between pruneheap.c's > rules for which tuples are removable and vacuum.c's rules for that. > This seems like a massive bug in its own right: what's the point of > pruneheap.c going to huge effort to decide whether it should keep a > tuple if vacuum will then kill it anyway? I do not understand why > whoever put in the GlobalVisState stuff only applied it in pruneheap.c > and not VACUUM proper. > > These two points interact, in that we don't get to the "cannot freeze" > failure except when pruneheap has decided not to remove something that > is removable according to VACUUM's rules. (VACUUM doesn't actually > remove it, because lazy_scan_heap won't try to remove HeapOnly tuples > even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze > the tuple, and heap_prepare_freeze_tuple spits up.) However, if I revert > a7212be8b then the pg_surgery test still fails in the presence of a > concurrent transaction (both of the expected "skipping TID" notices > disappear). So reverting that patch wouldn't get us out of trouble. > > I think to move forward, we need to figure out what the freezing > behavior ought to be for temp tables. We could make it the same > as it was before a7212be8b, which'd just require some more complexity > in vacuum_set_xid_limits. However, that negates the idea that we'd > like VACUUM's behavior on a temp table to be fully independent of > whether concurrent transactions exist. I'd prefer to allow a7212be8b's > behavior to stand, but then it seems we need to lobotomize the error > check in heap_prepare_freeze_tuple to some extent. > > Independently of that, it seems like we need to fix things so that > when pruneheap.c is called by vacuum, it makes EXACTLY the same > dead-or-not-dead decisions that the main vacuum code makes. This > business with applying some GlobalVisState rule or other instead > seems just as unsafe as can be. > Yeah, on a quick look it seems before commit dc7420c2c9 the pruneheap.c and the main Vacuum code use to make the same decision and that is commit which has introduced GlobalVisState stuff. > AFAICS, there is no chance of the existing pg_surgery regression test > being fully stable if we don't fix both things. > What if ensure that it runs with autovacuum = off and there is no parallel test running? I am not sure about the second part but if we can do that then the test will be probably stable. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes: > On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> AFAICS, there is no chance of the existing pg_surgery regression test >> being fully stable if we don't fix both things. > What if ensure that it runs with autovacuum = off and there is no > parallel test running? I am not sure about the second part but if we > can do that then the test will be probably stable. Then it'll not be usable under "make installcheck", which is not very nice. It's also arguable that you aren't testing pg_surgery under real-world conditions if you do it like that. Moreover, I think that both of these points need to be addressed anyway, as they represent bugs that are reachable independently of pg_surgery. Admittedly, we do not have a test case that proves that the inconsistency between pruneheap and vacuum has any bad effects in the absence of a7212be8b. But do you really want to bet that there are none? regards, tom lane
On Sun, Sep 20, 2020 at 1:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > 1. My patch a7212be8b does indeed have a problem. It will allow > vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp > table if freeze_min_age is zero (ie VACUUM FREEZE). If there's > any concurrent transactions, this falls foul of > heap_prepare_freeze_tuple's expectation that > > * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any > * XID older than it could neither be running nor seen as running by any > * open transaction. This ensures that the replacement will not change > * anyone's idea of the tuple state. > > The "cannot freeze committed xmax" error message appears to be banking on > the assumption that we'd not reach heap_prepare_freeze_tuple for any > committed-dead tuple unless its xmax is past the specified cutoff_xid. > > 2. As Amit suspected, there's an inconsistency between pruneheap.c's > rules for which tuples are removable and vacuum.c's rules for that. > This seems like a massive bug in its own right: what's the point of > pruneheap.c going to huge effort to decide whether it should keep a > tuple if vacuum will then kill it anyway? I do not understand why > whoever put in the GlobalVisState stuff only applied it in pruneheap.c > and not VACUUM proper. I am not sure I fully understand why you're contrasting pruneheap.c with vacuum here, because vacuum just does HOT pruning to remove dead tuples - maybe calling the relevant functions with different arguments, but it doesn't have its own independent logic for that. The key point is that the freezing code isn't, or at least historically wasn't, very smart about dead tuples. For example, I think if you told it to freeze something that was dead it would just do it, which is obviously bad. And that's why Andres stuck those sanity checks in there. But it's still pretty fragile. I think perhaps the pruning code should be rewritten in such a way that it can be combined with the code that freezes and marks pages all-visible, so that there's not so much action at a distance, but such an endeavor is in itself pretty scary, and certainly not back-patchable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Sep 20, 2020 at 1:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 2. As Amit suspected, there's an inconsistency between pruneheap.c's >> rules for which tuples are removable and vacuum.c's rules for that. >> This seems like a massive bug in its own right: what's the point of >> pruneheap.c going to huge effort to decide whether it should keep a >> tuple if vacuum will then kill it anyway? I do not understand why >> whoever put in the GlobalVisState stuff only applied it in pruneheap.c >> and not VACUUM proper. > I am not sure I fully understand why you're contrasting pruneheap.c > with vacuum here, because vacuum just does HOT pruning to remove dead > tuples - maybe calling the relevant functions with different > arguments, but it doesn't have its own independent logic for that. Right, but what we end up with is that the very same tuple xmin and xmax might result in pruning/deletion, or not, depending on whether it's part of a HOT chain or not. That's at best pretty weird, and at worst it means that corner-case bugs in other places are triggered in only one of the two scenarios ... which is what we have here. > The key point is that the freezing code isn't, or at least > historically wasn't, very smart about dead tuples. For example, I > think if you told it to freeze something that was dead it would just > do it, which is obviously bad. And that's why Andres stuck those > sanity checks in there. But it's still pretty fragile. I think perhaps > the pruning code should be rewritten in such a way that it can be > combined with the code that freezes and marks pages all-visible, so > that there's not so much action at a distance, but such an endeavor is > in itself pretty scary, and certainly not back-patchable. Not sure. The pruning code is trying to serve two masters, that is both VACUUM and on-the-fly cleanup during ordinary queries. If you try to merge it with other tasks that VACUUM does, you're going to have a mess for the second usage. I fear there's going to be pretty strong conservation of cruft either way. FWIW, weakening the sanity checks in heap_prepare_freeze_tuple is *not* my preferred fix here. But it'll take some work in other places to preserve them. regards, tom lane
On Mon, Sep 21, 2020 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Right, but what we end up with is that the very same tuple xmin and > xmax might result in pruning/deletion, or not, depending on whether > it's part of a HOT chain or not. That's at best pretty weird, and > at worst it means that corner-case bugs in other places are triggered > in only one of the two scenarios ... which is what we have here. I'm not sure I really understand how that's happening, because surely HOT chains and non-HOT chains are pruned by the same code, but it doesn't sound good. > FWIW, weakening the sanity checks in heap_prepare_freeze_tuple is > *not* my preferred fix here. But it'll take some work in other > places to preserve them. Make sense. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-09-21 16:02:29 -0400, Robert Haas wrote: > On Mon, Sep 21, 2020 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Right, but what we end up with is that the very same tuple xmin and > > xmax might result in pruning/deletion, or not, depending on whether > > it's part of a HOT chain or not. That's at best pretty weird, and > > at worst it means that corner-case bugs in other places are triggered > > in only one of the two scenarios ... which is what we have here. > > I'm not sure I really understand how that's happening, because surely > HOT chains and non-HOT chains are pruned by the same code, but it > doesn't sound good. Not necessarily, unfortunately: case HEAPTUPLE_DEAD: /* * Ordinarily, DEAD tuples would have been removed by * heap_page_prune(), but it's possible that the tuple * state changed since heap_page_prune() looked. In * particular an INSERT_IN_PROGRESS tuple could have * changed to DEAD if the inserter aborted. So this * cannot be considered an error condition. * * If the tuple is HOT-updated then it must only be * removed by a prune operation; so we keep it just as if * it were RECENTLY_DEAD. Also, if it's a heap-only * tuple, we choose to keep it, because it'll be a lot * cheaper to get rid of it in the next pruning pass than * to treat it like an indexed tuple. Finally, if index * cleanup is disabled, the second heap pass will not * execute, and the tuple will not get removed, so we must * treat it like any other dead tuple that we choose to * keep. * * If this were to happen for a tuple that actually needed * to be deleted, we'd be in trouble, because it'd * possibly leave a tuple below the relation's xmin * horizon alive. heap_prepare_freeze_tuple() is prepared * to detect that case and abort the transaction, * preventing corruption. */ if (HeapTupleIsHotUpdated(&tuple) || HeapTupleIsHeapOnly(&tuple) || params->index_cleanup == VACOPT_TERNARY_DISABLED) nkeep += 1; else tupgone = true; /* we can delete the tuple */ all_visible = false; So if e.g. a transaction aborts between the heap_page_prune and this check the pruning behaviour depends on whether the tuple is part of a HOT chain or not. Greetings, Andres Freund
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Sep 21, 2020 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Right, but what we end up with is that the very same tuple xmin and >> xmax might result in pruning/deletion, or not, depending on whether >> it's part of a HOT chain or not. That's at best pretty weird, and >> at worst it means that corner-case bugs in other places are triggered >> in only one of the two scenarios ... which is what we have here. > I'm not sure I really understand how that's happening, because surely > HOT chains and non-HOT chains are pruned by the same code, but it > doesn't sound good. No, they're not. lazy_scan_heap() will never remove a tuple that is HeapTupleIsHotUpdated or HeapTupleIsHeapOnly, even if it thinks it's DEAD -- cf. vacuumlazy.c, about line 1350. So tuples in a HOT chain are deleted exactly when pruneheap.c sees fit to do so. OTOH, for tuples not in a HOT chain, the decision is (I believe) entirely on lazy_scan_heap(). And the core of my complaint is that pruneheap.c's decisions about what is DEAD are not reliably identical to what HeapTupleSatisfiesVacuum thinks. I don't mind if a free-standing prune operation has its own rules, but when it's invoked by VACUUM it ought to follow VACUUM's rules about what is dead or alive. What remains unclear at this point is whether we ought to import some of the intelligence added by the GlobalVisState patch into VACUUM's behavior, or just dumb down pruneheap.c so that it applies exactly the HeapTupleSatisfiesVacuum rules when invoked by VACUUM. regards, tom lane
Hi, On 2020-09-20 13:13:16 -0400, Tom Lane wrote: > 2. As Amit suspected, there's an inconsistency between pruneheap.c's > rules for which tuples are removable and vacuum.c's rules for that. > This seems like a massive bug in its own right: what's the point of > pruneheap.c going to huge effort to decide whether it should keep a > tuple if vacuum will then kill it anyway? I do not understand why > whoever put in the GlobalVisState stuff only applied it in pruneheap.c > and not VACUUM proper. The reason for that is that the GlobalVisState stuff is computed heuristically (and then re-checked if that's not sufficient to prune a tuple, unless already done so). That's done so GetSnapshotData() doesn't have to look at each backends ->xmin, which is quite a massive speedup at higher connection counts, as each backends ->xmin changes much more often than each backend's xid. But for VACUUM we need to do the accurate scan of the procarray anyway, because we need an accurate value for things other than HOT pruning decisions. What do you exactly mean with the "going to huge effort to decide" bit? > I think to move forward, we need to figure out what the freezing > behavior ought to be for temp tables. We could make it the same > as it was before a7212be8b, which'd just require some more complexity > in vacuum_set_xid_limits. However, that negates the idea that we'd > like VACUUM's behavior on a temp table to be fully independent of > whether concurrent transactions exist. I'd prefer to allow a7212be8b's > behavior to stand, but then it seems we need to lobotomize the error > check in heap_prepare_freeze_tuple to some extent. I think that's an argument for what I suggested elsewhere, which is that we should move the logic for a different horizon for temp tables out of vacuum_set_xid_limits, and into procarray. > Independently of that, it seems like we need to fix things so that > when pruneheap.c is called by vacuum, it makes EXACTLY the same > dead-or-not-dead decisions that the main vacuum code makes. This > business with applying some GlobalVisState rule or other instead > seems just as unsafe as can be. It's not great, I agree. Not sure there is a super nice answer though. Note that, even before my changes, vacuumlazy can decide differently than pruning whether a tuple is live. E.g. when an inserting transaction aborts. That's pretty much unavoidable as long as we have multiple HTSV calls for a tuple, since none of our locking can (nor should) prevent concurrent transactions from aborting. Before your new code avoiding the GetOldestNonRemovableTransactionId() call for temp tables, the GlobalVis* can never be more pessimistic than decisions based ona prior GetOldestNonRemovableTransactionId call (as that internally updates the heuristic horizons used by GlobalVis). Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > The reason for that is that the GlobalVisState stuff is computed > heuristically (and then re-checked if that's not sufficient to prune a > tuple, unless already done so). That's done so GetSnapshotData() doesn't > have to look at each backends ->xmin, which is quite a massive speedup > at higher connection counts, as each backends ->xmin changes much more > often than each backend's xid. OK. > What do you exactly mean with the "going to huge effort to decide" bit? I'd looked at all the complexity around GlobalVisState, but failed to register that it should be pretty cheap on a per-tuple basis. So never mind that complaint. The point that remains is just that it's different from HeapTupleSatisfiesVacuum's rules. >> I think to move forward, we need to figure out what the freezing >> behavior ought to be for temp tables. We could make it the same >> as it was before a7212be8b, which'd just require some more complexity >> in vacuum_set_xid_limits. However, that negates the idea that we'd >> like VACUUM's behavior on a temp table to be fully independent of >> whether concurrent transactions exist. I'd prefer to allow a7212be8b's >> behavior to stand, but then it seems we need to lobotomize the error >> check in heap_prepare_freeze_tuple to some extent. > I think that's an argument for what I suggested elsewhere, which is that > we should move the logic for a different horizon for temp tables out of > vacuum_set_xid_limits, and into procarray. But procarray does not seem like a great place for table-persistence-dependent decisions either? >> Independently of that, it seems like we need to fix things so that >> when pruneheap.c is called by vacuum, it makes EXACTLY the same >> dead-or-not-dead decisions that the main vacuum code makes. This >> business with applying some GlobalVisState rule or other instead >> seems just as unsafe as can be. > It's not great, I agree. Not sure there is a super nice answer > though. Note that, even before my changes, vacuumlazy can decide > differently than pruning whether a tuple is live. E.g. when an inserting > transaction aborts. That's pretty much unavoidable as long as we have > multiple HTSV calls for a tuple, since none of our locking can (nor > should) prevent concurrent transactions from aborting. It's clear that if the environment changes between test A and test B, we might get different results. What I'm not happy about is that the rules are different, so we might get different results even if the environment did not change. regards, tom lane
Hi, On 2020-09-21 16:40:40 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > >> I think to move forward, we need to figure out what the freezing > >> behavior ought to be for temp tables. We could make it the same > >> as it was before a7212be8b, which'd just require some more complexity > >> in vacuum_set_xid_limits. However, that negates the idea that we'd > >> like VACUUM's behavior on a temp table to be fully independent of > >> whether concurrent transactions exist. I'd prefer to allow a7212be8b's > >> behavior to stand, but then it seems we need to lobotomize the error > >> check in heap_prepare_freeze_tuple to some extent. > > > I think that's an argument for what I suggested elsewhere, which is that > > we should move the logic for a different horizon for temp tables out of > > vacuum_set_xid_limits, and into procarray. > > But procarray does not seem like a great place for > table-persistence-dependent decisions either? That ship has sailed a long long time ago though. GetOldestXmin() has looked at the passed in relation for a quite a while, and even before that we had logic about 'allDbs' etc. It doesn't easily seem possible to avoid that, given how intimately that's coupled with how snapshots are built and used, database & vacuumFlags checks etc. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2020-09-21 16:40:40 -0400, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >>> I think that's an argument for what I suggested elsewhere, which is that >>> we should move the logic for a different horizon for temp tables out of >>> vacuum_set_xid_limits, and into procarray. >> But procarray does not seem like a great place for >> table-persistence-dependent decisions either? > That ship has sailed a long long time ago though. GetOldestXmin() has > looked at the passed in relation for a quite a while, and even before > that we had logic about 'allDbs' etc. It doesn't easily seem possible > to avoid that, given how intimately that's coupled with how snapshots > are built and used, database & vacuumFlags checks etc. OK. Given that you've got strong feelings about this, do you want to propose a patch? I'm happy to fix it, since it's at least in part my bug, but I probably won't do it exactly like you would. regards, tom lane
On 2020-09-21 17:03:53 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2020-09-21 16:40:40 -0400, Tom Lane wrote: > >> Andres Freund <andres@anarazel.de> writes: > >>> I think that's an argument for what I suggested elsewhere, which is that > >>> we should move the logic for a different horizon for temp tables out of > >>> vacuum_set_xid_limits, and into procarray. > > >> But procarray does not seem like a great place for > >> table-persistence-dependent decisions either? > > > That ship has sailed a long long time ago though. GetOldestXmin() has > > looked at the passed in relation for a quite a while, and even before > > that we had logic about 'allDbs' etc. It doesn't easily seem possible > > to avoid that, given how intimately that's coupled with how snapshots > > are built and used, database & vacuumFlags checks etc. > > OK. Given that you've got strong feelings about this, do you want to > propose a patch? I'm happy to fix it, since it's at least in part my > bug, but I probably won't do it exactly like you would. I can give it a try. I can see several paths of varying invasiveness, not sure yet what the best approach is. Let me think about if for a bit. Greetings, Andres Freund
Hi, On 2020-09-21 14:20:03 -0700, Andres Freund wrote: > I can give it a try. I can see several paths of varying invasiveness, > not sure yet what the best approach is. Let me think about if for a bit. Ugh, sorry for taking so long to get around to this. Attached is a *prototype* implemention of this concept, which clearly is lacking some comment work (and is intentionally lacking some re-indentation). I described my thoughts about how to limit the horizons for temp tables in https://www.postgresql.org/message-id/20201014203103.72oke6hqywcyhx7s%40alap3.anarazel.de Besides comments this probably mainly needs a bit more tests around temp table vacuuming. Should have at least an isolation test that verifies that temp table rows can be a) vacuumed b) pruned away in the presence of other sessions with xids. Greetings, Andres Freund
Attachment
Hi, On 2020-10-15 01:37:35 -0700, Andres Freund wrote: > Attached is a *prototype* implemention of this concept, which clearly is > lacking some comment work (and is intentionally lacking some > re-indentation). > > I described my thoughts about how to limit the horizons for temp tables in > https://www.postgresql.org/message-id/20201014203103.72oke6hqywcyhx7s%40alap3.anarazel.de Attached is an updated version of this patch. Quite a bit of polish, added removal of the isTopLevel arguments added a7212be8b9e that are now unnecessary, and changed the initialization of the temp table horizons to be latestCompletedXid + 1 instead of just latestCompletedXid when no xid is assigned. > Besides comments this probably mainly needs a bit more tests around temp > table vacuuming. Should have at least an isolation test that verifies > that temp table rows can be a) vacuumed b) pruned away in the presence > of other sessions with xids. I added an isolationtester test for this. It verifies that dead rows in temp tables get vacuumed and pruned despite concurrent sessions having older snapshots. It does so by forcing an IOS and checking the number of heap fetches reported by EXPLAIN. I also added a companion test for permanent relations, ensuring that such rows do not get removed. Any comments? Otherwise I'll push that patch tomorrow. Greetings, Andres Freund
Attachment
Hi, On 2020-10-27 20:51:10 -0700, Andres Freund wrote: > On 2020-10-15 01:37:35 -0700, Andres Freund wrote: > > Attached is a *prototype* implemention of this concept, which clearly is > > lacking some comment work (and is intentionally lacking some > > re-indentation). > > > > I described my thoughts about how to limit the horizons for temp tables in > > https://www.postgresql.org/message-id/20201014203103.72oke6hqywcyhx7s%40alap3.anarazel.de > > Attached is an updated version of this patch. Quite a bit of polish, > added removal of the isTopLevel arguments added a7212be8b9e that are now > unnecessary, and changed the initialization of the temp table horizons > to be latestCompletedXid + 1 instead of just latestCompletedXid when no > xid is assigned. > > > > Besides comments this probably mainly needs a bit more tests around temp > > table vacuuming. Should have at least an isolation test that verifies > > that temp table rows can be a) vacuumed b) pruned away in the presence > > of other sessions with xids. > > I added an isolationtester test for this. It verifies that dead rows in > temp tables get vacuumed and pruned despite concurrent sessions having > older snapshots. It does so by forcing an IOS and checking the number of > heap fetches reported by EXPLAIN. I also added a companion test for > permanent relations, ensuring that such rows do not get removed. > > > Any comments? Otherwise I'll push that patch tomorrow. Just pushed this. Let's see what the BF says... It's kinda cool how much more aggressive hot pruning / killtuples now is for temp tables. Greetings, Andres Freund
On 2020-10-28 18:13:44 -0700, Andres Freund wrote: > Just pushed this. Let's see what the BF says... It says that apparently something is unstable about my new test. It first passed on a few animals, but then failed a lot in a row. Looking.
Hi, On 2020-10-28 19:09:14 -0700, Andres Freund wrote: > On 2020-10-28 18:13:44 -0700, Andres Freund wrote: > > Just pushed this. Let's see what the BF says... > > It says that apparently something is unstable about my new test. It > first passed on a few animals, but then failed a lot in a row. Looking. The differentiating factor is force_parallel_mode=regress. Ugh, this is nasty: The problem is that we can end up computing the horizons the first time before MyDatabaseId is even set. Which leads us to compute a too aggressive horizon for plain tables, because we skip over them, as MyDatabaseId still is InvalidOid: /* * Normally queries in other databases are ignored for anything but * the shared horizon. But in recovery we cannot compute an accurate * per-database horizon as all xids are managed via the * KnownAssignedXids machinery. */ if (in_recovery || proc->databaseId == MyDatabaseId || proc->databaseId == 0) /* always include WalSender */ h->data_oldest_nonremovable = TransactionIdOlder(h->data_oldest_nonremovable, xmin); That then subsequently leads us consider a row fully dead in heap_hot_search_buffers(). Triggering the killtuples logic. Causing the test to fail. With force_parallel_mode=regress we constantly start parallel workers, which makes it much more likely that this case is hit. It's trivial to fix luckily... Greetings, Andres Freund
Hi, On 2020-10-28 21:00:30 -0700, Andres Freund wrote: > On 2020-10-28 19:09:14 -0700, Andres Freund wrote: > > On 2020-10-28 18:13:44 -0700, Andres Freund wrote: > > > Just pushed this. Let's see what the BF says... > > > > It says that apparently something is unstable about my new test. It > > first passed on a few animals, but then failed a lot in a row. Looking. > > The differentiating factor is force_parallel_mode=regress. > > Ugh, this is nasty: The problem is that we can end up computing the > horizons the first time before MyDatabaseId is even set. Which leads us > to compute a too aggressive horizon for plain tables, because we skip > over them, as MyDatabaseId still is InvalidOid: > > /* > * Normally queries in other databases are ignored for anything but > * the shared horizon. But in recovery we cannot compute an accurate > * per-database horizon as all xids are managed via the > * KnownAssignedXids machinery. > */ > if (in_recovery || > proc->databaseId == MyDatabaseId || > proc->databaseId == 0) /* always include WalSender */ > h->data_oldest_nonremovable = > TransactionIdOlder(h->data_oldest_nonremovable, xmin); > > That then subsequently leads us consider a row fully dead in > heap_hot_search_buffers(). Triggering the killtuples logic. Causing the > test to fail. > > With force_parallel_mode=regress we constantly start parallel workers, > which makes it much more likely that this case is hit. > > It's trivial to fix luckily... Pushed that fix, hopefully that calms the BF. Greetings, Andres Freund
Hi hackers! Does anyone maintain opensource pg_surgery analogs for released versions of PG? It seems to me I'll have to use something like this and I just though that I should consider pg_surgery in favour of ourpg_dirty_hands. Thanks! Best regards, Andrey Borodin.
On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > Does anyone maintain opensource pg_surgery analogs for released versions of PG? > It seems to me I'll have to use something like this and I just though that I should consider pg_surgery in favour of ourpg_dirty_hands. I do not. I'm still of the opinion that we ought to back-patch pg_surgery. This didn't attract a consensus before, and it's hard to dispute that it's a new feature in what would be a back branch. But it's unclear to me how users are otherwise supposed to recover from some of the bugs that are or have been present in those back branches. I'm not sure that I see the logic in telling people we'll try to prevent them from getting hosed in the future but if they're already hosed they can wait for v14 to fix it. They can't wait that long, and a dump-and-restore cycle is awfully painful. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jan 18, 2021 at 08:54:10AM -0500, Robert Haas wrote: > On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > Does anyone maintain opensource pg_surgery analogs for released > > versions of PG? It seems to me I'll have to use something like this > > and I just though that I should consider pg_surgery in favour of our > > pg_dirty_hands. > > I do not. I'm still of the opinion that we ought to back-patch > pg_surgery. This didn't attract a consensus before, and it's hard to > dispute that it's a new feature in what would be a back branch. But > it's unclear to me how users are otherwise supposed to recover from > some of the bugs that are or have been present in those back branches. One other possiblity would be to push a version of pg_surgery that is compatible with the back-branches somewhere external (e.g. either git.postgresql.org and/or Github), so that it can be picked up by distributions and/or individual users in need. That is Assuming it does not need assorted server changes to go with; I did not read the thread in detail but I was under the assumption it is a client program? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Mon, Jan 18, 2021 at 9:25 AM Michael Banck <michael.banck@credativ.de> wrote: > One other possiblity would be to push a version of pg_surgery that is > compatible with the back-branches somewhere external (e.g. either > git.postgresql.org and/or Github), so that it can be picked up by > distributions and/or individual users in need. Sure, but I don't see how that's better. > That is Assuming it does not need assorted server changes to go with; I > did not read the thread in detail but I was under the assumption it is a > client program? It's a server extension. It does not require core changes. -- Robert Haas EDB: http://www.enterprisedb.com
> 18 янв. 2021 г., в 18:54, Robert Haas <robertmhaas@gmail.com> написал(а): > > On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> Does anyone maintain opensource pg_surgery analogs for released versions of PG? >> It seems to me I'll have to use something like this and I just though that I should consider pg_surgery in favour of ourpg_dirty_hands. > > I do not. I'm still of the opinion that we ought to back-patch > pg_surgery. +1. Yesterday I spent a few hours packaging pg_dirty_hands and pg_surgery(BTW it works fine for 12). It's a kind of a 911 tool, one doesn't think they will need it until they actually do. And clocks are ticking. OTOH, it opens new ways to shoot in the foot. Best regards, Andrey Borodin.