Thread: Re: pg_upgrade check for invalid databases
On Sun, Sep 29, 2024 at 08:45:50PM -0400, Thomas Krennwallner wrote: > if a cluster contains invalid databases that we cannot connect to anymore, > pg_upgrade would currently fail when trying to connect to the first > encountered invalid database with > > [...] > > If there is more than one invalid database, we need to run pg_upgrade more > than once (unless the user inspects pg_database). > > I attached two small patches for PG 17 and PG 18 (can be easily backported > to all previous versions upon request). Instead of just failing to connect > with an error, we collect all invalid databases in a report file > invalid_databases.txt: Should we have pg_upgrade skip invalid databases? If the only valid action is to drop them, IMHO it seems unnecessary to require users to manually drop them before retrying pg_upgrade. -- nathan
Nathan Bossart <nathandbossart@gmail.com> writes: > Should we have pg_upgrade skip invalid databases? If the only valid action > is to drop them, IMHO it seems unnecessary to require users to manually > drop them before retrying pg_upgrade. I was thinking the same. But I wonder if there is any chance of losing data that could be recoverable. It feels like this should not be a default behavior. TBH I'm not finding anything very much wrong with the current behavior... this has to be a rare situation, do we need to add debatable behavior to make it easier? regards, tom lane
> On 30 Sep 2024, at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: > TBH I'm not finding anything very much wrong with the current > behavior... this has to be a rare situation, do we need to add > debatable behavior to make it easier? One argument would be to make the checks consistent, pg_upgrade generally tries to report all the offending entries to help the user when fixing the source database. Not sure if it's a strong enough argument for carrying code which really shouldn't see much use though. -- Daniel Gustafsson
Daniel Gustafsson <daniel@yesql.se> writes: >> On 30 Sep 2024, at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> TBH I'm not finding anything very much wrong with the current >> behavior... this has to be a rare situation, do we need to add >> debatable behavior to make it easier? > One argument would be to make the checks consistent, pg_upgrade generally tries > to report all the offending entries to help the user when fixing the source > database. Not sure if it's a strong enough argument for carrying code which > really shouldn't see much use though. OK, but the consistency argument would be to just report and fail. I don't think there's a precedent in other pg_upgrade checks for trying to fix problems automatically. regards, tom lane
On 30/09/2024 17.29, Daniel Gustafsson wrote: >> On 30 Sep 2024, at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> TBH I'm not finding anything very much wrong with the current >> behavior... this has to be a rare situation, do we need to add >> debatable behavior to make it easier? > > One argument would be to make the checks consistent, pg_upgrade generally tries > to report all the offending entries to help the user when fixing the source > database. Not sure if it's a strong enough argument for carrying code which > really shouldn't see much use though. In general, I agree that this situation should be rare for deliberate DROP DATABASE interrupted in interactive sessions. Unfortunately, for (popular) tools that perform automatic "temporary database" cleanup, we could recently see an increase in invalid databases. The additional check for pg_upgrade was made necessary due to several unrelated customers having invalid databases that stem from left-over Prisma Migrate "shadow databases" [1]. We could not reproduce this Prisma Migrate issue yet, as those migrations happened some time ago. Maybe this bug really stems from a much older Prisma Migrate version and we only see the fallout now. This is still a TODO item. But it appears that this tool can get interrupted "at the wrong time" while it is deleting temporary databases (probably a manual Ctrl-C), and clients are unaware that this can then leave behind invalid databases. Those temporary databases do not cause any harm as they are not used anymore. But eventually, PG installations will be upgraded to the next major version, and it is only then when those invalid databases resurface after pg_upgrade fails to run the checks. Long story short: interactive DROP DATABASE interrupts are rare (they do exist, but customers are usually aware). Automation tools on the other hand may run DROP DATABASE and when they get interrupted at the wrong time they will then produce several left-over invalid databases. pg_upgrade will then fail to run the checks. [1] https://www.prisma.io/docs/orm/prisma-migrate/understanding-prisma-migrate/shadow-database
> On 1 Oct 2024, at 00:20, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Daniel Gustafsson <daniel@yesql.se> writes: >>> On 30 Sep 2024, at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> TBH I'm not finding anything very much wrong with the current >>> behavior... this has to be a rare situation, do we need to add >>> debatable behavior to make it easier? > >> One argument would be to make the checks consistent, pg_upgrade generally tries >> to report all the offending entries to help the user when fixing the source >> database. Not sure if it's a strong enough argument for carrying code which >> really shouldn't see much use though. > > OK, but the consistency argument would be to just report and fail. > I don't think there's a precedent in other pg_upgrade checks for > trying to fix problems automatically. Correct, sorry for being unclear. The consistency argument would be to expand pg_upgrade to report all invalid databases rather than just the first found; attempting to fix problems would be a new behavior. -- Daniel Gustafsson
> On 1 Oct 2024, at 02:35, Thomas Krennwallner <tk@postsubmeta.net> wrote: > > On 30/09/2024 17.29, Daniel Gustafsson wrote: >>> On 30 Sep 2024, at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> TBH I'm not finding anything very much wrong with the current >>> behavior... this has to be a rare situation, do we need to add >>> debatable behavior to make it easier? >> One argument would be to make the checks consistent, pg_upgrade generally tries >> to report all the offending entries to help the user when fixing the source >> database. Not sure if it's a strong enough argument for carrying code which >> really shouldn't see much use though. > In general, I agree that this situation should be rare for deliberate DROP DATABASE interrupted in interactive sessions. > > Unfortunately, for (popular) tools that perform automatic "temporary database" cleanup, we could recently see an increasein invalid databases. > > The additional check for pg_upgrade was made necessary due to several unrelated customers having invalid databases thatstem from left-over Prisma Migrate "shadow databases" [1]. We could not reproduce this Prisma Migrate issue yet, as thosemigrations happened some time ago. Maybe this bug really stems from a much older Prisma Migrate version and we onlysee the fallout now. This is still a TODO item. > > But it appears that this tool can get interrupted "at the wrong time" while it is deleting temporary databases (probablya manual Ctrl-C), and clients are unaware that this can then leave behind invalid databases. > > Those temporary databases do not cause any harm as they are not used anymore. But eventually, PG installations will beupgraded to the next major version, and it is only then when those invalid databases resurface after pg_upgrade fails torun the checks. Databases containing transient data no longer needed left by buggy tools is one thing, but pg_upgrade won't be able to differentiate between those and invalid databases of legitimate interest. Allowing pg_upgrade to skip invalid databases expose the risk of (potentially) valuable data being dropped during the upgrade due to the user not having realized a rarely-used production database was invalid. > Long story short: interactive DROP DATABASE interrupts are rare (they do exist, but customers are usually aware). Automationtools on the other hand may run DROP DATABASE and when they get interrupted at the wrong time they will then produceseveral left-over invalid databases. pg_upgrade will then fail to run the checks. Checking and reporting all invalid databases during the check phase seems like a user-friendly option here, I can agree that the current behaviour isn't great for users experiencing this issue. -- Daniel Gustafsson
On Tue, Oct 1, 2024 at 09:28:54AM +0200, Daniel Gustafsson wrote: > > On 1 Oct 2024, at 00:20, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Daniel Gustafsson <daniel@yesql.se> writes: > >>> On 30 Sep 2024, at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> TBH I'm not finding anything very much wrong with the current > >>> behavior... this has to be a rare situation, do we need to add > >>> debatable behavior to make it easier? > > > >> One argument would be to make the checks consistent, pg_upgrade generally tries > >> to report all the offending entries to help the user when fixing the source > >> database. Not sure if it's a strong enough argument for carrying code which > >> really shouldn't see much use though. > > > > OK, but the consistency argument would be to just report and fail. > > I don't think there's a precedent in other pg_upgrade checks for > > trying to fix problems automatically. > > Correct, sorry for being unclear. The consistency argument would be to expand > pg_upgrade to report all invalid databases rather than just the first found; > attempting to fix problems would be a new behavior. Yes, historically pg_upgrade will fail if it finds anything unusual, mostly because what it does normally is already scary enough. If users what pg_upgrade to do cleanups, it would be enabled by a separate flag, or even a new command-line app. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On Mon, Oct 07, 2024 at 03:37:35PM -0400, Bruce Momjian wrote: > On Tue, Oct 1, 2024 at 09:28:54AM +0200, Daniel Gustafsson wrote: >> Correct, sorry for being unclear. The consistency argument would be to expand >> pg_upgrade to report all invalid databases rather than just the first found; >> attempting to fix problems would be a new behavior. > > Yes, historically pg_upgrade will fail if it finds anything unusual, > mostly because what it does normally is already scary enough. If users > what pg_upgrade to do cleanups, it would be enabled by a separate flag, > or even a new command-line app. While I suspect it's rare that someone CTRL-C's out of an accidental DROP DATABASE and then runs pg_upgrade before trying to recover the data, I agree with the principle of having pg_upgrade fail by default for things like this. If we did add a new flag, the new invalid database report that Daniel mentions could say something like "try again with --skip-invalid-databases to have pg_upgrade automatically drop invalid databases." -- nathan
> On 7 Oct 2024, at 22:04, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Mon, Oct 07, 2024 at 03:37:35PM -0400, Bruce Momjian wrote: >> On Tue, Oct 1, 2024 at 09:28:54AM +0200, Daniel Gustafsson wrote: >>> Correct, sorry for being unclear. The consistency argument would be to expand >>> pg_upgrade to report all invalid databases rather than just the first found; >>> attempting to fix problems would be a new behavior. >> >> Yes, historically pg_upgrade will fail if it finds anything unusual, >> mostly because what it does normally is already scary enough. If users >> what pg_upgrade to do cleanups, it would be enabled by a separate flag, >> or even a new command-line app. > > While I suspect it's rare that someone CTRL-C's out of an accidental DROP > DATABASE and then runs pg_upgrade before trying to recover the data, I > agree with the principle of having pg_upgrade fail by default for things > like this. If we did add a new flag, the new invalid database report that > Daniel mentions could say something like "try again with > --skip-invalid-databases to have pg_upgrade automatically drop invalid > databases." If we are teaching pg_upgrade to handle errors, either by skipping or by fixing, then I believe this is the right way to go about it. A successful run should probably also create a report of the databases which were skipped. -- Daniel Gustafsson
On Sun, Oct 13, 2024 at 08:28:57AM -0400, Thomas Krennwallner wrote: > In v2 I've made changes to the patch incorporating the suggestions here: > > * Default behaviour is to just fail with a report of all invalid databases > > * A new option --skip-invalid-databases will then skip the checks, and > would not transfer any invalid database to the new cluster. A warning > with a report file will then follow after a successful run. > > Dropping invalid databases in the old cluster will make invalid > databases unrecoverable, so I opted for a skip over invalid databases > approach that would leave invalid databases in the old cluster. > > Apart from a missing --skip-invalid-databases test, does this attempt look OK? I don't think there is enough demand for the feature of skipping invalid databases because we have gotten few reports of such problems, and also because your case is related to an external tool causing this problem. What might be acceptable would be to add an option that would make pg_upgrade more tolerant of problems in many areas, that is a lot more research and discussion. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"