Thread: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
[bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
"MauMau"
Date:
Hello, I've found and fixed a bug that causes recovery (crash recovery, PITR) to fail. Please find attached the patch against HEAD. [Bug] To reproduce the problem, do the following on Windows: 1. pg_ctl start 2. CREATE TABLESPACE tbs LOCATION 'some_tblspc_path'; 3. pg_ctl stop -mi 4. pg_ctl start The database server fails to start, leaving the below messages: LOG: database system was interrupted; last known up at 2013-10-31 20:24:07 JST LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/1788938 FATAL: could not remove symbolic link "pg_tblspc/16385": Permission denied CONTEXT: xlog redo create tablespace: 16385 "d:/tbs" LOG: startup process (PID 2724) exited with exit code 1 LOG: aborting startup due to startup process failure [Cause] In redo, create_tablespace_directories() tries to remove the symbolic link for the tablespace using unlink(). However, unlink() on Windows fails with errno=13 (Permission denied). This is because junction points are directories on Windows. [Fix] Follow destroy_tablespace_directories() and use rmdir() to remove the junction point. I've tested the patch. Could you review it and commit? I wish it to be backported to all major releases. Regards MauMau
Attachment
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
Andrew Dunstan
Date:
On 10/31/2013 08:40 AM, MauMau wrote: > Hello, > > I've found and fixed a bug that causes recovery (crash recovery, PITR) > to fail. Please find attached the patch against HEAD. > > > [Bug] > To reproduce the problem, do the following on Windows: > > 1. pg_ctl start > 2. CREATE TABLESPACE tbs LOCATION 'some_tblspc_path'; > 3. pg_ctl stop -mi > 4. pg_ctl start > > The database server fails to start, leaving the below messages: > > LOG: database system was interrupted; last known up at 2013-10-31 > 20:24:07 JST > LOG: database system was not properly shut down; automatic recovery in > progress > LOG: redo starts at 0/1788938 > FATAL: could not remove symbolic link "pg_tblspc/16385": Permission > denied > CONTEXT: xlog redo create tablespace: 16385 "d:/tbs" > LOG: startup process (PID 2724) exited with exit code 1 > LOG: aborting startup due to startup process failure > > > [Cause] > In redo, create_tablespace_directories() tries to remove the symbolic > link for the tablespace using unlink(). However, unlink() on Windows > fails with errno=13 (Permission denied). This is because junction > points are directories on Windows. > > > [Fix] > Follow destroy_tablespace_directories() and use rmdir() to remove the > junction point. > > > I've tested the patch. Could you review it and commit? I wish it to be > backported to all major releases. > > > Regards > MauMau > Why are you making this a runtime check instead of a compile time check? cheers andrew
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
"MauMau"
Date:
From: "Andrew Dunstan" <andrew@dunslane.net> > Why are you making this a runtime check instead of a compile time check? I thought the same situation could happen as in destroy_tablespace_directories(), but it doesn't seem to apply on second thought. Revised patch attached, which is very simple based on compile time check. Regards MauMau
Attachment
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
"MauMau"
Date:
From: "MauMau" <maumau307@gmail.com> > I thought the same situation could happen as in > destroy_tablespace_directories(), but it doesn't seem to apply on second > thought. Revised patch attached, which is very simple based on compile > time > check. Sorry, I was careless to leave an old comment. Please use this version. Regards MauMau
Attachment
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
Asif Naeem
Date:
I did worked on testing the patch on Windows, test scenario mentioned above thread is reproducible and the provided patch resolve the issue. In case of junction or directory unlink function (deprecated) returns -1 with errno “EACCES” (i.e. Permission denied). As you have followed destroy_tablespace_directories() function, Is there any specific reason not to use same logic to detect type of the file/link i.e. “(lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))”, It also seems have more appropriate error message i.e.
src/backend/commands/tablespace.c
….
745 /*
746 * Try to remove the symlink. We must however deal with the possibility
747 * that it's a directory instead of a symlink --- this could happen during
748 * WAL replay (see TablespaceCreateDbspace), and it is also the case on
749 * Windows where junction points lstat() as directories.
750 *
751 * Note: in the redo case, we'll return true if this final step fails;
752 * there's no point in retrying it. Also, ENOENT should provoke no more
753 * than a warning.
754 */
755 remove_symlink:
756 linkloc = pstrdup(linkloc_with_version_dir);
757 get_parent_directory(linkloc);
758 if (lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))
759 {
760 if (rmdir(linkloc) < 0)
761 ereport(redo ? LOG : ERROR,
762 (errcode_for_file_access(),
763 errmsg("could not remove directory \"%s\": %m",
764 linkloc)));
765 }
766 else
767 {
768 if (unlink(linkloc) < 0)
769 ereport(redo ? LOG : (errno == ENOENT ? WARNING : ERROR),
770 (errcode_for_file_access(),
771 errmsg("could not remove symbolic link \"%s\": %m",
772 linkloc)));
773 }
….
Other than this patch looks good to me.
Regards,
Muhammad Asif Naeem
On Thu, Oct 31, 2013 at 8:03 PM, MauMau <maumau307@gmail.com> wrote:
From: "MauMau" <maumau307@gmail.com>Sorry, I was careless to leave an old comment. Please use this version.I thought the same situation could happen as in
destroy_tablespace_directories(), but it doesn't seem to apply on second
thought. Revised patch attached, which is very simple based on compile time
check.
Regards
MauMau
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
"MauMau"
Date:
From: "Asif Naeem" <anaeem.it@gmail.com> > As you have followed destroy_tablespace_directories() function, Is there any specific reason not to use same logic to detect type of the file/link i.e. “(lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))”, It also seems have more appropriate error message i.e. Thanks for reviewing and testing the patch. Yes, at first I did what you mentioned, but modified the patch according to some advice in the mail thread. During redo, create_tablespace_directories() needs to handle the case where the $PGDATA/pg_tblspc/xxx is not a symlink but a directory even on UNIX/Linux. Please see TablespaceCreateDbspace is(). destroy_tablespace_directories() doesn't have to handle such situation. Regards MauMau
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
Asif Naeem
Date:
Hi MauMau,
+ #ifdef WIN32
+ if (rmdir(linkloc) < 0 && errno != ENOENT)
+ #else
if (unlink(linkloc) < 0 && errno != ENOENT)
+ #endif
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove symbolic link \"%s\": %m",
What is your opinion about it, Is it not worth changing ? . Thanks.
On Wed, Jan 15, 2014 at 7:42 PM, MauMau <maumau307@gmail.com> wrote:
From: "Asif Naeem" <anaeem.it@gmail.com>Thanks for reviewing and testing the patch. Yes, at first I did what you mentioned, but modified the patch according to some advice in the mail thread. During redo, create_tablespace_directories() needs to handle the case where the $PGDATA/pg_tblspc/xxx is not a symlink but a directory even on UNIX/Linux. Please see TablespaceCreateDbspace is(). destroy_tablespace_directories() doesn't have to handle such situation.As you havefollowed destroy_tablespace_directories() function, Is there any specific
reason not to use same logic to detect type of the file/link i.e.
“(lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))”, It also seems have
more appropriate error message i.e.
Regards
MauMau
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
"MauMau"
Date:
I'm sorry, I'm replying to an older mail, because I lost your latest mail by mistake. > Ah. Sorry, I missed that part. As NTFS junctions and symbolic links are > different (although they behave similarly), there seems only a minor > inconvenience related to misleading error message i.e. You are right. Fixed. Regards MauMau
Attachment
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
Amit Kapila
Date:
On Wed, Jan 15, 2014 at 8:12 PM, MauMau <maumau307@gmail.com> wrote: > From: "Asif Naeem" <anaeem.it@gmail.com> > >> As you have > > followed destroy_tablespace_directories() function, Is there any specific > reason not to use same logic to detect type of the file/link i.e. > "(lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))", It also seems have > more appropriate error message i.e. > > Thanks for reviewing and testing the patch. Yes, at first I did what you > mentioned, but modified the patch according to some advice in the mail > thread. During redo, create_tablespace_directories() needs to handle the > case where the $PGDATA/pg_tblspc/xxx is not a symlink but a directory even > on UNIX/Linux. Please see TablespaceCreateDbspace is(). > destroy_tablespace_directories() doesn't have to handle such situation. If create_tablespace_directories() needs to handle with directory both on Windows/Linux, then shouldn't it be a runtime check as in your first version rather than compile time check? Also isn't that the reason why destroy_tablespace_directories() have similar check? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
"MauMau"
Date:
From: "Amit Kapila" <amit.kapila16@gmail.com> > On Wed, Jan 15, 2014 at 8:12 PM, MauMau <maumau307@gmail.com> wrote: >> Thanks for reviewing and testing the patch. Yes, at first I did what you >> mentioned, but modified the patch according to some advice in the mail >> thread. During redo, create_tablespace_directories() needs to handle the >> case where the $PGDATA/pg_tblspc/xxx is not a symlink but a directory >> even >> on UNIX/Linux. Please see TablespaceCreateDbspace is(). >> destroy_tablespace_directories() doesn't have to handle such situation. > > If create_tablespace_directories() needs to handle with directory both on > Windows/Linux, then shouldn't it be a runtime check as in your first > version rather than compile time check? > Also isn't that the reason why destroy_tablespace_directories() have > similar > check? I see..., and you are correct. The first version of my patch should be the right fix. It seems that my head went somewhere when I submitted the second revision. What should I do? Should I re-submit the first revision as the latest fifth revision and link the email from the CommitFest newest entry? Regards MauMau
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
Amit Kapila
Date:
On Fri, Mar 21, 2014 at 12:24 PM, MauMau <maumau307@gmail.com> wrote: > From: "Amit Kapila" <amit.kapila16@gmail.com> >> If create_tablespace_directories() needs to handle with directory both on >> Windows/Linux, then shouldn't it be a runtime check as in your first >> version rather than compile time check? >> Also isn't that the reason why destroy_tablespace_directories() have >> similar >> check? > > > I see..., and you are correct. The first version of my patch should be the > right fix. It seems that my head went somewhere when I submitted the second > revision. > > What should I do? Should I re-submit the first revision as the latest fifth > revision and link the email from the CommitFest newest entry? The comments in your first version needs to be improved, as there you just mentioned a Windows specific comment: + /* On Windows, lstat() I think you can change comments (make it somewhat similar to destroy_tablespace_directories) and then submit it as a new version. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
"MauMau"
Date:
From: "Amit Kapila" <amit.kapila16@gmail.com> > The comments in your first version needs to be improved, as there > you just mentioned a Windows specific comment: > + /* On Windows, lstat() > > I think you can change comments (make it somewhat similar to > destroy_tablespace_directories) and then submit it as a new version. OK, done. Please find the attached patch. I also rebased the patch to HEAD. I'll update the CommitFest entry soon. Regards MauMau
Attachment
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
Amit Kapila
Date:
On Fri, Mar 21, 2014 at 12:24 PM, MauMau <maumau307@gmail.com> wrote: > From: "Amit Kapila" <amit.kapila16@gmail.com> >> On Wed, Jan 15, 2014 at 8:12 PM, MauMau <maumau307@gmail.com> wrote: >>> >>> Thanks for reviewing and testing the patch. Yes, at first I did what you >>> mentioned, but modified the patch according to some advice in the mail >>> thread. During redo, create_tablespace_directories() needs to handle the >>> case where the $PGDATA/pg_tblspc/xxx is not a symlink but a directory >>> even >>> on UNIX/Linux. Please see TablespaceCreateDbspace is(). >>> destroy_tablespace_directories() doesn't have to handle such situation. >> >> >> If create_tablespace_directories() needs to handle with directory both on >> Windows/Linux, then shouldn't it be a runtime check as in your first >> version rather than compile time check? >> Also isn't that the reason why destroy_tablespace_directories() have >> similar >> check? > > > I see..., and you are correct. I have thought about the above point again and it seems to me that the above claim (create_tablespace_directories() needs to handle a path which is not a symlink but directory) might not be true. For Example, I could easily think of case where it is required for destroy_tablespace_directories(). 1. Assume a tablespace tbs already exists. 2. Create table t1(c1 int) tablespace tbs; 3. drop table t1; 4. Drop tablespace tbs; 5. Do immediate shutdown (pg_ctl stop -mi); 6. During recovery it will create a table in directory (in function TablespaceCreateDbspace) which needs to be removedby destroy_tablespace_directories(). I am neither aware of, nor could think of such a case for create_tablespace_directories(). Do you have any such case in mind which I might be missing? By saying above, I don't mean that your current patch has any problem; even if there is no such scenario, I think your code is right as stat/isdir check seems to be okay to identify junction points and it avoids ifdef WIN32 check (which I personally think is bit annoying and we should try to avoid such code unless it is must or provides any significant advantage). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
"MauMau"
Date:
From: "Amit Kapila" <amit.kapila16@gmail.com> > 1. Assume a tablespace tbs already exists. > 2. Create table t1(c1 int) tablespace tbs; > 3. drop table t1; > 4. Drop tablespace tbs; > 5. Do immediate shutdown (pg_ctl stop -mi); > 6. During recovery it will create a table in directory (in function > TablespaceCreateDbspace) which needs to be removed by > destroy_tablespace_directories(). > > I am neither aware of, nor could think of such a case for > create_tablespace_directories(). Do you have any such case in mind > which I might be missing? A bit contrived example is: 1. After the directory is created by TablespaceCreateDbspace(), recovery is stopped (e.g. due to power outage). The directory remains. 2. Restart the server, redoing CREATE TABLESPACE during recovery, which executes create_tablespace_directories(). > By saying above, I don't mean that your current patch has any > problem; even if there is no such scenario, I think your code is > right as stat/isdir check seems to be okay to identify junction > points and it avoids ifdef WIN32 check (which I personally think > is bit annoying and we should try to avoid such code unless it > is must or provides any significant advantage). I think so, too. Regards MauMau
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
Amit Kapila
Date:
On Mon, Mar 24, 2014 at 7:49 PM, MauMau <maumau307@gmail.com> wrote: > A bit contrived example is: > > 1. After the directory is created by TablespaceCreateDbspace(), recovery is > stopped (e.g. due to power outage). The directory remains. > 2. Restart the server, redoing CREATE TABLESPACE during recovery, which > executes create_tablespace_directories(). I don't understand how after step-1, step-2 can occur in recovery for same tablespace path. Function TablespaceCreateDbspace() would have called for CREATE TABLE. Now Step-2 can only occur if there is a Drop Tablespace command in-between step-1 and step-2, else CREATE TABLESPACE can't be successful during command execution so will not get recorded in WAL. Basically Create Table cannot happen on a particular directory without having some CREATE TABLESPACE before it, so in the above example taken by you, there must be some Create TableSpace before step-1 or it's on default tablespace location which means you cannot perform step-2 for same tablespace path as step-1 without having DROP TABLESPACE in-between step-1 and step-2. If you think that above scenario is not possible, then you just need to modify comment: "! * Remove old symlink in recovery...." One more minor point about patch: + struct stat st; if (InRecovery) { struct stat st; Defining stat struct two times in same function in different ways doesn't seem to be good, we can do the same way for new usage as is already done in code or may be declare it once. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
"MauMau"
Date:
From: "Amit Kapila" <amit.kapila16@gmail.com> > If you think that above scenario is not possible, then you just need to > modify comment: > "! * Remove old symlink in recovery...." Hm, my scenario seems impossible. I reverted the comment. > One more minor point about patch: > + struct stat st; > > if (InRecovery) > { > struct stat st; > > Defining stat struct two times in same function in different ways doesn't > seem to be good, we can do the same way for new usage as is already > done in code or may be declare it once. OK, I removed the second (existing) definition of st. Regards MauMau
Attachment
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
Amit Kapila
Date:
On Tue, Mar 25, 2014 at 5:39 PM, MauMau <maumau307@gmail.com> wrote: > OK, I removed the second (existing) definition of st. This patch version looks fine, I have verified the issue, regression tests passed. I had attached the latest version of patch in CF app and marked it as "Ready For Committer". With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
Andres Freund
Date:
On 2014-03-25 21:09:13 +0900, MauMau wrote: > ! /* > ! * Remove old symlink in recovery, in case it points to the wrong place. > ! * On Windows, lstat() reports junction points as directories. > ! */ > if (InRecovery) > { > ! if (lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode)) > ! { > ! if (rmdir(linkloc) < 0) > ! ereport(ERROR, > ! (errcode_for_file_access(), > ! errmsg("could not remove directory \"%s\": %m", > ! linkloc))); > ! } > ! else > ! { > ! if (unlink(linkloc) < 0 && errno != ENOENT) > ! ereport(ERROR, > ! (errcode_for_file_access(), > ! errmsg("could not remove symbolic link \"%s\": %m", > ! linkloc))); > ! } > } if (..) ... else { if (...) ... } is pretty odd code layout. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From
Tom Lane
Date:
"MauMau" <maumau307@gmail.com> writes: > [ remove_tblspc_symlink_v6.patch ] Committed, thanks. regards, tom lane