Thread: [HACKERS] Still another race condition in recovery TAP tests
In a moment of idleness I tried to run the TAP tests on pademelon, which is a mighty old and slow machine. Behold, src/test/recovery/t/010_logical_decoding_timelines.pl fell over, with the relevant section of its log contents being: # testing logical timeline following with a filesystem-level copy # Taking filesystem backup b1 from node "master" # pg_start_backup: 0/2000028 could not opendir(/home/postgres/pgsql/src/test/recovery/tmp_check/t_010_logical_decoding_timelines_master_data/pgdata/pg_wal/archive_status/000000010000000000000001.ready): Nosuch file or directory at ../../../src/test/perl//RecursiveCopy.pm line 115. ### Stopping node "master" using mode immediate The postmaster log has this relevant entry: 2017-09-08 22:03:22.917 EDT [19160] DEBUG: archived write-ahead log file "000000010000000000000001" It looks to me like the archiver removed 000000010000000000000001.ready between where RecursiveCopy.pm checks that $srcpath is a regular file or directory (line 95) and where it rechecks whether it's a regular file (line 105). Then the "-f" test on line 105 fails, allowing it to fall through to the its-a-directory path, and unsurprisingly the opendir at line 115 fails with the above symptom. In short, RecursiveCopy.pm is woefully unprepared for the idea that the source directory tree might be changing underneath it. I'm not real sure if the appropriate answer to this is "we need to fix RecursiveCopy" or "we need to fix the callers to not call RecursiveCopy until the source directory is stable". Thoughts? (I do kinda wonder why we rolled our own RecursiveCopy; surely there's a better implementation in CPAN?) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 9, 2017 at 11:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > In a moment of idleness I tried to run the TAP tests on pademelon, > which is a mighty old and slow machine. How long did it take? Just wondering if that's actually the slowest one or not to run the full set of recovery tests. This would be mighty slow. Even hamster never detected this failure I think. > It looks to me like the archiver removed 000000010000000000000001.ready > between where RecursiveCopy.pm checks that $srcpath is a regular file > or directory (line 95) and where it rechecks whether it's a regular > file (line 105). Then the "-f" test on line 105 fails, allowing it to > fall through to the its-a-directory path, and unsurprisingly the opendir > at line 115 fails with the above symptom. > > In short, RecursiveCopy.pm is woefully unprepared for the idea that the > source directory tree might be changing underneath it. Yes, before 010_logical_decoding_timelines and _backup_fs were introduced, we only used RecursiveCopy with a static source in init_from_backup(), which represented no risks. > I'm not real sure if the appropriate answer to this is "we need to fix > RecursiveCopy" or "we need to fix the callers to not call RecursiveCopy > until the source directory is stable". Thoughts? We could make RecursiveCopy smarter here. In backup_fs_hot, if for example another segment switch happens between pg_start_backup() and CopyRecursive, say with a low archive_timeout with a TAP test using this configuration, then we would stay at the same point. Another thing that we could actually use here is the fact that _backup_fs assumes that archives must be enabled (a sanity check would be nice!), so we could just exclude pg_wal from the FS-backup and avoid any problems, at least for this test butthat won't save from any other paths getting modified on disk. So making RecursiveCopy really looks like the best bet in the long term. > (I do kinda wonder why we rolled our own RecursiveCopy; surely there's > a better implementation in CPAN?) This comes from here, which is something I got involved in: commit: 1caef31d9e550408d0cbc5788a422dcb69736df5 author: Alvaro Herrera <alvherre@alvh.no-ip.org> date: Wed, 2 Dec 2015 18:46:16 -0300 Refactor Perl test code So the idea here is really to have: 1) Something compatible down to perl 5.8. 2) Something which is not external, which is where we could have used File::Copy::Recursive (only compatible from 5.12 if I recall correctly? I am too lazy to check now). Note as well that a copy of the whole directory is used for backups instead of tar format of pg_basebackup because there is no easy support for tar files down to 5.8. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes: > On Sat, Sep 9, 2017 at 11:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In a moment of idleness I tried to run the TAP tests on pademelon, >> which is a mighty old and slow machine. > How long did it take? The last time I tried it, make check-world took about 3 hours IIRC. But that was a few months ago, I suspect it's more now. I re-launched the run after diagnosing this failure; it's gotten past the recovery tests (proving that this is a race condition not a hard failure) and is somewhere in the contrib tests, about 2 hours in. >> I'm not real sure if the appropriate answer to this is "we need to fix >> RecursiveCopy" or "we need to fix the callers to not call RecursiveCopy >> until the source directory is stable". Thoughts? > ... So making RecursiveCopy really looks > like the best bet in the long term. Yeah, even if we fixed this particular call site, I'm sure the issue would come up again. Certainly we expect hot backups to work with a changing source directory. >> (I do kinda wonder why we rolled our own RecursiveCopy; surely there's >> a better implementation in CPAN?) > This comes from here, which is something I got involved in: > So the idea here is really to have: > 1) Something compatible down to perl 5.8. > 2) Something which is not external, which is where we could have used > File::Copy::Recursive (only compatible from 5.12 if I recall > correctly? I am too lazy to check now). Hm. Even if we don't want to use File::Copy::Recursive, maybe we should look at it and see what (if anything) it does about source-tree changes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 9, 2017 at 1:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >>> I'm not real sure if the appropriate answer to this is "we need to fix >>> RecursiveCopy" or "we need to fix the callers to not call RecursiveCopy >>> until the source directory is stable". Thoughts? > >> ... So making RecursiveCopy really looks >> like the best bet in the long term. > > Yeah, even if we fixed this particular call site, I'm sure the issue > would come up again. Certainly we expect hot backups to work with > a changing source directory. Definitely... > Hm. Even if we don't want to use File::Copy::Recursive, maybe we should > look at it and see what (if anything) it does about source-tree changes. Actually, looking at the CPAN file of the module here: http://cpansearch.perl.org/src/DMUEY/File-Copy-Recursive-0.38/Recursive.pm There is no minimum version defined. There is even some handling for perl 5.6 so I think that we should actually be fine from a compatibility point of view. Also dircopy() uses fcopy(), which just uses File::Copy::copy() to copy a file. A difference wiht RecursiveCopy is that File::Copy::Recursive does not fail if copy() itself fails (see "copy(@_) or return;" in "sub fcopy"). It just returns for a failure. We have actually discussed about RecursiveCopy and File::Copy::Recursive two years ago: https://www.postgresql.org/message-id/20151123212707.GD4073@alvherre.pgsql The first version of RecursiveCopy did not need any fancy copy except a simple recursive function, and I personally still like the fact that symlinks are not handled. Should we care about that actually for cases where we take a FS-backup with tablespaces and a pg_wal symlink? Such case has not showed up yet, but if it does I think that we could consider integrating File::Copy::Recursive. In short, I'd still like to keep RecursiveCopy for now, but change its code so as a copy() is not a hard failure. What do you think? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes: > On Sat, Sep 9, 2017 at 1:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah, even if we fixed this particular call site, I'm sure the issue >> would come up again. Certainly we expect hot backups to work with >> a changing source directory. > In short, I'd still like to keep RecursiveCopy for now, but change its > code so as a copy() is not a hard failure. What do you think? The specific case we need to allow is "ENOENT on a file/dir that was there a moment ago". I think it still behooves us to complain about anything else. If you think it's a simple fix, have at it. But I see at least three ways for _copypath_recurse to fail depending on exactly when the file disappears. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Sep 10, 2017 at 12:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> On Sat, Sep 9, 2017 at 1:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Yeah, even if we fixed this particular call site, I'm sure the issue >>> would come up again. Certainly we expect hot backups to work with >>> a changing source directory. > >> In short, I'd still like to keep RecursiveCopy for now, but change its >> code so as a copy() is not a hard failure. What do you think? > > The specific case we need to allow is "ENOENT on a file/dir that was > there a moment ago". I think it still behooves us to complain about > anything else. If you think it's a simple fix, have at it. But > I see at least three ways for _copypath_recurse to fail depending on > exactly when the file disappears. I have finally been able to take a couple of hours and looked at problems in this module. Errno is supported down to 5.8.8 per the docs, and those are set also on Windows, copy() and opendir() complain with ENOENT for missing entries: http://perldoc.perl.org/5.8.8/Errno.html readdir() does not though. If for example a directory gets removed in the middle of a scan, then an empty list is returned. mkdir() would not fail on ENOENT either, the parent directory would have already been created if a child repository is being scanned. With the check for -d and -f, this brings indeed the count to three code paths. With the patch attached, I have added some manual sleep calls in RecursiveCopy.pm and triggered manual deletions of the source repository. The copy is able to complete in any case, warning about missing directories or files. I have added warn messages in the patch when ENOENT is triggered, though I think that those should be removed in the final patch. By the way, 010_logical_decoding_timelines.pl does not need to include RecursiveCopy. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Michael Paquier <michael.paquier@gmail.com> writes: > On Sun, Sep 10, 2017 at 12:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The specific case we need to allow is "ENOENT on a file/dir that was >> there a moment ago". I think it still behooves us to complain about >> anything else. If you think it's a simple fix, have at it. But >> I see at least three ways for _copypath_recurse to fail depending on >> exactly when the file disappears. > With the check for -d and -f, this brings indeed the count to three > code paths. With the patch attached, I have added some manual sleep > calls in RecursiveCopy.pm and triggered manual deletions of the source > repository. The copy is able to complete in any case, warning about > missing directories or files. I have added warn messages in the patch > when ENOENT is triggered, though I think that those should be removed > in the final patch. Hm, I don't much like having it silently ignore files that are present; that seems like a foot-gun in the long run. What do you think of the attached? > By the way, 010_logical_decoding_timelines.pl does not need to include > RecursiveCopy. Good catch. regards, tom lane diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm index 28ecaf6..19f7dd2 100644 *** a/src/test/perl/RecursiveCopy.pm --- b/src/test/perl/RecursiveCopy.pm *************** use File::Copy; *** 29,40 **** =head2 copypath($from, $to, %params) Recursively copy all files and directories from $from to $to. Only regular files and subdirectories are copied. Trying to copy other types of directory entries raises an exception. Raises an exception if a file would be overwritten, the source directory can't ! be read, or any I/O operation fails. Always returns true. If the B<filterfn> parameter is given, it must be a subroutine reference. This subroutine will be called for each entry in the source directory with its --- 29,45 ---- =head2 copypath($from, $to, %params) Recursively copy all files and directories from $from to $to. + Does not preserve file metadata (e.g., permissions). Only regular files and subdirectories are copied. Trying to copy other types of directory entries raises an exception. Raises an exception if a file would be overwritten, the source directory can't ! be read, or any I/O operation fails. However, we silently ignore ENOENT on ! open, because when copying from a live database it's possible for a file/dir ! to be deleted after we see its directory entry but before we can open it. ! ! Always returns true. If the B<filterfn> parameter is given, it must be a subroutine reference. This subroutine will be called for each entry in the source directory with its *************** sub copypath *** 74,79 **** --- 79,87 ---- $filterfn = sub { return 1; }; } + # Complain if original path is bogus, because _copypath_recurse won't. + die "\"$base_src_dir\" does not exist" if !-e $base_src_dir; + # Start recursive copy from current directory return _copypath_recurse($base_src_dir, $base_dest_dir, "", $filterfn); } *************** sub _copypath_recurse *** 89,100 **** return 1 unless &$filterfn($curr_path); # Check for symlink -- needed only on source dir ! die "Cannot operate on symlinks" if -l $srcpath; ! ! # Can't handle symlinks or other weird things ! die "Source path \"$srcpath\" is not a regular file or directory" ! unless -f $srcpath ! or -d $srcpath; # Abort if destination path already exists. Should we allow directories # to exist already? --- 97,104 ---- return 1 unless &$filterfn($curr_path); # Check for symlink -- needed only on source dir ! # (note: this will fall through quietly if file is already gone) ! die "Cannot operate on symlink \"$srcpath\"" if -l $srcpath; # Abort if destination path already exists. Should we allow directories # to exist already? *************** sub _copypath_recurse *** 104,128 **** # same name and we're done. if (-f $srcpath) { ! copy($srcpath, $destpath) or die "copy $srcpath -> $destpath failed: $!"; return 1; } ! # Otherwise this is directory: create it on dest and recurse onto it. ! mkdir($destpath) or die "mkdir($destpath) failed: $!"; ! ! opendir(my $directory, $srcpath) or die "could not opendir($srcpath): $!"; ! while (my $entry = readdir($directory)) { ! next if ($entry eq '.' or $entry eq '..'); ! _copypath_recurse($base_src_dir, $base_dest_dir, ! $curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn) ! or die "copypath $srcpath/$entry -> $destpath/$entry failed"; } - closedir($directory); ! return 1; } 1; --- 108,154 ---- # same name and we're done. if (-f $srcpath) { ! my $fh; ! unless (open($fh, '<', $srcpath)) ! { ! return 1 if ($!{ENOENT}); ! die "open($srcpath) failed: $!"; ! } ! copy($fh, $destpath) or die "copy $srcpath -> $destpath failed: $!"; + close $fh; return 1; } ! # If it's a directory, create it on dest and recurse into it. ! if (-d $srcpath) { ! my $directory; ! unless (opendir($directory, $srcpath)) ! { ! return 1 if ($!{ENOENT}); ! die "opendir($srcpath) failed: $!"; ! } ! ! mkdir($destpath) or die "mkdir($destpath) failed: $!"; ! ! while (my $entry = readdir($directory)) ! { ! next if ($entry eq '.' or $entry eq '..'); ! _copypath_recurse($base_src_dir, $base_dest_dir, ! $curr_path eq '' ? $entry : "$curr_path/$entry", $filterfn) ! or die "copypath $srcpath/$entry -> $destpath/$entry failed"; ! } ! ! closedir($directory); ! return 1; } ! # If it disappeared from sight, that's OK. ! return 1 if !-e $srcpath; ! ! # Else it's some weird file type; complain. ! die "Source path \"$srcpath\" is not a regular file or directory"; } 1; diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl b/src/test/recovery/t/010_logical_decoding_timelines.pl index edc0219..5620450 100644 *** a/src/test/recovery/t/010_logical_decoding_timelines.pl --- b/src/test/recovery/t/010_logical_decoding_timelines.pl *************** use warnings; *** 24,30 **** use PostgresNode; use TestLib; use Test::More tests => 13; - use RecursiveCopy; use File::Copy; use IPC::Run (); use Scalar::Util qw(blessed); --- 24,29 ---- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 11, 2017 at 11:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hm, I don't much like having it silently ignore files that are present; > that seems like a foot-gun in the long run. What do you think of the > attached? That looks good to me. I have tried pretty hard to break it, but could not. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes: > On Mon, Sep 11, 2017 at 11:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hm, I don't much like having it silently ignore files that are present; >> that seems like a foot-gun in the long run. What do you think of the >> attached? > That looks good to me. I have tried pretty hard to break it, but could not. Pushed, thanks for testing! regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 08, 2017 at 10:32:03PM -0400, Tom Lane wrote: > (I do kinda wonder why we rolled our own RecursiveCopy; surely there's > a better implementation in CPAN?) Fewer people will test as we grow the list of modules they must first install. Bundling a copy is tempting, but most CPAN modules use the Perl license. We're in that hole already[1], but I perceived a consensus to stop digging. Code used only for testing is less concerning, but I'd still not bundle it. [1] https://www.postgresql.org/message-id/flat/54EC6D80.5090507%40vmware.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 6 October 2017 at 14:03, Noah Misch <noah@leadboat.com> wrote: > On Fri, Sep 08, 2017 at 10:32:03PM -0400, Tom Lane wrote: >> (I do kinda wonder why we rolled our own RecursiveCopy; surely there's >> a better implementation in CPAN?) > > Fewer people will test as we grow the list of modules they must first install. Meh, I don't buy that. At worst, all we have to do is provide a script that fetches them, from distro repos if possible, and failing that from CPAN. With cpanminus, that's pretty darn simple too. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 06, 2017 at 05:57:24PM +0800, Craig Ringer wrote: > On 6 October 2017 at 14:03, Noah Misch <noah@leadboat.com> wrote: > > On Fri, Sep 08, 2017 at 10:32:03PM -0400, Tom Lane wrote: > >> (I do kinda wonder why we rolled our own RecursiveCopy; surely there's > >> a better implementation in CPAN?) > > > > Fewer people will test as we grow the list of modules they must first install. > > Meh, I don't buy that. At worst, all we have to do is provide a script > that fetches them, from distro repos if possible, and failing that > from CPAN. > > With cpanminus, that's pretty darn simple too. If the tree had such a script and it were reliable, then yes, it would matter little whether the script procured one module or five. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 10/13/2017 01:04 AM, Noah Misch wrote: > On Fri, Oct 06, 2017 at 05:57:24PM +0800, Craig Ringer wrote: >> On 6 October 2017 at 14:03, Noah Misch <noah@leadboat.com> wrote: >>> On Fri, Sep 08, 2017 at 10:32:03PM -0400, Tom Lane wrote: >>>> (I do kinda wonder why we rolled our own RecursiveCopy; surely there's >>>> a better implementation in CPAN?) >>> Fewer people will test as we grow the list of modules they must first install. >> Meh, I don't buy that. At worst, all we have to do is provide a script >> that fetches them, from distro repos if possible, and failing that >> from CPAN. >> >> With cpanminus, that's pretty darn simple too. > If the tree had such a script and it were reliable, then yes, it would matter > little whether the script procured one module or five. > > Not everyone has cpanminus installed either. My approach in the buildfarm code is to lean over backwards in order to avoid non-standard modules. For the problem at hand we use cp/xcopy, but the tree being copied is stable so we don't run into the disappearing/changing file problem. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 6, 2017 at 5:57 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >> Fewer people will test as we grow the list of modules they must first install. > > Meh, I don't buy that. Well, I do. Prerequisites are a pain, and the more of them there are, the more pain it is. > At worst, all we have to do is provide a script > that fetches them, from distro repos if possible, and failing that > from CPAN. > > With cpanminus, that's pretty darn simple too. I think the idea that this is going to work for everybody is not very realistic. I am not going to run some random script to install stuff on my system in whatever manner it feels like, because when I do stuff like that I usually end up regretting it. If it's a throw-away VM, sure, but not otherwise. We don't even have good documentation of what packages you should install on various packaging systems to build the server, let alone all of the additional things that may be required depending on build options and TAP tests you might want to run. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 17, 2017 at 5:01 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 6, 2017 at 5:57 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >>> Fewer people will test as we grow the list of modules they must first install. >> At worst, all we have to do is provide a script >> that fetches them, from distro repos if possible, and failing that >> from CPAN. >> >> With cpanminus, that's pretty darn simple too. > > I think the idea that this is going to work for everybody is not very > realistic. I am not going to run some random script to install stuff > on my system in whatever manner it feels like, because when I do stuff > like that I usually end up regretting it. If it's a throw-away VM, > sure, but not otherwise. Yeah, having something which is network-dependent to run the full set of regression tests is not a good idea. I have seen users, at least in Japan, deploying Postgres on hosts that have no external connection, just a base OS image built with all packages present. Data center rules can be very strict. > We don't even have good documentation of what packages you should > install on various packaging systems to build the server, let alone > all of the additional things that may be required depending on build > options and TAP tests you might want to run. On top of that, let's not forget that we take the time to improve the modules what we already have in the code tree. Let's not forget that ;) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers