Thread: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFS destination throws fsync error at end of backup
BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFS destination throws fsync error at end of backup
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15636 Logged by: John Klann Email address: jk7255@gmail.com PostgreSQL version: 11.1 Operating system: Red Hat Enterprise Linux Server release 7.5 Description: Issue: - PostgreSQL 11.1 pg_basebackup and pg_dump parallel database backup to a CIFS destination throws fsync error at the very end of the backup. - Command: pg_basebackup -D /cifs/backups/<backupDirectoryName> -U backupuser -Ft -Z 1 -X fetch -p 5432 - Error: pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>": Invalid argument Details: We are preparing to move to PostgreSQL 11.1 from 9.3.x, this move will be a complete rebuild on new hardware. After setting up the new hardware installing/configuring Rhel 7.5 (updated), installing/configuring PostgreSQL 11.1 I tested our migration process, parallel dump --> parellel restoring databases worked without issue. I started testing backups and that is when I came across the fsync error. I have seen references that PostgreSQL does not support storing the data directory on CIFS due to similar issues. Although I have not found any reference to backing up to CIFS not being supported. I am able to fully restore and recovery from these backups no issue and based off the research I have done I would suspect some sort of issue of cifs not supporting the fsync call on the containing directory level. I put examples below of all of the testing I have performed that has also lead me to this conclusion. Environment: - Server ○ Model: Dell R740 ○ RAM: 768 GB RDIMM 2666MT/s ○ Processor: Intel Xeon Gold 6146 3.2G 24.75MB § 2 Nodes, 12 cores each, HT = total cores of 48 ○ Storage: § OS: local ssd in raid § pgdata, pgwal, pglog: each on their own dedicated EMC XIO luns attached via 16 8Gbps paths (2x QLogic 2562, Dual Port 8Gb Optical Fibre Channel HBAs) □ XIO is XtremIO V1 in two brick clustered configuration ○ OS: § uname -a: □ Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Fri Sep 21 09:07:21 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux § cat /etc/redhat-release: □ Red Hat Enterprise Linux Server release 7.5 (Maipo) ○ PostgreSQL § PostgreSQL 11.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit § Custom Configs: [postgres@servername data1]$ cat postgresql.auto.conf # Do not edit this file manually! # It will be overwritten by the ALTER SYSTEM command. port = '5432' listen_addresses = '0.0.0.0' work_mem = '8MB' maintenance_work_mem = '1GB' random_page_cost = '1.0' track_functions = 'all' wal_buffers = '-1' checkpoint_timeout = '10min' checkpoint_completion_target = '0.9' checkpoint_warning = '30s' log_destination = 'csvlog' log_directory = '/dbalog/data1' log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' log_min_messages = 'error' log_min_error_statement = 'error' log_line_prefix = '[%u] [%d] [%h] [%m] [%p]:>' log_rotation_size = '10MB' log_statement = 'ddl' shared_buffers = '8GB' max_connections = '500' effective_cache_size = '589824 MB' wal_level = 'replica' max_wal_senders = '2' archive_mode = 'on' archive_command = 'test ! -f /pgxlog1/data1/%f || cp /pgxlog1/data1/%f /cifs/backups/<backupDirectoryName>/dmp/archive/%f' log_disconnections = 'on' standard_conforming_strings = 'off' § Databases □ 4 dbs ® 3 - very small < 3 GB total ® 1 - 964.67 GB □ Load: OLTP, DW mix - CIFS (Backup destination) ○ Windows 2012 R2 (last patched September/2018) ○ VNX Block Storage lun configured for CIFS using NTFS Testing/Reproduction: - T1: ○ basebackup to windows cifs § Command: □ pg_basebackup -D /cifs/backups/<backupDirectoryName> -U backupuser -Ft -Z 1 -X fetch -p 5432 § Error: pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>": Invalid argument - T2: ○ Same backup as T1 with much less data - same result - T3: ○ Same as T2 with -N (--no-sync) option - no error (fairly obvious why) - T4 ○ basebackup same dataset as T2, no tar, no compression going to windows cifs § Command: □ pg_basebackup -D /cifs/backups/dbadb1linbos/5432/dmp/basebkp -U backupuser -X fetch -p 5432 § Same error seems to happen on all directories: pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/base/1": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/base/13877": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/base/13878": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/base/16397": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/base/16660": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/base": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/global": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/log": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_commit_ts": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_dynshmem": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_logical/mappings": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_logical/snapshots": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_logical": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_multixact/members": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_multixact/offsets": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_multixact": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_notify": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_replslot": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_serial": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_snapshots": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_stat": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_stat_tmp": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_subtrans": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_tblspc": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_twophase": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_wal/archive_status": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_wal": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_xact": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp": Invalid argument pg_basebackup: could not fsync file "/cifs/backups/<backupDirectoryName>/basebkp/pg_tblspc": Invalid argument - T5 ○ pg_dump single threaded backup of a database to windows cifs - no error - T6 ○ pg_dump multithreaded backup of a database to windows cifs - same fsync error on containing directory § Command: pg_dump -p 5432 -j 16 -f /cifs/backups/<backupDirectoryName>/dmp/DBA_02132019_133120 -U backupuser -Fd -d DBA § Error: □ pg_dump: could not fsync file "/cifs/backups/<backupDirectoryName>/dmp/DBA_02132019_133120": Invalid argument - T7 ○ Same as T6 but with --no-sync option - no error - T8 ○ Same as T1 but to local storage (XIO SAN attached Lun ext4) § No error - T9 ○ Same as T1 but to linux CIFS share with XIO SAN attached lun ext4 (same version of linux) § Same Fsync error - T10 ○ Same as T1 but to linux NFS share with XIO SAN attached lun ext4 (same version of linux) § No Error Questions: - Is backing up to CIFS supported? - Based on research and other reported issues that the error may mean that cifs handles this call differently or already performs this action itself, should this error bring the integrity of the backup into question? ○ In what scenarios would it bring the integrity into question? ○ Issue reference, see bug #6372 thread: https://www.postgresql.org/message-id/1149.1325535272%40sss.pgh.pa.us - Is there a workaround or configuration that we could use that maintains using fsync to our current windows CIFS configuration? - Is there a better way to check integrity of backup rather than restoring and performing a dump backup? - Recommended steps forward?
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
Thomas Munro
Date:
On Fri, Feb 15, 2019 at 10:15 AM PG Bug reporting form <noreply@postgresql.org> wrote: > pg_basebackup: could not fsync file > "/cifs/backups/<backupDirectoryName>/basebkp/base/1": Invalid argument > pg_basebackup: could not fsync file Hmm, it looks like your system gives EINVAL when you try to fsync a directory. Perhaps we should teach fsync__fname() about that here: /* * Some OSes don't allow us to fsync directories at all, so we can ignore * those errors. Anything else needs to be reported. */ if (returncode != 0 && !(isdir && errno == EBADF)) { fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"), progname, fname, strerror(errno)); (void) close(fd); return -1; } EINVAL actually makes more sense to me than EBADF for a filesystem that can't fsync directories. From POSIX: EINVAL = "The fildes argument does not refer to a file on which this operation is possible." vs EBADF "The fildes argument is not a valid descriptor." It *is* a valid descriptor, it's just not a valid operation (apparently). Quick googling on the topic tells me that CIFS directory operations are "synchronous", so fsync'ing isn't necessary. However, they only made it silently do nothing in a recent version: https://github.com/torvalds/linux/commit/6e70c267e68d77679534dcf4aaf84e66f2cf1425 Presumably before that you get EINVAL because there is no handler registered. The commit message even mentions that this was breaking stuff like us. -- Thomas Munro http://www.enterprisedb.com
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
John Klann
Date:
+adding back to thread
On Thu, Feb 14, 2019 at 5:49 PM John Klann <jk7255@gmail.com> wrote:
Thanks Thomas that certainly makes sense considering the commit comment as well: "Directory operations under CIFS/SMB2/SMB3 are synchronous, so fsync()".I think will probably not use the --no-sync option then and handle for the EINVAL messages.On Thu, Feb 14, 2019 at 5:31 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:Hi John,
Unfortunately that applies only to directories, not to regular files
inside them.
--no-sync should get you past the problem as a workaround for now, but
then if your server(s) crash/lose power in the following seconds the
data might not be on disk. Fsync just means that the program doesn't
return until the data is flushed, but within some short period of time
the data will be flushed anyway, and power failure in the short window
before that is really unlikely, it's more a theoretical issue that we
database hackers like to worry about.
On Fri, Feb 15, 2019 at 11:22 AM John Klann <jk7255@gmail.com> wrote:
>
> Ah thank you this all makes sense.
>
> If CIFS is synchronous and and fsync'ing is not necessary then running with the --no-sync option should be safe and possibly more performant correct?
>
> John
>
> On Thu, Feb 14, 2019 at 5:06 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>>
>> On Fri, Feb 15, 2019 at 10:15 AM PG Bug reporting form
>> <noreply@postgresql.org> wrote:
>> > pg_basebackup: could not fsync file
>> > "/cifs/backups/<backupDirectoryName>/basebkp/base/1": Invalid argument
>> > pg_basebackup: could not fsync file
>>
>> Hmm, it looks like your system gives EINVAL when you try to fsync a
>> directory. Perhaps we should teach fsync__fname() about that here:
>>
>> /*
>> * Some OSes don't allow us to fsync directories at all, so we
>> can ignore
>> * those errors. Anything else needs to be reported.
>> */
>> if (returncode != 0 && !(isdir && errno == EBADF))
>> {
>> fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
>> progname, fname, strerror(errno));
>> (void) close(fd);
>> return -1;
>> }
>>
>> EINVAL actually makes more sense to me than EBADF for a filesystem
>> that can't fsync directories. From POSIX: EINVAL = "The fildes
>> argument does not refer to a file on which this operation is
>> possible." vs EBADF "The fildes argument is not a valid descriptor."
>> It *is* a valid descriptor, it's just not a valid operation
>> (apparently).
>>
>> Quick googling on the topic tells me that CIFS directory operations
>> are "synchronous", so fsync'ing isn't necessary. However, they only
>> made it silently do nothing in a recent version:
>>
>> https://github.com/torvalds/linux/commit/6e70c267e68d77679534dcf4aaf84e66f2cf1425
>>
>> Presumably before that you get EINVAL because there is no handler
>> registered. The commit message even mentions that this was breaking
>> stuff like us.
>>
>> --
>> Thomas Munro
>> http://www.enterprisedb.com
--
Thomas Munro
http://www.enterprisedb.com
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
Thomas Munro
Date:
On Fri, Feb 15, 2019 at 11:53 AM John Klann <jk7255@gmail.com> wrote: >>> >> Hmm, it looks like your system gives EINVAL when you try to fsync a >>> >> directory. Perhaps we should teach fsync__fname() about that here: >>> >> >>> >> /* >>> >> * Some OSes don't allow us to fsync directories at all, so we >>> >> can ignore >>> >> * those errors. Anything else needs to be reported. >>> >> */ >>> >> if (returncode != 0 && !(isdir && errno == EBADF)) >>> >> { >>> >> fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"), >>> >> progname, fname, strerror(errno)); >>> >> (void) close(fd); >>> >> return -1; >>> >> } Here is a patch like that. Any chance you could test it? Happy to send instructions on how to do that if necessary. Does anyone think it'd be a bad idea to do this, and possibly even back-patch it? We have tolerated EBADF since 7d7db18a (which OS is that for?) -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
John Klann
Date:
Happy to test it, I will need the instructions if you can. It's for Rhel 7.5 Maipo latest. (Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Fri Sep 21 09:07:21 UTC 2018
x86_64 x86_64 x86_64 GNU/Linux)
Thanks.
On Thu, Feb 14, 2019, 23:11 Thomas Munro <thomas.munro@enterprisedb.com wrote:
On Fri, Feb 15, 2019 at 11:53 AM John Klann <jk7255@gmail.com> wrote:
>>> >> Hmm, it looks like your system gives EINVAL when you try to fsync a
>>> >> directory. Perhaps we should teach fsync__fname() about that here:
>>> >>
>>> >> /*
>>> >> * Some OSes don't allow us to fsync directories at all, so we
>>> >> can ignore
>>> >> * those errors. Anything else needs to be reported.
>>> >> */
>>> >> if (returncode != 0 && !(isdir && errno == EBADF))
>>> >> {
>>> >> fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
>>> >> progname, fname, strerror(errno));
>>> >> (void) close(fd);
>>> >> return -1;
>>> >> }
Here is a patch like that. Any chance you could test it? Happy to
send instructions on how to do that if necessary. Does anyone think
it'd be a bad idea to do this, and possibly even back-patch it? We
have tolerated EBADF since 7d7db18a (which OS is that for?)
--
Thomas Munro
http://www.enterprisedb.com
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
Thomas Munro
Date:
On Fri, Feb 15, 2019 at 5:27 PM John Klann <jk7255@gmail.com> wrote: > Happy to test it, I will need the instructions if you can. It's for Rhel 7.5 Maipo latest. (Linux 3.10.0-862.14.4.el7.x86_64#1 SMP Fri Sep 21 09:07:21 UTC 2018 > x86_64 x86_64 x86_64 GNU/Linux) Thanks! Use yum to install gcc, make, bison, yacc, git, readline-devel, zlib-devel. (Hope I'm not forgetting anything here...) Pull down the sources: $ git clone https://github.com/postgres/postgres.git Go into the source directory: $ cd postgres Let's say we want to test with the 11 branch (so you can test against an existing 11.x development environment database you might have installed the usual way with packages, as opposed to the master branch where we are developing 12). Switch to that branch: $ git checkout REL_11_STABLE Create a new branch for this experiment: $ git checkout -b tolerate-einval Apply the patch (adjust path to wherever you put the patch): $ git am 0001-Tolerate-EINVAL-when-calling-fsync-on-a-directory.patch Configure, build, install (the prefix is where it will install the resulting binaries, adjust to taste, -j is how many CPUs you have if you want to build fast): $ mkdir $HOME/install $ ./configure --prefix=$HOME/install $ make -s -j4 && make -s install Optionally, run a self-test to see if everything is sane: $ make -s check Now you should have binaries like $HOME/install/bin/pg_basebackup and you can test if they work on CIFS. -- Thomas Munro http://www.enterprisedb.com
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
John Klann
Date:
Great thank you I will let you know how is goes.
John
On Thu, Feb 14, 2019, 23:47 Thomas Munro <thomas.munro@enterprisedb.com wrote:
On Fri, Feb 15, 2019 at 5:27 PM John Klann <jk7255@gmail.com> wrote:
> Happy to test it, I will need the instructions if you can. It's for Rhel 7.5 Maipo latest. (Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Fri Sep 21 09:07:21 UTC 2018
> x86_64 x86_64 x86_64 GNU/Linux)
Thanks! Use yum to install gcc, make, bison, yacc, git,
readline-devel, zlib-devel. (Hope I'm not forgetting anything
here...)
Pull down the sources:
$ git clone https://github.com/postgres/postgres.git
Go into the source directory:
$ cd postgres
Let's say we want to test with the 11 branch (so you can test against
an existing 11.x development environment database you might have
installed the usual way with packages, as opposed to the master branch
where we are developing 12). Switch to that branch:
$ git checkout REL_11_STABLE
Create a new branch for this experiment:
$ git checkout -b tolerate-einval
Apply the patch (adjust path to wherever you put the patch):
$ git am 0001-Tolerate-EINVAL-when-calling-fsync-on-a-directory.patch
Configure, build, install (the prefix is where it will install the
resulting binaries, adjust to taste, -j is how many CPUs you have if
you want to build fast):
$ mkdir $HOME/install
$ ./configure --prefix=$HOME/install
$ make -s -j4 && make -s install
Optionally, run a self-test to see if everything is sane:
$ make -s check
Now you should have binaries like $HOME/install/bin/pg_basebackup and
you can test if they work on CIFS.
--
Thomas Munro
http://www.enterprisedb.com
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
John Klann
Date:
Hi Thomas,
I was able to test the patch without issue and it appears to have worked see below. Let me know if there is further testing I can do or logging you would like.
existing bin:
patched bin:[postgres@server]$ /usr/pgsql-11/bin/pg_basebackup -D /cifs/backups/99testNoPatch -U backupuser -Ft -Z 1 -X fetch -p 5433pg_basebackup: could not fsync file "/cifs/backups/99testNoPatch": Invalid argument
I also restored the from the patched backup no issues:[postgres@server]$ /home/postgres/patchinstall/bin/pg_basebackup -D /cifs/backups/99testWithPatch -U backupuser -Ft -Z 1 -X fetch -p 5433[postgres@server]$
2019-02-15 10:29:12.426 EST, LOG, "ending log output to stderr",,"Future log output will go to log destination ""csvlog""."2019-02-15 10:29:12.431 EST, LOG, "database system was interrupted; last known up at 2019-02-15 10:18:00 EST"2019-02-15 10:29:12.605 EST, LOG, "starting point-in-time recovery to 2019-02-15 10:29:09-05"2019-02-15 10:29:12.796 EST, LOG, "restored log file ""000000010000000000000008"" from archive"2019-02-15 10:29:12.845 EST, LOG, "redo starts at 0/8000028"2019-02-15 10:29:12.848 EST, LOG, "consistent recovery state reached at 0/80000F8"2019-02-15 10:29:12.849 EST, LOG, "database system is ready to accept read only connections"2019-02-15 10:29:13.013 EST, LOG, "restored log file ""000000010000000000000009"" from archive"2019-02-15 10:29:13.039 EST, LOG, "redo done at 0/9000140"2019-02-15 10:29:13.135 EST, LOG, "restored log file ""000000010000000000000009"" from archive"2019-02-15 10:29:13.188 EST, LOG, "selected new timeline ID: 3"2019-02-15 10:29:13.228 EST, LOG, "archive recovery complete"2019-02-15 10:29:13.363 EST, LOG, "database system is ready to accept connections"
Thanks,
John
On Fri, Feb 15, 2019 at 12:53 AM John Klann <jk7255@gmail.com> wrote:
Great thank you I will let you know how is goes.JohnOn Thu, Feb 14, 2019, 23:47 Thomas Munro <thomas.munro@enterprisedb.com wrote:On Fri, Feb 15, 2019 at 5:27 PM John Klann <jk7255@gmail.com> wrote:
> Happy to test it, I will need the instructions if you can. It's for Rhel 7.5 Maipo latest. (Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Fri Sep 21 09:07:21 UTC 2018
> x86_64 x86_64 x86_64 GNU/Linux)
Thanks! Use yum to install gcc, make, bison, yacc, git,
readline-devel, zlib-devel. (Hope I'm not forgetting anything
here...)
Pull down the sources:
$ git clone https://github.com/postgres/postgres.git
Go into the source directory:
$ cd postgres
Let's say we want to test with the 11 branch (so you can test against
an existing 11.x development environment database you might have
installed the usual way with packages, as opposed to the master branch
where we are developing 12). Switch to that branch:
$ git checkout REL_11_STABLE
Create a new branch for this experiment:
$ git checkout -b tolerate-einval
Apply the patch (adjust path to wherever you put the patch):
$ git am 0001-Tolerate-EINVAL-when-calling-fsync-on-a-directory.patch
Configure, build, install (the prefix is where it will install the
resulting binaries, adjust to taste, -j is how many CPUs you have if
you want to build fast):
$ mkdir $HOME/install
$ ./configure --prefix=$HOME/install
$ make -s -j4 && make -s install
Optionally, run a self-test to see if everything is sane:
$ make -s check
Now you should have binaries like $HOME/install/bin/pg_basebackup and
you can test if they work on CIFS.
--
Thomas Munro
http://www.enterprisedb.com
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
Thomas Munro
Date:
On Sat, Feb 16, 2019 at 4:39 AM John Klann <jk7255@gmail.com> wrote: > I was able to test the patch without issue and it appears to have worked see below. Let me know if there is further testingI can do or logging you would like. Thanks. So that leaves the question for the list: should we accept this as a bug and consider back-patching it? Our relationship with fsync() is ... complicated at the moment. Considering that we already tolerated EBADF for directories on some other (forgotten?) operating system, I think I would vote +1 for doing that ... if I didn't know that they'd fixed this on the Linux side, and I see now that they've back-patched that even to long term kernel 3.16: https://cdn.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.16.60 You mentioned that you were on RHEL 7.5, which ships with a 3.10 kernel. When you eventually move to RHEL 8 (4.x) or even install an alternative newer kernel for 7.x, the problem will hopefully go away by itself. Considering that, the lack of other complaints, and the availability of a workaround (--no-sync followed by "sync"), I'm not so sure it's worth committing this. PS I found an earlier discussion of this topic that didn't go anywhere: https://www.postgresql.org/message-id/flat/E1RhkjA-0005bq-B6%40wrigleys.postgresql.org -- Thomas Munro http://www.enterprisedb.com
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
Michael Paquier
Date:
On Mon, Feb 18, 2019 at 02:34:15PM +1300, Thomas Munro wrote: > You mentioned that you were on RHEL 7.5, which ships with a 3.10 > kernel. When you eventually move to RHEL 8 (4.x) or even install an > alternative newer kernel for 7.x, the problem will hopefully go away > by itself. Considering that, the lack of other complaints, and the > availability of a workaround (--no-sync followed by "sync"), I'm not > so sure it's worth committing this. If we get to ignore EINVAL, then we won't know that fsync has failed even in cases where the caller has done an incorrect thing by using a special file, which sounds like a bad idea for the core code as much as any plugins calling that? -- Michael
Attachment
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
Thomas Munro
Date:
On Mon, Feb 18, 2019 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Feb 18, 2019 at 02:34:15PM +1300, Thomas Munro wrote: > > ... Considering that, the lack of other complaints, and the > > availability of a workaround (--no-sync followed by "sync"), I'm not > > so sure it's worth committing this. > > If we get to ignore EINVAL, then we won't know that fsync has failed > even in cases where the caller has done an incorrect thing by using a > special file, which sounds like a bad idea for the core code as much > as any plugins calling that? Well the theory is that it's OK because it's only for directories (the same case in which case we already tolerate EBADF (for an OS that nobody documented so we don't really know if/when we can take it out)). But I agree with you that the patch does not spark joy. -- Thomas Munro http://www.enterprisedb.com
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
Michael Paquier
Date:
On Mon, Feb 18, 2019 at 04:10:58PM +1300, Thomas Munro wrote: > Well the theory is that it's OK because it's only for directories (the > same case in which case we already tolerate EBADF (for an OS that > nobody documented so we don't really know if/when we can take it > out)). But I agree with you that the patch does not spark joy. If I would vote for this patch, that would be -0.1, meaning that I am not strongly opposed to it, but I don't see any strong point either to do something about it as there are downsides. When in doubt about a back-patch, I use as basic rule to never back-patch so as we are sure that users avoid any surprises and keep relying on the existing design. -- Michael
Attachment
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
Stephen Frost
Date:
Greetings, * Thomas Munro (thomas.munro@enterprisedb.com) wrote: > On Sat, Feb 16, 2019 at 4:39 AM John Klann <jk7255@gmail.com> wrote: > > I was able to test the patch without issue and it appears to have worked see below. Let me know if there is further testingI can do or logging you would like. > > Thanks. So that leaves the question for the list: should we accept > this as a bug and consider back-patching it? Our relationship with > fsync() is ... complicated at the moment. Yes, we should accept this as a bug and back-patch it, imv. > Considering that we already > tolerated EBADF for directories on some other (forgotten?) operating > system, I think I would vote +1 for doing that ... if I didn't know > that they'd fixed this on the Linux side, and I see now that they've > back-patched that even to long term kernel 3.16: > > https://cdn.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.16.60 Which seems like it's just another good reason why we should also back-patch it. > PS I found an earlier discussion of this topic that didn't go > anywhere: https://www.postgresql.org/message-id/flat/E1RhkjA-0005bq-B6%40wrigleys.postgresql.org As does this. I would be much more concerned about a change like this if we either didn't know it was a directory that we got fsync() EINVAL for. Since we know it's a directory and we know there's filesystems out there which do synchronous metadata changes and we know they return EINVAL in those cases, I'm inclined to say we should allow it. In the above thread, Tom mentioned the standard, which pretty clearly distinguishes between "this thing you gave to fsync() doesn't support synchronization" and "an I/O error occured" and it's really the latter that we care about here. Further, saying "just use --no-sync" is *not* a workaround. That's throwing the baby out with the bathwater. A workaround would be an option like "--do-not-complain-about-directory-fsync-einval", but suggesting that all fsync's be removed is terrible. Thanks! Stephen
Attachment
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
Stephen Frost
Date:
Greetings, * PG Bug reporting form (noreply@postgresql.org) wrote: > archive_mode = 'on' > archive_command = 'test ! -f /pgxlog1/data1/%f || cp /pgxlog1/data1/%f > /cifs/backups/<backupDirectoryName>/dmp/archive/%f' I realize you were complaining about pg_basebackup here, but since you shared your config, I wanted to point out that the above is an absolutely *terrible* archive_command, *especially* when you've got a network filesystem involved. There's no guarantee with that archive command that the WAL makes it to the backup server or that it's completely written out on the backup server before the PG server removes the WAL- if something happened (a network hiccup, the CIFS server being rebooted, or the PG server restarting) there's a real possibility that you'd lose some portion of your WAL, which could make a backup invalid, or could make it so you can't do PITR. Please don't roll your own backup solution like this. Thanks! Stephen
Attachment
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
John Klann
Date:
Thanks Stephen for pointing this out.
John
On Mon, Feb 18, 2019 at 11:26 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* PG Bug reporting form (noreply@postgresql.org) wrote:
> archive_mode = 'on'
> archive_command = 'test ! -f /pgxlog1/data1/%f || cp /pgxlog1/data1/%f
> /cifs/backups/<backupDirectoryName>/dmp/archive/%f'
I realize you were complaining about pg_basebackup here, but since you
shared your config, I wanted to point out that the above is an
absolutely *terrible* archive_command, *especially* when you've got a
network filesystem involved.
There's no guarantee with that archive command that the WAL makes it to
the backup server or that it's completely written out on the backup
server before the PG server removes the WAL- if something happened (a
network hiccup, the CIFS server being rebooted, or the PG server
restarting) there's a real possibility that you'd lose some portion of
your WAL, which could make a backup invalid, or could make it so you
can't do PITR.
Please don't roll your own backup solution like this.
Thanks!
Stephen
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
Thomas Munro
Date:
Hi, So, for this patch tolerating EINVAL for fsync of directories (but not files), enabling eg pg_backbackup on CIFS on eg Linux 3.10 to work, tested by John, we have: Stephen: +1 Michael: -0.1 Thomas: -0.1 Any other opinions? Just in case you think it's strange that I'm voting against my own patch: I'd probably be for it if I hadn't discovered that they've fixed this in Linux 3.16 and later so that it succeeds. It's apparently not in the default RHEL 6 and 7 kernels, though, and the latter could be around for a while. I'm not entirely sure what amount of work we should be doing to tolerate problems that are fixed in a newer versions. One argument is that a 3.10 user who cares about this should petition RH to back-port the fix into that kernel. (The nearby WSL thread has some things in common but is a more clear cut case IMV because there is no fix available on the WSL side.) -- Thomas Munro https://enterprisedb.com
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
Andres Freund
Date:
Hi, On 2019-02-23 11:20:02 +1300, Thomas Munro wrote: > So, for this patch tolerating EINVAL for fsync of directories (but not > files), enabling eg pg_backbackup on CIFS on eg Linux 3.10 to work, > tested by John, we have: > > Stephen: +1 > Michael: -0.1 > Thomas: -0.1 > > Any other opinions? > > Just in case you think it's strange that I'm voting against my own > patch: I'd probably be for it if I hadn't discovered that they've > fixed this in Linux 3.16 and later so that it succeeds. It's > apparently not in the default RHEL 6 and 7 kernels, though, and the > latter could be around for a while. I'm not entirely sure what amount > of work we should be doing to tolerate problems that are fixed in a > newer versions. One argument is that a 3.10 user who cares about this > should petition RH to back-port the fix into that kernel. (The nearby > WSL thread has some things in common but is a more clear cut case IMV > because there is no fix available on the WSL side.) Hm, given the fact that EINVAL doesn't appear to be triggerable by data level issues, I don't see much reason not to allow it. I mean it's really stupid that cifs ever returns it, but there's going to be a lot of users running things on older kernel for the forseeable future. Greetings, Andres Freund
Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFSdestination throws fsync error at end of backup
From
Thomas Munro
Date:
On Sat, Feb 23, 2019 at 11:28 AM Andres Freund <andres@anarazel.de> wrote: > On 2019-02-23 11:20:02 +1300, Thomas Munro wrote: > > So, for this patch tolerating EINVAL for fsync of directories (but not > > files), enabling eg pg_backbackup on CIFS on eg Linux 3.10 to work, > > tested by John, we have: > > > > Stephen: +1 > > Michael: -0.1 > > Thomas: -0.1 > > > > Any other opinions? > > > > Just in case you think it's strange that I'm voting against my own > > patch: I'd probably be for it if I hadn't discovered that they've > > fixed this in Linux 3.16 and later so that it succeeds. It's > > apparently not in the default RHEL 6 and 7 kernels, though, and the > > latter could be around for a while. I'm not entirely sure what amount > > of work we should be doing to tolerate problems that are fixed in a > > newer versions. One argument is that a 3.10 user who cares about this > > should petition RH to back-port the fix into that kernel. (The nearby > > WSL thread has some things in common but is a more clear cut case IMV > > because there is no fix available on the WSL side.) > > Hm, given the fact that EINVAL doesn't appear to be triggerable by data > level issues, I don't see much reason not to allow it. I mean it's > really stupid that cifs ever returns it, but there's going to be a lot > of users running things on older kernel for the forseeable future. Thanks. Pushed. FTR here is the discussion about the pre-existing tolerance of EBADF for directories: https://www.postgresql.org/message-id/flat/20100215005057.5AC8F7541C5%40cvs.postgresql.org -- Thomas Munro https://enterprisedb.com