Thread: fd.c: flush data problems on osx
Hi Current implementation of pg_flush_data when called with zero offset and zero nbytes is assumed to flush all file. In osxit uses mmap with these arguments, but according to man: "[EINVAL] The len argument was negative or zero. Historically, the system call would not return an error if the argument was zero. See other potential additional restrictions in the COMPAT- IBILITY section below." so it is generate a lot of warnings: "WARNING: could not mmap while flushing dirty data: Invalid argument" Call to pg_flush_data with zero offset and nbytes happens when replica starts from base backup and fsync=on. TAP test toreproduce is attached. One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0. Also there are no default ifdef inside this function, is there any check that will guarantee that pg_flush_data will notend up with empty body on some platform? --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
Hi, On 2016-03-17 20:09:53 +0300, Stas Kelvich wrote: > Current implementation of pg_flush_data when called with zero offset and zero nbytes is assumed to flush all file. In osxit uses mmap with these arguments, but according to man: > > "[EINVAL] The len argument was negative or zero. Historically, the system call would not return an > error if the argument was zero. See other potential additional restrictions in the COMPAT- > IBILITY section below." > > so it is generate a lot of warnings: > > "WARNING: could not mmap while flushing dirty data: Invalid argument" Hm, yea, that's buggy. > One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0. Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get the file size. Could you test that? > Also there are no default ifdef inside this function, is there any > check that will guarantee that pg_flush_data will not end up with > empty body on some platform? There doesn't need to be - it's purely "advisory", i.e. just an optimization. Greetings, Andres Freund
> On 17 Mar 2016, at 20:23, Andres Freund <andres@anarazel.de> wrote: > >> >> Also there are no default ifdef inside this function, is there any >> check that will guarantee that pg_flush_data will not end up with >> empty body on some platform? > > There doesn't need to be - it's purely "advisory", i.e. just an > optimization. Ah, okay, then I misinterpret purpose of that function and it shouldn’t be forced to sync. > >> One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0. > > Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get > the file size. Could you test that? > It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize while manual says it can be of arbitrary length,so i aligned it. Also there were call to mmap with PROT_READ | PROT_WRITE, but when called from pre_sync_fname file descriptor is just O_RDONLY,so i changed mmap mode to PROT_READ — seems that PROT_WRITE wasn’t needed anyway. And all of that reduces number of warnings in order of magnitude but there are still some and I don’t yet understand whyare they happening. > > Greetings, > > Andres Freund > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
> On 18 Mar 2016, at 14:45, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >> >>> One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0. >> >> Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get >> the file size. Could you test that? >> > > It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize while manual says it can be of arbitrary length,so i aligned it. > Also there were call to mmap with PROT_READ | PROT_WRITE, but when called from pre_sync_fname file descriptor is just O_RDONLY,so i changed mmap mode to PROT_READ — seems that PROT_WRITE wasn’t needed anyway. > > And all of that reduces number of warnings in order of magnitude but there are still some and I don’t yet understand whyare they happening. I’ve spend some more time on this issue and found that remaining warnings were caused by mmap-ing directories — that raisesEINVAL in OSX (probably not only OSX, but I didn’t tried). So i’ve skipped mmap for dirs and now restore happens without warnings. Also I’ve fixed wrong error check that was in previousversion of patch. --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
On 2016-03-21 14:46:09 +0300, Stas Kelvich wrote: > > > On 18 Mar 2016, at 14:45, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> > >>> One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0. > >> > >> Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get > >> the file size. Could you test that? > >> > > > > It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize while manual says it can be of arbitrary length,so i aligned it. > > Also there were call to mmap with PROT_READ | PROT_WRITE, but when called from pre_sync_fname file descriptor is justO_RDONLY, so i changed mmap mode to PROT_READ — seems that PROT_WRITE wasn’t needed anyway. > > > > And all of that reduces number of warnings in order of magnitude but there are still some and I don’t yet understandwhy are they happening. > > I’ve spend some more time on this issue and found that remaining > warnings were caused by mmap-ing directories — that raises EINVAL in > OSX (probably not only OSX, but I didn’t tried). > So i’ve skipped mmap for dirs and now restore happens without > warnings. Also I’ve fixed wrong error check that was in previous > version of patch. Hm. I think we should rather just skip calling pg_flush_data in the directory case, that very likely isn't beneficial on any OS. I think we still need to fix the mmap() implementation to support the offset = 0, nbytes = 0 case (via fseek(SEEK_END). Greetings, Andres Freund
On 21 Mar 2016, at 14:53, Andres Freund <andres@anarazel.de> wrote: > Hm. I think we should rather just skip calling pg_flush_data in the > directory case, that very likely isn't beneficial on any OS. Seems reasonable, changed. > I think we still need to fix the mmap() implementation to support the > offset = 0, nbytes = 0 case (via fseek(SEEK_END). It is already in this diff. I’ve added this few messages ago. --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
On Mon, Mar 21, 2016 at 03:09:56PM +0300, Stas Kelvich wrote: > On 21 Mar 2016, at 14:53, Andres Freund <andres@anarazel.de> wrote: > > Hm. I think we should rather just skip calling pg_flush_data in the > > directory case, that very likely isn't beneficial on any OS. > > Seems reasonable, changed. > > > I think we still need to fix the mmap() implementation to support the > > offset = 0, nbytes = 0 case (via fseek(SEEK_END). > > It is already in this diff. I’ve added this few messages ago. [This is a generic notification.] The above-described topic is currently a PostgreSQL 9.6 open item. Andres, since you committed the patch believed to have created it, you own this open item. If that responsibility lies elsewhere, please let us know whose responsibility it is to fix this. Since new open items may be discovered at any time and I want to plan to have them all fixed well in advance of the ship date, I will appreciate your efforts toward speedy resolution. Please present, within 72 hours, a plan to fix the defect within seven days of this message. Thanks.
On Mon, Mar 21, 2016 at 9:09 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > On 21 Mar 2016, at 14:53, Andres Freund <andres@anarazel.de> wrote: >> Hm. I think we should rather just skip calling pg_flush_data in the >> directory case, that very likely isn't beneficial on any OS. > > Seems reasonable, changed. > >> I think we still need to fix the mmap() implementation to support the >> offset = 0, nbytes = 0 case (via fseek(SEEK_END). > > It is already in this diff. I’ve added this few messages ago. A similar change seems to be needed in initdb.c's pre_sync_fname. -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > On Mon, Mar 21, 2016 at 9:09 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >> On 21 Mar 2016, at 14:53, Andres Freund <andres@anarazel.de> wrote: >>> I think we still need to fix the mmap() implementation to support the >>> offset = 0, nbytes = 0 case (via fseek(SEEK_END). >> It is already in this diff. I’ve added this few messages ago. There are bigger issues in this code, actually, to wit assuming that it should always be possible to mmap the target region. That's patently false on 32-bit machines, where you likely won't have a gigabyte of free address space. For this reason, I think that issuing a WARNING on mmap failure is a damfool idea, and giving up on the flush attempt even more so. We should just silently fall through to the next implementation if we cannot mmap the target region. While I'm whinging: what happens when off_t is wider than size_t? It's entirely possible that the user has configured the relation segment size to more than 4GB, so I do not think that's academic. I think we also need a test for length > SIZE_T_MAX and then fall through to another implementation, rather than mapping and msync'ing some initial fragment of the file, which is what will happen now. > A similar change seems to be needed in initdb.c's pre_sync_fname. Hmm, do we need to move this logic into src/common? regards, tom lane
I wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> A similar change seems to be needed in initdb.c's pre_sync_fname. > Hmm, do we need to move this logic into src/common? I concluded that sharing the code would be more trouble than it's worth, because initdb.c's version of this is in fact not broken: it was never taught about mmap, and it doesn't need to be IMO, because it's not that performance-critical. I fixed the other things I complained about and pushed this. Please check to see that the committed patch resolves your original problem. regards, tom lane
On 2016-04-13 17:21:01 -0400, Tom Lane wrote: > I concluded that sharing the code would be more trouble than it's worth, > because initdb.c's version of this is in fact not broken: it was never > taught about mmap, and it doesn't need to be IMO, because it's not that > performance-critical. Agreed, there seems little immediate reason to go through additional trouble. The issue of posix_fadvise(DONTNEED) dropping the cache seems much less severe for initdb, than for the flushing patch (where it'd obviously hurt). It's probably good to unify the paths at some point in the future though. We're lacking fsyncs in a bunch of relatively important paths (WIP patch for most of them posted), and we end up needing fsync (and probably the flushing logic) in the backend, initdb, pg_basebackup/pg_recvlogical, pg_rewind and possibly some others that I forgot about. > I fixed the other things I complained about and pushed this. Thanks for taking care of it! > Please check to see that the committed patch resolves your original problem. I looked through the commit, and I only have one very minor nitpick: It might be good if + * We compile all alternatives that are supported on the current platform, + * to find portability problems more easily. noted that we also fall through the approaches. I'm not entirely sure what + /* + * Caution: do not call pg_flush_data with amount = 0, it could trash the + * file's seek position. + */ + if (amount <= 0) + return; + is about? - Andres
Andres Freund <andres@anarazel.de> writes: > I'm not entirely sure what > + /* > + * Caution: do not call pg_flush_data with amount = 0, it could trash the > + * file's seek position. > + */ > + if (amount <= 0) > + return; > + > is about? fd.c tracks seek position for open files. I'm not sure that that function can get called with amount == 0, but if it did, the caller would certainly not be expecting the file position to change. regards, tom lane
On 2016-04-13 17:44:41 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I'm not entirely sure what > > > + /* > > + * Caution: do not call pg_flush_data with amount = 0, it could trash the > > + * file's seek position. > > + */ > > + if (amount <= 0) > > + return; > > + > > > is about? > > fd.c tracks seek position for open files. I'm not sure that that > function can get called with amount == 0, but if it did, the caller > would certainly not be expecting the file position to change. Ok, fair enough. (And no, it should currently be never called that way) Andres
Andres Freund <andres@anarazel.de> writes: > On 2016-04-13 17:44:41 -0400, Tom Lane wrote: >> fd.c tracks seek position for open files. I'm not sure that that >> function can get called with amount == 0, but if it did, the caller >> would certainly not be expecting the file position to change. > Ok, fair enough. (And no, it should currently be never called that way) BTW, I just noticed another issue here, which is that FileWriteback and the corresponding smgr routines are declared with bogusly narrow "amount" arguments --- eg, it's silly that FileWriteback only takes an int amount. I think this code could be actively broken for relation segment sizes exceeding 2GB, and even if it isn't, we should define the functions correctly the first time. Will fix the function definitions, but I'm kind of wondering exactly how many times the inner loop in IssuePendingWritebacks() could possibly iterate ... regards, tom lane
On 2016-04-13 18:09:18 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2016-04-13 17:44:41 -0400, Tom Lane wrote: > >> fd.c tracks seek position for open files. I'm not sure that that > >> function can get called with amount == 0, but if it did, the caller > >> would certainly not be expecting the file position to change. > > > Ok, fair enough. (And no, it should currently be never called that way) > > BTW, I just noticed another issue here, which is that FileWriteback > and the corresponding smgr routines are declared with bogusly narrow > "amount" arguments --- eg, it's silly that FileWriteback only takes > an int amount. Well, I modeled it after the nearby routines (like FileRead), which all only take an amount in int. Now there might be less reason to read a lot of data at once, than to flush large amounts; but it still didn't seem necessary to break with the rest of the functions in the file. > I think this code could be actively broken for relation segment sizes > exceeding 2GB, and even if it isn't, we should define the functions > correctly the first time. I don't think it's actively problematic, ->max_pending (and thus nr_pending) is limited by #define WRITEBACK_MAX_PENDING_FLUSHES 256 (although I wondered whether we should increase the limit a bit further). Even with a blocksize of 32768, that's pretty far away from exceeding INT_MAX. > Will fix the function definitions, but I'm kind of wondering exactly how > many times the inner loop in IssuePendingWritebacks() could possibly > iterate ... WRITEBACK_MAX_PENDING_FLUSHES/256, due to the limitation mentioned above. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2016-04-13 18:09:18 -0400, Tom Lane wrote: >> BTW, I just noticed another issue here, which is that FileWriteback >> and the corresponding smgr routines are declared with bogusly narrow >> "amount" arguments --- eg, it's silly that FileWriteback only takes >> an int amount. > Well, I modeled it after the nearby routines (like FileRead), which all > only take an amount in int. Now there might be less reason to read a lot > of data at once, than to flush large amounts; but it still didn't seem > necessary to break with the rest of the functions in the file. Well, those APIs are pretty historical. I think it's useful to get new ones correct from the outset. regards, tom lane