Thread: fd.c: flush data problems on osx

fd.c: flush data problems on osx

From
Stas Kelvich
Date:
Hi

Current implementation of pg_flush_data when called with zero offset and zero nbytes is assumed to flush all file. In
osxit uses mmap with these arguments, but according to man:  

"[EINVAL]           The len argument was negative or zero. Historically, the system call would not return an
                        error if the argument was zero.  See other potential additional restrictions in the COMPAT-
                        IBILITY section below."

so it is generate a lot of warnings:

"WARNING:  could not mmap while flushing dirty data: Invalid argument"

Call to pg_flush_data with zero offset and nbytes happens when replica starts from base backup and fsync=on. TAP test
toreproduce is attached. 

One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0.

Also there are no default ifdef inside this function, is there any check that will guarantee that pg_flush_data will
notend up with empty body on some platform? 

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: fd.c: flush data problems on osx

From
Andres Freund
Date:
Hi,

On 2016-03-17 20:09:53 +0300, Stas Kelvich wrote:
> Current implementation of pg_flush_data when called with zero offset and zero nbytes is assumed to flush all file. In
osxit uses mmap with these arguments, but according to man: 
 
> 
> "[EINVAL]           The len argument was negative or zero. Historically, the system call would not return an
>                         error if the argument was zero.  See other potential additional restrictions in the COMPAT-
>                         IBILITY section below."
> 
> so it is generate a lot of warnings:
> 
> "WARNING:  could not mmap while flushing dirty data: Invalid argument"

Hm, yea, that's buggy.


> One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0.

Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get
the file size. Could you test that?


> Also there are no default ifdef inside this function, is there any
> check that will guarantee that pg_flush_data will not end up with
> empty body on some platform?

There doesn't need to be - it's purely "advisory", i.e. just an
optimization.

Greetings,

Andres Freund



Re: fd.c: flush data problems on osx

From
Stas Kelvich
Date:
> On 17 Mar 2016, at 20:23, Andres Freund <andres@anarazel.de> wrote:
>
>>
>> Also there are no default ifdef inside this function, is there any
>> check that will guarantee that pg_flush_data will not end up with
>> empty body on some platform?
>
> There doesn't need to be - it's purely "advisory", i.e. just an
> optimization.


Ah, okay, then I misinterpret purpose of that function and it shouldn’t be forced to sync.

>
>> One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0.
>
> Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get
> the file size. Could you test that?
>

It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize while manual says it can be of arbitrary
length,so i aligned it. 
Also there were call to mmap with PROT_READ | PROT_WRITE, but when called from pre_sync_fname file descriptor is just
O_RDONLY,so i changed mmap mode to PROT_READ — seems that PROT_WRITE wasn’t needed anyway. 

And all of that reduces number of warnings in order of magnitude but there are still some and I don’t yet understand
whyare they happening. 

>
> Greetings,
>
> Andres Freund
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company





Attachment

Re: fd.c: flush data problems on osx

From
Stas Kelvich
Date:
> On 18 Mar 2016, at 14:45, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
>>
>>> One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0.
>>
>> Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get
>> the file size. Could you test that?
>>
>
> It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize while manual says it can be of arbitrary
length,so i aligned it. 
> Also there were call to mmap with PROT_READ | PROT_WRITE, but when called from pre_sync_fname file descriptor is just
O_RDONLY,so i changed mmap mode to PROT_READ — seems that PROT_WRITE wasn’t needed anyway. 
>
> And all of that reduces number of warnings in order of magnitude but there are still some and I don’t yet understand
whyare they happening. 

I’ve spend some more time on this issue and found that remaining warnings were caused by mmap-ing directories — that
raisesEINVAL in OSX (probably not only OSX, but I didn’t tried). 
So i’ve skipped mmap for dirs and now restore happens without warnings. Also I’ve fixed wrong error check that was in
previousversion of patch. 




---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Attachment

Re: fd.c: flush data problems on osx

From
Andres Freund
Date:
On 2016-03-21 14:46:09 +0300, Stas Kelvich wrote:
> 
> > On 18 Mar 2016, at 14:45, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
> >> 
> >>> One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0.
> >> 
> >> Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get
> >> the file size. Could you test that?
> >> 
> > 
> > It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize while manual says it can be of arbitrary
length,so i aligned it.
 
> > Also there were call to mmap with PROT_READ | PROT_WRITE, but when called from pre_sync_fname file descriptor is
justO_RDONLY, so i changed mmap mode to PROT_READ — seems that PROT_WRITE wasn’t needed anyway.
 
> > 
> > And all of that reduces number of warnings in order of magnitude but there are still some and I don’t yet
understandwhy are they happening.
 
> 
> I’ve spend some more time on this issue and found that remaining
> warnings were caused by mmap-ing directories — that raises EINVAL in
> OSX (probably not only OSX, but I didn’t tried).

> So i’ve skipped mmap for dirs and now restore happens without
> warnings. Also I’ve fixed wrong error check that was in previous
> version of patch.

Hm. I think we should rather just skip calling pg_flush_data in the
directory case, that very likely isn't beneficial on any OS.

I think we still need to fix the mmap() implementation to support the
offset = 0, nbytes = 0 case (via fseek(SEEK_END).

Greetings,

Andres Freund



Re: fd.c: flush data problems on osx

From
Stas Kelvich
Date:
On 21 Mar 2016, at 14:53, Andres Freund <andres@anarazel.de> wrote:
> Hm. I think we should rather just skip calling pg_flush_data in the
> directory case, that very likely isn't beneficial on any OS.

Seems reasonable, changed.

> I think we still need to fix the mmap() implementation to support the
> offset = 0, nbytes = 0 case (via fseek(SEEK_END).

It is already in this diff. I’ve added this few messages ago.


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Attachment

Re: fd.c: flush data problems on osx

From
Noah Misch
Date:
On Mon, Mar 21, 2016 at 03:09:56PM +0300, Stas Kelvich wrote:
> On 21 Mar 2016, at 14:53, Andres Freund <andres@anarazel.de> wrote:
> > Hm. I think we should rather just skip calling pg_flush_data in the
> > directory case, that very likely isn't beneficial on any OS.
> 
> Seems reasonable, changed.
> 
> > I think we still need to fix the mmap() implementation to support the
> > offset = 0, nbytes = 0 case (via fseek(SEEK_END).
> 
> It is already in this diff. I’ve added this few messages ago.

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Andres,
since you committed the patch believed to have created it, you own this open
item.  If that responsibility lies elsewhere, please let us know whose
responsibility it is to fix this.  Since new open items may be discovered at
any time and I want to plan to have them all fixed well in advance of the ship
date, I will appreciate your efforts toward speedy resolution.  Please
present, within 72 hours, a plan to fix the defect within seven days of this
message.  Thanks.



Re: fd.c: flush data problems on osx

From
Michael Paquier
Date:
On Mon, Mar 21, 2016 at 9:09 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
> On 21 Mar 2016, at 14:53, Andres Freund <andres@anarazel.de> wrote:
>> Hm. I think we should rather just skip calling pg_flush_data in the
>> directory case, that very likely isn't beneficial on any OS.
>
> Seems reasonable, changed.
>
>> I think we still need to fix the mmap() implementation to support the
>> offset = 0, nbytes = 0 case (via fseek(SEEK_END).
>
> It is already in this diff. I’ve added this few messages ago.

A similar change seems to be needed in initdb.c's pre_sync_fname.
--
Michael



Re: fd.c: flush data problems on osx

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Mon, Mar 21, 2016 at 9:09 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
>> On 21 Mar 2016, at 14:53, Andres Freund <andres@anarazel.de> wrote:
>>> I think we still need to fix the mmap() implementation to support the
>>> offset = 0, nbytes = 0 case (via fseek(SEEK_END).

>> It is already in this diff. I’ve added this few messages ago.

There are bigger issues in this code, actually, to wit assuming that
it should always be possible to mmap the target region.  That's patently
false on 32-bit machines, where you likely won't have a gigabyte of free
address space.

For this reason, I think that issuing a WARNING on mmap failure is a
damfool idea, and giving up on the flush attempt even more so.  We
should just silently fall through to the next implementation if we
cannot mmap the target region.

While I'm whinging: what happens when off_t is wider than size_t?  It's
entirely possible that the user has configured the relation segment size
to more than 4GB, so I do not think that's academic.  I think we also need
a test for length > SIZE_T_MAX and then fall through to another
implementation, rather than mapping and msync'ing some initial fragment of
the file, which is what will happen now.

> A similar change seems to be needed in initdb.c's pre_sync_fname.

Hmm, do we need to move this logic into src/common?
        regards, tom lane



Re: fd.c: flush data problems on osx

From
Tom Lane
Date:
I wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> A similar change seems to be needed in initdb.c's pre_sync_fname.

> Hmm, do we need to move this logic into src/common?

I concluded that sharing the code would be more trouble than it's worth,
because initdb.c's version of this is in fact not broken: it was never
taught about mmap, and it doesn't need to be IMO, because it's not that
performance-critical.

I fixed the other things I complained about and pushed this.  Please
check to see that the committed patch resolves your original problem.
        regards, tom lane



Re: fd.c: flush data problems on osx

From
Andres Freund
Date:
On 2016-04-13 17:21:01 -0400, Tom Lane wrote:
> I concluded that sharing the code would be more trouble than it's worth,
> because initdb.c's version of this is in fact not broken: it was never
> taught about mmap, and it doesn't need to be IMO, because it's not that
> performance-critical.

Agreed, there seems little immediate reason to go through additional
trouble. The issue of posix_fadvise(DONTNEED) dropping the cache seems
much less severe for initdb, than for the flushing patch (where it'd
obviously hurt).

It's probably good to unify the paths at some point in the future
though. We're lacking fsyncs in a bunch of relatively important paths
(WIP patch for most of them posted), and we end up needing fsync (and
probably the flushing logic) in the backend, initdb,
pg_basebackup/pg_recvlogical, pg_rewind and possibly some others that I
forgot about.


> I fixed the other things I complained about and pushed this.

Thanks for taking care of it!


> Please check to see that the committed patch resolves your original problem.

I looked through the commit, and I only have one very minor nitpick:
It might be good if
+    * We compile all alternatives that are supported on the current platform,
+    * to find portability problems more easily.
noted that we also fall through the approaches.


I'm not entirely sure what

+   /*
+    * Caution: do not call pg_flush_data with amount = 0, it could trash the
+    * file's seek position.
+    */
+   if (amount <= 0)
+       return;
+

is about?

- Andres



Re: fd.c: flush data problems on osx

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I'm not entirely sure what

> +   /*
> +    * Caution: do not call pg_flush_data with amount = 0, it could trash the
> +    * file's seek position.
> +    */
> +   if (amount <= 0)
> +       return;
> +

> is about?

fd.c tracks seek position for open files.  I'm not sure that that
function can get called with amount == 0, but if it did, the caller
would certainly not be expecting the file position to change.
        regards, tom lane



Re: fd.c: flush data problems on osx

From
Andres Freund
Date:
On 2016-04-13 17:44:41 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I'm not entirely sure what
> 
> > +   /*
> > +    * Caution: do not call pg_flush_data with amount = 0, it could trash the
> > +    * file's seek position.
> > +    */
> > +   if (amount <= 0)
> > +       return;
> > +
> 
> > is about?
> 
> fd.c tracks seek position for open files.  I'm not sure that that
> function can get called with amount == 0, but if it did, the caller
> would certainly not be expecting the file position to change.

Ok, fair enough.  (And no, it should currently be never called that way)

Andres



Re: fd.c: flush data problems on osx

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2016-04-13 17:44:41 -0400, Tom Lane wrote:
>> fd.c tracks seek position for open files.  I'm not sure that that
>> function can get called with amount == 0, but if it did, the caller
>> would certainly not be expecting the file position to change.

> Ok, fair enough.  (And no, it should currently be never called that way)

BTW, I just noticed another issue here, which is that FileWriteback
and the corresponding smgr routines are declared with bogusly narrow
"amount" arguments --- eg, it's silly that FileWriteback only takes
an int amount.  I think this code could be actively broken for
relation segment sizes exceeding 2GB, and even if it isn't, we should
define the functions correctly the first time.

Will fix the function definitions, but I'm kind of wondering exactly how
many times the inner loop in IssuePendingWritebacks() could possibly
iterate ...
        regards, tom lane



Re: fd.c: flush data problems on osx

From
Andres Freund
Date:
On 2016-04-13 18:09:18 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2016-04-13 17:44:41 -0400, Tom Lane wrote:
> >> fd.c tracks seek position for open files.  I'm not sure that that
> >> function can get called with amount == 0, but if it did, the caller
> >> would certainly not be expecting the file position to change.
> 
> > Ok, fair enough.  (And no, it should currently be never called that way)
> 
> BTW, I just noticed another issue here, which is that FileWriteback
> and the corresponding smgr routines are declared with bogusly narrow
> "amount" arguments --- eg, it's silly that FileWriteback only takes
> an int amount.

Well, I modeled it after the nearby routines (like FileRead), which all
only take an amount in int. Now there might be less reason to read a lot
of data at once, than to flush large amounts; but it still didn't seem
necessary to break with the rest of the functions in the file.


> I think this code could be actively broken for relation segment sizes
> exceeding 2GB, and even if it isn't, we should define the functions
> correctly the first time.

I don't think it's actively problematic, ->max_pending (and thus
nr_pending) is limited by #define WRITEBACK_MAX_PENDING_FLUSHES 256
(although I wondered whether we should increase the limit a bit
further). Even with a blocksize of 32768, that's pretty far away from
exceeding INT_MAX.


> Will fix the function definitions, but I'm kind of wondering exactly how
> many times the inner loop in IssuePendingWritebacks() could possibly
> iterate ...

WRITEBACK_MAX_PENDING_FLUSHES/256, due to the limitation mentioned
above.


Greetings,

Andres Freund



Re: fd.c: flush data problems on osx

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2016-04-13 18:09:18 -0400, Tom Lane wrote:
>> BTW, I just noticed another issue here, which is that FileWriteback
>> and the corresponding smgr routines are declared with bogusly narrow
>> "amount" arguments --- eg, it's silly that FileWriteback only takes
>> an int amount.

> Well, I modeled it after the nearby routines (like FileRead), which all
> only take an amount in int. Now there might be less reason to read a lot
> of data at once, than to flush large amounts; but it still didn't seem
> necessary to break with the rest of the functions in the file.

Well, those APIs are pretty historical.  I think it's useful to get
new ones correct from the outset.
        regards, tom lane