Thread: retry shm attach for windows (WAS: Re: [HACKERS] OK, so culicidae is*still* broken)

On Sat, May 20, 2017 at 5:56 PM, Noah Misch <noah@leadboat.com> wrote:
> On Sat, Apr 15, 2017 at 02:30:18PM -0700, Andres Freund wrote:
>> On 2017-04-15 17:24:54 -0400, Tom Lane wrote:
>> > Andres Freund <andres@anarazel.de> writes:
>> > > On 2017-04-15 17:09:38 -0400, Tom Lane wrote:
>> > >> Why doesn't Windows' ability to map the segment into the new process
>> > >> before it executes take care of that?
>> >
>> > > Because of ASLR of the main executable (i.e. something like PIE).
>> >
>> > Not following.  Are you saying that the main executable gets mapped into
>> > the process address space immediately, but shared libraries are not?
>
> At the time of the pgwin32_ReserveSharedMemoryRegion() call, the child process
> contains only ntdll.dll and the executable.
>
>> Without PIE/ASLR we can somewhat rely on pgwin32_ReserveSharedMemoryRegion
>> to find the space that PGSharedMemoryCreate allocated still unoccupied.
>
> I've never had access to a Windows system that can reproduce the fork
> failures.  My best theory is that antivirus or similar software injects an
> additional DLL at that early stage.
>
>> > I wonder whether we could work around that by just destroying the created
>> > process and trying again if we get a collision.  It'd be a tad
>> > inefficient, but hopefully collisions wouldn't happen often enough to be a
>> > big problem.
>>
>> That might work, although it's obviously not pretty.
>
> I didn't like that idea when Michael proposed it in 2015.  Since disabling
> ASLR on the exe proved insufficient, I do like it now.  It degrades nicely; if
> something raises the collision rate from 1% to 10%, that just looks like fork
> latency degrading.
>

So it seems both you and Tom are leaning towards some sort of retry
mechanism for shm reattach on Windows.  I also think that is a viable
option to negate the impact of ASLR.  Attached patch does that.  Note
that, as I have mentioned above I think we need to do it for shm
reserve operation as well.  I think we need to decide how many retries
are sufficient before bailing out.  As of now, I have used 10 to have
some similarity with PGSharedMemoryCreate(), but we can choose some
different count as well.  One might say that we can have "number of
retries" as a guc parameter, but I am not sure about it, so not used.
Another point to consider is that do we want the same retry mechanism
for EXEC_BACKEND builds (function
PGSharedMemoryReAttach is different on Windows and EXEC_BACKEND
builds).  I think it makes sense to have retry mechanism for
EXEC_BACKEND builds, so done that way in the patch.  Yet another point
which needs some thought is for reattach operation, before retrying do
we want to reserve the shm by using VirtualAllocEx?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Tue, May 23, 2017 at 8:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> So it seems both you and Tom are leaning towards some sort of retry
> mechanism for shm reattach on Windows.  I also think that is a viable
> option to negate the impact of ASLR.  Attached patch does that.  Note
> that, as I have mentioned above I think we need to do it for shm
> reserve operation as well.  I think we need to decide how many retries
> are sufficient before bailing out.  As of now, I have used 10 to have
> some similarity with PGSharedMemoryCreate(), but we can choose some
> different count as well.  One might say that we can have "number of
> retries" as a guc parameter, but I am not sure about it, so not used.

New GUCs can be backpatched if necessary, though this does not seem
necessary. Who is going to set up that anyway if we have a limit high
enough. 10 looks like a sufficient number to me.

> Another point to consider is that do we want the same retry mechanism
> for EXEC_BACKEND builds (function
> PGSharedMemoryReAttach is different on Windows and EXEC_BACKEND
> builds).  I think it makes sense to have retry mechanism for
> EXEC_BACKEND builds, so done that way in the patch.  Yet another point
> which needs some thought is for reattach operation, before retrying do
> we want to reserve the shm by using VirtualAllocEx?

-       elog(FATAL, "could not reattach to shared memory (key=%p,
addr=%p): error code %lu",
+   {
+       elog(LOG, "could not reattach to shared memory (key=%p,
addr=%p): error code %lu",            UsedShmemSegID, UsedShmemSegAddr, GetLastError());
+       return false;
+   }
This should be a WARNING, with the attempt number reported as well?

-void
-PGSharedMemoryReAttach(void)
+bool
+PGSharedMemoryReAttach(int retry_count)
I think that the loop logic should be kept within
PGSharedMemoryReAttach, this makes the code of postmaster.c cleaner,
and it seems to me that each step of PGSharedMemoryReAttach() should
be retried in order. Do we need also to worry about SysV? I agree with
you that having consistency is better, but I don't recall seeing
failures or complains related to cygwin for ASLR.

I think that you are forgetting PGSharedMemoryCreate in the retry process.
-- 
Michael



On Wed, May 24, 2017 at 6:59 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, May 23, 2017 at 8:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> So it seems both you and Tom are leaning towards some sort of retry
>> mechanism for shm reattach on Windows.  I also think that is a viable
>> option to negate the impact of ASLR.  Attached patch does that.  Note
>> that, as I have mentioned above I think we need to do it for shm
>> reserve operation as well.  I think we need to decide how many retries
>> are sufficient before bailing out.  As of now, I have used 10 to have
>> some similarity with PGSharedMemoryCreate(), but we can choose some
>> different count as well.  One might say that we can have "number of
>> retries" as a guc parameter, but I am not sure about it, so not used.
>
> New GUCs can be backpatched if necessary, though this does not seem
> necessary. Who is going to set up that anyway if we have a limit high
> enough. 10 looks like a sufficient number to me.
>

Okay.

>> Another point to consider is that do we want the same retry mechanism
>> for EXEC_BACKEND builds (function
>> PGSharedMemoryReAttach is different on Windows and EXEC_BACKEND
>> builds).  I think it makes sense to have retry mechanism for
>> EXEC_BACKEND builds, so done that way in the patch.  Yet another point
>> which needs some thought is for reattach operation, before retrying do
>> we want to reserve the shm by using VirtualAllocEx?
>
> -       elog(FATAL, "could not reattach to shared memory (key=%p,
> addr=%p): error code %lu",
> +   {
> +       elog(LOG, "could not reattach to shared memory (key=%p,
> addr=%p): error code %lu",
>              UsedShmemSegID, UsedShmemSegAddr, GetLastError());
> +       return false;
> +   }
> This should be a WARNING, with the attempt number reported as well?
>

I think for retry we just want to log, why you want to display it as a
warning?  During startup, other similar places (where we continue
startup even after the call has failed) also use LOG (refer
PGSharedMemoryDetach), so why do differently here?   However, I think
adding retry_count should be okay.

> -void
> -PGSharedMemoryReAttach(void)
> +bool
> +PGSharedMemoryReAttach(int retry_count)
> I think that the loop logic should be kept within
> PGSharedMemoryReAttach, this makes the code of postmaster.c cleaner,
>

Sure, we can do that, but then we need to repeat the same looping
logic in both sysv and win32 case.  Now, if decide not to do for the
sysv case, then it might make sense to consider it doing it in
function PGSharedMemoryReAttach().

> and it seems to me that each step of PGSharedMemoryReAttach() should
> be retried in order. Do we need also to worry about SysV? I agree with
> you that having consistency is better, but I don't recall seeing
> failures or complains related to cygwin for ASLR.
>

I am also not aware of Cygwin failures, but I think keeping the code
same for the cases where we are not using fork mechanism seems like an
advisable approach.  Also, if someone is testing EXEC_BACKEND on Linux
then randomization is 'on' by default, so one can hit this issue
during tests which doesn't matter much, but it still seems better to
have retry logic to prevent the issue.

> I think that you are forgetting PGSharedMemoryCreate in the retry process.
>

No, we don't need retry for PGSharedMemoryCreate as we need this only
we are trying to attach to some pre-reserved shared memory.  Do you
have something else in mind?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Wed, May 24, 2017 at 09:29:11AM -0400, Michael Paquier wrote:
> On Tue, May 23, 2017 at 8:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > So it seems both you and Tom are leaning towards some sort of retry
> > mechanism for shm reattach on Windows.  I also think that is a viable
> > option to negate the impact of ASLR.  Attached patch does that.  Note
> > that, as I have mentioned above I think we need to do it for shm
> > reserve operation as well.  I think we need to decide how many retries
> > are sufficient before bailing out.  As of now, I have used 10 to have
> > some similarity with PGSharedMemoryCreate(), but we can choose some
> > different count as well.  One might say that we can have "number of
> > retries" as a guc parameter, but I am not sure about it, so not used.
> 
> New GUCs can be backpatched if necessary, though this does not seem
> necessary. Who is going to set up that anyway if we have a limit high
> enough. 10 looks like a sufficient number to me.

Ten feels low to me.  The value should be be low enough so users don't give up
and assume a permanent hang, but there's little advantage to making it lower.
I'd set it such that we give up in 1-5s on a modern Windows machine, which I
expect implies a retry count of one hundred or more.



Re: retry shm attach for windows (WAS: Re: [HACKERS] OK, soculicidae is *still* broken)

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Noah Misch
> Ten feels low to me.  The value should be be low enough so users don't give
> up and assume a permanent hang, but there's little advantage to making it
> lower.
> I'd set it such that we give up in 1-5s on a modern Windows machine, which
> I expect implies a retry count of one hundred or more.

Then, maybe we can measure the time in each iteration and give up after a particular seconds.

Regards
Takayuki Tsunakawa




On Thu, May 25, 2017 at 11:34 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> From: pgsql-hackers-owner@postgresql.org
>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Noah Misch
>> Ten feels low to me.  The value should be be low enough so users don't give
>> up and assume a permanent hang, but there's little advantage to making it
>> lower.
>> I'd set it such that we give up in 1-5s on a modern Windows machine, which
>> I expect implies a retry count of one hundred or more.
>
> Then, maybe we can measure the time in each iteration and give up after a particular seconds.

Indeed, pgrename() does so with a 100ms sleep time between each
iteration. Perhaps we could do that and limit to 50 iterations?
-- 
Michael



On Thu, May 25, 2017 at 11:41:19AM +0900, Michael Paquier wrote:
> On Thu, May 25, 2017 at 11:34 AM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
> > From: pgsql-hackers-owner@postgresql.org
> >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Noah Misch
> >> Ten feels low to me.  The value should be be low enough so users don't give
> >> up and assume a permanent hang, but there's little advantage to making it
> >> lower.
> >> I'd set it such that we give up in 1-5s on a modern Windows machine, which
> >> I expect implies a retry count of one hundred or more.
> >
> > Then, maybe we can measure the time in each iteration and give up after a particular seconds.

Exact duration is not important.  Giving up after 0.1s is needlessly early,
because a system taking that long to start a backend is still usable.  Giving
up after 50s is quite late.  In between those extremes, lots of durations
would be reasonable.  Thus, measuring time is needless complexity; retry count
is a suitable proxy.

> Indeed, pgrename() does so with a 100ms sleep time between each
> iteration. Perhaps we could do that and limit to 50 iterations?

pgrename() is polling for an asynchronous event, hence the sleep.  To my
knowledge, time doesn't heal shm attach failures; therefore, a sleep is not
appropriate here.



On Thu, May 25, 2017 at 8:41 AM, Noah Misch <noah@leadboat.com> wrote:
> On Thu, May 25, 2017 at 11:41:19AM +0900, Michael Paquier wrote:
>
>> Indeed, pgrename() does so with a 100ms sleep time between each
>> iteration. Perhaps we could do that and limit to 50 iterations?
>
> pgrename() is polling for an asynchronous event, hence the sleep.  To my
> knowledge, time doesn't heal shm attach failures; therefore, a sleep is not
> appropriate here.
>

Yes, I also share this opinion, the shm attach failures are due to
randomization behavior, so sleep won't help much.  So, I will change
the patch to use 100 retries unless people have other opinions.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Amit Kapila <amit.kapila16@gmail.com> writes:
> Yes, I also share this opinion, the shm attach failures are due to
> randomization behavior, so sleep won't help much.  So, I will change
> the patch to use 100 retries unless people have other opinions.

Sounds about right to me.
        regards, tom lane



Re: retry shm attach for windows (WAS: Re: [HACKERS] OK, soculicidae is *still* broken)

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Amit Kapila
> Yes, I also share this opinion, the shm attach failures are due to
> randomization behavior, so sleep won't help much.  So, I will change the
> patch to use 100 retries unless people have other opinions.

Perhaps I'm misunderstanding, but I thought it is not known yet whether the cause of the original problem is ASLR.  I
remembersomeone referred to anti-virus software and something else.  I guessed that the reason Noah suggested 1 - 5
secondsof retry is based on the expectation that the address space might be freed by the anti-virus software.
 

Regards
Takayuki Tsunakawa



On Thu, May 25, 2017 at 8:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> Yes, I also share this opinion, the shm attach failures are due to
>> randomization behavior, so sleep won't help much.  So, I will change
>> the patch to use 100 retries unless people have other opinions.
>
> Sounds about right to me.
>

Okay.  I have changed the retry count to 100 and modified few comments
in the attached patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Fri, May 26, 2017 at 5:30 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> From: pgsql-hackers-owner@postgresql.org
>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Amit Kapila
>> Yes, I also share this opinion, the shm attach failures are due to
>> randomization behavior, so sleep won't help much.  So, I will change the
>> patch to use 100 retries unless people have other opinions.
>
> Perhaps I'm misunderstanding, but I thought it is not known yet whether the cause of the original problem is ASLR.  I
remembersomeone referred to anti-virus software and something else.
 
>

We are here purposefully trying to resolve the randomize shm
allocation behavior due to ASLR.  The original failure was on a linux
machine and is resolved.  We presumably sometimes get the failures [1]
due to this behavior.

>  I guessed that the reason Noah suggested 1 - 5 seconds of retry is based on the expectation that the address space
mightbe freed by the anti-virus software.
 
>

Noah is also suggesting to have a retry count, read his mail above in
this thread and refer to his comment ("Thus, measuring time is
needless complexity; retry count is a suitable proxy.")


I think the real question here is, shall we backpatch this fix or we
want to do this just in Head or we want to consider it as a new
feature for PostgreSQL-11.  I think it should be fixed in Head and the
change seems harmless to me, so we should even backpatch it.


[1] - https://www.postgresql.org/message-id/14121.1485360296%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Fri, May 26, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think the real question here is, shall we backpatch this fix or we
> want to do this just in Head or we want to consider it as a new
> feature for PostgreSQL-11.  I think it should be fixed in Head and the
> change seems harmless to me, so we should even backpatch it.

The thing is not invasive, so backpatching is a low-risk move. We can
as well get that into HEAD first, wait a bit for dust to settle on it,
and then backpatch.
-- 
Michael




On Fri, May 26, 2017 at 8:24 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, May 26, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think the real question here is, shall we backpatch this fix or we
> want to do this just in Head or we want to consider it as a new
> feature for PostgreSQL-11.  I think it should be fixed in Head and the
> change seems harmless to me, so we should even backpatch it.

The thing is not invasive, so backpatching is a low-risk move. We can
as well get that into HEAD first, wait a bit for dust to settle on it,
and then backpatch.


I would definitely suggest putting it in HEAD (and thus, v10) for a while to get some real world exposure before backpatching. But if it does work out well in the end, then we can certainly consider backpatching it. But given the difficulty in reliably reproducing the problem etc, I think it's a good idea to give it some proper real world experience in 10 first.

--
On Fri, May 26, 2017 at 8:21 PM, Magnus Hagander <magnus@hagander.net> wrote:
>
> On Fri, May 26, 2017 at 8:24 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> On Fri, May 26, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > I think the real question here is, shall we backpatch this fix or we
>> > want to do this just in Head or we want to consider it as a new
>> > feature for PostgreSQL-11.  I think it should be fixed in Head and the
>> > change seems harmless to me, so we should even backpatch it.
>>
>> The thing is not invasive, so backpatching is a low-risk move. We can
>> as well get that into HEAD first, wait a bit for dust to settle on it,
>> and then backpatch.
>
>
>
> I would definitely suggest putting it in HEAD (and thus, v10) for a while to
> get some real world exposure before backpatching.
>

make sense to me, so I have added an entry in "Older Bugs" section in
PostgreSQL 10 Open Items.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Fri, May 26, 2017 at 10:51 AM, Magnus Hagander <magnus@hagander.net> wrote:
> I would definitely suggest putting it in HEAD (and thus, v10) for a while to
> get some real world exposure before backpatching. But if it does work out
> well in the end, then we can certainly consider backpatching it. But given
> the difficulty in reliably reproducing the problem etc, I think it's a good
> idea to give it some proper real world experience in 10 first.

So, are you going to, perhaps, commit this?  Or who is picking this up?

/me knows precious little about Windows.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> So, are you going to, perhaps, commit this?  Or who is picking this up?

> /me knows precious little about Windows.

I'm not going to be the one to commit this either, but seems like someone
should.
        regards, tom lane



On 01/06/17 15:25, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> So, are you going to, perhaps, commit this?  Or who is picking this up?
> 
>> /me knows precious little about Windows.
> 
> I'm not going to be the one to commit this either, but seems like someone
> should.
> 

The new code does not use any windows specific APIs or anything, it just
adds retry logic for reattaching when we do EXEC_BACKEND which seems to
be agreed way of solving this. I do have couple of comments about the
code though.

The new parameter retry_count in PGSharedMemoryReAttach() seems to be
only used to decide if to log reattach issues so that we don't spam log
when retrying, but this fact is not mentioned anywhere.

Also, I am not excited about following coding style:
> +        if (!pgwin32_ReserveSharedMemoryRegion(pi.hProcess))
> +            continue;
> +        else
> +        {

Amit, if you want to avoid having to add the curly braces for single
line while still having else, I'd invert the expression in the if ()
statement so that true comes first. It's much less ugly to have curly
braces part first and the continue statement in the else block IMHO.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



On Fri, May 26, 2017 at 05:50:45PM +0530, Amit Kapila wrote:
> On Fri, May 26, 2017 at 5:30 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote:
> >  I guessed that the reason Noah suggested 1 - 5 seconds of retry is based on the expectation that the address space
mightbe freed by the anti-virus software.
 

No, I suggested it because I wouldn't seriously consider keeping an
installation where backend start takes 5s.  If the address conflicts are that
persistent, I'd fix the bug or switch operating systems.  Therefore, we may as
well let it fail at that duration, thereby showing the user what to
investigate.  Startup time of 0.2s, on the other hand, is noticeable but
usable; I'd prefer not to fail hard at that duration.

> Noah is also suggesting to have a retry count, read his mail above in
> this thread and refer to his comment ("Thus, measuring time is
> needless complexity; retry count is a suitable proxy.")

Right.



On Thu, Jun 1, 2017 at 10:36 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> On 01/06/17 15:25, Tom Lane wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> So, are you going to, perhaps, commit this?  Or who is picking this up?
>>
>>> /me knows precious little about Windows.
>>
>> I'm not going to be the one to commit this either, but seems like someone
>> should.
>>
>
> The new code does not use any windows specific APIs or anything, it just
> adds retry logic for reattaching when we do EXEC_BACKEND which seems to
> be agreed way of solving this. I do have couple of comments about the
> code though.
>
> The new parameter retry_count in PGSharedMemoryReAttach() seems to be
> only used to decide if to log reattach issues so that we don't spam log
> when retrying, but this fact is not mentioned anywhere.
>

No, it is to avoid calling free of memory which is not reserved on
retry.  See the comment:
+ * On the first try, release memory region reservation that was made by
+ * the postmaster.

Are you referring to the same function in sysv_shm.c, if so probably I
can say refer the same API in win32_shmem.c or maybe add a similar
comment there as well?


> Also, I am not excited about following coding style:
>> +             if (!pgwin32_ReserveSharedMemoryRegion(pi.hProcess))
>> +                     continue;
>> +             else
>> +             {
>
> Amit, if you want to avoid having to add the curly braces for single
> line while still having else, I'd invert the expression in the if ()
> statement so that true comes first. It's much less ugly to have curly
> braces part first and the continue statement in the else block IMHO.
>

I felt that it is easier to understand the code in the way it is
currently written, but I can invert the check if you find it is easier
to read and understand that way.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On 02/06/17 15:37, Amit Kapila wrote:
> On Thu, Jun 1, 2017 at 10:36 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> On 01/06/17 15:25, Tom Lane wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> So, are you going to, perhaps, commit this?  Or who is picking this up?
>>>
>>>> /me knows precious little about Windows.
>>>
>>> I'm not going to be the one to commit this either, but seems like someone
>>> should.
>>>
>>
>> The new code does not use any windows specific APIs or anything, it just
>> adds retry logic for reattaching when we do EXEC_BACKEND which seems to
>> be agreed way of solving this. I do have couple of comments about the
>> code though.
>>
>> The new parameter retry_count in PGSharedMemoryReAttach() seems to be
>> only used to decide if to log reattach issues so that we don't spam log
>> when retrying, but this fact is not mentioned anywhere.
>>
> 
> No, it is to avoid calling free of memory which is not reserved on
> retry.  See the comment:
> + * On the first try, release memory region reservation that was made by
> + * the postmaster.
> 
> Are you referring to the same function in sysv_shm.c, if so probably I
> can say refer the same API in win32_shmem.c or maybe add a similar
> comment there as well?
> 

Yeah something like that would help, but my main confusion comes from
the fact that there is counter (and even named as such) but only
relevant difference is 0 and not 0. I'd like mention of that mainly
since I was confused by that on the first read.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



On Fri, Jun 2, 2017 at 7:20 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> On 02/06/17 15:37, Amit Kapila wrote:
>>
>> No, it is to avoid calling free of memory which is not reserved on
>> retry.  See the comment:
>> + * On the first try, release memory region reservation that was made by
>> + * the postmaster.
>>
>> Are you referring to the same function in sysv_shm.c, if so probably I
>> can say refer the same API in win32_shmem.c or maybe add a similar
>> comment there as well?
>>
>
> Yeah something like that would help, but my main confusion comes from
> the fact that there is counter (and even named as such) but only
> relevant difference is 0 and not 0. I'd like mention of that mainly
> since I was confused by that on the first read.
>

Okay, I have added the comment to explain the same.  I have also
modified the patch to adjust the looping as per your suggestion.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
Amit Kapila <amit.kapila16@gmail.com> writes:
> Okay, I have added the comment to explain the same.  I have also
> modified the patch to adjust the looping as per your suggestion.

I took a quick look at this, and it seems rather beside the point.
You can't just loop inside an already-forked process and expect
that address space that was previously unavailable will suddenly
become available, when the problem is that the executable itself
(or some shared library) is mapped into the space you want.
What we have to do in order to retry is to fork an entire new
process, whose address space will hopefully get laid out differently.

While it's possible we could do that in a platform-independent
way, it would be pretty invasive and I don't see the value.
The implementation I'd had in mind originally was to put a loop
inside postmaster.c's internal_forkexec (win32 version), such
that if pgwin32_ReserveSharedMemoryRegion failed, it would
terminate/close up the subprocess as it already does, but then
go back around and do another CreateProcess.  Perhaps it's as
simple as the attached, but I'm not in a position to test it.

Assuming this passes minimal smoke testing, the way to actually test
it is to undo the hack in the MSVC build scripts that disables ASLR,
and then (on a Windows version where ASLR is active) run it until
you've seen a few occurrences of the retry, which should be detectable
by looking for bleats from pgwin32_ReserveSharedMemoryRegion
in the postmaster log.

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 35b4ec8..e30b0c6 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** internal_forkexec(int argc, char *argv[]
*** 4487,4492 ****
--- 4487,4493 ----
  static pid_t
  internal_forkexec(int argc, char *argv[], Port *port)
  {
+     int            retry_count = 0;
      STARTUPINFO si;
      PROCESS_INFORMATION pi;
      int            i;
*************** internal_forkexec(int argc, char *argv[]
*** 4504,4509 ****
--- 4505,4513 ----
      Assert(strncmp(argv[1], "--fork", 6) == 0);
      Assert(argv[2] == NULL);

+     /* Resume here if we need to retry */
+ retry:
+
      /* Set up shared memory for parameter passing */
      ZeroMemory(&sa, sizeof(sa));
      sa.nLength = sizeof(sa);
*************** internal_forkexec(int argc, char *argv[]
*** 4595,4616 ****

      /*
       * Reserve the memory region used by our main shared memory segment before
!      * we resume the child process.
       */
      if (!pgwin32_ReserveSharedMemoryRegion(pi.hProcess))
      {
!         /*
!          * Failed to reserve the memory, so terminate the newly created
!          * process and give up.
!          */
          if (!TerminateProcess(pi.hProcess, 255))
              ereport(LOG,
                      (errmsg_internal("could not terminate process that failed to reserve memory: error code %lu",
                                       GetLastError())));
          CloseHandle(pi.hProcess);
          CloseHandle(pi.hThread);
!         return -1;                /* logging done made by
!                                  * pgwin32_ReserveSharedMemoryRegion() */
      }

      /*
--- 4599,4624 ----

      /*
       * Reserve the memory region used by our main shared memory segment before
!      * we resume the child process.  Normally this should succeed, but if ASLR
!      * is active then it might sometimes fail due to the executable having
!      * gotten mapped into that range.  In that case, just terminate the
!      * process and retry.
       */
      if (!pgwin32_ReserveSharedMemoryRegion(pi.hProcess))
      {
!         /* pgwin32_ReserveSharedMemoryRegion already made a log entry */
          if (!TerminateProcess(pi.hProcess, 255))
              ereport(LOG,
                      (errmsg_internal("could not terminate process that failed to reserve memory: error code %lu",
                                       GetLastError())));
          CloseHandle(pi.hProcess);
          CloseHandle(pi.hThread);
!         if (++retry_count < 100)
!             goto retry;
!         ereport(LOG,
!           (errmsg("giving up after too many tries to reserve shared memory"),
!            errhint("This might be caused by ASLR or antivirus software.")));
!         return -1;
      }

      /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Mon, Jun 5, 2017 at 4:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> Okay, I have added the comment to explain the same.  I have also
>> modified the patch to adjust the looping as per your suggestion.
>
> I took a quick look at this, and it seems rather beside the point.
> You can't just loop inside an already-forked process and expect
> that address space that was previously unavailable will suddenly
> become available, when the problem is that the executable itself
> (or some shared library) is mapped into the space you want.
>

What I understood from the randomization shm allocation behavior due
to ASLR is that when we try to allocate memory (say using
VirtualAllocEx as in our case) at a pre-defined address, it will
instead allocate it at a different address which can lead to a
collision.  So, it seems to me the problem to solve here is not to
attempt to allocate the address space which is allocated to another
library,  rather negate the impact of randomized address allocation
behavior.  So I thought retrying to allocate space at predefined
address is sufficient.

> What we have to do in order to retry is to fork an entire new
> process, whose address space will hopefully get laid out differently.
>
> While it's possible we could do that in a platform-independent
> way, it would be pretty invasive and I don't see the value.
> The implementation I'd had in mind originally was to put a loop
> inside postmaster.c's internal_forkexec (win32 version), such
> that if pgwin32_ReserveSharedMemoryRegion failed, it would
> terminate/close up the subprocess as it already does, but then
> go back around and do another CreateProcess.  Perhaps it's as
> simple as the attached, but I'm not in a position to test it.
>

I think the same problem can happen during reattach as well.
Basically, MapViewOfFileEx can fail to load image at predefined
address (
UsedShmemSegAddr).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Amit Kapila <amit.kapila16@gmail.com> writes:
> On Mon, Jun 5, 2017 at 4:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I took a quick look at this, and it seems rather beside the point.

> What I understood from the randomization shm allocation behavior due
> to ASLR is that when we try to allocate memory (say using
> VirtualAllocEx as in our case) at a pre-defined address, it will
> instead allocate it at a different address which can lead to a
> collision.

No.  The problem we're concerned about is that by the time we try
pgwin32_ReserveSharedMemoryRegion's callVirtualAllocEx(hChild, UsedShmemSegAddr, UsedShmemSegSize,
MEM_RESERVE,PAGE_READWRITE)
 
in the child process, something might have already taken that address
space in the child process, which would cause the call to fail not just
allocate some randomly different space.  That can't happen unless the base
executable, or ntdll.dll, or possibly something injected by an antivirus
product, has taken up address space different from what it took up in the
postmaster process.  What we are assuming here is that any such variation
is randomized, and so will probably be different in the next child process
launched by the postmaster, allowing a new process launch to possibly fix
the problem.  But once that address space is allocated in a given process,
no amount of merely retrying a new allocation request in that process is
going to make it go away.  You'd have to do something to free the existing
allocation, and we have no way to do that short of starting over.

> I think the same problem can happen during reattach as well.
> Basically, MapViewOfFileEx can fail to load image at predefined
> address (UsedShmemSegAddr).

Once we've successfully done the VirtualAllocEx call, that should hold
until the VirtualFree call in PGSharedMemoryReAttach, allowing the
immediately-following MapViewOfFileEx call to succeed.  Were that not the
case, we'd have had problems even without ASLR.  We did have problems
exactly like that before the pgwin32_ReserveSharedMemoryRegion code was
added.  So my feeling about this is that retrying the process creation as
in my prototype patch ought to be sufficient; if you think it isn't, the
burden of proof is on you.
        regards, tom lane



On Mon, Jun 5, 2017 at 9:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
>> I think the same problem can happen during reattach as well.
>> Basically, MapViewOfFileEx can fail to load image at predefined
>> address (UsedShmemSegAddr).
>
> Once we've successfully done the VirtualAllocEx call, that should hold
> until the VirtualFree call in PGSharedMemoryReAttach, allowing the
> immediately-following MapViewOfFileEx call to succeed.  Were that not the
> case, we'd have had problems even without ASLR.  We did have problems
> exactly like that before the pgwin32_ReserveSharedMemoryRegion code was
> added.
>

I could not find anything directly in specs which could prove the
theory either way.  However, in one of the StackOverflow discussions,
it has been indicated that MapViewOfFile can opt to load the image at
an address different than the predefined address due to ASLR.

>  So my feeling about this is that retrying the process creation as
> in my prototype patch ought to be sufficient; if you think it isn't, the
> burden of proof is on you.
>

Sure.  I think it is slightly tricky because specs don't say clearly
how ASLR can impact the behavior of any API and in my last attempt I
could not reproduce the issue.

I can try to do basic verification with the patch you have proposed,
but I fear, to do the actual tests as suggested by you is difficult as
it is not reproducible on my machine, but I can still try.


[1] - https://stackoverflow.com/questions/9718616/what-does-mapviewoffile-return/11233456
Refer below text:

"Yes, MapViewOfFile returns the virtual memory base address where the
image has been loaded. The value (content) of this address depends on
whether the image has been successfully loaded at its predefined
address (which has been setup by the linker) or whether the image has
been relocated (because the desired, predefined address is already
occupied or because the image has opt-in to support ASLR)."



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Mon, Jun 5, 2017 at 1:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jun 5, 2017 at 9:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
>> I think the same problem can happen during reattach as well.
>> Basically, MapViewOfFileEx can fail to load image at predefined
>> address (UsedShmemSegAddr).
>
> Once we've successfully done the VirtualAllocEx call, that should hold
> until the VirtualFree call in PGSharedMemoryReAttach, allowing the
> immediately-following MapViewOfFileEx call to succeed.  Were that not the
> case, we'd have had problems even without ASLR.  We did have problems
> exactly like that before the pgwin32_ReserveSharedMemoryRegion code was
> added.
>

I could not find anything directly in specs which could prove the
theory either way.  However, in one of the StackOverflow discussions,
it has been indicated that MapViewOfFile can opt to load the image at
an address different than the predefined address due to ASLR.

>  So my feeling about this is that retrying the process creation as
> in my prototype patch ought to be sufficient; if you think it isn't, the
> burden of proof is on you.
>

Sure.  I think it is slightly tricky because specs don't say clearly
how ASLR can impact the behavior of any API and in my last attempt I
could not reproduce the issue.

I can try to do basic verification with the patch you have proposed,
but I fear, to do the actual tests as suggested by you is difficult as
it is not reproducible on my machine, but I can still try.


[1] - https://stackoverflow.com/questions/9718616/what-does-mapviewoffile-return/11233456
Refer below text:

"Yes, MapViewOfFile returns the virtual memory base address where the
image has been loaded. The value (content) of this address depends on
whether the image has been successfully loaded at its predefined
address (which has been setup by the linker) or whether the image has
been relocated (because the desired, predefined address is already
occupied or because the image has opt-in to support ASLR)."


That statements refers to mapping executables though, like DLL and EXE. Not mapping of data segments.

It does randomize the entire location of the heap, in which case it might also change. But not for the individual block.

But in neither of those cases does it help to retry without restarting the process, because the heap will be mapped into the same place, and any DLLs loading prior to our code will already have been loaded.

--
On Mon, Jun 5, 2017 at 4:58 PM, Magnus Hagander <magnus@hagander.net> wrote:
> On Mon, Jun 5, 2017 at 1:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Mon, Jun 5, 2017 at 9:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Amit Kapila <amit.kapila16@gmail.com> writes:
>> >
>> >> I think the same problem can happen during reattach as well.
>> >> Basically, MapViewOfFileEx can fail to load image at predefined
>> >> address (UsedShmemSegAddr).
>> >
>> > Once we've successfully done the VirtualAllocEx call, that should hold
>> > until the VirtualFree call in PGSharedMemoryReAttach, allowing the
>> > immediately-following MapViewOfFileEx call to succeed.  Were that not
>> > the
>> > case, we'd have had problems even without ASLR.  We did have problems
>> > exactly like that before the pgwin32_ReserveSharedMemoryRegion code was
>> > added.
>> >
>>
>> I could not find anything directly in specs which could prove the
>> theory either way.  However, in one of the StackOverflow discussions,
>> it has been indicated that MapViewOfFile can opt to load the image at
>> an address different than the predefined address due to ASLR.
>>
>> >  So my feeling about this is that retrying the process creation as
>> > in my prototype patch ought to be sufficient; if you think it isn't, the
>> > burden of proof is on you.
>> >
>>
>> Sure.  I think it is slightly tricky because specs don't say clearly
>> how ASLR can impact the behavior of any API and in my last attempt I
>> could not reproduce the issue.
>>
>> I can try to do basic verification with the patch you have proposed,
>> but I fear, to do the actual tests as suggested by you is difficult as
>> it is not reproducible on my machine, but I can still try.
>>
>>
>> [1] -
>> https://stackoverflow.com/questions/9718616/what-does-mapviewoffile-return/11233456
>> Refer below text:
>>
>> "Yes, MapViewOfFile returns the virtual memory base address where the
>> image has been loaded. The value (content) of this address depends on
>> whether the image has been successfully loaded at its predefined
>> address (which has been setup by the linker) or whether the image has
>> been relocated (because the desired, predefined address is already
>> occupied or because the image has opt-in to support ASLR)."
>>
>
> That statements refers to mapping executables though, like DLL and EXE. Not
> mapping of data segments.
>
> It does randomize the entire location of the heap, in which case it might
> also change. But not for the individual block.
>
> But in neither of those cases does it help to retry without restarting the
> process,
>

Okay, the question here is do we need some handling during reattach
operation where we do MapViewOfFileEx at the predefined location?




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Mon, Jun 5, 2017 at 1:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jun 5, 2017 at 4:58 PM, Magnus Hagander <magnus@hagander.net> wrote:
> On Mon, Jun 5, 2017 at 1:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Mon, Jun 5, 2017 at 9:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Amit Kapila <amit.kapila16@gmail.com> writes:
>> >
>> >> I think the same problem can happen during reattach as well.
>> >> Basically, MapViewOfFileEx can fail to load image at predefined
>> >> address (UsedShmemSegAddr).
>> >
>> > Once we've successfully done the VirtualAllocEx call, that should hold
>> > until the VirtualFree call in PGSharedMemoryReAttach, allowing the
>> > immediately-following MapViewOfFileEx call to succeed.  Were that not
>> > the
>> > case, we'd have had problems even without ASLR.  We did have problems
>> > exactly like that before the pgwin32_ReserveSharedMemoryRegion code was
>> > added.
>> >
>>
>> I could not find anything directly in specs which could prove the
>> theory either way.  However, in one of the StackOverflow discussions,
>> it has been indicated that MapViewOfFile can opt to load the image at
>> an address different than the predefined address due to ASLR.
>>
>> >  So my feeling about this is that retrying the process creation as
>> > in my prototype patch ought to be sufficient; if you think it isn't, the
>> > burden of proof is on you.
>> >
>>
>> Sure.  I think it is slightly tricky because specs don't say clearly
>> how ASLR can impact the behavior of any API and in my last attempt I
>> could not reproduce the issue.
>>
>> I can try to do basic verification with the patch you have proposed,
>> but I fear, to do the actual tests as suggested by you is difficult as
>> it is not reproducible on my machine, but I can still try.
>>
>>
>> [1] -
>> https://stackoverflow.com/questions/9718616/what-does-mapviewoffile-return/11233456
>> Refer below text:
>>
>> "Yes, MapViewOfFile returns the virtual memory base address where the
>> image has been loaded. The value (content) of this address depends on
>> whether the image has been successfully loaded at its predefined
>> address (which has been setup by the linker) or whether the image has
>> been relocated (because the desired, predefined address is already
>> occupied or because the image has opt-in to support ASLR)."
>>
>
> That statements refers to mapping executables though, like DLL and EXE. Not
> mapping of data segments.
>
> It does randomize the entire location of the heap, in which case it might
> also change. But not for the individual block.
>
> But in neither of those cases does it help to retry without restarting the
> process,
>

Okay, the question here is do we need some handling during reattach
operation where we do MapViewOfFileEx at the predefined location?


As long as we restart in a new process, I don't think so. ASLR does not randomize away our addresses *after we've been given them*. It does it at process (or thread start), so as long as we get it initially, I think we are safe.

(Obviously nobody can be 100% sure without seeing the source code for it, but all references I've seen to the implementation seem to indicate that)

--
Amit Kapila <amit.kapila16@gmail.com> writes:
> Sure.  I think it is slightly tricky because specs don't say clearly
> how ASLR can impact the behavior of any API and in my last attempt I
> could not reproduce the issue.

> I can try to do basic verification with the patch you have proposed,
> but I fear, to do the actual tests as suggested by you is difficult as
> it is not reproducible on my machine, but I can still try.

Yeah, being able to reproduce the problem reliably enough to say whether
it's fixed or not is definitely the sticking point here.  I have some
ideas about that:

1. Be sure to use Win32 not Win64 --- the odds of a failure in the larger
address space are so small you'd never prove anything.  (And of course
it has to be a version that has ASLR enabled.)

2. Revert 7f3e17b48 so that you have an ASLR-enabled build.

3. Crank shared_buffers to the maximum the machine will allow, reducing
the amount of free address space and improving the odds of a collision.

4. Spawn lots of sessions --- pgbench with -C option might be a useful
testing tool.

With luck, that will get failures often enough that you can be pretty
sure whether a patch improves the situation or not.
        regards, tom lane



On Mon, Jun 5, 2017 at 7:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> Sure.  I think it is slightly tricky because specs don't say clearly
>> how ASLR can impact the behavior of any API and in my last attempt I
>> could not reproduce the issue.
>
>> I can try to do basic verification with the patch you have proposed,
>> but I fear, to do the actual tests as suggested by you is difficult as
>> it is not reproducible on my machine, but I can still try.
>
> Yeah, being able to reproduce the problem reliably enough to say whether
> it's fixed or not is definitely the sticking point here.

Agreed.  By the way, while browsing about this problem, I found that
one other open source (nginx) has used a solution similar to what
Andres was proposing upthread to solve this problem.  Refer:
https://github.com/nginx/nginx/commit/2ec8cfcd7d34415a99c3f3db3024ae954c00d0dd

Just to be clear, by above, I don't mean to say that if some other
open source is using some solution, we should also use it, but I think
it is worth considering (especially if it is a proven solution - just
saying based on the time (2015) it has been committed and sustained in
the code).

>  I have some
> ideas about that:
>
> 1. Be sure to use Win32 not Win64 --- the odds of a failure in the larger
> address space are so small you'd never prove anything.  (And of course
> it has to be a version that has ASLR enabled.)
>

I don't have access to Win32, so I request others who are reading this
and have access to Win32 to see if they can help in reproducing the
problem.

> 2. Revert 7f3e17b48 so that you have an ASLR-enabled build.
>
> 3. Crank shared_buffers to the maximum the machine will allow, reducing
> the amount of free address space and improving the odds of a collision.
>
> 4. Spawn lots of sessions --- pgbench with -C option might be a useful
> testing tool.
>

All are very good points and I think together they will certainly
improve the chances of reproducing this problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Mon, Jun 5, 2017 at 10:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Agreed.  By the way, while browsing about this problem, I found that
> one other open source (nginx) has used a solution similar to what
> Andres was proposing upthread to solve this problem.  Refer:
> https://github.com/nginx/nginx/commit/2ec8cfcd7d34415a99c3f3db3024ae954c00d0dd
>
> Just to be clear, by above, I don't mean to say that if some other
> open source is using some solution, we should also use it, but I think
> it is worth considering (especially if it is a proven solution - just
> saying based on the time (2015) it has been committed and sustained in
> the code).

I think the idea of retrying process creation (and I definitely agree
with Tom and Magnus that we have to retry process creation, not just
individual mappings) is a good place to start.  Now if we find that we
are having to retry frequently, then I think we might need to try
something along the lines of what Andres proposed and what nginx
apparently did.  However, any fixed address will be prone to
occasional failures (or maybe, on some systems, regular failures) if
that particular address happens to get claimed by something.  I don't
think we can say that there is any address where that definitely won't
happen.  So I would say let's do this retry thing first, and then if
that proves inadequate, we can also try moving the mappings to a range
where conflicts are less likely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> I think the idea of retrying process creation (and I definitely agree
> with Tom and Magnus that we have to retry process creation, not just
> individual mappings) is a good place to start.  Now if we find that we
> are having to retry frequently, then I think we might need to try
> something along the lines of what Andres proposed and what nginx
> apparently did.  However, any fixed address will be prone to
> occasional failures (or maybe, on some systems, regular failures) if
> that particular address happens to get claimed by something.  I don't
> think we can say that there is any address where that definitely won't
> happen.  So I would say let's do this retry thing first, and then if
> that proves inadequate, we can also try moving the mappings to a range
> where conflicts are less likely.

By definition, the address range we're trying to reuse worked successfully
in the postmaster process.  I don't see how forcing a specific address
could do anything but create an additional risk of postmaster startup
failure.
        regards, tom lane



On Tue, Jun 6, 2017 at 12:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> By definition, the address range we're trying to reuse worked successfully
> in the postmaster process.  I don't see how forcing a specific address
> could do anything but create an additional risk of postmaster startup
> failure.

If the postmaster picked an address where other things are unlikely to
get loaded, then that would increase the chances of child processes
finding it available, wouldn't it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jun 6, 2017 at 12:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> By definition, the address range we're trying to reuse worked successfully
>> in the postmaster process.  I don't see how forcing a specific address
>> could do anything but create an additional risk of postmaster startup
>> failure.

> If the postmaster picked an address where other things are unlikely to
> get loaded, then that would increase the chances of child processes
> finding it available, wouldn't it?

But how would we know that a particular address range is more unlikely
than others to have a conflict?  (And even if we do know that, what
happens when there is a conflict anyway?)  I sure don't want to be in
the business of figuring out what to use across all the different Windows
versions there are, to say nothing of the different antivirus products
that might be causing the problem.

Also, the big picture here is that we ought to be working towards allowing
our Windows builds to use ASLR; our inability to support that is not
something to be proud of in 2017.  No predetermined-address scheme is
likely to be helpful for that.
        regards, tom lane



On Tue, Jun 6, 2017 at 1:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jun 6, 2017 at 12:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> By definition, the address range we're trying to reuse worked successfully
>>> in the postmaster process.  I don't see how forcing a specific address
>>> could do anything but create an additional risk of postmaster startup
>>> failure.
>
>> If the postmaster picked an address where other things are unlikely to
>> get loaded, then that would increase the chances of child processes
>> finding it available, wouldn't it?
>
> But how would we know that a particular address range is more unlikely
> than others to have a conflict? (And even if we do know that, what
> happens when there is a conflict anyway?)  I sure don't want to be in
> the business of figuring out what to use across all the different Windows
> versions there are, to say nothing of the different antivirus products
> that might be causing the problem.

I'm not quite sure how to respond to this.  How do we know any piece
of information about anything, ever?  Sometimes we figure it out by
looking for relevant sources using, say, Google, and other times we
determine it by experiment or from first principles.  You and Andres
were having a discussion earlier about gathering this exact
information, so apparently you thought it might be possible back then:

https://www.postgresql.org/message-id/4180.1492292046%40sss.pgh.pa.us

Now, that having been said, I agree that no address range is perfectly
safe (and that's why it's good to have retries).  I also agree that
this is likely to be heavily platform-dependent, which is why I wrote
DSM the way that I did instead of (as Noah was advocating) trying to
solve the problem of getting a constant mapping across all processes
in a parallel group.  But since nobody's keen on the idea of trying to
tolerate having the main shared memory segment at different addresses
in different processes, we'll have to come up with some other solution
for that case.  If the retry thing doesn't plug the hole adequately,
trying to put the initial allocation in some place that's less likely
to induce conflicts seems like the next thing to try.

> Also, the big picture here is that we ought to be working towards allowing
> our Windows builds to use ASLR; our inability to support that is not
> something to be proud of in 2017.  No predetermined-address scheme is
> likely to be helpful for that.

Sure, retrying is better for that as well, as I already said upthread,
but that doesn't mean that putting the initial mapping someplace less
likely to conflict couldn't reduce the need for retries.

The even-bigger picture here is that both this issue and the need for
DSA and DSM are due to the fact that we've picked an unpopular
programming model with poor operating system support.   If we switch
from using processes to using threads, we don't have to deal with all
of this nonsense any more, and we'd solve some other problems, too -
e.g. at least on Windows, I think backend startup would get quite a
bit faster.  Obviously, anybody else who is using processes + shared
memory is going to run into the same problems we're hitting, and if
operating system manufacturers wanted to make this kind of programming
easy, they could do it.  We're expending a lot of effort here because
we're swimming against the current.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Tue, Jun 6, 2017 at 10:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think the idea of retrying process creation (and I definitely agree
>> with Tom and Magnus that we have to retry process creation, not just
>> individual mappings) is a good place to start.  Now if we find that we
>> are having to retry frequently, then I think we might need to try
>> something along the lines of what Andres proposed and what nginx
>> apparently did.  However, any fixed address will be prone to
>> occasional failures (or maybe, on some systems, regular failures) if
>> that particular address happens to get claimed by something.  I don't
>> think we can say that there is any address where that definitely won't
>> happen.  So I would say let's do this retry thing first, and then if
>> that proves inadequate, we can also try moving the mappings to a range
>> where conflicts are less likely.
>
> By definition, the address range we're trying to reuse worked successfully
> in the postmaster process.  I don't see how forcing a specific address
> could do anything but create an additional risk of postmaster startup
> failure.
>

I think it won't create an additional risk, because the idea is that
if we fail to map the shm segment at a predefined address, then we
will allow the system to choose the initial address as we are doing
now.  So, it can reduce chances of doing retries.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Amit Kapila <amit.kapila16@gmail.com> writes:
> On Tue, Jun 6, 2017 at 10:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> By definition, the address range we're trying to reuse worked successfully
>> in the postmaster process.  I don't see how forcing a specific address
>> could do anything but create an additional risk of postmaster startup
>> failure.

> I think it won't create an additional risk, because the idea is that
> if we fail to map the shm segment at a predefined address, then we
> will allow the system to choose the initial address as we are doing
> now.  So, it can reduce chances of doing retries.

[ shrug... ]  That would just make the patch even more complicated and
hard to test.  And it doesn't do anything to fix the ASLR issue.
Could we get on with trying to test something that *does* fix the
ASLR issue, like the draft patch I posted upthread?
        regards, tom lane



On Mon, Jun 05, 2017 at 09:56:33AM -0400, Tom Lane wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > Sure.  I think it is slightly tricky because specs don't say clearly
> > how ASLR can impact the behavior of any API and in my last attempt I
> > could not reproduce the issue.
> 
> > I can try to do basic verification with the patch you have proposed,
> > but I fear, to do the actual tests as suggested by you is difficult as
> > it is not reproducible on my machine, but I can still try.
> 
> Yeah, being able to reproduce the problem reliably enough to say whether
> it's fixed or not is definitely the sticking point here.  I have some
> ideas about that:
> 
> 1. Be sure to use Win32 not Win64 --- the odds of a failure in the larger
> address space are so small you'd never prove anything.  (And of course
> it has to be a version that has ASLR enabled.)
> 
> 2. Revert 7f3e17b48 so that you have an ASLR-enabled build.
> 
> 3. Crank shared_buffers to the maximum the machine will allow, reducing
> the amount of free address space and improving the odds of a collision.
> 
> 4. Spawn lots of sessions --- pgbench with -C option might be a useful
> testing tool.
> 
> With luck, that will get failures often enough that you can be pretty
> sure whether a patch improves the situation or not.

I tried this procedure without finding a single failure.  Attributes:

- 32-bit build of commit fad7873 w/ 7f3e17b48 reverted
- confirmed ASLR-enabled with "dumpbin /headers postgres.exe"
- OS = 64-bit Windows Server 2016
- compiler = Visual Studio 2015 Express
- no config.pl, so not linked with any optional library
- tried shared_buffers=1100M and shared_buffers=24M
- echo 'select 1;' >pgbench-trivial; pgbench -n -f pgbench-trivial -C -c 30 -j5 -T900
- tried starting as a service at boot time, in addition to manual start

=== postgresql.conf ===
log_connections = on
log_line_prefix = '%p %m '
autovacuum = off
listen_addresses = '127.0.0.1'
log_min_messages = debug1
max_connections = 40
shared_buffers = 24MB
deadlock_timeout = '20ms'
wal_buffers = '16MB'
fsync = off


I watched the ensuing memory maps, which led me to these interesting facts:
 An important result of the ASLR design in Windows Vista is that some address space layout parameters, such as PEB,
stack,and heap locations, are selected once per program execution. Other parameters, such as the location of the
programcode, data segment, BSS segment, and libraries, change only between reboots. ... This offset is selected once
perreboot, although we’ve uncovered at least one other way to cause this offset to be reset without a reboot (see
AppendixII). -- http://www.symantec.com/avcenter/reference/Address_Space_Layout_Randomization.pdf 
 

So, reattach failures might be reproducible on some reboots and not others.
While I had a few reboots during the test work, I did not exercise that
dimension systematically.  This information also implies we should not
re-enable ASLR, even if your patch helps, due to addresses that change less
than once per process creation but occasionally more than once per boot.


I did try the combination of your patch and the following to simulate a 95%
failure rate:

--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -410,2 +410,4 @@ pgwin32_ReserveSharedMemoryRegion(HANDLE hChild)                             MEM_RESERVE,
PAGE_READWRITE);
+    if (random() > MAX_RANDOM_VALUE / 20)
+        address = NULL;    if (address == NULL)

This confirmed retries worked.  During a 900s test run, one connection attempt
failed permanently by exhausting its 100 retries.  The run achieved 11
connections per second.  That is an order of magnitude slower than a run
without simulated failures, but most applications would tolerate it.

I recommend pushing your patch so the August back-branch releases have it.
One can see by inspection that your patch has negligible effect on systems
healthy today.  I have a reasonable suspicion it will help some systems.  If
others remain broken after this, that fact will provide a useful clue.

Thanks,
nm



On Mon, Jul 10, 2017 at 3:36 PM, Noah Misch <noah@leadboat.com> wrote:
> I recommend pushing your patch so the August back-branch releases have it.
> One can see by inspection that your patch has negligible effect on systems
> healthy today.  I have a reasonable suspicion it will help some systems.  If
> others remain broken after this, that fact will provide a useful clue.

+1. Thanks for taking the time to test and look at the patch.
-- 
Michael



Noah Misch <noah@leadboat.com> writes:
> On Mon, Jun 05, 2017 at 09:56:33AM -0400, Tom Lane wrote:
>> Yeah, being able to reproduce the problem reliably enough to say whether
>> it's fixed or not is definitely the sticking point here.  I have some
>> ideas about that: ...

> I tried this procedure without finding a single failure.

Thanks for testing!  But apparently we still lack some critical part
of the recipe for getting failures in the wild.

> I watched the ensuing memory maps, which led me to these interesting facts:

>   An important result of the ASLR design in Windows Vista is that some address
>   space layout parameters, such as PEB, stack, and heap locations, are
>   selected once per program execution. Other parameters, such as the location
>   of the program code, data segment, BSS segment, and libraries, change only
>   between reboots.
>   ...
>   This offset is selected once per reboot, although we’ve uncovered at least
>   one other way to cause this offset to be reset without a reboot (see
>   Appendix II).
>   -- http://www.symantec.com/avcenter/reference/Address_Space_Layout_Randomization.pdf

That is really interesting, though I'm not quite sure what to make of it.
It suggests though that you might need certain types of antivirus products
to be active, to create more variability in the initial process address
space layout than would happen from Windows ASLR alone.

> I recommend pushing your patch so the August back-branch releases have it.
> One can see by inspection that your patch has negligible effect on systems
> healthy today.  I have a reasonable suspicion it will help some systems.  If
> others remain broken after this, that fact will provide a useful clue.

Okay, so that leaves us with a decision to make: push it into beta2, or
wait till after wrap?  I find it pretty scary to push a patch with
portability implications so soon before wrap, but a quick look at the
buildfarm suggests we'd get runs from 3 or 4 Windows members before the
wrap, if I do it quickly.  If we wait, then it will hit the field in
production releases with reasonable buildfarm testing but little more.
That's also pretty scary.

On balance I'm tempted to push it now for beta2, but it's a close call.
Thoughts?
        regards, tom lane





On Jul 10, 2017 16:08, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Noah Misch <noah@leadboat.com> writes:
> On Mon, Jun 05, 2017 at 09:56:33AM -0400, Tom Lane wrote:
>> Yeah, being able to reproduce the problem reliably enough to say whether
>> it's fixed or not is definitely the sticking point here.  I have some
>> ideas about that: ...

> I tried this procedure without finding a single failure.

Thanks for testing!  But apparently we still lack some critical part
of the recipe for getting failures in the wild.

> I watched the ensuing memory maps, which led me to these interesting facts:

>   An important result of the ASLR design in Windows Vista is that some address
>   space layout parameters, such as PEB, stack, and heap locations, are
>   selected once per program execution. Other parameters, such as the location
>   of the program code, data segment, BSS segment, and libraries, change only
>   between reboots.
>   ...
>   This offset is selected once per reboot, although we’ve uncovered at least
>   one other way to cause this offset to be reset without a reboot (see
>   Appendix II).
>   -- http://www.symantec.com/avcenter/reference/Address_Space_Layout_Randomization.pdf

That is really interesting, though I'm not quite sure what to make of it.
It suggests though that you might need certain types of antivirus products
to be active, to create more variability in the initial process address
space layout than would happen from Windows ASLR alone.

> I recommend pushing your patch so the August back-branch releases have it.
> One can see by inspection that your patch has negligible effect on systems
> healthy today.  I have a reasonable suspicion it will help some systems.  If
> others remain broken after this, that fact will provide a useful clue.

Okay, so that leaves us with a decision to make: push it into beta2, or
wait till after wrap?  I find it pretty scary to push a patch with
portability implications so soon before wrap, but a quick look at the
buildfarm suggests we'd get runs from 3 or 4 Windows members before the
wrap, if I do it quickly.  If we wait, then it will hit the field in
production releases with reasonable buildfarm testing but little more.
That's also pretty scary.

On balance I'm tempted to push it now for beta2, but it's a close call.
Thoughts?

Given the lack of windows testing on non packaged releases I think we should definitely push this for beta2. That will give it a much better real world testing on what is still a beta. 

If it breaks its a lot better to do it in beta2 (and possibly quickly roll a beta3) than in production. 

/Magnus 


Magnus Hagander <magnus@hagander.net> writes:
> On Jul 10, 2017 16:08, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> Okay, so that leaves us with a decision to make: push it into beta2, or
>> wait till after wrap?  I find it pretty scary to push a patch with
>> portability implications so soon before wrap, but a quick look at the
>> buildfarm suggests we'd get runs from 3 or 4 Windows members before the
>> wrap, if I do it quickly.  If we wait, then it will hit the field in
>> production releases with reasonable buildfarm testing but little more.
>> That's also pretty scary.
>> On balance I'm tempted to push it now for beta2, but it's a close call.
>> Thoughts?

> Given the lack of windows testing on non packaged releases I think we
> should definitely push this for beta2. That will give it a much better real
> world testing on what is still a beta.
> If it breaks its a lot better to do it in beta2 (and possibly quickly roll
> a beta3) than in production.

Shall we go for broke and also remove the ASLR-disabling patch in beta2?
I do not feel a need to back-patch that one, at least not yet.  But if
we're going to do it any time soon, a beta seems like the time.
        regards, tom lane



On Mon, Jul 10, 2017 at 10:08:53AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > I recommend pushing your patch so the August back-branch releases have it.
> > One can see by inspection that your patch has negligible effect on systems
> > healthy today.  I have a reasonable suspicion it will help some systems.  If
> > others remain broken after this, that fact will provide a useful clue.
> 
> Okay, so that leaves us with a decision to make: push it into beta2, or
> wait till after wrap?  I find it pretty scary to push a patch with
> portability implications so soon before wrap, but a quick look at the
> buildfarm suggests we'd get runs from 3 or 4 Windows members before the
> wrap, if I do it quickly.  If we wait, then it will hit the field in
> production releases with reasonable buildfarm testing but little more.
> That's also pretty scary.
> 
> On balance I'm tempted to push it now for beta2, but it's a close call.
> Thoughts?

Pushing now for beta2 sounds good.



On Mon, Jul 10, 2017 at 10:46:09AM -0400, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > On Jul 10, 2017 16:08, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
> >> Okay, so that leaves us with a decision to make: push it into beta2, or
> >> wait till after wrap?  I find it pretty scary to push a patch with
> >> portability implications so soon before wrap, but a quick look at the
> >> buildfarm suggests we'd get runs from 3 or 4 Windows members before the
> >> wrap, if I do it quickly.  If we wait, then it will hit the field in
> >> production releases with reasonable buildfarm testing but little more.
> >> That's also pretty scary.
> >> On balance I'm tempted to push it now for beta2, but it's a close call.
> >> Thoughts?
> 
> > Given the lack of windows testing on non packaged releases I think we
> > should definitely push this for beta2. That will give it a much better real
> > world testing on what is still a beta.
> > If it breaks its a lot better to do it in beta2 (and possibly quickly roll
> > a beta3) than in production.
> 
> Shall we go for broke and also remove the ASLR-disabling patch in beta2?
> I do not feel a need to back-patch that one, at least not yet.  But if
> we're going to do it any time soon, a beta seems like the time.

As I mentioned in my message eight hours ago, no.



Noah Misch <noah@leadboat.com> writes:
> On Mon, Jul 10, 2017 at 10:46:09AM -0400, Tom Lane wrote:
>> Shall we go for broke and also remove the ASLR-disabling patch in beta2?

> As I mentioned in my message eight hours ago, no.

Ah, sorry, I'd managed to swap out that bit of info already.  However,
I've now gone and read the Symantec paper you pointed to.  It's pretty
interesting, although I think it should not be treated as gospel.
For one thing, it's ten years old, and for another, it's describing
the behavior of Windows Vista; one might reasonably assume that Microsoft
has improved some of this stuff since then.

I noted with particular interest that the PEB (Process Environment Block)
location is stated to be randomized whether or not you have ASLR enabled,
although apparently there are only a small number of possible locations.
I wonder whether that could explain anything.  If it is affecting us
then the retry logic should fix it.

Anyway, the point at hand is Symantec's claim that there's a global
offset that normally only changes at reboot but might change at other
times.  I'm inclined to take that with a really large grain of salt.
For instance, here's a flat denial of that from a Microsoft person:
https://blogs.msdn.microsoft.com/oldnewthing/20170118-00/?p=95205
with some interesting stuff in the comments too, particularly about
how Windows attempts to load DLLs at the same location in all
processes so as to share page table entries.

Looking at Symantec's appendix II, which is a script that claims to
provoke changes in this global offset, what it's doing is creating,
executing, and deleting files over and over.

It seems plausible to me that Windows intends to assign a randomized
address to an executable or DLL once per boot, or at least once per
loading of such a file, and that what Symantec's script is doing is
defeating Windows' recognition that the "same" executable is being
run each time.  Without games like that, the executable image would
persist at the same base address, at least till it got swapped out,
allowing it to appear at the same address across multiple executions.

If things do work more or less like that, then it's likely that it would
have no effect on us, because the continuous existence of the postmaster
process would lock the PG executable as well as all DLLs loaded by the
postmaster into addresses that would not change for the life of the
postmaster.  Under that theory, what might bite us is randomization of the
PEB, stack, or heap (which are stated to be re-done every time); or some
antivirus code that's doing something weird.  It's entirely possible that
security-oriented code might take measures to ensure that it gets mapped
into different addresses in every process even though Windows wouldn't
normally do that.  But all of these cases would presumably yield to retry
of the process launch.

In short, I don't really buy that we should be afraid to enable ASLR
because of this Symantec paper.
        regards, tom lane