Thread: [BUGS] Replication to Postgres 10 on Windows is broken

[BUGS] Replication to Postgres 10 on Windows is broken

From
"Augustine, Jobin"
Date:
Summary:
Replica / Slave on Windows fails to conenct to Master.

Scenario:
Master: PostgreSQL 10 (beta1,beta2), Any OS (Linux, Windows, Mac)
Slave : PostgreSQL 10 (beta1,beta2), Windows (Any MS Windows OS)
Slave must be on Windows Machine running on a different server than the Master


Steps to reproduce:
1. Setup a Postgres 10betaX master database can be on any platform (Linux/Windows/Mac)
2. Try to setup a streaming replica on any of the MS Windows running a different Server than the master
    (replication within the same server works!)


Error in Postgres logs:
Lines like below keep repeating in Slave side postgres log
----------------------
2017-08-03 10:49:41 UTC [2108]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
    could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
2017-08-03 10:49:45 UTC [3600]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
    could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
2017-08-03 10:49:50 UTC [4832]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
    could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
-------------------------

Additional Information:
Stand alone tools like psql, pg_basebackup, pg_recivewal etc  are successful in connecting to primary server and perform their functionalities.
Postgres 9.6 replication works fine across same set of servers.
This Windows specific issue reproducible across pg10beta1 and beta2
This is tested and confirmed by multiple people from my team across different Windows versions.


-Jobin Augustine

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Igor Neyman
Date:

 

From: pgsql-bugs-owner@postgresql.org [mailto:pgsql-bugs-owner@postgresql.org] On Behalf Of Augustine, Jobin
Sent: Friday, August 04, 2017 1:54 AM
To: pgsql-bugs@postgresql.org
Subject: [BUGS] Replication to Postgres 10 on Windows is broken

 

Attention: This email was sent from someone outside of Perceptron. Always exercise caution when opening attachments or clicking links from unknown senders or when receiving unexpected emails.

 

Summary:
Replica / Slave on Windows fails to conenct to Master.

Scenario:
Master: PostgreSQL 10 (beta1,beta2), Any OS (Linux, Windows, Mac)
Slave : PostgreSQL 10 (beta1,beta2), Windows (Any MS Windows OS)
Slave must be on Windows Machine running on a different server than the Master


Steps to reproduce:
1. Setup a Postgres 10betaX master database can be on any platform (Linux/Windows/Mac)
2. Try to setup a streaming replica on any of the MS Windows running a different Server than the master
    (replication within the same server works!)


Error in Postgres logs:
Lines like below keep repeating in Slave side postgres log
----------------------
2017-08-03 10:49:41 UTC [2108]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
    could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
2017-08-03 10:49:45 UTC [3600]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
    could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
2017-08-03 10:49:50 UTC [4832]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
    could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
-------------------------

Additional Information:
Stand alone tools like psql, pg_basebackup, pg_recivewal etc  are successful in connecting to primary server and perform their functionalities.
Postgres 9.6 replication works fine across same set of servers.
This Windows specific issue reproducible across pg10beta1 and beta2
This is tested and confirmed by multiple people from my team across different Windows versions.

-Jobin Augustine

 

Using PG10 Beta1 and BETA2 I’m getting the same error:

 

Socket is not connected (0x00002749/10057)
    could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)

 

with logical replication in Windows environment, when I’m trying to Create Subscription, even though any other connection from Subscriber server to Publisher server works quite fine, for instance there is no errors when using Postgres_fdw to connect from Subscriber server to Publisher server.

I reported this issue on this forum: Bug #14669, but didn’t get any replies.

 

Must be a common issue between streaming and logical replication in PG10 BETA in Windows environment.

I’d appreciate if someone from 2ndQardrant could take a look at this issue.

 

Regards,

Igor Neyman

 

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Noah Misch
Date:
On Fri, Aug 04, 2017 at 11:23:43AM +0530, Augustine, Jobin wrote:
> *Summary:*
> Replica / Slave on Windows fails to conenct to Master.
> 
> *Scenario:*
> Master: PostgreSQL 10 (beta1,beta2), Any OS (Linux, Windows, Mac)
> Slave : PostgreSQL 10 (beta1,beta2), Windows (Any MS Windows OS)
> Slave must be on Windows Machine running on a different server than the
> Master
> 
> 
> *Steps to reproduce:*
> 1. Setup a Postgres 10betaX master database can be on any platform
> (Linux/Windows/Mac)
> 2. Try to setup a streaming replica on any of the MS Windows running a
> different Server than the master

I've added this as an open item.  Confirmed in this setup:

-- Server
Debian 9
PostgreSQL 10beta2 from apt.postgresql.org
edit postgresql.conf to allow remote connections
edit pg_hba.conf to allow remote connections
create replication user w/ password

-- Client
Windows Server 2016
postgresql-10.0-beta2-windows-x64-binaries.zip from EnterpriseDB
pg_basebackup -h myserver -D datadir -X stream -R
install ordinary *.conf
postgres -D datadir

> Stand alone tools like psql, pg_basebackup, pg_recivewal etc  are
> successful in connecting to primary server and perform their
> functionalities.

I can confirm each of those, too.

I don't, however, see a smoking gun among commits.  Would you bisect the
commits since 9.6 and see which one broke things?


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> I've added this as an open item.  Confirmed in this setup:

> -- Client
> Windows Server 2016
> postgresql-10.0-beta2-windows-x64-binaries.zip from EnterpriseDB

I wonder whether the other complainants were using EDB's build,
and if not, just what were they using.  The indirect question is:
what version of OpenSSL is the Windows build using?

> I don't, however, see a smoking gun among commits.  Would you bisect the
> commits since 9.6 and see which one broke things?

Gut instinct says that the reason this case fails when other tools
can connect successfully is that libpqwalreceiver is the only tool
that uses PQconnectStart/PQconnectPoll rather than a plain
PQconnectdb, and that there is some behavioral difference between
connectDBComplete's wait loop and libpqrcv_connect's wait loop that
OpenSSL is sensitive to --- but only on Windows, and maybe only on
particular OpenSSL versions.

I have some memory of having tweaked libpqrcv_connect's loop once
already, but I can't find a commit for that right now.

A different line of thought is that maybe we broke it with the
changes for OpenSSL 1.1.0.  Again, it would be very important
to know which OpenSSL version we're dealing with here.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Noah Misch
Date:
On Sun, Aug 06, 2017 at 11:17:57AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > I've added this as an open item.  Confirmed in this setup:
> 
> > -- Client
> > Windows Server 2016
> > postgresql-10.0-beta2-windows-x64-binaries.zip from EnterpriseDB
> 
> I wonder whether the other complainants were using EDB's build,
> and if not, just what were they using.  The indirect question is:
> what version of OpenSSL is the Windows build using?

Those binaries I used have OpenSSL 1.0.2l.

> > I don't, however, see a smoking gun among commits.  Would you bisect the
> > commits since 9.6 and see which one broke things?
> 
> Gut instinct says that the reason this case fails when other tools
> can connect successfully is that libpqwalreceiver is the only tool
> that uses PQconnectStart/PQconnectPoll rather than a plain
> PQconnectdb, and that there is some behavioral difference between
> connectDBComplete's wait loop and libpqrcv_connect's wait loop that

That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
callers outside of libpq itself.

> OpenSSL is sensitive to --- but only on Windows, and maybe only on
> particular OpenSSL versions.

The failure is so early, before pgtls_init(), that I doubt OpenSSL version is
a factor.


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
I wrote:
> Gut instinct says that the reason this case fails when other tools
> can connect successfully is that libpqwalreceiver is the only tool
> that uses PQconnectStart/PQconnectPoll rather than a plain
> PQconnectdb, and that there is some behavioral difference between
> connectDBComplete's wait loop and libpqrcv_connect's wait loop that
> OpenSSL is sensitive to --- but only on Windows, and maybe only on
> particular OpenSSL versions.

On closer inspection, I take that back.  This can't be directly
OpenSSL's fault, because those error messages come out before libpq
has invoked OpenSSL at all; in particular we see

2017-08-03 10:49:41 UTC [2108]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not
senddata to server: Socket is not connected (0x00002749/10057)   could not send SSL negotiation packet: Socket is not
connected
(0x00002749/10057)

and "could not send SSL negotiation packet" certainly must occur
before we've asked OpenSSL to do anything.

What seems likely to me at this point is that the changes in
PQconnectPoll() to support multiple hosts are somehow responsible.
It must still be connected to libpqwalreceiver's different wait loop,
but the details are unclear.

It would likely be useful to add some debug logging to PQconnectPoll
to find out what set of addresses it's seeing and whether this failure
occurs after having advanced over some of them.
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Sun, Aug 06, 2017 at 11:17:57AM -0400, Tom Lane wrote:
>> Gut instinct says that the reason this case fails when other tools
>> can connect successfully is that libpqwalreceiver is the only tool
>> that uses PQconnectStart/PQconnectPoll rather than a plain
>> PQconnectdb, and that there is some behavioral difference between
>> connectDBComplete's wait loop and libpqrcv_connect's wait loop that

> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> callers outside of libpq itself.

Yeah.  After some digging around I think I see exactly what is happening.
The error message would be better read as "Socket is not connected *yet*",
that is, the problem is that we're trying to write data before the
nonblocking connection request has completed.  (This fits with the OP's
observation that local loopback connections work fine --- they probably
complete immediately.)  PQconnectPoll believes that it just has to wait
for write-ready when waiting for a connection to complete.  When using
connectDBComplete's wait loop, that reduces to a call to Windows' version
of select(2), in pqSocketPoll, and according to

https://msdn.microsoft.com/en-us/library/windows/desktop/ms740141(v=vs.85).aspx

"The parameter writefds identifies the sockets that are to be checked for
writability. If a socket is processing a connect call (nonblocking), a
socket is writeable if the connection establishment successfully
completes."

On the other hand, in libpqwalreceiver, we're depending on latch.c's
implementation, and it uses WSAEventSelect's FD_WRITE event:

https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576(v=vs.85).aspx

If I'm reading that correctly, FD_WRITE is set instantly by the connect
request, probably even in the nonblock case, and it only gets cleared
by a failed write request.  It looks to me like we would have to
specifically look for FD_CONNECT, *not* FD_WRITE, to make this work.

This is problematic, because the APIs in between don't provide a way
to report that we're still waiting for connect rather than for
data-write-ready.  Anybody have the stomach for extending PQconnectPoll's
API with an extra PGRES_POLLING_CONNECTING state?  If not, can we tell in
WaitEventAdjustWin32 that the socket is still connecting and we must
substitute FD_CONNECT for FD_WRITE?
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
Hi,

On 2017-08-06 12:29:07 -0400, Tom Lane wrote:
> Yeah.  After some digging around I think I see exactly what is happening.
> The error message would be better read as "Socket is not connected *yet*",
> that is, the problem is that we're trying to write data before the
> nonblocking connection request has completed.  (This fits with the OP's
> observation that local loopback connections work fine --- they probably
> complete immediately.)  PQconnectPoll believes that it just has to wait
> for write-ready when waiting for a connection to complete.  When using
> connectDBComplete's wait loop, that reduces to a call to Windows' version
> of select(2), in pqSocketPoll, and according to
> 
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms740141(v=vs.85).aspx
> 
> "The parameter writefds identifies the sockets that are to be checked for
> writability. If a socket is processing a connect call (nonblocking), a
> socket is writeable if the connection establishment successfully
> completes."
> 
> On the other hand, in libpqwalreceiver, we're depending on latch.c's
> implementation, and it uses WSAEventSelect's FD_WRITE event:
> 
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576(v=vs.85).aspx
> 
> If I'm reading that correctly, FD_WRITE is set instantly by the connect
> request, probably even in the nonblock case, and it only gets cleared
> by a failed write request.  It looks to me like we would have to
> specifically look for FD_CONNECT, *not* FD_WRITE, to make this work.

Nice digging.


> This is problematic, because the APIs in between don't provide a way
> to report that we're still waiting for connect rather than for
> data-write-ready.  Anybody have the stomach for extending PQconnectPoll's
> API with an extra PGRES_POLLING_CONNECTING state?

I'm a bit hesitant to do so at this phase of the release cycle, it'd
kind of force all users to upgrade their code, and I'm sure there's a
couple out-of-tree ones. And not just code explicitly using new versions
of libpq, also users of old versions - several distributions just
install newer libpq versions and rely on it being compatible.


> If not, can we tell in
> WaitEventAdjustWin32 that the socket is still connecting and we must
> substitute FD_CONNECT for FD_WRITE?

I was wondering, for a second, if we should just always use FD_CONNECT
once in every set. But unfortunately there's plenty places that
create/destroy sets at a high enough speed for that to not be a nice
solution.

A third solution would be to, for v10, add a #ifdef WIN32 block to
libpqrcv_connect() that just waits till FD_CONNECT is ready. That has
the disadvantage of not accepting interrupts, but still seems better
than not working at all.  That's not much of a real solution, but this
late in the cycle it might be advisable to hold our noses :(

Greetings,

Andres Freund


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-08-06 12:29:07 -0400, Tom Lane wrote:
>> This is problematic, because the APIs in between don't provide a way
>> to report that we're still waiting for connect rather than for
>> data-write-ready.  Anybody have the stomach for extending PQconnectPoll's
>> API with an extra PGRES_POLLING_CONNECTING state?

> I'm a bit hesitant to do so at this phase of the release cycle,

I'm pretty hesitant to do that at *any* phase of the release cycle,
it's just too likely to break user code.

However, there's always more than one way to skin the cat.  What we need
to know is whether PQconnectPoll's state machine is in CONNECTION_STARTED
state, and PQstatus() does expose that, even though it claims it's an
internal state that callers shouldn't rely on.  So here's a sketch:

* Invent a separate WL_SOCKET_CONNECTED flag for WaitLatchOrSocket.
On Unix we can treat that the same as WL_SOCKET_WRITEABLE, on Windows
it maps to FD_CONNECT.

* In libpqwalreceiver, do something like
       io_flag = (status == PGRES_POLLING_READING                  ? WL_SOCKET_READABLE
#ifdef WIN32                  : PQstatus(conn->streamConn) == CONNECTION_STARTED       ? WL_SOCKET_CONNECTED
#endif                  : WL_SOCKET_WRITEABLE);

The #ifdef's optional, but no reason to spend extra cycles if we don't
have to.  (Not sure if this is adequately parenthesized...)
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
Hi,

On 2017-08-06 13:34:42 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-08-06 12:29:07 -0400, Tom Lane wrote:
> >> This is problematic, because the APIs in between don't provide a way
> >> to report that we're still waiting for connect rather than for
> >> data-write-ready.  Anybody have the stomach for extending PQconnectPoll's
> >> API with an extra PGRES_POLLING_CONNECTING state?
> 
> > I'm a bit hesitant to do so at this phase of the release cycle,
> 
> I'm pretty hesitant to do that at *any* phase of the release cycle,
> it's just too likely to break user code.

Right, that's what I went on to say ;)


> However, there's always more than one way to skin the cat.  What we need
> to know is whether PQconnectPoll's state machine is in CONNECTION_STARTED
> state, and PQstatus() does expose that, even though it claims it's an
> internal state that callers shouldn't rely on.  So here's a sketch:
> 
> * Invent a separate WL_SOCKET_CONNECTED flag for WaitLatchOrSocket.
> On Unix we can treat that the same as WL_SOCKET_WRITEABLE, on Windows
> it maps to FD_CONNECT.

Yea, that should work, and could be useful-ish for things outside of
libpqwalreceiver. E.g. parallel fdw scan, that IIRC somebody is working
on, seems like it very well could run into similar issues if we want to
handle interrupts, which we should...


> * In libpqwalreceiver, do something like
> 
>         io_flag = (status == PGRES_POLLING_READING
>                    ? WL_SOCKET_READABLE
> #ifdef WIN32
>                    : PQstatus(conn->streamConn) == CONNECTION_STARTED
>            ? WL_SOCKET_CONNECTED
> #endif
>                    : WL_SOCKET_WRITEABLE);
> 
> The #ifdef's optional, but no reason to spend extra cycles if we don't
> have to.  (Not sure if this is adequately parenthesized...)

Let's use proper ifs ;)...


Greetings,

Andres Freund


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
Hi,


On 2017-08-06 10:48:22 -0700, Andres Freund wrote:
> Hi,
> 
> On 2017-08-06 13:34:42 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On 2017-08-06 12:29:07 -0400, Tom Lane wrote:
> > >> This is problematic, because the APIs in between don't provide a way
> > >> to report that we're still waiting for connect rather than for
> > >> data-write-ready.  Anybody have the stomach for extending PQconnectPoll's
> > >> API with an extra PGRES_POLLING_CONNECTING state?
> > 
> > > I'm a bit hesitant to do so at this phase of the release cycle,
> > 
> > I'm pretty hesitant to do that at *any* phase of the release cycle,
> > it's just too likely to break user code.
> 
> Right, that's what I went on to say ;)
> 
> 
> > However, there's always more than one way to skin the cat.  What we need
> > to know is whether PQconnectPoll's state machine is in CONNECTION_STARTED
> > state, and PQstatus() does expose that, even though it claims it's an
> > internal state that callers shouldn't rely on.  So here's a sketch:
> > 
> > * Invent a separate WL_SOCKET_CONNECTED flag for WaitLatchOrSocket.
> > On Unix we can treat that the same as WL_SOCKET_WRITEABLE, on Windows
> > it maps to FD_CONNECT.
> 
> Yea, that should work, and could be useful-ish for things outside of
> libpqwalreceiver. E.g. parallel fdw scan, that IIRC somebody is working
> on, seems like it very well could run into similar issues if we want to
> handle interrupts, which we should...
> 
> 
> > * In libpqwalreceiver, do something like
> > 
> >         io_flag = (status == PGRES_POLLING_READING
> >                    ? WL_SOCKET_READABLE
> > #ifdef WIN32
> >                    : PQstatus(conn->streamConn) == CONNECTION_STARTED
> >            ? WL_SOCKET_CONNECTED
> > #endif
> >                    : WL_SOCKET_WRITEABLE);
> > 
> > The #ifdef's optional, but no reason to spend extra cycles if we don't
> > have to.  (Not sure if this is adequately parenthesized...)
> 
> Let's use proper ifs ;)...

Is anybody working on a patch like this? I can try writing one blindly,
if somebody else tests it.  But if somebody natively on windows is
working on it, that's going to be more efficient.  It'd be nice to have
something merged by wrap tomorrow...

Greetings,

Andres Freund


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
On 2017-08-06 13:07:42 -0700, Andres Freund wrote:
> Is anybody working on a patch like this? I can try writing one blindly,
> if somebody else tests it.  But if somebody natively on windows is
> working on it, that's going to be more efficient.  It'd be nice to have
> something merged by wrap tomorrow...

Here's a prototype patch implementing what Tom outlined. I've compiled
it with mingw under linux, but haven't run it.  Could some windows
capable person take this for a spin?

Unfortunately we can't just push this to the BF and see what it says -
our tests don't catch this one... And there's not that much time before
the wrap.

Anybody have an opinion about adding ifs for WL_SOCKET_CONNECTED to
!win32 implementations rather than redefining it to WL_SOCKET_WRITEABLE?

- Andres

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Here's a prototype patch implementing what Tom outlined.

This bit is flat wrong:

-        int            io_flag;
+        int            io_flag = WL_POSTMASTER_DEATH | WL_LATCH_SET;

io_flag has to be *just* the I/O condition, because we use it in a test
after the WaitLatchOrSocket call.

> Anybody have an opinion about adding ifs for WL_SOCKET_CONNECTED to
> !win32 implementations rather than redefining it to WL_SOCKET_WRITEABLE?

I fear it would complicate matters greatly, because you'd have to figure
out which of the two flags to signal back after detecting socket writable.
I think defining it as equal to WL_SOCKET_WRITEABLE is fine.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
On 2017-08-06 18:04:49 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Here's a prototype patch implementing what Tom outlined.
> 
> This bit is flat wrong:
> 
> -        int            io_flag;
> +        int            io_flag = WL_POSTMASTER_DEATH | WL_LATCH_SET;
> 
> io_flag has to be *just* the I/O condition, because we use it in a test
> after the WaitLatchOrSocket call.

Hm, right. Wouldn't be particularly consequential, but... I'd actually
consider just removing the if around    /* If socket is ready, advance the libpq state machine */    if (rc & io_flag)
     status = PQconnectPoll(conn->streamConn);
 
the only thing that protects us against is calling PQconnectPoll() when
the latch has been set. Hardly problematic.


> > Anybody have an opinion about adding ifs for WL_SOCKET_CONNECTED to
> > !win32 implementations rather than redefining it to WL_SOCKET_WRITEABLE?
> 
> I fear it would complicate matters greatly, because you'd have to figure
> out which of the two flags to signal back after detecting socket writable.
> I think defining it as equal to WL_SOCKET_WRITEABLE is fine.

Well, I'd have said, signal the one(s) back that have been
requested. But I'm ok with the current state, adding a bunch of
pointless branches didn't strike me as worthwhile...

Greetings,

Andres Freund


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-08-06 18:04:49 -0400, Tom Lane wrote:
>> io_flag has to be *just* the I/O condition, because we use it in a test
>> after the WaitLatchOrSocket call.

> Hm, right. Wouldn't be particularly consequential,

Yes, it would be.  The entire point of this loop is to avoid blocking
in PQconnectPoll, and that's not guaranteed unless we honor the contract
of only calling it when we've gotten the read-ready or write-ready status
it asked for.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Replication to Postgres 10 on Windows is broken

From
Noah Misch
Date:
On Sun, Aug 06, 2017 at 08:50:37AM -0700, Noah Misch wrote:
> On Sun, Aug 06, 2017 at 11:17:57AM -0400, Tom Lane wrote:
> > Noah Misch <noah@leadboat.com> writes:
> > > I've added this as an open item.  Confirmed in this setup:
> > 
> > > -- Client
> > > Windows Server 2016
> > > postgresql-10.0-beta2-windows-x64-binaries.zip from EnterpriseDB
> > 
> > I wonder whether the other complainants were using EDB's build,
> > and if not, just what were they using.  The indirect question is:
> > what version of OpenSSL is the Windows build using?
> 
> Those binaries I used have OpenSSL 1.0.2l.
> 
> > > I don't, however, see a smoking gun among commits.  Would you bisect the
> > > commits since 9.6 and see which one broke things?
> > 
> > Gut instinct says that the reason this case fails when other tools
> > can connect successfully is that libpqwalreceiver is the only tool
> > that uses PQconnectStart/PQconnectPoll rather than a plain
> > PQconnectdb, and that there is some behavioral difference between
> > connectDBComplete's wait loop and libpqrcv_connect's wait loop that
> 
> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> callers outside of libpq itself.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Peter Eisentraut
Date:
On 8/7/17 21:06, Noah Misch wrote:
>> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
>> callers outside of libpq itself.
> [Action required within three days.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> since you committed the patch believed to have created it, you own this open
> item.

I don't think I can usefully contribute to this.  Could someone else
take it?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
On 2017-08-08 19:25:37 -0400, Peter Eisentraut wrote:
> On 8/7/17 21:06, Noah Misch wrote:
> >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> >> callers outside of libpq itself.
> > [Action required within three days.  This is a generic notification.]
> > 
> > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > since you committed the patch believed to have created it, you own this open
> > item.
> 
> I don't think I can usefully contribute to this.  Could someone else
> take it?

I've written up a patch at
http://archives.postgresql.org/message-id/20170806215521.d6fq4esvx7s5ejka%40alap3.anarazel.de
but I can't test this either. We really need somebody with access to
window to verify whether it works.  That patch needs some adjustments as
remarked by Tom, but can be tested as is.

Regards,

Andres



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Noah Misch
Date:
On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> On 8/7/17 21:06, Noah Misch wrote:
> >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> >> callers outside of libpq itself.
> > [Action required within three days.  This is a generic notification.]
> > 
> > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > since you committed the patch believed to have created it, you own this open
> > item.
> 
> I don't think I can usefully contribute to this.  Could someone else
> take it?

If nobody volunteers, you could always resolve this by reverting 1e8a850 and
successors.



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
On 2017-08-10 18:51:07 -0700, Noah Misch wrote:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > On 8/7/17 21:06, Noah Misch wrote:
> > >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> > >> callers outside of libpq itself.
> > > [Action required within three days.  This is a generic notification.]
> > > 
> > > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > > since you committed the patch believed to have created it, you own this open
> > > item.
> > 
> > I don't think I can usefully contribute to this.  Could someone else
> > take it?
> 
> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.

I've volunteered a fix nearby, just can't test it. I don't think we can
require committers to work on windows.

- Andres



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
"Augustine, Jobin"
Date:
I am in an effort to create independent build environment on Windows to test the patch from Andres.
I shall come back with result possibly in another 24 hours.

-Jobin

On Fri, Aug 11, 2017 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-08-10 18:51:07 -0700, Noah Misch wrote:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > On 8/7/17 21:06, Noah Misch wrote:
> > >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> > >> callers outside of libpq itself.
> > > [Action required within three days.  This is a generic notification.]
> > >
> > > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > > since you committed the patch believed to have created it, you own this open
> > > item.
> >
> > I don't think I can usefully contribute to this.  Could someone else
> > take it?
>
> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.

I've volunteered a fix nearby, just can't test it. I don't think we can
require committers to work on windows.

- Andres



--

Jobin Augustine
Architect : Production Database Operations

OpenSCG

phone : +91 9989932600

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
>> I don't think I can usefully contribute to this.  Could someone else
>> take it?

> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.

I think you're blaming the victim.  Our current theory about the cause
of this is that on Windows, WaitLatchOrSocket cannot be used to wait for
completion of a nonblocking connect() call.  That seems pretty broken
independently of whether libpqwalreceiver needs the capability.

In any case, we have a draft patch, so what we should be pressing for
is for somebody to test it.  Peter's not in a position to do that
(and neither am I), but anyone who can build from source on Windows
could do so.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
"Augustine, Jobin"
Date:
Appears that patch is not helping.
Errors like below are still appearing in the log
===============================
2017-08-11 12:22:35 UTC [2840]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
    could not send startup packet: Socket is not connected (0x00002749/10057)
2017-08-11 12:22:40 UTC [1964]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
    could not send startup packet: Socket is not connected (0x00002749/10057)
2017-08-11 12:22:45 UTC [248]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
    could not send startup packet: Socket is not connected (0x00002749/10057)
===============================


On Fri, Aug 11, 2017 at 7:28 AM, Augustine, Jobin <jobin.augustine@openscg.com> wrote:
I am in an effort to create independent build environment on Windows to test the patch from Andres.
I shall come back with result possibly in another 24 hours.

-Jobin

On Fri, Aug 11, 2017 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-08-10 18:51:07 -0700, Noah Misch wrote:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > On 8/7/17 21:06, Noah Misch wrote:
> > >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> > >> callers outside of libpq itself.
> > > [Action required within three days.  This is a generic notification.]
> > >
> > > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > > since you committed the patch believed to have created it, you own this open
> > > item.
> >
> > I don't think I can usefully contribute to this.  Could someone else
> > take it?
>
> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.

I've volunteered a fix nearby, just can't test it. I don't think we can
require committers to work on windows.

- Andres



--

Jobin Augustine
Architect : Production Database Operations

OpenSCG

phone : +91 9989932600




--

Jobin Augustine
Architect : Production Database Operations

OpenSCG

phone : +91 9989932600

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Noah Misch
Date:
On Thu, Aug 10, 2017 at 09:59:40PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> >> I don't think I can usefully contribute to this.  Could someone else
> >> take it?

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

> > If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> > successors.
> 
> I think you're blaming the victim.  Our current theory about the cause
> of this is that on Windows, WaitLatchOrSocket cannot be used to wait for
> completion of a nonblocking connect() call.  That seems pretty broken
> independently of whether libpqwalreceiver needs the capability.

Yes, the theorized defect lies in APIs commit 1e8a850 used, not in the commit
itself.  Nonetheless, commit 1e8a850 promoted the defect from one reachable
only by writing C code to one reachable by merely configuring replication on
Windows according to the documentation.  For that, its committer owns this
open item.  Besides the one approach I mentioned, there exist several other
fine ways to implement said ownership.

> In any case, we have a draft patch, so what we should be pressing for
> is for somebody to test it.

Now done.  (Thanks, Jobin.)



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Noah Misch
Date:
On Fri, Aug 11, 2017 at 08:56:22PM -0700, Noah Misch wrote:
> On Thu, Aug 10, 2017 at 09:59:40PM -0400, Tom Lane wrote:
> > Noah Misch <noah@leadboat.com> writes:
> > > On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > >> I don't think I can usefully contribute to this.  Could someone else
> > >> take it?
> 
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.  Refer to the policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past due
for your status update.  Please reacquaint yourself with the policy on open
item ownership[1] and then reply immediately.  If I do not hear from you by
2017-08-14 20:00 UTC, I will transfer this item to release management team
ownership without further notice.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
On 2017-08-11 20:56:22 -0700, Noah Misch wrote:
> > > If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> > > successors.
> > 
> > I think you're blaming the victim.  Our current theory about the cause
> > of this is that on Windows, WaitLatchOrSocket cannot be used to wait for
> > completion of a nonblocking connect() call.  That seems pretty broken
> > independently of whether libpqwalreceiver needs the capability.
> 
> Yes, the theorized defect lies in APIs commit 1e8a850 used, not in the commit
> itself.  Nonetheless, commit 1e8a850 promoted the defect from one reachable
> only by writing C code to one reachable by merely configuring replication on
> Windows according to the documentation.  For that, its committer owns this
> open item.  Besides the one approach I mentioned, there exist several other
> fine ways to implement said ownership.

FWIW, I'm personally quite demotivated by this style of handling
issues. You're essentially saying that any code change, even if it just
increases exposure of a preexisting bug, needs to be handled by the
committer of the exposing change. And even if that bug is on a platform
the committer doesn't have.  And all that despite the issue getting
attention.

I'm not ok with this.



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
Hi,

On 2017-08-11 18:11:03 +0530, Augustine, Jobin wrote:
> Appears that patch is not helping.
> Errors like below are still appearing in the log
> ===============================
> 2017-08-11 12:22:35 UTC [2840]: [1-1] user=,db=,app=,client= FATAL:  could
> not connect to the primary server: could not send data to server: Socket is
> not connected (0x00002749/10057)
>     could not send startup packet: Socket is not connected
> (0x00002749/10057)
> 2017-08-11 12:22:40 UTC [1964]: [1-1] user=,db=,app=,client= FATAL:  could
> not connect to the primary server: could not send data to server: Socket is
> not connected (0x00002749/10057)
>     could not send startup packet: Socket is not connected
> (0x00002749/10057)
> 2017-08-11 12:22:45 UTC [248]: [1-1] user=,db=,app=,client= FATAL:  could
> not connect to the primary server: could not send data to server: Socket is
> not connected (0x00002749/10057)
>     could not send startup packet: Socket is not connected
> (0x00002749/10057)
> ===============================

That's too bad. Any chance you could install
https://docs.microsoft.com/en-us/sysinternals/downloads/procmon and
activate monitoring just for that phase? I think it can export to a text
file...

Greetings,

Andres



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Peter Geoghegan
Date:
On Sun, Aug 13, 2017 at 12:57 PM, Andres Freund <andres@anarazel.de> wrote:
> FWIW, I'm personally quite demotivated by this style of handling
> issues. You're essentially saying that any code change, even if it just
> increases exposure of a preexisting bug, needs to be handled by the
> committer of the exposing change. And even if that bug is on a platform
> the committer doesn't have.  And all that despite the issue getting
> attention.

I don't think you can generalize from what Noah said like that,
because it's always a matter of degree (the degree to which the
preexisting bug was a problem). Abbreviated keys for collated text
were disabled, though not due to bug in strxfrm(). Technically, it was
due to a bug in strcoll(), which glibc always had. strxfrm() therefore
only failed to be bug compatible with glibc's strcoll(). Does that
mean that we were wrong to disable the use of strxfrm() for
abbreviated keys?

I think that it's useful for these things to be handled in an
adversarial manner, in the same way that litigation is adversarial in
a common law court. I doubt that Noah actually set out to demoralize
anyone. He is just doing the job he was assigned.

-- 
Peter Geoghegan



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-08-11 18:11:03 +0530, Augustine, Jobin wrote:
>> Appears that patch is not helping.

> That's too bad. Any chance you could install
> https://docs.microsoft.com/en-us/sysinternals/downloads/procmon and
> activate monitoring just for that phase? I think it can export to a text
> file...

It strikes me that maybe the misuse of io_flag could be contributing
to this: if the walreceiver process's latch were set, we'd end up calling
PQconnectPoll before the socket had necessarily come ready, which would
produce the described symptom.  That's grasping at straws admittedly,
because I'm not sure why the walreceiver process's latch would be set
at this point; but it seems like we ought to test a version of the patch
that we believe correct before deciding that we still have a problem.

To move things along, here's a corrected patch --- Jobin, please test.

            regards, tom lane

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index de03362..37b481c 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -168,13 +168,18 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
     status = PGRES_POLLING_WRITING;
     do
     {
-        /* Wait for socket ready and/or other events. */
         int            io_flag;
         int            rc;

-        io_flag = (status == PGRES_POLLING_READING
-                   ? WL_SOCKET_READABLE
-                   : WL_SOCKET_WRITEABLE);
+        if (status == PGRES_POLLING_READING)
+            io_flag = WL_SOCKET_READABLE;
+#ifdef WIN32
+        /* Windows needs a different test while waiting for connection-made */
+        else if (PQstatus(conn->streamConn) == CONNECTION_STARTED)
+            io_flag = WL_SOCKET_CONNECTED;
+#endif
+        else
+            io_flag = WL_SOCKET_WRITEABLE;

         rc = WaitLatchOrSocket(MyLatch,
                                WL_POSTMASTER_DEATH |
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 07b1364..c7345de 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -374,11 +374,11 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
         AddWaitEventToSet(set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
                           NULL, NULL);

-    if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+    if (wakeEvents & WL_SOCKET_MASK)
     {
         int            ev;

-        ev = wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
+        ev = wakeEvents & WL_SOCKET_MASK;
         AddWaitEventToSet(set, ev, sock, NULL, NULL);
     }

@@ -390,8 +390,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
     {
         ret |= event.events & (WL_LATCH_SET |
                                WL_POSTMASTER_DEATH |
-                               WL_SOCKET_READABLE |
-                               WL_SOCKET_WRITEABLE);
+                               WL_SOCKET_MASK);
     }

     FreeWaitEventSet(set);
@@ -640,10 +639,12 @@ FreeWaitEventSet(WaitEventSet *set)
  * Add an event to the set. Possible events are:
  * - WL_LATCH_SET: Wait for the latch to be set
  * - WL_POSTMASTER_DEATH: Wait for postmaster to die
- * - WL_SOCKET_READABLE: Wait for socket to become readable
- *     can be combined in one event with WL_SOCKET_WRITEABLE
- * - WL_SOCKET_WRITEABLE: Wait for socket to become writeable
- *     can be combined with WL_SOCKET_READABLE
+ * - WL_SOCKET_READABLE: Wait for socket to become readable,
+ *     can be combined in one event with other WL_SOCKET_* events
+ * - WL_SOCKET_WRITEABLE: Wait for socket to become writeable,
+ *     can be combined with other WL_SOCKET_* events
+ * - WL_SOCKET_CONNECTED: Wait for socket connection to be established,
+ *     can be combined with other WL_SOCKET_* events
  *
  * Returns the offset in WaitEventSet->events (starting from 0), which can be
  * used to modify previously added wait events using ModifyWaitEvent().
@@ -652,9 +653,9 @@ FreeWaitEventSet(WaitEventSet *set)
  * i.e. it must be a process-local latch initialized with InitLatch, or a
  * shared latch associated with the current process by calling OwnLatch.
  *
- * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are
- * reported by returning the socket as readable/writable or both, depending on
- * WL_SOCKET_READABLE/WRITEABLE being specified.
+ * In the WL_SOCKET_READABLE/WRITEABLE/CONNECTED cases, EOF and error
+ * conditions are reported by returning the socket as readable/writable or
+ * both, depending on WL_SOCKET_READABLE/WRITEABLE/CONNECTED being specified.
  *
  * The user_data pointer specified here will be set for the events returned
  * by WaitEventSetWait(), allowing to easily associate additional data with
@@ -685,8 +686,7 @@ AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch,
     }

     /* waiting for socket readiness without a socket indicates a bug */
-    if (fd == PGINVALID_SOCKET &&
-        (events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)))
+    if (fd == PGINVALID_SOCKET && (events & (WL_SOCKET_MASK)))
         elog(ERROR, "cannot wait on socket event without a socket");

     event = &set->events[set->nevents];
@@ -885,6 +885,8 @@ WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
             flags |= FD_READ;
         if (event->events & WL_SOCKET_WRITEABLE)
             flags |= FD_WRITE;
+        if (event->events & WL_SOCKET_CONNECTED)
+            flags |= FD_CONNECT;

         if (*handle == WSA_INVALID_EVENT)
         {
@@ -1395,7 +1397,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
             returned_events++;
         }
     }
-    else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+    else if (cur_event->events & WL_SOCKET_MASK)
     {
         WSANETWORKEVENTS resEvents;
         HANDLE        handle = set->handles[cur_event->pos + 1];
@@ -1432,6 +1434,12 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
             /* writeable */
             occurred_events->events |= WL_SOCKET_WRITEABLE;
         }
+        if ((cur_event->events & WL_SOCKET_CONNECTED) &&
+            (resEvents.lNetworkEvents & FD_CONNECT))
+        {
+            /* connected */
+            occurred_events->events |= WL_SOCKET_CONNECTED;
+        }
         if (resEvents.lNetworkEvents & FD_CLOSE)
         {
             /* EOF */
@@ -1439,6 +1447,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
                 occurred_events->events |= WL_SOCKET_READABLE;
             if (cur_event->events & WL_SOCKET_WRITEABLE)
                 occurred_events->events |= WL_SOCKET_WRITEABLE;
+            if (cur_event->events & WL_SOCKET_CONNECTED)
+                occurred_events->events |= WL_SOCKET_CONNECTED;
         }

         if (occurred_events->events != 0)
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 73abfaf..8af0266 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -126,6 +126,16 @@ typedef struct Latch
 #define WL_SOCKET_WRITEABLE  (1 << 2)
 #define WL_TIMEOUT             (1 << 3)    /* not for WaitEventSetWait() */
 #define WL_POSTMASTER_DEATH  (1 << 4)
+#ifdef WIN32
+#define WL_SOCKET_CONNECTED  (1 << 5)
+#else
+/* avoid having to to deal with case on platforms not requiring it */
+#define WL_SOCKET_CONNECTED     WL_SOCKET_WRITEABLE
+#endif
+
+#define WL_SOCKET_MASK        (WL_SOCKET_READABLE | \
+                             WL_SOCKET_WRITEABLE | \
+                             WL_SOCKET_CONNECTED)

 typedef struct WaitEvent
 {

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> I think that it's useful for these things to be handled in an
> adversarial manner, in the same way that litigation is adversarial in
> a common law court. I doubt that Noah actually set out to demoralize
> anyone. He is just doing the job he was assigned.

FWIW, I agree that Noah is just carrying out the RMT's task as assigned.
However, the only thing that Peter could really do about this of his own
authority is to revert 1e8a850, which I certainly think should be the last
resort not the first.  We are continuing to make progress towards finding
a better solution, I think, and since no particular deadline is imminent
we should let that process play out.

I think what Peter should do to satisfy the RMT process is to state that
if no better solution is found by date X, he will revert 1e8a850 (or
somehow mask the problem by removing functionality), and he sees no
need for additional progress reports before that.  Date X could be
sometime early in September, perhaps.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
"Augustine, Jobin" <jobin.augustine@openscg.com> writes:
> Appears that patch is not helping.
> Errors like below are still appearing in the log
> ===============================
> 2017-08-11 12:22:35 UTC [2840]: [1-1] user=,db=,app=,client= FATAL:  could
> not connect to the primary server: could not send data to server: Socket is
> not connected (0x00002749/10057)
>     could not send startup packet: Socket is not connected
> (0x00002749/10057)

BTW, I notice that this is a bit different from your first report, because
that mentioned "could not send SSL negotiation packet" rather than "could
not send startup packet".  If the initial report was about a setup in
which SSL was enabled, and in this report's setup it isn't, then that
difference is unsurprising; but otherwise it needs explanation.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
On 2017-08-13 16:55:33 -0400, Tom Lane wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
> > I think that it's useful for these things to be handled in an
> > adversarial manner, in the same way that litigation is adversarial in
> > a common law court. I doubt that Noah actually set out to demoralize
> > anyone. He is just doing the job he was assigned.
> 
> FWIW, I agree that Noah is just carrying out the RMT's task as
> assigned.

Well, then that's a sign that the tasks/process need to be rescoped.


> However, the only thing that Peter could really do about this of his own
> authority is to revert 1e8a850, which I certainly think should be the last
> resort not the first.  We are continuing to make progress towards finding
> a better solution, I think, and since no particular deadline is imminent
> we should let that process play out.

I agree that that'd be a bad fix. There's other things that should, but
don't, really use the asynchronous API, e.g. postgres_fdw.

As a measure of last restart we could add a libpq workaround that forces
a pqSocketCheck() at the right moment, while still establishing a
connection.  That's not good from an interruptability perspective, but
better than blocking for the entire connection establishment.

Greetings,

Andres Freund



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> As a measure of last restart we could add a libpq workaround that forces
> a pqSocketCheck() at the right moment, while still establishing a
> connection.  That's not good from an interruptability perspective, but
> better than blocking for the entire connection establishment.

Probably a better idea is to fix things so that can't-send-yet results
in returning PGRES_POLLING_WRITING rather than a failure.  That would
result in a busy-wait in this scenario on Windows, which is arguably
better than blocking.  It would also make this part of libpq more robust
against applications being sloppy about socket readiness checks (which,
you could argue, is exactly what libpqwalreceiver is being).  But it
would be a somewhat ticklish change because of the portability hazards,
so I'm really disinclined to do it this late in beta.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Peter Geoghegan
Date:
On Sun, Aug 13, 2017 at 2:22 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-08-13 16:55:33 -0400, Tom Lane wrote:
>> Peter Geoghegan <pg@bowt.ie> writes:
>> > I think that it's useful for these things to be handled in an
>> > adversarial manner, in the same way that litigation is adversarial in
>> > a common law court. I doubt that Noah actually set out to demoralize
>> > anyone. He is just doing the job he was assigned.
>>
>> FWIW, I agree that Noah is just carrying out the RMT's task as
>> assigned.
>
> Well, then that's a sign that the tasks/process need to be rescoped.

Why? I might agree if the RMT had an outsized influence on final
outcome. If that's the case, then it's something that I missed.

I also don't think that it's fair to expect Noah to spend a lot of
time poring over whether sending out one of those pro forma status
update policy violation mails is warranted. I expect someone is his
position to aim to err on the side of sending out too many of those.
It's easy for the patch author to explain the situation, but hard for
the RMT to understand the situation fully at all times.

-- 
Peter Geoghegan



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
"Augustine, Jobin"
Date:
Thank you Tom Lane,
This patch fixes the problem.

With this patch, streaming replication started working (replication to Windows)

(Tested for Linux to Windows replication)

-Jobin.

On Mon, Aug 14, 2017 at 2:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
> On 2017-08-11 18:11:03 +0530, Augustine, Jobin wrote:
>> Appears that patch is not helping.

> That's too bad. Any chance you could install
> https://docs.microsoft.com/en-us/sysinternals/downloads/procmon and
> activate monitoring just for that phase? I think it can export to a text
> file...

It strikes me that maybe the misuse of io_flag could be contributing
to this: if the walreceiver process's latch were set, we'd end up calling
PQconnectPoll before the socket had necessarily come ready, which would
produce the described symptom.  That's grasping at straws admittedly,
because I'm not sure why the walreceiver process's latch would be set
at this point; but it seems like we ought to test a version of the patch
that we believe correct before deciding that we still have a problem.

To move things along, here's a corrected patch --- Jobin, please test.

                        regards, tom lane




--

Jobin Augustine
Architect : Production Database Operations

OpenSCG

phone : +91 9989932600

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Peter Eisentraut
Date:
On 8/13/17 15:46, Noah Misch wrote:
> IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past due
> for your status update.  Please reacquaint yourself with the policy on open
> item ownership[1] and then reply immediately.  If I do not hear from you by
> 2017-08-14 20:00 UTC, I will transfer this item to release management team
> ownership without further notice.

It appears there is a patch that fixes the issue now.  Let's wait until
Wednesday to see if that ends up being successful.

Otherwise, the plan forward would be to decorate the change in the
original commit with an #ifndef WIN32.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> It appears there is a patch that fixes the issue now.  Let's wait until
> Wednesday to see if that ends up being successful.

Jobim reported success with add-connected-event-2.patch --- are you
expecting more testing to happen?

I did instrument the loop in libpqrcv_connect and verified that the
process latch is set the first time through.  I haven't traced down
exactly why it's set at that point, but that confirms why Andres'
initial patch didn't work and the new one did.

I think we could commit add-connected-event-2.patch and call this
issue resolved.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Peter Eisentraut
Date:
On 8/14/17 10:57, Tom Lane wrote:
> I think we could commit add-connected-event-2.patch and call this
> issue resolved.

Would you like to commit your patch then?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 8/14/17 10:57, Tom Lane wrote:
>> I think we could commit add-connected-event-2.patch and call this
>> issue resolved.

> Would you like to commit your patch then?

It's really Andres' patch, but I can push it.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:


On August 15, 2017 7:04:59 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> On 8/14/17 10:57, Tom Lane wrote:
>>> I think we could commit add-connected-event-2.patch and call this
>>> issue resolved.
>
>> Would you like to commit your patch then?
>
>It's really Andres' patch, but I can push it.

I'll in a bit, unless you prefer to do it. You seem busy enough ;)

I don't see any relevant uses of sockets in older branches, did I miss something?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:

On August 15, 2017 8:07:43 AM PDT, Andres Freund <andres@anarazel.de> wrote:
>
>
>On August 15, 2017 7:04:59 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>>> On 8/14/17 10:57, Tom Lane wrote:
>>>> I think we could commit add-connected-event-2.patch and call this
>>>> issue resolved.
>>
>>> Would you like to commit your patch then?
>>
>>It's really Andres' patch, but I can push it.
>
>I'll in a bit, unless you prefer to do it. You seem busy enough ;)

Hah, synchronicity. That works too ;)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I don't see any relevant uses of sockets in older branches, did I miss something?

No, I think we don't need to go back further than v10.  Even if it turns
out we do, I'd just as soon let this get a bit of field testing first.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
"Augustine, Jobin"
Date:
This patch is fixing similar problem reported in Logical replication also by another user.

Message:
20170524173644.1440.53348@wrigleys.postgresql.org
Bug reference:      14669
Logged by:          Igor Neyman
Email address:      ineyman(at)perceptron(dot)com
PostgreSQL version: 10beta1
Operating system:   Windows

Regards,
Jobin.


Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows isbroken

From
Tatsuo Ishii
Date:
> Thank you Tom Lane,
> This patch fixes the problem.
> 
> With this patch, streaming replication started working (replication to
> Windows)
> 
> (Tested for Linux to Windows replication)

Do we allow streaming replication among different OS? I thought it is
required that primary and standbys are same platform (in my
understanding this means the same hardware architecture and OS) in
streaming replication.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Michael Paquier
Date:
On Thu, Aug 17, 2017 at 8:13 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
> Do we allow streaming replication among different OS?

No. WAL is a binary format.

> I thought it is
> required that primary and standbys are same platform (in my
> understanding this means the same hardware architecture and OS) in
> streaming replication.

Yep, I recall the same requirement.
-- 
Michael



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Thu, Aug 17, 2017 at 8:13 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>> Do we allow streaming replication among different OS?

> No. WAL is a binary format.

>> I thought it is
>> required that primary and standbys are same platform (in my
>> understanding this means the same hardware architecture and OS) in
>> streaming replication.

> Yep, I recall the same requirement.

In practice it's largely about architecture, I think.  You definitely
need the same endianness and floating-point format, which are hardware,
and you need the same MAXALIGN, which is partly hardware but in principle
could be chosen differently in different ABI conventions for the same
hardware.  (At least for non-alignment-picky hardware like Intel.
Alignment-picky hardware is likely going to dictate that choice too.)

We disclaim cross-OS portability partly because of the possible influence
of different ABI conventions, but it doesn't surprise me in the least if
in practice 64-bit Windows can interoperate with x86_64 Linux.  (Less sure
about the 32-bit case --- it looks like pg_config.h.win32 chooses
MAXALIGN 8 in all cases, which would mean 32-bit Windows is more likely to
interoperate with 64-bit Linux than 32-bit Linux.)

The thing that's really likely to bite you cross-platform is dependency
of textual index sort ordering on non-C locale definitions.  But that
wouldn't show up in a quick smoke test of replication ... and if you
really wanted, you could use C locale on both ends.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Peter Eisentraut
Date:
On 8/16/17 19:41, Michael Paquier wrote:
> On Thu, Aug 17, 2017 at 8:13 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>> Do we allow streaming replication among different OS?
> 
> No. WAL is a binary format.
> 
>> I thought it is
>> required that primary and standbys are same platform (in my
>> understanding this means the same hardware architecture and OS) in
>> streaming replication.
> 
> Yep, I recall the same requirement.

He meant logical replication, but the code in question here is the same
for streaming replication, or whatever it's called.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows isbroken

From
Tatsuo Ishii
Date:
> In practice it's largely about architecture, I think.  You definitely
> need the same endianness and floating-point format, which are hardware,
> and you need the same MAXALIGN, which is partly hardware but in principle
> could be chosen differently in different ABI conventions for the same
> hardware.  (At least for non-alignment-picky hardware like Intel.
> Alignment-picky hardware is likely going to dictate that choice too.)
> 
> We disclaim cross-OS portability partly because of the possible influence
> of different ABI conventions, but it doesn't surprise me in the least if
> in practice 64-bit Windows can interoperate with x86_64 Linux.  (Less sure
> about the 32-bit case --- it looks like pg_config.h.win32 chooses
> MAXALIGN 8 in all cases, which would mean 32-bit Windows is more likely to
> interoperate with 64-bit Linux than 32-bit Linux.)

I feel like we need to be a little bit more cleaner about this
in our official documentation. From section 26.2.1. Planning:

"Hardware need not be exactly the same, but experience shows that
maintaining two identical systems is easier than maintaining two
dissimilar ones over the lifetime of the application and system. In
any case the hardware architecture must be the same ― shipping from,
say, a 32-bit to a 64-bit system will not work."

Probably We should disclaim cross-OS portability here?

> The thing that's really likely to bite you cross-platform is dependency
> of textual index sort ordering on non-C locale definitions.  But that
> wouldn't show up in a quick smoke test of replication ... and if you
> really wanted, you could use C locale on both ends.

Of course.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows isbroken

From
Tatsuo Ishii
Date:
> He meant logical replication,

Oh I could not find he meant logical replication in the original
report.

> but the code in question here is the same
> for streaming replication, or whatever it's called.

Yes.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Michael Paquier
Date:
On Thu, Aug 17, 2017 at 9:15 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>> He meant logical replication,
>
> Oh I could not find he meant logical replication in the original
> report.

The second message of the thread says so, but the first does not
mention logical replication at all.
From here are mentioned PG 9.6 and pg_basebackup:
https://www.postgresql.org/message-id/CAHBggj8g2T+ZDcACZ2FmzX9CTxkWjKBsHd6NkYB4i9Ojf6K1Fw@mail.gmail.com.
Explaining the confusion.
-- 
Michael



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows isbroken

From
Tatsuo Ishii
Date:
> On Thu, Aug 17, 2017 at 9:15 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>>> He meant logical replication,
>>
>> Oh I could not find he meant logical replication in the original
>> report.
> 
> The second message of the thread says so, but the first does not
> mention logical replication at all.
>>From here are mentioned PG 9.6 and pg_basebackup:
> https://www.postgresql.org/message-id/CAHBggj8g2T+ZDcACZ2FmzX9CTxkWjKBsHd6NkYB4i9Ojf6K1Fw@mail.gmail.com.
> Explaining the confusion.

Oh, I see it now. Thanks.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp