Thread: Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Noah Misch
Date:
On Sun, Aug 06, 2017 at 08:50:37AM -0700, Noah Misch wrote:
> On Sun, Aug 06, 2017 at 11:17:57AM -0400, Tom Lane wrote:
> > Noah Misch <noah@leadboat.com> writes:
> > > I've added this as an open item.  Confirmed in this setup:
> > 
> > > -- Client
> > > Windows Server 2016
> > > postgresql-10.0-beta2-windows-x64-binaries.zip from EnterpriseDB
> > 
> > I wonder whether the other complainants were using EDB's build,
> > and if not, just what were they using.  The indirect question is:
> > what version of OpenSSL is the Windows build using?
> 
> Those binaries I used have OpenSSL 1.0.2l.
> 
> > > I don't, however, see a smoking gun among commits.  Would you bisect the
> > > commits since 9.6 and see which one broke things?
> > 
> > Gut instinct says that the reason this case fails when other tools
> > can connect successfully is that libpqwalreceiver is the only tool
> > that uses PQconnectStart/PQconnectPoll rather than a plain
> > PQconnectdb, and that there is some behavioral difference between
> > connectDBComplete's wait loop and libpqrcv_connect's wait loop that
> 
> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> callers outside of libpq itself.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Peter Eisentraut
Date:
On 8/7/17 21:06, Noah Misch wrote:
>> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
>> callers outside of libpq itself.
> [Action required within three days.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> since you committed the patch believed to have created it, you own this open
> item.

I don't think I can usefully contribute to this.  Could someone else
take it?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
On 2017-08-08 19:25:37 -0400, Peter Eisentraut wrote:
> On 8/7/17 21:06, Noah Misch wrote:
> >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> >> callers outside of libpq itself.
> > [Action required within three days.  This is a generic notification.]
> > 
> > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > since you committed the patch believed to have created it, you own this open
> > item.
> 
> I don't think I can usefully contribute to this.  Could someone else
> take it?

I've written up a patch at
http://archives.postgresql.org/message-id/20170806215521.d6fq4esvx7s5ejka%40alap3.anarazel.de
but I can't test this either. We really need somebody with access to
window to verify whether it works.  That patch needs some adjustments as
remarked by Tom, but can be tested as is.

Regards,

Andres



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Noah Misch
Date:
On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> On 8/7/17 21:06, Noah Misch wrote:
> >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> >> callers outside of libpq itself.
> > [Action required within three days.  This is a generic notification.]
> > 
> > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > since you committed the patch believed to have created it, you own this open
> > item.
> 
> I don't think I can usefully contribute to this.  Could someone else
> take it?

If nobody volunteers, you could always resolve this by reverting 1e8a850 and
successors.



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
On 2017-08-10 18:51:07 -0700, Noah Misch wrote:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > On 8/7/17 21:06, Noah Misch wrote:
> > >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> > >> callers outside of libpq itself.
> > > [Action required within three days.  This is a generic notification.]
> > > 
> > > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > > since you committed the patch believed to have created it, you own this open
> > > item.
> > 
> > I don't think I can usefully contribute to this.  Could someone else
> > take it?
> 
> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.

I've volunteered a fix nearby, just can't test it. I don't think we can
require committers to work on windows.

- Andres



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
"Augustine, Jobin"
Date:
I am in an effort to create independent build environment on Windows to test the patch from Andres.
I shall come back with result possibly in another 24 hours.

-Jobin

On Fri, Aug 11, 2017 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-08-10 18:51:07 -0700, Noah Misch wrote:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > On 8/7/17 21:06, Noah Misch wrote:
> > >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> > >> callers outside of libpq itself.
> > > [Action required within three days.  This is a generic notification.]
> > >
> > > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > > since you committed the patch believed to have created it, you own this open
> > > item.
> >
> > I don't think I can usefully contribute to this.  Could someone else
> > take it?
>
> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.

I've volunteered a fix nearby, just can't test it. I don't think we can
require committers to work on windows.

- Andres



--

Jobin Augustine
Architect : Production Database Operations

OpenSCG

phone : +91 9989932600

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
>> I don't think I can usefully contribute to this.  Could someone else
>> take it?

> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.

I think you're blaming the victim.  Our current theory about the cause
of this is that on Windows, WaitLatchOrSocket cannot be used to wait for
completion of a nonblocking connect() call.  That seems pretty broken
independently of whether libpqwalreceiver needs the capability.

In any case, we have a draft patch, so what we should be pressing for
is for somebody to test it.  Peter's not in a position to do that
(and neither am I), but anyone who can build from source on Windows
could do so.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
"Augustine, Jobin"
Date:
Appears that patch is not helping.
Errors like below are still appearing in the log
===============================
2017-08-11 12:22:35 UTC [2840]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
    could not send startup packet: Socket is not connected (0x00002749/10057)
2017-08-11 12:22:40 UTC [1964]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
    could not send startup packet: Socket is not connected (0x00002749/10057)
2017-08-11 12:22:45 UTC [248]: [1-1] user=,db=,app=,client= FATAL:  could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
    could not send startup packet: Socket is not connected (0x00002749/10057)
===============================


On Fri, Aug 11, 2017 at 7:28 AM, Augustine, Jobin <jobin.augustine@openscg.com> wrote:
I am in an effort to create independent build environment on Windows to test the patch from Andres.
I shall come back with result possibly in another 24 hours.

-Jobin

On Fri, Aug 11, 2017 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-08-10 18:51:07 -0700, Noah Misch wrote:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > On 8/7/17 21:06, Noah Misch wrote:
> > >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> > >> callers outside of libpq itself.
> > > [Action required within three days.  This is a generic notification.]
> > >
> > > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > > since you committed the patch believed to have created it, you own this open
> > > item.
> >
> > I don't think I can usefully contribute to this.  Could someone else
> > take it?
>
> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.

I've volunteered a fix nearby, just can't test it. I don't think we can
require committers to work on windows.

- Andres



--

Jobin Augustine
Architect : Production Database Operations

OpenSCG

phone : +91 9989932600




--

Jobin Augustine
Architect : Production Database Operations

OpenSCG

phone : +91 9989932600

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Noah Misch
Date:
On Thu, Aug 10, 2017 at 09:59:40PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> >> I don't think I can usefully contribute to this.  Could someone else
> >> take it?

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

> > If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> > successors.
> 
> I think you're blaming the victim.  Our current theory about the cause
> of this is that on Windows, WaitLatchOrSocket cannot be used to wait for
> completion of a nonblocking connect() call.  That seems pretty broken
> independently of whether libpqwalreceiver needs the capability.

Yes, the theorized defect lies in APIs commit 1e8a850 used, not in the commit
itself.  Nonetheless, commit 1e8a850 promoted the defect from one reachable
only by writing C code to one reachable by merely configuring replication on
Windows according to the documentation.  For that, its committer owns this
open item.  Besides the one approach I mentioned, there exist several other
fine ways to implement said ownership.

> In any case, we have a draft patch, so what we should be pressing for
> is for somebody to test it.

Now done.  (Thanks, Jobin.)



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Noah Misch
Date:
On Fri, Aug 11, 2017 at 08:56:22PM -0700, Noah Misch wrote:
> On Thu, Aug 10, 2017 at 09:59:40PM -0400, Tom Lane wrote:
> > Noah Misch <noah@leadboat.com> writes:
> > > On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > >> I don't think I can usefully contribute to this.  Could someone else
> > >> take it?
> 
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.  Refer to the policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past due
for your status update.  Please reacquaint yourself with the policy on open
item ownership[1] and then reply immediately.  If I do not hear from you by
2017-08-14 20:00 UTC, I will transfer this item to release management team
ownership without further notice.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
On 2017-08-11 20:56:22 -0700, Noah Misch wrote:
> > > If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> > > successors.
> > 
> > I think you're blaming the victim.  Our current theory about the cause
> > of this is that on Windows, WaitLatchOrSocket cannot be used to wait for
> > completion of a nonblocking connect() call.  That seems pretty broken
> > independently of whether libpqwalreceiver needs the capability.
> 
> Yes, the theorized defect lies in APIs commit 1e8a850 used, not in the commit
> itself.  Nonetheless, commit 1e8a850 promoted the defect from one reachable
> only by writing C code to one reachable by merely configuring replication on
> Windows according to the documentation.  For that, its committer owns this
> open item.  Besides the one approach I mentioned, there exist several other
> fine ways to implement said ownership.

FWIW, I'm personally quite demotivated by this style of handling
issues. You're essentially saying that any code change, even if it just
increases exposure of a preexisting bug, needs to be handled by the
committer of the exposing change. And even if that bug is on a platform
the committer doesn't have.  And all that despite the issue getting
attention.

I'm not ok with this.



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
Hi,

On 2017-08-11 18:11:03 +0530, Augustine, Jobin wrote:
> Appears that patch is not helping.
> Errors like below are still appearing in the log
> ===============================
> 2017-08-11 12:22:35 UTC [2840]: [1-1] user=,db=,app=,client= FATAL:  could
> not connect to the primary server: could not send data to server: Socket is
> not connected (0x00002749/10057)
>     could not send startup packet: Socket is not connected
> (0x00002749/10057)
> 2017-08-11 12:22:40 UTC [1964]: [1-1] user=,db=,app=,client= FATAL:  could
> not connect to the primary server: could not send data to server: Socket is
> not connected (0x00002749/10057)
>     could not send startup packet: Socket is not connected
> (0x00002749/10057)
> 2017-08-11 12:22:45 UTC [248]: [1-1] user=,db=,app=,client= FATAL:  could
> not connect to the primary server: could not send data to server: Socket is
> not connected (0x00002749/10057)
>     could not send startup packet: Socket is not connected
> (0x00002749/10057)
> ===============================

That's too bad. Any chance you could install
https://docs.microsoft.com/en-us/sysinternals/downloads/procmon and
activate monitoring just for that phase? I think it can export to a text
file...

Greetings,

Andres



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Peter Geoghegan
Date:
On Sun, Aug 13, 2017 at 12:57 PM, Andres Freund <andres@anarazel.de> wrote:
> FWIW, I'm personally quite demotivated by this style of handling
> issues. You're essentially saying that any code change, even if it just
> increases exposure of a preexisting bug, needs to be handled by the
> committer of the exposing change. And even if that bug is on a platform
> the committer doesn't have.  And all that despite the issue getting
> attention.

I don't think you can generalize from what Noah said like that,
because it's always a matter of degree (the degree to which the
preexisting bug was a problem). Abbreviated keys for collated text
were disabled, though not due to bug in strxfrm(). Technically, it was
due to a bug in strcoll(), which glibc always had. strxfrm() therefore
only failed to be bug compatible with glibc's strcoll(). Does that
mean that we were wrong to disable the use of strxfrm() for
abbreviated keys?

I think that it's useful for these things to be handled in an
adversarial manner, in the same way that litigation is adversarial in
a common law court. I doubt that Noah actually set out to demoralize
anyone. He is just doing the job he was assigned.

-- 
Peter Geoghegan



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-08-11 18:11:03 +0530, Augustine, Jobin wrote:
>> Appears that patch is not helping.

> That's too bad. Any chance you could install
> https://docs.microsoft.com/en-us/sysinternals/downloads/procmon and
> activate monitoring just for that phase? I think it can export to a text
> file...

It strikes me that maybe the misuse of io_flag could be contributing
to this: if the walreceiver process's latch were set, we'd end up calling
PQconnectPoll before the socket had necessarily come ready, which would
produce the described symptom.  That's grasping at straws admittedly,
because I'm not sure why the walreceiver process's latch would be set
at this point; but it seems like we ought to test a version of the patch
that we believe correct before deciding that we still have a problem.

To move things along, here's a corrected patch --- Jobin, please test.

            regards, tom lane

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index de03362..37b481c 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -168,13 +168,18 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
     status = PGRES_POLLING_WRITING;
     do
     {
-        /* Wait for socket ready and/or other events. */
         int            io_flag;
         int            rc;

-        io_flag = (status == PGRES_POLLING_READING
-                   ? WL_SOCKET_READABLE
-                   : WL_SOCKET_WRITEABLE);
+        if (status == PGRES_POLLING_READING)
+            io_flag = WL_SOCKET_READABLE;
+#ifdef WIN32
+        /* Windows needs a different test while waiting for connection-made */
+        else if (PQstatus(conn->streamConn) == CONNECTION_STARTED)
+            io_flag = WL_SOCKET_CONNECTED;
+#endif
+        else
+            io_flag = WL_SOCKET_WRITEABLE;

         rc = WaitLatchOrSocket(MyLatch,
                                WL_POSTMASTER_DEATH |
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 07b1364..c7345de 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -374,11 +374,11 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
         AddWaitEventToSet(set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
                           NULL, NULL);

-    if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+    if (wakeEvents & WL_SOCKET_MASK)
     {
         int            ev;

-        ev = wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
+        ev = wakeEvents & WL_SOCKET_MASK;
         AddWaitEventToSet(set, ev, sock, NULL, NULL);
     }

@@ -390,8 +390,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
     {
         ret |= event.events & (WL_LATCH_SET |
                                WL_POSTMASTER_DEATH |
-                               WL_SOCKET_READABLE |
-                               WL_SOCKET_WRITEABLE);
+                               WL_SOCKET_MASK);
     }

     FreeWaitEventSet(set);
@@ -640,10 +639,12 @@ FreeWaitEventSet(WaitEventSet *set)
  * Add an event to the set. Possible events are:
  * - WL_LATCH_SET: Wait for the latch to be set
  * - WL_POSTMASTER_DEATH: Wait for postmaster to die
- * - WL_SOCKET_READABLE: Wait for socket to become readable
- *     can be combined in one event with WL_SOCKET_WRITEABLE
- * - WL_SOCKET_WRITEABLE: Wait for socket to become writeable
- *     can be combined with WL_SOCKET_READABLE
+ * - WL_SOCKET_READABLE: Wait for socket to become readable,
+ *     can be combined in one event with other WL_SOCKET_* events
+ * - WL_SOCKET_WRITEABLE: Wait for socket to become writeable,
+ *     can be combined with other WL_SOCKET_* events
+ * - WL_SOCKET_CONNECTED: Wait for socket connection to be established,
+ *     can be combined with other WL_SOCKET_* events
  *
  * Returns the offset in WaitEventSet->events (starting from 0), which can be
  * used to modify previously added wait events using ModifyWaitEvent().
@@ -652,9 +653,9 @@ FreeWaitEventSet(WaitEventSet *set)
  * i.e. it must be a process-local latch initialized with InitLatch, or a
  * shared latch associated with the current process by calling OwnLatch.
  *
- * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are
- * reported by returning the socket as readable/writable or both, depending on
- * WL_SOCKET_READABLE/WRITEABLE being specified.
+ * In the WL_SOCKET_READABLE/WRITEABLE/CONNECTED cases, EOF and error
+ * conditions are reported by returning the socket as readable/writable or
+ * both, depending on WL_SOCKET_READABLE/WRITEABLE/CONNECTED being specified.
  *
  * The user_data pointer specified here will be set for the events returned
  * by WaitEventSetWait(), allowing to easily associate additional data with
@@ -685,8 +686,7 @@ AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch,
     }

     /* waiting for socket readiness without a socket indicates a bug */
-    if (fd == PGINVALID_SOCKET &&
-        (events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)))
+    if (fd == PGINVALID_SOCKET && (events & (WL_SOCKET_MASK)))
         elog(ERROR, "cannot wait on socket event without a socket");

     event = &set->events[set->nevents];
@@ -885,6 +885,8 @@ WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
             flags |= FD_READ;
         if (event->events & WL_SOCKET_WRITEABLE)
             flags |= FD_WRITE;
+        if (event->events & WL_SOCKET_CONNECTED)
+            flags |= FD_CONNECT;

         if (*handle == WSA_INVALID_EVENT)
         {
@@ -1395,7 +1397,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
             returned_events++;
         }
     }
-    else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+    else if (cur_event->events & WL_SOCKET_MASK)
     {
         WSANETWORKEVENTS resEvents;
         HANDLE        handle = set->handles[cur_event->pos + 1];
@@ -1432,6 +1434,12 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
             /* writeable */
             occurred_events->events |= WL_SOCKET_WRITEABLE;
         }
+        if ((cur_event->events & WL_SOCKET_CONNECTED) &&
+            (resEvents.lNetworkEvents & FD_CONNECT))
+        {
+            /* connected */
+            occurred_events->events |= WL_SOCKET_CONNECTED;
+        }
         if (resEvents.lNetworkEvents & FD_CLOSE)
         {
             /* EOF */
@@ -1439,6 +1447,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
                 occurred_events->events |= WL_SOCKET_READABLE;
             if (cur_event->events & WL_SOCKET_WRITEABLE)
                 occurred_events->events |= WL_SOCKET_WRITEABLE;
+            if (cur_event->events & WL_SOCKET_CONNECTED)
+                occurred_events->events |= WL_SOCKET_CONNECTED;
         }

         if (occurred_events->events != 0)
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 73abfaf..8af0266 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -126,6 +126,16 @@ typedef struct Latch
 #define WL_SOCKET_WRITEABLE  (1 << 2)
 #define WL_TIMEOUT             (1 << 3)    /* not for WaitEventSetWait() */
 #define WL_POSTMASTER_DEATH  (1 << 4)
+#ifdef WIN32
+#define WL_SOCKET_CONNECTED  (1 << 5)
+#else
+/* avoid having to to deal with case on platforms not requiring it */
+#define WL_SOCKET_CONNECTED     WL_SOCKET_WRITEABLE
+#endif
+
+#define WL_SOCKET_MASK        (WL_SOCKET_READABLE | \
+                             WL_SOCKET_WRITEABLE | \
+                             WL_SOCKET_CONNECTED)

 typedef struct WaitEvent
 {

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> I think that it's useful for these things to be handled in an
> adversarial manner, in the same way that litigation is adversarial in
> a common law court. I doubt that Noah actually set out to demoralize
> anyone. He is just doing the job he was assigned.

FWIW, I agree that Noah is just carrying out the RMT's task as assigned.
However, the only thing that Peter could really do about this of his own
authority is to revert 1e8a850, which I certainly think should be the last
resort not the first.  We are continuing to make progress towards finding
a better solution, I think, and since no particular deadline is imminent
we should let that process play out.

I think what Peter should do to satisfy the RMT process is to state that
if no better solution is found by date X, he will revert 1e8a850 (or
somehow mask the problem by removing functionality), and he sees no
need for additional progress reports before that.  Date X could be
sometime early in September, perhaps.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
"Augustine, Jobin" <jobin.augustine@openscg.com> writes:
> Appears that patch is not helping.
> Errors like below are still appearing in the log
> ===============================
> 2017-08-11 12:22:35 UTC [2840]: [1-1] user=,db=,app=,client= FATAL:  could
> not connect to the primary server: could not send data to server: Socket is
> not connected (0x00002749/10057)
>     could not send startup packet: Socket is not connected
> (0x00002749/10057)

BTW, I notice that this is a bit different from your first report, because
that mentioned "could not send SSL negotiation packet" rather than "could
not send startup packet".  If the initial report was about a setup in
which SSL was enabled, and in this report's setup it isn't, then that
difference is unsurprising; but otherwise it needs explanation.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:
On 2017-08-13 16:55:33 -0400, Tom Lane wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
> > I think that it's useful for these things to be handled in an
> > adversarial manner, in the same way that litigation is adversarial in
> > a common law court. I doubt that Noah actually set out to demoralize
> > anyone. He is just doing the job he was assigned.
> 
> FWIW, I agree that Noah is just carrying out the RMT's task as
> assigned.

Well, then that's a sign that the tasks/process need to be rescoped.


> However, the only thing that Peter could really do about this of his own
> authority is to revert 1e8a850, which I certainly think should be the last
> resort not the first.  We are continuing to make progress towards finding
> a better solution, I think, and since no particular deadline is imminent
> we should let that process play out.

I agree that that'd be a bad fix. There's other things that should, but
don't, really use the asynchronous API, e.g. postgres_fdw.

As a measure of last restart we could add a libpq workaround that forces
a pqSocketCheck() at the right moment, while still establishing a
connection.  That's not good from an interruptability perspective, but
better than blocking for the entire connection establishment.

Greetings,

Andres Freund



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> As a measure of last restart we could add a libpq workaround that forces
> a pqSocketCheck() at the right moment, while still establishing a
> connection.  That's not good from an interruptability perspective, but
> better than blocking for the entire connection establishment.

Probably a better idea is to fix things so that can't-send-yet results
in returning PGRES_POLLING_WRITING rather than a failure.  That would
result in a busy-wait in this scenario on Windows, which is arguably
better than blocking.  It would also make this part of libpq more robust
against applications being sloppy about socket readiness checks (which,
you could argue, is exactly what libpqwalreceiver is being).  But it
would be a somewhat ticklish change because of the portability hazards,
so I'm really disinclined to do it this late in beta.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Peter Geoghegan
Date:
On Sun, Aug 13, 2017 at 2:22 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-08-13 16:55:33 -0400, Tom Lane wrote:
>> Peter Geoghegan <pg@bowt.ie> writes:
>> > I think that it's useful for these things to be handled in an
>> > adversarial manner, in the same way that litigation is adversarial in
>> > a common law court. I doubt that Noah actually set out to demoralize
>> > anyone. He is just doing the job he was assigned.
>>
>> FWIW, I agree that Noah is just carrying out the RMT's task as
>> assigned.
>
> Well, then that's a sign that the tasks/process need to be rescoped.

Why? I might agree if the RMT had an outsized influence on final
outcome. If that's the case, then it's something that I missed.

I also don't think that it's fair to expect Noah to spend a lot of
time poring over whether sending out one of those pro forma status
update policy violation mails is warranted. I expect someone is his
position to aim to err on the side of sending out too many of those.
It's easy for the patch author to explain the situation, but hard for
the RMT to understand the situation fully at all times.

-- 
Peter Geoghegan



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
"Augustine, Jobin"
Date:
Thank you Tom Lane,
This patch fixes the problem.

With this patch, streaming replication started working (replication to Windows)

(Tested for Linux to Windows replication)

-Jobin.

On Mon, Aug 14, 2017 at 2:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
> On 2017-08-11 18:11:03 +0530, Augustine, Jobin wrote:
>> Appears that patch is not helping.

> That's too bad. Any chance you could install
> https://docs.microsoft.com/en-us/sysinternals/downloads/procmon and
> activate monitoring just for that phase? I think it can export to a text
> file...

It strikes me that maybe the misuse of io_flag could be contributing
to this: if the walreceiver process's latch were set, we'd end up calling
PQconnectPoll before the socket had necessarily come ready, which would
produce the described symptom.  That's grasping at straws admittedly,
because I'm not sure why the walreceiver process's latch would be set
at this point; but it seems like we ought to test a version of the patch
that we believe correct before deciding that we still have a problem.

To move things along, here's a corrected patch --- Jobin, please test.

                        regards, tom lane




--

Jobin Augustine
Architect : Production Database Operations

OpenSCG

phone : +91 9989932600

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Peter Eisentraut
Date:
On 8/13/17 15:46, Noah Misch wrote:
> IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past due
> for your status update.  Please reacquaint yourself with the policy on open
> item ownership[1] and then reply immediately.  If I do not hear from you by
> 2017-08-14 20:00 UTC, I will transfer this item to release management team
> ownership without further notice.

It appears there is a patch that fixes the issue now.  Let's wait until
Wednesday to see if that ends up being successful.

Otherwise, the plan forward would be to decorate the change in the
original commit with an #ifndef WIN32.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> It appears there is a patch that fixes the issue now.  Let's wait until
> Wednesday to see if that ends up being successful.

Jobim reported success with add-connected-event-2.patch --- are you
expecting more testing to happen?

I did instrument the loop in libpqrcv_connect and verified that the
process latch is set the first time through.  I haven't traced down
exactly why it's set at that point, but that confirms why Andres'
initial patch didn't work and the new one did.

I think we could commit add-connected-event-2.patch and call this
issue resolved.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Peter Eisentraut
Date:
On 8/14/17 10:57, Tom Lane wrote:
> I think we could commit add-connected-event-2.patch and call this
> issue resolved.

Would you like to commit your patch then?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 8/14/17 10:57, Tom Lane wrote:
>> I think we could commit add-connected-event-2.patch and call this
>> issue resolved.

> Would you like to commit your patch then?

It's really Andres' patch, but I can push it.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:


On August 15, 2017 7:04:59 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> On 8/14/17 10:57, Tom Lane wrote:
>>> I think we could commit add-connected-event-2.patch and call this
>>> issue resolved.
>
>> Would you like to commit your patch then?
>
>It's really Andres' patch, but I can push it.

I'll in a bit, unless you prefer to do it. You seem busy enough ;)

I don't see any relevant uses of sockets in older branches, did I miss something?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Andres Freund
Date:

On August 15, 2017 8:07:43 AM PDT, Andres Freund <andres@anarazel.de> wrote:
>
>
>On August 15, 2017 7:04:59 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>>> On 8/14/17 10:57, Tom Lane wrote:
>>>> I think we could commit add-connected-event-2.patch and call this
>>>> issue resolved.
>>
>>> Would you like to commit your patch then?
>>
>>It's really Andres' patch, but I can push it.
>
>I'll in a bit, unless you prefer to do it. You seem busy enough ;)

Hah, synchronicity. That works too ;)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I don't see any relevant uses of sockets in older branches, did I miss something?

No, I think we don't need to go back further than v10.  Even if it turns
out we do, I'd just as soon let this get a bit of field testing first.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
"Augustine, Jobin"
Date:
This patch is fixing similar problem reported in Logical replication also by another user.

Message:
20170524173644.1440.53348@wrigleys.postgresql.org
Bug reference:      14669
Logged by:          Igor Neyman
Email address:      ineyman(at)perceptron(dot)com
PostgreSQL version: 10beta1
Operating system:   Windows

Regards,
Jobin.


Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows isbroken

From
Tatsuo Ishii
Date:
> Thank you Tom Lane,
> This patch fixes the problem.
> 
> With this patch, streaming replication started working (replication to
> Windows)
> 
> (Tested for Linux to Windows replication)

Do we allow streaming replication among different OS? I thought it is
required that primary and standbys are same platform (in my
understanding this means the same hardware architecture and OS) in
streaming replication.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Michael Paquier
Date:
On Thu, Aug 17, 2017 at 8:13 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
> Do we allow streaming replication among different OS?

No. WAL is a binary format.

> I thought it is
> required that primary and standbys are same platform (in my
> understanding this means the same hardware architecture and OS) in
> streaming replication.

Yep, I recall the same requirement.
-- 
Michael



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Thu, Aug 17, 2017 at 8:13 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>> Do we allow streaming replication among different OS?

> No. WAL is a binary format.

>> I thought it is
>> required that primary and standbys are same platform (in my
>> understanding this means the same hardware architecture and OS) in
>> streaming replication.

> Yep, I recall the same requirement.

In practice it's largely about architecture, I think.  You definitely
need the same endianness and floating-point format, which are hardware,
and you need the same MAXALIGN, which is partly hardware but in principle
could be chosen differently in different ABI conventions for the same
hardware.  (At least for non-alignment-picky hardware like Intel.
Alignment-picky hardware is likely going to dictate that choice too.)

We disclaim cross-OS portability partly because of the possible influence
of different ABI conventions, but it doesn't surprise me in the least if
in practice 64-bit Windows can interoperate with x86_64 Linux.  (Less sure
about the 32-bit case --- it looks like pg_config.h.win32 chooses
MAXALIGN 8 in all cases, which would mean 32-bit Windows is more likely to
interoperate with 64-bit Linux than 32-bit Linux.)

The thing that's really likely to bite you cross-platform is dependency
of textual index sort ordering on non-C locale definitions.  But that
wouldn't show up in a quick smoke test of replication ... and if you
really wanted, you could use C locale on both ends.
        regards, tom lane



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Peter Eisentraut
Date:
On 8/16/17 19:41, Michael Paquier wrote:
> On Thu, Aug 17, 2017 at 8:13 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>> Do we allow streaming replication among different OS?
> 
> No. WAL is a binary format.
> 
>> I thought it is
>> required that primary and standbys are same platform (in my
>> understanding this means the same hardware architecture and OS) in
>> streaming replication.
> 
> Yep, I recall the same requirement.

He meant logical replication, but the code in question here is the same
for streaming replication, or whatever it's called.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows isbroken

From
Tatsuo Ishii
Date:
> In practice it's largely about architecture, I think.  You definitely
> need the same endianness and floating-point format, which are hardware,
> and you need the same MAXALIGN, which is partly hardware but in principle
> could be chosen differently in different ABI conventions for the same
> hardware.  (At least for non-alignment-picky hardware like Intel.
> Alignment-picky hardware is likely going to dictate that choice too.)
> 
> We disclaim cross-OS portability partly because of the possible influence
> of different ABI conventions, but it doesn't surprise me in the least if
> in practice 64-bit Windows can interoperate with x86_64 Linux.  (Less sure
> about the 32-bit case --- it looks like pg_config.h.win32 chooses
> MAXALIGN 8 in all cases, which would mean 32-bit Windows is more likely to
> interoperate with 64-bit Linux than 32-bit Linux.)

I feel like we need to be a little bit more cleaner about this
in our official documentation. From section 26.2.1. Planning:

"Hardware need not be exactly the same, but experience shows that
maintaining two identical systems is easier than maintaining two
dissimilar ones over the lifetime of the application and system. In
any case the hardware architecture must be the same ― shipping from,
say, a 32-bit to a 64-bit system will not work."

Probably We should disclaim cross-OS portability here?

> The thing that's really likely to bite you cross-platform is dependency
> of textual index sort ordering on non-C locale definitions.  But that
> wouldn't show up in a quick smoke test of replication ... and if you
> really wanted, you could use C locale on both ends.

Of course.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows isbroken

From
Tatsuo Ishii
Date:
> He meant logical replication,

Oh I could not find he meant logical replication in the original
report.

> but the code in question here is the same
> for streaming replication, or whatever it's called.

Yes.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

From
Michael Paquier
Date:
On Thu, Aug 17, 2017 at 9:15 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>> He meant logical replication,
>
> Oh I could not find he meant logical replication in the original
> report.

The second message of the thread says so, but the first does not
mention logical replication at all.
From here are mentioned PG 9.6 and pg_basebackup:
https://www.postgresql.org/message-id/CAHBggj8g2T+ZDcACZ2FmzX9CTxkWjKBsHd6NkYB4i9Ojf6K1Fw@mail.gmail.com.
Explaining the confusion.
-- 
Michael



Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows isbroken

From
Tatsuo Ishii
Date:
> On Thu, Aug 17, 2017 at 9:15 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>>> He meant logical replication,
>>
>> Oh I could not find he meant logical replication in the original
>> report.
> 
> The second message of the thread says so, but the first does not
> mention logical replication at all.
>>From here are mentioned PG 9.6 and pg_basebackup:
> https://www.postgresql.org/message-id/CAHBggj8g2T+ZDcACZ2FmzX9CTxkWjKBsHd6NkYB4i9Ojf6K1Fw@mail.gmail.com.
> Explaining the confusion.

Oh, I see it now. Thanks.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp