Thread: [BUGS] Replication to Postgres 10 on Windows is broken
Replica / Slave on Windows fails to conenct to Master.
Scenario:
Master: PostgreSQL 10 (beta1,beta2), Any OS (Linux, Windows, Mac)
Slave : PostgreSQL 10 (beta1,beta2), Windows (Any MS Windows OS)
Slave must be on Windows Machine running on a different server than the Master
Steps to reproduce:
1. Setup a Postgres 10betaX master database can be on any platform (Linux/Windows/Mac)
2. Try to setup a streaming replica on any of the MS Windows running a different Server than the master
(replication within the same server works!)
Error in Postgres logs:
Lines like below keep repeating in Slave side postgres log
----------------------
2017-08-03 10:49:41 UTC [2108]: [1-1] user=,db=,app=,client= FATAL: could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
2017-08-03 10:49:45 UTC [3600]: [1-1] user=,db=,app=,client= FATAL: could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
2017-08-03 10:49:50 UTC [4832]: [1-1] user=,db=,app=,client= FATAL: could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
-------------------------
Additional Information:
Stand alone tools like psql, pg_basebackup, pg_recivewal etc are successful in connecting to primary server and perform their functionalities.
Postgres 9.6 replication works fine across same set of servers.
This Windows specific issue reproducible across pg10beta1 and beta2
This is tested and confirmed by multiple people from my team across different Windows versions.
From: pgsql-bugs-owner@postgresql.org [mailto:pgsql-bugs-owner@postgresql.org] On Behalf Of Augustine, Jobin
Sent: Friday, August 04, 2017 1:54 AM
To: pgsql-bugs@postgresql.org
Subject: [BUGS] Replication to Postgres 10 on Windows is broken
Attention: This email was sent from someone outside of Perceptron. Always exercise caution when opening attachments or clicking links from unknown senders or when receiving unexpected emails.
Summary:
Replica / Slave on Windows fails to conenct to Master.
Scenario:
Master: PostgreSQL 10 (beta1,beta2), Any OS (Linux, Windows, Mac)
Slave : PostgreSQL 10 (beta1,beta2), Windows (Any MS Windows OS)
Slave must be on Windows Machine running on a different server than the Master
Steps to reproduce:
1. Setup a Postgres 10betaX master database can be on any platform (Linux/Windows/Mac)
2. Try to setup a streaming replica on any of the MS Windows running a different Server than the master
(replication within the same server works!)
Error in Postgres logs:
Lines like below keep repeating in Slave side postgres log
----------------------
2017-08-03 10:49:41 UTC [2108]: [1-1] user=,db=,app=,client= FATAL: could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
2017-08-03 10:49:45 UTC [3600]: [1-1] user=,db=,app=,client= FATAL: could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
2017-08-03 10:49:50 UTC [4832]: [1-1] user=,db=,app=,client= FATAL: could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
-------------------------
Additional Information:
Stand alone tools like psql, pg_basebackup, pg_recivewal etc are successful in connecting to primary server and perform their functionalities.
Postgres 9.6 replication works fine across same set of servers.
This Windows specific issue reproducible across pg10beta1 and beta2
This is tested and confirmed by multiple people from my team across different Windows versions.
-Jobin Augustine
Using PG10 Beta1 and BETA2 I’m getting the same error:
Socket is not connected (0x00002749/10057)
could not send SSL negotiation packet: Socket is not connected (0x00002749/10057)
with logical replication in Windows environment, when I’m trying to Create Subscription, even though any other connection from Subscriber server to Publisher server works quite fine, for instance there is no errors when using Postgres_fdw to connect from Subscriber server to Publisher server.
I reported this issue on this forum: Bug #14669, but didn’t get any replies.
Must be a common issue between streaming and logical replication in PG10 BETA in Windows environment.
I’d appreciate if someone from 2ndQardrant could take a look at this issue.
Regards,
Igor Neyman
On Fri, Aug 04, 2017 at 11:23:43AM +0530, Augustine, Jobin wrote: > *Summary:* > Replica / Slave on Windows fails to conenct to Master. > > *Scenario:* > Master: PostgreSQL 10 (beta1,beta2), Any OS (Linux, Windows, Mac) > Slave : PostgreSQL 10 (beta1,beta2), Windows (Any MS Windows OS) > Slave must be on Windows Machine running on a different server than the > Master > > > *Steps to reproduce:* > 1. Setup a Postgres 10betaX master database can be on any platform > (Linux/Windows/Mac) > 2. Try to setup a streaming replica on any of the MS Windows running a > different Server than the master I've added this as an open item. Confirmed in this setup: -- Server Debian 9 PostgreSQL 10beta2 from apt.postgresql.org edit postgresql.conf to allow remote connections edit pg_hba.conf to allow remote connections create replication user w/ password -- Client Windows Server 2016 postgresql-10.0-beta2-windows-x64-binaries.zip from EnterpriseDB pg_basebackup -h myserver -D datadir -X stream -R install ordinary *.conf postgres -D datadir > Stand alone tools like psql, pg_basebackup, pg_recivewal etc are > successful in connecting to primary server and perform their > functionalities. I can confirm each of those, too. I don't, however, see a smoking gun among commits. Would you bisect the commits since 9.6 and see which one broke things? -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Noah Misch <noah@leadboat.com> writes: > I've added this as an open item. Confirmed in this setup: > -- Client > Windows Server 2016 > postgresql-10.0-beta2-windows-x64-binaries.zip from EnterpriseDB I wonder whether the other complainants were using EDB's build, and if not, just what were they using. The indirect question is: what version of OpenSSL is the Windows build using? > I don't, however, see a smoking gun among commits. Would you bisect the > commits since 9.6 and see which one broke things? Gut instinct says that the reason this case fails when other tools can connect successfully is that libpqwalreceiver is the only tool that uses PQconnectStart/PQconnectPoll rather than a plain PQconnectdb, and that there is some behavioral difference between connectDBComplete's wait loop and libpqrcv_connect's wait loop that OpenSSL is sensitive to --- but only on Windows, and maybe only on particular OpenSSL versions. I have some memory of having tweaked libpqrcv_connect's loop once already, but I can't find a commit for that right now. A different line of thought is that maybe we broke it with the changes for OpenSSL 1.1.0. Again, it would be very important to know which OpenSSL version we're dealing with here. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On Sun, Aug 06, 2017 at 11:17:57AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > I've added this as an open item. Confirmed in this setup: > > > -- Client > > Windows Server 2016 > > postgresql-10.0-beta2-windows-x64-binaries.zip from EnterpriseDB > > I wonder whether the other complainants were using EDB's build, > and if not, just what were they using. The indirect question is: > what version of OpenSSL is the Windows build using? Those binaries I used have OpenSSL 1.0.2l. > > I don't, however, see a smoking gun among commits. Would you bisect the > > commits since 9.6 and see which one broke things? > > Gut instinct says that the reason this case fails when other tools > can connect successfully is that libpqwalreceiver is the only tool > that uses PQconnectStart/PQconnectPoll rather than a plain > PQconnectdb, and that there is some behavioral difference between > connectDBComplete's wait loop and libpqrcv_connect's wait loop that That would fit. Until v10 (commit 1e8a850), PQconnectStart() had no in-tree callers outside of libpq itself. > OpenSSL is sensitive to --- but only on Windows, and maybe only on > particular OpenSSL versions. The failure is so early, before pgtls_init(), that I doubt OpenSSL version is a factor. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
I wrote: > Gut instinct says that the reason this case fails when other tools > can connect successfully is that libpqwalreceiver is the only tool > that uses PQconnectStart/PQconnectPoll rather than a plain > PQconnectdb, and that there is some behavioral difference between > connectDBComplete's wait loop and libpqrcv_connect's wait loop that > OpenSSL is sensitive to --- but only on Windows, and maybe only on > particular OpenSSL versions. On closer inspection, I take that back. This can't be directly OpenSSL's fault, because those error messages come out before libpq has invoked OpenSSL at all; in particular we see 2017-08-03 10:49:41 UTC [2108]: [1-1] user=,db=,app=,client= FATAL: could not connect to the primary server: could not senddata to server: Socket is not connected (0x00002749/10057) could not send SSL negotiation packet: Socket is not connected (0x00002749/10057) and "could not send SSL negotiation packet" certainly must occur before we've asked OpenSSL to do anything. What seems likely to me at this point is that the changes in PQconnectPoll() to support multiple hosts are somehow responsible. It must still be connected to libpqwalreceiver's different wait loop, but the details are unclear. It would likely be useful to add some debug logging to PQconnectPoll to find out what set of addresses it's seeing and whether this failure occurs after having advanced over some of them. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Noah Misch <noah@leadboat.com> writes: > On Sun, Aug 06, 2017 at 11:17:57AM -0400, Tom Lane wrote: >> Gut instinct says that the reason this case fails when other tools >> can connect successfully is that libpqwalreceiver is the only tool >> that uses PQconnectStart/PQconnectPoll rather than a plain >> PQconnectdb, and that there is some behavioral difference between >> connectDBComplete's wait loop and libpqrcv_connect's wait loop that > That would fit. Until v10 (commit 1e8a850), PQconnectStart() had no in-tree > callers outside of libpq itself. Yeah. After some digging around I think I see exactly what is happening. The error message would be better read as "Socket is not connected *yet*", that is, the problem is that we're trying to write data before the nonblocking connection request has completed. (This fits with the OP's observation that local loopback connections work fine --- they probably complete immediately.) PQconnectPoll believes that it just has to wait for write-ready when waiting for a connection to complete. When using connectDBComplete's wait loop, that reduces to a call to Windows' version of select(2), in pqSocketPoll, and according to https://msdn.microsoft.com/en-us/library/windows/desktop/ms740141(v=vs.85).aspx "The parameter writefds identifies the sockets that are to be checked for writability. If a socket is processing a connect call (nonblocking), a socket is writeable if the connection establishment successfully completes." On the other hand, in libpqwalreceiver, we're depending on latch.c's implementation, and it uses WSAEventSelect's FD_WRITE event: https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576(v=vs.85).aspx If I'm reading that correctly, FD_WRITE is set instantly by the connect request, probably even in the nonblock case, and it only gets cleared by a failed write request. It looks to me like we would have to specifically look for FD_CONNECT, *not* FD_WRITE, to make this work. This is problematic, because the APIs in between don't provide a way to report that we're still waiting for connect rather than for data-write-ready. Anybody have the stomach for extending PQconnectPoll's API with an extra PGRES_POLLING_CONNECTING state? If not, can we tell in WaitEventAdjustWin32 that the socket is still connecting and we must substitute FD_CONNECT for FD_WRITE? regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Hi, On 2017-08-06 12:29:07 -0400, Tom Lane wrote: > Yeah. After some digging around I think I see exactly what is happening. > The error message would be better read as "Socket is not connected *yet*", > that is, the problem is that we're trying to write data before the > nonblocking connection request has completed. (This fits with the OP's > observation that local loopback connections work fine --- they probably > complete immediately.) PQconnectPoll believes that it just has to wait > for write-ready when waiting for a connection to complete. When using > connectDBComplete's wait loop, that reduces to a call to Windows' version > of select(2), in pqSocketPoll, and according to > > https://msdn.microsoft.com/en-us/library/windows/desktop/ms740141(v=vs.85).aspx > > "The parameter writefds identifies the sockets that are to be checked for > writability. If a socket is processing a connect call (nonblocking), a > socket is writeable if the connection establishment successfully > completes." > > On the other hand, in libpqwalreceiver, we're depending on latch.c's > implementation, and it uses WSAEventSelect's FD_WRITE event: > > https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576(v=vs.85).aspx > > If I'm reading that correctly, FD_WRITE is set instantly by the connect > request, probably even in the nonblock case, and it only gets cleared > by a failed write request. It looks to me like we would have to > specifically look for FD_CONNECT, *not* FD_WRITE, to make this work. Nice digging. > This is problematic, because the APIs in between don't provide a way > to report that we're still waiting for connect rather than for > data-write-ready. Anybody have the stomach for extending PQconnectPoll's > API with an extra PGRES_POLLING_CONNECTING state? I'm a bit hesitant to do so at this phase of the release cycle, it'd kind of force all users to upgrade their code, and I'm sure there's a couple out-of-tree ones. And not just code explicitly using new versions of libpq, also users of old versions - several distributions just install newer libpq versions and rely on it being compatible. > If not, can we tell in > WaitEventAdjustWin32 that the socket is still connecting and we must > substitute FD_CONNECT for FD_WRITE? I was wondering, for a second, if we should just always use FD_CONNECT once in every set. But unfortunately there's plenty places that create/destroy sets at a high enough speed for that to not be a nice solution. A third solution would be to, for v10, add a #ifdef WIN32 block to libpqrcv_connect() that just waits till FD_CONNECT is ready. That has the disadvantage of not accepting interrupts, but still seems better than not working at all. That's not much of a real solution, but this late in the cycle it might be advisable to hold our noses :( Greetings, Andres Freund -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Andres Freund <andres@anarazel.de> writes: > On 2017-08-06 12:29:07 -0400, Tom Lane wrote: >> This is problematic, because the APIs in between don't provide a way >> to report that we're still waiting for connect rather than for >> data-write-ready. Anybody have the stomach for extending PQconnectPoll's >> API with an extra PGRES_POLLING_CONNECTING state? > I'm a bit hesitant to do so at this phase of the release cycle, I'm pretty hesitant to do that at *any* phase of the release cycle, it's just too likely to break user code. However, there's always more than one way to skin the cat. What we need to know is whether PQconnectPoll's state machine is in CONNECTION_STARTED state, and PQstatus() does expose that, even though it claims it's an internal state that callers shouldn't rely on. So here's a sketch: * Invent a separate WL_SOCKET_CONNECTED flag for WaitLatchOrSocket. On Unix we can treat that the same as WL_SOCKET_WRITEABLE, on Windows it maps to FD_CONNECT. * In libpqwalreceiver, do something like io_flag = (status == PGRES_POLLING_READING ? WL_SOCKET_READABLE #ifdef WIN32 : PQstatus(conn->streamConn) == CONNECTION_STARTED ? WL_SOCKET_CONNECTED #endif : WL_SOCKET_WRITEABLE); The #ifdef's optional, but no reason to spend extra cycles if we don't have to. (Not sure if this is adequately parenthesized...) regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Hi, On 2017-08-06 13:34:42 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-08-06 12:29:07 -0400, Tom Lane wrote: > >> This is problematic, because the APIs in between don't provide a way > >> to report that we're still waiting for connect rather than for > >> data-write-ready. Anybody have the stomach for extending PQconnectPoll's > >> API with an extra PGRES_POLLING_CONNECTING state? > > > I'm a bit hesitant to do so at this phase of the release cycle, > > I'm pretty hesitant to do that at *any* phase of the release cycle, > it's just too likely to break user code. Right, that's what I went on to say ;) > However, there's always more than one way to skin the cat. What we need > to know is whether PQconnectPoll's state machine is in CONNECTION_STARTED > state, and PQstatus() does expose that, even though it claims it's an > internal state that callers shouldn't rely on. So here's a sketch: > > * Invent a separate WL_SOCKET_CONNECTED flag for WaitLatchOrSocket. > On Unix we can treat that the same as WL_SOCKET_WRITEABLE, on Windows > it maps to FD_CONNECT. Yea, that should work, and could be useful-ish for things outside of libpqwalreceiver. E.g. parallel fdw scan, that IIRC somebody is working on, seems like it very well could run into similar issues if we want to handle interrupts, which we should... > * In libpqwalreceiver, do something like > > io_flag = (status == PGRES_POLLING_READING > ? WL_SOCKET_READABLE > #ifdef WIN32 > : PQstatus(conn->streamConn) == CONNECTION_STARTED > ? WL_SOCKET_CONNECTED > #endif > : WL_SOCKET_WRITEABLE); > > The #ifdef's optional, but no reason to spend extra cycles if we don't > have to. (Not sure if this is adequately parenthesized...) Let's use proper ifs ;)... Greetings, Andres Freund -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Hi, On 2017-08-06 10:48:22 -0700, Andres Freund wrote: > Hi, > > On 2017-08-06 13:34:42 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > On 2017-08-06 12:29:07 -0400, Tom Lane wrote: > > >> This is problematic, because the APIs in between don't provide a way > > >> to report that we're still waiting for connect rather than for > > >> data-write-ready. Anybody have the stomach for extending PQconnectPoll's > > >> API with an extra PGRES_POLLING_CONNECTING state? > > > > > I'm a bit hesitant to do so at this phase of the release cycle, > > > > I'm pretty hesitant to do that at *any* phase of the release cycle, > > it's just too likely to break user code. > > Right, that's what I went on to say ;) > > > > However, there's always more than one way to skin the cat. What we need > > to know is whether PQconnectPoll's state machine is in CONNECTION_STARTED > > state, and PQstatus() does expose that, even though it claims it's an > > internal state that callers shouldn't rely on. So here's a sketch: > > > > * Invent a separate WL_SOCKET_CONNECTED flag for WaitLatchOrSocket. > > On Unix we can treat that the same as WL_SOCKET_WRITEABLE, on Windows > > it maps to FD_CONNECT. > > Yea, that should work, and could be useful-ish for things outside of > libpqwalreceiver. E.g. parallel fdw scan, that IIRC somebody is working > on, seems like it very well could run into similar issues if we want to > handle interrupts, which we should... > > > > * In libpqwalreceiver, do something like > > > > io_flag = (status == PGRES_POLLING_READING > > ? WL_SOCKET_READABLE > > #ifdef WIN32 > > : PQstatus(conn->streamConn) == CONNECTION_STARTED > > ? WL_SOCKET_CONNECTED > > #endif > > : WL_SOCKET_WRITEABLE); > > > > The #ifdef's optional, but no reason to spend extra cycles if we don't > > have to. (Not sure if this is adequately parenthesized...) > > Let's use proper ifs ;)... Is anybody working on a patch like this? I can try writing one blindly, if somebody else tests it. But if somebody natively on windows is working on it, that's going to be more efficient. It'd be nice to have something merged by wrap tomorrow... Greetings, Andres Freund -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On 2017-08-06 13:07:42 -0700, Andres Freund wrote: > Is anybody working on a patch like this? I can try writing one blindly, > if somebody else tests it. But if somebody natively on windows is > working on it, that's going to be more efficient. It'd be nice to have > something merged by wrap tomorrow... Here's a prototype patch implementing what Tom outlined. I've compiled it with mingw under linux, but haven't run it. Could some windows capable person take this for a spin? Unfortunately we can't just push this to the BF and see what it says - our tests don't catch this one... And there's not that much time before the wrap. Anybody have an opinion about adding ifs for WL_SOCKET_CONNECTED to !win32 implementations rather than redefining it to WL_SOCKET_WRITEABLE? - Andres -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
Andres Freund <andres@anarazel.de> writes: > Here's a prototype patch implementing what Tom outlined. This bit is flat wrong: - int io_flag; + int io_flag = WL_POSTMASTER_DEATH | WL_LATCH_SET; io_flag has to be *just* the I/O condition, because we use it in a test after the WaitLatchOrSocket call. > Anybody have an opinion about adding ifs for WL_SOCKET_CONNECTED to > !win32 implementations rather than redefining it to WL_SOCKET_WRITEABLE? I fear it would complicate matters greatly, because you'd have to figure out which of the two flags to signal back after detecting socket writable. I think defining it as equal to WL_SOCKET_WRITEABLE is fine. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On 2017-08-06 18:04:49 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Here's a prototype patch implementing what Tom outlined. > > This bit is flat wrong: > > - int io_flag; > + int io_flag = WL_POSTMASTER_DEATH | WL_LATCH_SET; > > io_flag has to be *just* the I/O condition, because we use it in a test > after the WaitLatchOrSocket call. Hm, right. Wouldn't be particularly consequential, but... I'd actually consider just removing the if around /* If socket is ready, advance the libpq state machine */ if (rc & io_flag) status = PQconnectPoll(conn->streamConn); the only thing that protects us against is calling PQconnectPoll() when the latch has been set. Hardly problematic. > > Anybody have an opinion about adding ifs for WL_SOCKET_CONNECTED to > > !win32 implementations rather than redefining it to WL_SOCKET_WRITEABLE? > > I fear it would complicate matters greatly, because you'd have to figure > out which of the two flags to signal back after detecting socket writable. > I think defining it as equal to WL_SOCKET_WRITEABLE is fine. Well, I'd have said, signal the one(s) back that have been requested. But I'm ok with the current state, adding a bunch of pointless branches didn't strike me as worthwhile... Greetings, Andres Freund -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Andres Freund <andres@anarazel.de> writes: > On 2017-08-06 18:04:49 -0400, Tom Lane wrote: >> io_flag has to be *just* the I/O condition, because we use it in a test >> after the WaitLatchOrSocket call. > Hm, right. Wouldn't be particularly consequential, Yes, it would be. The entire point of this loop is to avoid blocking in PQconnectPoll, and that's not guaranteed unless we honor the contract of only calling it when we've gotten the read-ready or write-ready status it asked for. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On Sun, Aug 06, 2017 at 08:50:37AM -0700, Noah Misch wrote: > On Sun, Aug 06, 2017 at 11:17:57AM -0400, Tom Lane wrote: > > Noah Misch <noah@leadboat.com> writes: > > > I've added this as an open item. Confirmed in this setup: > > > > > -- Client > > > Windows Server 2016 > > > postgresql-10.0-beta2-windows-x64-binaries.zip from EnterpriseDB > > > > I wonder whether the other complainants were using EDB's build, > > and if not, just what were they using. The indirect question is: > > what version of OpenSSL is the Windows build using? > > Those binaries I used have OpenSSL 1.0.2l. > > > > I don't, however, see a smoking gun among commits. Would you bisect the > > > commits since 9.6 and see which one broke things? > > > > Gut instinct says that the reason this case fails when other tools > > can connect successfully is that libpqwalreceiver is the only tool > > that uses PQconnectStart/PQconnectPoll rather than a plain > > PQconnectdb, and that there is some behavioral difference between > > connectDBComplete's wait loop and libpqrcv_connect's wait loop that > > That would fit. Until v10 (commit 1e8a850), PQconnectStart() had no in-tree > callers outside of libpq itself. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Peter, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
On 8/7/17 21:06, Noah Misch wrote: >> That would fit. Until v10 (commit 1e8a850), PQconnectStart() had no in-tree >> callers outside of libpq itself. > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Peter, > since you committed the patch believed to have created it, you own this open > item. I don't think I can usefully contribute to this. Could someone else take it? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-08-08 19:25:37 -0400, Peter Eisentraut wrote: > On 8/7/17 21:06, Noah Misch wrote: > >> That would fit. Until v10 (commit 1e8a850), PQconnectStart() had no in-tree > >> callers outside of libpq itself. > > [Action required within three days. This is a generic notification.] > > > > The above-described topic is currently a PostgreSQL 10 open item. Peter, > > since you committed the patch believed to have created it, you own this open > > item. > > I don't think I can usefully contribute to this. Could someone else > take it? I've written up a patch at http://archives.postgresql.org/message-id/20170806215521.d6fq4esvx7s5ejka%40alap3.anarazel.de but I can't test this either. We really need somebody with access to window to verify whether it works. That patch needs some adjustments as remarked by Tom, but can be tested as is. Regards, Andres
On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote: > On 8/7/17 21:06, Noah Misch wrote: > >> That would fit. Until v10 (commit 1e8a850), PQconnectStart() had no in-tree > >> callers outside of libpq itself. > > [Action required within three days. This is a generic notification.] > > > > The above-described topic is currently a PostgreSQL 10 open item. Peter, > > since you committed the patch believed to have created it, you own this open > > item. > > I don't think I can usefully contribute to this. Could someone else > take it? If nobody volunteers, you could always resolve this by reverting 1e8a850 and successors.
On 2017-08-10 18:51:07 -0700, Noah Misch wrote: > On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote: > > On 8/7/17 21:06, Noah Misch wrote: > > >> That would fit. Until v10 (commit 1e8a850), PQconnectStart() had no in-tree > > >> callers outside of libpq itself. > > > [Action required within three days. This is a generic notification.] > > > > > > The above-described topic is currently a PostgreSQL 10 open item. Peter, > > > since you committed the patch believed to have created it, you own this open > > > item. > > > > I don't think I can usefully contribute to this. Could someone else > > take it? > > If nobody volunteers, you could always resolve this by reverting 1e8a850 and > successors. I've volunteered a fix nearby, just can't test it. I don't think we can require committers to work on windows. - Andres
I've volunteered a fix nearby, just can't test it. I don't think we canOn 2017-08-10 18:51:07 -0700, Noah Misch wrote:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > On 8/7/17 21:06, Noah Misch wrote:
> > >> That would fit. Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> > >> callers outside of libpq itself.
> > > [Action required within three days. This is a generic notification.]
> > >
> > > The above-described topic is currently a PostgreSQL 10 open item. Peter,
> > > since you committed the patch believed to have created it, you own this open
> > > item.
> >
> > I don't think I can usefully contribute to this. Could someone else
> > take it?
>
> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.
require committers to work on windows.
- Andres
--
Jobin Augustine
Architect : Production Database Operations
OpenSCG
phone : +91 9989932600
Noah Misch <noah@leadboat.com> writes: > On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote: >> I don't think I can usefully contribute to this. Could someone else >> take it? > If nobody volunteers, you could always resolve this by reverting 1e8a850 and > successors. I think you're blaming the victim. Our current theory about the cause of this is that on Windows, WaitLatchOrSocket cannot be used to wait for completion of a nonblocking connect() call. That seems pretty broken independently of whether libpqwalreceiver needs the capability. In any case, we have a draft patch, so what we should be pressing for is for somebody to test it. Peter's not in a position to do that (and neither am I), but anyone who can build from source on Windows could do so. regards, tom lane
2017-08-11 12:22:35 UTC [2840]: [1-1] user=,db=,app=,client= FATAL: could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
could not send startup packet: Socket is not connected (0x00002749/10057)
2017-08-11 12:22:40 UTC [1964]: [1-1] user=,db=,app=,client= FATAL: could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
could not send startup packet: Socket is not connected (0x00002749/10057)
2017-08-11 12:22:45 UTC [248]: [1-1] user=,db=,app=,client= FATAL: could not connect to the primary server: could not send data to server: Socket is not connected (0x00002749/10057)
could not send startup packet: Socket is not connected (0x00002749/10057)
===============================
-JobinI am in an effort to create independent build environment on Windows to test the patch from Andres.I shall come back with result possibly in another 24 hours.--On Fri, Aug 11, 2017 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote:I've volunteered a fix nearby, just can't test it. I don't think we canOn 2017-08-10 18:51:07 -0700, Noah Misch wrote:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > On 8/7/17 21:06, Noah Misch wrote:
> > >> That would fit. Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> > >> callers outside of libpq itself.
> > > [Action required within three days. This is a generic notification.]
> > >
> > > The above-described topic is currently a PostgreSQL 10 open item. Peter,
> > > since you committed the patch believed to have created it, you own this open
> > > item.
> >
> > I don't think I can usefully contribute to this. Could someone else
> > take it?
>
> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.
require committers to work on windows.
- AndresJobin Augustine
Architect : Production Database OperationsOpenSCG
phone : +91 9989932600
--
Jobin Augustine
Architect : Production Database Operations
OpenSCG
phone : +91 9989932600
On Thu, Aug 10, 2017 at 09:59:40PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote: > >> I don't think I can usefully contribute to this. Could someone else > >> take it? This PostgreSQL 10 open item is past due for your status update. Kindly send a status update within 24 hours, and include a date for your subsequent status update. Refer to the policy on open item ownership: https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > If nobody volunteers, you could always resolve this by reverting 1e8a850 and > > successors. > > I think you're blaming the victim. Our current theory about the cause > of this is that on Windows, WaitLatchOrSocket cannot be used to wait for > completion of a nonblocking connect() call. That seems pretty broken > independently of whether libpqwalreceiver needs the capability. Yes, the theorized defect lies in APIs commit 1e8a850 used, not in the commit itself. Nonetheless, commit 1e8a850 promoted the defect from one reachable only by writing C code to one reachable by merely configuring replication on Windows according to the documentation. For that, its committer owns this open item. Besides the one approach I mentioned, there exist several other fine ways to implement said ownership. > In any case, we have a draft patch, so what we should be pressing for > is for somebody to test it. Now done. (Thanks, Jobin.)
On Fri, Aug 11, 2017 at 08:56:22PM -0700, Noah Misch wrote: > On Thu, Aug 10, 2017 at 09:59:40PM -0400, Tom Lane wrote: > > Noah Misch <noah@leadboat.com> writes: > > > On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote: > > >> I don't think I can usefully contribute to this. Could someone else > > >> take it? > > This PostgreSQL 10 open item is past due for your status update. Kindly send > a status update within 24 hours, and include a date for your subsequent status > update. Refer to the policy on open item ownership: > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due for your status update. Please reacquaint yourself with the policy on open item ownership[1] and then reply immediately. If I do not hear from you by 2017-08-14 20:00 UTC, I will transfer this item to release management team ownership without further notice. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
On 2017-08-11 20:56:22 -0700, Noah Misch wrote: > > > If nobody volunteers, you could always resolve this by reverting 1e8a850 and > > > successors. > > > > I think you're blaming the victim. Our current theory about the cause > > of this is that on Windows, WaitLatchOrSocket cannot be used to wait for > > completion of a nonblocking connect() call. That seems pretty broken > > independently of whether libpqwalreceiver needs the capability. > > Yes, the theorized defect lies in APIs commit 1e8a850 used, not in the commit > itself. Nonetheless, commit 1e8a850 promoted the defect from one reachable > only by writing C code to one reachable by merely configuring replication on > Windows according to the documentation. For that, its committer owns this > open item. Besides the one approach I mentioned, there exist several other > fine ways to implement said ownership. FWIW, I'm personally quite demotivated by this style of handling issues. You're essentially saying that any code change, even if it just increases exposure of a preexisting bug, needs to be handled by the committer of the exposing change. And even if that bug is on a platform the committer doesn't have. And all that despite the issue getting attention. I'm not ok with this.
Hi, On 2017-08-11 18:11:03 +0530, Augustine, Jobin wrote: > Appears that patch is not helping. > Errors like below are still appearing in the log > =============================== > 2017-08-11 12:22:35 UTC [2840]: [1-1] user=,db=,app=,client= FATAL: could > not connect to the primary server: could not send data to server: Socket is > not connected (0x00002749/10057) > could not send startup packet: Socket is not connected > (0x00002749/10057) > 2017-08-11 12:22:40 UTC [1964]: [1-1] user=,db=,app=,client= FATAL: could > not connect to the primary server: could not send data to server: Socket is > not connected (0x00002749/10057) > could not send startup packet: Socket is not connected > (0x00002749/10057) > 2017-08-11 12:22:45 UTC [248]: [1-1] user=,db=,app=,client= FATAL: could > not connect to the primary server: could not send data to server: Socket is > not connected (0x00002749/10057) > could not send startup packet: Socket is not connected > (0x00002749/10057) > =============================== That's too bad. Any chance you could install https://docs.microsoft.com/en-us/sysinternals/downloads/procmon and activate monitoring just for that phase? I think it can export to a text file... Greetings, Andres
On Sun, Aug 13, 2017 at 12:57 PM, Andres Freund <andres@anarazel.de> wrote: > FWIW, I'm personally quite demotivated by this style of handling > issues. You're essentially saying that any code change, even if it just > increases exposure of a preexisting bug, needs to be handled by the > committer of the exposing change. And even if that bug is on a platform > the committer doesn't have. And all that despite the issue getting > attention. I don't think you can generalize from what Noah said like that, because it's always a matter of degree (the degree to which the preexisting bug was a problem). Abbreviated keys for collated text were disabled, though not due to bug in strxfrm(). Technically, it was due to a bug in strcoll(), which glibc always had. strxfrm() therefore only failed to be bug compatible with glibc's strcoll(). Does that mean that we were wrong to disable the use of strxfrm() for abbreviated keys? I think that it's useful for these things to be handled in an adversarial manner, in the same way that litigation is adversarial in a common law court. I doubt that Noah actually set out to demoralize anyone. He is just doing the job he was assigned. -- Peter Geoghegan
Andres Freund <andres@anarazel.de> writes: > On 2017-08-11 18:11:03 +0530, Augustine, Jobin wrote: >> Appears that patch is not helping. > That's too bad. Any chance you could install > https://docs.microsoft.com/en-us/sysinternals/downloads/procmon and > activate monitoring just for that phase? I think it can export to a text > file... It strikes me that maybe the misuse of io_flag could be contributing to this: if the walreceiver process's latch were set, we'd end up calling PQconnectPoll before the socket had necessarily come ready, which would produce the described symptom. That's grasping at straws admittedly, because I'm not sure why the walreceiver process's latch would be set at this point; but it seems like we ought to test a version of the patch that we believe correct before deciding that we still have a problem. To move things along, here's a corrected patch --- Jobin, please test. regards, tom lane diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index de03362..37b481c 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -168,13 +168,18 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname, status = PGRES_POLLING_WRITING; do { - /* Wait for socket ready and/or other events. */ int io_flag; int rc; - io_flag = (status == PGRES_POLLING_READING - ? WL_SOCKET_READABLE - : WL_SOCKET_WRITEABLE); + if (status == PGRES_POLLING_READING) + io_flag = WL_SOCKET_READABLE; +#ifdef WIN32 + /* Windows needs a different test while waiting for connection-made */ + else if (PQstatus(conn->streamConn) == CONNECTION_STARTED) + io_flag = WL_SOCKET_CONNECTED; +#endif + else + io_flag = WL_SOCKET_WRITEABLE; rc = WaitLatchOrSocket(MyLatch, WL_POSTMASTER_DEATH | diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 07b1364..c7345de 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -374,11 +374,11 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, AddWaitEventToSet(set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL); - if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) + if (wakeEvents & WL_SOCKET_MASK) { int ev; - ev = wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE); + ev = wakeEvents & WL_SOCKET_MASK; AddWaitEventToSet(set, ev, sock, NULL, NULL); } @@ -390,8 +390,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, { ret |= event.events & (WL_LATCH_SET | WL_POSTMASTER_DEATH | - WL_SOCKET_READABLE | - WL_SOCKET_WRITEABLE); + WL_SOCKET_MASK); } FreeWaitEventSet(set); @@ -640,10 +639,12 @@ FreeWaitEventSet(WaitEventSet *set) * Add an event to the set. Possible events are: * - WL_LATCH_SET: Wait for the latch to be set * - WL_POSTMASTER_DEATH: Wait for postmaster to die - * - WL_SOCKET_READABLE: Wait for socket to become readable - * can be combined in one event with WL_SOCKET_WRITEABLE - * - WL_SOCKET_WRITEABLE: Wait for socket to become writeable - * can be combined with WL_SOCKET_READABLE + * - WL_SOCKET_READABLE: Wait for socket to become readable, + * can be combined in one event with other WL_SOCKET_* events + * - WL_SOCKET_WRITEABLE: Wait for socket to become writeable, + * can be combined with other WL_SOCKET_* events + * - WL_SOCKET_CONNECTED: Wait for socket connection to be established, + * can be combined with other WL_SOCKET_* events * * Returns the offset in WaitEventSet->events (starting from 0), which can be * used to modify previously added wait events using ModifyWaitEvent(). @@ -652,9 +653,9 @@ FreeWaitEventSet(WaitEventSet *set) * i.e. it must be a process-local latch initialized with InitLatch, or a * shared latch associated with the current process by calling OwnLatch. * - * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are - * reported by returning the socket as readable/writable or both, depending on - * WL_SOCKET_READABLE/WRITEABLE being specified. + * In the WL_SOCKET_READABLE/WRITEABLE/CONNECTED cases, EOF and error + * conditions are reported by returning the socket as readable/writable or + * both, depending on WL_SOCKET_READABLE/WRITEABLE/CONNECTED being specified. * * The user_data pointer specified here will be set for the events returned * by WaitEventSetWait(), allowing to easily associate additional data with @@ -685,8 +686,7 @@ AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch, } /* waiting for socket readiness without a socket indicates a bug */ - if (fd == PGINVALID_SOCKET && - (events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))) + if (fd == PGINVALID_SOCKET && (events & (WL_SOCKET_MASK))) elog(ERROR, "cannot wait on socket event without a socket"); event = &set->events[set->nevents]; @@ -885,6 +885,8 @@ WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event) flags |= FD_READ; if (event->events & WL_SOCKET_WRITEABLE) flags |= FD_WRITE; + if (event->events & WL_SOCKET_CONNECTED) + flags |= FD_CONNECT; if (*handle == WSA_INVALID_EVENT) { @@ -1395,7 +1397,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout, returned_events++; } } - else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) + else if (cur_event->events & WL_SOCKET_MASK) { WSANETWORKEVENTS resEvents; HANDLE handle = set->handles[cur_event->pos + 1]; @@ -1432,6 +1434,12 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout, /* writeable */ occurred_events->events |= WL_SOCKET_WRITEABLE; } + if ((cur_event->events & WL_SOCKET_CONNECTED) && + (resEvents.lNetworkEvents & FD_CONNECT)) + { + /* connected */ + occurred_events->events |= WL_SOCKET_CONNECTED; + } if (resEvents.lNetworkEvents & FD_CLOSE) { /* EOF */ @@ -1439,6 +1447,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout, occurred_events->events |= WL_SOCKET_READABLE; if (cur_event->events & WL_SOCKET_WRITEABLE) occurred_events->events |= WL_SOCKET_WRITEABLE; + if (cur_event->events & WL_SOCKET_CONNECTED) + occurred_events->events |= WL_SOCKET_CONNECTED; } if (occurred_events->events != 0) diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h index 73abfaf..8af0266 100644 --- a/src/include/storage/latch.h +++ b/src/include/storage/latch.h @@ -126,6 +126,16 @@ typedef struct Latch #define WL_SOCKET_WRITEABLE (1 << 2) #define WL_TIMEOUT (1 << 3) /* not for WaitEventSetWait() */ #define WL_POSTMASTER_DEATH (1 << 4) +#ifdef WIN32 +#define WL_SOCKET_CONNECTED (1 << 5) +#else +/* avoid having to to deal with case on platforms not requiring it */ +#define WL_SOCKET_CONNECTED WL_SOCKET_WRITEABLE +#endif + +#define WL_SOCKET_MASK (WL_SOCKET_READABLE | \ + WL_SOCKET_WRITEABLE | \ + WL_SOCKET_CONNECTED) typedef struct WaitEvent { -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan <pg@bowt.ie> writes: > I think that it's useful for these things to be handled in an > adversarial manner, in the same way that litigation is adversarial in > a common law court. I doubt that Noah actually set out to demoralize > anyone. He is just doing the job he was assigned. FWIW, I agree that Noah is just carrying out the RMT's task as assigned. However, the only thing that Peter could really do about this of his own authority is to revert 1e8a850, which I certainly think should be the last resort not the first. We are continuing to make progress towards finding a better solution, I think, and since no particular deadline is imminent we should let that process play out. I think what Peter should do to satisfy the RMT process is to state that if no better solution is found by date X, he will revert 1e8a850 (or somehow mask the problem by removing functionality), and he sees no need for additional progress reports before that. Date X could be sometime early in September, perhaps. regards, tom lane
"Augustine, Jobin" <jobin.augustine@openscg.com> writes: > Appears that patch is not helping. > Errors like below are still appearing in the log > =============================== > 2017-08-11 12:22:35 UTC [2840]: [1-1] user=,db=,app=,client= FATAL: could > not connect to the primary server: could not send data to server: Socket is > not connected (0x00002749/10057) > could not send startup packet: Socket is not connected > (0x00002749/10057) BTW, I notice that this is a bit different from your first report, because that mentioned "could not send SSL negotiation packet" rather than "could not send startup packet". If the initial report was about a setup in which SSL was enabled, and in this report's setup it isn't, then that difference is unsurprising; but otherwise it needs explanation. regards, tom lane
On 2017-08-13 16:55:33 -0400, Tom Lane wrote: > Peter Geoghegan <pg@bowt.ie> writes: > > I think that it's useful for these things to be handled in an > > adversarial manner, in the same way that litigation is adversarial in > > a common law court. I doubt that Noah actually set out to demoralize > > anyone. He is just doing the job he was assigned. > > FWIW, I agree that Noah is just carrying out the RMT's task as > assigned. Well, then that's a sign that the tasks/process need to be rescoped. > However, the only thing that Peter could really do about this of his own > authority is to revert 1e8a850, which I certainly think should be the last > resort not the first. We are continuing to make progress towards finding > a better solution, I think, and since no particular deadline is imminent > we should let that process play out. I agree that that'd be a bad fix. There's other things that should, but don't, really use the asynchronous API, e.g. postgres_fdw. As a measure of last restart we could add a libpq workaround that forces a pqSocketCheck() at the right moment, while still establishing a connection. That's not good from an interruptability perspective, but better than blocking for the entire connection establishment. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > As a measure of last restart we could add a libpq workaround that forces > a pqSocketCheck() at the right moment, while still establishing a > connection. That's not good from an interruptability perspective, but > better than blocking for the entire connection establishment. Probably a better idea is to fix things so that can't-send-yet results in returning PGRES_POLLING_WRITING rather than a failure. That would result in a busy-wait in this scenario on Windows, which is arguably better than blocking. It would also make this part of libpq more robust against applications being sloppy about socket readiness checks (which, you could argue, is exactly what libpqwalreceiver is being). But it would be a somewhat ticklish change because of the portability hazards, so I'm really disinclined to do it this late in beta. regards, tom lane
On Sun, Aug 13, 2017 at 2:22 PM, Andres Freund <andres@anarazel.de> wrote: > On 2017-08-13 16:55:33 -0400, Tom Lane wrote: >> Peter Geoghegan <pg@bowt.ie> writes: >> > I think that it's useful for these things to be handled in an >> > adversarial manner, in the same way that litigation is adversarial in >> > a common law court. I doubt that Noah actually set out to demoralize >> > anyone. He is just doing the job he was assigned. >> >> FWIW, I agree that Noah is just carrying out the RMT's task as >> assigned. > > Well, then that's a sign that the tasks/process need to be rescoped. Why? I might agree if the RMT had an outsized influence on final outcome. If that's the case, then it's something that I missed. I also don't think that it's fair to expect Noah to spend a lot of time poring over whether sending out one of those pro forma status update policy violation mails is warranted. I expect someone is his position to aim to err on the side of sending out too many of those. It's easy for the patch author to explain the situation, but hard for the RMT to understand the situation fully at all times. -- Peter Geoghegan
Andres Freund <andres@anarazel.de> writes:
> On 2017-08-11 18:11:03 +0530, Augustine, Jobin wrote:
>> Appears that patch is not helping.
> That's too bad. Any chance you could install
> https://docs.microsoft.com/en-us/sysinternals/downloads/ procmon and
> activate monitoring just for that phase? I think it can export to a text
> file...
It strikes me that maybe the misuse of io_flag could be contributing
to this: if the walreceiver process's latch were set, we'd end up calling
PQconnectPoll before the socket had necessarily come ready, which would
produce the described symptom. That's grasping at straws admittedly,
because I'm not sure why the walreceiver process's latch would be set
at this point; but it seems like we ought to test a version of the patch
that we believe correct before deciding that we still have a problem.
To move things along, here's a corrected patch --- Jobin, please test.
regards, tom lane
--
Jobin Augustine
Architect : Production Database Operations
OpenSCG
phone : +91 9989932600
On 8/13/17 15:46, Noah Misch wrote: > IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due > for your status update. Please reacquaint yourself with the policy on open > item ownership[1] and then reply immediately. If I do not hear from you by > 2017-08-14 20:00 UTC, I will transfer this item to release management team > ownership without further notice. It appears there is a patch that fixes the issue now. Let's wait until Wednesday to see if that ends up being successful. Otherwise, the plan forward would be to decorate the change in the original commit with an #ifndef WIN32. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > It appears there is a patch that fixes the issue now. Let's wait until > Wednesday to see if that ends up being successful. Jobim reported success with add-connected-event-2.patch --- are you expecting more testing to happen? I did instrument the loop in libpqrcv_connect and verified that the process latch is set the first time through. I haven't traced down exactly why it's set at that point, but that confirms why Andres' initial patch didn't work and the new one did. I think we could commit add-connected-event-2.patch and call this issue resolved. regards, tom lane
On 8/14/17 10:57, Tom Lane wrote: > I think we could commit add-connected-event-2.patch and call this > issue resolved. Would you like to commit your patch then? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 8/14/17 10:57, Tom Lane wrote: >> I think we could commit add-connected-event-2.patch and call this >> issue resolved. > Would you like to commit your patch then? It's really Andres' patch, but I can push it. regards, tom lane
On August 15, 2017 7:04:59 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> On 8/14/17 10:57, Tom Lane wrote:
>>> I think we could commit add-connected-event-2.patch and call this
>>> issue resolved.
>
>> Would you like to commit your patch then?
>
>It's really Andres' patch, but I can push it.
I'll in a bit, unless you prefer to do it. You seem busy enough ;)
I don't see any relevant uses of sockets in older branches, did I miss something?
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On August 15, 2017 8:07:43 AM PDT, Andres Freund <andres@anarazel.de> wrote: > > >On August 15, 2017 7:04:59 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >>> On 8/14/17 10:57, Tom Lane wrote: >>>> I think we could commit add-connected-event-2.patch and call this >>>> issue resolved. >> >>> Would you like to commit your patch then? >> >>It's really Andres' patch, but I can push it. > >I'll in a bit, unless you prefer to do it. You seem busy enough ;) Hah, synchronicity. That works too ;) -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > I don't see any relevant uses of sockets in older branches, did I miss something? No, I think we don't need to go back further than v10. Even if it turns out we do, I'd just as soon let this get a bit of field testing first. regards, tom lane
Message:
20170524173644.1440.53348@wrigleys.postgresql.org
Bug reference: 14669 Logged by: Igor Neyman Email address: ineyman(at)perceptron(dot)com PostgreSQL version: 10beta1 Operating system: Windows
Regards,
Jobin.
> Thank you Tom Lane, > This patch fixes the problem. > > With this patch, streaming replication started working (replication to > Windows) > > (Tested for Linux to Windows replication) Do we allow streaming replication among different OS? I thought it is required that primary and standbys are same platform (in my understanding this means the same hardware architecture and OS) in streaming replication. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On Thu, Aug 17, 2017 at 8:13 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > Do we allow streaming replication among different OS? No. WAL is a binary format. > I thought it is > required that primary and standbys are same platform (in my > understanding this means the same hardware architecture and OS) in > streaming replication. Yep, I recall the same requirement. -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > On Thu, Aug 17, 2017 at 8:13 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote: >> Do we allow streaming replication among different OS? > No. WAL is a binary format. >> I thought it is >> required that primary and standbys are same platform (in my >> understanding this means the same hardware architecture and OS) in >> streaming replication. > Yep, I recall the same requirement. In practice it's largely about architecture, I think. You definitely need the same endianness and floating-point format, which are hardware, and you need the same MAXALIGN, which is partly hardware but in principle could be chosen differently in different ABI conventions for the same hardware. (At least for non-alignment-picky hardware like Intel. Alignment-picky hardware is likely going to dictate that choice too.) We disclaim cross-OS portability partly because of the possible influence of different ABI conventions, but it doesn't surprise me in the least if in practice 64-bit Windows can interoperate with x86_64 Linux. (Less sure about the 32-bit case --- it looks like pg_config.h.win32 chooses MAXALIGN 8 in all cases, which would mean 32-bit Windows is more likely to interoperate with 64-bit Linux than 32-bit Linux.) The thing that's really likely to bite you cross-platform is dependency of textual index sort ordering on non-C locale definitions. But that wouldn't show up in a quick smoke test of replication ... and if you really wanted, you could use C locale on both ends. regards, tom lane
On 8/16/17 19:41, Michael Paquier wrote: > On Thu, Aug 17, 2017 at 8:13 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote: >> Do we allow streaming replication among different OS? > > No. WAL is a binary format. > >> I thought it is >> required that primary and standbys are same platform (in my >> understanding this means the same hardware architecture and OS) in >> streaming replication. > > Yep, I recall the same requirement. He meant logical replication, but the code in question here is the same for streaming replication, or whatever it's called. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> In practice it's largely about architecture, I think. You definitely > need the same endianness and floating-point format, which are hardware, > and you need the same MAXALIGN, which is partly hardware but in principle > could be chosen differently in different ABI conventions for the same > hardware. (At least for non-alignment-picky hardware like Intel. > Alignment-picky hardware is likely going to dictate that choice too.) > > We disclaim cross-OS portability partly because of the possible influence > of different ABI conventions, but it doesn't surprise me in the least if > in practice 64-bit Windows can interoperate with x86_64 Linux. (Less sure > about the 32-bit case --- it looks like pg_config.h.win32 chooses > MAXALIGN 8 in all cases, which would mean 32-bit Windows is more likely to > interoperate with 64-bit Linux than 32-bit Linux.) I feel like we need to be a little bit more cleaner about this in our official documentation. From section 26.2.1. Planning: "Hardware need not be exactly the same, but experience shows that maintaining two identical systems is easier than maintaining two dissimilar ones over the lifetime of the application and system. In any case the hardware architecture must be the same ― shipping from, say, a 32-bit to a 64-bit system will not work." Probably We should disclaim cross-OS portability here? > The thing that's really likely to bite you cross-platform is dependency > of textual index sort ordering on non-C locale definitions. But that > wouldn't show up in a quick smoke test of replication ... and if you > really wanted, you could use C locale on both ends. Of course. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
> He meant logical replication, Oh I could not find he meant logical replication in the original report. > but the code in question here is the same > for streaming replication, or whatever it's called. Yes. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On Thu, Aug 17, 2017 at 9:15 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote: >> He meant logical replication, > > Oh I could not find he meant logical replication in the original > report. The second message of the thread says so, but the first does not mention logical replication at all. From here are mentioned PG 9.6 and pg_basebackup: https://www.postgresql.org/message-id/CAHBggj8g2T+ZDcACZ2FmzX9CTxkWjKBsHd6NkYB4i9Ojf6K1Fw@mail.gmail.com. Explaining the confusion. -- Michael
> On Thu, Aug 17, 2017 at 9:15 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote: >>> He meant logical replication, >> >> Oh I could not find he meant logical replication in the original >> report. > > The second message of the thread says so, but the first does not > mention logical replication at all. >>From here are mentioned PG 9.6 and pg_basebackup: > https://www.postgresql.org/message-id/CAHBggj8g2T+ZDcACZ2FmzX9CTxkWjKBsHd6NkYB4i9Ojf6K1Fw@mail.gmail.com. > Explaining the confusion. Oh, I see it now. Thanks. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp