Thread: An I/O error occurred while sending to the backend (PG 13.4)
Hello,
I hope this email finds you all well!
I have a data warehouse with a fairly complex ETL process that has been running for years now across PG 9.6, 11.2 and now 13.4 for the past couple of months. I have been getting the error “An I/O error occurred while sending to the backend” quite often under load in 13.4 which I never used to get on 11.2. I have applied some tricks, particularly with the socketTimeout JDBC configuration.
So my first question is whether anyone has any idea why this is happening? My hardware and general PG configuration have not changed between 11.2 and 13.4 and I NEVER experienced this on 11.2 in about 2y of production.
Second, I have one stored procedure that takes a very long time to run (40mn more or less), so obviously, I’d need to set socketTimeout to something like 1h in order to call it and not timeout. That doesn’t seem reasonable?
I understand that there is not just Postgres 13.4, but also the JDBC driver. I ran production for a several days on V42.2.19 (which had run with PG11.2 fine) to try and got the error a couple of times, i.e., the same as with 42.2.24, so I am not sure this has to do with the JDBC Driver.
So I am not sure what to do now. I do not know if there are some related configuration options since 11.2 that could trigger this issue that I missed, or some other phenomenon going on. I have always had a few “long running” queries in the system (i.e., > 20mn) and never experienced this on 11.2, and experiencing this maybe once or twice a week on 13.4, seemingly randomly. So sometimes, the queries run fine, and others, they time out. Weird.
Thanks,
Laurent.
On Sat, Dec 04, 2021 at 05:32:10PM +0000, ldh@laurent-hasson.com wrote: > I have a data warehouse with a fairly complex ETL process that has been running for years now across PG 9.6, 11.2 and now13.4 for the past couple of months. I have been getting the error "An I/O error occurred while sending to the backend"quite often under load in 13.4 which I never used to get on 11.2. I have applied some tricks, particularly with thesocketTimeout JDBC configuration. > > So my first question is whether anyone has any idea why this is happening? My hardware and general PG configuration havenot changed between 11.2 and 13.4 and I NEVER experienced this on 11.2 in about 2y of production. > > Second, I have one stored procedure that takes a very long time to run (40mn more or less), so obviously, I'd need to setsocketTimeout to something like 1h in order to call it and not timeout. That doesn't seem reasonable? Is the DB server local or remote (TCP/IP) to the client? Could you collect the corresponding postgres query logs when this happens ? It'd be nice to see a network trace for this too. Using tcpdump or wireshark. Preferably from the client side. FWIW, I suspect the JDBC socketTimeout is a bad workaround. -- Justin
RE: An I/O error occurred while sending to the backend (PG 13.4)
> -----Original Message----- > From: Justin Pryzby <pryzby@telsasoft.com> > Sent: Saturday, December 4, 2021 12:59 > To: ldh@laurent-hasson.com > Cc: pgsql-performance@postgresql.org > Subject: Re: An I/O error occurred while sending to the backend (PG > 13.4) > > On Sat, Dec 04, 2021 at 05:32:10PM +0000, ldh@laurent-hasson.com > wrote: > > I have a data warehouse with a fairly complex ETL process that has > been running for years now across PG 9.6, 11.2 and now 13.4 for the > past couple of months. I have been getting the error "An I/O error > occurred while sending to the backend" quite often under load in 13.4 > which I never used to get on 11.2. I have applied some tricks, particularly > with the socketTimeout JDBC configuration. > > > > So my first question is whether anyone has any idea why this is > happening? My hardware and general PG configuration have not > changed between 11.2 and 13.4 and I NEVER experienced this on 11.2 in > about 2y of production. > > > > Second, I have one stored procedure that takes a very long time to run > (40mn more or less), so obviously, I'd need to set socketTimeout to > something like 1h in order to call it and not timeout. That doesn't seem > reasonable? > > Is the DB server local or remote (TCP/IP) to the client? > > Could you collect the corresponding postgres query logs when this > happens ? > > It'd be nice to see a network trace for this too. Using tcpdump or > wireshark. > Preferably from the client side. > > FWIW, I suspect the JDBC socketTimeout is a bad workaround. > > -- > Justin It's a remote server, but all on a local network. Network performance is I am sure not the issue. Also, the system is onWindows Server. What are you expecting to see out of a tcpdump? I'll try to get PG logs on the failing query. Thank you, Laurent.
On Sat, Dec 04, 2021 at 07:18:06PM +0000, ldh@laurent-hasson.com wrote: > It's a remote server, but all on a local network. Network performance is I am sure not the issue. Also, the system is onWindows Server. What are you expecting to see out of a tcpdump? I'll try to get PG logs on the failing query. I'd want to know if postgres sent anything to the client, like TCP RST, or if the client decided on its own that there had been an error. -- Justin
RE: An I/O error occurred while sending to the backend (PG 13.4)
> -----Original Message----- > From: ldh@laurent-hasson.com <ldh@laurent-hasson.com> > Sent: Saturday, December 4, 2021 14:18 > To: Justin Pryzby <pryzby@telsasoft.com> > Cc: pgsql-performance@postgresql.org > Subject: RE: An I/O error occurred while sending to the backend (PG 13.4) > > > > -----Original Message----- > > From: Justin Pryzby <pryzby@telsasoft.com> > > Sent: Saturday, December 4, 2021 12:59 > > To: ldh@laurent-hasson.com > > Cc: pgsql-performance@postgresql.org > > Subject: Re: An I/O error occurred while sending to the backend (PG > > 13.4) > > > > On Sat, Dec 04, 2021 at 05:32:10PM +0000, ldh@laurent-hasson.com > > wrote: > > > I have a data warehouse with a fairly complex ETL process that has > > been running for years now across PG 9.6, 11.2 and now 13.4 for the > > past couple of months. I have been getting the error "An I/O error > > occurred while sending to the backend" quite often under load in 13.4 > > which I never used to get on 11.2. I have applied some tricks, > particularly > > with the socketTimeout JDBC configuration. > > > > > > So my first question is whether anyone has any idea why this is > > happening? My hardware and general PG configuration have not > > changed between 11.2 and 13.4 and I NEVER experienced this on 11.2 > in > > about 2y of production. > > > > > > Second, I have one stored procedure that takes a very long time to > run > > (40mn more or less), so obviously, I'd need to set socketTimeout to > > something like 1h in order to call it and not timeout. That doesn't seem > > reasonable? > > > > Is the DB server local or remote (TCP/IP) to the client? > > > > Could you collect the corresponding postgres query logs when this > > happens ? > > > > It'd be nice to see a network trace for this too. Using tcpdump or > > wireshark. > > Preferably from the client side. > > > > FWIW, I suspect the JDBC socketTimeout is a bad workaround. > > > > -- > > Justin > > It's a remote server, but all on a local network. Network performance is I > am sure not the issue. Also, the system is on Windows Server. What are you > expecting to see out of a tcpdump? I'll try to get PG logs on the failing query. > > Thank you, > Laurent. > > > > Hello Justin, It has been ages! The issue has been happening a bit more often recently, as much as once every 10 days or so. As a reminder,the set up is Postgres 13.4 on Windows Server with 16cores and 64GB memory. The scenario where this occurs is anETL tool called Pentaho Kettle (V7) connecting to the DB for DataWarehouse workloads. The tool is Java-based and connectsvia JDBC using postgresql-42.2.5.jar. There are no particular settings besides the socketTimeout setting mentionedabove. The workload has some steps being lots of quick transactions for dimension tables for example, but some fact table calculations,especially large pivots, can make queries run for 40mn up to over an hour (a few of those). I caught these in the logs at the time of a failure but unsure what to make of that: 2022-02-21 02:08:16.214 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:29.347 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.371 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.463 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.596 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.687 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.786 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.873 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:30.976 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:31.050 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:31.131 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:31.240 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:31.906 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:31.988 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:33.068 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:08:34.850 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:10:43.596 EST [836] LOG: could not receive data from client: An existing connection was forcibly closed bythe remote host. 2022-02-21 02:10:43.598 EST [8616] LOG: could not receive data from client: An existing connection was forcibly closed bythe remote host. 2022-02-21 02:10:43.598 EST [8616] LOG: unexpected EOF on client connection with an open transaction 2022-02-21 02:10:43.605 EST [7000] LOG: could not receive data from client: An existing connection was forcibly closed bythe remote host. 2022-02-21 02:10:43.605 EST [7000] LOG: unexpected EOF on client connection with an open transaction 2022-02-21 02:10:43.605 EST [1368] LOG: could not receive data from client: An existing connection was forcibly closed bythe remote host. 2022-02-21 02:10:43.605 EST [1368] LOG: unexpected EOF on client connection with an open transaction 2022-02-21 02:10:43.605 EST [3304] LOG: could not receive data from client: An existing connection was forcibly closed bythe remote host. 2022-02-21 02:10:43.605 EST [3304] LOG: unexpected EOF on client connection with an open transaction 2022-02-21 02:31:38.808 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:31:38.817 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:31:38.825 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:31:38.834 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:31:38.845 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported 2022-02-21 02:34:32.112 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported Thank you, Laurent.
On Thu, Feb 24, 2022 at 12:47:42AM +0000, ldh@laurent-hasson.com wrote: > On Sat, Dec 04, 2021 at 05:32:10PM +0000, ldh@laurent-hasson.com wrote: > > I have a data warehouse with a fairly complex ETL process that has > > been running for years now across PG 9.6, 11.2 and now 13.4 for the > > past couple of months. I have been getting the error "An I/O error > > occurred while sending to the backend" quite often under load in 13.4 > > which I never used to get on 11.2. I have applied some tricks, particularly > > with the socketTimeout JDBC configuration. > It'd be nice to see a network trace for this too. Using tcpdump or > wireshark. Preferably from the client side. > > Hello Justin, > > It has been ages! The issue has been happening a bit more often recently, as much as once every 10 days or so. As a reminder,the set up is Postgres 13.4 on Windows Server with 16cores and 64GB memory. The scenario where this occurs is anETL tool called Pentaho Kettle (V7) connecting to the DB for DataWarehouse workloads. The tool is Java-based and connectsvia JDBC using postgresql-42.2.5.jar. There are no particular settings besides the socketTimeout setting mentionedabove. > > The workload has some steps being lots of quick transactions for dimension tables for example, but some fact table calculations,especially large pivots, can make queries run for 40mn up to over an hour (a few of those). > > I caught these in the logs at the time of a failure but unsure what to make of that: > > 2022-02-21 02:10:43.605 EST [1368] LOG: unexpected EOF on client connection with an open transaction > 2022-02-21 02:10:43.605 EST [3304] LOG: could not receive data from client: An existing connection was forcibly closedby the remote host. > > 2022-02-21 02:10:43.605 EST [3304] LOG: unexpected EOF on client connection with an open transaction > 2022-02-21 02:31:38.808 EST [1704] LOG: setsockopt(TCP_USER_TIMEOUT) not supported I suggest to enable CSV logging, which has many more columns of data. Some of them might provide an insight - I'm not sure. log_destination=csvlog (in addition to whatever else you have set). And the aforementioned network trace. You could set a capture filter on TCP SYN|RST so it's not absurdly large. From my notes, it might look like this: (tcp[tcpflags]&(tcp-rst|tcp-syn|tcp-fin)!=0) -- Justin
You originally mailed about an error on the client, and now you found corresponding server logs, which suggests a veritable network issue. Are the postgres clients and server on the same subnet ? If not, what are the intermediate routers ? Is there any NAT happening ? Do those devices have any interesting logs that correspond with the server/client connection failures ? Have you tried enabling TCP keepalives ? This might help to convince a NAT device not to forget about your connection. https://www.postgresql.org/docs/current/runtime-config-connection.html tcp_keepalives_idle=9 tcp_keepalives_interval=9 tcp_keepalives_count=0 tcp_user_timeout=0 -- You apparently have this set, but it cannot work on windows, so just generates noise. On linux, you can check the keepalive counters in "netstat -not" to be sure that it's enabled. A similar switch hopefully exists for windows. -- Justin
> -----Original Message-----
> From: ldh@laurent-hasson.com <ldh@laurent-hasson.com>
> Sent: Saturday, December 4, 2021 14:18
> To: Justin Pryzby <pryzby@telsasoft.com>
> Cc: pgsql-performance@postgresql.org
> Subject: RE: An I/O error occurred while sending to the backend (PG 13.4)
>
>
> > -----Original Message-----
> > From: Justin Pryzby <pryzby@telsasoft.com>
> > Sent: Saturday, December 4, 2021 12:59
> > To: ldh@laurent-hasson.com
> > Cc: pgsql-performance@postgresql.org
> > Subject: Re: An I/O error occurred while sending to the backend (PG
> > 13.4)
> >
> > On Sat, Dec 04, 2021 at 05:32:10PM +0000, ldh@laurent-hasson.com
> > wrote:
> > > I have a data warehouse with a fairly complex ETL process that has
> > been running for years now across PG 9.6, 11.2 and now 13.4 for the
> > past couple of months. I have been getting the error "An I/O error
> > occurred while sending to the backend" quite often under load in 13.4
> > which I never used to get on 11.2. I have applied some tricks,
> particularly
> > with the socketTimeout JDBC configuration.
> > >
> > > So my first question is whether anyone has any idea why this is
> > happening? My hardware and general PG configuration have not
> > changed between 11.2 and 13.4 and I NEVER experienced this on 11.2
> in
> > about 2y of production.
> > >
> > > Second, I have one stored procedure that takes a very long time to
> run
> > (40mn more or less), so obviously, I'd need to set socketTimeout to
> > something like 1h in order to call it and not timeout. That doesn't seem
> > reasonable?
> >
> > Is the DB server local or remote (TCP/IP) to the client?
> >
> > Could you collect the corresponding postgres query logs when this
> > happens ?
> >
> > It'd be nice to see a network trace for this too. Using tcpdump or
> > wireshark.
> > Preferably from the client side.
> >
> > FWIW, I suspect the JDBC socketTimeout is a bad workaround.
> >
> > --
> > Justin
>
> It's a remote server, but all on a local network. Network performance is I
> am sure not the issue. Also, the system is on Windows Server. What are you
> expecting to see out of a tcpdump? I'll try to get PG logs on the failing query.
>
> Thank you,
> Laurent.
>
>
>
>
Hello Justin,
It has been ages! The issue has been happening a bit more often recently, as much as once every 10 days or so. As a reminder, the set up is Postgres 13.4 on Windows Server with 16cores and 64GB memory.
On Thu, Feb 24, 2022 at 08:50:45AM -0300, Ranier Vilela wrote: > I can't understand why you are still using 13.4? > [1] There is a long discussion about the issue with 13.4, the project was > made to fix a DLL bottleneck. > > Why you not use 13.6? That other problem (and its fix) were in the windows build environment, and not an issue in some postgres version. It's still a good idea to schedule an update. -- Justin
On Thu, Feb 24, 2022 at 08:50:45AM -0300, Ranier Vilela wrote:
> I can't understand why you are still using 13.4?
> [1] There is a long discussion about the issue with 13.4, the project was
> made to fix a DLL bottleneck.
>
> Why you not use 13.6?
That other problem (and its fix) were in the windows build environment, and not
an issue in some postgres version.
On Wed, Feb 23, 2022 at 07:04:15PM -0600, Justin Pryzby wrote: > And the aforementioned network trace. You could set a capture filter on TCP > SYN|RST so it's not absurdly large. From my notes, it might look like this: > (tcp[tcpflags]&(tcp-rst|tcp-syn|tcp-fin)!=0) I'd also add '|| icmp'. My hunch is that you'll see some ICMP (not "ping") being sent by an intermediate gateway, resulting in the connection being reset. -- Justin
RE: An I/O error occurred while sending to the backend (PG 13.4)
>From: Ranier Vilela <ranier.vf@gmail.com> >Sent: Thursday, February 24, 2022 08:46 >To: Justin Pryzby <pryzby@telsasoft.com> >Cc: ldh@laurent-hasson.com; pgsql-performance@postgresql.org >Subject: Re: An I/O error occurred while sending to the backend (PG 13.4) > >Em qui., 24 de fev. de 2022 às 09:59, Justin Pryzby <mailto:pryzby@telsasoft.com> escreveu: >On Thu, Feb 24, 2022 at 08:50:45AM -0300, Ranier Vilela wrote: >> I can't understand why you are still using 13.4? >> [1] There is a long discussion about the issue with 13.4, the project was >> made to fix a DLL bottleneck. >> >> Why you not use 13.6? > >That other problem (and its fix) were in the windows build environment, and not >an issue in some postgres version. >Yeah, correct. >But I think that it was very clear in the other thread that version 13.4, >on Windows, may have a slowdown, because of the DLL problem. >So it would be better to use the latest available version >that has this specific fix and many others. > >regards, >Ranier Vilela OK, absolutely. I was thinking about even moving to 14. I know migrations within a release are painless, but my experiencewith upgrading across releases has also been quite good (short of bugs that were found of course). Any opinionon 14.2? Thank you, Laurent.
RE: An I/O error occurred while sending to the backend (PG 13.4)
> -----Original Message----- > From: Justin Pryzby <pryzby@telsasoft.com> > Sent: Friday, February 25, 2022 08:02 > To: ldh@laurent-hasson.com > Cc: pgsql-performance@postgresql.org > Subject: Re: An I/O error occurred while sending to the backend (PG 13.4) > > On Wed, Feb 23, 2022 at 07:04:15PM -0600, Justin Pryzby wrote: > > And the aforementioned network trace. You could set a capture filter > > on TCP > > SYN|RST so it's not absurdly large. From my notes, it might look like this: > > (tcp[tcpflags]&(tcp-rst|tcp-syn|tcp-fin)!=0) > > I'd also add '|| icmp'. My hunch is that you'll see some ICMP (not "ping") > being sent by an intermediate gateway, resulting in the connection being > reset. > > -- > Justin Hello Justin, I am so sorry but I do not understand what you are asking me to do. I am unfamiliar with these commands. Is this a postgresconfiguration file? Is this something I just do once or something I leave on to hopefully catch it when the issueoccurs? Is this something to do on the DB machine or the ETL machine? FYI: - My ETL machine is on 10.64.17.211 - My DB machine is on 10.64.17.210 - Both on Windows Server 2012 R2, x64 So sorry for the bother, Laurent.
On Mon, Feb 28, 2022 at 09:43:09PM +0000, ldh@laurent-hasson.com wrote: > On Wed, Feb 23, 2022 at 07:04:15PM -0600, Justin Pryzby wrote: > > > And the aforementioned network trace. You could set a capture filter on TCP > > > SYN|RST so it's not absurdly large. From my notes, it might look like this: > > > (tcp[tcpflags]&(tcp-rst|tcp-syn|tcp-fin)!=0) > > > > I'd also add '|| icmp'. My hunch is that you'll see some ICMP (not "ping") > > being sent by an intermediate gateway, resulting in the connection being > > reset. > > I am so sorry but I do not understand what you are asking me to do. I am unfamiliar with these commands. Is this a postgresconfiguration file? Is this something I just do once or something I leave on to hopefully catch it when the issueoccurs? Is this something to do on the DB machine or the ETL machine? FYI: It's no problem. I suggest that you run wireshark with a capture filter to try to show *why* the connections are failing. I think the capture filter might look like: (icmp || (tcp[tcpflags] & (tcp-rst|tcp-syn|tcp-fin)!=0)) && host 10.64.17.211 With the "host" filtering for the IP address of the *remote* machine. You could run that on whichever machine is more convenient and leave it running for however long it takes for that error to happen. You'll be able to save a .pcap file for inspection. I suppose it'll show either a TCP RST or an ICMP. Whichever side sent that is where the problem is. I still suspect the issue isn't in postgres. > - My ETL machine is on 10.64.17.211 > - My DB machine is on 10.64.17.210 > - Both on Windows Server 2012 R2, x64 These network details make my theory unlikely. They're on the same subnet with no intermediate gateways, and communicate directly via a hub/switch/crossover cable. If that's true, then both will have each other's hardware address in ARP after pinging from one to the other. -- Justin
RE: An I/O error occurred while sending to the backend (PG 13.4)
> -----Original Message----- > From: Justin Pryzby <pryzby@telsasoft.com> > Sent: Monday, February 28, 2022 17:05 > To: ldh@laurent-hasson.com > Cc: pgsql-performance@postgresql.org > Subject: Re: An I/O error occurred while sending to the backend (PG 13.4) > > On Mon, Feb 28, 2022 at 09:43:09PM +0000, ldh@laurent-hasson.com > wrote: > > On Wed, Feb 23, 2022 at 07:04:15PM -0600, Justin Pryzby wrote: > > > > And the aforementioned network trace. You could set a capture > filter on TCP > > > > SYN|RST so it's not absurdly large. From my notes, it might look like > this: > > > > (tcp[tcpflags]&(tcp-rst|tcp-syn|tcp-fin)!=0) > > > > > > I'd also add '|| icmp'. My hunch is that you'll see some ICMP (not > "ping") > > > being sent by an intermediate gateway, resulting in the connection > being > > > reset. > > > > I am so sorry but I do not understand what you are asking me to do. I am > unfamiliar with these commands. Is this a postgres configuration file? Is this > something I just do once or something I leave on to hopefully catch it when > the issue occurs? Is this something to do on the DB machine or the ETL > machine? FYI: > > It's no problem. > > I suggest that you run wireshark with a capture filter to try to show *why* > the connections are failing. I think the capture filter might look like: > > (icmp || (tcp[tcpflags] & (tcp-rst|tcp-syn|tcp-fin)!=0)) && host > 10.64.17.211 > > With the "host" filtering for the IP address of the *remote* machine. > > You could run that on whichever machine is more convenient and leave it > running for however long it takes for that error to happen. You'll be able to > save a .pcap file for inspection. I suppose it'll show either a TCP RST or an > ICMP. > Whichever side sent that is where the problem is. I still suspect the issue > isn't in postgres. > > > - My ETL machine is on 10.64.17.211 > > - My DB machine is on 10.64.17.210 > > - Both on Windows Server 2012 R2, x64 > > These network details make my theory unlikely. > > They're on the same subnet with no intermediate gateways, and > communicate directly via a hub/switch/crossover cable. If that's true, then > both will have each other's hardware address in ARP after pinging from one > to the other. > > -- > Justin Yes, the machines ARE on the same subnet. They actually even are on the same physical rack as per what I have been told.When I run a tracert, I get this: Tracing route to PRODDB.xxx.int [10.64.17.210] over a maximum of 30 hops: 1 1 ms <1 ms <1 ms PRODDB.xxx.int [10.64.17.210] Trace complete. Now, there is an additional component I think... Storage is on an array and I am not getting a clear answer as to where itis 😊 Is it possible that something is happening at the storage layer? Could that be reported as a network issue vs a storageissue for Postgres? Also, both machines are actually VMs. I forgot to mention that and not sure if that's relevant. Thank you, Laurent.
On Tue, Mar 01, 2022 at 04:28:31PM +0000, ldh@laurent-hasson.com wrote: > Now, there is an additional component I think... Storage is on an array and I am not getting a clear answer as to whereit is 😊 Is it possible that something is happening at the storage layer? Could that be reported as a network issue vsa storage issue for Postgres? No. If there were an error with storage, it'd be reported as a local error, and the query would fail, rather than failing with client-server communication. > Also, both machines are actually VMs. I forgot to mention that and not sure if that's relevant. Are they running on the same hypervisor ? Is that hyperv ? Lacking other good hypotheses, that does seem relevant. -- Justin
Of course, 14.2 would be better than 13.6, but I think that there are
>From: Ranier Vilela <ranier.vf@gmail.com>
>Sent: Thursday, February 24, 2022 08:46
>To: Justin Pryzby <pryzby@telsasoft.com>
>Cc: ldh@laurent-hasson.com; pgsql-performance@postgresql.org
>Subject: Re: An I/O error occurred while sending to the backend (PG 13.4)
>
>Em qui., 24 de fev. de 2022 às 09:59, Justin Pryzby <mailto:pryzby@telsasoft.com> escreveu:
>On Thu, Feb 24, 2022 at 08:50:45AM -0300, Ranier Vilela wrote:
>> I can't understand why you are still using 13.4?
>> [1] There is a long discussion about the issue with 13.4, the project was
>> made to fix a DLL bottleneck.
>>
>> Why you not use 13.6?
>
>That other problem (and its fix) were in the windows build environment, and not
>an issue in some postgres version.
>Yeah, correct.
>But I think that it was very clear in the other thread that version 13.4,
>on Windows, may have a slowdown, because of the DLL problem.
>So it would be better to use the latest available version
>that has this specific fix and many others.
>
>regards,
>Ranier Vilela
OK, absolutely. I was thinking about even moving to 14. I know migrations within a release are painless, but my experience with upgrading across releases has also been quite good (short of bugs that were found of course). Any opinion on 14.2?
RE: An I/O error occurred while sending to the backend (PG 13.4)
> -----Original Message----- > From: Justin Pryzby <pryzby@telsasoft.com> > Sent: Tuesday, March 1, 2022 14:27 > To: ldh@laurent-hasson.com > Cc: pgsql-performance@postgresql.org > Subject: Re: An I/O error occurred while sending to the backend (PG 13.4) > > On Tue, Mar 01, 2022 at 04:28:31PM +0000, ldh@laurent-hasson.com > wrote: > > Now, there is an additional component I think... Storage is on an array > and I am not getting a clear answer as to where it is 😊 Is it possible that > something is happening at the storage layer? Could that be reported as a > network issue vs a storage issue for Postgres? > > No. If there were an error with storage, it'd be reported as a local error, > and the query would fail, rather than failing with client-server > communication. > > > Also, both machines are actually VMs. I forgot to mention that and not > sure if that's relevant. > > Are they running on the same hypervisor ? Is that hyperv ? > Lacking other good hypotheses, that does seem relevant. > > -- > Justin Issue happened again last night. I did implement your recommendations but it didn't seem to prevent the issue: tcp_keepalives_idle=9 # TCP_KEEPIDLE, in seconds; # 0 selects the system default tcp_keepalives_interval=9 # TCP_KEEPINTVL, in seconds; # 0 selects the system default tcp_keepalives_count=0 # TCP_KEEPCNT; # 0 selects the system default #tcp_user_timeout = 0 # TCP_USER_TIMEOUT, in milliseconds; # 0 selects the system default On the client application, the exceptions are: 2022/03/03 01:04:56 - Upsert2.0 - ERROR (version 7.1.0.0-12, build 1 from 2017-05-16 17.18.02 by buildguy) : Unexpected error 2022/03/03 01:04:56 - Upsert2.0 - ERROR (version 7.1.0.0-12, build 1 from 2017-05-16 17.18.02 by buildguy) : org.pentaho.di.core.exception.KettleStepException: 2022/03/03 01:04:56 - Upsert2.0 - Error in step, asking everyone to stop because of: 2022/03/03 01:04:56 - Upsert2.0 - 2022/03/03 01:04:56 - Upsert2.0 - Error inserting/updating row 2022/03/03 01:04:56 - Upsert2.0 - An I/O error occurred while sending to the backend. 2022/03/03 01:04:56 - Upsert2.0 - 2022/03/03 01:04:56 - Upsert2.0 - 2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.trans.steps.insertupdate.InsertUpdate.processRow(InsertUpdate.java:313) 2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62) 2022/03/03 01:04:56 - Upsert2.0 - at java.lang.Thread.run(Thread.java:745) 2022/03/03 01:04:56 - Upsert2.0 - Caused by: org.pentaho.di.core.exception.KettleDatabaseException: 2022/03/03 01:04:56 - Upsert2.0 - Error inserting/updating row 2022/03/03 01:04:56 - Upsert2.0 - An I/O error occurred while sending to the backend. 2022/03/03 01:04:56 - Upsert2.0 - 2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.core.database.Database.insertRow(Database.java:1321) 2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.core.database.Database.insertRow(Database.java:1245) 2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.core.database.Database.insertRow(Database.java:1233) 2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.trans.steps.insertupdate.InsertUpdate.lookupValues(InsertUpdate.java:163) 2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.trans.steps.insertupdate.InsertUpdate.processRow(InsertUpdate.java:299) 2022/03/03 01:04:56 - Upsert2.0 - ... 2 more 2022/03/03 01:04:56 - Upsert2.0 - Caused by: org.postgresql.util.PSQLException: An I/O error occurred while sending to thebackend. 2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:382) 2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490) 2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408) 2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:166) 2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:134) 2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.core.database.Database.insertRow(Database.java:1288) 2022/03/03 01:04:56 - Upsert2.0 - ... 6 more 2022/03/03 01:04:56 - Upsert2.0 - Caused by: java.net.SocketException: Connection reset 2022/03/03 01:04:56 - Upsert2.0 - at java.net.SocketInputStream.read(SocketInputStream.java:209) 2022/03/03 01:04:56 - Upsert2.0 - at java.net.SocketInputStream.read(SocketInputStream.java:141) 2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:161) 2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:128) 2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:113) 2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:73) 2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.PGStream.receiveChar(PGStream.java:453) 2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2119) 2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355) 2022/03/03 01:04:56 - Upsert2.0 - ... 11 more On the DB: 2022-03-03 01:04:40 EST [21228] LOG: could not receive data from client: An existing connection was forcibly closed by theremote host. 2022-03-03 01:04:40 EST [21228] LOG: unexpected EOF on client connection with an open transaction 2022-03-03 01:04:40 EST [21228] LOG: disconnection: session time: 0:02:07.570 user=postgres database=Pepper host=10.64.17.211port=63686 2022-03-03 01:04:41 EST [21160] LOG: could not receive data from client: An existing connection was forcibly closed by theremote host. 2022-03-03 01:04:41 EST [21160] LOG: unexpected EOF on client connection with an open transaction 2022-03-03 01:04:41 EST [21160] LOG: disconnection: session time: 0:02:07.730 user=postgres database=Pepper host=10.64.17.211port=63688 I don't know if that is meaningful, but I see a 15s delay between the timestamp on the database and on the application. Theservers are synchronized properly. I have asked the IT team to look at the VMs and see if anything strange is happening.They are not too happy with installing WireShark to do more analysis given the "complexity of the tools and sizeof the logs" 😊 I keep on pushing. Thank you, Laurent.
> -----Original Message-----
> From: Justin Pryzby <pryzby@telsasoft.com>
> Sent: Tuesday, March 1, 2022 14:27
> To: ldh@laurent-hasson.com
> Cc: pgsql-performance@postgresql.org
> Subject: Re: An I/O error occurred while sending to the backend (PG 13.4)
>
> On Tue, Mar 01, 2022 at 04:28:31PM +0000, ldh@laurent-hasson.com
> wrote:
> > Now, there is an additional component I think... Storage is on an array
> and I am not getting a clear answer as to where it is 😊 Is it possible that
> something is happening at the storage layer? Could that be reported as a
> network issue vs a storage issue for Postgres?
>
> No. If there were an error with storage, it'd be reported as a local error,
> and the query would fail, rather than failing with client-server
> communication.
>
> > Also, both machines are actually VMs. I forgot to mention that and not
> sure if that's relevant.
>
> Are they running on the same hypervisor ? Is that hyperv ?
> Lacking other good hypotheses, that does seem relevant.
>
> --
> Justin
Issue happened again last night. I did implement your recommendations but it didn't seem to prevent the issue:
tcp_keepalives_idle=9 # TCP_KEEPIDLE, in seconds;
# 0 selects the system default
tcp_keepalives_interval=9 # TCP_KEEPINTVL, in seconds;
# 0 selects the system default
tcp_keepalives_count=0 # TCP_KEEPCNT;
# 0 selects the system default
#tcp_user_timeout = 0 # TCP_USER_TIMEOUT, in milliseconds;
# 0 selects the system default
On the client application, the exceptions are:
2022/03/03 01:04:56 - Upsert2.0 - ERROR (version 7.1.0.0-12, build 1 from 2017-05-16 17.18.02 by buildguy) : Unexpected error
2022/03/03 01:04:56 - Upsert2.0 - ERROR (version 7.1.0.0-12, build 1 from 2017-05-16 17.18.02 by buildguy) : org.pentaho.di.core.exception.KettleStepException:
2022/03/03 01:04:56 - Upsert2.0 - Error in step, asking everyone to stop because of:
2022/03/03 01:04:56 - Upsert2.0 -
2022/03/03 01:04:56 - Upsert2.0 - Error inserting/updating row
2022/03/03 01:04:56 - Upsert2.0 - An I/O error occurred while sending to the backend.
2022/03/03 01:04:56 - Upsert2.0 -
2022/03/03 01:04:56 - Upsert2.0 -
2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.trans.steps.insertupdate.InsertUpdate.processRow(InsertUpdate.java:313)
2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
2022/03/03 01:04:56 - Upsert2.0 - at java.lang.Thread.run(Thread.java:745)
2022/03/03 01:04:56 - Upsert2.0 - Caused by: org.pentaho.di.core.exception.KettleDatabaseException:
2022/03/03 01:04:56 - Upsert2.0 - Error inserting/updating row
2022/03/03 01:04:56 - Upsert2.0 - An I/O error occurred while sending to the backend.
2022/03/03 01:04:56 - Upsert2.0 -
2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.core.database.Database.insertRow(Database.java:1321)
2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.core.database.Database.insertRow(Database.java:1245)
2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.core.database.Database.insertRow(Database.java:1233)
2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.trans.steps.insertupdate.InsertUpdate.lookupValues(InsertUpdate.java:163)
2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.trans.steps.insertupdate.InsertUpdate.processRow(InsertUpdate.java:299)
2022/03/03 01:04:56 - Upsert2.0 - ... 2 more
2022/03/03 01:04:56 - Upsert2.0 - Caused by: org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:382)
2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490)
2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408)
2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:166)
2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:134)
2022/03/03 01:04:56 - Upsert2.0 - at org.pentaho.di.core.database.Database.insertRow(Database.java:1288)
2022/03/03 01:04:56 - Upsert2.0 - ... 6 more
2022/03/03 01:04:56 - Upsert2.0 - Caused by: java.net.SocketException: Connection reset
2022/03/03 01:04:56 - Upsert2.0 - at java.net.SocketInputStream.read(SocketInputStream.java:209)
2022/03/03 01:04:56 - Upsert2.0 - at java.net.SocketInputStream.read(SocketInputStream.java:141)
2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:161)
2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:128)
2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:113)
2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:73)
2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.PGStream.receiveChar(PGStream.java:453)
2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2119)
2022/03/03 01:04:56 - Upsert2.0 - at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355)
2022/03/03 01:04:56 - Upsert2.0 - ... 11 more
On the DB:
2022-03-03 01:04:40 EST [21228] LOG: could not receive data from client: An existing connection was forcibly closed by the remote host.
2022-03-03 01:04:40 EST [21228] LOG: unexpected EOF on client connection with an open transaction
So most likely the error could be from Pentaho or JDBC.
On Thu, Mar 03, 2022 at 01:33:08PM -0300, Ranier Vilela wrote: > Sorry, but this is much more on the client side. The client is reporting the problem, as is the server. > Following the logs, it is understood that the client is dropping the > connection. The logs show that the client's connection *was* dropped. And on the server, the same. > So most likely the error could be from Pentaho or JDBC. > > https://www.geeksforgeeks.org/java-net-socketexception-in-java-with-examples/ > " This *SocketException* occurs on the server-side when the client closed > the socket connection before the response could be returned over the > socket." > > I suggest moving this thread to the Pentaho or JDBC support. We don't know the source of the problem. I still doubt it's in postgres, but I don't think it's helpful to blame the client, just because the client reported the problem. If the server were to disconnect abruptly, I'd expect the client to report that, too. Laurent would just have to start the conversation over (and probably collect the same diagnostic information anyway). The client projects could blame postgres with as much rationale as there is for us to blame the client. Please don't add confusion here. I made suggestions for how to collect more information to better understand the source of the problem, and there's probably not much else to say without that. -- Justin
On Thu, Mar 03, 2022 at 01:33:08PM -0300, Ranier Vilela wrote:
> Sorry, but this is much more on the client side.
The client is reporting the problem, as is the server.
2022-03-03 01:04:40 EST [21228] LOG: unexpected EOF on client connection with an open transaction"
> Following the logs, it is understood that the client is dropping the
> connection.
The logs show that the client's connection *was* dropped.
And on the server, the same.
> So most likely the error could be from Pentaho or JDBC.
>
> https://www.geeksforgeeks.org/java-net-socketexception-in-java-with-examples/
> " This *SocketException* occurs on the server-side when the client closed
> the socket connection before the response could be returned over the
> socket."
>
> I suggest moving this thread to the Pentaho or JDBC support.
We don't know the source of the problem.
I still doubt it's in postgres,
but I
don't think it's helpful to blame the client, just because the client reported
the problem. If the server were to disconnect abruptly, I'd expect the client
to report that, too.
Laurent would just have to start the conversation over (and probably collect
the same diagnostic information anyway). The client projects could blame
postgres with as much rationale as there is for us to blame the client.
Please don't add confusion here.
RE: An I/O error occurred while sending to the backend (PG 13.4)
I am also starting to feel that the issue being on the database’s side is less and less likely. There is something happening in between, or possibly on the client.
Ranier, the only reason I was focusing on this at the PG level is that this issue started to show up several months ago shortly after I updated to PG13 from PG11. Had run PG11 for 2 years without ever seeing that issue at all. The ETL itself hasn’t changed either, except for upgrading the JDBC driver… But I did revert back to an older JDBC driver and the issue still did occur eventually.
Of course, other things could have changed in the client’s IT infrastructure that I am not aware of, so I am pushing that angle as well more aggressively now. I am also pushing for WireShark to monitor the network more closely. Stay tuned!
Thank you so much all for your support but at this time, I think the ball is in my camp and working out with it on some plan.
Thank you,
Laurent.
From: Ranier Vilela <ranier.vf@gmail.com>
Sent: Thursday, March 3, 2022 11:57
To: Justin Pryzby <pryzby@telsasoft.com>
Cc: ldh@laurent-hasson.com; pgsql-performance@postgresql.org
Subject: Re: An I/O error occurred while sending to the backend (PG 13.4)
Em qui., 3 de mar. de 2022 às 13:46, Justin Pryzby <pryzby@telsasoft.com> escreveu:
On Thu, Mar 03, 2022 at 01:33:08PM -0300, Ranier Vilela wrote:
> Sorry, but this is much more on the client side.
The client is reporting the problem, as is the server.
Are you read the server log?
" 2022-03-03 01:04:40 EST [21228] LOG: could not receive data from client: An existing connection was forcibly closed by the remote host.
2022-03-03 01:04:40 EST [21228] LOG: unexpected EOF on client connection with an open transaction"
> Following the logs, it is understood that the client is dropping the
> connection.
The logs show that the client's connection *was* dropped.
And on the server, the same.
No, the log server shows that the client dropped the connection.
> So most likely the error could be from Pentaho or JDBC.
>
> https://www.geeksforgeeks.org/java-net-socketexception-in-java-with-examples/
> " This *SocketException* occurs on the server-side when the client closed
> the socket connection before the response could be returned over the
> socket."
>
> I suggest moving this thread to the Pentaho or JDBC support.
We don't know the source of the problem.
Yeah, but it is much more likely to be on the client.
I still doubt it's in postgres,
Everything indicates not.
but I
don't think it's helpful to blame the client, just because the client reported
the problem. If the server were to disconnect abruptly, I'd expect the client
to report that, too.
Laurent would just have to start the conversation over (and probably collect
the same diagnostic information anyway). The client projects could blame
postgres with as much rationale as there is for us to blame the client.
Please don't add confusion here.
I just suggested, this is not an order.
regards,
Ranier Vilela
I am also starting to feel that the issue being on the database’s side is less and less likely. There is something happening in between, or possibly on the client.
Ranier, the only reason I was focusing on this at the PG level is that this issue started to show up several months ago shortly after I updated to PG13 from PG11. Had run PG11 for 2 years without ever seeing that issue at all. The ETL itself hasn’t changed either, except for upgrading the JDBC driver… But I did revert back to an older JDBC driver and the issue still did occur eventually.
Of course, other things could have changed in the client’s IT infrastructure that I am not aware of, so I am pushing that angle as well more aggressively now. I am also pushing for WireShark to monitor the network more closely. Stay tuned!
Thank you so much all for your support but at this time, I think the ball is in my camp and working out with it on some plan.
RE: An I/O error occurred while sending to the backend (PG 13.4)
>From: Ranier Vilela ranier.vf@gmail.com
>Sent: Thursday, March 03, 2022 13:22
>
>
>You are welcome.
>
>regards,
>Ranier Vilela
Hello all,
After a lot of back and forth, someone in IT informed us that the database VM is under a backup schedule using Veeam. Apparently, during the backup window, Veeam creates a snapshot and that takes the VM offline for a couple of minutes… And of course, they scheduled this right at the busiest time of the day for this machine which is during our nightly ETL. Their backup doesn’t perform very week either, which explained why the failure seemed to randomly happen at various points during our ETL (which takes about 2h30mn).
They moved the schedule out and the issue has not happened again over the past 3 weeks. This looks like it was the root cause and would explain (I think) how the database and the client simultaneously reported a connection timeout.
Thank you so much for all your help in trying to figure this out and exonerate Postgres.
Thank you,
Laurent.
On Wed, Apr 13, 2022 at 03:36:19PM +0000, ldh@laurent-hasson.com wrote: > After a lot of back and forth, someone in IT informed us that the database VM is under a backup schedule using Veeam. Apparently,during the backup window, Veeam creates a snapshot and that takes the VM offline for a couple of minutes… Andof course, they scheduled this right at the busiest time of the day for this machine which is during our nightly ETL.Their backup doesn’t perform very week either, which explained why the failure seemed to randomly happen at various pointsduring our ETL (which takes about 2h30mn). > > They moved the schedule out and the issue has not happened again over the past 3 weeks. This looks like it was the rootcause and would explain (I think) how the database and the client simultaneously reported a connection timeout. > > Thank you so much for all your help in trying to figure this out and exonerate Postgres. Great, thanks for letting us know. This time it wasn't postgres' fault; you're 2 for 3 ;) One issue I've seen is if a vmware snapshot is taken and then saved for a long time. It can be okay if VEEM takes a transient snapshot, copies its data, and then destroys the snapshot. But it can be bad if multiple snapshots are taken and then left around for a long time to use as a backup themselves. -- Justin