Thread: psql \copy hanging
Hi list On Windows, I have an unattended script that exports data overnight by issuing psql commands like : psql -f mycommands.sql whith mycommands.sql containing a bunch of \copy lines like : \copy (SELECT * FROM someview) TO '\\windowsserver\windowsshare\somefile.csv' When I run this script by hand, everything is fine but when it runs overnight it hangs and allways on the same line. There is no lock in the database, and the connection is marked as active in the backend and running the statement "COPY (SELECT * FROM myview) TO STDOUT". So where can things be hanging ? My first guess (appart from db lock) was that the target file was locked by the filesystem. But wouldn't psql know about it and fail with an error ? Thanks for your help ! -- Arnaud
On 8/22/19 1:47 AM, Arnaud L. wrote: > Hi list > > On Windows, I have an unattended script that exports data overnight by > issuing psql commands like : > psql -f mycommands.sql > whith mycommands.sql containing a bunch of \copy lines like : > \copy (SELECT * FROM someview) TO > '\\windowsserver\windowsshare\somefile.csv' > > When I run this script by hand, everything is fine but when it runs > overnight it hangs and allways on the same line. > There is no lock in the database, and the connection is marked as active > in the backend and running the statement "COPY (SELECT * FROM myview) TO > STDOUT". > So where can things be hanging ? The first thing I think of when seeing this sort of behavior on Windows is anti-virus software. Doing you have any running on the machine? You say it always hangs on the same line. Is that line doing something different from the others? Say writing to a different location. > My first guess (appart from db lock) was that the target file was locked > by the filesystem. But wouldn't psql know about it and fail with an error ? > > Thanks for your help ! > > -- > Arnaud > > > -- Adrian Klaver adrian.klaver@aklaver.com
Le 22/08/2019 à 16:00, Adrian Klaver a écrit : > The first thing I think of when seeing this sort of behavior on Windows > is anti-virus software. Doing you have any running on the machine? > > You say it always hangs on the same line. Is that line doing something > different from the others? Say writing to a different location. Thanks for your insights Adrian. There is no AV software on the machine as it is a Server version of Windows. (I know some people run AV on Windows Server, but I don't) All the lines write to the same location which is a network share, but the hanging one is clearly the longest query to execute and the output file is one of the biggest one (though not the biggest). Note that the share points to a local folder. The fact the psql writes to a share is only for robustness so that this script could be executed from any machine. Cheers -- Arnaud
On 8/22/19 7:07 AM, Arnaud L. wrote: > Le 22/08/2019 à 16:00, Adrian Klaver a écrit : >> The first thing I think of when seeing this sort of behavior on Windows >> is anti-virus software. Doing you have any running on the machine? >> >> You say it always hangs on the same line. Is that line doing something >> different from the others? Say writing to a different location. > > Thanks for your insights Adrian. > > There is no AV software on the machine as it is a Server version of > Windows. (I know some people run AV on Windows Server, but I don't) > > All the lines write to the same location which is a network share, but > the hanging one is clearly the longest query to execute and the output > file is one of the biggest one (though not the biggest). The only thing I can think of at the moment is to put the offending line in a separate script and see what happens. > > Note that the share points to a local folder. The fact the psql writes > to a share is only for robustness so that this script could be executed > from any machine. > > Cheers > > -- > Arnaud > -- Adrian Klaver adrian.klaver@aklaver.com
Le 22/08/2019 à 16:21, Adrian Klaver a écrit : > The only thing I can think of at the moment is to put the offending line > in a separate script and see what happens. Then I think I'll first try to switch the line orders. I'll execute this action last and see if it changes anything. If it doesn't, I'll break the script into smaller parts. Thanks for you help ! -- Arnaud
"Arnaud L." <arnaud.listes@codata.eu> writes: > Hi list > > On Windows, I have an unattended script that exports data overnight by > issuing psql commands like : > psql -f mycommands.sql > whith mycommands.sql containing a bunch of \copy lines like : > \copy (SELECT * FROM someview) TO > '\\windowsserver\windowsshare\somefile.csv' > > When I run this script by hand, everything is fine but when it runs > overnight it hangs and allways on the same line. Is the overnight run being done as the same DB user you are testing with? If not, then perhaps the automated user deviates in resource settings and/or permissions, in turn resulting in sluggish performance caused by thrashing and/or inferior query plan. > There is no lock in the database, and the connection is marked as > active in the backend and running the statement "COPY (SELECT * FROM > myview) TO STDOUT". > So where can things be hanging ? > My first guess (appart from db lock) was that the target file was > locked by the filesystem. But wouldn't psql know about it and fail > with an error ? > > Thanks for your help ! > > -- > Arnaud > > > -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@comcast.net
Le 22/08/2019 à 20:00, Jerry Sievers a écrit : > Is the overnight run being done as the same DB user you are testing > with? > > If not, then perhaps the automated user deviates in resource settings > and/or permissions, in turn resulting in sluggish performance caused by > thrashing and/or inferior query plan. Hi Jerry, yes, they both run under the same user. I've checked my other scheduled tasks and I don't see what could interfere, but then this is quite complex so I could easily overlook something. Unfortunately, tonight's run failed for a totally different reason... Thanks for your help Cheers -- Arnaud
Le 22/08/2019 à 16:23, Arnaud L. a écrit : > Le 22/08/2019 à 16:21, Adrian Klaver a écrit : >> The only thing I can think of at the moment is to put the offending line >> in a separate script and see what happens. > > Then I think I'll first try to switch the line orders. I'll execute this > action last and see if it changes anything. If it doesn't, I'll break > the script into smaller parts. It ran fine over the week-end so I'll consider this as temporary glitch. It's always dissatisfying not to know what has happened, but I'm not going to try too hard to reproduce this. Thanks for helping and sorry for the noise ! Cheers -- Arnaud
Le 22/08/2019 à 10:47, Arnaud L. a écrit : > On Windows, I have an unattended script that exports data overnight by > issuing psql commands like : > psql -f mycommands.sql > whith mycommands.sql containing a bunch of \copy lines like : > \copy (SELECT * FROM someview) TO > '\\windowsserver\windowsshare\somefile.csv' > > When I run this script by hand, everything is fine but when it runs > overnight it hangs and allways on the same line. > There is no lock in the database, and the connection is marked as active > in the backend and running the statement "COPY (SELECT * FROM myview) TO > STDOUT". > So where can things be hanging ? > My first guess (appart from db lock) was that the target file was locked > by the filesystem. But wouldn't psql know about it and fail with an error ? OK, so it happended again... I'm really puzzled. There is no lock on the FS that I can find (openfiles reports 0 lock on this file). The client psql process is sitting idle, doing nothing. In PostgreSQL, the connection is active with the query being simple copy (select * from view) to stdout. In pg_stat_activity wait_event_type IS NULL, state is active, query_start is 2019-08-27 00:18:20.471957+02 and state_change is 2019-08-27 00:18:20.47196+02... Nothing in postgresql LOG around that time. Later I have : 2019-08-27 01:00:04 CEST LOG: could not receive data from client: An existing connection was forcibly closed by the remote host. But I realy doubt this is related... Any other idea ? I'll change the lines order for tonight's run, but that is not what I'd call a solution... Thanks a lot ! -- Arnaud
On Tue, Aug 27, 2019 at 9:54 AM Arnaud L. <arnaud.listes@codata.eu> wrote: > Any other idea ? I'll change the lines order for tonight's run, but that > is not what I'd call a solution... Does it hangs against the same line content or the same line number? Are you able to run the script automatically during working hours (to avoid firewalling or upgrades running in parrallel to your nightly script execution)? Any chance something is querying the same data and a lock blocks the transaction (pg_locks)? Just throwing on the table some desperate ideas.... Luca
Le 27/08/2019 à 10:00, Luca Ferrari a écrit : > On Tue, Aug 27, 2019 at 9:54 AM Arnaud L. <arnaud.listes@codata.eu> wrote: >> Any other idea ? I'll change the lines order for tonight's run, but that >> is not what I'd call a solution... > > Does it hangs against the same line content or the same line number? > Are you able to run the script automatically during working hours (to > avoid firewalling or upgrades running in parrallel to your nightly > script execution)? > Any chance something is querying the same data and a lock blocks the > transaction (pg_locks)? Hi Luca. I can run the script just fine during working hours. I checked pg_locks, and this pid is the only process requesting locks. It has around 100 of them since it is a view querying multiple tables, but I see nothing blocking. All locks are granted. I have a csv output of pg_locks so I can post this if asked. It hangs against the same line in the sql script, all lines being "\copy (select ....) to 'file on unc share'". This line is simply the longest running query because the view inside the select outputs almost 1M rows and does some subqueries. Also when the script is hung, output has not started (file size is 0). > Just throwing on the table some desperate ideas.... Desperate ideas are very welcome ! Cheers -- Arnaud
On Tue, Aug 27, 2019 at 10:48 AM Arnaud L. <arnaud.listes@codata.eu> wrote: > I can run the script just fine during working hours. I meant thru your scheduler (cron or something). > It hangs against the same line in the sql script, all lines being "\copy > (select ....) to 'file on unc share'". This is a new detail to me: what if you output to a local file and move it after on the share? > This line is simply the longest running query because the view inside > the select outputs almost 1M rows and does some subqueries. > I still don't get: is the line content the longest or the highest numbered in the output? > Also when the script is hung, output has not started (file size is 0). This makes me think there is some problem with the share, not the script. Again: test against a local file that you move (rsync?) to the share after. Luca
Le 27/08/2019 à 10:57, Luca Ferrari a écrit : > On Tue, Aug 27, 2019 at 10:48 AM Arnaud L. <arnaud.listes@codata.eu> wrote: >> I can run the script just fine during working hours. > > I meant thru your scheduler (cron or something). Yes, it runs fine too. There is indeed something happening on the server at this specific time but I don't know what. >> It hangs against the same line in the sql script, all lines being "\copy >> (select ....) to 'file on unc share'". > > This is a new detail to me: what if you output to a local file and > move it after on the share? I could do this but it would make the script a lot more complicated. There are a lot of views that I \copy directly to this share, and this is the only one that poses any problem. > I still don't get: is the line content the longest or the highest > numbered in the output? Sorry, I don't undertand either. Your question was "Does it hangs against the same line content or the same line number?". I run an sql script that contains a bunch of \copy commands. It is always the same \copy that is hanging. >> Also when the script is hung, output has not started (file size is 0). > > This makes me think there is some problem with the share, not the > script. Again: test against a local file that you move (rsync?) to the > share after. Everything is windows based, so rsync is not really an option. There are alternatives of course. I could write the file locally, but for robustness I'm trying not to rely on the local FS so that this script could be run from any machine on the network without modifications. Cheers -- Arnaud
On Tue, Aug 27, 2019 at 12:34 PM Arnaud L. <arnaud.listes@codata.eu> wrote: > I could do this but it would make the script a lot more complicated. > There are a lot of views that I \copy directly to this share, and this > is the only one that poses any problem. I would however give it a try. I would schedule a nightly execution with a local path to see if that completes. > Sorry, I don't undertand either. Your question was "Does it hangs > against the same line content or the same line number?". > I run an sql script that contains a bunch of \copy commands. It is > always the same \copy that is hanging. Now I got it: your script is hanging against a specific \copy command, while I was thinking it was hanging once it was dumping a specific table line from your copy. > I could write the file locally, but for robustness I'm trying not to > rely on the local FS so that this script could be run from any machine > on the network without modifications. My personal experience with shares and network that are not so much stable is that they can block, and that is why I'm suggesting to try on the local filesystem to see if that is a share-related problem or a resource problem. Also scheduling the script at a different time could help, because it could be at that time the machine (and it could mean the sharing machine) is busy at the point it does not respond. Luca
Le 27/08/2019 à 13:13, Luca Ferrari a écrit : > My personal experience with shares and network that are not so much > stable is that they can block, and that is why I'm suggesting to try > on the local filesystem to see if that is a share-related problem or a > resource problem. Also scheduling the script at a different time could > help, because it could be at that time the machine (and it could mean > the sharing machine) is busy at the point it does not respond. I'm quite confident that changing the schedule would help (since the script runs fine during work hours), but the time-table is already quite busy. I move the offending line at the end of the script, so it will run some minutes later, maybe this will be enough. I'll also try to write it directly on a local folder to rule this out. Cheers -- Arnaud
Interesting topic.
Is there any easy way to debug, and get correct error messages?
Regards,
Shao
On Tue, 27 Aug 2019 at 12:17, Arnaud L. <arnaud.listes@codata.eu> wrote:
Le 27/08/2019 à 13:13, Luca Ferrari a écrit :
> My personal experience with shares and network that are not so much
> stable is that they can block, and that is why I'm suggesting to try
> on the local filesystem to see if that is a share-related problem or a
> resource problem. Also scheduling the script at a different time could
> help, because it could be at that time the machine (and it could mean
> the sharing machine) is busy at the point it does not respond.
I'm quite confident that changing the schedule would help (since the
script runs fine during work hours), but the time-table is already quite
busy.
I move the offending line at the end of the script, so it will run some
minutes later, maybe this will be enough.
I'll also try to write it directly on a local folder to rule this out.
Cheers
--
Arnaud
Le 27/08/2019 à 13:17, Arnaud L. a écrit : > I move the offending line at the end of the script, so it will run some > minutes later, maybe this will be enough. OK, so this was enough for last night's schedule to run without problem. I still don't get it so I'm not satisfied with this solution, but at least it works. I'll keep the list informed if something new arises. Thanks to everyone for helping ! Cheers -- Arnaud
On Wed, Aug 28, 2019 at 9:09 AM Arnaud L. <arnaud.listes@codata.eu> wrote: > OK, so this was enough for last night's schedule to run without problem. > I still don't get it so I'm not satisfied with this solution, but at > least it works. > I'll keep the list informed if something new arises. I don't want to be pedantic, but I would have tried with a single change at a time. And my bet is: the local file would do the trick (i.e., it is a weird share problem). If you are going to invest some time, you could also try to write a small file on the share just before the copy starts, so that you are guaranteed the share is working. Something like: echo $(date) >> $SHARE/log.txt psql 'copy ...' echo 'done' >> $SHARE/log.txt Luca
Le 28/08/2019 à 09:43, Luca Ferrari a écrit : > I don't want to be pedantic, but I would have tried with a single > change at a time. > And my bet is: the local file would do the trick (i.e., it is a weird > share problem). You're not don't worry. This process is quite important in our workflow (not critical), so I can't really afford to make tests for weeks. Moving the offending line to the end of the script is what I'd consider "single change". To me it makes something clear : the problem occurs only at the very specific time this command is running. > If you are going to invest some time, you could also try to write a > small file on the share just before the copy starts, so that you are > guaranteed the share is working. Something like: > > echo $(date) >> $SHARE/log.txt > psql 'copy ...' > echo 'done' >> $SHARE/log.txt Well, I do know that the share is working just before this command runs because ALL the commands in the script write to this specific share. The script is basically outputting many views results in a single share, something like : \copy (select * from view1) TO '\\myserver\myshare\file1.csv' \copy (select * from view2) TO '\\myserver\myshare\file2.csv' ... \copy (select * from view99) TO '\\myserver\myshare\file99.csv' So the \copy command right before (say view15 for instance) is writing to this same share just milliseconds before the the problematic command (view16) tries to do the same. Since this particular view takes some time to execute, there is some time between the moment the file gets created on the share (which happens immediately when the \copy command runs if I get it right), and the moment psql receives content and starts writing to it (my tests suggest 1min to 1min30s). Either psql doesn't receive anything (possible, since the connection is marked as active but it does not look as if it's doing anything at all), or there has been some timeout. It could have been tcp keepalive, but in Windows the default is 2h I believe and postgresql uses system default if nothing is specified in conf (which is my case). So with all this in mind I'd rather think I have a problem with either psql's \copy or with my query on the server side. But I'm not rulling anything out of course. One other thing I could try is using COPY TO STDOUT \g. From what I understand in the documentation this would not be 100% similar to what \copy is doing. Anyway, if it works with the current setup I won't dig into it much deeper. Cheers -- Arnaud