Thread: Why is psql \copy process stuck?
I have two postgres 7.4.6 databases, running on different hosts (Linux 2.6.11 kernel), and I'm copying data from one to the other as follows: psql -h 10.3.1.154 mydb -U username -c "\copy sometable to stdout" | \ psql mydb -U username -c "\copy sometable from stdin" Such copies are pretty frequent, and have always worked well. Today, the processes doing the copy have been stuck for a few hours. ps axw says 21167 ? S 0:00 /usr/local/pgsql/bin/psql -h 10.3.1.154 mydb -U username -c \copy sometable to stdout 21168 ? S 0:00 /usr/local/pgsql/bin/psql mydb -U username -c \copy sometable from stdin 21169 ? S< 0:00 postgres: risdba ris [local] COPY The amount of data is small, about 13,000 rows, an average of 150-200 bytes each. pg_locks shows this on the local host: table | transaction | pid | mode | granted ---------------+-------------+-------+---------------------+--------- ... sometable | | 21169 | RowExclusiveLock | t ... (I joined pg_locks with other catalog tables to get the table name.) One other pid is waiting to get an AccessExclusiveLock on sometable, but that's it. pg_locks doesn't have any other entries with granted = f. On the remote host, I see no sign of a COPY going on in pg_locks, (no lock of any kind for sometable). So it looks like the copy from stdin completed, as far as the remote db is concerned, but that the process reading from the remote db, 21167, doesn't realize this, and the copy from process (21168) is simply waiting for more data. The postgres logs don't indicate any trouble on either host around the time that the copy starts, or for the next few minutes. Why is process 21167 stuck? Jack Orenstein
Jack Orenstein <jao@geophile.com> writes: > 21167 ? S 0:00 /usr/local/pgsql/bin/psql -h 10.3.1.154 mydb -U username -c \copy sometable to stdout > Why is process 21167 stuck? Are you sure it's made a connection at all? You might try attaching to it with a debugger to get a stack trace. Also, see if you can identify a connected backend on the remote machine, and gdb that to see what it is doing. regards, tom lane
Tom Lane wrote: > Jack Orenstein <jao@geophile.com> writes: > >>21167 ? S 0:00 /usr/local/pgsql/bin/psql -h 10.3.1.154 mydb -U username -c \copy sometable to stdout > > >>Why is process 21167 stuck? > > > Are you sure it's made a connection at all? You might try attaching to > it with a debugger to get a stack trace. Also, see if you can identify > a connected backend on the remote machine, and gdb that to see what it > is doing. As far as I can tell there is no process corresponding to the copy on the remote node. I checked ps and looked at pg_locks. I'll work on attaching to the local copy processes. Jack Orenstein
Jack Orenstein <jorenstein@archivas.com> writes: > I'm not sure I did this right, (it's been a while since I worked with > gdb). But here's what I found. Doesn't look like either of those can be trusted very far :-(. But both of them seem to be waiting for input. What this looks like to me is some kind of networking problem. Perhaps the connection to the remote host got dropped and the local kernel hasn't realized it yet? From memory, we use TCP KEEPALIVE on the server socket but not in client-side code, so there's probably not any timeout that will trigger in anything like a sane amount of time. On the other hand, if the remote backend didn't die immediately, it *would* have recognized a network timeout and gone away in at most an hour or so. Did you check for any "unexpected EOF on client connection" or similar messages in the remote postmaster's log? regards, tom lane
Tom Lane wrote: > Jack Orenstein <jorenstein@archivas.com> writes: > >>I'm not sure I did this right, (it's been a while since I worked with >>gdb). But here's what I found. > > > Doesn't look like either of those can be trusted very far :-(. But both > of them seem to be waiting for input. > > What this looks like to me is some kind of networking problem. Perhaps > the connection to the remote host got dropped and the local kernel > hasn't realized it yet? From memory, we use TCP KEEPALIVE on the > server socket but not in client-side code, so there's probably not any > timeout that will trigger in anything like a sane amount of time. > > On the other hand, if the remote backend didn't die immediately, it > *would* have recognized a network timeout and gone away in at most an > hour or so. Did you check for any "unexpected EOF on client connection" > or similar messages in the remote postmaster's log? I did check, and didn't see any such message. Jack Orenstein