Thread: Why is psql \copy process stuck?

Why is psql \copy process stuck?

From
Jack Orenstein
Date:
I have two postgres 7.4.6 databases, running on different hosts (Linux
2.6.11 kernel), and I'm copying data from one to the other as follows:

     psql -h 10.3.1.154 mydb -U username -c "\copy sometable to stdout" | \
     psql mydb -U username -c "\copy sometable from stdin"

Such copies are pretty frequent, and have always worked well.

Today, the processes doing the copy have been stuck for a few
hours. ps axw says

21167 ?        S      0:00 /usr/local/pgsql/bin/psql -h 10.3.1.154 mydb -U username -c \copy
sometable to stdout
21168 ?        S      0:00 /usr/local/pgsql/bin/psql mydb -U username -c \copy sometable from stdin
21169 ?        S<     0:00 postgres: risdba ris [local] COPY

The amount of data is small, about 13,000 rows, an average of 150-200
bytes each.

pg_locks shows this on the local host:

  table         | transaction |  pid  |        mode         | granted
---------------+-------------+-------+---------------------+---------
...
  sometable     |             | 21169 | RowExclusiveLock    | t
...

(I joined pg_locks with other catalog tables to get the table name.)

One other pid is waiting to get an AccessExclusiveLock on sometable,
but that's it. pg_locks doesn't have any other entries with granted =
f.

On the remote host, I see no sign of a COPY going on in pg_locks, (no lock
of any kind for sometable).

So it looks like the copy from stdin completed, as far as the remote
db is concerned, but that the process reading from the remote db,
21167, doesn't realize this, and the copy from process (21168) is
simply waiting for more data.

The postgres logs don't indicate any trouble on either host around the
time that the copy starts, or for the next few minutes.

Why is process 21167 stuck?

Jack Orenstein

Re: Why is psql \copy process stuck?

From
Tom Lane
Date:
Jack Orenstein <jao@geophile.com> writes:
> 21167 ?        S      0:00 /usr/local/pgsql/bin/psql -h 10.3.1.154 mydb -U username -c \copy sometable to stdout

> Why is process 21167 stuck?

Are you sure it's made a connection at all?  You might try attaching to
it with a debugger to get a stack trace.  Also, see if you can identify
a connected backend on the remote machine, and gdb that to see what it
is doing.

            regards, tom lane

Re: Why is psql \copy process stuck?

From
Jack Orenstein
Date:
Tom Lane wrote:
> Jack Orenstein <jao@geophile.com> writes:
>
>>21167 ?        S      0:00 /usr/local/pgsql/bin/psql -h 10.3.1.154 mydb -U username -c \copy sometable to stdout
>
>
>>Why is process 21167 stuck?
>
>
> Are you sure it's made a connection at all?  You might try attaching to
> it with a debugger to get a stack trace.  Also, see if you can identify
> a connected backend on the remote machine, and gdb that to see what it
> is doing.

As far as I can tell there is no process corresponding to the copy
on the remote node. I checked ps and looked at pg_locks.

I'll work on attaching to the local copy processes.

Jack Orenstein

Re: Why is psql \copy process stuck?

From
Tom Lane
Date:
Jack Orenstein <jorenstein@archivas.com> writes:
> I'm not sure I did this right, (it's been a while since I worked with
> gdb). But here's what I found.

Doesn't look like either of those can be trusted very far :-(.  But both
of them seem to be waiting for input.

What this looks like to me is some kind of networking problem.  Perhaps
the connection to the remote host got dropped and the local kernel
hasn't realized it yet?  From memory, we use TCP KEEPALIVE on the
server socket but not in client-side code, so there's probably not any
timeout that will trigger in anything like a sane amount of time.

On the other hand, if the remote backend didn't die immediately, it
*would* have recognized a network timeout and gone away in at most an
hour or so.  Did you check for any "unexpected EOF on client connection"
or similar messages in the remote postmaster's log?

            regards, tom lane

Re: Why is psql \copy process stuck?

From
Jack Orenstein
Date:
Tom Lane wrote:
> Jack Orenstein <jorenstein@archivas.com> writes:
>
>>I'm not sure I did this right, (it's been a while since I worked with
>>gdb). But here's what I found.
>
>
> Doesn't look like either of those can be trusted very far :-(.  But both
> of them seem to be waiting for input.
>
> What this looks like to me is some kind of networking problem.  Perhaps
> the connection to the remote host got dropped and the local kernel
> hasn't realized it yet?  From memory, we use TCP KEEPALIVE on the
> server socket but not in client-side code, so there's probably not any
> timeout that will trigger in anything like a sane amount of time.
>
> On the other hand, if the remote backend didn't die immediately, it
> *would* have recognized a network timeout and gone away in at most an
> hour or so.  Did you check for any "unexpected EOF on client connection"
> or similar messages in the remote postmaster's log?

I did check, and didn't see any such message.

Jack Orenstein