Thread: pg_restore stuck in a loop?

pg_restore stuck in a loop?

From
Rod Taylor
Date:
I took a pg_dump -Fc from a Pg 8.0.1 installation on (Solaris / Sun)
with Slony installed and attempted to restore to 8.0.2 on Linux / x86
without Slony installed.

pg_restore skipped all of the Slony related items, but then it
immediately jumped up to 700MB of memory usage and 99% CPU, but the DB
connection was idle. Two hours later, memory usage had gradually dropped
to 500MB, but CPU was still going full still with an idle database
connection.

I eventually clued in and made a TOC and removed all of the Slony items,
but I'm still curious to know what exactly pg_restore had been doing for
the last hour or so.

-- 



Re: pg_restore stuck in a loop?

From
Tom Lane
Date:
Rod Taylor <pg@rbt.ca> writes:
> I eventually clued in and made a TOC and removed all of the Slony items,
> but I'm still curious to know what exactly pg_restore had been doing for
> the last hour or so.

You tell us ;-).  You've got the test case, attach to it with a debugger
and find out what it's doing.
        regards, tom lane


Re: pg_restore stuck in a loop?

From
Rod Taylor
Date:
On Tue, 2005-04-26 at 23:22 -0400, Tom Lane wrote:
> Rod Taylor <pg@rbt.ca> writes:
> > I eventually clued in and made a TOC and removed all of the Slony items,
> > but I'm still curious to know what exactly pg_restore had been doing for
> > the last hour or so.
> 
> You tell us ;-).  You've got the test case, attach to it with a debugger
> and find out what it's doing.

I wasn't entirely sure how to "catch it in action" so I just used CTRL+C
to interrupt the pg_restore process, and got these as backtraces:

(gdb) backtrace
#0  0xb74b9589 in mremap () from /lib/tls/libc.so.6
#1  0xb744dc13 in mremap_chunk () from /lib/tls/libc.so.6
#2  0xb744dfd1 in realloc () from /lib/tls/libc.so.6
#3  0xb75e46b9 in enlargePQExpBuffer ()
from /home/rbt/pg8/lib/libpq.so.4
#4  0xb75e481f in appendPQExpBufferChar ()
from /home/rbt/pg8/lib/libpq.so.4
#5  0x0804efe3 in _sendSQLLine ()
#6  0x0804f2bd in ExecuteSqlCommandBuf ()
#7  0x0805019e in _PrintData ()
#8  0x0804fee5 in _PrintTocData ()
#9  0x0804aba2 in RestoreArchive ()
#10 0x0804a525 in main ()

(gdb) backtrace
#0  0xb75e465e in enlargePQExpBuffer ()
from /home/rbt/pg8/lib/libpq.so.4
#1  0xb75e481f in appendPQExpBufferChar ()
from /home/rbt/pg8/lib/libpq.so.4
#2  0x0804efe3 in _sendSQLLine ()
#3  0x0804f2bd in ExecuteSqlCommandBuf ()
#4  0x0805019e in _PrintData ()
#5  0x0804fee5 in _PrintTocData ()
#6  0x0804aba2 in RestoreArchive ()
#7  0x0804a525 in main ()

(gdb) backtrace
#0  0xb75e4800 in appendPQExpBufferChar ()
from /home/rbt/pg8/lib/libpq.so.4
#1  0x0804efe3 in _sendSQLLine ()
#2  0x0804f2bd in ExecuteSqlCommandBuf ()
#3  0x0805019e in _PrintData ()
#4  0x0804fee5 in _PrintTocData ()
#5  0x0804aba2 in RestoreArchive ()
#6  0x0804a525 in main ()





-- 



Re: pg_restore stuck in a loop?

From
Tom Lane
Date:
Rod Taylor <pg@rbt.ca> writes:
> On Tue, 2005-04-26 at 23:22 -0400, Tom Lane wrote:
>> You tell us ;-).  You've got the test case, attach to it with a debugger
>> and find out what it's doing.

> I wasn't entirely sure how to "catch it in action" so I just used CTRL+C
> to interrupt the pg_restore process, and got these as backtraces:

Hm, I wonder if there is something broken about the "SQL parsing" logic
in _sendSQLLine, such that it could go into an infinite loop given the
right input?
        regards, tom lane


Re: pg_restore stuck in a loop?

From
Rod Taylor
Date:
On Thu, 2005-04-28 at 02:11 -0400, Tom Lane wrote:
> Rod Taylor <pg@rbt.ca> writes:
> > On Tue, 2005-04-26 at 23:22 -0400, Tom Lane wrote:
> >> You tell us ;-).  You've got the test case, attach to it with a debugger
> >> and find out what it's doing.
> 
> > I wasn't entirely sure how to "catch it in action" so I just used CTRL+C
> > to interrupt the pg_restore process, and got these as backtraces:
> 
> Hm, I wonder if there is something broken about the "SQL parsing" logic
> in _sendSQLLine, such that it could go into an infinite loop given the
> right input?

Let me see if I can put together a test case which is smaller for next
week, and I'll send that in.

-- 



Re: pg_restore stuck in a loop?

From
Tom Lane
Date:
Rod Taylor <pg@rbt.ca> writes:
> On Thu, 2005-04-28 at 02:11 -0400, Tom Lane wrote:
>> Hm, I wonder if there is something broken about the "SQL parsing" logic
>> in _sendSQLLine, such that it could go into an infinite loop given the
>> right input?

> Let me see if I can put together a test case which is smaller for next
> week, and I'll send that in.

If the parsing-bug theory is right, then there is probably some one
object in your database that will make even "pg_dump -s" go nuts.
        regards, tom lane


Re: pg_restore stuck in a loop?

From
Tom Lane
Date:
Rod Taylor <pg@rbt.ca> writes:
> On Tue, 2005-04-26 at 23:22 -0400, Tom Lane wrote:
>> Rod Taylor <pg@rbt.ca> writes:
>>> I eventually clued in and made a TOC and removed all of the Slony items,
>>> but I'm still curious to know what exactly pg_restore had been doing for
>>> the last hour or so.
>> 
>> You tell us ;-).  You've got the test case, attach to it with a debugger
>> and find out what it's doing.

> I wasn't entirely sure how to "catch it in action" so I just used CTRL+C
> to interrupt the pg_restore process, and got these as backtraces:

I had forgotten about this problem, but noticed it again in my mail
folder.  After staring at the code for awhile, I have a theory: whoever
added dollar-quote parsing to _sendSQLLine() did not understand that it
has to keep all its state in AH->sqlparse, because the input it's passed
can be broken into arbitrary bufferload boundaries.  Suppose that the
input contains $1$2 (which should properly be parsed as two parameter
symbols) and a bufferload boundary happens to fall between the first $
and the 1.  Because startDT is incorrectly treated as a reinitializable
local, the code will mistakenly decide that $1$ is a dollar-quote
start tag, and start looking for the matching close tag.  As long as it
doesn't happen to find a match, all the rest of the output intended to
go to the database (including both SQL commands and COPY data) is going
to get crammed into AH->sqlBuf.  This wouldn't be an infinite loop
exactly, but it could certainly drive the program into swap hell before
long, if the dump file contains sufficiently many megabytes of data.

To believe that this explains your report, we'd have to assume that the
Slony part of the dump included such a string and it wasn't within any
other form of quoting.  I'm not sure offhand whether pg_dump would ever
actually produce such a string --- in most contexts I'd expect "$n" to
be surrounded by spaces.

Whether or not this actually is what you saw, the code is definitely broken.
Will work on cleaning it up.
        regards, tom lane