Thread: PG Seg Faults Performing a Query

PG Seg Faults Performing a Query

From
Bill Thoen
Date:
How would you suggest I try to track down this problem?
I run the following query:

SELECT a.* FROM compliance_2006 a, ers_regions b
  WHERE a.fips_st_cd=b.fips_st
    AND a.fips_cnty_cd=b.fips_cou AND b.region =1
    AND a.fips_st_cd='17' AND a.fips_cnty_cd='003';

and it works. But when I try this:

SELECT a.* FROM compliance_2006 a, ers_regions b
  WHERE a.fips_st_cd=b.fips_st
    AND a.fips_cnty_cd=b.fips_cou AND b.region =1
    AND a.fips_st_cd='17' ;

psql dies with the message:
Segmentation Fault.

Any suggestions?


Re: PG Seg Faults Performing a Query

From
"Scott Marlowe"
Date:
On 8/21/07, Bill Thoen <bthoen@gisnet.com> wrote:
> How would you suggest I try to track down this problem?
> I run the following query:
>
> SELECT a.* FROM compliance_2006 a, ers_regions b
>   WHERE a.fips_st_cd=b.fips_st
>     AND a.fips_cnty_cd=b.fips_cou AND b.region =1
>     AND a.fips_st_cd='17' AND a.fips_cnty_cd='003';
>
> and it works. But when I try this:
>
> SELECT a.* FROM compliance_2006 a, ers_regions b
>   WHERE a.fips_st_cd=b.fips_st
>     AND a.fips_cnty_cd=b.fips_cou AND b.region =1
>     AND a.fips_st_cd='17' ;
>
> psql dies with the message:
> Segmentation Fault.

so the client psql is what's dieing right?  In that case you likely
are getting too big a result set for psql to handle at once.  Trying
declaring a cursor to hold your query and fetching 100 or 1000 or so
rows at a time.

Just guessing.  What's the exact text of the error message?

Re: PG Seg Faults Performing a Query

From
"Andrej Ricnik-Bay"
Date:
On 8/22/07, Bill Thoen <bthoen@gisnet.com> wrote:
> How would you suggest I try to track down this problem?
> Any suggestions?
postgres version?
Operating system?
Anything in the log(s)?


--
Please don't top post, and don't use HTML e-Mail :}  Make your quotes concise.

http://www.american.edu/econ/notes/htmlmail.htm

Re: PG Seg Faults Performing a Query

From
Bill Thoen
Date:
On Tue, Aug 21, 2007 at 04:38:42PM -0500, Scott Marlowe wrote:
> On 8/21/07, Bill Thoen <bthoen@gisnet.com> wrote:
> > How would you suggest I try to track down this problem?
> > I run the following query:
> >
> > SELECT a.* FROM compliance_2006 a, ers_regions b
> >   WHERE a.fips_st_cd=b.fips_st
> >     AND a.fips_cnty_cd=b.fips_cou AND b.region =1
> >     AND a.fips_st_cd='17' AND a.fips_cnty_cd='003';
> >
> > and it works. But when I try this:
> >
> > SELECT a.* FROM compliance_2006 a, ers_regions b
> >   WHERE a.fips_st_cd=b.fips_st
> >     AND a.fips_cnty_cd=b.fips_cou AND b.region =1
> >     AND a.fips_st_cd='17' ;
> >
> > psql dies with the message:
> > Segmentation Fault.
>
> so the client psql is what's dieing right?  In that case you likely
> are getting too big a result set for psql to handle at once.  Trying
> declaring a cursor to hold your query and fetching 100 or 1000 or so
> rows at a time.
>
> Just guessing.  What's the exact text of the error message?
>

The exact message was:

Segmentation Fault.


But the table compliance_2006 is very big (18 million plus records) so I'll
try that cursor idea. But even so, an error like that makes me think that
something's broken.

Re: PG Seg Faults Performing a Query

From
Bill Thoen
Date:
On Wed, Aug 22, 2007 at 09:46:21AM +1200, Andrej Ricnik-Bay wrote:
> On 8/22/07, Bill Thoen <bthoen@gisnet.com> wrote:
> > How would you suggest I try to track down this problem?
> > Any suggestions?
> postgres version?
> Operating system?
> Anything in the log(s)?

PostgreSQL Version is 8.1.5, running on Linux (Fedora Core 6). The last few
lines in the Serverlog are:
LOG:  unexpected EOF on client connection
LOG:  transaction ID wrap limit is 1073746500, limited by database
"postgres"
LOG:  transaction ID wrap limit is 1073746500, limited by database
"postgres"

(I ran VACUUM FULL after it crashed to make sure there was no loose disk
space floating around, so that last line was probably from that.) I assume
that bit about "transaction wrap limit" is informational and not related to
this problem.

My PostgreSQL is working great for small SQL queries even from my large
table (18 million records). But when I ask it to retrieve anything that
takes it more than 10 minutes to assemble, it crashes with this
"Segmentation Fault" error. I get so little feedback and I'm still pretty
unfamiliar with Postgresql that I don't even know where to begin.

This version of PostgreSQL was compiled from source with support for various
other packages needed for GIS support, but the tables I'm trying to extract
data from contain no GIS information. So I believe that this operation is
plain PostgreSQL.

Any help you can offer as to how I can track down what's wrong would be
greatly appreciated. If I can't get this to work and can only use small
tables in PG, then its usefulnes to me will be pretty limited.

- Bill Thoen

Re: PG Seg Faults Performing a Query

From
Martijn van Oosterhout
Date:
On Wed, Aug 22, 2007 at 07:09:22AM -0600, Bill Thoen wrote:
> PostgreSQL Version is 8.1.5, running on Linux (Fedora Core 6). The last few
> lines in the Serverlog are:
> LOG:  unexpected EOF on client connection
> LOG:  transaction ID wrap limit is 1073746500, limited by database
> "postgres"
> LOG:  transaction ID wrap limit is 1073746500, limited by database
> "postgres"

All indications are that your client is unable to hold the 18 million
row result and crashing with out of memory. Nothing you do in the
server or client is going to magic more memory for you, you need to
avoid getting it in the first place.

If you only want to display part of it, do a LIMIT <rows>. Or use a
cursor to page through it.

That said, it would be nice if it returned an error instead of
crashing.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: PG Seg Faults Performing a Query

From
Alvaro Herrera
Date:
Martijn van Oosterhout escribió:

> That said, it would be nice if it returned an error instead of
> crashing.

In my opinion it isn't just a matter of "would be nice".  It is a
possible bug that should be investigated.

A look at a stack trace from the crashing process would be the first
place to start.  In order to do that, please set "ulimit -c unlimited"
and rerun the query under psql.  That should produce a core file.  Then
run
gdb psql core
and inside gdb, execute "bt".  Please send that output our way.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: PG Seg Faults Performing a Query

From
Tom Lane
Date:
Bill Thoen <bthoen@gisnet.com> writes:
> My PostgreSQL is working great for small SQL queries even from my large
> table (18 million records). But when I ask it to retrieve anything that
> takes it more than 10 minutes to assemble, it crashes with this
> "Segmentation Fault" error. I get so little feedback and I'm still pretty
> unfamiliar with Postgresql that I don't even know where to begin.

Running the client under gdb and getting a stack trace would be a good
place to begin.

FWIW, when I deliberately try to read a query result that's too large
for client memory, I get reasonable behavior:

regression=# select x, y, repeat('xyzzy',200) from generate_series(1,10000) x, generate_series(1,100) y;
out of memory for query result
regression=#

If you're seeing a segfault in psql then it sounds like a PG bug.  If
you're seeing a segfault in a homebrew program then I wonder whether
it's properly checking for an error return from libpq ...

            regards, tom lane

Re: PG Seg Faults Performing a Query

From
Bill Thoen
Date:
As you requested, here's what bt in gbd reports:
(gdb) bt
#0  0x0000003054264571 in fputc () from /lib64/libc.so.6
#1  0x000000000040dbd2 in print_aligned_text ()
#2  0x000000000040f10b in printTable ()
#3  0x000000000041020b in printQuery ()
#4  0x0000000000407906 in SendQuery ()
#5  0x0000000000409153 in MainLoop ()
#6  0x000000000040b16e in main ()

Please tell me what it means if you can and if I can fix this problem.

Thanks,
- Bill Thoen

Alvaro Herrera wrote:
> Martijn van Oosterhout escribió:
>
>
>> That said, it would be nice if it returned an error instead of
>> crashing.
>>
>
> In my opinion it isn't just a matter of "would be nice".  It is a
> possible bug that should be investigated.
>
> A look at a stack trace from the crashing process would be the first
> place to start.  In order to do that, please set "ulimit -c unlimited"
> and rerun the query under psql.  That should produce a core file.  Then
> run
> gdb psql core
> and inside gdb, execute "bt".  Please send that output our way.
>
>


Re: PG Seg Faults Performing a Query

From
Tom Lane
Date:
Bill Thoen <bthoen@gisnet.com> writes:
> As you requested, here's what bt in gbd reports:
> (gdb) bt
> #0  0x0000003054264571 in fputc () from /lib64/libc.so.6
> #1  0x000000000040dbd2 in print_aligned_text ()
> #2  0x000000000040f10b in printTable ()
> #3  0x000000000041020b in printQuery ()
> #4  0x0000000000407906 in SendQuery ()
> #5  0x0000000000409153 in MainLoop ()
> #6  0x000000000040b16e in main ()

Hmph.  So it looks like it successfully absorbed the query result from
the backend and is dying trying to print it.

What this smells like to me is someplace failing to check for a malloc()
failure result, but I don't see any such places in print.c.  And I
didn't have any luck reproducing the problem while exercising 8.1 psql
on 64-bit Fedora 6.  I got either "out of memory for query result" or
plain "out of memory", nothing else.

Can you install the postgresql debuginfo RPM, or reproduce this on a
custom build with debugging enabled?  Knowing just where the crash
is might help more.

            regards, tom lane

Re: PG Seg Faults Performing a Query

From
Tom Lane
Date:
Bill Thoen <bthoen@gisnet.com> writes:
> (gdb) bt
> #0  0x0000003054264571 in fputc () from /lib64/libc.so.6
> #1  0x000000000040dbc2 in print_aligned_text (title=0x0, headers=0x5665d0,
>     cells=0x2aaaaf8fc010, footers=0x557c90,
>     opt_align=0x557ef0 'l' <repeats 18 times>, "rr", 'l' <repeats 12
> times>, "rl lllllll", opt_tuples_only=0 '\0', opt_numeric_locale=0 '\0',
> opt_border=1,
>     encoding=8, fout=0x0) at print.c:448
> #2  0x000000000040f0eb in printTable (title=0x0, headers=0x5665d0,
>     cells=0x2aaaaf8fc010, footers=0x557c90,
>     align=0x557ef0 'l' <repeats 18 times>, "rr", 'l' <repeats 12 times>,
> "rlllll lll", opt=0x7fff3e3be8c0, fout=0x3054442760, flog=0x0) at
> print.c:1551

OK, so the problem is that print_aligned_text is being passed fout = NULL.
Since that wasn't what was passed to printTable, the conclusion must be
that PageOutput() was called and returned NULL --- that is, that its
popen() call failed.  Obviously we should put in some sort of check for
that.  I can see three reasonable responses: either make psql abort
entirely (akin to its out-of-memory behavior), or have it fall back to
not using the pager, either silently or after printing an error
message.  Any thoughts which way to jump?

Meanwhile, the question Bill needs to look into is why popen() is
failing for him.  I'm guessing it's a fork() failure at bottom, but
why so consistent?  strace'ing the psql run might provide some more
info.

            regards, tom lane

Re: PG Seg Faults Performing a Query

From
Bill Thoen
Date:
I'm a bit out of my depth with using these debugging tools and
interpreting their results, but I think the problem is due to the output
being just too big for interactive display. Using the same query with
tighter limits in the WHERE clause works perfectly. When I changed the
SQL script to write output into a table it worked with the same query
using even looser limits in the WHERE clause. So sending output to a
table instead of to the monitor when the queries produce a large amount
of output is reliable, faster and doesn't tie up the machine.

I tried using strace, but it produced so much telemetry and
unfortunately I couldn't understand it anyway that I don't think this
would do me any good. I don't want to bug the PostgreSQL list with a
problem that's probably not a PostgreSQL one, but if someone here would
be willing to help me track down this apparent popen or fork problem I'd
appreciate it. However, I managed to get the results I needed, so we
could also call this "fixed via workaround."

Thanks for the help, Tom and others!
- Bill Thoen

Tom Lane wrote:
> Bill Thoen <bthoen@gisnet.com> writes:
>
>> (gdb) bt
>> #0  0x0000003054264571 in fputc () from /lib64/libc.so.6
>> #1  0x000000000040dbc2 in print_aligned_text (title=0x0, headers=0x5665d0,
>>     cells=0x2aaaaf8fc010, footers=0x557c90,
>>     opt_align=0x557ef0 'l' <repeats 18 times>, "rr", 'l' <repeats 12
>> times>, "rl lllllll", opt_tuples_only=0 '\0', opt_numeric_locale=0 '\0',
>> opt_border=1,
>>     encoding=8, fout=0x0) at print.c:448
>> #2  0x000000000040f0eb in printTable (title=0x0, headers=0x5665d0,
>>     cells=0x2aaaaf8fc010, footers=0x557c90,
>>     align=0x557ef0 'l' <repeats 18 times>, "rr", 'l' <repeats 12 times>,
>> "rlllll lll", opt=0x7fff3e3be8c0, fout=0x3054442760, flog=0x0) at
>> print.c:1551
>>
>
> OK, so the problem is that print_aligned_text is being passed fout = NULL.
> Since that wasn't what was passed to printTable, the conclusion must be
> that PageOutput() was called and returned NULL --- that is, that its
> popen() call failed.  Obviously we should put in some sort of check for
> that.  I can see three reasonable responses: either make psql abort
> entirely (akin to its out-of-memory behavior), or have it fall back to
> not using the pager, either silently or after printing an error
> message.  Any thoughts which way to jump?
>
> Meanwhile, the question Bill needs to look into is why popen() is
> failing for him.  I'm guessing it's a fork() failure at bottom, but
> why so consistent?  strace'ing the psql run might provide some more
> info.
>
>             regards, tom lane
>
>


Re: PG Seg Faults Performing a Query

From
Tom Lane
Date:
Bill Thoen <bthoen@gisnet.com> writes:
> I'm a bit out of my depth with using these debugging tools and
> interpreting their results, but I think the problem is due to the output
> being just too big for interactive display.

Well, I can certainly believe it's related to the amount of data
involved, but the exact relationship is far from clear.  popen()
doesn't do any actual data-pushing, it just sets up a pipe and forks
a child process --- so even if the child fails immediately after being
forked, that wouldn't lead to the problem seen here.  The rarity of
a failure here explains why we hadn't noticed the lack of error checking
long ago.

What I suppose is that you are running into some system-wide resource
constraint.  Exactly which one, and whether it's easy to fix, remain to
be seen.

> I tried using strace, but it produced so much telemetry and
> unfortunately I couldn't understand it anyway that I don't think this
> would do me any good.

Sorry, I should have said: the last few dozen lines before the crash are
all that will be interesting.

            regards, tom lane