Thread: Backend message type 0x44 arrived while idle

Backend message type 0x44 arrived while idle

From
Oleg Bartunov
Date:
WHile testing 6.5 cvs to see what's the progress with capability
of Postgres to work with big joins I get following error messages:

select a.a,at1.a as t1,at2.a as t2,at3.a as t3,at4.a as t4,at5.a as t5,a
t6.a as t6,at7.a as t7,at8.a as t8,at9.a as t9,at10.a as t10 
from t0 a ,t1 at1,t2 at2,t3 at3,t4 at4,t5 at5,t6 at6,t7 at7,t8 at8,t9 at9,
t10 at10 where at1.a_id = a.a_id and at2.a_id=a.a_id and at3.a_id=a.a_id and 
at4.a_id=a.a_id and at5.a_id=a.a_id and at6.a_id=a.a_id and at7.a_id=a.a_id 
and at8.a_id=a.a_id and at9.a_id=a.a_id and at10.a_id=a.a_id ;

Backend message type 0x44 arrived while idle
Backend message type 0x44 arrived while idle
We have lost the connection to the backend, so further processing is impossible.  Terminating.

Postgres+psql eaten all the memory+swap.
Are these messages ok in such a situation ?

Oleg

PS.

Postgres still can't serve large joins :-(



_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] Backend message type 0x44 arrived while idle

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> WHile testing 6.5 cvs to see what's the progress with capability
> of Postgres to work with big joins I get following error messages:

I think there are still some nasty bugs in the GEQO planner.  (I assume
you have the GEQO threshold set to less than the number of tables in
your query?)  Bruce did a lot of good cleanup work on the main planner
but GEQO is mostly untouched.  I've been hoping to poke at it some more
before 6.5 release.
        regards, tom lane


Re: [HACKERS] Backend message type 0x44 arrived while idle

From
Bruce Momjian
Date:
> Oleg Bartunov <oleg@sai.msu.su> writes:
> > WHile testing 6.5 cvs to see what's the progress with capability
> > of Postgres to work with big joins I get following error messages:
> 
> I think there are still some nasty bugs in the GEQO planner.  (I assume
> you have the GEQO threshold set to less than the number of tables in
> your query?)  Bruce did a lot of good cleanup work on the main planner
> but GEQO is mostly untouched.  I've been hoping to poke at it some more
> before 6.5 release.

I hope I didn't break it.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


I wrote:
> Oleg Bartunov <oleg@sai.msu.su> writes:
>> WHile testing 6.5 cvs to see what's the progress with capability
>> of Postgres to work with big joins I get following error messages:

> I think there are still some nasty bugs in the GEQO planner.

I have just committed some changes that fix bugs in the GEQO planner
and limit its memory usage.  It should now be possible to use GEQO even
for queries that join a very large number of tables --- at least from
the standpoint of not running out of memory during planning.  (It can
still take a while :-(.  I think that the default GEQO parameter
settings may be configured to use too many generations, but haven't
poked at this yet.)

I have observed that the regular optimizer requires about 50MB to plan
some ten-way joins, and can exceed my system's 128MB process data limit
on some eleven-way joins.  We currently have the GEQO threshold set at
11, which prevents the latter case by default --- but 50MB is a lot.
I wonder whether we shouldn't back the GEQO threshold off to 10.
(When I suggested setting it to 11, I was only looking at speed relative
to GEQO, not memory usage.  There is now a *big* difference in memory
usage...)  Comments?
        regards, tom lane


> I have observed that the regular optimizer requires about 50MB to plan
> some ten-way joins, and can exceed my system's 128MB process data limit
> on some eleven-way joins.  We currently have the GEQO threshold set at
> 11, which prevents the latter case by default --- but 50MB is a lot.
> I wonder whether we shouldn't back the GEQO threshold off to 10.
> (When I suggested setting it to 11, I was only looking at speed relative
> to GEQO, not memory usage.  There is now a *big* difference in memory
> usage...)  Comments?

You chose 11 by comparing GEQO with non-GEQO.  I think you will find
that with your improved GEQO, GEQO is faster for smaller number of
joins, preventing the memory problem.  Can you check the speeds again?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


On Sun, 16 May 1999, Bruce Momjian wrote:

> Date: Sun, 16 May 1999 21:17:30 -0400 (EDT)
> From: Bruce Momjian <maillist@candle.pha.pa.us>
> To: Tom Lane <tgl@sss.pgh.pa.us>
> Cc: Oleg Bartunov <oleg@sai.msu.su>, pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle)
> 
> > I have observed that the regular optimizer requires about 50MB to plan
> > some ten-way joins, and can exceed my system's 128MB process data limit
> > on some eleven-way joins.  We currently have the GEQO threshold set at
> > 11, which prevents the latter case by default --- but 50MB is a lot.
> > I wonder whether we shouldn't back the GEQO threshold off to 10.
> > (When I suggested setting it to 11, I was only looking at speed relative
> > to GEQO, not memory usage.  There is now a *big* difference in memory
> > usage...)  Comments?
> 
> You chose 11 by comparing GEQO with non-GEQO.  I think you will find
> that with your improved GEQO, GEQO is faster for smaller number of
> joins, preventing the memory problem.  Can you check the speeds again?
> 

I confirm big join with 11 tables doesn't eats all memory+swap on
my Linux box as before and  it runs *forever* :-).  It took already
18 minutes of CPU (P200, 64Mb) ! Will wait. 
8438 postgres  12   0 11104 3736  2620 R       0 98.6  5.9  18:16 postmaster

This query doesn't use (expicitly) GEQO

select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as t9,t10.a as
t10fromt0  ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and
t4.a_id=t0.a_t4_idand t5.a_id=t0.a_t5_id and t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and
t9.a_id=t0.a_t9_idand t10.a_id=t0.a_t10_id ;
 

Regards,
Oleg


> -- 
>   Bruce Momjian                        |  http://www.op.net/~candle
>   maillist@candle.pha.pa.us            |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Oleg Bartunov <oleg@sai.msu.su> writes:
> I confirm big join with 11 tables doesn't eats all memory+swap on
> my Linux box as before and  it runs *forever* :-).  It took already
> 18 minutes of CPU (P200, 64Mb) ! Will wait. 

18 minutes???  It takes barely over a minute on my aging 75MHz HP-PA
box.  (Practically all of which is planning time, since there are only
10 tuples to join... or are you doing this on a realistically sized
set of tables now?)

> This query doesn't use (expicitly) GEQO

> select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as t9,t10.a
ast10
 
>  from t0  ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10
>  where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and
t5.a_id=t0.a_t5_idand t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and
t10.a_id=t0.a_t10_id;
 

No, but since there are 11 tables mentioned, it will be sent to the GEQO
optimizer anyway with the default GEQO threshold of 11...
        regards, tom lane


On Mon, 17 May 1999, Tom Lane wrote:

> Date: Mon, 17 May 1999 09:44:18 -0400
> From: Tom Lane <tgl@sss.pgh.pa.us>
> To: Oleg Bartunov <oleg@sai.msu.su>
> Cc: pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle) 
> 
> Oleg Bartunov <oleg@sai.msu.su> writes:
> > I confirm big join with 11 tables doesn't eats all memory+swap on
> > my Linux box as before and  it runs *forever* :-).  It took already
> > 18 minutes of CPU (P200, 64Mb) ! Will wait. 
> 
> 18 minutes???  It takes barely over a minute on my aging 75MHz HP-PA
> box.  (Practically all of which is planning time, since there are only
> 10 tuples to join... or are you doing this on a realistically sized
> set of tables now?)
> 

Oops,

I found the problem. I modified my test script to add 'vacuum analyze'
after creating test data and it works really fast ! Great !
Now I'm wondering why do I need vacuum analyze after creating test data
and indices ? What's the state of discussion in hackers ?
Regards,    Oleg

> > This query doesn't use (expicitly) GEQO
> 
> > select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as
t9,t10.aas t10
 
> >  from t0  ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10
> >  where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and
t5.a_id=t0.a_t5_idand t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and
t10.a_id=t0.a_t10_id;
 
> 
> No, but since there are 11 tables mentioned, it will be sent to the GEQO
> optimizer anyway with the default GEQO threshold of 11...
> 
>             regards, tom lane
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Tom,

I was so happy with the problem solved so I decided to play with more joins :-)
Really, queries were processed very quickly but at 14 tables backend died :

COPY t13 FROM STDIN USING DELIMITERS '|';
vacuum analyze;
VACUUM
select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as t9,t10.a as
t10,t11.aas t11,t12.a as t12,t13.a as t13from t0  ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13where t1.a_id = t0.a_t1_id
andt2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and t5.a_id=t0.a_t5_id and t6.a_id=t0.a_t6_id and
t7.a_id=t0.a_t7_idand t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and t10.a_id=t0.a_t10_id and t11.a_id=t0.a_
 
t11_id and t12.a_id=t0.a_t12_id and t13.a_id=t0.a_t13_id ;
pqReadData() -- backend closed the channel unexpectedly.       This probably means the backend terminated abnormally
  before or while processing the request.
 
We have lost the connection to the backend, so further processing is impossible.  Terminating.


Tom, could you try my script at your machine ?
I attached the script. You need perl to run it.

mkjoindata.pl | psql test
Regards,
    Oleg


On Mon, 17 May 1999, Tom Lane wrote:

> Date: Mon, 17 May 1999 09:44:18 -0400
> From: Tom Lane <tgl@sss.pgh.pa.us>
> To: Oleg Bartunov <oleg@sai.msu.su>
> Cc: pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle) 
> 
> Oleg Bartunov <oleg@sai.msu.su> writes:
> > I confirm big join with 11 tables doesn't eats all memory+swap on
> > my Linux box as before and  it runs *forever* :-).  It took already
> > 18 minutes of CPU (P200, 64Mb) ! Will wait. 
> 
> 18 minutes???  It takes barely over a minute on my aging 75MHz HP-PA
> box.  (Practically all of which is planning time, since there are only
> 10 tuples to join... or are you doing this on a realistically sized
> set of tables now?)
> 
> > This query doesn't use (expicitly) GEQO
> 
> > select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as
t9,t10.aas t10
 
> >  from t0  ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10
> >  where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and
t5.a_id=t0.a_t5_idand t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and
t10.a_id=t0.a_t10_id;
 
> 
> No, but since there are 11 tables mentioned, it will be sent to the GEQO
> optimizer anyway with the default GEQO threshold of 11...
> 
>             regards, tom lane
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
#!/usr/local/bin/perl -w
#use strict;

# Number of tables - classificators
my $ntables  = 14;
# Number of joins, must be <= $ntables !
my $njoins   = 13;
# dimensions of tables
my @nitems   = ( 200, 200, 10, 10, 10, 10, 10,10,10, 10,10);


die ("njoins ($njoins) must be >= ntables ($ntables) !")  if ( $njoins > $ntables);
my $tables = ($njoins < $ntables ) ? $njoins : $ntables;
my $SQL = '';
# classificators
for ( $i = 0; $i <= $tables; $i++ ) { my $table = 't'. $i; $SQL .= qq^drop table $table;\ncreate table $table ( a_id
int4NOT NULL, a varchar(10)^;   if ( $i == 0 ) { # main table      for ( $k = 1; $k <= $ntables; $k++ ) {         $SQL
.=', a_t'.$k.'_id int4';      }   }  $SQL .= qq^);\ncreate index a_id_$table on $table (a_id);\n^;  if ( $i == 0 ) { #
maintable - create add. indices      for ( $k = 1; $k <= $ntables; $k++ ) {         my $table = 't'.$k;         $SQL .=
'createindex a_'.$table.'_id_t0 on t0 (a_'.$table.'_id);';         $SQL .= "\n";      }   } $SQL .= qq^COPY $table FROM
STDINUSING DELIMITERS '|';\n^; srand($i); for ( $j = 1; $j <= $nitems[$i]; $j++ ) {     my @cols = ();
 
#    push @cols, (int(rand($nitems[$i])),'a'.$table.$j);   push @cols, ($j,'a'.$table);   if ( $i == 0 ) { # main table
-add cols      for ( $k = 1; $k <= $ntables; $k++ ) {
 
#          push @cols, int(rand($nitems[$k]));         push @cols, $j;      }   }   $SQL .= join ('|',@cols)."\n";    }
$SQL.= qq^\\.\n^;
 
}

# vacuum analyze
$SQL .= qq^vacuum analyze;\n^;
print "$SQL\n";

# Query with joins
my $SELECT = '';
# Try GEQO
#$SELECT .= qq^set GEQO to 'ON=$njoins';^;
#$SELECT .= 'explain ';
$SELECT .= 'select t0.a';
for ( $i = 1; $i <= $njoins; $i++ ) { $SELECT .= ',t'.$i.'.a as t'.$i;
}
$SELECT .= "\n".' from t0  ';
for ( $i = 1; $i <= $njoins; $i++ ) { $SELECT .= ',t'.$i;
}
$SELECT .= "\n".' where t1.a_id = t0.a_t1_id ';
for ( $i = 2; $i <= $njoins; $i++ ) { $SELECT .= 'and t'.$i.'.a_id=t0.a_t'.$i.'_id ';
}
$SELECT .= ';';

print $SELECT,"\n";

Oops,

it seems that was my fault, I didn't specified @nitems (sizes of tables)
for all tables. Now it works fine.
Tom, in case of my fault, why did postgres die ?
Regards,
    Oleg



On Mon, 17 May 1999, Oleg Bartunov wrote:

> Date: Mon, 17 May 1999 18:08:45 +0400 (MSD)
> From: Oleg Bartunov <oleg@sai.msu.su>
> To: Tom Lane <tgl@sss.pgh.pa.us>
> Cc: Oleg Bartunov <oleg@sai.msu.su>, pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle) 
> 
> Tom,
> 
> I was so happy with the problem solved so I decided to play with more joins :-)
> Really, queries were processed very quickly but at 14 tables backend died :
> 
> COPY t13 FROM STDIN USING DELIMITERS '|';
> vacuum analyze;
> VACUUM
>  
> select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as t9,t10.a
ast10,t11.a as t11,t12.a as t12,t13.a as t13
 
>  from t0  ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13
>  where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and
t5.a_id=t0.a_t5_idand t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and
t10.a_id=t0.a_t10_idand t11.a_id=t0.
 
a_
> t11_id and t12.a_id=t0.a_t12_id and t13.a_id=t0.a_t13_id ;
> pqReadData() -- backend closed the channel unexpectedly.
>         This probably means the backend terminated abnormally
>         before or while processing the request.
> We have lost the connection to the backend, so further processing is impossible.  Terminating.
> 
> 
> Tom, could you try my script at your machine ?
> I attached the script. You need perl to run it.
> 
> mkjoindata.pl | psql test
> 
>     Regards,
> 
>         Oleg
> 
> 
> On Mon, 17 May 1999, Tom Lane wrote:
> 
> > Date: Mon, 17 May 1999 09:44:18 -0400
> > From: Tom Lane <tgl@sss.pgh.pa.us>
> > To: Oleg Bartunov <oleg@sai.msu.su>
> > Cc: pgsql-hackers@postgreSQL.org
> > Subject: Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle) 
> > 
> > Oleg Bartunov <oleg@sai.msu.su> writes:
> > > I confirm big join with 11 tables doesn't eats all memory+swap on
> > > my Linux box as before and  it runs *forever* :-).  It took already
> > > 18 minutes of CPU (P200, 64Mb) ! Will wait. 
> > 
> > 18 minutes???  It takes barely over a minute on my aging 75MHz HP-PA
> > box.  (Practically all of which is planning time, since there are only
> > 10 tuples to join... or are you doing this on a realistically sized
> > set of tables now?)
> > 
> > > This query doesn't use (expicitly) GEQO
> > 
> > > select t0.a,t1.a as t1,t2.a as t2,t3.a as t3,t4.a as t4,t5.a as t5,t6.a as t6,t7.a as t7,t8.a as t8,t9.a as
t9,t10.aas t10
 
> > >  from t0  ,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10
> > >  where t1.a_id = t0.a_t1_id and t2.a_id=t0.a_t2_id and t3.a_id=t0.a_t3_id and t4.a_id=t0.a_t4_id and
t5.a_id=t0.a_t5_idand t6.a_id=t0.a_t6_id and t7.a_id=t0.a_t7_id and t8.a_id=t0.a_t8_id and t9.a_id=t0.a_t9_id and
t10.a_id=t0.a_t10_id;
 
> > 
> > No, but since there are 11 tables mentioned, it will be sent to the GEQO
> > optimizer anyway with the default GEQO threshold of 11...
> > 
> >             regards, tom lane
> > 
> 
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Oleg Bartunov <oleg@sai.msu.su> writes:
> I found the problem. I modified my test script to add 'vacuum analyze'
> after creating test data and it works really fast ! Great !
> Now I'm wondering why do I need vacuum analyze after creating test data
> and indices ?

VACUUM ANALYZE would create pg_statistics entries for the tables,
which'd allow the optimizer to make better estimates of restriction
and join selectivities.  I expect that it changes the plan being used;
what does EXPLAIN say with and without the analyze?
        regards, tom lane


Oleg Bartunov <oleg@sai.msu.su> writes:
> it seems that was my fault, I didn't specified @nitems (sizes of tables)
> for all tables. Now it works fine.
> Tom, in case of my fault, why did postgres die ?

I don't know --- I don't see it here.  I just ran your script as given,
and it worked.  (It produced zero rows of output, since the missing
nitems values meant that no data was loaded into the last few tables ...
but there was no backend crash.)

Is it crashing because of running out of memory, or something else?
Can you provide a backtrace?
        regards, tom lane


On Mon, 17 May 1999, Tom Lane wrote:

> Date: Mon, 17 May 1999 11:24:31 -0400
> From: Tom Lane <tgl@sss.pgh.pa.us>
> To: Oleg Bartunov <oleg@sai.msu.su>
> Cc: pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] GEQO optimizer (was Re: Backend message type 0x44 arrived while idle) 
> 
> Oleg Bartunov <oleg@sai.msu.su> writes:
> > it seems that was my fault, I didn't specified @nitems (sizes of tables)
> > for all tables. Now it works fine.
> > Tom, in case of my fault, why did postgres die ?
> 
> I don't know --- I don't see it here.  I just ran your script as given,
> and it worked.  (It produced zero rows of output, since the missing
> nitems values meant that no data was loaded into the last few tables ...
> but there was no backend crash.)
> 
> Is it crashing because of running out of memory, or something else?

No, memory is fine. It just dies.

> Can you provide a backtrace?

Will try to reproduce crash,. How do I can debug psql ?



> 
>             regards, tom lane
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Obtaining a backtrace (was Re: [HACKERS] GEQO optimizer)

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
>> Can you provide a backtrace?

> Will try to reproduce crash,. How do I can debug psql ?

There should be a core file left in the database subdirectory, eg
/usr/local/pgsql/data/base/DB/core

where DB represents the name of the database you used.

As the postgres user, do this (with appropriate pathname changes of course)
gdb /usr/local/pgsql/bin/postgres /usr/local/pgsql/data/base/DB/core

and when you get the (gdb) prompt, enter "bt" for backtrace.  You should
get a few dozen lines of printout, more or less like this:

(gdb) bt
#0  AllocSetAlloc (set=0x40254600, size=1076446936) at aset.c:267
#1  0x169314 in PortalHeapMemoryAlloc (this=0x40254600, size=36)   at portalmem.c:264
#2  0x168bb4 in MemoryContextAlloc (context=0x4007d940, size=36) at mcxt.c:230
#3  0xe4d88 in newNode (size=36, tag=T_Resdom) at nodes.c:41
#4  0xea92c in makeResdom (resno=17920, restype=23, restypmod=-1, resname=0x0,   reskey=0, reskeyop=0, resjunk=0) at
makefuncs.c:102
#5  0x101448 in create_tl_element (var=0x402402e0, resdomno=36) at tlist.c:135
#6  0xf689c in new_join_tlist (tlist=0x40254600, first_resdomno=36)   at joinrels.c:286   ...

The "q" command will get you out of gdb after you've copied and pasted
this info.

BTW, if you did not build the backend with "-g" included in CFLAGS, you
will get a much less complete backtrace ... but it may still tell us
something of use.
        regards, tom lane


Bruce Momjian <maillist@candle.pha.pa.us> writes:
>> I have observed that the regular optimizer requires about 50MB to plan
>> some ten-way joins, and can exceed my system's 128MB process data limit
>> on some eleven-way joins.  We currently have the GEQO threshold set at
>> 11, which prevents the latter case by default --- but 50MB is a lot.
>> I wonder whether we shouldn't back the GEQO threshold off to 10.
>> (When I suggested setting it to 11, I was only looking at speed relative
>> to GEQO, not memory usage.  There is now a *big* difference in memory
>> usage...)  Comments?

> You chose 11 by comparing GEQO with non-GEQO.  I think you will find
> that with your improved GEQO, GEQO is faster for smaller number of
> joins, preventing the memory problem.  Can you check the speeds again?

Bruce, I have rerun a couple of tests and am getting numbers like these:
        # tables joined
    ...    10    11    ...

STD OPTIMIZER        24    115
GEQO            45    55

This is after tweaking the GEQO parameters to improve speed slightly
in the default case.  (Setting EFFORT=LOW reduces the 11-way plan time
to about 40 sec, setting EFFORT=HIGH makes it about 70.)

The breakpoint for speed is still clearly at GEQO threshold 11.
*However*, the regular optimizer uses close to 120MB of memory to
plan these 11-way joins, and that's excessive (especially since that's
not even counting the space that will be used for execution...).
Until we can do something about reclaiming space more effectively,
I recommend reducing the default GEQO threshold to 10.
        regards, tom lane


> > You chose 11 by comparing GEQO with non-GEQO.  I think you will find
> > that with your improved GEQO, GEQO is faster for smaller number of
> > joins, preventing the memory problem.  Can you check the speeds again?
> 
> Bruce, I have rerun a couple of tests and am getting numbers like these:
> 
>             # tables joined
> 
>         ...    10    11    ...
> 
> STD OPTIMIZER        24    115
> GEQO            45    55
> 
> This is after tweaking the GEQO parameters to improve speed slightly
> in the default case.  (Setting EFFORT=LOW reduces the 11-way plan time
> to about 40 sec, setting EFFORT=HIGH makes it about 70.)
> 
> The breakpoint for speed is still clearly at GEQO threshold 11.
> *However*, the regular optimizer uses close to 120MB of memory to
> plan these 11-way joins, and that's excessive (especially since that's
> not even counting the space that will be used for execution...).
> Until we can do something about reclaiming space more effectively,
> I recommend reducing the default GEQO threshold to 10.

Agreed.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026