Thread: pg crashing

pg crashing

From

"Roberts, Jon"

Date:

01 July 2008, 17:19:39

Platform: Windows 2003 Server 32 bit
Windows SharedSection set to 1024,3072 based on recommendations from
http://www.postgresql.org/docs/faqs.FAQ_windows.html#4.4 and
http://support.microsoft.com/kb/184802
Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400"
Autovacuum: on
Virus Scanning: disabled on PostgreSQL home directory
Problem: My database keeps on crashing every few days with this type of
error message:

2008-07-01 10:46:30 CDT LOG:  all server processes terminated;
reinitializing
2008-07-01 10:46:31 CDT FATAL:  pre-existing shared memory block is
still in use
2008-07-01 10:46:31 CDT HINT:  Check if there are any old server
processes still running, and terminate them.


Is this problem fixed in 8.3.3 and/or fixed by moving to a Unix
environment like Solaris?


Jon

Re: pg crashing

From

Tom Lane

Date:

01 July 2008, 17:33:54

"Roberts, Jon" <Jon.Roberts@asurion.com> writes:
> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400"

Well, there are plenty of known bugs in 8.3.0 by now.  You really
should update before complaining, not after.

> Problem: My database keeps on crashing every few days with this type of
> error message:

> 2008-07-01 10:46:30 CDT LOG:  all server processes terminated;
> reinitializing

I think your real problem is with what happened *before* that.
But:

> 2008-07-01 10:46:31 CDT FATAL:  pre-existing shared memory block is
> still in use
> 2008-07-01 10:46:31 CDT HINT:  Check if there are any old server
> processes still running, and terminate them.

Hmm ... the code in win32_shmem.c that generates this message seems
mighty bogus to me --- it's just hoping that one-second delay is
enough.  Another problem is that postmaster children that do
PGSharedMemoryDetach will still have valid inherited handles for
the shmem segment --- does that factor into the behavior?  It looks
to me like the CloseHandle ought to be in PGSharedMemoryDetach.

            regards, tom lane

Re: pg crashing

From

Magnus Hagander

Date:

02 July 2008, 03:21:22

Tom Lane wrote:
> "Roberts, Jon" <Jon.Roberts@asurion.com> writes:
>> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400"
>
> Well, there are plenty of known bugs in 8.3.0 by now.  You really
> should update before complaining, not after.

Yes. And the traditional question should be asked - is there any
antivirus or other "personal security" software running on tihs machine?
If so, uninstall (not just disable!) it and see if the problem goes away.

>> 2008-07-01 10:46:30 CDT LOG:  all server processes terminated;
>> reinitializing
>
> I think your real problem is with what happened *before* that.

Agreed. There is some reason that they all terminated...

> But:
>> 2008-07-01 10:46:31 CDT FATAL:  pre-existing shared memory block is
>> still in use
>> 2008-07-01 10:46:31 CDT HINT:  Check if there are any old server
>> processes still running, and terminate them.
>
> Hmm ... the code in win32_shmem.c that generates this message seems
> mighty bogus to me --- it's just hoping that one-second delay is
> enough.

Well, we're basically waiting for the kernel cleanup thread to get run.
In my tests it never came above 20ms or so, so 1 second is a pretty long
time.

And you need *some* kind of timeout...

It would also be interesting to know if there are actually any other
processes running at this time.

> Another problem is that postmaster children that do
> PGSharedMemoryDetach will still have valid inherited handles for
> the shmem segment --- does that factor into the behavior?  It looks
> to me like the CloseHandle ought to be in PGSharedMemoryDetach.

Not as long as the processes die. If they die, their handles go with
them, and once the reference count goes to zero, the object goes away.

//Magnus

Re: pg crashing

From

Tom Lane

Date:

02 July 2008, 03:29:39

Magnus Hagander <magnus@hagander.net> writes:
> Tom Lane wrote:
>> Another problem is that postmaster children that do
>> PGSharedMemoryDetach will still have valid inherited handles for
>> the shmem segment --- does that factor into the behavior?  It looks
>> to me like the CloseHandle ought to be in PGSharedMemoryDetach.

> Not as long as the processes die. If they die, their handles go with
> them, and once the reference count goes to zero, the object goes away.

But the syslogger process (and maybe others) is *not* supposed to die.

            regards, tom lane

Re: pg crashing

From

Magnus Hagander

Date:

02 July 2008, 03:34:35

Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Tom Lane wrote:
>>> Another problem is that postmaster children that do
>>> PGSharedMemoryDetach will still have valid inherited handles for
>>> the shmem segment --- does that factor into the behavior?  It looks
>>> to me like the CloseHandle ought to be in PGSharedMemoryDetach.
>
>> Not as long as the processes die. If they die, their handles go with
>> them, and once the reference count goes to zero, the object goes away.
>
> But the syslogger process (and maybe others) is *not* supposed to die.

Right. But are you saying we actually want to start up a new backend in
a directory where we already have a running syslogger (and maybe others)
processes, just no postmaster? I'd assume we might run into such simple
things as "sharing violations" on the logfile - if nothing inside the db
itself..

//Magnus

Re: pg crashing

From

Tom Lane

Date:

02 July 2008, 03:41:08

Magnus Hagander <magnus@hagander.net> writes:
> Tom Lane wrote:
>> But the syslogger process (and maybe others) is *not* supposed to die.

> Right. But are you saying we actually want to start up a new backend in
> a directory where we already have a running syslogger (and maybe others)
> processes, just no postmaster?

Not great, maybe, but what it looks to me is that the current system
guarantees that a postmaster with a syslogger child will never recover
from a backend-child crash.  That's not better.

            regards, tom lane

Re: pg crashing

From

"Roberts, Jon"

Date:

02 July 2008, 09:48:45

> "Roberts, Jon" <Jon.Roberts@asurion.com> writes:
> > Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400"
>
> Well, there are plenty of known bugs in 8.3.0 by now.  You really
> should update before complaining, not after.

I'm not complaining.  I just want to make sure that if I upgrade, it
will fix the problem.  An upgrade could possible introduce a new
problem.  I also wonder if this is isolated to Win32 because we are
upgrading to Solaris very soon.

>
> > Problem: My database keeps on crashing every few days with this type
of
> > error message:
>
> > 2008-07-01 10:46:30 CDT LOG:  all server processes terminated;
> > reinitializing
>
> I think your real problem is with what happened *before* that.

I found the first instance of "crash" and then got the rest of the log
file.

2008-07-01 10:43:42 CDT LOG:  server process (PID 3524) exited with exit
code 128
2008-07-01 10:43:42 CDT LOG:  terminating any other active server
processes
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT CONTEXT:  SQL statement "SELECT
pg_sleep(cast(control.fn_get_variable('sleep_time') as int))"
    PL/pgSQL function "fn_update_status" line 136 at PERFORM
    SQL statement "SELECT  gp_load.fn_update_status( $1 ,  $2 ,
'Processing', '', 0)"
    PL/pgSQL function "fn_load" line 56 at PERFORM
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT CONTEXT:  PL/pgSQL function "fn_get_job_details"
line 114 at IF
    PL/pgSQL function "fn_load" line 465 at FOR over SELECT rows
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT CONTEXT:  SQL statement "SELECT
pg_sleep(cast(control.fn_get_variable('sleep_time') as int))"
    PL/pgSQL function "fn_update_status" line 136 at PERFORM
    SQL statement "SELECT  gp_load.fn_update_status( $1 ,  $2 ,
'Processing', '', 0)"
    PL/pgSQL function "fn_load" line 56 at PERFORM
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT CONTEXT:  SQL statement "SELECT
pg_sleep(cast(control.fn_get_variable('sleep_time') as int))"
    PL/pgSQL function "fn_update_status" line 136 at PERFORM
    SQL statement "SELECT  gp_load.fn_update_status( $1 ,  $2 ,
'Processing', '', 0)"
    PL/pgSQL function "fn_load" line 56 at PERFORM
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT CONTEXT:  SQL statement "SELECT
pg_sleep(cast(control.fn_get_variable('sleep_time') as int))"
    PL/pgSQL function "fn_update_status" line 136 at PERFORM
    SQL statement "SELECT  gp_load.fn_update_status( $1 ,  $2 ,
'Processing', '', 0)"
    PL/pgSQL function "fn_load" line 56 at PERFORM
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT CONTEXT:  SQL statement "SELECT
pg_sleep(cast(control.fn_get_variable('sleep_time') as int))"
    PL/pgSQL function "fn_update_status" line 136 at PERFORM
    SQL statement "SELECT  gp_load.fn_update_status( $1 ,  $2 ,
'Processing', '', 0)"
    PL/pgSQL function "fn_load" line 56 at PERFORM
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:42 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:42 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:42 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
INFO:  2008-07-01 10:44:42.373008-05:3000:Executing Query ..... create
temporary table order_iud on commit drop as select
s."jid",s."order_id",s."parent_order_id",s."claim_id",s."fulfillment_met
hod",s."order_status_id",s."payment_status",s."created_by",s."created_dt
m",s."modified_by",s."modified_dtm",s."deleted_flg",s."claimed_item_id",
s."order_state_hashcode",s."autopaymentauth_flg",s."salvageexpectation_s
tatus",s."snr_fee",s."salvageexpectation_reason",s."salvageexpectation_n
ote",s."nocharge_override",s."change_datetime",s."change_type",s."edw_pr
ocess_flag",s."edw_process_ts" from stage_intelliset.order s, (select
max("order".jid) as jid, "order".order_id from stage_intelliset."order"
group by "order".order_id) m where m.jid = s.jid and m.order_id::text =
s.order_id::text distributed by (order_id)
INFO:  2008-07-01 10:44:43.185777-05:3000:Execution Complete.
INFO:  2008-07-01 10:44:46.988228-05:5000:Executing Update ..... update
replica_intelliset.invoiceautopayment set "order_id" = y."order_id",
"invoice_id" = y."invoice_id", "transaction_id" = y."transaction_id",
"paymenttype_code" = y."paymenttype_code", "amount" = y."amount",
"status_code" = y."status_code", "deleted_flg" = y."deleted_flg",
"created_by" = y."created_by", "created_dtm" = y."created_dtm",
"modified_by" = y."modified_by", "modified_dtm" = y."modified_dtm",
edw_modified_dt = y.edw_process_ts, edw_modified_id = y.jid,
edw_clock_ts = clock_timestamp() from (select
a."invoiceautopayement_id", a."order_id", a."invoice_id",
a."transaction_id", a."paymenttype_code", a."amount", a."status_code",
a."deleted_flg", a."created_by", a."created_dtm", a."modified_by",
a."modified_dtm",  a.edw_process_ts, a.jid, a.change_type from
invoiceautopayment_iud a inner join
replica_intelliset.invoiceautopayment b on a.invoiceautopayement_id =
b.invoiceautopayement_id) y where
replica_intelliset.invoiceautopayment.invoiceautopayement_id =
y.invoiceautopayement_id and y.change_type <> 3
INFO:  2008-07-01 10:44:48.785215-05:5000:Update Complete.
INFO:  2008-07-01 10:44:48.786175-05:6000:Executing Insert ..... insert
into replica_intelliset.invoiceautopayment ("invoiceautopayement_id",
"order_id", "invoice_id", "transaction_id", "paymenttype_code",
"amount", "status_code", "deleted_flg", "created_by", "created_dtm",
"modified_by", "modified_dtm",  edw_created_dt, edw_modified_dt,
edw_created_id, edw_modified_id, edw_clock_ts) select
a."invoiceautopayement_id", a."order_id", a."invoice_id",
a."transaction_id", a."paymenttype_code", a."amount", a."status_code",
a."deleted_flg", a."created_by", a."created_dtm", a."modified_by",
a."modified_dtm",  edw_process_ts as edw_created_dt, null as
edw_modified_dt, a.jid as edw_created_id, null as edw_modified_id,
clock_timestamp() as edw_clock_ts from invoiceautopayment_iud a left
outer join replica_intelliset.invoiceautopayment b on
a.invoiceautopayement_id = b.invoiceautopayement_id where
(b.invoiceautopayement_id is null) and a.change_type <> 3
INFO:  2008-07-01 10:44:50.738058-05:6000:Insert Complete.
INFO:  2008-07-01 10:44:50.750247-05:8000:Executing insert into archive
table..... insert into stage_intelliset.invoiceautopayment_arch select *
from stage_intelliset.invoiceautopayment
INFO:  2008-07-01 10:44:50.839105-05:8000:Insert complete
2008-07-01 10:43:55 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:55 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:55 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:55 CDT CONTEXT:  SQL statement "select * from
dblink('gp',  $1 ) as t1 (return_text varchar)"
    PL/pgSQL function "fn_remote_sql" line 63 at SQL statement
    SQL statement "SELECT  gp_execute.fn_remote_sql( $1 ,  $2 )"
    PL/pgSQL function "fn_load" line 393 at PERFORM
2008-07-01 10:43:55 CDT LOG:  could not send data to client: No
connection could be made because the target machine actively refused it.


2008-07-01 10:43:55 CDT CONTEXT:  SQL statement "select * from
dblink('gp',  $1 ) as t1 (return_text varchar)"
    PL/pgSQL function "fn_remote_sql" line 63 at SQL statement
    SQL statement "SELECT  gp_execute.fn_remote_sql( $1 ,  $2 )"
    PL/pgSQL function "fn_load" line 393 at PERFORM
2008-07-01 10:43:55 CDT STATEMENT:  select * from gp_load.fn_load(809)
INFO:  2008-07-01 10:44:53.103065-05:5000:Executing Update ..... update
replica_intelliset.order set "parent_order_id" = y."parent_order_id",
"claim_id" = y."claim_id", "fulfillment_method" =
y."fulfillment_method", "order_status_id" = y."order_status_id",
"payment_status" = y."payment_status", "created_by" = y."created_by",
"created_dtm" = y."created_dtm", "modified_by" = y."modified_by",
"modified_dtm" = y."modified_dtm", "deleted_flg" = y."deleted_flg",
"claimed_item_id" = y."claimed_item_id", "order_state_hashcode" =
y."order_state_hashcode", "autopaymentauth_flg" =
y."autopaymentauth_flg", "salvageexpectation_status" =
y."salvageexpectation_status", "snr_fee" = y."snr_fee",
"salvageexpectation_reason" = y."salvageexpectation_reason",
"salvageexpectation_note" = y."salvageexpectation_note",
"nocharge_override" = y."nocharge_override", edw_modified_dt =
y.edw_process_ts, edw_modified_id = y.jid, edw_clock_ts =
clock_timestamp() from (select a."order_id", a."parent_order_id",
a."claim_id", a."fulfillment_method", a."order_status_id",
a."payment_status", a."created_by", a."created_dtm", a."modified_by",
a."modified_dtm", a."deleted_flg", a."claimed_item_id",
a."order_state_hashcode", a."autopaymentauth_flg",
a."salvageexpectation_status", a."snr_fee",
a."salvageexpectation_reason", a."salvageexpectation_note",
a."nocharge_override",  a.edw_process_ts, a.jid, a.change_type from
order_iud a inner join replica_intelliset.order b on a.order_id =
b.order_id) y where replica_intelliset.order.order_id = y.order_id and
y.change_type <> 3
INFO:  2008-07-01 10:44:54.536366-05:5000:Update Complete.
INFO:  2008-07-01 10:44:54.537312-05:6000:Executing Insert ..... insert
into replica_intelliset.order ("order_id", "parent_order_id",
"claim_id", "fulfillment_method", "order_status_id", "payment_status",
"created_by", "created_dtm", "modified_by", "modified_dtm",
"deleted_flg", "claimed_item_id", "order_state_hashcode",
"autopaymentauth_flg", "salvageexpectation_status", "snr_fee",
"salvageexpectation_reason", "salvageexpectation_note",
"nocharge_override",  edw_created_dt, edw_modified_dt, edw_created_id,
edw_modified_id, edw_clock_ts) select a."order_id", a."parent_order_id",
a."claim_id", a."fulfillment_method", a."order_status_id",
a."payment_status", a."created_by", a."created_dtm", a."modified_by",
a."modified_dtm", a."deleted_flg", a."claimed_item_id",
a."order_state_hashcode", a."autopaymentauth_flg",
a."salvageexpectation_status", a."snr_fee",
a."salvageexpectation_reason", a."salvageexpectation_note",
a."nocharge_override",  edw_process_ts as edw_created_dt, null as
edw_modified_dt, a.jid as edw_created_id, null as edw_modified_id,
clock_timestamp() as edw_clock_ts from order_iud a left outer join
replica_intelliset.order b on a.order_id = b.order_id where (b.order_id
is null) and a.change_type <> 3
INFO:  2008-07-01 10:44:54.577749-05:6000:Insert Complete.
INFO:  2008-07-01 10:44:54.583708-05:8000:Executing insert into archive
table..... insert into stage_intelliset.order_arch select * from
stage_intelliset.order
INFO:  2008-07-01 10:44:54.715591-05:8000:Insert complete
2008-07-01 10:43:58 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:43:58 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:43:58 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:43:58 CDT CONTEXT:  SQL statement "select * from
dblink('gp',  $1 ) as t1 (return_text varchar)"
    PL/pgSQL function "fn_remote_sql" line 63 at SQL statement
    SQL statement "SELECT  gp_execute.fn_remote_sql( $1 ,  $2 )"
    PL/pgSQL function "fn_load" line 393 at PERFORM
2008-07-01 10:43:58 CDT LOG:  could not send data to client: No
connection could be made because the target machine actively refused it.


2008-07-01 10:43:58 CDT CONTEXT:  SQL statement "select * from
dblink('gp',  $1 ) as t1 (return_text varchar)"
    PL/pgSQL function "fn_remote_sql" line 63 at SQL statement
    SQL statement "SELECT  gp_execute.fn_remote_sql( $1 ,  $2 )"
    PL/pgSQL function "fn_load" line 393 at PERFORM
2008-07-01 10:43:58 CDT STATEMENT:  select * from gp_load.fn_load(813)
2008-07-01 10:46:30 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:46:30 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:46:30 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:46:30 CDT CONTEXT:  PL/pgSQL function "fn_load" line 271
at IF
2008-07-01 10:46:30 CDT LOG:  could not send data to client: No
connection could be made because the target machine actively refused it.


2008-07-01 10:46:30 CDT CONTEXT:  PL/pgSQL function "fn_load" line 271
at IF
2008-07-01 10:46:30 CDT STATEMENT:  select * from gp_load.fn_load(757)
2008-07-01 10:46:30 CDT WARNING:  terminating connection because of
crash of another server process
2008-07-01 10:46:30 CDT DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2008-07-01 10:46:30 CDT HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2008-07-01 10:46:30 CDT CONTEXT:  SQL statement "select
remote_execute.fn_get_max( $1 ,  $2 ,  $3 ,  $4 ,  $5 ,  $6 ,  $7 ,  $8
,  $9 ,  $10 ,  $11 )"
    PL/pgSQL function "fn_load" line 265 at SQL statement
2008-07-01 10:46:30 CDT LOG:  could not send data to client: No
connection could be made because the target machine actively refused it.


2008-07-01 10:46:30 CDT CONTEXT:  SQL statement "select
remote_execute.fn_get_max( $1 ,  $2 ,  $3 ,  $4 ,  $5 ,  $6 ,  $7 ,  $8
,  $9 ,  $10 ,  $11 )"
    PL/pgSQL function "fn_load" line 265 at SQL statement
2008-07-01 10:46:30 CDT STATEMENT:  select * from gp_load.fn_load(965)
2008-07-01 10:46:30 CDT LOG:  all server processes terminated;
reinitializing
2008-07-01 10:46:31 CDT FATAL:  pre-existing shared memory block is
still in use
2008-07-01 10:46:31 CDT HINT:  Check if there are any old server
processes still running, and terminate them.

> But:
>
> > 2008-07-01 10:46:31 CDT FATAL:  pre-existing shared memory block is
> > still in use
> > 2008-07-01 10:46:31 CDT HINT:  Check if there are any old server
> > processes still running, and terminate them.
>
> Hmm ... the code in win32_shmem.c that generates this message seems
> mighty bogus to me --- it's just hoping that one-second delay is
> enough.  Another problem is that postmaster children that do
> PGSharedMemoryDetach will still have valid inherited handles for
> the shmem segment --- does that factor into the behavior?  It looks
> to me like the CloseHandle ought to be in PGSharedMemoryDetach.
>
>             regards, tom lane

This wouldn't be a problem in Unix?


Jon

Re: pg crashing

From

"Roberts, Jon"

Date:

02 July 2008, 09:51:51

> Tom Lane wrote:
> > "Roberts, Jon" <Jon.Roberts@asurion.com> writes:
> >> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400"
> >
> > Well, there are plenty of known bugs in 8.3.0 by now.  You really
> > should update before complaining, not after.
>
> Yes. And the traditional question should be asked - is there any
> antivirus or other "personal security" software running on tihs
machine?
> If so, uninstall (not just disable!) it and see if the problem goes
away.

I am not able to un-install this.  However, this problem only started as
the database grew in size and usage.  It is 232 MB in size now and it
has 30 or so active sessions 24x7.

>
> >> 2008-07-01 10:46:30 CDT LOG:  all server processes terminated;
> >> reinitializing
> >
> > I think your real problem is with what happened *before* that.
>
> Agreed. There is some reason that they all terminated...
>

I just emailed a larger part of the log file.

>
> > But:
> >> 2008-07-01 10:46:31 CDT FATAL:  pre-existing shared memory block is
> >> still in use
> >> 2008-07-01 10:46:31 CDT HINT:  Check if there are any old server
> >> processes still running, and terminate them.
> >
> > Hmm ... the code in win32_shmem.c that generates this message seems
> > mighty bogus to me --- it's just hoping that one-second delay is
> > enough.
>
> Well, we're basically waiting for the kernel cleanup thread to get
run.
> In my tests it never came above 20ms or so, so 1 second is a pretty
long
> time.
>
> And you need *some* kind of timeout...
>
> It would also be interesting to know if there are actually any other
> processes running at this time.
>

Yes, there were about 30 active sessions executing functions.


> > Another problem is that postmaster children that do
> > PGSharedMemoryDetach will still have valid inherited handles for
> > the shmem segment --- does that factor into the behavior?  It looks
> > to me like the CloseHandle ought to be in PGSharedMemoryDetach.
>
> Not as long as the processes die. If they die, their handles go with
> them, and once the reference count goes to zero, the object goes away.
>
> //Magnus

Re: pg crashing

From

"Roberts, Jon"

Date:

02 July 2008, 09:55:16

> Magnus Hagander <magnus@hagander.net> writes:
> > Tom Lane wrote:
> >> But the syslogger process (and maybe others) is *not* supposed to
die.
>
> > Right. But are you saying we actually want to start up a new backend
in
> > a directory where we already have a running syslogger (and maybe
others)
> > processes, just no postmaster?
>
> Not great, maybe, but what it looks to me is that the current system
> guarantees that a postmaster with a syslogger child will never recover
> from a backend-child crash.  That's not better.
>

When you say "current system", do you mean PG on Windows?


Jon

Re: pg crashing

From

Craig Ringer

Date:

02 July 2008, 09:59:44

Roberts, Jon wrote:
>> "Roberts, Jon" <Jon.Roberts@asurion.com> writes:
>>> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400"
>> Well, there are plenty of known bugs in 8.3.0 by now.  You really
>> should update before complaining, not after.
>
> I'm not complaining.  I just want to make sure that if I upgrade, it
> will fix the problem.  An upgrade could possible introduce a new
> problem.  I also wonder if this is isolated to Win32 because we are
> upgrading to Solaris very soon.

Are you able to get your hands on a cheap Linux box (say, convert a
spare desktop) and do some testing on that?

It may also be helpful to test on a Windows machine without a virus
scanner, personal firewall, or other such system-mangling poison.

If you can reproduce the fault in either of those configurations I'm
sure it'd be very informative. Personally (and my opinion isn't worth
much) I'd blame the virus scanner basically by default, because they're
always causing subtle problems and introducing weird quirks. Something
like PostgreSQL really does rely on the OS doing what it says it will,
and virus scanners tend to bend or break the rules.

--
Craig Ringer

Re: pg crashing

From

Magnus Hagander

Date:

02 July 2008, 10:06:35

Roberts, Jon wrote:
>> Tom Lane wrote:
>>> "Roberts, Jon" <Jon.Roberts@asurion.com> writes:
>>>> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400"
>>> Well, there are plenty of known bugs in 8.3.0 by now.  You really
>>> should update before complaining, not after.
>> Yes. And the traditional question should be asked - is there any
>> antivirus or other "personal security" software running on tihs
> machine?
>> If so, uninstall (not just disable!) it and see if the problem goes
> away.
>
> I am not able to un-install this.  However, this problem only started as
> the database grew in size and usage.  It is 232 MB in size now and it
> has 30 or so active sessions 24x7.

But you do run antivirus on the machine? Which antivirus?

It's quite possible that this is the reason. As the files grow, the AV
may take longer to do whatever crap it's doing to them, thus exceeding
timeouts.

In general, you can get all sorts of strange things when you run AV on
your database server. The general recommendation is never to do that if
you want things to work, and it will be the first thing you're told to
remove.

It would be good if you could at least temporarily remove it and see if
it fixes the issue.

>> It would also be interesting to know if there are actually any other
>> processes running at this time.
>>
>
> Yes, there were about 30 active sessions executing functions.

Right, but are the processes still active at the moment whrere it
crashes, or has the postmaster managed to kill them off for the restart?

//Magnus

Re: pg crashing

From

"Roberts, Jon"

Date:

02 July 2008, 10:25:44

> Roberts, Jon wrote:
> >> Tom Lane wrote:
> >>> "Roberts, Jon" <Jon.Roberts@asurion.com> writes:
> >>>> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400"
> >>> Well, there are plenty of known bugs in 8.3.0 by now.  You really
> >>> should update before complaining, not after.
> >> Yes. And the traditional question should be asked - is there any
> >> antivirus or other "personal security" software running on tihs
> > machine?
> >> If so, uninstall (not just disable!) it and see if the problem goes
> > away.
> >
> > I am not able to un-install this.  However, this problem only
started as
> > the database grew in size and usage.  It is 232 MB in size now and
it
> > has 30 or so active sessions 24x7.
>
> But you do run antivirus on the machine? Which antivirus?

We are running Computer Associates eTrust Antivirus.  The realtime
scanner has an exclusion of e:\PostgreSQL and all subdirectories.

>
> It's quite possible that this is the reason. As the files grow, the AV
> may take longer to do whatever crap it's doing to them, thus exceeding
> timeouts.
>
The log files are located here: e:\PostgreSQL\data\pg_log

Before I had that exclusion on the directory, the database would crash
because data files would get locked.  We don't get those errors now at
all so I don't think this problem is caused by the antivirus program.

We do have lots of sessions running at once.  It seems that the extra
load on the server could also cause it to exceed the timeout.

> In general, you can get all sorts of strange things when you run AV on
> your database server. The general recommendation is never to do that
if
> you want things to work, and it will be the first thing you're told to
> remove.
>
I agree to an extent.  I've used Oracle and SQL Server on Windows and
when the antivirus program scans any database files, it screws it up.
However, I've never had a problem with an antivirus program once the
database directories were excluded.

> It would be good if you could at least temporarily remove it and see
if
> it fixes the issue.
>

I can't do that.  In this corporate environment, I would get shot if I
did that.

>
> >> It would also be interesting to know if there are actually any
other
> >> processes running at this time.
> >>
> >
> > Yes, there were about 30 active sessions executing functions.
>
> Right, but are the processes still active at the moment whrere it
> crashes, or has the postmaster managed to kill them off for the
restart?
>

Yes, there are active sessions.

Could this be the problem?

We have a queue table in which we only allow x number of concurrent
jobs.  A job is first inserted into the Queue with a Status of "Queued".
Next the job executes a function to update the Status to "Processing".

Function Details:
1.  Creates a db_link
2.  "lock table gp_load.queue in access exclusive mode"
3.  executes this in the linked session:
    select count(*) from gp_load.queue where status = 'Processing'
4.  if current jobs less than max jobs allowed then
    Update gp_load.queue set status = 'Processing'
    Commit
5.  Close connection
6.  Returns the status of still Queued or Processing

I have to use a dblink because PostgreSQL doesn't have autonomous
transactions.  We call a single function to do all of this work and the
other jobs need to see the current Status so they know if they should
wait or run right now.

The calling function uses a while loop to execute this function.  If the
function returns Queued, it means it was unable to change the status so
it uses pg_sleep to wait 10 seconds and then it trys again until it is
able to set the status to Processing.

Sorry for the lengthy answer but yes, we could have active sessions
using pg_sleep waiting for a slot in the Queue to process a job.  The
log file shows that this happened too.

Jon

Re: pg crashing

From

Magnus Hagander

Date:

02 July 2008, 11:18:19

Roberts, Jon wrote:
>> Roberts, Jon wrote:
>>>> Tom Lane wrote:
>>>>> "Roberts, Jon" <Jon.Roberts@asurion.com> writes:
>>>>>> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400"
>>>>> Well, there are plenty of known bugs in 8.3.0 by now.  You really
>>>>> should update before complaining, not after.
>>>> Yes. And the traditional question should be asked - is there any
>>>> antivirus or other "personal security" software running on tihs
>>> machine?
>>>> If so, uninstall (not just disable!) it and see if the problem goes
>>> away.
>>>
>>> I am not able to un-install this.  However, this problem only
> started as
>>> the database grew in size and usage.  It is 232 MB in size now and
> it
>>> has 30 or so active sessions 24x7.
>> But you do run antivirus on the machine? Which antivirus?
>
> We are running Computer Associates eTrust Antivirus.  The realtime
> scanner has an exclusion of e:\PostgreSQL and all subdirectories.

Yeah. Exclusion alone often isn't enough though. CA isn't one of the
ones on my "blacklist", but *all* AV is on my very-dark-graylist.


>> It's quite possible that this is the reason. As the files grow, the AV
>> may take longer to do whatever crap it's doing to them, thus exceeding
>> timeouts.
>>
> The log files are located here: e:\PostgreSQL\data\pg_log
>
> Before I had that exclusion on the directory, the database would crash
> because data files would get locked.  We don't get those errors now at
> all so I don't think this problem is caused by the antivirus program.

It can still very much be caused by the AV.


> We do have lots of sessions running at once.  It seems that the extra
> load on the server could also cause it to exceed the timeout.

Lots of sessions = lots of processes = lots of AV don't like that.


>> In general, you can get all sorts of strange things when you run AV on
>> your database server. The general recommendation is never to do that
> if
>> you want things to work, and it will be the first thing you're told to
>> remove.
>>
> I agree to an extent.  I've used Oracle and SQL Server on Windows and
> when the antivirus program scans any database files, it screws it up.
> However, I've never had a problem with an antivirus program once the
> database directories were excluded.

The difference is that PostgreSQL uses a multi-process architecture with
inherited handles. Most AV don't deal well with that, and it can show up
as very strange errors. SQL Server and Oracle are both threadead on
win32, which is what the AV software is used to seeing.


>> It would be good if you could at least temporarily remove it and see
> if
>> it fixes the issue.
>>
>
> I can't do that.  In this corporate environment, I would get shot if I
> did that.

Oops. We wouldn't want *that* to happen :D


>>>> It would also be interesting to know if there are actually any
> other
>>>> processes running at this time.
>>>>
>>> Yes, there were about 30 active sessions executing functions.
>> Right, but are the processes still active at the moment whrere it
>> crashes, or has the postmaster managed to kill them off for the
> restart?
>
> Yes, there are active sessions.
>
> Could this be the problem?
>
> We have a queue table in which we only allow x number of concurrent
> jobs.  A job is first inserted into the Queue with a Status of "Queued".
> Next the job executes a function to update the Status to "Processing".
>
> Function Details:
> 1.  Creates a db_link
> 2.  "lock table gp_load.queue in access exclusive mode"
> 3.  executes this in the linked session:
>     select count(*) from gp_load.queue where status = 'Processing'
> 4.  if current jobs less than max jobs allowed then
>     Update gp_load.queue set status = 'Processing'
>     Commit
> 5.  Close connection
> 6.  Returns the status of still Queued or Processing
>
> I have to use a dblink because PostgreSQL doesn't have autonomous
> transactions.  We call a single function to do all of this work and the
> other jobs need to see the current Status so they know if they should
> wait or run right now.
>
> The calling function uses a while loop to execute this function.  If the
> function returns Queued, it means it was unable to change the status so
> it uses pg_sleep to wait 10 seconds and then it trys again until it is
> able to set the status to Processing.
>
> Sorry for the lengthy answer but yes, we could have active sessions
> using pg_sleep waiting for a slot in the Queue to process a job.  The
> log file shows that this happened too.

Not having looked at the internals of db_link, I'd say it's certainly
possible that this is the reason for the failed restart. If db_link is
blocking something, the postmaster can't kill it off, and it'll still be
sitting there holding a reference to the shared memory segment.

That said, it shouldn't be the reason why it's crashing in the first
place - just the reason why it won't restart properly.

//Magnus

Re: pg crashing

From

"Roberts, Jon"

Date:

02 July 2008, 11:40:34

>
> Not having looked at the internals of db_link, I'd say it's certainly
> possible that this is the reason for the failed restart. If db_link is
> blocking something, the postmaster can't kill it off, and it'll still
be
> sitting there holding a reference to the shared memory segment.
>
> That said, it shouldn't be the reason why it's crashing in the first
> place - just the reason why it won't restart properly.
>

Is this a problem in Unix?  We are about 1 - 2 weeks away from moving
this database to Solaris.


Jon

Re: pg crashing

From

Magnus Hagander

Date:

02 July 2008, 11:45:51

Roberts, Jon wrote:
>> Not having looked at the internals of db_link, I'd say it's certainly
>> possible that this is the reason for the failed restart. If db_link is
>> blocking something, the postmaster can't kill it off, and it'll still
> be
>> sitting there holding a reference to the shared memory segment.
>>
>> That said, it shouldn't be the reason why it's crashing in the first
>> place - just the reason why it won't restart properly.
>>
>
> Is this a problem in Unix?  We are about 1 - 2 weeks away from moving
> this database to Solaris.

Not likely, but I'd test it anyway. If the issue is related to AV, it's
certainly fine - you won't be running AV on your Solaris. But more
importantly, Unix has actual support for signals and not just the fake
stuff we have on Win32, so it's likely that the postmaster will be
capable of killing the child processes.

//Magnus

Re: pg crashing

From

Tom Lane

Date:

02 July 2008, 12:10:49

"Roberts, Jon" <Jon.Roberts@asurion.com> writes:
> I'm not complaining.  I just want to make sure that if I upgrade, it
> will fix the problem.  An upgrade could possible introduce a new
> problem.  I also wonder if this is isolated to Win32 because we are
> upgrading to Solaris very soon.

The specific problem you are seeing is clearly isolated to Win32,
because the error message you are getting comes out of Win32-only code.
This is not to say that there might not be comparable failures on
Unixen, but PG has a lot more track record on Unixen ...

            regards, tom lane

Re: pg crashing

From

Tom Lane

Date:

02 July 2008, 12:18:34

BTW, just looking at win32_shmem.c ...

    retptr = malloc(bufsize + 1 + 18);    /* 1 NULL and 18 for
                                           * Global\PostgreSQL: */
    if (retptr == NULL)
        elog(FATAL, "could not allocate memory for shared memory name");

    strcpy(retptr, "Global\\PostgreSQL:");
    r = GetFullPathName(DataDir, bufsize, retptr + 11, NULL);

Surely that "11" ought to be "18".  Also, since the loop immediately
below this is going to convert \ to /, wouldn't it be clearer to
describe the path prefix as Global/PostgreSQL: in the first place?

(BTW, as far as I can tell the +1 added to the malloc request is
useless: bufsize includes the trailing null, and the code would
not work if it did not.)

            regards, tom lane

Re: pg crashing

From

Tom Lane

Date:

02 July 2008, 12:52:32

Magnus Hagander <magnus@hagander.net> writes:
> Not likely, but I'd test it anyway. If the issue is related to AV, it's
> certainly fine - you won't be running AV on your Solaris. But more
> importantly, Unix has actual support for signals and not just the fake
> stuff we have on Win32, so it's likely that the postmaster will be
> capable of killing the child processes.

I'm not sure what failure mode you're imagining, but the postmaster has
already verified that all the children that are supposed to be connected
to shared memory are dead before it attempts to recreate shared memory.
So the above sounds completely bogus.

I'm still suspicious of the syslogger holding onto an inherited handle
to the shared-memory file, though that theory would seem to mean that
crash recovery would never work at all on Windows if the syslogger
were enabled.  But maybe there is some additional gating factor needed
to cause the problem to manifest.

            regards, tom lane

Re: pg crashing

From

Magnus Hagander

Date:

02 July 2008, 12:55:22

Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Not likely, but I'd test it anyway. If the issue is related to AV, it's
>> certainly fine - you won't be running AV on your Solaris. But more
>> importantly, Unix has actual support for signals and not just the fake
>> stuff we have on Win32, so it's likely that the postmaster will be
>> capable of killing the child processes.
>
> I'm not sure what failure mode you're imagining, but the postmaster has
> already verified that all the children that are supposed to be connected
> to shared memory are dead before it attempts to recreate shared memory.
> So the above sounds completely bogus.
>
> I'm still suspicious of the syslogger holding onto an inherited handle
> to the shared-memory file, though that theory would seem to mean that
> crash recovery would never work at all on Windows if the syslogger
> were enabled.  But maybe there is some additional gating factor needed
> to cause the problem to manifest.

Well, the syslogger is enabled by default on *all* binary installs on
windows, so I think we would've seen more if it never works.

I'll see if I can repro a case like it to see if the syslogger prevents
the shared mem from going away when I get back to a dev box. Should be
enough to just stick a sleep preventing it from stopping, right?

//Magnus

Re: pg crashing

From

Tom Lane

Date:

02 July 2008, 13:04:50

Magnus Hagander <magnus@hagander.net> writes:
> I'll see if I can repro a case like it to see if the syslogger prevents
> the shared mem from going away when I get back to a dev box. Should be
> enough to just stick a sleep preventing it from stopping, right?

The syslogger isn't restarted at all during a crash --- this isn't
a race-condition scenario.

If there is a race condition here, it must be associated with cleanup
for a process continuing to happen after win32_waitpid has already
reported it dead.  Hmm ... how much do we trust that bit of spaghetti
around pgwin32_deadchild_callback?  What condition is it really waiting
for?

            regards, tom lane

Re: pg crashing

From

"Roberts, Jon"

Date:

08 July 2008, 10:59:58

> Roberts, Jon wrote:
> >> Not having looked at the internals of db_link, I'd say it's
certainly
> >> possible that this is the reason for the failed restart. If db_link
is
> >> blocking something, the postmaster can't kill it off, and it'll
still
> > be
> >> sitting there holding a reference to the shared memory segment.
> >>
> >> That said, it shouldn't be the reason why it's crashing in the
first
> >> place - just the reason why it won't restart properly.
> >>
> >
> > Is this a problem in Unix?  We are about 1 - 2 weeks away from
moving
> > this database to Solaris.
>
> Not likely, but I'd test it anyway. If the issue is related to AV,
it's
> certainly fine - you won't be running AV on your Solaris. But more
> importantly, Unix has actual support for signals and not just the fake
> stuff we have on Win32, so it's likely that the postmaster will be
> capable of killing the child processes.
>

Our AV program has been off for a while now and I haven't had a crash.
I think part of the problem is how we have PostgreSQL installed and how
eTrust is configured.  We have the binaries installed on C:\program
files\PostgreSQL\8.3\ and the data is located on E:\PostgreSQL\data\.
We have eTrust excluding just E:\PostgreSQL\data\.

I'm guessing the activity on the binaries causes some scanning which may
have slowed down the cleanup enough to cause the crash to happen.

Jon

Re: pg crashing

From

Magnus Hagander

Date:

01 August 2008, 08:59:36

Roberts, Jon wrote:
>> Roberts, Jon wrote:
>>>> Not having looked at the internals of db_link, I'd say it's
> certainly
>>>> possible that this is the reason for the failed restart. If db_link
> is
>>>> blocking something, the postmaster can't kill it off, and it'll
> still
>>> be
>>>> sitting there holding a reference to the shared memory segment.
>>>>
>>>> That said, it shouldn't be the reason why it's crashing in the
> first
>>>> place - just the reason why it won't restart properly.
>>>>
>>> Is this a problem in Unix?  We are about 1 - 2 weeks away from
> moving
>>> this database to Solaris.
>> Not likely, but I'd test it anyway. If the issue is related to AV,
> it's
>> certainly fine - you won't be running AV on your Solaris. But more
>> importantly, Unix has actual support for signals and not just the fake
>> stuff we have on Win32, so it's likely that the postmaster will be
>> capable of killing the child processes.
>>
>
> Our AV program has been off for a while now and I haven't had a crash.
> I think part of the problem is how we have PostgreSQL installed and how
> eTrust is configured.  We have the binaries installed on C:\program
> files\PostgreSQL\8.3\ and the data is located on E:\PostgreSQL\data\.
> We have eTrust excluding just E:\PostgreSQL\data\.
>
> I'm guessing the activity on the binaries causes some scanning which may
> have slowed down the cleanup enough to cause the crash to happen.

Yeah, that does seem like a reasonable explanation. Yet another reason
not to use AV on your database server ;-) And if you absolutely have to,
exclude the postgresql stuff.

Since we do re-execute postgres.exe for every new connection, it's quite
possible that the AV scanned it every single time, and it's a fairly
large EXE...

//Magnus

Re: pg crashing

From

Magnus Hagander

Date:

12 August 2008, 09:25:18

Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> I'll see if I can repro a case like it to see if the syslogger prevents
>> the shared mem from going away when I get back to a dev box. Should be
>> enough to just stick a sleep preventing it from stopping, right?
>
> The syslogger isn't restarted at all during a crash --- this isn't
> a race-condition scenario.
>
> If there is a race condition here, it must be associated with cleanup
> for a process continuing to happen after win32_waitpid has already
> reported it dead.  Hmm ... how much do we trust that bit of spaghetti
> around pgwin32_deadchild_callback?  What condition is it really waiting
> for?

I looked that code over a bit again, and it still looks good to me :-)
The wait on the handle will fire when a process exits (according to the
API). When it does, we post that information to the queue and send
SIGCHLD. And the waitpid function pick off the top of the queue.

(It's not particularly spaghettified if you know your way around those
APIs :-P That's not to say it's impossible there's a bug there, of course)

//Magnus