Re: pg crashing - Mailing list pgsql-general

From Magnus Hagander
Subject Re: pg crashing
Date
Msg-id 486B8E00.2010303@hagander.net
Whole thread Raw
In response to Re: pg crashing  ("Roberts, Jon" <Jon.Roberts@asurion.com>)
Responses Re: pg crashing
List pgsql-general
Roberts, Jon wrote:
>> Roberts, Jon wrote:
>>>> Tom Lane wrote:
>>>>> "Roberts, Jon" <Jon.Roberts@asurion.com> writes:
>>>>>> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400"
>>>>> Well, there are plenty of known bugs in 8.3.0 by now.  You really
>>>>> should update before complaining, not after.
>>>> Yes. And the traditional question should be asked - is there any
>>>> antivirus or other "personal security" software running on tihs
>>> machine?
>>>> If so, uninstall (not just disable!) it and see if the problem goes
>>> away.
>>>
>>> I am not able to un-install this.  However, this problem only
> started as
>>> the database grew in size and usage.  It is 232 MB in size now and
> it
>>> has 30 or so active sessions 24x7.
>> But you do run antivirus on the machine? Which antivirus?
>
> We are running Computer Associates eTrust Antivirus.  The realtime
> scanner has an exclusion of e:\PostgreSQL and all subdirectories.

Yeah. Exclusion alone often isn't enough though. CA isn't one of the
ones on my "blacklist", but *all* AV is on my very-dark-graylist.


>> It's quite possible that this is the reason. As the files grow, the AV
>> may take longer to do whatever crap it's doing to them, thus exceeding
>> timeouts.
>>
> The log files are located here: e:\PostgreSQL\data\pg_log
>
> Before I had that exclusion on the directory, the database would crash
> because data files would get locked.  We don't get those errors now at
> all so I don't think this problem is caused by the antivirus program.

It can still very much be caused by the AV.


> We do have lots of sessions running at once.  It seems that the extra
> load on the server could also cause it to exceed the timeout.

Lots of sessions = lots of processes = lots of AV don't like that.


>> In general, you can get all sorts of strange things when you run AV on
>> your database server. The general recommendation is never to do that
> if
>> you want things to work, and it will be the first thing you're told to
>> remove.
>>
> I agree to an extent.  I've used Oracle and SQL Server on Windows and
> when the antivirus program scans any database files, it screws it up.
> However, I've never had a problem with an antivirus program once the
> database directories were excluded.

The difference is that PostgreSQL uses a multi-process architecture with
inherited handles. Most AV don't deal well with that, and it can show up
as very strange errors. SQL Server and Oracle are both threadead on
win32, which is what the AV software is used to seeing.


>> It would be good if you could at least temporarily remove it and see
> if
>> it fixes the issue.
>>
>
> I can't do that.  In this corporate environment, I would get shot if I
> did that.

Oops. We wouldn't want *that* to happen :D


>>>> It would also be interesting to know if there are actually any
> other
>>>> processes running at this time.
>>>>
>>> Yes, there were about 30 active sessions executing functions.
>> Right, but are the processes still active at the moment whrere it
>> crashes, or has the postmaster managed to kill them off for the
> restart?
>
> Yes, there are active sessions.
>
> Could this be the problem?
>
> We have a queue table in which we only allow x number of concurrent
> jobs.  A job is first inserted into the Queue with a Status of "Queued".
> Next the job executes a function to update the Status to "Processing".
>
> Function Details:
> 1.  Creates a db_link
> 2.  "lock table gp_load.queue in access exclusive mode"
> 3.  executes this in the linked session:
>     select count(*) from gp_load.queue where status = 'Processing'
> 4.  if current jobs less than max jobs allowed then
>     Update gp_load.queue set status = 'Processing'
>     Commit
> 5.  Close connection
> 6.  Returns the status of still Queued or Processing
>
> I have to use a dblink because PostgreSQL doesn't have autonomous
> transactions.  We call a single function to do all of this work and the
> other jobs need to see the current Status so they know if they should
> wait or run right now.
>
> The calling function uses a while loop to execute this function.  If the
> function returns Queued, it means it was unable to change the status so
> it uses pg_sleep to wait 10 seconds and then it trys again until it is
> able to set the status to Processing.
>
> Sorry for the lengthy answer but yes, we could have active sessions
> using pg_sleep waiting for a slot in the Queue to process a job.  The
> log file shows that this happened too.

Not having looked at the internals of db_link, I'd say it's certainly
possible that this is the reason for the failed restart. If db_link is
blocking something, the postmaster can't kill it off, and it'll still be
sitting there holding a reference to the shared memory segment.

That said, it shouldn't be the reason why it's crashing in the first
place - just the reason why it won't restart properly.

//Magnus

pgsql-general by date:

Previous
From: "Albe Laurenz"
Date:
Subject: Re: SAST FATAL: could not access private key file "server.key"
Next
From: Felipe de Jesús Molina Bravo
Date:
Subject: Re: SRF written in C