Thread: Issue with pgAgent jobs forever running state after server restart

Issue with pgAgent jobs forever running state after server restart

From
"Paragon Corporation"
Date:
This has been an issue for a while but has only become a more common issue
for us recently and one we can more easily predict.

We are running pgAgent on windows, and whenever we restart the server, if a
job is in the middle of a run when we restart, it gets stuck in a forever
endless running state.

To fix the issue, we have to go into the pgagent.pga_job table and get rid
of the jobagentid that is in there for the specific job that is stuck.

We are running the pgAgent 3.3.0 that is available via Stack Builder.


Thanks,
Leo





Re: Issue with pgAgent jobs forever running state after server restart

From
Bastiaan Olij
Date:
Hi Leo,

On 17/03/13 9:37 AM, Paragon Corporation wrote:
> This has been an issue for a while but has only become a more common issue
> for us recently and one we can more easily predict.
>
> We are running pgAgent on windows, and whenever we restart the server, if a
> job is in the middle of a run when we restart, it gets stuck in a forever
> endless running state.
>
> To fix the issue, we have to go into the pgagent.pga_job table and get rid
> of the jobagentid that is in there for the specific job that is stuck.
>
> We are running the pgAgent 3.3.0 that is available via Stack Builder.
>
When the job starts pgAgent simply updates the status to running. When
it finishes it updates the status to either failed or completed.
When you restart in the middle of a job running that job gets
interrupted but as pgAgent also shuts down it is not able to change the
status on the job. When it restarts it has all but forgotten about that
job that was running.

That it still says running doesn't mean there is actually something
still running in this case, pgAgent in that respect is a pretty straight
forward unintelligent application (and I don't mean that in a negative
way). If pgAgent went down for some other reason the process could
indeed still be running and pgAgent has no way of knowing this. Not all
application shut down just because the host application disappears, some
are designed to complete the process if at all possible in the background.

The only way to get around this is indeed to do what you do and reset
the status manually.

I guess pgAgent could be enhanced to change the status to failed of any
job still marked as running when it starts up but if this is happening
often enough for you to worry about it you might want to talk to the
people who decide to restart servers while the server is actively busy
doing stuff like running that all too important backup process. I would
be much more worried about that then pgAgent failing to mark a job as
failed or complete on a restart of the server.

I have seen too many servers getting into serious problems because of
ignorant IT people who think janking out the power cord of a machine is
a perfectly reasonable way to fix windows explorer from having gotten stuck.

Cheers,

Bas



Re: Issue with pgAgent jobs forever running state after server restart

From
Dave Page
Date:
On Sat, Mar 16, 2013 at 11:25 PM, Bastiaan Olij <bastiaan@basenlily.me> wrote:
> Hi Leo,
>
> On 17/03/13 9:37 AM, Paragon Corporation wrote:
>> This has been an issue for a while but has only become a more common issue
>> for us recently and one we can more easily predict.
>>
>> We are running pgAgent on windows, and whenever we restart the server, if a
>> job is in the middle of a run when we restart, it gets stuck in a forever
>> endless running state.
>>
>> To fix the issue, we have to go into the pgagent.pga_job table and get rid
>> of the jobagentid that is in there for the specific job that is stuck.
>>
>> We are running the pgAgent 3.3.0 that is available via Stack Builder.
>>
> When the job starts pgAgent simply updates the status to running. When
> it finishes it updates the status to either failed or completed.
> When you restart in the middle of a job running that job gets
> interrupted but as pgAgent also shuts down it is not able to change the
> status on the job. When it restarts it has all but forgotten about that
> job that was running.

Actually, that's not what should happen. When an agent first runs, it
records it's backend PID in pga_jobagent. Then, when it starts a job
it sets the pga_job.jobagentid column to it's backend PID as well.
Later, when the agent restarts it:

1) Checks the pga_jobagent table for agent PIDs that are thought to be
running, and compares them with the PIDs listed in pg_stat_activity.
Any that don't existing in pg_stat_activity are considered to be
zombies and are recorded in a temp table.

2) It then updates the job log and job step log for any jobs/steps
that were being executed by a zombie agent, setting the status to 'd'
(oddly, for aborted).

3) It then resets the execution info in pga_job, so the job can
execute again on it's next schedule.

4) Finally it clears the entry from the pga_jobagent table.

As far as I'm aware, this has always worked well. The only caveat
being that it will fail of course, if when the agent restarts there
happens to be a new connection with the same backend PID as was
previously used. That means that the zombie detection can conceivably
fail from time to time, but that should be relatively rare. I'm not
sure there's any easy way round that, though maybe there is something
that could be done with the appname GUC that we have now.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company