Thread: Issue with pgAgent jobs forever running state after server restart
This has been an issue for a while but has only become a more common issue for us recently and one we can more easily predict. We are running pgAgent on windows, and whenever we restart the server, if a job is in the middle of a run when we restart, it gets stuck in a forever endless running state. To fix the issue, we have to go into the pgagent.pga_job table and get rid of the jobagentid that is in there for the specific job that is stuck. We are running the pgAgent 3.3.0 that is available via Stack Builder. Thanks, Leo
Hi Leo, On 17/03/13 9:37 AM, Paragon Corporation wrote: > This has been an issue for a while but has only become a more common issue > for us recently and one we can more easily predict. > > We are running pgAgent on windows, and whenever we restart the server, if a > job is in the middle of a run when we restart, it gets stuck in a forever > endless running state. > > To fix the issue, we have to go into the pgagent.pga_job table and get rid > of the jobagentid that is in there for the specific job that is stuck. > > We are running the pgAgent 3.3.0 that is available via Stack Builder. > When the job starts pgAgent simply updates the status to running. When it finishes it updates the status to either failed or completed. When you restart in the middle of a job running that job gets interrupted but as pgAgent also shuts down it is not able to change the status on the job. When it restarts it has all but forgotten about that job that was running. That it still says running doesn't mean there is actually something still running in this case, pgAgent in that respect is a pretty straight forward unintelligent application (and I don't mean that in a negative way). If pgAgent went down for some other reason the process could indeed still be running and pgAgent has no way of knowing this. Not all application shut down just because the host application disappears, some are designed to complete the process if at all possible in the background. The only way to get around this is indeed to do what you do and reset the status manually. I guess pgAgent could be enhanced to change the status to failed of any job still marked as running when it starts up but if this is happening often enough for you to worry about it you might want to talk to the people who decide to restart servers while the server is actively busy doing stuff like running that all too important backup process. I would be much more worried about that then pgAgent failing to mark a job as failed or complete on a restart of the server. I have seen too many servers getting into serious problems because of ignorant IT people who think janking out the power cord of a machine is a perfectly reasonable way to fix windows explorer from having gotten stuck. Cheers, Bas
On Sat, Mar 16, 2013 at 11:25 PM, Bastiaan Olij <bastiaan@basenlily.me> wrote: > Hi Leo, > > On 17/03/13 9:37 AM, Paragon Corporation wrote: >> This has been an issue for a while but has only become a more common issue >> for us recently and one we can more easily predict. >> >> We are running pgAgent on windows, and whenever we restart the server, if a >> job is in the middle of a run when we restart, it gets stuck in a forever >> endless running state. >> >> To fix the issue, we have to go into the pgagent.pga_job table and get rid >> of the jobagentid that is in there for the specific job that is stuck. >> >> We are running the pgAgent 3.3.0 that is available via Stack Builder. >> > When the job starts pgAgent simply updates the status to running. When > it finishes it updates the status to either failed or completed. > When you restart in the middle of a job running that job gets > interrupted but as pgAgent also shuts down it is not able to change the > status on the job. When it restarts it has all but forgotten about that > job that was running. Actually, that's not what should happen. When an agent first runs, it records it's backend PID in pga_jobagent. Then, when it starts a job it sets the pga_job.jobagentid column to it's backend PID as well. Later, when the agent restarts it: 1) Checks the pga_jobagent table for agent PIDs that are thought to be running, and compares them with the PIDs listed in pg_stat_activity. Any that don't existing in pg_stat_activity are considered to be zombies and are recorded in a temp table. 2) It then updates the job log and job step log for any jobs/steps that were being executed by a zombie agent, setting the status to 'd' (oddly, for aborted). 3) It then resets the execution info in pga_job, so the job can execute again on it's next schedule. 4) Finally it clears the entry from the pga_jobagent table. As far as I'm aware, this has always worked well. The only caveat being that it will fail of course, if when the agent restarts there happens to be a new connection with the same backend PID as was previously used. That means that the zombie detection can conceivably fail from time to time, but that should be relatively rare. I'm not sure there's any easy way round that, though maybe there is something that could be done with the appname GUC that we have now. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company