Thread: pgagent hangs forever in "r" state
All, On the same server with the SQL errors (sequences are cleaned up now), we're having pgagent start all jobs in the "r" state, log the first step in each job in the "r" state, and then hang forever (or, at least for three days). At this point, 8 different jobs are in "r" state. In pg_stat_activity, pgagent is connected, but hasn't run anything except the query to get the next step. We're debugging away, but would appreciate advice on other ways to figure out what pgagent is doing. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 07/06/2015 10:26 AM, Josh Berkus wrote: > All, > > On the same server with the SQL errors (sequences are cleaned up now), > we're having pgagent start all jobs in the "r" state, log the first step > in each job in the "r" state, and then hang forever (or, at least for > three days). At this point, 8 different jobs are in "r" state. ... as an update, what it appears to be doing is executing the first step of each job, successfully. Then, for some reason, it attempts to destroy that jobthread after just the first step. I say "attempts" because that child backend is still running, and none of pga_job, pga_joblog, or pga_jobsteplog get updated. I've looked in jobs.cpp and pgAgent.cpp, and I can't figure out what would cause it to stop after the first step. All of the steps I've looked at are regular queries, and can be successfully executed by hand. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Jul 7, 2015 at 12:02 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 07/06/2015 10:26 AM, Josh Berkus wrote: >> All, >> >> On the same server with the SQL errors (sequences are cleaned up now), >> we're having pgagent start all jobs in the "r" state, log the first step >> in each job in the "r" state, and then hang forever (or, at least for >> three days). At this point, 8 different jobs are in "r" state. > > ... as an update, what it appears to be doing is executing the first > step of each job, successfully. Then, for some reason, it attempts to > destroy that jobthread after just the first step. I say "attempts" > because that child backend is still running, and none of pga_job, > pga_joblog, or pga_jobsteplog get updated. > > I've looked in jobs.cpp and pgAgent.cpp, and I can't figure out what > would cause it to stop after the first step. All of the steps I've > looked at are regular queries, and can be successfully executed by hand. Are you able to get a stacktrace from a running instance that's hung? Also, does setting the logging level to debug reveal anything useful in the filesystem logs? (pgagent -l2 ....) -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/10/2015 07:16 AM, Dave Page wrote: > Are you able to get a stacktrace from a running instance that's hung? Yes, but it's not doing anything, it's just polling for the next job. > Also, does setting the logging level to debug reveal anything useful > in the filesystem logs? (pgagent -l2 ....) Nope. And I scanned the postgresql log for SQL errors, and there's nothing. What appears to be happening is: 1. pgagent picks up job 2. pgagent finds first step in that job 3. pgagent executes first step (successfully, according to the postgresql log, and according to a fake "touch file" job I added). 4. pgagent abandons the job without updating the status of either the job or the step. 5. pgagent starts polling for new jobs. 6. Repeat 1-5 until all jobs are in "r" state with an assigned agent, but not running. Note that this particular user has now abandoned pgagent and is switching to centrally managed cron, so my ability to get further troubleshooting information will be limited. According to apt-get, this is the latest pgagent package. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Sanket, can you reproduce this? On Fri, Jul 10, 2015 at 5:41 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 07/10/2015 07:16 AM, Dave Page wrote: >> Are you able to get a stacktrace from a running instance that's hung? > > Yes, but it's not doing anything, it's just polling for the next job. > >> Also, does setting the logging level to debug reveal anything useful >> in the filesystem logs? (pgagent -l2 ....) > > Nope. And I scanned the postgresql log for SQL errors, and there's nothing. > > What appears to be happening is: > > 1. pgagent picks up job > > 2. pgagent finds first step in that job > > 3. pgagent executes first step (successfully, according to the > postgresql log, and according to a fake "touch file" job I added). > > 4. pgagent abandons the job without updating the status of either the > job or the step. > > 5. pgagent starts polling for new jobs. > > 6. Repeat 1-5 until all jobs are in "r" state with an assigned agent, > but not running. > > Note that this particular user has now abandoned pgagent and is > switching to centrally managed cron, so my ability to get further > troubleshooting information will be limited. > > According to apt-get, this is the latest pgagent package. > > -- > Josh Berkus > PostgreSQL Experts Inc. > http://pgexperts.com -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Sure Dave,
I will try and let you know.
Regards,
Sanket Mehta
Sr Software engineer
Enterprisedb
On Mon, Jul 13, 2015 at 7:44 PM, Dave Page <dpage@pgadmin.org> wrote:
Sanket, can you reproduce this?
On Fri, Jul 10, 2015 at 5:41 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/10/2015 07:16 AM, Dave Page wrote:
>> Are you able to get a stacktrace from a running instance that's hung?
>
> Yes, but it's not doing anything, it's just polling for the next job.
>
>> Also, does setting the logging level to debug reveal anything useful
>> in the filesystem logs? (pgagent -l2 ....)
>
> Nope. And I scanned the postgresql log for SQL errors, and there's nothing.
>
> What appears to be happening is:
>
> 1. pgagent picks up job
>
> 2. pgagent finds first step in that job
>
> 3. pgagent executes first step (successfully, according to the
> postgresql log, and according to a fake "touch file" job I added).
>
> 4. pgagent abandons the job without updating the status of either the
> job or the step.
>
> 5. pgagent starts polling for new jobs.
>
> 6. Repeat 1-5 until all jobs are in "r" state with an assigned agent,
> but not running.
>
> Note that this particular user has now abandoned pgagent and is
> switching to centrally managed cron, so my ability to get further
> troubleshooting information will be limited.
>
> According to apt-get, this is the latest pgagent package.
>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com
--Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi Dave,
I have tried to reproduce the same in my system by resetting the pgagent sequences to 1.2015-07-20 15:35:14 IST ERROR: duplicate key value violates unique constraint "pga_joblog_pkey"
2015-07-20 15:35:14 IST DETAIL: Key (jlgid)=(3) already exists.
2015-07-20 15:35:14 IST STATEMENT: INSERT INTO pgagent.pga_joblog(jlgid, jlgjobid, jlgstatus) VALUES (3, 1, 'r')
So my analysis is:
as sequences are getting reset before the job starts, it tries to insert jlgid directly from nextvalue of its sequence which is already present in the table. same thing happens for jobsteplog and joblog tables also.
as sequences are getting reset before the job starts, it tries to insert jlgid directly from nextvalue of its sequence which is already present in the table. same thing happens for jobsteplog and joblog tables also.
Suggestion to resolve the issue:
get the max value of ID from table before insert query and run the sequence till we get the max(ID)+1 and then we execute the insert statement with this new value as ID.
Please provide your suggestion for the same.
Regards,
Sanket Mehta
Sr Software engineer
Enterprisedb
On Tue, Jul 14, 2015 at 11:07 AM, Sanket Mehta <sanket.mehta@enterprisedb.com> wrote:
Sure Dave,I will try and let you know.Regards,Sanket MehtaSr Software engineerEnterprisedbOn Mon, Jul 13, 2015 at 7:44 PM, Dave Page <dpage@pgadmin.org> wrote:Sanket, can you reproduce this?
On Fri, Jul 10, 2015 at 5:41 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/10/2015 07:16 AM, Dave Page wrote:
>> Are you able to get a stacktrace from a running instance that's hung?
>
> Yes, but it's not doing anything, it's just polling for the next job.
>
>> Also, does setting the logging level to debug reveal anything useful
>> in the filesystem logs? (pgagent -l2 ....)
>
> Nope. And I scanned the postgresql log for SQL errors, and there's nothing.
>
> What appears to be happening is:
>
> 1. pgagent picks up job
>
> 2. pgagent finds first step in that job
>
> 3. pgagent executes first step (successfully, according to the
> postgresql log, and according to a fake "touch file" job I added).
>
> 4. pgagent abandons the job without updating the status of either the
> job or the step.
>
> 5. pgagent starts polling for new jobs.
>
> 6. Repeat 1-5 until all jobs are in "r" state with an assigned agent,
> but not running.
>
> Note that this particular user has now abandoned pgagent and is
> switching to centrally managed cron, so my ability to get further
> troubleshooting information will be limited.
>
> According to apt-get, this is the latest pgagent package.
>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com
--Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi On Mon, Jul 20, 2015 at 12:49 PM, Sanket Mehta <sanket.mehta@enterprisedb.com> wrote: > Hi Dave, > > I have tried to reproduce the same in my system by resetting the pgagent > sequences to 1. > > Below is the results I have came across: > > the job is not getting executed and its current status is "i" (no > steps found ) although I have specified 2 steps for that job. > > Below is the log entry in postgreSQL logs: > > 2015-07-20 15:35:14 IST ERROR: duplicate key value violates unique > constraint "pga_joblog_pkey" > 2015-07-20 15:35:14 IST DETAIL: Key (jlgid)=(3) already exists. > 2015-07-20 15:35:14 IST STATEMENT: INSERT INTO > pgagent.pga_joblog(jlgid, jlgjobid, jlgstatus) VALUES (3, 1, 'r') OK - I'm not sure that's related to the issue in this thread though. > So my analysis is: > > as sequences are getting reset before the job starts, it tries to > insert jlgid directly from nextvalue of its sequence which is already > present in the table. same thing happens for jobsteplog and joblog tables > also. > > Suggestion to resolve the issue: > > get the max value of ID from table before insert query and run the > sequence till we get the max(ID)+1 and then we execute the insert statement > with this new value as ID. We're not doing that - it'll be a performance killer. The user needs to not muck about with their sequences. The real questions are: 1) Why doesn't the insert error appear in the pgAgent log? 2) Can you reproduce Josh's issue where the first step in a job runs, but the second doesn't (without touching the sequences)? -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/20/2015 08:01 AM, Dave Page wrote: >> Suggestion to resolve the issue: >> >> get the max value of ID from table before insert query and run the >> sequence till we get the max(ID)+1 and then we execute the insert statement >> with this new value as ID. > > We're not doing that - it'll be a performance killer. The user needs > to not muck about with their sequences. Yeah, pgagent doesn't need to cope with sequence failures. That's the DBA's job to fix. > > The real questions are: > > 1) Why doesn't the insert error appear in the pgAgent log? > > 2) Can you reproduce Josh's issue where the first step in a job runs, > but the second doesn't (without touching the sequences)? Particularly, *after* the sequence errors are fixed, do the jobs now run normally? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Hi Dave,
As per our discussion, I have tried to reproduce the issue, but it seems jobs are running fine without any problem.
Hence the issue is not reproducible on my system.
Regards,
Sanket Mehta
Sr Software engineer
Enterprisedb
On Tue, Jul 21, 2015 at 9:52 AM, Josh Berkus <josh@agliodbs.com> wrote:
On 07/20/2015 08:01 AM, Dave Page wrote:
>> Suggestion to resolve the issue:
>>
>> get the max value of ID from table before insert query and run the
>> sequence till we get the max(ID)+1 and then we execute the insert statement
>> with this new value as ID.
>
> We're not doing that - it'll be a performance killer. The user needs
> to not muck about with their sequences.
Yeah, pgagent doesn't need to cope with sequence failures. That's the
DBA's job to fix.
>
> The real questions are:
>
> 1) Why doesn't the insert error appear in the pgAgent log?
>
> 2) Can you reproduce Josh's issue where the first step in a job runs,
> but the second doesn't (without touching the sequences)?
Particularly, *after* the sequence errors are fixed, do the jobs now run
normally?
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com--
Sent via pgadmin-support mailing list (pgadmin-support@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgadmin-support
On Wed, Jul 22, 2015 at 11:56 AM, Sanket Mehta <sanket.mehta@enterprisedb.com> wrote:
Hi Dave,As per our discussion, I have tried to reproduce the issue, but it seems jobs are running fine without any problem.Hence the issue is not reproducible on my system.
Josh,
As we're unable to reproduce the issue at our end, can you please generate the logs after setting the log level to debug2 (i.e. -l2) for analyzing the issue?
Regards,Sanket MehtaSr Software engineerEnterprisedbOn Tue, Jul 21, 2015 at 9:52 AM, Josh Berkus <josh@agliodbs.com> wrote:On 07/20/2015 08:01 AM, Dave Page wrote:
>> Suggestion to resolve the issue:
>>
>> get the max value of ID from table before insert query and run the
>> sequence till we get the max(ID)+1 and then we execute the insert statement
>> with this new value as ID.
>
> We're not doing that - it'll be a performance killer. The user needs
> to not muck about with their sequences.
Yeah, pgagent doesn't need to cope with sequence failures. That's the
DBA's job to fix.
>
> The real questions are:
>
> 1) Why doesn't the insert error appear in the pgAgent log?
>
> 2) Can you reproduce Josh's issue where the first step in a job runs,
> but the second doesn't (without touching the sequences)?
Particularly, *after* the sequence errors are fixed, do the jobs now run
normally?
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com--
Sent via pgadmin-support mailing list (pgadmin-support@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgadmin-support
On 07/21/2015 11:29 PM, Ashesh Vashi wrote: > On Wed, Jul 22, 2015 at 11:56 AM, Sanket Mehta > <sanket.mehta@enterprisedb.com <mailto:sanket.mehta@enterprisedb.com>> > wrote: > > Hi Dave, > > As per our discussion, I have tried to reproduce the issue, but it > seems jobs are running fine without any problem. > Hence the issue is not reproducible on my system. > > Josh, > > As we're unable to reproduce the issue at our end, can you please > generate the logs after setting the log level to debug2 (i.e. -l2) for > analyzing the issue? As I said before, the logs show nothing of interest. There are no errors. just polling for new jobs. Anyway, the failing system has now been shut down, so we'll need to chalk this particular failure up to "wierd stuff". -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Just chiming in quickly, We experienced exactly the same behavior with pg 9.4. It was consistent across the 18 database servers that we upgraded from 9.3 to 9.4. A job would complete successfully, and then just remain in 'running' state. We tried purging and re-installing pg and pgadmin/agent cleanly and still had the same results. Eventually in frustration we just copied over the pgagent binary (nothing else) from a server running 9.3 and amazingly stuff started working. All I can really contribute to this issue discussion is our "path to the problem". Initial OS: Ubuntu 12 LTS x64 running pg 9.1 from the pgdg repos We were running 9.3 without any issues on those servers. All pgagent jobs would complete and run as expected We upgraded the OS to Ubuntu 14.04 LTS We then upgraded pg to 9.4 Post-upgrade, pgagent (which was also upgraded in the process) started exhibiting the behavior as described in the thread Leonard On Wed, Jul 22, 2015 at 12:33 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 07/21/2015 11:29 PM, Ashesh Vashi wrote: >> On Wed, Jul 22, 2015 at 11:56 AM, Sanket Mehta >> <sanket.mehta@enterprisedb.com <mailto:sanket.mehta@enterprisedb.com>> >> wrote: >> >> Hi Dave, >> >> As per our discussion, I have tried to reproduce the issue, but it >> seems jobs are running fine without any problem. >> Hence the issue is not reproducible on my system. >> >> Josh, >> >> As we're unable to reproduce the issue at our end, can you please >> generate the logs after setting the log level to debug2 (i.e. -l2) for >> analyzing the issue? > > As I said before, the logs show nothing of interest. There are no > errors. just polling for new jobs. > > Anyway, the failing system has now been shut down, so we'll need to > chalk this particular failure up to "wierd stuff". > > > -- > Josh Berkus > PostgreSQL Experts Inc. > http://pgexperts.com > > > -- > Sent via pgadmin-support mailing list (pgadmin-support@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgadmin-support