Thread: Weird error

Weird error

From

Philip Molter

Date:

27 June 2001, 01:44:01

I have a Postgres application running right now.  The thing is
constantly doing 3-5 updates/sec and 1-2 multi-join selects/sec and
performance is actually doing all right.  Unfortunately, as the system
runs, performance degrades, which I guess has been documented, although
I still don't understand why.

To work around this, I have a cron job that runs every hour and vacuum
analyzes the three tables that are actually updated significantly.  Most of the time, it works fine, but recently, I've
beengetting this error: 

  NOTICE:  Child itemid in update-chain marked as unused - can't continue
  repair_frag

What causes this and how do I make it stop?  When this happens,
whatever table is affected doesn't get analyzed and the database
continues its downward resource spiral.

Thanks in advance,
Philip

* Philip Molter
* DataFoundry.net
* http://www.datafoundry.net/
* philip@datafoundry.net

Re: Weird error

From

Alex Knight

Date:

27 June 2001, 01:59:29

On Tue, 26 Jun 2001, Philip Molter wrote:

> I have a Postgres application running right now.  The thing is
> constantly doing 3-5 updates/sec and 1-2 multi-join selects/sec and
> performance is actually doing all right.  Unfortunately, as the system
> runs, performance degrades, which I guess has been documented, although
> I still don't understand why.
>
> To work around this, I have a cron job that runs every hour and vacuum
> analyzes the three tables that are actually updated significantly.  Most of the time, it works fine, but recently,
I'vebeen getting this error: 
>
>   NOTICE:  Child itemid in update-chain marked as unused - can't continue
>   repair_frag
>
> What causes this and how do I make it stop?  When this happens,
> whatever table is affected doesn't get analyzed and the database
> continues its downward resource spiral.

I'm fairly sure you are _suppose_ to run VACUUM ANALYZE when there are no
clients connected to the database. You may have to have your cron job
temporarily suspend remote connectivity while the actions are performed.

-Knight

>
> Thanks in advance,
> Philip
>
> * Philip Molter
> * DataFoundry.net
> * http://www.datafoundry.net/
> * philip@datafoundry.net
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

Re: Weird error

From

Philip Molter

Date:

27 June 2001, 03:59:08

On Tue, Jun 26, 2001 at 07:58:55PM -0700, Alex Knight wrote:
: >   NOTICE:  Child itemid in update-chain marked as unused - can't continue
: >   repair_frag
:
: I'm fairly sure you are _suppose_ to run VACUUM ANALYZE when there are no
: clients connected to the database. You may have to have your cron job
: temporarily suspend remote connectivity while the actions are performed.

Hrmm, if that's the case, then that REALLY sucks.  I have a system
that's constantly running, forcing me to run VACUUM ANALYZE on it
because Postgres will constantly consume CPU if I don't.  However, I
need to stop or suspend my constantly running system to solve the
problem.

What causes that resource issue again?

And 90% of the time, I can run VACUUM ANALYZE just fine, without any
errors (in fact, it was running on the hour for about 10 hours before I
got the first warning).

* Philip Molter
* DataFoundry.net
* http://www.datafoundry.net/
* philip@datafoundry.net

Re: Weird error

From

Alex Pilosov

Date:

27 June 2001, 09:06:54

On Tue, 26 Jun 2001, Alex Knight wrote:

> On Tue, 26 Jun 2001, Philip Molter wrote:
>
> > I have a Postgres application running right now.  The thing is
> > constantly doing 3-5 updates/sec and 1-2 multi-join selects/sec and
> > performance is actually doing all right.  Unfortunately, as the system
> > runs, performance degrades, which I guess has been documented, although
> > I still don't understand why.
> >
> > To work around this, I have a cron job that runs every hour and vacuum
> > analyzes the three tables that are actually updated significantly.  Most of the time, it works fine, but recently,
I'vebeen getting this error: 
> >
> >   NOTICE:  Child itemid in update-chain marked as unused - can't continue
> >   repair_frag
> >
> > What causes this and how do I make it stop?  When this happens,
> > whatever table is affected doesn't get analyzed and the database
> > continues its downward resource spiral.
>
> I'm fairly sure you are _suppose_ to run VACUUM ANALYZE when there are no
> clients connected to the database. You may have to have your cron job
> temporarily suspend remote connectivity while the actions are performed.
This is definitely FALSE. Vacuum does not lock the database, it acquires
certain locks while its vacuuming certain tables. I.E. your clients may
not be able to modify table while its being vacuumed.

Regarding the error you are getting: Which postgres version is it? See if
7.1.2 has it fixed...

Re: Weird error

From

Philip Molter

Date:

27 June 2001, 12:39:14

On Wed, Jun 27, 2001 at 06:16:35AM -0400, Alex Pilosov wrote:
: This is definitely FALSE. Vacuum does not lock the database, it acquires
: certain locks while its vacuuming certain tables. I.E. your clients may
: not be able to modify table while its being vacuumed.
:
: Regarding the error you are getting: Which postgres version is it? See if
: 7.1.2 has it fixed...

I am using 7.1.2.  Plenty of memory (512MB, about 300MB used), Linux
2.4.2, SMP.  The problem is real intermittent.  It's happened twice in
the 24 hours the system has been running (this particular action
happens once an hour).

* Philip Molter
* DataFoundry.net
* http://www.datafoundry.net/
* philip@datafoundry.net

Re: Weird error

From

Tom Lane

Date:

27 June 2001, 13:59:48

Philip Molter <philip@datafoundry.net> writes:
>   NOTICE:  Child itemid in update-chain marked as unused - can't continue
>   repair_frag

What Postgres version is this?  I think we fixed some bugs in that
general vicinity in 7.1.

            regards, tom lane

Re: Weird error

From

Tom Lane

Date:

27 June 2001, 14:56:05

Philip Molter <philip@datafoundry.net> writes:
> I am using 7.1.2.

Drat.

Don't suppose you want to dig in there with a debugger when it happens?
You must be seeing some hard-to-replicate problem in VACUUM's
tuple-chain-moving logic.  That stuff is pretty hairy, and I doubt
anyone will be able to intuit what's wrong without close examination
of a failure case.

            regards, tom lane

Re: Weird error

From

Philip Molter

Date:

27 June 2001, 15:41:57

On Wed, Jun 27, 2001 at 11:30:54AM -0400, Tom Lane wrote:
: Philip Molter <philip@datafoundry.net> writes:
: > I am using 7.1.2.
:
: Don't suppose you want to dig in there with a debugger when it happens?
: You must be seeing some hard-to-replicate problem in VACUUM's
: tuple-chain-moving logic.  That stuff is pretty hairy, and I doubt
: anyone will be able to intuit what's wrong without close examination
: of a failure case.

Well, considering that we're pushing this into production and the
server was installed from Rawhide RPMs, no, not really. :)  Reproducing
the RedHat install locations for this stuff is a pain in the ass.
However, considering that it's not consistent and not continuous, I can
work around it.  In the meantime, I'll try to get some detailed logging
so that perhaps I can get a good look at what goes on during a failure
case.

* Philip Molter
* DataFoundry.net
* http://www.datafoundry.net/
* philip@datafoundry.net

Re: Weird error

From

Hiroshi Inoue

Date:

28 June 2001, 09:08:26

Tom Lane wrote:
>
> Philip Molter <philip@datafoundry.net> writes:
> > I am using 7.1.2.
>
> Drat.
>
> Don't suppose you want to dig in there with a debugger when it happens?
> You must be seeing some hard-to-replicate problem in VACUUM's
> tuple-chain-moving logic.

I had a pretty reproducible example 2 years ago.
IIRC the situation was like

When vacuum starts xid=10001 and 10004 are alive.
If vacuum finds an update chain (10002 -> 10000 -> 10003),
it removes the tuple (10000) because no xids <= 10000 is
alive. Then the chain is broken.

The problem seems to lie in scan_heap.
How could vacuum know that the tuple (10000) must be alive
after vacuum ?

regards,
Hiroshi Inoue

Re: Weird error

From

Joseph Shraibman

Date:

28 June 2001, 17:53:09

Alex Pilosov wrote:
>

> This is definitely FALSE. Vacuum does not lock the database, it acquires
> certain locks while its vacuuming certain tables. I.E. your clients may
> not be able to modify table while its being vacuumed.
>
I've had a vacuum deadlock my database.  When I killed the vacuum client
(^C from the command line) my program  continued.


--
Joseph Shraibman
jks@selectacast.net
Increase signal to noise ratio.  http://www.targabot.com

Re: Weird error

From

Alex Pilosov

Date:

28 June 2001, 18:18:29

On Thu, 28 Jun 2001, Joseph Shraibman wrote:

> Alex Pilosov wrote:
> >
>
> > This is definitely FALSE. Vacuum does not lock the database, it acquires
> > certain locks while its vacuuming certain tables. I.E. your clients may
> > not be able to modify table while its being vacuumed.
> >
> I've had a vacuum deadlock my database.  When I killed the vacuum client
> (^C from the command line) my program  continued.

Are you sure you mean 'deadlock'? Deadlock is when neither client nor
vacuum can proceed. What most likely happened is vacuum locking the table
until its done, and that is a normal behavior.

-alex

Re: Weird error

From

Joseph Shraibman

Date:

28 June 2001, 20:58:08

No, it was deadlocked.  Neither vacuum nor my program were doing
anything.

Alex Pilosov wrote:
>
> On Thu, 28 Jun 2001, Joseph Shraibman wrote:
>
> > Alex Pilosov wrote:
> > >
> >
> > > This is definitely FALSE. Vacuum does not lock the database, it acquires
> > > certain locks while its vacuuming certain tables. I.E. your clients may
> > > not be able to modify table while its being vacuumed.
> > >
> > I've had a vacuum deadlock my database.  When I killed the vacuum client
> > (^C from the command line) my program  continued.
>
> Are you sure you mean 'deadlock'? Deadlock is when neither client nor
> vacuum can proceed. What most likely happened is vacuum locking the table
> until its done, and that is a normal behavior.
>
> -alex
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly

--
Joseph Shraibman
jks@selectacast.net
Increase signal to noise ratio.  http://www.targabot.com