Thread: Re: [HACKERS] ERROR: btree scan list trashed ??

Re: [HACKERS] ERROR: btree scan list trashed ??

From
Tom Lane
Date:
Adriaan Joubert <a.joubert@albourne.com> writes:
>> What Postgres version are you using, and on what platform?  If it's
>> anything older than 6.5.1, an upgrade would probably be a good idea.

> Sorry, I should have mentioined that. I'm using 6.5.0 on DEC Alpha
> (Digital Unix, compiled with cc). 

Alpha, eh?  We have some known porting problems on 64-bit architectures,
and I wonder whether this is one of them.  Going to be hard to nail that
down until we can reproduce the error, however.

After some digging around in backend/access/nbtree/nbtscan.c, which is
producing the error, I notice that the routine in question is searching
a list that does not get cleared properly at transaction abort.  It's
not clear that that's the cause of the error message, though.  What
I suggest at this point is that you pay more attention to what happens
just before the transaction in which you get the "btree scan list
trashed" message.  In particular, are there any commands that abort
with errors a little bit earlier in the same backend?  It might take
the combination of an error in a btree-index-using command and then
another btree index access to provoke the "trashed" symptom.
        regards, tom lane


Re: [HACKERS] ERROR: btree scan list trashed ??

From
Adriaan Joubert
Date:
Tom Lane wrote:
> 
> Adriaan Joubert <a.joubert@albourne.com> writes:
> >> What Postgres version are you using, and on what platform?  If it's
> >> anything older than 6.5.1, an upgrade would probably be a good idea.
> 
> > Sorry, I should have mentioined that. I'm using 6.5.0 on DEC Alpha
> > (Digital Unix, compiled with cc).
> 
> Alpha, eh?  We have some known porting problems on 64-bit architectures,
> and I wonder whether this is one of them.  Going to be hard to nail that
> down until we can reproduce the error, however.
> 
> After some digging around in backend/access/nbtree/nbtscan.c, which is
> producing the error, I notice that the routine in question is searching
> a list that does not get cleared properly at transaction abort.  It's
> not clear that that's the cause of the error message, though.  What
> I suggest at this point is that you pay more attention to what happens
> just before the transaction in which you get the "btree scan list
> trashed" message.  In particular, are there any commands that abort
> with errors a little bit earlier in the same backend?  It might take
> the combination of an error in a btree-index-using command and then
> another btree index access to provoke the "trashed" symptom.


That may be it. I have some PL routines that raise an exception if an
operation could lead to an inconsistency in my database tables. This is
not really an error, but I do want to abort the transaction in that
case, so I raise an exception. I'll continue trying to nail it down and
then I can look with the debugger what happens.

BTW, I've installed 6.5.1 and still have the same problems. Vacuuming
hung up everything, and I had to shut the whole thing down and restart
it to get it working again. Dropping the indices and rebuilding them all
fixed the problem.

How difficult is it to clear the list at transaction abort? Is this
something I could patch and try out?

Thanks a lot for looking at this, much appreciated!

Adriaan


Re: [HACKERS] ERROR: btree scan list trashed ??

From
Tom Lane
Date:
Adriaan Joubert <a.joubert@albourne.com> writes:
>> After some digging around in backend/access/nbtree/nbtscan.c, which is
>> producing the error, I notice that the routine in question is searching
>> a list that does not get cleared properly at transaction abort.  It's
>> not clear that that's the cause of the error message, though.

> BTW, I've installed 6.5.1 and still have the same problems.

No surprise, really.

> Vacuuming hung up everything, and I had to shut the whole thing down
> and restart it to get it working again. Dropping the indices and
> rebuilding them all fixed the problem.

Hmm, that suggests that your indexes are actually getting corrupted.

> How difficult is it to clear the list at transaction abort? Is this
> something I could patch and try out?

The BTScans variable in nbtscan.c needs to be reset to NULL during
xact abort.  I don't see how this would *directly* cause the
observed symptom, but failing to do it should lead to misbehavior in
_bt_adjscans() during later transactions, so it might be related
somehow.  If you want to patch it, make a subroutine that clears the
variable (no need to free the list; since it's palloc'd it'll go
away anyway) and call it from transaction cleanup in
backend/access/transam/xact.c.
        regards, tom lane


Re: [HACKERS] ERROR: btree scan list trashed ??

From
Vadim Mikheev
Date:
Tom Lane wrote:
> 
> The BTScans variable in nbtscan.c needs to be reset to NULL during
> xact abort.  I don't see how this would *directly* cause the
> observed symptom, but failing to do it should lead to misbehavior in
> _bt_adjscans() during later transactions, so it might be related
> somehow.  If you want to patch it, make a subroutine that clears the
> variable (no need to free the list; since it's palloc'd it'll go
> away anyway) and call it from transaction cleanup in
> backend/access/transam/xact.c.

This should be fixed in CVS too.

Vadim


Re: [HACKERS] ERROR: btree scan list trashed ??

From
The Hermit Hacker
Date:
On Thu, 5 Aug 1999, Vadim Mikheev wrote:

> Tom Lane wrote:
> > 
> > The BTScans variable in nbtscan.c needs to be reset to NULL during
> > xact abort.  I don't see how this would *directly* cause the
> > observed symptom, but failing to do it should lead to misbehavior in
> > _bt_adjscans() during later transactions, so it might be related
> > somehow.  If you want to patch it, make a subroutine that clears the
> > variable (no need to free the list; since it's palloc'd it'll go
> > away anyway) and call it from transaction cleanup in
> > backend/access/transam/xact.c.
> 
> This should be fixed in CVS too.

Is this something that can be easily back-patched for v6.5.2?

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: [HACKERS] ERROR: btree scan list trashed ??

From
Tom Lane
Date:
The Hermit Hacker <scrappy@hub.org> writes:
> On Thu, 5 Aug 1999, Vadim Mikheev wrote:
>> Tom Lane wrote:
>>>> The BTScans variable in nbtscan.c needs to be reset to NULL during
>>>> xact abort.
>> 
>> This should be fixed in CVS too.

Yes, absolutely.

> Is this something that can be easily back-patched for v6.5.2?

I will patch this in both current and REL6_5.  But, although this
is clearly a bug, I am not at all convinced that it explains
Adriaan's problem.  I think more creepie-crawlies lurk nearby :-(
        regards, tom lane


Re: [HACKERS] ERROR: btree scan list trashed ??

From
Adriaan Joubert
Date:
> I will patch this in both current and REL6_5.  But, although this
> is clearly a bug, I am not at all convinced that it explains
> Adriaan's problem.  I think more creepie-crawlies lurk nearby :-(


Hmm, I made the changes and I only got three errors out of the system
today. So it is not fixed, although perhaps improved (or was I just
lucky?). I've been locking tables more restrictively, so this may have
helped as well. I definitely think this has something to do with
concurrent accesses to the same index. It always seems to start
happening as the the tables start getting updates more rapidly.

Another thought: an index on a table that gets updated sometimes through
a PL trigger is an index on a user-defined type (the bitmask type I
posted a while ago). Could this have something to do with a btree index
on a user-defined type? I'll drop that index and see whether it makes a
difference. All indexes on other tables that are touched are int4.

Thanks for all the help, Tom!

Adriaan


Re: [HACKERS] ERROR: btree scan list trashed ??

From
Adriaan Joubert
Date:
OK, I've dropped my user-defined type index and it hasn't made any
difference. I've had quite a few of the following again:

UPDATE TasksIds SET qstate=8::bit1 where task=358 and id=5654
ERROR:  btree scan list trashed; can't find 0x1401744a0

I've got a lot of logging switched on, and these do not seem to be
preceded by errors. Since patching it the system seems to recover ok, so
I'm wondering whether this could be a caching issue. I think I will just
lock all tables in their entirety now, and see whether that fixes it
(there goes my MVCC performance boost 8-(). I still think it has
something to do with concurrent access to the indices.

If anybody has any more suggestions of what I could try, please let me
know.

Cheers,

Adriaan


Re: [HACKERS] ERROR: btree scan list trashed ??

From
Tom Lane
Date:
Adriaan Joubert <a.joubert@albourne.com> writes:
> OK, I've dropped my user-defined type index and it hasn't made any
> difference. I've had quite a few of the following again:

> UPDATE TasksIds SET qstate=8::bit1 where task=358 and id=5654
> ERROR:  btree scan list trashed; can't find 0x1401744a0

> I've got a lot of logging switched on, and these do not seem to be
> preceded by errors. Since patching it the system seems to recover ok, so
> I'm wondering whether this could be a caching issue. I think I will just
> lock all tables in their entirety now, and see whether that fixes it
> (there goes my MVCC performance boost 8-(). I still think it has
> something to do with concurrent access to the indices.

Let us know whether going to full locking makes any difference.

I am currently wondering whether this is a porting issue (64-bit vs
32-bit pointers).  If it only happens on 64-bit platforms, that would
explain why we haven't seen many similar reports.  Unfortunately,
that theory provides little useful guidance about where to look :-(
        regards, tom lane