Thread: Strong feeling of something ugly lurking deeply within 7.0 ;-)

Strong feeling of something ugly lurking deeply within 7.0 ;-)

From
Christof Petig
Date:
The severity of this bug heavily depends on your lack of buggy programs.

Short description:
Long standing open transactions combined with high traffic updates and
some regular vacuums eventually corrupt memory.

Long description:
Due to a design flaw within our ecpg Programs (I don't recommend
designing for autocommit off!) some transactions stayed open for several
days. A process data collection system generates a lot of status change
updates (3MB a day) to about 110 rows in a table at the same time.
After 1024 updates I vacuum the high traffic table which should shrink
to 16kB. First I noticed that vacuum did not free old tuples. This put
me on the track of the real cause.

Since three weeks (more buggy long standing transactions) I have seen
one major crash of the program system per week. For months I have seen
some strange NOTICES which went away after another vacuum. And this
morning I found a 'possible memory corruption, killing other backends'
message.

The situation got better and better during the 7.0 development cycle (I
started with a pre-beta version this January and reported some
concurrent vacuum oddities that time). And it got worse the more
interactive programs we added.
But up to now I didn't see the special addon which causes the pain: Long
standing transactions.

It's not very bad. This seems to happen on rare conditions. Until this
week I thought of it as a minor oddity - a temporary nuissance.

And: It is current stable CVS tree! running on a 233MHz Pentium2, Linux
2.2.14(?)

Sample Code:
    update bn_actual set meter=meter+1 where machine= ?; // repeat every
second
combined with
    begin transaction; // hold
    select something;
and
    vacuum analyze; // once a day
and
    vacuum bn_actual; // every 1024 updates

and some others.

PS: Of course I'm currently fixing the long transactions problem. I'll
tell you once the system runs 4 weeks again without any strange
occurence.
PPS: Yes, I'm following the hackers list.
P3S: No, I don't believe in a hardware bug.

Re: Strong feeling of something ugly lurking deeply within 7.0 ;-)

From
Tom Lane
Date:
I think the cause here is probably a known problem.  The vacuums in
parallel with the long-running transactions would result in periodic
sinval message queue overflows, with resultant flushes of syscache
entries in all active backends.  We know that there are places where
syscache entry pointers are used longer than is safe --- ie, it's
possible for an entry to get flushed while some routine still has
a pointer to it.  Finding all these places, or better redesigning the
syscache mechanism to eliminate the issue completely, has been on the
todo list for awhile.

In the short term I'd recommend that you avoid vacuuming system tables
while there are other open transactions; that should reduce the
incidence of overflows to a livable level.

            regards, tom lane