Thread: Strong feeling of something ugly lurking deeply within 7.0 ;-)
The severity of this bug heavily depends on your lack of buggy programs. Short description: Long standing open transactions combined with high traffic updates and some regular vacuums eventually corrupt memory. Long description: Due to a design flaw within our ecpg Programs (I don't recommend designing for autocommit off!) some transactions stayed open for several days. A process data collection system generates a lot of status change updates (3MB a day) to about 110 rows in a table at the same time. After 1024 updates I vacuum the high traffic table which should shrink to 16kB. First I noticed that vacuum did not free old tuples. This put me on the track of the real cause. Since three weeks (more buggy long standing transactions) I have seen one major crash of the program system per week. For months I have seen some strange NOTICES which went away after another vacuum. And this morning I found a 'possible memory corruption, killing other backends' message. The situation got better and better during the 7.0 development cycle (I started with a pre-beta version this January and reported some concurrent vacuum oddities that time). And it got worse the more interactive programs we added. But up to now I didn't see the special addon which causes the pain: Long standing transactions. It's not very bad. This seems to happen on rare conditions. Until this week I thought of it as a minor oddity - a temporary nuissance. And: It is current stable CVS tree! running on a 233MHz Pentium2, Linux 2.2.14(?) Sample Code: update bn_actual set meter=meter+1 where machine= ?; // repeat every second combined with begin transaction; // hold select something; and vacuum analyze; // once a day and vacuum bn_actual; // every 1024 updates and some others. PS: Of course I'm currently fixing the long transactions problem. I'll tell you once the system runs 4 weeks again without any strange occurence. PPS: Yes, I'm following the hackers list. P3S: No, I don't believe in a hardware bug.
I think the cause here is probably a known problem. The vacuums in parallel with the long-running transactions would result in periodic sinval message queue overflows, with resultant flushes of syscache entries in all active backends. We know that there are places where syscache entry pointers are used longer than is safe --- ie, it's possible for an entry to get flushed while some routine still has a pointer to it. Finding all these places, or better redesigning the syscache mechanism to eliminate the issue completely, has been on the todo list for awhile. In the short term I'd recommend that you avoid vacuuming system tables while there are other open transactions; that should reduce the incidence of overflows to a livable level. regards, tom lane