Thread: Nasty VACUUM/bgwriter/segmentation bug
While working on fixing the recently reported hash-index problem, I was using a test build with a very small RELSEG_SIZE (128K), so that I could trigger the reported bug with a reasonably small amount of data. And I started finding some unexpected data corruption. I eventually reduced it to this test case: checkpoint; create table foo (f1 int); insert into foo select x from generate_series(1,100000) x; -- wait 30 seconds delete from foo; vacuum verbose foo; insert into foo select x from generate_series(1,100000) x; \q stop and restart postmaster, then vacuum verbose foo; This vacuum will generate a whole lot of WARNING: relation "foo" page 16 is uninitialized --- fixing WARNING: relation "foo" page 17 is uninitialized --- fixing ... and when the dust settles, most of the second batch of 100000 rows is gone. What is happening is that during that 30-second wait, the bgwriter is dumping out all the dirty pages, and acquiring open file references to each segment of table "foo" as it does so. The VACUUM then truncates "foo" back to zero size, since it contains no data after the DELETE, and then the second INSERT bulks it up again. The problem is that the bgwriter still has open file references to the deleted segments after the first one, and that's where it'll write the data if given a chance. So the updates disappear into the ether as far as any other process is concerned, for each segment except the first. There's a rather indirect mechanism that's supposed to prevent similar problems between two backends: a file truncation is supposed to be associated with a forced relcache flush, and that causes an smgrclose(), so other backends will be forced to reopen the file(s) before they can do any more work with the truncated relation. On reflection I think I don't trust this though, because in the case where a backend writes a dirty buffer because it needs to reclaim a buffer, it doesn't try to open the relation at the relcache level (it cannot, because the rel might be in a different database). So it could use a stale smgr relation, same as the bgwriter. The bgwriter does not participate in shared cache inval, not having a relcache in the first place, and so this mechanism doesn't help it anyway. This is a fairly low-probability bug in real-world cases, because it could only happen when a relation is truncated and then re-expanded across a 1GB segment boundary. Moreover, because the bgwriter flushes all its open files after each checkpoint, the window for trouble only extends to the next checkpoint. But it definitely could happen, and it might explain some irreproducible corruption reports. I think that the easiest fix might be to not remove no-longer-used segment files during a truncate, but simply reduce them to zero size rather than delete them. Then any open file pointers aren't invalidated. The only alternative I can see is to invent some new signaling mechanism to force closure of open files, but that seems ugly, complex, and perhaps subject to race conditions. Thoughts? regards, tom lane
Tom Lane wrote: > While working on fixing the recently reported hash-index problem, > I was using a test build with a very small RELSEG_SIZE (128K), > so that I could trigger the reported bug with a reasonably small > amount of data. And I started finding some unexpected data corruption. > I eventually reduced it to this test case: > > checkpoint; > create table foo (f1 int); > insert into foo select x from generate_series(1,100000) x; > -- wait 30 seconds > delete from foo; > vacuum verbose foo; > insert into foo select x from generate_series(1,100000) x; > \q > stop and restart postmaster, then > vacuum verbose foo; > > This vacuum will generate a whole lot of > WARNING: relation "foo" page 16 is uninitialized --- fixing > WARNING: relation "foo" page 17 is uninitialized --- fixing > ... > and when the dust settles, most of the second batch of 100000 rows > is gone. > > What is happening is that during that 30-second wait, the bgwriter is > dumping out all the dirty pages, and acquiring open file references > to each segment of table "foo" as it does so. The VACUUM then truncates > "foo" back to zero size, since it contains no data after the DELETE, > and then the second INSERT bulks it up again. The problem is that the > bgwriter still has open file references to the deleted segments after > the first one, and that's where it'll write the data if given a chance. > So the updates disappear into the ether as far as any other process is > concerned, for each segment except the first. > Does TRUNCATE suffer from the same issue? > There's a rather indirect mechanism that's supposed to prevent similar > problems between two backends: a file truncation is supposed to be > associated with a forced relcache flush, and that causes an smgrclose(), > so other backends will be forced to reopen the file(s) before they can > do any more work with the truncated relation. On reflection I think > I don't trust this though, because in the case where a backend writes a > dirty buffer because it needs to reclaim a buffer, it doesn't try to > open the relation at the relcache level (it cannot, because the rel > might be in a different database). So it could use a stale smgr > relation, same as the bgwriter. The bgwriter does not participate > in shared cache inval, not having a relcache in the first place, and > so this mechanism doesn't help it anyway. > > This is a fairly low-probability bug in real-world cases, because it > could only happen when a relation is truncated and then re-expanded > across a 1GB segment boundary. Moreover, because the bgwriter flushes > all its open files after each checkpoint, the window for trouble only > extends to the next checkpoint. But it definitely could happen, and > it might explain some irreproducible corruption reports. > Regular imports that delete data or truncate relations would increase this probability wouldn't they? Autovac is also likely to run on that relation in the "wait" phase, which other relations are being truncated by an import process. > I think that the easiest fix might be to not remove no-longer-used > segment files during a truncate, but simply reduce them to zero size > rather than delete them. Then any open file pointers aren't > invalidated. The only alternative I can see is to invent some new > signaling mechanism to force closure of open files, but that seems > ugly, complex, and perhaps subject to race conditions. > > Thoughts? > Seems reasonable from my lowly user point of view. Would there be a requirement to remove the extra segments at any point in the future or would they hang around on the disk forever? Russell Smith > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > > >
Russell Smith <mr-russ@pws.com.au> writes: > Tom Lane wrote: >> What is happening is that during that 30-second wait, the bgwriter is >> dumping out all the dirty pages, and acquiring open file references >> to each segment of table "foo" as it does so. The VACUUM then truncates >> "foo" back to zero size, since it contains no data after the DELETE, >> and then the second INSERT bulks it up again. The problem is that the >> bgwriter still has open file references to the deleted segments after >> the first one, and that's where it'll write the data if given a chance. >> So the updates disappear into the ether as far as any other process is >> concerned, for each segment except the first. >> > Does TRUNCATE suffer from the same issue? No, because TRUNCATE actually substitutes new physical files with different relfilenode numbers, in order to allow rollback if its transaction aborts. VACUUM is at risk, and I think we also use a non-transactional truncate for temp tables, but those shouldn't be at risk either. > Seems reasonable from my lowly user point of view. Would there be a > requirement to remove the extra segments at any point in the future or > would they hang around on the disk forever? I'm envisioning that they'd stay there as long as the table exists (though I suppose a TRUNCATE could get rid of 'em). Zero-size files shouldn't be a big problem... regards, tom lane
Tom Lane wrote: > > I think that the easiest fix might be to not remove no-longer-used > segment files during a truncate, but simply reduce them to zero size > rather than delete them. Then any open file pointers aren't > invalidated. The only alternative I can see is to invent some new > signaling mechanism to force closure of open files, but that seems > ugly, complex, and perhaps subject to race conditions. > Setting the files to zero size seems to make sense all around, as when the bgwriter wants to use one of those segments it can 'figure out' that there's nothing in the file and it should start at offset zero. Regards, Paul Bort
Tom Lane wrote: > I think that the easiest fix might be to not remove no-longer-used > segment files during a truncate, but simply reduce them to zero size > rather than delete them. Then any open file pointers aren't > invalidated. The only alternative I can see is to invent some new > signaling mechanism to force closure of open files, but that seems > ugly, complex, and perhaps subject to race conditions. > > Thoughts? Ouch. Let's go with the easy fix. With regular 1GB segment size, having a few empty files in the data directory isn't going to hurt anyone. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom, > Let's go with the easy fix. With regular 1GB segment size, having a few > empty files in the data directory isn't going to hurt anyone. No, but it will confuse DBAs ("What the heck are all these 0B files?"). Maybe we should add code to VACUUM to look for these empty file segments and unlink them if they haven't been touched in a while (say, a day?). -- Josh Berkus PostgreSQL @ Sun San Francisco
On Sun, 2006-11-19 at 11:28 -0800, Josh Berkus wrote: > Tom, > > > Let's go with the easy fix. With regular 1GB segment size, having a few > > empty files in the data directory isn't going to hurt anyone. > > No, but it will confuse DBAs ("What the heck are all these 0B files?"). Maybe > we should add code to VACUUM to look for these empty file segments and unlink > them if they haven't been touched in a while (say, a day?). I don't buy this argument. A smart DBA isn't going to do any such thing. If you go around deleting unknown files you deserve everything you get. Especially if you are in the middle of a PostgreSQL cluster tree. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
On Sun, Nov 19, 2006 at 12:01:15PM -0800, Joshua D. Drake wrote: > On Sun, 2006-11-19 at 11:28 -0800, Josh Berkus wrote: > > Tom, > > > > > Let's go with the easy fix. With regular 1GB segment size, > > > having a few empty files in the data directory isn't going to > > > hurt anyone. > > > > No, but it will confuse DBAs ("What the heck are all these 0B > > files?"). Maybe we should add code to VACUUM to look for these > > empty file segments and unlink them if they haven't been touched > > in a while (say, a day?). > > I don't buy this argument. A smart DBA isn't going to do any such > thing. If you go around deleting unknown files you deserve > everything you get. Especially if you are in the middle of a > PostgreSQL cluster tree. A mention in the README in that directory wouldn't hurt, tho. Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
Joshua D. Drake wrote: > On Sun, 2006-11-19 at 11:28 -0800, Josh Berkus wrote: >> Tom, >> >>> Let's go with the easy fix. With regular 1GB segment size, having a few >>> empty files in the data directory isn't going to hurt anyone. >> No, but it will confuse DBAs ("What the heck are all these 0B files?"). Maybe >> we should add code to VACUUM to look for these empty file segments and unlink >> them if they haven't been touched in a while (say, a day?). > > I don't buy this argument. A smart DBA isn't going to do any such thing. > If you go around deleting unknown files you deserve everything you get. > Especially if you are in the middle of a PostgreSQL cluster tree. Couldn't some application trigger this problem regularly (Say, once a day while importing new data), and therefore create an ever increasing number of empty files that you can't get rid of without backup,reinit and restore? Maybe postgres could delete them during recovery? greetings, Florian Pflug
Florian G. Pflug wrote: > Joshua D. Drake wrote: >> On Sun, 2006-11-19 at 11:28 -0800, Josh Berkus wrote: >>> Tom, >>>> Let's go with the easy fix. With regular 1GB segment size, having a few >>>> empty files in the data directory isn't going to hurt anyone. >>> No, but it will confuse DBAs ("What the heck are all these 0B >>> files?"). Maybe we should add code to VACUUM to look for these empty >>> file segments and unlink them if they haven't been touched in a while >>> (say, a day?). >> >> I don't buy this argument. A smart DBA isn't going to do any such thing. >> If you go around deleting unknown files you deserve everything you get. >> Especially if you are in the middle of a PostgreSQL cluster tree. > > Couldn't some application trigger this problem regularly (Say, once a > day while > importing new data), and therefore create an ever increasing number of > empty > files that you can't get rid of without backup,reinit and restore? No, the old empty files would get reused if the relation grows again. And they would get deleted if you truncate or drop the relation. > Maybe postgres could delete them during recovery? Yes, though it would have to run not only when recovering from WAL, but on normal startup as well. It would require sweeping through the data directory, looking for the empty files, which would increase the startup time a bit. We discussed a solution like that to find orphaned relation files some time ago, which IMHO is a real problem, but that idea got forgotten. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Florian G. Pflug wrote: >> Joshua D. Drake wrote: >>> On Sun, 2006-11-19 at 11:28 -0800, Josh Berkus wrote: >>>> Tom, >>>>> Let's go with the easy fix. With regular 1GB segment size, having a >>>>> few >>>>> empty files in the data directory isn't going to hurt anyone. >>>> No, but it will confuse DBAs ("What the heck are all these 0B >>>> files?"). Maybe we should add code to VACUUM to look for these >>>> empty file segments and unlink them if they haven't been touched in >>>> a while (say, a day?). >>> >>> I don't buy this argument. A smart DBA isn't going to do any such thing. >>> If you go around deleting unknown files you deserve everything you get. >>> Especially if you are in the middle of a PostgreSQL cluster tree. >> >> Couldn't some application trigger this problem regularly (Say, once a >> day while >> importing new data), and therefore create an ever increasing number of >> empty >> files that you can't get rid of without backup,reinit and restore? > > No, the old empty files would get reused if the relation grows again. > And they would get deleted if you truncate or drop the relation. Ah, sorry, I had misunderstood that point. I believed that a new segment would be created when the relation grows again, and the empty file would linger around forever... Thanks for your explaination greetings, Florian Pflug
On Sun, 2006-11-19 at 12:03 -0800, David Fetter wrote: > On Sun, Nov 19, 2006 at 12:01:15PM -0800, Joshua D. Drake wrote: > > On Sun, 2006-11-19 at 11:28 -0800, Josh Berkus wrote: > > > Tom, > > > > > > > Let's go with the easy fix. With regular 1GB segment size, > > > > having a few empty files in the data directory isn't going to > > > > hurt anyone. > > > > > > No, but it will confuse DBAs ("What the heck are all these 0B > > > files?"). Maybe we should add code to VACUUM to look for these > > > empty file segments and unlink them if they haven't been touched > > > in a while (say, a day?). > > > > I don't buy this argument. A smart DBA isn't going to do any such > > thing. If you go around deleting unknown files you deserve > > everything you get. Especially if you are in the middle of a > > PostgreSQL cluster tree. > > A mention in the README in that directory wouldn't hurt, tho. Definitely :) Sincerely, Joshua D. Drake > > Cheers, > D -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate