Thread: Nasty VACUUM/bgwriter/segmentation bug

Nasty VACUUM/bgwriter/segmentation bug

From

Tom Lane

Date:

18 November 2006, 20:34:58

While working on fixing the recently reported hash-index problem,
I was using a test build with a very small RELSEG_SIZE (128K),
so that I could trigger the reported bug with a reasonably small
amount of data.  And I started finding some unexpected data corruption.
I eventually reduced it to this test case:

checkpoint;
create table foo (f1 int);
insert into foo select x from generate_series(1,100000) x;
-- wait 30 seconds
delete from foo;
vacuum verbose foo;
insert into foo select x from generate_series(1,100000) x;
\q
stop and restart postmaster, then
vacuum verbose foo;

This vacuum will generate a whole lot of 
WARNING:  relation "foo" page 16 is uninitialized --- fixing
WARNING:  relation "foo" page 17 is uninitialized --- fixing
...
and when the dust settles, most of the second batch of 100000 rows
is gone.

What is happening is that during that 30-second wait, the bgwriter is
dumping out all the dirty pages, and acquiring open file references
to each segment of table "foo" as it does so.  The VACUUM then truncates
"foo" back to zero size, since it contains no data after the DELETE,
and then the second INSERT bulks it up again.  The problem is that the
bgwriter still has open file references to the deleted segments after
the first one, and that's where it'll write the data if given a chance.
So the updates disappear into the ether as far as any other process is
concerned, for each segment except the first.

There's a rather indirect mechanism that's supposed to prevent similar
problems between two backends: a file truncation is supposed to be
associated with a forced relcache flush, and that causes an smgrclose(),
so other backends will be forced to reopen the file(s) before they can
do any more work with the truncated relation.  On reflection I think
I don't trust this though, because in the case where a backend writes a
dirty buffer because it needs to reclaim a buffer, it doesn't try to
open the relation at the relcache level (it cannot, because the rel
might be in a different database).  So it could use a stale smgr
relation, same as the bgwriter.  The bgwriter does not participate
in shared cache inval, not having a relcache in the first place, and
so this mechanism doesn't help it anyway.

This is a fairly low-probability bug in real-world cases, because it
could only happen when a relation is truncated and then re-expanded
across a 1GB segment boundary.  Moreover, because the bgwriter flushes
all its open files after each checkpoint, the window for trouble only
extends to the next checkpoint.  But it definitely could happen, and
it might explain some irreproducible corruption reports.

I think that the easiest fix might be to not remove no-longer-used
segment files during a truncate, but simply reduce them to zero size
rather than delete them.  Then any open file pointers aren't
invalidated.  The only alternative I can see is to invent some new
signaling mechanism to force closure of open files, but that seems
ugly, complex, and perhaps subject to race conditions.

Thoughts?
        regards, tom lane

Re: Nasty VACUUM/bgwriter/segmentation bug

From

Russell Smith

Date:

18 November 2006, 21:18:41

Tom Lane wrote:
> While working on fixing the recently reported hash-index problem,
> I was using a test build with a very small RELSEG_SIZE (128K),
> so that I could trigger the reported bug with a reasonably small
> amount of data.  And I started finding some unexpected data corruption.
> I eventually reduced it to this test case:
>
> checkpoint;
> create table foo (f1 int);
> insert into foo select x from generate_series(1,100000) x;
> -- wait 30 seconds
> delete from foo;
> vacuum verbose foo;
> insert into foo select x from generate_series(1,100000) x;
> \q
> stop and restart postmaster, then
> vacuum verbose foo;
>
> This vacuum will generate a whole lot of 
> WARNING:  relation "foo" page 16 is uninitialized --- fixing
> WARNING:  relation "foo" page 17 is uninitialized --- fixing
> ...
> and when the dust settles, most of the second batch of 100000 rows
> is gone.
>
> What is happening is that during that 30-second wait, the bgwriter is
> dumping out all the dirty pages, and acquiring open file references
> to each segment of table "foo" as it does so.  The VACUUM then truncates
> "foo" back to zero size, since it contains no data after the DELETE,
> and then the second INSERT bulks it up again.  The problem is that the
> bgwriter still has open file references to the deleted segments after
> the first one, and that's where it'll write the data if given a chance.
> So the updates disappear into the ether as far as any other process is
> concerned, for each segment except the first.
>   
Does TRUNCATE suffer from the same issue?
> There's a rather indirect mechanism that's supposed to prevent similar
> problems between two backends: a file truncation is supposed to be
> associated with a forced relcache flush, and that causes an smgrclose(),
> so other backends will be forced to reopen the file(s) before they can
> do any more work with the truncated relation.  On reflection I think
> I don't trust this though, because in the case where a backend writes a
> dirty buffer because it needs to reclaim a buffer, it doesn't try to
> open the relation at the relcache level (it cannot, because the rel
> might be in a different database).  So it could use a stale smgr
> relation, same as the bgwriter.  The bgwriter does not participate
> in shared cache inval, not having a relcache in the first place, and
> so this mechanism doesn't help it anyway.
>
> This is a fairly low-probability bug in real-world cases, because it
> could only happen when a relation is truncated and then re-expanded
> across a 1GB segment boundary.  Moreover, because the bgwriter flushes
> all its open files after each checkpoint, the window for trouble only
> extends to the next checkpoint.  But it definitely could happen, and
> it might explain some irreproducible corruption reports.
>   
Regular imports that delete data or truncate relations would increase 
this probability wouldn't they?
Autovac is also likely to run on that relation in the "wait" phase, 
which other relations are being truncated by an import process.

> I think that the easiest fix might be to not remove no-longer-used
> segment files during a truncate, but simply reduce them to zero size
> rather than delete them.  Then any open file pointers aren't
> invalidated.  The only alternative I can see is to invent some new
> signaling mechanism to force closure of open files, but that seems
> ugly, complex, and perhaps subject to race conditions.
>
> Thoughts?
>   
Seems reasonable from my lowly user point of view.  Would there be a 
requirement to remove the extra segments at any point in the future or 
would they hang around on the disk forever?

Russell Smith
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
>
>

Re: Nasty VACUUM/bgwriter/segmentation bug

From

Tom Lane

Date:

19 November 2006, 02:46:34

Russell Smith <mr-russ@pws.com.au> writes:
> Tom Lane wrote:
>> What is happening is that during that 30-second wait, the bgwriter is
>> dumping out all the dirty pages, and acquiring open file references
>> to each segment of table "foo" as it does so.  The VACUUM then truncates
>> "foo" back to zero size, since it contains no data after the DELETE,
>> and then the second INSERT bulks it up again.  The problem is that the
>> bgwriter still has open file references to the deleted segments after
>> the first one, and that's where it'll write the data if given a chance.
>> So the updates disappear into the ether as far as any other process is
>> concerned, for each segment except the first.
>> 
> Does TRUNCATE suffer from the same issue?

No, because TRUNCATE actually substitutes new physical files with
different relfilenode numbers, in order to allow rollback if its
transaction aborts.  VACUUM is at risk, and I think we also use a
non-transactional truncate for temp tables, but those shouldn't be at
risk either.

> Seems reasonable from my lowly user point of view.  Would there be a 
> requirement to remove the extra segments at any point in the future or 
> would they hang around on the disk forever?

I'm envisioning that they'd stay there as long as the table exists
(though I suppose a TRUNCATE could get rid of 'em).  Zero-size files
shouldn't be a big problem...
        regards, tom lane

Re: Nasty VACUUM/bgwriter/segmentation bug

From

"Bort, Paul"

Date:

19 November 2006, 12:53:40

Tom Lane wrote:
>
> I think that the easiest fix might be to not remove no-longer-used
> segment files during a truncate, but simply reduce them to zero size
> rather than delete them.  Then any open file pointers aren't
> invalidated.  The only alternative I can see is to invent some new
> signaling mechanism to force closure of open files, but that seems
> ugly, complex, and perhaps subject to race conditions.
>

Setting the files to zero size seems to make sense all around, as when
the bgwriter wants to use one of those segments it can 'figure out' that
there's nothing in the file and it should start at offset zero.

Regards,
Paul Bort

Re: Nasty VACUUM/bgwriter/segmentation bug

From

"Heikki Linnakangas"

Date:

19 November 2006, 14:33:24

Tom Lane wrote:
> I think that the easiest fix might be to not remove no-longer-used
> segment files during a truncate, but simply reduce them to zero size
> rather than delete them.  Then any open file pointers aren't
> invalidated.  The only alternative I can see is to invent some new
> signaling mechanism to force closure of open files, but that seems
> ugly, complex, and perhaps subject to race conditions.
> 
> Thoughts?

Ouch.

Let's go with the easy fix. With regular 1GB segment size, having a few 
empty files in the data directory isn't going to hurt anyone.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Nasty VACUUM/bgwriter/segmentation bug

From

Josh Berkus

Date:

19 November 2006, 15:28:15

Tom, 

> Let's go with the easy fix. With regular 1GB segment size, having a few
> empty files in the data directory isn't going to hurt anyone.

No, but it will confuse DBAs ("What the heck are all these 0B files?").  Maybe 
we should add code to VACUUM to look for these empty file segments and unlink 
them if they haven't been touched in a while (say, a day?).

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Nasty VACUUM/bgwriter/segmentation bug

From

"Joshua D. Drake"

Date:

19 November 2006, 16:01:16

On Sun, 2006-11-19 at 11:28 -0800, Josh Berkus wrote:
> Tom, 
> 
> > Let's go with the easy fix. With regular 1GB segment size, having a few
> > empty files in the data directory isn't going to hurt anyone.
> 
> No, but it will confuse DBAs ("What the heck are all these 0B files?").  Maybe 
> we should add code to VACUUM to look for these empty file segments and unlink 
> them if they haven't been touched in a while (say, a day?).

I don't buy this argument. A smart DBA isn't going to do any such thing.
If you go around deleting unknown files you deserve everything you get.
Especially if you are in the middle of a PostgreSQL cluster tree.

Sincerely,

Joshua D. Drake


-- 
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997            http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

Re: Nasty VACUUM/bgwriter/segmentation bug

From

David Fetter

Date:

19 November 2006, 16:03:28

On Sun, Nov 19, 2006 at 12:01:15PM -0800, Joshua D. Drake wrote:
> On Sun, 2006-11-19 at 11:28 -0800, Josh Berkus wrote:
> > Tom, 
> > 
> > > Let's go with the easy fix. With regular 1GB segment size,
> > > having a few empty files in the data directory isn't going to
> > > hurt anyone.
> > 
> > No, but it will confuse DBAs ("What the heck are all these 0B
> > files?").  Maybe we should add code to VACUUM to look for these
> > empty file segments and unlink them if they haven't been touched
> > in a while (say, a day?).
> 
> I don't buy this argument. A smart DBA isn't going to do any such
> thing.  If you go around deleting unknown files you deserve
> everything you get.  Especially if you are in the middle of a
> PostgreSQL cluster tree.

A mention in the README in that directory wouldn't hurt, tho.

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!

Re: Nasty VACUUM/bgwriter/segmentation bug

From

"Florian G. Pflug"

Date:

19 November 2006, 16:11:05

Joshua D. Drake wrote:
> On Sun, 2006-11-19 at 11:28 -0800, Josh Berkus wrote:
>> Tom, 
>>
>>> Let's go with the easy fix. With regular 1GB segment size, having a few
>>> empty files in the data directory isn't going to hurt anyone.
>> No, but it will confuse DBAs ("What the heck are all these 0B files?").  Maybe 
>> we should add code to VACUUM to look for these empty file segments and unlink 
>> them if they haven't been touched in a while (say, a day?).
> 
> I don't buy this argument. A smart DBA isn't going to do any such thing.
> If you go around deleting unknown files you deserve everything you get.
> Especially if you are in the middle of a PostgreSQL cluster tree.

Couldn't some application trigger this problem regularly (Say, once a day while
importing new data), and therefore create an ever increasing number of empty
files that you can't get rid of without backup,reinit and restore?

Maybe postgres could delete them during recovery?

greetings, Florian Pflug

Re: Nasty VACUUM/bgwriter/segmentation bug

From

"Heikki Linnakangas"

Date:

19 November 2006, 16:22:26

Florian G. Pflug wrote:
> Joshua D. Drake wrote:
>> On Sun, 2006-11-19 at 11:28 -0800, Josh Berkus wrote:
>>> Tom,
>>>> Let's go with the easy fix. With regular 1GB segment size, having a few
>>>> empty files in the data directory isn't going to hurt anyone.
>>> No, but it will confuse DBAs ("What the heck are all these 0B 
>>> files?").  Maybe we should add code to VACUUM to look for these empty 
>>> file segments and unlink them if they haven't been touched in a while 
>>> (say, a day?).
>>
>> I don't buy this argument. A smart DBA isn't going to do any such thing.
>> If you go around deleting unknown files you deserve everything you get.
>> Especially if you are in the middle of a PostgreSQL cluster tree.
> 
> Couldn't some application trigger this problem regularly (Say, once a 
> day while
> importing new data), and therefore create an ever increasing number of 
> empty
> files that you can't get rid of without backup,reinit and restore?

No, the old empty files would get reused if the relation grows again. 
And they would get deleted if you truncate or drop the relation.

> Maybe postgres could delete them during recovery?

Yes, though it would have to run not only when recovering from WAL, but 
on normal startup as well. It would require sweeping through the data 
directory, looking for the empty files, which would increase the startup 
time a bit. We discussed a solution like that to find orphaned relation 
files some time ago, which IMHO is a real problem, but that idea got 
forgotten.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Nasty VACUUM/bgwriter/segmentation bug

From

"Florian G. Pflug"

Date:

19 November 2006, 16:28:56

Heikki Linnakangas wrote:
> Florian G. Pflug wrote:
>> Joshua D. Drake wrote:
>>> On Sun, 2006-11-19 at 11:28 -0800, Josh Berkus wrote:
>>>> Tom,
>>>>> Let's go with the easy fix. With regular 1GB segment size, having a 
>>>>> few
>>>>> empty files in the data directory isn't going to hurt anyone.
>>>> No, but it will confuse DBAs ("What the heck are all these 0B 
>>>> files?").  Maybe we should add code to VACUUM to look for these 
>>>> empty file segments and unlink them if they haven't been touched in 
>>>> a while (say, a day?).
>>>
>>> I don't buy this argument. A smart DBA isn't going to do any such thing.
>>> If you go around deleting unknown files you deserve everything you get.
>>> Especially if you are in the middle of a PostgreSQL cluster tree.
>>
>> Couldn't some application trigger this problem regularly (Say, once a 
>> day while
>> importing new data), and therefore create an ever increasing number of 
>> empty
>> files that you can't get rid of without backup,reinit and restore?
> 
> No, the old empty files would get reused if the relation grows again. 
> And they would get deleted if you truncate or drop the relation.
Ah, sorry, I had misunderstood that point. I believed that a new
segment would be created when the relation grows again, and the
empty file would linger around forever...

Thanks for your explaination
greetings, Florian Pflug

Re: Nasty VACUUM/bgwriter/segmentation bug

From

"Joshua D. Drake"

Date:

19 November 2006, 16:29:24

On Sun, 2006-11-19 at 12:03 -0800, David Fetter wrote:
> On Sun, Nov 19, 2006 at 12:01:15PM -0800, Joshua D. Drake wrote:
> > On Sun, 2006-11-19 at 11:28 -0800, Josh Berkus wrote:
> > > Tom, 
> > > 
> > > > Let's go with the easy fix. With regular 1GB segment size,
> > > > having a few empty files in the data directory isn't going to
> > > > hurt anyone.
> > > 
> > > No, but it will confuse DBAs ("What the heck are all these 0B
> > > files?").  Maybe we should add code to VACUUM to look for these
> > > empty file segments and unlink them if they haven't been touched
> > > in a while (say, a day?).
> > 
> > I don't buy this argument. A smart DBA isn't going to do any such
> > thing.  If you go around deleting unknown files you deserve
> > everything you get.  Especially if you are in the middle of a
> > PostgreSQL cluster tree.
> 
> A mention in the README in that directory wouldn't hurt, tho.

Definitely :)

Sincerely,

Joshua D. Drake

> 
> Cheers,
> D
-- 
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997            http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate