Thread: CREATE DATABASE vs delayed table unlink

CREATE DATABASE vs delayed table unlink

From
Tom Lane
Date:
The thread here
http://archives.postgresql.org/pgsql-performance/2008-10/msg00031.php
illustrates an undesirable side effect of the recent patch to delay
table file unlinks to the next checkpoint.  What is evidently happening
is that copydir() fetches a block of a directory, and by the time it
arrives at some particular entry in the block, a checkpoint has happened
and that file got removed.  If there are some large files in the
directory then the window for this race condition can be wide.

The only real solution I can see is to replace createdb()'s
FlushDatabaseBuffers call with a full-blown checkpoint.  It's pretty
annoying to do *two* checkpoints in a CREATE DATABASE, but as long as
we're doing this via filesystem-based APIs we probably haven't got much
choice.

Comments?
        regards, tom lane


Re: CREATE DATABASE vs delayed table unlink

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> The thread here
> http://archives.postgresql.org/pgsql-performance/2008-10/msg00031.php
> illustrates an undesirable side effect of the recent patch to delay
> table file unlinks to the next checkpoint.  What is evidently happening
> is that copydir() fetches a block of a directory, and by the time it
> arrives at some particular entry in the block, a checkpoint has happened
> and that file got removed.  If there are some large files in the
> directory then the window for this race condition can be wide.
> 
> The only real solution I can see is to replace createdb()'s
> FlushDatabaseBuffers call with a full-blown checkpoint.  It's pretty
> annoying to do *two* checkpoints in a CREATE DATABASE, but as long as
> we're doing this via filesystem-based APIs we probably haven't got much
> choice.

Hmph, that is pretty annoying. An extra checkpoint seems like the easy 
solution.

Another thought is to ignore ENOENT in copydir. But then you'd still 
copy all the lingering empty files, which would never be deleted. They'd 
be zero-length, and you can end up with orphaned files anyway in crash 
scenarios, but it'd still be annoying.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: CREATE DATABASE vs delayed table unlink

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Another thought is to ignore ENOENT in copydir.

Yeah, I thought about that too, but it seems extremely dangerous ...
        regards, tom lane


Re: CREATE DATABASE vs delayed table unlink

From
Heikki Linnakangas
Date:
Matthew Wakeling wrote:
> I could be wrong - but couldn't other bad things happen too? If you're 
> copying the files before the checkpoint has completed, couldn't the new 
> database end up with some of the recent changes going missing? Or is 
> that prevented by FlushDatabaseBuffers?

FlushDatabaseBuffers prevents that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: CREATE DATABASE vs delayed table unlink

From
Heikki Linnakangas
Date:
Matthew Wakeling wrote:
>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>>> Another thought is to ignore ENOENT in copydir.
> 
> On Wed, 8 Oct 2008, Tom Lane wrote:
>> Yeah, I thought about that too, but it seems extremely dangerous ...
> 
> I agree. If a file randomly goes missing, that's not an error to ignore, 
> even if you think the only way that could happen is safe.

I committed a patch to do a full-blown checkpoint before the copy. 
Annoying to do two checkpoints, but CREATE DATABASE is a pretty 
heavy-weight operation anyway. I don't see any other solution at the 
moment, at least not one that we could back-patch.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: CREATE DATABASE vs delayed table unlink

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I committed a patch to do a full-blown checkpoint before the copy. 
> Annoying to do two checkpoints, but CREATE DATABASE is a pretty 
> heavy-weight operation anyway. I don't see any other solution at the 
> moment, at least not one that we could back-patch.

Agreed.  Patch looks good.

I tried to reproduce the issue here using yesterday's CVS HEAD.
It is not hard to get the "file does not exist" failure, but so
far as I can tell CREATE DATABASE does clean up the target directory
before reporting that failure to the user.  It is probably possible
to interrupt the cleanup, but if that happened then the original
error message wouldn't ever get delivered at all.  So I'm mystified
how Matthew could have seen the expected error and yet had the
destination tree (or at least large chunks of it) left behind.

[ thinks for a bit... ]  We know there were multiple occurrences.
Matthew, is it possible that you had other createdb failures that
did *not* report "file does not exist"?  For instance, a createdb
interrupted by a "fast" database shutdown might have left things this
way.
        regards, tom lane


Re: CREATE DATABASE vs delayed table unlink

From
Matthew Wakeling
Date:
On Thu, 9 Oct 2008, Tom Lane wrote:
> So I'm mystified
> how Matthew could have seen the expected error and yet had the
> destination tree (or at least large chunks of it) left behind.

Remember I was running 8.3.0, and you mentioned a few changes after that 
version which would have made sure the destination tree was cleaned up 
properly.

> [ thinks for a bit... ]  We know there were multiple occurrences.
> Matthew, is it possible that you had other createdb failures that
> did *not* report "file does not exist"?  For instance, a createdb
> interrupted by a "fast" database shutdown might have left things this
> way.

Well, we didn't have any fast database shutdowns or power failures. I 
don't think so.

Matthew

-- 
Heat is work, and work's a curse. All the heat in the universe, it's
going to cool down, because it can't increase, then there'll be no
more work, and there'll be perfect peace.      -- Michael Flanders


Re: CREATE DATABASE vs delayed table unlink

From
Matthew Wakeling
Date:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Another thought is to ignore ENOENT in copydir.

On Wed, 8 Oct 2008, Tom Lane wrote:
> Yeah, I thought about that too, but it seems extremely dangerous ...

I agree. If a file randomly goes missing, that's not an error to ignore, 
even if you think the only way that could happen is safe.

I could be wrong - but couldn't other bad things happen too? If you're 
copying the files before the checkpoint has completed, couldn't the new 
database end up with some of the recent changes going missing? Or is that 
prevented by FlushDatabaseBuffers?

Matthew

-- 
Isn't "Microsoft Works" something of a contradiction?


Re: CREATE DATABASE vs delayed table unlink

From
Tom Lane
Date:
Matthew Wakeling <mnw21@cam.ac.uk> writes:
> On Thu, 9 Oct 2008, Tom Lane wrote:
>> So I'm mystified
>> how Matthew could have seen the expected error and yet had the
>> destination tree (or at least large chunks of it) left behind.

> Remember I was running 8.3.0, and you mentioned a few changes after that 
> version which would have made sure the destination tree was cleaned up 
> properly.

Well, there were some fixes for the case of a SIGTERM shutdown, but
I still don't see how 8.3.0 (or any PG version for some time back)
could report the file-not-found-in-source-tree failure without having
passed through the cleanup code.

There's some possibility that it tried to clean up and got a failure
(which would be reported as a WARNING, which conceivably you didn't
note) ... but it's kind of hard to see what failure it could get from
deleting files it just created.  Is there anything weird about the
ownership/permissions on the orphaned directories and files?
        regards, tom lane


Re: CREATE DATABASE vs delayed table unlink

From
Matthew Wakeling
Date:
The error on createdb happened again this morning. However, this time an 
abandoned directory was not created. The full error message was:

$ createdb -E SQL_ASCII -U flyminebuild -h brian.flymine.org -T production-flyminebuild
production-flyminebuild:uniprot
createdb: database creation failed: ERROR:  could not stat file "base/33049747/33269704": No such file or directory

However, my colleagues promptly dropped the database that was being 
copied and restarted the build process, so I can't diagnose anything. 
Suffice to say that there is no abandoned directory, and the directory 
33049747 no longer exists either.

I'll try again to get some details next time it happens.

Matthew

-- 
$ rm core
Segmentation Fault (core dumped)