Thread: Checkpoint request failed on version 8.2.1.

Checkpoint request failed on version 8.2.1.

From
"Patrick Earl"
Date:
Hi all.  I'm getting a checkpoint request failed message when I try to
execute a CREATE DATABASE command.  Since it was a fresh install, I've
included the entire server log up to the point of the error.  I
truncated the log output two lines after the error message.

Is there a way I can avoid this problem?  Is this a bug?  Thanks.

        Patrick

2007-01-11 09:46:20 LOG:  database system was shut down at 2007-01-11
09:46:16 Mountain Standard Time
2007-01-11 09:46:20 LOG:  checkpoint record is at 0/487970
2007-01-11 09:46:20 LOG:  redo record is at 0/487970; undo record is
at 0/0; shutdown TRUE
2007-01-11 09:46:20 LOG:  next transaction ID: 0/595; next OID: 10820
2007-01-11 09:46:20 LOG:  next MultiXactId: 1; next MultiXactOffset: 0
2007-01-11 09:46:20 LOG:  database system is ready
2007-01-11 09:47:51 LOG:  received SIGHUP, reloading configuration files
2007-01-11 09:48:18 FATAL:  no pg_hba.conf entry for host
"192.168.2.102", user "sa", database "postgres", SSL off
2007-01-11 09:55:39 LOG:  could not receive data from client: No
connection could be made because the target machine actively refused
it.

2007-01-11 09:55:39 LOG:  unexpected EOF on client connection
2007-01-11 09:55:59 NOTICE:  CREATE TABLE will create implicit
sequence "Transactions_Transaction_ID_seq" for serial column
"Transactions.Transaction_ID"
2007-01-11 09:55:59 NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create
implicit index "Transactions_pkey" for table "Transactions"
2007-01-11 09:56:05 NOTICE:  CREATE TABLE will create implicit
sequence "Database_Updates_ID_seq" for serial column
"Database_Updates.ID"
2007-01-11 09:56:05 NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create
implicit index "Database_Updates_pkey" for table "Database_Updates"
2007-01-11 09:56:06 NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create
implicit index "Key_pkey" for table "Key"
2007-01-11 09:56:07 NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create
implicit index "UnitTestTable1_pkey" for table "UnitTestTable1"
2007-01-11 09:56:07 NOTICE:  trigger "Delete_Tracking_Trigger" for
table "UnitTestTable1" does not exist, skipping
2007-01-11 09:56:07 NOTICE:  trigger "Update_Tracking_Trigger" for
table "UnitTestTable1" does not exist, skipping
2007-01-11 09:56:07 NOTICE:  trigger "Insert_Tracking_Trigger" for
table "UnitTestTable1" does not exist, skipping
2007-01-11 09:56:10 NOTICE:  drop cascades to sequence "_WEBDRIL"."IDSequence"
2007-01-11 09:56:10 NOTICE:  drop cascades to table "_WEBDRIL"."Database_Id"
2007-01-11 09:56:10 NOTICE:  drop cascades to table
"_WEBDRIL"."Database_Updates"
2007-01-11 09:56:10 NOTICE:  drop cascades to default for table
"_WEBDRIL"."Database_Updates" column ID
2007-01-11 09:56:10 NOTICE:  drop cascades to table "_WEBDRIL"."Transactions"
2007-01-11 09:56:10 NOTICE:  drop cascades to default for table
"_WEBDRIL"."Transactions" column Transaction_ID
2007-01-11 09:56:10 NOTICE:  drop cascades to function
"_WEBDRIL"."Get_Transaction_ID"()
2007-01-11 09:56:10 NOTICE:  drop cascades to function
"_WEBDRIL"."Current_Transaction_ID"()
2007-01-11 09:56:10 NOTICE:  drop cascades to function
"_WEBDRIL"."Get_User_ID"()
2007-01-11 09:56:11 NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create
implicit index "UnitTestTable1_pkey" for table "UnitTestTable1"
2007-01-11 09:56:11 NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create
implicit index "UnitTestTable2_pkey" for table "UnitTestTable2"
2007-01-11 09:56:17 ERROR:  could not open relation 1663/16403/16426:
Permission denied
2007-01-11 09:56:17 ERROR:  checkpoint request failed
2007-01-11 09:56:17 HINT:  Consult recent messages in the server log
for details.
2007-01-11 09:56:17 STATEMENT:  CREATE DATABASE "TestDatabase"
2007-01-11 09:56:18 NOTICE:  drop cascades to table "TestSchema"."stable"

Re: Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
"Patrick Earl" <patearl@patearl.net> writes:
> 2007-01-11 09:56:17 ERROR:  could not open relation 1663/16403/16426:
> Permission denied
> 2007-01-11 09:56:17 ERROR:  checkpoint request failed
> 2007-01-11 09:56:17 HINT:  Consult recent messages in the server log
> for details.
> 2007-01-11 09:56:17 STATEMENT:  CREATE DATABASE "TestDatabase"

Well, like the HINT says, consult recent messages for details.  In this
case the reason the checkpoint failed was evidently a file permissions
problem.  Is this repeatable?  What platform is it on?

            regards, tom lane

Re: Checkpoint request failed on version 8.2.1.

From
"Patrick Earl"
Date:
We're getting the error as part of an automated test suite and it is
seems to occur every time the suite is run.  The platform is Win XP 64
bit.

When running the same unit test suite from a remote machine, the error
does not occur.  The error also does not occur when manually running
the create database command using pgadmin.  We are logging into
PostgreSQL as the same user.

        Patrick

On 1/11/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Patrick Earl" <patearl@patearl.net> writes:
> > 2007-01-11 09:56:17 ERROR:  could not open relation 1663/16403/16426:
> > Permission denied
> > 2007-01-11 09:56:17 ERROR:  checkpoint request failed
> > 2007-01-11 09:56:17 HINT:  Consult recent messages in the server log
> > for details.
> > 2007-01-11 09:56:17 STATEMENT:  CREATE DATABASE "TestDatabase"
>
> Well, like the HINT says, consult recent messages for details.  In this
> case the reason the checkpoint failed was evidently a file permissions
> problem.  Is this repeatable?  What platform is it on?
>
>                         regards, tom lane
>

Re: Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
"Patrick Earl" <patearl@patearl.net> writes:
> We're getting the error as part of an automated test suite and it is
> seems to occur every time the suite is run.  The platform is Win XP 64
> bit.

Hm.  We've seen problems of this ilk caused by bogus antivirus software,
but if that were the explanation it's not clear why running the test
locally vs remotely would change anything.

Another theory is that it's got something to do with Windows' problems
with unlinking open files --- we've found that "permission denied" can
be the error code when attempting to open a file that's been unlinked
but isn't yet gone because someone is holding it open.  Can you
determine whether the complained-of relation is one that's recently been
deleted?

            regards, tom lane

Re: Checkpoint request failed on version 8.2.1.

From
"Patrick Earl"
Date:
There is no antivirus software running on the machine.

I'm not entirely sure how to determine which relation it is
complaining about.  I see a folder that corresponds to the middle
number in the log, and I see numbers in the same range as the right
number from the log.

In any case, the unit tests remove all contents and schema within the
database before starting, and they remove the tables they create as
they proceed.  Certainly there are many things have been recently
deleted.

        Patrick

On 1/11/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Patrick Earl" <patearl@patearl.net> writes:
> > We're getting the error as part of an automated test suite and it is
> > seems to occur every time the suite is run.  The platform is Win XP 64
> > bit.
>
> Hm.  We've seen problems of this ilk caused by bogus antivirus software,
> but if that were the explanation it's not clear why running the test
> locally vs remotely would change anything.
>
> Another theory is that it's got something to do with Windows' problems
> with unlinking open files --- we've found that "permission denied" can
> be the error code when attempting to open a file that's been unlinked
> but isn't yet gone because someone is holding it open.  Can you
> determine whether the complained-of relation is one that's recently been
> deleted?
>
>                         regards, tom lane
>

Re: Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
"Patrick Earl" <patearl@patearl.net> writes:
> In any case, the unit tests remove all contents and schema within the
> database before starting, and they remove the tables they create as
> they proceed.  Certainly there are many things have been recently
> deleted.

Yeah, I think then there's no question that the bgwriter is trying to
fsync something that's been deleted but isn't yet closed by every
process.  We have things set up so that that's not a really serious
problem anymore --- eventually it will be closed and then the next
checkpoint will succeed.  But CREATE DATABASE insists on checkpointing
and so it's vulnerable to even a transient failure.

I've been resisting changing the checkpoint code to treat EACCES as a
non-error situation on Windows, but maybe we have no choice.  How do
people feel about this idea: #ifdef WIN32 and the open or fsync fails
with EACCES, then

1. Emit a LOG (or maybe DEBUG) message noting the problem.
2. Leave the fsync request entry in the hashtable for next time.
3. Allow the current checkpoint to complete normally anyway.

If the file has actually been deleted, then eventually it will be closed
and the next checkpoint will be able to remove the hash entry.  If
there's something else wrong, we'll keep bleating and maybe the DBA will
notice eventually.

The downside of this is that a real EACCES problem wouldn't get noted at
any level higher than LOG, and so you could theoretically lose data
without much warning.  But I'm not seeing anything else we could do
about it --- AFAIK we have not heard of a way we can distinguish this
case from a real permissions problem.  And anyway there should never
*be* a real permissions problem; if there is then the user's been poking
under the hood sufficient to void the warranty anyway ;-)

Comments?

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Magnus Hagander
Date:
Tom Lane wrote:
> "Patrick Earl" <patearl@patearl.net> writes:
>> In any case, the unit tests remove all contents and schema within the
>> database before starting, and they remove the tables they create as
>> they proceed.  Certainly there are many things have been recently
>> deleted.
>
> Yeah, I think then there's no question that the bgwriter is trying to
> fsync something that's been deleted but isn't yet closed by every
> process.  We have things set up so that that's not a really serious
> problem anymore --- eventually it will be closed and then the next
> checkpoint will succeed.  But CREATE DATABASE insists on checkpointing
> and so it's vulnerable to even a transient failure.
>
> I've been resisting changing the checkpoint code to treat EACCES as a
> non-error situation on Windows, but maybe we have no choice.  How do
> people feel about this idea: #ifdef WIN32 and the open or fsync fails
> with EACCES, then
>
> 1. Emit a LOG (or maybe DEBUG) message noting the problem.
> 2. Leave the fsync request entry in the hashtable for next time.
> 3. Allow the current checkpoint to complete normally anyway.
>
> If the file has actually been deleted, then eventually it will be closed
> and the next checkpoint will be able to remove the hash entry.  If
> there's something else wrong, we'll keep bleating and maybe the DBA will
> notice eventually.
>
> The downside of this is that a real EACCES problem wouldn't get noted at
> any level higher than LOG, and so you could theoretically lose data
> without much warning.  But I'm not seeing anything else we could do
> about it --- AFAIK we have not heard of a way we can distinguish this
> case from a real permissions problem.  And anyway there should never
> *be* a real permissions problem; if there is then the user's been poking
> under the hood sufficient to void the warranty anyway ;-)
>
> Comments?

I find it very unlikely that you would "during normal operations" end up
in a situation where you would first have permissions to create files in
a directory, and then lose them.
What could be is that you have a directory where you never had
permissions to create the file in the first place.

Any chance to differentiate between these? In the first case, someone
did something to change the permissions, and can be expected to actually
check that things continued to work after that. In the second case, it
would be nice if it was possible to catch it faster.

//Magnus

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> I find it very unlikely that you would "during normal operations" end up
> in a situation where you would first have permissions to create files in
> a directory, and then lose them.
> What could be is that you have a directory where you never had
> permissions to create the file in the first place.

> Any chance to differentiate between these?

The cases we're concerned about involve access to an existing file, not
attempts to create a new one, so I'm not clear what your point is.

I would certainly *love* to differentiate between these failures and
ordinary permissions failures, but so far as I've heard we can't.

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
"Jim C. Nasby"
Date:
On Thu, Jan 11, 2007 at 03:14:37PM -0500, Tom Lane wrote:
> The downside of this is that a real EACCES problem wouldn't get noted at
> any level higher than LOG, and so you could theoretically lose data
> without much warning.  But I'm not seeing anything else we could do
> about it --- AFAIK we have not heard of a way we can distinguish this
> case from a real permissions problem.  And anyway there should never
> *be* a real permissions problem; if there is then the user's been poking
> under the hood sufficient to void the warranty anyway ;-)

Or some other "helpful" process such as a virus scanner has been poking
under the hood for you... :(

Given that this could result in data loss, if this was to be done I'd
very much want to see a way to disable it in a production environment.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> Given that this could result in data loss, if this was to be done I'd
> very much want to see a way to disable it in a production environment.

Production environments are the same ones that won't be happy with
random checkpoint failures, either.

If we can't find a way to positively identify the deleted-file failures
then I think we've got to do something like this.

(You know, of course, that my opinion is that no sane person would run a
production database on Windows in the first place.  So the data-loss
risk to me seems less of a problem than the unexpected-failures problem.
It's not like there aren't a ton of other data-loss scenarios in that OS
that we can't do anything about...)

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Magnus Hagander
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> I find it very unlikely that you would "during normal operations" end up
>> in a situation where you would first have permissions to create files in
>> a directory, and then lose them.
>> What could be is that you have a directory where you never had
>> permissions to create the file in the first place.
>
>> Any chance to differentiate between these?
>
> The cases we're concerned about involve access to an existing file, not
> attempts to create a new one, so I'm not clear what your point is.

Well, then I don't see it as being a big problem, which was the
question, I think. If pgsql had permissions to create the file, it would
never lose it unless the dba changed something - and if the dba changed
something, then he should check his logs afterwards to make sure he
didn't break anything.

My point is that if we know that *we* could create the file, than the
probability of it being an *actual* permissions problem is very low
during normal operations. So it's most likely "the delete issue", and
thus doing what you propose does seem like a fairly safe bet.


> I would certainly *love* to differentiate between these failures and
> ordinary permissions failures, but so far as I've heard we can't.

Right, that's the base problem.

//Magnus

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
"Jim C. Nasby"
Date:
On Thu, Jan 11, 2007 at 04:32:42PM -0500, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > Given that this could result in data loss, if this was to be done I'd
> > very much want to see a way to disable it in a production environment.
>
> Production environments are the same ones that won't be happy with
> random checkpoint failures, either.

Maybe I'm not understanding what happens in a checkpoint failure, but
I'd rather have my pg_xlog fill up (hopefully with a lot af WARNINGS
thrown before-hand) and face a long recovery than lose data...

> If we can't find a way to positively identify the deleted-file failures
> then I think we've got to do something like this.
>
> (You know, of course, that my opinion is that no sane person would run a
> production database on Windows in the first place.  So the data-loss
> risk to me seems less of a problem than the unexpected-failures problem.
> It's not like there aren't a ton of other data-loss scenarios in that OS
> that we can't do anything about...)

Yeah, and I share your opinion. Unfortunately, a lot of others do not.
:(

It would be useful if we had a page somewhere that explained in detail
what these data-loss issues were and why they're out of our control. At
least then people would (or could...) understand why production +
Windows == BadIdea.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Richard Troy
Date:
On Thu, 11 Jan 2007, Tom Lane wrote:

...snip...
>
> (You know, of course, that my opinion is that no sane person would run a
> production database on Windows in the first place.  So the data-loss
> risk to me seems less of a problem than the unexpected-failures problem.
> It's not like there aren't a ton of other data-loss scenarios in that OS
> that we can't do anything about...)
>
>             regards, tom lane
>

PLEASE OH PLEASE document every f-ing one of them! (And I don't mean
document Windows issues as comments in the source code. Best would be in
the official documentation/on a web page.) On occasion, I could *really*
use such a list! (If such already exists, please point me at it!)

Thing is, Tom, not everybody has the same level of information you have on
the subject...

Regards,
Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/


Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Andrew Dunstan
Date:
Richard Troy wrote:
> On Thu, 11 Jan 2007, Tom Lane wrote:
>
> ...snip...
>
>> (You know, of course, that my opinion is that no sane person would run a
>> production database on Windows in the first place.  So the data-loss
>> risk to me seems less of a problem than the unexpected-failures problem.
>> It's not like there aren't a ton of other data-loss scenarios in that OS
>> that we can't do anything about...)
>>
>>
>>
>
> PLEASE OH PLEASE document every f-ing one of them! (And I don't mean
> document Windows issues as comments in the source code. Best would be in
> the official documentation/on a web page.) On occasion, I could *really*
> use such a list! (If such already exists, please point me at it!)
>
> Thing is, Tom, not everybody has the same level of information you have on
> the subject...
>
>
>


Please don't. At least not on the PostgreSQL web site nor in the docs.
And no, I don't run my production servers on Windows either.

For good or ill, we made a decision years ago to do a proper Windows
port. I think that it's actually worked out reasonably well. All
operating systems have warts. Not long ago I tended to advise people not
to run mission critical Postgresql on Linux unless they were *very*
careful, due to the over-commit issue.

In fact, I don't trust any OS. I use dumps and backups and replication
to protect myself from them all.

In the present instance, the data loss risk is largely theoretical, as I
understand it, as we don't expect a genuine EACCESS error.

cheers

andrew

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
"Joshua D. Drake"
Date:
> >
> >
> >
>
>
> Please don't. At least not on the PostgreSQL web site nor in the docs.
> And no, I don't run my production servers on Windows either.

It does seem like it might be a good idea to have FAQs based on each OS,
yes? There are various things that effect each OS differently. The most
obvious to me being shared memory and wal_sync_method.

If could be a good idea to have.

Joshua D. Drake

--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate




Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
jam@zoidtechnologies.com
Date:
On Thu, Jan 11, 2007 at 03:12:07PM -0800, Joshua D. Drake wrote:
> It does seem like it might be a good idea to have FAQs based on each OS,
> yes? There are various things that effect each OS differently. The most
> obvious to me being shared memory and wal_sync_method.
>
> If could be a good idea to have.
>
> Joshua D. Drake
>

+1

regards,
J

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Alvaro Herrera
Date:
Joshua D. Drake wrote:

> > Please don't. At least not on the PostgreSQL web site nor in the docs.
> > And no, I don't run my production servers on Windows either.
>
> It does seem like it might be a good idea to have FAQs based on each OS,
> yes? There are various things that effect each OS differently. The most
> obvious to me being shared memory and wal_sync_method.

But we have per-platform FAQs.  If there is information missing, the
reason is that nobody has submitted an appropriate patch, nothing more.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
jam@zoidtechnologies.com
Date:
On Thu, Jan 11, 2007 at 09:42:38PM -0300, Alvaro Herrera wrote:
>
> But we have per-platform FAQs.  If there is information missing, the
> reason is that nobody has submitted an appropriate patch, nothing more.
>

where are these FAQs, and why were they not easily found when the original
poster sent his email? is there some "SEO" we need to do on the websites to
make things more obvious?

regards,
J



Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
"Joshua D. Drake"
Date:
On Thu, 2007-01-11 at 21:42 -0300, Alvaro Herrera wrote:
> Joshua D. Drake wrote:
>
> > > Please don't. At least not on the PostgreSQL web site nor in the docs.
> > > And no, I don't run my production servers on Windows either.
> >
> > It does seem like it might be a good idea to have FAQs based on each OS,
> > yes? There are various things that effect each OS differently. The most
> > obvious to me being shared memory and wal_sync_method.
>
> But we have per-platform FAQs.  If there is information missing, the
> reason is that nobody has submitted an appropriate patch, nothing more.

Yes you are correct, now that I look. It is not obviously apparent
though and they do appear to be quite out of date.

Joshua D. Drake




>
--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate




Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> On Thu, Jan 11, 2007 at 03:14:37PM -0500, Tom Lane wrote:
>> ... And anyway there should never
>> *be* a real permissions problem; if there is then the user's been poking
>> under the hood sufficient to void the warranty anyway ;-)

> Or some other "helpful" process such as a virus scanner has been poking
> under the hood for you... :(

One point worth making is that I'm not really convinced anymore that
we have proof that antivirus code has been creating any such problems.
We have several anecdotal cases where someone reported erratic
"permission denied" problems on Windows, and we suggested getting rid
of any AV code, and it seemed to fix their problem --- but how long did
they test?  This problem is inherently very timing-sensitive, and so the
fact that you don't see it for a little while is hardly proof that it's
gone.  See the report that started this thread for examples of apparent
correlations that are really quite spurious, like whether the test case
is being driven locally or not.  It could easy be that every report
we've heard really traces to the not-yet-deleted-file problem.

So basically what we'd have is that if you manually remove permissions
on a database file or directory you'd be risking data loss; but heck,
if you manually move, rename, delete such a file you're risking
(guaranteeing) data loss.  Any sane user is going to figure "keep your
fingers away from the moving parts"; or if he can't figure that out,
he's got no one but himself to blame.

It's not ideal, granted, but we're dealing with a much-less-than-ideal
OS, so we gotta make some compromises.

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
"Zeugswetter Andreas ADI SD"
Date:
> > I find it very unlikely that you would "during normal operations"
end up
> > in a situation where you would first have permissions to create
files in
> > a directory, and then lose them.
> > What could be is that you have a directory where you never had
> > permissions to create the file in the first place.
>
> > Any chance to differentiate between these?
>
> The cases we're concerned about involve access to an existing file,
not
> attempts to create a new one, so I'm not clear what your point is.

I am wondering if we can delete the file by opening it with
FILE_FLAG_DELETE_ON_CLOSE, and immediately close it again.
The semantics should be clear if we let the OS delete the file after the

last handle on it is closed ?
Until all handles are closed another process can still open it with
FILE_SHARE_DELETE (according to docs), but not without the flag.
This seems to be what we want.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/
fs/createfile.asp

If this fails (see the loop in dirmod.c) we could try to move it to
the recycle bin with SHFileOperation with FO_DELETE.

It seems the win unlink is not implemented correctly and we need to
replace it.
I don't feel easy with the ignore EACCES idea.

Should I try to supply a patch along this line ?

Andreas

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Magnus Hagander
Date:
On Fri, Jan 12, 2007 at 10:49:53AM +0100, Zeugswetter Andreas ADI SD wrote:
>
> > > I find it very unlikely that you would "during normal operations"
> end up
> > > in a situation where you would first have permissions to create
> files in
> > > a directory, and then lose them.
> > > What could be is that you have a directory where you never had
> > > permissions to create the file in the first place.
> >
> > > Any chance to differentiate between these?
> >
> > The cases we're concerned about involve access to an existing file,
> not
> > attempts to create a new one, so I'm not clear what your point is.
>
> I am wondering if we can delete the file by opening it with
> FILE_FLAG_DELETE_ON_CLOSE, and immediately close it again.
> The semantics should be clear if we let the OS delete the file after the
>
> last handle on it is closed ?
> Until all handles are closed another process can still open it with
> FILE_SHARE_DELETE (according to docs), but not without the flag.
> This seems to be what we want.
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/
> fs/createfile.asp
>
> If this fails (see the loop in dirmod.c) we could try to move it to
> the recycle bin with SHFileOperation with FO_DELETE.
>
> It seems the win unlink is not implemented correctly and we need to
> replace it.
> I don't feel easy with the ignore EACCES idea.
>
> Should I try to supply a patch along this line ?

Doesn't sound unreasonable, so yes, let's give it a try at least.

//Magnus

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Magnus Hagander
Date:
On Thu, Jan 11, 2007 at 10:39:47PM -0500, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > On Thu, Jan 11, 2007 at 03:14:37PM -0500, Tom Lane wrote:
> >> ... And anyway there should never
> >> *be* a real permissions problem; if there is then the user's been poking
> >> under the hood sufficient to void the warranty anyway ;-)
>
> > Or some other "helpful" process such as a virus scanner has been poking
> > under the hood for you... :(
>
> One point worth making is that I'm not really convinced anymore that
> we have proof that antivirus code has been creating any such problems.

We do. I have positive proof of this being caused by AV software.

I don't know that it has been the problem in *all cases*, certainly, but
I've had kernel stacktraces pointing into AV filter drivers more than
once.


> We have several anecdotal cases where someone reported erratic
> "permission denied" problems on Windows, and we suggested getting rid
> of any AV code, and it seemed to fix their problem --- but how long did
> they test?  This problem is inherently very timing-sensitive, and so the
> fact that you don't see it for a little while is hardly proof that it's
> gone.  See the report that started this thread for examples of apparent
> correlations that are really quite spurious, like whether the test case
> is being driven locally or not.  It could easy be that every report
> we've heard really traces to the not-yet-deleted-file problem.

No, not all of them. But certainly a fair share of them can have been.

> So basically what we'd have is that if you manually remove permissions
> on a database file or directory you'd be risking data loss; but heck,
> if you manually move, rename, delete such a file you're risking
> (guaranteeing) data loss.

That was the point I was trying tom ake erarlier :-)

//Magnus

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
"Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> writes:
> It seems the win unlink is not implemented correctly and we need to
> replace it.

Easier said than done ...

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Thu, Jan 11, 2007 at 10:39:47PM -0500, Tom Lane wrote:
>> One point worth making is that I'm not really convinced anymore that
>> we have proof that antivirus code has been creating any such problems.

> We do. I have positive proof of this being caused by AV software.
> I don't know that it has been the problem in *all cases*, certainly, but
> I've had kernel stacktraces pointing into AV filter drivers more than
> once.

No, I didn't claim that Windows AV software is bug-free ;-).  What I
said was that I'm not certain it's related to the "permission denied"
reports, as opposed to other problems.  Or are your stack traces
specifically for "permission denied" failures?

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Magnus Hagander
Date:
On Fri, Jan 12, 2007 at 09:47:55AM -0500, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > On Thu, Jan 11, 2007 at 10:39:47PM -0500, Tom Lane wrote:
> >> One point worth making is that I'm not really convinced anymore that
> >> we have proof that antivirus code has been creating any such problems.
>
> > We do. I have positive proof of this being caused by AV software.
> > I don't know that it has been the problem in *all cases*, certainly, but
> > I've had kernel stacktraces pointing into AV filter drivers more than
> > once.
>
> No, I didn't claim that Windows AV software is bug-free ;-).  What I
> said was that I'm not certain it's related to the "permission denied"
> reports, as opposed to other problems.  Or are your stack traces
> specifically for "permission denied" failures?

I have at least two cases specifically for the permission denied
failures in postgres.

//Magnus

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Fri, Jan 12, 2007 at 09:47:55AM -0500, Tom Lane wrote:
>> No, I didn't claim that Windows AV software is bug-free ;-).  What I
>> said was that I'm not certain it's related to the "permission denied"
>> reports, as opposed to other problems.  Or are your stack traces
>> specifically for "permission denied" failures?

> I have at least two cases specifically for the permission denied
> failures in postgres.

Actually, it could still be the same problem, with the AV software only
involved to the extent that it's trying to scan files for viruses.
That could result in the AV code holding a table file open for a little
bit (or not such a little bit, if it's a big table) after it's nominally
been deleted, and that's exactly the situation we see checkpoints
failing in.

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Magnus Hagander
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> On Fri, Jan 12, 2007 at 09:47:55AM -0500, Tom Lane wrote:
>>> No, I didn't claim that Windows AV software is bug-free ;-).  What I
>>> said was that I'm not certain it's related to the "permission denied"
>>> reports, as opposed to other problems.  Or are your stack traces
>>> specifically for "permission denied" failures?
>
>> I have at least two cases specifically for the permission denied
>> failures in postgres.
>
> Actually, it could still be the same problem, with the AV software only
> involved to the extent that it's trying to scan files for viruses.
> That could result in the AV code holding a table file open for a little
> bit (or not such a little bit, if it's a big table) after it's nominally
> been deleted, and that's exactly the situation we see checkpoints
> failing in.

Partially the same, but I've seen AV software keeping it open for
hours... Basically until reboot.

//Magnus

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Tom Lane wrote:
>> Actually, it could still be the same problem, with the AV software only
>> involved to the extent that it's trying to scan files for viruses.

> Partially the same, but I've seen AV software keeping it open for
> hours... Basically until reboot.

Well, the bug report that just went by proves there's another problem:

: select version();
:
: "PostgreSQL 8.2.1 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC)
: 3.4.2 (mingw-special)"
:
: pg_log:
:
: 2007-01-12 17:23:16 PANIC:  could not open control file
: "global/pg_control": Permission denied

pg_control is certainly not ever deleted or renamed, and in fact I
believe there's an LWLock enforcing that only one PG process at a time
is even touching it.  So we need another theory to explain this one :-(
... anyone have a better one than "Windows is a piece of junk"?

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Magnus Hagander
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Tom Lane wrote:
>>> Actually, it could still be the same problem, with the AV software only
>>> involved to the extent that it's trying to scan files for viruses.
>
>> Partially the same, but I've seen AV software keeping it open for
>> hours... Basically until reboot.
>
> Well, the bug report that just went by proves there's another problem:
>
> : select version();
> :
> : "PostgreSQL 8.2.1 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC)
> : 3.4.2 (mingw-special)"
> :
> : pg_log:
> :
> : 2007-01-12 17:23:16 PANIC:  could not open control file
> : "global/pg_control": Permission denied
>
> pg_control is certainly not ever deleted or renamed, and in fact I
> believe there's an LWLock enforcing that only one PG process at a time
> is even touching it.  So we need another theory to explain this one :-(
> ... anyone have a better one than "Windows is a piece of junk"?

Right. What we need is a list of which processes have handles open to
the file, which can be dumped using Process Explorer (there are other
sysinternals tools to do it as well, but PE is probably the easiest)-

//Magnus

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Tom Lane wrote:
>> pg_control is certainly not ever deleted or renamed, and in fact I
>> believe there's an LWLock enforcing that only one PG process at a time
>> is even touching it.  So we need another theory to explain this one :-(

> Right. What we need is a list of which processes have handles open to
> the file, which can be dumped using Process Explorer (there are other
> sysinternals tools to do it as well, but PE is probably the easiest)-

Hmm, are you just assuming that the underlying error is
ERROR_SHARING_VIOLATION?  One of the things that's bothered me all along
is that there are a dozen different Windows error codes that we map to
EACCES ... perhaps it's time to think about disambiguating that a bit
better?

            regards, tom lane

Re: Checkpoint request failed on version 8.2.1.

From
Scott Ribe
Date:
> Comments?

Note when it happens, and if it doesn't succeed for some value of "too
long", at least escalate to ERROR message, possibly fail.

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



Re: Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
Scott Ribe <scott_ribe@killerbytes.com> writes:
> Note when it happens, and if it doesn't succeed for some value of "too
> long", at least escalate to ERROR message, possibly fail.

ERROR and "fail" are the same thing.  We could do this, and it wouldn't
even be much code, but it doesn't seem to address the problem real well:

1. It's still the case that if we let the checkpoint complete, and then
the system crashes before the file's actually been flushed to disk,
you'd lose data.  Putting a limit on the time we let this state persist
doesn't seem to me to accomplish a lot, mainly because it's hardly
credible that the system wouldn't have sync'd the data of its own accord
before the timeout.  (The place where we have this problem is *not*
trying to write data out to the kernel; it's trying to be sure that the
kernel has pushed previously-written data down to disk.  I do not know
Windows' exact policies for that, but I seriously doubt they let dirty
disk buffers sit around indefinitely.)

2. It'd certainly be useful if we could alert the DBA that something is
interfering with access to our disk files, but having the bgwriter log a
message in the postmaster log that says ERROR rather than only LOG is
probably not going to draw the average Windows user's attention :-(

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
"Jim C. Nasby"
Date:
On Thu, Jan 11, 2007 at 06:04:56PM -0500, Andrew Dunstan wrote:
> Please don't. At least not on the PostgreSQL web site nor in the docs.
> And no, I don't run my production servers on Windows either.
>
> For good or ill, we made a decision years ago to do a proper Windows
> port. I think that it's actually worked out reasonably well. All
> operating systems have warts. Not long ago I tended to advise people not
> to run mission critical Postgresql on Linux unless they were *very*
> careful, due to the over-commit issue.

Yes, and IIRC we documented the overcommit stuff as well.

This isn't about OS holy wars, it's about providing information so that
people can make an informed decision about what OS to run their database
on.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Magnus Hagander
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Tom Lane wrote:
>>> pg_control is certainly not ever deleted or renamed, and in fact I
>>> believe there's an LWLock enforcing that only one PG process at a time
>>> is even touching it.  So we need another theory to explain this one :-(
>
>> Right. What we need is a list of which processes have handles open to
>> the file, which can be dumped using Process Explorer (there are other
>> sysinternals tools to do it as well, but PE is probably the easiest)-
>
> Hmm, are you just assuming that the underlying error is
> ERROR_SHARING_VIOLATION?  One of the things that's bothered me all along
> is that there are a dozen different Windows error codes that we map to
> EACCES ... perhaps it's time to think about disambiguating that a bit
> better?

I was. Using PE is just one way to prove that was it :-)

But yeah, that's probably a good idea. A quick look at the code says we
should at least ask people who have this problem to give it a run with
logging at DEBUG5 which should then log exactly what the errorcode was.
Or are you seeing more places that need such logging first?


//Magnus

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> But yeah, that's probably a good idea. A quick look at the code says we
> should at least ask people who have this problem to give it a run with
> logging at DEBUG5 which should then log exactly what the errorcode was.
> Or are you seeing more places that need such logging first?

DEBUG5 is going to be a bit voluminous, but let's try that if we can.

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Magnus Hagander
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> But yeah, that's probably a good idea. A quick look at the code says we
>> should at least ask people who have this problem to give it a run with
>> logging at DEBUG5 which should then log exactly what the errorcode was.
>> Or are you seeing more places that need such logging first?
>
> DEBUG5 is going to be a bit voluminous, but let's try that if we can.

Perhaps we should switch down the DEBUG level of it, at least until we
know what happens?

//Magnus

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Tom Lane wrote:
>> DEBUG5 is going to be a bit voluminous, but let's try that if we can.

> Perhaps we should switch down the DEBUG level of it, at least until we
> know what happens?

That would have to wait on another update release, or at least someone
being willing to build a nonstandard executable for Windows, so let's
first see if people are willing to do the DEBUG5 bit.

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
"Takayuki Tsunakawa"
Date:
From: "Magnus Hagander" <magnus@hagander.net>
> But yeah, that's probably a good idea. A quick look at the code says
we
> should at least ask people who have this problem to give it a run
with
> logging at DEBUG5 which should then log exactly what the errorcode
was.
> Or are you seeing more places that need such logging first?

I'm sorry we can't get get the Win32 error code to be displayed.  I
got the following messages:

2007-01-16 09:24:48 DEBUG:  checkpoint starting
2007-01-16 09:24:48 ERROR:  could not open relation 1663/10819/18296:
Permission denied
2007-01-16 09:24:48 ERROR:  checkpoint request failed
2007-01-16 09:24:48 HINT:  Consult recent messages in the server log
for details.
2007-01-16 09:24:48 STATEMENT:  checkpoint;


The reason is that src/port/open.c does not use _dosmaperr().  It
converts the Win32 error code to errno directly.  EACCES is converted
from ERROR_ACCESS_DENIED only.  Mmm, we may have to compromise as
Tom-san says.

BTW, why does the bgwriter try to open and write the pages of already
dropped relations?  When dropping relations, DropRelFileNodeBuffers is
called to discard dirty buffers.  If the relation being dropeed has
already been registered in the list of files to be fsynced, isn't it
possible to remove the file from the list before unlinking the file,
asking bgwriter in a similar way as ForwardFsyncRequest()?

#
The timestamp at the head of each message is noisy since the event
viewer has the time info, isn't it?  Besides, several PostgreSQL
messages appeared as one entry of event log, separated by a LF instead
of CR LF.  On Windows, CR LF should separate lines.





Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> writes:
> BTW, why does the bgwriter try to open and write the pages of already
> dropped relations?

It does not; the problem is with stale fsync requests.

> If the relation being dropeed has
> already been registered in the list of files to be fsynced, isn't it
> possible to remove the file from the list before unlinking the file,
> asking bgwriter in a similar way as ForwardFsyncRequest()?

I suggested that here
http://archives.postgresql.org/pgsql-hackers/2007-01/msg00642.php
but have received no feedback about it ...

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Magnus Hagander
Date:
On Tue, Jan 16, 2007 at 10:20:04AM +0900, Takayuki Tsunakawa wrote:
> From: "Magnus Hagander" <magnus@hagander.net>
> > But yeah, that's probably a good idea. A quick look at the code says
> we
> > should at least ask people who have this problem to give it a run
> with
> > logging at DEBUG5 which should then log exactly what the errorcode
> was.
> > Or are you seeing more places that need such logging first?
>
> I'm sorry we can't get get the Win32 error code to be displayed.  I
> got the following messages:
>
> 2007-01-16 09:24:48 DEBUG:  checkpoint starting
> 2007-01-16 09:24:48 ERROR:  could not open relation 1663/10819/18296:
> Permission denied
> 2007-01-16 09:24:48 ERROR:  checkpoint request failed
> 2007-01-16 09:24:48 HINT:  Consult recent messages in the server log
> for details.
> 2007-01-16 09:24:48 STATEMENT:  checkpoint;
>
>
> The reason is that src/port/open.c does not use _dosmaperr().  It
> converts the Win32 error code to errno directly.  EACCES is converted
> from ERROR_ACCESS_DENIED only.  Mmm, we may have to compromise as
> Tom-san says.

Good point. In this case, we *know* it's access denied then, and not
some other error code.

And actually, when I look at the API docs, our case now seems to be
documented. Or am I misreading our situation. I have:

"If you call CreateFile on a file that is pending deletion as a result
of a previous call to DeleteFile, the function fails. The operating
system delays file deletion until all handles to the file are closed.
GetLastError returns ERROR_ACCESS_DENIED."


//Magnus

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> And actually, when I look at the API docs, our case now seems to be
> documented. Or am I misreading our situation. I have:

> "If you call CreateFile on a file that is pending deletion as a result
> of a previous call to DeleteFile, the function fails. The operating
> system delays file deletion until all handles to the file are closed.
> GetLastError returns ERROR_ACCESS_DENIED."

We are not calling CreateFile ... we're just trying to open the thing.

            regards, tom lane

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Magnus Hagander
Date:
On Tue, Jan 16, 2007 at 11:11:59AM -0500, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > And actually, when I look at the API docs, our case now seems to be
> > documented. Or am I misreading our situation. I have:
>
> > "If you call CreateFile on a file that is pending deletion as a result
> > of a previous call to DeleteFile, the function fails. The operating
> > system delays file deletion until all handles to the file are closed.
> > GetLastError returns ERROR_ACCESS_DENIED."
>
> We are not calling CreateFile ... we're just trying to open the thing.

Yes, we are calling CreateFile. It's used to open files as well. Check
the code in src/port/open.c.

It's not very intuitive, but createfile is used both to create new files
and to open existing ones (create handles to them).

//Magnus

Re: [HACKERS] Checkpoint request failed on version 8.2.1.

From
Andrew Dunstan
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>
>> And actually, when I look at the API docs, our case now seems to be
>> documented. Or am I misreading our situation. I have:
>>
>
>
>> "If you call CreateFile on a file that is pending deletion as a result
>> of a previous call to DeleteFile, the function fails. The operating
>> system delays file deletion until all handles to the file are closed.
>> GetLastError returns ERROR_ACCESS_DENIED."
>>
>
> We are not calling CreateFile ... we're just trying to open the thing.
>
>

see src/port/open.c - pgwin32_open() calls CreateFile().

cheers

andrew