Thread: Checkpoint request failed on version 8.2.1.
Hi all. I'm getting a checkpoint request failed message when I try to execute a CREATE DATABASE command. Since it was a fresh install, I've included the entire server log up to the point of the error. I truncated the log output two lines after the error message. Is there a way I can avoid this problem? Is this a bug? Thanks. Patrick 2007-01-11 09:46:20 LOG: database system was shut down at 2007-01-11 09:46:16 Mountain Standard Time 2007-01-11 09:46:20 LOG: checkpoint record is at 0/487970 2007-01-11 09:46:20 LOG: redo record is at 0/487970; undo record is at 0/0; shutdown TRUE 2007-01-11 09:46:20 LOG: next transaction ID: 0/595; next OID: 10820 2007-01-11 09:46:20 LOG: next MultiXactId: 1; next MultiXactOffset: 0 2007-01-11 09:46:20 LOG: database system is ready 2007-01-11 09:47:51 LOG: received SIGHUP, reloading configuration files 2007-01-11 09:48:18 FATAL: no pg_hba.conf entry for host "192.168.2.102", user "sa", database "postgres", SSL off 2007-01-11 09:55:39 LOG: could not receive data from client: No connection could be made because the target machine actively refused it. 2007-01-11 09:55:39 LOG: unexpected EOF on client connection 2007-01-11 09:55:59 NOTICE: CREATE TABLE will create implicit sequence "Transactions_Transaction_ID_seq" for serial column "Transactions.Transaction_ID" 2007-01-11 09:55:59 NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "Transactions_pkey" for table "Transactions" 2007-01-11 09:56:05 NOTICE: CREATE TABLE will create implicit sequence "Database_Updates_ID_seq" for serial column "Database_Updates.ID" 2007-01-11 09:56:05 NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "Database_Updates_pkey" for table "Database_Updates" 2007-01-11 09:56:06 NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "Key_pkey" for table "Key" 2007-01-11 09:56:07 NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "UnitTestTable1_pkey" for table "UnitTestTable1" 2007-01-11 09:56:07 NOTICE: trigger "Delete_Tracking_Trigger" for table "UnitTestTable1" does not exist, skipping 2007-01-11 09:56:07 NOTICE: trigger "Update_Tracking_Trigger" for table "UnitTestTable1" does not exist, skipping 2007-01-11 09:56:07 NOTICE: trigger "Insert_Tracking_Trigger" for table "UnitTestTable1" does not exist, skipping 2007-01-11 09:56:10 NOTICE: drop cascades to sequence "_WEBDRIL"."IDSequence" 2007-01-11 09:56:10 NOTICE: drop cascades to table "_WEBDRIL"."Database_Id" 2007-01-11 09:56:10 NOTICE: drop cascades to table "_WEBDRIL"."Database_Updates" 2007-01-11 09:56:10 NOTICE: drop cascades to default for table "_WEBDRIL"."Database_Updates" column ID 2007-01-11 09:56:10 NOTICE: drop cascades to table "_WEBDRIL"."Transactions" 2007-01-11 09:56:10 NOTICE: drop cascades to default for table "_WEBDRIL"."Transactions" column Transaction_ID 2007-01-11 09:56:10 NOTICE: drop cascades to function "_WEBDRIL"."Get_Transaction_ID"() 2007-01-11 09:56:10 NOTICE: drop cascades to function "_WEBDRIL"."Current_Transaction_ID"() 2007-01-11 09:56:10 NOTICE: drop cascades to function "_WEBDRIL"."Get_User_ID"() 2007-01-11 09:56:11 NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "UnitTestTable1_pkey" for table "UnitTestTable1" 2007-01-11 09:56:11 NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "UnitTestTable2_pkey" for table "UnitTestTable2" 2007-01-11 09:56:17 ERROR: could not open relation 1663/16403/16426: Permission denied 2007-01-11 09:56:17 ERROR: checkpoint request failed 2007-01-11 09:56:17 HINT: Consult recent messages in the server log for details. 2007-01-11 09:56:17 STATEMENT: CREATE DATABASE "TestDatabase" 2007-01-11 09:56:18 NOTICE: drop cascades to table "TestSchema"."stable"
"Patrick Earl" <patearl@patearl.net> writes: > 2007-01-11 09:56:17 ERROR: could not open relation 1663/16403/16426: > Permission denied > 2007-01-11 09:56:17 ERROR: checkpoint request failed > 2007-01-11 09:56:17 HINT: Consult recent messages in the server log > for details. > 2007-01-11 09:56:17 STATEMENT: CREATE DATABASE "TestDatabase" Well, like the HINT says, consult recent messages for details. In this case the reason the checkpoint failed was evidently a file permissions problem. Is this repeatable? What platform is it on? regards, tom lane
We're getting the error as part of an automated test suite and it is seems to occur every time the suite is run. The platform is Win XP 64 bit. When running the same unit test suite from a remote machine, the error does not occur. The error also does not occur when manually running the create database command using pgadmin. We are logging into PostgreSQL as the same user. Patrick On 1/11/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Patrick Earl" <patearl@patearl.net> writes: > > 2007-01-11 09:56:17 ERROR: could not open relation 1663/16403/16426: > > Permission denied > > 2007-01-11 09:56:17 ERROR: checkpoint request failed > > 2007-01-11 09:56:17 HINT: Consult recent messages in the server log > > for details. > > 2007-01-11 09:56:17 STATEMENT: CREATE DATABASE "TestDatabase" > > Well, like the HINT says, consult recent messages for details. In this > case the reason the checkpoint failed was evidently a file permissions > problem. Is this repeatable? What platform is it on? > > regards, tom lane >
"Patrick Earl" <patearl@patearl.net> writes: > We're getting the error as part of an automated test suite and it is > seems to occur every time the suite is run. The platform is Win XP 64 > bit. Hm. We've seen problems of this ilk caused by bogus antivirus software, but if that were the explanation it's not clear why running the test locally vs remotely would change anything. Another theory is that it's got something to do with Windows' problems with unlinking open files --- we've found that "permission denied" can be the error code when attempting to open a file that's been unlinked but isn't yet gone because someone is holding it open. Can you determine whether the complained-of relation is one that's recently been deleted? regards, tom lane
There is no antivirus software running on the machine. I'm not entirely sure how to determine which relation it is complaining about. I see a folder that corresponds to the middle number in the log, and I see numbers in the same range as the right number from the log. In any case, the unit tests remove all contents and schema within the database before starting, and they remove the tables they create as they proceed. Certainly there are many things have been recently deleted. Patrick On 1/11/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Patrick Earl" <patearl@patearl.net> writes: > > We're getting the error as part of an automated test suite and it is > > seems to occur every time the suite is run. The platform is Win XP 64 > > bit. > > Hm. We've seen problems of this ilk caused by bogus antivirus software, > but if that were the explanation it's not clear why running the test > locally vs remotely would change anything. > > Another theory is that it's got something to do with Windows' problems > with unlinking open files --- we've found that "permission denied" can > be the error code when attempting to open a file that's been unlinked > but isn't yet gone because someone is holding it open. Can you > determine whether the complained-of relation is one that's recently been > deleted? > > regards, tom lane >
"Patrick Earl" <patearl@patearl.net> writes: > In any case, the unit tests remove all contents and schema within the > database before starting, and they remove the tables they create as > they proceed. Certainly there are many things have been recently > deleted. Yeah, I think then there's no question that the bgwriter is trying to fsync something that's been deleted but isn't yet closed by every process. We have things set up so that that's not a really serious problem anymore --- eventually it will be closed and then the next checkpoint will succeed. But CREATE DATABASE insists on checkpointing and so it's vulnerable to even a transient failure. I've been resisting changing the checkpoint code to treat EACCES as a non-error situation on Windows, but maybe we have no choice. How do people feel about this idea: #ifdef WIN32 and the open or fsync fails with EACCES, then 1. Emit a LOG (or maybe DEBUG) message noting the problem. 2. Leave the fsync request entry in the hashtable for next time. 3. Allow the current checkpoint to complete normally anyway. If the file has actually been deleted, then eventually it will be closed and the next checkpoint will be able to remove the hash entry. If there's something else wrong, we'll keep bleating and maybe the DBA will notice eventually. The downside of this is that a real EACCES problem wouldn't get noted at any level higher than LOG, and so you could theoretically lose data without much warning. But I'm not seeing anything else we could do about it --- AFAIK we have not heard of a way we can distinguish this case from a real permissions problem. And anyway there should never *be* a real permissions problem; if there is then the user's been poking under the hood sufficient to void the warranty anyway ;-) Comments? regards, tom lane
Tom Lane wrote: > "Patrick Earl" <patearl@patearl.net> writes: >> In any case, the unit tests remove all contents and schema within the >> database before starting, and they remove the tables they create as >> they proceed. Certainly there are many things have been recently >> deleted. > > Yeah, I think then there's no question that the bgwriter is trying to > fsync something that's been deleted but isn't yet closed by every > process. We have things set up so that that's not a really serious > problem anymore --- eventually it will be closed and then the next > checkpoint will succeed. But CREATE DATABASE insists on checkpointing > and so it's vulnerable to even a transient failure. > > I've been resisting changing the checkpoint code to treat EACCES as a > non-error situation on Windows, but maybe we have no choice. How do > people feel about this idea: #ifdef WIN32 and the open or fsync fails > with EACCES, then > > 1. Emit a LOG (or maybe DEBUG) message noting the problem. > 2. Leave the fsync request entry in the hashtable for next time. > 3. Allow the current checkpoint to complete normally anyway. > > If the file has actually been deleted, then eventually it will be closed > and the next checkpoint will be able to remove the hash entry. If > there's something else wrong, we'll keep bleating and maybe the DBA will > notice eventually. > > The downside of this is that a real EACCES problem wouldn't get noted at > any level higher than LOG, and so you could theoretically lose data > without much warning. But I'm not seeing anything else we could do > about it --- AFAIK we have not heard of a way we can distinguish this > case from a real permissions problem. And anyway there should never > *be* a real permissions problem; if there is then the user's been poking > under the hood sufficient to void the warranty anyway ;-) > > Comments? I find it very unlikely that you would "during normal operations" end up in a situation where you would first have permissions to create files in a directory, and then lose them. What could be is that you have a directory where you never had permissions to create the file in the first place. Any chance to differentiate between these? In the first case, someone did something to change the permissions, and can be expected to actually check that things continued to work after that. In the second case, it would be nice if it was possible to catch it faster. //Magnus
Magnus Hagander <magnus@hagander.net> writes: > I find it very unlikely that you would "during normal operations" end up > in a situation where you would first have permissions to create files in > a directory, and then lose them. > What could be is that you have a directory where you never had > permissions to create the file in the first place. > Any chance to differentiate between these? The cases we're concerned about involve access to an existing file, not attempts to create a new one, so I'm not clear what your point is. I would certainly *love* to differentiate between these failures and ordinary permissions failures, but so far as I've heard we can't. regards, tom lane
On Thu, Jan 11, 2007 at 03:14:37PM -0500, Tom Lane wrote: > The downside of this is that a real EACCES problem wouldn't get noted at > any level higher than LOG, and so you could theoretically lose data > without much warning. But I'm not seeing anything else we could do > about it --- AFAIK we have not heard of a way we can distinguish this > case from a real permissions problem. And anyway there should never > *be* a real permissions problem; if there is then the user's been poking > under the hood sufficient to void the warranty anyway ;-) Or some other "helpful" process such as a virus scanner has been poking under the hood for you... :( Given that this could result in data loss, if this was to be done I'd very much want to see a way to disable it in a production environment. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> writes: > Given that this could result in data loss, if this was to be done I'd > very much want to see a way to disable it in a production environment. Production environments are the same ones that won't be happy with random checkpoint failures, either. If we can't find a way to positively identify the deleted-file failures then I think we've got to do something like this. (You know, of course, that my opinion is that no sane person would run a production database on Windows in the first place. So the data-loss risk to me seems less of a problem than the unexpected-failures problem. It's not like there aren't a ton of other data-loss scenarios in that OS that we can't do anything about...) regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> I find it very unlikely that you would "during normal operations" end up >> in a situation where you would first have permissions to create files in >> a directory, and then lose them. >> What could be is that you have a directory where you never had >> permissions to create the file in the first place. > >> Any chance to differentiate between these? > > The cases we're concerned about involve access to an existing file, not > attempts to create a new one, so I'm not clear what your point is. Well, then I don't see it as being a big problem, which was the question, I think. If pgsql had permissions to create the file, it would never lose it unless the dba changed something - and if the dba changed something, then he should check his logs afterwards to make sure he didn't break anything. My point is that if we know that *we* could create the file, than the probability of it being an *actual* permissions problem is very low during normal operations. So it's most likely "the delete issue", and thus doing what you propose does seem like a fairly safe bet. > I would certainly *love* to differentiate between these failures and > ordinary permissions failures, but so far as I've heard we can't. Right, that's the base problem. //Magnus
On Thu, Jan 11, 2007 at 04:32:42PM -0500, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > Given that this could result in data loss, if this was to be done I'd > > very much want to see a way to disable it in a production environment. > > Production environments are the same ones that won't be happy with > random checkpoint failures, either. Maybe I'm not understanding what happens in a checkpoint failure, but I'd rather have my pg_xlog fill up (hopefully with a lot af WARNINGS thrown before-hand) and face a long recovery than lose data... > If we can't find a way to positively identify the deleted-file failures > then I think we've got to do something like this. > > (You know, of course, that my opinion is that no sane person would run a > production database on Windows in the first place. So the data-loss > risk to me seems less of a problem than the unexpected-failures problem. > It's not like there aren't a ton of other data-loss scenarios in that OS > that we can't do anything about...) Yeah, and I share your opinion. Unfortunately, a lot of others do not. :( It would be useful if we had a page somewhere that explained in detail what these data-loss issues were and why they're out of our control. At least then people would (or could...) understand why production + Windows == BadIdea. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Thu, 11 Jan 2007, Tom Lane wrote: ...snip... > > (You know, of course, that my opinion is that no sane person would run a > production database on Windows in the first place. So the data-loss > risk to me seems less of a problem than the unexpected-failures problem. > It's not like there aren't a ton of other data-loss scenarios in that OS > that we can't do anything about...) > > regards, tom lane > PLEASE OH PLEASE document every f-ing one of them! (And I don't mean document Windows issues as comments in the source code. Best would be in the official documentation/on a web page.) On occasion, I could *really* use such a list! (If such already exists, please point me at it!) Thing is, Tom, not everybody has the same level of information you have on the subject... Regards, Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 rtroy@ScienceTools.com, http://ScienceTools.com/
Richard Troy wrote: > On Thu, 11 Jan 2007, Tom Lane wrote: > > ...snip... > >> (You know, of course, that my opinion is that no sane person would run a >> production database on Windows in the first place. So the data-loss >> risk to me seems less of a problem than the unexpected-failures problem. >> It's not like there aren't a ton of other data-loss scenarios in that OS >> that we can't do anything about...) >> >> >> > > PLEASE OH PLEASE document every f-ing one of them! (And I don't mean > document Windows issues as comments in the source code. Best would be in > the official documentation/on a web page.) On occasion, I could *really* > use such a list! (If such already exists, please point me at it!) > > Thing is, Tom, not everybody has the same level of information you have on > the subject... > > > Please don't. At least not on the PostgreSQL web site nor in the docs. And no, I don't run my production servers on Windows either. For good or ill, we made a decision years ago to do a proper Windows port. I think that it's actually worked out reasonably well. All operating systems have warts. Not long ago I tended to advise people not to run mission critical Postgresql on Linux unless they were *very* careful, due to the over-commit issue. In fact, I don't trust any OS. I use dumps and backups and replication to protect myself from them all. In the present instance, the data loss risk is largely theoretical, as I understand it, as we don't expect a genuine EACCESS error. cheers andrew
> > > > > > > > > Please don't. At least not on the PostgreSQL web site nor in the docs. > And no, I don't run my production servers on Windows either. It does seem like it might be a good idea to have FAQs based on each OS, yes? There are various things that effect each OS differently. The most obvious to me being shared memory and wal_sync_method. If could be a good idea to have. Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
On Thu, Jan 11, 2007 at 03:12:07PM -0800, Joshua D. Drake wrote: > It does seem like it might be a good idea to have FAQs based on each OS, > yes? There are various things that effect each OS differently. The most > obvious to me being shared memory and wal_sync_method. > > If could be a good idea to have. > > Joshua D. Drake > +1 regards, J
Joshua D. Drake wrote: > > Please don't. At least not on the PostgreSQL web site nor in the docs. > > And no, I don't run my production servers on Windows either. > > It does seem like it might be a good idea to have FAQs based on each OS, > yes? There are various things that effect each OS differently. The most > obvious to me being shared memory and wal_sync_method. But we have per-platform FAQs. If there is information missing, the reason is that nobody has submitted an appropriate patch, nothing more. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Thu, Jan 11, 2007 at 09:42:38PM -0300, Alvaro Herrera wrote: > > But we have per-platform FAQs. If there is information missing, the > reason is that nobody has submitted an appropriate patch, nothing more. > where are these FAQs, and why were they not easily found when the original poster sent his email? is there some "SEO" we need to do on the websites to make things more obvious? regards, J
On Thu, 2007-01-11 at 21:42 -0300, Alvaro Herrera wrote: > Joshua D. Drake wrote: > > > > Please don't. At least not on the PostgreSQL web site nor in the docs. > > > And no, I don't run my production servers on Windows either. > > > > It does seem like it might be a good idea to have FAQs based on each OS, > > yes? There are various things that effect each OS differently. The most > > obvious to me being shared memory and wal_sync_method. > > But we have per-platform FAQs. If there is information missing, the > reason is that nobody has submitted an appropriate patch, nothing more. Yes you are correct, now that I look. It is not obviously apparent though and they do appear to be quite out of date. Joshua D. Drake > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
"Jim C. Nasby" <jim@nasby.net> writes: > On Thu, Jan 11, 2007 at 03:14:37PM -0500, Tom Lane wrote: >> ... And anyway there should never >> *be* a real permissions problem; if there is then the user's been poking >> under the hood sufficient to void the warranty anyway ;-) > Or some other "helpful" process such as a virus scanner has been poking > under the hood for you... :( One point worth making is that I'm not really convinced anymore that we have proof that antivirus code has been creating any such problems. We have several anecdotal cases where someone reported erratic "permission denied" problems on Windows, and we suggested getting rid of any AV code, and it seemed to fix their problem --- but how long did they test? This problem is inherently very timing-sensitive, and so the fact that you don't see it for a little while is hardly proof that it's gone. See the report that started this thread for examples of apparent correlations that are really quite spurious, like whether the test case is being driven locally or not. It could easy be that every report we've heard really traces to the not-yet-deleted-file problem. So basically what we'd have is that if you manually remove permissions on a database file or directory you'd be risking data loss; but heck, if you manually move, rename, delete such a file you're risking (guaranteeing) data loss. Any sane user is going to figure "keep your fingers away from the moving parts"; or if he can't figure that out, he's got no one but himself to blame. It's not ideal, granted, but we're dealing with a much-less-than-ideal OS, so we gotta make some compromises. regards, tom lane
> > I find it very unlikely that you would "during normal operations" end up > > in a situation where you would first have permissions to create files in > > a directory, and then lose them. > > What could be is that you have a directory where you never had > > permissions to create the file in the first place. > > > Any chance to differentiate between these? > > The cases we're concerned about involve access to an existing file, not > attempts to create a new one, so I'm not clear what your point is. I am wondering if we can delete the file by opening it with FILE_FLAG_DELETE_ON_CLOSE, and immediately close it again. The semantics should be clear if we let the OS delete the file after the last handle on it is closed ? Until all handles are closed another process can still open it with FILE_SHARE_DELETE (according to docs), but not without the flag. This seems to be what we want. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/ fs/createfile.asp If this fails (see the loop in dirmod.c) we could try to move it to the recycle bin with SHFileOperation with FO_DELETE. It seems the win unlink is not implemented correctly and we need to replace it. I don't feel easy with the ignore EACCES idea. Should I try to supply a patch along this line ? Andreas
On Fri, Jan 12, 2007 at 10:49:53AM +0100, Zeugswetter Andreas ADI SD wrote: > > > > I find it very unlikely that you would "during normal operations" > end up > > > in a situation where you would first have permissions to create > files in > > > a directory, and then lose them. > > > What could be is that you have a directory where you never had > > > permissions to create the file in the first place. > > > > > Any chance to differentiate between these? > > > > The cases we're concerned about involve access to an existing file, > not > > attempts to create a new one, so I'm not clear what your point is. > > I am wondering if we can delete the file by opening it with > FILE_FLAG_DELETE_ON_CLOSE, and immediately close it again. > The semantics should be clear if we let the OS delete the file after the > > last handle on it is closed ? > Until all handles are closed another process can still open it with > FILE_SHARE_DELETE (according to docs), but not without the flag. > This seems to be what we want. > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/ > fs/createfile.asp > > If this fails (see the loop in dirmod.c) we could try to move it to > the recycle bin with SHFileOperation with FO_DELETE. > > It seems the win unlink is not implemented correctly and we need to > replace it. > I don't feel easy with the ignore EACCES idea. > > Should I try to supply a patch along this line ? Doesn't sound unreasonable, so yes, let's give it a try at least. //Magnus
On Thu, Jan 11, 2007 at 10:39:47PM -0500, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > On Thu, Jan 11, 2007 at 03:14:37PM -0500, Tom Lane wrote: > >> ... And anyway there should never > >> *be* a real permissions problem; if there is then the user's been poking > >> under the hood sufficient to void the warranty anyway ;-) > > > Or some other "helpful" process such as a virus scanner has been poking > > under the hood for you... :( > > One point worth making is that I'm not really convinced anymore that > we have proof that antivirus code has been creating any such problems. We do. I have positive proof of this being caused by AV software. I don't know that it has been the problem in *all cases*, certainly, but I've had kernel stacktraces pointing into AV filter drivers more than once. > We have several anecdotal cases where someone reported erratic > "permission denied" problems on Windows, and we suggested getting rid > of any AV code, and it seemed to fix their problem --- but how long did > they test? This problem is inherently very timing-sensitive, and so the > fact that you don't see it for a little while is hardly proof that it's > gone. See the report that started this thread for examples of apparent > correlations that are really quite spurious, like whether the test case > is being driven locally or not. It could easy be that every report > we've heard really traces to the not-yet-deleted-file problem. No, not all of them. But certainly a fair share of them can have been. > So basically what we'd have is that if you manually remove permissions > on a database file or directory you'd be risking data loss; but heck, > if you manually move, rename, delete such a file you're risking > (guaranteeing) data loss. That was the point I was trying tom ake erarlier :-) //Magnus
"Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> writes: > It seems the win unlink is not implemented correctly and we need to > replace it. Easier said than done ... regards, tom lane
Magnus Hagander <magnus@hagander.net> writes: > On Thu, Jan 11, 2007 at 10:39:47PM -0500, Tom Lane wrote: >> One point worth making is that I'm not really convinced anymore that >> we have proof that antivirus code has been creating any such problems. > We do. I have positive proof of this being caused by AV software. > I don't know that it has been the problem in *all cases*, certainly, but > I've had kernel stacktraces pointing into AV filter drivers more than > once. No, I didn't claim that Windows AV software is bug-free ;-). What I said was that I'm not certain it's related to the "permission denied" reports, as opposed to other problems. Or are your stack traces specifically for "permission denied" failures? regards, tom lane
On Fri, Jan 12, 2007 at 09:47:55AM -0500, Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: > > On Thu, Jan 11, 2007 at 10:39:47PM -0500, Tom Lane wrote: > >> One point worth making is that I'm not really convinced anymore that > >> we have proof that antivirus code has been creating any such problems. > > > We do. I have positive proof of this being caused by AV software. > > I don't know that it has been the problem in *all cases*, certainly, but > > I've had kernel stacktraces pointing into AV filter drivers more than > > once. > > No, I didn't claim that Windows AV software is bug-free ;-). What I > said was that I'm not certain it's related to the "permission denied" > reports, as opposed to other problems. Or are your stack traces > specifically for "permission denied" failures? I have at least two cases specifically for the permission denied failures in postgres. //Magnus
Magnus Hagander <magnus@hagander.net> writes: > On Fri, Jan 12, 2007 at 09:47:55AM -0500, Tom Lane wrote: >> No, I didn't claim that Windows AV software is bug-free ;-). What I >> said was that I'm not certain it's related to the "permission denied" >> reports, as opposed to other problems. Or are your stack traces >> specifically for "permission denied" failures? > I have at least two cases specifically for the permission denied > failures in postgres. Actually, it could still be the same problem, with the AV software only involved to the extent that it's trying to scan files for viruses. That could result in the AV code holding a table file open for a little bit (or not such a little bit, if it's a big table) after it's nominally been deleted, and that's exactly the situation we see checkpoints failing in. regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> On Fri, Jan 12, 2007 at 09:47:55AM -0500, Tom Lane wrote: >>> No, I didn't claim that Windows AV software is bug-free ;-). What I >>> said was that I'm not certain it's related to the "permission denied" >>> reports, as opposed to other problems. Or are your stack traces >>> specifically for "permission denied" failures? > >> I have at least two cases specifically for the permission denied >> failures in postgres. > > Actually, it could still be the same problem, with the AV software only > involved to the extent that it's trying to scan files for viruses. > That could result in the AV code holding a table file open for a little > bit (or not such a little bit, if it's a big table) after it's nominally > been deleted, and that's exactly the situation we see checkpoints > failing in. Partially the same, but I've seen AV software keeping it open for hours... Basically until reboot. //Magnus
Magnus Hagander <magnus@hagander.net> writes: > Tom Lane wrote: >> Actually, it could still be the same problem, with the AV software only >> involved to the extent that it's trying to scan files for viruses. > Partially the same, but I've seen AV software keeping it open for > hours... Basically until reboot. Well, the bug report that just went by proves there's another problem: : select version(); : : "PostgreSQL 8.2.1 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) : 3.4.2 (mingw-special)" : : pg_log: : : 2007-01-12 17:23:16 PANIC: could not open control file : "global/pg_control": Permission denied pg_control is certainly not ever deleted or renamed, and in fact I believe there's an LWLock enforcing that only one PG process at a time is even touching it. So we need another theory to explain this one :-( ... anyone have a better one than "Windows is a piece of junk"? regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> Tom Lane wrote: >>> Actually, it could still be the same problem, with the AV software only >>> involved to the extent that it's trying to scan files for viruses. > >> Partially the same, but I've seen AV software keeping it open for >> hours... Basically until reboot. > > Well, the bug report that just went by proves there's another problem: > > : select version(); > : > : "PostgreSQL 8.2.1 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) > : 3.4.2 (mingw-special)" > : > : pg_log: > : > : 2007-01-12 17:23:16 PANIC: could not open control file > : "global/pg_control": Permission denied > > pg_control is certainly not ever deleted or renamed, and in fact I > believe there's an LWLock enforcing that only one PG process at a time > is even touching it. So we need another theory to explain this one :-( > ... anyone have a better one than "Windows is a piece of junk"? Right. What we need is a list of which processes have handles open to the file, which can be dumped using Process Explorer (there are other sysinternals tools to do it as well, but PE is probably the easiest)- //Magnus
Magnus Hagander <magnus@hagander.net> writes: > Tom Lane wrote: >> pg_control is certainly not ever deleted or renamed, and in fact I >> believe there's an LWLock enforcing that only one PG process at a time >> is even touching it. So we need another theory to explain this one :-( > Right. What we need is a list of which processes have handles open to > the file, which can be dumped using Process Explorer (there are other > sysinternals tools to do it as well, but PE is probably the easiest)- Hmm, are you just assuming that the underlying error is ERROR_SHARING_VIOLATION? One of the things that's bothered me all along is that there are a dozen different Windows error codes that we map to EACCES ... perhaps it's time to think about disambiguating that a bit better? regards, tom lane
> Comments? Note when it happens, and if it doesn't succeed for some value of "too long", at least escalate to ERROR message, possibly fail. -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice
Scott Ribe <scott_ribe@killerbytes.com> writes: > Note when it happens, and if it doesn't succeed for some value of "too > long", at least escalate to ERROR message, possibly fail. ERROR and "fail" are the same thing. We could do this, and it wouldn't even be much code, but it doesn't seem to address the problem real well: 1. It's still the case that if we let the checkpoint complete, and then the system crashes before the file's actually been flushed to disk, you'd lose data. Putting a limit on the time we let this state persist doesn't seem to me to accomplish a lot, mainly because it's hardly credible that the system wouldn't have sync'd the data of its own accord before the timeout. (The place where we have this problem is *not* trying to write data out to the kernel; it's trying to be sure that the kernel has pushed previously-written data down to disk. I do not know Windows' exact policies for that, but I seriously doubt they let dirty disk buffers sit around indefinitely.) 2. It'd certainly be useful if we could alert the DBA that something is interfering with access to our disk files, but having the bgwriter log a message in the postmaster log that says ERROR rather than only LOG is probably not going to draw the average Windows user's attention :-( regards, tom lane
On Thu, Jan 11, 2007 at 06:04:56PM -0500, Andrew Dunstan wrote: > Please don't. At least not on the PostgreSQL web site nor in the docs. > And no, I don't run my production servers on Windows either. > > For good or ill, we made a decision years ago to do a proper Windows > port. I think that it's actually worked out reasonably well. All > operating systems have warts. Not long ago I tended to advise people not > to run mission critical Postgresql on Linux unless they were *very* > careful, due to the over-commit issue. Yes, and IIRC we documented the overcommit stuff as well. This isn't about OS holy wars, it's about providing information so that people can make an informed decision about what OS to run their database on. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> Tom Lane wrote: >>> pg_control is certainly not ever deleted or renamed, and in fact I >>> believe there's an LWLock enforcing that only one PG process at a time >>> is even touching it. So we need another theory to explain this one :-( > >> Right. What we need is a list of which processes have handles open to >> the file, which can be dumped using Process Explorer (there are other >> sysinternals tools to do it as well, but PE is probably the easiest)- > > Hmm, are you just assuming that the underlying error is > ERROR_SHARING_VIOLATION? One of the things that's bothered me all along > is that there are a dozen different Windows error codes that we map to > EACCES ... perhaps it's time to think about disambiguating that a bit > better? I was. Using PE is just one way to prove that was it :-) But yeah, that's probably a good idea. A quick look at the code says we should at least ask people who have this problem to give it a run with logging at DEBUG5 which should then log exactly what the errorcode was. Or are you seeing more places that need such logging first? //Magnus
Magnus Hagander <magnus@hagander.net> writes: > But yeah, that's probably a good idea. A quick look at the code says we > should at least ask people who have this problem to give it a run with > logging at DEBUG5 which should then log exactly what the errorcode was. > Or are you seeing more places that need such logging first? DEBUG5 is going to be a bit voluminous, but let's try that if we can. regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> But yeah, that's probably a good idea. A quick look at the code says we >> should at least ask people who have this problem to give it a run with >> logging at DEBUG5 which should then log exactly what the errorcode was. >> Or are you seeing more places that need such logging first? > > DEBUG5 is going to be a bit voluminous, but let's try that if we can. Perhaps we should switch down the DEBUG level of it, at least until we know what happens? //Magnus
Magnus Hagander <magnus@hagander.net> writes: > Tom Lane wrote: >> DEBUG5 is going to be a bit voluminous, but let's try that if we can. > Perhaps we should switch down the DEBUG level of it, at least until we > know what happens? That would have to wait on another update release, or at least someone being willing to build a nonstandard executable for Windows, so let's first see if people are willing to do the DEBUG5 bit. regards, tom lane
From: "Magnus Hagander" <magnus@hagander.net> > But yeah, that's probably a good idea. A quick look at the code says we > should at least ask people who have this problem to give it a run with > logging at DEBUG5 which should then log exactly what the errorcode was. > Or are you seeing more places that need such logging first? I'm sorry we can't get get the Win32 error code to be displayed. I got the following messages: 2007-01-16 09:24:48 DEBUG: checkpoint starting 2007-01-16 09:24:48 ERROR: could not open relation 1663/10819/18296: Permission denied 2007-01-16 09:24:48 ERROR: checkpoint request failed 2007-01-16 09:24:48 HINT: Consult recent messages in the server log for details. 2007-01-16 09:24:48 STATEMENT: checkpoint; The reason is that src/port/open.c does not use _dosmaperr(). It converts the Win32 error code to errno directly. EACCES is converted from ERROR_ACCESS_DENIED only. Mmm, we may have to compromise as Tom-san says. BTW, why does the bgwriter try to open and write the pages of already dropped relations? When dropping relations, DropRelFileNodeBuffers is called to discard dirty buffers. If the relation being dropeed has already been registered in the list of files to be fsynced, isn't it possible to remove the file from the list before unlinking the file, asking bgwriter in a similar way as ForwardFsyncRequest()? # The timestamp at the head of each message is noisy since the event viewer has the time info, isn't it? Besides, several PostgreSQL messages appeared as one entry of event log, separated by a LF instead of CR LF. On Windows, CR LF should separate lines.
"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> writes: > BTW, why does the bgwriter try to open and write the pages of already > dropped relations? It does not; the problem is with stale fsync requests. > If the relation being dropeed has > already been registered in the list of files to be fsynced, isn't it > possible to remove the file from the list before unlinking the file, > asking bgwriter in a similar way as ForwardFsyncRequest()? I suggested that here http://archives.postgresql.org/pgsql-hackers/2007-01/msg00642.php but have received no feedback about it ... regards, tom lane
On Tue, Jan 16, 2007 at 10:20:04AM +0900, Takayuki Tsunakawa wrote: > From: "Magnus Hagander" <magnus@hagander.net> > > But yeah, that's probably a good idea. A quick look at the code says > we > > should at least ask people who have this problem to give it a run > with > > logging at DEBUG5 which should then log exactly what the errorcode > was. > > Or are you seeing more places that need such logging first? > > I'm sorry we can't get get the Win32 error code to be displayed. I > got the following messages: > > 2007-01-16 09:24:48 DEBUG: checkpoint starting > 2007-01-16 09:24:48 ERROR: could not open relation 1663/10819/18296: > Permission denied > 2007-01-16 09:24:48 ERROR: checkpoint request failed > 2007-01-16 09:24:48 HINT: Consult recent messages in the server log > for details. > 2007-01-16 09:24:48 STATEMENT: checkpoint; > > > The reason is that src/port/open.c does not use _dosmaperr(). It > converts the Win32 error code to errno directly. EACCES is converted > from ERROR_ACCESS_DENIED only. Mmm, we may have to compromise as > Tom-san says. Good point. In this case, we *know* it's access denied then, and not some other error code. And actually, when I look at the API docs, our case now seems to be documented. Or am I misreading our situation. I have: "If you call CreateFile on a file that is pending deletion as a result of a previous call to DeleteFile, the function fails. The operating system delays file deletion until all handles to the file are closed. GetLastError returns ERROR_ACCESS_DENIED." //Magnus
Magnus Hagander <magnus@hagander.net> writes: > And actually, when I look at the API docs, our case now seems to be > documented. Or am I misreading our situation. I have: > "If you call CreateFile on a file that is pending deletion as a result > of a previous call to DeleteFile, the function fails. The operating > system delays file deletion until all handles to the file are closed. > GetLastError returns ERROR_ACCESS_DENIED." We are not calling CreateFile ... we're just trying to open the thing. regards, tom lane
On Tue, Jan 16, 2007 at 11:11:59AM -0500, Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: > > And actually, when I look at the API docs, our case now seems to be > > documented. Or am I misreading our situation. I have: > > > "If you call CreateFile on a file that is pending deletion as a result > > of a previous call to DeleteFile, the function fails. The operating > > system delays file deletion until all handles to the file are closed. > > GetLastError returns ERROR_ACCESS_DENIED." > > We are not calling CreateFile ... we're just trying to open the thing. Yes, we are calling CreateFile. It's used to open files as well. Check the code in src/port/open.c. It's not very intuitive, but createfile is used both to create new files and to open existing ones (create handles to them). //Magnus
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: > >> And actually, when I look at the API docs, our case now seems to be >> documented. Or am I misreading our situation. I have: >> > > >> "If you call CreateFile on a file that is pending deletion as a result >> of a previous call to DeleteFile, the function fails. The operating >> system delays file deletion until all handles to the file are closed. >> GetLastError returns ERROR_ACCESS_DENIED." >> > > We are not calling CreateFile ... we're just trying to open the thing. > > see src/port/open.c - pgwin32_open() calls CreateFile(). cheers andrew