Thread: FATAL: lock file "postmaster.pid" already exists
Hi,
On Windows 2008, sometimes the server fails to start due to an existing "postmaster.pid' file.
I tried rebooting a few times and even force shutting down the server, and it started up fine.
It seems to be a race-condition of sorts in the code that detects whether the process with PID
in the file is running or not.
Does any one have this same problem? Any way to fix it besides removing the PID file
manually each time the server complains about this?
Thanks,
Deepak
On Windows 2008, sometimes the server fails to start due to an existing "postmaster.pid' file.
I tried rebooting a few times and even force shutting down the server, and it started up fine.
It seems to be a race-condition of sorts in the code that detects whether the process with PID
in the file is running or not.
Does any one have this same problem? Any way to fix it besides removing the PID file
manually each time the server complains about this?
Thanks,
Deepak
On 8 May 2012, at 24:34, deepak wrote: > Hi, > > On Windows 2008, sometimes the server fails to start due to an existing "postmaster.pid' file. > > I tried rebooting a few times and even force shutting down the server, and it started up fine. > It seems to be a race-condition of sorts in the code that detects whether the process with PID > in the file is running or not. No, it means that postgres wasn't shut down properly when Windows shut down. Removing the pid-file is one of the last thingsthe shut-down procedure does. The file is used to prevent 2 instances of the same server running on the same data-directory. If it's a race-condition, it's probably one in Microsoft's shutdown code. I've seen similar problems with Outlook mailboxeson a network directory; Windows unmounts the remote file-systems before Outlook finished updating its files underthat mount point, so Outlook throws an error message and Windows doesn't shut down because of that. I don't suppose that pid-file is on a remote file-system? > Does any one have this same problem? Any way to fix it besides removing the PID file > manually each time the server complains about this? You could probably script removal of the pid file if its creation date is before the time the system started booting up. Alban Hertroys -- The scale of a problem often equals the size of an ego.
On Tue, May 8, 2012 at 3:09 AM, Alban Hertroys <haramrae@gmail.com> wrote:
Thanks, it looks like the code already seems to overwrite an old pid file if no other process is using it (if I understand the code correctly, it just echoes a byte onto a pipe to detect this).
Still, I can't see under what conditions this occurs, but I have seen it happen a couple of times, just that I don't know how to predictably reproduce the problem.
--
Deepak
On 8 May 2012, at 24:34, deepak wrote:No, it means that postgres wasn't shut down properly when Windows shut down. Removing the pid-file is one of the last things the shut-down procedure does. The file is used to prevent 2 instances of the same server running on the same data-directory.
> Hi,
>
> On Windows 2008, sometimes the server fails to start due to an existing "postmaster.pid' file.
>
> I tried rebooting a few times and even force shutting down the server, and it started up fine.
> It seems to be a race-condition of sorts in the code that detects whether the process with PID
> in the file is running or not.
If it's a race-condition, it's probably one in Microsoft's shutdown code. I've seen similar problems with Outlook mailboxes on a network directory; Windows unmounts the remote file-systems before Outlook finished updating its files under that mount point, so Outlook throws an error message and Windows doesn't shut down because of that.
I don't suppose that pid-file is on a remote file-system?
No, it's local.
> Does any one have this same problem? Any way to fix it besides removing the PID fileYou could probably script removal of the pid file if its creation date is before the time the system started booting up.
> manually each time the server complains about this?
Thanks, it looks like the code already seems to overwrite an old pid file if no other process is using it (if I understand the code correctly, it just echoes a byte onto a pipe to detect this).
Still, I can't see under what conditions this occurs, but I have seen it happen a couple of times, just that I don't know how to predictably reproduce the problem.
--
Deepak
Hi!
We could reproduce the start-up problem on Windows 2003. After a reboot, postmaster, in its start-up sequence cleans up old temporary files, and this step used to take several minutes (a little over 4 minutes), delaying the writing of line 6 onwards into the PID file. This delay caused pg_ctl to timeout, leaving behind an orphaned postgres.exe process (which eventually forks off many other postgres.exe processes). But since pg_ctl itself isn't running after the timeout, Windows thinks the service isn't running. A subsequent attempt to start the service using pg_ctl now complains about the existing lock file still being used by one of the postgres.exe processes that was spawned before.
We could reproduce the start-up problem on Windows 2003. After a reboot, postmaster, in its start-up sequence cleans up old temporary files, and this step used to take several minutes (a little over 4 minutes), delaying the writing of line 6 onwards into the PID file. This delay caused pg_ctl to timeout, leaving behind an orphaned postgres.exe process (which eventually forks off many other postgres.exe processes). But since pg_ctl itself isn't running after the timeout, Windows thinks the service isn't running. A subsequent attempt to start the service using pg_ctl now complains about the existing lock file still being used by one of the postgres.exe processes that was spawned before.
On Tue, May 8, 2012 at 12:13 PM, deepak <deepak.pn@gmail.com> wrote:
On Tue, May 8, 2012 at 3:09 AM, Alban Hertroys <haramrae@gmail.com> wrote:On 8 May 2012, at 24:34, deepak wrote:No, it means that postgres wasn't shut down properly when Windows shut down. Removing the pid-file is one of the last things the shut-down procedure does. The file is used to prevent 2 instances of the same server running on the same data-directory.
> Hi,
>
> On Windows 2008, sometimes the server fails to start due to an existing "postmaster.pid' file.
>
> I tried rebooting a few times and even force shutting down the server, and it started up fine.
> It seems to be a race-condition of sorts in the code that detects whether the process with PID
> in the file is running or not.
If it's a race-condition, it's probably one in Microsoft's shutdown code. I've seen similar problems with Outlook mailboxes on a network directory; Windows unmounts the remote file-systems before Outlook finished updating its files under that mount point, so Outlook throws an error message and Windows doesn't shut down because of that.
I don't suppose that pid-file is on a remote file-system?No, it's local.
> Does any one have this same problem? Any way to fix it besides removing the PID fileYou could probably script removal of the pid file if its creation date is before the time the system started booting up.
> manually each time the server complains about this?
Thanks, it looks like the code already seems to overwrite an old pid file if no other process is using it (if I understand the code correctly, it just echoes a byte onto a pipe to detect this).
Still, I can't see under what conditions this occurs, but I have seen it happen a couple of times, just that I don't know how to predictably reproduce the problem.
--
Deepak
deepak <deepak.pn@gmail.com> writes: > We could reproduce the start-up problem on Windows 2003. After a reboot, > postmaster, in its start-up sequence cleans up old temporary files, and > this step used to take several minutes (a little over 4 minutes), delaying > the writing of line 6 onwards into the PID file. This delay caused pg_ctl > to timeout, leaving behind an orphaned postgres.exe process (which > eventually forks off many other postgres.exe processes). Hmm. It's easy enough to postpone temp file cleanup till after the postmaster's PID file is completely written, so I've committed a patch for that. However, I find it mildly astonishing that such cleanup could take multiple minutes. What are you using for storage, a man with an abacus? regards, tom lane
Thanks, I have put one of the other developers working on this issue, to comment.
--
Deepak
--
Deepak
On Mon, May 21, 2012 at 10:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
deepak <deepak.pn@gmail.com> writes:Hmm. It's easy enough to postpone temp file cleanup till after the
> We could reproduce the start-up problem on Windows 2003. After a reboot,
> postmaster, in its start-up sequence cleans up old temporary files, and
> this step used to take several minutes (a little over 4 minutes), delaying
> the writing of line 6 onwards into the PID file. This delay caused pg_ctl
> to timeout, leaving behind an orphaned postgres.exe process (which
> eventually forks off many other postgres.exe processes).
postmaster's PID file is completely written, so I've committed a patch
for that. However, I find it mildly astonishing that such cleanup could
take multiple minutes. What are you using for storage, a man with an
abacus?
regards, tom lane
I tried moving the call to RemovePgTempFiles until
after the PID file is fully written, but it did not help.
pg_ctl attempts to connect to the database, and does
not report the database as running until that connection
succeeds. I am not comfortable moving the call to
RemovePgTempFiles after the point in the postmaster
where child processes are spawned and connections
made available to clients because by that point the
temporary files encountered may be valid ones from
the current incarnation of Postgres and not from the
incarnation before the reboot.
I do not know precisely why the filesystem is so slow,
except to say that we have many relations:
xyzzy=# select count(*) from pg_catalog.pg_class;
count
-------
27340
(1 row)
xyzzy=# select count(*) from pg_catalog.pg_attribute;
count
--------
236252
(1 row)
count
-------
27340
(1 row)
xyzzy=# select count(*) from pg_catalog.pg_attribute;
count
--------
236252
(1 row)
Running `find . | wc -l` on the data directory gives
55219
From: deepak <deepak.pn@gmail.com>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Alban Hertroys <haramrae@gmail.com>; pgsql-general@postgresql.org; markdilger@yahoo.com
Sent: Wednesday, May 23, 2012 9:03 AM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Thanks, I have put one of the other developers working on this issue, to comment.
--
Deepak
--
Deepak
On Mon, May 21, 2012 at 10:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
deepak <deepak.pn@gmail.com> writes:Hmm. It's easy enough to postpone temp file cleanup till after the
> We could reproduce the start-up problem on Windows 2003. After a reboot,
> postmaster, in its start-up sequence cleans up old temporary files, and
> this step used to take several minutes (a little over 4 minutes), delaying
> the writing of line 6 onwards into the PID file. This delay caused pg_ctl
> to timeout, leaving behind an orphaned postgres.exe process (which
> eventually forks off many other postgres.exe processes).
postmaster's PID file is completely written, so I've committed a patch
for that. However, I find it mildly astonishing that such cleanup could
take multiple minutes. What are you using for storage, a man with an
abacus?
regards, tom lane
Mark Dilger <markdilger@yahoo.com> writes: > I tried moving the call to RemovePgTempFiles until > after the PID file is fully written, but it did not help. I wonder whether you correctly identified the source of the slowness. The thing I would have suspected is identify_system_timezone(), which will attempt to read every file in the timezone-database directory tree, of which there are about 600. It's not unusual for that to take several seconds on a cold-started machine that doesn't have any of that tree in filesystem cache. It's still a stretch to believe that it'd take several minutes on any storage system more advanced than a floppy disk; but at least we'd only be trying to pin about one order of magnitude slowdown on the filesystem, rather than several orders. If that is what is causing it, there is a very simple workaround, which is to set the timezone setting explicitly in postgresql.conf instead of leaving the postmaster to try to figure it out from the environment. (9.2 will use a better answer, which is for initdb to do this once and store the result in postgresql.conf.) regards, tom lane
Prior to posting to the mailing list, we made some
changes in postmaster.c to identify where time was
being spent. Based on the elog(NOTICE,...) lines
we put in the file, we determined the time was spent
inside RemovePgTempFiles.
I then altered RemovePgTempFiles to take a starttime
parameter and, while recursing, to check if more than
5 seconds has passed since it started. I did not want
to add the complexity of setting an alarm and catching
the signal, so I just made the code check the wallclock
time at each step of the recursion. When more than
5 seconds has passed, it does not recurse further.
After making this change, we have not been able to
reproduce the slowness.
We do not consider this a fix to the problem. It is just
a tool for verifying where the slowness comes from.
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mark Dilger <markdilger@yahoo.com>
Cc: deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, May 23, 2012 9:50 AM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Mark Dilger <markdilger@yahoo.com> writes:
> I tried moving the call to RemovePgTempFiles until
> after the PID file is fully written, but it did not help.
I wonder whether you correctly identified the source of the slowness.
The thing I would have suspected is identify_system_timezone(), which
will attempt to read every file in the timezone-database directory tree,
of which there are about 600. It's not unusual for that to take several
seconds on a cold-started machine that doesn't have any of that tree in
filesystem cache. It's still a stretch to believe that it'd take
several minutes on any storage system more advanced than a floppy disk;
but at least we'd only be trying to pin about one order of magnitude
slowdown on the filesystem, rather than several orders.
If that is what is causing it, there is a very simple workaround, which
is to set the timezone setting explicitly in postgresql.conf instead of
leaving the postmaster to try to figure it out from the environment.
(9.2 will use a better answer, which is for initdb to do this once and
store the result in postgresql.conf.)
regards, tom lane
We tried setting the timezone, as:
timezone = 'US/Eastern'
in postgresql.conf, but it did not help.
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mark Dilger <markdilger@yahoo.com>
Cc: deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, May 23, 2012 9:50 AM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Mark Dilger <markdilger@yahoo.com> writes:
> I tried moving the call to RemovePgTempFiles until
> after the PID file is fully written, but it did not help.
I wonder whether you correctly identified the source of the slowness.
The thing I would have suspected is identify_system_timezone(), which
will attempt to read every file in the timezone-database directory tree,
of which there are about 600. It's not unusual for that to take several
seconds on a cold-started machine that doesn't have any of that tree in
filesystem cache. It's still a stretch to believe that it'd take
several minutes on any storage system more advanced than a floppy disk;
but at least we'd only be trying to pin about one order of magnitude
slowdown on the filesystem, rather than several orders.
If that is what is causing it, there is a very simple workaround, which
is to set the timezone setting explicitly in postgresql.conf instead of
leaving the postmaster to try to figure it out from the environment.
(9.2 will use a better answer, which is for initdb to do this once and
store the result in postgresql.conf.)
regards, tom lane
Mark Dilger <markdilger@yahoo.com> writes: > Prior to posting to the mailing list, we made some > changes in postmaster.c to identify where time was > being spent.� Based on the elog(NOTICE,...) lines > we put in the file, we determined the time was spent > inside RemovePgTempFiles. > I then altered RemovePgTempFiles to take a starttime > parameter and, while recursing, to check if more than > 5 seconds has passed since it started.� I did not want > to add the complexity of setting an alarm and catching > the signal, so I just made the code check the wallclock > time at each step of the recursion.� When more than > 5 seconds has passed, it does not recurse further. > After making this change, we have not been able to > reproduce the slowness. OK, so we're back to the original question: how could this possibly be taking that long? Have you got thousands of tablespaces (and if so why)? Does your system have a habit of crashing at times when there are thousands of temp files? Maybe you're using IP over avian carriers to access your SAN? It just doesn't make any sense given the information you've provided. regards, tom lane
We do not use tablespaces at all. We do use table
partitioning very heavily, with many check
constraints. That is the only thing unusual about
the schema.
To my eyes, the birds appear to be flying pretty
darned fast, though we have not figured out how
to remove the message bands quickly without
cutting off their feet.
The server is a virtual machine, and at this point
I will ask the sys admins to get a non-virtual
server running to reconfirm the problem.
Thanks
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mark Dilger <markdilger@yahoo.com>
Cc: deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, May 23, 2012 11:17 AM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Mark Dilger <markdilger@yahoo.com> writes:
> Prior to posting to the mailing list, we made some
> changes in postmaster.c to identify where time was
> being spent. Based on the elog(NOTICE,...) lines
> we put in the file, we determined the time was spent
> inside RemovePgTempFiles.
> I then altered RemovePgTempFiles to take a starttime
> parameter and, while recursing, to check if more than
> 5 seconds has passed since it started. I did not want
> to add the complexity of setting an alarm and catching
> the signal, so I just made the code check the wallclock
> time at each step of the recursion. When more than
> 5 seconds has passed, it does not recurse further.
> After making this change, we have not been able to
> reproduce the slowness.
OK, so we're back to the original question: how could this possibly be
taking that long? Have you got thousands of tablespaces (and if so why)?
Does your system have a habit of crashing at times when there are
thousands of temp files? Maybe you're using IP over avian carriers to
access your SAN? It just doesn't make any sense given the information
you've provided.
regards, tom lane
Mark Dilger <markdilger@yahoo.com> writes: > We do not use tablespaces at all. [ scratches head... ] If you aren't using any tablespaces, there should be only *one* pgsql_tmp directory, which makes this even more confusing. (Unless you're using a pre-8.3 release, in which case there would be one per database, so maybe if you've got hundreds/thousands of databases in the cluster that would explain it. But I sure hope you're not still using pre-8.3, especially not on Windows.) regards, tom lane
We only use one database, not counting the
built-in template databases. The server is
running 9.1.3. We were running 9.1.1 until
fairly recently.
We are still getting set up to test this on
non-virtual hardware, but hope to have results
from that in a few hours or less.
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mark Dilger <markdilger@yahoo.com>
Cc: deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, May 23, 2012 12:23 PM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Mark Dilger <markdilger@yahoo.com> writes:
> We do not use tablespaces at all.
[ scratches head... ] If you aren't using any tablespaces, there should
be only *one* pgsql_tmp directory, which makes this even more confusing.
(Unless you're using a pre-8.3 release, in which case there would be one
per database, so maybe if you've got hundreds/thousands of databases in
the cluster that would explain it. But I sure hope you're not still
using pre-8.3, especially not on Windows.)
regards, tom lane
Mark Dilger <markdilger@yahoo.com> writes: > We only use one database, not counting the > built-in template databases.� The server is > running 9.1.3.� We were running 9.1.1 until > fairly recently. OK. I had forgotten that in recent versions, RemovePgTempFiles doesn't only iterate through the pgsql_tmp directories; it scans the regular database directories too, looking for possibly orphaned temp relations. So if you had lots and lots of files in your regular database directories, possibly scanning those could be slow. Still, it's only looking at the file names, not attempting to stat() them or anything, so it would be a pretty shoddy filesystem that would take a really long time for that. regards, tom lane
I am running this code on Windows 2003. It
appears that postgres has in src/port/dirent.c
a port of readdir() that internally uses the
WIN32_FIND_DATA structure, and the function
FindNextFile() to iterate through the directory.
Looking at the documentation, it seems that
this function does collect file creation time,
last access time, last write time, file size, etc.,
much like performing a stat.
In my case, the code is iterating through roughly
56,000 files. Apparently, this is doing the
equivalent of a stat on each of them.
See http://msdn.microsoft.com/en-us/library/windows/desktop/aa365740%28v=vs.85%29.aspx
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mark Dilger <markdilger@yahoo.com>
Cc: deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, May 23, 2012 1:54 PM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Mark Dilger <markdilger@yahoo.com> writes:
> We only use one database, not counting the
> built-in template databases. The server is
> running 9.1.3. We were running 9.1.1 until
> fairly recently.
OK. I had forgotten that in recent versions, RemovePgTempFiles doesn't
only iterate through the pgsql_tmp directories; it scans the regular
database directories too, looking for possibly orphaned temp relations.
So if you had lots and lots of files in your regular database
directories, possibly scanning those could be slow. Still, it's only
looking at the file names, not attempting to stat() them or anything,
so it would be a pretty shoddy filesystem that would take a really long
time for that.
regards, tom lane
Mark Dilger <markdilger@yahoo.com> writes: > I am running this code on Windows 2003.� It > appears that postgres has in src/port/dirent.c > a port of readdir() that internally uses the > WIN32_FIND_DATA structure, and the function > FindNextFile() to iterate through the directory. > Looking at the documentation, it seems that > this function does collect file creation time, > last access time, last write time, file size, etc., > much like performing a stat. > In my case, the code is iterating through roughly > 56,000 files. Apparently, this is doing the > equivalent of a stat on each of them. That would explain it all right. I think you're basically screwed here, because so far as I can see Windows doesn't provide any means to enumerate a directory's contents without fetching that info; at least http://msdn.microsoft.com/en-us/library/windows/desktop/aa364232(v=vs.85).aspx doesn't seem to offer any substitutes for FindFirstFile/FindNextFile. It's barely possible that using FindFirstFileEx with fInfoLevelId = FindExInfoBasic would save enough to be useful, except that that option doesn't exist on Windows 2003 anyway. Consider using another operating system ... regards, tom lane
FindFirstFile can take a wildcard filename
pattern. It appears that we are effectively
calling FindFirstFile without a pattern, getting
all 56000 file names with complete stat
information, doing a poor-man's regex on
those names, and matching just the temporary
files.
If RemovePgTempFiles were modified to
pass a filter, this code might perform better
on Windows. I'll look into this.
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Mark Dilger <markdilger@yahoo.com>
Cc: deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Wednesday, May 23, 2012 4:25 PM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
Mark Dilger <markdilger@yahoo.com> writes:
> I am running this code on Windows 2003. It
> appears that postgres has in src/port/dirent.c
> a port of readdir() that internally uses the
> WIN32_FIND_DATA structure, and the function
> FindNextFile() to iterate through the directory.
> Looking at the documentation, it seems that
> this function does collect file creation time,
> last access time, last write time, file size, etc.,
> much like performing a stat.
> In my case, the code is iterating through roughly
> 56,000 files. Apparently, this is doing the
> equivalent of a stat on each of them.
That would explain it all right. I think you're basically screwed here,
because so far as I can see Windows doesn't provide any means to
enumerate a directory's contents without fetching that info; at least
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364232(v=vs.85).aspx
doesn't seem to offer any substitutes for FindFirstFile/FindNextFile.
It's barely possible that using FindFirstFileEx with fInfoLevelId =
FindExInfoBasic would save enough to be useful, except that that option
doesn't exist on Windows 2003 anyway.
Consider using another operating system ...
regards, tom lane
On Thu, May 24, 2012 at 12:47 AM, Mark Dilger <markdilger@yahoo.com> wrote: > I am running this code on Windows 2003. It > appears that postgres has in src/port/dirent.c > a port of readdir() that internally uses the > WIN32_FIND_DATA structure, and the function > FindNextFile() to iterate through the directory. > Looking at the documentation, it seems that > this function does collect file creation time, > last access time, last write time, file size, etc., > much like performing a stat. > > In my case, the code is iterating through roughly > 56,000 files. Apparently, this is doing the > equivalent of a stat on each of them. how did you end up with 56,000 files? Lots and lots and lots of tables? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Thu, May 24, 2012 at 2:42 AM, Mark Dilger <markdilger@yahoo.com> wrote: > FindFirstFile can take a wildcard filename > pattern. It appears that we are effectively > calling FindFirstFile without a pattern, getting > all 56000 file names with complete stat > information, doing a poor-man's regex on > those names, and matching just the temporary > files. > > If RemovePgTempFiles were modified to > pass a filter, this code might perform better > on Windows. I'll look into this. It might in that case be worthwhile looking at using scandir() on platforms that support that as well, so that other platforms can benefit from an optimization as well. Though I'm not sure how much that would actually help - ISTM that one actually scans the whole directory anyway, just you don't have to do it yourself... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
We have lots of partition tables that inherit
from a smaller number of parents. Some,
but not all of these tables also have indexes.
The number actually varies depending on
the data loaded. For some other database
instances, fortunately on Linux, the number
is in the millions.
I have been testing with passing FindFirstFile
a pattern to match the temporary file names,
rather than letting FindFirstFile/FindNextFile
return all names and then having postgres
do the pattern match itself. So far, this looks
very promising, with a stand-alone program
that uses this technique cutting the runtime
from 4 minutes down to less than a second.
I have a fairly clean patch in the works that
I will submit after I have verified it on
Windows 2003, Windows 2008 and Linux.
From: Magnus Hagander <magnus@hagander.net>
To: Mark Dilger <markdilger@yahoo.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; deepak <deepak.pn@gmail.com>; Alban Hertroys <haramrae@gmail.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Thursday, May 24, 2012 3:58 AM
Subject: Re: [GENERAL] FATAL: lock file "postmaster.pid" already exists
On Thu, May 24, 2012 at 12:47 AM, Mark Dilger <markdilger@yahoo.com> wrote:
> I am running this code on Windows 2003. It
> appears that postgres has in src/port/dirent.c
> a port of readdir() that internally uses the
> WIN32_FIND_DATA structure, and the function
> FindNextFile() to iterate through the directory.
> Looking at the documentation, it seems that
> this function does collect file creation time,
> last access time, last write time, file size, etc.,
> much like performing a stat.
>
> In my case, the code is iterating through roughly
> 56,000 files. Apparently, this is doing the
> equivalent of a stat on each of them.
how did you end up with 56,000 files? Lots and lots and lots of tables?
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
We have observed conclusively that file system cache is coming into play. We tested the scenario in which a reboot was followed by navigating the file system under the data directory using "find" Cygwin command, following which there was "no" timeout for pg_ctl and the server started up fine, suggesting that the clean up is way faster when the file system is cached.
Any ideas on fixing this start-up delay in postmaster?
Could the task of cleanup move elsewhere, specifically to somewhere after the writing of PID file is complete so that pg_ctl doesn't timeout?
Any other suggestions for working around this problem?
Thanks,
Deepak