Thread: Re: [DOCS] pg_total_relation_size() and CHECKPOINT
[ moved to -hackers --- see original thread here http://archives.postgresql.org/pgsql-docs/2008-03/msg00039.php ] "Zubkovsky, Sergey" <Sergey.Zubkovsky@transas.com> writes: > Here is my example. Hmm ... on my Fedora machine I get the same result (704512) in all these cases, which is what I'd expect. (The exact value could vary across platforms, of course.) You said you were using the MinGW build --- maybe MinGW's version of stat(2) isn't trustworthy? regards, tom lane
The previous results were received on PG 8.3 version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400" -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, March 14, 2008 7:19 PM To: Zubkovsky, Sergey Cc: pgsql-docs@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [DOCS] pg_total_relation_size() and CHECKPOINT [ moved to -hackers --- see original thread here http://archives.postgresql.org/pgsql-docs/2008-03/msg00039.php ] "Zubkovsky, Sergey" <Sergey.Zubkovsky@transas.com> writes: > Here is my example. Hmm ... on my Fedora machine I get the same result (704512) in all these cases, which is what I'd expect. (The exact value could vary across platforms, of course.) You said you were using the MinGW build --- maybe MinGW's version of stat(2) isn't trustworthy? regards, tom lane
"Zubkovsky, Sergey" <Sergey.Zubkovsky@transas.com> writes: > The previous results were received on PG 8.3 version: > "PostgreSQL 8.3.0, compiled by Visual C++ build 1400" Hmm. I find the whole thing fairly worrisome, because what it suggests is that Windows isn't actually allocating file space during smgrextend, which would mean that we'd be prone to running out of disk space at unfortunate times --- like during a checkpoint, after we've already promised the client the data is committed. Can any Windows hackers look into this and find out what's really happening? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > "Zubkovsky, Sergey" <Sergey.Zubkovsky@transas.com> writes: >> The previous results were received on PG 8.3 version: >> "PostgreSQL 8.3.0, compiled by Visual C++ build 1400" > > Hmm. I find the whole thing fairly worrisome, because what it suggests > is that Windows isn't actually allocating file space during smgrextend, > which would mean that we'd be prone to running out of disk space at > unfortunate times --- like during a checkpoint, after we've already > promised the client the data is committed. Surely we can't lose after the fsync? Losing at commit rather than at the time of insert might still be poor, but how could we lose after we've promised the data is committed? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> Hmm. I find the whole thing fairly worrisome, because what it suggests >> is that Windows isn't actually allocating file space during smgrextend, >> which would mean that we'd be prone to running out of disk space at >> unfortunate times --- like during a checkpoint, after we've already >> promised the client the data is committed. > Surely we can't lose after the fsync? Losing at commit rather than at > the time of insert might still be poor, but how could we lose after > we've promised the data is committed? What I'm afraid of is write() returning ENOSPC for a write to a disk block we thought we had allocated previously. If such a situation is persistent we'd be unable to flush dirty data from shared buffers and thus never be able to complete a checkpoint. We'd never *get* to the fsync, so whether the data is safe after fsync is moot. The way it is supposed to work is that ENOSPC ought to happen during smgrextend, that is before we've put any data into a shared buffer corresponding to a new page of the file. With that, we will never be able to commit a transaction that requires disk space we don't have. The real question here is whether Windows' stat() is telling the truth about how much filesystem space has actually been allocated to a file. It seems entirely possible that it's not; but if it is, then I think we have a problem. regards, tom lane
Tom Lane wrote: > The real question here is whether Windows' stat() is telling the truth > about how much filesystem space has actually been allocated to a file. > It seems entirely possible that it's not; but if it is, then I think we > have a problem. Has this been examined by a Windows hacker? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Tom Lane wrote: > > >> The real question here is whether Windows' stat() is telling the truth >> about how much filesystem space has actually been allocated to a file. >> It seems entirely possible that it's not; but if it is, then I think we >> have a problem. >> > > Has this been examined by a Windows hacker? > > If someone can suggest a test program I'll be happy to run it. cheers andrew
Can anybody tell me how filesystem space is allocated and point me to the sources if it's possible? I have some experience with programming for Windows and I'll try to investigate this problem. -----Original Message----- From: Andrew Dunstan [mailto:andrew@dunslane.net] Sent: Wednesday, March 26, 2008 4:56 PM To: Alvaro Herrera Cc: Tom Lane; Gregory Stark; Zubkovsky, Sergey; pgsql-hackers@postgresql.org; Magnus Hagander Subject: Re: [HACKERS] [DOCS] pg_total_relation_size() and CHECKPOINT Alvaro Herrera wrote: > Tom Lane wrote: > > >> The real question here is whether Windows' stat() is telling the truth >> about how much filesystem space has actually been allocated to a file. >> It seems entirely possible that it's not; but if it is, then I think we >> have a problem. >> > > Has this been examined by a Windows hacker? > > If someone can suggest a test program I'll be happy to run it. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Alvaro Herrera wrote: >> Tom Lane wrote: >>> The real question here is whether Windows' stat() is telling the truth >>> about how much filesystem space has actually been allocated to a file. >>> It seems entirely possible that it's not; but if it is, then I think we >>> have a problem. >> Has this been examined by a Windows hacker? > If someone can suggest a test program I'll be happy to run it. One thing that would be good is just to see who else can reproduce the original observation: http://archives.postgresql.org/pgsql-docs/2008-03/msg00041.php It might occur only on some versions of Windows, for instance. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Alvaro Herrera wrote: >> >>> Tom Lane wrote: >>> >>>> The real question here is whether Windows' stat() is telling the truth >>>> about how much filesystem space has actually been allocated to a file. >>>> It seems entirely possible that it's not; but if it is, then I think we >>>> have a problem. >>>> > > >>> Has this been examined by a Windows hacker? >>> > > >> If someone can suggest a test program I'll be happy to run it. >> > > One thing that would be good is just to see who else can reproduce > the original observation: > http://archives.postgresql.org/pgsql-docs/2008-03/msg00041.php > > It might occur only on some versions of Windows, for instance. > > > I have reproduced it in XP-Pro/SP2 running in a VMWare machine on an FC6 host. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >>> The real question here is whether Windows' stat() is telling the truth >>> about how much filesystem space has actually been allocated to a file. >> >> One thing that would be good is just to see who else can reproduce >> the original observation: >> http://archives.postgresql.org/pgsql-docs/2008-03/msg00041.php > I have reproduced it in XP-Pro/SP2 running in a VMWare machine on an FC6 > host. OK, so the next question is do we really have an issue, or is this just an observational artifact? What I'd try is deliberately running the machine out of disk space with a long series of inserts, and then see whether subsequent checkpoint attempts fail due to ENOSPC errors while trying to write out dirty buffers. To avoid conflating this effect with anything else, it'd be best if you could put the DB on its own small partition, and *not* put pg_xlog there. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>>> The real question here is whether Windows' stat() is telling the truth >>>> about how much filesystem space has actually been allocated to a file. >>>> >>> One thing that would be good is just to see who else can reproduce >>> the original observation: >>> http://archives.postgresql.org/pgsql-docs/2008-03/msg00041.php >>> > > >> I have reproduced it in XP-Pro/SP2 running in a VMWare machine on an FC6 >> host. >> > > OK, so the next question is do we really have an issue, or is this just > an observational artifact? What I'd try is deliberately running the > machine out of disk space with a long series of inserts, and then see > whether subsequent checkpoint attempts fail due to ENOSPC errors while > trying to write out dirty buffers. > > To avoid conflating this effect with anything else, it'd be best if you > could put the DB on its own small partition, and *not* put pg_xlog > there. > > > I'm working on this (thank goodness for junctions). Maybe we shopuld look at providing a config setting for pg_xlog. cheers andrew
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>>> The real question here is whether Windows' stat() is telling the truth >>>> about how much filesystem space has actually been allocated to a file. >>>> >>> One thing that would be good is just to see who else can reproduce >>> the original observation: >>> http://archives.postgresql.org/pgsql-docs/2008-03/msg00041.php >>> > > >> I have reproduced it in XP-Pro/SP2 running in a VMWare machine on an FC6 >> host. >> > > OK, so the next question is do we really have an issue, or is this just > an observational artifact? What I'd try is deliberately running the > machine out of disk space with a long series of inserts, and then see > whether subsequent checkpoint attempts fail due to ENOSPC errors while > trying to write out dirty buffers. > > To avoid conflating this effect with anything else, it'd be best if you > could put the DB on its own small partition, and *not* put pg_xlog > there. > > > OK, a very large insert failed as expected. Checkpoint succeeded. Then vacuum recovered the space. I suspect that the size reported by stat() is a little delayed here, but the file system is keeping proper track of it, so the lseek that tries to extend the file fails at the right spot. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I suspect that the size reported by stat() is a little delayed here, but > the file system is keeping proper track of it, so the lseek that tries > to extend the file fails at the right spot. Hmm. If it really works that way, one would hope Microsoft would've documented that someplace. Can anyone find a statement that Windows' stat() is not current? regards, tom lane
On Thu, 27 Mar 2008 00:13:42 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > I suspect that the size reported by stat() is a little delayed > > here, but the file system is keeping proper track of it, so the > > lseek that tries to extend the file fails at the right spot. > > Hmm. If it really works that way, one would hope Microsoft would've > documented that someplace. Can anyone find a statement that Windows' > stat() is not current? I'm not in a position to test it myself now (doing training, and then I'll be off to pg-east...), but it'd be interesting to see if it acts the same way with GetFileSize(), or if it's just stat()... /Magnus
Andrew Dunstan wrote: > I'm working on this (thank goodness for junctions). Maybe we shopuld > look at providing a config setting for pg_xlog. I hope you mean an initdb switch -- otherwise it is way too easy to misuse. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Maybe this helps: "It is not an error to set a file pointer to a position beyond the end of the file. The size of the file does not increase until you call the SetEndOfFile, WriteFile, or WriteFileEx function. A write operation increases the size of the file to the file pointer position plus the size of the buffer written, which results in the intervening bytes uninitialized." http://msdn2.microsoft.com/en-us/library/aa365541(VS.85).aspx According to Windows' lseek implementation (attached) SetEndOfFile() isn't called for this case. Thanks, Sergey Zubkovsky -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Thursday, March 27, 2008 7:14 AM To: Andrew Dunstan Cc: Alvaro Herrera; Gregory Stark; Zubkovsky, Sergey; pgsql-hackers@postgresql.org; Magnus Hagander Subject: Re: [HACKERS] [DOCS] pg_total_relation_size() and CHECKPOINT Andrew Dunstan <andrew@dunslane.net> writes: > I suspect that the size reported by stat() is a little delayed here, but > the file system is keeping proper track of it, so the lseek that tries > to extend the file fails at the right spot. Hmm. If it really works that way, one would hope Microsoft would've documented that someplace. Can anyone find a statement that Windows' stat() is not current? regards, tom lane
Attachment
Zubkovsky, Sergey wrote: > Maybe this helps: > > "It is not an error to set a file pointer to a position beyond the end > of the file. The size of the file does not increase until you call the > SetEndOfFile, WriteFile, or WriteFileEx function. A write operation > increases the size of the file to the file pointer position plus the > size of the buffer written, which results in the intervening bytes > uninitialized." > > http://msdn2.microsoft.com/en-us/library/aa365541(VS.85).aspx > > According to Windows' lseek implementation (attached) SetEndOfFile() > isn't called for this case. > > > Yes, but we immediately follow the lseek bye a write(). See src/backend/storage/smgr/md.c:mdextend() . cheers andrew
Alvaro Herrera <alvherre@commandprompt.com> writes: > Andrew Dunstan wrote: >> I'm working on this (thank goodness for junctions). Maybe we shopuld >> look at providing a config setting for pg_xlog. > I hope you mean an initdb switch -- otherwise it is way too easy to > misuse. There's one already .. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > >> Andrew Dunstan wrote: >> >>> I'm working on this (thank goodness for junctions). Maybe we shopuld >>> look at providing a config setting for pg_xlog. >>> > > >> I hope you mean an initdb switch -- otherwise it is way too easy to >> misuse. >> > > There's one already .. > > > heh, the things that creep up on you when you're not looking ... cheers andrew
It seems I've found the cause and the workaround of the problem. MSVC's stat() is implemented by using FindNextFile(). MSDN contains the following suspicious paragraph аbout FindNextFile(): "In rare cases, file attribute information on NTFS file systems may not be current at the time you call this function. Toobtain the current NTFS file system file attributes, call GetFileInformationByHandle." Since we generally cannot open an examined file, we need another way. In the prepared custom build of PG 8.3.1 the native MSVC's stat() was rewrote by adding GetFileAttributesEx() to correctstat's st_size value. I had seen that a result of MSVC's stat() and a result of GetFileAttributesEx() may be differ by the file size values atleast. The most important thing is the test in the original post ( http://archives.postgresql.org/pgsql-docs/2008-03/msg00041.php ) doesn't reproduce any inconsistence now. All work fine. This was tested on my WinXP SP2 platform but I suppose it will work on any NT-based OS. Thanks, Sergey Zubkovsky -----Original Message----- From: Andrew Dunstan [mailto:andrew@dunslane.net] Sent: Thursday, March 27, 2008 3:54 PM To: Zubkovsky, Sergey Cc: Tom Lane; Alvaro Herrera; Gregory Stark; pgsql-hackers@postgresql.org; Magnus Hagander Subject: Re: [HACKERS] [DOCS] pg_total_relation_size() and CHECKPOINT Zubkovsky, Sergey wrote: > Maybe this helps: > > "It is not an error to set a file pointer to a position beyond the end > of the file. The size of the file does not increase until you call the > SetEndOfFile, WriteFile, or WriteFileEx function. A write operation > increases the size of the file to the file pointer position plus the > size of the buffer written, which results in the intervening bytes > uninitialized." > > http://msdn2.microsoft.com/en-us/library/aa365541(VS.85).aspx > > According to Windows' lseek implementation (attached) SetEndOfFile() > isn't called for this case. > > > Yes, but we immediately follow the lseek bye a write(). See src/backend/storage/smgr/md.c:mdextend() . cheers andrew