Thread: fsync, ext2 on Linux

fsync, ext2 on Linux

From
Heikki Linnakangas
Date:
The Linux fsync man page says:

"It does not necessarily ensure that the entry in the directory 
containing the file has also reached disk. For that an explicit fsync on 
the file descriptor of the directory is also needed."

AFAIK, we don't care about it at the moment. The actual behaviour depends 
on the filesystem, reiserfs and other journaling filesystems probably 
don't need the explicit fsync on the parent directory, but at least ext2 
does.

I've experimented with a user-mode-linux installation, crashing it at 
specific points. It seems that on ext2, it's possible to get the database 
in non-consistent state.

Especially:

1. start transaction
2. do a lot of updates, so that a new xlog file is created
3. commit
4. crash

Sometimes the creation of the new xlog file is lost, losing the already 
committed transaction.

I also got into this situation after one crash test:

template1=# SELECT * FROM foo;
ERROR:  could not access status of transaction 1768515945
DETAIL:  could not open file 
"/home/hlinnaka/pgsql/data_broken/pg_clog/0696": No such file or directory

I haven't tried to debug it more deeply.

Should we fix this by fsyncing the parent directory of new files? We could 
also declare ext2 broken, but there could be others.

- Heikki


Re: fsync, ext2 on Linux

From
Oliver Jowett
Date:
Heikki Linnakangas wrote:
> The Linux fsync man page says:
> 
> "It does not necessarily ensure that the entry in the directory 
> containing the file has also reached disk. For that an explicit fsync on 
> the file descriptor of the directory is also needed."
> 
> AFAIK, we don't care about it at the moment. The actual behaviour 
> depends on the filesystem, reiserfs and other journaling filesystems 
> probably don't need the explicit fsync on the parent directory, but at 
> least ext2 does.
> 
> I've experimented with a user-mode-linux installation, crashing it at 
> specific points. It seems that on ext2, it's possible to get the 
> database in non-consistent state.

Have you experimented with mounting the filesystem with the dirsync 
option ('-o dirsync') or marking the log directory as synchronous with 
'chattr +D'?  (no, it's not a real fix, just another data point..)

-O


Re: fsync, ext2 on Linux

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> The Linux [ext2] fsync man page says:
> "It does not necessarily ensure that the entry in the directory 
> containing the file has also reached disk. For that an explicit fsync on 
> the file descriptor of the directory is also needed."

This seems so broken as to defy belief.  A process creating a file
doesn't normally *have* a file descriptor for the parent directory,
and I don't think the concept of an FD for a directory is even
portable (opendir() certainly doesn't return an FD).  One might also
ask if we are expected to fsync everything up to the root in order
to be sure that the file remains accessible, and how exactly we should
do that on directories we don't have write access for.

In general we expect the filesystem to take care of its own metadata.
Run ext3 in journaling mode, or something like that.

(It occurs to me that the admin guide really ought to have a few words
about recommended and non-recommended filesystems ...)
        regards, tom lane


Re: fsync, ext2 on Linux

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Heikki Linnakangas <hlinnaka@iki.fi> writes:
>  
>
>>The Linux [ext2] fsync man page says:
>>"It does not necessarily ensure that the entry in the directory 
>>containing the file has also reached disk. For that an explicit fsync on 
>>the file descriptor of the directory is also needed."
>>    
>>
>
>This seems so broken as to defy belief.  A process creating a file
>doesn't normally *have* a file descriptor for the parent directory,
>and I don't think the concept of an FD for a directory is even
>portable (opendir() certainly doesn't return an FD).  One might also
>ask if we are expected to fsync everything up to the root in order
>to be sure that the file remains accessible, and how exactly we should
>do that on directories we don't have write access for.
>  
>

The notes say this:
      When  an  ext2  file  system is mounted with the sync option, 
directory      entries are also implicitly synced by fsync.

cheers

andrew




Re: fsync, ext2 on Linux

From
"Joshua D. Drake"
Date:
>In general we expect the filesystem to take care of its own metadata.
>Run ext3 in journaling mode, or something like that.
>
>(It occurs to me that the admin guide really ought to have a few words
>about recommended and non-recommended filesystems ...)
>
>
Well I am not their admin, but I don't suggest any of the ext systems.
Although ext3 is reasonably stable it is very slow.

Stick with XFS, JFS or even Reiser.

Sincerely,

Joshua D. Drake



>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>


--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL


Attachment

Re: fsync, ext2 on Linux

From
Heikki Linnakangas
Date:
On Mon, 1 Nov 2004, Oliver Jowett wrote:

> Heikki Linnakangas wrote:
>> The Linux fsync man page says:
>> 
>> "It does not necessarily ensure that the entry in the directory containing 
>> the file has also reached disk. For that an explicit fsync on the file 
>> descriptor of the directory is also needed."
>> 
>> AFAIK, we don't care about it at the moment. The actual behaviour depends 
>> on the filesystem, reiserfs and other journaling filesystems probably don't 
>> need the explicit fsync on the parent directory, but at least ext2 does.
>> 
>> I've experimented with a user-mode-linux installation, crashing it at 
>> specific points. It seems that on ext2, it's possible to get the database 
>> in non-consistent state.
>
> Have you experimented with mounting the filesystem with the dirsync option 
> ('-o dirsync') or marking the log directory as synchronous with 'chattr +D'? 
> (no, it's not a real fix, just another data point..)

Quick experiment shows that they seem to fix it as expected.

"chattr +D" might not be such a bad idea. A warning would be nice if you 
start the postmaster on a filesystem that requires it. Few admins would 
remember/know about it otherwise.

- Heikki


Re: fsync, ext2 on Linux

From
Heikki Linnakangas
Date:
On Sun, 31 Oct 2004, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> The Linux [ext2] fsync man page says:
>> "It does not necessarily ensure that the entry in the directory
>> containing the file has also reached disk. For that an explicit fsync on
>> the file descriptor of the directory is also needed."
>
> This seems so broken as to defy belief.  A process creating a file
> doesn't normally *have* a file descriptor for the parent directory,
> and I don't think the concept of an FD for a directory is even
> portable (opendir() certainly doesn't return an FD).  One might also
> ask if we are expected to fsync everything up to the root in order
> to be sure that the file remains accessible, and how exactly we should
> do that on directories we don't have write access for.

I agree on the brokeness. Linux is the only OS that's broken that I know 
of. Therefore it doesn't really matter if the fix is portable or not, we 
would only do it on Linux anyway.

Surely it's not necessary to crawl up to the root. Just fsync the 
parent of every new file and directory.

> In general we expect the filesystem to take care of its own metadata.
> Run ext3 in journaling mode, or something like that.

I normally run reiserfs, I set up the ext2 filesystem just to test it.

> (It occurs to me that the admin guide really ought to have a few words
> about recommended and non-recommended filesystems ...)

That's the least we can do. I wonder if we could check the filesystem at 
runtime and issue a warning if it's not in the list of recommended 
filesystems.

- Heikki