Thread: fsync, ext2 on Linux
The Linux fsync man page says: "It does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync on the file descriptor of the directory is also needed." AFAIK, we don't care about it at the moment. The actual behaviour depends on the filesystem, reiserfs and other journaling filesystems probably don't need the explicit fsync on the parent directory, but at least ext2 does. I've experimented with a user-mode-linux installation, crashing it at specific points. It seems that on ext2, it's possible to get the database in non-consistent state. Especially: 1. start transaction 2. do a lot of updates, so that a new xlog file is created 3. commit 4. crash Sometimes the creation of the new xlog file is lost, losing the already committed transaction. I also got into this situation after one crash test: template1=# SELECT * FROM foo; ERROR: could not access status of transaction 1768515945 DETAIL: could not open file "/home/hlinnaka/pgsql/data_broken/pg_clog/0696": No such file or directory I haven't tried to debug it more deeply. Should we fix this by fsyncing the parent directory of new files? We could also declare ext2 broken, but there could be others. - Heikki
Heikki Linnakangas wrote: > The Linux fsync man page says: > > "It does not necessarily ensure that the entry in the directory > containing the file has also reached disk. For that an explicit fsync on > the file descriptor of the directory is also needed." > > AFAIK, we don't care about it at the moment. The actual behaviour > depends on the filesystem, reiserfs and other journaling filesystems > probably don't need the explicit fsync on the parent directory, but at > least ext2 does. > > I've experimented with a user-mode-linux installation, crashing it at > specific points. It seems that on ext2, it's possible to get the > database in non-consistent state. Have you experimented with mounting the filesystem with the dirsync option ('-o dirsync') or marking the log directory as synchronous with 'chattr +D'? (no, it's not a real fix, just another data point..) -O
Heikki Linnakangas <hlinnaka@iki.fi> writes: > The Linux [ext2] fsync man page says: > "It does not necessarily ensure that the entry in the directory > containing the file has also reached disk. For that an explicit fsync on > the file descriptor of the directory is also needed." This seems so broken as to defy belief. A process creating a file doesn't normally *have* a file descriptor for the parent directory, and I don't think the concept of an FD for a directory is even portable (opendir() certainly doesn't return an FD). One might also ask if we are expected to fsync everything up to the root in order to be sure that the file remains accessible, and how exactly we should do that on directories we don't have write access for. In general we expect the filesystem to take care of its own metadata. Run ext3 in journaling mode, or something like that. (It occurs to me that the admin guide really ought to have a few words about recommended and non-recommended filesystems ...) regards, tom lane
Tom Lane wrote: >Heikki Linnakangas <hlinnaka@iki.fi> writes: > > >>The Linux [ext2] fsync man page says: >>"It does not necessarily ensure that the entry in the directory >>containing the file has also reached disk. For that an explicit fsync on >>the file descriptor of the directory is also needed." >> >> > >This seems so broken as to defy belief. A process creating a file >doesn't normally *have* a file descriptor for the parent directory, >and I don't think the concept of an FD for a directory is even >portable (opendir() certainly doesn't return an FD). One might also >ask if we are expected to fsync everything up to the root in order >to be sure that the file remains accessible, and how exactly we should >do that on directories we don't have write access for. > > The notes say this: When an ext2 file system is mounted with the sync option, directory entries are also implicitly synced by fsync. cheers andrew
>In general we expect the filesystem to take care of its own metadata. >Run ext3 in journaling mode, or something like that. > >(It occurs to me that the admin guide really ought to have a few words >about recommended and non-recommended filesystems ...) > > Well I am not their admin, but I don't suggest any of the ext systems. Although ext3 is reasonably stable it is very slow. Stick with XFS, JFS or even Reiser. Sincerely, Joshua D. Drake > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
On Mon, 1 Nov 2004, Oliver Jowett wrote: > Heikki Linnakangas wrote: >> The Linux fsync man page says: >> >> "It does not necessarily ensure that the entry in the directory containing >> the file has also reached disk. For that an explicit fsync on the file >> descriptor of the directory is also needed." >> >> AFAIK, we don't care about it at the moment. The actual behaviour depends >> on the filesystem, reiserfs and other journaling filesystems probably don't >> need the explicit fsync on the parent directory, but at least ext2 does. >> >> I've experimented with a user-mode-linux installation, crashing it at >> specific points. It seems that on ext2, it's possible to get the database >> in non-consistent state. > > Have you experimented with mounting the filesystem with the dirsync option > ('-o dirsync') or marking the log directory as synchronous with 'chattr +D'? > (no, it's not a real fix, just another data point..) Quick experiment shows that they seem to fix it as expected. "chattr +D" might not be such a bad idea. A warning would be nice if you start the postmaster on a filesystem that requires it. Few admins would remember/know about it otherwise. - Heikki
On Sun, 31 Oct 2004, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> The Linux [ext2] fsync man page says: >> "It does not necessarily ensure that the entry in the directory >> containing the file has also reached disk. For that an explicit fsync on >> the file descriptor of the directory is also needed." > > This seems so broken as to defy belief. A process creating a file > doesn't normally *have* a file descriptor for the parent directory, > and I don't think the concept of an FD for a directory is even > portable (opendir() certainly doesn't return an FD). One might also > ask if we are expected to fsync everything up to the root in order > to be sure that the file remains accessible, and how exactly we should > do that on directories we don't have write access for. I agree on the brokeness. Linux is the only OS that's broken that I know of. Therefore it doesn't really matter if the fix is portable or not, we would only do it on Linux anyway. Surely it's not necessary to crawl up to the root. Just fsync the parent of every new file and directory. > In general we expect the filesystem to take care of its own metadata. > Run ext3 in journaling mode, or something like that. I normally run reiserfs, I set up the ext2 filesystem just to test it. > (It occurs to me that the admin guide really ought to have a few words > about recommended and non-recommended filesystems ...) That's the least we can do. I wonder if we could check the filesystem at runtime and issue a warning if it's not in the list of recommended filesystems. - Heikki