Thread: What is WAL used for?
I'm just trying to figure out the terminology that is used on this board and wanted to know what is WAL and what roll does it play in Postgresql? Thanks
WAL is write-ahead logging. Basically, before the database actually performs an operation, it writes in a log what it's about to do. Then, it goes and does it. This ensures data consistency. Let's say that the computer was powered off suddenly. There are several points that could happen: 1) before a write - in this case the database would be fine with or without write-ahead logging. 2) during a write - without write-ahead logging, if the machine is powered off during a write, the database has no way of knowing what remained to be written, or what was being written. WIth Postgres, this is furthere broken down into two possibilities: * The power-off occurred while it was writing to the log - in this case, the log is rolled back. The database is unaffected because the data was never written to the database proper. * The power-off occurred after writing to the log, while writing to disk - in this case, Postgres can simply read from the log what was supposed to be written, and complete the write. 3) after a write - again, this does not affect Postgres either with or without WAL. In addition, WAL increases PostgreSQL's efficiency, because it can delay random-access writes to disk, and just do sequential writes to the log for a long time. This reduces the amount of head-seek the dissk are doing. If you store your WAL files on a different disk, you get even more speed advantages. Jon On Tue, 25 Nov 2003, Relaxin wrote: > I'm just trying to figure out the terminology that is used on this board and > wanted to know what is WAL and what roll does it play in Postgresql? > > Thanks > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >
Jonathan, Could you tell me what is the real impact of "fsync=false" on the WAL and on the database in the same catastrophic scenario ? Thierry Missimilly Jonathan Bartlett wrote: > WAL is write-ahead logging. Basically, before the database actually > performs an operation, it writes in a log what it's about to do. Then, it > goes and does it. This ensures data consistency. Let's say that the > computer was powered off suddenly. There are several points that could > happen: > > 1) before a write - in this case the database would be fine with or > without write-ahead logging. > > 2) during a write - without write-ahead logging, if the machine is powered > off during a write, the database has no way of knowing what remained to be > written, or what was being written. WIth Postgres, this is furthere > broken down into two possibilities: > > * The power-off occurred while it was writing to the log - in this > case, the log is rolled back. The database is unaffected because the data > was never written to the database proper. > > * The power-off occurred after writing to the log, while writing to > disk - in this case, Postgres can simply read from the log what was > supposed to be written, and complete the write. > > 3) after a write - again, this does not affect Postgres either with or > without WAL. > > In addition, WAL increases PostgreSQL's efficiency, because it can delay > random-access writes to disk, and just do sequential writes to the log for > a long time. This reduces the amount of head-seek the dissk are doing. > If you store your WAL files on a different disk, you get even more speed > advantages. > > Jon > > On Tue, 25 Nov 2003, Relaxin wrote: > > > I'm just trying to figure out the terminology that is used on this board and > > wanted to know what is WAL and what roll does it play in Postgresql? > > > > Thanks > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 7: don't forget to increase your free space map settings > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match
Attachment
> Could you tell me what is the real impact of "fsync=false" on the WAL and on the > database in the same catastrophic scenario ? I am not certain on this point, but I believe fsync=false messes up the whole thing. The nice thing about WAL is that fsync is no longer as much of a slowdown, because PG rarely has to do random-access writes to the disk. Jon > > Thierry Missimilly > > Jonathan Bartlett wrote: > > > WAL is write-ahead logging. Basically, before the database actually > > performs an operation, it writes in a log what it's about to do. Then, it > > goes and does it. This ensures data consistency. Let's say that the > > computer was powered off suddenly. There are several points that could > > happen: > > > > 1) before a write - in this case the database would be fine with or > > without write-ahead logging. > > > > 2) during a write - without write-ahead logging, if the machine is powered > > off during a write, the database has no way of knowing what remained to be > > written, or what was being written. WIth Postgres, this is furthere > > broken down into two possibilities: > > > > * The power-off occurred while it was writing to the log - in this > > case, the log is rolled back. The database is unaffected because the data > > was never written to the database proper. > > > > * The power-off occurred after writing to the log, while writing to > > disk - in this case, Postgres can simply read from the log what was > > supposed to be written, and complete the write. > > > > 3) after a write - again, this does not affect Postgres either with or > > without WAL. > > > > In addition, WAL increases PostgreSQL's efficiency, because it can delay > > random-access writes to disk, and just do sequential writes to the log for > > a long time. This reduces the amount of head-seek the dissk are doing. > > If you store your WAL files on a different disk, you get even more speed > > advantages. > > > > Jon > > > > On Tue, 25 Nov 2003, Relaxin wrote: > > > > > I'm just trying to figure out the terminology that is used on this board and > > > wanted to know what is WAL and what roll does it play in Postgresql? > > > > > > Thanks > > > > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > > TIP 7: don't forget to increase your free space map settings > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 9: the planner will ignore your desire to choose an index scan if your > > joining column's datatypes do not match >
Jon, I have tried a little bench with pgbench on my 2 proc 2.4 Gb with 4 GB RAM and Linux RH 9.0. The database size is 700 MB, so it can be loaded in memory. Postgres 7.4 is on disk sda (Root disk) Meta Data are on disk sdb bench data are on disk sdc When pgbench is running, i can see with top tool that the CPU are 53% in I/O wait. And mainling because postgres is writting block on sdb disk. And the Transaction Per Second (tps) are 222. By setting "fsync=false", the CPU I/O wait decrease to 0.6%. And the result tps is : 466. So, should i conclude that even if the whole database is in memory, the TPS result is slow down by the WAL mecanism which wait for writting the log on disk ? And the main thing to increase the TPS and preserve the consistency of data in case of crash is to increase the I/O throughput of the Postgres WAL disk by creating RAID0 on fiber channel subsystem (I will test that as soon asap). Regards, Thierry Jonathan Bartlett wrote: > > Could you tell me what is the real impact of "fsync=false" on the WAL and on the > > database in the same catastrophic scenario ? > > I am not certain on this point, but I believe fsync=false messes up the > whole thing. The nice thing about WAL is that fsync is no longer as much > of a slowdown, because PG rarely has to do random-access writes to the > disk. > > Jon >
Attachment
On Fri, 28 Nov 2003 15:19:36 +0100, Thierry Missimilly wrote: > I have tried a little bench with pgbench on my 2 proc 2.4 Gb with 4 GB RAM > and Linux RH 9.0. > ... Which filesystem in which mode? Yes, that's relevant and in fact the make-or-break factor here, at least from the POV of the hard drive. I guess RH9 uses ext3 in journaled mode by default, which does data as well as metadata journaling. Retry your benchmarks with both ext2 and ext3 in data=writeback mode; both results should be much closer to each other. > So, should i conclude that even if the whole database is in memory, the > TPS result is slow down by the WAL mecanism which wait for writting the No, you need to take the working of your filesystem into account. As soon as data journaling comes into play, it is normal and in fact unavoidable that performance drops, because everything is written effectively twice - once into the log, once into the file, and to do so the drive has to move. WAL with ext3's data journaling is quite unnecessary because the WAL sort of IS the database's journal. Holger -- A: Maybe because some people are too annoyed by top-posting. Q: Why do I not get an answer to my question(s)? A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
> WAL with ext3's data journaling is quite unnecessary because the WAL > sort of IS the database's journal. I believe you are mistaken. ext3 data journalling only does the filesystem. It has no concept of the structure of the database itself. WAL is still necessary to keep consistency on the table itself. > > Holger > -- > A: Maybe because some people are too annoyed by top-posting. > Q: Why do I not get an answer to my question(s)? > A: Because it messes up the order in which people normally read text. > Q: Why is top-posting such a bad thing? > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >
Jonathan Bartlett wrote: > > WAL with ext3's data journaling is quite unnecessary because the WAL > > sort of IS the database's journal. > > I believe you are mistaken. ext3 data journalling only does the > filesystem. It has no concept of the structure of the database itself. > WAL is still necessary to keep consistency on the table itself. What he means is that PostgreSQL doesn't need the file contents restore pristine on crash recovery, just the directory structure and WAL can recreate the file contents. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Holger Hoffstaette wrote: > No, you need to take the working of your filesystem into account. As soon > as data journaling comes into play, it is normal and in fact unavoidable > that performance drops, because everything is written effectively twice - > once into the log, once into the file, and to do so the drive has to move. > WAL with ext3's data journaling is quite unnecessary because the WAL > sort of IS the database's journal. Logically seems right but in practice may be untrue. I've found for my apps, data=journal performs better. When I was picking filesystems, I did a whole bunch of Googling and there were quite a few people who also said data=journal performed faster for their Postgres or DB config. Here's one explanation I found: "If the database is seeking all over the filesystem and then running fsync(), then ext3 in data=journal mode can make a huge difference, because all the dirty data is written out *linearly* to the journal, for later aysnchronous writeback. This can offer 10x speedups or more."
Holger Hoffstaette wrote: > On Fri, 28 Nov 2003 15:19:36 +0100, Thierry Missimilly wrote: > > > I have tried a little bench with pgbench on my 2 proc 2.4 Gb with 4 GB RAM > > and Linux RH 9.0. > > ... > > Which filesystem in which mode? Yes, that's relevant and in fact the > make-or-break factor here, at least from the POV of the hard drive. > I guess RH9 uses ext3 in journaled mode by default, which does data as > well as metadata journaling. Retry your benchmarks with both ext2 and ext3 > in data=writeback mode; both results should be much closer to each other. > You are right, my filesystem types are ext3. With the data=writeback mode, I increase the TPS by 18% and dicrease the wait I/O from 54% to 30%. I did not change my filesystem to ext2 as I have to have to cancel the partition and recreate all the database. Futhermore, i have understood that journaled filesystem allowed better and faster fsck after a Power off crash and it is not redundant with the WAL Crash recovery. I think that "journaling" is at file system level and WAL is above in the Database level. What happen if the xlog filesystem has been breakdown by a power off. All the Data concisentcy done by PG will be lost. I hope that data stored in the FS journal, can avoid that. Thierry Missimilly
Attachment
On Fri, 2003-12-05 at 02:40, Thierry Missimilly wrote: > With the data=writeback mode, I increase the TPS by 18% and dicrease the wait > I/O from 54% to 30%. > I did not change my filesystem to ext2 as I have to have to cancel the partition > and recreate all the database. Futhermore, i have understood that journaled > filesystem allowed better and faster fsck after a Power off crash and it is not > redundant with the WAL Crash recovery. > I think that "journaling" is at file system level and WAL is above in the > Database level. What happen if the xlog filesystem has been breakdown by a power > off. All the Data concisentcy done by PG will be lost. I hope that data stored > in the FS journal, can avoid that. What's the recommended method of changing an ext3 partition to data=writeback mode on RH9? I tried the tune2fs -j method to set the default journal type and rebooted, but saw *no* performance differences, so I was wondering if setting the default actually put it in writeback mode. Does anyone know if there's an easy way to verify the mode, or if I'm setting it wrong ?
Cott Lang wrote: > On Fri, 2003-12-05 at 02:40, Thierry Missimilly wrote: > > > With the data=writeback mode, I increase the TPS by 18% and dicrease the wait > > I/O from 54% to 30%. > > I did not change my filesystem to ext2 as I have to have to cancel the partition > > and recreate all the database. Futhermore, i have understood that journaled > > filesystem allowed better and faster fsck after a Power off crash and it is not > > redundant with the WAL Crash recovery. > > I think that "journaling" is at file system level and WAL is above in the > > Database level. What happen if the xlog filesystem has been breakdown by a power > > off. All the Data concisentcy done by PG will be lost. I hope that data stored > > in the FS journal, can avoid that. > > What's the recommended method of changing an ext3 partition to > data=writeback mode on RH9? > For exemple, with the root privilege : mount -t ext3 -o data=writeback /dev/sdb1 /data1 or in /etc/fstab : /dev/sdb1 /data1 ext3 data=writeback 0 0 > > I tried the tune2fs -j method to set the default journal type and > rebooted, but saw *no* performance differences, so I was wondering if > setting the default actually put it in writeback mode. > > Does anyone know if there's an easy way to verify the mode, or if I'm > setting it wrong ? > By default the mode is "ordered". The command : mount returns how FS are mounted. Thierry Missimilly > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Attachment
On Mon, 2003-12-08 at 02:53, Thierry Missimilly wrote: > For exemple, with the root privilege : > mount -t ext3 -o data=writeback /dev/sdb1 /data1 Thanks, but my problem is I need to change the root partition to data=writeback, which you can't do by changing fstab. :(
Cott Lang <cott@internetstaff.com> writes: > On Mon, 2003-12-08 at 02:53, Thierry Missimilly wrote: > >> For exemple, with the root privilege : >> mount -t ext3 -o data=writeback /dev/sdb1 /data1 > > Thanks, but my problem is I need to change the root partition to > data=writeback, which you can't do by changing fstab. :( I think there is a kernel boot argument for this, but I don't know what it's called. Google will probably turn it up... You shouldn't have your database files on / on a production system anyway. /var should be a separate partition. -Doug