Thread: "full_page_writes" makes no difference?
Hi guys, No matter I turn on or turn off the "full_page_writes", I always observe 8192-byte writes of log data for simple write operations (write/update). But according to the document, when this is off, it could speed up operations but may cause problems during recovery. So, I guess this is because it writes less when the option is turned off. However, this contradicts my observations .... If I am not missing anything, I find that the writes of log data go through function "XLogWrite" in source file "backend/access/transam/xlog.c". In this file, log data are written with the following code: from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ; nbytes = npages * (Size) XLOG_BLCKSZ; if (write(openLogFile, from, nbytes) != nbytes) { ... } So, "nbytes" should always be multiples of XLOG_BLCKSZ, which in the default case, is 8192. My question is, if it always writes full pages no matter "full_page_writes" is on or off, what is the difference? Thanks! Regards, - Tian
On Wed, 2011-05-04 at 00:17 -0400, Tian Luo wrote: > So, "nbytes" should always be multiples of XLOG_BLCKSZ, which in the > default case, is 8192. > > My question is, if it always writes full pages no matter > "full_page_writes" is on or off, what is the difference? Most I/O systems and filesystems can end up writing part of a page (in this case, 8192 bytes) in the event of a power failure, which is called a "torn page". That can cause problems for postgresql, because the page will be a mix of old and new data, which is corrupt. The solution is "full page writes", which means that when a data page is modified for the first time after a checkpoint, it logs the entire contents of the page (except the free space) to WAL, and can use that as a starting point during recovery. This results in extra WAL data for safety, but it's unnecessary if your filesytem + IO system guarantee that there will be no torn pages (and that's the only safe time to turn it off). So, to answer your question, the difference is that full_page_writes=off means less total WAL data, which means fewer 8192-byte writes in the long run (you have to test long enough to go through a checkpoint to see this difference, however). PostgreSQL will never issue write() calls with 17 bytes, or some other odd number, regardless of the full_page_writes setting. I can see how the name is slightly misleading, but it has to do with whether to write this extra information to WAL (where "extra information" happens to be "full data pages" in this case); not whether to write the WAL itself in full pages. Regards, Jeff Davis
Thanks Jeff. It makes sense now. I did a test with DBT2 by turning the "full_page_write" on and off. The argument is set to "-d 200 -w 1 -c 10" for a short test. There is a 7 times difference in the number of pages written. When the option is on, 1066 pages are written; When the option is off, 158 pages are written; I agree with you that the name "full_page_write" is a little bit misleading. - Tian On Wed, May 25, 2011 at 5:59 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2011-05-04 at 00:17 -0400, Tian Luo wrote: >> So, "nbytes" should always be multiples of XLOG_BLCKSZ, which in the >> default case, is 8192. >> >> My question is, if it always writes full pages no matter >> "full_page_writes" is on or off, what is the difference? > > Most I/O systems and filesystems can end up writing part of a page (in > this case, 8192 bytes) in the event of a power failure, which is called > a "torn page". That can cause problems for postgresql, because the page > will be a mix of old and new data, which is corrupt. > > The solution is "full page writes", which means that when a data page is > modified for the first time after a checkpoint, it logs the entire > contents of the page (except the free space) to WAL, and can use that as > a starting point during recovery. This results in extra WAL data for > safety, but it's unnecessary if your filesytem + IO system guarantee > that there will be no torn pages (and that's the only safe time to turn > it off). > > So, to answer your question, the difference is that full_page_writes=off > means less total WAL data, which means fewer 8192-byte writes in the > long run (you have to test long enough to go through a checkpoint to see > this difference, however). PostgreSQL will never issue write() calls > with 17 bytes, or some other odd number, regardless of the > full_page_writes setting. > > I can see how the name is slightly misleading, but it has to do with > whether to write this extra information to WAL (where "extra > information" happens to be "full data pages" in this case); not whether > to write the WAL itself in full pages. > > Regards, > Jeff Davis > >