Thread: WAL sync behaviour
Hi, We're having problems with our PostgreSQL server using forever for simple queries, even when there's little load -- or rather, the transactions seem to take forever to commit. We're using 8.1 (yay!) on a single Opteron, with WAL on the system two-disk (software) RAID-1, separate from the database four-disk RAID-10. All drives are 10000rpm SCSI disks, with write cache turned off; we value our data :-) We're running Linux 2.6.13.4, with 64-bit kernel but 32-bit userspace. The main oddity is that simple transactions take forever to execute, even on small tables with no triggers. A COMMIT on an otherwise idle system with one row to commit can take anything from 60-200ms to execute, which seems quite excessive -- sometimes (and I've verified that there's not a checkpoint or vacuum going on at that time), transactions seem to pile up and you get behaviour like: LOG: duration: 836.004 ms statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.100', hostname = 'mivu-03.samfundet.no'WHERE posid = 'mivu-03' LOG: duration: 753.545 ms statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.110', hostname = 'mivu-13.samfundet.no'WHERE posid = 'mivu-13' LOG: duration: 567.914 ms statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.109', hostname = 'mivu-12.samfundet.no'WHERE posid = 'mivu-12' LOG: duration: 515.013 ms statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.105', hostname = 'mivu-08.samfundet.no'WHERE posid = 'mivu-08' LOG: duration: 427.541 ms statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.104', hostname = 'mivu-07.samfundet.no'WHERE posid = 'mivu-07' LOG: duration: 383.314 ms statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.107', hostname = 'mivu-10.samfundet.no'WHERE posid = 'mivu-10' LOG: duration: 348.965 ms statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.103', hostname = 'mivu-06.samfundet.no'WHERE posid = 'mivu-06' LOG: duration: 314.465 ms statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.101', hostname = 'mivu-04.samfundet.no'WHERE posid = 'mivu-04' LOG: duration: 824.893 ms statement: UPDATE mivu3ping SET pingtime = NOW(), ip = '129.241.93.106', hostname = 'mivu-09.samfundet.no'WHERE posid = 'mivu-09' Sometimes, six or seven of these transactions even seem to wait for the same thing, reporting finishing times of something like 6, 5, 4, 3 and 2 seconds right after each other in the log! This is not a highly loaded system, so I don't really see why this should happen. (We had the same problems with 7.4, but if my imagination isn't playing games on me, they seem to have become slightly worse with 8.1.) strace shows that fdatasync() takes almost all that time, but when I run my own fdatasync() test program on the same file system, I can consistently sync a file (after an 8kB write) in about 30ms every time, so I don't really know why this would be so much slower with PostgreSQL. We're using the cfq scheduler, but deadline and noop give about the same results. Setting wal_sync_method = open_sync seems to improve the situation dramatically on simple commits; we get down into the 10-30ms range on an idle system. OTOH, behaviour seems to get slightly worse when there's more stuff going on, and we still get the 300ms transactions in batches every now and then. Any good ideas? /* Steinar */ -- Homepage: http://www.sesse.net/
Steinar H. Gunderson wrote: > Hi, > > We're having problems with our PostgreSQL server using forever for simple > queries, even when there's little load -- or rather, the transactions seem > to take forever to commit. We're using 8.1 (yay!) on a single Opteron, with > WAL on the system two-disk (software) RAID-1, separate from the database > four-disk RAID-10. All drives are 10000rpm SCSI disks, with write cache > turned off; we value our data :-) We're running Linux 2.6.13.4, with 64-bit > kernel but 32-bit userspace. You're beyond my area of expertise, but I do know that someone's going to ask what filesystem this is (ext2/xfs/etc). And probably to see the strace too. Hmm - the only things I can think to check: Do vmstat/iostat show any unusual activity? Are your system logs on these disks too? Could it be the journalling on the fs clashing with the WAL? -- Richard Huxton Archonet Ltd
On Thu, Nov 10, 2005 at 02:14:30PM +0000, Richard Huxton wrote: > You're beyond my area of expertise, but I do know that someone's going > to ask what filesystem this is (ext2/xfs/etc). Ah, yes, I forgot -- it's ext3. We're considering simply moving the WAL onto a separate partition (with data=writeback and noatime) if that can help us any. > And probably to see the strace too. The strace with wal_sync_method = fdatasync goes like this (I attach just before I do the commit): cirkus:~> sudo strace -T -p 15718 Process 15718 attached - interrupt to quit read(8, "\27\3\1\0 ", 5) = 5 <2.635856> read(8, "\336\333\24KB\325Ga\324\264[\307v\254h\254\350\20\220a"..., 32) = 32 <0.000031> read(8, "\27\3\1\0000", 5) = 5 <0.000027> read(8, "$E\217<z\350bI\2217\317\3662\301\273\233\17\177\256\225"..., 48) = 48 <0.000026> send(7, "\3\0\0\0\30\0\0\0\20\0\0\0f=\0\0commit;\0", 24, 0) = 24 <0.000071> gettimeofday({1131632603, 187599}, NULL) = 0 <0.000026> time(NULL) = 1131632603 <0.000027> open("pg_xlog/0000000100000000000000A2", O_RDWR|O_LARGEFILE) = 14 <0.000039> _llseek(14, 12500992, [12500992], SEEK_SET) = 0 <0.000026> write(14, "]\320\1\0\1\0\0\0\0\0\0\0\0\300\276\242\362\0\0\0\31\0"..., 8192) = 8192 <0.000057> fdatasync(14) = 0 <0.260194> gettimeofday({1131632603, 448459}, NULL) = 0 <0.000034> time(NULL) = 1131632603 <0.000027> time([1131632603]) = 1131632603 <0.000025> getpid() = 15718 <0.000025> rt_sigaction(SIGPIPE, {0x558a27e0, [], 0}, {SIG_IGN}, 8) = 0 <0.000029> send(3, "<134>Nov 10 15:23:23 postgres[15"..., 121, 0) = 121 <0.000032> rt_sigaction(SIGPIPE, {SIG_IGN}, NULL, 8) = 0 <0.000029> send(7, "\4\0\0\0\330\3\0\0\20\0\0\0f=\0\0\247@\0\0\16\0\0\0\1\0"..., 984, 0) = 984 <0.000076> send(7, "\4\0\0\0\330\3\0\0\20\0\0\0f=\0\0\247@\0\0\16\0\0\0\0\0"..., 984, 0) = 984 <0.000051> send(7, "\4\0\0\0\330\3\0\0\20\0\0\0f=\0\0\247@\0\0\16\0\0\0\0\0"..., 984, 0) = 984 <0.000050> send(7, "\4\0\0\0\250\0\0\0\20\0\0\0f=\0\0\247@\0\0\2\0\0\0\0\0"..., 168, 0) = 168 <0.000050> send(7, "\4\0\0\0\250\0\0\0\20\0\0\0f=\0\0\0\0\0\0\2\0\0\0\0\0\0"..., 168, 0) = 168 <0.000049> send(7, "\3\0\0\0\27\0\0\0\20\0\0\0f=\0\0<IDLE>\0", 23, 0) = 23 <0.000047> write(8, "\27\3\1\0 B\260\253rq)\232\265o\225\272\235\v\375\31\323"..., 90) = 90 <0.000229> read(8, <unfinished ...> Process 15718 detached > Do vmstat/iostat show any unusual activity? No, there's not much activity. In fact, it's close to idle. > Are your system logs on these disks too? Yes, they are, but nothing much is logged, really -- and sync is off for most of the logs in syslogd. > Could it be the journalling on the fs clashing with the WAL? Unsure -- that's what I was hoping to get some information on :-) /* Steinar */ -- Homepage: http://www.sesse.net/
On Thu, Nov 10, 2005 at 03:25:35PM +0100, Steinar H. Gunderson wrote: >Ah, yes, I forgot -- it's ext3. We're considering simply moving the WAL onto >a separate partition (with data=writeback and noatime) if that can help us >any. There's no reason to use a journaled filesystem for the wal. Use ext2 in preference to ext3. Mike Stone
On Thu, 2005-11-10 at 08:43, Michael Stone wrote: > On Thu, Nov 10, 2005 at 03:25:35PM +0100, Steinar H. Gunderson wrote: > >Ah, yes, I forgot -- it's ext3. We're considering simply moving the WAL onto > >a separate partition (with data=writeback and noatime) if that can help us > >any. > > There's no reason to use a journaled filesystem for the wal. Use ext2 in > preference to ext3. Not from what I understood. Ext2 can't guarantee that your data will even be there in any form after a crash. I believe only metadata journaling is needed though.
On Thu, Nov 10, 2005 at 09:52:38AM -0600, Scott Marlowe wrote: >Not from what I understood. Ext2 can't guarantee that your data will >even be there in any form after a crash. It can if you sync the data. (Which is the whole point of the WAL.) >I believe only metadata journaling is needed though. If you don't sync, metadata journaling doesn't do anything to guarantee that your data will be there, so you're adding no data security in the non-synchronous-write case. (Which is irrelevant for the WAL.) What metadata journalling gets you is fast recovery from crashes by avoiding a fsck. The fsck time is related to the number of files on a filesystem--so it's generally pretty quick on a WAL partition anyway. Mike Stone
Scott Marlowe <smarlowe@g2switchworks.com> writes: > On Thu, 2005-11-10 at 08:43, Michael Stone wrote: >> There's no reason to use a journaled filesystem for the wal. Use ext2 in >> preference to ext3. > Not from what I understood. Ext2 can't guarantee that your data will > even be there in any form after a crash. I believe only metadata > journaling is needed though. No, Mike is right: for WAL you shouldn't need any journaling. This is because we zero out *and fsync* an entire WAL file before we ever consider putting live WAL data in it. During live use of a WAL file, its metadata is not changing. As long as the filesystem follows the minimal rule of syncing metadata about a file when it fsyncs the file, all the live WAL files should survive crashes OK. We can afford to do this mainly because WAL files can normally be recycled instead of created afresh, so the zero-out overhead doesn't get paid during normal operation. You do need metadata journaling for all non-WAL PG files, since we don't fsync them every time we extend them; which means the filesystem could lose track of which disk blocks belong to such a file, if it's not journaled. regards, tom lane
On Thu, 2005-11-10 at 10:39, Tom Lane wrote: > Scott Marlowe <smarlowe@g2switchworks.com> writes: > > On Thu, 2005-11-10 at 08:43, Michael Stone wrote: > >> There's no reason to use a journaled filesystem for the wal. Use ext2 in > >> preference to ext3. > > > Not from what I understood. Ext2 can't guarantee that your data will > > even be there in any form after a crash. I believe only metadata > > journaling is needed though. > > No, Mike is right: for WAL you shouldn't need any journaling. This is > because we zero out *and fsync* an entire WAL file before we ever > consider putting live WAL data in it. During live use of a WAL file, > its metadata is not changing. As long as the filesystem follows > the minimal rule of syncing metadata about a file when it fsyncs the > file, all the live WAL files should survive crashes OK. > > We can afford to do this mainly because WAL files can normally be > recycled instead of created afresh, so the zero-out overhead doesn't > get paid during normal operation. > > You do need metadata journaling for all non-WAL PG files, since we don't > fsync them every time we extend them; which means the filesystem could > lose track of which disk blocks belong to such a file, if it's not > journaled. Thanks for the clarification! Nice to know I can setup an ext2 partition for my WAL files then. Is this in the docs anywhere?
On Thu, Nov 10, 2005 at 11:39:34AM -0500, Tom Lane wrote: > No, Mike is right: for WAL you shouldn't need any journaling. This is > because we zero out *and fsync* an entire WAL file before we ever > consider putting live WAL data in it. During live use of a WAL file, > its metadata is not changing. As long as the filesystem follows > the minimal rule of syncing metadata about a file when it fsyncs the > file, all the live WAL files should survive crashes OK. Yes, with emphasis on the zero out... :-) > You do need metadata journaling for all non-WAL PG files, since we don't > fsync them every time we extend them; which means the filesystem could > lose track of which disk blocks belong to such a file, if it's not > journaled. I think there may be theoretical problems with regard to the ordering of the fsync operation, for files that are not pre-allocated. For example, if a new block is allocated - there are two blocks that need to be updated. The indirect reference block (or inode block, if block references fit into the inode entry), and the block itself. If the indirect reference block is written first, before the data block, the state of the disk is inconsistent. This would be a crash during the fsync() operation. The metadata journalling can ensure that the data block is allocated first, and then all the necessary references updated, allowing for the operation to be incomplete and rolled back, or committed in full. Or, that is my understanding, anyways, and this is why I would not use ext2 for the database, even if it was claimed that fsync() was used. For WAL, with pre-allocated zero blocks? Sure. Ext2... :-) mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
Scott Marlowe <smarlowe@g2switchworks.com> writes: > Thanks for the clarification! Nice to know I can setup an ext2 > partition for my WAL files then. Is this in the docs anywhere? Don't think so ... want to write something up? Hard part is to figure out where to put it ... regards, tom lane
On Thu, Nov 10, 2005 at 12:44:15PM -0500, Tom Lane wrote: > Don't think so ... want to write something up? Hard part is to > figure out where to put it ... To be honest, I think we could use a "newbie's guide to PostgreSQL performance tuning". I've seen rather good guides for query tuning, and guides for general performance tuning, but none that really cover both in a coherent way. (Also, many of the ones I've seen start getting rather dated; after bitmap index scans arrived, for instance, many of the rules with regard to index planning probably changed.) I'd guess http://www.powerpostgresql.com/PerfList is a rather good start for the second part (and it's AFAICS under a free license); having something like that in the docs (or some other document) would probably be a good start. /* Steinar */ -- Homepage: http://www.sesse.net/