Re: Load distributed checkpoint - Mailing list pgsql-hackers
From | Takayuki Tsunakawa |
---|---|
Subject | Re: Load distributed checkpoint |
Date | |
Msg-id | 037901c72427$859864f0$19527c0a@OPERAO Whole thread Raw |
In response to | Load distributed checkpoint (ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>) |
Responses |
Re: Load distributed checkpoint
Re: Load distributed checkpoint Re: Load distributed checkpoint |
List | pgsql-hackers |
Hello, Itagaki-san, all I have to report a sad result. Your patch didn't work. Let's consider the solution together. What you are addressing is very important for the system designers in the real world -- smoothing response time. Recall that unpatched PostgreSQL showed the following tps's in case (1) (i.e. with default bgwriter_* and checkpoint_* settings.) 235 80 226 77 240 The patched PostgreSQL showed the following tps's: 230 228 77 209 66 [disk usage] The same tendency can be seen as with the unpatched PostgreSQL. That is: When the tps is low, the %util of disk for data files is high, and %util of disk for WAL is low. Why is transaction logging is disturbed by cleaning and/or syncing activity? While the bgwriter is fsync()ing, it does not lock any data structures that the transactions want to access. Even though they share the same SCSI controller and bus, they are different disks. The bandwidth does not appear to be exhausted, since Ultra320 is said to have 256MB band width in practice. (Recall that WAL is on sdd and data files are on sde.) Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 810.78 0.00 102.20 0.00 7306.99 0.00 3653.49 71.50 1.12 10.95 7.32 74.77 sde 0.00 25.35 0.00 6.19 0.00 252.30 0.00 126.15 40.77 0.50 81.32 5.94 3.67 sdd 0.00 884.20 0.00 126.00 0.00 8080.00 0.00 4040.00 64.13 1.26 10.00 7.11 89.64 sde 0.00 21.40 0.00 5.00 0.00 211.20 0.00 105.60 42.24 0.31 62.56 6.52 3.26 sdd 0.00 924.80 0.00 116.20 0.00 8326.40 0.00 4163.20 71.66 1.23 10.59 7.37 85.64 sde 0.00 27.60 0.00 26.60 0.00 433.60 0.00 216.80 16.30 4.24 159.29 2.44 6.50 sdd 0.00 721.20 0.00 102.40 0.00 6588.80 0.00 3294.40 64.34 0.99 9.71 7.07 72.40 sde 0.00 1446.80 0.00 101.60 0.00 20289.60 0.00 10144.80 199.70 1192.40 572.45 2.29 23.30 sdd 0.00 0.00 0.00 0.20 0.00 1.60 0.00 0.80 8.00 0.11 539.00 539.00 10.80 sde 0.00 0.00 0.00 452.10 0.00 0.00 0.00 0.00 0.00 3829.57 3715.83 2.22 100.22 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 349.80 0.00 0.00 0.00 0.00 0.00 1745.52 8515.74 2.86 100.02 sdd 0.00 442.40 0.00 51.00 0.00 3948.80 0.00 1974.40 77.43 0.60 11.73 7.54 38.46 sde 0.00 2.80 0.00 184.00 0.00 25.60 0.00 12.80 0.14 277.52 12629.41 3.19 58.74 sdd 0.00 898.00 0.00 124.80 0.00 8182.40 0.00 4091.20 65.56 1.30 10.40 7.24 90.30 sde 0.00 19.20 0.00 3.80 0.00 184.00 0.00 92.00 48.42 0.24 62.11 14.11 5.36 sdd 0.00 842.28 0.00 109.02 0.00 7612.02 0.00 3806.01 69.82 1.33 12.26 8.35 91.02 sde 0.00 45.49 0.00 46.89 0.00 739.08 0.00 369.54 15.76 9.04 192.73 3.38 15.85 sdd 0.00 1198.41 0.00 71.51 0.00 10505.18 0.00 5252.59 146.90 128.19 99.76 13.48 96.43 sde 0.00 1357.77 0.00 199.80 0.00 19263.75 0.00 9631.87 96.41 2251.09 1179.42 2.39 47.81 sdd 0.00 0.00 0.00 7.20 0.00 0.00 0.00 0.00 0.00 203.87 5671.83 138.92 100.02 sde 0.00 0.00 0.00 409.60 0.00 0.00 0.00 0.00 0.00 3171.04 4779.83 2.44 100.02 sdd 0.00 0.00 0.00 17.80 0.00 0.00 0.00 0.00 0.00 137.87 10240.90 56.19 100.02 sde 0.00 0.00 0.00 240.60 0.00 0.00 0.00 0.00 0.00 1573.85 9815.29 4.16 100.02 sdd 0.00 109.80 0.00 35.40 0.00 1012.80 0.00 506.40 28.61 42.14 7974.47 27.86 98.64 sde 0.00 2.80 0.00 198.80 0.00 30.40 0.00 15.20 0.15 428.49 14474.39 4.30 85.56 sdd 0.00 466.20 0.00 62.80 0.00 4230.40 0.00 2115.20 67.36 0.59 9.49 6.79 42.62 sde 0.00 5.20 0.00 0.80 0.00 48.00 0.00 24.00 60.00 0.01 16.25 11.25 0.90 sdd 0.00 0.00 0.00 0.20 0.00 1.60 0.00 0.80 8.00 0.01 35.00 35.00 0.70 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 I suspect that fsync() is the criminal as I've been afraid. I'll show you an interesting data. I collected the stack traces of backend processes while a checkpoint is happening. [bgwriter] Oh, he is fsync()ing hard. #0 0x0000003a629bfbb2 in __fsync_nocancel () from /lib64/tls/libc.so.6 #1 0x00000000005742a1 in mdsync () #2 0x00000000005753d7 in smgrsync () #3 0x0000000000564d65 in FlushBufferPool () ... [some backends] They are forced to wait for some lock pertaining to WAL when they try to insert a log record. #0 0x0000003a629c7b79 in semop () from /lib64/tls/libc.so.6 #1 0x000000000054c9c6 in PGSemaphoreLock () #2 0x0000000000572374 in LWLockAcquire () #3 0x000000000046ab10 in XLogInsert () [some backends] They are waiting for some lock when they commit. #0 0x0000003a629c7b79 in semop () from /lib64/tls/libc.so.6 #1 0x000000000054c9c6 in PGSemaphoreLock () #2 0x0000000000572374 in LWLockAcquire () #3 0x000000000046a746 in XLogFlush () #4 0x000000000046228e in RecordTransactionCommit () [the second criminal backend] He is making other backends wait for WALWRITE lock, because he is doing a long-time operation - fdatasync() - with the WALWRITE lock being held. But I think he is also a victim. He is probably forced to wait by bgwriter's fsync() activity. #0 0x0000003a629bfc59 in fdatasync () from /lib64/tls/libc.so.6 #1 0x0000000000469ac0 in issue_xlog_fsync () #2 0x000000000046a210 in XLogWrite () #3 0x000000000046a7df in XLogFlush () #4 0x000000000046228e in RecordTransactionCommit () The following stack trace sometimes appears as the second criminal instead of the above: #0 0x0000003a629bfbb2 in __fsync_nocancel () from /lib64/tls/libc.so.6 #1 0x000000000046530e in UpdateControlFile () #2 0x000000000046a05d in XLogWrite () #3 0x000000000046a7df in XLogFlush () #4 0x000000000046228e in RecordTransactionCommit () [Conclusion] I believe that the problem cannot be solved in a real sense by avoiding fsync/fdatasync(). We can't ignore what commercial databases have done so far. The kernel does as much as he likes when PostgreSQL requests him to fsync(). One question is the disk utilization. While bgwriter is fsync()ing, %util of WAL disk drops to almost 0. But the the bandwidth of Ultra320 SCSI does not appear to be used fully. Any idea? So what should we do next?
pgsql-hackers by date: