Re: Load distributed checkpoint V3 - Mailing list pgsql-patches
From | Takayuki Tsunakawa |
---|---|
Subject | Re: Load distributed checkpoint V3 |
Date | |
Msg-id | 00ce01c777fa$877b1bb0$19527c0a@OPERAO Whole thread Raw |
In response to | Load distributed checkpoint V3 (ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>) |
Responses |
Re: Load distributed checkpoint V3
|
List | pgsql-patches |
Hello, long time no see. I'm sorry to interrupt your discussion. I'm afraid the code is getting more complicated to continue to use fsync(). Though I don't intend to say the current approach is wrong, could anyone evaluate O_SYNC approach again that commercial databases use and tell me if and why PostgreSQL's fsync() approach is better than theirs? This January, I got a good result with O_SYNC, which I haven't reported here. I'll show it briefly. Please forgive me for my abrupt email, because I don't have enough time. # Personally, I want to work in the community, if I'm allowed. And sorry again. I reported that O_SYNC resulted in very bad performance last year. But that was wrong. The PC server I borrowed was configured that all the disks form one RAID5 device. So, the disks for data and WAL (/dev/sdd and /dev/sde) came from the same RAID5 device, resulting in I/O conflict. What I modified is md.c only. I just added O_SYNC to the open flag in mdopen() and _mdfd_openseg(), if am_bgwriter is true. I didn't want backends to use O_SYNC because mdextend() does not have to transfer data to disk. My evaluation environment was: CPU: Intel Xeon 3.2GHz * 2 (HT on) Memory: 4GB Disk: Ultra320 SCSI (perhaps configured as write back) OS: RHEL3.0 Update 6 Kernel: 2.4.21-37.ELsmp PostgreSQL: 8.2.1 The relevant settings of PostgreSQL are: shared_buffers = 2GB wal_buffers = 1MB wal_sync_method = open_sync checkpoint_* and bgwriter_* parameters are left as their defaults. I used pgbench, with the data of scaling factor 50. [without O_SYNC, original behavior] - pgbench -c1 -t16000 best response: 1ms worst response: 6314ms 10th worst response: 427ms tps: 318 - pgbench -c32 -t500 best response: 1ms worst response: 8690ms 10th worst response: 8668ms tps: 330 [with O_SYNC] - pgbench -c1 -t16000 best response: 1ms worst response: 350ms 10th worst response: 91ms tps: 427 - pgbench -c32 -t500 best response: 1ms worst response: 496ms 10th worst response: 435ms tps: 1117 If the write back cache were disabled, the difference would be smaller. Windows version showed similar improvements. However, this approach has two big problems. (1) Slow down bulk updates Updates of large amount of data get much slower because bgwriter seeks and writes dirty buffers synchronously page-by-page. For example: - COPY of accounts (5m records) and CHECKPOINT command after COPY without O_SYNC: 100sec with O_SYNC: 1046sec - UPDATE of all records of accounts without O_SYNC: 139sec with O_SYNC: 639sec - CHECKPOINT command for flushing 1.6GB of dirty buffers without O_SYNC: 24sec with O_SYNC: 126sec To mitigate this problem, I sorted dirty buffers by their relfilenode and block numbers and wrote multiple pages that are adjacent both on memory and on disk. The result was: - COPY of accounts (5m records) and CHECKPOINT command after COPY 227sec - UPDATE of all records of accounts 569sec - CHECKPOINT command for flushing 1.6GB of dirty buffers 71sec Still bad... (2) Can't utilize tablespaces Though I didn't evaluate, update activity would be much less efficient with O_SYNC than with fsync() when using multiple tablespaces, because there is only one bgwriter. Anyone can solve these problems? One of my ideas is to use scattered I/O. I hear that readv()/writev() became able to do real scattered I/O since kernel 2.6 (RHEL4.0). With kernels before 2.6, readv()/writev() just performed I/Os sequentially. Windows has provided reliable scattered I/O for years. Another idea is to use async I/O, possibly combined with multiple bgwriter approach on platforms where async I/O is not available. How about the chance Josh-san has brought?
pgsql-patches by date: