Re: WAL prefetch - Mailing list pgsql-hackers
From | Konstantin Knizhnik |
---|---|
Subject | Re: WAL prefetch |
Date | |
Msg-id | 9823470b-96a7-683e-1a9c-485be74b28b8@postgrespro.ru Whole thread Raw |
In response to | Re: WAL prefetch (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
List | pgsql-hackers |
On 08.07.2018 00:47, Tomas Vondra wrote: > Hi, > > I've done a bit of testing on the current patch, mostly to see how much > the prefetching can help (if at all). While the patch is still in early > WIP stages (at least that's my assessment, YMMV), the improvement are > already quite significant. > > I've also planned to compare it to the pg_prefaulter [1] which kinda > started this all, but I've been unable to get it working with my very > limited knowledge of golang. I've fixed the simple stuff (references to > renamed PostgreSQL functions etc.) but then it does not do anything :-( > I wonder if it's working on FreeBSD only, or something like that ... > > So this compares only master with and without WAL prefetching. > > Instead of killing the server and measuring local recovery (which is > what Konstantin did before), I've decided to use replication. That is, > setup a replica, run pgbench on the master and see how much apply lag we > end up with over time. I find this much easier to reproduce, monitor > over time, do longer runs, ... > > master > ------ > * 32 cores (2x E5-2620v4) > * 32GB of RAM > * Intel Optane SSD 280GB > * shared_buffers=4GB > * max_wal_size=128GB > * checkpoint_timeout=30min > > replica > ------- > * 4 cores (i5-2500k) > * 8GB RAM > * 6x Intel S3700 SSD (RAID0) > * shared_buffers=512MB > * effective_cache_size=256MB > > I've also decided to use pgbench scale 1000 (~16GB) which fits into RAM > on the master but not the replica. This may seem like a bit strange > choice, but I think it's not entirely crazy, for a couple of reasons: > > * It's not entirely uncommon to have replicas with different hardware > condiguration. For HA it's a bad idea, but there are valid use cases. > > * Even with the same hardware config, you may have very different > workload on the replica, accessing very different subset of the data. > Consider master doing OLTP on small active set, while replica runs BI > queries on almost all data, pushing everything else from RAM. > > * It amplifies the effect of prefetching, which is nice for testing. > > * I don't have two machines with exactly the same config anyway ;-) > > The pgbench test is then executed on master like this: > > pgbench -c 32 -j 8 -T 3600 -l --aggregate-interval=1 test > > The replica is unlikely to keep up with the master, so the question is > how much apply lag we end up with at the end. > > Without prefetching, it's ~70GB of WAL. With prefetching, it's only > about 30GB. Considering the 1-hour test generates about 90GB of WAL, > this means the replay speed grew from 20GB/h to almost 60GB/h. That's > rather measurable improvement ;-) > > The attached replication-lag.png chart, showing how the lag grows over > time. The "bumps" after ~30 minutes coincide with a checkpoint, > triggering FPIs for a short while. The record-size.png and fpi-size.png > come from pg_waldump and show what part of WAL consists of regular > records and FPIs. > > Note: I've done two runs with each configuration, so there are four data > series on all charts. > > With prefetching the lag drops down a bit after a while (by about the > same amount of WAL), while without prefetch it does not. My explanation > is that the replay is so slow it does not get to the FPIs until after > the test - so it happens, but we don't see it here. > > Now, how does this look on system metrics? Without prefetching we see > low CPU usage, because the process is waiting for I/O. And the I/O is > under-utilized, because we only issue one request at a time (which means > short I/O queues, low utilization of individual devices in the RAID). > > In this case I see that without prefetching, the replay process uses > about 20% of a CPU. With prefetching increases this to ~60%, which is nice. > > At the storage level, the utilization for each device in the RAID0 array > is ~20%, and with prefetching enabled this jumps up to ~40%. If you look > at IOPS instead, that jumps from ~2000 to ~6500, so about 3x. How is > this possible when the utilization grew only ~2x? We're generating > longer I/O queues (20 requests instead of 3), and the devices can > optimize it quite a bit. > > > I think there's a room for additional improvement. We probably can't get > the CPU usage to 100%, but 60% is still quite low. The storage can > certainly handle more requests, the devices are doing something only > about 40% of the time. > > But overall it looks quite nice, and I think it's worth to keep working > on it. > > BTW to make this work, I had to tweak NUM_AUXILIARY_PROCS (increase it > from 4 to 5), otherwise InitAuxiliaryProcess() fails because there's not > room for additional process. I assume it works with local recovery, but > once you need to start walreceiver it fails. > > regards > Thank you very much for such precise and detailed investigation of my patch. Right now I am in vacation, but I am going to continue work on it. Any advice of what else can be improved or refactored inn this patch is welcome.
pgsql-hackers by date: