Re: design for parallel backup - Mailing list pgsql-hackers
| From | Andres Freund |
|---|---|
| Subject | Re: design for parallel backup |
| Date | |
| Msg-id | 20200503023643.x2o2244sa4vkikyb@alap3.anarazel.de Whole thread Raw |
| In response to | Re: design for parallel backup (Robert Haas <robertmhaas@gmail.com>) |
| Responses |
Re: design for parallel backup
|
| List | pgsql-hackers |
Hi,
On 2020-05-01 16:32:15 -0400, Robert Haas wrote:
> On Thu, Apr 30, 2020 at 6:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Apr 30, 2020 at 3:52 PM Andres Freund <andres@anarazel.de> wrote:
> > > Why 8kb? That's smaller than what we currently do in pg_basebackup,
> > > afaictl, and you're actually going to be bottlenecked by syscall
> > > overhead at that point (unless you disable / don't have the whole intel
> > > security mitigation stuff).
> >
> > I just picked something. Could easily try other things.
>
> I tried changing the write size to 64kB, keeping the rest the same.
> Here are the results:
>
> filesystem media 1@64GB 2@32GB 4@16GB 8@8GB 16@4GB
> xfs mag 65 53 64 74 79
> ext4 mag 96 68 75 303 437
> xfs ssd 75 43 29 33 38
> ext4 ssd 96 68 63 214 254
> spread spread n/a n/a 43 38 40
>
> And here again are the previous results with an 8kB write size:
>
> xfs mag 97 53 60 67 71
> ext4 mag 94 68 66 335 549
> xfs ssd 97 55 33 27 25
> ext4 ssd 116 70 66 227 450
> spread spread n/a n/a 48 42 44
>
> Generally, those numbers look better than the previous numbers, but
> parallelism still looks fairly appealing on the SSD storage - less so
> on magnetic disks, at least in this test.
I spent a fair bit of time analyzing this, and my conclusion is that you
might largely be seeing numa effects. Yay.
I don't have an as large numa machine at hand, but here's what I'm
seeing on my local machine, during a run of writing out 400GiB (this is
a run with noise on the machine, the benchmarks below are without
that). The machine has 192GiB of ram, evenly distributed to two sockets
/ numa domains.
At start I see
numastat -m|grep -E 'MemFree|MemUsed|Dirty|Writeback|Active\(file\)|Inactive\(file\)'"
MemFree 91908.20 92209.85 184118.05
MemUsed 3463.05 4553.33 8016.38
Active(file) 105.46 328.52 433.98
Inactive(file) 68.29 190.14 258.43
Dirty 0.86 0.90 1.76
Writeback 0.00 0.00 0.00
WritebackTmp 0.00 0.00 0.00
For a while there's pretty decent IO throughput (all 10s samples):
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz
d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1955.67 2299.32 0.00 0.00 42.48 1203.94
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 82.10 89.33
Then it starts to be slower on a sustained basis:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz
d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1593.33 1987.85 0.00 0.00 42.90 1277.55
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 67.55 76.53
And then performance tanks completely:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz
d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 646.33 781.85 0.00 0.00 132.68 1238.70
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 85.43 58.63
That amount of degradation confused me for a while, especially because I
couldn't reproduce it the more controlled I made the setups. In
particular I stopped seeing the same magnitude of issues after pinnning
processes to one numa socket (both running and memory).
After a few seconds:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz
d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1882.00 2320.07 0.00 0.00 42.50 1262.35
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 79.05 88.07
MemFree 35356.50 80986.46 116342.96
MemUsed 60014.75 15776.72 75791.47
Active(file) 179.44 163.28 342.72
Inactive(file) 58293.18 13385.15 71678.33
Dirty 18407.50 882.00 19289.50
Writeback 235.78 335.43 571.21
WritebackTmp 0.00 0.00 0.00
A bit later io starts to get slower:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz
d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1556.30 1898.70 0.00 0.00 40.92 1249.29
0.00 0.00 0.00 0.00 0.00 0.00 0.20 24.00 62.90 72.01
MemFree 519.56 36086.14 36605.70
MemUsed 94851.69 60677.04 155528.73
Active(file) 303.84 212.96 516.80
Inactive(file) 92776.70 58133.28 150909.97
Dirty 10913.20 5374.07 16287.27
Writeback 812.94 331.96 1144.90
WritebackTmp 0.00 0.00 0.00
And then later it gets worse:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz
d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1384.70 1671.25 0.00 0.00 40.87 1235.91
0.00 0.00 0.00 0.00 0.00 0.00 0.20 7.00 55.89 63.45
MemFree 519.54 242.98 762.52
MemUsed 94851.71 96520.20 191371.91
Active(file) 175.82 246.03 421.85
Inactive(file) 92820.19 93985.79 186805.98
Dirty 10482.75 4140.72 14623.47
Writeback 0.00 0.00 0.00
WritebackTmp 0.00 0.00 0.00
When using a 1s iostat instead of a 10s, it's noticable that performance
swings widely between very slow (<100MB/s) and very high throughput (>
2500MB/s).
It's clearly visible that performance degrades substantially first when
all of a numa node's free memory is exhausted, then when the second numa
node's is.
Looking at profile I see a lot of cacheline bouncing between the kernel
threads that "reclaim" pages (i.e. make them available for reuse), the
kernel threads that write out dirty pages, the kernel threads where the
IO completes (i.e. where the dirty bit can be flipped / locks get
released), and the writing process.
I think there's a lot from the kernel side that can improve - but it's
not too surprising that letting the kernel cache / forcing it to make
caching decisions for a large streaming wide has substantial costs.
I changed Robert's test program to optionall fallocate,
sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large
footprint in the page cache. The performance
differences are quite substantial:
gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \
rm -ff /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && \
/tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 --filesize=$((400*1024*1024*1024))
/srv/dev/bench/test1
running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0
[/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, close: 0, total: 220
comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1
running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1
[/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, close: 0, total: 161
Below are the results of running a the program with a variation of
parameters (both file and resutls attached).
I used perf stat in this run to measure the difference in CPU
usage.
ref_cycles are the number of CPU cycles, across all 20 cores / 40
threads, CPUs were doing *something*. It is not affected by CPU
frequency scaling, just by the time CPUs were not "halted". Whereas
cycles is affected by frequency scaling.
A high ref_cycles_sec, combined with a decent number of total
instructions/cycles is *good*, because it indicates fewer CPUs
used. Whereas a very high ref_cycles_tot means that more CPUs were
running doing something for the duration of the benchmark.
The run-to-run variations between the runs without cache control are
pretty large. So this is probably not the end-all-be-all numbers. But I
think the trends are pretty clear.
test time ref_cycles_tot
ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0 248.430736196 1,497,048,950,014
150.653M/sec 1,226,822,167,960 0.123GHz 705,950,461,166 0.54
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1 310.275952938 1,921,817,571,226
154.849M/sec 1,499,581,687,133 0.121GHz 944,243,167,053 0.59
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231
139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0 243.609959554 1,802,385,405,203
184.970M/sec 1,449,560,513,247 0.149GHz 855,426,288,031 0.56
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0 230.880100449 1,328,417,418,799
143.846M/sec 1,148,924,667,393 0.124GHz 723,158,246,628 0.63
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1 253.591234992 1,548,485,571,798
152.658M/sec 1,229,926,994,613 0.121GHz 1,117,352,436,324 0.95
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 164.488835158 911,974,902,254
138.611M/sec 760,756,011,483 0.116GHz 672,105,046,261 0.84
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336
237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0 192.151682414 1,526,440,715,456
198.603M/sec 1,037,135,756,007 0.135GHz 802,754,964,096 0.76
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1 242.648245159 1,782,637,416,163
183.629M/sec 1,463,696,313,881 0.151GHz 1,000,100,694,932 0.69
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 188.772193248 1,418,274,870,697
187.803M/sec 923,133,958,500 0.122GHz 799,212,291,243 0.92
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0 421.580487642 2,756,486,952,728
163.449M/sec 1,387,708,033,752 0.082GHz 990,478,650,874 0.72
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0 169.854206542 1,333,619,626,854
196.282M/sec 1,036,261,531,134 0.153GHz 666,052,333,591 0.64
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1 305.078100578 1,970,042,289,192
161.445M/sec 1,505,706,462,812 0.123GHz 954,963,240,648 0.62
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1 166.295223626 1,290,699,256,763
194.044M/sec 857,873,391,283 0.129GHz 761,338,026,415 0.89
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077
154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0 256.156100686 2,407,922,637,215
235.003M/sec 1,133,311,037,956 0.111GHz 748,666,206,805 0.65
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1 215.255015340 1,977,578,120,924
229.676M/sec 1,461,504,758,029 0.170GHz 1,005,270,838,642 0.68
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 158.262790654 1,720,443,307,097
271.769M/sec 1,004,079,045,479 0.159GHz 826,905,592,751 0.84
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0 334.932246893 2,366,388,662,460
176.628M/sec 1,216,049,589,993 0.091GHz 796,698,831,717 0.68
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0 161.697270285 1,866,036,713,483
288.576M/sec 1,068,181,502,433 0.165GHz 739,559,279,008 0.70
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1 231.440889430 1,965,389,749,057
212.391M/sec 1,407,927,406,358 0.152GHz 997,199,361,968 0.72
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1 214.433248700 2,232,198,239,769
260.300M/sec 1,073,334,918,389 0.125GHz 861,540,079,120 0.80
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=0 644.521613661 3,688,449,404,537
143.079M/sec 2,020,128,131,309 0.078GHz 961,486,630,359 0.48
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=0 243.830464632 1,499,608,983,445
153.756M/sec 1,227,468,439,403 0.126GHz 691,534,661,654 0.59
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=1 292.866419420 1,753,376,415,877
149.677M/sec 1,483,169,463,392 0.127GHz 860,035,914,148 0.56
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1 162.152397194 925,643,754,128
142.719M/sec 743,208,501,601 0.115GHz 554,462,585,110 0.70
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=0 211.369510165 1,558,996,898,599
184.401M/sec 1,359,343,408,200 0.161GHz 766,769,036,524 0.57
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=0 233.315094908 1,427,133,080,540
152.927M/sec 1,166,000,868,597 0.125GHz 743,027,329,074 0.64
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=1 290.698155820 1,732,849,079,701
149.032M/sec 1,441,508,612,326 0.124GHz 835,039,426,282 0.57
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1 159.945462440 850,162,390,626
132.892M/sec 724,286,281,548 0.113GHz 670,069,573,150 0.90
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=0 163.244592275 1,524,807,507,173
233.531M/sec 1,398,319,581,978 0.214GHz 689,514,058,243 0.46
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=0 231.795934322 1,731,030,267,153
186.686M/sec 1,124,935,745,020 0.121GHz 736,084,922,669 0.70
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=1 315.564163702 1,958,199,733,216
155.128M/sec 1,405,115,546,716 0.111GHz 1,000,595,890,394 0.73
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1 210.945487961 1,527,169,148,899
180.990M/sec 906,023,518,692 0.107GHz 700,166,552,207 0.80
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=0 161.759094088 1,468,321,054,671
226.934M/sec 1,221,167,105,510 0.189GHz 735,855,415,612 0.59
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=0 158.578248952 1,354,770,825,277
213.586M/sec 936,436,363,752 0.148GHz 654,823,079,884 0.68
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=1 274.628500801 1,792,841,068,080
163.209M/sec 1,343,398,055,199 0.122GHz 996,073,874,051 0.73
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1 179.140070123 1,383,595,004,328
193.095M/sec 850,299,722,091 0.119GHz 706,959,617,654 0.83
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=0 445.496787199 2,663,914,572,687
149.495M/sec 1,267,340,496,930 0.071GHz 787,469,552,454 0.62
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=0 261.866083604 2,325,884,820,091
222.043M/sec 1,094,814,208,219 0.105GHz 649,479,233,453 0.57
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=1 172.963505544 1,717,387,683,260
248.228M/sec 1,356,381,335,831 0.196GHz 822,256,638,370 0.58
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1 157.934678897 1,650,503,807,778
261.266M/sec 970,705,561,971 0.154GHz 637,953,927,131 0.66
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=0 225.623143601 1,804,402,820,599
199.938M/sec 1,086,394,788,362 0.120GHz 656,392,112,807 0.62
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=0 157.930900998 1,797,506,082,342
284.548M/sec 1,001,509,813,741 0.159GHz 644,107,150,289 0.66
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=1 165.772265335 1,805,895,001,689
272.353M/sec 1,514,173,918,970 0.228GHz 823,435,044,810 0.54
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1 187.664764448 1,964,118,348,429
261.660M/sec 978,060,510,880 0.130GHz 668,316,194,988 0.67
Greetings,
Andres Freund
Attachment
pgsql-hackers by date: