Re: where should I stick that backup? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: where should I stick that backup? |
Date | |
Msg-id | CA+Tgmoa=fYTLHahw7dbMnvRTROY45c-TMN3P5-XZZYAww76oZg@mail.gmail.com Whole thread Raw |
In response to | Re: where should I stick that backup? (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: where should I stick that backup?
|
List | pgsql-hackers |
On Thu, Apr 16, 2020 at 10:22 PM Robert Haas <robertmhaas@gmail.com> wrote: > Hmm. Could we learn what we need to know about this by doing something > as taking a basebackup of a cluster with some data in it (say, created > by pgbench -i -s 400 or something) and then comparing the speed of cat > < base.tar | gzip > base.tgz to the speed of gzip < base.tar > > base.tgz? It seems like there's no difference between those except > that the first one relays through an extra process and an extra pipe. I decided to try this. First I experimented on my laptop using a backup of a pristine pgbench database, scale factor 100, ~1.5GB. [rhaas pgbackup]$ for i in 1 2 3; do echo "= run number $i = "; sync; sync; time gzip < base.tar > base.tar.gz; rm -f base.tar.gz; sync; sync; time cat < base.tar | gzip > base.tar.gz; rm -f base.tar.gz; sync; sync; time cat < base.tar | cat | cat | gzip > base.tar.gz; rm -f base.tar.gz; done = run number 1 = real 0m24.011s user 0m23.542s sys 0m0.408s real 0m23.623s user 0m23.447s sys 0m0.908s real 0m23.688s user 0m23.847s sys 0m2.085s = run number 2 = real 0m23.704s user 0m23.290s sys 0m0.374s real 0m23.389s user 0m23.239s sys 0m0.879s real 0m23.762s user 0m23.888s sys 0m2.057s = run number 3 = real 0m23.567s user 0m23.187s sys 0m0.361s real 0m23.573s user 0m23.422s sys 0m0.903s real 0m23.749s user 0m23.884s sys 0m2.113s It looks like piping everything through an extra copy of 'cat' may even be *faster* than having the process read it directly; two out of three runs with the extra "cat" finished very slightly quicker than the test where gzip read the file directly. The third set of numbers for each test run is with three copies of "cat" interposed. That appears to be slower than with no extra pipes, but not very much, and it might just be noise. Next I tried it out on Linux. For this I used 'cthulhu', an older box with lots and lots of memory and cores. Here I took the scale factor up to 400, so it's about 5.9GB of data. Same command as above produced these results: = run number 1 = real 2m35.797s user 2m30.990s sys 0m4.760s real 2m35.407s user 2m32.730s sys 0m16.714s real 2m40.598s user 2m39.054s sys 0m37.596s = run number 2 = real 2m35.529s user 2m30.971s sys 0m4.510s real 2m33.933s user 2m31.685s sys 0m16.003s real 2m45.563s user 2m44.042s sys 0m40.357s = run number 3 = real 2m35.876s user 2m31.437s sys 0m4.391s real 2m33.872s user 2m31.629s sys 0m16.266s real 2m40.836s user 2m39.359s sys 0m38.960s These results are pretty similar to the MacOS results. The overall performance was worse, but I think that is probably explained by the fact that the MacBook is a Haswell-class processor rather than Westmere, and with significantly higher RAM speed. The pattern that one extra pipe seems to be perhaps slightly faster, and three extra pipes a tad slower, persists. So at least in this test, the overhead added by each pipe appears to be <1%, which I would classify as good enough not to worry too much about. > I don't know exactly how to do the equivalent of this on Windows, but > I bet somebody does. However, I still don't know what the situation is on Windows. I did do some searching around on the Internet to try to find out whether pipes being slow on Windows is a generally-known phenomenon, and I didn't find anything very compelling, but I don't have an environment set up to the test myself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: