Re: CREATE DATABASE with filesystem cloning - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: CREATE DATABASE with filesystem cloning |
Date | |
Msg-id | CA+hUKGKO=R=K5meb4+n2hmH6W=_Q01TT_89KPJS6BdDuJTJQdQ@mail.gmail.com Whole thread Raw |
In response to | Re: CREATE DATABASE with filesystem cloning (Andrew Dunstan <andrew@dunslane.net>) |
List | pgsql-hackers |
On Mon, Oct 9, 2023 at 2:20 AM Andrew Dunstan <andrew@dunslane.net> wrote: > I've had to disable COW on my BTRFS-resident buildfarm animals (see > previous discussion re Direct I/O). Right, because it is still buggy[1]. I don't see any sign that a fix has been committed yet, assuming that is the right thing (and it sure sounds like it). It means you still have to disable COW to run the 004_io_direct.pl test for now, but that's an independent thing due hopefully to be fixed soon, and you can still run PostgreSQL just fine with COW enabled as it is by default as long as you don't turn on debug_io_direct (which isn't for users yet anyway). Since I hadn't actually tried out this cloning stuff out on Linux/btrfs before and was claiming that it should work, I took it for a quick unscientific spin (literally, this is on a spinning SATA disk for extra crunchy slowness...). I created a scale 500 pgbench database, saw that du -h showed 7.4G, and got these times: postgres=# create database foodb_copy template=foodb strategy=file_copy; CREATE DATABASE Time: 124019.885 ms (02:04.020) postgres=# create database foodb_clone template=foodb strategy=file_clone; CREATE DATABASE Time: 8618.195 ms (00:08.618) That's something, but not as good as I was expecting, so let's also try Linux/XFS for reference on the same spinny rust... One thing I learned is that if you have an existing XFS partition, it might have been created without reflinks enabled (see output of xfs_info) as that was the default not very long ago and it's not changeable later, so on the box I'm writing from I had to do a fresh mkfs.xfs to see any benefit from this. postgres=# create database foodb_copy template=foodb strategy=file_copy; CREATE DATABASE Time: 49157.876 ms (00:49.158) postgres=# create database foodb_clone template=foodb strategy=file_clone; CREATE DATABASE Time: 1026.455 ms (00:01.026) Not bad. To understand what that did, we can check which physical blocks on disk hold the first segment of the pgbench_accounts table in foodb and foodb_clone: $ sudo xfs_bmap /mnt/xfs/pgdata/base/16384/16400 /mnt/xfs/pgdata/base/16384/16400: 0: [0..1637439]: 977586048..979223487 1: [1637440..2097151]: 1464966136..1465425847 $ sudo xfs_bmap /mnt/xfs/pgdata/base/16419/16400 /mnt/xfs/pgdata/base/16419/16400: 0: [0..1637439]: 977586048..979223487 1: [1637440..2097151]: 1464966136..1465425847 The same blocks. Now let's update a tuple on the second page of pgbench_accounts in the clone: foodb=# update pgbench_accounts set bid = bid + 1 where ctid = '(1, 1)'; UPDATE 1 foodb=# checkpoint; CHECKPOINT Now some new physical disk blocks have been allocated just for that page, but the rest are still clones: $ sudo xfs_bmap /mnt/xfs/pgdata/base/16419/16400 /mnt/xfs/pgdata/base/16419/16400: 0: [0..15]: 977586048..977586063 1: [16..31]: 977586064..977586079 2: [32..1637439]: 977586080..979223487 3: [1637440..2097151]: 1464966136..1465425847 I tried changing it to work in 1MB chunks and add the CFI() (v2 attached), and it didn't affect the time measurably and also didn't generate any extra extents as displayed by xfs_bmap, so the end result is the same. I haven't looked into the chunked version on the other file systems yet. I don't have the numbers to hand (different machines far from me right now) but FreeBSD/ZFS and macOS/APFS were on the order of a few hundred milliseconds for the same scale of pgbench on laptop storage (so not comparable with the above). I also tried a -s 5000 database, and saw that XFS could clone a 74GB database just as fast as the 7.4GB database (still ~1s). At a guess, this is going to scale not so much by total data size, but more by things like number of relations, segment size and internal (invisible) fragmentation due to previous cloning/update history in filesystem-dependent ways, since those are the things that generate extents (contiguous ranges of physical blocks to be referenced by the new file). [1] https://lore.kernel.org/linux-btrfs/ae81e48b0e954bae1c3451c0da1a24ae7146606c.1676684984.git.boris@bur.io/T/#u
Attachment
pgsql-hackers by date: