Thread: problems with making relfilenodes 56-bits
OK, so the recent commit and revert of the 56-bit relfilenode patch revealed a few issues that IMHO need design-level input. Let me try to surface those here, starting a new thread to separate this discussion from the clutter: 1. Commit Record Alignment. ParseCommitRecord() and ParseAbortRecord() are dependent on every subsidiary structure that can be added to a commit or abort record requiring exactly 4-byte alignment. IMHO, this seems awfully fragile, even leaving the 56-bit patch aside. Prepare records seem to have a much saner scheme: they've also got a bunch of different things that can be stuck onto the main record, but they maxalign each top-level thing that they stick in there. So ParsePrepareRecord() doesn't have to make any icky alignment assumptions the way ParseCommitRecord() and ParseAbortRecord() do. Unfortuantely, that scheme doesn't work as well for commit records, because the very first top-level thing only needs 2 bytes. We're currently using 4, and it would obviously be nicer to cut that down to 2 than to have it go up to 8. We could try to rejigger things around somehow to avoid needing that 2-byte quantity in there as a separate toplevel item, but I'm not quite sure how to do that, or we could just copy everything to ensure alignment, but that seems kind of expensive. If we don't decide to do either of those things, we should at least better document, and preferably enforce via assets, the requirement that these structs be exactly 4-byte aligned, so that nobody else makes the same mistake in the future. 2. WAL Size. Block references in the WAL are by RelFileLocator, so if you make RelFileLocators bigger, WAL gets bigger. We'd have to test the exact impact of this, but it seems a bit scary: if you have a WAL stream with few FPIs doing DML on a narrow table, probably most records will contain 1 block reference (and occasionally more, but I guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that block reference feels like it might add up to something significant. I don't really see any way around this, either: if you make relfilenode values wider, they take up more space. Perhaps there's a way to claw that back elsewhere, or we could do something really crazy like switch to variable-width representations of integer quantities in WAL records, but there doesn't seem to be any simple way forward other than, you know, deciding that we're willing to pay the cost of the additional WAL volume. 3. Sinval Message Size. Sinval messages are 16 bytes right now. They'll have to grow to 20 bytes if we do this. There's even less room for bit-squeezing here than there is for the WAL stuff. I'm skeptical that this really matters, but Tom seems concerned. 4. Other Uses of RelFileLocator. There are a bunch of structs I haven't looked into yet that also embed RelFileLocator, which may have their own issues with alignment, padding, and/or size: ginxlogSplit, ginxlogDeletePage, ginxlogUpdateMeta, gistxlogPageReuse, xl_heap_new_cid, xl_btree_reuse_page, LogicalRewriteMappingData, xl_smgr_truncate, xl_seq_rec, ReorderBufferChange, FileTag. I think a bunch of these are things that get written into WAL, but at least some of them seem like they probably don't get written into WAL enough to matter. Needs more investigation, though. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > 3. Sinval Message Size. Sinval messages are 16 bytes right now. > They'll have to grow to 20 bytes if we do this. There's even less room > for bit-squeezing here than there is for the WAL stuff. I'm skeptical > that this really matters, but Tom seems concerned. As far as that goes, I'm entirely prepared to accept a conclusion that the benefits of widening relfilenodes justify whatever space or speed penalties may exist there. However, we cannot honestly make that conclusion if we haven't measured said penalties. The same goes for the other issues you raise here. regards, tom lane
On Wed, Sep 28, 2022 at 4:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > As far as that goes, I'm entirely prepared to accept a conclusion > that the benefits of widening relfilenodes justify whatever space > or speed penalties may exist there. However, we cannot honestly > make that conclusion if we haven't measured said penalties. > The same goes for the other issues you raise here. I generally agree, but the devil is in the details. I tend to agree with Robert that many individual WAL record types just don't appear frequently enough to matter (it also helps that even the per-record space overhead with wider 56-bit relfilenodes isn't so bad). Just offhand I'd say that ginxlogSplit, ginxlogDeletePage, ginxlogUpdateMeta, gistxlogPageReuse and xl_btree_reuse_page are likely to be in this category (though would be nice to see some numbers for those). I'm much less sure about the other record types. Any WAL records with a variable number of relfilenode entries seem like they might be more of a problem. But I'm not ready to accept that that cannot be ameliorated in some way. Just for example, it wouldn't be impossible to do some kind of varbyte encoding for some record types. How many times will the cluster actually need billions of relfilenodes? It has to work, but maybe it can be suboptimal from a space overhead perspective. I'm not saying that we need to do anything fancy just yet. I'm just saying that there definitely *are* options. Maybe it's not really necessary to come up with something like a varbyte encoding, and maybe the complexity it imposes just won't be worth it -- I really have no opinion on that just yet. -- Peter Geoghegan
On Thu, 29 Sep 2022, 00:06 Robert Haas, <robertmhaas@gmail.com> wrote: > > 2. WAL Size. Block references in the WAL are by RelFileLocator, so if > you make RelFileLocators bigger, WAL gets bigger. We'd have to test > the exact impact of this, but it seems a bit scary: if you have a WAL > stream with few FPIs doing DML on a narrow table, probably most > records will contain 1 block reference (and occasionally more, but I > guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that > block reference feels like it might add up to something significant. I > don't really see any way around this, either: if you make relfilenode > values wider, they take up more space. Perhaps there's a way to claw > that back elsewhere, or we could do something really crazy like switch > to variable-width representations of integer quantities in WAL > records, but there doesn't seem to be any simple way forward other > than, you know, deciding that we're willing to pay the cost of the > additional WAL volume. Re: WAL volume and record size optimization I've been working off and on with WAL for some time now due to [0] and the interest of Neon in the area, and I think we can reduce the size of the base record by a significant margin: Currently, our minimal WAL record is exactly 24 bytes: length (4B), TransactionId (4B), previous record pointer (8B), flags (1B), redo manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these fields, TransactionID could reasonably be omitted for certain WAL records (as example: index insertions don't really need the XID). Additionally, the length field could be made to be variable length, and any padding is just plain bad (adding 4 bytes to all insert/update/delete/lock records was frowned upon). I'm working on a prototype patch for a more bare-bones WAL record header of which the only required fields would be prevptr (8B), CRC (4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't yet know the performance of this, but the considering that there will be a lot more conditionals in header decoding it might be slower for any one backend, but faster overall (less overall IOps) The flags field would be indications for additional information: [flag name (bits): explanation (additional xlog header data in bytes)] - len_size(0..1): xlog record size is at most xlrec_header_only (0B), uint8_max(1B), uint16_max(2B), uint32_max(4B) - has_xid (2): contains transaction ID of logging transaction (4B, or probably 8B when we introduce 64-bit xids) - has_cid (3): contains the command ID of the logging statement (4B) (rationale for logging CID in [0], now in record header because XID is included there as well, and both are required for consistent snapshots. - has_rminfo (4): has non-zero redo-manager flags field (1B) (rationale for separate field [1], non-zero allows 1B space optimization for one of each RMGR's operations) - special_rel (5): pre-existing definition - check_consistency (6): pre-existing definition - unset (7): no meaning defined yet. Could be used for full record compression, or other purposes. A normal record header (XLOG record with at least some registered data) would be only 15 to 17 bytes (0-1B rminfo + 1-2B in xl_len), and one with XID only up to 21 bytes. So, when compared to the current XLogRecord format, we would in general recover 2 or 3 bytes from the xl_tot_len field, 1 or 2 bytes from the alignment hole, and potentially the 4 bytes of the xid when that data is considered useless during recovery, or physical or logical replication. Kind regards, Matthias van de Meent [0] https://postgr.es/m/CAEze2WhmU8WciEgaVPZm71vxFBOpp8ncDc%3DSdEHHsW6HS%2Bk9zw%40mail.gmail.com [1] https://postgr.es/m/20220715173731.6t3km5cww3f5ztfq%40awork3.anarazel.de
On Thu, Sep 29, 2022 at 12:24 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Currently, our minimal WAL record is exactly 24 bytes: length (4B), > TransactionId (4B), previous record pointer (8B), flags (1B), redo > manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these > fields, TransactionID could reasonably be omitted for certain WAL > records (as example: index insertions don't really need the XID). > Additionally, the length field could be made to be variable length, > and any padding is just plain bad (adding 4 bytes to all > insert/update/delete/lock records was frowned upon). Right. I was shocked when I realized that we had two bytes of padding in there, considering that numerous rmgrs are stealing bits from the 1-byte field that identifies the record type. My question was: why aren't we exposing those 2 bytes for rmgr-type-specific use? Or for something like xl_xact_commit, we could get rid of xl_xact_info if we had those 2 bytes to work with. Right now, I see that a bare commit record is 34 bytes which rounds out to 40. With the trick above, we could shave off 4 bytes bringing the size to 30 which would round to 32. That's a pretty significant savings, although it'd be a lot better if we could get some kind of savings for DML records which could be much higher frequency. > I'm working on a prototype patch for a more bare-bones WAL record > header of which the only required fields would be prevptr (8B), CRC > (4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't > yet know the performance of this, but the considering that there will > be a lot more conditionals in header decoding it might be slower for > any one backend, but faster overall (less overall IOps) > > The flags field would be indications for additional information: [flag > name (bits): explanation (additional xlog header data in bytes)] > - len_size(0..1): xlog record size is at most xlrec_header_only (0B), > uint8_max(1B), uint16_max(2B), uint32_max(4B) > - has_xid (2): contains transaction ID of logging transaction (4B, or > probably 8B when we introduce 64-bit xids) > - has_cid (3): contains the command ID of the logging statement (4B) > (rationale for logging CID in [0], now in record header because XID is > included there as well, and both are required for consistent > snapshots. > - has_rminfo (4): has non-zero redo-manager flags field (1B) > (rationale for separate field [1], non-zero allows 1B space > optimization for one of each RMGR's operations) > - special_rel (5): pre-existing definition > - check_consistency (6): pre-existing definition > - unset (7): no meaning defined yet. Could be used for full record > compression, or other purposes. Interesting. One fly in the ointment here is that WAL records start on 8-byte boundaries (probably MAXALIGN boundaries, but I didn't check the details). And after the 24-byte header, there's a 2-byte header (or 5-byte header) introducing the payload data (see XLR_BLOCK_ID_DATA_SHORT/LONG). So if the size of the actual payload data is a multiple of 8, and is short enough that we use the short data header, we waste 6 bytes. If the data length is a multiple of 4, we waste 2 bytes. And those are probably really common cases. So the big improvements probably come from saving 2 bytes or 6 bytes or 10 bytes, and saving say 3 or 5 is probably not much better than 2. Or at least that's what I'm guessing. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Sep 29, 2022 at 2:36 AM Robert Haas <robertmhaas@gmail.com> wrote: > 2. WAL Size. Block references in the WAL are by RelFileLocator, so if > you make RelFileLocators bigger, WAL gets bigger. We'd have to test > the exact impact of this, but it seems a bit scary I have done some testing around this area to see the impact on WAL size especially when WAL sizes are smaller, with a very simple test with insert/update/delete I can see around an 11% increase in WAL size [1] then I did some more test with pgbench with smaller scale factor(1) there I do not see a significant increase in the WAL size although it increases WAL size around 1-2%. [2]. [1] checkpoint; do $$ declare lsn1 pg_lsn; lsn2 pg_lsn; diff float; begin select pg_current_wal_lsn() into lsn1; CREATE TABLE test(a int); for counter in 1..1000 loop INSERT INTO test values(1); UPDATE test set a=a+1; DELETE FROM test where a=1; end loop; DROP TABLE test; select pg_current_wal_lsn() into lsn2; select pg_wal_lsn_diff(lsn2, lsn1) into diff; raise notice '%', diff/1024; end; $$; wal generated head: 66199.09375 kB wal generated patch: 73906.984375 kB wal-size increase: 11% [2] ./pgbench -i postgres ./pgbench -c1 -j1 -t 30000 -M prepared postgres wal generated head: 30780 kB wal generated patch: 31284 kB wal-size increase: ~1-2% I have done further analysis to know why on pgbench workload the wal size is increasing by 1-2%. So with waldump I could see that wal size per transaction size increased from 566 (on head) to 590 (with patch), so that is around 4% but when we see total wal size difference after 30k transaction then it is just 1-2% and I think that is because there would be other records which are not impacted like FPI Conclusion: So as suspected with very small WAL sizes with a very targeted test case we can see a significant 11% increase in WAL size but with pgbench kind of workload the increase in WAL size is very less. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, On 2022-09-30 15:36:11 +0530, Dilip Kumar wrote: > I have done some testing around this area to see the impact on WAL > size especially when WAL sizes are smaller, with a very simple test > with insert/update/delete I can see around an 11% increase in WAL size > [1] then I did some more test with pgbench with smaller scale > factor(1) there I do not see a significant increase in the WAL size > although it increases WAL size around 1-2%. [2]. I think it'd be interesting to look at per-record-type stats between two equivalent workload, to see where practical workloads suffer the most (possibly with fpw=off, to make things more repeatable). I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to pay off for 56bit relfilenodes. The class of problems foreclosed is large enough to "waste" "improvement potential" on this. Greetings, Andres Freund
On Fri, Sep 30, 2022 at 5:20 PM Andres Freund <andres@anarazel.de> wrote: > I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to > pay off for 56bit relfilenodes. The class of problems foreclosed is large > enough to "waste" "improvement potential" on this. I agree overall. A closely related but distinct question occurs to me: if we're going to be "wasting" space on alignment padding in certain cases one way or another, can we at least recognize those cases and take advantage at the level of individual WAL record formats? In other words: So far we've been discussing the importance of not going over a critical threshold for certain WAL records. But it might also be valuable to consider recognizing that that's inevitable, and that we might as well make the most of it by including one or two other things. This seems most likely to matter when managing the problem of negative compression with per-WAL-record compression schemes for things like arrays of page offset numbers [1]. If (say) a given compression scheme "wastes" space for arrays of only 1-3 items, but we already know that the relevant space will all be lost to alignment needed by code one level down in any case, does it really count as waste? We're likely always going to have some kind of negative compression, but you do get to influence where and when the negative compression happens. Not sure how relevant this will turn out to be, but seems worth considering. More generally, thinking about how things work across multiple layers of abstraction seems like it could be valuable in other ways. [1] https://postgr.es/m/CAH2-WzmLCn2Hx9tQLdmdb+9CkHKLyWD2bsz=PmRebc4dAxjy6g@mail.gmail.com -- Peter Geoghegan
On Sat, Oct 1, 2022 at 5:50 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-09-30 15:36:11 +0530, Dilip Kumar wrote: > > I have done some testing around this area to see the impact on WAL > > size especially when WAL sizes are smaller, with a very simple test > > with insert/update/delete I can see around an 11% increase in WAL size > > [1] then I did some more test with pgbench with smaller scale > > factor(1) there I do not see a significant increase in the WAL size > > although it increases WAL size around 1-2%. [2]. > > I think it'd be interesting to look at per-record-type stats between two > equivalent workload, to see where practical workloads suffer the most > (possibly with fpw=off, to make things more repeatable). While testing pgbench, I dumped the wal sizes using waldump. So in pgbench case, most of the record sizes increased by 4 bytes as they include single block references and the same is true for the other test case I sent. Here is the wal dump of what the sizes look like for a single pgbench transaction[1]. Maybe for seeing these changes with the different workloads we can run some of the files from the regression test and compare the individual wal sizes. Head: rmgr: Heap len (rec/tot): 54/ 54, tx: 867, lsn: 0/02DD1280, prev 0/02DD1250, desc: LOCK off 44: xid 867: flags 0x01 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/5/16424 blk 226 rmgr: Heap len (rec/tot): 171/ 171, tx: 867, lsn: 0/02DD12B8, prev 0/02DD1280, desc: UPDATE off 44 xmax 867 flags 0x11 ; new off 30 xmax 0, blkref #0: rel 1663/5/16424 blk 1639, blkref #1: rel 1663/5/16424 blk 226 rmgr: Btree len (rec/tot): 64/ 64, tx: 867, lsn: 0/02DD1368, prev 0/02DD12B8, desc: INSERT_LEAF off 290, blkref #0: rel 1663/5/16432 blk 39 rmgr: Heap len (rec/tot): 78/ 78, tx: 867, lsn: 0/02DD13A8, prev 0/02DD1368, desc: HOT_UPDATE off 15 xmax 867 flags 0x10 ; new off 19 xmax 0, blkref #0: rel 1663/5/16427 blk 0 rmgr: Heap len (rec/tot): 74/ 74, tx: 867, lsn: 0/02DD13F8, prev 0/02DD13A8, desc: HOT_UPDATE off 9 xmax 867 flags 0x10 ; new off 10 xmax 0, blkref #0: rel 1663/5/16425 blk 0 rmgr: Heap len (rec/tot): 79/ 79, tx: 867, lsn: 0/02DD1448, prev 0/02DD13F8, desc: INSERT off 9 flags 0x08, blkref #0: rel 1663/5/16434 blk 0 rmgr: Transaction len (rec/tot): 46/ 46, tx: 867, lsn: 0/02DD1498, prev 0/02DD1448, desc: COMMIT 2022-10-01 11:24:03.464437 IST Patch: rmgr: Heap len (rec/tot): 58/ 58, tx: 818, lsn: 0/0218BEB0, prev 0/0218BE80, desc: LOCK off 34: xid 818: flags 0x01 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/5/100004 blk 522 rmgr: Heap len (rec/tot): 175/ 175, tx: 818, lsn: 0/0218BEF0, prev 0/0218BEB0, desc: UPDATE off 34 xmax 818 flags 0x11 ; new off 8 xmax 0, blkref #0: rel 1663/5/100004 blk 1645, blkref #1: rel 1663/5/100004 blk 522 rmgr: Btree len (rec/tot): 68/ 68, tx: 818, lsn: 0/0218BFA0, prev 0/0218BEF0, desc: INSERT_LEAF off 36, blkref #0: rel 1663/5/100010 blk 89 rmgr: Heap len (rec/tot): 82/ 82, tx: 818, lsn: 0/0218BFE8, prev 0/0218BFA0, desc: HOT_UPDATE off 66 xmax 818 flags 0x10 ; new off 90 xmax 0, blkref #0: rel 1663/5/100007 blk 0 rmgr: Heap len (rec/tot): 78/ 78, tx: 818, lsn: 0/0218C058, prev 0/0218BFE8, desc: HOT_UPDATE off 80 xmax 818 flags 0x10 ; new off 81 xmax 0, blkref #0: rel 1663/5/100005 blk 0 rmgr: Heap len (rec/tot): 83/ 83, tx: 818, lsn: 0/0218C0A8, prev 0/0218C058, desc: INSERT off 80 flags 0x08, blkref #0: rel 1663/5/100011 blk 0 rmgr: Transaction len (rec/tot): 46/ 46, tx: 818, lsn: 0/0218C100, prev 0/0218C0A8, desc: COMMIT 2022-10-01 11:11:03.564063 IST -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote: > I think it'd be interesting to look at per-record-type stats between two > equivalent workload, to see where practical workloads suffer the most > (possibly with fpw=off, to make things more repeatable). I would expect, and Dilip's results seem to confirm, the effect to be pretty uniform: basically, nearly every record gets bigger by 4 bytes. That's because most records contain at least one block reference, and if they contain multiple block references, likely all but one will be marked BKPBLOCK_SAME_REL, so we pay the cost just once. Because of alignment padding, the practical effect is probably that about half of the records get bigger by 8 bytes and the other half don't get bigger at all. But I see no reason to believe that things are any better or worse than that. Most interesting record types are going to contain some kind of variable-length payload, so the chances that a 4 byte size increase pushes you across a MAXALIGN boundary seem to be no better or worse than fifty-fifty. > I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to > pay off for 56bit relfilenodes. The class of problems foreclosed is large > enough to "waste" "improvement potential" on this. I thought about trying to buy back some space elsewhere, and I think that would be a reasonable approach to getting this committed if we could find a way to do it. However, I don't see a terribly obvious way of making it happen. Trying to do it by optimizing specific WAL record types seems like a real pain in the neck, because there's tons of different WAL records that all have the same problem. Trying to do it in a generic way makes more sense, and the fact that we have 2 padding bytes available in XLogRecord seems like a place to start looking, but the way forward from there is not clear to me. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-10-03 08:12:39 -0400, Robert Haas wrote: > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote: > > I think it'd be interesting to look at per-record-type stats between two > > equivalent workload, to see where practical workloads suffer the most > > (possibly with fpw=off, to make things more repeatable). > > I would expect, and Dilip's results seem to confirm, the effect to be > pretty uniform: basically, nearly every record gets bigger by 4 bytes. > That's because most records contain at least one block reference, and > if they contain multiple block references, likely all but one will be > marked BKPBLOCK_SAME_REL, so we pay the cost just once. But it doesn't really matter that much if an already large record gets a bit bigger. Whereas it does matter if it's a small record. Focussing on optimizing the record types where the increase is large seems like a potential way forward to me, even if we can't find something generic. > I thought about trying to buy back some space elsewhere, and I think > that would be a reasonable approach to getting this committed if we > could find a way to do it. However, I don't see a terribly obvious way > of making it happen. I think there's plenty potential... > Trying to do it by optimizing specific WAL record > types seems like a real pain in the neck, because there's tons of > different WAL records that all have the same problem. I am not so sure about that. Improving a bunch of the most frequent small records might buy you back enough on just about every workload to be OK. I put the top record sizes for an installcheck run with full_page_writes off at the bottom. Certainly our regression tests aren't generally representative. But I think it still decently highlights how just improving a few records could buy you back more than enough. > Trying to do it in a generic way makes more sense, and the fact that we have > 2 padding bytes available in XLogRecord seems like a place to start looking, > but the way forward from there is not clear to me. Random idea: xl_prev is large. Store a full xl_prev in the page header, but only store a 2 byte offset from the page header xl_prev within each record. Greetings, Andres Freund by total size: Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- Heap/INSERT 1041666 ( 50.48) 106565255 ( 50.54) 0 ( 0.00) 106565255 ( 43.92) Btree/INSERT_LEAF 352196 ( 17.07) 24067672 ( 11.41) 0 ( 0.00) 24067672 ( 9.92) Heap/DELETE 250852 ( 12.16) 13546008 ( 6.42) 0 ( 0.00) 13546008 ( 5.58) Hash/INSERT 108499 ( 5.26) 7811928 ( 3.70) 0 ( 0.00) 7811928 ( 3.22) Transaction/COMMIT 16053 ( 0.78) 6402657 ( 3.04) 0 ( 0.00) 6402657 ( 2.64) Gist/PAGE_UPDATE 57225 ( 2.77) 5217100 ( 2.47) 0 ( 0.00) 5217100 ( 2.15) Gin/UPDATE_META_PAGE 23943 ( 1.16) 4539970 ( 2.15) 0 ( 0.00) 4539970 ( 1.87) Gin/INSERT 27004 ( 1.31) 3623998 ( 1.72) 0 ( 0.00) 3623998 ( 1.49) Gist/PAGE_SPLIT 448 ( 0.02) 3391244 ( 1.61) 0 ( 0.00) 3391244 ( 1.40) SPGist/ADD_LEAF 38968 ( 1.89) 3341696 ( 1.58) 0 ( 0.00) 3341696 ( 1.38) ... XLOG/FPI 7228 ( 0.35) 378924 ( 0.18) 29788166 ( 93.67) 30167090 ( 12.43) ... Gin/SPLIT 141 ( 0.01) 13011 ( 0.01) 1187588 ( 3.73) 1200599 ( 0.49) ... -------- -------- -------- -------- Total 2063609 210848282 [86.89%] 31802766 [13.11%] 242651048 [100%] (Included XLOG/FPI and Gin/SPLIT to explain why there's FPIs despite running with fpw=off) sorted by number of records: Heap/INSERT 1041666 ( 50.48) 106565255 ( 50.54) 0 ( 0.00) 106565255 ( 43.92) Btree/INSERT_LEAF 352196 ( 17.07) 24067672 ( 11.41) 0 ( 0.00) 24067672 ( 9.92) Heap/DELETE 250852 ( 12.16) 13546008 ( 6.42) 0 ( 0.00) 13546008 ( 5.58) Hash/INSERT 108499 ( 5.26) 7811928 ( 3.70) 0 ( 0.00) 7811928 ( 3.22) Gist/PAGE_UPDATE 57225 ( 2.77) 5217100 ( 2.47) 0 ( 0.00) 5217100 ( 2.15) SPGist/ADD_LEAF 38968 ( 1.89) 3341696 ( 1.58) 0 ( 0.00) 3341696 ( 1.38) Gin/INSERT 27004 ( 1.31) 3623998 ( 1.72) 0 ( 0.00) 3623998 ( 1.49) Gin/UPDATE_META_PAGE 23943 ( 1.16) 4539970 ( 2.15) 0 ( 0.00) 4539970 ( 1.87) Standby/LOCK 18451 ( 0.89) 775026 ( 0.37) 0 ( 0.00) 775026 ( 0.32) Transaction/COMMIT 16053 ( 0.78) 6402657 ( 3.04) 0 ( 0.00) 6402657 ( 2.64)
On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote: > > Hi, > > On 2022-10-03 08:12:39 -0400, Robert Haas wrote: > > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote: > > > I think it'd be interesting to look at per-record-type stats between two > > > equivalent workload, to see where practical workloads suffer the most > > > (possibly with fpw=off, to make things more repeatable). > > > > I would expect, and Dilip's results seem to confirm, the effect to be > > pretty uniform: basically, nearly every record gets bigger by 4 bytes. > > That's because most records contain at least one block reference, and > > if they contain multiple block references, likely all but one will be > > marked BKPBLOCK_SAME_REL, so we pay the cost just once. > > But it doesn't really matter that much if an already large record gets a bit > bigger. Whereas it does matter if it's a small record. Focussing on optimizing > the record types where the increase is large seems like a potential way > forward to me, even if we can't find something generic. > > > > I thought about trying to buy back some space elsewhere, and I think > > that would be a reasonable approach to getting this committed if we > > could find a way to do it. However, I don't see a terribly obvious way > > of making it happen. > > I think there's plenty potential... > > > > Trying to do it by optimizing specific WAL record > > types seems like a real pain in the neck, because there's tons of > > different WAL records that all have the same problem. > > I am not so sure about that. Improving a bunch of the most frequent small > records might buy you back enough on just about every workload to be OK. > > I put the top record sizes for an installcheck run with full_page_writes off > at the bottom. Certainly our regression tests aren't generally > representative. But I think it still decently highlights how just improving a > few records could buy you back more than enough. > > > > Trying to do it in a generic way makes more sense, and the fact that we have > > 2 padding bytes available in XLogRecord seems like a place to start looking, > > but the way forward from there is not clear to me. > > Random idea: xl_prev is large. Store a full xl_prev in the page header, but > only store a 2 byte offset from the page header xl_prev within each record. With that small xl_prev we may not detect partial page writes in recycled segments; or other issues in the underlying file system. With small record sizes, the chance of returning incorrect data would be significant for small records (it would be approximately the chance of getting a record boundary on the underlying page boundary * chance of getting the same MAXALIGN-adjusted size record before the persistence boundary). That issue is part of the reason why my proposed change upthread still contains the full xl_prev. A different idea is removing most block_ids from the record, and optionally reducing per-block length fields to 1B. Used block ids are effectively always sequential, and we only allow 33+4 valid values, so we can use 2 bits to distinguish between 'block belonging to this ID field have at most 255B of data registered' and 'blocks up to this ID follow sequentially without own block ID'. That would save 2N-1 total bytes for N blocks. It is scraping the barrel, but I think it is quite possible. Lastly, we could add XLR_BLOCK_ID_DATA_MED for values >255 containing up to UINT16_MAX lengths. That would save 2 bytes for records that only just pass the 255B barrier, where 2B is still a fairly significant part of the record size. Kind regards, Matthias van de Meent
Hi, On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote: > On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote: > > Random idea: xl_prev is large. Store a full xl_prev in the page header, but > > only store a 2 byte offset from the page header xl_prev within each record. > > With that small xl_prev we may not detect partial page writes in > recycled segments; or other issues in the underlying file system. With > small record sizes, the chance of returning incorrect data would be > significant for small records (it would be approximately the chance of > getting a record boundary on the underlying page boundary * chance of > getting the same MAXALIGN-adjusted size record before the persistence > boundary). That issue is part of the reason why my proposed change > upthread still contains the full xl_prev. What exactly is the theory for this significant increase? I don't think xl_prev provides a meaningful protection against torn pages in the first place? Greetings, Andres Freund
On Mon, 3 Oct 2022 at 23:26, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote: > > On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote: > > > Random idea: xl_prev is large. Store a full xl_prev in the page header, but > > > only store a 2 byte offset from the page header xl_prev within each record. > > > > With that small xl_prev we may not detect partial page writes in > > recycled segments; or other issues in the underlying file system. With > > small record sizes, the chance of returning incorrect data would be > > significant for small records (it would be approximately the chance of > > getting a record boundary on the underlying page boundary * chance of > > getting the same MAXALIGN-adjusted size record before the persistence > > boundary). That issue is part of the reason why my proposed change > > upthread still contains the full xl_prev. > > What exactly is the theory for this significant increase? I don't think > xl_prev provides a meaningful protection against torn pages in the first > place? XLog pages don't have checksums, so they do not provide torn page protection capabilities on their own. A singular xlog record is protected against torn page writes through the checksum that covers the whole record - if only part of the record was written, we can detect that through the mismatching checksum. However, if records end at the tear boundary, we must know for certain that any record that starts after the tear is the record that was written after the one before the tear. Page-local references/offsets would not work, because the record decoding doesn't know which xlog page the record should be located on; it could be both the version of the page before it was recycled, or the one after. Currently, we can detect this because the value of xl_prev will point to a record far in the past (i.e. not the expected value), but with a page-local version of xl_prev we would be less likely to detect torn pages (and thus be unable to handle this without risk of corruption) due to the significant chance of the truncated xl_prev value being the same in both the old and new record. Example: Page { [ record A ] | tear boundary | [ record B ] } gets recycled and receives a new record C at the place of A with the same length. With your proposal, record B would still be a valid record when it follows C; as the page-local serial number/offset reference to the previous record would still match after the torn write. With the current situation and a full LSN in xl_prev, the mismatching value in the xl_prev pointer allows us to detect this torn page write and halt replay, before redoing an old (incorrect) record. Kind regards, Matthias van de Meent PS. there are ideas floating around (I heard about this one from Heikki) where we could concatenate WAL records into one combined record that has only one shared xl_prev+crc; which would save these 12 bytes per record. However, that needs a lot of careful consideration to make sure that the persistence guarantee of operations doesn't get lost somewhere in the traffic.
Hi, On 2022-10-04 15:05:47 +0200, Matthias van de Meent wrote: > On Mon, 3 Oct 2022 at 23:26, Andres Freund <andres@anarazel.de> wrote: > > On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote: > > > On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote: > > > > Random idea: xl_prev is large. Store a full xl_prev in the page header, but > > > > only store a 2 byte offset from the page header xl_prev within each record. > > > > > > With that small xl_prev we may not detect partial page writes in > > > recycled segments; or other issues in the underlying file system. With > > > small record sizes, the chance of returning incorrect data would be > > > significant for small records (it would be approximately the chance of > > > getting a record boundary on the underlying page boundary * chance of > > > getting the same MAXALIGN-adjusted size record before the persistence > > > boundary). That issue is part of the reason why my proposed change > > > upthread still contains the full xl_prev. > > > > What exactly is the theory for this significant increase? I don't think > > xl_prev provides a meaningful protection against torn pages in the first > > place? > > XLog pages don't have checksums, so they do not provide torn page > protection capabilities on their own. > A singular xlog record is protected against torn page writes through > the checksum that covers the whole record - if only part of the record > was written, we can detect that through the mismatching checksum. > However, if records end at the tear boundary, we must know for certain > that any record that starts after the tear is the record that was > written after the one before the tear. Page-local references/offsets > would not work, because the record decoding doesn't know which xlog > page the record should be located on; it could be both the version of > the page before it was recycled, or the one after. > Currently, we can detect this because the value of xl_prev will point > to a record far in the past (i.e. not the expected value), but with a > page-local version of xl_prev we would be less likely to detect torn > pages (and thus be unable to handle this without risk of corruption) > due to the significant chance of the truncated xl_prev value being the > same in both the old and new record. Think this is addressable, see below. > Example: Page { [ record A ] | tear boundary | [ record B ] } gets > recycled and receives a new record C at the place of A with the same > length. > > With your proposal, record B would still be a valid record when it > follows C; as the page-local serial number/offset reference to the > previous record would still match after the torn write. > With the current situation and a full LSN in xl_prev, the mismatching > value in the xl_prev pointer allows us to detect this torn page write > and halt replay, before redoing an old (incorrect) record. In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection? As you specified it, C has the same length as A, so B's xl_prev will be the same whether it's a page local offset or the full 8 bytes. The relevant protection against issues like this isn't xl_prev, it's the CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather than the offset. > PS. there are ideas floating around (I heard about this one from > Heikki) where we could concatenate WAL records into one combined > record that has only one shared xl_prev+crc; which would save these 12 > bytes per record. However, that needs a lot of careful consideration > to make sure that the persistence guarantee of operations doesn't get > lost somewhere in the traffic. One version of that is to move the CRCs to the page header, make the pages smaller (512 bytes / 4K, depending on the hardware), and to pad out partial pages when flushing them out. Rewriting pages is bad for hardware and prevents having multiple WAL IOs in flight at the same time. Greetings, Andres Freund
On Tue, Oct 4, 2022 at 11:34 AM Andres Freund <andres@anarazel.de> wrote: > > Example: Page { [ record A ] | tear boundary | [ record B ] } gets > > recycled and receives a new record C at the place of A with the same > > length. > > > > With your proposal, record B would still be a valid record when it > > follows C; as the page-local serial number/offset reference to the > > previous record would still match after the torn write. > > With the current situation and a full LSN in xl_prev, the mismatching > > value in the xl_prev pointer allows us to detect this torn page write > > and halt replay, before redoing an old (incorrect) record. > > In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection? > As you specified it, C has the same length as A, so B's xl_prev will be the > same whether it's a page local offset or the full 8 bytes. > > The relevant protection against issues like this isn't xl_prev, it's the > CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather > than the offset. I'm really confused. xl_prev *is* a full-width LSN currently, as I understand it. So in the scenario that Matthias poses, let's say the segment was previously 000000010000000400000025 and now it's 000000010000000400000049. So if a given chunk of the page is leftover from when the page was 000000010000000400000025, it will have xl_prev values like 4/25xxxxxx. If it's been rewritten since the segment was recycled, it will have xl_prev values like 4/49xxxxxx. So, we can tell whether record B has been overwritten with a new record since the segment was recycled. But if we stored only 2 bytes in each xl_prev field, that would no longer be possible. So I'm lost. It seems like Matthias has correctly identified a real hazard, and not some weird corner case but actually something that will happen regularly. All you need is for the old segment that got recycled to have a record stating at the same place where the page tore, and for the previous record to have been the same length as the one on the new page. Given that there's only <~1024 places on a page where a record can start, and given that in many workloads the lengths of WAL records will be fairly uniform, this doesn't seem unlikely at all. A way to up the chances of detecting this case would be to store only 2 or 4 bytes of xl_prev on disk, but arrange to include the full xl_prev value in the xl_crc calculation. Then your chances of a collision are about 2^-32, or maybe more if you posit that CRC is a weak and crappy algorithm, but even then it's strictly better than just hoping that there isn't a tear point at a record boundary where the same length record precedes the tear in both the old and new WAL segments. However, on the flip side, even if you assume that CRC is a fantastic algorithm with beautiful and state-of-the-art bit mixing, the chances of it failing to notice the problem are still >0, whereas the current algorithm that compares the full xl_prev value is a sure thing. Because xl_prev values are never repeated, it's certain that when a segment is recycled, any values that were legal for the old one aren't legal in the new one. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-10-04 13:36:33 -0400, Robert Haas wrote: > On Tue, Oct 4, 2022 at 11:34 AM Andres Freund <andres@anarazel.de> wrote: > > > Example: Page { [ record A ] | tear boundary | [ record B ] } gets > > > recycled and receives a new record C at the place of A with the same > > > length. > > > > > > With your proposal, record B would still be a valid record when it > > > follows C; as the page-local serial number/offset reference to the > > > previous record would still match after the torn write. > > > With the current situation and a full LSN in xl_prev, the mismatching > > > value in the xl_prev pointer allows us to detect this torn page write > > > and halt replay, before redoing an old (incorrect) record. > > > > In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection? > > As you specified it, C has the same length as A, so B's xl_prev will be the > > same whether it's a page local offset or the full 8 bytes. > > > > The relevant protection against issues like this isn't xl_prev, it's the > > CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather > > than the offset. > > I'm really confused. xl_prev *is* a full-width LSN currently, as I > understand it. So in the scenario that Matthias poses, let's say the > segment was previously 000000010000000400000025 and now it's > 000000010000000400000049. So if a given chunk of the page is leftover > from when the page was 000000010000000400000025, it will have xl_prev > values like 4/25xxxxxx. If it's been rewritten since the segment was > recycled, it will have xl_prev values like 4/49xxxxxx. So, we can tell > whether record B has been overwritten with a new record since the > segment was recycled. But if we stored only 2 bytes in each xl_prev > field, that would no longer be possible. Oh, I think I misunderstood the scenario. I was thinking of cases where we write out a bunch of pages, crash, only some of the pages made it to disk, we then write new ones of the same length, and now find a record after the "real" end of the WAL to be valid. Not sure how I mentally swallowed the "recycled". For the recycling scenario to be a problem we'll also need to crash, with parts of the page ending up with the new contents and parts of the page ending up with the old "pre recycling" content, correct? Because without a crash we'll have zeroed out the remainder of the page (well, leaving walreceiver out of the picture, grr). However, this can easily happen without any record boundaries on the partially recycled page, so we rely on the CRCs to protect against this. Here I originally wrote a more in-depth explanation of the scenario I was thinking about, where we alread rely on CRCs to protect us. But, ooph, I think they don't reliably, with today's design. But maybe I'm missing more things today. Consider the following sequence: 1) we write WAL like this: [record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev B_lsn] 2) crash, the sectors with A and C made it to disk, the one with B didn't 3) We replay A, discover B is invalid (possibly zeroes or old contents), insert a new record B' with the same length. Now it looks like this: [record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev B_lsn] 4) crash, the sector with B' makes it to disk 5) we replay A, B', C, because C has an xl_prev that's compatible with B' location and a valid CRC. Oops. I think this can happen both within a single page and across page boundaries. I hope I am missing something here? > A way to up the chances of detecting this case would be to store only > 2 or 4 bytes of xl_prev on disk, but arrange to include the full > xl_prev value in the xl_crc calculation. Right, that's what I was suggesting as well. > Then your chances of a collision are about 2^-32, or maybe more if you posit > that CRC is a weak and crappy algorithm, but even then it's strictly better > than just hoping that there isn't a tear point at a record boundary where > the same length record precedes the tear in both the old and new WAL > segments. However, on the flip side, even if you assume that CRC is a > fantastic algorithm with beautiful and state-of-the-art bit mixing, the > chances of it failing to notice the problem are still >0, whereas the > current algorithm that compares the full xl_prev value is a sure > thing. Because xl_prev values are never repeated, it's certain that when a > segment is recycled, any values that were legal for the old one aren't legal > in the new one. Given that we already rely on the CRCs to detect corruption within a single record spanning tear boundaries, this doesn't cause me a lot of heartburn. But I suspect we might need to do something about the scenario I outlined above, which likely would also increase the protection against this issue. I think there might be reasonable ways to increase the guarantees based on the 2 byte xl_prev approach "alone". We don't have to store the offset from the page header as a plain offset. What about storing something like: page_offset ^ (page_lsn >> wal_segsz_shift) I think something like that'd result in prev_not_really_lsn typically not simply matching after recycling. Of course it only provides so much protection, given 16bits... Greetings, Andres Freund
On Tue, Oct 4, 2022 at 2:30 PM Andres Freund <andres@anarazel.de> wrote: > Consider the following sequence: > > 1) we write WAL like this: > > [record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev B_lsn] > > 2) crash, the sectors with A and C made it to disk, the one with B didn't > > 3) We replay A, discover B is invalid (possibly zeroes or old contents), > insert a new record B' with the same length. Now it looks like this: > > [record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev B_lsn] > > 4) crash, the sector with B' makes it to disk > > 5) we replay A, B', C, because C has an xl_prev that's compatible with B' > location and a valid CRC. > > Oops. > > I think this can happen both within a single page and across page boundaries. > > I hope I am missing something here? If you are, I don't know what it is off-hand. That seems like a plausible scenario to me. It does require the OS to write things out of order, and I don't know how likely that is in practice, but the answer probably isn't zero. > I think there might be reasonable ways to increase the guarantees based on the > 2 byte xl_prev approach "alone". We don't have to store the offset from the > page header as a plain offset. What about storing something like: > page_offset ^ (page_lsn >> wal_segsz_shift) > > I think something like that'd result in prev_not_really_lsn typically not > simply matching after recycling. Of course it only provides so much > protection, given 16bits... Maybe. That does seem somewhat better, but I feel like it's hard to reason about whether it's safe in absolute terms or just resistant to the precise scenario Matthias postulated while remaining vulnerable to slightly modified versions. How about this: remove xl_prev. widen xl_crc to 64 bits. include the CRC of the previous WAL record in the xl_crc calculation. That doesn't cut quite as many bytes out of the record size as your proposal, but it feels like it should strongly resist pretty much every attack of this general type, with only the minor disadvantage that the more expensive CRC calculation will destroy all hope of getting anything committed. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-10-03 10:01:25 -0700, Andres Freund wrote: > On 2022-10-03 08:12:39 -0400, Robert Haas wrote: > > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote: > > I thought about trying to buy back some space elsewhere, and I think > > that would be a reasonable approach to getting this committed if we > > could find a way to do it. However, I don't see a terribly obvious way > > of making it happen. > > I think there's plenty potential... I light dusted off my old varint implementation from [1] and converted the RelFileLocator and BlockNumber from fixed width integers to varint ones. This isn't meant as a serious patch, but an experiment to see if this is a path worth pursuing. A run of installcheck in a cluster with autovacuum=off, full_page_writes=off (for increased reproducability) shows a decent saving: master: 241106544 - 230 MB varint: 227858640 - 217 MB The average record size goes from 102.7 to 95.7 bytes excluding the remaining FPIs, 118.1 to 111.0 including FPIs. There's plenty other spots that could be converted (e.g. the record length which rarely needs four bytes), this is just meant as a demonstration. I used pg_waldump --stats for that range of WAL to measure the CPU overhead. A profile does show pg_varint_decode_uint64(), but partially that seems to be offset by the reduced amount of bytes to CRC. Maybe a ~2% overhead remains. That would be tolerable, I think, because waldump --stats pretty much doesn't do anything with the WAL. But I suspect there's plenty of optimization potential in the varint code. Right now it e.g. stores data as big endian, and the bswap instructions do show up. And a lot of my bit-maskery could be optimized too. Greetings, Andres Freund
Attachment
On Wed, Oct 5, 2022 at 5:19 AM Andres Freund <andres@anarazel.de> wrote: > > I light dusted off my old varint implementation from [1] and converted the > RelFileLocator and BlockNumber from fixed width integers to varint ones. This > isn't meant as a serious patch, but an experiment to see if this is a path > worth pursuing. > > A run of installcheck in a cluster with autovacuum=off, full_page_writes=off > (for increased reproducability) shows a decent saving: > > master: 241106544 - 230 MB > varint: 227858640 - 217 MB > > The average record size goes from 102.7 to 95.7 bytes excluding the remaining > FPIs, 118.1 to 111.0 including FPIs. > I have also executed my original test after applying these patches on top of the 56 bit relfilenode patch. So earlier we saw the WAL size increased by 11% (66199.09375 kB to 73906.984375 kB) and after this patch now the WAL generated is 58179.2265625. That means in this particular example this patch is reducing the WAL size by 12% even with the 56 bit relfilenode patch. [1] https://www.postgresql.org/message-id/CAFiTN-uut%2B04AdwvBY_oK_jLvMkwXUpDJj5mXg--nek%2BucApPQ%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Oct 10, 2022 at 5:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have also executed my original test after applying these patches on > top of the 56 bit relfilenode patch. So earlier we saw the WAL size > increased by 11% (66199.09375 kB to 73906.984375 kB) and after this > patch now the WAL generated is 58179.2265625. That means in this > particular example this patch is reducing the WAL size by 12% even > with the 56 bit relfilenode patch. That's a very promising result, but the question in my mind is how much work would be required to bring this patch to a committable state? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-10-10 08:10:22 -0400, Robert Haas wrote: > On Mon, Oct 10, 2022 at 5:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have also executed my original test after applying these patches on > > top of the 56 bit relfilenode patch. So earlier we saw the WAL size > > increased by 11% (66199.09375 kB to 73906.984375 kB) and after this > > patch now the WAL generated is 58179.2265625. That means in this > > particular example this patch is reducing the WAL size by 12% even > > with the 56 bit relfilenode patch. > > That's a very promising result, but the question in my mind is how > much work would be required to bring this patch to a committable > state? The biggest part clearly is to review the variable width integer patch. It's not a large amount of code, but probably more complicated than average. One complication there is that currently the patch assumes: * Note that this function, for efficiency, reads 8 bytes, even if the * variable integer is less than 8 bytes long. The buffer has to be * allocated sufficiently large to account for that fact. The maximum * amount of memory read is 9 bytes. We could make a less efficient version without that assumption, but I think it might be easier to just guarantee it in the xlog*.c case. Using it in xloginsert.c is pretty darn simple, code-wise. xlogreader is bit harder, although not for intrinsic reasons - the logic underlying COPY_HEADER_FIELD seems unneccessary complicated to me. The minimal solution would likely be to just wrap the varint reads in another weird macro. Leaving the code issues themselves aside, one important thing would be to evaluate what the performance impacts of the varint encoding/decoding are as part of "full" server. I suspect it'll vanish in the noise, but we'd need to validate that. Greetings, Andres Freund
On Mon, Oct 10, 2022 at 5:40 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Oct 10, 2022 at 5:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have also executed my original test after applying these patches on > > top of the 56 bit relfilenode patch. So earlier we saw the WAL size > > increased by 11% (66199.09375 kB to 73906.984375 kB) and after this > > patch now the WAL generated is 58179.2265625. That means in this > > particular example this patch is reducing the WAL size by 12% even > > with the 56 bit relfilenode patch. > > That's a very promising result, but the question in my mind is how > much work would be required to bring this patch to a committable > state? Right, the results are promising. I have done some more testing with make installcheck WAL size (fpw=off) and I have seen a similar gain with this patch. 1. Head: 272 MB 2. 56 bit RelfileLocator: 285 MB 3. 56 bit RelfileLocator + this patch: 261 MB -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-10-03 10:01:25 -0700, Andres Freund wrote: > > On 2022-10-03 08:12:39 -0400, Robert Haas wrote: > > > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote: > > > I thought about trying to buy back some space elsewhere, and I think > > > that would be a reasonable approach to getting this committed if we > > > could find a way to do it. However, I don't see a terribly obvious way > > > of making it happen. > > > > I think there's plenty potential... > > I light dusted off my old varint implementation from [1] and converted the > RelFileLocator and BlockNumber from fixed width integers to varint ones. This > isn't meant as a serious patch, but an experiment to see if this is a path > worth pursuing. > > A run of installcheck in a cluster with autovacuum=off, full_page_writes=off > (for increased reproducability) shows a decent saving: > > master: 241106544 - 230 MB > varint: 227858640 - 217 MB I think a signficant part of this improvement comes from the premise of starting with a fresh database. tablespace OID will indeed most likely be low, but database OID may very well be linearly distributed if concurrent workloads in the cluster include updating (potentially unlogged) TOASTed columns and the databases are not created in one "big bang" but over the lifetime of the cluster. In that case DBOID will consume 5B for a significant fraction of databases (anything with OID >=2^28). My point being: I don't think that we should have different WAL performance in databases which is dependent on which OID was assigned to that database. In addition; this varlen encoding of relfilenode would mean that performance would drop over time, as a relations' relfile locator is updated to something with a wider number (through VACUUM FULL or other relfilelocator cycling; e.g. re-importing a database). For maximum performance, you'd have to tune your database to have the lowest possible database, namespace and relfilelocator numbers; which (in older clusters) implies hacking into the catalogs - which seems like an antipattern. I would have much less issue with this if we had separate counters per database (and approximately incremental dbOid:s), but that's not the case right now. > The average record size goes from 102.7 to 95.7 bytes excluding the remaining > FPIs, 118.1 to 111.0 including FPIs. That's quite promising. > There's plenty other spots that could be converted (e.g. the record length > which rarely needs four bytes), this is just meant as a demonstration. Agreed. > I used pg_waldump --stats for that range of WAL to measure the CPU overhead. A > profile does show pg_varint_decode_uint64(), but partially that seems to be > offset by the reduced amount of bytes to CRC. Maybe a ~2% overhead remains. > > That would be tolerable, I think, because waldump --stats pretty much doesn't > do anything with the WAL. > > But I suspect there's plenty of optimization potential in the varint > code. Right now it e.g. stores data as big endian, and the bswap instructions > do show up. And a lot of my bit-maskery could be optimized too. One thing that comes to mind is that we will never see dbOid < 2^8 (and rarely < 2^14, nor spcOid less than 2^8 for that matter), so we'll probably waste at least one or two bits in the encoding of those values. That's not the end of the world, but it'd probably be better if we could improve on that - up to 6% of the field's disk usage would be wasted on an always-on bit. ---- Attached is a prototype patchset that reduces the WAL record size in many common cases. This is a prototype, as it fails tests due to a locking issue in prepared_xacts that I have not been able to find the source of yet. It also could use some more polishing, but the base case seems quite good. I haven't yet run the numbers though... 0001 - Extract xl_rminfo from xl_info See [0] for more info as to why that's useful, the patch was pulled from there. It is mainly used to reduce the size of 0002; and mostly consists of find-and-replace of rmgrs extracting their bits from xl_info. 0002 - Rework XLogRecord This makes many fields in the xlog header optional, reducing the size of many xlog records by several bytes. This implements the design I shared in my earlier message [1]. 0003 - Rework XLogRecordBlockHeader. This patch could be applied on current head, and saves some bytes in per-block data. It potentially saves some bytes per registered block/buffer in the WAL record (max 2 bytes for the first block, after that up to 3). See the patch's commit message in the patch for detailed information. Kind regards, Matthias van de Meent [0] https://postgr.es/m/CAEze2WgZti_Bgs-Aw3egsR5PJQpHcYZwZFCJND5MS-O_DX0-Hg%40mail.gmail.com [1] https://postgr.es/m/CAEze2WjOFzRzPMPYhH4odSa9OCF2XeZszE3jGJhJzrpdFmyLOw@mail.gmail.com
Attachment
Hi, On 2022-10-12 22:05:30 +0200, Matthias van de Meent wrote: > On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres@anarazel.de> wrote: > > On 2022-10-03 10:01:25 -0700, Andres Freund wrote: > > > On 2022-10-03 08:12:39 -0400, Robert Haas wrote: > > > > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote: > > > > I thought about trying to buy back some space elsewhere, and I think > > > > that would be a reasonable approach to getting this committed if we > > > > could find a way to do it. However, I don't see a terribly obvious way > > > > of making it happen. > > > > > > I think there's plenty potential... > > > > I light dusted off my old varint implementation from [1] and converted the > > RelFileLocator and BlockNumber from fixed width integers to varint ones. This > > isn't meant as a serious patch, but an experiment to see if this is a path > > worth pursuing. > > > > A run of installcheck in a cluster with autovacuum=off, full_page_writes=off > > (for increased reproducability) shows a decent saving: > > > > master: 241106544 - 230 MB > > varint: 227858640 - 217 MB > > I think a signficant part of this improvement comes from the premise > of starting with a fresh database. tablespace OID will indeed most > likely be low, but database OID may very well be linearly distributed > if concurrent workloads in the cluster include updating (potentially > unlogged) TOASTed columns and the databases are not created in one > "big bang" but over the lifetime of the cluster. In that case DBOID > will consume 5B for a significant fraction of databases (anything with > OID >=2^28). > > My point being: I don't think that we should have different WAL > performance in databases which is dependent on which OID was assigned > to that database. To me this is raising the bar to an absurd level. Some minor space usage increase after oid wraparound and for very large block numbers isn't a huge issue - if you're in that situation you already have a huge amount of wal. > 0002 - Rework XLogRecord > This makes many fields in the xlog header optional, reducing the size > of many xlog records by several bytes. This implements the design I > shared in my earlier message [1]. > > 0003 - Rework XLogRecordBlockHeader. > This patch could be applied on current head, and saves some bytes in > per-block data. It potentially saves some bytes per registered > block/buffer in the WAL record (max 2 bytes for the first block, after > that up to 3). See the patch's commit message in the patch for > detailed information. The amount of complexity these two introduce seems quite substantial to me. Both from an maintenance and a runtime perspective. I think we'd be better off using building blocks like variable lengths encoded values than open coding it in many places. Greetings, Andres Freund
On Wed, 12 Oct 2022 at 23:13, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-10-12 22:05:30 +0200, Matthias van de Meent wrote: > > On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres@anarazel.de> wrote: > > > I light dusted off my old varint implementation from [1] and converted the > > > RelFileLocator and BlockNumber from fixed width integers to varint ones. This > > > isn't meant as a serious patch, but an experiment to see if this is a path > > > worth pursuing. > > > > > > A run of installcheck in a cluster with autovacuum=off, full_page_writes=off > > > (for increased reproducability) shows a decent saving: > > > > > > master: 241106544 - 230 MB > > > varint: 227858640 - 217 MB > > > > I think a signficant part of this improvement comes from the premise > > of starting with a fresh database. tablespace OID will indeed most > > likely be low, but database OID may very well be linearly distributed > > if concurrent workloads in the cluster include updating (potentially > > unlogged) TOASTed columns and the databases are not created in one > > "big bang" but over the lifetime of the cluster. In that case DBOID > > will consume 5B for a significant fraction of databases (anything with > > OID >=2^28). > > > > My point being: I don't think that we should have different WAL > > performance in databases which is dependent on which OID was assigned > > to that database. > > To me this is raising the bar to an absurd level. Some minor space usage > increase after oid wraparound and for very large block numbers isn't a huge > issue - if you're in that situation you already have a huge amount of wal. I didn't want to block all varlen encoding, I just want to make clear that I don't think it's great for performance testing and consistency across installations if WAL size (and thus part of your performance) is dependent on which actual database/relation/tablespace combination you're running your workload in. With the 56-bit relfilenode, the size of a block reference would realistically differ between 7 bytes and 23 bytes: - tblspc=0=1B db=16386=3B rel=797=2B (787 = 4 * default # of data relations in a fresh DB in PG14, + 1) block=0=1B vs - tsp>=2^28 = 5B dat>=2^28 =5B rel>=2^49 =8B block>=2^28 =5B That's a difference of 16 bytes, of which only the block number can realistically be directly influenced by the user ("just don't have relations larger than X blocks"). If applied to Dilip's pgbench transaction data, that would imply a minimum per transaction wal usage of 509 bytes, and a maximum per transaction wal usage of 609 bytes. That is nearly a 20% difference in WAL size based only on the location of your data, and I'm just not comfortable with that. Users have little or zero control over the internal IDs we assign to these fields, while it would affect performance fairly significantly. (difference % between min/max wal size is unchanged (within 0.1%) after accounting for record alignment) > > 0002 - Rework XLogRecord > > This makes many fields in the xlog header optional, reducing the size > > of many xlog records by several bytes. This implements the design I > > shared in my earlier message [1]. > > > > 0003 - Rework XLogRecordBlockHeader. > > This patch could be applied on current head, and saves some bytes in > > per-block data. It potentially saves some bytes per registered > > block/buffer in the WAL record (max 2 bytes for the first block, after > > that up to 3). See the patch's commit message in the patch for > > detailed information. > > The amount of complexity these two introduce seems quite substantial to > me. Both from an maintenance and a runtime perspective. I think we'd be better > off using building blocks like variable lengths encoded values than open > coding it in many places. I guess that's true for length fields, but I don't think dynamic header field presence (the 0002 rewrite, and the omission of data_length in 0003) is that bad. We already have dynamic data inclusion through block ids 25x; I'm not sure why we couldn't do that more compactly with bitfields as indicators instead (hence the dynamic header size). As for complexity, I think my current patchset is mostly complex due to a lack of tooling. Note that decoding makes common use of COPY_HEADER_FIELD, which we don't really have an equivalent for in XLogRecordAssemble. I think the code for 0002 would improve significantly in readability if such construct would be available. To reduce complexity in 0003, I could drop the 'repeat id' optimization, as that reduces the complexity significantly, at the cost of not saving that 1 byte per registered block after the first. Kind regards, Matthias van de Meent
On Wed, Oct 12, 2022 at 5:13 PM Andres Freund <andres@anarazel.de> wrote: > > I think a signficant part of this improvement comes from the premise > > of starting with a fresh database. tablespace OID will indeed most > > likely be low, but database OID may very well be linearly distributed > > if concurrent workloads in the cluster include updating (potentially > > unlogged) TOASTed columns and the databases are not created in one > > "big bang" but over the lifetime of the cluster. In that case DBOID > > will consume 5B for a significant fraction of databases (anything with > > OID >=2^28). > > > > My point being: I don't think that we should have different WAL > > performance in databases which is dependent on which OID was assigned > > to that database. > > To me this is raising the bar to an absurd level. Some minor space usage > increase after oid wraparound and for very large block numbers isn't a huge > issue - if you're in that situation you already have a huge amount of wal. I have to admit that I worried about the same thing that Matthias raises, more or less. But I don't know whether I'm right to be worried. A variable-length representation of any kind is essentially a gamble that values requiring fewer bytes will be more common than values requiring more bytes, and by enough to justify the overhead that the method has. And, you want it to be more common for each individual user, not just overall. For example, more people are going to have small relations than large ones, but nobody wants performance to drop off a cliff when the relation passes a certain size threshold. Now, it wouldn't drop off a cliff here, but what about someone with a really big, append-only relation? Won't they just end up writing more to WAL than with the present system? Maybe not. They might still have some writes to relations other than the very large, append-only relation, and then they could still win. Also, if we assume that the overhead of the variable-length representation is never more than 1 byte beyond what is needed to represent the underlying quantity in the minimal number of bytes, they are only going to lose if their relation is already more than half the maximum theoretical size, and if that is the case, they are in danger of hitting the size limit anyway. You can argue that there's still a risk here, but it doesn't seem like that bad of a risk. But the same thing is not so obvious for, let's say, database OIDs. What if you just have one or a few databases, but due to the previous history of the cluster, their OIDs just happen to be big? Then you're just behind where you would have been without the patch. Granted, if this happens to you, you will be in the minority, because most users are likely to have small database OIDs, but the fact that other people are writing less WAL on average isn't going to make you happy about writing more WAL on average. And even for a user for which that doesn't happen, it's not at all unlikely that the gains they see will be less than what we see on a freshly-initdb'd database. So I don't really know what the answer is here. I don't think this technique sucks, but I don't think it's necessarily a categorical win for every case, either. And it even seems hard to reason about which cases are likely to be wins and which cases are likely to be losses. > > 0002 - Rework XLogRecord > > This makes many fields in the xlog header optional, reducing the size > > of many xlog records by several bytes. This implements the design I > > shared in my earlier message [1]. > > > > 0003 - Rework XLogRecordBlockHeader. > > This patch could be applied on current head, and saves some bytes in > > per-block data. It potentially saves some bytes per registered > > block/buffer in the WAL record (max 2 bytes for the first block, after > > that up to 3). See the patch's commit message in the patch for > > detailed information. > > The amount of complexity these two introduce seems quite substantial to > me. Both from a maintenance and a runtime perspective. I think we'd be better > off using building blocks like variable lengths encoded values than open > coding it in many places. I agree that this looks pretty ornate as written, but I think there might be some good ideas in here, too. It is also easy to reason about this kind of thing at least in terms of space consumption. It's a bit harder to know how things will play out in terms of CPU cycles and code complexity. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-10-17 17:14:21 -0400, Robert Haas wrote: > I have to admit that I worried about the same thing that Matthias > raises, more or less. But I don't know whether I'm right to be > worried. A variable-length representation of any kind is essentially a > gamble that values requiring fewer bytes will be more common than > values requiring more bytes, and by enough to justify the overhead > that the method has. And, you want it to be more common for each > individual user, not just overall. For example, more people are going > to have small relations than large ones, but nobody wants performance > to drop off a cliff when the relation passes a certain size threshold. > Now, it wouldn't drop off a cliff here, but what about someone with a > really big, append-only relation? Won't they just end up writing more > to WAL than with the present system? Perhaps. But I suspect it'd be a very small increase because they'd be using bulk-insert paths in all likelihood anyway, if they managed to get to a very large relation. And even in that case, if we e.g. were to make the record size variable length, they'd still pretty much never reach that and it'd be an overall win. The number of people with that large relations, leaving partitioning aside which'd still benefit as each relation is smaller, strikes me as a very small percentage. And as you say, it's not like there's a cliff where everything starts to be horrible. > Maybe not. They might still have some writes to relations other than > the very large, append-only relation, and then they could still win. > Also, if we assume that the overhead of the variable-length > representation is never more than 1 byte beyond what is needed to > represent the underlying quantity in the minimal number of bytes, they > are only going to lose if their relation is already more than half the > maximum theoretical size, and if that is the case, they are in danger > of hitting the size limit anyway. You can argue that there's still a > risk here, but it doesn't seem like that bad of a risk. Another thing here is that I suspect we ought to increase our relation size beyond 4 byte * blocksize at some point - and then we'll have to use variable encodings... Admittedly the amount of work needed to get there is substantial. Somewhat relatedly, I think we, very slowly, should move towards wider OIDs as well. Not having to deal with oid wraparound will be a significant win (particularly for toast), but to keep the overhead reasonable, we're going to need variable encodings. > But the same thing is not so obvious for, let's say, database OIDs. > What if you just have one or a few databases, but due to the previous > history of the cluster, their OIDs just happen to be big? Then you're > just behind where you would have been without the patch. Granted, if > this happens to you, you will be in the minority, because most users > are likely to have small database OIDs, but the fact that other people > are writing less WAL on average isn't going to make you happy about > writing more WAL on average. And even for a user for which that > doesn't happen, it's not at all unlikely that the gains they see will > be less than what we see on a freshly-initdb'd database. I agree that going for variable width encodings on the basis of the database oid field alone would be an unconvincing proposition. But variably encoding database oids when we already variably encode other fields seems like a decent bet. If you e.g. think of the 56-bit relfilenode field itself - obviously what I was thinking about in the first place - it's going to be a win much more often. To really loose you'd not just have to have a large database oid, but also a large tablespace and relation oid and a huge block number... > So I don't really know what the answer is here. I don't think this > technique sucks, but I don't think it's necessarily a categorical win > for every case, either. And it even seems hard to reason about which > cases are likely to be wins and which cases are likely to be losses. True. I'm far less concerned than you or Matthias about increasing the size in rare cases as long as it wins in the majority of cases. But that doesn't mean every case is easy to consider. > > > 0002 - Rework XLogRecord > > > This makes many fields in the xlog header optional, reducing the size > > > of many xlog records by several bytes. This implements the design I > > > shared in my earlier message [1]. > > > > > > 0003 - Rework XLogRecordBlockHeader. > > > This patch could be applied on current head, and saves some bytes in > > > per-block data. It potentially saves some bytes per registered > > > block/buffer in the WAL record (max 2 bytes for the first block, after > > > that up to 3). See the patch's commit message in the patch for > > > detailed information. > > > > The amount of complexity these two introduce seems quite substantial to > > me. Both from a maintenance and a runtime perspective. I think we'd be better > > off using building blocks like variable lengths encoded values than open > > coding it in many places. > > I agree that this looks pretty ornate as written, but I think there > might be some good ideas in here, too. Agreed! Several of the ideas seem orthogonal to using variable encodings, so this isn't really an either / or. > It is also easy to reason about this kind of thing at least in terms of > space consumption. Hm, not for me, but... Greetings, Andres Freund
On Thu, Oct 20, 2022 at 12:51 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-10-17 17:14:21 -0400, Robert Haas wrote: > > I have to admit that I worried about the same thing that Matthias > > raises, more or less. But I don't know whether I'm right to be > > worried. A variable-length representation of any kind is essentially a > > gamble that values requiring fewer bytes will be more common than > > values requiring more bytes, and by enough to justify the overhead > > that the method has. And, you want it to be more common for each > > individual user, not just overall. For example, more people are going > > to have small relations than large ones, but nobody wants performance > > to drop off a cliff when the relation passes a certain size threshold. > > Now, it wouldn't drop off a cliff here, but what about someone with a > > really big, append-only relation? Won't they just end up writing more > > to WAL than with the present system? > > Perhaps. But I suspect it'd be a very small increase because they'd be using > bulk-insert paths in all likelihood anyway, if they managed to get to a very > large relation. And even in that case, if we e.g. were to make the record size > variable length, they'd still pretty much never reach that and it'd be an > overall win. I think the number of cases where we will reduce the WAL size will be far more than the cases where it will slightly increase the size. And also the number of bytes we save in winning cases is far bigger than the number of bytes we increase. So IMHO it seems like an overall win at least from the WAL size reduction pov. Do we have some number that how much overhead it has for encoding/decoding? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com