Thread: Re: [PATCH] Refactor SLRU to always use long file names
Hi Michael, > On Wed, Sep 11, 2024 at 04:07:06PM +0300, Aleksander Alekseev wrote: > > Commit 4ed8f0913bfd introduced long SLRU file names. The proposed > > patch removes SlruCtl->long_segment_names flag and makes SLRU always > > use long file names. This simplifies both the code and the API. > > Corresponding changes to pg_upgrade are included. > > That's leaner, indeed. > > > One drawback I see is that technically SLRU is an exposed API and > > changing it may affect third-party code. I'm not sure if we should > > seriously worry about this. Firstly, the change is trivial and > > secondly, it's not clear whether such third-party code even exists (we > > broke this API just recently in 4ed8f0913bfd and no one complained). > > Any third-party code using custom SLRUs would need to take care of > handling their upgrade path outside pg_upgrade. Not sure there are > any of them, TBH, but let's see. > > > I didn't include any tests for the new pg_upgrade code. To my > > knowledge we test it manually, with buildfarm members and during > > alpha- and beta-testing periods. Please let me know if you think there > > should be a corresponding TAP test. > > Removing the old API means that it is impossible to test a move from > short to long file names. That's OK by me to rely on the pg_upgrade > paths in the buildfarm code. We have a few of them. Thanks for the feedback. > There is one thing I am wondering, here, though, which is to think > harder about a validity check at the end of 002_pg_upgrade.pl to make > sure that all the SLRU use long file names after running the tests. > That would mean thinking about a mechanism to list all of them from a > backend, rather than hardcode a list of them. Perhaps that's not > worth it, just dropping an idea in the bucket of ideas. I would guess > in the shape of a catalog that's able to represent at SQL level all > the SLRUs that exist in a backend. Hmm... IMO it would be a rather niche facility to maintain in PG core. At least I'm not aware of cases when a DBA wanted to list initialized SLRUs. Would it be convenient for core / extensions developers? Creating a breakpoint on SimpleLruInit() or adding a temporary elog() sounds simpler to me. It wouldn't hurt re-checking the segment file names in the TAP test but this would mean hardcoding catalog names which as I understand you want to avoid. With high probability PG wouldn't start if the corresponding piece of pg_upgrade is wrong (I checked more than once :). So I'm not entirely sure if it's worth the effort, but let's see what others think. -- Best regards, Aleksander Alekseev
Hi again, Just a quick follow-up. > (*) BTW I noticed a mistake in the commented code. The condition > should be `>=`, not `<`, i.e: > > ``` > if(new_cluster.controldata.cat_ver >= SLRU_SEG_FILENAMES_CHANGE_CAT_VER) > return; > ``` The concentration of caffeine in my blood is a bit low right now. I suspect I may need to re-check this statement with a fresh head. Also it occured to me that as a 4th option we could just get rid of this check. Users however will pay the price every time they execute pg_upgrade so I doubt we are going to do this. -- Best regards, Aleksander Alekseev
Hi Michael, > The scans may be quite long as well, actually, which could be a > bottleneck. Did you measure the runtime with a maximized (still > realistic) pool of files for these SLRUs in the upgrade time? For > upgrades, data would be the neck. Good question. In theory SLRUs are not supposed to grow large and their size is a small fraction of the rest of the database. As an example CLOG ( pg_xact/ ) stores 2 bits per transaction. Since every SLRU has a dedicated directory and we scan just it, non-SLRU files don't affect the scan time. To make sure I asked several people to check how many SLRUs they have in the prod environment. The typical response looked like this: ``` $PGDATA/pg_xact: 191 segments $PGDATA/pg_commit_ts: 3 $PGDATA/pg_multixact/offsets: 148 $PGDATA/pg_multixact/members: 400 $PGDATA/pg_subtrans: 4 $PGDATA/pg_serial: 3 ``` This is a 800 Gb database. Interestingly larger databases (4.2Tb) may have much less SLRU segments (220 in total, most of them are pg_xact). And here is the *worst* case that was reported to me: ``` $PGDATA/pg_xact: 171 segments $PGDATA/pg_commit_ts: 3 $PGDATA/pg_multixact/offsets: 4864 $PGDATA/pg_multixact/members: 40996 $PGDATA/pg_subtrans: 5 $PGDATA/pg_serial: 3 ``` I was told this is a "1Tb+" database. For this user pg_upgrade will rename 45 000 files. I wrote a little script to check how much time it will take: ``` #!/usr/bin/env perl use strict; my $from = "test_0001.tmp"; my $to = "test_0002.tmp"; system("touch $from"); for my $i (1..45000) { rename($from, $to); ($from, $to) = ($to, $from); } ``` On my laptop I get 0.5 seconds. Note that I don't do scanning, only renaming, assuming that the recent should take most of the time. I think this should be multiplied by 10 to take into account the role of the filesystem cache and other factors. All in all in the absolutely worst case scenario this shouldn't take more than 5 seconds, in reality it will probably be orders of magnitude less. > Note that this also depends on the system endianness, see 039_end_of_wal.pl. Sure, I think I took it into account when using pack("L!"). My understanding is that "L" takes care of the endiness since I see special flags to force little- or big-endiness independently from the platform [1]. This of course should be tested in practice on different machines. Using an exclamation mark in "L!" was a mistake since cat_ver is not an int, but rather an uint32. > You don't really need the lookup part, actually? For lookup we already have the pg_controldata tool, that's not a problem. > Control file manipulation may be useful as a routine in Cluster.pm, > based on an offset in the file and a format to pack as argument? > [...] > It's one of these things I could see myself reuse to force a state in > the cluster and make a test cheaper, for example. > You would just need the part where > the control file is rewritten, which should be OK as long as the > cluster is freshly initdb'd meaning that there should be nothing that > interacts with the new value set. Agree. Still I don't see a good way of figuring out sizeof(ControlFileData) from Perl. The structure has int's in it (e.g. wal_level, MaxConnections, etc) thus the size is platform-dependent. The CRC should be placed at the end of the structure. If we want to manipulate MaxConnections etc their offsets are going to be platform-dependent as well. And my understanding is that the alignment is platform/compiler dependent too. I guess we are going to need either a `pg_writecontoldata` tool or `pg_controldata -w` flag. I wonder which option you find more attractive, or maybe you have better ideas? [1]: https://perldoc.perl.org/functions/pack -- Best regards, Aleksander Alekseev
Hi, > I guess we are going to need either a `pg_writecontoldata` tool or > `pg_controldata -w` flag. I wonder which option you find more > attractive, or maybe you have better ideas? For the record, Michael and I had a brief discussion about this offlist and decided to abandon the idea of adding TAP tests, relying only on buildfarm. Also I will check if we have a clear error message in case when a user forgot to run pg_upgrade and running new slru.c with old filenames. If the user doesn't get such an error message I will see if it's possible to add it somewhere in slru.c without introducing much performance overhead. Also I'm going to submit precise steps to test this migration manually for the reviewers convenience. -- Best regards, Aleksander Alekseev