Re: POC: make mxidoff 64 bits - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: POC: make mxidoff 64 bits
Date
Msg-id 2c62322e-a0e3-49cd-b369-370718a8efd8@iki.fi
Whole thread Raw
In response to Re: POC: make mxidoff 64 bits  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
List pgsql-hackers
After a little detour to the "IPC/MultixactCreation on the Standby 
server" issue [1], I'm back to working on this. New patch version 
attached, addressing your comments and Maxim's.

On 05/12/2025 15:42, Ashutosh Bapat wrote:
> There is duplication of code/functionality between server and
> pg_upgrade. With it we carry all the risks that come with
> code/functionality duplication like the copies going out of sync.
> There may be a valid reason to do that but it's not documented in the
> comments. At-least both mutlixact_new.c and slru_io.c are not as well
> commented as their server counterparts. I understand that the SLRU
> code in the server deals with shared memory which is not needed in
> pg_upgrade; pg_upgrade will not need more than one buffer in memory
> and pg_upgrade code doesn't need to deal with lock and it can not deal
> with locks. That means the code required by pg_upgrade is much simpler
> than that on the server. But there's also non-trivial code which is
> required in both the cases. WIll it be possible to extract parts of
> slru.c which deal with IO into slru_io.c, make it part of the core and
> then use it in pg_upgrade as well as slru.c? Or whether it's possible
> to make SLRU use local memory? And throwing some FRONTEND magic to the
> mix, we may be able to avoid duplication. Have we tried this or
> something else to avoid duplication? Sorry, if this has been discussed
> earlier. Please point me to the relevant discussion if so.

That's a fair point, but I think it's better to have some code 
duplication in this case, than trying to write code that works for both 
the server and for pg_upgrade. The needs are so different.

> 007_multixact_conversion.pl fires thousands of queries through
> BackgroundPsql which prints debug output for each of the queries. When
> running this file with oldinstall set,
> 2.2M  regress_log_007_multixact_conversion (size of file)
> 77874 regress_log_007_multixact_conversion (wc -l output)
> 
> 
> Since this output is also copied in testlog.txt, the effect is two-fold.
> 
> 
> Most, if not all, of this output is useless. It also makes it hard to
> find the output we are looking for. PFA patch which reduces this
> output. The patch adds a flag verbose to query_safe() and query() to
> toggle this output. With the patch the sizes are
> 27K  regress_log_007_multixact_conversion
> 588 regress_log_007_multixact_conversion
> 
> 
> And it makes the test faster by about a second or two on my laptop.
> Something on those lines or other is required to reduce the output
> from query_safe().

Nice! That log bloat was the reason I bundled together the "COMMIT; 
BEGIN; SELECT ...;" steps into one statement in the loop. Your solution 
addresses it more directly.

I turned 'verbose' into a keyword parameter, for future extensibility of 
those functions, so you now call it like "$node->query_safe("SELECT 1", 
verbose => 0);". I also set "log_statements=none" in those connections, 
to reduce the noise in the server log too.

> Some more comments
> +++ b/src/bin/pg_upgrade/multixact_old.c
> 
> 
> We may need to introduce new _new and then _old will become _older.
> Should we rename the files to have pre19 and post19 or some similar
> suffixes which make it clear what is meant by old and new?

+1. I renamed multixact_old.c to multixact_pre_v19.c. And 
multixact_new.c to multixact_rewrite.c. I also moved the 
"convert_multixact" function that drives the conversion to 
multixact_rewrite.c. The idea is that if in the future we change the 
format again, we will have:

multixact_pre_v19.c  # for reading -v19 files
multixact_pre_v24.c  # for reading v19-v23 files
multixact_rewrite.c  # for writing new files

Hard to predict what a possible future format might look like and how 
we'd want to organize the code then, though. This can be changed then if 
needed, but it makes sense now.

> +static inline int64
> +MultiXactIdToOffsetPage(MultiXactId multi)
> 
> 
> The prologue mentions that the definitions are copy-pasted from
> multixact.c from version 18, but they share the names with functions
> in the current version. I think that's going to be a good source of
> confusion especially in a file which is a few hundred lines long. Can
> we rename them to have "Old" prefix or something similar?

Fair. On the other hand, having the same names makes it easier to see 
what the real differences with the server functions are. Not sure what's 
best here..

As long as we use the same names, it's important that 
multixact_pre_v19.c doesn't #include the new definitions. I added some 
comments on that, and also this safeguard:

#define MultiXactOffset should_not_be_used

That actually caught one (harmless) instance in the file where we had 
not renamed MultiXactOffset to OldMultiXactOffset.

I'm not entirely happy with the "Old" prefix here, because as you 
pointed out, we might end up needing "older" or "oldold" in the future. 
I couldn't come up with anything better though. "PreV19MultiXactOffset" 
is quite a mouthful.

> +# Dump contents of the 'mxofftest' table, created by mxact_workload
> +sub get_dump_for_comparison
> 
> 
> This local function shares its name with a local function in
> 002_pg_upgrade.pl. Better to use a separate name. Also it's not
> "dumping" data using "pg_dump", so "dump" in the name can be
> misleading.

Renamed to "get_test_table_contents"

> + $newnode->start;
> + my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}_dump");
> + $newnode->stop;
> 
> 
> There is no code which actually looks at the multixact offsets here to
> make sure that the conversion happened correctly. I guess the test
> relies on visibility checks for that. Anyway, we need a comment
> explaining why just comparing the contents of the table is enough to
> ensure correct conversion. Better if we can add an explicit test that
> the offsets were converted correctly. I don't have any idea of how to
> do that right now, though. Maybe use pg_get_multixact_members()
> somehow in the query to extract data out of the table?

Agreed, the verification here is quite weak. I didn't realize that 
pg_get_multixact_members() exists! That might indeed be handy here, but 
I'm not sure how exactly to construct the test. A direct C function like 
test_create_multixact() in test_multixact.c would be handy here, but 
we'd need to compile and do run that in the old cluster, which seems 
difficult.

> + compare_files($old_dump, $new_dump,
> + 'dump outputs from original and restored regression databases match');
> 
> 
> A shared test name too :); but there is not regression database here.

Fixed :-)

> +
> + note ">>> case #${tag}\n"
> +   . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
> +   . " newnode mxoff ${new_next_mxoff}\n";
> 
> 
> Should we check that some condition holds between finish_mxoff and
> new_next_mxoff?

Got something in mind that we could check?


On 04/12/2025 17:33, Maxim Orlov wrote:
> On Thu, 4 Dec 2025 at 13:39, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> However, that doesn't hold for pg_upgrade. pg_upgrade will try to read
>> all the multixids. So we need to make the multixact reading code
>> tolerant of the situations that could be present after a crash. I think
>> the right philosophy here is that we try to read all the old multixids,
>> and do our best to interpret them the same way that the old server
>> would.
> 
> Something like attached?

+1

> Now previous scheme of upgrade with the bytes joggling start
> looking not so bad. Just a funny thought that came to my mind.

:-)

>> Perhaps we should check that all the files exist and have the correct
>> sizes in the pre-check stage
> 
> Not sure about it. Because SLRU does not support "holes", simply
> checking if the first and last multixacts exist will be enough. But
> we'll do it anyway in a real conversion.

Yeah, the point would be to complain if there are holes when there 
shouldn't be. As a sanity check.

There's a reason to not do that though: if you use pg_resetwal to skip
over some multixids, you end up with holes. We shouldn't encourage 
people to use pg_resetwal, but it seems reasonable to tolerate it if you 
have done it.

I worked some more on this. One notable change is that in light of the 
"IPC/MultixactCreation on the Standby server" changes [1], we need to 
always write the next multixid's offset, even if the next multixid 
itself is invalid. Because otherwise the previous multixid is unreadable 
too.

The SlruReadSwitchPageSlow() function didn't handle short reads 
properly. As a result, if an SLRU file was shorter than expected, the 
buffer kept its old contents when switching to read the missing page. 
Fixed that so that the missing part is read as all-zeros instead.

I removed the warnings from some of the invalid multixid cases, like if 
the offset or some of the members are zeros. Those cases can 
legitimately happen after crash and restart, so we shouldn't complain 
about them. If a multixid has more than one updating member, I kept that 
as a fatal error. That really should not happen.

To summarize, the behavior now is that if an old SLRU file does not 
exist, you get an error. If an SLRU file is too short, you get warnings 
and the missing pages are read as all-zeros, i.e. all the multixids on 
the missing pages are considered invalid. If an individual multixid is 
invalid, because the offset is zero, it's silently written as invalid in 
the new file too.

I'm still not 100% sure what the desired behavior for missing files is. 
For now, I didn't include the pre-checks for the first and last files in 
this version. You can end up with missing files if you skip over many 
multixids with pg_resetwal. Or it could be a sign of lost data. If it's 
lost data, would you prefer for the upgrade to fail, or continue 
upgrading the data that you have? The conversion shouldn't make things 
worse, if the data was already lost, but then again, if something really 
bad has happened, all bets are off and perhaps it would be best to abort 
and complain loudly.

[1] 
https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Making jsonb_agg() faster
Next
From: Chao Li
Date:
Subject: Re: vacuumdb: add --dry-run