Thread: Avoiding "wrong tuple length" errors at the end of VACUUM on pg_database update (Backpatch of 947789f to v12 and v13)

Hi all,

The problem mentioned in $subject has been discussed here:

https://www.postgresql.org/message-id/DM5PR0501MB38800D9E4605BCA72DD35557CCE10@DM5PR0501MB3880.namprd05.prod.outlook.com

Ths issue has been fixed by 947789f, without a backpatch to v12 (as
per 96cdeae) as the risk seemed rather limited seen from here, back
when the problem was discussed.  Unfortunately, I have seen customer
deployments on v12 and v13 playing with pg_database entries large
enough that they would have toast entries and would be able to trigger
the problem fixed in v14 at the end of a vacuum.

Any objections about getting 947789f applied to REL_13_STABLE and
REL_12_STABLE and see this issue completely gone from all the versions
supported?

Thanks,
--
Michael

Attachment
Michael Paquier <michael@paquier.xyz> writes:
> Any objections about getting 947789f applied to REL_13_STABLE and
> REL_12_STABLE and see this issue completely gone from all the versions
> supported?

No objections to back-patching the fix, but I wonder if we can find
some mechanical way to prevent this sort of error in future.  It's
surely far from obvious that we need to apply heap_inplace_update
to a raw tuple rather than a syscache entry.

A partial fix perhaps could be to verify that the supplied tuple
is the same length as what we see on-disk?  It's partial because
it'd only trigger if there had actually been a toasted-field
expansion, so it'd most likely not catch such coding errors
during developer testing.

            regards, tom lane



On Tue, Jan 10, 2023 at 02:57:43AM -0500, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> Any objections about getting 947789f applied to REL_13_STABLE and
>> REL_12_STABLE and see this issue completely gone from all the versions
>> supported?
> 
> No objections to back-patching the fix...

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Hi,

On 2023-01-10 02:57:43 -0500, Tom Lane wrote:
> No objections to back-patching the fix, but I wonder if we can find
> some mechanical way to prevent this sort of error in future.

What about a define that forces external toasting very aggressively for
catalog tables, iff they have a toast table? I suspect doing so for
non-catalog tables as well would trigger test changes. Running a buildfarm
animal with that would at least make issues like this much easier to discover.

Greetings,

Andres Freund



On Tue, Jan 10, 2023 at 09:54:31AM -0800, Nathan Bossart wrote:
> +1

Okay, thanks.  Done this part as of c0ee694 and 72b6098, then.
--
Michael

Attachment
On Tue, Jan 10, 2023 at 11:05:04AM -0800, Andres Freund wrote:
> What about a define that forces external toasting very aggressively for
> catalog tables, iff they have a toast table? I suspect doing so for
> non-catalog tables as well would trigger test changes. Running a buildfarm
> animal with that would at least make issues like this much easier to discover.

Hmm.  That could work.  I guess that you mean to do something like
that in SearchSysCacheCopy() when we build the tuple copy.  There is
an access to the where the cacheId, meaning that we know the catalog
involved.  Still, we would need a lookup at its pg_class entry to
check after a reltoastrelid, meaning an extra relation opened, which
would be fine under a specific #define, anyway..
--
Michael

Attachment