Thread: Hide exposed impl detail of wchar.c

Hide exposed impl detail of wchar.c

From
Jubilee Young
Date:
Hello,

I work on pgrx, a Rust crate (really, a set of them) that allows
people to use Rust to write extensions against Postgres, exporting
what Postgres sees as ordinary "C" dynamic libraries. Unfortunately,
the build system for this is a touch complicated, as it cannot simply
run pgxs.mk, and as-of Postgres 16 it has been periodically failing on
platforms it used to do fine on, due to troubles involved with the
SIMD extension headers.

I have root-caused the exact problem, but the bug is... social, rather
than technical in nature: people with inadequate options at their
disposal have implemented increasingly clever educated guesses that
are increasingly prone to going wrong, rather than just asking anyone
to help them increase their options. Rather than continuing this
trend, I figured I would simply start doing things to hopefully draw
the line here. I will be looking to follow up with the bindgen tools
that fail to handle this correctly, but it would be nice if this
stopped shipping in Postgres 16."${PG_NEXT_MINOR}", as pgrx does need
the definitions in pg_wchar.h to have enough data to correctly
determine database encoding and preserve certain Rust library
invariants ("all Rust strings are correctly-formed UTF-8, anything
else is just a sequence of bytes") without also obliterating
performance.

On the off-chance that everyone agrees with me without reserve, the
attached patch moves the relevant code around (and includes the gory
details). This seems to be unlikely to be the only mildly-exotic build
system failure caused by such an overexposed implementation detail, so
while I'm not married to this particular code motion, it seems best to
improve this some way.

Attachment

Re: Hide exposed impl detail of wchar.c

From
Tom Lane
Date:
Jubilee Young <workingjubilee@gmail.com> writes:
> I have root-caused the exact problem, but the bug is... social, rather
> than technical in nature: people with inadequate options at their
> disposal have implemented increasingly clever educated guesses that
> are increasingly prone to going wrong, rather than just asking anyone
> to help them increase their options. Rather than continuing this
> trend, I figured I would simply start doing things to hopefully draw
> the line here. I will be looking to follow up with the bindgen tools
> that fail to handle this correctly, but it would be nice if this
> stopped shipping in Postgres 16."${PG_NEXT_MINOR}", as pgrx does need
> the definitions in pg_wchar.h to have enough data to correctly
> determine database encoding and preserve certain Rust library
> invariants ("all Rust strings are correctly-formed UTF-8, anything
> else is just a sequence of bytes") without also obliterating
> performance.

It would be nice if you would state your problem straightforwardly,
rather than burying us in irrelevant-to-us details; but apparently
what you are unhappy about is that pg_wchar.h now #include's simd.h.
That evidently stems from commit 121d2d3d7 trying to make
is_valid_ascii() faster.

Currently the only caller of is_valid_ascii() is in wchar.c,
and so we could easily fix this by moving is_valid_ascii()
into wchar.c as your patch proposes.  However ... I suppose the
point of having it as a "static inline" in a header file is to
be able to optimize other call sites too.  So I wonder if there
used to be some, or this was just somebody's over-eagerness to
expose stuff they thought possibly might be useful.  And maybe
more to the point, are we worried about there being other
callers in future?  I'm really not sure.

            regards, tom lane



Re: Hide exposed impl detail of wchar.c

From
Nathan Bossart
Date:
On Thu, Nov 16, 2023 at 12:10:59PM -0800, Jubilee Young wrote:
> On the off-chance that everyone agrees with me without reserve, the
> attached patch moves the relevant code around (and includes the gory
> details). This seems to be unlikely to be the only mildly-exotic build
> system failure caused by such an overexposed implementation detail, so
> while I'm not married to this particular code motion, it seems best to
> improve this some way.

It looks like is_valid_ascii() was originally added to pg_wchar.h so that
it could easily be used elsewhere [0] [1], but that doesn't seem to have
happened yet.

Would moving this definition to a separate header file be a viable option?
That'd still break any existing projects that are using it, but at least
there'd be an easy fix.  I'm not sure there _are_ any other projects using
it, anyway.  However, both of these proposals feel like they might be
slightly beyond what we'd ordinarily consider back-patching.

That being said, it's not unheard of for Postgres to make adjustments for
third-party code (search for "pljava" in commits 62aba76 and f4aa3a1).  I
read the description of the pgrx problem [2], and I'm still trying to
understand the scope of the issue.  I don't think it's reasonable to expect
someone building an extension to always use the exact same compiler that
was used by the packager, but I also don't fully understand why different
compilations of an inline function would cause problems.

[0] https://postgr.es/m/CAFBsxsHG%3Dg6W8Mie%2B_NO8dV6O0pO2stxrnS%3Dme5ZmGqk--fd5g%40mail.gmail.com
[1] https://postgr.es/m/CAFBsxsH1Yutrmu%2B6LLHKK8iXY%2BvG--Do6zN%2B2900spHXQNNQKQ%40mail.gmail.com
[2] https://github.com/pgcentralfoundation/pgrx/issues/1298

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Hide exposed impl detail of wchar.c

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> It looks like is_valid_ascii() was originally added to pg_wchar.h so that
> it could easily be used elsewhere [0] [1], but that doesn't seem to have
> happened yet.

It seems to be new as of v15, so there wouldn't have been a lot of time
for external code to adopt it.  As far as I can tell from Debian Code
Search, nobody has yet.

> Would moving this definition to a separate header file be a viable option?
> That'd still break any existing projects that are using it, but at least
> there'd be an easy fix.

That would provide a little bit of cover, at least, compared to just
hiding it in the .c file.

I'm generally sympathetic to the idea that simd.h was a rather large
dependency to add to something as widely used as pg_wchar.h.  So I'd
favor getting it out of there just on compilation-time grounds,
independently of whether it's causing active problems.  That argument
wouldn't justify a back-patch, but "it's causing problems" might.

            regards, tom lane



Re: Hide exposed impl detail of wchar.c

From
Jubilee Young
Date:
On Thu, Nov 16, 2023 at 2:54 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> That being said, it's not unheard of for Postgres to make adjustments for
> third-party code (search for "pljava" in commits 62aba76 and f4aa3a1). I
> read the description of the pgrx problem [2], and I'm still trying to
> understand the scope of the issue. I don't think it's reasonable to expect
> someone building an extension to always use the exact same compiler that
> was used by the packager, but I also don't fully understand why different
> compilations of an inline function would cause problems.
>
> [0] https://postgr.es/m/CAFBsxsHG%3Dg6W8Mie%2B_NO8dV6O0pO2stxrnS%3Dme5ZmGqk--fd5g%40mail.gmail.com
> [1] https://postgr.es/m/CAFBsxsH1Yutrmu%2B6LLHKK8iXY%2BvG--Do6zN%2B2900spHXQNNQKQ%40mail.gmail.com
> [2] https://github.com/pgcentralfoundation/pgrx/issues/1298
>

We don't directly `#include` C into Rust, but use libclang to preprocess and
compile a wrapping C header into a list of symbols Rust will look for at link
time. Our failure is in libclang and how we steer it:
- The Clang-C API (libclang.so) cannot determine where system headers are.
- A clang executable can determine where system headers are, but our bindgen
may be asked to use a libclang.so without a matching clang executable!
- This is partly because system packagers do not agree on what clang parts
must be installed together, nor even on the clang directory's layout.
- Thus, it is currently impossible to, given a libclang.so, determine with
100% accuracy where version-appropriate system headers are and include them,
nor does it do so implicitly.
- Users cannot be expected to always have reasonable directory layouts, nor
always have one clang + libclang.so + clang/"$MAJOR"/include on the system.
- We have tried various solutions and have had several users report, by various
channels, that their builds are breaking, even after they charitably try out
the patches I offer in their CI. Especially after system updates.

The clang-sys and rust-bindgen crates committed a series of unfortunate hacks
that surprisingly work. But the only real solution is actually exposing the
C++ API for header searches to Clang-C, and then move that up the deps chain.
Perhaps libclang-18.so will not have this problem?

- Jubilee



Re: Hide exposed impl detail of wchar.c

From
Nathan Bossart
Date:
On Thu, Nov 16, 2023 at 06:06:30PM -0500, Tom Lane wrote:
> I'm generally sympathetic to the idea that simd.h was a rather large
> dependency to add to something as widely used as pg_wchar.h.  So I'd
> favor getting it out of there just on compilation-time grounds,
> independently of whether it's causing active problems.  That argument
> wouldn't justify a back-patch, but "it's causing problems" might.

Given the lack of evidence of anyone else using is_valid_ascii(), I'm
leaning towards back-patching being the better option in this case.  I
don't know if it'll be feasible to keep simd.h out of all headers that
third-party code might want to use forever, but that's not an argument
against doing this right now for pgrx.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Hide exposed impl detail of wchar.c

From
John Naylor
Date:
On Fri, Nov 17, 2023 at 5:54 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> It looks like is_valid_ascii() was originally added to pg_wchar.h so that
> it could easily be used elsewhere [0] [1], but that doesn't seem to have
> happened yet.
>
> Would moving this definition to a separate header file be a viable option?

Seems fine to me. (I believe the original motivation for making it an
inline function was for in pg_mbstrlen_with_len(), but trying that
hasn't been a priority.)



Re: Hide exposed impl detail of wchar.c

From
Jubilee Young
Date:
On Fri, Nov 17, 2023 at 2:26 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Fri, Nov 17, 2023 at 5:54 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >
> > It looks like is_valid_ascii() was originally added to pg_wchar.h so that
> > it could easily be used elsewhere [0] [1], but that doesn't seem to have
> > happened yet.
> >
> > Would moving this definition to a separate header file be a viable option?
>
> Seems fine to me. (I believe the original motivation for making it an
> inline function was for in pg_mbstrlen_with_len(), but trying that
> hasn't been a priority.)

In that case, I took a look across the codebase and saw a
utils/ascii.h that doesn't
seem to have gotten much love, but I suppose one could argue that it's intended
to be a backend-only header file?

As the codebase is growing some enhanced UTF-8 support, you'll want somewhere
that contains the optimized US-ASCII routines, because, as US-ASCII is
a subset of
UTF-8, and often faster to handle, it's typical for such codepaths to look like

```c
while (i < len && no_multibyte_chars) {
   i = i + ascii_op_version(i, buffer, &no_multibyte_chars);
}

while (i < len) {
    i = i + utf8_op_version(i, buffer);
}
```

So it should probably end up living somewhere near the UTF-8 support, and
the easiest way to make it not go into something pgrx currently
includes would be
to make it a new header file, though there's a fair amount of API we
don't touch.

From the pgrx / Rust perspective, Postgres function calls are passed
via callback
to a "guard function" that guarantees that longjmp and setjmp don't
cause trouble
(and makes sure we participate in that). So we only want to call
Postgres functions
if we "can't replace" them, as the overhead is quite a lot. That means
UTF-8-per-se
functions aren't very interesting to us as the Rust language already
supports it, but
we do benefit from access to transcoding to/from UTF-8.

—Jubilee



Re: Hide exposed impl detail of wchar.c

From
Nathan Bossart
Date:
On Mon, Nov 20, 2023 at 10:50:36AM -0800, Jubilee Young wrote:
> In that case, I took a look across the codebase and saw a
> utils/ascii.h that doesn't
> seem to have gotten much love, but I suppose one could argue that it's intended
> to be a backend-only header file?

That might work.  It's not #included in very many files, so adding
port/simd.h shouldn't be too bad.  And ascii.h is also pretty inexpensive,
so including it in wchar.c seems permissible, too.  I'm not certain this
doesn't cause problems with libpgcommon, but I don't see why it would,
either.

> So it should probably end up living somewhere near the UTF-8 support, and
> the easiest way to make it not go into something pgrx currently
> includes would be
> to make it a new header file, though there's a fair amount of API we
> don't touch.

Does pgrx use ascii.h at all?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Hide exposed impl detail of wchar.c

From
Andres Freund
Date:
Hi,

On 2023-11-16 17:11:03 -0800, Jubilee Young wrote:
> We don't directly `#include` C into Rust, but use libclang to preprocess and
> compile a wrapping C header into a list of symbols Rust will look for at link
> time. Our failure is in libclang and how we steer it:
> - The Clang-C API (libclang.so) cannot determine where system headers are.
> - A clang executable can determine where system headers are, but our bindgen
> may be asked to use a libclang.so without a matching clang executable!
> - This is partly because system packagers do not agree on what clang parts
> must be installed together, nor even on the clang directory's layout.
> - Thus, it is currently impossible to, given a libclang.so, determine with
> 100% accuracy where version-appropriate system headers are and include them,
> nor does it do so implicitly.

I remember battling this in the past, independent of rust :(


What I don't quite get is why SIMD headers are particularly more problematic
than others - there's other headers that are compiler specific?

Greetings,

Andres Freund



Re: Hide exposed impl detail of wchar.c

From
Nathan Bossart
Date:
On Mon, Nov 20, 2023 at 05:14:17PM -0800, Jubilee Young wrote:
> On Mon, Nov 20, 2023 at 2:52 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> Does pgrx use ascii.h at all?
> 
> We don't use utils/ascii.h, no.

Alright.  The next minor release isn't until February, so I'll let this one
sit a little while longer in case anyone objects to back-patching something
like this [0].

[0] https://postgr.es/m/attachment/152305/move_is_valid_ascii_v2.patch

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Hide exposed impl detail of wchar.c

From
Eric Ridge
Date:
> On Nov 20, 2023, at 7:10 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
> What I don't quite get is why SIMD headers are particularly more problematic
> than others - there's other headers that are compiler specific?

The short answer is the rust-based bindings generation tool pgrx uses (bindgen) is a little brain dead and gets
confusedwhen there’s multiple compiler builtins headers on the host. 

This simd header is the first of its kind we’ve run across that’s exposed via Postgres’ “public api”. And bindgen wants
tofollow all the includes, it gets confused, picks the wrong one, and then errors happen. 

And I don’t know that it makes much sense for Postgres to include such a header into 3rd-party code anyways.

I think Jubilee is also working with them to fix this, but we’re hoping Jubilee’s patch here (or similar) can get
mergedso we can clear our build drama. 

eric


Re: Hide exposed impl detail of wchar.c

From
Eric Ridge
Date:
(I hope you don't mind I'm reposting your reply -- I accidentally replied directly to you b/c phone)

> On Nov 21, 2023, at 11:56 AM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-11-21 10:11:18 -0500, Eric Ridge wrote:
>> On Mon, Nov 20, 2023 at 7:10 PM Andres Freund <andres@anarazel.de> wrote:

<snip>

>> And I don’t know that it makes much sense for Postgres to include such a
>> header into 3rd-party code anyways.
>
> Well, we want to expose such functionality to extensions. For some cases using
> full functions would to be too expensive, hence using static inlines. In case
> of exposing simd stuff, that means we need to include headers.

Sure.  Probably not my place to make that kind of broad statement anyways.  The "static inlines" are problematic for us
inpgrx-land too, but that's a different problem for another day. 


> I'm not against splitting this out of pg_wchar.h, to be clear - that's a too
> widely used header for, so there's a good independent reason for such a
> change. I just don't really believe that moving simd.h out of there will end
> the issue, we'll add more inlines using simd over time...

Yeah and that's why Jubilee is working with the bindgen folks to tighten this up for good.

(We tracked all of the pg16 betas and didn't run into this until after pg16 went gold.  I personally haven't groveled
throughthe git logs to see when this header/static inline was added, but we'd have reported this sooner had we found it
sooner.)

eric


Re: Hide exposed impl detail of wchar.c

From
Nathan Bossart
Date:
On Mon, Nov 20, 2023 at 10:39:43PM -0600, Nathan Bossart wrote:
> Alright.  The next minor release isn't until February, so I'll let this one
> sit a little while longer in case anyone objects to back-patching something
> like this [0].
> 
> [0] https://postgr.es/m/attachment/152305/move_is_valid_ascii_v2.patch

Barring objections, I plan to commit this and back-patch it to v16 in the
next few days.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Hide exposed impl detail of wchar.c

From
Nathan Bossart
Date:
On Thu, Jan 04, 2024 at 04:43:29PM -0600, Nathan Bossart wrote:
> On Mon, Nov 20, 2023 at 10:39:43PM -0600, Nathan Bossart wrote:
>> Alright.  The next minor release isn't until February, so I'll let this one
>> sit a little while longer in case anyone objects to back-patching something
>> like this [0].
>> 
>> [0] https://postgr.es/m/attachment/152305/move_is_valid_ascii_v2.patch
> 
> Barring objections, I plan to commit this and back-patch it to v16 in the
> next few days.

Apologies for the delay.  We're getting close to the February release, so I
should probably take care of this one soon...

I see that I was planning on back-patching this to v16, but since
is_valid_ascii() was introduced in v15, I'm wondering if it'd be better to
back-patch it there so that is_valid_ascii() lives in the same file for all
versions where it exists.  Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Hide exposed impl detail of wchar.c

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> I see that I was planning on back-patching this to v16, but since
> is_valid_ascii() was introduced in v15, I'm wondering if it'd be better to
> back-patch it there so that is_valid_ascii() lives in the same file for all
> versions where it exists.  Thoughts?

Yeah, if we're going to back-patch at all, that probably makes sense.

            regards, tom lane



Re: Hide exposed impl detail of wchar.c

From
Nathan Bossart
Date:
On Fri, Jan 26, 2024 at 01:24:19PM -0500, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> I see that I was planning on back-patching this to v16, but since
>> is_valid_ascii() was introduced in v15, I'm wondering if it'd be better to
>> back-patch it there so that is_valid_ascii() lives in the same file for all
>> versions where it exists.  Thoughts?
> 
> Yeah, if we're going to back-patch at all, that probably makes sense.

Committed/back-patched.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com