Thread: making the backend's json parser work in frontend code

making the backend's json parser work in frontend code

From
Robert Haas
Date:
The discussion on the backup manifest thread has gotten bogged down on
the issue of the format that should be used to store the backup
manifest file. I want something simple and ad-hoc; David Steele and
Stephen Frost prefer JSON. That is problematic because our JSON parser
does not work in frontend code, and I want to be able to validate a
backup against its manifest, which involves being able to parse the
manifest from frontend code. The latest development over there is that
David Steele has posted the JSON parser that he wrote for pgbackrest
with an offer to try to adapt it for use in front-end PostgreSQL code,
an offer which I genuinely appreciate. I'll write more about that over
on that thread. However, I decided to spend today doing some further
investigation of an alternative approach, namely making the backend's
existing JSON parser work in frontend code as well. I did not solve
all the problems there, but I did come up with some patches which I
think would be worth committing on independent grounds, and I think
the whole series is worth posting. So here goes.

0001 moves wchar.c from src/backend/utils/mb to src/common. Unless I'm
missing something, this seems like an overdue cleanup. It's long been
the case that wchar.c is actually compiled and linked into both
frontend and backend code. Commit
60f11b87a2349985230c08616fa8a34ffde934c8 added code into src/common
that depends on wchar.c being available, but didn't actually make
wchar.c part of src/common, which seems like an odd decision: the
functions in the library are dependent on code that is not part of any
library but whose source files get copied around where needed. Eh?

0002 does some basic header cleanup to make it possible to include the
existing header file jsonapi.h in frontend code. The state of the JSON
headers today looks generally poor. There seems not to have been much
attempt to get the prototypes for a given source file, say foo.c, into
a header file with the same name, say foo.h. Also, dependencies
between various header files seem to be have added somewhat freely.
This patch does not come close to fixing all that, but I consider it a
modest down payment on a cleanup that probably ought to be taken
further.

0003 splits json.c into two files, json.c and jsonapi.c. All the
lexing and parsing stuff (whose prototypes are in jsonapi.h) goes into
jsonapi.c, while the stuff that pertains to the 'json' data type
remains in json.c. This also seems like a good cleanup, because to me,
at least, it's not a great idea to mix together code that is used by
both the json and jsonb data types as well as other things in the
system that want to generate or parse json together with things that
are specific to the 'json' data type.

As far as I know all three of the above patches are committable as-is;
review and contrary opinions welcome.

On the other hand, 0004, 0005, and 0006 are charitably described as
experimental or WIP.  0004 and 0005 hack up jsonapi.c so that it can
still be compiled even if #include "postgres.h" is changed to #include
"postgres-fe.h" and 0006 moves it into src/common. Note that I say
that they make it compile, not work. It's not just untested; it's
definitely broken. But it gives a feeling for what the remaining
obstacles to making this code available in a frontend environment are.
Since I wrote my very first email complaining about the difficulty of
making the backend's JSON parser work in a frontend environment, one
obstacle has been knocked down: StringInfo is now available in
front-end code (commit 26aaf97b683d6258c098859e6b1268e1f5da242f). The
remaining problems (that I know about) have to do with error reporting
and multibyte character support; a read of the patches is suggested
for those wanting further details.

Suggestions welcome.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> ... However, I decided to spend today doing some further
> investigation of an alternative approach, namely making the backend's
> existing JSON parser work in frontend code as well. I did not solve
> all the problems there, but I did come up with some patches which I
> think would be worth committing on independent grounds, and I think
> the whole series is worth posting. So here goes.

In general, if we can possibly get to having one JSON parser in
src/common, that seems like an obviously better place to be than
having two JSON parsers.  So I'm encouraged that it might be
feasible after all.

> 0001 moves wchar.c from src/backend/utils/mb to src/common. Unless I'm
> missing something, this seems like an overdue cleanup.

FWIW, I've been wanting to do that for awhile.  I've not studied
your patch, but +1 for the idea.  We might also need to take a
hard look at mbutils.c to see if any of that code can/should move.

> Since I wrote my very first email complaining about the difficulty of
> making the backend's JSON parser work in a frontend environment, one
> obstacle has been knocked down: StringInfo is now available in
> front-end code (commit 26aaf97b683d6258c098859e6b1268e1f5da242f). The
> remaining problems (that I know about) have to do with error reporting
> and multibyte character support; a read of the patches is suggested
> for those wanting further details.

The patch I just posted at <2863.1579127649@sss.pgh.pa.us> probably
affects this in small ways, but not anything major.

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Andres Freund
Date:
Hi,

On 2020-01-15 16:02:49 -0500, Robert Haas wrote:
> The discussion on the backup manifest thread has gotten bogged down on
> the issue of the format that should be used to store the backup
> manifest file. I want something simple and ad-hoc; David Steele and
> Stephen Frost prefer JSON. That is problematic because our JSON parser
> does not work in frontend code, and I want to be able to validate a
> backup against its manifest, which involves being able to parse the
> manifest from frontend code. The latest development over there is that
> David Steele has posted the JSON parser that he wrote for pgbackrest
> with an offer to try to adapt it for use in front-end PostgreSQL code,
> an offer which I genuinely appreciate. I'll write more about that over
> on that thread.

I'm not sure where I come down between using json and a simple ad-hoc
format, when the dependency for the former is making the existing json
parser work in the frontend. But if the alternative is to add a second
json parser, it very clearly shifts towards using an ad-hoc
format. Having to maintain a simple ad-hoc parser is a lot less
technical debt than having a second full blown json parser. Imo even
when an external project or three also has to have that simple parser.

If the alternative were to use that newly proposed json parser to
*replace* the backend one too, the story would again be different.


> 0001 moves wchar.c from src/backend/utils/mb to src/common. Unless I'm
> missing something, this seems like an overdue cleanup. It's long been
> the case that wchar.c is actually compiled and linked into both
> frontend and backend code. Commit
> 60f11b87a2349985230c08616fa8a34ffde934c8 added code into src/common
> that depends on wchar.c being available, but didn't actually make
> wchar.c part of src/common, which seems like an odd decision: the
> functions in the library are dependent on code that is not part of any
> library but whose source files get copied around where needed. Eh?

Cool.


> 0002 does some basic header cleanup to make it possible to include the
> existing header file jsonapi.h in frontend code. The state of the JSON
> headers today looks generally poor. There seems not to have been much
> attempt to get the prototypes for a given source file, say foo.c, into
> a header file with the same name, say foo.h. Also, dependencies
> between various header files seem to be have added somewhat freely.
> This patch does not come close to fixing all that, but I consider it a
> modest down payment on a cleanup that probably ought to be taken
> further.

Yea, this seems like a necessary cleanup (or well, maybe the start of
it).


> 0003 splits json.c into two files, json.c and jsonapi.c. All the
> lexing and parsing stuff (whose prototypes are in jsonapi.h) goes into
> jsonapi.c, while the stuff that pertains to the 'json' data type
> remains in json.c. This also seems like a good cleanup, because to me,
> at least, it's not a great idea to mix together code that is used by
> both the json and jsonb data types as well as other things in the
> system that want to generate or parse json together with things that
> are specific to the 'json' data type.

+1


> On the other hand, 0004, 0005, and 0006 are charitably described as
> experimental or WIP.  0004 and 0005 hack up jsonapi.c so that it can
> still be compiled even if #include "postgres.h" is changed to #include
> "postgres-fe.h" and 0006 moves it into src/common. Note that I say
> that they make it compile, not work. It's not just untested; it's
> definitely broken. But it gives a feeling for what the remaining
> obstacles to making this code available in a frontend environment are.
> Since I wrote my very first email complaining about the difficulty of
> making the backend's JSON parser work in a frontend environment, one
> obstacle has been knocked down: StringInfo is now available in
> front-end code (commit 26aaf97b683d6258c098859e6b1268e1f5da242f). The
> remaining problems (that I know about) have to do with error reporting
> and multibyte character support; a read of the patches is suggested
> for those wanting further details.

> From d05e1fc82a51cb583a0367e72b1afc0de561dd00 Mon Sep 17 00:00:00 2001
> From: Robert Haas <rhaas@postgresql.org>
> Date: Wed, 15 Jan 2020 10:36:52 -0500
> Subject: [PATCH 4/6] Introduce json_error() macro.
> 
> ---
>  src/backend/utils/adt/jsonapi.c | 221 +++++++++++++-------------------
>  1 file changed, 90 insertions(+), 131 deletions(-)
> 
> diff --git a/src/backend/utils/adt/jsonapi.c b/src/backend/utils/adt/jsonapi.c
> index fc8af9f861..20f7f0f7ac 100644
> --- a/src/backend/utils/adt/jsonapi.c
> +++ b/src/backend/utils/adt/jsonapi.c
> @@ -17,6 +17,9 @@
>  #include "miscadmin.h"
>  #include "utils/jsonapi.h"
>  
> +#define json_error(rest) \
> +    ereport(ERROR, (rest, report_json_context(lex)))
> +

It's not obvious why the better approach here wouldn't be to just have a
very simple ereport replacement, that needs to be explicitly included
from frontend code. It'd not be meaningfully harder, imo, and it'd
require fewer adaptions, and it'd look more familiar.



>  /* the null action object used for pure validation */
> @@ -701,7 +735,11 @@ json_lex_string(JsonLexContext *lex)
>                          ch = (ch * 16) + (*s - 'A') + 10;
>                      else
>                      {
> +#ifdef FRONTEND
> +                        lex->token_terminator = s + PQmblen(s, PG_UTF8);
> +#else
>                          lex->token_terminator = s + pg_mblen(s);
> +#endif

If we were to go this way, it seems like the ifdef should rather be in a
helper function, rather than all over. It seems like it should be
unproblematic to have a common interface for both frontend/backend?

Greetings,

Andres Freund



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Wed, Jan 15, 2020 at 6:40 PM Andres Freund <andres@anarazel.de> wrote:
> It's not obvious why the better approach here wouldn't be to just have a
> very simple ereport replacement, that needs to be explicitly included
> from frontend code. It'd not be meaningfully harder, imo, and it'd
> require fewer adaptions, and it'd look more familiar.

I agree that it's far from obvious that the hacks in the patch are
best; to the contrary, they are hacks. That said, I feel that the
semantics of throwing an error are not very well-defined in a
front-end environment. I mean, in a backend context, throwing an error
is going to abort the current transaction, with all that this implies.
If the frontend equivalent is to do nothing and hope for the best, I
doubt it will survive anything more than the simplest use cases. This
is one of the reasons I've been very reluctant to go do down this
whole path in the first place.

> > +#ifdef FRONTEND
> > +                                             lex->token_terminator = s + PQmblen(s, PG_UTF8);
> > +#else
> >                                               lex->token_terminator = s + pg_mblen(s);
> > +#endif
>
> If we were to go this way, it seems like the ifdef should rather be in a
> helper function, rather than all over.

Sure... like I said, this is just to illustrate the problem.

> It seems like it should be
> unproblematic to have a common interface for both frontend/backend?

Not sure how. pg_mblen() and PQmblen() are both existing interfaces,
and they're not compatible with each other. I guess we could make
PQmblen() available to backend code, but given that the function name
implies an origin in libpq, that seems wicked confusing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Michael Paquier
Date:
On Wed, Jan 15, 2020 at 09:39:13PM -0500, Robert Haas wrote:
> On Wed, Jan 15, 2020 at 6:40 PM Andres Freund <andres@anarazel.de> wrote:
>> It's not obvious why the better approach here wouldn't be to just have a
>> very simple ereport replacement, that needs to be explicitly included
>> from frontend code. It'd not be meaningfully harder, imo, and it'd
>> require fewer adaptions, and it'd look more familiar.
>
> I agree that it's far from obvious that the hacks in the patch are
> best; to the contrary, they are hacks. That said, I feel that the
> semantics of throwing an error are not very well-defined in a
> front-end environment. I mean, in a backend context, throwing an error
> is going to abort the current transaction, with all that this implies.
> If the frontend equivalent is to do nothing and hope for the best, I
> doubt it will survive anything more than the simplest use cases. This
> is one of the reasons I've been very reluctant to go do down this
> whole path in the first place.

The error handling is a well defined concept in the backend.  If
connected to a database, you know that a session has to rollback any
existing activity, etc.  The clients have to be more flexible because
an error depends a lot of how the tools is designed and how it should
react on a error.  So the backend code in charge of logging an error
does the best it can: it throws an error, then lets the caller decide
what to do with it.  I agree with the feeling that having a simple
replacement for ereport() in the frontend would be nice, that would be
less code churn in parts shared by backend/frontend.

> Not sure how. pg_mblen() and PQmblen() are both existing interfaces,
> and they're not compatible with each other. I guess we could make
> PQmblen() available to backend code, but given that the function name
> implies an origin in libpq, that seems wicked confusing.

Well, the problem here is the encoding part, and the code looks at the
same table pg_wchar_table[] at the end, so this needs some thoughts.
On top of that, we don't know exactly on the client what kind of
encoding is available (this led for example to several
assumptions/hiccups behind the implementation of SCRAM as it requires
UTF-8 per its RFC when working on the libpq part).
--
Michael

Attachment

Re: making the backend's json parser work in frontend code

From
David Steele
Date:
Hi Robert,

On 1/15/20 2:02 PM, Robert Haas wrote:
 > The discussion on the backup manifest thread has gotten bogged down on
 > the issue of the format that should be used to store the backup
 > manifest file. I want something simple and ad-hoc; David Steele and
 > Stephen Frost prefer JSON. That is problematic because our JSON parser
 > does not work in frontend code, and I want to be able to validate a
 > backup against its manifest, which involves being able to parse the
 > manifest from frontend code. The latest development over there is that
 > David Steele has posted the JSON parser that he wrote for pgbackrest
 > with an offer to try to adapt it for use in front-end PostgreSQL code,
 > an offer which I genuinely appreciate. I'll write more about that over
 > on that thread. However, I decided to spend today doing some further
 > investigation of an alternative approach, namely making the backend's
 > existing JSON parser work in frontend code as well. I did not solve
 > all the problems there, but I did come up with some patches which I
 > think would be worth committing on independent grounds, and I think
 > the whole series is worth posting. So here goes.

I was starting to wonder if it wouldn't be simpler to go back to the 
Postgres JSON parser and see if we can adapt it.  I'm not sure that it 
*is* simpler, but it would almost certainly be more acceptable.

 > 0001 moves wchar.c from src/backend/utils/mb to src/common. Unless I'm
 > missing something, this seems like an overdue cleanup. It's long been
 > the case that wchar.c is actually compiled and linked into both
 > frontend and backend code. Commit
 > 60f11b87a2349985230c08616fa8a34ffde934c8 added code into src/common
 > that depends on wchar.c being available, but didn't actually make
 > wchar.c part of src/common, which seems like an odd decision: the
 > functions in the library are dependent on code that is not part of any
 > library but whose source files get copied around where needed. Eh?

This looks like an obvious improvement to me.

 > 0002 does some basic header cleanup to make it possible to include the
 > existing header file jsonapi.h in frontend code. The state of the JSON
 > headers today looks generally poor. There seems not to have been much
 > attempt to get the prototypes for a given source file, say foo.c, into
 > a header file with the same name, say foo.h. Also, dependencies
 > between various header files seem to be have added somewhat freely.
 > This patch does not come close to fixing all that, but I consider it a
 > modest down payment on a cleanup that probably ought to be taken
 > further.

Agreed that these header files are fairly disorganized.  In general the 
names json, jsonapi, jsonfuncs don't tell me a whole lot.  I feel like 
I'd want to include json.h to get a json parser but it only contains one 
utility function before these patches.  I can see that json.c primarily 
contains SQL functions so that's why.

So the idea here is that json.c will have the JSON SQL functions, 
jsonb.c the JSONB SQL functions, and jsonapi.c the parser, and 
jsonfuncs.c the utility functions?

 > 0003 splits json.c into two files, json.c and jsonapi.c. All the
 > lexing and parsing stuff (whose prototypes are in jsonapi.h) goes into
 > jsonapi.c, while the stuff that pertains to the 'json' data type
 > remains in json.c. This also seems like a good cleanup, because to me,
 > at least, it's not a great idea to mix together code that is used by
 > both the json and jsonb data types as well as other things in the
 > system that want to generate or parse json together with things that
 > are specific to the 'json' data type.

This seems like a good first step.  I wonder if the remainder of the SQL 
json/jsonb functions should be moved to json.c/jsonb.c respectively?

That does represent a lot of code churn though, so perhaps not worth it.

 > As far as I know all three of the above patches are committable as-is;
 > review and contrary opinions welcome.

Agreed, with some questions as above.

 > On the other hand, 0004, 0005, and 0006 are charitably described as
 > experimental or WIP.  0004 and 0005 hack up jsonapi.c so that it can
 > still be compiled even if #include "postgres.h" is changed to #include
 > "postgres-fe.h" and 0006 moves it into src/common. Note that I say
 > that they make it compile, not work. It's not just untested; it's
 > definitely broken. But it gives a feeling for what the remaining
 > obstacles to making this code available in a frontend environment are.
 > Since I wrote my very first email complaining about the difficulty of
 > making the backend's JSON parser work in a frontend environment, one
 > obstacle has been knocked down: StringInfo is now available in
 > front-end code (commit 26aaf97b683d6258c098859e6b1268e1f5da242f). The
 > remaining problems (that I know about) have to do with error reporting
 > and multibyte character support; a read of the patches is suggested
 > for those wanting further details.

Well, with the caveat that it doesn't work, it's less than I expected.

Obviously ereport() is a pretty big deal and I agree with Michael 
downthread that we should port this to the frontend code.

It would also be nice to unify functions like PQmblen() and pg_mblen() 
if possible.

The next question in my mind is given the caveat that the error handing 
is questionable in the front end, can we at least render/parse valid 
JSON with the code?

Regards,
-- 
-David
david@pgmasters.net



Re: making the backend's json parser work in frontend code

From
David Steele
Date:
Hi Robert,

On 1/16/20 11:37 AM, David Steele wrote:
> 
> The next question in my mind is given the caveat that the error handing 
> is questionable in the front end, can we at least render/parse valid 
> JSON with the code?

Hrm, this bit was from an earlier edit.  I meant:

The next question in my mind is what will it take to get this working in 
a limited form so we can at least prototype it with pg_basebackup.  I 
can hack on this with some static strings in front end code tomorrow to 
see what works and what doesn't if that makes sense.

Regards,
-- 
-David
david@pgmasters.net



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Thu, Jan 16, 2020 at 1:37 PM David Steele <david@pgmasters.net> wrote:
> I was starting to wonder if it wouldn't be simpler to go back to the
> Postgres JSON parser and see if we can adapt it.  I'm not sure that it
> *is* simpler, but it would almost certainly be more acceptable.

That is my feeling also.

> So the idea here is that json.c will have the JSON SQL functions,
> jsonb.c the JSONB SQL functions, and jsonapi.c the parser, and
> jsonfuncs.c the utility functions?

Uh, I think roughly that, yes. Although I can't claim to fully
understand everything that's here.

> This seems like a good first step.  I wonder if the remainder of the SQL
> json/jsonb functions should be moved to json.c/jsonb.c respectively?
>
> That does represent a lot of code churn though, so perhaps not worth it.

I don't have an opinion on this right now.

> Well, with the caveat that it doesn't work, it's less than I expected.
>
> Obviously ereport() is a pretty big deal and I agree with Michael
> downthread that we should port this to the frontend code.

Another possibly-attractive option would be to defer throwing the
error: i.e. set some flags in the lex or parse state or something, and
then just return. The caller notices the flags and has enough
information to throw an error or whatever it wants to do. The reason I
think this might be attractive is that it dodges the whole question of
what exactly throwing an error is supposed to do in a world without
transactions, memory contexts, resource owners, etc. However, it has
some pitfalls of its own, like maybe being too much code churn or
hurting performance in non-error cases.

> It would also be nice to unify functions like PQmblen() and pg_mblen()
> if possible.

I don't see how to do that at the moment, but I agree that it would be
nice if we can figure it out.

> The next question in my mind is given the caveat that the error handing
> is questionable in the front end, can we at least render/parse valid
> JSON with the code?

That's a real good question. Thanks for offering to test it; I think
that would be very helpful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
David Steele
Date:
On 1/15/20 4:40 PM, Andres Freund wrote:
 >
 > I'm not sure where I come down between using json and a simple ad-hoc
 > format, when the dependency for the former is making the existing json
 > parser work in the frontend. But if the alternative is to add a second
 > json parser, it very clearly shifts towards using an ad-hoc
 > format. Having to maintain a simple ad-hoc parser is a lot less
 > technical debt than having a second full blown json parser.

Maybe at first, but it will grow and become more complex as new features 
are added.  This has been our experience with pgBackRest, at least.

 > Imo even
 > when an external project or three also has to have that simple parser.

I don't agree here.  Especially if we outgrow the format and they need 
two parsers, depending on the version of PostgreSQL.

To do page-level incrementals (which this feature is intended to enable) 
the user will need to be able to associate full and incremental backups 
and the only way I see to do that (currently) is to read the manifests, 
since the prior backup should be stored there.  I think this means that 
parsing the manifest is not really optional -- it will be required to do 
any kind of automation with incrementals.

It's easy enough for a tool like pgBackRest to do something like that, 
much harder for a user hacking together a tool in bash based on 
pg_basebackup.

 > If the alternative were to use that newly proposed json parser to
 > *replace* the backend one too, the story would again be different.

That was certainly not my intention.

Regards,
-- 
-David
david@pgmasters.net



Re: making the backend's json parser work in frontend code

From
David Steele
Date:
On 1/15/20 7:39 PM, Robert Haas wrote:
 > On Wed, Jan 15, 2020 at 6:40 PM Andres Freund <andres@anarazel.de> wrote:
 >> It's not obvious why the better approach here wouldn't be to just have a
 >> very simple ereport replacement, that needs to be explicitly included
 >> from frontend code. It'd not be meaningfully harder, imo, and it'd
 >> require fewer adaptions, and it'd look more familiar.
 >
 > I agree that it's far from obvious that the hacks in the patch are
 > best; to the contrary, they are hacks. That said, I feel that the
 > semantics of throwing an error are not very well-defined in a
 > front-end environment. I mean, in a backend context, throwing an error
 > is going to abort the current transaction, with all that this implies.
 > If the frontend equivalent is to do nothing and hope for the best, I
 > doubt it will survive anything more than the simplest use cases. This
 > is one of the reasons I've been very reluctant to go do down this
 > whole path in the first place.

The way we handle this in pgBackRest is to put a TRY ... CATCH block in 
main() to log and exit on any uncaught THROW.  That seems like a 
reasonable way to start here.  Without memory contexts that almost 
certainly will mean memory leaks but I'm not sure how much that matters 
if the action is to exit immediately.

Regards,
-- 
-David
david@pgmasters.net



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
David Steele <david@pgmasters.net> writes:
> On 1/15/20 7:39 PM, Robert Haas wrote:
>>> I agree that it's far from obvious that the hacks in the patch are
>>> best; to the contrary, they are hacks. That said, I feel that the
>>> semantics of throwing an error are not very well-defined in a
>>> front-end environment. I mean, in a backend context, throwing an error
>>> is going to abort the current transaction, with all that this implies.
>>> If the frontend equivalent is to do nothing and hope for the best, I
>>> doubt it will survive anything more than the simplest use cases. This
>>> is one of the reasons I've been very reluctant to go do down this
>>> whole path in the first place.

> The way we handle this in pgBackRest is to put a TRY ... CATCH block in 
> main() to log and exit on any uncaught THROW.  That seems like a 
> reasonable way to start here.  Without memory contexts that almost 
> certainly will mean memory leaks but I'm not sure how much that matters 
> if the action is to exit immediately.

If that's the expectation, we might as well replace backend ereport(ERROR)
with something that just prints a message and does exit(1).

The question comes down to whether there are use-cases where a frontend
application would really want to recover and continue processing after
a JSON syntax problem.  I'm not seeing that that's a near-term
requirement, so maybe we could leave it for somebody to solve when
and if they want to do it.

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Andres Freund
Date:
Hi,

On 2020-01-16 14:20:28 -0500, Tom Lane wrote:
> David Steele <david@pgmasters.net> writes:
> > The way we handle this in pgBackRest is to put a TRY ... CATCH block in
> > main() to log and exit on any uncaught THROW.  That seems like a
> > reasonable way to start here.  Without memory contexts that almost
> > certainly will mean memory leaks but I'm not sure how much that matters
> > if the action is to exit immediately.
>
> If that's the expectation, we might as well replace backend ereport(ERROR)
> with something that just prints a message and does exit(1).

Well, the process might still want to do some cleanup of half-finished
work. You'd not need to be resistant against memory leaks to do so, if
followed by an exit. Obviously you can also do all the necessarily
cleanup from within the ereport(ERROR) itself, but that doesn't seem
appealing to me (not composable, harder to reuse for other programs,
etc).

Greetings,

Andres Freund



Re: making the backend's json parser work in frontend code

From
David Steele
Date:
On 1/16/20 12:26 PM, Andres Freund wrote:
> Hi,
> 
> On 2020-01-16 14:20:28 -0500, Tom Lane wrote:
>> David Steele <david@pgmasters.net> writes:
>>> The way we handle this in pgBackRest is to put a TRY ... CATCH block in
>>> main() to log and exit on any uncaught THROW.  That seems like a
>>> reasonable way to start here.  Without memory contexts that almost
>>> certainly will mean memory leaks but I'm not sure how much that matters
>>> if the action is to exit immediately.
>>
>> If that's the expectation, we might as well replace backend ereport(ERROR)
>> with something that just prints a message and does exit(1).
> 
> Well, the process might still want to do some cleanup of half-finished
> work. You'd not need to be resistant against memory leaks to do so, if
> followed by an exit. Obviously you can also do all the necessarily
> cleanup from within the ereport(ERROR) itself, but that doesn't seem
> appealing to me (not composable, harder to reuse for other programs,
> etc).

In pgBackRest we have a default handler that just logs the message to 
stderr and exits (though we consider it a coding error if it gets 
called).  Seems like we could do the same here.  Default message and 
exit if no handler, but optionally allow a handler (which could RETHROW 
to get to the default handler afterwards).

It seems like we've been wanting a front end version of ereport() for a 
while so I'll take a look at that and see what it involves.

Regards,
-- 
-David
david@pgmasters.net



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> 0001 moves wchar.c from src/backend/utils/mb to src/common. Unless I'm
> missing something, this seems like an overdue cleanup.

Here's a reviewed version of 0001.  You missed fixing the MSVC build,
and there were assorted comments and other things referencing wchar.c
that needed to be cleaned up.

Also, it seemed to me that if we are going to move wchar.c, we should
also move encnames.c, so that libpq can get fully out of the
symlinking-source-files business.  It makes initdb less weird too.

I took the liberty of sticking proper copyright headers onto these
two files, too.  (This makes the diff a lot more bulky :-(.  Would
it help to add the headers in a separate commit?)

Another thing I'm wondering about is if any of the #ifndef FRONTEND
code should get moved *back* to src/backend/utils/mb.  But that
could be a separate commit, too.

Lastly, it strikes me that maybe pg_wchar.h, or parts of it, should
migrate over to src/include/common.  But that'd be far more invasive
to other source files, so I've not touched the issue here.

            regards, tom lane

diff --git a/src/backend/utils/mb/Makefile b/src/backend/utils/mb/Makefile
index cd4a016..b19a125 100644
--- a/src/backend/utils/mb/Makefile
+++ b/src/backend/utils/mb/Makefile
@@ -14,10 +14,8 @@ include $(top_builddir)/src/Makefile.global

 OBJS = \
     conv.o \
-    encnames.o \
     mbutils.o \
     stringinfo_mb.o \
-    wchar.o \
     wstrcmp.o \
     wstrncmp.o

diff --git a/src/backend/utils/mb/README b/src/backend/utils/mb/README
index 7495ca5..ef36626 100644
--- a/src/backend/utils/mb/README
+++ b/src/backend/utils/mb/README
@@ -3,12 +3,8 @@ src/backend/utils/mb/README
 Encodings
 =========

-encnames.c:    public functions for both the backend and the frontend.
 conv.c:        static functions and a public table for code conversion
-wchar.c:    mostly static functions and a public table for mb string and
-        multibyte conversion
 mbutils.c:    public functions for the backend only.
-        requires conv.c and wchar.c
 stringinfo_mb.c: public backend-only multibyte-aware stringinfo functions
 wstrcmp.c:    strcmp for mb
 wstrncmp.c:    strncmp for mb
@@ -16,6 +12,12 @@ win866.c:    a tool to generate KOI8 <--> CP866 conversion table
 iso.c:        a tool to generate KOI8 <--> ISO8859-5 conversion table
 win1251.c:    a tool to generate KOI8 <--> CP1251 conversion table

+See also in src/common/:
+
+encnames.c:    public functions for encoding names
+wchar.c:    mostly static functions and a public table for mb string and
+        multibyte conversion
+
 Introduction
 ------------
     http://www.cprogramming.com/tutorial/unicode.html
diff --git a/src/backend/utils/mb/encnames.c b/src/backend/utils/mb/encnames.c
deleted file mode 100644
index 12b61cd..0000000
--- a/src/backend/utils/mb/encnames.c
+++ /dev/null
@@ -1,629 +0,0 @@
-/*
- * Encoding names and routines for work with it. All
- * in this file is shared between FE and BE.
- *
- * src/backend/utils/mb/encnames.c
- */
-#ifdef FRONTEND
-#include "postgres_fe.h"
-#else
-#include "postgres.h"
-#include "utils/builtins.h"
-#endif
-
-#include <ctype.h>
-#include <unistd.h>
-
-#include "mb/pg_wchar.h"
-
-
-/* ----------
- * All encoding names, sorted:         *** A L P H A B E T I C ***
- *
- * All names must be without irrelevant chars, search routines use
- * isalnum() chars only. It means ISO-8859-1, iso_8859-1 and Iso8859_1
- * are always converted to 'iso88591'. All must be lower case.
- *
- * The table doesn't contain 'cs' aliases (like csISOLatin1). It's needed?
- *
- * Karel Zak, Aug 2001
- * ----------
- */
-typedef struct pg_encname
-{
-    const char *name;
-    pg_enc        encoding;
-} pg_encname;
-
-static const pg_encname pg_encname_tbl[] =
-{
-    {
-        "abc", PG_WIN1258
-    },                            /* alias for WIN1258 */
-    {
-        "alt", PG_WIN866
-    },                            /* IBM866 */
-    {
-        "big5", PG_BIG5
-    },                            /* Big5; Chinese for Taiwan multibyte set */
-    {
-        "euccn", PG_EUC_CN
-    },                            /* EUC-CN; Extended Unix Code for simplified
-                                 * Chinese */
-    {
-        "eucjis2004", PG_EUC_JIS_2004
-    },                            /* EUC-JIS-2004; Extended UNIX Code fixed
-                                 * Width for Japanese, standard JIS X 0213 */
-    {
-        "eucjp", PG_EUC_JP
-    },                            /* EUC-JP; Extended UNIX Code fixed Width for
-                                 * Japanese, standard OSF */
-    {
-        "euckr", PG_EUC_KR
-    },                            /* EUC-KR; Extended Unix Code for Korean , KS
-                                 * X 1001 standard */
-    {
-        "euctw", PG_EUC_TW
-    },                            /* EUC-TW; Extended Unix Code for
-                                 *
-                                 * traditional Chinese */
-    {
-        "gb18030", PG_GB18030
-    },                            /* GB18030;GB18030 */
-    {
-        "gbk", PG_GBK
-    },                            /* GBK; Chinese Windows CodePage 936
-                                 * simplified Chinese */
-    {
-        "iso88591", PG_LATIN1
-    },                            /* ISO-8859-1; RFC1345,KXS2 */
-    {
-        "iso885910", PG_LATIN6
-    },                            /* ISO-8859-10; RFC1345,KXS2 */
-    {
-        "iso885913", PG_LATIN7
-    },                            /* ISO-8859-13; RFC1345,KXS2 */
-    {
-        "iso885914", PG_LATIN8
-    },                            /* ISO-8859-14; RFC1345,KXS2 */
-    {
-        "iso885915", PG_LATIN9
-    },                            /* ISO-8859-15; RFC1345,KXS2 */
-    {
-        "iso885916", PG_LATIN10
-    },                            /* ISO-8859-16; RFC1345,KXS2 */
-    {
-        "iso88592", PG_LATIN2
-    },                            /* ISO-8859-2; RFC1345,KXS2 */
-    {
-        "iso88593", PG_LATIN3
-    },                            /* ISO-8859-3; RFC1345,KXS2 */
-    {
-        "iso88594", PG_LATIN4
-    },                            /* ISO-8859-4; RFC1345,KXS2 */
-    {
-        "iso88595", PG_ISO_8859_5
-    },                            /* ISO-8859-5; RFC1345,KXS2 */
-    {
-        "iso88596", PG_ISO_8859_6
-    },                            /* ISO-8859-6; RFC1345,KXS2 */
-    {
-        "iso88597", PG_ISO_8859_7
-    },                            /* ISO-8859-7; RFC1345,KXS2 */
-    {
-        "iso88598", PG_ISO_8859_8
-    },                            /* ISO-8859-8; RFC1345,KXS2 */
-    {
-        "iso88599", PG_LATIN5
-    },                            /* ISO-8859-9; RFC1345,KXS2 */
-    {
-        "johab", PG_JOHAB
-    },                            /* JOHAB; Extended Unix Code for simplified
-                                 * Chinese */
-    {
-        "koi8", PG_KOI8R
-    },                            /* _dirty_ alias for KOI8-R (backward
-                                 * compatibility) */
-    {
-        "koi8r", PG_KOI8R
-    },                            /* KOI8-R; RFC1489 */
-    {
-        "koi8u", PG_KOI8U
-    },                            /* KOI8-U; RFC2319 */
-    {
-        "latin1", PG_LATIN1
-    },                            /* alias for ISO-8859-1 */
-    {
-        "latin10", PG_LATIN10
-    },                            /* alias for ISO-8859-16 */
-    {
-        "latin2", PG_LATIN2
-    },                            /* alias for ISO-8859-2 */
-    {
-        "latin3", PG_LATIN3
-    },                            /* alias for ISO-8859-3 */
-    {
-        "latin4", PG_LATIN4
-    },                            /* alias for ISO-8859-4 */
-    {
-        "latin5", PG_LATIN5
-    },                            /* alias for ISO-8859-9 */
-    {
-        "latin6", PG_LATIN6
-    },                            /* alias for ISO-8859-10 */
-    {
-        "latin7", PG_LATIN7
-    },                            /* alias for ISO-8859-13 */
-    {
-        "latin8", PG_LATIN8
-    },                            /* alias for ISO-8859-14 */
-    {
-        "latin9", PG_LATIN9
-    },                            /* alias for ISO-8859-15 */
-    {
-        "mskanji", PG_SJIS
-    },                            /* alias for Shift_JIS */
-    {
-        "muleinternal", PG_MULE_INTERNAL
-    },
-    {
-        "shiftjis", PG_SJIS
-    },                            /* Shift_JIS; JIS X 0202-1991 */
-
-    {
-        "shiftjis2004", PG_SHIFT_JIS_2004
-    },                            /* SHIFT-JIS-2004; Shift JIS for Japanese,
-                                 * standard JIS X 0213 */
-    {
-        "sjis", PG_SJIS
-    },                            /* alias for Shift_JIS */
-    {
-        "sqlascii", PG_SQL_ASCII
-    },
-    {
-        "tcvn", PG_WIN1258
-    },                            /* alias for WIN1258 */
-    {
-        "tcvn5712", PG_WIN1258
-    },                            /* alias for WIN1258 */
-    {
-        "uhc", PG_UHC
-    },                            /* UHC; Korean Windows CodePage 949 */
-    {
-        "unicode", PG_UTF8
-    },                            /* alias for UTF8 */
-    {
-        "utf8", PG_UTF8
-    },                            /* alias for UTF8 */
-    {
-        "vscii", PG_WIN1258
-    },                            /* alias for WIN1258 */
-    {
-        "win", PG_WIN1251
-    },                            /* _dirty_ alias for windows-1251 (backward
-                                 * compatibility) */
-    {
-        "win1250", PG_WIN1250
-    },                            /* alias for Windows-1250 */
-    {
-        "win1251", PG_WIN1251
-    },                            /* alias for Windows-1251 */
-    {
-        "win1252", PG_WIN1252
-    },                            /* alias for Windows-1252 */
-    {
-        "win1253", PG_WIN1253
-    },                            /* alias for Windows-1253 */
-    {
-        "win1254", PG_WIN1254
-    },                            /* alias for Windows-1254 */
-    {
-        "win1255", PG_WIN1255
-    },                            /* alias for Windows-1255 */
-    {
-        "win1256", PG_WIN1256
-    },                            /* alias for Windows-1256 */
-    {
-        "win1257", PG_WIN1257
-    },                            /* alias for Windows-1257 */
-    {
-        "win1258", PG_WIN1258
-    },                            /* alias for Windows-1258 */
-    {
-        "win866", PG_WIN866
-    },                            /* IBM866 */
-    {
-        "win874", PG_WIN874
-    },                            /* alias for Windows-874 */
-    {
-        "win932", PG_SJIS
-    },                            /* alias for Shift_JIS */
-    {
-        "win936", PG_GBK
-    },                            /* alias for GBK */
-    {
-        "win949", PG_UHC
-    },                            /* alias for UHC */
-    {
-        "win950", PG_BIG5
-    },                            /* alias for BIG5 */
-    {
-        "windows1250", PG_WIN1250
-    },                            /* Windows-1251; Microsoft */
-    {
-        "windows1251", PG_WIN1251
-    },                            /* Windows-1251; Microsoft */
-    {
-        "windows1252", PG_WIN1252
-    },                            /* Windows-1252; Microsoft */
-    {
-        "windows1253", PG_WIN1253
-    },                            /* Windows-1253; Microsoft */
-    {
-        "windows1254", PG_WIN1254
-    },                            /* Windows-1254; Microsoft */
-    {
-        "windows1255", PG_WIN1255
-    },                            /* Windows-1255; Microsoft */
-    {
-        "windows1256", PG_WIN1256
-    },                            /* Windows-1256; Microsoft */
-    {
-        "windows1257", PG_WIN1257
-    },                            /* Windows-1257; Microsoft */
-    {
-        "windows1258", PG_WIN1258
-    },                            /* Windows-1258; Microsoft */
-    {
-        "windows866", PG_WIN866
-    },                            /* IBM866 */
-    {
-        "windows874", PG_WIN874
-    },                            /* Windows-874; Microsoft */
-    {
-        "windows932", PG_SJIS
-    },                            /* alias for Shift_JIS */
-    {
-        "windows936", PG_GBK
-    },                            /* alias for GBK */
-    {
-        "windows949", PG_UHC
-    },                            /* alias for UHC */
-    {
-        "windows950", PG_BIG5
-    }                            /* alias for BIG5 */
-};
-
-/* ----------
- * These are "official" encoding names.
- * XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h)
- * ----------
- */
-#ifndef WIN32
-#define DEF_ENC2NAME(name, codepage) { #name, PG_##name }
-#else
-#define DEF_ENC2NAME(name, codepage) { #name, PG_##name, codepage }
-#endif
-const pg_enc2name pg_enc2name_tbl[] =
-{
-    DEF_ENC2NAME(SQL_ASCII, 0),
-    DEF_ENC2NAME(EUC_JP, 20932),
-    DEF_ENC2NAME(EUC_CN, 20936),
-    DEF_ENC2NAME(EUC_KR, 51949),
-    DEF_ENC2NAME(EUC_TW, 0),
-    DEF_ENC2NAME(EUC_JIS_2004, 20932),
-    DEF_ENC2NAME(UTF8, 65001),
-    DEF_ENC2NAME(MULE_INTERNAL, 0),
-    DEF_ENC2NAME(LATIN1, 28591),
-    DEF_ENC2NAME(LATIN2, 28592),
-    DEF_ENC2NAME(LATIN3, 28593),
-    DEF_ENC2NAME(LATIN4, 28594),
-    DEF_ENC2NAME(LATIN5, 28599),
-    DEF_ENC2NAME(LATIN6, 0),
-    DEF_ENC2NAME(LATIN7, 0),
-    DEF_ENC2NAME(LATIN8, 0),
-    DEF_ENC2NAME(LATIN9, 28605),
-    DEF_ENC2NAME(LATIN10, 0),
-    DEF_ENC2NAME(WIN1256, 1256),
-    DEF_ENC2NAME(WIN1258, 1258),
-    DEF_ENC2NAME(WIN866, 866),
-    DEF_ENC2NAME(WIN874, 874),
-    DEF_ENC2NAME(KOI8R, 20866),
-    DEF_ENC2NAME(WIN1251, 1251),
-    DEF_ENC2NAME(WIN1252, 1252),
-    DEF_ENC2NAME(ISO_8859_5, 28595),
-    DEF_ENC2NAME(ISO_8859_6, 28596),
-    DEF_ENC2NAME(ISO_8859_7, 28597),
-    DEF_ENC2NAME(ISO_8859_8, 28598),
-    DEF_ENC2NAME(WIN1250, 1250),
-    DEF_ENC2NAME(WIN1253, 1253),
-    DEF_ENC2NAME(WIN1254, 1254),
-    DEF_ENC2NAME(WIN1255, 1255),
-    DEF_ENC2NAME(WIN1257, 1257),
-    DEF_ENC2NAME(KOI8U, 21866),
-    DEF_ENC2NAME(SJIS, 932),
-    DEF_ENC2NAME(BIG5, 950),
-    DEF_ENC2NAME(GBK, 936),
-    DEF_ENC2NAME(UHC, 949),
-    DEF_ENC2NAME(GB18030, 54936),
-    DEF_ENC2NAME(JOHAB, 0),
-    DEF_ENC2NAME(SHIFT_JIS_2004, 932)
-};
-
-/* ----------
- * These are encoding names for gettext.
- *
- * This covers all encodings except MULE_INTERNAL, which is alien to gettext.
- * ----------
- */
-const pg_enc2gettext pg_enc2gettext_tbl[] =
-{
-    {PG_SQL_ASCII, "US-ASCII"},
-    {PG_UTF8, "UTF-8"},
-    {PG_LATIN1, "LATIN1"},
-    {PG_LATIN2, "LATIN2"},
-    {PG_LATIN3, "LATIN3"},
-    {PG_LATIN4, "LATIN4"},
-    {PG_ISO_8859_5, "ISO-8859-5"},
-    {PG_ISO_8859_6, "ISO_8859-6"},
-    {PG_ISO_8859_7, "ISO-8859-7"},
-    {PG_ISO_8859_8, "ISO-8859-8"},
-    {PG_LATIN5, "LATIN5"},
-    {PG_LATIN6, "LATIN6"},
-    {PG_LATIN7, "LATIN7"},
-    {PG_LATIN8, "LATIN8"},
-    {PG_LATIN9, "LATIN-9"},
-    {PG_LATIN10, "LATIN10"},
-    {PG_KOI8R, "KOI8-R"},
-    {PG_KOI8U, "KOI8-U"},
-    {PG_WIN1250, "CP1250"},
-    {PG_WIN1251, "CP1251"},
-    {PG_WIN1252, "CP1252"},
-    {PG_WIN1253, "CP1253"},
-    {PG_WIN1254, "CP1254"},
-    {PG_WIN1255, "CP1255"},
-    {PG_WIN1256, "CP1256"},
-    {PG_WIN1257, "CP1257"},
-    {PG_WIN1258, "CP1258"},
-    {PG_WIN866, "CP866"},
-    {PG_WIN874, "CP874"},
-    {PG_EUC_CN, "EUC-CN"},
-    {PG_EUC_JP, "EUC-JP"},
-    {PG_EUC_KR, "EUC-KR"},
-    {PG_EUC_TW, "EUC-TW"},
-    {PG_EUC_JIS_2004, "EUC-JP"},
-    {PG_SJIS, "SHIFT-JIS"},
-    {PG_BIG5, "BIG5"},
-    {PG_GBK, "GBK"},
-    {PG_UHC, "UHC"},
-    {PG_GB18030, "GB18030"},
-    {PG_JOHAB, "JOHAB"},
-    {PG_SHIFT_JIS_2004, "SHIFT_JISX0213"},
-    {0, NULL}
-};
-
-
-#ifndef FRONTEND
-
-/*
- * Table of encoding names for ICU
- *
- * Reference: <https://ssl.icu-project.org/icu-bin/convexp>
- *
- * NULL entries are not supported by ICU, or their mapping is unclear.
- */
-static const char *const pg_enc2icu_tbl[] =
-{
-    NULL,                        /* PG_SQL_ASCII */
-    "EUC-JP",                    /* PG_EUC_JP */
-    "EUC-CN",                    /* PG_EUC_CN */
-    "EUC-KR",                    /* PG_EUC_KR */
-    "EUC-TW",                    /* PG_EUC_TW */
-    NULL,                        /* PG_EUC_JIS_2004 */
-    "UTF-8",                    /* PG_UTF8 */
-    NULL,                        /* PG_MULE_INTERNAL */
-    "ISO-8859-1",                /* PG_LATIN1 */
-    "ISO-8859-2",                /* PG_LATIN2 */
-    "ISO-8859-3",                /* PG_LATIN3 */
-    "ISO-8859-4",                /* PG_LATIN4 */
-    "ISO-8859-9",                /* PG_LATIN5 */
-    "ISO-8859-10",                /* PG_LATIN6 */
-    "ISO-8859-13",                /* PG_LATIN7 */
-    "ISO-8859-14",                /* PG_LATIN8 */
-    "ISO-8859-15",                /* PG_LATIN9 */
-    NULL,                        /* PG_LATIN10 */
-    "CP1256",                    /* PG_WIN1256 */
-    "CP1258",                    /* PG_WIN1258 */
-    "CP866",                    /* PG_WIN866 */
-    NULL,                        /* PG_WIN874 */
-    "KOI8-R",                    /* PG_KOI8R */
-    "CP1251",                    /* PG_WIN1251 */
-    "CP1252",                    /* PG_WIN1252 */
-    "ISO-8859-5",                /* PG_ISO_8859_5 */
-    "ISO-8859-6",                /* PG_ISO_8859_6 */
-    "ISO-8859-7",                /* PG_ISO_8859_7 */
-    "ISO-8859-8",                /* PG_ISO_8859_8 */
-    "CP1250",                    /* PG_WIN1250 */
-    "CP1253",                    /* PG_WIN1253 */
-    "CP1254",                    /* PG_WIN1254 */
-    "CP1255",                    /* PG_WIN1255 */
-    "CP1257",                    /* PG_WIN1257 */
-    "KOI8-U",                    /* PG_KOI8U */
-};
-
-bool
-is_encoding_supported_by_icu(int encoding)
-{
-    return (pg_enc2icu_tbl[encoding] != NULL);
-}
-
-const char *
-get_encoding_name_for_icu(int encoding)
-{
-    const char *icu_encoding_name;
-
-    StaticAssertStmt(lengthof(pg_enc2icu_tbl) == PG_ENCODING_BE_LAST + 1,
-                     "pg_enc2icu_tbl incomplete");
-
-    icu_encoding_name = pg_enc2icu_tbl[encoding];
-
-    if (!icu_encoding_name)
-        ereport(ERROR,
-                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-                 errmsg("encoding \"%s\" not supported by ICU",
-                        pg_encoding_to_char(encoding))));
-
-    return icu_encoding_name;
-}
-
-#endif                            /* not FRONTEND */
-
-
-/* ----------
- * Encoding checks, for error returns -1 else encoding id
- * ----------
- */
-int
-pg_valid_client_encoding(const char *name)
-{
-    int            enc;
-
-    if ((enc = pg_char_to_encoding(name)) < 0)
-        return -1;
-
-    if (!PG_VALID_FE_ENCODING(enc))
-        return -1;
-
-    return enc;
-}
-
-int
-pg_valid_server_encoding(const char *name)
-{
-    int            enc;
-
-    if ((enc = pg_char_to_encoding(name)) < 0)
-        return -1;
-
-    if (!PG_VALID_BE_ENCODING(enc))
-        return -1;
-
-    return enc;
-}
-
-int
-pg_valid_server_encoding_id(int encoding)
-{
-    return PG_VALID_BE_ENCODING(encoding);
-}
-
-/* ----------
- * Remove irrelevant chars from encoding name
- * ----------
- */
-static char *
-clean_encoding_name(const char *key, char *newkey)
-{
-    const char *p;
-    char       *np;
-
-    for (p = key, np = newkey; *p != '\0'; p++)
-    {
-        if (isalnum((unsigned char) *p))
-        {
-            if (*p >= 'A' && *p <= 'Z')
-                *np++ = *p + 'a' - 'A';
-            else
-                *np++ = *p;
-        }
-    }
-    *np = '\0';
-    return newkey;
-}
-
-/* ----------
- * Search encoding by encoding name
- *
- * Returns encoding ID, or -1 for error
- * ----------
- */
-int
-pg_char_to_encoding(const char *name)
-{
-    unsigned int nel = lengthof(pg_encname_tbl);
-    const pg_encname *base = pg_encname_tbl,
-               *last = base + nel - 1,
-               *position;
-    int            result;
-    char        buff[NAMEDATALEN],
-               *key;
-
-    if (name == NULL || *name == '\0')
-        return -1;
-
-    if (strlen(name) >= NAMEDATALEN)
-    {
-#ifdef FRONTEND
-        fprintf(stderr, "encoding name too long\n");
-        return -1;
-#else
-        ereport(ERROR,
-                (errcode(ERRCODE_NAME_TOO_LONG),
-                 errmsg("encoding name too long")));
-#endif
-    }
-    key = clean_encoding_name(name, buff);
-
-    while (last >= base)
-    {
-        position = base + ((last - base) >> 1);
-        result = key[0] - position->name[0];
-
-        if (result == 0)
-        {
-            result = strcmp(key, position->name);
-            if (result == 0)
-                return position->encoding;
-        }
-        if (result < 0)
-            last = position - 1;
-        else
-            base = position + 1;
-    }
-    return -1;
-}
-
-#ifndef FRONTEND
-Datum
-PG_char_to_encoding(PG_FUNCTION_ARGS)
-{
-    Name        s = PG_GETARG_NAME(0);
-
-    PG_RETURN_INT32(pg_char_to_encoding(NameStr(*s)));
-}
-#endif
-
-const char *
-pg_encoding_to_char(int encoding)
-{
-    if (PG_VALID_ENCODING(encoding))
-    {
-        const pg_enc2name *p = &pg_enc2name_tbl[encoding];
-
-        Assert(encoding == p->encoding);
-        return p->name;
-    }
-    return "";
-}
-
-#ifndef FRONTEND
-Datum
-PG_encoding_to_char(PG_FUNCTION_ARGS)
-{
-    int32        encoding = PG_GETARG_INT32(0);
-    const char *encoding_name = pg_encoding_to_char(encoding);
-
-    return DirectFunctionCall1(namein, CStringGetDatum(encoding_name));
-}
-
-#endif
diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c
deleted file mode 100644
index 02e2588..0000000
--- a/src/backend/utils/mb/wchar.c
+++ /dev/null
@@ -1,2036 +0,0 @@
-/*
- * conversion functions between pg_wchar and multibyte streams.
- * Tatsuo Ishii
- * src/backend/utils/mb/wchar.c
- *
- */
-/* can be used in either frontend or backend */
-#ifdef FRONTEND
-#include "postgres_fe.h"
-#else
-#include "postgres.h"
-#endif
-
-#include "mb/pg_wchar.h"
-
-
-/*
- * Operations on multi-byte encodings are driven by a table of helper
- * functions.
- *
- * To add an encoding support, define mblen(), dsplen() and verifier() for
- * the encoding.  For server-encodings, also define mb2wchar() and wchar2mb()
- * conversion functions.
- *
- * These functions generally assume that their input is validly formed.
- * The "verifier" functions, further down in the file, have to be more
- * paranoid.
- *
- * We expect that mblen() does not need to examine more than the first byte
- * of the character to discover the correct length.  GB18030 is an exception
- * to that rule, though, as it also looks at second byte.  But even that
- * behaves in a predictable way, if you only pass the first byte: it will
- * treat 4-byte encoded characters as two 2-byte encoded characters, which is
- * good enough for all current uses.
- *
- * Note: for the display output of psql to work properly, the return values
- * of the dsplen functions must conform to the Unicode standard. In particular
- * the NUL character is zero width and control characters are generally
- * width -1. It is recommended that non-ASCII encodings refer their ASCII
- * subset to the ASCII routines to ensure consistency.
- */
-
-/*
- * SQL/ASCII
- */
-static int
-pg_ascii2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
-{
-    int            cnt = 0;
-
-    while (len > 0 && *from)
-    {
-        *to++ = *from++;
-        len--;
-        cnt++;
-    }
-    *to = 0;
-    return cnt;
-}
-
-static int
-pg_ascii_mblen(const unsigned char *s)
-{
-    return 1;
-}
-
-static int
-pg_ascii_dsplen(const unsigned char *s)
-{
-    if (*s == '\0')
-        return 0;
-    if (*s < 0x20 || *s == 0x7f)
-        return -1;
-
-    return 1;
-}
-
-/*
- * EUC
- */
-static int
-pg_euc2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
-{
-    int            cnt = 0;
-
-    while (len > 0 && *from)
-    {
-        if (*from == SS2 && len >= 2)    /* JIS X 0201 (so called "1 byte
-                                         * KANA") */
-        {
-            from++;
-            *to = (SS2 << 8) | *from++;
-            len -= 2;
-        }
-        else if (*from == SS3 && len >= 3)    /* JIS X 0212 KANJI */
-        {
-            from++;
-            *to = (SS3 << 16) | (*from++ << 8);
-            *to |= *from++;
-            len -= 3;
-        }
-        else if (IS_HIGHBIT_SET(*from) && len >= 2) /* JIS X 0208 KANJI */
-        {
-            *to = *from++ << 8;
-            *to |= *from++;
-            len -= 2;
-        }
-        else                    /* must be ASCII */
-        {
-            *to = *from++;
-            len--;
-        }
-        to++;
-        cnt++;
-    }
-    *to = 0;
-    return cnt;
-}
-
-static inline int
-pg_euc_mblen(const unsigned char *s)
-{
-    int            len;
-
-    if (*s == SS2)
-        len = 2;
-    else if (*s == SS3)
-        len = 3;
-    else if (IS_HIGHBIT_SET(*s))
-        len = 2;
-    else
-        len = 1;
-    return len;
-}
-
-static inline int
-pg_euc_dsplen(const unsigned char *s)
-{
-    int            len;
-
-    if (*s == SS2)
-        len = 2;
-    else if (*s == SS3)
-        len = 2;
-    else if (IS_HIGHBIT_SET(*s))
-        len = 2;
-    else
-        len = pg_ascii_dsplen(s);
-    return len;
-}
-
-/*
- * EUC_JP
- */
-static int
-pg_eucjp2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
-{
-    return pg_euc2wchar_with_len(from, to, len);
-}
-
-static int
-pg_eucjp_mblen(const unsigned char *s)
-{
-    return pg_euc_mblen(s);
-}
-
-static int
-pg_eucjp_dsplen(const unsigned char *s)
-{
-    int            len;
-
-    if (*s == SS2)
-        len = 1;
-    else if (*s == SS3)
-        len = 2;
-    else if (IS_HIGHBIT_SET(*s))
-        len = 2;
-    else
-        len = pg_ascii_dsplen(s);
-    return len;
-}
-
-/*
- * EUC_KR
- */
-static int
-pg_euckr2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
-{
-    return pg_euc2wchar_with_len(from, to, len);
-}
-
-static int
-pg_euckr_mblen(const unsigned char *s)
-{
-    return pg_euc_mblen(s);
-}
-
-static int
-pg_euckr_dsplen(const unsigned char *s)
-{
-    return pg_euc_dsplen(s);
-}
-
-/*
- * EUC_CN
- *
- */
-static int
-pg_euccn2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
-{
-    int            cnt = 0;
-
-    while (len > 0 && *from)
-    {
-        if (*from == SS2 && len >= 3)    /* code set 2 (unused?) */
-        {
-            from++;
-            *to = (SS2 << 16) | (*from++ << 8);
-            *to |= *from++;
-            len -= 3;
-        }
-        else if (*from == SS3 && len >= 3)    /* code set 3 (unused ?) */
-        {
-            from++;
-            *to = (SS3 << 16) | (*from++ << 8);
-            *to |= *from++;
-            len -= 3;
-        }
-        else if (IS_HIGHBIT_SET(*from) && len >= 2) /* code set 1 */
-        {
-            *to = *from++ << 8;
-            *to |= *from++;
-            len -= 2;
-        }
-        else
-        {
-            *to = *from++;
-            len--;
-        }
-        to++;
-        cnt++;
-    }
-    *to = 0;
-    return cnt;
-}
-
-static int
-pg_euccn_mblen(const unsigned char *s)
-{
-    int            len;
-
-    if (IS_HIGHBIT_SET(*s))
-        len = 2;
-    else
-        len = 1;
-    return len;
-}
-
-static int
-pg_euccn_dsplen(const unsigned char *s)
-{
-    int            len;
-
-    if (IS_HIGHBIT_SET(*s))
-        len = 2;
-    else
-        len = pg_ascii_dsplen(s);
-    return len;
-}
-
-/*
- * EUC_TW
- *
- */
-static int
-pg_euctw2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
-{
-    int            cnt = 0;
-
-    while (len > 0 && *from)
-    {
-        if (*from == SS2 && len >= 4)    /* code set 2 */
-        {
-            from++;
-            *to = (((uint32) SS2) << 24) | (*from++ << 16);
-            *to |= *from++ << 8;
-            *to |= *from++;
-            len -= 4;
-        }
-        else if (*from == SS3 && len >= 3)    /* code set 3 (unused?) */
-        {
-            from++;
-            *to = (SS3 << 16) | (*from++ << 8);
-            *to |= *from++;
-            len -= 3;
-        }
-        else if (IS_HIGHBIT_SET(*from) && len >= 2) /* code set 2 */
-        {
-            *to = *from++ << 8;
-            *to |= *from++;
-            len -= 2;
-        }
-        else
-        {
-            *to = *from++;
-            len--;
-        }
-        to++;
-        cnt++;
-    }
-    *to = 0;
-    return cnt;
-}
-
-static int
-pg_euctw_mblen(const unsigned char *s)
-{
-    int            len;
-
-    if (*s == SS2)
-        len = 4;
-    else if (*s == SS3)
-        len = 3;
-    else if (IS_HIGHBIT_SET(*s))
-        len = 2;
-    else
-        len = 1;
-    return len;
-}
-
-static int
-pg_euctw_dsplen(const unsigned char *s)
-{
-    int            len;
-
-    if (*s == SS2)
-        len = 2;
-    else if (*s == SS3)
-        len = 2;
-    else if (IS_HIGHBIT_SET(*s))
-        len = 2;
-    else
-        len = pg_ascii_dsplen(s);
-    return len;
-}
-
-/*
- * Convert pg_wchar to EUC_* encoding.
- * caller must allocate enough space for "to", including a trailing zero!
- * len: length of from.
- * "from" not necessarily null terminated.
- */
-static int
-pg_wchar2euc_with_len(const pg_wchar *from, unsigned char *to, int len)
-{
-    int            cnt = 0;
-
-    while (len > 0 && *from)
-    {
-        unsigned char c;
-
-        if ((c = (*from >> 24)))
-        {
-            *to++ = c;
-            *to++ = (*from >> 16) & 0xff;
-            *to++ = (*from >> 8) & 0xff;
-            *to++ = *from & 0xff;
-            cnt += 4;
-        }
-        else if ((c = (*from >> 16)))
-        {
-            *to++ = c;
-            *to++ = (*from >> 8) & 0xff;
-            *to++ = *from & 0xff;
-            cnt += 3;
-        }
-        else if ((c = (*from >> 8)))
-        {
-            *to++ = c;
-            *to++ = *from & 0xff;
-            cnt += 2;
-        }
-        else
-        {
-            *to++ = *from;
-            cnt++;
-        }
-        from++;
-        len--;
-    }
-    *to = 0;
-    return cnt;
-}
-
-
-/*
- * JOHAB
- */
-static int
-pg_johab_mblen(const unsigned char *s)
-{
-    return pg_euc_mblen(s);
-}
-
-static int
-pg_johab_dsplen(const unsigned char *s)
-{
-    return pg_euc_dsplen(s);
-}
-
-/*
- * convert UTF8 string to pg_wchar (UCS-4)
- * caller must allocate enough space for "to", including a trailing zero!
- * len: length of from.
- * "from" not necessarily null terminated.
- */
-static int
-pg_utf2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
-{
-    int            cnt = 0;
-    uint32        c1,
-                c2,
-                c3,
-                c4;
-
-    while (len > 0 && *from)
-    {
-        if ((*from & 0x80) == 0)
-        {
-            *to = *from++;
-            len--;
-        }
-        else if ((*from & 0xe0) == 0xc0)
-        {
-            if (len < 2)
-                break;            /* drop trailing incomplete char */
-            c1 = *from++ & 0x1f;
-            c2 = *from++ & 0x3f;
-            *to = (c1 << 6) | c2;
-            len -= 2;
-        }
-        else if ((*from & 0xf0) == 0xe0)
-        {
-            if (len < 3)
-                break;            /* drop trailing incomplete char */
-            c1 = *from++ & 0x0f;
-            c2 = *from++ & 0x3f;
-            c3 = *from++ & 0x3f;
-            *to = (c1 << 12) | (c2 << 6) | c3;
-            len -= 3;
-        }
-        else if ((*from & 0xf8) == 0xf0)
-        {
-            if (len < 4)
-                break;            /* drop trailing incomplete char */
-            c1 = *from++ & 0x07;
-            c2 = *from++ & 0x3f;
-            c3 = *from++ & 0x3f;
-            c4 = *from++ & 0x3f;
-            *to = (c1 << 18) | (c2 << 12) | (c3 << 6) | c4;
-            len -= 4;
-        }
-        else
-        {
-            /* treat a bogus char as length 1; not ours to raise error */
-            *to = *from++;
-            len--;
-        }
-        to++;
-        cnt++;
-    }
-    *to = 0;
-    return cnt;
-}
-
-
-/*
- * Map a Unicode code point to UTF-8.  utf8string must have 4 bytes of
- * space allocated.
- */
-unsigned char *
-unicode_to_utf8(pg_wchar c, unsigned char *utf8string)
-{
-    if (c <= 0x7F)
-    {
-        utf8string[0] = c;
-    }
-    else if (c <= 0x7FF)
-    {
-        utf8string[0] = 0xC0 | ((c >> 6) & 0x1F);
-        utf8string[1] = 0x80 | (c & 0x3F);
-    }
-    else if (c <= 0xFFFF)
-    {
-        utf8string[0] = 0xE0 | ((c >> 12) & 0x0F);
-        utf8string[1] = 0x80 | ((c >> 6) & 0x3F);
-        utf8string[2] = 0x80 | (c & 0x3F);
-    }
-    else
-    {
-        utf8string[0] = 0xF0 | ((c >> 18) & 0x07);
-        utf8string[1] = 0x80 | ((c >> 12) & 0x3F);
-        utf8string[2] = 0x80 | ((c >> 6) & 0x3F);
-        utf8string[3] = 0x80 | (c & 0x3F);
-    }
-
-    return utf8string;
-}
-
-/*
- * Trivial conversion from pg_wchar to UTF-8.
- * caller should allocate enough space for "to"
- * len: length of from.
- * "from" not necessarily null terminated.
- */
-static int
-pg_wchar2utf_with_len(const pg_wchar *from, unsigned char *to, int len)
-{
-    int            cnt = 0;
-
-    while (len > 0 && *from)
-    {
-        int            char_len;
-
-        unicode_to_utf8(*from, to);
-        char_len = pg_utf_mblen(to);
-        cnt += char_len;
-        to += char_len;
-        from++;
-        len--;
-    }
-    *to = 0;
-    return cnt;
-}
-
-/*
- * Return the byte length of a UTF8 character pointed to by s
- *
- * Note: in the current implementation we do not support UTF8 sequences
- * of more than 4 bytes; hence do NOT return a value larger than 4.
- * We return "1" for any leading byte that is either flat-out illegal or
- * indicates a length larger than we support.
- *
- * pg_utf2wchar_with_len(), utf8_to_unicode(), pg_utf8_islegal(), and perhaps
- * other places would need to be fixed to change this.
- */
-int
-pg_utf_mblen(const unsigned char *s)
-{
-    int            len;
-
-    if ((*s & 0x80) == 0)
-        len = 1;
-    else if ((*s & 0xe0) == 0xc0)
-        len = 2;
-    else if ((*s & 0xf0) == 0xe0)
-        len = 3;
-    else if ((*s & 0xf8) == 0xf0)
-        len = 4;
-#ifdef NOT_USED
-    else if ((*s & 0xfc) == 0xf8)
-        len = 5;
-    else if ((*s & 0xfe) == 0xfc)
-        len = 6;
-#endif
-    else
-        len = 1;
-    return len;
-}
-
-/*
- * This is an implementation of wcwidth() and wcswidth() as defined in
- * "The Single UNIX Specification, Version 2, The Open Group, 1997"
- * <http://www.unix.org/online.html>
- *
- * Markus Kuhn -- 2001-09-08 -- public domain
- *
- * customised for PostgreSQL
- *
- * original available at : http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
- */
-
-struct mbinterval
-{
-    unsigned short first;
-    unsigned short last;
-};
-
-/* auxiliary function for binary search in interval table */
-static int
-mbbisearch(pg_wchar ucs, const struct mbinterval *table, int max)
-{
-    int            min = 0;
-    int            mid;
-
-    if (ucs < table[0].first || ucs > table[max].last)
-        return 0;
-    while (max >= min)
-    {
-        mid = (min + max) / 2;
-        if (ucs > table[mid].last)
-            min = mid + 1;
-        else if (ucs < table[mid].first)
-            max = mid - 1;
-        else
-            return 1;
-    }
-
-    return 0;
-}
-
-
-/* The following functions define the column width of an ISO 10646
- * character as follows:
- *
- *      - The null character (U+0000) has a column width of 0.
- *
- *      - Other C0/C1 control characters and DEL will lead to a return
- *        value of -1.
- *
- *      - Non-spacing and enclosing combining characters (general
- *        category code Mn or Me in the Unicode database) have a
- *        column width of 0.
- *
- *      - Other format characters (general category code Cf in the Unicode
- *        database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
- *
- *      - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
- *        have a column width of 0.
- *
- *      - Spacing characters in the East Asian Wide (W) or East Asian
- *        FullWidth (F) category as defined in Unicode Technical
- *        Report #11 have a column width of 2.
- *
- *      - All remaining characters (including all printable
- *        ISO 8859-1 and WGL4 characters, Unicode control characters,
- *        etc.) have a column width of 1.
- *
- * This implementation assumes that wchar_t characters are encoded
- * in ISO 10646.
- */
-
-static int
-ucs_wcwidth(pg_wchar ucs)
-{
-#include "common/unicode_combining_table.h"
-
-    /* test for 8-bit control characters */
-    if (ucs == 0)
-        return 0;
-
-    if (ucs < 0x20 || (ucs >= 0x7f && ucs < 0xa0) || ucs > 0x0010ffff)
-        return -1;
-
-    /* binary search in table of non-spacing characters */
-    if (mbbisearch(ucs, combining,
-                   sizeof(combining) / sizeof(struct mbinterval) - 1))
-        return 0;
-
-    /*
-     * if we arrive here, ucs is not a combining or C0/C1 control character
-     */
-
-    return 1 +
-        (ucs >= 0x1100 &&
-         (ucs <= 0x115f ||        /* Hangul Jamo init. consonants */
-          (ucs >= 0x2e80 && ucs <= 0xa4cf && (ucs & ~0x0011) != 0x300a &&
-           ucs != 0x303f) ||    /* CJK ... Yi */
-          (ucs >= 0xac00 && ucs <= 0xd7a3) ||    /* Hangul Syllables */
-          (ucs >= 0xf900 && ucs <= 0xfaff) ||    /* CJK Compatibility
-                                                 * Ideographs */
-          (ucs >= 0xfe30 && ucs <= 0xfe6f) ||    /* CJK Compatibility Forms */
-          (ucs >= 0xff00 && ucs <= 0xff5f) ||    /* Fullwidth Forms */
-          (ucs >= 0xffe0 && ucs <= 0xffe6) ||
-          (ucs >= 0x20000 && ucs <= 0x2ffff)));
-}
-
-/*
- * Convert a UTF-8 character to a Unicode code point.
- * This is a one-character version of pg_utf2wchar_with_len.
- *
- * No error checks here, c must point to a long-enough string.
- */
-pg_wchar
-utf8_to_unicode(const unsigned char *c)
-{
-    if ((*c & 0x80) == 0)
-        return (pg_wchar) c[0];
-    else if ((*c & 0xe0) == 0xc0)
-        return (pg_wchar) (((c[0] & 0x1f) << 6) |
-                           (c[1] & 0x3f));
-    else if ((*c & 0xf0) == 0xe0)
-        return (pg_wchar) (((c[0] & 0x0f) << 12) |
-                           ((c[1] & 0x3f) << 6) |
-                           (c[2] & 0x3f));
-    else if ((*c & 0xf8) == 0xf0)
-        return (pg_wchar) (((c[0] & 0x07) << 18) |
-                           ((c[1] & 0x3f) << 12) |
-                           ((c[2] & 0x3f) << 6) |
-                           (c[3] & 0x3f));
-    else
-        /* that is an invalid code on purpose */
-        return 0xffffffff;
-}
-
-static int
-pg_utf_dsplen(const unsigned char *s)
-{
-    return ucs_wcwidth(utf8_to_unicode(s));
-}
-
-/*
- * convert mule internal code to pg_wchar
- * caller should allocate enough space for "to"
- * len: length of from.
- * "from" not necessarily null terminated.
- */
-static int
-pg_mule2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
-{
-    int            cnt = 0;
-
-    while (len > 0 && *from)
-    {
-        if (IS_LC1(*from) && len >= 2)
-        {
-            *to = *from++ << 16;
-            *to |= *from++;
-            len -= 2;
-        }
-        else if (IS_LCPRV1(*from) && len >= 3)
-        {
-            from++;
-            *to = *from++ << 16;
-            *to |= *from++;
-            len -= 3;
-        }
-        else if (IS_LC2(*from) && len >= 3)
-        {
-            *to = *from++ << 16;
-            *to |= *from++ << 8;
-            *to |= *from++;
-            len -= 3;
-        }
-        else if (IS_LCPRV2(*from) && len >= 4)
-        {
-            from++;
-            *to = *from++ << 16;
-            *to |= *from++ << 8;
-            *to |= *from++;
-            len -= 4;
-        }
-        else
-        {                        /* assume ASCII */
-            *to = (unsigned char) *from++;
-            len--;
-        }
-        to++;
-        cnt++;
-    }
-    *to = 0;
-    return cnt;
-}
-
-/*
- * convert pg_wchar to mule internal code
- * caller should allocate enough space for "to"
- * len: length of from.
- * "from" not necessarily null terminated.
- */
-static int
-pg_wchar2mule_with_len(const pg_wchar *from, unsigned char *to, int len)
-{
-    int            cnt = 0;
-
-    while (len > 0 && *from)
-    {
-        unsigned char lb;
-
-        lb = (*from >> 16) & 0xff;
-        if (IS_LC1(lb))
-        {
-            *to++ = lb;
-            *to++ = *from & 0xff;
-            cnt += 2;
-        }
-        else if (IS_LC2(lb))
-        {
-            *to++ = lb;
-            *to++ = (*from >> 8) & 0xff;
-            *to++ = *from & 0xff;
-            cnt += 3;
-        }
-        else if (IS_LCPRV1_A_RANGE(lb))
-        {
-            *to++ = LCPRV1_A;
-            *to++ = lb;
-            *to++ = *from & 0xff;
-            cnt += 3;
-        }
-        else if (IS_LCPRV1_B_RANGE(lb))
-        {
-            *to++ = LCPRV1_B;
-            *to++ = lb;
-            *to++ = *from & 0xff;
-            cnt += 3;
-        }
-        else if (IS_LCPRV2_A_RANGE(lb))
-        {
-            *to++ = LCPRV2_A;
-            *to++ = lb;
-            *to++ = (*from >> 8) & 0xff;
-            *to++ = *from & 0xff;
-            cnt += 4;
-        }
-        else if (IS_LCPRV2_B_RANGE(lb))
-        {
-            *to++ = LCPRV2_B;
-            *to++ = lb;
-            *to++ = (*from >> 8) & 0xff;
-            *to++ = *from & 0xff;
-            cnt += 4;
-        }
-        else
-        {
-            *to++ = *from & 0xff;
-            cnt += 1;
-        }
-        from++;
-        len--;
-    }
-    *to = 0;
-    return cnt;
-}
-
-int
-pg_mule_mblen(const unsigned char *s)
-{
-    int            len;
-
-    if (IS_LC1(*s))
-        len = 2;
-    else if (IS_LCPRV1(*s))
-        len = 3;
-    else if (IS_LC2(*s))
-        len = 3;
-    else if (IS_LCPRV2(*s))
-        len = 4;
-    else
-        len = 1;                /* assume ASCII */
-    return len;
-}
-
-static int
-pg_mule_dsplen(const unsigned char *s)
-{
-    int            len;
-
-    /*
-     * Note: it's not really appropriate to assume that all multibyte charsets
-     * are double-wide on screen.  But this seems an okay approximation for
-     * the MULE charsets we currently support.
-     */
-
-    if (IS_LC1(*s))
-        len = 1;
-    else if (IS_LCPRV1(*s))
-        len = 1;
-    else if (IS_LC2(*s))
-        len = 2;
-    else if (IS_LCPRV2(*s))
-        len = 2;
-    else
-        len = 1;                /* assume ASCII */
-
-    return len;
-}
-
-/*
- * ISO8859-1
- */
-static int
-pg_latin12wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
-{
-    int            cnt = 0;
-
-    while (len > 0 && *from)
-    {
-        *to++ = *from++;
-        len--;
-        cnt++;
-    }
-    *to = 0;
-    return cnt;
-}
-
-/*
- * Trivial conversion from pg_wchar to single byte encoding. Just ignores
- * high bits.
- * caller should allocate enough space for "to"
- * len: length of from.
- * "from" not necessarily null terminated.
- */
-static int
-pg_wchar2single_with_len(const pg_wchar *from, unsigned char *to, int len)
-{
-    int            cnt = 0;
-
-    while (len > 0 && *from)
-    {
-        *to++ = *from++;
-        len--;
-        cnt++;
-    }
-    *to = 0;
-    return cnt;
-}
-
-static int
-pg_latin1_mblen(const unsigned char *s)
-{
-    return 1;
-}
-
-static int
-pg_latin1_dsplen(const unsigned char *s)
-{
-    return pg_ascii_dsplen(s);
-}
-
-/*
- * SJIS
- */
-static int
-pg_sjis_mblen(const unsigned char *s)
-{
-    int            len;
-
-    if (*s >= 0xa1 && *s <= 0xdf)
-        len = 1;                /* 1 byte kana? */
-    else if (IS_HIGHBIT_SET(*s))
-        len = 2;                /* kanji? */
-    else
-        len = 1;                /* should be ASCII */
-    return len;
-}
-
-static int
-pg_sjis_dsplen(const unsigned char *s)
-{
-    int            len;
-
-    if (*s >= 0xa1 && *s <= 0xdf)
-        len = 1;                /* 1 byte kana? */
-    else if (IS_HIGHBIT_SET(*s))
-        len = 2;                /* kanji? */
-    else
-        len = pg_ascii_dsplen(s);    /* should be ASCII */
-    return len;
-}
-
-/*
- * Big5
- */
-static int
-pg_big5_mblen(const unsigned char *s)
-{
-    int            len;
-
-    if (IS_HIGHBIT_SET(*s))
-        len = 2;                /* kanji? */
-    else
-        len = 1;                /* should be ASCII */
-    return len;
-}
-
-static int
-pg_big5_dsplen(const unsigned char *s)
-{
-    int            len;
-
-    if (IS_HIGHBIT_SET(*s))
-        len = 2;                /* kanji? */
-    else
-        len = pg_ascii_dsplen(s);    /* should be ASCII */
-    return len;
-}
-
-/*
- * GBK
- */
-static int
-pg_gbk_mblen(const unsigned char *s)
-{
-    int            len;
-
-    if (IS_HIGHBIT_SET(*s))
-        len = 2;                /* kanji? */
-    else
-        len = 1;                /* should be ASCII */
-    return len;
-}
-
-static int
-pg_gbk_dsplen(const unsigned char *s)
-{
-    int            len;
-
-    if (IS_HIGHBIT_SET(*s))
-        len = 2;                /* kanji? */
-    else
-        len = pg_ascii_dsplen(s);    /* should be ASCII */
-    return len;
-}
-
-/*
- * UHC
- */
-static int
-pg_uhc_mblen(const unsigned char *s)
-{
-    int            len;
-
-    if (IS_HIGHBIT_SET(*s))
-        len = 2;                /* 2byte? */
-    else
-        len = 1;                /* should be ASCII */
-    return len;
-}
-
-static int
-pg_uhc_dsplen(const unsigned char *s)
-{
-    int            len;
-
-    if (IS_HIGHBIT_SET(*s))
-        len = 2;                /* 2byte? */
-    else
-        len = pg_ascii_dsplen(s);    /* should be ASCII */
-    return len;
-}
-
-/*
- * GB18030
- *    Added by Bill Huang <bhuang@redhat.com>,<bill_huanghb@ybb.ne.jp>
- */
-
-/*
- * Unlike all other mblen() functions, this also looks at the second byte of
- * the input.  However, if you only pass the first byte of a multi-byte
- * string, and \0 as the second byte, this still works in a predictable way:
- * a 4-byte character will be reported as two 2-byte characters.  That's
- * enough for all current uses, as a client-only encoding.  It works that
- * way, because in any valid 4-byte GB18030-encoded character, the third and
- * fourth byte look like a 2-byte encoded character, when looked at
- * separately.
- */
-static int
-pg_gb18030_mblen(const unsigned char *s)
-{
-    int            len;
-
-    if (!IS_HIGHBIT_SET(*s))
-        len = 1;                /* ASCII */
-    else if (*(s + 1) >= 0x30 && *(s + 1) <= 0x39)
-        len = 4;
-    else
-        len = 2;
-    return len;
-}
-
-static int
-pg_gb18030_dsplen(const unsigned char *s)
-{
-    int            len;
-
-    if (IS_HIGHBIT_SET(*s))
-        len = 2;
-    else
-        len = pg_ascii_dsplen(s);    /* ASCII */
-    return len;
-}
-
-/*
- *-------------------------------------------------------------------
- * multibyte sequence validators
- *
- * These functions accept "s", a pointer to the first byte of a string,
- * and "len", the remaining length of the string.  If there is a validly
- * encoded character beginning at *s, return its length in bytes; else
- * return -1.
- *
- * The functions can assume that len > 0 and that *s != '\0', but they must
- * test for and reject zeroes in any additional bytes of a multibyte character.
- *
- * Note that this definition allows the function for a single-byte
- * encoding to be just "return 1".
- *-------------------------------------------------------------------
- */
-
-static int
-pg_ascii_verifier(const unsigned char *s, int len)
-{
-    return 1;
-}
-
-#define IS_EUC_RANGE_VALID(c)    ((c) >= 0xa1 && (c) <= 0xfe)
-
-static int
-pg_eucjp_verifier(const unsigned char *s, int len)
-{
-    int            l;
-    unsigned char c1,
-                c2;
-
-    c1 = *s++;
-
-    switch (c1)
-    {
-        case SS2:                /* JIS X 0201 */
-            l = 2;
-            if (l > len)
-                return -1;
-            c2 = *s++;
-            if (c2 < 0xa1 || c2 > 0xdf)
-                return -1;
-            break;
-
-        case SS3:                /* JIS X 0212 */
-            l = 3;
-            if (l > len)
-                return -1;
-            c2 = *s++;
-            if (!IS_EUC_RANGE_VALID(c2))
-                return -1;
-            c2 = *s++;
-            if (!IS_EUC_RANGE_VALID(c2))
-                return -1;
-            break;
-
-        default:
-            if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */
-            {
-                l = 2;
-                if (l > len)
-                    return -1;
-                if (!IS_EUC_RANGE_VALID(c1))
-                    return -1;
-                c2 = *s++;
-                if (!IS_EUC_RANGE_VALID(c2))
-                    return -1;
-            }
-            else
-                /* must be ASCII */
-            {
-                l = 1;
-            }
-            break;
-    }
-
-    return l;
-}
-
-static int
-pg_euckr_verifier(const unsigned char *s, int len)
-{
-    int            l;
-    unsigned char c1,
-                c2;
-
-    c1 = *s++;
-
-    if (IS_HIGHBIT_SET(c1))
-    {
-        l = 2;
-        if (l > len)
-            return -1;
-        if (!IS_EUC_RANGE_VALID(c1))
-            return -1;
-        c2 = *s++;
-        if (!IS_EUC_RANGE_VALID(c2))
-            return -1;
-    }
-    else
-        /* must be ASCII */
-    {
-        l = 1;
-    }
-
-    return l;
-}
-
-/* EUC-CN byte sequences are exactly same as EUC-KR */
-#define pg_euccn_verifier    pg_euckr_verifier
-
-static int
-pg_euctw_verifier(const unsigned char *s, int len)
-{
-    int            l;
-    unsigned char c1,
-                c2;
-
-    c1 = *s++;
-
-    switch (c1)
-    {
-        case SS2:                /* CNS 11643 Plane 1-7 */
-            l = 4;
-            if (l > len)
-                return -1;
-            c2 = *s++;
-            if (c2 < 0xa1 || c2 > 0xa7)
-                return -1;
-            c2 = *s++;
-            if (!IS_EUC_RANGE_VALID(c2))
-                return -1;
-            c2 = *s++;
-            if (!IS_EUC_RANGE_VALID(c2))
-                return -1;
-            break;
-
-        case SS3:                /* unused */
-            return -1;
-
-        default:
-            if (IS_HIGHBIT_SET(c1)) /* CNS 11643 Plane 1 */
-            {
-                l = 2;
-                if (l > len)
-                    return -1;
-                /* no further range check on c1? */
-                c2 = *s++;
-                if (!IS_EUC_RANGE_VALID(c2))
-                    return -1;
-            }
-            else
-                /* must be ASCII */
-            {
-                l = 1;
-            }
-            break;
-    }
-    return l;
-}
-
-static int
-pg_johab_verifier(const unsigned char *s, int len)
-{
-    int            l,
-                mbl;
-    unsigned char c;
-
-    l = mbl = pg_johab_mblen(s);
-
-    if (len < l)
-        return -1;
-
-    if (!IS_HIGHBIT_SET(*s))
-        return mbl;
-
-    while (--l > 0)
-    {
-        c = *++s;
-        if (!IS_EUC_RANGE_VALID(c))
-            return -1;
-    }
-    return mbl;
-}
-
-static int
-pg_mule_verifier(const unsigned char *s, int len)
-{
-    int            l,
-                mbl;
-    unsigned char c;
-
-    l = mbl = pg_mule_mblen(s);
-
-    if (len < l)
-        return -1;
-
-    while (--l > 0)
-    {
-        c = *++s;
-        if (!IS_HIGHBIT_SET(c))
-            return -1;
-    }
-    return mbl;
-}
-
-static int
-pg_latin1_verifier(const unsigned char *s, int len)
-{
-    return 1;
-}
-
-static int
-pg_sjis_verifier(const unsigned char *s, int len)
-{
-    int            l,
-                mbl;
-    unsigned char c1,
-                c2;
-
-    l = mbl = pg_sjis_mblen(s);
-
-    if (len < l)
-        return -1;
-
-    if (l == 1)                    /* pg_sjis_mblen already verified it */
-        return mbl;
-
-    c1 = *s++;
-    c2 = *s;
-    if (!ISSJISHEAD(c1) || !ISSJISTAIL(c2))
-        return -1;
-    return mbl;
-}
-
-static int
-pg_big5_verifier(const unsigned char *s, int len)
-{
-    int            l,
-                mbl;
-
-    l = mbl = pg_big5_mblen(s);
-
-    if (len < l)
-        return -1;
-
-    while (--l > 0)
-    {
-        if (*++s == '\0')
-            return -1;
-    }
-
-    return mbl;
-}
-
-static int
-pg_gbk_verifier(const unsigned char *s, int len)
-{
-    int            l,
-                mbl;
-
-    l = mbl = pg_gbk_mblen(s);
-
-    if (len < l)
-        return -1;
-
-    while (--l > 0)
-    {
-        if (*++s == '\0')
-            return -1;
-    }
-
-    return mbl;
-}
-
-static int
-pg_uhc_verifier(const unsigned char *s, int len)
-{
-    int            l,
-                mbl;
-
-    l = mbl = pg_uhc_mblen(s);
-
-    if (len < l)
-        return -1;
-
-    while (--l > 0)
-    {
-        if (*++s == '\0')
-            return -1;
-    }
-
-    return mbl;
-}
-
-static int
-pg_gb18030_verifier(const unsigned char *s, int len)
-{
-    int            l;
-
-    if (!IS_HIGHBIT_SET(*s))
-        l = 1;                    /* ASCII */
-    else if (len >= 4 && *(s + 1) >= 0x30 && *(s + 1) <= 0x39)
-    {
-        /* Should be 4-byte, validate remaining bytes */
-        if (*s >= 0x81 && *s <= 0xfe &&
-            *(s + 2) >= 0x81 && *(s + 2) <= 0xfe &&
-            *(s + 3) >= 0x30 && *(s + 3) <= 0x39)
-            l = 4;
-        else
-            l = -1;
-    }
-    else if (len >= 2 && *s >= 0x81 && *s <= 0xfe)
-    {
-        /* Should be 2-byte, validate */
-        if ((*(s + 1) >= 0x40 && *(s + 1) <= 0x7e) ||
-            (*(s + 1) >= 0x80 && *(s + 1) <= 0xfe))
-            l = 2;
-        else
-            l = -1;
-    }
-    else
-        l = -1;
-    return l;
-}
-
-static int
-pg_utf8_verifier(const unsigned char *s, int len)
-{
-    int            l = pg_utf_mblen(s);
-
-    if (len < l)
-        return -1;
-
-    if (!pg_utf8_islegal(s, l))
-        return -1;
-
-    return l;
-}
-
-/*
- * Check for validity of a single UTF-8 encoded character
- *
- * This directly implements the rules in RFC3629.  The bizarre-looking
- * restrictions on the second byte are meant to ensure that there isn't
- * more than one encoding of a given Unicode character point; that is,
- * you may not use a longer-than-necessary byte sequence with high order
- * zero bits to represent a character that would fit in fewer bytes.
- * To do otherwise is to create security hazards (eg, create an apparent
- * non-ASCII character that decodes to plain ASCII).
- *
- * length is assumed to have been obtained by pg_utf_mblen(), and the
- * caller must have checked that that many bytes are present in the buffer.
- */
-bool
-pg_utf8_islegal(const unsigned char *source, int length)
-{
-    unsigned char a;
-
-    switch (length)
-    {
-        default:
-            /* reject lengths 5 and 6 for now */
-            return false;
-        case 4:
-            a = source[3];
-            if (a < 0x80 || a > 0xBF)
-                return false;
-            /* FALL THRU */
-        case 3:
-            a = source[2];
-            if (a < 0x80 || a > 0xBF)
-                return false;
-            /* FALL THRU */
-        case 2:
-            a = source[1];
-            switch (*source)
-            {
-                case 0xE0:
-                    if (a < 0xA0 || a > 0xBF)
-                        return false;
-                    break;
-                case 0xED:
-                    if (a < 0x80 || a > 0x9F)
-                        return false;
-                    break;
-                case 0xF0:
-                    if (a < 0x90 || a > 0xBF)
-                        return false;
-                    break;
-                case 0xF4:
-                    if (a < 0x80 || a > 0x8F)
-                        return false;
-                    break;
-                default:
-                    if (a < 0x80 || a > 0xBF)
-                        return false;
-                    break;
-            }
-            /* FALL THRU */
-        case 1:
-            a = *source;
-            if (a >= 0x80 && a < 0xC2)
-                return false;
-            if (a > 0xF4)
-                return false;
-            break;
-    }
-    return true;
-}
-
-#ifndef FRONTEND
-
-/*
- * Generic character incrementer function.
- *
- * Not knowing anything about the properties of the encoding in use, we just
- * keep incrementing the last byte until we get a validly-encoded result,
- * or we run out of values to try.  We don't bother to try incrementing
- * higher-order bytes, so there's no growth in runtime for wider characters.
- * (If we did try to do that, we'd need to consider the likelihood that 255
- * is not a valid final byte in the encoding.)
- */
-static bool
-pg_generic_charinc(unsigned char *charptr, int len)
-{
-    unsigned char *lastbyte = charptr + len - 1;
-    mbverifier    mbverify;
-
-    /* We can just invoke the character verifier directly. */
-    mbverify = pg_wchar_table[GetDatabaseEncoding()].mbverify;
-
-    while (*lastbyte < (unsigned char) 255)
-    {
-        (*lastbyte)++;
-        if ((*mbverify) (charptr, len) == len)
-            return true;
-    }
-
-    return false;
-}
-
-/*
- * UTF-8 character incrementer function.
- *
- * For a one-byte character less than 0x7F, we just increment the byte.
- *
- * For a multibyte character, every byte but the first must fall between 0x80
- * and 0xBF; and the first byte must be between 0xC0 and 0xF4.  We increment
- * the last byte that's not already at its maximum value.  If we can't find a
- * byte that's less than the maximum allowable value, we simply fail.  We also
- * need some special-case logic to skip regions used for surrogate pair
- * handling, as those should not occur in valid UTF-8.
- *
- * Note that we don't reset lower-order bytes back to their minimums, since
- * we can't afford to make an exhaustive search (see make_greater_string).
- */
-static bool
-pg_utf8_increment(unsigned char *charptr, int length)
-{
-    unsigned char a;
-    unsigned char limit;
-
-    switch (length)
-    {
-        default:
-            /* reject lengths 5 and 6 for now */
-            return false;
-        case 4:
-            a = charptr[3];
-            if (a < 0xBF)
-            {
-                charptr[3]++;
-                break;
-            }
-            /* FALL THRU */
-        case 3:
-            a = charptr[2];
-            if (a < 0xBF)
-            {
-                charptr[2]++;
-                break;
-            }
-            /* FALL THRU */
-        case 2:
-            a = charptr[1];
-            switch (*charptr)
-            {
-                case 0xED:
-                    limit = 0x9F;
-                    break;
-                case 0xF4:
-                    limit = 0x8F;
-                    break;
-                default:
-                    limit = 0xBF;
-                    break;
-            }
-            if (a < limit)
-            {
-                charptr[1]++;
-                break;
-            }
-            /* FALL THRU */
-        case 1:
-            a = *charptr;
-            if (a == 0x7F || a == 0xDF || a == 0xEF || a == 0xF4)
-                return false;
-            charptr[0]++;
-            break;
-    }
-
-    return true;
-}
-
-/*
- * EUC-JP character incrementer function.
- *
- * If the sequence starts with SS2 (0x8e), it must be a two-byte sequence
- * representing JIS X 0201 characters with the second byte ranging between
- * 0xa1 and 0xdf.  We just increment the last byte if it's less than 0xdf,
- * and otherwise rewrite the whole sequence to 0xa1 0xa1.
- *
- * If the sequence starts with SS3 (0x8f), it must be a three-byte sequence
- * in which the last two bytes range between 0xa1 and 0xfe.  The last byte
- * is incremented if possible, otherwise the second-to-last byte.
- *
- * If the sequence starts with a value other than the above and its MSB
- * is set, it must be a two-byte sequence representing JIS X 0208 characters
- * with both bytes ranging between 0xa1 and 0xfe.  The last byte is
- * incremented if possible, otherwise the second-to-last byte.
- *
- * Otherwise, the sequence is a single-byte ASCII character. It is
- * incremented up to 0x7f.
- */
-static bool
-pg_eucjp_increment(unsigned char *charptr, int length)
-{
-    unsigned char c1,
-                c2;
-    int            i;
-
-    c1 = *charptr;
-
-    switch (c1)
-    {
-        case SS2:                /* JIS X 0201 */
-            if (length != 2)
-                return false;
-
-            c2 = charptr[1];
-
-            if (c2 >= 0xdf)
-                charptr[0] = charptr[1] = 0xa1;
-            else if (c2 < 0xa1)
-                charptr[1] = 0xa1;
-            else
-                charptr[1]++;
-            break;
-
-        case SS3:                /* JIS X 0212 */
-            if (length != 3)
-                return false;
-
-            for (i = 2; i > 0; i--)
-            {
-                c2 = charptr[i];
-                if (c2 < 0xa1)
-                {
-                    charptr[i] = 0xa1;
-                    return true;
-                }
-                else if (c2 < 0xfe)
-                {
-                    charptr[i]++;
-                    return true;
-                }
-            }
-
-            /* Out of 3-byte code region */
-            return false;
-
-        default:
-            if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */
-            {
-                if (length != 2)
-                    return false;
-
-                for (i = 1; i >= 0; i--)
-                {
-                    c2 = charptr[i];
-                    if (c2 < 0xa1)
-                    {
-                        charptr[i] = 0xa1;
-                        return true;
-                    }
-                    else if (c2 < 0xfe)
-                    {
-                        charptr[i]++;
-                        return true;
-                    }
-                }
-
-                /* Out of 2 byte code region */
-                return false;
-            }
-            else
-            {                    /* ASCII, single byte */
-                if (c1 > 0x7e)
-                    return false;
-                (*charptr)++;
-            }
-            break;
-    }
-
-    return true;
-}
-#endif                            /* !FRONTEND */
-
-
-/*
- *-------------------------------------------------------------------
- * encoding info table
- * XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h)
- *-------------------------------------------------------------------
- */
-const pg_wchar_tbl pg_wchar_table[] = {
-    {pg_ascii2wchar_with_len, pg_wchar2single_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_ascii_verifier, 1}, /*
PG_SQL_ASCII*/ 
-    {pg_eucjp2wchar_with_len, pg_wchar2euc_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3},    /*
PG_EUC_JP*/ 
-    {pg_euccn2wchar_with_len, pg_wchar2euc_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_euccn_verifier, 2},    /*
PG_EUC_CN*/ 
-    {pg_euckr2wchar_with_len, pg_wchar2euc_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_euckr_verifier, 3},    /*
PG_EUC_KR*/ 
-    {pg_euctw2wchar_with_len, pg_wchar2euc_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_euctw_verifier, 4},    /*
PG_EUC_TW*/ 
-    {pg_eucjp2wchar_with_len, pg_wchar2euc_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3},    /*
PG_EUC_JIS_2004*/ 
-    {pg_utf2wchar_with_len, pg_wchar2utf_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4},    /* PG_UTF8 */
-    {pg_mule2wchar_with_len, pg_wchar2mule_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier, 4},    /*
PG_MULE_INTERNAL*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN1*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN2*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN3*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN4*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN5*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN6*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN7*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN8*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN9*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN10*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1256*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1258*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN866*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN874*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_KOI8R*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1251*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1252*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
ISO-8859-5*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
ISO-8859-6*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
ISO-8859-7*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
ISO-8859-8*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1250*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1253*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1254*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1255*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1257*/ 
-    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_KOI8U*/ 
-    {0, 0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */
-    {0, 0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */
-    {0, 0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2},    /* PG_GBK */
-    {0, 0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2},    /* PG_UHC */
-    {0, 0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4},    /* PG_GB18030 */
-    {0, 0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3},    /* PG_JOHAB */
-    {0, 0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}    /* PG_SHIFT_JIS_2004 */
-};
-
-/* returns the byte length of a word for mule internal code */
-int
-pg_mic_mblen(const unsigned char *mbstr)
-{
-    return pg_mule_mblen(mbstr);
-}
-
-/*
- * Returns the byte length of a multibyte character.
- */
-int
-pg_encoding_mblen(int encoding, const char *mbstr)
-{
-    return (PG_VALID_ENCODING(encoding) ?
-            pg_wchar_table[encoding].mblen((const unsigned char *) mbstr) :
-            pg_wchar_table[PG_SQL_ASCII].mblen((const unsigned char *) mbstr));
-}
-
-/*
- * Returns the display length of a multibyte character.
- */
-int
-pg_encoding_dsplen(int encoding, const char *mbstr)
-{
-    return (PG_VALID_ENCODING(encoding) ?
-            pg_wchar_table[encoding].dsplen((const unsigned char *) mbstr) :
-            pg_wchar_table[PG_SQL_ASCII].dsplen((const unsigned char *) mbstr));
-}
-
-/*
- * Verify the first multibyte character of the given string.
- * Return its byte length if good, -1 if bad.  (See comments above for
- * full details of the mbverify API.)
- */
-int
-pg_encoding_verifymb(int encoding, const char *mbstr, int len)
-{
-    return (PG_VALID_ENCODING(encoding) ?
-            pg_wchar_table[encoding].mbverify((const unsigned char *) mbstr, len) :
-            pg_wchar_table[PG_SQL_ASCII].mbverify((const unsigned char *) mbstr, len));
-}
-
-/*
- * fetch maximum length of a given encoding
- */
-int
-pg_encoding_max_length(int encoding)
-{
-    Assert(PG_VALID_ENCODING(encoding));
-
-    return pg_wchar_table[encoding].maxmblen;
-}
-
-#ifndef FRONTEND
-
-/*
- * fetch maximum length of the encoding for the current database
- */
-int
-pg_database_encoding_max_length(void)
-{
-    return pg_wchar_table[GetDatabaseEncoding()].maxmblen;
-}
-
-/*
- * get the character incrementer for the encoding for the current database
- */
-mbcharacter_incrementer
-pg_database_encoding_character_incrementer(void)
-{
-    /*
-     * Eventually it might be best to add a field to pg_wchar_table[], but for
-     * now we just use a switch.
-     */
-    switch (GetDatabaseEncoding())
-    {
-        case PG_UTF8:
-            return pg_utf8_increment;
-
-        case PG_EUC_JP:
-            return pg_eucjp_increment;
-
-        default:
-            return pg_generic_charinc;
-    }
-}
-
-/*
- * Verify mbstr to make sure that it is validly encoded in the current
- * database encoding.  Otherwise same as pg_verify_mbstr().
- */
-bool
-pg_verifymbstr(const char *mbstr, int len, bool noError)
-{
-    return
-        pg_verify_mbstr_len(GetDatabaseEncoding(), mbstr, len, noError) >= 0;
-}
-
-/*
- * Verify mbstr to make sure that it is validly encoded in the specified
- * encoding.
- */
-bool
-pg_verify_mbstr(int encoding, const char *mbstr, int len, bool noError)
-{
-    return pg_verify_mbstr_len(encoding, mbstr, len, noError) >= 0;
-}
-
-/*
- * Verify mbstr to make sure that it is validly encoded in the specified
- * encoding.
- *
- * mbstr is not necessarily zero terminated; length of mbstr is
- * specified by len.
- *
- * If OK, return length of string in the encoding.
- * If a problem is found, return -1 when noError is
- * true; when noError is false, ereport() a descriptive message.
- */
-int
-pg_verify_mbstr_len(int encoding, const char *mbstr, int len, bool noError)
-{
-    mbverifier    mbverify;
-    int            mb_len;
-
-    Assert(PG_VALID_ENCODING(encoding));
-
-    /*
-     * In single-byte encodings, we need only reject nulls (\0).
-     */
-    if (pg_encoding_max_length(encoding) <= 1)
-    {
-        const char *nullpos = memchr(mbstr, 0, len);
-
-        if (nullpos == NULL)
-            return len;
-        if (noError)
-            return -1;
-        report_invalid_encoding(encoding, nullpos, 1);
-    }
-
-    /* fetch function pointer just once */
-    mbverify = pg_wchar_table[encoding].mbverify;
-
-    mb_len = 0;
-
-    while (len > 0)
-    {
-        int            l;
-
-        /* fast path for ASCII-subset characters */
-        if (!IS_HIGHBIT_SET(*mbstr))
-        {
-            if (*mbstr != '\0')
-            {
-                mb_len++;
-                mbstr++;
-                len--;
-                continue;
-            }
-            if (noError)
-                return -1;
-            report_invalid_encoding(encoding, mbstr, len);
-        }
-
-        l = (*mbverify) ((const unsigned char *) mbstr, len);
-
-        if (l < 0)
-        {
-            if (noError)
-                return -1;
-            report_invalid_encoding(encoding, mbstr, len);
-        }
-
-        mbstr += l;
-        len -= l;
-        mb_len++;
-    }
-    return mb_len;
-}
-
-/*
- * check_encoding_conversion_args: check arguments of a conversion function
- *
- * "expected" arguments can be either an encoding ID or -1 to indicate that
- * the caller will check whether it accepts the ID.
- *
- * Note: the errors here are not really user-facing, so elog instead of
- * ereport seems sufficient.  Also, we trust that the "expected" encoding
- * arguments are valid encoding IDs, but we don't trust the actuals.
- */
-void
-check_encoding_conversion_args(int src_encoding,
-                               int dest_encoding,
-                               int len,
-                               int expected_src_encoding,
-                               int expected_dest_encoding)
-{
-    if (!PG_VALID_ENCODING(src_encoding))
-        elog(ERROR, "invalid source encoding ID: %d", src_encoding);
-    if (src_encoding != expected_src_encoding && expected_src_encoding >= 0)
-        elog(ERROR, "expected source encoding \"%s\", but got \"%s\"",
-             pg_enc2name_tbl[expected_src_encoding].name,
-             pg_enc2name_tbl[src_encoding].name);
-    if (!PG_VALID_ENCODING(dest_encoding))
-        elog(ERROR, "invalid destination encoding ID: %d", dest_encoding);
-    if (dest_encoding != expected_dest_encoding && expected_dest_encoding >= 0)
-        elog(ERROR, "expected destination encoding \"%s\", but got \"%s\"",
-             pg_enc2name_tbl[expected_dest_encoding].name,
-             pg_enc2name_tbl[dest_encoding].name);
-    if (len < 0)
-        elog(ERROR, "encoding conversion length must not be negative");
-}
-
-/*
- * report_invalid_encoding: complain about invalid multibyte character
- *
- * note: len is remaining length of string, not length of character;
- * len must be greater than zero, as we always examine the first byte.
- */
-void
-report_invalid_encoding(int encoding, const char *mbstr, int len)
-{
-    int            l = pg_encoding_mblen(encoding, mbstr);
-    char        buf[8 * 5 + 1];
-    char       *p = buf;
-    int            j,
-                jlimit;
-
-    jlimit = Min(l, len);
-    jlimit = Min(jlimit, 8);    /* prevent buffer overrun */
-
-    for (j = 0; j < jlimit; j++)
-    {
-        p += sprintf(p, "0x%02x", (unsigned char) mbstr[j]);
-        if (j < jlimit - 1)
-            p += sprintf(p, " ");
-    }
-
-    ereport(ERROR,
-            (errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
-             errmsg("invalid byte sequence for encoding \"%s\": %s",
-                    pg_enc2name_tbl[encoding].name,
-                    buf)));
-}
-
-/*
- * report_untranslatable_char: complain about untranslatable character
- *
- * note: len is remaining length of string, not length of character;
- * len must be greater than zero, as we always examine the first byte.
- */
-void
-report_untranslatable_char(int src_encoding, int dest_encoding,
-                           const char *mbstr, int len)
-{
-    int            l = pg_encoding_mblen(src_encoding, mbstr);
-    char        buf[8 * 5 + 1];
-    char       *p = buf;
-    int            j,
-                jlimit;
-
-    jlimit = Min(l, len);
-    jlimit = Min(jlimit, 8);    /* prevent buffer overrun */
-
-    for (j = 0; j < jlimit; j++)
-    {
-        p += sprintf(p, "0x%02x", (unsigned char) mbstr[j]);
-        if (j < jlimit - 1)
-            p += sprintf(p, " ");
-    }
-
-    ereport(ERROR,
-            (errcode(ERRCODE_UNTRANSLATABLE_CHARACTER),
-             errmsg("character with byte sequence %s in encoding \"%s\" has no equivalent in encoding \"%s\"",
-                    buf,
-                    pg_enc2name_tbl[src_encoding].name,
-                    pg_enc2name_tbl[dest_encoding].name)));
-}
-
-#endif                            /* !FRONTEND */
diff --git a/src/bin/initdb/.gitignore b/src/bin/initdb/.gitignore
index 71a899f..b3167c4 100644
--- a/src/bin/initdb/.gitignore
+++ b/src/bin/initdb/.gitignore
@@ -1,4 +1,3 @@
-/encnames.c
 /localtime.c

 /initdb
diff --git a/src/bin/initdb/Makefile b/src/bin/initdb/Makefile
index f587a86..7e23754 100644
--- a/src/bin/initdb/Makefile
+++ b/src/bin/initdb/Makefile
@@ -18,7 +18,12 @@ include $(top_builddir)/src/Makefile.global

 override CPPFLAGS := -DFRONTEND -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(CPPFLAGS)

-# note: we need libpq only because fe_utils does
+# Note: it's important that we link to encnames.o from libpgcommon, not
+# from libpq, else we have risks of version skew if we run with a libpq
+# shared library from a different PG version.  The libpq_pgport macro
+# should ensure that that happens.
+#
+# We need libpq only because fe_utils does.
 LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)

 # use system timezone data?
@@ -28,7 +33,6 @@ endif

 OBJS = \
     $(WIN32RES) \
-    encnames.o \
     findtimezone.o \
     initdb.o \
     localtime.o
@@ -38,15 +42,7 @@ all: initdb
 initdb: $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils
     $(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

-# We used to pull in all of libpq to get encnames.c, but that
-# exposes us to risks of version skew if we link to a shared library.
-# Do it the hard way, instead, so that we're statically linked.
-
-encnames.c: % : $(top_srcdir)/src/backend/utils/mb/%
-    rm -f $@ && $(LN_S) $< .
-
-# Likewise, pull in localtime.c from src/timezones
-
+# We must pull in localtime.c from src/timezones
 localtime.c: % : $(top_srcdir)/src/timezone/%
     rm -f $@ && $(LN_S) $< .

@@ -60,7 +56,7 @@ uninstall:
     rm -f '$(DESTDIR)$(bindir)/initdb$(X)'

 clean distclean maintainer-clean:
-    rm -f initdb$(X) $(OBJS) encnames.c localtime.c
+    rm -f initdb$(X) $(OBJS) localtime.c
     rm -rf tmp_check

 # ensure that changes in datadir propagate into object file
diff --git a/src/common/Makefile b/src/common/Makefile
index ffb0f6e..5b44340 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -51,6 +51,7 @@ OBJS_COMMON = \
     config_info.o \
     controldata_utils.o \
     d2s.o \
+    encnames.o \
     exec.o \
     f2s.o \
     file_perm.o \
@@ -70,7 +71,8 @@ OBJS_COMMON = \
     stringinfo.o \
     unicode_norm.o \
     username.o \
-    wait_error.o
+    wait_error.o \
+    wchar.o

 ifeq ($(with_openssl),yes)
 OBJS_COMMON += sha2_openssl.o
diff --git a/src/common/encnames.c b/src/common/encnames.c
new file mode 100644
index 0000000..2086e00
--- /dev/null
+++ b/src/common/encnames.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * encnames.c
+ *      Encoding names and routines for working with them.
+ *
+ * Portions Copyright (c) 2001-2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *      src/common/encnames.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
+#include "postgres.h"
+#include "utils/builtins.h"
+#endif
+
+#include <ctype.h>
+#include <unistd.h>
+
+#include "mb/pg_wchar.h"
+
+
+/* ----------
+ * All encoding names, sorted:         *** A L P H A B E T I C ***
+ *
+ * All names must be without irrelevant chars, search routines use
+ * isalnum() chars only. It means ISO-8859-1, iso_8859-1 and Iso8859_1
+ * are always converted to 'iso88591'. All must be lower case.
+ *
+ * The table doesn't contain 'cs' aliases (like csISOLatin1). It's needed?
+ *
+ * Karel Zak, Aug 2001
+ * ----------
+ */
+typedef struct pg_encname
+{
+    const char *name;
+    pg_enc        encoding;
+} pg_encname;
+
+static const pg_encname pg_encname_tbl[] =
+{
+    {
+        "abc", PG_WIN1258
+    },                            /* alias for WIN1258 */
+    {
+        "alt", PG_WIN866
+    },                            /* IBM866 */
+    {
+        "big5", PG_BIG5
+    },                            /* Big5; Chinese for Taiwan multibyte set */
+    {
+        "euccn", PG_EUC_CN
+    },                            /* EUC-CN; Extended Unix Code for simplified
+                                 * Chinese */
+    {
+        "eucjis2004", PG_EUC_JIS_2004
+    },                            /* EUC-JIS-2004; Extended UNIX Code fixed
+                                 * Width for Japanese, standard JIS X 0213 */
+    {
+        "eucjp", PG_EUC_JP
+    },                            /* EUC-JP; Extended UNIX Code fixed Width for
+                                 * Japanese, standard OSF */
+    {
+        "euckr", PG_EUC_KR
+    },                            /* EUC-KR; Extended Unix Code for Korean , KS
+                                 * X 1001 standard */
+    {
+        "euctw", PG_EUC_TW
+    },                            /* EUC-TW; Extended Unix Code for
+                                 *
+                                 * traditional Chinese */
+    {
+        "gb18030", PG_GB18030
+    },                            /* GB18030;GB18030 */
+    {
+        "gbk", PG_GBK
+    },                            /* GBK; Chinese Windows CodePage 936
+                                 * simplified Chinese */
+    {
+        "iso88591", PG_LATIN1
+    },                            /* ISO-8859-1; RFC1345,KXS2 */
+    {
+        "iso885910", PG_LATIN6
+    },                            /* ISO-8859-10; RFC1345,KXS2 */
+    {
+        "iso885913", PG_LATIN7
+    },                            /* ISO-8859-13; RFC1345,KXS2 */
+    {
+        "iso885914", PG_LATIN8
+    },                            /* ISO-8859-14; RFC1345,KXS2 */
+    {
+        "iso885915", PG_LATIN9
+    },                            /* ISO-8859-15; RFC1345,KXS2 */
+    {
+        "iso885916", PG_LATIN10
+    },                            /* ISO-8859-16; RFC1345,KXS2 */
+    {
+        "iso88592", PG_LATIN2
+    },                            /* ISO-8859-2; RFC1345,KXS2 */
+    {
+        "iso88593", PG_LATIN3
+    },                            /* ISO-8859-3; RFC1345,KXS2 */
+    {
+        "iso88594", PG_LATIN4
+    },                            /* ISO-8859-4; RFC1345,KXS2 */
+    {
+        "iso88595", PG_ISO_8859_5
+    },                            /* ISO-8859-5; RFC1345,KXS2 */
+    {
+        "iso88596", PG_ISO_8859_6
+    },                            /* ISO-8859-6; RFC1345,KXS2 */
+    {
+        "iso88597", PG_ISO_8859_7
+    },                            /* ISO-8859-7; RFC1345,KXS2 */
+    {
+        "iso88598", PG_ISO_8859_8
+    },                            /* ISO-8859-8; RFC1345,KXS2 */
+    {
+        "iso88599", PG_LATIN5
+    },                            /* ISO-8859-9; RFC1345,KXS2 */
+    {
+        "johab", PG_JOHAB
+    },                            /* JOHAB; Extended Unix Code for simplified
+                                 * Chinese */
+    {
+        "koi8", PG_KOI8R
+    },                            /* _dirty_ alias for KOI8-R (backward
+                                 * compatibility) */
+    {
+        "koi8r", PG_KOI8R
+    },                            /* KOI8-R; RFC1489 */
+    {
+        "koi8u", PG_KOI8U
+    },                            /* KOI8-U; RFC2319 */
+    {
+        "latin1", PG_LATIN1
+    },                            /* alias for ISO-8859-1 */
+    {
+        "latin10", PG_LATIN10
+    },                            /* alias for ISO-8859-16 */
+    {
+        "latin2", PG_LATIN2
+    },                            /* alias for ISO-8859-2 */
+    {
+        "latin3", PG_LATIN3
+    },                            /* alias for ISO-8859-3 */
+    {
+        "latin4", PG_LATIN4
+    },                            /* alias for ISO-8859-4 */
+    {
+        "latin5", PG_LATIN5
+    },                            /* alias for ISO-8859-9 */
+    {
+        "latin6", PG_LATIN6
+    },                            /* alias for ISO-8859-10 */
+    {
+        "latin7", PG_LATIN7
+    },                            /* alias for ISO-8859-13 */
+    {
+        "latin8", PG_LATIN8
+    },                            /* alias for ISO-8859-14 */
+    {
+        "latin9", PG_LATIN9
+    },                            /* alias for ISO-8859-15 */
+    {
+        "mskanji", PG_SJIS
+    },                            /* alias for Shift_JIS */
+    {
+        "muleinternal", PG_MULE_INTERNAL
+    },
+    {
+        "shiftjis", PG_SJIS
+    },                            /* Shift_JIS; JIS X 0202-1991 */
+
+    {
+        "shiftjis2004", PG_SHIFT_JIS_2004
+    },                            /* SHIFT-JIS-2004; Shift JIS for Japanese,
+                                 * standard JIS X 0213 */
+    {
+        "sjis", PG_SJIS
+    },                            /* alias for Shift_JIS */
+    {
+        "sqlascii", PG_SQL_ASCII
+    },
+    {
+        "tcvn", PG_WIN1258
+    },                            /* alias for WIN1258 */
+    {
+        "tcvn5712", PG_WIN1258
+    },                            /* alias for WIN1258 */
+    {
+        "uhc", PG_UHC
+    },                            /* UHC; Korean Windows CodePage 949 */
+    {
+        "unicode", PG_UTF8
+    },                            /* alias for UTF8 */
+    {
+        "utf8", PG_UTF8
+    },                            /* alias for UTF8 */
+    {
+        "vscii", PG_WIN1258
+    },                            /* alias for WIN1258 */
+    {
+        "win", PG_WIN1251
+    },                            /* _dirty_ alias for windows-1251 (backward
+                                 * compatibility) */
+    {
+        "win1250", PG_WIN1250
+    },                            /* alias for Windows-1250 */
+    {
+        "win1251", PG_WIN1251
+    },                            /* alias for Windows-1251 */
+    {
+        "win1252", PG_WIN1252
+    },                            /* alias for Windows-1252 */
+    {
+        "win1253", PG_WIN1253
+    },                            /* alias for Windows-1253 */
+    {
+        "win1254", PG_WIN1254
+    },                            /* alias for Windows-1254 */
+    {
+        "win1255", PG_WIN1255
+    },                            /* alias for Windows-1255 */
+    {
+        "win1256", PG_WIN1256
+    },                            /* alias for Windows-1256 */
+    {
+        "win1257", PG_WIN1257
+    },                            /* alias for Windows-1257 */
+    {
+        "win1258", PG_WIN1258
+    },                            /* alias for Windows-1258 */
+    {
+        "win866", PG_WIN866
+    },                            /* IBM866 */
+    {
+        "win874", PG_WIN874
+    },                            /* alias for Windows-874 */
+    {
+        "win932", PG_SJIS
+    },                            /* alias for Shift_JIS */
+    {
+        "win936", PG_GBK
+    },                            /* alias for GBK */
+    {
+        "win949", PG_UHC
+    },                            /* alias for UHC */
+    {
+        "win950", PG_BIG5
+    },                            /* alias for BIG5 */
+    {
+        "windows1250", PG_WIN1250
+    },                            /* Windows-1251; Microsoft */
+    {
+        "windows1251", PG_WIN1251
+    },                            /* Windows-1251; Microsoft */
+    {
+        "windows1252", PG_WIN1252
+    },                            /* Windows-1252; Microsoft */
+    {
+        "windows1253", PG_WIN1253
+    },                            /* Windows-1253; Microsoft */
+    {
+        "windows1254", PG_WIN1254
+    },                            /* Windows-1254; Microsoft */
+    {
+        "windows1255", PG_WIN1255
+    },                            /* Windows-1255; Microsoft */
+    {
+        "windows1256", PG_WIN1256
+    },                            /* Windows-1256; Microsoft */
+    {
+        "windows1257", PG_WIN1257
+    },                            /* Windows-1257; Microsoft */
+    {
+        "windows1258", PG_WIN1258
+    },                            /* Windows-1258; Microsoft */
+    {
+        "windows866", PG_WIN866
+    },                            /* IBM866 */
+    {
+        "windows874", PG_WIN874
+    },                            /* Windows-874; Microsoft */
+    {
+        "windows932", PG_SJIS
+    },                            /* alias for Shift_JIS */
+    {
+        "windows936", PG_GBK
+    },                            /* alias for GBK */
+    {
+        "windows949", PG_UHC
+    },                            /* alias for UHC */
+    {
+        "windows950", PG_BIG5
+    }                            /* alias for BIG5 */
+};
+
+/* ----------
+ * These are "official" encoding names.
+ * XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h)
+ * ----------
+ */
+#ifndef WIN32
+#define DEF_ENC2NAME(name, codepage) { #name, PG_##name }
+#else
+#define DEF_ENC2NAME(name, codepage) { #name, PG_##name, codepage }
+#endif
+const pg_enc2name pg_enc2name_tbl[] =
+{
+    DEF_ENC2NAME(SQL_ASCII, 0),
+    DEF_ENC2NAME(EUC_JP, 20932),
+    DEF_ENC2NAME(EUC_CN, 20936),
+    DEF_ENC2NAME(EUC_KR, 51949),
+    DEF_ENC2NAME(EUC_TW, 0),
+    DEF_ENC2NAME(EUC_JIS_2004, 20932),
+    DEF_ENC2NAME(UTF8, 65001),
+    DEF_ENC2NAME(MULE_INTERNAL, 0),
+    DEF_ENC2NAME(LATIN1, 28591),
+    DEF_ENC2NAME(LATIN2, 28592),
+    DEF_ENC2NAME(LATIN3, 28593),
+    DEF_ENC2NAME(LATIN4, 28594),
+    DEF_ENC2NAME(LATIN5, 28599),
+    DEF_ENC2NAME(LATIN6, 0),
+    DEF_ENC2NAME(LATIN7, 0),
+    DEF_ENC2NAME(LATIN8, 0),
+    DEF_ENC2NAME(LATIN9, 28605),
+    DEF_ENC2NAME(LATIN10, 0),
+    DEF_ENC2NAME(WIN1256, 1256),
+    DEF_ENC2NAME(WIN1258, 1258),
+    DEF_ENC2NAME(WIN866, 866),
+    DEF_ENC2NAME(WIN874, 874),
+    DEF_ENC2NAME(KOI8R, 20866),
+    DEF_ENC2NAME(WIN1251, 1251),
+    DEF_ENC2NAME(WIN1252, 1252),
+    DEF_ENC2NAME(ISO_8859_5, 28595),
+    DEF_ENC2NAME(ISO_8859_6, 28596),
+    DEF_ENC2NAME(ISO_8859_7, 28597),
+    DEF_ENC2NAME(ISO_8859_8, 28598),
+    DEF_ENC2NAME(WIN1250, 1250),
+    DEF_ENC2NAME(WIN1253, 1253),
+    DEF_ENC2NAME(WIN1254, 1254),
+    DEF_ENC2NAME(WIN1255, 1255),
+    DEF_ENC2NAME(WIN1257, 1257),
+    DEF_ENC2NAME(KOI8U, 21866),
+    DEF_ENC2NAME(SJIS, 932),
+    DEF_ENC2NAME(BIG5, 950),
+    DEF_ENC2NAME(GBK, 936),
+    DEF_ENC2NAME(UHC, 949),
+    DEF_ENC2NAME(GB18030, 54936),
+    DEF_ENC2NAME(JOHAB, 0),
+    DEF_ENC2NAME(SHIFT_JIS_2004, 932)
+};
+
+/* ----------
+ * These are encoding names for gettext.
+ *
+ * This covers all encodings except MULE_INTERNAL, which is alien to gettext.
+ * ----------
+ */
+const pg_enc2gettext pg_enc2gettext_tbl[] =
+{
+    {PG_SQL_ASCII, "US-ASCII"},
+    {PG_UTF8, "UTF-8"},
+    {PG_LATIN1, "LATIN1"},
+    {PG_LATIN2, "LATIN2"},
+    {PG_LATIN3, "LATIN3"},
+    {PG_LATIN4, "LATIN4"},
+    {PG_ISO_8859_5, "ISO-8859-5"},
+    {PG_ISO_8859_6, "ISO_8859-6"},
+    {PG_ISO_8859_7, "ISO-8859-7"},
+    {PG_ISO_8859_8, "ISO-8859-8"},
+    {PG_LATIN5, "LATIN5"},
+    {PG_LATIN6, "LATIN6"},
+    {PG_LATIN7, "LATIN7"},
+    {PG_LATIN8, "LATIN8"},
+    {PG_LATIN9, "LATIN-9"},
+    {PG_LATIN10, "LATIN10"},
+    {PG_KOI8R, "KOI8-R"},
+    {PG_KOI8U, "KOI8-U"},
+    {PG_WIN1250, "CP1250"},
+    {PG_WIN1251, "CP1251"},
+    {PG_WIN1252, "CP1252"},
+    {PG_WIN1253, "CP1253"},
+    {PG_WIN1254, "CP1254"},
+    {PG_WIN1255, "CP1255"},
+    {PG_WIN1256, "CP1256"},
+    {PG_WIN1257, "CP1257"},
+    {PG_WIN1258, "CP1258"},
+    {PG_WIN866, "CP866"},
+    {PG_WIN874, "CP874"},
+    {PG_EUC_CN, "EUC-CN"},
+    {PG_EUC_JP, "EUC-JP"},
+    {PG_EUC_KR, "EUC-KR"},
+    {PG_EUC_TW, "EUC-TW"},
+    {PG_EUC_JIS_2004, "EUC-JP"},
+    {PG_SJIS, "SHIFT-JIS"},
+    {PG_BIG5, "BIG5"},
+    {PG_GBK, "GBK"},
+    {PG_UHC, "UHC"},
+    {PG_GB18030, "GB18030"},
+    {PG_JOHAB, "JOHAB"},
+    {PG_SHIFT_JIS_2004, "SHIFT_JISX0213"},
+    {0, NULL}
+};
+
+
+#ifndef FRONTEND
+
+/*
+ * Table of encoding names for ICU
+ *
+ * Reference: <https://ssl.icu-project.org/icu-bin/convexp>
+ *
+ * NULL entries are not supported by ICU, or their mapping is unclear.
+ */
+static const char *const pg_enc2icu_tbl[] =
+{
+    NULL,                        /* PG_SQL_ASCII */
+    "EUC-JP",                    /* PG_EUC_JP */
+    "EUC-CN",                    /* PG_EUC_CN */
+    "EUC-KR",                    /* PG_EUC_KR */
+    "EUC-TW",                    /* PG_EUC_TW */
+    NULL,                        /* PG_EUC_JIS_2004 */
+    "UTF-8",                    /* PG_UTF8 */
+    NULL,                        /* PG_MULE_INTERNAL */
+    "ISO-8859-1",                /* PG_LATIN1 */
+    "ISO-8859-2",                /* PG_LATIN2 */
+    "ISO-8859-3",                /* PG_LATIN3 */
+    "ISO-8859-4",                /* PG_LATIN4 */
+    "ISO-8859-9",                /* PG_LATIN5 */
+    "ISO-8859-10",                /* PG_LATIN6 */
+    "ISO-8859-13",                /* PG_LATIN7 */
+    "ISO-8859-14",                /* PG_LATIN8 */
+    "ISO-8859-15",                /* PG_LATIN9 */
+    NULL,                        /* PG_LATIN10 */
+    "CP1256",                    /* PG_WIN1256 */
+    "CP1258",                    /* PG_WIN1258 */
+    "CP866",                    /* PG_WIN866 */
+    NULL,                        /* PG_WIN874 */
+    "KOI8-R",                    /* PG_KOI8R */
+    "CP1251",                    /* PG_WIN1251 */
+    "CP1252",                    /* PG_WIN1252 */
+    "ISO-8859-5",                /* PG_ISO_8859_5 */
+    "ISO-8859-6",                /* PG_ISO_8859_6 */
+    "ISO-8859-7",                /* PG_ISO_8859_7 */
+    "ISO-8859-8",                /* PG_ISO_8859_8 */
+    "CP1250",                    /* PG_WIN1250 */
+    "CP1253",                    /* PG_WIN1253 */
+    "CP1254",                    /* PG_WIN1254 */
+    "CP1255",                    /* PG_WIN1255 */
+    "CP1257",                    /* PG_WIN1257 */
+    "KOI8-U",                    /* PG_KOI8U */
+};
+
+bool
+is_encoding_supported_by_icu(int encoding)
+{
+    return (pg_enc2icu_tbl[encoding] != NULL);
+}
+
+const char *
+get_encoding_name_for_icu(int encoding)
+{
+    const char *icu_encoding_name;
+
+    StaticAssertStmt(lengthof(pg_enc2icu_tbl) == PG_ENCODING_BE_LAST + 1,
+                     "pg_enc2icu_tbl incomplete");
+
+    icu_encoding_name = pg_enc2icu_tbl[encoding];
+
+    if (!icu_encoding_name)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("encoding \"%s\" not supported by ICU",
+                        pg_encoding_to_char(encoding))));
+
+    return icu_encoding_name;
+}
+
+#endif                            /* not FRONTEND */
+
+
+/* ----------
+ * Encoding checks, for error returns -1 else encoding id
+ * ----------
+ */
+int
+pg_valid_client_encoding(const char *name)
+{
+    int            enc;
+
+    if ((enc = pg_char_to_encoding(name)) < 0)
+        return -1;
+
+    if (!PG_VALID_FE_ENCODING(enc))
+        return -1;
+
+    return enc;
+}
+
+int
+pg_valid_server_encoding(const char *name)
+{
+    int            enc;
+
+    if ((enc = pg_char_to_encoding(name)) < 0)
+        return -1;
+
+    if (!PG_VALID_BE_ENCODING(enc))
+        return -1;
+
+    return enc;
+}
+
+int
+pg_valid_server_encoding_id(int encoding)
+{
+    return PG_VALID_BE_ENCODING(encoding);
+}
+
+/* ----------
+ * Remove irrelevant chars from encoding name
+ * ----------
+ */
+static char *
+clean_encoding_name(const char *key, char *newkey)
+{
+    const char *p;
+    char       *np;
+
+    for (p = key, np = newkey; *p != '\0'; p++)
+    {
+        if (isalnum((unsigned char) *p))
+        {
+            if (*p >= 'A' && *p <= 'Z')
+                *np++ = *p + 'a' - 'A';
+            else
+                *np++ = *p;
+        }
+    }
+    *np = '\0';
+    return newkey;
+}
+
+/* ----------
+ * Search encoding by encoding name
+ *
+ * Returns encoding ID, or -1 for error
+ * ----------
+ */
+int
+pg_char_to_encoding(const char *name)
+{
+    unsigned int nel = lengthof(pg_encname_tbl);
+    const pg_encname *base = pg_encname_tbl,
+               *last = base + nel - 1,
+               *position;
+    int            result;
+    char        buff[NAMEDATALEN],
+               *key;
+
+    if (name == NULL || *name == '\0')
+        return -1;
+
+    if (strlen(name) >= NAMEDATALEN)
+    {
+#ifdef FRONTEND
+        fprintf(stderr, "encoding name too long\n");
+        return -1;
+#else
+        ereport(ERROR,
+                (errcode(ERRCODE_NAME_TOO_LONG),
+                 errmsg("encoding name too long")));
+#endif
+    }
+    key = clean_encoding_name(name, buff);
+
+    while (last >= base)
+    {
+        position = base + ((last - base) >> 1);
+        result = key[0] - position->name[0];
+
+        if (result == 0)
+        {
+            result = strcmp(key, position->name);
+            if (result == 0)
+                return position->encoding;
+        }
+        if (result < 0)
+            last = position - 1;
+        else
+            base = position + 1;
+    }
+    return -1;
+}
+
+#ifndef FRONTEND
+Datum
+PG_char_to_encoding(PG_FUNCTION_ARGS)
+{
+    Name        s = PG_GETARG_NAME(0);
+
+    PG_RETURN_INT32(pg_char_to_encoding(NameStr(*s)));
+}
+#endif
+
+const char *
+pg_encoding_to_char(int encoding)
+{
+    if (PG_VALID_ENCODING(encoding))
+    {
+        const pg_enc2name *p = &pg_enc2name_tbl[encoding];
+
+        Assert(encoding == p->encoding);
+        return p->name;
+    }
+    return "";
+}
+
+#ifndef FRONTEND
+Datum
+PG_encoding_to_char(PG_FUNCTION_ARGS)
+{
+    int32        encoding = PG_GETARG_INT32(0);
+    const char *encoding_name = pg_encoding_to_char(encoding);
+
+    return DirectFunctionCall1(namein, CStringGetDatum(encoding_name));
+}
+
+#endif
diff --git a/src/common/saslprep.c b/src/common/saslprep.c
index 2a2449e..7739b81 100644
--- a/src/common/saslprep.c
+++ b/src/common/saslprep.c
@@ -27,12 +27,6 @@

 #include "common/saslprep.h"
 #include "common/unicode_norm.h"
-
-/*
- * Note: The functions in this file depend on functions from
- * src/backend/utils/mb/wchar.c, so in order to use this in frontend
- * code, you will need to link that in, too.
- */
 #include "mb/pg_wchar.h"

 /*
diff --git a/src/common/wchar.c b/src/common/wchar.c
new file mode 100644
index 0000000..74a8823
--- /dev/null
+++ b/src/common/wchar.c
@@ -0,0 +1,2041 @@
+/*-------------------------------------------------------------------------
+ *
+ * wchar.c
+ *      Functions for working with multibyte characters in various encodings.
+ *
+ * Portions Copyright (c) 1998-2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *      src/common/wchar.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
+#include "postgres.h"
+#endif
+
+#include "mb/pg_wchar.h"
+
+
+/*
+ * Operations on multi-byte encodings are driven by a table of helper
+ * functions.
+ *
+ * To add an encoding support, define mblen(), dsplen() and verifier() for
+ * the encoding.  For server-encodings, also define mb2wchar() and wchar2mb()
+ * conversion functions.
+ *
+ * These functions generally assume that their input is validly formed.
+ * The "verifier" functions, further down in the file, have to be more
+ * paranoid.
+ *
+ * We expect that mblen() does not need to examine more than the first byte
+ * of the character to discover the correct length.  GB18030 is an exception
+ * to that rule, though, as it also looks at second byte.  But even that
+ * behaves in a predictable way, if you only pass the first byte: it will
+ * treat 4-byte encoded characters as two 2-byte encoded characters, which is
+ * good enough for all current uses.
+ *
+ * Note: for the display output of psql to work properly, the return values
+ * of the dsplen functions must conform to the Unicode standard. In particular
+ * the NUL character is zero width and control characters are generally
+ * width -1. It is recommended that non-ASCII encodings refer their ASCII
+ * subset to the ASCII routines to ensure consistency.
+ */
+
+/*
+ * SQL/ASCII
+ */
+static int
+pg_ascii2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
+{
+    int            cnt = 0;
+
+    while (len > 0 && *from)
+    {
+        *to++ = *from++;
+        len--;
+        cnt++;
+    }
+    *to = 0;
+    return cnt;
+}
+
+static int
+pg_ascii_mblen(const unsigned char *s)
+{
+    return 1;
+}
+
+static int
+pg_ascii_dsplen(const unsigned char *s)
+{
+    if (*s == '\0')
+        return 0;
+    if (*s < 0x20 || *s == 0x7f)
+        return -1;
+
+    return 1;
+}
+
+/*
+ * EUC
+ */
+static int
+pg_euc2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
+{
+    int            cnt = 0;
+
+    while (len > 0 && *from)
+    {
+        if (*from == SS2 && len >= 2)    /* JIS X 0201 (so called "1 byte
+                                         * KANA") */
+        {
+            from++;
+            *to = (SS2 << 8) | *from++;
+            len -= 2;
+        }
+        else if (*from == SS3 && len >= 3)    /* JIS X 0212 KANJI */
+        {
+            from++;
+            *to = (SS3 << 16) | (*from++ << 8);
+            *to |= *from++;
+            len -= 3;
+        }
+        else if (IS_HIGHBIT_SET(*from) && len >= 2) /* JIS X 0208 KANJI */
+        {
+            *to = *from++ << 8;
+            *to |= *from++;
+            len -= 2;
+        }
+        else                    /* must be ASCII */
+        {
+            *to = *from++;
+            len--;
+        }
+        to++;
+        cnt++;
+    }
+    *to = 0;
+    return cnt;
+}
+
+static inline int
+pg_euc_mblen(const unsigned char *s)
+{
+    int            len;
+
+    if (*s == SS2)
+        len = 2;
+    else if (*s == SS3)
+        len = 3;
+    else if (IS_HIGHBIT_SET(*s))
+        len = 2;
+    else
+        len = 1;
+    return len;
+}
+
+static inline int
+pg_euc_dsplen(const unsigned char *s)
+{
+    int            len;
+
+    if (*s == SS2)
+        len = 2;
+    else if (*s == SS3)
+        len = 2;
+    else if (IS_HIGHBIT_SET(*s))
+        len = 2;
+    else
+        len = pg_ascii_dsplen(s);
+    return len;
+}
+
+/*
+ * EUC_JP
+ */
+static int
+pg_eucjp2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
+{
+    return pg_euc2wchar_with_len(from, to, len);
+}
+
+static int
+pg_eucjp_mblen(const unsigned char *s)
+{
+    return pg_euc_mblen(s);
+}
+
+static int
+pg_eucjp_dsplen(const unsigned char *s)
+{
+    int            len;
+
+    if (*s == SS2)
+        len = 1;
+    else if (*s == SS3)
+        len = 2;
+    else if (IS_HIGHBIT_SET(*s))
+        len = 2;
+    else
+        len = pg_ascii_dsplen(s);
+    return len;
+}
+
+/*
+ * EUC_KR
+ */
+static int
+pg_euckr2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
+{
+    return pg_euc2wchar_with_len(from, to, len);
+}
+
+static int
+pg_euckr_mblen(const unsigned char *s)
+{
+    return pg_euc_mblen(s);
+}
+
+static int
+pg_euckr_dsplen(const unsigned char *s)
+{
+    return pg_euc_dsplen(s);
+}
+
+/*
+ * EUC_CN
+ *
+ */
+static int
+pg_euccn2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
+{
+    int            cnt = 0;
+
+    while (len > 0 && *from)
+    {
+        if (*from == SS2 && len >= 3)    /* code set 2 (unused?) */
+        {
+            from++;
+            *to = (SS2 << 16) | (*from++ << 8);
+            *to |= *from++;
+            len -= 3;
+        }
+        else if (*from == SS3 && len >= 3)    /* code set 3 (unused ?) */
+        {
+            from++;
+            *to = (SS3 << 16) | (*from++ << 8);
+            *to |= *from++;
+            len -= 3;
+        }
+        else if (IS_HIGHBIT_SET(*from) && len >= 2) /* code set 1 */
+        {
+            *to = *from++ << 8;
+            *to |= *from++;
+            len -= 2;
+        }
+        else
+        {
+            *to = *from++;
+            len--;
+        }
+        to++;
+        cnt++;
+    }
+    *to = 0;
+    return cnt;
+}
+
+static int
+pg_euccn_mblen(const unsigned char *s)
+{
+    int            len;
+
+    if (IS_HIGHBIT_SET(*s))
+        len = 2;
+    else
+        len = 1;
+    return len;
+}
+
+static int
+pg_euccn_dsplen(const unsigned char *s)
+{
+    int            len;
+
+    if (IS_HIGHBIT_SET(*s))
+        len = 2;
+    else
+        len = pg_ascii_dsplen(s);
+    return len;
+}
+
+/*
+ * EUC_TW
+ *
+ */
+static int
+pg_euctw2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
+{
+    int            cnt = 0;
+
+    while (len > 0 && *from)
+    {
+        if (*from == SS2 && len >= 4)    /* code set 2 */
+        {
+            from++;
+            *to = (((uint32) SS2) << 24) | (*from++ << 16);
+            *to |= *from++ << 8;
+            *to |= *from++;
+            len -= 4;
+        }
+        else if (*from == SS3 && len >= 3)    /* code set 3 (unused?) */
+        {
+            from++;
+            *to = (SS3 << 16) | (*from++ << 8);
+            *to |= *from++;
+            len -= 3;
+        }
+        else if (IS_HIGHBIT_SET(*from) && len >= 2) /* code set 2 */
+        {
+            *to = *from++ << 8;
+            *to |= *from++;
+            len -= 2;
+        }
+        else
+        {
+            *to = *from++;
+            len--;
+        }
+        to++;
+        cnt++;
+    }
+    *to = 0;
+    return cnt;
+}
+
+static int
+pg_euctw_mblen(const unsigned char *s)
+{
+    int            len;
+
+    if (*s == SS2)
+        len = 4;
+    else if (*s == SS3)
+        len = 3;
+    else if (IS_HIGHBIT_SET(*s))
+        len = 2;
+    else
+        len = 1;
+    return len;
+}
+
+static int
+pg_euctw_dsplen(const unsigned char *s)
+{
+    int            len;
+
+    if (*s == SS2)
+        len = 2;
+    else if (*s == SS3)
+        len = 2;
+    else if (IS_HIGHBIT_SET(*s))
+        len = 2;
+    else
+        len = pg_ascii_dsplen(s);
+    return len;
+}
+
+/*
+ * Convert pg_wchar to EUC_* encoding.
+ * caller must allocate enough space for "to", including a trailing zero!
+ * len: length of from.
+ * "from" not necessarily null terminated.
+ */
+static int
+pg_wchar2euc_with_len(const pg_wchar *from, unsigned char *to, int len)
+{
+    int            cnt = 0;
+
+    while (len > 0 && *from)
+    {
+        unsigned char c;
+
+        if ((c = (*from >> 24)))
+        {
+            *to++ = c;
+            *to++ = (*from >> 16) & 0xff;
+            *to++ = (*from >> 8) & 0xff;
+            *to++ = *from & 0xff;
+            cnt += 4;
+        }
+        else if ((c = (*from >> 16)))
+        {
+            *to++ = c;
+            *to++ = (*from >> 8) & 0xff;
+            *to++ = *from & 0xff;
+            cnt += 3;
+        }
+        else if ((c = (*from >> 8)))
+        {
+            *to++ = c;
+            *to++ = *from & 0xff;
+            cnt += 2;
+        }
+        else
+        {
+            *to++ = *from;
+            cnt++;
+        }
+        from++;
+        len--;
+    }
+    *to = 0;
+    return cnt;
+}
+
+
+/*
+ * JOHAB
+ */
+static int
+pg_johab_mblen(const unsigned char *s)
+{
+    return pg_euc_mblen(s);
+}
+
+static int
+pg_johab_dsplen(const unsigned char *s)
+{
+    return pg_euc_dsplen(s);
+}
+
+/*
+ * convert UTF8 string to pg_wchar (UCS-4)
+ * caller must allocate enough space for "to", including a trailing zero!
+ * len: length of from.
+ * "from" not necessarily null terminated.
+ */
+static int
+pg_utf2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
+{
+    int            cnt = 0;
+    uint32        c1,
+                c2,
+                c3,
+                c4;
+
+    while (len > 0 && *from)
+    {
+        if ((*from & 0x80) == 0)
+        {
+            *to = *from++;
+            len--;
+        }
+        else if ((*from & 0xe0) == 0xc0)
+        {
+            if (len < 2)
+                break;            /* drop trailing incomplete char */
+            c1 = *from++ & 0x1f;
+            c2 = *from++ & 0x3f;
+            *to = (c1 << 6) | c2;
+            len -= 2;
+        }
+        else if ((*from & 0xf0) == 0xe0)
+        {
+            if (len < 3)
+                break;            /* drop trailing incomplete char */
+            c1 = *from++ & 0x0f;
+            c2 = *from++ & 0x3f;
+            c3 = *from++ & 0x3f;
+            *to = (c1 << 12) | (c2 << 6) | c3;
+            len -= 3;
+        }
+        else if ((*from & 0xf8) == 0xf0)
+        {
+            if (len < 4)
+                break;            /* drop trailing incomplete char */
+            c1 = *from++ & 0x07;
+            c2 = *from++ & 0x3f;
+            c3 = *from++ & 0x3f;
+            c4 = *from++ & 0x3f;
+            *to = (c1 << 18) | (c2 << 12) | (c3 << 6) | c4;
+            len -= 4;
+        }
+        else
+        {
+            /* treat a bogus char as length 1; not ours to raise error */
+            *to = *from++;
+            len--;
+        }
+        to++;
+        cnt++;
+    }
+    *to = 0;
+    return cnt;
+}
+
+
+/*
+ * Map a Unicode code point to UTF-8.  utf8string must have 4 bytes of
+ * space allocated.
+ */
+unsigned char *
+unicode_to_utf8(pg_wchar c, unsigned char *utf8string)
+{
+    if (c <= 0x7F)
+    {
+        utf8string[0] = c;
+    }
+    else if (c <= 0x7FF)
+    {
+        utf8string[0] = 0xC0 | ((c >> 6) & 0x1F);
+        utf8string[1] = 0x80 | (c & 0x3F);
+    }
+    else if (c <= 0xFFFF)
+    {
+        utf8string[0] = 0xE0 | ((c >> 12) & 0x0F);
+        utf8string[1] = 0x80 | ((c >> 6) & 0x3F);
+        utf8string[2] = 0x80 | (c & 0x3F);
+    }
+    else
+    {
+        utf8string[0] = 0xF0 | ((c >> 18) & 0x07);
+        utf8string[1] = 0x80 | ((c >> 12) & 0x3F);
+        utf8string[2] = 0x80 | ((c >> 6) & 0x3F);
+        utf8string[3] = 0x80 | (c & 0x3F);
+    }
+
+    return utf8string;
+}
+
+/*
+ * Trivial conversion from pg_wchar to UTF-8.
+ * caller should allocate enough space for "to"
+ * len: length of from.
+ * "from" not necessarily null terminated.
+ */
+static int
+pg_wchar2utf_with_len(const pg_wchar *from, unsigned char *to, int len)
+{
+    int            cnt = 0;
+
+    while (len > 0 && *from)
+    {
+        int            char_len;
+
+        unicode_to_utf8(*from, to);
+        char_len = pg_utf_mblen(to);
+        cnt += char_len;
+        to += char_len;
+        from++;
+        len--;
+    }
+    *to = 0;
+    return cnt;
+}
+
+/*
+ * Return the byte length of a UTF8 character pointed to by s
+ *
+ * Note: in the current implementation we do not support UTF8 sequences
+ * of more than 4 bytes; hence do NOT return a value larger than 4.
+ * We return "1" for any leading byte that is either flat-out illegal or
+ * indicates a length larger than we support.
+ *
+ * pg_utf2wchar_with_len(), utf8_to_unicode(), pg_utf8_islegal(), and perhaps
+ * other places would need to be fixed to change this.
+ */
+int
+pg_utf_mblen(const unsigned char *s)
+{
+    int            len;
+
+    if ((*s & 0x80) == 0)
+        len = 1;
+    else if ((*s & 0xe0) == 0xc0)
+        len = 2;
+    else if ((*s & 0xf0) == 0xe0)
+        len = 3;
+    else if ((*s & 0xf8) == 0xf0)
+        len = 4;
+#ifdef NOT_USED
+    else if ((*s & 0xfc) == 0xf8)
+        len = 5;
+    else if ((*s & 0xfe) == 0xfc)
+        len = 6;
+#endif
+    else
+        len = 1;
+    return len;
+}
+
+/*
+ * This is an implementation of wcwidth() and wcswidth() as defined in
+ * "The Single UNIX Specification, Version 2, The Open Group, 1997"
+ * <http://www.unix.org/online.html>
+ *
+ * Markus Kuhn -- 2001-09-08 -- public domain
+ *
+ * customised for PostgreSQL
+ *
+ * original available at : http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
+ */
+
+struct mbinterval
+{
+    unsigned short first;
+    unsigned short last;
+};
+
+/* auxiliary function for binary search in interval table */
+static int
+mbbisearch(pg_wchar ucs, const struct mbinterval *table, int max)
+{
+    int            min = 0;
+    int            mid;
+
+    if (ucs < table[0].first || ucs > table[max].last)
+        return 0;
+    while (max >= min)
+    {
+        mid = (min + max) / 2;
+        if (ucs > table[mid].last)
+            min = mid + 1;
+        else if (ucs < table[mid].first)
+            max = mid - 1;
+        else
+            return 1;
+    }
+
+    return 0;
+}
+
+
+/* The following functions define the column width of an ISO 10646
+ * character as follows:
+ *
+ *      - The null character (U+0000) has a column width of 0.
+ *
+ *      - Other C0/C1 control characters and DEL will lead to a return
+ *        value of -1.
+ *
+ *      - Non-spacing and enclosing combining characters (general
+ *        category code Mn or Me in the Unicode database) have a
+ *        column width of 0.
+ *
+ *      - Other format characters (general category code Cf in the Unicode
+ *        database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
+ *
+ *      - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
+ *        have a column width of 0.
+ *
+ *      - Spacing characters in the East Asian Wide (W) or East Asian
+ *        FullWidth (F) category as defined in Unicode Technical
+ *        Report #11 have a column width of 2.
+ *
+ *      - All remaining characters (including all printable
+ *        ISO 8859-1 and WGL4 characters, Unicode control characters,
+ *        etc.) have a column width of 1.
+ *
+ * This implementation assumes that wchar_t characters are encoded
+ * in ISO 10646.
+ */
+
+static int
+ucs_wcwidth(pg_wchar ucs)
+{
+#include "common/unicode_combining_table.h"
+
+    /* test for 8-bit control characters */
+    if (ucs == 0)
+        return 0;
+
+    if (ucs < 0x20 || (ucs >= 0x7f && ucs < 0xa0) || ucs > 0x0010ffff)
+        return -1;
+
+    /* binary search in table of non-spacing characters */
+    if (mbbisearch(ucs, combining,
+                   sizeof(combining) / sizeof(struct mbinterval) - 1))
+        return 0;
+
+    /*
+     * if we arrive here, ucs is not a combining or C0/C1 control character
+     */
+
+    return 1 +
+        (ucs >= 0x1100 &&
+         (ucs <= 0x115f ||        /* Hangul Jamo init. consonants */
+          (ucs >= 0x2e80 && ucs <= 0xa4cf && (ucs & ~0x0011) != 0x300a &&
+           ucs != 0x303f) ||    /* CJK ... Yi */
+          (ucs >= 0xac00 && ucs <= 0xd7a3) ||    /* Hangul Syllables */
+          (ucs >= 0xf900 && ucs <= 0xfaff) ||    /* CJK Compatibility
+                                                 * Ideographs */
+          (ucs >= 0xfe30 && ucs <= 0xfe6f) ||    /* CJK Compatibility Forms */
+          (ucs >= 0xff00 && ucs <= 0xff5f) ||    /* Fullwidth Forms */
+          (ucs >= 0xffe0 && ucs <= 0xffe6) ||
+          (ucs >= 0x20000 && ucs <= 0x2ffff)));
+}
+
+/*
+ * Convert a UTF-8 character to a Unicode code point.
+ * This is a one-character version of pg_utf2wchar_with_len.
+ *
+ * No error checks here, c must point to a long-enough string.
+ */
+pg_wchar
+utf8_to_unicode(const unsigned char *c)
+{
+    if ((*c & 0x80) == 0)
+        return (pg_wchar) c[0];
+    else if ((*c & 0xe0) == 0xc0)
+        return (pg_wchar) (((c[0] & 0x1f) << 6) |
+                           (c[1] & 0x3f));
+    else if ((*c & 0xf0) == 0xe0)
+        return (pg_wchar) (((c[0] & 0x0f) << 12) |
+                           ((c[1] & 0x3f) << 6) |
+                           (c[2] & 0x3f));
+    else if ((*c & 0xf8) == 0xf0)
+        return (pg_wchar) (((c[0] & 0x07) << 18) |
+                           ((c[1] & 0x3f) << 12) |
+                           ((c[2] & 0x3f) << 6) |
+                           (c[3] & 0x3f));
+    else
+        /* that is an invalid code on purpose */
+        return 0xffffffff;
+}
+
+static int
+pg_utf_dsplen(const unsigned char *s)
+{
+    return ucs_wcwidth(utf8_to_unicode(s));
+}
+
+/*
+ * convert mule internal code to pg_wchar
+ * caller should allocate enough space for "to"
+ * len: length of from.
+ * "from" not necessarily null terminated.
+ */
+static int
+pg_mule2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
+{
+    int            cnt = 0;
+
+    while (len > 0 && *from)
+    {
+        if (IS_LC1(*from) && len >= 2)
+        {
+            *to = *from++ << 16;
+            *to |= *from++;
+            len -= 2;
+        }
+        else if (IS_LCPRV1(*from) && len >= 3)
+        {
+            from++;
+            *to = *from++ << 16;
+            *to |= *from++;
+            len -= 3;
+        }
+        else if (IS_LC2(*from) && len >= 3)
+        {
+            *to = *from++ << 16;
+            *to |= *from++ << 8;
+            *to |= *from++;
+            len -= 3;
+        }
+        else if (IS_LCPRV2(*from) && len >= 4)
+        {
+            from++;
+            *to = *from++ << 16;
+            *to |= *from++ << 8;
+            *to |= *from++;
+            len -= 4;
+        }
+        else
+        {                        /* assume ASCII */
+            *to = (unsigned char) *from++;
+            len--;
+        }
+        to++;
+        cnt++;
+    }
+    *to = 0;
+    return cnt;
+}
+
+/*
+ * convert pg_wchar to mule internal code
+ * caller should allocate enough space for "to"
+ * len: length of from.
+ * "from" not necessarily null terminated.
+ */
+static int
+pg_wchar2mule_with_len(const pg_wchar *from, unsigned char *to, int len)
+{
+    int            cnt = 0;
+
+    while (len > 0 && *from)
+    {
+        unsigned char lb;
+
+        lb = (*from >> 16) & 0xff;
+        if (IS_LC1(lb))
+        {
+            *to++ = lb;
+            *to++ = *from & 0xff;
+            cnt += 2;
+        }
+        else if (IS_LC2(lb))
+        {
+            *to++ = lb;
+            *to++ = (*from >> 8) & 0xff;
+            *to++ = *from & 0xff;
+            cnt += 3;
+        }
+        else if (IS_LCPRV1_A_RANGE(lb))
+        {
+            *to++ = LCPRV1_A;
+            *to++ = lb;
+            *to++ = *from & 0xff;
+            cnt += 3;
+        }
+        else if (IS_LCPRV1_B_RANGE(lb))
+        {
+            *to++ = LCPRV1_B;
+            *to++ = lb;
+            *to++ = *from & 0xff;
+            cnt += 3;
+        }
+        else if (IS_LCPRV2_A_RANGE(lb))
+        {
+            *to++ = LCPRV2_A;
+            *to++ = lb;
+            *to++ = (*from >> 8) & 0xff;
+            *to++ = *from & 0xff;
+            cnt += 4;
+        }
+        else if (IS_LCPRV2_B_RANGE(lb))
+        {
+            *to++ = LCPRV2_B;
+            *to++ = lb;
+            *to++ = (*from >> 8) & 0xff;
+            *to++ = *from & 0xff;
+            cnt += 4;
+        }
+        else
+        {
+            *to++ = *from & 0xff;
+            cnt += 1;
+        }
+        from++;
+        len--;
+    }
+    *to = 0;
+    return cnt;
+}
+
+int
+pg_mule_mblen(const unsigned char *s)
+{
+    int            len;
+
+    if (IS_LC1(*s))
+        len = 2;
+    else if (IS_LCPRV1(*s))
+        len = 3;
+    else if (IS_LC2(*s))
+        len = 3;
+    else if (IS_LCPRV2(*s))
+        len = 4;
+    else
+        len = 1;                /* assume ASCII */
+    return len;
+}
+
+static int
+pg_mule_dsplen(const unsigned char *s)
+{
+    int            len;
+
+    /*
+     * Note: it's not really appropriate to assume that all multibyte charsets
+     * are double-wide on screen.  But this seems an okay approximation for
+     * the MULE charsets we currently support.
+     */
+
+    if (IS_LC1(*s))
+        len = 1;
+    else if (IS_LCPRV1(*s))
+        len = 1;
+    else if (IS_LC2(*s))
+        len = 2;
+    else if (IS_LCPRV2(*s))
+        len = 2;
+    else
+        len = 1;                /* assume ASCII */
+
+    return len;
+}
+
+/*
+ * ISO8859-1
+ */
+static int
+pg_latin12wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
+{
+    int            cnt = 0;
+
+    while (len > 0 && *from)
+    {
+        *to++ = *from++;
+        len--;
+        cnt++;
+    }
+    *to = 0;
+    return cnt;
+}
+
+/*
+ * Trivial conversion from pg_wchar to single byte encoding. Just ignores
+ * high bits.
+ * caller should allocate enough space for "to"
+ * len: length of from.
+ * "from" not necessarily null terminated.
+ */
+static int
+pg_wchar2single_with_len(const pg_wchar *from, unsigned char *to, int len)
+{
+    int            cnt = 0;
+
+    while (len > 0 && *from)
+    {
+        *to++ = *from++;
+        len--;
+        cnt++;
+    }
+    *to = 0;
+    return cnt;
+}
+
+static int
+pg_latin1_mblen(const unsigned char *s)
+{
+    return 1;
+}
+
+static int
+pg_latin1_dsplen(const unsigned char *s)
+{
+    return pg_ascii_dsplen(s);
+}
+
+/*
+ * SJIS
+ */
+static int
+pg_sjis_mblen(const unsigned char *s)
+{
+    int            len;
+
+    if (*s >= 0xa1 && *s <= 0xdf)
+        len = 1;                /* 1 byte kana? */
+    else if (IS_HIGHBIT_SET(*s))
+        len = 2;                /* kanji? */
+    else
+        len = 1;                /* should be ASCII */
+    return len;
+}
+
+static int
+pg_sjis_dsplen(const unsigned char *s)
+{
+    int            len;
+
+    if (*s >= 0xa1 && *s <= 0xdf)
+        len = 1;                /* 1 byte kana? */
+    else if (IS_HIGHBIT_SET(*s))
+        len = 2;                /* kanji? */
+    else
+        len = pg_ascii_dsplen(s);    /* should be ASCII */
+    return len;
+}
+
+/*
+ * Big5
+ */
+static int
+pg_big5_mblen(const unsigned char *s)
+{
+    int            len;
+
+    if (IS_HIGHBIT_SET(*s))
+        len = 2;                /* kanji? */
+    else
+        len = 1;                /* should be ASCII */
+    return len;
+}
+
+static int
+pg_big5_dsplen(const unsigned char *s)
+{
+    int            len;
+
+    if (IS_HIGHBIT_SET(*s))
+        len = 2;                /* kanji? */
+    else
+        len = pg_ascii_dsplen(s);    /* should be ASCII */
+    return len;
+}
+
+/*
+ * GBK
+ */
+static int
+pg_gbk_mblen(const unsigned char *s)
+{
+    int            len;
+
+    if (IS_HIGHBIT_SET(*s))
+        len = 2;                /* kanji? */
+    else
+        len = 1;                /* should be ASCII */
+    return len;
+}
+
+static int
+pg_gbk_dsplen(const unsigned char *s)
+{
+    int            len;
+
+    if (IS_HIGHBIT_SET(*s))
+        len = 2;                /* kanji? */
+    else
+        len = pg_ascii_dsplen(s);    /* should be ASCII */
+    return len;
+}
+
+/*
+ * UHC
+ */
+static int
+pg_uhc_mblen(const unsigned char *s)
+{
+    int            len;
+
+    if (IS_HIGHBIT_SET(*s))
+        len = 2;                /* 2byte? */
+    else
+        len = 1;                /* should be ASCII */
+    return len;
+}
+
+static int
+pg_uhc_dsplen(const unsigned char *s)
+{
+    int            len;
+
+    if (IS_HIGHBIT_SET(*s))
+        len = 2;                /* 2byte? */
+    else
+        len = pg_ascii_dsplen(s);    /* should be ASCII */
+    return len;
+}
+
+/*
+ * GB18030
+ *    Added by Bill Huang <bhuang@redhat.com>,<bill_huanghb@ybb.ne.jp>
+ */
+
+/*
+ * Unlike all other mblen() functions, this also looks at the second byte of
+ * the input.  However, if you only pass the first byte of a multi-byte
+ * string, and \0 as the second byte, this still works in a predictable way:
+ * a 4-byte character will be reported as two 2-byte characters.  That's
+ * enough for all current uses, as a client-only encoding.  It works that
+ * way, because in any valid 4-byte GB18030-encoded character, the third and
+ * fourth byte look like a 2-byte encoded character, when looked at
+ * separately.
+ */
+static int
+pg_gb18030_mblen(const unsigned char *s)
+{
+    int            len;
+
+    if (!IS_HIGHBIT_SET(*s))
+        len = 1;                /* ASCII */
+    else if (*(s + 1) >= 0x30 && *(s + 1) <= 0x39)
+        len = 4;
+    else
+        len = 2;
+    return len;
+}
+
+static int
+pg_gb18030_dsplen(const unsigned char *s)
+{
+    int            len;
+
+    if (IS_HIGHBIT_SET(*s))
+        len = 2;
+    else
+        len = pg_ascii_dsplen(s);    /* ASCII */
+    return len;
+}
+
+/*
+ *-------------------------------------------------------------------
+ * multibyte sequence validators
+ *
+ * These functions accept "s", a pointer to the first byte of a string,
+ * and "len", the remaining length of the string.  If there is a validly
+ * encoded character beginning at *s, return its length in bytes; else
+ * return -1.
+ *
+ * The functions can assume that len > 0 and that *s != '\0', but they must
+ * test for and reject zeroes in any additional bytes of a multibyte character.
+ *
+ * Note that this definition allows the function for a single-byte
+ * encoding to be just "return 1".
+ *-------------------------------------------------------------------
+ */
+
+static int
+pg_ascii_verifier(const unsigned char *s, int len)
+{
+    return 1;
+}
+
+#define IS_EUC_RANGE_VALID(c)    ((c) >= 0xa1 && (c) <= 0xfe)
+
+static int
+pg_eucjp_verifier(const unsigned char *s, int len)
+{
+    int            l;
+    unsigned char c1,
+                c2;
+
+    c1 = *s++;
+
+    switch (c1)
+    {
+        case SS2:                /* JIS X 0201 */
+            l = 2;
+            if (l > len)
+                return -1;
+            c2 = *s++;
+            if (c2 < 0xa1 || c2 > 0xdf)
+                return -1;
+            break;
+
+        case SS3:                /* JIS X 0212 */
+            l = 3;
+            if (l > len)
+                return -1;
+            c2 = *s++;
+            if (!IS_EUC_RANGE_VALID(c2))
+                return -1;
+            c2 = *s++;
+            if (!IS_EUC_RANGE_VALID(c2))
+                return -1;
+            break;
+
+        default:
+            if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */
+            {
+                l = 2;
+                if (l > len)
+                    return -1;
+                if (!IS_EUC_RANGE_VALID(c1))
+                    return -1;
+                c2 = *s++;
+                if (!IS_EUC_RANGE_VALID(c2))
+                    return -1;
+            }
+            else
+                /* must be ASCII */
+            {
+                l = 1;
+            }
+            break;
+    }
+
+    return l;
+}
+
+static int
+pg_euckr_verifier(const unsigned char *s, int len)
+{
+    int            l;
+    unsigned char c1,
+                c2;
+
+    c1 = *s++;
+
+    if (IS_HIGHBIT_SET(c1))
+    {
+        l = 2;
+        if (l > len)
+            return -1;
+        if (!IS_EUC_RANGE_VALID(c1))
+            return -1;
+        c2 = *s++;
+        if (!IS_EUC_RANGE_VALID(c2))
+            return -1;
+    }
+    else
+        /* must be ASCII */
+    {
+        l = 1;
+    }
+
+    return l;
+}
+
+/* EUC-CN byte sequences are exactly same as EUC-KR */
+#define pg_euccn_verifier    pg_euckr_verifier
+
+static int
+pg_euctw_verifier(const unsigned char *s, int len)
+{
+    int            l;
+    unsigned char c1,
+                c2;
+
+    c1 = *s++;
+
+    switch (c1)
+    {
+        case SS2:                /* CNS 11643 Plane 1-7 */
+            l = 4;
+            if (l > len)
+                return -1;
+            c2 = *s++;
+            if (c2 < 0xa1 || c2 > 0xa7)
+                return -1;
+            c2 = *s++;
+            if (!IS_EUC_RANGE_VALID(c2))
+                return -1;
+            c2 = *s++;
+            if (!IS_EUC_RANGE_VALID(c2))
+                return -1;
+            break;
+
+        case SS3:                /* unused */
+            return -1;
+
+        default:
+            if (IS_HIGHBIT_SET(c1)) /* CNS 11643 Plane 1 */
+            {
+                l = 2;
+                if (l > len)
+                    return -1;
+                /* no further range check on c1? */
+                c2 = *s++;
+                if (!IS_EUC_RANGE_VALID(c2))
+                    return -1;
+            }
+            else
+                /* must be ASCII */
+            {
+                l = 1;
+            }
+            break;
+    }
+    return l;
+}
+
+static int
+pg_johab_verifier(const unsigned char *s, int len)
+{
+    int            l,
+                mbl;
+    unsigned char c;
+
+    l = mbl = pg_johab_mblen(s);
+
+    if (len < l)
+        return -1;
+
+    if (!IS_HIGHBIT_SET(*s))
+        return mbl;
+
+    while (--l > 0)
+    {
+        c = *++s;
+        if (!IS_EUC_RANGE_VALID(c))
+            return -1;
+    }
+    return mbl;
+}
+
+static int
+pg_mule_verifier(const unsigned char *s, int len)
+{
+    int            l,
+                mbl;
+    unsigned char c;
+
+    l = mbl = pg_mule_mblen(s);
+
+    if (len < l)
+        return -1;
+
+    while (--l > 0)
+    {
+        c = *++s;
+        if (!IS_HIGHBIT_SET(c))
+            return -1;
+    }
+    return mbl;
+}
+
+static int
+pg_latin1_verifier(const unsigned char *s, int len)
+{
+    return 1;
+}
+
+static int
+pg_sjis_verifier(const unsigned char *s, int len)
+{
+    int            l,
+                mbl;
+    unsigned char c1,
+                c2;
+
+    l = mbl = pg_sjis_mblen(s);
+
+    if (len < l)
+        return -1;
+
+    if (l == 1)                    /* pg_sjis_mblen already verified it */
+        return mbl;
+
+    c1 = *s++;
+    c2 = *s;
+    if (!ISSJISHEAD(c1) || !ISSJISTAIL(c2))
+        return -1;
+    return mbl;
+}
+
+static int
+pg_big5_verifier(const unsigned char *s, int len)
+{
+    int            l,
+                mbl;
+
+    l = mbl = pg_big5_mblen(s);
+
+    if (len < l)
+        return -1;
+
+    while (--l > 0)
+    {
+        if (*++s == '\0')
+            return -1;
+    }
+
+    return mbl;
+}
+
+static int
+pg_gbk_verifier(const unsigned char *s, int len)
+{
+    int            l,
+                mbl;
+
+    l = mbl = pg_gbk_mblen(s);
+
+    if (len < l)
+        return -1;
+
+    while (--l > 0)
+    {
+        if (*++s == '\0')
+            return -1;
+    }
+
+    return mbl;
+}
+
+static int
+pg_uhc_verifier(const unsigned char *s, int len)
+{
+    int            l,
+                mbl;
+
+    l = mbl = pg_uhc_mblen(s);
+
+    if (len < l)
+        return -1;
+
+    while (--l > 0)
+    {
+        if (*++s == '\0')
+            return -1;
+    }
+
+    return mbl;
+}
+
+static int
+pg_gb18030_verifier(const unsigned char *s, int len)
+{
+    int            l;
+
+    if (!IS_HIGHBIT_SET(*s))
+        l = 1;                    /* ASCII */
+    else if (len >= 4 && *(s + 1) >= 0x30 && *(s + 1) <= 0x39)
+    {
+        /* Should be 4-byte, validate remaining bytes */
+        if (*s >= 0x81 && *s <= 0xfe &&
+            *(s + 2) >= 0x81 && *(s + 2) <= 0xfe &&
+            *(s + 3) >= 0x30 && *(s + 3) <= 0x39)
+            l = 4;
+        else
+            l = -1;
+    }
+    else if (len >= 2 && *s >= 0x81 && *s <= 0xfe)
+    {
+        /* Should be 2-byte, validate */
+        if ((*(s + 1) >= 0x40 && *(s + 1) <= 0x7e) ||
+            (*(s + 1) >= 0x80 && *(s + 1) <= 0xfe))
+            l = 2;
+        else
+            l = -1;
+    }
+    else
+        l = -1;
+    return l;
+}
+
+static int
+pg_utf8_verifier(const unsigned char *s, int len)
+{
+    int            l = pg_utf_mblen(s);
+
+    if (len < l)
+        return -1;
+
+    if (!pg_utf8_islegal(s, l))
+        return -1;
+
+    return l;
+}
+
+/*
+ * Check for validity of a single UTF-8 encoded character
+ *
+ * This directly implements the rules in RFC3629.  The bizarre-looking
+ * restrictions on the second byte are meant to ensure that there isn't
+ * more than one encoding of a given Unicode character point; that is,
+ * you may not use a longer-than-necessary byte sequence with high order
+ * zero bits to represent a character that would fit in fewer bytes.
+ * To do otherwise is to create security hazards (eg, create an apparent
+ * non-ASCII character that decodes to plain ASCII).
+ *
+ * length is assumed to have been obtained by pg_utf_mblen(), and the
+ * caller must have checked that that many bytes are present in the buffer.
+ */
+bool
+pg_utf8_islegal(const unsigned char *source, int length)
+{
+    unsigned char a;
+
+    switch (length)
+    {
+        default:
+            /* reject lengths 5 and 6 for now */
+            return false;
+        case 4:
+            a = source[3];
+            if (a < 0x80 || a > 0xBF)
+                return false;
+            /* FALL THRU */
+        case 3:
+            a = source[2];
+            if (a < 0x80 || a > 0xBF)
+                return false;
+            /* FALL THRU */
+        case 2:
+            a = source[1];
+            switch (*source)
+            {
+                case 0xE0:
+                    if (a < 0xA0 || a > 0xBF)
+                        return false;
+                    break;
+                case 0xED:
+                    if (a < 0x80 || a > 0x9F)
+                        return false;
+                    break;
+                case 0xF0:
+                    if (a < 0x90 || a > 0xBF)
+                        return false;
+                    break;
+                case 0xF4:
+                    if (a < 0x80 || a > 0x8F)
+                        return false;
+                    break;
+                default:
+                    if (a < 0x80 || a > 0xBF)
+                        return false;
+                    break;
+            }
+            /* FALL THRU */
+        case 1:
+            a = *source;
+            if (a >= 0x80 && a < 0xC2)
+                return false;
+            if (a > 0xF4)
+                return false;
+            break;
+    }
+    return true;
+}
+
+#ifndef FRONTEND
+
+/*
+ * Generic character incrementer function.
+ *
+ * Not knowing anything about the properties of the encoding in use, we just
+ * keep incrementing the last byte until we get a validly-encoded result,
+ * or we run out of values to try.  We don't bother to try incrementing
+ * higher-order bytes, so there's no growth in runtime for wider characters.
+ * (If we did try to do that, we'd need to consider the likelihood that 255
+ * is not a valid final byte in the encoding.)
+ */
+static bool
+pg_generic_charinc(unsigned char *charptr, int len)
+{
+    unsigned char *lastbyte = charptr + len - 1;
+    mbverifier    mbverify;
+
+    /* We can just invoke the character verifier directly. */
+    mbverify = pg_wchar_table[GetDatabaseEncoding()].mbverify;
+
+    while (*lastbyte < (unsigned char) 255)
+    {
+        (*lastbyte)++;
+        if ((*mbverify) (charptr, len) == len)
+            return true;
+    }
+
+    return false;
+}
+
+/*
+ * UTF-8 character incrementer function.
+ *
+ * For a one-byte character less than 0x7F, we just increment the byte.
+ *
+ * For a multibyte character, every byte but the first must fall between 0x80
+ * and 0xBF; and the first byte must be between 0xC0 and 0xF4.  We increment
+ * the last byte that's not already at its maximum value.  If we can't find a
+ * byte that's less than the maximum allowable value, we simply fail.  We also
+ * need some special-case logic to skip regions used for surrogate pair
+ * handling, as those should not occur in valid UTF-8.
+ *
+ * Note that we don't reset lower-order bytes back to their minimums, since
+ * we can't afford to make an exhaustive search (see make_greater_string).
+ */
+static bool
+pg_utf8_increment(unsigned char *charptr, int length)
+{
+    unsigned char a;
+    unsigned char limit;
+
+    switch (length)
+    {
+        default:
+            /* reject lengths 5 and 6 for now */
+            return false;
+        case 4:
+            a = charptr[3];
+            if (a < 0xBF)
+            {
+                charptr[3]++;
+                break;
+            }
+            /* FALL THRU */
+        case 3:
+            a = charptr[2];
+            if (a < 0xBF)
+            {
+                charptr[2]++;
+                break;
+            }
+            /* FALL THRU */
+        case 2:
+            a = charptr[1];
+            switch (*charptr)
+            {
+                case 0xED:
+                    limit = 0x9F;
+                    break;
+                case 0xF4:
+                    limit = 0x8F;
+                    break;
+                default:
+                    limit = 0xBF;
+                    break;
+            }
+            if (a < limit)
+            {
+                charptr[1]++;
+                break;
+            }
+            /* FALL THRU */
+        case 1:
+            a = *charptr;
+            if (a == 0x7F || a == 0xDF || a == 0xEF || a == 0xF4)
+                return false;
+            charptr[0]++;
+            break;
+    }
+
+    return true;
+}
+
+/*
+ * EUC-JP character incrementer function.
+ *
+ * If the sequence starts with SS2 (0x8e), it must be a two-byte sequence
+ * representing JIS X 0201 characters with the second byte ranging between
+ * 0xa1 and 0xdf.  We just increment the last byte if it's less than 0xdf,
+ * and otherwise rewrite the whole sequence to 0xa1 0xa1.
+ *
+ * If the sequence starts with SS3 (0x8f), it must be a three-byte sequence
+ * in which the last two bytes range between 0xa1 and 0xfe.  The last byte
+ * is incremented if possible, otherwise the second-to-last byte.
+ *
+ * If the sequence starts with a value other than the above and its MSB
+ * is set, it must be a two-byte sequence representing JIS X 0208 characters
+ * with both bytes ranging between 0xa1 and 0xfe.  The last byte is
+ * incremented if possible, otherwise the second-to-last byte.
+ *
+ * Otherwise, the sequence is a single-byte ASCII character. It is
+ * incremented up to 0x7f.
+ */
+static bool
+pg_eucjp_increment(unsigned char *charptr, int length)
+{
+    unsigned char c1,
+                c2;
+    int            i;
+
+    c1 = *charptr;
+
+    switch (c1)
+    {
+        case SS2:                /* JIS X 0201 */
+            if (length != 2)
+                return false;
+
+            c2 = charptr[1];
+
+            if (c2 >= 0xdf)
+                charptr[0] = charptr[1] = 0xa1;
+            else if (c2 < 0xa1)
+                charptr[1] = 0xa1;
+            else
+                charptr[1]++;
+            break;
+
+        case SS3:                /* JIS X 0212 */
+            if (length != 3)
+                return false;
+
+            for (i = 2; i > 0; i--)
+            {
+                c2 = charptr[i];
+                if (c2 < 0xa1)
+                {
+                    charptr[i] = 0xa1;
+                    return true;
+                }
+                else if (c2 < 0xfe)
+                {
+                    charptr[i]++;
+                    return true;
+                }
+            }
+
+            /* Out of 3-byte code region */
+            return false;
+
+        default:
+            if (IS_HIGHBIT_SET(c1)) /* JIS X 0208? */
+            {
+                if (length != 2)
+                    return false;
+
+                for (i = 1; i >= 0; i--)
+                {
+                    c2 = charptr[i];
+                    if (c2 < 0xa1)
+                    {
+                        charptr[i] = 0xa1;
+                        return true;
+                    }
+                    else if (c2 < 0xfe)
+                    {
+                        charptr[i]++;
+                        return true;
+                    }
+                }
+
+                /* Out of 2 byte code region */
+                return false;
+            }
+            else
+            {                    /* ASCII, single byte */
+                if (c1 > 0x7e)
+                    return false;
+                (*charptr)++;
+            }
+            break;
+    }
+
+    return true;
+}
+#endif                            /* !FRONTEND */
+
+
+/*
+ *-------------------------------------------------------------------
+ * encoding info table
+ * XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h)
+ *-------------------------------------------------------------------
+ */
+const pg_wchar_tbl pg_wchar_table[] = {
+    {pg_ascii2wchar_with_len, pg_wchar2single_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_ascii_verifier, 1}, /*
PG_SQL_ASCII*/ 
+    {pg_eucjp2wchar_with_len, pg_wchar2euc_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3},    /*
PG_EUC_JP*/ 
+    {pg_euccn2wchar_with_len, pg_wchar2euc_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_euccn_verifier, 2},    /*
PG_EUC_CN*/ 
+    {pg_euckr2wchar_with_len, pg_wchar2euc_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_euckr_verifier, 3},    /*
PG_EUC_KR*/ 
+    {pg_euctw2wchar_with_len, pg_wchar2euc_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_euctw_verifier, 4},    /*
PG_EUC_TW*/ 
+    {pg_eucjp2wchar_with_len, pg_wchar2euc_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3},    /*
PG_EUC_JIS_2004*/ 
+    {pg_utf2wchar_with_len, pg_wchar2utf_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4},    /* PG_UTF8 */
+    {pg_mule2wchar_with_len, pg_wchar2mule_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier, 4},    /*
PG_MULE_INTERNAL*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN1*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN2*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN3*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN4*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN5*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN6*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN7*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN8*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN9*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_LATIN10*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1256*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1258*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN866*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN874*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_KOI8R*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1251*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1252*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
ISO-8859-5*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
ISO-8859-6*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
ISO-8859-7*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
ISO-8859-8*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1250*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1253*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1254*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1255*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_WIN1257*/ 
+    {pg_latin12wchar_with_len, pg_wchar2single_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /*
PG_KOI8U*/ 
+    {0, 0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */
+    {0, 0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */
+    {0, 0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2},    /* PG_GBK */
+    {0, 0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2},    /* PG_UHC */
+    {0, 0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4},    /* PG_GB18030 */
+    {0, 0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3},    /* PG_JOHAB */
+    {0, 0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}    /* PG_SHIFT_JIS_2004 */
+};
+
+/* returns the byte length of a word for mule internal code */
+int
+pg_mic_mblen(const unsigned char *mbstr)
+{
+    return pg_mule_mblen(mbstr);
+}
+
+/*
+ * Returns the byte length of a multibyte character.
+ */
+int
+pg_encoding_mblen(int encoding, const char *mbstr)
+{
+    return (PG_VALID_ENCODING(encoding) ?
+            pg_wchar_table[encoding].mblen((const unsigned char *) mbstr) :
+            pg_wchar_table[PG_SQL_ASCII].mblen((const unsigned char *) mbstr));
+}
+
+/*
+ * Returns the display length of a multibyte character.
+ */
+int
+pg_encoding_dsplen(int encoding, const char *mbstr)
+{
+    return (PG_VALID_ENCODING(encoding) ?
+            pg_wchar_table[encoding].dsplen((const unsigned char *) mbstr) :
+            pg_wchar_table[PG_SQL_ASCII].dsplen((const unsigned char *) mbstr));
+}
+
+/*
+ * Verify the first multibyte character of the given string.
+ * Return its byte length if good, -1 if bad.  (See comments above for
+ * full details of the mbverify API.)
+ */
+int
+pg_encoding_verifymb(int encoding, const char *mbstr, int len)
+{
+    return (PG_VALID_ENCODING(encoding) ?
+            pg_wchar_table[encoding].mbverify((const unsigned char *) mbstr, len) :
+            pg_wchar_table[PG_SQL_ASCII].mbverify((const unsigned char *) mbstr, len));
+}
+
+/*
+ * fetch maximum length of a given encoding
+ */
+int
+pg_encoding_max_length(int encoding)
+{
+    Assert(PG_VALID_ENCODING(encoding));
+
+    return pg_wchar_table[encoding].maxmblen;
+}
+
+#ifndef FRONTEND
+
+/*
+ * fetch maximum length of the encoding for the current database
+ */
+int
+pg_database_encoding_max_length(void)
+{
+    return pg_wchar_table[GetDatabaseEncoding()].maxmblen;
+}
+
+/*
+ * get the character incrementer for the encoding for the current database
+ */
+mbcharacter_incrementer
+pg_database_encoding_character_incrementer(void)
+{
+    /*
+     * Eventually it might be best to add a field to pg_wchar_table[], but for
+     * now we just use a switch.
+     */
+    switch (GetDatabaseEncoding())
+    {
+        case PG_UTF8:
+            return pg_utf8_increment;
+
+        case PG_EUC_JP:
+            return pg_eucjp_increment;
+
+        default:
+            return pg_generic_charinc;
+    }
+}
+
+/*
+ * Verify mbstr to make sure that it is validly encoded in the current
+ * database encoding.  Otherwise same as pg_verify_mbstr().
+ */
+bool
+pg_verifymbstr(const char *mbstr, int len, bool noError)
+{
+    return
+        pg_verify_mbstr_len(GetDatabaseEncoding(), mbstr, len, noError) >= 0;
+}
+
+/*
+ * Verify mbstr to make sure that it is validly encoded in the specified
+ * encoding.
+ */
+bool
+pg_verify_mbstr(int encoding, const char *mbstr, int len, bool noError)
+{
+    return pg_verify_mbstr_len(encoding, mbstr, len, noError) >= 0;
+}
+
+/*
+ * Verify mbstr to make sure that it is validly encoded in the specified
+ * encoding.
+ *
+ * mbstr is not necessarily zero terminated; length of mbstr is
+ * specified by len.
+ *
+ * If OK, return length of string in the encoding.
+ * If a problem is found, return -1 when noError is
+ * true; when noError is false, ereport() a descriptive message.
+ */
+int
+pg_verify_mbstr_len(int encoding, const char *mbstr, int len, bool noError)
+{
+    mbverifier    mbverify;
+    int            mb_len;
+
+    Assert(PG_VALID_ENCODING(encoding));
+
+    /*
+     * In single-byte encodings, we need only reject nulls (\0).
+     */
+    if (pg_encoding_max_length(encoding) <= 1)
+    {
+        const char *nullpos = memchr(mbstr, 0, len);
+
+        if (nullpos == NULL)
+            return len;
+        if (noError)
+            return -1;
+        report_invalid_encoding(encoding, nullpos, 1);
+    }
+
+    /* fetch function pointer just once */
+    mbverify = pg_wchar_table[encoding].mbverify;
+
+    mb_len = 0;
+
+    while (len > 0)
+    {
+        int            l;
+
+        /* fast path for ASCII-subset characters */
+        if (!IS_HIGHBIT_SET(*mbstr))
+        {
+            if (*mbstr != '\0')
+            {
+                mb_len++;
+                mbstr++;
+                len--;
+                continue;
+            }
+            if (noError)
+                return -1;
+            report_invalid_encoding(encoding, mbstr, len);
+        }
+
+        l = (*mbverify) ((const unsigned char *) mbstr, len);
+
+        if (l < 0)
+        {
+            if (noError)
+                return -1;
+            report_invalid_encoding(encoding, mbstr, len);
+        }
+
+        mbstr += l;
+        len -= l;
+        mb_len++;
+    }
+    return mb_len;
+}
+
+/*
+ * check_encoding_conversion_args: check arguments of a conversion function
+ *
+ * "expected" arguments can be either an encoding ID or -1 to indicate that
+ * the caller will check whether it accepts the ID.
+ *
+ * Note: the errors here are not really user-facing, so elog instead of
+ * ereport seems sufficient.  Also, we trust that the "expected" encoding
+ * arguments are valid encoding IDs, but we don't trust the actuals.
+ */
+void
+check_encoding_conversion_args(int src_encoding,
+                               int dest_encoding,
+                               int len,
+                               int expected_src_encoding,
+                               int expected_dest_encoding)
+{
+    if (!PG_VALID_ENCODING(src_encoding))
+        elog(ERROR, "invalid source encoding ID: %d", src_encoding);
+    if (src_encoding != expected_src_encoding && expected_src_encoding >= 0)
+        elog(ERROR, "expected source encoding \"%s\", but got \"%s\"",
+             pg_enc2name_tbl[expected_src_encoding].name,
+             pg_enc2name_tbl[src_encoding].name);
+    if (!PG_VALID_ENCODING(dest_encoding))
+        elog(ERROR, "invalid destination encoding ID: %d", dest_encoding);
+    if (dest_encoding != expected_dest_encoding && expected_dest_encoding >= 0)
+        elog(ERROR, "expected destination encoding \"%s\", but got \"%s\"",
+             pg_enc2name_tbl[expected_dest_encoding].name,
+             pg_enc2name_tbl[dest_encoding].name);
+    if (len < 0)
+        elog(ERROR, "encoding conversion length must not be negative");
+}
+
+/*
+ * report_invalid_encoding: complain about invalid multibyte character
+ *
+ * note: len is remaining length of string, not length of character;
+ * len must be greater than zero, as we always examine the first byte.
+ */
+void
+report_invalid_encoding(int encoding, const char *mbstr, int len)
+{
+    int            l = pg_encoding_mblen(encoding, mbstr);
+    char        buf[8 * 5 + 1];
+    char       *p = buf;
+    int            j,
+                jlimit;
+
+    jlimit = Min(l, len);
+    jlimit = Min(jlimit, 8);    /* prevent buffer overrun */
+
+    for (j = 0; j < jlimit; j++)
+    {
+        p += sprintf(p, "0x%02x", (unsigned char) mbstr[j]);
+        if (j < jlimit - 1)
+            p += sprintf(p, " ");
+    }
+
+    ereport(ERROR,
+            (errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
+             errmsg("invalid byte sequence for encoding \"%s\": %s",
+                    pg_enc2name_tbl[encoding].name,
+                    buf)));
+}
+
+/*
+ * report_untranslatable_char: complain about untranslatable character
+ *
+ * note: len is remaining length of string, not length of character;
+ * len must be greater than zero, as we always examine the first byte.
+ */
+void
+report_untranslatable_char(int src_encoding, int dest_encoding,
+                           const char *mbstr, int len)
+{
+    int            l = pg_encoding_mblen(src_encoding, mbstr);
+    char        buf[8 * 5 + 1];
+    char       *p = buf;
+    int            j,
+                jlimit;
+
+    jlimit = Min(l, len);
+    jlimit = Min(jlimit, 8);    /* prevent buffer overrun */
+
+    for (j = 0; j < jlimit; j++)
+    {
+        p += sprintf(p, "0x%02x", (unsigned char) mbstr[j]);
+        if (j < jlimit - 1)
+            p += sprintf(p, " ");
+    }
+
+    ereport(ERROR,
+            (errcode(ERRCODE_UNTRANSLATABLE_CHARACTER),
+             errmsg("character with byte sequence %s in encoding \"%s\" has no equivalent in encoding \"%s\"",
+                    buf,
+                    pg_enc2name_tbl[src_encoding].name,
+                    pg_enc2name_tbl[dest_encoding].name)));
+}
+
+#endif                            /* !FRONTEND */
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 7fb5fa4..026f64f 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -222,8 +222,8 @@ typedef unsigned int pg_wchar;
  * PostgreSQL encoding identifiers
  *
  * WARNING: the order of this enum must be same as order of entries
- *            in the pg_enc2name_tbl[] array (in mb/encnames.c), and
- *            in the pg_wchar_table[] array (in mb/wchar.c)!
+ *            in the pg_enc2name_tbl[] array (in src/common/encnames.c), and
+ *            in the pg_wchar_table[] array (in src/common/wchar.c)!
  *
  *            If you add some encoding don't forget to check
  *            PG_ENCODING_BE_LAST macro.
diff --git a/src/interfaces/libpq/.gitignore b/src/interfaces/libpq/.gitignore
index 7b438f3..a4afe7c 100644
--- a/src/interfaces/libpq/.gitignore
+++ b/src/interfaces/libpq/.gitignore
@@ -1,4 +1 @@
 /exports.list
-# .c files that are symlinked in from elsewhere
-/encnames.c
-/wchar.c
diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile
index f5f1c0c..a068826 100644
--- a/src/interfaces/libpq/Makefile
+++ b/src/interfaces/libpq/Makefile
@@ -45,11 +45,6 @@ OBJS = \
     pqexpbuffer.o \
     fe-auth.o

-# src/backend/utils/mb
-OBJS += \
-    encnames.o \
-    wchar.o
-
 ifeq ($(with_openssl),yes)
 OBJS += \
     fe-secure-common.o \
@@ -102,17 +97,7 @@ include $(top_srcdir)/src/Makefile.shlib
 backend_src = $(top_srcdir)/src/backend


-# We use a few backend modules verbatim, but since we need
-# to compile with appropriate options to build a shared lib, we can't
-# use the same object files built for the backend.
-# Instead, symlink the source files in here and build our own object files.
-# When you add a file here, remember to add it in the "clean" target below.
-
-encnames.c wchar.c: % : $(backend_src)/utils/mb/%
-    rm -f $@ && $(LN_S) $< .
-
-
-# Make dependencies on pg_config_paths.h visible, too.
+# Make dependencies on pg_config_paths.h visible in all builds.
 fe-connect.o: fe-connect.c $(top_builddir)/src/port/pg_config_paths.h
 fe-misc.o: fe-misc.c $(top_builddir)/src/port/pg_config_paths.h

@@ -144,8 +129,6 @@ clean distclean: clean-lib
     rm -f $(OBJS) pthread.h
 # Might be left over from a Win32 client-only build
     rm -f pg_config_paths.h
-# Remove files we (may have) symlinked in from other places
-    rm -f encnames.c wchar.c

 maintainer-clean: distclean
     $(MAKE) -C test $@
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index f6ab0d5..67b9f23 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -120,11 +120,12 @@ sub mkvcbuild
     }

     our @pgcommonallfiles = qw(
-      base64.c config_info.c controldata_utils.c d2s.c exec.c f2s.c file_perm.c ip.c
+      base64.c config_info.c controldata_utils.c d2s.c encnames.c exec.c
+      f2s.c file_perm.c ip.c
       keywords.c kwlookup.c link-canary.c md5.c
       pg_lzcompress.c pgfnames.c psprintf.c relpath.c rmtree.c
       saslprep.c scram-common.c string.c stringinfo.c unicode_norm.c username.c
-      wait_error.c);
+      wait_error.c wchar.c);

     if ($solution->{options}->{openssl})
     {

Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Thu, Jan 16, 2020 at 3:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > 0001 moves wchar.c from src/backend/utils/mb to src/common. Unless I'm
> > missing something, this seems like an overdue cleanup.
>
> Here's a reviewed version of 0001.  You missed fixing the MSVC build,
> and there were assorted comments and other things referencing wchar.c
> that needed to be cleaned up.

Wow, thanks.

> Also, it seemed to me that if we are going to move wchar.c, we should
> also move encnames.c, so that libpq can get fully out of the
> symlinking-source-files business.  It makes initdb less weird too.

OK.

> I took the liberty of sticking proper copyright headers onto these
> two files, too.  (This makes the diff a lot more bulky :-(.  Would
> it help to add the headers in a separate commit?)

I wouldn't bother making it a separate commit, but please do whatever you like.

> Another thing I'm wondering about is if any of the #ifndef FRONTEND
> code should get moved *back* to src/backend/utils/mb.  But that
> could be a separate commit, too.

+1 for moving that stuff to a separate backend-only file.

> Lastly, it strikes me that maybe pg_wchar.h, or parts of it, should
> migrate over to src/include/common.  But that'd be far more invasive
> to other source files, so I've not touched the issue here.

I don't have a view on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Thu, Jan 16, 2020 at 1:58 PM David Steele <david@pgmasters.net> wrote:
> To do page-level incrementals (which this feature is intended to enable)
> the user will need to be able to associate full and incremental backups
> and the only way I see to do that (currently) is to read the manifests,
> since the prior backup should be stored there.  I think this means that
> parsing the manifest is not really optional -- it will be required to do
> any kind of automation with incrementals.

My current belief is that enabling incremental backup will require
extending the manifest format either not at all or by adding one
additional line with some LSN info.

If we could foresee a need to store a bunch of additional *per-file*
details, I'd be a lot more receptive to the argument that we ought to
be using a more structured format like JSON. And it doesn't seem
impossible that such a thing could happen, but I don't think it's at
all clear that it actually will happen, or that it will happen soon
enough that we ought to be worrying about it now.

It's possible that we're chasing a real problem here, and if there's
something we can agree on and get done I'd rather do that than argue,
but I am still quite suspicious that there's no actually serious
technical problem here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jan 16, 2020 at 3:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Here's a reviewed version of 0001.  You missed fixing the MSVC build,
>> and there were assorted comments and other things referencing wchar.c
>> that needed to be cleaned up.

> Wow, thanks.

Pushed that.

>> Another thing I'm wondering about is if any of the #ifndef FRONTEND
>> code should get moved *back* to src/backend/utils/mb.  But that
>> could be a separate commit, too.

> +1 for moving that stuff to a separate backend-only file.

After a brief look, I propose the following:

* I think we should just shove the "#ifndef FRONTEND" stuff in
wchar.c into mbutils.c.  It doesn't seem worth inventing a whole
new file for that code, especially when it's arguably within the
remit of mbutils.c anyway.

* Let's remove the "#ifndef FRONTEND" restriction on the ICU-related
stuff in encnames.c.  Even if we don't need that stuff in frontend
today, it's hardly unlikely that we will need it tomorrow.  And there's
not that much bulk there anyway.

* The one positive reason for that restriction is the ereport() in
get_encoding_name_for_icu.  We could change that to be the usual
#ifdef-ereport-or-printf dance, but I think there's a better way: put
the ereport at the caller, by redefining that function to return NULL
for an unsupported encoding.  There's only one caller today anyhow.

* PG_char_to_encoding() and PG_encoding_to_char() can be moved to
mbutils.c; they'd fit reasonably well beside getdatabaseencoding and
pg_client_encoding.  (I also thought about utils/adt/misc.c, but
that's not obviously better.)

Barring objections I'll go make this happen shortly.

>> Lastly, it strikes me that maybe pg_wchar.h, or parts of it, should
>> migrate over to src/include/common.  But that'd be far more invasive
>> to other source files, so I've not touched the issue here.

> I don't have a view on this.

If anyone is hot to do this part, please have at it.  I'm not.

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> It's possible that we're chasing a real problem here, and if there's
> something we can agree on and get done I'd rather do that than argue,
> but I am still quite suspicious that there's no actually serious
> technical problem here.

It's entirely possible that you're right.  But if this is a file format
that is meant to be exposed to user tools, we need to take a very long
view of the requirements for it.  Five or ten years down the road, we
might be darn glad we spent extra time now.

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Andrew Dunstan
Date:
On Thu, Jan 16, 2020 at 7:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
>

>
> 0002 does some basic header cleanup to make it possible to include the
> existing header file jsonapi.h in frontend code. The state of the JSON
> headers today looks generally poor. There seems not to have been much
> attempt to get the prototypes for a given source file, say foo.c, into
> a header file with the same name, say foo.h. Also, dependencies
> between various header files seem to be have added somewhat freely.
> This patch does not come close to fixing all that, but I consider it a
> modest down payment on a cleanup that probably ought to be taken
> further.
>
> 0003 splits json.c into two files, json.c and jsonapi.c. All the
> lexing and parsing stuff (whose prototypes are in jsonapi.h) goes into
> jsonapi.c, while the stuff that pertains to the 'json' data type
> remains in json.c. This also seems like a good cleanup, because to me,
> at least, it's not a great idea to mix together code that is used by
> both the json and jsonb data types as well as other things in the
> system that want to generate or parse json together with things that
> are specific to the 'json' data type.
>

I'm probably responsible for a good deal of the mess, so let me say Thankyou.

I'll have a good look at these.

cheers

andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
David Steele
Date:
Hi Robert,

On 1/16/20 11:51 AM, Robert Haas wrote:
> On Thu, Jan 16, 2020 at 1:37 PM David Steele <david@pgmasters.net> wrote:
> 
>> The next question in my mind is given the caveat that the error handing
>> is questionable in the front end, can we at least render/parse valid
>> JSON with the code?
> 
> That's a real good question. Thanks for offering to test it; I think
> that would be very helpful.

It seems to work just fine.  I didn't stress it too hard but I did put 
in one escape and a multi-byte character and check the various data types.

Attached is a test hack on pg_basebackup which produces this output:

START
     FIELD "number", null 0
     SCALAR TYPE 2: 123
     FIELD "string", null 0
     SCALAR TYPE 1: val    ue-丏
     FIELD "bool", null 0
     SCALAR TYPE 9: true
     FIELD "null", null 1
     SCALAR TYPE 11: null
END

I used the callbacks because that's the first method I found but it 
seems like json_lex() might be easier to use in practice.

I think it's an issue that the entire string must be passed to the lexer 
at once.  That will not be great for large manifests.  However, I don't 
think it will be all that hard to implement an optional "want more" 
callback in the lexer so JSON data can be fed in from the file in chunks.

So, that just leaves ereport() as the largest remaining issue?  I'll 
look at that today and Tuesday and see what I can work up.

Regards,
-- 
-David
david@pgmasters.net

Attachment

Re: making the backend's json parser work in frontend code

From
David Steele
Date:
Hi Robert,

On 1/16/20 11:51 AM, Robert Haas wrote:
> On Thu, Jan 16, 2020 at 1:37 PM David Steele <david@pgmasters.net> wrote:
> 
>> So the idea here is that json.c will have the JSON SQL functions,
>> jsonb.c the JSONB SQL functions, and jsonapi.c the parser, and
>> jsonfuncs.c the utility functions?
> 
> Uh, I think roughly that, yes. Although I can't claim to fully
> understand everything that's here.

Now that I've spent some time with the code I see your intent was just 
to isolate the JSON lexer code with 0002 and 0003.  As such, I now think 
they are commit-able as is.

Regards,
-- 
-David
david@pgmasters.net



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Thu, Jan 16, 2020 at 8:55 PM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
> I'm probably responsible for a good deal of the mess, so let me say Thankyou.
>
> I'll have a good look at these.

Thanks, appreciated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Fri, Jan 17, 2020 at 12:36 PM David Steele <david@pgmasters.net> wrote:
> It seems to work just fine.  I didn't stress it too hard but I did put
> in one escape and a multi-byte character and check the various data types.

Cool.

> I used the callbacks because that's the first method I found but it
> seems like json_lex() might be easier to use in practice.

Ugh, really? That doesn't seem like it would be nice at all.

> I think it's an issue that the entire string must be passed to the lexer
> at once.  That will not be great for large manifests.  However, I don't
> think it will be all that hard to implement an optional "want more"
> callback in the lexer so JSON data can be fed in from the file in chunks.

I thought so initially, but now I'm not so sure. The thing is, you
actually need all the manifest data in memory at once anyway, or so I
think. You're essentially doing a "full join" between the contents of
the manifest and the contents of the file system, so you've got to
scan one (probably the filesystem) and then mark entries in the other
(probably the manifest) used as you go.

But this might need more thought. The details probably depend on
exactly how you design it all.

> So, that just leaves ereport() as the largest remaining issue?  I'll
> look at that today and Tuesday and see what I can work up.

PFA my work on that topic. As compared with my previous patch series,
the previous 0001 is dropped and what are now 0001 and 0002 are the
same as patches from the previous series. 0003 and 0004 are aiming
toward getting rid of ereport() and, I believe, show a plausible
strategy for so doing. There are, possibly, things not to like here,
and it's certainly incomplete, but I think I kinda like this
direction. Comments appreciated.

0003 nukes lex_accept(), inlining the logic into callers. I found that
the refactoring I wanted to do in 0004 was pretty hard without this,
and it turns out to save code, so I think this is a good idea
independently of anything else.

0004 adjusts many functions in jsonapi.c to return a new enumerated
type, JsonParseErrorType, instead of directly doing ereport(). It adds
a new function that takes this value and a lexing context and throws
an error. The JSON_ESCAPING_INVALID case is wrong and involves a gross
hack, but that's fixable with another field in the lexing context.
More work is needed to really bring this up to scratch, but the idea
is to make this code have a soft dependency on ereport() rather than a
hard one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:

> On Jan 16, 2020, at 1:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>>>
>>> Lastly, it strikes me that maybe pg_wchar.h, or parts of it, should
>>> migrate over to src/include/common.  But that'd be far more invasive
>>> to other source files, so I've not touched the issue here.
>
>> I don't have a view on this.
>
> If anyone is hot to do this part, please have at it.  I'm not.

I moved the file pg_wchar.h into src/include/common and split out
most of the functions you marked as being suitable for the
backend only into a new file src/include/utils/mbutils.h.  That
resulted in the need to include this new “utils/mbutils.h” from
a number of .c files in the source tree.

One issue that came up was libpq/pqformat.h uses a couple
of those functions from within static inline functions, preventing
me from moving those to a backend-only include file without
making pqformat.h a backend-only include file.

I think the right thing to do here is to move references to these
functions into pqformat.c by un-inlining these functions.  I have
not done that yet.

There are whitespace cleanup issues I’m not going to fix just
yet, since I’ll be making more changes anyway.  What do you
think of the direction I’m taking in the attached?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Attachment

Re: making the backend's json parser work in frontend code

From
David Steele
Date:
Hi Robert,

On 1/17/20 2:33 PM, Robert Haas wrote:
 > On Fri, Jan 17, 2020 at 12:36 PM David Steele <david@pgmasters.net> 
wrote:
 >
 >> I used the callbacks because that's the first method I found but it
 >> seems like json_lex() might be easier to use in practice.
 >
 > Ugh, really? That doesn't seem like it would be nice at all.

I guess it's a matter of how you want to structure the code.

 >> I think it's an issue that the entire string must be passed to the lexer
 >> at once.  That will not be great for large manifests.  However, I don't
 >> think it will be all that hard to implement an optional "want more"
 >> callback in the lexer so JSON data can be fed in from the file in 
chunks.
 >
 > I thought so initially, but now I'm not so sure. The thing is, you
 > actually need all the manifest data in memory at once anyway, or so I
 > think. You're essentially doing a "full join" between the contents of
 > the manifest and the contents of the file system, so you've got to
 > scan one (probably the filesystem) and then mark entries in the other
 > (probably the manifest) used as you go.

Yeah, having a copy of the manifest in memory is the easiest way to do 
validation, but I think you'd want it in a structured format.

We parse the file part of the manifest into a sorted struct array which 
we can then do binary searches on by filename.

 >> So, that just leaves ereport() as the largest remaining issue?  I'll
 >> look at that today and Tuesday and see what I can work up.
 >
 > PFA my work on that topic. As compared with my previous patch series,
 > the previous 0001 is dropped and what are now 0001 and 0002 are the
 > same as patches from the previous series. 0003 and 0004 are aiming
 > toward getting rid of ereport() and, I believe, show a plausible
 > strategy for so doing. There are, possibly, things not to like here,
 > and it's certainly incomplete, but I think I kinda like this
 > direction. Comments appreciated.
 >
 > 0003 nukes lex_accept(), inlining the logic into callers. I found that
 > the refactoring I wanted to do in 0004 was pretty hard without this,
 > and it turns out to save code, so I think this is a good idea
 > independently of anything else.

No arguments here.

 > 0004 adjusts many functions in jsonapi.c to return a new enumerated
 > type, JsonParseErrorType, instead of directly doing ereport(). It adds
 > a new function that takes this value and a lexing context and throws
 > an error. The JSON_ESCAPING_INVALID case is wrong and involves a gross
 > hack, but that's fixable with another field in the lexing context.
 > More work is needed to really bring this up to scratch, but the idea
 > is to make this code have a soft dependency on ereport() rather than a
 > hard one.

My first reaction was that if we migrated ereport() first it would make 
this all so much easier.  Now I'm no so sure.

Having a general json parser in libcommon that is not tied into a 
specific error handling/logging system actually sounds like a really 
nice thing to have.  If we do migrate ereport() the user would always 
have the choice to call throw_a_json_error() if they wanted to.

There's also a bit of de-duplication of error messages, which is nice, 
especially in the case JSON_ESCAPING_INVALID.  And I agree that this 
case can be fixed with another field in the lexer -- or at least so it 
seems to me.

Though, throw_a_json_error() is not my favorite name.  Perhaps 
json_ereport()?

Regards,
-- 
-David
david@pgmasters.net



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Tue, Jan 21, 2020 at 5:34 PM David Steele <david@pgmasters.net> wrote:
> Though, throw_a_json_error() is not my favorite name.  Perhaps
> json_ereport()?

That name was deliberately chosen to be dumb, with the thought that
readers would understand it was to be replaced at some point before
this was final. It sounds like it wasn't quite dumb enough to make
that totally clear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Tue, Jan 21, 2020 at 7:23 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 21, 2020 at 5:34 PM David Steele <david@pgmasters.net> wrote:
> > Though, throw_a_json_error() is not my favorite name.  Perhaps
> > json_ereport()?
>
> That name was deliberately chosen to be dumb, with the thought that
> readers would understand it was to be replaced at some point before
> this was final. It sounds like it wasn't quite dumb enough to make
> that totally clear.

Here is a new version that is, I think, much closer what I would
consider a final form. 0001 through 0003 are as before, and unless
somebody says pretty soon that they see a problem with those or want
more time to review them, I'm going to commit them; David Steele has
endorsed all three, and they seem like independently sensible
cleanups.

0004 is a substantially cleaned up version of the patch to make the
JSON parser return a result code rather than throwing errors. Names
have been fixed, interfaces have been tidied up, and the thing is
better integrated with the surrounding code. I would really like
comments, if anyone has them, on whether this approach is acceptable.

0005 builds on 0004 by moving three functions from jsonapi.c to
jsonfuncs.c. With that done, jsonapi.c has minimal remaining
dependencies on the backend environment. It would still need a
substitute for elog(ERROR, "some internal thing is broken"); I'm
thinking of using pg_log_fatal() for that case. It would also need a
fix for the problem that pg_mblen() is not available in the front-end
environment. I don't know what to do about that yet exactly, but it
doesn't seem unsolvable. The frontend environment just needs to know
which encoding to use, and needs a way to call PQmblen() rather than
pg_mblen().

One problem with this whole thing that I just realized is that the
backup manifest file needs to store filenames, and we don't know that
the filenames we get from the filesystem are going to be valid in
UTF-8 (or, for that matter, any other encoding we might want to
choose). So, just deciding that the backup manifest is always UTF-8
doesn't seem like an option, unless we stick another level of escaping
in there somehow. Strictly as a theoretical matter, someone might
consider this a reason why using JSON for the backup manifest is not
necessarily the best fit, but since other arguments to that effect
have gotten me nowhere until now, I will instead request that someone
suggest to me how I ought to handle that problem.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: making the backend's json parser work in frontend code

From
Alvaro Herrera
Date:
On 2020-Jan-22, Robert Haas wrote:

> Here is a new version that is, I think, much closer what I would
> consider a final form. 0001 through 0003 are as before, and unless
> somebody says pretty soon that they see a problem with those or want
> more time to review them, I'm going to commit them; David Steele has
> endorsed all three, and they seem like independently sensible
> cleanups.

I'm not sure I see the point of keeping json.h split from jsonapi.h.  It
seems to me that you could move back all the contents from jsonapi.h
into json.h, and everything would work just as well.  (Evidently the
Datum in JsonEncodeDateTime's proto is problematic ... perhaps putting
that prototype in jsonfuncs.h would work.)

I don't really object to your 0001 patch as posted, though.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Wed, Jan 22, 2020 at 2:26 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I'm not sure I see the point of keeping json.h split from jsonapi.h.  It
> seems to me that you could move back all the contents from jsonapi.h
> into json.h, and everything would work just as well.  (Evidently the
> Datum in JsonEncodeDateTime's proto is problematic ... perhaps putting
> that prototype in jsonfuncs.h would work.)
>
> I don't really object to your 0001 patch as posted, though.

The goal is to make it possible to use the JSON parser in the
frontend, and we can't do that if the header files that would have to
be included on the frontend side rely on things that only work in the
backend. As written, the patch series leaves json.h with a dependency
on Datum, so the stuff that it leaves in jsonapi.h (which is intended
to be the header that gets moved to src/common and included by
frontend code) can't be merged with it.

Now, we could obviously rearrange that. I don't think any of the file
naming here is great. But I think we probably want, as far as
possible, for the code in FOO.c to correspond to the prototypes in
FOO.h. What I'm thinking we should work towards is:

json.c/h - support for the 'json' data type
jsonb.c/h - support for the 'jsonb' data type
jsonfuncs.c/h - backend code that doesn't fit in either of the above
jsonapi.c/h - lexing/parsing code that can be used in either the
frontend or the backend

I'm not wedded to that. It just looks like the most natural thing from
where we are now.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Alvaro Herrera
Date:
On 2020-Jan-22, Robert Haas wrote:

> On Wed, Jan 22, 2020 at 2:26 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > I'm not sure I see the point of keeping json.h split from jsonapi.h.  It
> > seems to me that you could move back all the contents from jsonapi.h
> > into json.h, and everything would work just as well.  (Evidently the
> > Datum in JsonEncodeDateTime's proto is problematic ... perhaps putting
> > that prototype in jsonfuncs.h would work.)
> >
> > I don't really object to your 0001 patch as posted, though.
> 
> The goal is to make it possible to use the JSON parser in the
> frontend, and we can't do that if the header files that would have to
> be included on the frontend side rely on things that only work in the
> backend. As written, the patch series leaves json.h with a dependency
> on Datum, so the stuff that it leaves in jsonapi.h (which is intended
> to be the header that gets moved to src/common and included by
> frontend code) can't be merged with it.

Right, I agree with that goal, and as I said, I don't object to your
patch as posted.

> Now, we could obviously rearrange that. I don't think any of the file
> naming here is great. But I think we probably want, as far as
> possible, for the code in FOO.c to correspond to the prototypes in
> FOO.h. What I'm thinking we should work towards is:
> 
> json.c/h - support for the 'json' data type
> jsonb.c/h - support for the 'jsonb' data type
> jsonfuncs.c/h - backend code that doesn't fit in either of the above
> jsonapi.c/h - lexing/parsing code that can be used in either the
> frontend or the backend

... it would probably require more work to make this 100% attainable,
but I don't really care all that much.

> I'm not wedded to that. It just looks like the most natural thing from
> where we are now.

Let's go with it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:

> On Jan 22, 2020, at 12:11 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2020-Jan-22, Robert Haas wrote:
>
>> On Wed, Jan 22, 2020 at 2:26 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> I'm not sure I see the point of keeping json.h split from jsonapi.h.  It
>>> seems to me that you could move back all the contents from jsonapi.h
>>> into json.h, and everything would work just as well.  (Evidently the
>>> Datum in JsonEncodeDateTime's proto is problematic ... perhaps putting
>>> that prototype in jsonfuncs.h would work.)
>>>
>>> I don't really object to your 0001 patch as posted, though.
>>
>> The goal is to make it possible to use the JSON parser in the
>> frontend, and we can't do that if the header files that would have to
>> be included on the frontend side rely on things that only work in the
>> backend. As written, the patch series leaves json.h with a dependency
>> on Datum, so the stuff that it leaves in jsonapi.h (which is intended
>> to be the header that gets moved to src/common and included by
>> frontend code) can't be merged with it.
>
> Right, I agree with that goal, and as I said, I don't object to your
> patch as posted.
>
>> Now, we could obviously rearrange that. I don't think any of the file
>> naming here is great. But I think we probably want, as far as
>> possible, for the code in FOO.c to correspond to the prototypes in
>> FOO.h. What I'm thinking we should work towards is:
>>
>> json.c/h - support for the 'json' data type
>> jsonb.c/h - support for the 'jsonb' data type
>> jsonfuncs.c/h - backend code that doesn't fit in either of the above
>> jsonapi.c/h - lexing/parsing code that can be used in either the
>> frontend or the backend
>
> ... it would probably require more work to make this 100% attainable,
> but I don't really care all that much.
>
>> I'm not wedded to that. It just looks like the most natural thing from
>> where we are now.
>
> Let's go with it.

I have this done in my local repo to the point that I can build frontend tools against the json parser that is now in
src/commonand also run all the check-world tests without failure.  I’m planning to post my work soon, possibly tonight
ifI don’t run out of time, but more likely tomorrow. 

The main issue remaining is that my repo has a lot of stuff organized differently than Robert’s patches, so I’m trying
toturn my code into a simple extension of his work rather than having my implementation compete against his. 

For the curious, the code as Robert left it still relies on the DatabaseEncoding though the use of GetDatabaseEncoding,
pg_mblen,and similar, and that has been changed in my patches to only rely on the database encoding in the backend,
withthe code in src/common taking an explicit encoding, which the backend gets in the usual way and the frontend might
getwith PQenv2encoding() or whatever the frontend programmer finds appropriate.  Hopefully, this addresses Robert’s
concernupthread about the filesystem name not necessarily being in utf8 format, though I might be misunderstanding the
exactthrust of his concern.  I can think of other possible interpretations of his concern as he expressed it, so I’ll
waitfor him to clarify. 

For those who want a sneak peak, I’m attaching WIP patches to this email with all my changes, with Robert’s changes
partiallymanually cherry-picked and the rest still unmerged.  *THESE ARE NOT MEANT FOR COMMIT.  THIS IS FOR ADVISORY
PURPOSESONLY.*. I have some debugging cruft left in here, too, like gcc __attribute__ stuff that won’t be in the
patchesI submit. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company









Attachment

Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Wed, Jan 22, 2020 at 10:00 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Hopefully, this addresses Robert’s concern upthread about the filesystem name not necessarily being in utf8 format,
thoughI might be misunderstanding the exact thrust of his concern.  I can think of other possible interpretations of
hisconcern as he expressed it, so I’ll wait for him to clarify. 

No, that's not it. Suppose that Álvaro Herrera has some custom
settings he likes to put on all the PostgreSQL clusters that he uses,
so he creates a file álvaro.conf and uses an "include" directive in
postgresql.conf to suck in those settings. If he also likes UTF-8,
then the file name will be stored in the file system as a 12-byte
value of which the first two bytes will be 0xc3 0xa1. In that case,
everything will be fine, because JSON is supposed to always be UTF-8,
and the file name is UTF-8, and it's all good. But suppose he instead
likes LATIN-1. Then the file name will be stored as an 11-byte value
and the first byte will be 0xe1. The second byte, representing a
lower-case 'l', will be 0x6c. But we can't put a byte sequence that
goes 0xe1 0x6c into a JSON manifest stored as UTF-8, because that's
not valid in UTF-8.  UTF-8 requires that every byte from 0xc0-0xff be
followed by one or more bytes in the range 0x80-0xbf, and our
hypothetical file name that starts with 0xe1 0x6c does not meet that
criteria.

Now, you might say "well, why don't we just do an encoding
conversion?", but we can't. When the filesystem tells us what the file
names are, it does not tell us what encoding the person who created
those files had in mind. We don't know that they had *any* encoding in
mind. IIUC, a file in the data directory can have a name that consists
of any sequence of bytes whatsoever, so long as it doesn't contain
prohibited characters like a path separator or \0 byte. But only some
of those possible octet sequences can be stored in a manifest that has
to be valid UTF-8.

The degree to which there is a practical problem here is limited by
the fact that most filenames within the data directory are chosen by
the system, e.g. base/16384/16385, and those file names are only going
to contain ASCII characters (i.e. code points 0-127) and those are
valid in UTF-8 and lots of other encodings. Moreover, most people who
create additional files in the data directory will probably use ASCII
characters for those as well, at least if they are from an
English-speaking country, and if they're not, they're likely going to
use UTF-8, and then they'll still be fine. But there is no rule that
says people have to do that, and if somebody wants to use file names
based around SJIS or whatever, the backup manifest functionality
should not for that reason break.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Alvaro Herrera
Date:
On 2020-Jan-23, Robert Haas wrote:

> No, that's not it. Suppose that Álvaro Herrera has some custom
> settings he likes to put on all the PostgreSQL clusters that he uses,
> so he creates a file álvaro.conf and uses an "include" directive in
> postgresql.conf to suck in those settings. If he also likes UTF-8,
> then the file name will be stored in the file system as a 12-byte
> value of which the first two bytes will be 0xc3 0xa1. In that case,
> everything will be fine, because JSON is supposed to always be UTF-8,
> and the file name is UTF-8, and it's all good. But suppose he instead
> likes LATIN-1.

I do have files with Latin-1-encoded names in my filesystem, even though
my system is UTF-8, so I understand the problem.  I was wondering if it
would work to encode any non-UTF8-valid name using something like
base64; the encoded name will be plain ASCII and can be put in the
manifest, probably using a different field of the JSON object -- so for
a normal file you'd have { path => '1234/2345' } but for a
Latin-1-encoded file you'd have { path_base64 => '4Wx2YXJvLmNvbmYK' }.
Then it's the job of the tool to ensure it decodes the name to its
original form when creating/querying for the file.

A problem I have with this idea is that this is very corner-casey, so
most tool implementors will never realize that there's a need to decode
certain file names.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Bruce Momjian
Date:
On Thu, Jan 23, 2020 at 02:23:14PM -0300, Alvaro Herrera wrote:
> On 2020-Jan-23, Robert Haas wrote:
> 
> > No, that's not it. Suppose that Álvaro Herrera has some custom
> > settings he likes to put on all the PostgreSQL clusters that he uses,
> > so he creates a file álvaro.conf and uses an "include" directive in
> > postgresql.conf to suck in those settings. If he also likes UTF-8,
> > then the file name will be stored in the file system as a 12-byte
> > value of which the first two bytes will be 0xc3 0xa1. In that case,
> > everything will be fine, because JSON is supposed to always be UTF-8,
> > and the file name is UTF-8, and it's all good. But suppose he instead
> > likes LATIN-1.
> 
> I do have files with Latin-1-encoded names in my filesystem, even though
> my system is UTF-8, so I understand the problem.  I was wondering if it
> would work to encode any non-UTF8-valid name using something like
> base64; the encoded name will be plain ASCII and can be put in the
> manifest, probably using a different field of the JSON object -- so for
> a normal file you'd have { path => '1234/2345' } but for a
> Latin-1-encoded file you'd have { path_base64 => '4Wx2YXJvLmNvbmYK' }.
> Then it's the job of the tool to ensure it decodes the name to its
> original form when creating/querying for the file.
> 
> A problem I have with this idea is that this is very corner-casey, so
> most tool implementors will never realize that there's a need to decode
> certain file names.

Another idea is to use base64 for all non-ASCII file names, so we don't
need to check if the file name is valid UTF8 before outputting --- we
just need to check for non-ASCII, which is much easier.  Another
problem, though, is how do you _flag_ file names as being
base64-encoded?  Use another JSON field to specify that?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Thu, Jan 23, 2020 at 12:24 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> I do have files with Latin-1-encoded names in my filesystem, even though
> my system is UTF-8, so I understand the problem.  I was wondering if it
> would work to encode any non-UTF8-valid name using something like
> base64; the encoded name will be plain ASCII and can be put in the
> manifest, probably using a different field of the JSON object -- so for
> a normal file you'd have { path => '1234/2345' } but for a
> Latin-1-encoded file you'd have { path_base64 => '4Wx2YXJvLmNvbmYK' }.
> Then it's the job of the tool to ensure it decodes the name to its
> original form when creating/querying for the file.

Right. That's what I meant, a couple of messages back, when I
mentioned an extra layer of escaping, but your explanation here is
better because it's more detailed.

> A problem I have with this idea is that this is very corner-casey, so
> most tool implementors will never realize that there's a need to decode
> certain file names.

That's a valid concern. I would not necessarily have thought that
out-of-core tools would find a lot of use in reading them, provided
PostgreSQL itself both knows how to generate them and how to validate
them, but the interest in this topic suggests that people do care
about that.

Mostly, I think this issue shows the folly of imagining that putting
everything into JSON is a good idea because it gets rid of escaping
problems. Actually, what it does is create multiple kinds of escaping
problems. With the format I proposed, you only have to worry that the
file name might contain a tab character, because in that format, tab
is the delimiter. But, if we use JSON, then we've got the same problem
with JSON's delimiter, namely a double quote, which the JSON parser
will solve for you.  We then have this additional and somewhat obscure
problem with invalidly encoded data, to which JSON itself provides no
solution. We have to invent our own, probably along the lines of what
you have just proposed. I think one can reasonably wonder whether this
is really an improvement over just inventing a way to escape tabs.

That said, there are other reasons to want to go with JSON, most of
all the fact that it's easy to see how to extend the format to
additional fields. Once you decide that each file will have an object,
you can add any keys that you like to that object and things should
scale up nicely. It has been my contention that we probably will not
find the need to add much more here, but such arguments are always
suspect and have a good chance of being wrong. Also, such prophecies
can be self-fulfilling: if the format is easy to extend, then people
may extend it, whereas if it is hard to extend, they may not try, or
they may try and then give up.

At the end of the day, I'm willing to make this work either way. I do
not think that my original proposal was bad, but there were things not
to like about it. There are also things not to like about using a
JSON-based format, and this seems to me to be a fairly significant
one. However, both sets of problems are solvable, and neither design
is awful. It's just a question of which kind of warts we like better.
To be blunt, I've already spent a lot more effort on this problem than
I would have liked, and more than 90% of it has been spent on the
issue of how to format a file that only PostgreSQL needs to read and
write. While I do not think that good file formats are unimportant, I
remain unconvinced that switching to JSON is making things better. It
seems like it's just making them different, while inflating the amount
of coding required by a fairly significant multiple.

That being said, unless somebody objects in the next few days, I am
going to assume that the people who preferred JSON over a
tab-separated file are also happy with the idea of using base-64
encoding as proposed by you above to represent files whose names are
not valid UTF-8 sequences; and I will then go rewrite the patch that
generates the manifests to use that format, rewrite the validator
patch to parse that format using this infrastructure, and hopefully
end up with something that can be reviewed and committed before we run
out of time to get things done for this release. If anybody wants to
vote for another plan, please vote soon.

In the meantime, any review of the new patches I posted here yesterday
would be warmly appreciated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Thu, Jan 23, 2020 at 12:49 PM Bruce Momjian <bruce@momjian.us> wrote:
> Another idea is to use base64 for all non-ASCII file names, so we don't
> need to check if the file name is valid UTF8 before outputting --- we
> just need to check for non-ASCII, which is much easier.

I think that we have the infrastructure available to check in a
convenient way whether it's valid as UTF-8, so this might not be
necessary, but I will look into it further unless there is a consensus
to go another direction entirely.

> Another
> problem, though, is how do you _flag_ file names as being
> base64-encoded?  Use another JSON field to specify that?

Alvaro's proposed solution in the message to which you replied was to
call the field either 'path' or 'path_base64' depending on whether
base-64 escaping was used. That seems better to me than having a field
called 'path' and a separate field called 'is_path_base64' or
whatever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Bruce Momjian
Date:
On Thu, Jan 23, 2020 at 01:05:50PM -0500, Robert Haas wrote:
> > Another
> > problem, though, is how do you _flag_ file names as being
> > base64-encoded?  Use another JSON field to specify that?
> 
> Alvaro's proposed solution in the message to which you replied was to
> call the field either 'path' or 'path_base64' depending on whether
> base-64 escaping was used. That seems better to me than having a field
> called 'path' and a separate field called 'is_path_base64' or
> whatever.

Hmm, so the JSON key name is the flag --- interesting.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: making the backend's json parser work in frontend code

From
Alvaro Herrera
Date:
On 2020-Jan-23, Bruce Momjian wrote:

> On Thu, Jan 23, 2020 at 01:05:50PM -0500, Robert Haas wrote:
> > > Another
> > > problem, though, is how do you _flag_ file names as being
> > > base64-encoded?  Use another JSON field to specify that?
> > 
> > Alvaro's proposed solution in the message to which you replied was to
> > call the field either 'path' or 'path_base64' depending on whether
> > base-64 escaping was used. That seems better to me than having a field
> > called 'path' and a separate field called 'is_path_base64' or
> > whatever.
> 
> Hmm, so the JSON key name is the flag --- interesting.

Yes, because if you use the same key name, you risk a dumb tool writing
the file name as the encoded name.  That's worse because it's harder to
figure out that it's wrong.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Bruce Momjian
Date:
On Thu, Jan 23, 2020 at 03:20:27PM -0300, Alvaro Herrera wrote:
> On 2020-Jan-23, Bruce Momjian wrote:
> 
> > On Thu, Jan 23, 2020 at 01:05:50PM -0500, Robert Haas wrote:
> > > > Another
> > > > problem, though, is how do you _flag_ file names as being
> > > > base64-encoded?  Use another JSON field to specify that?
> > > 
> > > Alvaro's proposed solution in the message to which you replied was to
> > > call the field either 'path' or 'path_base64' depending on whether
> > > base-64 escaping was used. That seems better to me than having a field
> > > called 'path' and a separate field called 'is_path_base64' or
> > > whatever.
> > 
> > Hmm, so the JSON key name is the flag --- interesting.
> 
> Yes, because if you use the same key name, you risk a dumb tool writing
> the file name as the encoded name.  That's worse because it's harder to
> figure out that it's wrong.

Yes, good point.  I think my one concern is that someone might specify
both keys in the JSON, which would be very odd.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Thu, Jan 23, 2020 at 1:22 PM Bruce Momjian <bruce@momjian.us> wrote:
> Yes, good point.  I think my one concern is that someone might specify
> both keys in the JSON, which would be very odd.

I think that if a tool other than PostgreSQL chooses to generate a
PostreSQL backup manifest, it must take care to do it in a manner that
is compatible with what PostgreSQL would generate. If it doesn't,
well, that sucks for them, but we can't prevent other people from
writing bad code. On a very good day, we can prevent ourselves from
writing bad code.

There is in general the question of how rigorous PostgreSQL ought to
be when validating a backup manifest. The proposal on the table is to
store four (4) fields per file: name, size, last modification time,
and checksum. So a JSON object representing a file should have four
keys, say "path", "size", "mtime", and "checksum". The "checksum" key
could perhaps be optional, in case the user disables checksums, or we
could represent that case in some other way, like "checksum" => null,
"checksum" => "", or "checksum" => "NONE". There is an almost
unlimited scope for bike-shedding here, but let's leave that to one
side for the moment.

Suppose that someone asks PostgreSQL's backup manifest verification
tool to validate a backup manifest where there's an extra key. Say, in
addition to the four keys listed in the previous paragraph, there is
an additional key, "momjian". On the one hand, our backup manifest
verification tool could take this as a sign that the manifest is
invalid, and accordingly throw an error. Or, it could assume that some
third-party backup tool generated the backup manifest and that the
"momjian" field is there to track something which is of interest to
that tool but not to PostgreSQL core, in which case it should just be
ignored.

Incidentally, some research seems to suggest that the problem of
filenames which don't form a valid UTF-8 sequence cannot occur on
Windows. This blog post may be helpful in understanding the issues:

http://beets.io/blog/paths.html

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Bruce Momjian
Date:
On Thu, Jan 23, 2020 at 02:00:23PM -0500, Robert Haas wrote:
> Incidentally, some research seems to suggest that the problem of
> filenames which don't form a valid UTF-8 sequence cannot occur on
> Windows. This blog post may be helpful in understanding the issues:
> 
> http://beets.io/blog/paths.html

Is there any danger of assuming a non-UTF8 sequence to be UTF8 even when
it isn't, except that it displays oddly?  I am thinking of a non-UTF8
sequence that is value UTF8.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: making the backend's json parser work in frontend code

From
"Daniel Verite"
Date:
    Robert Haas wrote:

> With the format I proposed, you only have to worry that the
> file name might contain a tab character, because in that format, tab
> is the delimiter

It could be CSV, which has this problem already solved,
is easier to parse than JSON, certainly no less popular,
and is not bound to a specific encoding.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Thu, Jan 23, 2020 at 2:08 PM Daniel Verite <daniel@manitou-mail.org> wrote:
> It could be CSV, which has this problem already solved,
> is easier to parse than JSON, certainly no less popular,
> and is not bound to a specific encoding.

Sure. I don't think that would look quite as nice visually as what I
proposed when inspected by humans, and our default COPY output format
is tab-separated rather than comma-separated. However, if CSV would be
more acceptable, great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Alvaro Herrera
Date:
On 2020-Jan-23, Bruce Momjian wrote:

> Yes, good point.  I think my one concern is that someone might specify
> both keys in the JSON, which would be very odd.

Just make that a reason to raise an error.  I think it's even possible
to specify that as a JSON Schema constraint, using a "oneOf" predicate.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:
> On Jan 22, 2020, at 7:00 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> I have this done in my local repo to the point that I can build frontend tools against the json parser that is now in
src/commonand also run all the check-world tests without failure.  I’m planning to post my work soon, possibly tonight
ifI don’t run out of time, but more likely tomorrow. 

Ok, I finished merging with Robert’s patches.  The attached follow his numbering, with my patches intended to by
appliedafter his. 

I tried not to change his work too much, but I did a bit of refactoring in 0010, as explained in the commit comment.

0011 is just for verifying the linking works ok and the json parser can be invoked from a frontend tool without error —
Idon’t really see the point in committing it. 

I ran some benchmarks for json parsing in the backend both before and after these patches, with very slight changes in
runtime. The setup for the benchmark creates an unlogged table with a single text column and loads rows of json
formattedtext: 

CREATE UNLOGGED TABLE benchmark (
    j text
);
COPY benchmark (j) FROM '/Users/mark.dilger/bench/json.csv’;


FYI:

wc ~/bench/json.csv
     107 34465023 503364244 /Users/mark.dilger/bench/json.csv

The benchmark itself casts the text column to jsonb, as follows:

SELECT jsonb_typeof(j::jsonb) typ, COUNT(*) FROM benchmark GROUP BY typ;

In summary, the times are:

    pristine    patched
    —————    —————
    11.985    12.237
    12.200    11.992
    11.691    11.896
    11.847    11.833
    11.722    11.936

The full output for the runtimes without the patch over five iterations:


CREATE TABLE
COPY 107
  typ   | count
--------+-------
 object |   107
(1 row)


real    0m11.985s
user    0m0.002s
sys    0m0.003s
  typ   | count
--------+-------
 object |   107
(1 row)


real    0m12.200s
user    0m0.002s
sys    0m0.004s
  typ   | count
--------+-------
 object |   107
(1 row)


real    0m11.691s
user    0m0.002s
sys    0m0.003s
  typ   | count
--------+-------
 object |   107
(1 row)


real    0m11.847s
user    0m0.002s
sys    0m0.004s
  typ   | count
--------+-------
 object |   107
(1 row)


real    0m11.722s
user    0m0.002s
sys    0m0.003s


An with the patch, also five iterations:


CREATE TABLE
COPY 107
  typ   | count
--------+-------
 object |   107
(1 row)


real    0m12.237s
user    0m0.002s
sys    0m0.004s
  typ   | count
--------+-------
 object |   107
(1 row)


real    0m11.992s
user    0m0.002s
sys    0m0.004s
  typ   | count
--------+-------
 object |   107
(1 row)


real    0m11.896s
user    0m0.002s
sys    0m0.004s
  typ   | count
--------+-------
 object |   107
(1 row)


real    0m11.833s
user    0m0.002s
sys    0m0.004s
  typ   | count
--------+-------
 object |   107
(1 row)


real    0m11.936s
user    0m0.002s
sys    0m0.004s


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Attachment

Re: making the backend's json parser work in frontend code

From
Andrew Dunstan
Date:
On Fri, Jan 24, 2020 at 7:35 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
> > On Jan 22, 2020, at 7:00 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> >
> > I have this done in my local repo to the point that I can build frontend tools against the json parser that is now
insrc/common and also run all the check-world tests without failure.  I’m planning to post my work soon, possibly
tonightif I don’t run out of time, but more likely tomorrow. 
>
> Ok, I finished merging with Robert’s patches.  The attached follow his numbering, with my patches intended to by
appliedafter his. 
>
> I tried not to change his work too much, but I did a bit of refactoring in 0010, as explained in the commit comment.
>
> 0011 is just for verifying the linking works ok and the json parser can be invoked from a frontend tool without error
—I don’t really see the point in committing it. 
>
> I ran some benchmarks for json parsing in the backend both before and after these patches, with very slight changes
inruntime.  The setup for the benchmark creates an unlogged table with a single text column and loads rows of json
formattedtext: 
>
> CREATE UNLOGGED TABLE benchmark (
>     j text
> );
> COPY benchmark (j) FROM '/Users/mark.dilger/bench/json.csv’;
>
>
> FYI:
>
> wc ~/bench/json.csv
>      107 34465023 503364244 /Users/mark.dilger/bench/json.csv
>
> The benchmark itself casts the text column to jsonb, as follows:
>
> SELECT jsonb_typeof(j::jsonb) typ, COUNT(*) FROM benchmark GROUP BY typ;
>
> In summary, the times are:
>
>         pristine        patched
>         —————   —————
>         11.985  12.237
>         12.200  11.992
>         11.691  11.896
>         11.847  11.833
>         11.722  11.936
>

OK, nothing noticeable there.

"accept" is a common utility I've used in the past with parsers of
this kind, but inlining it seems perfectly reasonable.

I've reviewed these patches and Robert's, and they seem basically good to me.

But I don't think src/bin is the right place for the test program. I
assume we're not going to ship this program, so it really belongs in
src/test somewhere, I think. It should also have a TAP test.

cheers

andrew



--
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:

> On Jan 23, 2020, at 4:27 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
> On Fri, Jan 24, 2020 at 7:35 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>>
>>
>>> On Jan 22, 2020, at 7:00 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>>>
>>> I have this done in my local repo to the point that I can build frontend tools against the json parser that is now
insrc/common and also run all the check-world tests without failure.  I’m planning to post my work soon, possibly
tonightif I don’t run out of time, but more likely tomorrow. 
>>
>> Ok, I finished merging with Robert’s patches.  The attached follow his numbering, with my patches intended to by
appliedafter his. 
>>
>> I tried not to change his work too much, but I did a bit of refactoring in 0010, as explained in the commit comment.
>>
>> 0011 is just for verifying the linking works ok and the json parser can be invoked from a frontend tool without
error— I don’t really see the point in committing it. 
>>
>> I ran some benchmarks for json parsing in the backend both before and after these patches, with very slight changes
inruntime.  The setup for the benchmark creates an unlogged table with a single text column and loads rows of json
formattedtext: 
>>
>> CREATE UNLOGGED TABLE benchmark (
>>    j text
>> );
>> COPY benchmark (j) FROM '/Users/mark.dilger/bench/json.csv’;
>>
>>
>> FYI:
>>
>> wc ~/bench/json.csv
>>     107 34465023 503364244 /Users/mark.dilger/bench/json.csv
>>
>> The benchmark itself casts the text column to jsonb, as follows:
>>
>> SELECT jsonb_typeof(j::jsonb) typ, COUNT(*) FROM benchmark GROUP BY typ;
>>
>> In summary, the times are:
>>
>>        pristine        patched
>>        —————   —————
>>        11.985  12.237
>>        12.200  11.992
>>        11.691  11.896
>>        11.847  11.833
>>        11.722  11.936
>>
>
> OK, nothing noticeable there.
>
> "accept" is a common utility I've used in the past with parsers of
> this kind, but inlining it seems perfectly reasonable.
>
> I've reviewed these patches and Robert's, and they seem basically good to me.

Thanks for the review!

> But I don't think src/bin is the right place for the test program. I
> assume we're not going to ship this program, so it really belongs in
> src/test somewhere, I think. It should also have a TAP test.

Ok, I’ll go do that; thanks for the suggestion.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: making the backend's json parser work in frontend code

From
Peter Eisentraut
Date:
On 2020-01-23 18:04, Robert Haas wrote:
> Now, you might say "well, why don't we just do an encoding
> conversion?", but we can't. When the filesystem tells us what the file
> names are, it does not tell us what encoding the person who created
> those files had in mind. We don't know that they had*any*  encoding in
> mind. IIUC, a file in the data directory can have a name that consists
> of any sequence of bytes whatsoever, so long as it doesn't contain
> prohibited characters like a path separator or \0 byte. But only some
> of those possible octet sequences can be stored in a manifest that has
> to be valid UTF-8.

I think it wouldn't be unreasonable to require that file names in the 
database directory be consistently encoded (as defined by pg_control, 
probably).  After all, this information is sometimes also shown in 
system views, so it's already difficult to process total junk.  In 
practice, this shouldn't be an onerous requirement.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 2020-01-23 18:04, Robert Haas wrote:
>> Now, you might say "well, why don't we just do an encoding
>> conversion?", but we can't. When the filesystem tells us what the file
>> names are, it does not tell us what encoding the person who created
>> those files had in mind. We don't know that they had*any*  encoding in
>> mind. IIUC, a file in the data directory can have a name that consists
>> of any sequence of bytes whatsoever, so long as it doesn't contain
>> prohibited characters like a path separator or \0 byte. But only some
>> of those possible octet sequences can be stored in a manifest that has
>> to be valid UTF-8.

> I think it wouldn't be unreasonable to require that file names in the 
> database directory be consistently encoded (as defined by pg_control, 
> probably).  After all, this information is sometimes also shown in 
> system views, so it's already difficult to process total junk.  In 
> practice, this shouldn't be an onerous requirement.

I don't entirely follow why we're discussing this at all, if the
requirement is backing up a PG data directory.  There are not, and
are never likely to be, any legitimate files with non-ASCII names
in that context.  Why can't we just skip any such files?

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
David Steele
Date:
On 1/23/20 11:05 AM, Robert Haas wrote:
 > On Thu, Jan 23, 2020 at 12:49 PM Bruce Momjian <bruce@momjian.us> wrote:
 >> Another idea is to use base64 for all non-ASCII file names, so we don't
 >> need to check if the file name is valid UTF8 before outputting --- we
 >> just need to check for non-ASCII, which is much easier.
 >
 > I think that we have the infrastructure available to check in a
 > convenient way whether it's valid as UTF-8, so this might not be
 > necessary, but I will look into it further unless there is a consensus
 > to go another direction entirely.
 >
 >> Another
 >> problem, though, is how do you _flag_ file names as being
 >> base64-encoded?  Use another JSON field to specify that?
 >
 > Alvaro's proposed solution in the message to which you replied was to
 > call the field either 'path' or 'path_base64' depending on whether
 > base-64 escaping was used. That seems better to me than having a field
 > called 'path' and a separate field called 'is_path_base64' or
 > whatever.

+1. I'm not excited about this solution but don't have a better idea.

It might be nice to have a strict mode where non-ASCII/UTF8 characters 
will error instead, but that can be added on later.

Regards,
-- 
-David
david@pgmasters.net



Re: making the backend's json parser work in frontend code

From
David Steele
Date:
On 1/24/20 9:27 AM, Tom Lane wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> On 2020-01-23 18:04, Robert Haas wrote:
>>> Now, you might say "well, why don't we just do an encoding
>>> conversion?", but we can't. When the filesystem tells us what the file
>>> names are, it does not tell us what encoding the person who created
>>> those files had in mind. We don't know that they had*any*  encoding in
>>> mind. IIUC, a file in the data directory can have a name that consists
>>> of any sequence of bytes whatsoever, so long as it doesn't contain
>>> prohibited characters like a path separator or \0 byte. But only some
>>> of those possible octet sequences can be stored in a manifest that has
>>> to be valid UTF-8.
> 
>> I think it wouldn't be unreasonable to require that file names in the
>> database directory be consistently encoded (as defined by pg_control,
>> probably).  After all, this information is sometimes also shown in
>> system views, so it's already difficult to process total junk.  In
>> practice, this shouldn't be an onerous requirement.
> 
> I don't entirely follow why we're discussing this at all, if the
> requirement is backing up a PG data directory.  There are not, and
> are never likely to be, any legitimate files with non-ASCII names
> in that context.  Why can't we just skip any such files?

It's not uncommon in my experience for users to drop odd files into 
PGDATA (usually versioned copies of postgresql.conf, etc.), but I agree 
that it should be discouraged.  Even so, I don't recall ever seeing any 
non-ASCII filenames.

Skipping files sounds scary, I'd prefer an error or a warning (and then 
base64 encode the filename).

Regards,
-- 
-David
david@pgmasters.net



Re: making the backend's json parser work in frontend code

From
Alvaro Herrera
Date:
On 2020-Jan-24, David Steele wrote:

> It might be nice to have a strict mode where non-ASCII/UTF8 characters will
> error instead, but that can be added on later.

"your backup failed because you have a file we don't like" is not great
behavior.  IIRC we already fail when a file is owned by root (or maybe
unreadable and owned by root), and it messes up severely when people
edit postgresql.conf as root.  Let's not add more cases of that sort.

Maybe we can get away with *ignoring* such files, perhaps after emitting
a warning.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
David Steele
Date:
On 1/24/20 10:00 AM, Alvaro Herrera wrote:
> On 2020-Jan-24, David Steele wrote:
> 
>> It might be nice to have a strict mode where non-ASCII/UTF8 characters will
>> error instead, but that can be added on later.
> 
> "your backup failed because you have a file we don't like" is not great
> behavior.  IIRC we already fail when a file is owned by root (or maybe
> unreadable and owned by root), and it messes up severely when people
> edit postgresql.conf as root.  Let's not add more cases of that sort.

My intention was that the strict mode would not be the default, so I 
don't see why it would be a big issue.

> Maybe we can get away with *ignoring* such files, perhaps after emitting
> a warning.

I'd prefer an an error (or base64 encoding) rather than just skipping a 
file.  The latter sounds scary.

Regards,
-- 
-David
david@pgmasters.net



Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:

> On Jan 24, 2020, at 8:36 AM, David Steele <david@pgmasters.net> wrote:
>
>> I don't entirely follow why we're discussing this at all, if the
>> requirement is backing up a PG data directory.  There are not, and
>> are never likely to be, any legitimate files with non-ASCII names
>> in that context.  Why can't we just skip any such files?
>
> It's not uncommon in my experience for users to drop odd files into PGDATA (usually versioned copies of
postgresql.conf,etc.), but I agree that it should be discouraged.  Even so, I don't recall ever seeing any non-ASCII
filenames.
>
> Skipping files sounds scary, I'd prefer an error or a warning (and then base64 encode the filename).

I tend to agree with Tom.  We know that postgres doesn’t write any such files now, and if we ever decided to change
that,we could change this, too.  So for now, we can assume any such files are not ours.  Either the user manually
scribbledin this directory, or had a tool (antivirus checksum file, vim .WHATEVER.swp file, etc) that did so.  Raising
anerror would break any automated backup process that hit this issue, and base64 encoding the file name and backing up
thefile contents could grab data that the user would not reasonably expect in the backup.  But this argument applies
equallywell to such files regardless of filename encoding.  It would be odd to back them up when they happen to be
validUTF-8/ASCII/whatever, but not do so when they are not valid.  I would expect, therefore, that we only back up
fileswhich match our expected file name pattern and ignore (perhaps with a warning) everything else. 

Quoting from Robert’s email about why we want a backup manifest seems to support this idea, at least as I see it:

> So, let's suppose we invent a backup manifest. What should it contain?
> I imagine that it would consist of a list of files, and the lengths of
> those files, and a checksum for each file. I think you should have a
> choice of what kind of checksums to use, because algorithms that used
> to seem like good choices (e.g. MD5) no longer do; this trend can
> probably be expected to continue. Even if we initially support only
> one kind of checksum -- presumably SHA-something since we have code
> for that already for SCRAM -- I think that it would also be a good
> idea to allow for future changes. And maybe it's best to just allow a
> choice of SHA-224, SHA-256, SHA-384, and SHA-512 right out of the
> gate, so that we can avoid bikeshedding over which one is secure
> enough. I guess we'll still have to argue about the default. I also
> think that it should be possible to build a manifest with no
> checksums, so that one need not pay the overhead of computing
> checksums if one does not wish. Of course, such a manifest is of much
> less utility for checking backup integrity, but you can still check
> that you've got the right files, which is noticeably better than
> nothing.  The manifest should probably also contain a checksum of its
> own contents so that the integrity of the manifest itself can be
> verified. And maybe a few other bits of metadata, but I'm not sure
> exactly what.  Ideas?
>
>
>
> Once we invent the concept of a backup manifest, what do we need to do
> with them? I think we'd want three things initially:
>
>
>
> (1) When taking a backup, have the option (perhaps enabled by default)
> to include a backup manifest.
> (2) Given an existing backup that has not got a manifest, construct one.
> (3) Cross-check a manifest against a backup and complain about extra
> files, missing files, size differences, or checksum mismatches.


Nothing in there sounds to me like it needs to include random cruft.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: making the backend's json parser work in frontend code

From
Alvaro Herrera
Date:
On 2020-Jan-24, David Steele wrote:

> On 1/24/20 10:00 AM, Alvaro Herrera wrote:

> > Maybe we can get away with *ignoring* such files, perhaps after emitting
> > a warning.
> 
> I'd prefer an an error (or base64 encoding) rather than just skipping a
> file.  The latter sounds scary.

Well, if the file is "invalid" then evidently Postgres cannot possibly
care about it, so why would it care if it's missing from the backup?

I prefer the encoding scheme myself.  I don't see the point of the
error.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I prefer the encoding scheme myself.  I don't see the point of the
> error.

Yeah, if we don't want to skip such files, then storing them using
a base64-encoded name (with a different key than regular names)
seems plausible.  But I don't really see why we'd go to that much
trouble, nor why we'd think it's likely that tools would correctly
handle a case that is going to have 0.00% usage in the field.

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Alvaro Herrera
Date:
On 2020-Jan-24, Mark Dilger wrote:

> I would expect, therefore, that we only back up files which match our
> expected file name pattern and ignore (perhaps with a warning)
> everything else.

That risks missing files placed in the datadir by extensions; see
discussion about pg_checksums using a whitelist[1], which does not
translate directly to this problem, because omitting to checksum a file
is not the same as failing to copy a file into the backups.
(Essentially, the backups would be incomplete.)

[1] https://postgr.es/m/20181019171747.4uithw2sjkt6msne@alap3.anarazel.de

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2020-Jan-24, Mark Dilger wrote:
>> I would expect, therefore, that we only back up files which match our
>> expected file name pattern and ignore (perhaps with a warning)
>> everything else.

> That risks missing files placed in the datadir by extensions;

I agree that assuming we know everything that will appear in the
data directory is a pretty unsafe assumption.  But no rational
extension is going to use a non-ASCII file name, either, if only
because it can't predict what the filesystem encoding will be.

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Fri, Jan 24, 2020 at 9:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > I prefer the encoding scheme myself.  I don't see the point of the
> > error.
>
> Yeah, if we don't want to skip such files, then storing them using
> a base64-encoded name (with a different key than regular names)
> seems plausible.  But I don't really see why we'd go to that much
> trouble, nor why we'd think it's likely that tools would correctly
> handle a case that is going to have 0.00% usage in the field.

I mean, I gave a not-totally-unrealistic example of how this could
happen upthread. I agree it's going to be rare, but it's not usually
OK to decide that if a user does something a little unusual,
not-obviously-related features subtly break.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Alvaro Herrera
Date:
On 2020-Jan-24, Tom Lane wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On 2020-Jan-24, Mark Dilger wrote:
> >> I would expect, therefore, that we only back up files which match our
> >> expected file name pattern and ignore (perhaps with a warning)
> >> everything else.
> 
> > That risks missing files placed in the datadir by extensions;
> 
> I agree that assuming we know everything that will appear in the
> data directory is a pretty unsafe assumption.  But no rational
> extension is going to use a non-ASCII file name, either, if only
> because it can't predict what the filesystem encoding will be.

I see two different arguments. One is about the file encoding. Those
files are rare and would be placed by the user manually.  We can fix
that by encoding the name.  We can have a debug mode that encodes all
names that way, just to ensure the tools are prepared for it.

The other is Mark's point about "expected file pattern", which seems a
slippery slope to me.  If the pattern is /^[a-zA-Z0-9_.]*$/ then I'm
okay with it (maybe add a few other punctuation chars); as you say no
sane extension would use names much weirder than that.  But we should
not be stricter, such as counting the number of periods/underscores
allowed or where are alpha chars expected (except maybe disallow period
at start of filename), or anything too specific like that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Peter Eisentraut
Date:
On 2020-01-24 18:56, Robert Haas wrote:
> On Fri, Jan 24, 2020 at 9:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> I prefer the encoding scheme myself.  I don't see the point of the
>>> error.
>>
>> Yeah, if we don't want to skip such files, then storing them using
>> a base64-encoded name (with a different key than regular names)
>> seems plausible.  But I don't really see why we'd go to that much
>> trouble, nor why we'd think it's likely that tools would correctly
>> handle a case that is going to have 0.00% usage in the field.
> 
> I mean, I gave a not-totally-unrealistic example of how this could
> happen upthread. I agree it's going to be rare, but it's not usually
> OK to decide that if a user does something a little unusual,
> not-obviously-related features subtly break.

Another example might be log files under pg_log with localized weekday 
or month names.  (Maybe we're not planning to back up log files, but the 
routines that deal with file names should probably be prepared to at 
least look at the name and decide that they don't care about it rather 
than freaking out right away.)

I'm not fond of the base64 idea btw., because it seems to sort of 
penalize using non-ASCII characters by making the result completely not 
human readable.  Something along the lines of MIME would be better in 
that way.  There are existing solutions to storing data with metadata 
around it.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:

> On Jan 24, 2020, at 10:03 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> The other is Mark's point about "expected file pattern", which seems a
> slippery slope to me.  If the pattern is /^[a-zA-Z0-9_.]*$/ then I'm
> okay with it (maybe add a few other punctuation chars); as you say no
> sane extension would use names much weirder than that.  But we should
> not be stricter, such as counting the number of periods/underscores
> allowed or where are alpha chars expected (except maybe disallow period
> at start of filename), or anything too specific like that.

What bothered me about skipping files based only on encoding is that it creates hard to anticipate bugs.  If extensions
embedsomething, like a customer name, into a filename, and that something is usually ASCII, or usually valid UTF-8, and
getsbacked up, but then some day they embed something that is not ASCII/UTF-8, then it does not get backed up, and
maybenobody notices until they actually *need* the backup, and it’s too late. 

We either need to be really strict about what gets backed up, so that nobody gets a false sense of security about what
getsincluded in that list, or we need to be completely permissive, which would include files named in arbitrary
encodings. I don’t see how it does anybody any favors to make the system appear to back up everything until you hit
thisunanticipated case and then it fails. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Thu, Jan 23, 2020 at 1:05 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Ok, I finished merging with Robert’s patches.  The attached follow his numbering, with my patches intended to by
appliedafter his. 

I think it'd be a good idea to move the pg_wchar.h stuff into a new
thread. This thread is getting a bit complicated, because we've got
(1) the patches need to do $SUBJECT plus (2) additional patches that
clean up the multibyte stuff more plus (3) discussion of issues that
pertain to the backup manifest thread. To my knowledge, $SUBJECT
doesn't strictly require the pg_wchar.h changes, so I suggest we try
to segregate those.

> I tried not to change his work too much, but I did a bit of refactoring in 0010, as explained in the commit comment.

Hmm, I generally prefer to avoid these kinds of macro tricks because I
think they can be confusing to the reader. It's worth it in a case
like equalfuncs.c where so much boilerplate code is saved that the
gain in readability more than makes up for having to go check what the
macros do -- but I don't feel that's the case here. There aren't
*that* many call sites, and I think the code will be easier to
understand without "return" statements concealed within macros...

> I ran some benchmarks for json parsing in the backend both before and after these patches, with very slight changes
inruntime. 

Cool, thanks.

Since 0001-0003 have been reviewed by multiple people and nobody's
objected, I have committed those. But I made a hash of it: the first
one, I failed to credit any reviewers, or include a Discussion link,
and I just realized that I should have listed Alvaro's name as a
reviewer also. Sorry about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Alvaro Herrera
Date:
On 2020-Jan-24, Peter Eisentraut wrote:

> I'm not fond of the base64 idea btw., because it seems to sort of penalize
> using non-ASCII characters by making the result completely not human
> readable.  Something along the lines of MIME would be better in that way.
> There are existing solutions to storing data with metadata around it.

You mean quoted-printable?  That works for me.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:

> On Jan 24, 2020, at 10:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Since 0001-0003 have been reviewed by multiple people and nobody's
> objected, I have committed those.

I think 0004-0005 have been reviewed and accepted by both me and Andrew, if I understood him correctly:

> I've reviewed these patches and Robert's, and they seem basically good to me.

Certainly, nothing in those two patches caused me any concern.  I’m going to modify my patches as you suggested, get
ridof the INSIST macro, and move the pg_wchar changes to their own thread.  None of that should require changes in your
0004or 0005.  It won’t bother me if you commit those two.  Andrew? 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:

> On Jan 22, 2020, at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> 0004 is a substantially cleaned up version of the patch to make the
> JSON parser return a result code rather than throwing errors. Names
> have been fixed, interfaces have been tidied up, and the thing is
> better integrated with the surrounding code. I would really like
> comments, if anyone has them, on whether this approach is acceptable.
>
> 0005 builds on 0004 by moving three functions from jsonapi.c to
> jsonfuncs.c. With that done, jsonapi.c has minimal remaining
> dependencies on the backend environment. It would still need a
> substitute for elog(ERROR, "some internal thing is broken"); I'm
> thinking of using pg_log_fatal() for that case. It would also need a
> fix for the problem that pg_mblen() is not available in the front-end
> environment. I don't know what to do about that yet exactly, but it
> doesn't seem unsolvable. The frontend environment just needs to know
> which encoding to use, and needs a way to call PQmblen() rather than
> pg_mblen().

I have completed the work in the attached 0006 and 0007 patches.
These are intended to apply after your 0004 and 0005; they won’t
work directly on master which, as of this writing, only contains your
0001-0003 patches.

0006 finishes moving the json parser to src/include/common and src/common.

0007 adds testing.

I would appreciate somebody looking at the portability issues for 0007
on Windows.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Attachment

Re: making the backend's json parser work in frontend code

From
Andrew Dunstan
Date:
On Sat, Jan 25, 2020 at 6:20 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Jan 24, 2020, at 10:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > Since 0001-0003 have been reviewed by multiple people and nobody's
> > objected, I have committed those.
>
> I think 0004-0005 have been reviewed and accepted by both me and Andrew, if I understood him correctly:
>
> > I've reviewed these patches and Robert's, and they seem basically good to me.
>
> Certainly, nothing in those two patches caused me any concern.  I’m going to modify my patches as you suggested, get
ridof the INSIST macro, and move the pg_wchar changes to their own thread.  None of that should require changes in your
0004or 0005.  It won’t bother me if you commit those two.  Andrew? 
>


Just reviewed the latest versions of 4 and 5, they look good to me.

cheers

andrew


--
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Andrew Dunstan
Date:
On Mon, Jan 27, 2020 at 5:54 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Jan 22, 2020, at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > 0004 is a substantially cleaned up version of the patch to make the
> > JSON parser return a result code rather than throwing errors. Names
> > have been fixed, interfaces have been tidied up, and the thing is
> > better integrated with the surrounding code. I would really like
> > comments, if anyone has them, on whether this approach is acceptable.
> >
> > 0005 builds on 0004 by moving three functions from jsonapi.c to
> > jsonfuncs.c. With that done, jsonapi.c has minimal remaining
> > dependencies on the backend environment. It would still need a
> > substitute for elog(ERROR, "some internal thing is broken"); I'm
> > thinking of using pg_log_fatal() for that case. It would also need a
> > fix for the problem that pg_mblen() is not available in the front-end
> > environment. I don't know what to do about that yet exactly, but it
> > doesn't seem unsolvable. The frontend environment just needs to know
> > which encoding to use, and needs a way to call PQmblen() rather than
> > pg_mblen().
>
> I have completed the work in the attached 0006 and 0007 patches.
> These are intended to apply after your 0004 and 0005; they won’t
> work directly on master which, as of this writing, only contains your
> 0001-0003 patches.
>
> 0006 finishes moving the json parser to src/include/common and src/common.
>
> 0007 adds testing.
>
> I would appreciate somebody looking at the portability issues for 0007
> on Windows.
>

We'll need at a minimum something added to src/tools/msvc to build the
test program, maybe some other stuff too. I'll take a look.

cheers

andrew


--
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:

> On Jan 26, 2020, at 5:09 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
> We'll need at a minimum something added to src/tools/msvc to build the
> test program, maybe some other stuff too. I'll take a look.

Thanks, much appreciated.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: making the backend's json parser work in frontend code

From
Andrew Dunstan
Date:
> > 0007 adds testing.
> >
> > I would appreciate somebody looking at the portability issues for 0007
> > on Windows.
> >
>
> We'll need at a minimum something added to src/tools/msvc to build the
> test program, maybe some other stuff too. I'll take a look.


Patch complains that the 0007 patch is malformed:

andrew@ariana:pg_head (master)*$ patch -p 1 <
~/Downloads/v4-0007-Adding-frontend-tests-for-json-parser.patch
patching file src/Makefile
patching file src/test/Makefile
patching file src/test/bin/.gitignore
patching file src/test/bin/Makefile
patching file src/test/bin/README
patching file src/test/bin/t/001_test_json.pl
patch: **** malformed patch at line 201: diff --git
a/src/test/bin/test_json.c b/src/test/bin/test_json.c


cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:

> On Jan 26, 2020, at 5:51 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
>>> 0007 adds testing.
>>>
>>> I would appreciate somebody looking at the portability issues for 0007
>>> on Windows.
>>>
>>
>> We'll need at a minimum something added to src/tools/msvc to build the
>> test program, maybe some other stuff too. I'll take a look.
>
>
> Patch complains that the 0007 patch is malformed:
>
> andrew@ariana:pg_head (master)*$ patch -p 1 <
> ~/Downloads/v4-0007-Adding-frontend-tests-for-json-parser.patch
> patching file src/Makefile
> patching file src/test/Makefile
> patching file src/test/bin/.gitignore
> patching file src/test/bin/Makefile
> patching file src/test/bin/README
> patching file src/test/bin/t/001_test_json.pl
> patch: **** malformed patch at line 201: diff --git
> a/src/test/bin/test_json.c b/src/test/bin/test_json.c

I manually removed a stray newline in the patch file.  I shouldn’t have done that.  I’ve removed the stray newline in
thesources, committed (with git commit —amend) and am testing again, which is what I should have done the first time…. 

Ok, the tests pass.  Here are those two patches again, both regenerated with a fresh invocation of ‘git format-patch’.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Attachment

Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Sun, Jan 26, 2020 at 9:05 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Ok, the tests pass.  Here are those two patches again, both regenerated with a fresh invocation of ‘git
format-patch’.

Regarding 0006:

+#ifndef FRONTEND
 #include "miscadmin.h"
-#include "utils/jsonapi.h"
+#endif

I suggest

#ifdef FRONTEND
#define check_stack_depth()
#else
#include "miscadmin.h"
#endif

- lex->token_terminator = s + pg_mblen(s);
+ lex->token_terminator = s +
pg_wchar_table[lex->input_encoding].mblen((const unsigned char *) s);

Can we use pq_encoding_mblen() here? Regardless, it doesn't seem great
to add more direct references to pg_wchar_table. I think we should
avoid that.

+ return JSON_BAD_PARSER_STATE;

I don't like this, either. I'm thinking about adding some
variable-argument macros that either elog() in backend code or else
pg_log_fatal() and exit(1) in frontend code. There are some existing
precedents already (e.g. rmtree.c, pgfnames.c) which could perhaps be
generalized. I think I'll start a new thread about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Mahendra Singh Thalor
Date:
On Mon, 27 Jan 2020 at 19:00, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Jan 26, 2020 at 9:05 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
> > Ok, the tests pass.  Here are those two patches again, both regenerated with a fresh invocation of ‘git
format-patch’.
>
> Regarding 0006:
>
> +#ifndef FRONTEND
>  #include "miscadmin.h"
> -#include "utils/jsonapi.h"
> +#endif
>
> I suggest
>
> #ifdef FRONTEND
> #define check_stack_depth()
> #else
> #include "miscadmin.h"
> #endif
>
> - lex->token_terminator = s + pg_mblen(s);
> + lex->token_terminator = s +
> pg_wchar_table[lex->input_encoding].mblen((const unsigned char *) s);
>
> Can we use pq_encoding_mblen() here? Regardless, it doesn't seem great
> to add more direct references to pg_wchar_table. I think we should
> avoid that.
>
> + return JSON_BAD_PARSER_STATE;
>
> I don't like this, either. I'm thinking about adding some
> variable-argument macros that either elog() in backend code or else
> pg_log_fatal() and exit(1) in frontend code. There are some existing
> precedents already (e.g. rmtree.c, pgfnames.c) which could perhaps be
> generalized. I think I'll start a new thread about that.
>

Hi,
I can see one warning on HEAD.

jsonapi.c: In function ‘json_errdetail’:
jsonapi.c:1068:1: warning: control reaches end of non-void function
[-Wreturn-type]
 }
 ^

Attaching a patch to fix warning.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:

> On Jan 27, 2020, at 5:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Jan 26, 2020 at 9:05 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> Ok, the tests pass.  Here are those two patches again, both regenerated with a fresh invocation of ‘git
format-patch’.
>
> Regarding 0006:
>
> +#ifndef FRONTEND
> #include "miscadmin.h"
> -#include "utils/jsonapi.h"
> +#endif
>
> I suggest
>
> #ifdef FRONTEND
> #define check_stack_depth()
> #else
> #include "miscadmin.h"
> #endif

Sure, we can do it that way.

> - lex->token_terminator = s + pg_mblen(s);
> + lex->token_terminator = s +
> pg_wchar_table[lex->input_encoding].mblen((const unsigned char *) s);
>
> Can we use pq_encoding_mblen() here? Regardless, it doesn't seem great
> to add more direct references to pg_wchar_table. I think we should
> avoid that.

Yes, that looks a lot cleaner.

>
> + return JSON_BAD_PARSER_STATE;
>
> I don't like this, either. I'm thinking about adding some
> variable-argument macros that either elog() in backend code or else
> pg_log_fatal() and exit(1) in frontend code. There are some existing
> precedents already (e.g. rmtree.c, pgfnames.c) which could perhaps be
> generalized. I think I'll start a new thread about that.

Right, you started the "pg_croak, or something like it?” thread, which already looks like it might not be resolved
quickly. Can we use the 

#ifndef FRONTEND
#define pg_log_warning(...) elog(WARNING, __VA_ARGS__)
#else
#include "common/logging.h"
#endif

pattern here as a place holder, and revisit it along with the other couple instances of this pattern if/when the
“pg_croak,or something like it?” thread is ready for commit?  I’m calling it json_log_and_abort(…) for now, as I can’t
hopeto guess what the final name will be. 

I’m attaching a new patch set with these three changes including Mahendra’s patch posted elsewhere on this thread.

Since you’ve committed your 0004 and 0005 patches, this v6 patch set is now based on a fresh copy of master.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Attachment

Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Mon, Jan 27, 2020 at 2:02 PM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:
> I can see one warning on HEAD.
>
> jsonapi.c: In function ‘json_errdetail’:
> jsonapi.c:1068:1: warning: control reaches end of non-void function
> [-Wreturn-type]
>  }
>  ^
>
> Attaching a patch to fix warning.

Hmm, I don't get a warning there. This function is a switch over an
enum type with a case for every value of the enum, and every branch
either does a "return" or an "elog," so any code after the switch
should be unreachable. It's possible your compiler is too dumb to know
that, but I thought there were other places in the code base where we
assumed that if we handled every defined value of enum, that was good
enough.

But maybe not. I found similar coding in CreateDestReceiver(), and
that ends with:

        /* should never get here */
        pg_unreachable();

So perhaps we need the same thing here. Does adding that fix it for you?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Mon, Jan 27, 2020 at 3:05 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> I’m attaching a new patch set with these three changes including Mahendra’s patch posted elsewhere on this thread.
>
> Since you’ve committed your 0004 and 0005 patches, this v6 patch set is now based on a fresh copy of master.

OK, so I think this is getting close.

What is now 0001 manages to have four (4) conditionals on FRONTEND at
the top of the file. This seems like at least one two many. I am OK
with this being separate:

+#ifndef FRONTEND
 #include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif

postgres(_fe).h has pride of place among includes, so it's reasonable
to put this in its own section like this.

+#ifdef FRONTEND
+#define check_stack_depth()
+#define json_log_and_abort(...) pg_log_fatal(__VA_ARGS__); exit(1);
+#else
+#define json_log_and_abort(...) elog(ERROR, __VA_ARGS__)
+#endif

OK, so here we have a section entirely devoted to our own file-local
macros. Also reasonable. But in between, you have both an #ifdef
FRONTEND and an #ifndef FRONTEND for other includes, and I really
think that should be like #ifdef FRONTEND .. #else .. #endif.

Also, the preprocessor police are on their way to your house now to
arrest you for that first one. You need to write it like this:

#define json_log_and_abort(...) \
    do { pg_log_fatal(__VA_ARGS__); exit(1); } while (0)

Otherwise, hilarity ensues if somebody writes if (my_code_is_buggy)
json_log_and_abort("oops").

 {
- JsonLexContext *lex = palloc0(sizeof(JsonLexContext));
+ JsonLexContext *lex;
+
+#ifndef FRONTEND
+ lex = palloc0(sizeof(JsonLexContext));
+#else
+ lex = (JsonLexContext*) malloc(sizeof(JsonLexContext));
+ memset(lex, 0, sizeof(JsonLexContext));
+#endif

Instead of this, how making no change at all here?

- default:
- elog(ERROR, "unexpected json parse state: %d", ctx);
  }
+
+ /* Not reached */
+ json_log_and_abort("unexpected json parse state: %d", ctx);

This, too, seems unnecessary.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Julien Rouhaud
Date:
On Tue, Jan 28, 2020 at 4:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 27, 2020 at 2:02 PM Mahendra Singh Thalor
> <mahi6run@gmail.com> wrote:
> > I can see one warning on HEAD.
> >
> > jsonapi.c: In function ‘json_errdetail’:
> > jsonapi.c:1068:1: warning: control reaches end of non-void function
> > [-Wreturn-type]
> >  }
> >  ^
> >
> > Attaching a patch to fix warning.
>
> Hmm, I don't get a warning there. This function is a switch over an
> enum type with a case for every value of the enum, and every branch
> either does a "return" or an "elog," so any code after the switch
> should be unreachable. It's possible your compiler is too dumb to know
> that, but I thought there were other places in the code base where we
> assumed that if we handled every defined value of enum, that was good
> enough.
>
> But maybe not. I found similar coding in CreateDestReceiver(), and
> that ends with:
>
>         /* should never get here */
>         pg_unreachable();
>
> So perhaps we need the same thing here. Does adding that fix it for you?

FTR this has unfortunately the same result on Thomas' automatic patch
tester, e.g. https://travis-ci.org/postgresql-cfbot/postgresql/builds/642634195#L1968



Re: making the backend's json parser work in frontend code

From
Mahendra Singh Thalor
Date:
On Tue, 28 Jan 2020 at 20:36, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 27, 2020 at 2:02 PM Mahendra Singh Thalor
> <mahi6run@gmail.com> wrote:
> > I can see one warning on HEAD.
> >
> > jsonapi.c: In function ‘json_errdetail’:
> > jsonapi.c:1068:1: warning: control reaches end of non-void function
> > [-Wreturn-type]
> >  }
> >  ^
> >
> > Attaching a patch to fix warning.
>
> Hmm, I don't get a warning there. This function is a switch over an
> enum type with a case for every value of the enum, and every branch
> either does a "return" or an "elog," so any code after the switch
> should be unreachable. It's possible your compiler is too dumb to know
> that, but I thought there were other places in the code base where we
> assumed that if we handled every defined value of enum, that was good
> enough.
>
> But maybe not. I found similar coding in CreateDestReceiver(), and
> that ends with:
>
>         /* should never get here */
>         pg_unreachable();
>
> So perhaps we need the same thing here. Does adding that fix it for you?
>

Hi Robert,
Tom Lane already fixed this and committed yesterday(4589c6a2a30faba53d0655a8e).

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Tue, Jan 28, 2020 at 10:30 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:
> Tom Lane already fixed this and committed yesterday(4589c6a2a30faba53d0655a8e).

Oops. OK, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Tue, Jan 28, 2020 at 10:19 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> FTR this has unfortunately the same result on Thomas' automatic patch
> tester, e.g. https://travis-ci.org/postgresql-cfbot/postgresql/builds/642634195#L1968

That's unfortunate ... but presumably Tom's changes took care of this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jan 28, 2020 at 10:19 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
>> FTR this has unfortunately the same result on Thomas' automatic patch
>> tester, e.g. https://travis-ci.org/postgresql-cfbot/postgresql/builds/642634195#L1968

> That's unfortunate ... but presumably Tom's changes took care of this?

Probably the cfbot just hasn't retried this build since that fix.
I don't know what its algorithm is for retrying failed builds, but it
does seem to do so after awhile.

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Mon, Jan 27, 2020 at 3:05 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Since you’ve committed your 0004 and 0005 patches, this v6 patch set is now based on a fresh copy of master.

I think the first question for 0005 is whether want this at all.
Initially, you proposed NOT committing it, but then Andrew reviewed it
as if it were for commit. I'm not sure whether he was actually saying
that it ought to be committed, though, or whether he just missed your
remarks on the topic. Nobody else has really taken a position. I'm not
100% convinced that it's necessary to include this, but I'm also not
particularly opposed to it. It's a fairly small amount of code, which
is nice, and perhaps useful as a demonstration of how to use the JSON
parser in a frontend application, which someone also might find nice.
Anyone else want to express an opinion?

Meanwhile, here is a round of nitp^H^H^H^Hreview:

-# installcheck and install should not recurse into the subdirectory "modules".
+# installcheck and install should not recurse into the subdirectory "modules"
+# nor "bin".

I would probably have just changed this to:

# installcheck and install should not recurse into "modules" or "bin"

The details are arguable, but you definitely shouldn't say "the
subdirectory" and then list two of them.

+This directory contains a set of programs that exercise functionality declared
+in src/include/common and defined in src/common.  The purpose of these programs
+is to verify that code intended to work both from frontend and backend code do
+indeed work when compiled and used in frontend code.  The structure of this
+directory makes no attempt to test that such code works in the backend, as the
+backend has its own tests already, and presumably those tests sufficiently
+exercide the code as used by it.

"exercide" is not spelled correctly, but I also disagree with giving
the directory so narrow a charter. I think you should just say
something like:

This directory contains programs that are built and executed for
testing purposes,
but never installed. It may be used, for example, to test that code in
src/common
works in frontend environments.

+# There doesn't seem to be any easy way to get TestLib to use the binaries from
+# our directory, so we hack up a path to our binary and run that
directly.  This
+# seems brittle enough that some other solution should be found, if possible.
+
+my $test_json = join('/', $ENV{TESTDIR}, 'test_json');

I don't know what the right thing to do here is. Perhaps someone more
familiar with TAP testing can comment.

+ set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_test_json"));

Do we need this? I guess we're not likely to bother with translations
for a test program.

+ /*
+ * Make stdout unbuffered to match stderr; and ensure stderr is unbuffered
+ * too, which it should already be everywhere except sometimes in Windows.
+ */
+ setbuf(stdout, NULL);
+ setbuf(stderr, NULL);

Do we need this? If so, why?

+ char *json;
+ unsigned int json_len;
+ JsonLexContext *lex;
+ int client_encoding;
+ JsonParseErrorType parse_result;
+
+ json_len = (unsigned int) strlen(str);
+ client_encoding = PQenv2encoding();
+
+ json = strdup(str);
+ lex = makeJsonLexContextCstringLen(json, strlen(json),
client_encoding, true /* need_escapes */);
+ parse_result = pg_parse_json(lex, &nullSemAction);
+ fprintf(stdout, _("%s\n"), (JSON_SUCCESS == parse_result ? "VALID" :
"INVALID"));
+ return;

json_len is set but not used.

Not entirely sure why we are using PQenv2encoding() here.

The trailing return is unnecessary.

I think it would be a good idea to use json_errdetail() in the failure
case, print the error, and have the tests check that we got the
expected error.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jan 28, 2020 at 10:30 AM Mahendra Singh Thalor
> <mahi6run@gmail.com> wrote:
>> Tom Lane already fixed this and committed yesterday(4589c6a2a30faba53d0655a8e).

> Oops. OK, thanks.

Yeah, there were multiple issues here:

1. If a switch is expected to cover all values of an enum type,
we now prefer not to have a default: case, so that we'll get
compiler warnings if somebody adds an enum value and fails to
update the switch.

2. Without a default:, though, you need to have after-the-switch
code to catch the possibility that the runtime value was not a
legal enum element.  Some compilers are trusting and assume that
that's not a possible case, but some are not (and Coverity will
complain about it too).

3. Some compilers still don't understand that elog(ERROR) doesn't
return, so you need a dummy return.  Perhaps pg_unreachable()
would do as well, but project style has been the dummy return for
a long time ... and I'm not entirely convinced by the assumption
that every compiler understands pg_unreachable(), anyway.

(I know Robert knows all this stuff, even if he momentarily
forgot.  Just summarizing for onlookers.)

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Julien Rouhaud
Date:
On Tue, Jan 28, 2020 at 5:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Jan 28, 2020 at 10:19 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> >> FTR this has unfortunately the same result on Thomas' automatic patch
> >> tester, e.g. https://travis-ci.org/postgresql-cfbot/postgresql/builds/642634195#L1968
>
> > That's unfortunate ... but presumably Tom's changes took care of this?
>
> Probably the cfbot just hasn't retried this build since that fix.
> I don't know what its algorithm is for retrying failed builds, but it
> does seem to do so after awhile.

Yes, I think to remember that Thomas put some rules to avoid
rebuilding everything all the time.  Patches that was rebuilt since
indeed starting to get back to green, so it's all good!



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Tue, Jan 28, 2020 at 11:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 3. Some compilers still don't understand that elog(ERROR) doesn't
> return, so you need a dummy return.  Perhaps pg_unreachable()
> would do as well, but project style has been the dummy return for
> a long time ... and I'm not entirely convinced by the assumption
> that every compiler understands pg_unreachable(), anyway.

Is the example of CreateDestReceiver() sufficient to show that this is
not a problem in practice?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jan 28, 2020 at 11:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 3. Some compilers still don't understand that elog(ERROR) doesn't
>> return, so you need a dummy return.  Perhaps pg_unreachable()
>> would do as well, but project style has been the dummy return for
>> a long time ... and I'm not entirely convinced by the assumption
>> that every compiler understands pg_unreachable(), anyway.

> Is the example of CreateDestReceiver() sufficient to show that this is
> not a problem in practice?

Dunno.  I don't see any warnings about that in the buildfarm, but
that's not a very large sample of non-gcc compilers.

Another angle here is that on non-gcc compilers, pg_unreachable()
is going to expand to an abort() call, which is likely to eat more
code space than a dummy "return 0".

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Is the example of CreateDestReceiver() sufficient to show that this is
>> not a problem in practice?

> Dunno.  I don't see any warnings about that in the buildfarm, but
> that's not a very large sample of non-gcc compilers.

BTW, now that I think about it, CreateDestReceiver is not up to project
standards anyway, in that it fails to provide reasonable behavior in
the case where what's passed is not a legal value of the enum.
What you'll get, if you're lucky, is a SIGABRT crash with no
indication of the cause --- or if you're not lucky, some very
hard-to-debug crash elsewhere as a result of the function returning
a garbage pointer.  So independently of whether the current coding
suppresses compiler warnings reliably, I think we ought to replace it
with elog()-and-return-NULL.  Admittedly, that's wasting a few bytes
on a case that should never happen ... but we haven't ever hesitated
to do that elsewhere, if it'd make the problem more diagnosable.

IOW, there's a good reason why there are exactly no other uses
of that coding pattern.

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Tue, Jan 28, 2020 at 1:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> BTW, now that I think about it, CreateDestReceiver is not up to project
> standards anyway, in that it fails to provide reasonable behavior in
> the case where what's passed is not a legal value of the enum.
> What you'll get, if you're lucky, is a SIGABRT crash with no
> indication of the cause --- or if you're not lucky, some very
> hard-to-debug crash elsewhere as a result of the function returning
> a garbage pointer.  So independently of whether the current coding
> suppresses compiler warnings reliably, I think we ought to replace it
> with elog()-and-return-NULL.  Admittedly, that's wasting a few bytes
> on a case that should never happen ... but we haven't ever hesitated
> to do that elsewhere, if it'd make the problem more diagnosable.

Well, I might be responsible for the CreateDestReceiver thing -- or I
might not, I haven't checked -- but I do think that style is a bit
cleaner and more elegant. I think it's VERY unlikely that anyone would
ever manage to call it with something that's not a legal value of the
enum, and if they do, I think the chances of surviving are basically
nil, and frankly, I'd rather die. If you asked me where you want me to
store my output and I tell you to store it in the sdklgjsdjgslkdg, you
really should refuse to do anything at all, not just stick my output
someplace-or-other and hope for the best.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jan 28, 2020 at 1:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, now that I think about it, CreateDestReceiver is not up to project
>> standards anyway, in that it fails to provide reasonable behavior in
>> the case where what's passed is not a legal value of the enum.

> Well, I might be responsible for the CreateDestReceiver thing -- or I
> might not, I haven't checked -- but I do think that style is a bit
> cleaner and more elegant. I think it's VERY unlikely that anyone would
> ever manage to call it with something that's not a legal value of the
> enum, and if they do, I think the chances of surviving are basically
> nil, and frankly, I'd rather die. If you asked me where you want me to
> store my output and I tell you to store it in the sdklgjsdjgslkdg, you
> really should refuse to do anything at all, not just stick my output
> someplace-or-other and hope for the best.

Well, yeah, that's exactly my point.  But in my book, "refuse to do
anything" should be "elog(ERROR)", not "invoke undefined behavior".
An actual abort() call might be all right here, in that at least
we'd know what would happen and we could debug it once we got hold
of a stack trace.  But pg_unreachable() is not that.  Basically, if
there's *any* circumstance, broken code or not, where control could
reach a pg_unreachable() call, you did it wrong.

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Tue, Jan 28, 2020 at 2:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, yeah, that's exactly my point.  But in my book, "refuse to do
> anything" should be "elog(ERROR)", not "invoke undefined behavior".
> An actual abort() call might be all right here, in that at least
> we'd know what would happen and we could debug it once we got hold
> of a stack trace.  But pg_unreachable() is not that.  Basically, if
> there's *any* circumstance, broken code or not, where control could
> reach a pg_unreachable() call, you did it wrong.

I don't really agree. I think such defensive coding is more than
justified when the input is coming from a file on disk or some other
external source where it might have been corrupted. For instance, I
think the fact that the code which deforms heap tuples will cheerfully
sail off the end of the buffer or seg fault if the tuple descriptor
doesn't match the tuple is a seriously bad thing. It results in actual
production crashes that could be avoided with more defensive coding.
Admittedly, there would likely be a performance cost, which might not
be a reason to do it, but if that cost is small I would probably vote
for paying it, because this is something that actually happens to
users on a pretty regular basis.

In the case at hand, though, there are no constants of type
CommandDest that come from any place other than a constant in the
program text, and it seems unlikely that this will ever be different
in the future. So, how could we ever end up with a value that's not in
the enum? I guess the program text itself could be corrupted, but we
cannot defend against that.

Mind you, I'm not going to put up a huge stink if you're bound and
determined to go change this. I prefer it the way that it is, and I
think that preference is well-justified by facts on the ground, but I
don't think it's worth fighting about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jan 28, 2020 at 2:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, yeah, that's exactly my point.  But in my book, "refuse to do
>> anything" should be "elog(ERROR)", not "invoke undefined behavior".
>> An actual abort() call might be all right here, in that at least
>> we'd know what would happen and we could debug it once we got hold
>> of a stack trace.  But pg_unreachable() is not that.  Basically, if
>> there's *any* circumstance, broken code or not, where control could
>> reach a pg_unreachable() call, you did it wrong.

> I don't really agree. I think such defensive coding is more than
> justified when the input is coming from a file on disk or some other
> external source where it might have been corrupted.

There's certainly an argument to be made that an elog() call is an
unjustified expenditure of code space and we should just do an abort()
(but still not pg_unreachable(), IMO).  However, what I'm really on about
here is that CreateDestReceiver is out of step with nigh-universal project
practice.  If it's not worth having an elog() here, then there are
literally hundreds of other elog() calls that we ought to be nuking on
the same grounds.  I don't really want to run around and have a bunch
of debates about exactly which extant elog() calls are effectively
unreachable and which are not.  That's not always very clear, and even
if it is clear today it might not be tomorrow.  The minute somebody calls
CreateDestReceiver with a non-constant argument, the issue becomes open
again.  And I'd rather not have to stop and think hard about the tradeoff
between elog() and abort() when I write such functions in future.

So basically, my problem with this is that I don't think it's a coding
style we want to encourage, because it's too fragile.  And there's no
good argument (like performance) to leave it that way.  I quite agree
with you that there are places like tuple deforming where we're taking
more chances than I'd like --- but there is a noticeable performance
cost to being paranoid there.

            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:

> On Jan 28, 2020, at 8:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 27, 2020 at 3:05 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> Since you’ve committed your 0004 and 0005 patches, this v6 patch set is now based on a fresh copy of master.
>
> I think the first question for 0005 is whether want this at all.
> Initially, you proposed NOT committing it, but then Andrew reviewed it
> as if it were for commit. I'm not sure whether he was actually saying
> that it ought to be committed, though, or whether he just missed your
> remarks on the topic. Nobody else has really taken a position. I'm not
> 100% convinced that it's necessary to include this, but I'm also not
> particularly opposed to it. It's a fairly small amount of code, which
> is nice, and perhaps useful as a demonstration of how to use the JSON
> parser in a frontend application, which someone also might find nice.

Once Andrew reviewed it, I started thinking about it as something that might get committed.  In that context, I think
thereshould be a lot more tests in this new src/test/bin directory for other common code, but adding those as part of
thispatch just seems to confuse this patch. 

In addition to adding frontend tests for code already in src/common, the conversation in another thread about adding
frontendversions of elog and ereport seem like candidates for tests in this location.  Sure, you can add an elog into a
realfrontend tool, such as pg_ctl, and update the tests for that program to expect that elog’s output, but what if you
justwant to exhaustively test the elog infrastructure in the frontend spanning multiple locales, encodings, whatever?
You’vealso recently mentioned the possibility of having memory contexts in frontend code.  Testing those seems like a
goodfit, too. 

I decided to leave this in the next version of the patch set, v7.  v6 had three files, the second being something that
alreadygot committed in a different form, so this is now in v7-0002 whereas it had been in v6-0003.  v6-0002 has no
equivalentin v7. 

> Anyone else want to express an opinion?
>
> Meanwhile, here is a round of nitp^H^H^H^Hreview:
>
> -# installcheck and install should not recurse into the subdirectory "modules".
> +# installcheck and install should not recurse into the subdirectory "modules"
> +# nor "bin".
>
> I would probably have just changed this to:
>
> # installcheck and install should not recurse into "modules" or "bin"
>
> The details are arguable, but you definitely shouldn't say "the
> subdirectory" and then list two of them.

I read that as “nor [the subdirectory] bin” with the [the subdirectory] portion elided, and it doesn’t sound anomalous
tome, but your formulation is more compact.  I have used it in v7 of the patch set.  Thanks. 

>
> +This directory contains a set of programs that exercise functionality declared
> +in src/include/common and defined in src/common.  The purpose of these programs
> +is to verify that code intended to work both from frontend and backend code do
> +indeed work when compiled and used in frontend code.  The structure of this
> +directory makes no attempt to test that such code works in the backend, as the
> +backend has its own tests already, and presumably those tests sufficiently
> +exercide the code as used by it.
>
> "exercide" is not spelled correctly, but I also disagree with giving
> the directory so narrow a charter. I think you should just say
> something like:
>
> This directory contains programs that are built and executed for
> testing purposes,
> but never installed. It may be used, for example, to test that code in
> src/common
> works in frontend environments.

Your formulation sounds fine, and I’ve used it in v7.

> +# There doesn't seem to be any easy way to get TestLib to use the binaries from
> +# our directory, so we hack up a path to our binary and run that
> directly.  This
> +# seems brittle enough that some other solution should be found, if possible.
> +
> +my $test_json = join('/', $ENV{TESTDIR}, 'test_json');
>
> I don't know what the right thing to do here is. Perhaps someone more
> familiar with TAP testing can comment.

Yeah, I was hoping that might get a comment from Andrew.  I think if it works as-is on windows, we could just use it
thisway until it causes a problem on some platform or other.  It’s not a runtime issue, being only a build-time test,
andonly then when tap tests are enabled *and* running check-world, so nobody should really be adversely affected.  I’ll
likelyget around to testing this on Windows, but I don’t have any Windows environments set up yet, as that is still on
mytodo list. 

> + set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_test_json"));
>
> Do we need this? I guess we're not likely to bother with translations
> for a test program.

Removed.

> + /*
> + * Make stdout unbuffered to match stderr; and ensure stderr is unbuffered
> + * too, which it should already be everywhere except sometimes in Windows.
> + */
> + setbuf(stdout, NULL);
> + setbuf(stderr, NULL);
>
> Do we need this? If so, why?

For the current test setup, it is not needed.  The tap test executes this program (test_json) once per json string, and
exitsafter printing a single line.  Surely the tap test wouldn’t have problems hanging on an unflushed buffer for a
programthat has exited.  I was imagining this code might grow more complex, with the tap test communicating repeatedly
withthe same instance of test_json, such as if we extend the json parser to iterate over chunks of the input json
string.

I’ve removed this for v7, since we don’t need it yet.

> + char *json;
> + unsigned int json_len;
> + JsonLexContext *lex;
> + int client_encoding;
> + JsonParseErrorType parse_result;
> +
> + json_len = (unsigned int) strlen(str);
> + client_encoding = PQenv2encoding();
> +
> + json = strdup(str);
> + lex = makeJsonLexContextCstringLen(json, strlen(json),
> client_encoding, true /* need_escapes */);
> + parse_result = pg_parse_json(lex, &nullSemAction);
> + fprintf(stdout, _("%s\n"), (JSON_SUCCESS == parse_result ? "VALID" :
> "INVALID"));
> + return;
>
> json_len is set but not used.

You’re right.  I’ve removed it.

> Not entirely sure why we are using PQenv2encoding() here.

This program, which passes possibly json formatted strings into the parser, gets those strings from perl through the
shell. If locale settings on the machine where this runs might break something about that for a real client
application,then our test should break in the same way.  Hard-coding “C” or “POSIX” or whatever for the locale
side-stepspart of the issue we’re trying to test.  No? 

I’m leaving it as is for v7, but if you still disagree, I’ll change it.  Let me know what you want me to change it
*to*,though, as there is no obvious choice that I can see. 

> The trailing return is unnecessary.

Ok, I’ve removed it.

> I think it would be a good idea to use json_errdetail() in the failure
> case, print the error, and have the tests check that we got the
> expected error.

Oh, yeah, I like that idea.  That works, and is included in v7.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Attachment

Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:
Thanks for your review.  I considered all of this along with your review comments in another email prior to sending v7
inresponse to that other email a few minutes ago. 

> On Jan 28, 2020, at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 27, 2020 at 3:05 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> I’m attaching a new patch set with these three changes including Mahendra’s patch posted elsewhere on this thread.
>>
>> Since you’ve committed your 0004 and 0005 patches, this v6 patch set is now based on a fresh copy of master.
>
> OK, so I think this is getting close.
>
> What is now 0001 manages to have four (4) conditionals on FRONTEND at
> the top of the file. This seems like at least one two many.

You are referencing this section, copied here from the patch:

> #ifndef FRONTEND
> #include "postgres.h"
> #else
> #include "postgres_fe.h"
> #endif
>
> #include "common/jsonapi.h"
>
> #ifdef FRONTEND
> #include "common/logging.h"
> #endif
>
> #include "mb/pg_wchar.h"
>
> #ifndef FRONTEND
> #include "miscadmin.h"
> #endif

I merged these a bit.  See v7-0001 for details.

> Also, the preprocessor police are on their way to your house now to
> arrest you for that first one. You need to write it like this:
>
> #define json_log_and_abort(...) \
>    do { pg_log_fatal(__VA_ARGS__); exit(1); } while (0)

Yes, right, I had done that and somehow didn’t get it into the patch.  I’ll have coffee and donuts waiting.

> {
> - JsonLexContext *lex = palloc0(sizeof(JsonLexContext));
> + JsonLexContext *lex;
> +
> +#ifndef FRONTEND
> + lex = palloc0(sizeof(JsonLexContext));
> +#else
> + lex = (JsonLexContext*) malloc(sizeof(JsonLexContext));
> + memset(lex, 0, sizeof(JsonLexContext));
> +#endif
>
> Instead of this, how making no change at all here?

Yes, good point.  I had split that into frontend vs backend because I was using palloc0fast for the backend, which
seemsto me the preferred function when the size is compile-time known, like it is here, and there is no palloc0fast in
fe_memutils.hfor frontend use.  I then changed back to palloc0 when I noticed that pretty much nowhere else similar to
thisin the project uses palloc0fast.  I neglected to change back completely, which left what you are quoting. 

Out of curiousity, why is palloc0fast not used in more places?

> - default:
> - elog(ERROR, "unexpected json parse state: %d", ctx);
>  }
> +
> + /* Not reached */
> + json_log_and_abort("unexpected json parse state: %d", ctx);
>
> This, too, seems unnecessary.

This was in response to Mahendra’s report of a compiler warning, which I didn’t get on my platform. The code in master
changeda bit since v6 was written, so v7 just goes with how the newer committed code does this. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: making the backend's json parser work in frontend code

From
Andrew Dunstan
Date:
On 1/28/20 5:28 PM, Mark Dilger wrote:
>
>
>> +# There doesn't seem to be any easy way to get TestLib to use the binaries from
>> +# our directory, so we hack up a path to our binary and run that
>> directly.  This
>> +# seems brittle enough that some other solution should be found, if possible.
>> +
>> +my $test_json = join('/', $ENV{TESTDIR}, 'test_json');
>>
>> I don't know what the right thing to do here is. Perhaps someone more
>> familiar with TAP testing can comment.
> Yeah, I was hoping that might get a comment from Andrew.  I think if it works as-is on windows, we could just use it
thisway until it causes a problem on some platform or other.  It’s not a runtime issue, being only a build-time test,
andonly then when tap tests are enabled *and* running check-world, so nobody should really be adversely affected.  I’ll
likelyget around to testing this on Windows, but I don’t have any Windows environments set up yet, as that is still on
mytodo list.
 
>


I think using TESTDIR is Ok, but we do need a little more on Windows,
because the executable name will be different. See attached revised
version of the test script.



We also need some extra stuff for MSVC. Something like the attached
change to src/tools/msvc/Mkvcbuild.pm. Also, the Makefile will need a
line like:


PROGRAM = test_json


I'm still not 100% on the location of the test. I think the way the msvc
suite works this should be in its own dedicated directory e.g.
src/test/json_parse.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Attachment

Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Tue, Jan 28, 2020 at 5:35 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> I merged these a bit.  See v7-0001 for details.

I jiggered that a bit more and committed this. I couldn't see the
point of having both the FRONTEND and non-FRONTEND code include
pg_wchar.h.

I'll wait to see what you make of Andrew's latest comments before
doing anything further.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jan 28, 2020 at 5:35 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> I merged these a bit.  See v7-0001 for details.

> I jiggered that a bit more and committed this. I couldn't see the
> point of having both the FRONTEND and non-FRONTEND code include
> pg_wchar.h.

First buildfarm report is not positive:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2020-01-29%2015%3A30%3A26

  json.obj : error LNK2019: unresolved external symbol makeJsonLexContextCstringLen referenced in function json_recv
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  jsonb.obj : error LNK2001: unresolved external symbol makeJsonLexContextCstringLen
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  jsonfuncs.obj : error LNK2001: unresolved external symbol makeJsonLexContextCstringLen
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  json.obj : error LNK2019: unresolved external symbol json_lex referenced in function json_typeof
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  json.obj : error LNK2019: unresolved external symbol IsValidJsonNumber referenced in function datum_to_json
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  json.obj : error LNK2001: unresolved external symbol nullSemAction
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  jsonfuncs.obj : error LNK2019: unresolved external symbol pg_parse_json referenced in function json_strip_nulls
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  jsonfuncs.obj : error LNK2019: unresolved external symbol json_count_array_elements referenced in function
get_array_start[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj] 
  jsonfuncs.obj : error LNK2019: unresolved external symbol json_errdetail referenced in function json_ereport_error
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  .\Release\postgres\postgres.exe : fatal error LNK1120: 7 unresolved externals
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]



            regards, tom lane



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Wed, Jan 29, 2020 at 10:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Jan 28, 2020 at 5:35 PM Mark Dilger
> > <mark.dilger@enterprisedb.com> wrote:
> >> I merged these a bit.  See v7-0001 for details.
>
> > I jiggered that a bit more and committed this. I couldn't see the
> > point of having both the FRONTEND and non-FRONTEND code include
> > pg_wchar.h.
>
> First buildfarm report is not positive:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2020-01-29%2015%3A30%3A26
>
>   json.obj : error LNK2019: unresolved external symbol makeJsonLexContextCstringLen referenced in function json_recv
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   jsonb.obj : error LNK2001: unresolved external symbol makeJsonLexContextCstringLen
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   jsonfuncs.obj : error LNK2001: unresolved external symbol makeJsonLexContextCstringLen
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   json.obj : error LNK2019: unresolved external symbol json_lex referenced in function json_typeof
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   json.obj : error LNK2019: unresolved external symbol IsValidJsonNumber referenced in function datum_to_json
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   json.obj : error LNK2001: unresolved external symbol nullSemAction
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   jsonfuncs.obj : error LNK2019: unresolved external symbol pg_parse_json referenced in function json_strip_nulls
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   jsonfuncs.obj : error LNK2019: unresolved external symbol json_count_array_elements referenced in function
get_array_start[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
 
>   jsonfuncs.obj : error LNK2019: unresolved external symbol json_errdetail referenced in function json_ereport_error
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   .\Release\postgres\postgres.exe : fatal error LNK1120: 7 unresolved externals
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]

Hrrm, OK. I think it must need a sprinkling of Windows-specific magic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Robert Haas
Date:
On Wed, Jan 29, 2020 at 10:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Hrrm, OK. I think it must need a sprinkling of Windows-specific magic.

I see that the patch Andrew posted earlier adjusts Mkvcbuild.pm's
@pgcommonallfiles, so I pushed that fix. The other hunks there should
go into the patch to add a test_json utility, I think. Hopefully that
will fix it, but I guess we'll see.

I was under the impression that the MSVC build gets the list of files
to build by parsing the Makefiles, but I guess that's not true at
least in the case of src/common.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: making the backend's json parser work in frontend code

From
Andrew Dunstan
Date:
On Wed, Jan 29, 2020 at 4:32 PM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
>
>
> On 1/28/20 5:28 PM, Mark Dilger wrote:
> >
> >
> >> +# There doesn't seem to be any easy way to get TestLib to use the binaries from
> >> +# our directory, so we hack up a path to our binary and run that
> >> directly.  This
> >> +# seems brittle enough that some other solution should be found, if possible.
> >> +
> >> +my $test_json = join('/', $ENV{TESTDIR}, 'test_json');
> >>
> >> I don't know what the right thing to do here is. Perhaps someone more
> >> familiar with TAP testing can comment.
> > Yeah, I was hoping that might get a comment from Andrew.  I think if it works as-is on windows, we could just use
itthis way until it causes a problem on some platform or other.  It’s not a runtime issue, being only a build-time
test,and only then when tap tests are enabled *and* running check-world, so nobody should really be adversely affected.
I’ll likely get around to testing this on Windows, but I don’t have any Windows environments set up yet, as that is
stillon my todo list. 
> >
>
>
> I think using TESTDIR is Ok,


I've changed my mind, I don't think that will work for MSVC, the
executable gets built elsewhere for that. I'll try to come up with
something portable.

cheers

andrew


--
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: making the backend's json parser work in frontend code

From
Mark Dilger
Date:

> On Jan 29, 2020, at 1:02 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
> On Wed, Jan 29, 2020 at 4:32 PM Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
>>
>>
>> On 1/28/20 5:28 PM, Mark Dilger wrote:
>>>
>>>
>>>> +# There doesn't seem to be any easy way to get TestLib to use the binaries from
>>>> +# our directory, so we hack up a path to our binary and run that
>>>> directly.  This
>>>> +# seems brittle enough that some other solution should be found, if possible.
>>>> +
>>>> +my $test_json = join('/', $ENV{TESTDIR}, 'test_json');
>>>>
>>>> I don't know what the right thing to do here is. Perhaps someone more
>>>> familiar with TAP testing can comment.
>>> Yeah, I was hoping that might get a comment from Andrew.  I think if it works as-is on windows, we could just use
itthis way until it causes a problem on some platform or other.  It’s not a runtime issue, being only a build-time
test,and only then when tap tests are enabled *and* running check-world, so nobody should really be adversely affected.
I’ll likely get around to testing this on Windows, but I don’t have any Windows environments set up yet, as that is
stillon my todo list. 
>>>
>>
>>
>> I think using TESTDIR is Ok,
>
>
> I've changed my mind, I don't think that will work for MSVC, the
> executable gets built elsewhere for that. I'll try to come up with
> something portable.

I’m just now working on getting my Windows VMs set up with Visual Studio and whatnot, per the wiki instructions, so I
don’tneed to burden you with this sort of Windows task in the future.  If there are any gotchas not mentioned on the
wiki,I’d appreciate pointers about how to avoid them.  I’ll try to help devise a solution, or test what you come up
with,once I’m properly set up for that. 

For no particular reason, I chose Windows Server 2019 and Windows 10 Pro.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: making the backend's json parser work in frontend code

From
Andrew Dunstan
Date:
On Thu, Jan 30, 2020 at 7:36 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Jan 29, 2020, at 1:02 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
> >
> > On Wed, Jan 29, 2020 at 4:32 PM Andrew Dunstan
> > <andrew.dunstan@2ndquadrant.com> wrote:
> >>
> >>
> >> On 1/28/20 5:28 PM, Mark Dilger wrote:
> >>>
> >>>
> >>>> +# There doesn't seem to be any easy way to get TestLib to use the binaries from
> >>>> +# our directory, so we hack up a path to our binary and run that
> >>>> directly.  This
> >>>> +# seems brittle enough that some other solution should be found, if possible.
> >>>> +
> >>>> +my $test_json = join('/', $ENV{TESTDIR}, 'test_json');
> >>>>
> >>>> I don't know what the right thing to do here is. Perhaps someone more
> >>>> familiar with TAP testing can comment.
> >>> Yeah, I was hoping that might get a comment from Andrew.  I think if it works as-is on windows, we could just use
itthis way until it causes a problem on some platform or other.  It’s not a runtime issue, being only a build-time
test,and only then when tap tests are enabled *and* running check-world, so nobody should really be adversely affected.
I’ll likely get around to testing this on Windows, but I don’t have any Windows environments set up yet, as that is
stillon my todo list. 
> >>>
> >>
> >>
> >> I think using TESTDIR is Ok,
> >
> >
> > I've changed my mind, I don't think that will work for MSVC, the
> > executable gets built elsewhere for that. I'll try to come up with
> > something portable.
>
> I’m just now working on getting my Windows VMs set up with Visual Studio and whatnot, per the wiki instructions, so I
don’tneed to burden you with this sort of Windows task in the future.  If there are any gotchas not mentioned on the
wiki,I’d appreciate pointers about how to avoid them.  I’ll try to help devise a solution, or test what you come up
with,once I’m properly set up for that. 
>
> For no particular reason, I chose Windows Server 2019 and Windows 10 Pro.
>


One VM should be sufficient. Either W10Pro os WS2019 would be fine. I
have buildfarm animals running on both.

Here's what I got working after a lot of trial and error. (This will
require a tiny change in the buildfarm script to make the animals test
it). Note that there is one test that I couldn't get working, so I
skipped it. If you can find out why it fails so much the better ... it
seems to be related to how the command processor handles quotes.

cheers

andrew


--
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment