Thread: Request for comment on setting binary format output per session

Request for comment on setting binary format output per session

From
Dave Cramer
Date:
Greetings,

In [1] I proposed a patch that used a GUC to request a list of OID's to be returned in binary format. 
In [2] Peter Eisentraut proposed a very similar solution to the problem.

In [2] there was some discussion regarding whether this should be set via GUC or a new protocol message. 

I'd like to open up this discussion again so that we can move forward. I prefer the GUC as it is relatively simple and as Peter mentioned it works, but I'm not married to the idea. 

Regards,
Dave

Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Thu, 2023-03-02 at 09:13 -0500, Dave Cramer wrote:
> I'd like to open up this discussion again so that we can
> move forward. I prefer the GUC as it is relatively simple and as
> Peter mentioned it works, but I'm not married to the idea. 

It's not very friendly to extensions, where the types are not
guaranteed to have stable OIDs. Did you consider any proposals that
work with type names?

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:

Dave Cramer


On Sat, 4 Mar 2023 at 11:35, Jeff Davis <pgsql@j-davis.com> wrote:
On Thu, 2023-03-02 at 09:13 -0500, Dave Cramer wrote:
> I'd like to open up this discussion again so that we can
> move forward. I prefer the GUC as it is relatively simple and as
> Peter mentioned it works, but I'm not married to the idea. 

It's not very friendly to extensions, where the types are not
guaranteed to have stable OIDs. Did you consider any proposals that
work with type names?

I had not. 
Most of the clients know how to decode the builtin types. I'm not sure there is a use case for binary encode types that the clients don't have a priori knowledge of.

Dave 

Regards,
        Jeff Davis

Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Sat, 2023-03-04 at 18:04 -0500, Dave Cramer wrote:
> Most of the clients know how to decode the builtin types. I'm not
> sure there is a use case for binary encode types that the clients
> don't have a priori knowledge of.

The client could, in theory, have a priori knowledge of a non-builtin
type.

I don't have a great solution for that, though. Maybe it's only
practical for builtin types.

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2023-03-04 at 18:04 -0500, Dave Cramer wrote:
>> Most of the clients know how to decode the builtin types. I'm not
>> sure there is a use case for binary encode types that the clients
>> don't have a priori knowledge of.

> The client could, in theory, have a priori knowledge of a non-builtin
> type.

I don't see what's "in theory" about that.  There seems plenty of
use for binary I/O of, say, PostGIS types.  Even for built-in types,
do we really want to encourage people to hard-wire their OIDs into
applications?

I don't see a big problem with driving this off a GUC, but I think
it should be a list of type names not OIDs.  We already have plenty
of precedent for dealing with that sort of thing; see search_path
for the canonical example.  IIRC, there's similar caching logic
for temp_tablespaces.

            regards, tom lane



Re: Request for comment on setting binary format output per session

From
"David G. Johnston"
Date:
On Sat, Mar 4, 2023 at 5:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2023-03-04 at 18:04 -0500, Dave Cramer wrote:
>> Most of the clients know how to decode the builtin types. I'm not
>> sure there is a use case for binary encode types that the clients
>> don't have a priori knowledge of.

> The client could, in theory, have a priori knowledge of a non-builtin
> type.

I don't see what's "in theory" about that.  There seems plenty of
use for binary I/O of, say, PostGIS types.  Even for built-in types,
do we really want to encourage people to hard-wire their OIDs into
applications?

I don't see a big problem with driving this off a GUC, but I think
it should be a list of type names not OIDs.  We already have plenty
of precedent for dealing with that sort of thing; see search_path
for the canonical example.  IIRC, there's similar caching logic
for temp_tablespaces.


This seems slightly different since types depend upon schemas whereas search_path is top-level and tablespaces are global.  But I agree that names should be accepted, maybe in addition to OIDs, the latter, for core types in particular, being a way to not have to worry about masking in user-space.

David J.

Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:


On Sat, 4 Mar 2023 at 19:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2023-03-04 at 18:04 -0500, Dave Cramer wrote:
>> Most of the clients know how to decode the builtin types. I'm not
>> sure there is a use case for binary encode types that the clients
>> don't have a priori knowledge of.

> The client could, in theory, have a priori knowledge of a non-builtin
> type.

I don't see what's "in theory" about that.  There seems plenty of
use for binary I/O of, say, PostGIS types.  Even for built-in types,
do we really want to encourage people to hard-wire their OIDs into
applications?

How does a client read these? I'm pretty narrowly focussed. The JDBC API doesn't really have a way to read a non built-in type.  There is a facility to read a UDT, but the user would have to provide that transcoder. I guess I'm curious how other clients read binary UDT's ?

I don't see a big problem with driving this off a GUC, but I think
it should be a list of type names not OIDs.  We already have plenty
of precedent for dealing with that sort of thing; see search_path
for the canonical example.  IIRC, there's similar caching logic
for temp_tablespaces.

I have no issue with allowing names, OID's were compact, but we could easily support both

Dave

Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:

Dave Cramer


On Sat, 4 Mar 2023 at 19:39, Dave Cramer <davecramer@gmail.com> wrote:


On Sat, 4 Mar 2023 at 19:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2023-03-04 at 18:04 -0500, Dave Cramer wrote:
>> Most of the clients know how to decode the builtin types. I'm not
>> sure there is a use case for binary encode types that the clients
>> don't have a priori knowledge of.

> The client could, in theory, have a priori knowledge of a non-builtin
> type.

I don't see what's "in theory" about that.  There seems plenty of
use for binary I/O of, say, PostGIS types.  Even for built-in types,
do we really want to encourage people to hard-wire their OIDs into
applications?

How does a client read these? I'm pretty narrowly focussed. The JDBC API doesn't really have a way to read a non built-in type.  There is a facility to read a UDT, but the user would have to provide that transcoder. I guess I'm curious how other clients read binary UDT's ?

I don't see a big problem with driving this off a GUC, but I think
it should be a list of type names not OIDs.  We already have plenty
of precedent for dealing with that sort of thing; see search_path
for the canonical example.  IIRC, there's similar caching logic
for temp_tablespaces.

I have no issue with allowing names, OID's were compact, but we could easily support both

Attached is a preliminary patch that takes a list of OID's. I'd like to know if this is going in the right direction.

Next step would be to deal with type names as opposed to OID's. 
This will be a bit more challenging as type names are schema specific.

Dave
Attachment

Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Mon, 2023-03-13 at 16:33 -0400, Dave Cramer wrote:
> Attached is a preliminary patch that takes a list of OID's. I'd like
> to know if this is going in the right direction.

I found a few issues:

1. Some kind of memory error:

  SET format_binary='25,1082,1184';
  WARNING:  problem in alloc set PortalContext: detected write past
chunk end in block 0x55ba7b5f7610, chunk 0x55ba7b5f7a48
  ...
  SET

2. Easy to confuse psql:

  CREATE TABLE a(d date, t timestamptz);
  SET format_binary='25,1082,1184';
  SELECT * FROM a;
   d | t
  ---+---
   ! |
  (1 row)

3. Some style issues
  - use of "//" comments
  - findOid should return bool, not int

When you add support for user-defined types, that introduces a couple
other issues:

4. The format_binary GUC would depend on the search_path GUC, which
isn't great.

5. There's a theoretical invalidation problem. It might also be a
practical problem in some testing setups with long-lived connections
that are recreating user-defined types.


We've had this problem with binary for a long time, and it seems
desirable to solve it. But I'm not sure GUCs are the right way.

How hard did you try to solve it in the protocol rather than with a
GUC? I see that the startup message allows protocol extensions by
prefixing a parameter name with "_pq_". Are protocol extensions
documented somewhere and would that be a reasonable thing to do here?

Also, if we're going to make the binary format more practical to use,
can we document the expectations better? It seems the expecatation is
that the binary format just never changes, and that if it does, that's
a new type name.

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:

On Mon, 20 Mar 2023 at 13:05, Jeff Davis <pgsql@j-davis.com> wrote:
On Mon, 2023-03-13 at 16:33 -0400, Dave Cramer wrote:
> Attached is a preliminary patch that takes a list of OID's. I'd like
> to know if this is going in the right direction.


Thanks for the review. I'm curious what system you are running on as I don't see any of these errors. 
I found a few issues:

1. Some kind of memory error:

  SET format_binary='25,1082,1184';
  WARNING:  problem in alloc set PortalContext: detected write past
chunk end in block 0x55ba7b5f7610, chunk 0x55ba7b5f7a48
  ...
  SET 
2. Easy to confuse psql:

  CREATE TABLE a(d date, t timestamptz);
  SET format_binary='25,1082,1184';
  SELECT * FROM a;
   d | t
  ---+---
   ! |
  (1 row)

Well I'm guessing psql doesn't know how to read date or timestamptz in binary. This is not a failing of the code.
 
3. Some style issues
  - use of "//" comments
  - findOid should return bool, not int

Sure will fix see attached patch 
When you add support for user-defined types, that introduces a couple
other issues:

4. The format_binary GUC would depend on the search_path GUC, which
isn't great.
This is an interesting question. If the type isn't visible then it's not visible to the query so 

5. There's a theoretical invalidation problem. It might also be a
practical problem in some testing setups with long-lived connections
that are recreating user-defined types.
UDT's seem to be a problem here which candidly have very little use case for binary output. 


We've had this problem with binary for a long time, and it seems
desirable to solve it. But I'm not sure GUCs are the right way.

How hard did you try to solve it in the protocol rather than with a
GUC? I see that the startup message allows protocol extensions by
prefixing a parameter name with "_pq_". Are protocol extensions
documented somewhere and would that be a reasonable thing to do here?

I didn't try to solve it as Tom was OK with using a GUC. Using a startup GUC is interesting, 
but how would that work with pools where we want to reset the connection when we return it and then
set the binary format on borrow ? By using a GUC when a client borrows a connection from a pool the client
can reconfigure the oids it wants formatted in binary.

Also, if we're going to make the binary format more practical to use,
can we document the expectations better?
Yes we can do that. 
It seems the expecatation is
that the binary format just never changes, and that if it does, that's
a new type name.

I really hadn't considered supporting type names. I have asked Paul Ramsey  about PostGIS and he doesn't see PostGIS using this.
 
Regards,
        Jeff Davis

Attachment

Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Mon, 2023-03-20 at 10:04 -0700, Jeff Davis wrote:
>   CREATE TABLE a(d date, t timestamptz);
>   SET format_binary='25,1082,1184';
>   SELECT * FROM a;
>    d | t
>   ---+---
>    ! |
>   (1 row)

Oops, missing the following statement after the CREATE TABLE:

  INSERT INTO a VALUES('1234-01-01', '2023-03-20 09:00:00');

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Tom Lane
Date:
Dave Cramer <davecramer@gmail.com> writes:
> On Mon, 20 Mar 2023 at 13:05, Jeff Davis <pgsql@j-davis.com> wrote:
>> 2. Easy to confuse psql:
>> 
>> CREATE TABLE a(d date, t timestamptz);
>> SET format_binary='25,1082,1184';
>> SELECT * FROM a;
>> d | t
>> ---+---
>> ! |
>> (1 row)
>> 
>> Well I'm guessing psql doesn't know how to read date or timestamptz in
>> binary. This is not a failing of the code.

What it is is a strong suggestion that controlling this via a GUC is
not a great choice.  There are many inappropriate (wrong abstraction
level) ways to change a GUC and thereby break a client that's not
expecting binary output.  I think Jeff's suggestion that we should
treat this as a protocol extension might be a good idea.

If I recall the protocol-extension design correctly, such a setting
could only be set at session start, which could be annoying --- at the
very least we'd have to tolerate entries for unrecognized data types,
since clients couldn't be expected to have checked the list against
the current server in advance.

            regards, tom lane



Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Mon, 2023-03-20 at 14:36 -0400, Dave Cramer wrote:
> Thanks for the review. I'm curious what system you are running on as
> I don't see any of these errors. 

Are asserts enabled?

> Well I'm guessing psql doesn't know how to read date or timestamptz
> in binary. This is not a failing of the code.

It seems strange, and potentially dangerous, to send binary data to a
client that's not expecting it. It feels too easy to cause confusion by
changing the GUC mid-session.

Also, it seems like DISCARD ALL is not resetting it, which I think is a
bug.

>
> This is an interesting question. If the type isn't visible then it's
> not visible to the query so 

I don't think that's true -- the type could be in a different schema
from the table.

> >
> > 5. There's a theoretical invalidation problem. It might also be a
> > practical problem in some testing setups with long-lived
> > connections
> > that are recreating user-defined types.
> >
>
> UDT's seem to be a problem here which candidly have very little use
> case for binary output. 

I mostly agree with that, but it also might not be hard to support
UDTs. Is there a design problem here or is it "just a matter of code"?

>
> I didn't try to solve it as Tom was OK with using a GUC. Using a
> startup GUC is interesting, 
> but how would that work with pools where we want to reset the
> connection when we return it and then
> set the binary format on borrow ? By using a GUC when a client
> borrows a connection from a pool the client
> can reconfigure the oids it wants formatted in binary.

That's a good point. How common is it to share a connection pool
between different clients (some of which might support a binary format,
and others which don't)? And would the connection pool put connections
with and without the property in different pools?

>
> I really hadn't considered supporting type names. I have asked Paul
> Ramsey  about PostGIS and he doesn't see PostGIS using this.

One of the things I like about Postgres is that the features all work
together, and that user-defined objects are generally as good as built-
in ones. Sometimes there's a reason to make a special case (e.g. syntax
support or something), but in this case it seems like we could support
user-defined types just fine, right? It's also just more friendly and
readable to use type names, especially if it's a GUC.

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Merlin Moncure
Date:


On Mon, Mar 13, 2023 at 3:33 PM Dave Cramer <davecramer@gmail.com> wrote:

Dave Cramer


On Sat, 4 Mar 2023 at 19:39, Dave Cramer <davecramer@gmail.com> wrote:


On Sat, 4 Mar 2023 at 19:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2023-03-04 at 18:04 -0500, Dave Cramer wrote:
>> Most of the clients know how to decode the builtin types. I'm not
>> sure there is a use case for binary encode types that the clients
>> don't have a priori knowledge of.

> The client could, in theory, have a priori knowledge of a non-builtin
> type.

I don't see what's "in theory" about that.  There seems plenty of
use for binary I/O of, say, PostGIS types.  Even for built-in types,
do we really want to encourage people to hard-wire their OIDs into
applications?

How does a client read these? I'm pretty narrowly focussed. The JDBC API doesn't really have a way to read a non built-in type.  There is a facility to read a UDT, but the user would have to provide that transcoder. I guess I'm curious how other clients read binary UDT's ?

I don't see a big problem with driving this off a GUC, but I think
it should be a list of type names not OIDs.  We already have plenty
of precedent for dealing with that sort of thing; see search_path
for the canonical example.  IIRC, there's similar caching logic
for temp_tablespaces.

I have no issue with allowing names, OID's were compact, but we could easily support both

Attached is a preliminary patch that takes a list of OID's. I'd like to know if this is going in the right direction.

Next step would be to deal with type names as opposed to OID's. 
This will be a bit more challenging as type names are schema specific.

OIDs are a pain to deal with IMO.   They will not survive a dump style restore, and are hard to keep synchronized between databases...type names don't have this problem.   OIDs are an implementation artifact that ought not need any extra dependency.  

This seems like a protocol or even a driver issue rather than a GUC issue. Why does the server need to care what format the client might want to prefer on a query by query basis?  I just don't see it. The resultformat switch in libpq works pretty well, except that it's "all in" on getting data from the server, with the dead simple workaround of casting to text which might even be able to be managed from within the driver itself. 

merlin


Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:



On Mon, 20 Mar 2023 at 19:10, Merlin Moncure <mmoncure@gmail.com> wrote:


On Mon, Mar 13, 2023 at 3:33 PM Dave Cramer <davecramer@gmail.com> wrote:

Dave Cramer


On Sat, 4 Mar 2023 at 19:39, Dave Cramer <davecramer@gmail.com> wrote:


On Sat, 4 Mar 2023 at 19:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2023-03-04 at 18:04 -0500, Dave Cramer wrote:
>> Most of the clients know how to decode the builtin types. I'm not
>> sure there is a use case for binary encode types that the clients
>> don't have a priori knowledge of.

> The client could, in theory, have a priori knowledge of a non-builtin
> type.

I don't see what's "in theory" about that.  There seems plenty of
use for binary I/O of, say, PostGIS types.  Even for built-in types,
do we really want to encourage people to hard-wire their OIDs into
applications?

How does a client read these? I'm pretty narrowly focussed. The JDBC API doesn't really have a way to read a non built-in type.  There is a facility to read a UDT, but the user would have to provide that transcoder. I guess I'm curious how other clients read binary UDT's ?

I don't see a big problem with driving this off a GUC, but I think
it should be a list of type names not OIDs.  We already have plenty
of precedent for dealing with that sort of thing; see search_path
for the canonical example.  IIRC, there's similar caching logic
for temp_tablespaces.

I have no issue with allowing names, OID's were compact, but we could easily support both

Attached is a preliminary patch that takes a list of OID's. I'd like to know if this is going in the right direction.

Next step would be to deal with type names as opposed to OID's. 
This will be a bit more challenging as type names are schema specific.

OIDs are a pain to deal with IMO.   They will not survive a dump style restore, and are hard to keep synchronized between databases...type names don't have this problem.   OIDs are an implementation artifact that ought not need any extra dependency.
AFAIK, OID's for built-in types don't change. 
Clearly we need more thought on how to deal with UDT's  
  

This seems like a protocol or even a driver issue rather than a GUC issue. Why does the server need to care what format the client might want to prefer on a query by query basis? 

Actually this isn't a query by query basis. The point of this is that the client wants all the results for given OID's in binary. 
 
I just don't see it. The resultformat switch in libpq works pretty well, except that it's "all in" on getting data from the server, with the dead simple workaround of casting to text which might even be able to be managed from within the driver itself. 

merlin


Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:

Dave Cramer


On Mon, 20 Mar 2023 at 15:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dave Cramer <davecramer@gmail.com> writes:
> On Mon, 20 Mar 2023 at 13:05, Jeff Davis <pgsql@j-davis.com> wrote:
>> 2. Easy to confuse psql:
>>
>> CREATE TABLE a(d date, t timestamptz);
>> SET format_binary='25,1082,1184';
>> SELECT * FROM a;
>> d | t
>> ---+---
>> ! |
>> (1 row)
>>
>> Well I'm guessing psql doesn't know how to read date or timestamptz in
>> binary. This is not a failing of the code.

What it is is a strong suggestion that controlling this via a GUC is
not a great choice.  There are many inappropriate (wrong abstraction
level) ways to change a GUC and thereby break a client that's not
expecting binary output.  I think Jeff's suggestion that we should
treat this as a protocol extension might be a good idea.

If I recall the protocol-extension design correctly, such a setting
could only be set at session start, which could be annoying --- at the
very least we'd have to tolerate entries for unrecognized data types,
since clients couldn't be expected to have checked the list against
the current server in advance.

As mentioned for connection pools we need to be able to set these after the session starts.
I'm not sure how useful the protocol extension mechanism works given that it can only be used on startup.  

                        regards, tom lane

Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:



On Mon, 20 Mar 2023 at 15:09, Jeff Davis <pgsql@j-davis.com> wrote:
On Mon, 2023-03-20 at 14:36 -0400, Dave Cramer wrote:
> Thanks for the review. I'm curious what system you are running on as
> I don't see any of these errors. 

Are asserts enabled?

> Well I'm guessing psql doesn't know how to read date or timestamptz
> in binary. This is not a failing of the code.

It seems strange, and potentially dangerous, to send binary data to a
client that's not expecting it. It feels too easy to cause confusion by
changing the GUC mid-session.

Also, it seems like DISCARD ALL is not resetting it, which I think is a
bug.
Thanks yes, this is a bug 

>
> This is an interesting question. If the type isn't visible then it's
> not visible to the query so 

I don't think that's true -- the type could be in a different schema
from the table.

Good point. This seems to be the very difficult part. 
 

> >
> > 5. There's a theoretical invalidation problem. It might also be a
> > practical problem in some testing setups with long-lived
> > connections
> > that are recreating user-defined types.
> >
>
> UDT's seem to be a problem here which candidly have very little use
> case for binary output. 

I mostly agree with that, but it also might not be hard to support
UDTs. Is there a design problem here or is it "just a matter of code"?

>
> I didn't try to solve it as Tom was OK with using a GUC. Using a
> startup GUC is interesting, 
> but how would that work with pools where we want to reset the
> connection when we return it and then
> set the binary format on borrow ? By using a GUC when a client
> borrows a connection from a pool the client
> can reconfigure the oids it wants formatted in binary.

That's a good point. How common is it to share a connection pool
between different clients (some of which might support a binary format,
and others which don't)? And would the connection pool put connections
with and without the property in different pools?

For JAVA pools it's probably OK, but for pools like pgbouncer we have no control of who is going to get the connection next.
 

>
> I really hadn't considered supporting type names. I have asked Paul
> Ramsey  about PostGIS and he doesn't see PostGIS using this.

One of the things I like about Postgres is that the features all work
together, and that user-defined objects are generally as good as built-
in ones. Sometimes there's a reason to make a special case (e.g. syntax
support or something), but in this case it seems like we could support
user-defined types just fine, right? It's also just more friendly and
readable to use type names, especially if it's a GUC.

Regards,
        Jeff Davis

Re: Request for comment on setting binary format output per session

From
Merlin Moncure
Date:


On Mon, Mar 20, 2023 at 7:11 PM Dave Cramer <davecramer@gmail.com> wrote:



On Mon, 20 Mar 2023 at 19:10, Merlin Moncure <mmoncure@gmail.com> wrote:


On Mon, Mar 13, 2023 at 3:33 PM Dave Cramer <davecramer@gmail.com> wrote:

OIDs are a pain to deal with IMO.   They will not survive a dump style restore, and are hard to keep synchronized between databases...type names don't have this problem.   OIDs are an implementation artifact that ought not need any extra dependency.
AFAIK, OID's for built-in types don't change. 
Clearly we need more thought on how to deal with UDT's  

Yeah.  Not having a solution that handles arrays and composites though would feel pretty incomplete since they would be the one of the main beneficiaries from a performance standpoint.    I guess minimally you'd need to expose some mechanic to look up oids, but being able to specify "foo"."bar", in the GUC would be pretty nice (albeit a lot more work).
 
This seems like a protocol or even a driver issue rather than a GUC issue. Why does the server need to care what format the client might want to prefer on a query by query basis? 

Actually this isn't a query by query basis. The point of this is that the client wants all the results for given OID's in binary. 

Yep.  Your rationale is starting to click.  How would this interact with existing code bases?  I get that JDBC is the main target, but how does this interact with libpq code that explicitly sets resultformat? Perhaps the answer should be as it shouldn't change documented behavior, and a hypothetical resultformat=2 could be reserved to default to text but allow for server control, and 3 as the same but default to binary.

merlin

 

Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:



On Tue, 21 Mar 2023 at 07:35, Merlin Moncure <mmoncure@gmail.com> wrote:


On Mon, Mar 20, 2023 at 7:11 PM Dave Cramer <davecramer@gmail.com> wrote:



On Mon, 20 Mar 2023 at 19:10, Merlin Moncure <mmoncure@gmail.com> wrote:


On Mon, Mar 13, 2023 at 3:33 PM Dave Cramer <davecramer@gmail.com> wrote:

OIDs are a pain to deal with IMO.   They will not survive a dump style restore, and are hard to keep synchronized between databases...type names don't have this problem.   OIDs are an implementation artifact that ought not need any extra dependency.
AFAIK, OID's for built-in types don't change. 
Clearly we need more thought on how to deal with UDT's  

Yeah.  Not having a solution that handles arrays and composites though would feel pretty incomplete since they would be the one of the main beneficiaries from a performance standpoint.   
I don't think arrays of built-in types are a problem; drivers already know how to deal with these.
 
I guess minimally you'd need to expose some mechanic to look up oids, but being able to specify "foo"."bar", in the GUC would be pretty nice (albeit a lot more work).

As Jeff mentioned there is a visibility problem if the search path is changed. The simplest solution IMO is to look up the OID at the time the format is requested and use the OID going forward to format the output as binary. If the search path changes and a type with the same name is now first in the search path then the data would be returned in text. 

 
 
This seems like a protocol or even a driver issue rather than a GUC issue. Why does the server need to care what format the client might want to prefer on a query by query basis? 

Actually this isn't a query by query basis. The point of this is that the client wants all the results for given OID's in binary. 

Yep.  Your rationale is starting to click.  How would this interact with existing code bases? 
Actually JDBC wasn't the first to ask for this.  Default result formats should be settable per session · postgresql-interfaces/enhancement-ideas · Discussion #5 (github.com) I've tested it with JDBC and it requires no code changes on our end. Jack tested it and it required no code changes on his end either. He did some performance tests and found "At 100 rows the text format takes 48% longer than the binary format."  https://github.com/postgresql-interfaces/enhancement-ideas/discussions/5#discussioncomment-3188599

I get that JDBC is the main target, but how does this interact with libpq code that explicitly sets resultformat?
Honestly I have no idea how it would function with libpq. I presume if the client did not request binary format then things would work as they do today.


Dave
 

Re: Request for comment on setting binary format output per session

From
Merlin Moncure
Date:

On Tue, Mar 21, 2023 at 8:22 AM Dave Cramer <davecramer@gmail.com> wrote:

On Tue, 21 Mar 2023 at 07:35, Merlin Moncure <mmoncure@gmail.com> wrote:


On Mon, Mar 20, 2023 at 7:11 PM Dave Cramer <davecramer@gmail.com> wrote:



On Mon, 20 Mar 2023 at 19:10, Merlin Moncure <mmoncure@gmail.com> wrote:


On Mon, Mar 13, 2023 at 3:33 PM Dave Cramer <davecramer@gmail.com> wrote:

OIDs are a pain to deal with IMO.   They will not survive a dump style restore, and are hard to keep synchronized between databases...type names don't have this problem.   OIDs are an implementation artifact that ought not need any extra dependency.
AFAIK, OID's for built-in types don't change. 
Clearly we need more thought on how to deal with UDT's  

Yeah.  Not having a solution that handles arrays and composites though would feel pretty incomplete since they would be the one of the main beneficiaries from a performance standpoint.   
I don't think arrays of built-in types are a problem; drivers already know how to deal with these.
 
I guess minimally you'd need to expose some mechanic to look up oids, but being able to specify "foo"."bar", in the GUC would be pretty nice (albeit a lot more work).

As Jeff mentioned there is a visibility problem if the search path is changed.

Only if the name is not fully qualified. By allowing OID to bypass visibility, it stands to reason visibility ought to be bypassed for type requests as well, or at least be able to be.  If we are setting things in GUC, that suggests we can establish things in postgresql.conf, and oids feel out of place there.

Yep.  Your rationale is starting to click.  How would this interact with existing code bases? 
Actually JDBC wasn't the first to ask for this.  Default result formats should be settable per session · postgresql-interfaces/enhancement-ideas · Discussion #5 (github.com) I've tested it with JDBC and it requires no code changes on our end. Jack tested it and it required no code changes on his end either. He did some performance tests and found "At 100 rows the text format takes 48% longer than the binary format."  https://github.com/postgresql-interfaces/enhancement-ideas/discussions/5#discussioncomment-3188599

Yeah, the general need is very clear IMO.
 
I get that JDBC is the main target, but how does this interact with libpq code that explicitly sets resultformat?
Honestly I have no idea how it would function with libpq. I presume if the client did not request binary format then things would work as they do today.

I see your argument here, but IMO this is another can of nudge away from GUC, unless you're willing to establish that behavior.  Thinking here is that the GUC wouldn't do anything for libpq, uses cases, and couldn't, since resultformat would be overriding the behavior in all interesting cases...it seems odd to implement server side specified behavior that the client library doesn't implement. 

merlin


Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Mon, 2023-03-20 at 20:18 -0400, Dave Cramer wrote:
> For JAVA pools it's probably OK, but for pools like pgbouncer we have
> no control of who is going to get the connection next.

Can pgbouncer use different pools for different settings of
format_binary?

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:


On Tue, 21 Mar 2023 at 11:52, Jeff Davis <pgsql@j-davis.com> wrote:
On Mon, 2023-03-20 at 20:18 -0400, Dave Cramer wrote:
> For JAVA pools it's probably OK, but for pools like pgbouncer we have
> no control of who is going to get the connection next.

Can pgbouncer use different pools for different settings of
format_binary?


My concern here is that if I can only change binary format in the startup parameter then when I return the connection to the pool I would expect the pool to reset all session level settings including binary format. 
The next time I borrow the connection I can no longer set binary format.


Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Tue, 2023-03-21 at 09:22 -0400, Dave Cramer wrote:
> As Jeff mentioned there is a visibility problem if the search path is
> changed. The simplest solution IMO is to look up the OID at the time
> the format is requested and use the OID going forward to format the
> output as binary. If the search path changes and a type with the same
> name is now first in the search path then the data would be returned
> in text. 

The binary format parameter would ordinarily be set by the maintainer
of the client library, who knows nothing about the schema the client
might be accessing, and nothing about the search_path that might be
set. They would only know which binary parsers they've already written
and included with their client library.

With that in mind, using search_path at all seems weird. Why would a
change in search_path affect which types the client library knows how
to parse? If the client library knows how to parse "foo.mytype"'s
binary representation, and you change the search path such that it
finds "bar.mytype" instead, did the client library all of a sudden
forget how to parse "foo.mytype" and learn to parse "bar.mytype"?

If there's some extension that offers type "mytype", and perhaps allows
it to be installed in any schema, then it seems that the client library
would know how to parse all instances of "mytype" regardless of the
schema or search_path.

Of course, a potential problem is that ordinary users can create types
(e.g. enum types) and so you'd have to be careful about some tricks
where someone shadows a well-known extension in order to confuse the
client with unexpected binary data (not sure if that's a security
concern or not, just thinking out loud).

One solution might be that unqualified type names would work on all
types of that name (in any schema) that are owned by a superuser,
regardless of search_path. Most extension scripts will be run as
superuser anyway. It would feel a little magical, which I don't like,
but would work in any practical case I can think of.

Another solution would be to have some extra catalog field in pg_type
that would be a "binary format identifier" and use that rather than the
type name to match up binary parsers with the proper type.

Am I over-thinking this?

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Peter Eisentraut
Date:
On 04.03.23 17:35, Jeff Davis wrote:
> On Thu, 2023-03-02 at 09:13 -0500, Dave Cramer wrote:
>> I'd like to open up this discussion again so that we can
>> move forward. I prefer the GUC as it is relatively simple and as
>> Peter mentioned it works, but I'm not married to the idea.
> 
> It's not very friendly to extensions, where the types are not
> guaranteed to have stable OIDs. Did you consider any proposals that
> work with type names?

Sending type names is kind of useless if what comes back with the result 
(RowDescription) are OIDs anyway.

The client would presumably have some code like

if (typeoid == 555)
     parseThatType();

So it already needs to know about the OIDs of all the types it is 
interested in.




Re: Request for comment on setting binary format output per session

From
Peter Eisentraut
Date:
On 20.03.23 18:04, Jeff Davis wrote:
> 2. Easy to confuse psql:
> 
>    CREATE TABLE a(d date, t timestamptz);
>    SET format_binary='25,1082,1184';
>    SELECT * FROM a;
>     d | t
>    ---+---
>     ! |
>    (1 row)

You can already send binary data to psql using DECLARE BINARY CURSOR. 
It might be sensible to have psql check that the data it is getting is 
text format before trying to print it.




Re: Request for comment on setting binary format output per session

From
Peter Eisentraut
Date:
On 20.03.23 18:04, Jeff Davis wrote:
> Also, if we're going to make the binary format more practical to use,
> can we document the expectations better? It seems the expecatation is
> that the binary format just never changes, and that if it does, that's
> a new type name.

I've been thinking that we need some new kind of identifier to allow 
clients to process types in more sophisticated ways.

For example, each type could be (self-)assigned a UUID, which is fixed 
for that type no matter in which schema or under what extension name or 
with what OID it is installed.  Client libraries could then hardcode 
that UUID for processing the types.  Conversely, the UUID could be 
changed if the wire format of the type is changed, without having to 
change the type name.




Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:
If there's some extension that offers type "mytype", and perhaps allows
it to be installed in any schema, then it seems that the client library
would know how to parse all instances of "mytype" regardless of the
schema or search_path.

I may be overthinking this.

Dave Cramer


On Tue, 21 Mar 2023 at 17:47, Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2023-03-21 at 09:22 -0400, Dave Cramer wrote:
> As Jeff mentioned there is a visibility problem if the search path is
> changed. The simplest solution IMO is to look up the OID at the time
> the format is requested and use the OID going forward to format the
> output as binary. If the search path changes and a type with the same
> name is now first in the search path then the data would be returned
> in text. 

The binary format parameter would ordinarily be set by the maintainer
of the client library, who knows nothing about the schema the client
might be accessing, and nothing about the search_path that might be
set. They would only know which binary parsers they've already written
and included with their client library.

With that in mind, using search_path at all seems weird. Why would a
change in search_path affect which types the client library knows how
to parse? If the client library knows how to parse "foo.mytype"'s
binary representation, and you change the search path such that it
finds "bar.mytype" instead, did the client library all of a sudden
forget how to parse "foo.mytype" and learn to parse "bar.mytype"?

If there's some extension that offers type "mytype", and perhaps allows
it to be installed in any schema, then it seems that the client library
would know how to parse all instances of "mytype" regardless of the
schema or search_path.

Of course, a potential problem is that ordinary users can create types
(e.g. enum types) and so you'd have to be careful about some tricks
where someone shadows a well-known extension in order to confuse the
client with unexpected binary data (not sure if that's a security
concern or not, just thinking out loud).

One solution might be that unqualified type names would work on all
types of that name (in any schema) that are owned by a superuser,
regardless of search_path. Most extension scripts will be run as
superuser anyway. It would feel a little magical, which I don't like,
but would work in any practical case I can think of.

Another solution would be to have some extra catalog field in pg_type
that would be a "binary format identifier" and use that rather than the
type name to match up binary parsers with the proper type.

Am I over-thinking this?

Regards,
        Jeff Davis

Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:
If I recall the protocol-extension design correctly, such a setting
could only be set at session start, which could be annoying --- at the
very least we'd have to tolerate entries for unrecognized data types,
since clients couldn't be expected to have checked the list against
the current server in advance.

The protocol extension design has the drawback that it can only be set at startup. 
What if we were to allow changes to the setting after startup if the client passed the cancel key as a unique identifier that only the driver would know?
 
Dave Cramer




Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Wed, 2023-03-22 at 10:21 +0100, Peter Eisentraut wrote:
> I've been thinking that we need some new kind of identifier to allow
> clients to process types in more sophisticated ways.
>
> For example, each type could be (self-)assigned a UUID, which is
> fixed
> for that type no matter in which schema or under what extension name
> or
> with what OID it is installed.  Client libraries could then hardcode
> that UUID for processing the types.  Conversely, the UUID could be
> changed if the wire format of the type is changed, without having to
> change the type name.

That sounds reasonable to me. It could also be useful for other
extension objects (or the extension itself) to avoid other kinds of
weirdness from name collisions or major version updates or extensions
that depend on other extensions.

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Wed, 2023-03-22 at 10:12 +0100, Peter Eisentraut wrote:
> Sending type names is kind of useless if what comes back with the
> result
> (RowDescription) are OIDs anyway.
>
> The client would presumably have some code like
>
> if (typeoid == 555)
>      parseThatType();
>
> So it already needs to know about the OIDs of all the types it is
> interested in.

Technically it's still an improvement because you can avoid an extra
round-trip. The client library can pipeline a query like:

   SELECT typname, oid FROM pg_type
     WHERE typname IN (...list of supported type names...);

when the client first connects, and then go ahead and send whatever
queries you want without waiting for the response. When you get back
the result of the pg_type query, you cache the mapping, and use it to
process any other results you get.

That avoids introducing an extra round-trip. I'm not sure if that's a
reasonable thing to expect the client to do, so I agree that we should
offer a better way.

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Wed, 2023-03-22 at 10:21 +0100, Peter Eisentraut wrote:
>> I've been thinking that we need some new kind of identifier to allow 
>> clients to process types in more sophisticated ways.
>> For example, each type could be (self-)assigned a UUID, which is fixed 
>> for that type no matter in which schema or under what extension name or 
>> with what OID it is installed.  Client libraries could then hardcode 
>> that UUID for processing the types.  Conversely, the UUID could be 
>> changed if the wire format of the type is changed, without having to 
>> change the type name.

> That sounds reasonable to me. It could also be useful for other
> extension objects (or the extension itself) to avoid other kinds of
> weirdness from name collisions or major version updates or extensions
> that depend on other extensions.

This isn't going to help much unless we change the wire protocol
so that RowDescription messages carry these UUIDs instead of
(or in addition to?) the OIDs of the column datatypes.  While
that's not completely out of the question, it's a heavy lift
that will affect multiple layers of client code along with the
server.

Also, what about container types?  I doubt it's sane for
array-of-foo to have a UUID that's unrelated to the one for foo.
Composites and ranges would need some intelligence too if we
don't want them to be unduly complicated to process.

            regards, tom lane



Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Wed, 2023-03-22 at 14:42 -0400, Tom Lane wrote:
> This isn't going to help much unless we change the wire protocol
> so that RowDescription messages carry these UUIDs instead of
> (or in addition to?) the OIDs of the column datatypes.  While
> that's not completely out of the question, it's a heavy lift
> that will affect multiple layers of client code along with the
> server.

I'm not sure that's a hard requirement. I pointed out a similar
solution for type names here:

https://www.postgresql.org/message-id/4297b9e310172b9a1e6d737e21ad8796d0ab7b03.camel@j-davis.com

In other words: if the Bind message depends on knowing the OID
mappings, that forces an extra round-trip; but if the client doesn't
need the mapping until it receives its first result, then it can use
pipelining to avoid the extra round-trip.

(I haven't actually tried it and I don't know if it's very reasonable
to expect the client to do this.)

> Also, what about container types?  I doubt it's sane for
> array-of-foo to have a UUID that's unrelated to the one for foo.
> Composites and ranges would need some intelligence too if we
> don't want them to be unduly complicated to process.

That's a good point. I don't know if that is a major design issue or
not; but it certainly adds complexity to the proposal and/or clients
implementing it.

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:


On Wed, 22 Mar 2023 at 15:23, Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2023-03-22 at 14:42 -0400, Tom Lane wrote:
> This isn't going to help much unless we change the wire protocol
> so that RowDescription messages carry these UUIDs instead of
> (or in addition to?) the OIDs of the column datatypes.  While
> that's not completely out of the question, it's a heavy lift
> that will affect multiple layers of client code along with the
> server.

I'm not sure that's a hard requirement. I pointed out a similar
solution for type names here:

https://www.postgresql.org/message-id/4297b9e310172b9a1e6d737e21ad8796d0ab7b03.camel@j-davis.com

In other words: if the Bind message depends on knowing the OID
mappings, that forces an extra round-trip; but if the client doesn't
need the mapping until it receives its first result, then it can use
pipelining to avoid the extra round-trip.

This overcomplicates things for the JDBC driver. We don't pipeline queries, well we do for batch queries but those are special.


(I haven't actually tried it and I don't know if it's very reasonable
to expect the client to do this.)

> Also, what about container types?  I doubt it's sane for
> array-of-foo to have a UUID that's unrelated to the one for foo.
> Composites and ranges would need some intelligence too if we
> don't want them to be unduly complicated to process.

That's a good point. I don't know if that is a major design issue or
not; but it certainly adds complexity to the proposal and/or clients
implementing it.

So where do we go from here ?

I can implement using type names as well as OID's

Dave

Re: Request for comment on setting binary format output per session

From
Merlin Moncure
Date:
On Tue, Mar 21, 2023 at 4:47 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2023-03-21 at 09:22 -0400, Dave Cramer wrote:
> As Jeff mentioned there is a visibility problem if the search path is
> changed. The simplest solution IMO is to look up the OID at the time
> the format is requested and use the OID going forward to format the
> output as binary. If the search path changes and a type with the same
> name is now first in the search path then the data would be returned
> in text. 

Am I over-thinking this?

I think so.  Dave's idea puts a lot of flexibility into the client side, and that's good.  Search path mechanics are really well understood and well integrated with extensions already (create extension ..schema) assuming that the precise time UDT are looked up in an unqualified way is very clear to- or invoked via- the client code.  I'll say it again though; OIDs really ought to be considered a transient cache of type information rather than a permanent identifier. 

Regarding UDT, lots of common and useful scenarios (containers, enum, range, etc), do not require special knowledge to parse beyond the kind of type it is.  Automatic type creation from tables is one of the most genius things about postgres and directly wiring client structures to them through binary is really nifty.  This undermines the case that binary parsing requires special knowledge IMO, UDT might in fact be the main long term target...I could see scenarios where java classes might be glued directly to postgres tables by the driver...this would be a lot more efficient than using json which is how everyone does it today.  Then again, maybe *I* might be overthinking this.

merlin

Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Fri, 2023-03-24 at 07:52 -0500, Merlin Moncure wrote:
> I think so.  Dave's idea puts a lot of flexibility into the client
> side, and that's good.  Search path mechanics are really well
> understood and well integrated with extensions already (create
> extension ..schema) assuming that the precise time UDT are looked up
> in an unqualified way is very clear to- or invoked via- the client
> code.  I'll say it again though; OIDs really ought to be considered a
> transient cache of type information rather than a
> permanent identifier. 

I'm not clear on what proposal you are making and/or endorising?

> Regarding UDT, lots of common and useful scenarios (containers, enum,
> range, etc), do not require special knowledge to parse beyond the
> kind of type it is.  Automatic type creation from tables is one of
> the most genius things about postgres and directly wiring client
> structures to them through binary is really nifty.

Perhaps not special knowledge, but you need to know the structure. If
you have a query like "SELECT '...'::sometable", you still need to know
the structure of sometable to parse it.

>   This undermines the case that binary parsing requires special
> knowledge IMO, UDT might in fact be the main long term target...I
> could see scenarios where java classes might be glued directly to
> postgres tables by the driver...this would be a lot more efficient
> than using json which is how everyone does it today.  Then again,
> maybe *I* might be overthinking this.

Wouldn't that only work if someone is doing a "SELECT *"?

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Thu, 2023-03-23 at 15:37 -0400, Dave Cramer wrote:
> So where do we go from here ?
>
> I can implement using type names as well as OID's

My current thought is that you should use the protocol extension and
make type names work (in addition to OIDs) at least for fully-qualified
type names. I don't really like the GUC -- perhaps I could be convinced
it's OK, but until we find a problem with protocol extensions, it looks
like a protocol extension is the way to go here.

I like Peter's idea for some kind of global identifier, but we can do
that independently at a later time.

If search path works fine and we're all happy with it, we could also
support unqualified type names. It feels slightly off to me to use
search_path for something like that, though.

There's still the problem about the connection pools. pgbouncer could
consider the binary formats to be an important parameter like the
database name, where the connection pooler would not mingle connections
with different settings for binary_formats. That would work, but it
would be weird because if a new version of a driver adds new binary
format support, it could cause worse connection pooling performance
until all the other drivers also support that binary format. I'm not
sure if that's a major problem or not. Another idea would be for the
connection pooler to also have a binary_formats config, and it would do
some checking to make sure all connecting clients understand some
minimal set of binary formats, so that it could still mingle the
connections. Either way, I think this is solvable by the connection
pooler.

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:

Dave Cramer


On Sat, 25 Mar 2023 at 19:06, Jeff Davis <pgsql@j-davis.com> wrote:
On Thu, 2023-03-23 at 15:37 -0400, Dave Cramer wrote:
> So where do we go from here ?
>
> I can implement using type names as well as OID's

My current thought is that you should use the protocol extension and
make type names work (in addition to OIDs) at least for fully-qualified
type names. I don't really like the GUC -- perhaps I could be convinced
it's OK, but until we find a problem with protocol extensions, it looks
like a protocol extension is the way to go here.

Well as I said if I use any external pool that shares connections with multiple clients, some of which may not know how to decode binary data then we have to have a way to set the binary format after the connection is established. I did float the idea of using the cancel key as a unique identifier that passed with the parameter would allow setting the parameter after connection establishmen.

I like Peter's idea for some kind of global identifier, but we can do
that independently at a later time.

If search path works fine and we're all happy with it, we could also
support unqualified type names. It feels slightly off to me to use
search_path for something like that, though.

There's still the problem about the connection pools. pgbouncer could
consider the binary formats to be an important parameter like the
database name, where the connection pooler would not mingle connections
with different settings for binary_formats. That would work, but it
would be weird because if a new version of a driver adds new binary
format support, it could cause worse connection pooling performance
until all the other drivers also support that binary format. I'm not
sure if that's a major problem or not. Another idea would be for the
connection pooler to also have a binary_formats config, and it would do
some checking to make sure all connecting clients understand some
minimal set of binary formats, so that it could still mingle the
connections. Either way, I think this is solvable by the connection
pooler.

Well that means that connection poolers have to all be fixed. There are more than just pgbouncer.
Seems rather harsh that a new feature breaks a connection pooler or makes the pooler unusable.

Dave

Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Sat, 2023-03-25 at 19:58 -0400, Dave Cramer wrote:
> Well that means that connection poolers have to all be fixed. There
> are more than just pgbouncer.
> Seems rather harsh that a new feature breaks a connection pooler or
> makes the pooler unusable.

Would it actually break connection poolers as they are now? Or would,
for example, pgbouncer just not set the binary_format parameter on the
outbound connection, and therefore just return everything as text until
they add support to configure it?

I'll admit that GUCs wouldn't have this problem at all, but it would be
nice to know how much of a problem it is before we decide between a
protocol extension and a GUC.

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:


On Sun, 26 Mar 2023 at 14:00, Jeff Davis <pgsql@j-davis.com> wrote:
On Sat, 2023-03-25 at 19:58 -0400, Dave Cramer wrote:
> Well that means that connection poolers have to all be fixed. There
> are more than just pgbouncer.
> Seems rather harsh that a new feature breaks a connection pooler or
> makes the pooler unusable.

Would it actually break connection poolers as they are now? Or would,
for example, pgbouncer just not set the binary_format parameter on the
outbound connection, and therefore just return everything as text until
they add support to configure it?

Well I was presuming that they would just pass the parameter on. If they didn't then binary_format won't work with them. In the case that they do pass it on, then DISCARD_ALL will reset it and future borrows of the connection will have no way to set it again; effectively making this a one time setting.

So while it may not break them it doesn't seem like it is a very useful solution.

Dave

Re: Request for comment on setting binary format output per session

From
Tom Lane
Date:
Dave Cramer <davecramer@gmail.com> writes:
> Well I was presuming that they would just pass the parameter on. If they
> didn't then binary_format won't work with them. In the case that they do
> pass it on, then DISCARD_ALL will reset it and future borrows of the
> connection will have no way to set it again; effectively making this a one
> time setting.

I would not expect DISCARD ALL to reset a session-level property.

            regards, tom lane



Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:

Dave Cramer


On Sun, 26 Mar 2023 at 18:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dave Cramer <davecramer@gmail.com> writes:
> Well I was presuming that they would just pass the parameter on. If they
> didn't then binary_format won't work with them. In the case that they do
> pass it on, then DISCARD_ALL will reset it and future borrows of the
> connection will have no way to set it again; effectively making this a one
> time setting.

I would not expect DISCARD ALL to reset a session-level property.

Well if we can't reset it with DISCARD ALL how would that work with pgbouncer, or any pool for that matter since it doesn't know which client asked for which (if any) OID's to be binary. 

Dave

Re: Request for comment on setting binary format output per session

From
Tom Lane
Date:
Dave Cramer <davecramer@gmail.com> writes:
> On Sun, 26 Mar 2023 at 18:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I would not expect DISCARD ALL to reset a session-level property.

> Well if we can't reset it with DISCARD ALL how would that work with
> pgbouncer, or any pool for that matter since it doesn't know which client
> asked for which (if any) OID's to be binary.

Well, it'd need to know that, just like it already needs to know
which clients asked for which database or which login role.
Having DISCARD ALL reset those session properties is obviously silly.

The way I'm imagining this working is that it fits into the framework
for protocol options (cf commits ae65f6066 and bbf9c282c), whereby
the client and server negotiate whether they can handle this feature.
A non-updated pooler would act like a server that doesn't know the
feature, and the client would have to fall back to not using it,
just as it would with an older server.

I doubt that this would crimp a pooler's freedom of action very much.
In any given environment there will probably be only a few values of
the set-of-okay-types in use.

            regards, tom lane



Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:


On Sun, 26 Mar 2023 at 21:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dave Cramer <davecramer@gmail.com> writes:
> On Sun, 26 Mar 2023 at 18:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I would not expect DISCARD ALL to reset a session-level property.

> Well if we can't reset it with DISCARD ALL how would that work with
> pgbouncer, or any pool for that matter since it doesn't know which client
> asked for which (if any) OID's to be binary.

Well, it'd need to know that, just like it already needs to know
which clients asked for which database or which login role.

OK, IIUC what you are proposing here is that there would be a separate pool for 
database, user, and OIDs. This doesn't seem too flexible. For instance if I create a UDT and then want it to be returned 
as binary then I have to reconfigure the pool to be able to accept a new list of OID's.

Am I mis-understanding how this would potentially work?

Dave

Re: Request for comment on setting binary format output per session

From
"Gregory Stark (as CFM)"
Date:
FYI I attached this thread to
https://commitfest.postgresql.org/42/3777 which I believe is the same
issue. I mistakenly had this listed as a CF entry with no discussion
for a long time due to that missing link.


--
Gregory Stark
As Commitfest Manager



Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Tue, 2023-03-28 at 10:22 -0400, Dave Cramer wrote:
> OK, IIUC what you are proposing here is that there would be a
> separate pool for 
> database, user, and OIDs. This doesn't seem too flexible. For
> instance if I create a UDT and then want it to be returned 
> as binary then I have to reconfigure the pool to be able to accept a
> new list of OID's.

There are two ways that I could imagine the connection pool working:

1. Accept whatever clients connect, and pass along the binary_formats
setting to the outbound (server) connection. The downside here is that
if you have many different clients (or different versions) that have
different binary_formats settings, then it creates too many pools and
doesn't share well enough.

2. Some kind of configuration setting (or maybe it can be done
automatically) that organizes based on a common subset of binary
formats that many clients can understand.

These can evolve once the protocol extension is in place.

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:


On Tue, 28 Mar 2023 at 19:01, Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2023-03-28 at 10:22 -0400, Dave Cramer wrote:
> OK, IIUC what you are proposing here is that there would be a
> separate pool for 
> database, user, and OIDs. This doesn't seem too flexible. For
> instance if I create a UDT and then want it to be returned 
> as binary then I have to reconfigure the pool to be able to accept a
> new list of OID's.

There are two ways that I could imagine the connection pool working:

1. Accept whatever clients connect, and pass along the binary_formats
setting to the outbound (server) connection. The downside here is that
if you have many different clients (or different versions) that have
different binary_formats settings, then it creates too many pools and
doesn't share well enough.
As I understand it, pools create connections before the client actually requests the connection.
This would necessitate having the binary format information in the configuration file.

I'm starting to wonder about the utility of the protocol extension mechanism? 
It would seem that you would need to add the new feature into all pools ? 

2. Some kind of configuration setting (or maybe it can be done
automatically) that organizes based on a common subset of binary
formats that many clients can understand.
 
Well that would bring us back to just providing a list of OID's of well known types as I first proposed instead of trying to accomodate UDT's 

Dave
 

Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Wed, 2023-03-29 at 08:17 -0400, Dave Cramer wrote:
> I'm starting to wonder about the utility of the protocol extension
> mechanism? 

I'm starting to agree that the awkwardness around connection poolers is
a problem. If that's the case, I'm wondering if the protocol extensions
will ever be useful.

What I'm worried about with the GUC is that an attacker may be able to
start with a SQL injection attack, and then use the GUC to confuse a
client in a way that further escalates privileges. Is that a reasonable
fear?

A couple ideas to mitigate that concern with the GUC:

1. Fix our own clients, like psql, to check for binary data they can't
process.

2. Communicate (after the patch is committed) with client library
maintainers to see that they behave sanely when they receive binary
data unexpectedly.

3. Require that the binary_formats parameter is set very early, either
during connection startup or right after a DISCARD statement. A bit of
a hack, but may help. Not sure it really solves my security concern
because they'd just need to modify their SQL injection to also include
a DISCARD statement.

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Merlin Moncure
Date:
On Wed, Mar 29, 2023 at 11:04 AM Jeff Davis <pgsql@j-davis.com> wrote:
 I'm not clear on what proposal you are making and/or endorsing?

ha -- was just backing up dave's GUC idea.
 
1. Fix our own clients, like psql, to check for binary data they can't
process.

This ought to be impossible IMO.  All libpq routines except PQexec have an explicit expectation on format (via resultformat parameter) that should not be overridden.  PQexec ought to be explicitly documented and wired to only request text format data.

resultfomat can be extended now or later to allow participating clients to receive GUC configured format.  I do not think that libpq's result format being able to be overridden by GUC is a good idea at all, the library has to to participate, and I think can be made to so so without adjusting the interface (say, by resultFormat = 3).  Similarly, in JDBC world, it ought to be up to the driver to determine when it want the server to flex wire formats but must be able to override the server's decision.

merlin

Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Thu, 2023-03-30 at 07:06 -0500, Merlin Moncure wrote:
> This ought to be impossible IMO.  All libpq routines except PQexec
> have an explicit expectation on format (via resultformat parameter)
> that should not be overridden.  PQexec ought to be explicitly
> documented and wired to only request text format data.

Right now it's clearly documented[1] which formats will be returned for
a given Bind message. That can be seen as the root of the problem with
psql -- we are breaking the protocol by returning binary when psql can
rightfully expect text.

It is a minor break, because something needed to send the "SET
binary_formats='...'" command, but the protocol docs would need to be
updated for sure.

> participating clients to receive GUC configured format.  I do not
> think that libpq's result format being able to be overridden by GUC
> is a good idea at all, the library has to to participate, and I
> think can be made to so so without adjusting the interface (say, by
> resultFormat = 3).

Interesting idea. I suppose you'd need to specify 3 for all result
columns? That is a protocol change, but wouldn't "break" older clients.
The newer clients would need to make sure that they're connecting to
v16+, so using the protocol version alone wouldn't be enough. Hmm.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/docs/current/protocol-message-formats.html#PROTOCOL-MESSAGE-FORMATS-BIND




Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:

Dave Cramer


On Thu, 30 Mar 2023 at 15:40, Jeff Davis <pgsql@j-davis.com> wrote:
On Thu, 2023-03-30 at 07:06 -0500, Merlin Moncure wrote:
> This ought to be impossible IMO.  All libpq routines except PQexec
> have an explicit expectation on format (via resultformat parameter)
> that should not be overridden.  PQexec ought to be explicitly
> documented and wired to only request text format data.

Right now it's clearly documented[1] which formats will be returned for
a given Bind message. That can be seen as the root of the problem with
psql -- we are breaking the protocol by returning binary when psql can
rightfully expect text.

It is a minor break, because something needed to send the "SET
binary_formats='...'" command, but the protocol docs would need to be
updated for sure.

> participating clients to receive GUC configured format.  I do not
> think that libpq's result format being able to be overridden by GUC
> is a good idea at all, the library has to to participate, and I
> think can be made to so so without adjusting the interface (say, by
> resultFormat = 3).

Interesting idea. I suppose you'd need to specify 3 for all result
columns? That is a protocol change, but wouldn't "break" older clients.
The newer clients would need to make sure that they're connecting to
v16+, so using the protocol version alone wouldn't be enough. Hmm.

I'm confused. How does using resultFormat=3 change anything ?
Dave 

Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:
> participating clients to receive GUC configured format.  I do not
> think that libpq's result format being able to be overridden by GUC
> is a good idea at all, the library has to to participate, and I
> think can be made to so so without adjusting the interface (say, by
> resultFormat = 3).

Interesting idea. I suppose you'd need to specify 3 for all result
columns? That is a protocol change, but wouldn't "break" older clients.
The newer clients would need to make sure that they're connecting to
v16+, so using the protocol version alone wouldn't be enough. Hmm.


So this only works with extended protocol and not simple queries. 
Again, as Peter mentioned it's already easy enough to confuse psql using binary cursors so 
it makes sense to fix psql either way.

If you use resultFormat (3) I think you'd still end up doing the Describe (which we are trying to avoid) to make sure you could receive all the columns in binary.

Dave 

Re: Request for comment on setting binary format output per session

From
Merlin Moncure
Date:
On Mon, Apr 3, 2023 at 11:29 AM Dave Cramer <davecramer@gmail.com> wrote:
> participating clients to receive GUC configured format.  I do not
> think that libpq's result format being able to be overridden by GUC
> is a good idea at all, the library has to to participate, and I
> think can be made to so so without adjusting the interface (say, by
> resultFormat = 3).

Interesting idea. I suppose you'd need to specify 3 for all result
columns? That is a protocol change, but wouldn't "break" older clients.
The newer clients would need to make sure that they're connecting to
v16+, so using the protocol version alone wouldn't be enough. Hmm.


So this only works with extended protocol and not simple queries. 
Again, as Peter mentioned it's already easy enough to confuse psql using binary cursors so 
it makes sense to fix psql either way.

If you use resultFormat (3) I think you'd still end up doing the Describe (which we are trying to avoid) to make sure you could receive all the columns in binary.

Can you elaborate on why Describe would have to be passed?  Agreed that would be a dealbreaker if so.   If you pass a 3 sending it in, the you'd be checking PQfformat on data coming back as 0/1, or at least that's be smart since you're indicating the server is able to address the format.   This addresses the concern libpq clients currently passing resultfomat zero could not have that decision overridden by the server which I also think is a dealbreaker.  There might be other reasons why a describe message may be forced however.

merlin


Re: Request for comment on setting binary format output per session

From
Robert Haas
Date:
On Wed, Mar 29, 2023 at 12:04 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I'm starting to agree that the awkwardness around connection poolers is
> a problem. If that's the case, I'm wondering if the protocol extensions
> will ever be useful.

In the case at hand, it seems like the problem could easily be solved
by allowing the property to be changed after connection startup.
Instead of using the protocol extension mechanism to negotiate a
specific value for the property, we can use it to negotiate about
whether or not some new protocol message that can be used to change
that property is supported. If it is, then a new value can be set
whenever, and a connection pooler can switch the active value when it
associates the server's session with a different client session.

Alternatively, the protocol extension mechanism can be used to
negotiate an initial value for the property, with the understanding
that if any initial value is negotiated, that also implies that the
server will accept some new protocol message later in the session to
change the value. If no initial value is negotiated, the client can't
assume that the server even knows about that property and can't try to
change it.

Backing up a level, the purpose of the protocol extension mechanism is
to help us agree on the communication protocol -- that is, the set of
messages that we can send and receive on a certain connection. The
question for the protocol extension mechanism isn't "which types
should always be sent in binary format?" but "would it be ok if I
wanted you to always send certain types in binary format?", with the
idea that if the answer is yes it will still be necessary for the
client to let the server know which ones, but that's easy to do if
we've agreed on the concept that it's OK for me to ask the server for
that. And if it's OK for me to ask that once, it should also be OK for
me to later ask for something different.

This could, perhaps, be made even more general yet. We could define a
concept of "protocol session parameters" and make "which types are
always sent in binary format?" one of those parameters. So then the
conversation could go like this:

C: Hello! Do you know about protocol session parameters?
S: Why yes, actually I do.
C: Cool. I would like to set the protocol session parameter
types_always_in_binary_format=timestamptz. Does that work for you?
S: Sure thing! (or alternatively: Sadly, I've not heard of that
particular protocol session parameter, sorry to disappoint.)

The reason why I suggest this is that I feel like there could be a
bunch of things like this. The set of things to be sent in binary
format feels like a property of the wire protocol, not something
SQL-level that should be configured via SET.  Clients, drivers, and
connection poolers aren't going to want to have to worry about some
user screwing up the session by changing that property inside of a
function or procedure or whatever. But there could also be a bunch of
different things like this that we want to support. For example, one
that would be really useful for connection poolers is the session
user. The pooler would like to change the session user whenever the
connection is changed to talk to a different client, and it would like
that to happen in a way that can't be reversed by issuing any SQL
command. I expect in time we may find a bunch of others.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Mon, 2023-04-17 at 12:17 -0400, Robert Haas wrote:
> In the case at hand, it seems like the problem could easily be solved
> by allowing the property to be changed after connection startup.
> Instead of using the protocol extension mechanism to negotiate a
> specific value for the property, we can use it to negotiate about
> whether or not some new protocol message that can be used to change
> that property is supported.

...

> Backing up a level, the purpose of the protocol extension mechanism
> is
> to help us agree on the communication protocol

Thank you, that seems like a better approach to me.

It involves introducing new message types which I didn't really
consider. We might want to be careful about how many kinds of messages
we introduce so that the one-letter codes are still managable. I've
been frustrated in the past that we don't have separate symbols in the
source code to refer to the message types (we just use literal 'S',
etc.).

Maybe we should have a single new message type 'x' to indicate a
message for a protocol extension, and then have a sub-message-type? It
might make error handling better for unexpected messages.

Also, is there any reason we'd want this concept to integrate with
connection strings/URIs? Probably not a good idea to turn on features
that way, but perhaps we'd want to support disabling protocol
extensions from a URI? This could be used to restrict authentication
methods or sources of authentication information.

> The reason why I suggest this is that I feel like there could be a
> bunch of things like this.

What's the trade-off between having one protocol extension (e.g.
_pq_protocol_session_parameters) that tries to work for multiple cases
(e.g. binary_formats and session_user) vs just having two protocol
extensions (_pq_set_session_user and _pq_set_binary_formats)?


> For example, one
> that would be really useful for connection poolers is the session
> user. The pooler would like to change the session user whenever the
> connection is changed to talk to a different client, and it would
> like
> that to happen in a way that can't be reversed by issuing any SQL
> command.

That sounds valuable to me whether we generalize with "protocol session
parameters" or not.

Regards,
    Jeff Davis




Re: Request for comment on setting binary format output per session

From
Robert Haas
Date:
On Mon, Apr 17, 2023 at 1:55 PM Jeff Davis <pgsql@j-davis.com> wrote:
> It involves introducing new message types which I didn't really
> consider. We might want to be careful about how many kinds of messages
> we introduce so that the one-letter codes are still managable. I've
> been frustrated in the past that we don't have separate symbols in the
> source code to refer to the message types (we just use literal 'S',
> etc.).

Right. That was part of the thinking behind the protocol session
parameter thing I was throwing out there.

> Maybe we should have a single new message type 'x' to indicate a
> message for a protocol extension, and then have a sub-message-type? It
> might make error handling better for unexpected messages.

I'm somewhat skeptical that we want every protocol extension in the
universe to use a single message type. I think that could lead to
munging together all sorts of messages that are actually really
different from each other. On the other hand, in a certain sense, we
don't really have a choice. The type byte for a protocol message can
only take on one of 256 possible values, and some of those are already
used, so if we add a bunch of stuff to the protocol, we're eventually
going to run short of byte values. In fact, even if we said, well, 'x'
means that it's an extended message and then there's a type byte as
the first byte of the payload, that only doubles the number of
possible message types before we run out of room, and maybe there's a
world where we eventually have thousands upon thousands of message
types. We'd need a longer type code than 1 byte to really get out from
under the problem, so if we add a message like what you're talking
about, we should probably do that.

But I don't know if we need to be too paranoid about this. For
example, suppose we were to agree on adding protocol session
parameters and make this the first one. To do that, suppose we add two
new messages to the protocol, ProtocolSessionParameterSet and
ProtocolSessionParameterReponse. And suppose we just pick single
letter codes for those, like we have right now. How much use would
such a mechanism get? It seems possible that we'd add as many as 5 or
10 such parameters in the next half-decade, but they'd all only need
those two new message types. We'd only need a different message type
if somebody wanted to customize something about the protocol that
didn't fit into that model, and that might happen, but I bet it
wouldn't happen that often. I feel like if we're careful to make sure
that the new protocol messages that we add are carefully designed to
be reasonably general, we'd add them very slowly. It seems very
possible that we could go a century or more without running out of
possible values. We could then decide to leave it to future hackers to
decide what to do about it when the remaining bit space starts to get
tight.

The point of this thought experiment is to help us estimate how
careful we need to be. I think that if we added messages with 1-byte
type codes for things as specific as SetTypesWithBinaryOutputAlways,
there would be a significant chance that we would run out of 1-byte
type codes while some of us are still around to be sad about it. Maybe
it wouldn't happen, but it seems risky. Furthermore, such messages are
FAR more specific than existing protocol messages like Query or
Execute or ErrorResponse which cover HUGE amounts of territory. I
think we need to be a level of abstraction removed. Something like
ProtocolSessionParameterSet seems good enough to me - I think we'll
run out of codes like that soon enough to matter. I don't think it
would be wrong to take that as far as you propose here, and just add
one new message type to cover all future developments, but it feels
like it might not really help anyone. A lot of code would probably
have to drill down and look at what type of extended message it was
before deciding what to do with it, which seems a bit annoying.

One thing to keep in mind is that it's possible that in the future we
might want protocol extensions for things that are very
performance-sensitive. For instance, I think it might be advantageous
to have something that is intermediate between the simple and extended
query protocol. The simple query protocol doesn't let you set
parameters, but the extended query protocol requires you to send a
whole series of messages (Parse-Bind-Describe-Execute-Sync) which
doesn't seem to be particularly efficient for either the client or the
server. I think it would be nice to have a way to send a single
message that says "run this query with these parameters." But, if we
had that, some clients might use it Really A Lot. They would therefore
want the message to be as short as possible, which means that using up
a single byte code for it would probably be desirable. On the other
hand, the kinds of things we're talking about here really shouldn't be
subjected to that level of use, and so if for this purpose we pick a
message format that is longer and wordier and more extensible, that
should be fine. If your connection pooler is switching your connection
back and forth between a bunch of end clients that all have different
ideas about binary format parameters, it should be running at least
one query after each such change, and probably more than that. And
that query probably has some results, so a few extra bytes of overhead
in the message format shouldn't cost much even in fairly extreme
cases.

> Also, is there any reason we'd want this concept to integrate with
> connection strings/URIs? Probably not a good idea to turn on features
> that way, but perhaps we'd want to support disabling protocol
> extensions from a URI? This could be used to restrict authentication
> methods or sources of authentication information.

I don't really see why the connection string/URI has any business
disabling anything. It might require something to be enabled, though.
For instance, if we added a protocol extension to encrypt all result
sets returned to the client using rot13, we might also add a
connection parameter to control that behavior. If the user requested
that behavior using a connection parameter, libpq might then try to
enable it via a protocol extension -- it would have to, else it would
otherwise be unable to deliver the requested behavior. But the user
shouldn't get to say "please enable the protocol extension that would
enable you to turn on rot13 even though I don't actually want to use
rot13" nor should they be able to say "please give me rot13 without
using the protocol extension that would let you ask for that". Those
requests aren't sensible. The connection parameter interface is a way
for the user to request certain behaviors that they might want, and
then it's up to libpq, or some other connector, to decide what needs
to happen at a protocol level to implement those requests.

And that might change over time. We could introduce a new major
protocol version (v4!) or somebody could eventually say "hey, these
six protocol extensions are now universally supported by literally
every bit of code that we can find that speaks the PG wire protocol,
let's just start sending all these messages unconditionally and the
counterparty can error out if they're a fossil from the Jurassic era".
It's kind of hard to imagine that happening from where we are now, but
times change.

> > The reason why I suggest this is that I feel like there could be a
> > bunch of things like this.
>
> What's the trade-off between having one protocol extension (e.g.
> _pq_protocol_session_parameters) that tries to work for multiple cases
> (e.g. binary_formats and session_user) vs just having two protocol
> extensions (_pq_set_session_user and _pq_set_binary_formats)?

Well, it seems related to the message types issue mentioned above.
Presumably if we were going to have one set of message types for both
features, we'd want one protocol extension to enable that set of
message types. And if we were going to have separate message types for
each feature, we'd want separate protocol extensions to enable them.
There are probably other ways it could work, but that seems like the
most natural idea.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Request for comment on setting binary format output per session

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Apr 17, 2023 at 1:55 PM Jeff Davis <pgsql@j-davis.com> wrote:
>> Maybe we should have a single new message type 'x' to indicate a
>> message for a protocol extension, and then have a sub-message-type? It
>> might make error handling better for unexpected messages.

> ...
> The point of this thought experiment is to help us estimate how
> careful we need to be.

I tend to agree with the proposition that we aren't going to add new
message types very often, as long as we're careful to make them general
purpose.  Don't forget that adding a new message type isn't just a matter
of writing some spec text --- there has to be code backing it up.  We
will never introduce thousands of new message types, or if we do,
somebody factored it wrong and put data into the type code.

The fact that we've gotten away without adding *any* new message types
for about twenty years suggests to me that the growth rate isn't such
that we need sub-message-types yet.  I'd keep the structure the same
until such time as we can't choose a plausible code value for a new
message, and then maybe add the "x-and-subtype" convention Jeff suggests.

            regards, tom lane



Re: Request for comment on setting binary format output per session

From
Robert Haas
Date:
On Mon, Apr 17, 2023 at 4:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The fact that we've gotten away without adding *any* new message types
> for about twenty years suggests to me that the growth rate isn't such
> that we need sub-message-types yet.  I'd keep the structure the same
> until such time as we can't choose a plausible code value for a new
> message, and then maybe add the "x-and-subtype" convention Jeff suggests.

One thing I think we should do in this area is introduce #defines for
all the message type codes and use those instead of having hard-coded
constants everywhere.

I'm not brave enough to tackle that day, but the only reason the
current situation isn't a disaster is because every place we use e.g.
'Z' we generally also have a comment that mentions ReadyForQuery. If
it weren't for that, this would be pretty un-greppable.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Request for comment on setting binary format output per session

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> One thing I think we should do in this area is introduce #defines for
> all the message type codes and use those instead of having hard-coded
> constants everywhere.

+1, but I wonder where we should put those exactly.  My first thought
was postgres_ext.h, but the charter for that is

 *     This file contains declarations of things that are visible everywhere
 *  in PostgreSQL *and* are visible to clients of frontend interface libraries.
 *  For example, the Oid type is part of the API of libpq and other libraries.

so picayune details of the wire protocol probably don't belong there.
Maybe we need a new header concerned with the wire protocol?

            regards, tom lane



Re: Request for comment on setting binary format output per session

From
Robert Haas
Date:
On Tue, Apr 18, 2023 at 11:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > One thing I think we should do in this area is introduce #defines for
> > all the message type codes and use those instead of having hard-coded
> > constants everywhere.
>
> +1, but I wonder where we should put those exactly.  My first thought
> was postgres_ext.h, but the charter for that is
>
>  *     This file contains declarations of things that are visible everywhere
>  *  in PostgreSQL *and* are visible to clients of frontend interface libraries.
>  *  For example, the Oid type is part of the API of libpq and other libraries.
>
> so picayune details of the wire protocol probably don't belong there.
> Maybe we need a new header concerned with the wire protocol?

Yeah. I sort of thought maybe one of the files in src/include/libpq
would be the right place, but it doesn't look like it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:


On Tue, 18 Apr 2023 at 12:24, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Apr 18, 2023 at 11:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > One thing I think we should do in this area is introduce #defines for
> > all the message type codes and use those instead of having hard-coded
> > constants everywhere.
>
> +1, but I wonder where we should put those exactly.  My first thought
> was postgres_ext.h, but the charter for that is
>
>  *     This file contains declarations of things that are visible everywhere
>  *  in PostgreSQL *and* are visible to clients of frontend interface libraries.
>  *  For example, the Oid type is part of the API of libpq and other libraries.
>
> so picayune details of the wire protocol probably don't belong there.
> Maybe we need a new header concerned with the wire protocol?

Yeah. I sort of thought maybe one of the files in src/include/libpq
would be the right place, but it doesn't look like it.

If we at least created the defines and replaced occurrences with the same, then we can litigate where to put them later.

I think I'd prefer this in a different patch, but I'd be willing to take a run at it.

Dave

Re: Request for comment on setting binary format output per session

From
Greg Stark
Date:
On Mon, 17 Apr 2023 at 16:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I tend to agree with the proposition that we aren't going to add new
> message types very often, as long as we're careful to make them general
> purpose.  Don't forget that adding a new message type isn't just a matter
> of writing some spec text --- there has to be code backing it up.  We
> will never introduce thousands of new message types, or if we do,
> somebody factored it wrong and put data into the type code.

Well the way I understood Robert's proposal would be that you would
set a protocol option which could be some name like
SuperDuperExtension and then later send an extended message like X
SuperDuper Extension ...

The point being not so much that it saves on message types but that it
becomes possible for the wire protocol code to recognize the message
type and know which extension's code to call back to. Presumably a
callback was registered when the option was negotiated.

> The fact that we've gotten away without adding *any* new message types
> for about twenty years suggests to me that the growth rate isn't such
> that we need sub-message-types yet.  I'd keep the structure the same
> until such time as we can't choose a plausible code value for a new
> message, and then maybe add the "x-and-subtype" convention Jeff suggests.

Fwiw I've had at least two miniprojects that would eventually have led
to protocol extensions. Like most of my projects they're not finished
but one day...

Progress reporting on queries in progress -- I had things hacked to
send the progress report in an elog but eventually it would have made
sense to put it in a dedicated message type that the client would know
the structure of the content of.

Distributed tracing -- to pass the trace span id for each query and
any other baggage. Currently people either stuff it in application_id
or in SQL comments but they're both pretty awful.

-- 
greg



Re: Request for comment on setting binary format output per session

From
Robert Haas
Date:
On Tue, Apr 18, 2023 at 3:54 PM Greg Stark <stark@mit.edu> wrote:
> Well the way I understood Robert's proposal would be that you would
> set a protocol option which could be some name like
> SuperDuperExtension and then later send an extended message like X
> SuperDuper Extension ...
>
> The point being not so much that it saves on message types but that it
> becomes possible for the wire protocol code to recognize the message
> type and know which extension's code to call back to. Presumably a
> callback was registered when the option was negotiated.

That's not what I was talking about. I meant extending the protocol in
core, and dealing with version differences between the client and the
server, not loading extensions that extend the protocol. Such a thing
could possibly be done, but it seems fairly tricky to make useful.
Defining the message format is just a small part of the problem. If
for example the message is one to be sent from server to client, you
need a server side hook that's called at the right point to allow you
to inject those messages, and then you need something on the libpq
side to, I guess, intercept those messages and call a user-defined
handler when they show up. It might make sense for things like
progress reporting and tracing to piggyback on e.g. NoticeResponse,
which already has existing libpq-side handling, rather than inventing
something altogether new. Or if we are going to invent something new,
say because we want to send structured data rather than a string, then
we invent one new message type for that which can be used by multiple
facilities e.g. StructuredNoticeResponse with a content-type (e.g.
json) and a payload.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:



On Tue, 18 Apr 2023 at 12:31, Dave Cramer <davecramer@gmail.com> wrote:


On Tue, 18 Apr 2023 at 12:24, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Apr 18, 2023 at 11:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > One thing I think we should do in this area is introduce #defines for
> > all the message type codes and use those instead of having hard-coded
> > constants everywhere.
>
> +1, but I wonder where we should put those exactly.  My first thought
> was postgres_ext.h, but the charter for that is
>
>  *     This file contains declarations of things that are visible everywhere
>  *  in PostgreSQL *and* are visible to clients of frontend interface libraries.
>  *  For example, the Oid type is part of the API of libpq and other libraries.
>
> so picayune details of the wire protocol probably don't belong there.
> Maybe we need a new header concerned with the wire protocol?

Yeah. I sort of thought maybe one of the files in src/include/libpq
would be the right place, but it doesn't look like it.

If we at least created the defines and replaced occurrences with the same, then we can litigate where to put them later.

I think I'd prefer this in a different patch, but I'd be willing to take a run at it.

As promised here is a patch with defines for all of the protocol messages.
I created a protocol.h file and put it in src/includes
I'm fairly sure that some of the names I used may need to be changed but the grunt work of finding and replacing everything is done.

Dave Cramer 

Dave
Attachment

Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:



Backing up a level, the purpose of the protocol extension mechanism is
to help us agree on the communication protocol -- that is, the set of
messages that we can send and receive on a certain connection. The
question for the protocol extension mechanism isn't "which types
should always be sent in binary format?" but "would it be ok if I
wanted you to always send certain types in binary format?", with the
idea that if the answer is yes it will still be necessary for the
client to let the server know which ones, but that's easy to do if
we've agreed on the concept that it's OK for me to ask the server for
that. And if it's OK for me to ask that once, it should also be OK for
me to later ask for something different.

This could, perhaps, be made even more general yet. We could define a
concept of "protocol session parameters" and make "which types are
always sent in binary format?" one of those parameters. So then the
conversation could go like this:

C: Hello! Do you know about protocol session parameters?
S: Why yes, actually I do.
C: Cool. I would like to set the protocol session parameter
types_always_in_binary_format=timestamptz. Does that work for you?
S: Sure thing! (or alternatively: Sadly, I've not heard of that
particular protocol session parameter, sorry to disappoint.)

The reason why I suggest this is that I feel like there could be a
bunch of things like this. The set of things to be sent in binary
format feels like a property of the wire protocol, not something
SQL-level that should be configured via SET.  Clients, drivers, and
connection poolers aren't going to want to have to worry about some
user screwing up the session by changing that property inside of a
function or procedure or whatever. But there could also be a bunch of
different things like this that we want to support. For example, one
that would be really useful for connection poolers is the session
user. The pooler would like to change the session user whenever the
connection is changed to talk to a different client, and it would like
that to happen in a way that can't be reversed by issuing any SQL
command. I expect in time we may find a bunch of others.



Ok, this looks like the way to go. I have some questions about implementation.

Client sends _pq_.format_binary 
server doesn't object so now the client implicitly knows that they can send a new protocol message.
At this point the client sends some new message 'F" for example, with OID's the client wants in binary for the remainder of the session.

Ideally, I'd like to avoid this second message. Is the above correct ?

Dave

Re: Request for comment on setting binary format output per session

From
Merlin Moncure
Date:


On Thu, Apr 20, 2023 at 2:52 PM Dave Cramer <davecramer@gmail.com> wrote:

As promised here is a patch with defines for all of the protocol messages.
I created a protocol.h file and put it in src/includes
I'm fairly sure that some of the names I used may need to be changed but the grunt work of finding and replacing everything is done.

In many cases, converting inline character to macro eliminates the need for inline comment, e.g.:
+ case SIMPLE_QUERY: /* simple query */

...that's more work obviously, do you agree and if so would you like some help going through that?

merlin

Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:



On Mon, 24 Apr 2023 at 19:18, Merlin Moncure <mmoncure@gmail.com> wrote:


On Thu, Apr 20, 2023 at 2:52 PM Dave Cramer <davecramer@gmail.com> wrote:

As promised here is a patch with defines for all of the protocol messages.
I created a protocol.h file and put it in src/includes
I'm fairly sure that some of the names I used may need to be changed but the grunt work of finding and replacing everything is done.

In many cases, converting inline character to macro eliminates the need for inline comment, e.g.:
+ case SIMPLE_QUERY: /* simple query */

...that's more work obviously, do you agree and if so would you like some help going through that?

I certainly agree. I left them there mostly for reviewers. I expected some minor adjustments to names of the macro's

So if you have suggestions I'll make changes. 

I'll remove the comments if they are no longer necessary.

Dave 

Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:


On Tue, 25 Apr 2023 at 07:26, Dave Cramer <davecramer@gmail.com> wrote:



On Mon, 24 Apr 2023 at 19:18, Merlin Moncure <mmoncure@gmail.com> wrote:


On Thu, Apr 20, 2023 at 2:52 PM Dave Cramer <davecramer@gmail.com> wrote:

As promised here is a patch with defines for all of the protocol messages.
I created a protocol.h file and put it in src/includes
I'm fairly sure that some of the names I used may need to be changed but the grunt work of finding and replacing everything is done.

In many cases, converting inline character to macro eliminates the need for inline comment, e.g.:
+ case SIMPLE_QUERY: /* simple query */

...that's more work obviously, do you agree and if so would you like some help going through that?

I certainly agree. I left them there mostly for reviewers. I expected some minor adjustments to names of the macro's

So if you have suggestions I'll make changes. 

I'll remove the comments if they are no longer necessary.

Patch attached with comments removed
 

Dave 
Attachment

Re: Request for comment on setting binary format output per session

From
Daniel Gustafsson
Date:
> On 25 Apr 2023, at 16:47, Dave Cramer <davecramer@gmail.com> wrote:

> Patch attached with comments removed

This patch no longer applies, please submit a rebased version on top of HEAD.

--
Daniel Gustafsson




Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:

Dave Cramer


On Mon, 10 Jul 2023 at 03:56, Daniel Gustafsson <daniel@yesql.se> wrote:
> On 25 Apr 2023, at 16:47, Dave Cramer <davecramer@gmail.com> wrote:

> Patch attached with comments removed

This patch no longer applies, please submit a rebased version on top of HEAD.

Rebased see attached

 

--
Daniel Gustafsson

Attachment

Re: Request for comment on setting binary format output per session

From
Peter Eisentraut
Date:
On 31.07.23 18:27, Dave Cramer wrote:
> On Mon, 10 Jul 2023 at 03:56, Daniel Gustafsson <daniel@yesql.se 
> <mailto:daniel@yesql.se>> wrote:
> 
>      > On 25 Apr 2023, at 16:47, Dave Cramer <davecramer@gmail.com
>     <mailto:davecramer@gmail.com>> wrote:
> 
>      > Patch attached with comments removed
> 
>     This patch no longer applies, please submit a rebased version on top
>     of HEAD.
> 
> 
> Rebased see attached

I have studied this thread now.  It seems it has gone through the same 
progression with the same (non-)result as my original patch on the subject.

I have a few intermediate conclusions:

- Doing it with a GUC is challenging.  It's questionable layering to 
have the GUC system control protocol behavior.  It would allow weird 
behavior where a GUC set, maybe for a user or a database, would confuse, 
say, psql or pg_dump.  We probably should make some of those more robust 
in any case.  Also, handling of GUCs through connection poolers is a 
challenge.  It does work, but it's more like opt-in, and so can't be 
fully relied on for protocol correctness.

- Doing it with a session-level protocol-level setting is challenging. 
We currently don't have that kind of thing.  It's not clear how 
connection poolers would/should handle it.  Someone would have to work 
all this out before this could be used.

- In either case, there are issues like what if there is a connection 
pooler and types have different OIDs in different databases.  (Or, 
similarly, an extension is upgraded during the lifetime of a session and 
a type's OID changes.)  Also, maybe, what if types are in different 
schemas on different databases.

- We could avoid some of the session-state issues by doing this per 
request, like extending the Bind message somehow by appending the list 
of types to be sent in binary.  But the JDBC driver currently lists 24 
types for which it supports binary, so that would require adding 24*4=96 
bytes per request, which seems untenable.

I think intuitively, this facility ought to work like client_encoding. 
There, the client declares its capabilities, and the server has to 
format the output according to the client's capabilities.  That works, 
and it also works through connection poolers.  (It is a GUC.)  If we can 
model it like that as closely as possible, then we have a chance of 
getting it working reliably.  Notably, the value space for 
client_encoding is a globally known fixed list of strings.  We need to 
figure out what is the right way to globally identify types, like either 
by fully-qualified name, by base name, some combination, how does it 
work with extensions, or do we need a new mechanism like UUIDs.  I think 
that is something we need to work out, no matter which protocol 
mechanism we end up using.




Re: Request for comment on setting binary format output per session

From
Merlin Moncure
Date:
On Wed, Oct 4, 2023 at 9:17 AM Peter Eisentraut <peter@eisentraut.org> wrote: 
I think intuitively, this facility ought to work like client_encoding.
There, the client declares its capabilities, and the server has to
format the output according to the client's capabilities.  That works,
and it also works through connection poolers.  (It is a GUC.)  If we can
model it like that as closely as possible, then we have a chance of
getting it working reliably.  Notably, the value space for
client_encoding is a globally known fixed list of strings.  We need to
figure out what is the right way to globally identify types, like either
by fully-qualified name, by base name, some combination, how does it
work with extensions, or do we need a new mechanism like UUIDs.  I think
that is something we need to work out, no matter which protocol
mechanism we end up using.

 Fantastic write up.  

> globally known fixed list of strings
Are you suggesting that we would have a client/server negotiation such as, 'jdbc<version>', 'all', etc where that would identify which types are done which way?  If you did that, why would we need to promote names/uuid to permanent global space?

merlin





Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:


On Wed, 4 Oct 2023 at 10:17, Peter Eisentraut <peter@eisentraut.org> wrote:
On 31.07.23 18:27, Dave Cramer wrote:
> On Mon, 10 Jul 2023 at 03:56, Daniel Gustafsson <daniel@yesql.se
> <mailto:daniel@yesql.se>> wrote:
>
>      > On 25 Apr 2023, at 16:47, Dave Cramer <davecramer@gmail.com
>     <mailto:davecramer@gmail.com>> wrote:
>
>      > Patch attached with comments removed
>
>     This patch no longer applies, please submit a rebased version on top
>     of HEAD.
>
>
> Rebased see attached

I have studied this thread now.  It seems it has gone through the same
progression with the same (non-)result as my original patch on the subject.

I have a few intermediate conclusions:

- Doing it with a GUC is challenging.  It's questionable layering to
have the GUC system control protocol behavior.  It would allow weird
behavior where a GUC set, maybe for a user or a database, would confuse,
say, psql or pg_dump.  We probably should make some of those more robust
in any case.  Also, handling of GUCs through connection poolers is a
challenge.  It does work, but it's more like opt-in, and so can't be
fully relied on for protocol correctness.

- Doing it with a session-level protocol-level setting is challenging.
We currently don't have that kind of thing.  It's not clear how
connection poolers would/should handle it.  Someone would have to work
all this out before this could be used.

- In either case, there are issues like what if there is a connection
pooler and types have different OIDs in different databases.  (Or,
similarly, an extension is upgraded during the lifetime of a session and
a type's OID changes.)  Also, maybe, what if types are in different
schemas on different databases.

- We could avoid some of the session-state issues by doing this per
request, like extending the Bind message somehow by appending the list
of types to be sent in binary.  But the JDBC driver currently lists 24
types for which it supports binary, so that would require adding 24*4=96
bytes per request, which seems untenable.

I think intuitively, this facility ought to work like client_encoding.
There, the client declares its capabilities, and the server has to
format the output according to the client's capabilities.  That works,
and it also works through connection poolers.  (It is a GUC.)  If we can
model it like that as closely as possible, then we have a chance of
getting it working reliably.  Notably, the value space for
client_encoding is a globally known fixed list of strings.  We need to
figure out what is the right way to globally identify types, like either
by fully-qualified name, by base name, some combination, how does it
work with extensions, or do we need a new mechanism like UUIDs.  I think
that is something we need to work out, no matter which protocol
mechanism we end up using.

So how is this different than the GUC that I proposed ?

Dave 

Re: Request for comment on setting binary format output per session

From
Robert Haas
Date:
On Wed, Oct 4, 2023 at 10:17 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> I think intuitively, this facility ought to work like client_encoding.

I hadn't really considered client_encoding as a precedent for this
setting. A lot of my discomfort with the proposed mechanism also
applies to client_encoding, namely, suppose you call some function or
procedure or whatever and it changes client_encoding on your behalf
and now your communication with the server is all screwed up. That
seems very unpleasant. Yet it's also existing behavior. I think one
could conclude on these facts either that (a) client_encoding is fine
and the problems with controlling behavior using that kind of
mechanism are mostly theoretical or (b) that we messed up with
client_encoding and shouldn't add any more mistakes of the same ilk or
(c) that we should really be looking at redesigning the way
client_encoding works, too.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Request for comment on setting binary format output per session

From
Peter Eisentraut
Date:
On 04.10.23 18:26, Merlin Moncure wrote:
> On Wed, Oct 4, 2023 at 9:17 AM Peter Eisentraut <peter@eisentraut.org 
> <mailto:peter@eisentraut.org>> wrote:
> 
>     I think intuitively, this facility ought to work like client_encoding.
>     There, the client declares its capabilities, and the server has to
>     format the output according to the client's capabilities.  That works,
>     and it also works through connection poolers.  (It is a GUC.)  If we
>     can
>     model it like that as closely as possible, then we have a chance of
>     getting it working reliably.  Notably, the value space for
>     client_encoding is a globally known fixed list of strings.  We need to
>     figure out what is the right way to globally identify types, like
>     either
>     by fully-qualified name, by base name, some combination, how does it
>     work with extensions, or do we need a new mechanism like UUIDs.  I
>     think
>     that is something we need to work out, no matter which protocol
>     mechanism we end up using.
> 
> 
>   Fantastic write up.
> 
>  > globally known fixed list of strings
> Are you suggesting that we would have a client/server negotiation such 
> as, 'jdbc<version>', 'all', etc where that would identify which types 
> are done which way?  If you did that, why would we need to promote 
> names/uuid to permanent global space?

No, I don't think I meant anything like that.




Re: Request for comment on setting binary format output per session

From
Peter Eisentraut
Date:
On 04.10.23 20:30, Dave Cramer wrote:
>     We need to
>     figure out what is the right way to globally identify types, like
>     either
>     by fully-qualified name, by base name, some combination, how does it
>     work with extensions, or do we need a new mechanism like UUIDs.  I
>     think
>     that is something we need to work out, no matter which protocol
>     mechanism we end up using.
> 
> 
> So how is this different than the GUC that I proposed ?

The last patch I see from you in this thread uses OIDs, which I have 
argued is not the right solution.



Re: Request for comment on setting binary format output per session

From
Peter Eisentraut
Date:
On 04.10.23 21:10, Robert Haas wrote:
> On Wed, Oct 4, 2023 at 10:17 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>> I think intuitively, this facility ought to work like client_encoding.
> 
> I hadn't really considered client_encoding as a precedent for this
> setting. A lot of my discomfort with the proposed mechanism also
> applies to client_encoding, namely, suppose you call some function or
> procedure or whatever and it changes client_encoding on your behalf
> and now your communication with the server is all screwed up. That
> seems very unpleasant. Yet it's also existing behavior. I think one
> could conclude on these facts either that (a) client_encoding is fine
> and the problems with controlling behavior using that kind of
> mechanism are mostly theoretical or (b) that we messed up with
> client_encoding and shouldn't add any more mistakes of the same ilk or
> (c) that we should really be looking at redesigning the way
> client_encoding works, too.

Yeah I agree with all three of these points, but I don't have a strong 
opinion which is the best one.




Re: Request for comment on setting binary format output per session

From
Jelte Fennema
Date:
On Wed, 4 Oct 2023 at 21:10, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Oct 4, 2023 at 10:17 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> > I think intuitively, this facility ought to work like client_encoding.
>
> I hadn't really considered client_encoding as a precedent for this
> setting. A lot of my discomfort with the proposed mechanism also
> applies to client_encoding, namely, suppose you call some function or
> procedure or whatever and it changes client_encoding on your behalf
> and now your communication with the server is all screwed up. That
> seems very unpleasant. Yet it's also existing behavior. I think one
> could conclude on these facts either that (a) client_encoding is fine
> and the problems with controlling behavior using that kind of
> mechanism are mostly theoretical or (b) that we messed up with
> client_encoding and shouldn't add any more mistakes of the same ilk or
> (c) that we should really be looking at redesigning the way
> client_encoding works, too.

With my PgBouncer maintainer hat on: I think the GUC approach would be
quite alright, i.e. option (a). The nice thing is that it would be
very simple to make it work with connection poolers, because the same
approach could be reused that is currently used for client_encoding.
NOTE: This does require that the new GUC has the GUC_REPORT flag set
(just like client_encoding). By adding the GUC_REPORT flag clients
could also take into account any changes to the setting even when they
did not change it themselves (simplest way to handle a change would be
by throwing an error and closing the connection).

To clarify how PgBouncer currently handles client_encoding: For each
client PgBouncer keeps track of the current value for a list of GUCs,
one of which is client_encoding. This is done by listening for the
ParameterStatus responses it gets from the server while the client is
connected. Then if later a client is assigned another server
connection, and that server has different values for (some of) these
GUCs, before actually forwarding the client its query some SET
commands are sent to correctly set the GUCs.

The resultFormat = 3 trick might be nice for backwards compatibility
of clients. That way old clients would continue to get text or binary
output even when the new GUC is set. To be clear resultFormat=3 would
mean: Use binary format when the new GUC indicates that it should.
Upthread I see that Dave mentioned that this would require an extra
Describe, but I don't understand why. If you set 3 for all columns and
you know the value of the GUC, then you know which columns will be
encoded in binary.



Re: Request for comment on setting binary format output per session

From
Jelte Fennema
Date:
On Fri, 6 Oct 2023 at 13:11, Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 04.10.23 20:30, Dave Cramer wrote:
> >     We need to
> >     figure out what is the right way to globally identify types, like
> >     either
> >     by fully-qualified name, by base name, some combination, how does it
> >     work with extensions, or do we need a new mechanism like UUIDs.  I
> >     think
> >     that is something we need to work out, no matter which protocol
> >     mechanism we end up using.
> >
> >
> > So how is this different than the GUC that I proposed ?
>
> The last patch I see from you in this thread uses OIDs, which I have
> argued is not the right solution.

Since the protocol already returns OIDs in the ParameterDescription
and RowDescription messages I don't see why using OIDs for this GUC
would cause any additional problems. Clients already need to know OIDs
and how to encode/decode them. So I don't see a big reason why we
should allow passing in "schema"."type" as well. Clients already need
a mapping from typename to OID for user defined types to be able to
parse ParameterDescription and RowDescription messages.

With my Citus hat on: I would very much like something like the UUID
or typename approach. With Citus the same user defined type can have
different OIDs on each of the servers in the cluster. So it sounds
like currently using a transaction pooler that does load balancing
across the workers in the cluster would cause issues for user defined
types. Having a cluster global unique identifier for a type within a
database would be able to solve those issues. But that would require
that the protocol actually sent those cluster global unique
identifiers instead of OIDs. As far as I can tell similar issues would
be present with zero-downtime upgrades using pg_upgrade + logical
replication, and probably also in solutions like BDR. i.e. this is an
issue when clients get transparently re-connected to a new host where
an OIDs of user defined types might be different.

So I think OIDs are a good choice for the newly proposed GUC, because
that's what the protocol uses currently. But I do think it would be
good to keep in mind what it would look like if we'd change the
protocol to report and accept UUIDs/typenames instead of OIDs.
UUIDs/typenames and OIDs have a clearly different string
representation though. So, I think we could easily expand the new GUC
to support both OIDs and UUIDs/typenames when we change the protocol
to do so too, even when supporting just OIDs now.



Re: Request for comment on setting binary format output per session

From
Robert Haas
Date:
On Mon, Oct 9, 2023 at 11:09 AM Jelte Fennema <postgres@jeltef.nl> wrote:
> Since the protocol already returns OIDs in the ParameterDescription
> and RowDescription messages I don't see why using OIDs for this GUC
> would cause any additional problems.

...but then...

> With Citus the same user defined type can have
> different OIDs on each of the servers in the cluster.

I realize that your intention here may be to say that this is not an
*additional* problem but one we have already. But it seems like one
that we ought to be trying to solve, rather than propagating a
problematic solution into more places.

Decisions we make about the wire protocol are some of the most
long-lasting and painful decisions we make, right up there with the
on-disk format. Maybe worse, in some ways.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:

On Mon, 9 Oct 2023 at 15:00, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Oct 9, 2023 at 11:09 AM Jelte Fennema <postgres@jeltef.nl> wrote:
> Since the protocol already returns OIDs in the ParameterDescription
> and RowDescription messages I don't see why using OIDs for this GUC
> would cause any additional problems.

...but then...

> With Citus the same user defined type can have
> different OIDs on each of the servers in the cluster.

I realize that your intention here may be to say that this is not an
*additional* problem but one we have already. But it seems like one
that we ought to be trying to solve, rather than propagating a
problematic solution into more places.

Decisions we make about the wire protocol are some of the most
long-lasting and painful decisions we make, right up there with the
on-disk format. Maybe worse, in some ways.

So if we use <schema>.<type> would it be possible to have something like <builtin> which represents a set of well known types? 
My goal here is to reduce the overhead of naming all the types the client  wants in binary. The list of well known types is pretty long.
Additionally we could have a shorthand for removing a well known type. 

Dave

Re: Request for comment on setting binary format output per session

From
Jeff Davis
Date:
On Wed, 2023-10-04 at 15:10 -0400, Robert Haas wrote:
> I hadn't really considered client_encoding as a precedent for this
> setting. A lot of my discomfort with the proposed mechanism also
> applies to client_encoding, namely, suppose you call some function or
> procedure or whatever and it changes client_encoding on your behalf
> and now your communication with the server is all screwed up.

This may have some security implications, but we've had lots of
discussion about the general topic of executing malicious code, and the
ability to mess with the on-the-wire formats might not be any worse
than what can already happen. (Though expanding it to binary formats
might slightly increase the attack surface area.)

> That
> seems very unpleasant. Yet it's also existing behavior.

The binary format setting is better in some ways and worse in other
ways.

For text encoding, usually it's expecting a single encoding and so a
single setting at the start of the session makes sense. For binary
formats, the client is likely to support some values in binary and
others not; and user-defined types make it even messier.

On the other hand, at least the results are marked as being binary
format, so if something unexpected happens, a well-written client is
more likely to see that something went wrong. For text encoding, the
client would have to be a bit more defensive.

Another thing to consider is that using a GUC for binary formats is a
protocol change in a way that client_encoding is not. The existing
documentation for the protocol already specifies when binary formats
will be used, and a GUC would change that behavior. We absolutely would
need to update the documentation, and clients (like psql) really should
be updated.

> I think one
> could conclude on these facts either that (a) client_encoding is fine
> and the problems with controlling behavior using that kind of
> mechanism are mostly theoretical or 

I'm not clear on the exact rules for a protocol version bump and why a
GUC helps us avoid one. If we have a binary_formats GUC, the client
would need to know the server version and check that it's >=17 before
sending the "SET binary_formats='...'" commmand, right? What's the
difference between that and making it an explicit protocol message that
only >=17 understand?

In any case, I think clients and connection poolers can work around the
problems, and they are mostly minor in practice, but I wouldn't call
them "theoretical". If there's enough utility in the binary_formats
parameter, we can decide to put up with the problems; which is
different than saying there aren't any.

> (b) that we messed up with
> client_encoding and shouldn't add any more mistakes of the same ilk
> or
> (c) that we should really be looking at redesigning the way
> client_encoding works, too.

(b) doesn't seem like a very helpful perspective without some ideas
toward (c). I think (c) is worth discussing but we don't have to block
on it.

Regards,
    Jeff Davis





Re: Request for comment on setting binary format output per session

From
Jelte Fennema
Date:
On Mon, 9 Oct 2023 at 21:00, Robert Haas <robertmhaas@gmail.com> wrote:
> ...but then...
>
> > With Citus the same user defined type can have
> > different OIDs on each of the servers in the cluster.
>
> I realize that your intention here may be to say that this is not an
> *additional* problem but one we have already. But it seems like one
> that we ought to be trying to solve, rather than propagating a
> problematic solution into more places.

Yes, I probably should have emphasized the word *additional*. i.e.
starting from scratch I wouldn't use OIDs in this GUC nor in
ParameterDescription or RowDescription, but blocking the addition of
this GUC on addressing that seems unnecessary. When we fix it we can
fix this too. I'd rather use OIDs (with all their problems)
consistently now for communication of types with regards to protocol
related things. Then we can at some point change all places in bulk to
some better identifier than OIDs.

> Decisions we make about the wire protocol are some of the most
> long-lasting and painful decisions we make, right up there with the
> on-disk format. Maybe worse, in some ways.

Yes, I agree. I just don't think using OIDs makes changing the
protocol in this regard any less painful than it already is currently.



Re: Request for comment on setting binary format output per session

From
Jelte Fennema
Date:
On Mon, 9 Oct 2023 at 22:25, Jeff Davis <pgsql@j-davis.com> wrote:
> We absolutely would
> need to update the documentation, and clients (like psql) really should
> be updated.

+1

> > I think one
> > could conclude on these facts either that (a) client_encoding is fine
> > and the problems with controlling behavior using that kind of
> > mechanism are mostly theoretical or
>
> I'm not clear on the exact rules for a protocol version bump and why a
> GUC helps us avoid one. If we have a binary_formats GUC, the client
> would need to know the server version and check that it's >=17 before
> sending the "SET binary_formats='...'" commmand, right?

I agree that we'd probably still want to do a protocol minor version
bump. FYI there is another thread trying to introduce protocol change
which needs a minor version bump. Patch number 0003 in that patchset
is meant to actually make libpq handle minor version increases
correctly. If we need a version bump than that would be useful[1].


> What's the
> difference between that and making it an explicit protocol message that
> only >=17 understand?

Honestly I think the main difference is the need to introduce this
explicit protocol message. If we do, I think it might be best to have
this be a way of setting a GUC at the Protocol level, and expand the
GucContext enum to have a way to disallow setting it from SQL (e.g.
PGC_PROTOCOL), while still allowing PgBouncer (or other poolers) to
change the GUC as part of the connection handoff, in a way that's
similar to what's being done for client_encoding now. We might even
want to make client_encoding PGC_PROTOCOL too (eventually).

Actually, for connection poolers there's other reasons to want to set
GUC values at the protocol level instead of SQL. Because the value of
the ParameterStatus response is sadly not a valid SQL string... That's
why in PgBouncer we have to re-quote the value [2], which is a problem
for any GUC_LIST_QUOTE type, which search_path is. This GUC_LIST_QUOTE
logic is actually not completely correct in PgBouncer and only handles
"" (empty search_path), because for search_path that's the only
reasonable problematic case that people might hit (e.g. truncating to
NAMELEN is another problem, but elements in search_path should already
be at most NAMELEN). But still it would be good not to have to worry
about that. And being able to send the value in ParameterStatus back
verbatim to the server would be quite helpful for PgBouncer.

[1]:
https://www.postgresql.org/message-id/flat/CAFj8pRAX48WH5Y6BbqnZbUSzmtEaQZ22rY_6cYw%3DE9QkoVvL0A%40mail.gmail.com#643c91f84ae33b316c0fed64e19c8e49
[2]: https://github.com/pgbouncer/pgbouncer/blob/60708022d5b934fa53c51849b9f02d87a7881b11/src/varcache.c#L172-L183



Re: Request for comment on setting binary format output per session

From
Jelte Fennema
Date:
On Mon, 9 Oct 2023 at 21:08, Dave Cramer <davecramer@gmail.com> wrote:
> So if we use <schema>.<type> would it be possible to have something like <builtin> which represents a set of well
knowntypes?
 
> My goal here is to reduce the overhead of naming all the types the client  wants in binary. The list of well known
typesis pretty long.
 
> Additionally we could have a shorthand for removing a well known type.

You're only setting this once in the lifetime of the connection right,
i.e. right at the start (although pgbouncer could set it once per
transaction in the worst case). It seems like it shouldn't really
matter much to optimize the size of the "SET format_binary=..."
command, I'd expect it to be at most 1 kilobyte. I'm not super opposed
to having a shorthand for some of the most commonly wanted built-in
types, but then we'd need to decide on what those are, which would add
even more discussion/bikeshedding to this thread. I'm not sure the win
in size is worth that effort.



Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:


On Mon, 9 Oct 2023 at 17:11, Jelte Fennema <postgres@jeltef.nl> wrote:
On Mon, 9 Oct 2023 at 21:08, Dave Cramer <davecramer@gmail.com> wrote:
> So if we use <schema>.<type> would it be possible to have something like <builtin> which represents a set of well known types?
> My goal here is to reduce the overhead of naming all the types the client  wants in binary. The list of well known types is pretty long.
> Additionally we could have a shorthand for removing a well known type.

You're only setting this once in the lifetime of the connection right,

Correct 
i.e. right at the start (although pgbouncer could set it once per
transaction in the worst case). It seems like it shouldn't really
matter much to optimize the size of the "SET format_binary=..."
command, I'd expect it to be at most 1 kilobyte. I'm not super opposed
to having a shorthand for some of the most commonly wanted built-in
types, but then we'd need to decide on what those are, which would add
even more discussion/bikeshedding to this thread. I'm not sure the win
in size is worth that effort.
It's worth the effort if we use schema.typename, if we use oids then I'm not that invested in this approach.

Dave

Re: Request for comment on setting binary format output per session

From
Robert Haas
Date:
On Mon, Oct 9, 2023 at 4:25 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Another thing to consider is that using a GUC for binary formats is a
> protocol change in a way that client_encoding is not. The existing
> documentation for the protocol already specifies when binary formats
> will be used, and a GUC would change that behavior. We absolutely would
> need to update the documentation, and clients (like psql) really should
> be updated.

I think the idea of using a new parameterFormat value is a good one.
Let 0 and 1 continue to mean what they mean, and let clients opt in to
the new mechanism if they're aware of it.

I think it's a pretty bad idea to dump new protocol behavior on
clients who have in no way indicated that they know about it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Request for comment on setting binary format output per session

From
Dave Cramer
Date:


On Tue, 10 Oct 2023 at 10:25, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Oct 9, 2023 at 4:25 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Another thing to consider is that using a GUC for binary formats is a
> protocol change in a way that client_encoding is not. The existing
> documentation for the protocol already specifies when binary formats
> will be used, and a GUC would change that behavior. We absolutely would
> need to update the documentation, and clients (like psql) really should
> be updated.

I think the idea of using a new parameterFormat value is a good one.
Let 0 and 1 continue to mean what they mean, and let clients opt in to
the new mechanism if they're aware of it.

Correct me if I am wrong, but the client has to request this. So I'm not sure how we would be surprised ?

Dave

Re: Request for comment on setting binary format output per session

From
Robert Haas
Date:
On Mon, Oct 9, 2023 at 5:02 PM Jelte Fennema <postgres@jeltef.nl> wrote:
> Honestly I think the main difference is the need to introduce this
> explicit protocol message. If we do, I think it might be best to have
> this be a way of setting a GUC at the Protocol level, and expand the
> GucContext enum to have a way to disallow setting it from SQL (e.g.
> PGC_PROTOCOL), while still allowing PgBouncer (or other poolers) to
> change the GUC as part of the connection handoff, in a way that's
> similar to what's being done for client_encoding now. We might even
> want to make client_encoding PGC_PROTOCOL too (eventually).

That's an idea worth considering, IMHO. I'm not saying it's the best
or only idea, but it seems to have some real advantages.

The pooler case is actually a really important one here. If the client
is connected directly to the server, the difference between whether
something is controlled via the protocol or via SQL is just whether it
could be set inside some function. I think that's a thing to be
concerned about, but when you add the pooler to the equation then you
have the additional question of whether a certain value should be
controlled by the end-client or by the pooler. A really obvious
example of where you might want the latter behavior is
session_authorization. You'd like the pooler to be able to set that in
such a way that the end-client can't tinker with it by any means.
Right now we don't have a way to do that, but maybe someday we will.
This issue is perhaps a bit less critical, but it still feels bad if
the end-client can effectively pull the rug out from under the
pooler's wire protocol expectations. I'm not exactly sure what the
right policy is here concretely, so I'm not ready to argue for exactly
what we should do just yet, but I do want to argue that we should be
thinking carefully about these issues.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Request for comment on setting binary format output per session

From
Robert Haas
Date:
On Tue, Oct 10, 2023 at 10:30 AM Dave Cramer <davecramer@gmail.com> wrote:
> Correct me if I am wrong, but the client has to request this. So I'm not sure how we would be surprised ?

Consider an application, a connection pooler, and a stored procedure
or function on the server. If this is controlled by a GUC, any of them
could set it at any point in the session. That could lead to the
application and/or the connection pooler being out of step with the
server behavior.

--
Robert Haas
EDB: http://www.enterprisedb.com



On Mon, 31 Jul 2023 at 21:58, Dave Cramer <davecramer@gmail.com> wrote:
>
>
> Dave Cramer
>
>
> On Mon, 10 Jul 2023 at 03:56, Daniel Gustafsson <daniel@yesql.se> wrote:
>>
>> > On 25 Apr 2023, at 16:47, Dave Cramer <davecramer@gmail.com> wrote:
>>
>> > Patch attached with comments removed
>>
>> This patch no longer applies, please submit a rebased version on top of HEAD.
>
>
> Rebased see attached

CFBot shows that the patch does not apply anymore as in [1]:
=== Applying patches on top of PostgreSQL commit ID
fba2112b1569fd001a9e54dfdd73fd3cb8f16140 ===
=== applying patch ./0001-Created-protocol.h.patch
patching file src/backend/access/common/printsimple.c
Hunk #1 succeeded at 22 with fuzz 2 (offset 1 line).
Hunk #2 FAILED at 33.
Hunk #3 FAILED at 66.
2 out of 3 hunks FAILED -- saving rejects to file
src/backend/access/common/printsimple.c.rej
patching file src/backend/access/transam/parallel.c
Hunk #1 succeeded at 34 (offset 1 line).
Hunk #2 FAILED at 1128.
Hunk #3 FAILED at 1138.
Hunk #4 FAILED at 1184.
Hunk #5 succeeded at 1205 (offset 4 lines).
Hunk #6 FAILED at 1218.
Hunk #7 FAILED at 1373.
Hunk #8 FAILED at 1551.
6 out of 8 hunks FAILED -- saving rejects to file
src/backend/access/transam/parallel.c.rej

Please post an updated version for the same.

[1] - http://cfbot.cputube.org/patch_46_3777.log

Regards,
Vignesh



Re: Request for comment on setting binary format output per session

From
vignesh C
Date:
On Sat, 27 Jan 2024 at 07:45, vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, 31 Jul 2023 at 21:58, Dave Cramer <davecramer@gmail.com> wrote:
> >
> >
> > Dave Cramer
> >
> >
> > On Mon, 10 Jul 2023 at 03:56, Daniel Gustafsson <daniel@yesql.se> wrote:
> >>
> >> > On 25 Apr 2023, at 16:47, Dave Cramer <davecramer@gmail.com> wrote:
> >>
> >> > Patch attached with comments removed
> >>
> >> This patch no longer applies, please submit a rebased version on top of HEAD.
> >
> >
> > Rebased see attached
>
> CFBot shows that the patch does not apply anymore as in [1]:
> === Applying patches on top of PostgreSQL commit ID
> fba2112b1569fd001a9e54dfdd73fd3cb8f16140 ===
> === applying patch ./0001-Created-protocol.h.patch
> patching file src/backend/access/common/printsimple.c
> Hunk #1 succeeded at 22 with fuzz 2 (offset 1 line).
> Hunk #2 FAILED at 33.
> Hunk #3 FAILED at 66.
> 2 out of 3 hunks FAILED -- saving rejects to file
> src/backend/access/common/printsimple.c.rej
> patching file src/backend/access/transam/parallel.c
> Hunk #1 succeeded at 34 (offset 1 line).
> Hunk #2 FAILED at 1128.
> Hunk #3 FAILED at 1138.
> Hunk #4 FAILED at 1184.
> Hunk #5 succeeded at 1205 (offset 4 lines).
> Hunk #6 FAILED at 1218.
> Hunk #7 FAILED at 1373.
> Hunk #8 FAILED at 1551.
> 6 out of 8 hunks FAILED -- saving rejects to file
> src/backend/access/transam/parallel.c.rej
>
> Please post an updated version for the same.
>
> [1] - http://cfbot.cputube.org/patch_46_3777.log

The patch which you submitted has been awaiting your attention for
quite some time now.  As such, we have moved it to "Returned with
Feedback" and removed it from the reviewing queue. Depending on
timing, this may be reversible.  Kindly address the feedback you have
received, and resubmit the patch to the next CommitFest.

Regards,
Vignesh