Thread: Protocol buffer support for Postgres

Protocol buffer support for Postgres

From
陈天舟
Date:
I am interested in adding Protocol Buffer support for Postgres. Protocol Buffer occupies less space than JSON. More importantly, it has schema and is forward/backward compatible. All these make it a very good format for persistency.

Here are two rough ideas I have right now:

Approach 1:

Creating a datatype "PROTOBUF" similar as "JSON" and implement all the similar support (e.g. indexing) as done for "JSON" Type.
(1) Since each protocol buffer column requires a schema. I am not sure where is the best place to store that schema info. Should it be in a CONSTRAINT (but I am not able to find the doc referring any custom constraint), or should it be in the COMMENT or somewhere else?
(2) The input/output will be already serialized protocol buffer. The protocol buffer schema will be used to validate the column value when storing/retrieving the data. It may be possible to skip the check upon retrieving, instead the check can be done anytime the column schema is changed.

Approach 2:

Still creating a datatype "PROTOBUF", but internally it uses "JSON" type. So PROTOBUF is just a wrapper type doing the conversion and schema validation on the fly. The benefit of this is it can leverage the JSON support. The downside is it can only support the common features available in both Protocol Buffer and JSON. e.g. Protocol Buffer annotation is a quite useful feature which is not available in JSON.

I would like to hear some thoughts about having Protocol Buffer for Postgres as well as the approach tackling it.

Thanks
Tianzhou

Re: Protocol buffer support for Postgres

From
Craig Ringer
Date:
On 26 April 2016 at 14:06, 陈天舟 <tianzhouchen@gmail.com> wrote:
I am interested in adding Protocol Buffer support for Postgres. Protocol Buffer occupies less space than JSON. More importantly, it has schema and is forward/backward compatible. All these make it a very good format for persistency.

Here are two rough ideas I have right now:

Approach 1:

Creating a datatype "PROTOBUF" similar as "JSON" and implement all the similar support (e.g. indexing) as done for "JSON" Type.
(1) Since each protocol buffer column requires a schema. I am not sure where is the best place to store that schema info. Should it be in a CONSTRAINT (but I am not able to find the doc referring any custom constraint), or should it be in the COMMENT or somewhere else?

I can't really imagine how you'd do that without adding a new catalog like we have for enum members. A typmod isn't sufficient since you need a whole lot more than an integer, and typmods aren't tracked throughout the server that well.

That'll make it hard to do it with an extension.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Protocol buffer support for Postgres

From
José Luis Tallón
Date:
On 04/26/2016 08:06 AM, 陈天舟 wrote:
> I am interested in adding Protocol Buffer support for Postgres. 
> Protocol Buffer occupies less space than JSON. More importantly, it 
> has schema and is forward/backward compatible. All these make it a 
> very good format for persistency.

Have you investigated JSONB vs ProtoBuf space usage ?    (the key being  the "B" -- Postgres' own binary JSON
implementation)


The "per-column schema" thing sounds difficult to do without major 
changes to the core unless/until we have generalized user-defined 
metadata for objects ....


Just my .02€
    / J.L.




Re: Protocol buffer support for Postgres

From
Tom Lane
Date:
Craig Ringer <craig@2ndquadrant.com> writes:
> On 26 April 2016 at 14:06, 陈天舟 <tianzhouchen@gmail.com> wrote:
>> (1) Since each protocol buffer column requires a schema. I am not sure
>> where is the best place to store that schema info. Should it be in a
>> CONSTRAINT (but I am not able to find the doc referring any custom
>> constraint), or should it be in the COMMENT or somewhere else?

> I can't really imagine how you'd do that without adding a new catalog like
> we have for enum members. A typmod isn't sufficient since you need a whole
> lot more than an integer, and typmods aren't tracked throughout the server
> that well.

> That'll make it hard to do it with an extension.

PostGIS manages to reference quite a lot of schema-like information via
a geometry column's typmod.  Maybe there's a reason why their approach
wouldn't be a good fit for this, but it'd be worth investigating.
        regards, tom lane



Re: Protocol buffer support for Postgres

From
Paul Ramsey
Date:
On Tue, Apr 26, 2016 at 6:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Craig Ringer <craig@2ndquadrant.com> writes:
>> On 26 April 2016 at 14:06, 陈天舟 <tianzhouchen@gmail.com> wrote:
>>> (1) Since each protocol buffer column requires a schema. I am not sure
>>> where is the best place to store that schema info. Should it be in a
>>> CONSTRAINT (but I am not able to find the doc referring any custom
>>> constraint), or should it be in the COMMENT or somewhere else?
>
>> I can't really imagine how you'd do that without adding a new catalog like
>> we have for enum members. A typmod isn't sufficient since you need a whole
>> lot more than an integer, and typmods aren't tracked throughout the server
>> that well.
>
>> That'll make it hard to do it with an extension.
>
> PostGIS manages to reference quite a lot of schema-like information via
> a geometry column's typmod.  Maybe there's a reason why their approach
> wouldn't be a good fit for this, but it'd be worth investigating.

We pack a short type number, two flags and 24 bits of SRID number into
an integer. The SRID number is in turn a foreign key into the
spatial_ref_sys table where the fully spelled out spatial reference
definition lives.  It's not very nice, and it's quite breakable since
there's no foreign key integrity between the typmod and the
spatial_ref_sys pk.

P.



Re: Protocol buffer support for Postgres

From
David Fetter
Date:
On Mon, Apr 25, 2016 at 11:06:11PM -0700, 陈天舟 wrote:
> I am interested in adding Protocol Buffer support for Postgres. Protocol
> Buffer occupies less space than JSON. More importantly, it has schema and
> is forward/backward compatible. All these make it a very good format for
> persistency.

Thanks for the idea.  There are many different serializations, some of
which are listed below, each with their own qualities, which can be
good or bad with a very strong dependency on context:

https://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats

Should we see about making a more flexible serialization
infrastructure?  What we have is mostly /ad hoc/, and has already
caused real pain to the PostGIS folks, this even after some pretty
significant and successful efforts were made in this direction.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Protocol buffer support for Postgres

From
Flavius Anton
Date:
On Tue, Apr 26, 2016 at 2:40:49PM -0700, David Fetter wrote:
> Should we see about making a more flexible serialization
> infrastructure?  What we have is mostly /ad hoc/, and has already
> caused real pain to the PostGIS folks, this even after some pretty
> significant and successful efforts were made in this direction.

Hi all. Is anybody working on this right now? I would like to pick
this task for the summer. First of all, what do you think about what
David said? Should we try and design a generic infrastructure for
similar serialization datatypes? If so, will we need to refactor some
pieces from the JSON/XML implementation? I looked over the code and it
seems nicely decoupled, but I am not sure what this would involve.
I've done this before for MySQL[1] (not yet completed), but I'd love
to try it for PostgreSQL too.

On Tue, Apr 26, 2016 at 11:23:11AM -0700, José Luis Tallón wrote:
> Have you investigated JSONB vs ProtoBuf space usage ?
>        (the key being  the "B" -- Postgres' own binary JSON
>         implementation)

This is something I can further investigate, but another (possibly
major) benefit of the Protocol Buffers over JSON is that they *still*
have a schema. I think they combine advantages from both structured
and schemaless data.

My best guess is that we shouldn't focus on abstracting *any*
serialization paradigm, but only the ones that have a schema (like
Thrift or Protocol Buffers). Speaking of schemas, where is the best
place to keep that? For MySQL I opted for a plain text file similar to
.trg files (the ones used by MySQL for keeping triggers).

I'd love to talk more about this.

Thank you.
Flavius Anton

[1] https://github.com/google/mysql-protobuf



Re: Protocol buffer support for Postgres

From
Greg Stark
Date:
On Thu, Jun 23, 2016 at 8:15 PM, Flavius Anton <f.v.anton@gmail.com> wrote:
>
> I'd love to talk more about this.

I thought quite a bit about this a few years ago but never really
picked up it to work on.

There are different use cases where this could be useful and different
approaches that could be useful for the different use cases.

One idea was to have a protocol buffer data type which used the typmod
as an identifier to specify the schema. Perhaps as an oid primary key
from some other table, or something like how enums are handled. That
would be used much like JSON is used by languages where that's natural
but might be useful if you're in an environment where data is being
provided in protobufs and you need to send them on in protobufs.

Another option would be to allow the output of your select query to be
read in a protobuf. That might be a feature to add in libpq rather
than in the server, or perhaps as a new protocol feature that libpq
would then just switch to which might make it easier to use from other
languages. That might make it easier to use Postgres as a data store
for an environment where everything is in protobufs without losing the
full power of SQL schemas in the database.

As an aside, have you seen Cap’n Proto? It looks more interesting to
me than protobufs. It fixes a lot of the corner cases where protobufs
end up with unbounded computational complexity and seems a lot
cleaner. I haven't thought about it much but I wonder if it would be
possible to actually use Cap'n Proto capabilities to grant access to
Postgres RLS labels too.

--
greg



Re: Protocol buffer support for Postgres

From
Flavius Anton
Date:
On Thu, Jun 23, 2016 at 1:50 PM, Greg Stark <stark@mit.edu> wrote:
> On Thu, Jun 23, 2016 at 8:15 PM, Flavius Anton <f.v.anton@gmail.com> wrote:
>>
>> I'd love to talk more about this.
>
> I thought quite a bit about this a few years ago but never really
> picked up it to work on.
>
> Another option would be to allow the output of your select query to be
> read in a protobuf. That might be a feature to add in libpq rather
> than in the server, or perhaps as a new protocol feature that libpq
> would then just switch to which might make it easier to use from other
> languages. That might make it easier to use Postgres as a data store
> for an environment where everything is in protobufs without losing the
> full power of SQL schemas in the database.

I agree on this one, I think it's the most /natural/ way of doing things.

> As an aside, have you seen Cap’n Proto? It looks more interesting to
> me than protobufs. It fixes a lot of the corner cases where protobufs
> end up with unbounded computational complexity and seems a lot
> cleaner.

I've seen it around for quite some time, but my fear is that it is not
(yet?) widely adopted. Protocol Buffers themselves are not that
popular, let alone Cap'n Proto. By the way, there's also the newer
FlatBuffers[1] project, from Google too. They seem a lot like Cap'n
Proto, though. I think it's doable to provide some sort of abstraction
for these protocols in PostgreSQL, so that it can support all of them
eventually, with minimum effort for adding a new one. However, I am
skeptical about the practical advantages of having /all/ of them
supported.

--
Flavius

[1] https://github.com/google/flatbuffers



Re: Protocol buffer support for Postgres

From
Flavius Anton
Date:
On Thu, Jun 23, 2016 at 2:54 PM, Flavius Anton <f.v.anton@gmail.com> wrote:
> On Thu, Jun 23, 2016 at 1:50 PM, Greg Stark <stark@mit.edu> wrote:
>> On Thu, Jun 23, 2016 at 8:15 PM, Flavius Anton <f.v.anton@gmail.com> wrote:
>>>
>>> I'd love to talk more about this.
>>
>> I thought quite a bit about this a few years ago but never really
>> picked up it to work on.

Any other thoughts on this? My guess is that it might be an important
addition to Postgres that can attract even more users, but I am not
sure if there's enough interest from the community. If I want to pick
this task, how should I move forward? Do I need to write a design
document or similar or should I come up with a patch that implements a
draft prototype? I am new to this community and don't know the code
yet, so I'd appreciate some guidance from an older, more experienced
member.

Thanks.

--
Flavius



Re: Protocol buffer support for Postgres

From
Álvaro Hernández Tortosa
Date:

On 24/06/16 14:23, Flavius Anton wrote:
> On Thu, Jun 23, 2016 at 2:54 PM, Flavius Anton <f.v.anton@gmail.com> wrote:
>> On Thu, Jun 23, 2016 at 1:50 PM, Greg Stark <stark@mit.edu> wrote:
>>> On Thu, Jun 23, 2016 at 8:15 PM, Flavius Anton <f.v.anton@gmail.com> wrote:
>>>> I'd love to talk more about this.
>>> I thought quite a bit about this a few years ago but never really
>>> picked up it to work on.
> Any other thoughts on this? My guess is that it might be an important
> addition to Postgres that can attract even more users, but I am not
> sure if there's enough interest from the community. If I want to pick
> this task, how should I move forward? Do I need to write a design
> document or similar or should I come up with a patch that implements a
> draft prototype? I am new to this community and don't know the code
> yet, so I'd appreciate some guidance from an older, more experienced
> member.
>
>
    Other than protobuf, there are also other serialization formats 
that might be worth considering. Not too long ago I suggested working 
specifically on serialization formas for the json/jsonb types: 
https://www.postgresql.org/message-id/56CB8A62.40100%408kdata.com I 
believe this effort is on the same boat.
    Álvaro

-- 

Álvaro Hernández Tortosa


-----------
8Kdata




Re: Protocol buffer support for Postgres

From
Flavius Anton
Date:
On Fri, Jun 24, 2016 at 11:35 AM, Álvaro Hernández Tortosa
<aht@8kdata.com> wrote:
>
>
> On 24/06/16 14:23, Flavius Anton wrote:
>>
>> On Thu, Jun 23, 2016 at 2:54 PM, Flavius Anton <f.v.anton@gmail.com>
>> wrote:
>>>
>>> On Thu, Jun 23, 2016 at 1:50 PM, Greg Stark <stark@mit.edu> wrote:
>>>>
>>>> On Thu, Jun 23, 2016 at 8:15 PM, Flavius Anton <f.v.anton@gmail.com>
>>>> wrote:
>>>>>
>>>>> I'd love to talk more about this.
>>>>
>>>> I thought quite a bit about this a few years ago but never really
>>>> picked up it to work on.
>>
>> Any other thoughts on this? My guess is that it might be an important
>> addition to Postgres that can attract even more users, but I am not
>> sure if there's enough interest from the community. If I want to pick
>> this task, how should I move forward? Do I need to write a design
>> document or similar or should I come up with a patch that implements a
>> draft prototype? I am new to this community and don't know the code
>> yet, so I'd appreciate some guidance from an older, more experienced
>> member.
>>
>>
>
>     Other than protobuf, there are also other serialization formats that
> might be worth considering. Not too long ago I suggested working
> specifically on serialization formas for the json/jsonb types:
> https://www.postgresql.org/message-id/56CB8A62.40100%408kdata.com I believe
> this effort is on the same boat.

Sure, there are a bunch of these, some of the most popular being:
* Cap'n Proto
* Flatbuffers
* Thrift

A longer list is already made here[1].

Meanwhile, I came across another interesting idea. What if, for
starters, we don't introduce a completely new serialization format,
like Protocol Buffers, but rather make the JSON support more stronger
and interesting. What I am thinking is to have a JSON schema
(optionally) associated with a JSON column. In this way, you don't
have to store the keys on disk anymore and also you'd have your
database check for JSON validity at INSERT time. I think there are two
big advantages here. What do you think about this one?

--
Flavius

[1] https://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats



Re: Protocol buffer support for Postgres

From
Robert Haas
Date:
On Fri, Jun 24, 2016 at 2:23 PM, Flavius Anton <f.v.anton@gmail.com> wrote:
> Any other thoughts on this? My guess is that it might be an important
> addition to Postgres that can attract even more users, but I am not
> sure if there's enough interest from the community. If I want to pick
> this task, how should I move forward? Do I need to write a design
> document or similar or should I come up with a patch that implements a
> draft prototype? I am new to this community and don't know the code
> yet, so I'd appreciate some guidance from an older, more experienced
> member.

It's not very clear to me from this discussion what you actually want
to implement, and I think your chances of implementing it successfully
and your chances of getting that committed depend heavily on what you
actually intend to do.  If you want to implement a new datatype, that
can be done as a contrib module and, if it doesn't go into the core
distribution, it can always go up on GitHub and people can use it if
they like it.  If you want to implement a tool for validating JSON
documents against a JSON schema, you can just write that as a function
in a loadable module and people who want to use it can add a CHECK
(json_validate_schema(mycolumn, 'schema or a reference thereunto goes
here')) constraint.  Again, if core doesn't want it, you can still
distribute it and people can use it with unmodified PostgreSQL.  And
similarly, if you want to convert records from PostgreSQL's native
format to some new serialization format, there is precedent with that
for JSON and it can all be done as an extension to start.

If you do something that touches the wire protocol or adds new formats
in which data can be returned to the client, I think your chances of
getting that committed to core are extremely low.  We don't really
like to change the wire protocol, and every time we add some extension
to it, the amount of actual use that people get out of it is
contingent on all of the various reimplementations of that protocol in
various language-specific drivers also being extended to support that
same thing.  It's not worth going through that kind of hassle very
often, and I don't think this would come close to having enough value
to make that level of pain worthwhile, given that there are so many
other ways of accomplishing the same thing: you can use a function in
the query to transform the data before sending it, or you can have the
client transform it afterwards.  Moreover, if your work doesn't go
into core then it's probably going to die on the vine because who
wants a modified server version that breaks wire protocol
compatibility?

I'm not taking any position on the intrinsic value of protocol buffer
support beyond the broad statement that I don't personally have any
reason care about it.  That is, I neither support nor oppose the idea
of accepting code that does stuff related to protocol buffers.  I am
only offering a general opinion that something that can exist as an
extension is likely (a) easier to implement, (b) easier to get
committed, and (c) still potentially useful if it doesn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company