Thread: Protocol buffer support for Postgres
I am interested in adding Protocol Buffer support for Postgres. Protocol Buffer occupies less space than JSON. More importantly, it has schema and is forward/backward compatible. All these make it a very good format for persistency.
Here are two rough ideas I have right now:
Approach 1:
Creating a datatype "PROTOBUF" similar as "JSON" and implement all the similar support (e.g. indexing) as done for "JSON" Type.
(1) Since each protocol buffer column requires a schema. I am not sure where is the best place to store that schema info. Should it be in a CONSTRAINT (but I am not able to find the doc referring any custom constraint), or should it be in the COMMENT or somewhere else?
(2) The input/output will be already serialized protocol buffer. The protocol buffer schema will be used to validate the column value when storing/retrieving the data. It may be possible to skip the check upon retrieving, instead the check can be done anytime the column schema is changed.
Approach 2:
Still creating a datatype "PROTOBUF", but internally it uses "JSON" type. So PROTOBUF is just a wrapper type doing the conversion and schema validation on the fly. The benefit of this is it can leverage the JSON support. The downside is it can only support the common features available in both Protocol Buffer and JSON. e.g. Protocol Buffer annotation is a quite useful feature which is not available in JSON.
I would like to hear some thoughts about having Protocol Buffer for Postgres as well as the approach tackling it.
Thanks
Tianzhou
Tianzhou
On 26 April 2016 at 14:06, 陈天舟 <tianzhouchen@gmail.com> wrote:
I am interested in adding Protocol Buffer support for Postgres. Protocol Buffer occupies less space than JSON. More importantly, it has schema and is forward/backward compatible. All these make it a very good format for persistency.Here are two rough ideas I have right now:Approach 1:Creating a datatype "PROTOBUF" similar as "JSON" and implement all the similar support (e.g. indexing) as done for "JSON" Type.(1) Since each protocol buffer column requires a schema. I am not sure where is the best place to store that schema info. Should it be in a CONSTRAINT (but I am not able to find the doc referring any custom constraint), or should it be in the COMMENT or somewhere else?
I can't really imagine how you'd do that without adding a new catalog like we have for enum members. A typmod isn't sufficient since you need a whole lot more than an integer, and typmods aren't tracked throughout the server that well.
That'll make it hard to do it with an extension.
On 04/26/2016 08:06 AM, 陈天舟 wrote: > I am interested in adding Protocol Buffer support for Postgres. > Protocol Buffer occupies less space than JSON. More importantly, it > has schema and is forward/backward compatible. All these make it a > very good format for persistency. Have you investigated JSONB vs ProtoBuf space usage ? (the key being the "B" -- Postgres' own binary JSON implementation) The "per-column schema" thing sounds difficult to do without major changes to the core unless/until we have generalized user-defined metadata for objects .... Just my .02€ / J.L.
Craig Ringer <craig@2ndquadrant.com> writes: > On 26 April 2016 at 14:06, 陈天舟 <tianzhouchen@gmail.com> wrote: >> (1) Since each protocol buffer column requires a schema. I am not sure >> where is the best place to store that schema info. Should it be in a >> CONSTRAINT (but I am not able to find the doc referring any custom >> constraint), or should it be in the COMMENT or somewhere else? > I can't really imagine how you'd do that without adding a new catalog like > we have for enum members. A typmod isn't sufficient since you need a whole > lot more than an integer, and typmods aren't tracked throughout the server > that well. > That'll make it hard to do it with an extension. PostGIS manages to reference quite a lot of schema-like information via a geometry column's typmod. Maybe there's a reason why their approach wouldn't be a good fit for this, but it'd be worth investigating. regards, tom lane
On Tue, Apr 26, 2016 at 6:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Craig Ringer <craig@2ndquadrant.com> writes: >> On 26 April 2016 at 14:06, 陈天舟 <tianzhouchen@gmail.com> wrote: >>> (1) Since each protocol buffer column requires a schema. I am not sure >>> where is the best place to store that schema info. Should it be in a >>> CONSTRAINT (but I am not able to find the doc referring any custom >>> constraint), or should it be in the COMMENT or somewhere else? > >> I can't really imagine how you'd do that without adding a new catalog like >> we have for enum members. A typmod isn't sufficient since you need a whole >> lot more than an integer, and typmods aren't tracked throughout the server >> that well. > >> That'll make it hard to do it with an extension. > > PostGIS manages to reference quite a lot of schema-like information via > a geometry column's typmod. Maybe there's a reason why their approach > wouldn't be a good fit for this, but it'd be worth investigating. We pack a short type number, two flags and 24 bits of SRID number into an integer. The SRID number is in turn a foreign key into the spatial_ref_sys table where the fully spelled out spatial reference definition lives. It's not very nice, and it's quite breakable since there's no foreign key integrity between the typmod and the spatial_ref_sys pk. P.
On Mon, Apr 25, 2016 at 11:06:11PM -0700, 陈天舟 wrote: > I am interested in adding Protocol Buffer support for Postgres. Protocol > Buffer occupies less space than JSON. More importantly, it has schema and > is forward/backward compatible. All these make it a very good format for > persistency. Thanks for the idea. There are many different serializations, some of which are listed below, each with their own qualities, which can be good or bad with a very strong dependency on context: https://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats Should we see about making a more flexible serialization infrastructure? What we have is mostly /ad hoc/, and has already caused real pain to the PostGIS folks, this even after some pretty significant and successful efforts were made in this direction. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Tue, Apr 26, 2016 at 2:40:49PM -0700, David Fetter wrote: > Should we see about making a more flexible serialization > infrastructure? What we have is mostly /ad hoc/, and has already > caused real pain to the PostGIS folks, this even after some pretty > significant and successful efforts were made in this direction. Hi all. Is anybody working on this right now? I would like to pick this task for the summer. First of all, what do you think about what David said? Should we try and design a generic infrastructure for similar serialization datatypes? If so, will we need to refactor some pieces from the JSON/XML implementation? I looked over the code and it seems nicely decoupled, but I am not sure what this would involve. I've done this before for MySQL[1] (not yet completed), but I'd love to try it for PostgreSQL too. On Tue, Apr 26, 2016 at 11:23:11AM -0700, José Luis Tallón wrote: > Have you investigated JSONB vs ProtoBuf space usage ? > (the key being the "B" -- Postgres' own binary JSON > implementation) This is something I can further investigate, but another (possibly major) benefit of the Protocol Buffers over JSON is that they *still* have a schema. I think they combine advantages from both structured and schemaless data. My best guess is that we shouldn't focus on abstracting *any* serialization paradigm, but only the ones that have a schema (like Thrift or Protocol Buffers). Speaking of schemas, where is the best place to keep that? For MySQL I opted for a plain text file similar to .trg files (the ones used by MySQL for keeping triggers). I'd love to talk more about this. Thank you. Flavius Anton [1] https://github.com/google/mysql-protobuf
On Thu, Jun 23, 2016 at 8:15 PM, Flavius Anton <f.v.anton@gmail.com> wrote: > > I'd love to talk more about this. I thought quite a bit about this a few years ago but never really picked up it to work on. There are different use cases where this could be useful and different approaches that could be useful for the different use cases. One idea was to have a protocol buffer data type which used the typmod as an identifier to specify the schema. Perhaps as an oid primary key from some other table, or something like how enums are handled. That would be used much like JSON is used by languages where that's natural but might be useful if you're in an environment where data is being provided in protobufs and you need to send them on in protobufs. Another option would be to allow the output of your select query to be read in a protobuf. That might be a feature to add in libpq rather than in the server, or perhaps as a new protocol feature that libpq would then just switch to which might make it easier to use from other languages. That might make it easier to use Postgres as a data store for an environment where everything is in protobufs without losing the full power of SQL schemas in the database. As an aside, have you seen Cap’n Proto? It looks more interesting to me than protobufs. It fixes a lot of the corner cases where protobufs end up with unbounded computational complexity and seems a lot cleaner. I haven't thought about it much but I wonder if it would be possible to actually use Cap'n Proto capabilities to grant access to Postgres RLS labels too. -- greg
On Thu, Jun 23, 2016 at 1:50 PM, Greg Stark <stark@mit.edu> wrote: > On Thu, Jun 23, 2016 at 8:15 PM, Flavius Anton <f.v.anton@gmail.com> wrote: >> >> I'd love to talk more about this. > > I thought quite a bit about this a few years ago but never really > picked up it to work on. > > Another option would be to allow the output of your select query to be > read in a protobuf. That might be a feature to add in libpq rather > than in the server, or perhaps as a new protocol feature that libpq > would then just switch to which might make it easier to use from other > languages. That might make it easier to use Postgres as a data store > for an environment where everything is in protobufs without losing the > full power of SQL schemas in the database. I agree on this one, I think it's the most /natural/ way of doing things. > As an aside, have you seen Cap’n Proto? It looks more interesting to > me than protobufs. It fixes a lot of the corner cases where protobufs > end up with unbounded computational complexity and seems a lot > cleaner. I've seen it around for quite some time, but my fear is that it is not (yet?) widely adopted. Protocol Buffers themselves are not that popular, let alone Cap'n Proto. By the way, there's also the newer FlatBuffers[1] project, from Google too. They seem a lot like Cap'n Proto, though. I think it's doable to provide some sort of abstraction for these protocols in PostgreSQL, so that it can support all of them eventually, with minimum effort for adding a new one. However, I am skeptical about the practical advantages of having /all/ of them supported. -- Flavius [1] https://github.com/google/flatbuffers
On Thu, Jun 23, 2016 at 2:54 PM, Flavius Anton <f.v.anton@gmail.com> wrote: > On Thu, Jun 23, 2016 at 1:50 PM, Greg Stark <stark@mit.edu> wrote: >> On Thu, Jun 23, 2016 at 8:15 PM, Flavius Anton <f.v.anton@gmail.com> wrote: >>> >>> I'd love to talk more about this. >> >> I thought quite a bit about this a few years ago but never really >> picked up it to work on. Any other thoughts on this? My guess is that it might be an important addition to Postgres that can attract even more users, but I am not sure if there's enough interest from the community. If I want to pick this task, how should I move forward? Do I need to write a design document or similar or should I come up with a patch that implements a draft prototype? I am new to this community and don't know the code yet, so I'd appreciate some guidance from an older, more experienced member. Thanks. -- Flavius
On 24/06/16 14:23, Flavius Anton wrote: > On Thu, Jun 23, 2016 at 2:54 PM, Flavius Anton <f.v.anton@gmail.com> wrote: >> On Thu, Jun 23, 2016 at 1:50 PM, Greg Stark <stark@mit.edu> wrote: >>> On Thu, Jun 23, 2016 at 8:15 PM, Flavius Anton <f.v.anton@gmail.com> wrote: >>>> I'd love to talk more about this. >>> I thought quite a bit about this a few years ago but never really >>> picked up it to work on. > Any other thoughts on this? My guess is that it might be an important > addition to Postgres that can attract even more users, but I am not > sure if there's enough interest from the community. If I want to pick > this task, how should I move forward? Do I need to write a design > document or similar or should I come up with a patch that implements a > draft prototype? I am new to this community and don't know the code > yet, so I'd appreciate some guidance from an older, more experienced > member. > > Other than protobuf, there are also other serialization formats that might be worth considering. Not too long ago I suggested working specifically on serialization formas for the json/jsonb types: https://www.postgresql.org/message-id/56CB8A62.40100%408kdata.com I believe this effort is on the same boat. Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
On Fri, Jun 24, 2016 at 11:35 AM, Álvaro Hernández Tortosa <aht@8kdata.com> wrote: > > > On 24/06/16 14:23, Flavius Anton wrote: >> >> On Thu, Jun 23, 2016 at 2:54 PM, Flavius Anton <f.v.anton@gmail.com> >> wrote: >>> >>> On Thu, Jun 23, 2016 at 1:50 PM, Greg Stark <stark@mit.edu> wrote: >>>> >>>> On Thu, Jun 23, 2016 at 8:15 PM, Flavius Anton <f.v.anton@gmail.com> >>>> wrote: >>>>> >>>>> I'd love to talk more about this. >>>> >>>> I thought quite a bit about this a few years ago but never really >>>> picked up it to work on. >> >> Any other thoughts on this? My guess is that it might be an important >> addition to Postgres that can attract even more users, but I am not >> sure if there's enough interest from the community. If I want to pick >> this task, how should I move forward? Do I need to write a design >> document or similar or should I come up with a patch that implements a >> draft prototype? I am new to this community and don't know the code >> yet, so I'd appreciate some guidance from an older, more experienced >> member. >> >> > > Other than protobuf, there are also other serialization formats that > might be worth considering. Not too long ago I suggested working > specifically on serialization formas for the json/jsonb types: > https://www.postgresql.org/message-id/56CB8A62.40100%408kdata.com I believe > this effort is on the same boat. Sure, there are a bunch of these, some of the most popular being: * Cap'n Proto * Flatbuffers * Thrift A longer list is already made here[1]. Meanwhile, I came across another interesting idea. What if, for starters, we don't introduce a completely new serialization format, like Protocol Buffers, but rather make the JSON support more stronger and interesting. What I am thinking is to have a JSON schema (optionally) associated with a JSON column. In this way, you don't have to store the keys on disk anymore and also you'd have your database check for JSON validity at INSERT time. I think there are two big advantages here. What do you think about this one? -- Flavius [1] https://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats
On Fri, Jun 24, 2016 at 2:23 PM, Flavius Anton <f.v.anton@gmail.com> wrote: > Any other thoughts on this? My guess is that it might be an important > addition to Postgres that can attract even more users, but I am not > sure if there's enough interest from the community. If I want to pick > this task, how should I move forward? Do I need to write a design > document or similar or should I come up with a patch that implements a > draft prototype? I am new to this community and don't know the code > yet, so I'd appreciate some guidance from an older, more experienced > member. It's not very clear to me from this discussion what you actually want to implement, and I think your chances of implementing it successfully and your chances of getting that committed depend heavily on what you actually intend to do. If you want to implement a new datatype, that can be done as a contrib module and, if it doesn't go into the core distribution, it can always go up on GitHub and people can use it if they like it. If you want to implement a tool for validating JSON documents against a JSON schema, you can just write that as a function in a loadable module and people who want to use it can add a CHECK (json_validate_schema(mycolumn, 'schema or a reference thereunto goes here')) constraint. Again, if core doesn't want it, you can still distribute it and people can use it with unmodified PostgreSQL. And similarly, if you want to convert records from PostgreSQL's native format to some new serialization format, there is precedent with that for JSON and it can all be done as an extension to start. If you do something that touches the wire protocol or adds new formats in which data can be returned to the client, I think your chances of getting that committed to core are extremely low. We don't really like to change the wire protocol, and every time we add some extension to it, the amount of actual use that people get out of it is contingent on all of the various reimplementations of that protocol in various language-specific drivers also being extended to support that same thing. It's not worth going through that kind of hassle very often, and I don't think this would come close to having enough value to make that level of pain worthwhile, given that there are so many other ways of accomplishing the same thing: you can use a function in the query to transform the data before sending it, or you can have the client transform it afterwards. Moreover, if your work doesn't go into core then it's probably going to die on the vine because who wants a modified server version that breaks wire protocol compatibility? I'm not taking any position on the intrinsic value of protocol buffer support beyond the broad statement that I don't personally have any reason care about it. That is, I neither support nor oppose the idea of accepting code that does stuff related to protocol buffers. I am only offering a general opinion that something that can exist as an extension is likely (a) easier to implement, (b) easier to get committed, and (c) still potentially useful if it doesn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company