Thread: GSoC 2018: thrift encoding format

GSoC 2018: thrift encoding format

From
Charles Cui
Date:
Hi Aleksander,
   
   Start to study the format of thrift encoding format (binary protocol) and found this document(https://erikvanoosten.github.io/thrift-missing-specification/#_struct_encoding). Having one question. Can I assume the data in thrift is always send inside a struct? Otherwise, it needs other information to distinguish a number or a string (for example). I think this question also valid for protobuf? 


Thanks, Charles.

Re: GSoC 2018: thrift encoding format

From
Aleksander Alekseev
Date:
Hello Charles,

> Can I assume the data in thrift is always send inside a struct?

Sure!

> I think this question also valid for protobuf?

Right, pg_protobuf assumes that data is always a message which is an
equivalent of Thrift's struct.

--
Best regards,
Aleksander Alekseev

Attachment

Re: GSoC 2018: thrift encoding format

From
Charles Cui
Date:
Thanks for your confirm Aleksander! 
Also I am thinking of how to deal with complex
data structure like map, list, or set. I guess one possible 
solution is to get raw data bytes for these data structure?
Otherwise it could be hard to wrap into a Datum. 

2018-05-02 8:38 GMT-07:00 Aleksander Alekseev <a.alekseev@postgrespro.ru>:
Hello Charles,

> Can I assume the data in thrift is always send inside a struct?

Sure!

> I think this question also valid for protobuf?

Right, pg_protobuf assumes that data is always a message which is an
equivalent of Thrift's struct.

--
Best regards,
Aleksander Alekseev

Re: GSoC 2018: thrift encoding format

From
Aleksander Alekseev
Date:
Hello Charles,

> Thanks for your confirm Aleksander!
> Also I am thinking of how to deal with complex
> data structure like map, list, or set. I guess one possible
> solution is to get raw data bytes for these data structure?
> Otherwise it could be hard to wrap into a Datum.

Personally I think raw data bytes are OK if functions for getting all
keys and values from this data are provided. Another possibility is just
converting Thrift to JSONB and vise versa. In this case only two
procedures are required and all the rest is available out-of-the-box.

--
Best regards,
Aleksander Alekseev

Attachment

Re: GSoC 2018: thrift encoding format

From
Vladimir Sitnikov
Date:
>Personally I think raw data bytes are OK if functions for getting all
keys and values from this data are provided

What is the purpose of using Thrift "encoding" if it turns out to be a simple wrapper for existing binary data?

Do you mean the goal is to have "get/set" functions to fetch data out of bytea field?

Frankly speaking, I can hardly imagine why one would want to store MAP Datum inside Thrift inside bytea.

Vladimir

Re: GSoC 2018: thrift encoding format

From
Aleksander Alekseev
Date:
Hello Vladimir,

>> Personally I think raw data bytes are OK if functions for getting all
>> keys and values from this data are provided
>
> What is the purpose of using Thrift "encoding" if it turns out to be a
> simple wrapper for existing binary data?
>
> Do you mean the goal is to have "get/set" functions to fetch data out of
> bytea field?

I mean Charles is free to choose the interface for the extension he
believes is right. There would be no much learning left in the project
if all design decisions were made beforehand.

Personally I would probably just write a Thrift<->JSONB converter. But
there are pros and cons of this approach. For instance, CPU and memory
overhead for creating and storing temporary JSONB object is an obvious
drawback. On the other hand there are time limits for this project and
thus it makes sense to implement a feature as fast and as simple as
possible, and optimize it later (if necessary).

Maybe Charles likes to optimize everything. In this case he may choose
to implement all the getters and setters from scratch. This doesn't
exclude possibility of implementing the Thrift<->JSONB converter later.

Should Thrift objects be represented in the DBMS as a special Thrift
type, or as raw bytea? Personally I don't care. Once again, there are
pros and cons. It's good to have a bit of additional type safety. On the
other hand, it's not convenient to cast Thrift<->bytea all the time, and
if we add implicit casting there will be little type safety left. In
pg_protobuf extension I choose to store Protobuf as bytea, but if
Charles prefer to introduce a separate type that's fine by me.

--
Best regards,
Aleksander Alekseev

Attachment

Re: GSoC 2018: thrift encoding format

From
Vladimir Sitnikov
Date:
>I mean Charles is free to choose the interface for the extension he
believes is right

I'm just trying to figure out what are the use cases for using that Thrift extension.

For instance, it would be interesting if Thrift was an alternative way to transfer data between client and the database. I guess it could simplify building efficient clients.

Vladimir

Re: GSoC 2018: thrift encoding format

From
Aleksander Alekseev
Date:
Hello Vladimir,

> I'm just trying to figure out what are the use cases for using that Thrift
> extension.

You can find an answer in the project description:

https://wiki.postgresql.org/wiki/GSoC_2018#Thrift_datatype_support_.282018.29

--
Best regards,
Aleksander Alekseev

Attachment

Re: GSoC 2018: thrift encoding format

From
Stephen Frost
Date:
Greetings,

* Aleksander Alekseev (a.alekseev@postgrespro.ru) wrote:
> >> Personally I think raw data bytes are OK if functions for getting all
> >> keys and values from this data are provided
> >
> > What is the purpose of using Thrift "encoding" if it turns out to be a
> > simple wrapper for existing binary data?
> >
> > Do you mean the goal is to have "get/set" functions to fetch data out of
> > bytea field?
>
> I mean Charles is free to choose the interface for the extension he
> believes is right. There would be no much learning left in the project
> if all design decisions were made beforehand.

Perhaps the design decisions aren't all made beforehand, but they also
shouldn't be made in a vacuum- there should be discussions on -hackers
about what the right decision is for a given aspect and that's what
should be worked towards.

> Personally I would probably just write a Thrift<->JSONB converter. But
> there are pros and cons of this approach. For instance, CPU and memory
> overhead for creating and storing temporary JSONB object is an obvious
> drawback. On the other hand there are time limits for this project and
> thus it makes sense to implement a feature as fast and as simple as
> possible, and optimize it later (if necessary).

Just having such a convertor would reduce the usefulness of this
extension dramatically, wouldn't it?  Considering the justification for
the extension used on the GSoC project page, it certainly strikes me as
losing most of the value if we just convert to JSONB.

> Maybe Charles likes to optimize everything. In this case he may choose
> to implement all the getters and setters from scratch. This doesn't
> exclude possibility of implementing the Thrift<->JSONB converter later.

Having a way to cast between the two is entirely reasonable, imv, but
that's very different from having the data only able to be stored as
JSONB..

> Should Thrift objects be represented in the DBMS as a special Thrift
> type, or as raw bytea? Personally I don't care. Once again, there are
> pros and cons. It's good to have a bit of additional type safety. On the
> other hand, it's not convenient to cast Thrift<->bytea all the time, and
> if we add implicit casting there will be little type safety left. In
> pg_protobuf extension I choose to store Protobuf as bytea, but if
> Charles prefer to introduce a separate type that's fine by me.

I understand that you're open to having it as a new data type or as a
bytea, but I don't agree.  This should be a new data type, just as json
is a distinct data type and so is jsonb.

Thanks!

Stephen

Attachment

Re: GSoC 2018: thrift encoding format

From
Aleksander Alekseev
Date:
Hello Stephen,

> Perhaps the design decisions aren't all made beforehand, but they also
> shouldn't be made in a vacuum- there should be discussions on -hackers
> about what the right decision is for a given aspect and that's what
> should be worked towards.

+1, agree.

> > Personally I would probably just write a Thrift<->JSONB converter. But
> > there are pros and cons of this approach. For instance, CPU and memory
> > overhead for creating and storing temporary JSONB object is an obvious
> > drawback. On the other hand there are time limits for this project and
> > thus it makes sense to implement a feature as fast and as simple as
> > possible, and optimize it later (if necessary).
>
> Just having such a convertor would reduce the usefulness of this
> extension dramatically, wouldn't it?  Considering the justification for
> the extension used on the GSoC project page, it certainly strikes me as
> losing most of the value if we just convert to JSONB.
>
> > Maybe Charles likes to optimize everything. In this case he may choose
> > to implement all the getters and setters from scratch. This doesn't
> > exclude possibility of implementing the Thrift<->JSONB converter later.
>
> Having a way to cast between the two is entirely reasonable, imv, but
> that's very different from having the data only able to be stored as
> JSONB..

Good point.

> I understand that you're open to having it as a new data type or as a
> bytea, but I don't agree.  This should be a new data type, just as json
> is a distinct data type and so is jsonb.

Could you please explain in a little more detail why you believe so?
Also I wonder whether in your opinion the extension should provide
implicit casts from/to bytea as well.

--
Best regards,
Aleksander Alekseev

Attachment

Re: GSoC 2018: thrift encoding format

From
Stephen Frost
Date:
Greetings,

* Aleksander Alekseev (a.alekseev@postgrespro.ru) wrote:
> > I understand that you're open to having it as a new data type or as a
> > bytea, but I don't agree.  This should be a new data type, just as json
> > is a distinct data type and so is jsonb.
>
> Could you please explain in a little more detail why you believe so?

As mentioned elsewhere, there's multiple ways to encode thrift, no?  We
should pick which one makes sense and make that the interface to the
data type and then we might actually store the data differently, not to
mention that we'll likely want to build on things like indexing
capabilities to this data type, as we have for jsonb, and that's much
cleaner to do with a proper data type than if everyone has to use bytea
to store the data and then functional indexes (if we could even make
that happen...  I'm not thrilled with such an idea in any case).

Data validation is another thing- if it's a thrift data type then we can
validate that it's correct on the way in, and depend on that correctness
on the way out (to some extent- obviously we have to be wary of
corruption possibilities and such).

We could toss out all of our data types and store everything as bytea's
if we wanted to, but we don't, and for quite a few good reasons, these
are just a couple that I'm thinking of off-hand.

> Also I wonder whether in your opinion the extension should provide
> implicit casts from/to bytea as well.

I wouldn't make them implicit...

Thanks!

Stephen

Attachment

Re: GSoC 2018: thrift encoding format

From
Charles Cui
Date:
Thanks guys for your ideas! I feel like it is easier to 
follow pg_protobuf 's method to design and implement pg_thrift
for a postgres beginner like me. I can refer pg_protobuf's way of
using functions, writing tests, etc. I will reconsider what's the 
returned format for list, sets, and struct, etc. when I touch that 
part. Right now, I assume inputs are a series of thrift bytes, and
try to implement decoding logics for simple types.

2018-05-04 8:42 GMT-07:00 Stephen Frost <sfrost@snowman.net>:
Greetings,

* Aleksander Alekseev (a.alekseev@postgrespro.ru) wrote:
> > I understand that you're open to having it as a new data type or as a
> > bytea, but I don't agree.  This should be a new data type, just as json
> > is a distinct data type and so is jsonb.
>
> Could you please explain in a little more detail why you believe so?

As mentioned elsewhere, there's multiple ways to encode thrift, no?  We
should pick which one makes sense and make that the interface to the
data type and then we might actually store the data differently, not to
mention that we'll likely want to build on things like indexing
capabilities to this data type, as we have for jsonb, and that's much
cleaner to do with a proper data type than if everyone has to use bytea
to store the data and then functional indexes (if we could even make
that happen...  I'm not thrilled with such an idea in any case).

Data validation is another thing- if it's a thrift data type then we can
validate that it's correct on the way in, and depend on that correctness
on the way out (to some extent- obviously we have to be wary of
corruption possibilities and such).

We could toss out all of our data types and store everything as bytea's
if we wanted to, but we don't, and for quite a few good reasons, these
are just a couple that I'm thinking of off-hand.

> Also I wonder whether in your opinion the extension should provide
> implicit casts from/to bytea as well.

I wouldn't make them implicit...

Thanks!

Stephen