Re: [HACKERS] GSOC - TOAST'ing in slices - Mailing list pgsql-hackers

From George Papadrosou
Subject Re: [HACKERS] GSOC - TOAST'ing in slices
Date
Msg-id 8F4E7EF9-153A-42F8-913E-5A3A2EE01473@gmail.com
Whole thread Raw
In response to Re: [HACKERS] GSOC Introduction / Eliminate O(N^2) scaling fromrw-conflict tracking in serializable transactions  (Stephen Frost <sfrost@snowman.net>)
Responses Re: [HACKERS] GSOC - TOAST'ing in slices  (Robert Haas <robertmhaas@gmail.com>)
Re: [HACKERS] GSOC - TOAST'ing in slices  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
List pgsql-hackers
Hello!

Thank you for your message. I was just about to send this email when I got yours. 

I don't recall seeing an email from you about this yet?  My apologies if
I missed it

My apologies for the inconvenience, I wish I could start earlier with this but there was so much coursework reaching it’s deadline.

I have prepared a very basic proposal for the TOAST project which I am attaching below.  You will notice that the proposal is too basic. I would appreciate some guidance on how we could dive more into the project’s details so I can elaborate more in the proposal.

Also, I haven’t considered the PostGIS project when thinking of toast’able data types, so I will study it a bit in the meanwhile. 

Please find the proposal draft below. 
Thanks!
George

Abstract

In PostgreSQL, a field value is compressed and then stored in chunks in a separate table called TOAST table [1]. Currently there is no indication of which piece of the original data made it to which chunk in the TOAST table. If a subset of the value is needed, all of the chunks have to be re-formed and decompressed to the original value.

The project’s idea is implement different slicing approaches according to the value’s datatype. For example a text field could be split upon character boundaries while a JSON document would be split in a way that allows fast access to it’s keys or values.

Benefits to the PostgreSQL Community

Knowing about the data that each chunk holds, we could keep important chunks closer to computations as well as store them in indices.

Project details

?

Deliverables 

- Implement “semantic” slicing for datatypes that support slicing into TOAST tables. These datatypes will be the Text, Array, JSON/JSONb  and XML data types.

- Include the important chunks in the indices? (Not really sure about the data that indices contain at this time)

Timeline

- Until May 30: Study about Postgres internals, on-disk data structures, review relevant code and algorithms used, define slicing approaches and agree on implementation details .

- Until June 26: Implement the slicing approaches for the Text, Array, JSON/JSONb, XML

- June 26 - 30: Student/Mentor evaluations period and safety buffer 

- Until July 24: Make indices take advantage of the new slicing approaches

- July  24 - 28: Student/Mentor evaluations period and safety buffer  

- Until August 21: Improve testing and documentation

- August  21 - 29: Submit code and final evaluations

Bio 

Contact 
Name, email, phone etc




On 15 Μαρ 2017, at 03:39, Stephen Frost <sfrost@snowman.net> wrote:

George,

* George Papadrosou (gpapadrosou@gmail.com) wrote:
I understand your efforts and I am willing to back down. This is not the only project that appeals to me :)

Thank you very much for your willingness to adapt. :)

Mr. Frost, Mr. Munro,  thank you for your suggestions. I am now between the TOAST’ing slices and the predicate locking project. I am keen on the fact the “toasting” project is related to on-disk data structures so I will probably send you an email about that later today.

.  I have added Alexander Korotkov to the CC list as he was
also listed as a possible mentor for TOAST'ing in slices.

As it relates to TOAST'ing in slices, it would be good to think through
how we would represent and store the information about how a particular
object has been split up.  Note that PostgreSQL is very extensible in
its type system and therefore we would need a way for new data types
which are added to the system to be able to define how data of that data
type is to be split and a way to store the information they need to
regarding such a split.

In particular, the PostGIS project adds multiple data types which are
variable in length and often end up TOAST'd because they are large
geospatial objects, anything we come up with for TOAST'ing in slices
will need to be something that the PostGIS project could leverage.

In general, I would like to undertake a project interesting enough and important for Postgres. Also, I could take into account if you favor one over another, so please let me know. I understand that these projects should be strictly defined to fit in the GSOC period, however the potential for future improvements or challenges is what drives and motivates me.

We are certainly very interested in having you continue on and work with
the PostgreSQL community moving forward, though we do need to be sure to
scope the project goals within the GSOC requirements.

Thanks!

Stephen

pgsql-hackers by date:

Previous
From: Haribabu Kommi
Date:
Subject: Re: [HACKERS] ANALYZE command progress checker
Next
From: David Rowley
Date:
Subject: Re: [HACKERS] multivariate statistics (v25)