Thread: [HACKERS] GSOC Introduction / Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions

Hello psql hackers,

my name is George Papadrosou, this is my first semester as graduate student at Georgia Tech and would like to submit a
proposalto Google Summer of Code, for the project "Eliminate O(N^2) scaling from rw-conflict tracking in serializable
transactions”.

A short bio, I have a CS undergraduate degree from Athens University of Economics and Business. I had taken two
databasescourses where the first one was about sql, relational algebra, xpath and generally using an RDBMS while the
secondone was more about the internals, like  storage, indexing, query planning and transactions. 

I have 3+ years professional experience in web technologies with a focus on the backend and I recently started my
masterwith specialization in computing systems. One of my first courses that I am finishing this May is High
PerformanceComputing(parallel algorithms), which seems to be closely related to this GSOC project. 

I have not done any research on databases yet but I regard this project as an opportunity to make an initial contact
withpostgres' internals until I dive more into database algorithms. My future goal is to work on databases full 
time.

I am going to prepare a draft proposal for this project and share it with you soon. The project’s description is pretty
clear,do you think it should be more strictly defined in the proposal? 

Until then, I would like to familiarize myself a bit with the codebase and fix some bug/todo. I didn’t find many [E]
markedtasks in the todo list so the task I was thinking is "\s without arguments (display history) fails with libedit,
doesn'tuse pager either - psql \s not working on OSX”. However, it works on my OSX El Capitan laptop with Postgres
9.4.4.Would you suggest some other starter task? 

Looking forward to hearing from you soon.
Best regards,
George



George,

* George Papadrosou (gpapadrosou@gmail.com) wrote:
> my name is George Papadrosou, this is my first semester as graduate student at Georgia Tech and would like to submit
aproposal to Google Summer of Code, for the project "Eliminate O(N^2) scaling from rw-conflict tracking in serializable
transactions”.

Fantastic!  I'll let Kevin comment on your questions regarding that
project.

> Until then, I would like to familiarize myself a bit with the codebase and fix some bug/todo. I didn’t find many [E]
markedtasks in the todo list so the task I was thinking is "\s without arguments (display history) fails with libedit,
doesn'tuse pager either - psql \s not working on OSX”. However, it works on my OSX El Capitan laptop with Postgres
9.4.4.Would you suggest some other starter task? 

One of the best things which you can do to start learning the PostgreSQL
code base is to review existing patches which have been proposed for
inclusion.  Now is a great time to be doing that as the feature freeze
for the next version of PostgreSQL (PG v10) is at the end of the month
and there's a ton of patches which need reviewing.

The patches which need reviewing can be seen here:

https://commitfest.postgresql.org/13/

There's lots of information on our wiki and other places about
developing PostgreSQL, you can start here:

https://wiki.postgresql.org/wiki/Development_information

and in particular:

https://wiki.postgresql.org/wiki/So,_you_want_to_be_a_developer%3F

We look forward to working with you!  Welcome!

Thanks!

Stephen

On Fri, Mar 10, 2017 at 2:04 PM, Stephen Frost <sfrost@snowman.net> wrote:
> George,
>
> * George Papadrosou (gpapadrosou@gmail.com) wrote:
>> my name is George Papadrosou, this is my first semester as graduate student at Georgia Tech and would like to submit
aproposal to Google Summer of Code, for the project "Eliminate O(N^2) scaling from rw-conflict tracking in serializable
transactions”.
>
> Fantastic!  I'll let Kevin comment on your questions regarding that
> project.

+1, welcome.  It would be very cool to see SERIALIZABLE get faster.

>> Until then, I would like to familiarize myself a bit with the codebase and fix some bug/todo. I didn’t find many [E]
markedtasks in the todo list so the task I was thinking is "\s without arguments (display history) fails with libedit,
doesn'tuse pager either - psql \s not working on OSX”. However, it works on my OSX El Capitan laptop with Postgres
9.4.4.Would you suggest some other starter task? 
>
> One of the best things which you can do to start learning the PostgreSQL
> code base is to review existing patches which have been proposed for
> inclusion.  Now is a great time to be doing that as the feature freeze
> for the next version of PostgreSQL (PG v10) is at the end of the month
> and there's a ton of patches which need reviewing.
>
> The patches which need reviewing can be seen here:
>
> https://commitfest.postgresql.org/13/

+1

This is a large commitfest so there's a pretty wide range of
interesting stuff to help out with, and many ways to help including
documentation, testing and code review.

<plug>
One small patch that just might interest you, based on your interest
in SERIALIZABLE and also in parallelism, is my fledgling attempt to
connect those two features:

https://commitfest.postgresql.org/13/1004/

Unfortunately SERIALIZABLE support has not yet been included with a
couple of really important recent Postgres features: parallelism and
streaming replication.  I suspect the latter may be quite hard to fix
and the former quite easy.  If you're currently studying the
SERIALIZABLE internal data structures then you might like to review
that patch and tell me if my optimism is entirely misplaced, and
figure out how to test and break it...
</plug>

That said, please pick anything that interests you!

--
Thomas Munro
http://www.enterprisedb.com



[including Mengxing Liu in response, for reasons that should become
obvious below...]

Hi George,

On Thu, Mar 9, 2017 at 6:49 PM, George Papadrosou <gpapadrosou@gmail.com> wrote:

> my name is George Papadrosou, this is my first semester as
> graduate student at Georgia Tech and would like to submit a
> proposal to Google Summer of Code, for the project "Eliminate
> O(N^2) scaling from rw-conflict tracking in serializable
> transactions”.

I was recently contacted off-list by Mengxing Liu, who has been
looking at the same project, and said he was planning to submit a
GSoC proposal.  Rather than have one of you sit this out, do either
of you feel comfortable taking a different project instead?  Since
you've both been looking at the serializable code and supporting
documents, perhaps one of you could change to the other suggested
Serializable project?

> I am going to prepare a draft proposal for this project and share
> it with you soon. The project’s description is pretty clear, do
> you think it should be more strictly defined in the proposal?

At a minimum, the proposal should include list of milestones you
expect to reach along the way, and a timeline indicating when you
expect to reach them.  Some description of the benchmarks you intend
to run would be also be very good.

> Until then, I would like to familiarize myself a bit with the
> codebase and fix some bug/todo. I didn’t find many [E] marked
> tasks in the todo list so the task I was thinking is "\s without
> arguments (display history) fails with libedit, doesn't use pager
> either - psql \s not working on OSX”. However, it works on my OSX
> El Capitan laptop with Postgres 9.4.4. Would you suggest some
> other starter task?

There is a CommitFest in progress; reviewing patches is a good way
to become involved and familiar with the community processes.

https://wiki.postgresql.org/wiki/CommitFest

--
Kevin Grittner



Hi George,

I am Mengxing Liu. Happy to meet someone with the same idea : )

I have been concentrating on it for a long time, reading papers, reading source codes, and discussing details with Mr
Grittner. So I really understand your passion on it. But definitely I don't want all these effects to be in vain. So,
maybea little ruthless, would you mind to consider transferring to the other one? 


> -----原始邮件-----
> 发件人: "Kevin Grittner" <kgrittn@gmail.com>
> 发送时间: 2017-03-10 22:57:03 (星期五)
> 收件人: "George Papadrosou" <gpapadrosou@gmail.com>, "刘梦醒" <liu-mx15@mails.tsinghua.edu.cn>
> 抄送: "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
> 主题: Re: [HACKERS] GSOC Introduction / Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions
>
> [including Mengxing Liu in response, for reasons that should become
> obvious below...]
>
> Hi George,
>
> On Thu, Mar 9, 2017 at 6:49 PM, George Papadrosou <gpapadrosou@gmail.com> wrote:
>
> > my name is George Papadrosou, this is my first semester as
> > graduate student at Georgia Tech and would like to submit a
> > proposal to Google Summer of Code, for the project "Eliminate
> > O(N^2) scaling from rw-conflict tracking in serializable
> > transactions”.
>
> I was recently contacted off-list by Mengxing Liu, who has been
> looking at the same project, and said he was planning to submit a
> GSoC proposal.  Rather than have one of you sit this out, do either
> of you feel comfortable taking a different project instead?  Since
> you've both been looking at the serializable code and supporting
> documents, perhaps one of you could change to the other suggested
> Serializable project?
>
> > I am going to prepare a draft proposal for this project and share
> > it with you soon. The project’s description is pretty clear, do
> > you think it should be more strictly defined in the proposal?
>
> At a minimum, the proposal should include list of milestones you
> expect to reach along the way, and a timeline indicating when you
> expect to reach them.  Some description of the benchmarks you intend
> to run would be also be very good.
>
> > Until then, I would like to familiarize myself a bit with the
> > codebase and fix some bug/todo. I didn’t find many [E] marked
> > tasks in the todo list so the task I was thinking is "\s without
> > arguments (display history) fails with libedit, doesn't use pager
> > either - psql \s not working on OSX”. However, it works on my OSX
> > El Capitan laptop with Postgres 9.4.4. Would you suggest some
> > other starter task?
>
> There is a CommitFest in progress; reviewing patches is a good way
> to become involved and familiar with the community processes.
>
> https://wiki.postgresql.org/wiki/CommitFest
>
> --
> Kevin Grittner


--
Mengxing Liu











> [two people interested in the same GSoC project]

It's an interesting problem to have.

If neither of you chooses to voluntarily back down, the obvious
resolution is for the mentors to vote on the better proposal.  If we
do that early enough during the student application period, there
might still be time for the person whose proposal wasn't chosen to
submit a proposal for an alternative project.  As I see it, that
means:
- I would tend to favor a proposal submitted on the first day  (beginning March 20 16:00 UTC), if only one is.
- I would push for very early voting by the PostgreSQL mentors.

--
Kevin Grittner



Kevin,

* Kevin Grittner (kgrittn@gmail.com) wrote:
> > [two people interested in the same GSoC project]
>
> It's an interesting problem to have.
>
> If neither of you chooses to voluntarily back down, the obvious
> resolution is for the mentors to vote on the better proposal.  If we
> do that early enough during the student application period, there
> might still be time for the person whose proposal wasn't chosen to
> submit a proposal for an alternative project.  As I see it, that
> means:
>
>  - I would tend to favor a proposal submitted on the first day
>    (beginning March 20 16:00 UTC), if only one is.
>
>  - I would push for very early voting by the PostgreSQL mentors.

Agreed.  Many thanks!

Stephen

Hi all and thank you for your quick replies.

> [two people interested in the same GSoC project]

Mr. Grittner thank you for sharing this ahead of time.


Liu(is this your first name?),

> I have been concentrating on it for a long time, reading papers, reading source codes, and discussing details with Mr
Grittner.  

I understand your efforts and I am willing to back down. This is not the only project that appeals to me :)


Mr. Frost, Mr. Munro,  thank you for your suggestions. I am now between the TOAST’ing slices and the predicate locking
project.I am keen on the fact the “toasting” project is related to on-disk data structures so I will probably send you
anemail about that later today. 

In general, I would like to undertake a project interesting enough and important for Postgres. Also, I could take into
accountif you favor one over another, so please let me know. I understand that these projects should be strictly
definedto fit in the GSOC period, however the potential for future improvements or challenges is what drives and
motivatesme. 

Thank you!
George





George,

* George Papadrosou (gpapadrosou@gmail.com) wrote:
> I understand your efforts and I am willing to back down. This is not the only project that appeals to me :)

Thank you very much for your willingness to adapt. :)

> Mr. Frost, Mr. Munro,  thank you for your suggestions. I am now between the TOAST’ing slices and the predicate
lockingproject. I am keen on the fact the “toasting” project is related to on-disk data structures so I will probably
sendyou an email about that later today. 

I don't recall seeing an email from you about this yet?  My apologies if
I missed it.  I have added Alexander Korotkov to the CC list as he was
also listed as a possible mentor for TOAST'ing in slices.

As it relates to TOAST'ing in slices, it would be good to think through
how we would represent and store the information about how a particular
object has been split up.  Note that PostgreSQL is very extensible in
its type system and therefore we would need a way for new data types
which are added to the system to be able to define how data of that data
type is to be split and a way to store the information they need to
regarding such a split.

In particular, the PostGIS project adds multiple data types which are
variable in length and often end up TOAST'd because they are large
geospatial objects, anything we come up with for TOAST'ing in slices
will need to be something that the PostGIS project could leverage.

> In general, I would like to undertake a project interesting enough and important for Postgres. Also, I could take
intoaccount if you favor one over another, so please let me know. I understand that these projects should be strictly
definedto fit in the GSOC period, however the potential for future improvements or challenges is what drives and
motivatesme. 

We are certainly very interested in having you continue on and work with
the PostgreSQL community moving forward, though we do need to be sure to
scope the project goals within the GSOC requirements.

Thanks!

Stephen

Re: [HACKERS] GSOC - TOAST'ing in slices

From
George Papadrosou
Date:
Hello!

Thank you for your message. I was just about to send this email when I got yours. 

I don't recall seeing an email from you about this yet?  My apologies if
I missed it

My apologies for the inconvenience, I wish I could start earlier with this but there was so much coursework reaching it’s deadline.

I have prepared a very basic proposal for the TOAST project which I am attaching below.  You will notice that the proposal is too basic. I would appreciate some guidance on how we could dive more into the project’s details so I can elaborate more in the proposal.

Also, I haven’t considered the PostGIS project when thinking of toast’able data types, so I will study it a bit in the meanwhile. 

Please find the proposal draft below. 
Thanks!
George

Abstract

In PostgreSQL, a field value is compressed and then stored in chunks in a separate table called TOAST table [1]. Currently there is no indication of which piece of the original data made it to which chunk in the TOAST table. If a subset of the value is needed, all of the chunks have to be re-formed and decompressed to the original value.

The project’s idea is implement different slicing approaches according to the value’s datatype. For example a text field could be split upon character boundaries while a JSON document would be split in a way that allows fast access to it’s keys or values.

Benefits to the PostgreSQL Community

Knowing about the data that each chunk holds, we could keep important chunks closer to computations as well as store them in indices.

Project details

?

Deliverables 

- Implement “semantic” slicing for datatypes that support slicing into TOAST tables. These datatypes will be the Text, Array, JSON/JSONb  and XML data types.

- Include the important chunks in the indices? (Not really sure about the data that indices contain at this time)

Timeline

- Until May 30: Study about Postgres internals, on-disk data structures, review relevant code and algorithms used, define slicing approaches and agree on implementation details .

- Until June 26: Implement the slicing approaches for the Text, Array, JSON/JSONb, XML

- June 26 - 30: Student/Mentor evaluations period and safety buffer 

- Until July 24: Make indices take advantage of the new slicing approaches

- July  24 - 28: Student/Mentor evaluations period and safety buffer  

- Until August 21: Improve testing and documentation

- August  21 - 29: Submit code and final evaluations

Bio 

Contact 
Name, email, phone etc




On 15 Μαρ 2017, at 03:39, Stephen Frost <sfrost@snowman.net> wrote:

George,

* George Papadrosou (gpapadrosou@gmail.com) wrote:
I understand your efforts and I am willing to back down. This is not the only project that appeals to me :)

Thank you very much for your willingness to adapt. :)

Mr. Frost, Mr. Munro,  thank you for your suggestions. I am now between the TOAST’ing slices and the predicate locking project. I am keen on the fact the “toasting” project is related to on-disk data structures so I will probably send you an email about that later today.

.  I have added Alexander Korotkov to the CC list as he was
also listed as a possible mentor for TOAST'ing in slices.

As it relates to TOAST'ing in slices, it would be good to think through
how we would represent and store the information about how a particular
object has been split up.  Note that PostgreSQL is very extensible in
its type system and therefore we would need a way for new data types
which are added to the system to be able to define how data of that data
type is to be split and a way to store the information they need to
regarding such a split.

In particular, the PostGIS project adds multiple data types which are
variable in length and often end up TOAST'd because they are large
geospatial objects, anything we come up with for TOAST'ing in slices
will need to be something that the PostGIS project could leverage.

In general, I would like to undertake a project interesting enough and important for Postgres. Also, I could take into account if you favor one over another, so please let me know. I understand that these projects should be strictly defined to fit in the GSOC period, however the potential for future improvements or challenges is what drives and motivates me.

We are certainly very interested in having you continue on and work with
the PostgreSQL community moving forward, though we do need to be sure to
scope the project goals within the GSOC requirements.

Thanks!

Stephen

Re: [HACKERS] GSOC - TOAST'ing in slices

From
Robert Haas
Date:
On Tue, Mar 14, 2017 at 10:03 PM, George Papadrosou
<gpapadrosou@gmail.com> wrote:
> The project’s idea is implement different slicing approaches according to
> the value’s datatype. For example a text field could be split upon character
> boundaries while a JSON document would be split in a way that allows fast
> access to it’s keys or values.

Hmm.  So if you had a long text field containing multibyte characters,
and you split it after, say, every 1024 characters rather than after
every N bytes, then you could do substr() without detoasting the whole
field.  On the other hand, my guess is that you'd waste a fair amount
of space in the TOAST table, because it's unlikely that the chunks
would be exactly the right size to fill every page of the table
completely.  On balance it seems like you'd be worse off, because
substr() probably isn't all that common an operation.

Now, in contrast, slicing JSON is a very common operation, so a
smarter slicing scheme might well pay off, but the question is - what
kind of a splitting method would actually allow fast access to the
keys or values?  It strikes me that this might be a difficult problem.
Tabula raza, you could design a serialization format that was aware
that it might get toasted and was constructed in such a way that as to
contain boundaries that are actually referenced from within the
format, so that, say, after reading the toplevel keys and values, you
could know that you next need chunk #103.  But unless the existing
jsonb binary format was designed with that in mind, it doesn't seem
likely to end up being true just by chance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] GSOC - TOAST'ing in slices

From
Alexander Korotkov
Date:
On Wed, Mar 15, 2017 at 5:03 AM, George Papadrosou <gpapadrosou@gmail.com> wrote:
Deliverables 

- Implement “semantic” slicing for datatypes that support slicing into TOAST tables. These datatypes will be the Text, Array, JSON/JSONb  and XML data types.
 
That looks too much comprehensive GSoC project for me.  I would recommend you to focus on more local task.  For instance, providing better slicing for jsonb with support of all relevant functions/operators looks pretty enough work amount for GSoC project.
Remember, that the major target of GSoC project is to get it complete.  I.e. to produce patch in committable shape, not just a prototype.  That's quite ambitious, because this project also assumes some research.  Thus, there could be various approaches for slicing jsonb with different advantages and disadvantages.  So, I advice you to thought out approaches you're going to try, and specify them in proposal.  Schedule should include prototyping each of these approaches, then testing and benchmarking to selecting between them, then work on bringing selected approach to committable shape.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] GSOC - TOAST'ing in slices

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Mar 14, 2017 at 10:03 PM, George Papadrosou
> <gpapadrosou@gmail.com> wrote:
>> The project’s idea is implement different slicing approaches according to
>> the value’s datatype. For example a text field could be split upon character
>> boundaries while a JSON document would be split in a way that allows fast
>> access to it’s keys or values.

> Hmm.  So if you had a long text field containing multibyte characters,
> and you split it after, say, every 1024 characters rather than after
> every N bytes, then you could do substr() without detoasting the whole
> field.  On the other hand, my guess is that you'd waste a fair amount
> of space in the TOAST table, because it's unlikely that the chunks
> would be exactly the right size to fill every page of the table
> completely.  On balance it seems like you'd be worse off, because
> substr() probably isn't all that common an operation.

Keep in mind also that slicing on "interesting" boundaries rather than
with the current procrustean-bed approach could save you at most one or
two chunk fetches per access.  So the upside seems limited.  Moreover,
how are you going to know whether a given toast item has been stored
according to your newfangled approach?  I doubt we're going to accept
forcing a dump/reload for this.

IMO, the real problem here is to be able to predict which chunk(s) to
fetch at all, and I'd suggest focusing on that part of the problem rather
than changes to physical storage.  It's hard to see how to do anything
very smart for text (except in the single-byte-encoding case, which is
already solved).  But the JSONB format was designed with some thought
to this issue, so you might be able to get some traction there.
        regards, tom lane



Re: [HACKERS] GSOC - TOAST'ing in slices

From
George Papadrosou
Date:
Hello all, 

thank you for your replies.  I agree with Alexander Korotkov that it is important to have a quality patch at the end of the summer. 

Stephen, you mentioned PostGIS, but the conversation seems to lean towards JSONB. What are your thoughts?

Also, if I am to include some ideas/approaches in the proposal, it seems I should really focus on understanding how a specific data type is used, queried and indexed, which is a lot of exploring for a newcomer in postgres code.

In the meanwhile, I am trying to find how jsonb is indexed and queried. After I grasp the current situation I will be to think about new approaches.

Regards,
George 

On 15 Μαρ 2017, at 15:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Mar 14, 2017 at 10:03 PM, George Papadrosou
<gpapadrosou@gmail.com> wrote:
The project’s idea is implement different slicing approaches according to
the value’s datatype. For example a text field could be split upon character
boundaries while a JSON document would be split in a way that allows fast
access to it’s keys or values.

Hmm.  So if you had a long text field containing multibyte characters,
and you split it after, say, every 1024 characters rather than after
every N bytes, then you could do substr() without detoasting the whole
field.  On the other hand, my guess is that you'd waste a fair amount
of space in the TOAST table, because it's unlikely that the chunks
would be exactly the right size to fill every page of the table
completely.  On balance it seems like you'd be worse off, because
substr() probably isn't all that common an operation.

Keep in mind also that slicing on "interesting" boundaries rather than
with the current procrustean-bed approach could save you at most one or
two chunk fetches per access.  So the upside seems limited.  Moreover,
how are you going to know whether a given toast item has been stored
according to your newfangled approach?  I doubt we're going to accept
forcing a dump/reload for this.

IMO, the real problem here is to be able to predict which chunk(s) to
fetch at all, and I'd suggest focusing on that part of the problem rather
than changes to physical storage.  It's hard to see how to do anything
very smart for text (except in the single-byte-encoding case, which is
already solved).  But the JSONB format was designed with some thought
to this issue, so you might be able to get some traction there.

regards, tom lane

Re: [HACKERS] GSOC - TOAST'ing in slices

From
Stephen Frost
Date:
George,

* George Papadrosou (gpapadrosou@gmail.com) wrote:
> Stephen, you mentioned PostGIS, but the conversation seems to lean towards JSONB. What are your thoughts?

Both are important.  I brought up PostGIS specifically because it's an
external project which could benefit from this work and to explain that
PostgreSQL can be extended to beyond the built-in data types.  Focusing
on JSONB to start seems alright but keep in mind that we'll want to have
a way to let PostGIS do whatever it is we do for JSONB in core.

> Also, if I am to include some ideas/approaches in the proposal, it seems I should really focus on understanding how a
specificdata type is used, queried and indexed, which is a lot of exploring for a newcomer in postgres code.
 

This is true, but, really, the sooner the better. :)  While it's not a
small amount to go through it's also really pretty clean code, in
general, so hopefully you won't have too much trouble.  I would
recommend jumping on irc.freenode.net to the #postgresql channel where
you can find a number of PostgreSQL hacker who are generally quite happy
to answer specific questions you may have as you go through the code.

> In the meanwhile, I am trying to find how jsonb is indexed and queried. After I grasp the current situation I will be
tothink about new approaches.
 

I would suggest you make sure that you first understand how TOAST works
generally and review the on-disk storage format before jumping in to try
and understand jsonb indexing and queries.  Would be good to also
understand how the PGLZ compression works.

Thanks!

Stephen