Re: [HACKERS] GSOC - TOAST'ing in slices - Mailing list pgsql-hackers

From George Papadrosou
Subject Re: [HACKERS] GSOC - TOAST'ing in slices
Date
Msg-id A9D27575-54EF-41D5-A0E9-036A679515BD@gmail.com
Whole thread Raw
In response to Re: [HACKERS] GSOC - TOAST'ing in slices  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] GSOC - TOAST'ing in slices  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
Hello all, 

thank you for your replies.  I agree with Alexander Korotkov that it is important to have a quality patch at the end of the summer. 

Stephen, you mentioned PostGIS, but the conversation seems to lean towards JSONB. What are your thoughts?

Also, if I am to include some ideas/approaches in the proposal, it seems I should really focus on understanding how a specific data type is used, queried and indexed, which is a lot of exploring for a newcomer in postgres code.

In the meanwhile, I am trying to find how jsonb is indexed and queried. After I grasp the current situation I will be to think about new approaches.

Regards,
George 

On 15 Μαρ 2017, at 15:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Mar 14, 2017 at 10:03 PM, George Papadrosou
<gpapadrosou@gmail.com> wrote:
The project’s idea is implement different slicing approaches according to
the value’s datatype. For example a text field could be split upon character
boundaries while a JSON document would be split in a way that allows fast
access to it’s keys or values.

Hmm.  So if you had a long text field containing multibyte characters,
and you split it after, say, every 1024 characters rather than after
every N bytes, then you could do substr() without detoasting the whole
field.  On the other hand, my guess is that you'd waste a fair amount
of space in the TOAST table, because it's unlikely that the chunks
would be exactly the right size to fill every page of the table
completely.  On balance it seems like you'd be worse off, because
substr() probably isn't all that common an operation.

Keep in mind also that slicing on "interesting" boundaries rather than
with the current procrustean-bed approach could save you at most one or
two chunk fetches per access.  So the upside seems limited.  Moreover,
how are you going to know whether a given toast item has been stored
according to your newfangled approach?  I doubt we're going to accept
forcing a dump/reload for this.

IMO, the real problem here is to be able to predict which chunk(s) to
fetch at all, and I'd suggest focusing on that part of the problem rather
than changes to physical storage.  It's hard to see how to do anything
very smart for text (except in the single-byte-encoding case, which is
already solved).  But the JSONB format was designed with some thought
to this issue, so you might be able to get some traction there.

regards, tom lane

pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: [HACKERS] asynchronous execution
Next
From: David Steele
Date:
Subject: Re: [HACKERS] [POC] A better way to expand hash indexes.