Re: GSoC proposal - Mailing list pgsql-hackers

From Florian Pflug
Subject Re: GSoC proposal
Date
Msg-id F0CE4715-12FB-48DA-A059-9D0AD61F2FAA@phlo.org
Whole thread Raw
In response to GSoC proposal  (Tan Tran <tankimtran@gmail.com>)
List pgsql-hackers
On Feb28, 2014, at 05:29 , Tan Tran <tankimtran@gmail.com> wrote:
> I'm applying for GSoC 2014 with Postgresql and would appreciate your comments
> on my proposal (attached).
> <pg_gsoc2014_TanTran.pdf>

First, please include your proposal as plain, inline text next time.
That makes it easier to quote the relevant parts when replying, and
also allows your mail to be indexed correctly by the mailing list
archive.

Regarding your proposal, I think you need to explain what exactly it
is you want to achieve in more detail.

> In particular, text and bytea are EXTERNAL by default, so that substring
> operations can seek straight to the exact slice (which is O(1)) instead
> of de-toasting the whole datum (which is O(file size)). Specifically,
> varlena.c’s text_substring(...) and bytea_substring(...) call
> DatumGetTextPSlice(...), which r!etrieves only the slice(s) at an
> easily-computed offset.!
>
> ...
>
> 1. First, I will optimize array element retrieval and UTF-8 substring
> retrieval. Both are straightforward, as they involve calculating slice
> numbers and using similar code to above.!

I'm confused by that - text_substring *already* attempts to only fetch
the relevant slice in the case of UTF-8. It can't do so precisely - it
needs to use a conservative estimate - but I fail to see how that can
be avoided. Since UTF-8 maps a character to anything from 1 to 6 bytes,
you can't compute the byte offset of a given character index precisely.

You could store a constant number of *characters* per slice, instead of
a constant number of *bytes*, but due to the rather large worst-case of
6 bytes per character, that would increase the storage and access overhead
6 fold for languages which can largely be represented with 1 byte per
character. That's not going to go down well...

I haven't looked at how we currently handle arrays, but the problems
there are similar. For arrays containing variable-length types, you can't
compute the byte offset from the index. It's even worst than for varchar,
because the range of possible element lengths is much longer - one array
element might be only a few bytes long, while another may be 1kB or more...

> 2. Second, I will implement a SPLITTER clause for the CREATE TYPE
> statement. As 1 proposes, one would define a type, for example:
>   CREATE TYPE my_xml
>     LIKE xml
>     SPLITTER my_xml_splitter;

As far as I can tell, the idea is to allow a datatype to influence how
it's split into chunks for TOASTing so that functions can fetch only
the required slices more easily. To judge whether that is worthwhile or
not, you'd have to provide a concrete example of when such a facility
would be useful.

best regards,
Florian Pflug




pgsql-hackers by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: jsonb and nested hstore
Next
From: Merlin Moncure
Date:
Subject: Re: jsonb and nested hstore