Thread: GSoC proposal

GSoC proposal

From
Tan Tran
Date:
Hi developers,

I'm applying for GSoC 2014 with Postgresql and would appreciate your comments on my proposal (attached). I'm looking for technical corrections/comments and your opinions on the project's viability. In particular, if the community has doubts about its usefulness, I would start working on an extra proposal from https://wiki.postgresql.org/wiki/GSoC_2014, perhaps on the RETURNING clause as a student named Karlik did last year.

Thanks,
Tan Tran

Attachment

Re: GSoC proposal

From
Florian Pflug
Date:
On Feb28, 2014, at 05:29 , Tan Tran <tankimtran@gmail.com> wrote:
> I'm applying for GSoC 2014 with Postgresql and would appreciate your comments
> on my proposal (attached).
> <pg_gsoc2014_TanTran.pdf>

First, please include your proposal as plain, inline text next time.
That makes it easier to quote the relevant parts when replying, and
also allows your mail to be indexed correctly by the mailing list
archive.

Regarding your proposal, I think you need to explain what exactly it
is you want to achieve in more detail.

> In particular, text and bytea are EXTERNAL by default, so that substring
> operations can seek straight to the exact slice (which is O(1)) instead
> of de-toasting the whole datum (which is O(file size)). Specifically,
> varlena.c’s text_substring(...) and bytea_substring(...) call
> DatumGetTextPSlice(...), which r!etrieves only the slice(s) at an
> easily-computed offset.!
>
> ...
>
> 1. First, I will optimize array element retrieval and UTF-8 substring
> retrieval. Both are straightforward, as they involve calculating slice
> numbers and using similar code to above.!

I'm confused by that - text_substring *already* attempts to only fetch
the relevant slice in the case of UTF-8. It can't do so precisely - it
needs to use a conservative estimate - but I fail to see how that can
be avoided. Since UTF-8 maps a character to anything from 1 to 6 bytes,
you can't compute the byte offset of a given character index precisely.

You could store a constant number of *characters* per slice, instead of
a constant number of *bytes*, but due to the rather large worst-case of
6 bytes per character, that would increase the storage and access overhead
6 fold for languages which can largely be represented with 1 byte per
character. That's not going to go down well...

I haven't looked at how we currently handle arrays, but the problems
there are similar. For arrays containing variable-length types, you can't
compute the byte offset from the index. It's even worst than for varchar,
because the range of possible element lengths is much longer - one array
element might be only a few bytes long, while another may be 1kB or more...

> 2. Second, I will implement a SPLITTER clause for the CREATE TYPE
> statement. As 1 proposes, one would define a type, for example:
>   CREATE TYPE my_xml
>     LIKE xml
>     SPLITTER my_xml_splitter;

As far as I can tell, the idea is to allow a datatype to influence how
it's split into chunks for TOASTing so that functions can fetch only
the required slices more easily. To judge whether that is worthwhile or
not, you'd have to provide a concrete example of when such a facility
would be useful.

best regards,
Florian Pflug




Re: GSoC proposal

From
Albe Laurenz
Date:
> I'm applying for GSoC 2014 with Postgresql and would appreciate your comments on my proposal
> (attached). I'm looking for technical corrections/comments and your opinions on the project's
> viability. In particular, if the community has doubts about its usefulness, I would start working on
> an extra proposal from https://wiki.postgresql.org/wiki/GSoC_2014, perhaps on the RETURNING clause as
> a student named Karlik did last year.

I am sure that Simon had his reasons when he proposed
http://www.postgresql.org/message-id/CA+U5nMJGgJNt5VXqkR=crtDqXFmuyzwEF23-fD5NuSns+6N5dA@mail.gmail.com
but I cannot help asking some questions:

1) Why limit the feature to UTF8 strings?  Shouldn't the technique work for all multibyte server encodings?

2) There is probably something that makes this necessary, but why should the decision  how toast is sliced be attached
tothe data type?  My (probably naive) idea would be to add a new TOAST strategy (e.g. SLICED)  to PLAIN, MAIN, EXTERNAL
andEXTENDED.
 

The feature only makes sense for string data types, right?

Yours,
Laurenz Albe