Thread: GSoC proposal
Hi developers,
I'm applying for GSoC 2014 with Postgresql and would appreciate your comments on my proposal (attached). I'm looking for technical corrections/comments and your opinions on the project's viability. In particular, if the community has doubts about its usefulness, I would start working on an extra proposal from https://wiki.postgresql.org/wiki/GSoC_2014, perhaps on the RETURNING clause as a student named Karlik did last year.
Thanks,
Tan Tran
Attachment
On Feb28, 2014, at 05:29 , Tan Tran <tankimtran@gmail.com> wrote: > I'm applying for GSoC 2014 with Postgresql and would appreciate your comments > on my proposal (attached). > <pg_gsoc2014_TanTran.pdf> First, please include your proposal as plain, inline text next time. That makes it easier to quote the relevant parts when replying, and also allows your mail to be indexed correctly by the mailing list archive. Regarding your proposal, I think you need to explain what exactly it is you want to achieve in more detail. > In particular, text and bytea are EXTERNAL by default, so that substring > operations can seek straight to the exact slice (which is O(1)) instead > of de-toasting the whole datum (which is O(file size)). Specifically, > varlena.c’s text_substring(...) and bytea_substring(...) call > DatumGetTextPSlice(...), which r!etrieves only the slice(s) at an > easily-computed offset.! > > ... > > 1. First, I will optimize array element retrieval and UTF-8 substring > retrieval. Both are straightforward, as they involve calculating slice > numbers and using similar code to above.! I'm confused by that - text_substring *already* attempts to only fetch the relevant slice in the case of UTF-8. It can't do so precisely - it needs to use a conservative estimate - but I fail to see how that can be avoided. Since UTF-8 maps a character to anything from 1 to 6 bytes, you can't compute the byte offset of a given character index precisely. You could store a constant number of *characters* per slice, instead of a constant number of *bytes*, but due to the rather large worst-case of 6 bytes per character, that would increase the storage and access overhead 6 fold for languages which can largely be represented with 1 byte per character. That's not going to go down well... I haven't looked at how we currently handle arrays, but the problems there are similar. For arrays containing variable-length types, you can't compute the byte offset from the index. It's even worst than for varchar, because the range of possible element lengths is much longer - one array element might be only a few bytes long, while another may be 1kB or more... > 2. Second, I will implement a SPLITTER clause for the CREATE TYPE > statement. As 1 proposes, one would define a type, for example: > CREATE TYPE my_xml > LIKE xml > SPLITTER my_xml_splitter; As far as I can tell, the idea is to allow a datatype to influence how it's split into chunks for TOASTing so that functions can fetch only the required slices more easily. To judge whether that is worthwhile or not, you'd have to provide a concrete example of when such a facility would be useful. best regards, Florian Pflug
> I'm applying for GSoC 2014 with Postgresql and would appreciate your comments on my proposal > (attached). I'm looking for technical corrections/comments and your opinions on the project's > viability. In particular, if the community has doubts about its usefulness, I would start working on > an extra proposal from https://wiki.postgresql.org/wiki/GSoC_2014, perhaps on the RETURNING clause as > a student named Karlik did last year. I am sure that Simon had his reasons when he proposed http://www.postgresql.org/message-id/CA+U5nMJGgJNt5VXqkR=crtDqXFmuyzwEF23-fD5NuSns+6N5dA@mail.gmail.com but I cannot help asking some questions: 1) Why limit the feature to UTF8 strings? Shouldn't the technique work for all multibyte server encodings? 2) There is probably something that makes this necessary, but why should the decision how toast is sliced be attached tothe data type? My (probably naive) idea would be to add a new TOAST strategy (e.g. SLICED) to PLAIN, MAIN, EXTERNAL andEXTENDED. The feature only makes sense for string data types, right? Yours, Laurenz Albe