Thread: Slicing TOAST
I'm proposing this now as a possible GSoC project: In 1-byte character encodings (i.e. not UTF-8), SUBSTR() is optimised to allow seeking straight to the exact slice when retrieving a large toasted value. This reduces I/O considerably when you have large toasted values since it is an O(1) action rather than an O(N). This is possible because the slicing of toasted values is predictable on 1 byte encodings. It would be useful to have a predictable function perform the slicing, so we could use that knowledge later to optimise searches in a wider range of situations. More specifically, since UTF-8 is so common, to allow optimisations in that encoding of common data: text, XML, JSON. e.g. if we knew that an XML document has a required element called TITLE and that occurs only once and always in the first slice, it would be useful information to use in search functions. (Not sure, but it may be possible to assign non-consecutive slice numbers to allow variable data mid-way through a column value if needed). e.g. in UTF-8 free text we could put 500 characters in each slice, so that even if that could be anywhere between 500 and 2000 bytes it would still fit just fine. e.g. for arrays, if we put say 200 elements per slice, then accessing particular elements would require only 1 slice retrieval. Doing this would *possibly* reduce packing density, but not certainly so. But it would greatly improve access times to large structured toast values. Implementation would be to have a slicing function that gets called iteratively on a column value until it returns no further slices. There is no proposal for search functions. It would be up to the search function to confirm the details of the slicing function prior to using that knowledge in a search. We'd need a way to check that the function inputs matched the slicing of the column, so we'd need to have some requirement for input on the function to be matched against metadata on the column. So presumably some decoration of the input parameters of functions, which sounds like a little too much difficulty. So the proposal would be to provide the slicing/chunking function at the datatype level not the column level. The user would create a binary compatible type, that is effectively XML or whatever, just with extra constraints on usage for slicing. But now I consider the syntax, I'll call it a splitter function since slicer, chunker sound silly to me. CREATE TYPE my_xml LIKE xml SPLITTER my_toast_function; Search functions would then be designed that take such datatypes as input and would be able to rely with certainty upon the toast slicing algorithms in order to retrieve data. Doing it this way means that different XML or JSON schemas could have specific search functions optimised for them. I'm proposing this now as a possible GSoC project; I don't propose to actively work on it myself. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 05/14/2013 10:05 AM, Simon Riggs wrote: > I'm proposing this now as a possible GSoC project: > > In 1-byte character encodings (i.e. not UTF-8), SUBSTR() is optimised > to allow seeking straight to the exact slice when retrieving a large > toasted value. This reduces I/O considerably when you have large > toasted values since it is an O(1) action rather than an O(N). > > This is possible because the slicing of toasted values is predictable > on 1 byte encodings. > > It would be useful to have a predictable function perform the slicing, > so we could use that knowledge later to optimise searches in a wider > range of situations. More specifically, since UTF-8 is so common, to > allow optimisations in that encoding of common data: text, XML, JSON. > > e.g. if we knew that an XML document has a required element called > TITLE and that occurs only once and always in the first slice, it > would be useful information to use in search functions. (Not sure, but > it may be possible to assign non-consecutive slice numbers to allow > variable data mid-way through a column value if needed). > > e.g. in UTF-8 free text we could put 500 characters in each slice, so > that even if that could be anywhere between 500 and 2000 bytes it > would still fit just fine. > > e.g. for arrays, if we put say 200 elements per slice, then accessing > particular elements would require only 1 slice retrieval. > > Doing this would *possibly* reduce packing density, but not certainly > so. But it would greatly improve access times to large structured > toast values. On the contrary, as it would enable us to pack the chunks fitting more on the page, especially for :) That is, first chunk into N bytes, then compress each chunk ----------------- Hannu
On 14 May 2013 08:05, Simon Riggs <simon@2ndquadrant.com> wrote: > I'm proposing this now as a possible GSoC project: Unfortunately the deadline for project submissions for students was 3rd May. If this isn't worked on before next year, it can of course be put forward as an idea for GSoC 2014. -- Thom
On 14 May 2013 18:21, Thom Brown <thom@linux.com> wrote: > On 14 May 2013 08:05, Simon Riggs <simon@2ndquadrant.com> wrote: >> I'm proposing this now as a possible GSoC project: > > Unfortunately the deadline for project submissions for students was > 3rd May. If this isn't worked on before next year, it can of course > be put forward as an idea for GSoC 2014. Having reviewed the list of project ideas, I thought I'd submit an alternative, in case we have difficulties. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, May 14, 2013 at 06:34:55PM +0100, Simon Riggs wrote: > On 14 May 2013 18:21, Thom Brown <thom@linux.com> wrote: > > > On 14 May 2013 08:05, Simon Riggs <simon@2ndquadrant.com> wrote: > >> I'm proposing this now as a possible GSoC project: > > > > Unfortunately the deadline for project submissions for students was > > 3rd May. If this isn't worked on before next year, it can of course > > be put forward as an idea for GSoC 2014. > > Having reviewed the list of project ideas, I thought I'd submit an > alternative, in case we have difficulties. It being past 3 May, we have two and only two actual options: 1. Make a go with what we have, with a high risk of failure for some of the projects. 2. Give up slots. Option 1 is preferable in my mind to option 2 because it will not impact our ability to get slots next year, or if it does, it will not impact them as severely. Simon, Is there anyone in your organization who can mentor one or more of the projects we have which doesn't currently have a mentor? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 14 May 2013 21:09, David Fetter <david@fetter.org> wrote: > Is there anyone in your organization who can mentor one or more of the > projects we have which doesn't currently have a mentor? Depends upon the project. At least half the projects looked like things I would vote -1 for, which is why I proposed something straightforward and useful. Where is the list of projects/mentors? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> Depends upon the project. At least half the projects looked like > things I would vote -1 for, which is why I proposed something > straightforward and useful. > > Where is the list of projects/mentors? You'd need to register as a mentor on the GSOC app so that we can let you see the proposals. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com