Thread: Support of partial decompression for datums
Attached patch adds support of partial decompression for datums. It will be useful in many cases when extracting part of data is enough for big varlena structures. It is especially useful for expanded datums, because it provides storage for partial results. I have another patch, which removes the 1 Mb limit on tsvector using this feature. Usage: Assert(VARATT_IS_COMPRESSED(attr)); evh->data = (struct varlena *) palloc(TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ); SET_VARSIZE(evh->data, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ); /* Extract size of tsvector */ res = toast_decompress_datum_partial(attr, evh->data, evh->dcState, sizeof(int32)); if (res == -1) elog(ERROR, "compressed tsvector is corrupted"); evh->count = TS_COUNT((TSVector) evh->data); /* Extract entries of tsvector */ res = toast_decompress_datum_partial(attr, evh->data, evh->dcState, sizeof(int32) + sizeof(WordEntry) * evh->count); if (res == -1) elog(ERROR, "compressed tsvector is corrupted"); -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
On Fri, Dec 4, 2015 at 9:47 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote: > Attached patch adds support of partial decompression for datums. > It will be useful in many cases when extracting part of data is > enough for big varlena structures. > > It is especially useful for expanded datums, because it provides > storage for partial results. > > I have another patch, which removes the 1 Mb limit on tsvector using > this feature. -1 for changing the shape of pglz_decompress directly and particularly use metadata in it. The current format of those routines is close to what lz4 offers in terms of compression and decompression of a string, let's not break that we had a time hard enough in 9.5 cycle to get something clean. By the way, why don't you compress the multiple chunks and store the related metadata at a higher level? There is no need to put that in pglz itself. -- Michael
On Fri, 4 Dec 2015 22:13:58 +0900 Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Dec 4, 2015 at 9:47 PM, Ildus Kurbangaliev > <i.kurbangaliev@postgrespro.ru> wrote: > > Attached patch adds support of partial decompression for datums. > > It will be useful in many cases when extracting part of data is > > enough for big varlena structures. > > > > It is especially useful for expanded datums, because it provides > > storage for partial results. > > > > I have another patch, which removes the 1 Mb limit on tsvector using > > this feature. > > -1 for changing the shape of pglz_decompress directly and particularly > use metadata in it. The current format of those routines is close to > what lz4 offers in terms of compression and decompression of a string, > let's not break that we had a time hard enough in 9.5 cycle to get > something clean. Metadata is not used for current code, only for case with partial decompression. > > By the way, why don't you compress the multiple chunks and store the > related metadata at a higher level? There is no need to put that in > pglz itself. Yes, but this idea with chunks means that you are creating a whole new structure, but it can't be used if you want to optimize current structures. For example you can't just change arrays, but I think there are places where the partial decompression can be used. -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com Russian Postgres Company
On 4 December 2015 at 13:47, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
--
Attached patch adds support of partial decompression for datums.
It will be useful in many cases when extracting part of data is
enough for big varlena structures.
It is especially useful for expanded datums, because it provides
storage for partial results.
This isn't enough for anyone else to follow your thoughts and agree enough to commit.
Please explain the whole idea, starting from what problem you are trying to solve and how well this does it, why you did it this way and the other ways you tried/decided not to pursue. Thanks.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Dec 5, 2015 at 12:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 4 December 2015 at 13:47, Ildus Kurbangaliev > <i.kurbangaliev@postgrespro.ru> wrote: > >> >> Attached patch adds support of partial decompression for datums. >> It will be useful in many cases when extracting part of data is >> enough for big varlena structures. >> >> It is especially useful for expanded datums, because it provides >> storage for partial results. > > > This isn't enough for anyone else to follow your thoughts and agree enough > to commit. > > Please explain the whole idea, starting from what problem you are trying to > solve and how well this does it, why you did it this way and the other ways > you tried/decided not to pursue. Thanks. Yeah, I would imagine that what Ildus is trying to achieve is something close to LZ4_decompress_safe_partial, by being able to stop compression after getting a certain amount of data decompressed, and continue working once again after. And actually I think I get the idea. With his test case, what we get first is a size, and then we reuse this size to extract only what we need to fetch only a number of items from the tsvector. But that's actually linked to the length of the compressed chunk, and at the end we would still need to decompress the whole string perhaps, but it is not possible to be sure using the information provided. Ildus, using your patch for tsvector, are you aiming at being able to complete an operation by only using a portion of the compressed data? Or are you planning to use that to improve the speed of detection of corrupted data in the chunk? If that's the latter, we would need to still decompress the whole string anyway, so having a routine able to decompress only until a given position is not necessary, and based on the example given upthread it is not possible to know what you are trying to achieve. Hence could you share your thoughts regarding your stuff with tsvector? Changing pglz_decompress shape is still a bad idea anyway, I guess we had better have something new like pglz_decompress_partial instead. -- Michael
On Sat, 5 Dec 2015 06:14:07 +0900 Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Dec 5, 2015 at 12:10 AM, Simon Riggs <simon@2ndquadrant.com> > wrote: > > On 4 December 2015 at 13:47, Ildus Kurbangaliev > > <i.kurbangaliev@postgrespro.ru> wrote: > > > >> > >> Attached patch adds support of partial decompression for datums. > >> It will be useful in many cases when extracting part of data is > >> enough for big varlena structures. > >> > >> It is especially useful for expanded datums, because it provides > >> storage for partial results. > > > > > > This isn't enough for anyone else to follow your thoughts and agree > > enough to commit. > > > > Please explain the whole idea, starting from what problem you are > > trying to solve and how well this does it, why you did it this way > > and the other ways you tried/decided not to pursue. Thanks. > > Yeah, I would imagine that what Ildus is trying to achieve is > something close to LZ4_decompress_safe_partial, by being able to stop > compression after getting a certain amount of data decompressed, and > continue working once again after. > > And actually I think I get the idea. With his test case, what we get > first is a size, and then we reuse this size to extract only what we > need to fetch only a number of items from the tsvector. But that's > actually linked to the length of the compressed chunk, and at the end > we would still need to decompress the whole string perhaps, but it is > not possible to be sure using the information provided. > > Ildus, using your patch for tsvector, are you aiming at being able to > complete an operation by only using a portion of the compressed data? > Or are you planning to use that to improve the speed of detection of > corrupted data in the chunk? If that's the latter, we would need to > still decompress the whole string anyway, so having a routine able to > decompress only until a given position is not necessary, and based on > the example given upthread it is not possible to know what you are > trying to achieve. Hence could you share your thoughts regarding your > stuff with tsvector? > > Changing pglz_decompress shape is still a bad idea anyway, I guess we > had better have something new like pglz_decompress_partial instead. Yes, you've got the idea. First we get a size of entries in tsvector, then with the size we can get WordEntry values. WordEntry contains offset of lexeme in the data and length of lexeme. Information in these entries is enough to calculate an offset until we need to decompress tsvector varlena. So for example in binary search, we will decompress until half of lexemes data (lexemes in tsvector are sorted), and then if search will go left, then we just reuse that decompressed block, and we don't need other part of tsvector. If search will go right, then we just decompress a half of remaining part using the saved state and so on. So in half of cases, we will decompress only a half of lexemes in tsvector, and even if we need to decompress more, in most of cases we will not decompress the whole tsvector. -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com Russian Postgres Company