Thread: Support of partial decompression for datums

Support of partial decompression for datums

From
Ildus Kurbangaliev
Date:
Attached patch adds support of partial decompression for datums.
It will be useful in many cases when extracting part of data is
enough for big varlena structures.

It is especially useful for expanded datums, because it provides
storage for partial results.

I have another patch, which removes the 1 Mb limit on tsvector using
this feature.

Usage:

    Assert(VARATT_IS_COMPRESSED(attr));
    evh->data = (struct varlena *)
        palloc(TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);
    SET_VARSIZE(evh->data, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ);

    /* Extract size of tsvector */
    res = toast_decompress_datum_partial(attr, evh->data,
        evh->dcState, sizeof(int32));
    if (res == -1)
        elog(ERROR, "compressed tsvector is corrupted");

    evh->count = TS_COUNT((TSVector) evh->data);

    /* Extract entries of tsvector */
    res = toast_decompress_datum_partial(attr, evh->data,
        evh->dcState, sizeof(int32) + sizeof(WordEntry) * evh->count);
    if (res == -1)
        elog(ERROR, "compressed tsvector is corrupted");


--
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: Support of partial decompression for datums

From
Michael Paquier
Date:
On Fri, Dec 4, 2015 at 9:47 PM, Ildus Kurbangaliev
<i.kurbangaliev@postgrespro.ru> wrote:
> Attached patch adds support of partial decompression for datums.
> It will be useful in many cases when extracting part of data is
> enough for big varlena structures.
>
> It is especially useful for expanded datums, because it provides
> storage for partial results.
>
> I have another patch, which removes the 1 Mb limit on tsvector using
> this feature.

-1 for changing the shape of pglz_decompress directly and particularly
use metadata in it. The current format of those routines is close to
what lz4 offers in terms of compression and decompression of a string,
let's not break that we had a time hard enough in 9.5 cycle to get
something clean.

By the way, why don't you compress the multiple chunks and store the
related metadata at a higher level? There is no need to put that in
pglz itself.
-- 
Michael



Re: Support of partial decompression for datums

From
Ildus Kurbangaliev
Date:
On Fri, 4 Dec 2015 22:13:58 +0900
Michael Paquier <michael.paquier@gmail.com> wrote:

> On Fri, Dec 4, 2015 at 9:47 PM, Ildus Kurbangaliev
> <i.kurbangaliev@postgrespro.ru> wrote:
> > Attached patch adds support of partial decompression for datums.
> > It will be useful in many cases when extracting part of data is
> > enough for big varlena structures.
> >
> > It is especially useful for expanded datums, because it provides
> > storage for partial results.
> >
> > I have another patch, which removes the 1 Mb limit on tsvector using
> > this feature.  
> 
> -1 for changing the shape of pglz_decompress directly and particularly
> use metadata in it. The current format of those routines is close to
> what lz4 offers in terms of compression and decompression of a string,
> let's not break that we had a time hard enough in 9.5 cycle to get
> something clean.

Metadata is not used for current code, only for case with partial
decompression.

> 
> By the way, why don't you compress the multiple chunks and store the
> related metadata at a higher level? There is no need to put that in
> pglz itself.

Yes, but this idea with chunks means that you are creating a whole
new structure, but it can't be used if you want to optimize current
structures. For example you can't just change arrays, but I think there
are places where the partial decompression can be used.

-- 
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Support of partial decompression for datums

From
Simon Riggs
Date:
On 4 December 2015 at 13:47, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
 
Attached patch adds support of partial decompression for datums.
It will be useful in many cases when extracting part of data is
enough for big varlena structures.

It is especially useful for expanded datums, because it provides
storage for partial results.

This isn't enough for anyone else to follow your thoughts and agree enough to commit.

Please explain the whole idea, starting from what problem you are trying to solve and how well this does it, why you did it this way and the other ways you tried/decided not to pursue. Thanks.
 
--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support of partial decompression for datums

From
Michael Paquier
Date:
On Sat, Dec 5, 2015 at 12:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 4 December 2015 at 13:47, Ildus Kurbangaliev
> <i.kurbangaliev@postgrespro.ru> wrote:
>
>>
>> Attached patch adds support of partial decompression for datums.
>> It will be useful in many cases when extracting part of data is
>> enough for big varlena structures.
>>
>> It is especially useful for expanded datums, because it provides
>> storage for partial results.
>
>
> This isn't enough for anyone else to follow your thoughts and agree enough
> to commit.
>
> Please explain the whole idea, starting from what problem you are trying to
> solve and how well this does it, why you did it this way and the other ways
> you tried/decided not to pursue. Thanks.

Yeah, I would imagine that what Ildus is trying to achieve is
something close to LZ4_decompress_safe_partial, by being able to stop
compression after getting a certain amount of data decompressed, and
continue working once again after.

And actually I think I get the idea. With his test case, what we get
first is a size, and then we reuse this size to extract only what we
need to fetch only a number of items from the tsvector. But that's
actually linked to the length of the compressed chunk, and at the end
we would still need to decompress the whole string perhaps, but it is
not possible to be sure using the information provided.

Ildus, using your patch for tsvector, are you aiming at being able to
complete an operation by only using a portion of the compressed data?
Or are you planning to use that to improve the speed of detection of
corrupted data in the chunk? If that's the latter, we would need to
still decompress the whole string anyway, so having a routine able to
decompress only until a given position is not necessary, and based on
the example given upthread it is not possible to know what you are
trying to achieve. Hence could you share your thoughts regarding your
stuff with tsvector?

Changing pglz_decompress shape is still a bad idea anyway, I guess we
had better have something new like pglz_decompress_partial instead.
-- 
Michael



Re: Support of partial decompression for datums

From
Ildus Kurbangaliev
Date:
On Sat, 5 Dec 2015 06:14:07 +0900
Michael Paquier <michael.paquier@gmail.com> wrote:

> On Sat, Dec 5, 2015 at 12:10 AM, Simon Riggs <simon@2ndquadrant.com>
> wrote:
> > On 4 December 2015 at 13:47, Ildus Kurbangaliev
> > <i.kurbangaliev@postgrespro.ru> wrote:
> >  
> >>
> >> Attached patch adds support of partial decompression for datums.
> >> It will be useful in many cases when extracting part of data is
> >> enough for big varlena structures.
> >>
> >> It is especially useful for expanded datums, because it provides
> >> storage for partial results.  
> >
> >
> > This isn't enough for anyone else to follow your thoughts and agree
> > enough to commit.
> >
> > Please explain the whole idea, starting from what problem you are
> > trying to solve and how well this does it, why you did it this way
> > and the other ways you tried/decided not to pursue. Thanks.  
> 
> Yeah, I would imagine that what Ildus is trying to achieve is
> something close to LZ4_decompress_safe_partial, by being able to stop
> compression after getting a certain amount of data decompressed, and
> continue working once again after.
> 
> And actually I think I get the idea. With his test case, what we get
> first is a size, and then we reuse this size to extract only what we
> need to fetch only a number of items from the tsvector. But that's
> actually linked to the length of the compressed chunk, and at the end
> we would still need to decompress the whole string perhaps, but it is
> not possible to be sure using the information provided.
> 
> Ildus, using your patch for tsvector, are you aiming at being able to
> complete an operation by only using a portion of the compressed data?
> Or are you planning to use that to improve the speed of detection of
> corrupted data in the chunk? If that's the latter, we would need to
> still decompress the whole string anyway, so having a routine able to
> decompress only until a given position is not necessary, and based on
> the example given upthread it is not possible to know what you are
> trying to achieve. Hence could you share your thoughts regarding your
> stuff with tsvector?
> 
> Changing pglz_decompress shape is still a bad idea anyway, I guess we
> had better have something new like pglz_decompress_partial instead.

Yes, you've got the idea. First we get a size of entries in tsvector, 
then with the size we can get WordEntry values. WordEntry 
contains offset of lexeme in the data and length of lexeme.
Information in these entries is enough to calculate an offset until we
need to decompress tsvector varlena.

So for example in binary search, we will decompress until half of
lexemes data (lexemes in tsvector are sorted), and then if search will
go left, then we just reuse that decompressed block, and we don't need
other part of tsvector. If search will go right, then we just
decompress a half of remaining part using the saved state and so on.

So in half of cases, we will decompress only a half of lexemes
in tsvector, and even if we need to decompress more, in most of cases we
will not decompress the whole tsvector.

-- 
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company