Re: Support of partial decompression for datums - Mailing list pgsql-hackers

From Ildus Kurbangaliev
Subject Re: Support of partial decompression for datums
Date
Msg-id 20151207124141.202c4ec4@lp
Whole thread Raw
In response to Re: Support of partial decompression for datums  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
On Sat, 5 Dec 2015 06:14:07 +0900
Michael Paquier <michael.paquier@gmail.com> wrote:

> On Sat, Dec 5, 2015 at 12:10 AM, Simon Riggs <simon@2ndquadrant.com>
> wrote:
> > On 4 December 2015 at 13:47, Ildus Kurbangaliev
> > <i.kurbangaliev@postgrespro.ru> wrote:
> >  
> >>
> >> Attached patch adds support of partial decompression for datums.
> >> It will be useful in many cases when extracting part of data is
> >> enough for big varlena structures.
> >>
> >> It is especially useful for expanded datums, because it provides
> >> storage for partial results.  
> >
> >
> > This isn't enough for anyone else to follow your thoughts and agree
> > enough to commit.
> >
> > Please explain the whole idea, starting from what problem you are
> > trying to solve and how well this does it, why you did it this way
> > and the other ways you tried/decided not to pursue. Thanks.  
> 
> Yeah, I would imagine that what Ildus is trying to achieve is
> something close to LZ4_decompress_safe_partial, by being able to stop
> compression after getting a certain amount of data decompressed, and
> continue working once again after.
> 
> And actually I think I get the idea. With his test case, what we get
> first is a size, and then we reuse this size to extract only what we
> need to fetch only a number of items from the tsvector. But that's
> actually linked to the length of the compressed chunk, and at the end
> we would still need to decompress the whole string perhaps, but it is
> not possible to be sure using the information provided.
> 
> Ildus, using your patch for tsvector, are you aiming at being able to
> complete an operation by only using a portion of the compressed data?
> Or are you planning to use that to improve the speed of detection of
> corrupted data in the chunk? If that's the latter, we would need to
> still decompress the whole string anyway, so having a routine able to
> decompress only until a given position is not necessary, and based on
> the example given upthread it is not possible to know what you are
> trying to achieve. Hence could you share your thoughts regarding your
> stuff with tsvector?
> 
> Changing pglz_decompress shape is still a bad idea anyway, I guess we
> had better have something new like pglz_decompress_partial instead.

Yes, you've got the idea. First we get a size of entries in tsvector, 
then with the size we can get WordEntry values. WordEntry 
contains offset of lexeme in the data and length of lexeme.
Information in these entries is enough to calculate an offset until we
need to decompress tsvector varlena.

So for example in binary search, we will decompress until half of
lexemes data (lexemes in tsvector are sorted), and then if search will
go left, then we just reuse that decompressed block, and we don't need
other part of tsvector. If search will go right, then we just
decompress a half of remaining part using the saved state and so on.

So in half of cases, we will decompress only a half of lexemes
in tsvector, and even if we need to decompress more, in most of cases we
will not decompress the whole tsvector.

-- 
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: [PROPOSAL] VACUUM Progress Checker.
Next
From: "Shulgin, Oleksandr"
Date:
Subject: Re: More stable query plans via more predictable column statistics