Thread: Re: pgsql: Optimize pglz compressor for small inputs.

Re: pgsql: Optimize pglz compressor for small inputs.

From
Stephen Frost
Date:
Heikki,

* Heikki Linnakangas (heikki.linnakangas@iki.fi) wrote:
> This patch alleviates that in two ways. First, instead of storing pointers
> in the hash table, store 16-bit indexes into the hist_entries array. That
> slashes the size of the hash table to 1/2 or 1/4 of the original, depending
> on the pointer width. Secondly, adjust the size of the hash table based on
> input size. For very small inputs, you don't need a large hash table to
> avoid collisions.
 The coverity scanner has a bit of an issue with this patch which, at least on first blush, looks like a valid concern.
While the change in pg_lzcompress.c:pglz_find_match() to loop on:  while (hent != INVALID_ENTRY_PTR) {  const char *ip
=input;  const char *hp = hent->pos; 
 looks good, and INVALID_ENTRY_PTR is the address of the first entry in the array (and can't be NULL), towards the end
ofthe loop we do: 
 hent = hent->next; if (hent)  ...
 Should we really be checking for 'hent != INVALID_ENTRY_PTR' here?  If not, and hent really can end up as NULL, then
we'regoing to segfault on the next loop due to the unchecked 'hent->pos' early in the loop. If hent can never be NULL,
thenwe probably don't need this check at all. 
     Thanks,
    Stephen

Re: pgsql: Optimize pglz compressor for small inputs.

From
Heikki Linnakangas
Date:
On 14.07.2013 20:12, Stephen Frost wrote:
> * Heikki Linnakangas (heikki.linnakangas@iki.fi) wrote:
>> This patch alleviates that in two ways. First, instead of storing pointers
>> in the hash table, store 16-bit indexes into the hist_entries array. That
>> slashes the size of the hash table to 1/2 or 1/4 of the original, depending
>> on the pointer width. Secondly, adjust the size of the hash table based on
>> input size. For very small inputs, you don't need a large hash table to
>> avoid collisions.
>
>    The coverity scanner has a bit of an issue with this patch which, at
>    least on first blush, looks like a valid concern.
>
>    While the change in pg_lzcompress.c:pglz_find_match() to loop on:
>
>    while (hent != INVALID_ENTRY_PTR)
>    {
>       const char *ip = input;
>       const char *hp = hent->pos;
>
>    looks good, and INVALID_ENTRY_PTR is the address of the first entry in
>    the array (and can't be NULL), towards the end of the loop we do:
>
>    hent = hent->next;
>    if (hent)
>       ...
>
>    Should we really be checking for 'hent != INVALID_ENTRY_PTR' here?  If
>    not, and hent really can end up as NULL, then we're going to segfault
>    on the next loop due to the unchecked 'hent->pos' early in the loop.
>    If hent can never be NULL, then we probably don't need this check at
>    all.

hent can never be NULL, the code should indeed check for 'hent != 
INVALID_ENTRY_PTR'. The check is not required from a correctness point 
of view, the idea is just to avoid calculating the 'good_match' 
variable, if you're going to fall out of the loop anyway.

I'm actually a bit surprised the compiler doesn't optimize it that way 
anyway, without the explicit if-block, but at least "gcc -O2" (version 
4.7.3) seem to do that. So, I guess we should keep the check.

Committed '(hent)' -> '(hent != INVALID_ENTRY_PTR)'. Thanks for the report!

- Heikki