Re: GIN improvements part2: fast scan - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: GIN improvements part2: fast scan
Date
Msg-id 52EEF472.9020106@fuzzy.cz
Whole thread Raw
In response to Re: GIN improvements part2: fast scan  (Tomas Vondra <tv@fuzzy.cz>)
Responses Re: GIN improvements part2: fast scan  (Oleg Bartunov <obartunov@gmail.com>)
Re: GIN improvements part2: fast scan  (Jesper Krogh <jesper@krogh.cc>)
List pgsql-hackers
On 3.2.2014 00:13, Tomas Vondra wrote:
> On 2.2.2014 11:45, Heikki Linnakangas wrote:
>> On 01/30/2014 01:53 AM, Tomas Vondra wrote:
>>> (3) A file with explain plans for 4 queries suffering ~2x slowdown,
>>>      and explain plans with 9.4 master and Heikki's patches is available
>>>      here:
>>>
>>>        http://www.fuzzy.cz/tmp/gin/queries.txt
>>>
>>>      All the queries have 6 common words, and the explain plans look
>>>      just fine to me - exactly like the plans for other queries.
>>>
>>>      Two things now caught my eye. First some of these queries actually
>>>      have words repeated - either exactly like "database & database" or
>>>      in negated form like "!anything & anything". Second, while
>>>      generating the queries, I use "dumb" frequency, where only exact
>>>      matches count. I.e. "write != written" etc. But the actual number
>>>      of hits may be much higher - for example "write" matches exactly
>>>      just 5% documents, but using @@ it matches more than 20%.
>>>
>>>      I don't know if that's the actual cause though.
>>
>> Ok, here's another variant of these patches. Compared to git master, it
>> does three things:
>>
>> 1. It adds the concept of ternary consistent function internally, but no
>> catalog changes. It's implemented by calling the regular boolean
>> consistent function "both ways".
>>
>> 2. Use a binary heap to get the "next" item from the entries in a scan.
>> I'm pretty sure this makes sense, because arguably it makes the code
>> more readable, and reduces the number of item pointer comparisons
>> significantly for queries with a lot of entries.
>>
>> 3. Only perform the pre-consistent check to try skipping entries, if we
>> don't already have the next item from the entry loaded in the array.
>> This is a tradeoff, you will lose some of the performance gain you might
>> get from pre-consistent checks, but it also limits the performance loss
>> you might get from doing useless pre-consistent checks.
>>
>> So taken together, I would expect this patch to make some of the
>> performance gains less impressive, but also limit the loss we saw with
>> some of the other patches.
>>
>> Tomas, could you run your test suite with this patch, please?
>
> Sure, will do. Do I get it right that this should be applied instead of
> the four patches you've posted earlier?

So, I was curious and did a basic testing - I've repeated the tests on
current HEAD and 'HEAD with the new patch'. The complete data are
available at [http://www.fuzzy.cz/tmp/gin/gin-scan-benchmarks.ods] and
I've updated the charts at [http://www.fuzzy.cz/tmp/gin/] too.

Look for branches named 9.4-head-2 and 9.4-heikki-2.

To me it seems that:

(1) The main issue was that with common words, it used to be much
    slower than HEAD (or 9.3). This seems to be fixed, i.e. it's not
    slower than before. See

      http://www.fuzzy.cz/tmp/gin/3-common-words.png (previous patch)
      http://www.fuzzy.cz/tmp/gin/3-common-words-new.png (new patch)

    for comparison vs. 9.4 HEAD. With the new patch there's no slowdown,
    which seems nice. Compared to 9.3 it looks like this:

      http://www.fuzzy.cz/tmp/gin/3-common-words-new-vs-93.png

    so there's a significant speedup (thanks to the modified layout).

(2) The question is whether the new patch works fine on rare words. See
    this for comparison of the patches against HEAD:

      http://www.fuzzy.cz/tmp/gin/3-rare-words.png
      http://www.fuzzy.cz/tmp/gin/3-rare-words-new.png

    and this is the comparison of the two patches:

      http://www.fuzzy.cz/tmp/gin/patches-rare-words.png

    That seems fine to me - some queries are slower, but we're talking
    about queries taking 1 or 2 ms, so the measurement error is probably
    the main cause of the differences.

(3) With higher numbers of frequent words, the differences (vs. HEAD or
    the previous patch) are not that dramatic as in (1) - the new patch
    is consistently by ~20% faster.

Tomas

Attachment

pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: narwhal and PGDLLIMPORT
Next
From: Fujii Masao
Date:
Subject: Re: pg_basebackup and pg_stat_tmp directory