Re: [HACKERS] [PATCH] kNN for SP-GiST - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: [HACKERS] [PATCH] kNN for SP-GiST
Date
Msg-id CAPpHfdvGtEzwRHqOUM67ixvqLo+ZbGhV+KEk_YMLZZM_WTCy0g@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [PATCH] kNN for SP-GiST  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Responses Re: [HACKERS] [PATCH] kNN for SP-GiST
List pgsql-hackers
Hi!

On Tue, Aug 28, 2018 at 12:50 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Thu, Jul 26, 2018 at 8:39 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > I'm not sure in architectural point of view: supporting two ways (list and heap) to store result seems may be a bit
heavy,but OK. At least, it has meaningful benefits.
 
>
> It seems that Nikita have reworked it that way after my suspect that
> switching regular scans to pairing heap *might* cause a regression.
> However, I didn't insist that we need to support two ways.  For
> instance, if we can prove that there is no regression then it's fine
> to use just heap...

So, I decided to check does it really matter to support both list and
pairing heap?  Or can we just always use pairing heap?

I wrote following simple test.

Points are uniformly distributed in box (0,0)-(1,1).
# create table spgist_test as (select point(random(), random()) p from
generate_series(1,1000000) i);
# create index spgist_test_idx on spgist_test using spgist(p);

spgist_bench() walks over this box with given step and queries it each
time with boxes of given size.
CREATE FUNCTION spgist_bench(step float8, size float8) RETURNS void AS $$
DECLARE
    x float8;
    y float8;
BEGIN
    y := 0.0;
    WHILE y <= 1.0 LOOP
        x := 0.0;
        WHILE x <= 1.0 LOOP
            PERFORM * FROM spgist_test WHERE p <@ box(point(x,y),
point(x+size, y+size));
            x := x + step;
        END LOOP;
        y := y + step;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

It fist I've compared current patch with modified patch, which I made
to to always queue scan.

# Current patch (use list)
x     run 1 run 2 run 3
0.1    1206  1230  1231
0.01   2862  2813  2802
0.003 13915 13911 13900

# Always use queue
x     run 1 run 2 run 3
0.1    1238  1189  1170
0.01   2740  2852  2769
0.003 13627 13751 13757

And it appears that difference is less than statistical error.  But
then I compared that with master.  And it appears that master is much
faster.

# Master
x     run 1 run 2 run 3
0.1     725   691   700
0.01   1488  1480  1495
0.003  6696  6210  6295

It's probably because we're anyway allocating separate queue memory
context.  I'll explore this more and come up with details.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pgsql-hackers by date:

Previous
From: Yugo Nagata
Date:
Subject: Re: pg_verify_checksums -d option (was: Re: pg_verify_checksums -roption)
Next
From: Amit Langote
Date:
Subject: Re: speeding up planning with partitions