Re: Yet another fast GiST build - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Yet another fast GiST build
Date
Msg-id 08173bd0-488d-da76-a904-912c35da446b@iki.fi
Whole thread Raw
In response to Re: Yet another fast GiST build  (Pavel Borisov <pashkin.elfe@gmail.com>)
Responses Re: Yet another fast GiST build  (Oleg Bartunov <obartunov@postgrespro.ru>)
List pgsql-hackers
On 07/09/2020 13:59, Pavel Borisov wrote:
>>>> I suppose there is a big jump in integer value (whether signed or
>>>> unsigned) as you cross from positive to negative floats, and then the
>>>> sort order is reversed.  I have no idea if either of those things is a
>>>> problem worth fixing.  That made me wonder if there might also be an
>>
>> I took a stab at fixing this, see attached patch (applies on top of your
>> patch v14).
>>
>> To evaluate this, I used the other attached patch to expose the zorder
>> function to SQL, and plotted points around zero with gnuplot. See the
>> attached two images, one with patch v14, and the other one with this patch.
> 
> I'd made testing of sorted SpGist build in cases of points distributed only
> in 2d quadrant and points in all 4 quadrants and it appears that this
> abnormality doesn't affect as much as Andrey supposed. But Heikki's patch
> is really nice way to avoid what can be avoided and I'd like it is included
> together with Andrey's patch.

Thanks! Did you measure the quality of the built index somehow? The 
ordering shouldn't make any difference to the build speed, but it 
affects the shape of the resulting index and the speed of queries 
against it.

I played with some simple queries like this:

explain (analyze, buffers) select count(*) from points_good where p <@ 
box(point(50, 50), point(75, 75));

and looking at the "Buffers" line for how many pages were accessed. 
There doesn't seem to be any consistent difference between v14 and my 
fix. So I concur it doesn't seem to matter much.


I played some more with plotting the curve. I wrote a little python 
program to make an animation of it, and also simulated how the points 
would be divided into pages, assuming that each GiST page can hold 200 
tuples (I think the real number is around 150 with default page size). 
In the animation, the leaf pages appear as rectangles as it walks 
through the Z-order curve. This is just a simulation by splitting all 
the points into batches of 200 and drawing a bounding box around each 
batch. I haven't checked the actual pages as the GiST creates, but I 
think this gives a good idea of how it works.

The animation shows that there's quite a lot of overlap between the 
pages. It's not necessarily this patch's business to try to improve 
that, and the non-sorting index build isn't perfect either. But it 
occurs to me that there's maybe one pretty simple trick we could do: 
instead of blindly filling the leaf pages in Z-order, collect tuples 
into a larger buffer, in Z-order. I'm thinking 32 pages worth of tuples, 
or something in that ballpark, or maybe go all the way up to work_mem. 
When the buffer fills up, call the picksplit code to divide the buffer 
into the actual pages, and flush them to disk. If you look at the 
animation and imagine that you would take a handful of pages in the 
order they're created, and re-divide the points with the split 
algorithm, there would be much less overlap.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Juan José Santamaría Flecha
Date:
Subject: Re: A micro-optimisation for walkdir()
Next
From: Konstantin Knizhnik
Date:
Subject: Re: Improving connection scalability: GetSnapshotData()