Thread: WIP: Fast GiST index build

WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Hackers,

WIP patch of fast GiST index build is attached. Code is dirty and comments are lacking, but it works. Now it is ready for first benchmarks, which should prove efficiency of selected technique. It's time to compare fast GiST index build with repeat insert build on large enough datasets (datasets which don't fit to cache). There are following aims of testing:
1) Measure acceleration of index build.
2) Measure change in index quality.
I'm going to do first testing using synthetic datasets. Everybody who have interesting real-life datasets for testing are welcome.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 03.06.2011 14:02, Alexander Korotkov wrote:
> Hackers,
>
> WIP patch of fast GiST index build is attached. Code is dirty and comments
> are lacking, but it works. Now it is ready for first benchmarks, which
> should prove efficiency of selected technique. It's time to compare fast
> GiST index build with repeat insert build on large enough datasets (datasets
> which don't fit to cache). There are following aims of testing:
> 1) Measure acceleration of index build.
> 2) Measure change in index quality.
> I'm going to do first testing using synthetic datasets. Everybody who have
> interesting real-life datasets for testing are welcome.

I did some quick performance testing of this. I installed postgis 1.5, 
and loaded an extract of the OpenStreetMap data covering Finland. The 
biggest gist index in that data set is the idx_nodes_geom index on nodes 
table. I have maintenance_work_mem and shared_buffers both set to 512 
MB, and this laptop has 4GB of RAM.

Without the patch, reindexing the index takes about 170 seconds and the 
index size is 321 MB. And with the patch, it takes about 150 seconds, 
and the resulting index size is 319 MB.

The nodes table is 618MB in size, so it fits in RAM. I presume the gain 
would be bigger if it doesn't, as the random I/O to update the index 
starts to hurt more. But this shows that even when it does, this patch 
helps a little bit, and the resulting index size is comparable.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 03.06.2011 14:02, Alexander Korotkov wrote:
> Hackers,
>
> WIP patch of fast GiST index build is attached. Code is dirty and comments
> are lacking, but it works. Now it is ready for first benchmarks, which
> should prove efficiency of selected technique. It's time to compare fast
> GiST index build with repeat insert build on large enough datasets (datasets
> which don't fit to cache). There are following aims of testing:
> 1) Measure acceleration of index build.
> 2) Measure change in index quality.
> I'm going to do first testing using synthetic datasets. Everybody who have
> interesting real-life datasets for testing are welcome.

I ran another test with a simple table generated with:

CREATE TABLE pointtest (p point);
INSERT INTO pointtest SELECT point(random(), random()) FROM 
generate_series(1,50000000);

Generating a gist index with:

CREATE INDEX i_pointtest ON pointtest USING gist (p);

took about 15 hours without the patch, and 2 hours with it. That's quite 
dramatic.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 06.06.2011 10:42, Heikki Linnakangas wrote:
> On 03.06.2011 14:02, Alexander Korotkov wrote:
>> Hackers,
>>
>> WIP patch of fast GiST index build is attached. Code is dirty and
>> comments
>> are lacking, but it works. Now it is ready for first benchmarks, which
>> should prove efficiency of selected technique. It's time to compare fast
>> GiST index build with repeat insert build on large enough datasets
>> (datasets
>> which don't fit to cache). There are following aims of testing:
>> 1) Measure acceleration of index build.
>> 2) Measure change in index quality.
>> I'm going to do first testing using synthetic datasets. Everybody who
>> have
>> interesting real-life datasets for testing are welcome.
>
> I ran another test with a simple table generated with:
>
> CREATE TABLE pointtest (p point);
> INSERT INTO pointtest SELECT point(random(), random()) FROM
> generate_series(1,50000000);
>
> Generating a gist index with:
>
> CREATE INDEX i_pointtest ON pointtest USING gist (p);
>
> took about 15 hours without the patch, and 2 hours with it. That's quite
> dramatic.

Oops, that was a rounding error, sorry. The run took about 2.7 hours 
with the patch, which of course should be rounded to 3 hours, not 2. 
Anyway, it is still a very impressive improvement.

I'm glad you could get the patch ready for benchmarking this quickly. 
Now you just need to get the patch into shape so that it can be 
committed. That is always the more time-consuming part, so I'm glad you 
have plenty of time left for it.

Could you please create a TODO list on the wiki page, listing all the 
missing features, known bugs etc. that will need to be fixed? That'll 
make it easier to see how much work there is left. It'll also help 
anyone looking at the patch to know which issues are known issues.

Meanwhile, it would still be very valuable if others could test this 
with different workloads. And Alexander, it would be good if at some 
point you could write some benchmark scripts too, and put them on the 
wiki page, just to see what kind of workloads have been taken into 
consideration and tested already. Do you think there's some worst-case 
data distributions where this algorithm would perform particularly badly?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Hi!

On Mon, Jun 6, 2011 at 2:51 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 06.06.2011 10:42, Heikki Linnakangas wrote:
I ran another test with a simple table generated with:

CREATE TABLE pointtest (p point);
INSERT INTO pointtest SELECT point(random(), random()) FROM
generate_series(1,50000000);

Generating a gist index with:

CREATE INDEX i_pointtest ON pointtest USING gist (p);

took about 15 hours without the patch, and 2 hours with it. That's quite
dramatic.

Oops, that was a rounding error, sorry. The run took about 2.7 hours with the patch, which of course should be rounded to 3 hours, not 2. Anyway, it is still a very impressive improvement.
I have similar results on 100 millions of rows: 21.6 hours without patch and 2 hours with patch. But I found a problem: index quality is worse. See following query plans. There test is relation where index was created in ordinal way, and test2 is relation where patch was used.

                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=4391.01..270397.31 rows=100000 width=20) (actual time=1.257..2.147 rows=838 loops=1)
   Recheck Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
   Buffers: shared hit=968
   ->  Bitmap Index Scan on test_idx  (cost=0.00..4366.01 rows=100000 width=0) (actual time=1.162..1.162 rows=838 loops=1)
         Index Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
         Buffers: shared hit=131
 Total runtime: 2.214 ms
(7 rows)

                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test2  (cost=4370.84..270377.13 rows=100000 width=20) (actual time=5.252..6.056 rows=838 loops=1)
   Recheck Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
   Buffers: shared hit=1458
   ->  Bitmap Index Scan on test2_idx  (cost=0.00..4345.84 rows=100000 width=0) (actual time=5.155..5.155 rows=838 loops=1)
         Index Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
         Buffers: shared hit=621
 Total runtime: 6.121 ms
(7 rows)

                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=4391.01..270397.31 rows=100000 width=20) (actual time=2.148..2.977 rows=850 loops=1)
   Recheck Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
   Buffers: shared hit=1099
   ->  Bitmap Index Scan on test_idx  (cost=0.00..4366.01 rows=100000 width=0) (actual time=2.052..2.052 rows=850 loops=1)
         Index Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
         Buffers: shared hit=249
 Total runtime: 3.033 ms
(7 rows)

                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test2  (cost=4370.84..270377.13 rows=100000 width=20) (actual time=6.806..7.602 rows=850 loops=1)
   Recheck Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
   Buffers: shared hit=1615
   ->  Bitmap Index Scan on test2_idx  (cost=0.00..4345.84 rows=100000 width=0) (actual time=6.709..6.709 rows=850 loops=1)
         Index Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
         Buffers: shared hit=773
 Total runtime: 7.667 ms
(7 rows)

We can see that index scan requires read of several times more pages. Original paper denotes such effect. It explains it by the routing rectangles in less optimal ways. But this effect wasn't so dramatic in tests provided in the paper. So, I have following thoughts about this problem:

1) Number of pages, which was readed from index is too large even with ordinal index build. Querying of small area requires read of hundred of pages. It probbably caused by picksplit implementation. I've version of picksplit algorithm which seems to be much more efficient. I'll do some benchmarks with my picksplit algorithm. I hope difference in index quality will be not so dramatic.

2) I can try to do some enchancements in fast build alogrithms which could improve tree quality. In original paper Hilbert heuristic was used to achive even better tree quality than tree which was created in ordinal way. But since we use GiST we are restricted by it's interface (or we have to create new interface functions(s), but I like to avoid it). I would like to try to do some ordering by penalty value in buffer emptying process and buffers relocation on split.

3) Probably, there is some bug which affects tree quality.
 
Could you please create a TODO list on the wiki page, listing all the missing features, known bugs etc. that will need to be fixed? That'll make it easier to see how much work there is left. It'll also help anyone looking at the patch to know which issues are known issues.
Sure. I'll create such list on wiki page. I believe that currenlty most important issue is index quality.
 
Meanwhile, it would still be very valuable if others could test this with different workloads. And Alexander, it would be good if at some point you could write some benchmark scripts too, and put them on the wiki page, just to see what kind of workloads have been taken into consideration and tested already. Do you think there's some worst-case data distributions where this algorithm would perform particularly badly?
I don't expect some bad cases in terms in IO. My most worrying is about index quality which is strongly related to data distribution.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Mon, Jun 6, 2011 at 2:51 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Do you think there's some worst-case data distributions where this algorithm would perform particularly badly?
I think there could be some worst-case GiST applications. Just now gist fast build algorithm invokes more penalty calls than repeatable insert algorithm. If I succeed then it will invoke even more such calls. So, if penalty function is very slow then gist fast build will be slover then repeatable insert.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Mon, Jun 6, 2011 at 4:14 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
If I succeed then it will invoke even more such calls.
I meant here that if I succeed in enhancements which improve index quality then fast build algorithm will invoke even more such calls.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
I've tried index tuples sorting on penalty function before buffer relocation on split. But it was without any success. Index quality becomes even worse than without sorting.
The next thing I've tried is buffer relocation between all neighbor buffers. Results of first tests is much more promising. Number of page accesses during index scan is similar to those without fast index build. I'm going to hold on this approach.

test=# create index test_idx on test using gist(v);
NOTICE:  Level step = 1, pagesPerBuffer = 406
CREATE INDEX
Time: 10002590,469 ms

test=# select pg_size_pretty(pg_relation_size('test_idx'));
 pg_size_pretty 
----------------
 6939 MB
(1 row)

test=# explain (analyze, buffers) select * from test where v <@ '(0.903,0.203),(0.9,0.2)'::box;
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=4366.78..258752.22 rows=100000 width=16) (actual time=1.412..2.295 rows=897 loops=1)
   Recheck Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
   Buffers: shared hit=1038
   ->  Bitmap Index Scan on test_idx  (cost=0.00..4341.78 rows=100000 width=0) (actual time=1.311..1.311 rows=897 loops=1)
         Index Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
         Buffers: shared hit=141
 Total runtime: 2.375 ms
(7 rows)

test=# explain (analyze, buffers) select * from test where v <@ '(0.503,0.503),(0.5,0.5)'::box;

                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=4366.78..258752.22 rows=100000 width=16) (actual time=2.113..2.972 rows=855 loops=1)
   Recheck Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
   Buffers: shared hit=1095
   ->  Bitmap Index Scan on test_idx  (cost=0.00..4341.78 rows=100000 width=0) (actual time=2.016..2.016 rows=855 loops=1)
         Index Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
         Buffers: shared hit=240
 Total runtime: 3.043 ms
(7 rows)


------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Wed, Jun 15, 2011 at 11:21 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
I've tried index tuples sorting on penalty function before buffer relocation on split. But it was without any success. Index quality becomes even worse than without sorting.
The next thing I've tried is buffer relocation between all neighbor buffers. Results of first tests is much more promising. Number of page accesses during index scan is similar to those without fast index build. I'm going to hold on this approach.

test=# create index test_idx on test using gist(v);
NOTICE:  Level step = 1, pagesPerBuffer = 406
CREATE INDEX
Time: 10002590,469 ms
I forget to say that build time increases in about 40%, but it is still faster than ordinal build in about 10 times.

------
With best regards,
Alexander Korotkov.  

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 15.06.2011 10:24, Alexander Korotkov wrote:
> On Wed, Jun 15, 2011 at 11:21 AM, Alexander Korotkov
> <aekorotkov@gmail.com>wrote:
>
>> I've tried index tuples sorting on penalty function before buffer
>> relocation on split. But it was without any success. Index quality becomes
>> even worse than without sorting.
>> The next thing I've tried is buffer relocation between all neighbor
>> buffers. Results of first tests is much more promising. Number of page
>> accesses during index scan is similar to those without fast index build. I'm
>> going to hold on this approach.
>>
>> test=# create index test_idx on test using gist(v);
>> NOTICE:  Level step = 1, pagesPerBuffer = 406
>> CREATE INDEX
>> Time: 10002590,469 ms
>>
> I forget to say that build time increases in about 40%, but it is still
> faster than ordinal build in about 10 times.

Is this relocation mechanism something that can be tuned, for different 
tradeoffs between index quality and build time? In any case, it seems 
that we're going to need a lot of testing with different data sets to 
get a better picture of how this performs. But at least for now, it 
looks like this approach is going to be acceptable.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Wed, Jun 15, 2011 at 12:03 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Is this relocation mechanism something that can be tuned, for different tradeoffs between index quality and build time?
Yes, it can. I believe that it can be index parameter.
In any case, it seems that we're going to need a lot of testing with different data sets to get a better picture of how this performs.
Sure. My problem is that I haven't large enough reallife datasets. Picture of syntetic datasets can be unrepresentative on reallife cases. On smaller datasets that I have I actually can compare only index quality. Also, tests with large datasets takes long time especially without fast build. Probably solution is to limit cache size during testing. It should allow to measure I/O benefit even on relatively small datasets. But while I don't know now to do that on Linux.

------
With best regards,
Alexander Korotkov.  

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Actually, I would like to measure CPU and IO load independently for more comprehensive benchmarks. Can you advice me some appropriate tools for it?

------
With best regards,
Alexander Korotkov.  

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
My current idea is to measure number of IO accesses by pg_stat_statements and measure CPU usage by /proc/PID/stat. Any thoughts?

------
With best regards,
Alexander Korotkov.  


On Thu, Jun 16, 2011 at 1:33 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Actually, I would like to measure CPU and IO load independently for more comprehensive benchmarks. Can you advice me some appropriate tools for it?

------
With best regards,
Alexander Korotkov.  

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 16.06.2011 21:13, Alexander Korotkov wrote:
> My current idea is to measure number of IO accesses by pg_stat_statements
> and measure CPU usage by /proc/PID/stat. Any thoughts?

Actually, you get both of those very easily with:

set log_statement_stats=on

LOG:  QUERY STATISTICS
DETAIL:  ! system usage stats:!    0.000990 elapsed 0.000000 user 0.000000 system sec!    [0.000000 user 0.008000 sys
total]!   0/0 [32/0] filesystem blocks in/out!    0/0 [0/959] page faults/reclaims, 0 [0] swaps!    0 [0] signals rcvd,
0/0[0/0] messages rcvd/sent!    0/0 [10/1] voluntary/involuntary context switches
 
STATEMENT:  SELECT generate_series(1,100);


--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Oh, actually it's so easy. Thanks.

------
With best regards,
Alexander Korotkov.  

On Thu, Jun 16, 2011 at 10:26 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 16.06.2011 21:13, Alexander Korotkov wrote:
My current idea is to measure number of IO accesses by pg_stat_statements
and measure CPU usage by /proc/PID/stat. Any thoughts?

Actually, you get both of those very easily with:

set log_statement_stats=on

LOG:  QUERY STATISTICS
DETAIL:  ! system usage stats:
       !       0.000990 elapsed 0.000000 user 0.000000 system sec
       !       [0.000000 user 0.008000 sys total]
       !       0/0 [32/0] filesystem blocks in/out
       !       0/0 [0/959] page faults/reclaims, 0 [0] swaps
       !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
       !       0/0 [10/1] voluntary/involuntary context switches
STATEMENT:  SELECT generate_series(1,100);



--
 Heikki Linnakangas
 EnterpriseDB   http://www.enterprisedb.com

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
New version of patch. There are some bugfixes, minor refactoring, comments (unfortunatelly, not all the code is covered by comments yet). Also "fastbuild" parameter was added to the GiST index. It allows to test index building with and without fast build without postgres recompile.

------
With best regards,
Alexander Korotkov.  

On Thu, Jun 16, 2011 at 10:35 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Oh, actually it's so easy. Thanks.

------
With best regards,
Alexander Korotkov.  

On Thu, Jun 16, 2011 at 10:26 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 16.06.2011 21:13, Alexander Korotkov wrote:
My current idea is to measure number of IO accesses by pg_stat_statements
and measure CPU usage by /proc/PID/stat. Any thoughts?

Actually, you get both of those very easily with:

set log_statement_stats=on

LOG:  QUERY STATISTICS
DETAIL:  ! system usage stats:
       !       0.000990 elapsed 0.000000 user 0.000000 system sec
       !       [0.000000 user 0.008000 sys total]
       !       0/0 [32/0] filesystem blocks in/out
       !       0/0 [0/959] page faults/reclaims, 0 [0] swaps
       !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
       !       0/0 [10/1] voluntary/involuntary context switches
STATEMENT:  SELECT generate_series(1,100);



--
 Heikki Linnakangas
 EnterpriseDB   http://www.enterprisedb.com


Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Hi!

I've created section about testing in project wiki page:
http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results
Do you have any notes about table structure?
As you can see I found that CPU usage might be much higher with gist_trgm_ops. I believe it's due to relatively expensive penalty method in that opclass. But, probably index build can be still faster when index doesn't fit cache even for gist_trgm_ops. Also with that opclass index quality is slightly worse but the difference is not dramatic.

------
With best regards,
Alexander Korotkov.  

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 21.06.2011 13:08, Alexander Korotkov wrote:
> I've created section about testing in project wiki page:
> http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results
> Do you have any notes about table structure?

It would be nice to have links to the datasets and scripts used, so that 
others can reproduce the tests.

It's surprising that the search time differs so much between the 
point_ops tests with uniformly random data with 100M and 10M rows. Just 
to be sure I'm reading it correctly: a small search time is good, right? 
You might want to spell that out explicitly.

> As you can see I found that CPU usage might be much higher
> with gist_trgm_ops.

Yeah, that is a bit worrysome. 6 minutes without the patch and 18 
minutes with it.

> I believe it's due to relatively expensive penalty
> method in that opclass.

Hmm, I wonder if it could be optimized. I did a quick test, creating a 
gist_trgm_ops index on a list of English words from 
/usr/share/dict/words. oprofile shows that with the patch, 60% of the 
CPU time is spent in the makesign() function.

> But, probably index build can be still faster when
> index doesn't fit cache even for gist_trgm_ops.

Yep.

> Also with that opclass index
> quality is slightly worse but the difference is not dramatic.

5-10% difference should be acceptable

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Fri, Jun 24, 2011 at 12:40 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 21.06.2011 13:08, Alexander Korotkov wrote:
I've created section about testing in project wiki page:
http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results
Do you have any notes about table structure?

It would be nice to have links to the datasets and scripts used, so that others can reproduce the tests.
Done.
 
It's surprising that the search time differs so much between the point_ops tests with uniformly random data with 100M and 10M rows. Just to be sure I'm reading it correctly: a small search time is good, right? You might want to spell that out explicitly.
Yes, you're reading this correctly. Detailed explanation was added to the wiki page. It's surprising for me too. I need some more insight into causes of index quality difference. 

Now I found some large enough real-life datasets (thanks to Oleg Bartunov) and I'm performing tests on them.

------
With best regards,
Alexander Korotkov.  

Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

From
Heikki Linnakangas
Date:
On 24.06.2011 11:40, Heikki Linnakangas wrote:
> On 21.06.2011 13:08, Alexander Korotkov wrote:
>> I believe it's due to relatively expensive penalty
>> method in that opclass.
>
> Hmm, I wonder if it could be optimized. I did a quick test, creating a
> gist_trgm_ops index on a list of English words from
> /usr/share/dict/words. oprofile shows that with the patch, 60% of the
> CPU time is spent in the makesign() function.

I couldn't resist looking into this, and came up with the attached
patch. I tested this with:

CREATE TABLE words (word text);
COPY words FROM '/usr/share/dict/words';
CREATE INDEX i_words ON words USING gist (word gist_trgm_ops);

And then ran "REINDEX INDEX i_words" a few times with and without the
patch. Without the patch, reindex takes about 4.7 seconds. With the
patch, 3.7 seconds. That's a worthwhile gain on its own, but becomes
even more important with Alexander's fast GiST build patch, which calls
the penalty function more.

I used the attached showsign-debug.patch to verify that the patched
makesign function produces the same results as the existing code. I
haven't tested the big-endian code, however.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

From
Robert Haas
Date:
On Fri, Jun 24, 2011 at 12:51 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 24.06.2011 11:40, Heikki Linnakangas wrote:
>>
>> On 21.06.2011 13:08, Alexander Korotkov wrote:
>>>
>>> I believe it's due to relatively expensive penalty
>>> method in that opclass.
>>
>> Hmm, I wonder if it could be optimized. I did a quick test, creating a
>> gist_trgm_ops index on a list of English words from
>> /usr/share/dict/words. oprofile shows that with the patch, 60% of the
>> CPU time is spent in the makesign() function.
>
> I couldn't resist looking into this, and came up with the attached patch. I
> tested this with:
>
> CREATE TABLE words (word text);
> COPY words FROM '/usr/share/dict/words';
> CREATE INDEX i_words ON words USING gist (word gist_trgm_ops);
>
> And then ran "REINDEX INDEX i_words" a few times with and without the patch.
> Without the patch, reindex takes about 4.7 seconds. With the patch, 3.7
> seconds. That's a worthwhile gain on its own, but becomes even more
> important with Alexander's fast GiST build patch, which calls the penalty
> function more.
>
> I used the attached showsign-debug.patch to verify that the patched makesign
> function produces the same results as the existing code. I haven't tested
> the big-endian code, however.

Out of curiosity (and because there is no comment or Assert here), how
can you be so sure of the input alignment?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

From
Heikki Linnakangas
Date:
On 24.06.2011 21:24, Robert Haas wrote:
> Out of curiosity (and because there is no comment or Assert here), how
> can you be so sure of the input alignment?

The input TRGM to makesign() is a varlena, so it must be at least 4-byte 
aligned. If it was not for some reason, the existing VARSIZE invocation 
(within GETARR()) would already fail on platforms that are strict about 
alignment.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

From
Robert Haas
Date:
On Fri, Jun 24, 2011 at 3:01 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 24.06.2011 21:24, Robert Haas wrote:
>>
>> Out of curiosity (and because there is no comment or Assert here), how
>> can you be so sure of the input alignment?
>
> The input TRGM to makesign() is a varlena, so it must be at least 4-byte
> aligned. If it was not for some reason, the existing VARSIZE invocation
> (within GETARR()) would already fail on platforms that are strict about
> alignment.

Hmm, OK.  Might be worth adding a comment, anyway...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Fast GiST index build

From
Jesper Krogh
Date:
On 2011-06-06 09:42, Heikki Linnakangas wrote:
> took about 15 hours without the patch, and 2 hours with it. That's 
> quite dramatic.

With the precense of robust consumer-class SSD-drives that can be
found in sizes where they actually can fit "many" database usage
scenarios. A PostgreSQL version is not likely to hit the streets before
50% of PostgreSQL users are sitting on "some kind" of flash based
storage (for the part where the entire dataset doesn't fit in memory
any more). Point is:

* Wouldn't it be natural to measure the performance benefits of   disc-bound tests in an SSD setup?

... my understanding of Fast gi(n|st) index build is that it is
more or less a challenge to transform a lot of random IO workload
to be more sequential and collapse multiple changes into fewer.

In terms of random IO an SSD can easily be x100 better than rotating
drives and it would be a shame to optimize "against" that world?

-- 
Jesper


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Sat, Jun 25, 2011 at 11:03 AM, Jesper Krogh <jesper@krogh.cc> wrote:
* Wouldn't it be natural to measure the performance benefits of
  disc-bound tests in an SSD setup?
Sure, it would be great to run performance tests on SSD drives too. Unfortunately, I don't have corresponding test platform just now.

... my understanding of Fast gi(n|st) index build is that it is
more or less a challenge to transform a lot of random IO workload
to be more sequential and collapse multiple changes into fewer.
The main benefit of proposed algorithm is to greatly reduce number IO operations during index build due to dealing with great number of index tuples simultaneously. And it also makes some IO more sequential. I haven't precise measures yet, but I belive that contribution of making IO more sequantial is not very significant.
 
In terms of random IO an SSD can easily be x100 better than rotating
drives and it would be a shame to optimize "against" that world?
Actually, I'm not sure that IO is bottle neck of GiST index build on SSD drives. It's more likely for me that CPU becomes a bottle neck in this case and optimizing IO can't give much benefit. But anyway, the value of this work can be in producing better index in some cases and SSD drive lifetime economy due to less IO operations.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 25.06.2011 11:23, Alexander Korotkov wrote:
> On Sat, Jun 25, 2011 at 11:03 AM, Jesper Krogh<jesper@krogh.cc>  wrote:
>
>> * Wouldn't it be natural to measure the performance benefits of
>>    disc-bound tests in an SSD setup?
>>
> Sure, it would be great to run performance tests on SSD drives too.
> Unfortunately, I don't have corresponding test platform just now.

Anyone have an SSD setup to run some quick tests with this?

>> In terms of random IO an SSD can easily be x100 better than rotating
>> drives and it would be a shame to optimize "against" that world?
>
> Actually, I'm not sure that IO is bottle neck of GiST index build on SSD
> drives. It's more likely for me that CPU becomes a bottle neck in this case
> and optimizing IO can't give much benefit. But anyway, the value of this
> work can be in producing better index in some cases and SSD drive lifetime
> economy due to less IO operations.

Yeah, this patch probably doesn't give much benefit on SSDs, not the 
order of magnitude improvements it gives on HDDs anyway. I would expect 
there to still be a small gain, however. If you look at the comparison 
of CPU times on Alexander's tests, the patch doesn't add that much CPU 
overhead: about 5% on the point_ops tests. I/O isn't free on SSDs 
either, so I would expect the patch to buy back that 5% increase in CPU 
overhead by reduced time spent on I/O even on a SSD.

It's much worse on the gist_trgm_ops test case, so this clearly depends 
a lot on the opclass, but even that should be possible to optimize quite 
a bit.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
I've added information about testing on some real-life dataset to wiki page. This dataset have a speciality: data is ordered inside it. In this case tradeoff was inverse in comparison with expectations about "fast build" algrorithm. Index built is longer but index quality is significantly better. I think high speed of regular index built is because sequential inserts are into near tree parts. That's why number of actual page reads and writes is low. The difference in tree quality I can't convincingly explain now.
I've also maked tests with shuffled data of this dataset. In this case results was similar to random generated data.

------
With best regards,
Alexander Korotkov.

Optimizing box_penalty (Re: WIP: Fast GiST index build)

From
Heikki Linnakangas
Date:
On 27.06.2011 13:45, Alexander Korotkov wrote:
> I've added information about testing on some real-life dataset to wiki page.
> This dataset have a speciality: data is ordered inside it. In this case
> tradeoff was inverse in comparison with expectations about "fast build"
> algrorithm. Index built is longer but index quality is significantly better.
> I think high speed of regular index built is because sequential inserts are
> into near tree parts. That's why number of actual page reads and writes is
> low. The difference in tree quality I can't *convincingly explain now.*
> I've also maked tests with shuffled data of this dataset. In this case
> results was similar to random generated data.

Hmm, I assume the CPU overhead is coming from the penalty calls in this
case too. There's some low-hanging optimization fruit in
gist_box_penalty(), see attached patch. I tested this with:

CREATE TABLE points (a point);
CREATE INDEX i_points ON points using gist (a);
INSERT INTO points SELECT point(random(), random()) FROM
generate_series(1,1000000);

and running "checkpoint; reindex index i_points;" a few times with and
without the patch. The patch reduced the runtime from about 17.5 s to
15.5 s. oprofile confirms that the time spent in gist_box_penalty() and
rt_box_union() is reduced significantly.

This is all without the fast GiST index build patch, so this is
worthwhile on its own. If penalty function is called more, then this
becomes even more significant.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 27.06.2011 13:45, Alexander Korotkov wrote:
> I've added information about testing on some real-life dataset to wiki page.
> This dataset have a speciality: data is ordered inside it. In this case
> tradeoff was inverse in comparison with expectations about "fast build"
> algrorithm. Index built is longer but index quality is significantly better.
> I think high speed of regular index built is because sequential inserts are
> into near tree parts. That's why number of actual page reads and writes is
> low. The difference in tree quality I can't *convincingly explain now.*
> I've also maked tests with shuffled data of this dataset. In this case
> results was similar to random generated data.

Once again, interesting results.

The penalty function is called whenever a tuple is routed to the next 
level down, and the final tree has the same depth with and without the 
patch, so I would expect the number of penalty calls to be roughly the 
same. But clearly there's something wrong with that logic; can you 
explain in layman's terms why the patch adds so many gist penalty calls? 
And how many calls does it actually add, can you gather some numbers on 
that? Any ides on how to mitigate that, or do we just have to live with 
it? Or maybe use some heuristic to use the existing insertion method 
when the patch is not expected to be helpful?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Mon, Jun 27, 2011 at 6:34 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
The penalty function is called whenever a tuple is routed to the next level down, and the final tree has the same depth with and without the patch, so I would expect the number of penalty calls to be roughly the same. But clearly there's something wrong with that logic; can you explain in layman's terms why the patch adds so many gist penalty calls? And how many calls does it actually add, can you gather some numbers on that? Any ides on how to mitigate that, or do we just have to live with it? Or maybe use some heuristic to use the existing insertion method when the patch is not expected to be helpful?
In short due to parralel routing of many index tuples routing can alter. In fast build algorithm index tuples are accumulating into node buffers. When corresponding node splits we have to repocate index tuples from it. In original algorithm we are relocating node buffers into buffers of new nodes produced by split. Even this requires additional penalty calls. 
But for improvement of index quality I modified algorithm. With my modification index tuple of splitted node buffer can be relocated also into other node buffers of same parent. It produces more penalty calls.
I didn't have an estimate yet, but I'm working on it. Unfortunatelly, I haven't any idea about mitigating it except turning off my modification. 
Heuristic is possible, but I feel following problems. At first, we need to somehow estimate length of varlena keys. I avoid this estimate in fast algorithm itself just assumed worst case, but I believe we need some more precise for good heuristic. At second, the right decision is strongly depend on concurrent load. When there are no concurrent load (as in my experiments) fraction of tree which fits to effective cache is reasonable for estimating benefit of IO economy. But with high concurrent load part of cache occupied by tree should be considerable smaller than whole effective cache.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Mon, Jun 27, 2011 at 10:32 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
I didn't have an estimate yet, but I'm working on it. 

Now, it seems that I have an estimate. 
N - total number of itups
B - avg. number of itups in page
H - height of tree
K - avg. number of itups fitting in node buffer
step - level step of buffers

K = 2 * B^step
avg. number of internal pages with buffers = 2*N/((2*B)^step - 1) (assume pages to be half-filled)
avg. itups in node buffer = K / 2 (assume node buffers to be half-filled)
Each internal page with buffers can be produces by split of another internal page with buffers.
So, number of additional penalty calls = 2*N/((2*B)^step - 1) * K / 2 =(approximately)= 2*N*(1/2)^step
While number of regular penalty calls is H*N

Seems that fraction of additional penalty calls should decrease with increase of level step (while I didn't do experiments with level step != 1).
Also, we can try to broke K = 2 * B^step equation. This can increase number of IOs, but decrease number of additional penalty calls and, probably, increase tree quality in some cases.

------
With best regards,
Alexander Korotkov.
 

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
New version of patch. Bug which caused falldown on trees with high number of levels was fixed. Also some more comments and refactoring.

------
With best regards,
Alexander Korotkov.

Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
New version of patch. Now, it seems that all code is covered by comments. Also I'm going to write readme with general description of algorithm. Some bugs were fixed.
More options were added. Options description is below.
1) fastbuild - whether to fast build algorithm. Default is true.
2) levelstep - step of levels where buffers exists (if levelstep == 1 then there are buffers on each internal page, if levelstep == 2 then buffers are only on odd levels and so on). By default it's calculating by maintenance_work_mem and indexed types.
3) buffersize - size of buffers in pages. By default it's calculating by levelstep and indexed types.
4) neighborrelocation - whether to relocate buffer on split also between neighbor buffers (my modification for original algorithm). Improves tree quality, but produces additional penalty calls. Default is true.
Varying of this options should allow me to undertand tradeoffs better.

------
With best regards,
Alexander Korotkov.

Attachment
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I couldn't resist looking into this, and came up with the attached
> patch. I tested this with:

> CREATE TABLE words (word text);
> COPY words FROM '/usr/share/dict/words';
> CREATE INDEX i_words ON words USING gist (word gist_trgm_ops);

> And then ran "REINDEX INDEX i_words" a few times with and without the
> patch. Without the patch, reindex takes about 4.7 seconds. With the
> patch, 3.7 seconds. That's a worthwhile gain on its own, but becomes
> even more important with Alexander's fast GiST build patch, which calls
> the penalty function more.

I tested this on HPPA (big-endian, picky about alignment) to verify that
you had that code path right.  It's right, but on that platform the
speedup in the "reindex dict/words" test is only about 6%.  I'm afraid
that the benefit is a lot more machine-specific than we could wish.

I tweaked the patch a bit further (don't pessimize the boundary case
where there's exactly 4n+1 trigrams, avoid forcing trg1 into memory,
improve the comment) and attach that version below.  This is a little
bit faster but I still wonder if it's worth the price of adding an
obscure dependency on the header size.

> I used the attached showsign-debug.patch to verify that the patched
> makesign function produces the same results as the existing code. I
> haven't tested the big-endian code, however.

That didn't look terribly convincing.  I added a direct validation that
old and new code give the same results, a la

      if (ISARRKEY(newval))
      {
          BITVEC        sign;
+         BITVEC        osign;

          makesign(sign, newval);
+         origmakesign(osign, newval);
+         Assert(memcmp(sign, osign, sizeof(BITVEC)) == 0);

          if (ISALLTRUE(origval))
              *penalty = ((float) (SIGLENBIT - sizebitvec(sign))) / (float) (SIGLENBIT + 1);

and ran the regression tests and the dict/words example with that.

            regards, tom lane

diff --git a/contrib/pg_trgm/trgm_gist.c b/contrib/pg_trgm/trgm_gist.c
index b328a09f41fee50beb96a28835e15ef835222cd6..7cea6d68fdcde71e0fb033682d77c21978c3406b 100644
*** a/contrib/pg_trgm/trgm_gist.c
--- b/contrib/pg_trgm/trgm_gist.c
*************** gtrgm_out(PG_FUNCTION_ARGS)
*** 84,100 ****
  static void
  makesign(BITVECP sign, TRGM *a)
  {
!     int4        k,
!                 len = ARRNELEM(a);
      trgm       *ptr = GETARR(a);
!     int4        tmp = 0;

      MemSet((void *) sign, 0, sizeof(BITVEC));
      SETBIT(sign, SIGLENBIT);    /* set last unused bit */
!     for (k = 0; k < len; k++)
      {
!         CPTRGM(((char *) &tmp), ptr + k);
!         HASH(sign, tmp);
      }
  }

--- 84,178 ----
  static void
  makesign(BITVECP sign, TRGM *a)
  {
!     int4        len = ARRNELEM(a);
      trgm       *ptr = GETARR(a);
!     char       *p;
!     char       *endptr;
!     uint32        w1,
!                 w2,
!                 w3;
!     uint32        trg0 = 0,
!                 trg1,
!                 trg2,
!                 trg3,
!                 trg4;
!     uint32       *p32;

      MemSet((void *) sign, 0, sizeof(BITVEC));
      SETBIT(sign, SIGLENBIT);    /* set last unused bit */
!
!     if (len <= 0)
!         return;
!
!     /*----------
!      * We have to extract each trigram into a uint32, and calculate the HASH.
!      * This would be a lot easier if the trigrams were aligned on 4-byte
!      * boundaries, but they're not.  The simple way would be to copy each
!      * trigram byte-by-byte, but that is quite slow, and this function is a
!      * hotspot in penalty calculations.
!      *
!      * The first trigram in the array doesn't begin at a 4-byte boundary, as
!      * the flags byte comes first; but the next one does.  So we fetch the
!      * first trigram as a special case, and after that each four trigrams fall
!      * onto 4-byte words like this:
!      *
!      *  w1   w2   w3
!      * AAAB BBCC CDDD
!      *
!      * As long as there's at least four trigrams left to process, we fetch
!      * the next three words and extract the trigrams from them with bit
!      * operations, per the above diagram.  The last few trigrams are handled
!      * one at a time with byte-by-byte fetching.
!      *
!      * Note that this code yields different results on big-endian and
!      * little-endian machines, because the bytes of each trigram are loaded
!      * into a uint32 in memory order and left-justified.  That's probably
!      * undesirable, but changing this behavior would break existing indexes.
!      *----------
!      */
!     endptr = (char *) (ptr + len);
!     p32 = (uint32 *) (((char *) ptr) - 1);
!
!     /* Fetch and extract the initial word */
!     w1 = *(p32++);
! #ifdef WORDS_BIGENDIAN
!     trg1 = w1 << 8;
! #else
!     trg1 = w1 >> 8;
! #endif
!     HASH(sign, trg1);
!
!     while ((char *) p32 <= endptr - 3 * sizeof(uint32))
      {
!         w1 = *(p32++);
!         w2 = *(p32++);
!         w3 = *(p32++);
!
! #ifdef WORDS_BIGENDIAN
!         trg1 = w1 & 0xFFFFFF00;
!         trg2 = (w1 << 24) | ((w2 & 0xFFFF0000) >> 8);
!         trg3 = ((w2 & 0x0000FFFF) << 16) | ((w3 & 0xFF000000) >> 16);
!         trg4 = w3 << 8;
! #else
!         trg1 = w1 & 0x00FFFFFF;
!         trg2 = (w1 >> 24) | ((w2 & 0x0000FFFF) << 8);
!         trg3 = ((w2 & 0xFFFF0000) >> 16) | ((w3 & 0x000000FF) << 16);
!         trg4 = w3 >> 8;
! #endif
!
!         HASH(sign, trg1);
!         HASH(sign, trg2);
!         HASH(sign, trg3);
!         HASH(sign, trg4);
!     }
!
!     /* Handle the remaining 0-3 trigrams the slow way */
!     p = (char *) p32;
!     while (p < endptr)
!     {
!         CPTRGM(((char *) &trg0), p);
!         HASH(sign, trg0);
!         p += 3;
      }
  }


I wrote:
> I tweaked the patch a bit further (don't pessimize the boundary case
> where there's exactly 4n+1 trigrams, avoid forcing trg1 into memory,
> improve the comment) and attach that version below.  This is a little
> bit faster but I still wonder if it's worth the price of adding an
> obscure dependency on the header size.

It occurred to me that we could very easily remove that objection by
making the code dynamically detect when it's reached a suitably aligned
trigram.  The attached version of the patch does it that way.  It seems
to be a percent or so slower than my previous version, but I think that
making the function robust against header changes is probably well worth
that price.

BTW, I also tried wrapping the first two loops in an "if (len > 4)"
test, reasoning that the added complexity is useless unless the main
loop will be able to iterate at least once, and surely most words are
less than 15 bytes long.  While this did indeed make back the extra
percent on my HPPA box, it made things a percent or so slower yet on my
Intel box with gcc 4.4.5.  I think the compiler must be getting confused
about what to optimize.  So I left that out of this version of the
patch, but perhaps it deserves closer investigation.

            regards, tom lane

diff --git a/contrib/pg_trgm/trgm_gist.c b/contrib/pg_trgm/trgm_gist.c
index b328a09f41fee50beb96a28835e15ef835222cd6..e024caceeed0d656e8569d3a8f1bd2c1b6d72821 100644
*** a/contrib/pg_trgm/trgm_gist.c
--- b/contrib/pg_trgm/trgm_gist.c
*************** gtrgm_out(PG_FUNCTION_ARGS)
*** 84,100 ****
  static void
  makesign(BITVECP sign, TRGM *a)
  {
!     int4        k,
!                 len = ARRNELEM(a);
      trgm       *ptr = GETARR(a);
!     int4        tmp = 0;

      MemSet((void *) sign, 0, sizeof(BITVEC));
      SETBIT(sign, SIGLENBIT);    /* set last unused bit */
!     for (k = 0; k < len; k++)
      {
!         CPTRGM(((char *) &tmp), ptr + k);
!         HASH(sign, tmp);
      }
  }

--- 84,182 ----
  static void
  makesign(BITVECP sign, TRGM *a)
  {
!     int4        len = ARRNELEM(a);
      trgm       *ptr = GETARR(a);
!     char       *p;
!     char       *endptr;
!     uint32        w1,
!                 w2,
!                 w3;
!     uint32        trg0 = 0,
!                 trg1,
!                 trg2,
!                 trg3,
!                 trg4;
!     uint32       *p32;

      MemSet((void *) sign, 0, sizeof(BITVEC));
      SETBIT(sign, SIGLENBIT);    /* set last unused bit */
!
!     if (len <= 0)
!         return;
!
!     /*----------
!      * We have to extract each trigram into a uint32, and calculate the HASH.
!      * This would be a lot easier if the trigrams were aligned on 4-byte
!      * boundaries, but they're not.  The simple way would be to copy each
!      * trigram byte-by-byte, but that is quite slow, and this function is a
!      * hotspot in penalty calculations.
!      *
!      * The first trigram in the array might not begin at a 4-byte boundary,
!      * because the TRGM header is of odd length, and anyway we'd prefer not
!      * to hard-wire an assumption about header size here.  So we fetch
!      * trigrams byte-by-byte until we reach one that does start on a 4-byte
!      * boundary.  After that, each four trigrams fall onto 4-byte words like
!      * this:
!      *
!      *  w1   w2   w3
!      * AAAB BBCC CDDD
!      *
!      * As long as there's at least four trigrams left to process, we fetch
!      * the next three words and extract the trigrams from them with bit
!      * operations, per the above diagram.  The last few trigrams are handled
!      * one at a time with byte-by-byte fetching.
!      *
!      * Note that this code yields different results on big-endian and
!      * little-endian machines, because the bytes of each trigram are loaded
!      * into a uint32 in memory order and left-justified.  That's probably
!      * undesirable, but changing this behavior would break existing indexes.
!      *----------
!      */
!     p = (char *) ptr;
!     endptr = (char *) (ptr + len);
!
!     /* Handle trigrams the slow way until we reach a uint32 boundary */
!     while (p < endptr &&
!            ((intptr_t) p % sizeof(uint32)) != 0)
      {
!         CPTRGM(((char *) &trg0), p);
!         HASH(sign, trg0);
!         p += 3;
!     }
!
!     /* Now handle trigrams in groups of four */
!     p32 = (uint32 *) p;
!     while ((char *) p32 <= endptr - 3 * sizeof(uint32))
!     {
!         w1 = *(p32++);
!         w2 = *(p32++);
!         w3 = *(p32++);
!
! #ifdef WORDS_BIGENDIAN
!         trg1 = w1 & 0xFFFFFF00;
!         trg2 = (w1 << 24) | ((w2 & 0xFFFF0000) >> 8);
!         trg3 = ((w2 & 0x0000FFFF) << 16) | ((w3 & 0xFF000000) >> 16);
!         trg4 = w3 << 8;
! #else
!         trg1 = w1 & 0x00FFFFFF;
!         trg2 = (w1 >> 24) | ((w2 & 0x0000FFFF) << 8);
!         trg3 = ((w2 & 0xFFFF0000) >> 16) | ((w3 & 0x000000FF) << 16);
!         trg4 = w3 >> 8;
! #endif
!
!         HASH(sign, trg1);
!         HASH(sign, trg2);
!         HASH(sign, trg3);
!         HASH(sign, trg4);
!     }
!
!     /* Handle the remaining 0-3 trigrams the slow way */
!     p = (char *) p32;
!     while (p < endptr)
!     {
!         CPTRGM(((char *) &trg0), p);
!         HASH(sign, trg0);
!         p += 3;
      }
  }


Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

From
Heikki Linnakangas
Date:
On 02.07.2011 21:07, Tom Lane wrote:
> I wrote:
>> I tweaked the patch a bit further (don't pessimize the boundary case
>> where there's exactly 4n+1 trigrams, avoid forcing trg1 into memory,
>> improve the comment) and attach that version below.  This is a little
>> bit faster but I still wonder if it's worth the price of adding an
>> obscure dependency on the header size.
>
> It occurred to me that we could very easily remove that objection by
> making the code dynamically detect when it's reached a suitably aligned
> trigram.  The attached version of the patch does it that way.  It seems
> to be a percent or so slower than my previous version, but I think that
> making the function robust against header changes is probably well worth
> that price.

Ah, that opens the door to do something I considered earlier but
rejected because of alignment: instead of three 32-bit word fetches, we
could fetch one 64-bit word and 32-bit word. Might shave a few more
cycles...

Meanwhile, I experimented with optimizing the other part of the loop:
the HASH() macros for setting the right bits in the signature. With the
default compile-time settings, the signature array is 95 bits.
Currently, it's stored in a BITVEC, which is a byte array, but we could
store it in two 64-bit integers, which makes it possible to write SETBIT
differently. I experimented with a few approaches, first was essentially
this:

+ /* Set the nth bit in the signature, in s1 or s2 */
+ #define HASH_S(h) \
+     do {                                    \
+         unsigned int n = HASHVAL(h);        \
+         if (((uint64) (n)) < 64)            \
+             s1 |= (uint64) 1<<(n);            \
+         else                                \
+             s2 |= (uint64) 1<<((n) - 64);    \
+     } while(0)

That was a bit faster on my x64 laptop, but slightly slower on my ia64
HP-UX box. My second try was to use lookup tables, patch attached. That
was yet faster on x64, and a small win on the ia64 box too. I'm not sure
it's worth the added code complexity, though.

Here's a summary of the timings I got with different versions:

ia64 HP-UX (anole):

unpatched: 11194.038 ms
fast_makesign-tom: 10064.980 ms
fast_makesign-2int: 10649.726 ms
fast_makesign-tbl: 9951.547 ms

x64 laptop:

unpatched: 4595,209 ms
fast_makesign-tom: 3346,548 ms
fast_makesign-2int: 3102,874 ms
fast_makesign-tbl: 2997,854 ms

I used the same "REINDEX INDEX i_words" test I used earlier, repeated
each run a couple of times, and took the lowest number.
fast_makesign-tom is the first patch you posted, I haven't tested your
latest one. fast_makesign-2int is with the HASH_S() macro above, and
has_makesign-tbl is with the attached patch.

PS. in case you wonder why the HP-UX box is so much slower than my
laptop; this box isn't really meant for performance testing. It's just
something I happen to have access to, I think it's a virtual machine of
some sort. The numbers are very repeatable, however.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Ah, that opens the door to do something I considered earlier but 
> rejected because of alignment: instead of three 32-bit word fetches, we 
> could fetch one 64-bit word and 32-bit word. Might shave a few more 
> cycles...

Hm ... I suspect that might be a small win on natively-64-bit machines,
but almost certainly a loss on 32-bitters.

> Meanwhile, I experimented with optimizing the other part of the loop: 
> the HASH() macros for setting the right bits in the signature.

Yeah, I was eyeing that too, but I'm hesitant to hard-wire assumptions
about the size of the signature.
        regards, tom lane


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
New version of patch with readme and some bugfixes.
Some new tests with fast build parameters variations are on the wiki page. While I doubt to interpret some results of tests, because it looks strange for me. I particular I can't explain why decrease of buffer size affects index quality so much (in my expectation decrease of buffer size should make all build parameters closer to regular build). I'm going to recheck my experiments, probably I'm missing something.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
I found that results of previous tests with USNOA2 were not fully correct. Tables for tests with various parameters contained tuples in different order. I assumed that query
CREATE TABLE tab2 AS (SELECT * FROM tab1);
creates tab2 as exact copy of tab1, i.e. tab2 contain tuples in same order as tab1. But actually, it isn't always so. In aggregate with only few used test cases it can cause significant error. 
I'm going to make use some more thought-out testing method. Probably, some more precise index quality measure exists (even for R-tree).

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> I found that results of previous tests with USNOA2 were not fully correct.
> Tables for tests with various parameters contained tuples in different
> order. I assumed that query
> CREATE TABLE tab2 AS (SELECT * FROM tab1);
> creates tab2 as exact copy of tab1, i.e. tab2 contain tuples in same order
> as tab1. But actually, it isn't always so. In aggregate with only few used
> test cases it can cause significant error.

For test purposes, you could turn off synchronize_seqscans to prevent
that.
        regards, tom lane


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Fri, Jul 8, 2011 at 6:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
For test purposes, you could turn off synchronize_seqscans to prevent
that.

Thanks, it helps. I'm rerunning tests now.

------
With best regards,
Alexander Korotkov. 

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
New version of patch with a little more refactoring and comments.

------
With best regards,
Alexander Korotkov. 
Attachment

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
Hi,

Looking at the performance test results again on the wiki page 
(http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results), 
the patch can be summarized like this: it reduces the number of disk 
accesses, at the cost of some extra CPU work.

Is it possible to switch to the new buffering method in the middle of an 
index build? We could use the plain insertion method until the index 
grows to a certain size (effective_cache_size?), and switch to the 
buffering method after that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Hi!
 
On Wed, Jul 13, 2011 at 12:33 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Is it possible to switch to the new buffering method in the middle of an index build? We could use the plain insertion method until the index grows to a certain size (effective_cache_size?), and switch to the buffering method after that.
Yes, it seems to be possible. It also would be great to somehow detect case of ordered data when regular index build performs well.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Wed, Jul 13, 2011 at 12:40 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Jul 13, 2011 at 12:33 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Is it possible to switch to the new buffering method in the middle of an index build? We could use the plain insertion method until the index grows to a certain size (effective_cache_size?), and switch to the buffering method after that.
Yes, it seems to be possible.
It also gives possibility to get estimate of varlena size by real data before start of buffering method using.
 
It also would be great to somehow detect case of ordered data when regular index build performs well.
I think this case can be detected by the situation when most part of index tuples are inserting into few leaf pages which was recently used.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 12.07.2011 11:34, Alexander Korotkov wrote:
> New version of patch with a little more refactoring and comments.

Great! The README helps tremendously to understand this, thanks for that.

One thing that caught my eye is that when you empty a buffer, you load 
the entire subtree below that buffer, down to the next buffered or leaf 
level, into memory. Every page in that subtree is kept pinned. That is a 
problem; in the general case, the buffer manager can only hold a modest 
number of pages pinned at a time. Consider that the minimum value for 
shared_buffers is just 16. That's unrealistically low for any real 
system, but the default is only 32MB, which equals to just 4096 buffers. 
A subtree could easily be larger than that.

I don't think you're benefiting at all from the buffering that BufFile 
does for you, since you're reading/writing a full block at a time 
anyway. You might as well use the file API in fd.c directly, ie. 
OpenTemporaryFile/FileRead/FileWrite.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Wed, Jul 13, 2011 at 5:59 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
One thing that caught my eye is that when you empty a buffer, you load the entire subtree below that buffer, down to the next buffered or leaf level, into memory. Every page in that subtree is kept pinned. That is a problem; in the general case, the buffer manager can only hold a modest number of pages pinned at a time. Consider that the minimum value for shared_buffers is just 16. That's unrealistically low for any real system, but the default is only 32MB, which equals to just 4096 buffers. A subtree could easily be larger than that.
With level step = 1 we need only 2 levels in subtree. With mininun index tuple size (12 bytes) each page can have at maximum 675. Thus I think default shared_buffers is enough for level step = 1. I believe it's enough to add check we have sufficient shared_buffers, isn't it?
 
I don't think you're benefiting at all from the buffering that BufFile does for you, since you're reading/writing a full block at a time anyway. You might as well use the file API in fd.c directly, ie. OpenTemporaryFile/FileRead/FileWrite.
BufFile is distributing temporary data through several files. AFAICS postgres avoids working with files larger than 1GB. Size of tree buffers can easily be greater. Without BufFile I need to maintain set of files manually.

------
With best regards,
Alexander Korotkov. 

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 14.07.2011 11:33, Alexander Korotkov wrote:
> On Wed, Jul 13, 2011 at 5:59 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> One thing that caught my eye is that when you empty a buffer, you load the
>> entire subtree below that buffer, down to the next buffered or leaf level,
>> into memory. Every page in that subtree is kept pinned. That is a problem;
>> in the general case, the buffer manager can only hold a modest number of
>> pages pinned at a time. Consider that the minimum value for shared_buffers
>> is just 16. That's unrealistically low for any real system, but the default
>> is only 32MB, which equals to just 4096 buffers. A subtree could easily be
>> larger than that.
>>
> With level step = 1 we need only 2 levels in subtree. With mininun index
> tuple size (12 bytes) each page can have at maximum 675. Thus I think
> default shared_buffers is enough for level step = 1.

Hundreds of buffer pins is still a lot. And with_level_step=2, the 
number of pins required explodes to 675^2 = 455625.

Pinning a buffer that's already in the shared buffer cache is cheap, I 
doubt you're gaining much by keeping the private hash table in front of 
the buffer cache. Also, it's possible that not all of the subtree is 
actually required during the emptying, so in the worst case pre-loading 
them is counter-productive.

> I believe it's enough
> to add check we have sufficient shared_buffers, isn't it?

Well, what do you do if you deem that shared_buffers is too small? Fall 
back to the old method? Also, shared_buffers is shared by all backends, 
so you can't assume that you get to use all of it for the index build. 
You'd need a wide safety margin.

>> I don't think you're benefiting at all from the buffering that BufFile does
>> for you, since you're reading/writing a full block at a time anyway. You
>> might as well use the file API in fd.c directly, ie.
>> OpenTemporaryFile/FileRead/**FileWrite.
>
> BufFile is distributing temporary data through several files. AFAICS
> postgres avoids working with files larger than 1GB. Size of tree buffers can
> easily be greater. Without BufFile I need to maintain set of files manually.

Ah, I see. Makes sense.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Thu, Jul 14, 2011 at 12:42 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Pinning a buffer that's already in the shared buffer cache is cheap, I doubt you're gaining much by keeping the private hash table in front of the buffer cache.
Yes, I see. Pinning a lot of buffers don't gives singnificant benefits but produce a lot of problems. Also removing the hash table can simplify code.

Also, it's possible that not all of the subtree is actually required during the emptying, so in the worst case pre-loading them is counter-productive.
What do you think about pre-fetching all of the subtree? It requires actual loading of level_step - 1 levels of it. I some cases it still can be   counter-productive. But probably it is productive in average?
 
Well, what do you do if you deem that shared_buffers is too small? Fall back to the old method? Also, shared_buffers is shared by all backends, so you can't assume that you get to use all of it for the index build. You'd need a wide safety margin.
I assumed to check if there are enough of shared_buffers before switching to buffering method. But concurent backends makes this method unsafe. 

There are other difficulties with concurrent backends: it would be nice estimate usage of effective cache by other backeds before switching to buffering method. If don't take care about it then we can don't switch to buffering method which it can give significant benefit.

------
With best regards,
Alexander Korotkov. 

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Do you think using "rightlink" as pointer to parent page is possible during index build? It would allow to simplify code significantly, because of no more need to maintain in-memory structures for parents memorizing. 

------
With best regards,
Alexander Korotkov. 

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 14.07.2011 23:41, Alexander Korotkov wrote:
> Do you think using "rightlink" as pointer to parent page is possible during
> index build? It would allow to simplify code significantly, because of no
> more need to maintain in-memory structures for parents memorizing.

I guess, but where do you store the rightlink, then? Would you need a 
final pass through the index to fix all the rightlinks?

I think you could use the NSN field. It's used to detect concurrent page 
splits, but those can't happen during index build, so you don't need 
that field during index build. You just have to make it look like an 
otherwise illegal NSN, so that it won't be mistaken for a real NSN after 
the index is built. Maybe add a new flag to mean that the NSN is 
actually invalid.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
<div class="gmail_quote">Fri, Jul 15, 2011 at 12:53 AM, Heikki Linnakangas <span dir="ltr"><<a
href="mailto:heikki.linnakangas@enterprisedb.com">heikki.linnakangas@enterprisedb.com</a>></span>wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div
class="im">On14.07.2011 23:41, Alexander Korotkov wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex"> Do you think using "rightlink" as pointer to parent page is possible
during<br/> index build? It would allow to simplify code significantly, because of no<br /> more need to maintain
in-memorystructures for parents memorizing.<br /></blockquote><br /></div> I guess, but where do you store the
rightlink,then? Would you need a final pass through the index to fix all the rightlinks?<br /><br /> I think you could
usethe NSN field. It's used to detect concurrent page splits, but those can't happen during index build, so you don't
needthat field during index build. You just have to make it look like an otherwise illegal NSN, so that it won't be
mistakenfor a real NSN after the index is built. Maybe add a new flag to mean that the NSN is actually
invalid.</blockquote><divclass="gmail_quote"><br /></div><div class="gmail_quote">Thank you for advice. But I didn't
takeinto account that in this case I need to update parent link in many pages(which might be not in cache) on split.
Seemsthat I still need to maintain some in-memory structures for parent finding.</div><br />------<br />With best
regards,<br/>Alexander Korotkov.</div> 

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Hi!

New version of patch is attached. There are following changes.
1) Since proposed tchnique is not always a "fast" build, it was renamed everywhere in the patch to "buffering" build.
2) Parameter "buffering" now has 3 possible values "yes", "no" and "auto". "auto" means automatic switching from regular index build to buffering one. Currently it just switch when index size exceeds maintenance_work_mem.
3) Holding of many buffers pinned is avoided.
4) Rebased with head.

TODO:
1) Take care about ordered datasets in automatic switching.
2) Take care about concurrent backends in automatic switching.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Hi!

Patch with my try to detect ordered datasets is attached. The implemented idea is desribed below.
Index tuples are divided by chunks of 128. On each chunk we measure how much leaf pages where index tuples was inserted don't match those of previous chunk. Based on statistics of several chunks we estimate distribution of accesses between lead pages (exponential distribution law is accumed and it's seems to be an error). After that we can estimate portion of index tuples which can be processed without actual IO. If this estimate exceeds threshold then we should switch to buffering build.
Now my implementation successfully detects randomly mixed datasets and well ordered datasets, but it's seems to be too optimistic about intermediate cases. I believe it's due to wrong assumption about distribution law.
Do you think this approach is acceptable? Probably there are some researches about distribution law for such cases (while I didn't find anything relevant in google scholar)?
As an alternative I can propose take into account actual average IO operations per tuple rather then an estimate.

------
With best regards,
Alexander Korotkov.

On Mon, Jul 18, 2011 at 10:00 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi!

New version of patch is attached. There are following changes.
1) Since proposed tchnique is not always a "fast" build, it was renamed everywhere in the patch to "buffering" build.
2) Parameter "buffering" now has 3 possible values "yes", "no" and "auto". "auto" means automatic switching from regular index build to buffering one. Currently it just switch when index size exceeds maintenance_work_mem.
3) Holding of many buffers pinned is avoided.
4) Rebased with head.

TODO:
1) Take care about ordered datasets in automatic switching.
2) Take care about concurrent backends in automatic switching.

------
With best regards,
Alexander Korotkov.

Attachment

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 22.07.2011 12:38, Alexander Korotkov wrote:
> Patch with my try to detect ordered datasets is attached. The implemented
> idea is desribed below.
> Index tuples are divided by chunks of 128. On each chunk we measure how much
> leaf pages where index tuples was inserted don't match those of previous
> chunk. Based on statistics of several chunks we estimate distribution of
> accesses between lead pages (exponential distribution law is accumed and
> it's seems to be an error). After that we can estimate portion of index
> tuples which can be processed without actual IO. If this estimate exceeds
> threshold then we should switch to buffering build.
> Now my implementation successfully detects randomly mixed datasets and well
> ordered datasets, but it's seems to be too optimistic about intermediate
> cases. I believe it's due to wrong assumption about distribution law.
> Do you think this approach is acceptable? Probably there are some researches
> about distribution law for such cases (while I didn't find anything relevant
> in google scholar)?

Great! It would be nice to find a more scientific approach to this, but 
that's probably fine for now. It's time to start cleaning up the patch 
for eventual commit.

You got rid of the extra page pins, which is good, but I wonder why you 
still pre-create all the GISTLoadedPartItem structs for the whole 
subtree in loadTreePart() ? Can't you create those structs on-the-fly, 
when you descend the tree? I understand that it's difficult to update 
all the parent-pointers as trees are split, but it feels like there's 
way too much bookkeeping going on. Surely it's possible to simplify it 
somehow..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 25.07.2011 21:52, Heikki Linnakangas wrote:
> On 22.07.2011 12:38, Alexander Korotkov wrote:
>> Patch with my try to detect ordered datasets is attached. The implemented
>> idea is desribed below.
>> Index tuples are divided by chunks of 128. On each chunk we measure
>> how much
>> leaf pages where index tuples was inserted don't match those of previous
>> chunk. Based on statistics of several chunks we estimate distribution of
>> accesses between lead pages (exponential distribution law is accumed and
>> it's seems to be an error). After that we can estimate portion of index
>> tuples which can be processed without actual IO. If this estimate exceeds
>> threshold then we should switch to buffering build.
>> Now my implementation successfully detects randomly mixed datasets and
>> well
>> ordered datasets, but it's seems to be too optimistic about intermediate
>> cases. I believe it's due to wrong assumption about distribution law.
>> Do you think this approach is acceptable? Probably there are some
>> researches
>> about distribution law for such cases (while I didn't find anything
>> relevant
>> in google scholar)?
>
> Great! It would be nice to find a more scientific approach to this, but
> that's probably fine for now. It's time to start cleaning up the patch
> for eventual commit.
>
> You got rid of the extra page pins, which is good, but I wonder why you
> still pre-create all the GISTLoadedPartItem structs for the whole
> subtree in loadTreePart() ? Can't you create those structs on-the-fly,
> when you descend the tree? I understand that it's difficult to update
> all the parent-pointers as trees are split, but it feels like there's
> way too much bookkeeping going on. Surely it's possible to simplify it
> somehow..

That was a quite off-the-cuff remark, so I took the patch and culled out
loaded-tree bookkeeping. A lot of other changes fell off from that, so
it took me quite some time to get it working again, but here it is. This
is a *lot* smaller patch, although that's partly explained by the fact
that I left out some features: prefetching and the neighbor relocation
code is gone.

I'm pretty exhausted by this, so I just wanted to send this out without
further analysis. Let me know if you have questions on the approach
taken. I'm also not sure how this performs compared to your latest
patch, I haven't done any performance testing. Feel free to use this as
is, or as a source of inspiration :-).

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Tue, Jul 26, 2011 at 9:24 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
That was a quite off-the-cuff remark, so I took the patch and culled out loaded-tree bookkeeping. A lot of other changes fell off from that, so it took me quite some time to get it working again, but here it is. This is a *lot* smaller patch, although that's partly explained by the fact that I left out some features: prefetching and the neighbor relocation code is gone.

I'm pretty exhausted by this, so I just wanted to send this out without further analysis. Let me know if you have questions on the approach taken. I'm also not sure how this performs compared to your latest patch, I haven't done any performance testing. Feel free to use this as is, or as a source of inspiration :-).
I also was going to try to evade keeping loaded-tree hash. This might help me a lot. Thanks.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
I found a problem in WAL with this patch. I use simplified insert algorithm in my patch which don't insert downlink one by one but insert them at once. Thus FollowRight flag is leaving uncleared when redoing from WAL, because only one flag can be cleared by one WAL record. Do you think modification of WAL record structure is possible or I have to insert downlink one by one in buffering build too?

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 27.07.2011 15:29, Alexander Korotkov wrote:
> I found a problem in WAL with this patch. I use simplified insert algorithm
> in my patch which don't insert downlink one by one but insert them at once.
> Thus FollowRight flag is leaving uncleared when redoing from WAL, because
> only one flag can be cleared by one WAL record. Do you think modification of
> WAL record structure is possible or I have to insert downlink one by one in
> buffering build too?

Dunno, both approaches seem reasonable to me. There's no rule against 
changing WAL record structure across major releases, if that's what you 
were worried about.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Wed, Jul 27, 2011 at 6:05 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Dunno, both approaches seem reasonable to me. There's no rule against changing WAL record structure across major releases, if that's what you were worried about.
 
OK, thanks. I also found behaviour of GiST(without patch) with streaming replication that seems strange for me. On master there are only few rightlinks are InvalidBlockNumber while on slave there are a lot of them. I hack gevel for getting index structure on slave (accessing tree without AccessExclusiveLock).

On master:
# create table test as (select point(random(),random()) from generate_series(1,100000));
# create index test_idx on test using gist(point);
# \copy (select gist_tree('test_idx')) to 'tree1r.txt';

On slave:
# \copy (select gist_tree('test_idx')) to 'tree2r.txt';

In bash:
# cat tree1r.txt | sed 's/\\n/\n/g' > tree1.txt
# cat tree2r.txt | sed 's/\\n/\n/g' > tree2.txt
# diff tree1.txt tree2.txt

2,89c2,89
<     1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:637 (OK)
<         1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%) rightlink:319 (OK)
<         2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%) rightlink:153 (OK)
<         3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%) rightlink:551 (OK)
<         4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%) rightlink:555 (OK)
<         5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%) rightlink:74 (OK)
<         6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%) rightlink:294 (OK)
<         7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%) rightlink:567 (OK)
.....
---
>     1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:4294967295 (InvalidBlockNumber)
>         1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%) rightlink:4294967295 (InvalidBlockNumber)
>         2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%) rightlink:4294967295 (InvalidBlockNumber)
>         3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%) rightlink:4294967295 (InvalidBlockNumber)
>         4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%) rightlink:4294967295 (InvalidBlockNumber)
>         5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%) rightlink:4294967295 (InvalidBlockNumber)
>         6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%) rightlink:4294967295 (InvalidBlockNumber)
>         7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%) rightlink:4294967295 (InvalidBlockNumber)
.....

Isn't it a bug?

------
With best regards,
Alexander Korotkov. 

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 27.07.2011 17:43, Alexander Korotkov wrote:
>>      1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:4294967295
> (InvalidBlockNumber)
>>          1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)
> rightlink:4294967295 (InvalidBlockNumber)
>>          2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)
> rightlink:4294967295 (InvalidBlockNumber)
>>          3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)
> rightlink:4294967295 (InvalidBlockNumber)
>>          4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)
> rightlink:4294967295 (InvalidBlockNumber)
>>          5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)
> rightlink:4294967295 (InvalidBlockNumber)
>>          6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)
> rightlink:4294967295 (InvalidBlockNumber)
>>          7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)
> rightlink:4294967295 (InvalidBlockNumber)
> .....
>
> Isn't it a bug?

Yeah, looks like a bug. I must've messed up the WAL logging in my recent 
changes to this. I'll look into that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Tue, Jul 26, 2011 at 9:24 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Let me know if you have questions on the approach taken. 

I realized that approach which comes as replace to loaded-subtrees keeping is unclear to me. We save paths between node buffers. But those paths can become invalid on page splits. It seems to me that approximately same volume of code for maintaining parent links should be added to this version of patch in order to get it working correctly.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 28.07.2011 23:57, Alexander Korotkov wrote:
> On Tue, Jul 26, 2011 at 9:24 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>>
>> Let me know if you have questions on the approach taken.
>
>
> I realized that approach which comes as replace to loaded-subtrees keeping
> is unclear to me. We save paths between node buffers. But those paths can
> become invalid on page splits. It seems to me that approximately same volume
> of code for maintaining parent links should be added to this version of
> patch in order to get it working correctly.

gistFindCorrectParent() should take care of that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Fri, Jul 29, 2011 at 1:10 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
gistFindCorrectParent() should take care of that.
OK, now it seems that I understood. I need to verify amount memory needed for paths because it seems that they tends to accumulate. Also I need to verify final emptying, because IO guarantees of original paper is based on strict descending final emptying.
 
------
With best regards,
Alexander Korotkov.

Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Heikki Linnakangas
Date:
On 27.07.2011 17:43, Alexander Korotkov wrote:
> OK, thanks. I also found behaviour of GiST(without patch) with streaming
> replication that seems strange for me. On master there are only few
> rightlinks are InvalidBlockNumber while on slave there are a lot of them. I
> hack gevel for getting index structure on slave (accessing tree without
> AccessExclusiveLock).
>
> On master:
> # create table test as (select point(random(),random()) from
> generate_series(1,100000));
> # create index test_idx on test using gist(point);
> # \copy (select gist_tree('test_idx')) to 'tree1r.txt';
>
> On slave:
> # \copy (select gist_tree('test_idx')) to 'tree2r.txt';
>
> In bash:
> # cat tree1r.txt | sed 's/\\n/\n/g'>  tree1.txt
> # cat tree2r.txt | sed 's/\\n/\n/g'>  tree2.txt
> # diff tree1.txt tree2.txt
>
> 2,89c2,89
> <      1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:637 (OK)
> <          1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%) rightlink:319
> (OK)
> <          2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%) rightlink:153
> (OK)
> <          3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%) rightlink:551
> (OK)
> <          4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%) rightlink:555
> (OK)
> <          5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%) rightlink:74
> (OK)
> <          6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%) rightlink:294
> (OK)
> <          7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%) rightlink:567
> (OK)
> .....
> ---
>>      1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:4294967295
> (InvalidBlockNumber)
>>          1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)
> rightlink:4294967295 (InvalidBlockNumber)
>>          2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)
> rightlink:4294967295 (InvalidBlockNumber)
>>          3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)
> rightlink:4294967295 (InvalidBlockNumber)
>>          4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)
> rightlink:4294967295 (InvalidBlockNumber)
>>          5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)
> rightlink:4294967295 (InvalidBlockNumber)
>>          6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)
> rightlink:4294967295 (InvalidBlockNumber)
>>          7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)
> rightlink:4294967295 (InvalidBlockNumber)
> .....
>
> Isn't it a bug?

Yeah, it sure looks like a bug. I was certain that I had broken this in
the recent changes to GiST handling of page splits, but in fact it has
been like that forever.

The rightlinks are not needed after crash recovery, because all the
downlinks should be there. A scan will find all pages through the
downlinks, and doesn't need to follow any rightlinks. I'm not sure why
we explicitly clear them, it's not like the rightlinks would do any harm
either, but for crash recovery that's harmless.

But a scan during hot standby can see those intermediate states, just
like concurrent scans can on the master. The locking on replay of page
split needs to be fixed, too. At the moment, it locks and writes out
each page separately, so a concurrent scan could "overtake" the WAL
replay while following rightlinks, and miss tuples on the right half.

Attached is a patch for that for 9.1/master. The 9.0 GiST replay code
was quite different, it will require a separate patch.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Simon Riggs
Date:
On Mon, Aug 1, 2011 at 10:38 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 27.07.2011 17:43, Alexander Korotkov wrote:
>>
>> OK, thanks. I also found behaviour of GiST(without patch) with streaming
>> replication that seems strange for me. On master there are only few
>> rightlinks are InvalidBlockNumber while on slave there are a lot of them.
>> I
>> hack gevel for getting index structure on slave (accessing tree without
>> AccessExclusiveLock).
>>
>> On master:
>> # create table test as (select point(random(),random()) from
>> generate_series(1,100000));
>> # create index test_idx on test using gist(point);
>> # \copy (select gist_tree('test_idx')) to 'tree1r.txt';
>>
>> On slave:
>> # \copy (select gist_tree('test_idx')) to 'tree2r.txt';
>>
>> In bash:
>> # cat tree1r.txt | sed 's/\\n/\n/g'>  tree1.txt
>> # cat tree2r.txt | sed 's/\\n/\n/g'>  tree2.txt
>> # diff tree1.txt tree2.txt
>>
>> 2,89c2,89
>> <      1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:637
>> (OK)
>> <          1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%) rightlink:319
>> (OK)
>> <          2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%) rightlink:153
>> (OK)
>> <          3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%) rightlink:551
>> (OK)
>> <          4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%) rightlink:555
>> (OK)
>> <          5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%) rightlink:74
>> (OK)
>> <          6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%) rightlink:294
>> (OK)
>> <          7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%) rightlink:567
>> (OK)
>> .....
>> ---
>>>
>>>     1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%)
>>> rightlink:4294967295
>>
>> (InvalidBlockNumber)
>>>
>>>         1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)
>>
>> rightlink:4294967295 (InvalidBlockNumber)
>>>
>>>         2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)
>>
>> rightlink:4294967295 (InvalidBlockNumber)
>>>
>>>         3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)
>>
>> rightlink:4294967295 (InvalidBlockNumber)
>>>
>>>         4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)
>>
>> rightlink:4294967295 (InvalidBlockNumber)
>>>
>>>         5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)
>>
>> rightlink:4294967295 (InvalidBlockNumber)
>>>
>>>         6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)
>>
>> rightlink:4294967295 (InvalidBlockNumber)
>>>
>>>         7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)
>>
>> rightlink:4294967295 (InvalidBlockNumber)
>> .....
>>
>> Isn't it a bug?
>
> Yeah, it sure looks like a bug. I was certain that I had broken this in the
> recent changes to GiST handling of page splits, but in fact it has been like
> that forever.
>
> The rightlinks are not needed after crash recovery, because all the
> downlinks should be there. A scan will find all pages through the downlinks,
> and doesn't need to follow any rightlinks. I'm not sure why we explicitly
> clear them, it's not like the rightlinks would do any harm either, but for
> crash recovery that's harmless.
>
> But a scan during hot standby can see those intermediate states, just like
> concurrent scans can on the master. The locking on replay of page split
> needs to be fixed, too. At the moment, it locks and writes out each page
> separately, so a concurrent scan could "overtake" the WAL replay while
> following rightlinks, and miss tuples on the right half.
>
> Attached is a patch for that for 9.1/master. The 9.0 GiST replay code was
> quite different, it will require a separate patch.

Hmm, I was assured no changes would be required for Hot Standby for
GIST and GIN. Perhaps we should check GIN code also.

Does the order of locking of the buffers matter? I'm sure it does.

Did you want me to write the patch for 9.0?

And what does NSN stand for? :-)

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Heikki Linnakangas
Date:
On 01.08.2011 13:13, Simon Riggs wrote:
> On Mon, Aug 1, 2011 at 10:38 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> Attached is a patch for that for 9.1/master. The 9.0 GiST replay code was
>> quite different, it will require a separate patch.
>
> Hmm, I was assured no changes would be required for Hot Standby for
> GIST and GIN. Perhaps we should check GIN code also.

Yeah, we probably should.

> Does the order of locking of the buffers matter? I'm sure it does.

Yep.

> Did you want me to write the patch for 9.0?

I'm looking at it now.

> And what does NSN stand for? :-)

Hmm, I don't know actually know what NSN is an acronym for :-). But in 
case you want an explanation of what it does:

The NSN is field in the GiST page header, used to detect concurrent page 
splits. Whenever a page is split, its NSN is set to the LSN of the page 
split record. To be precise: the NSN of the resulting left page(s) is 
set, the resulting rightmost half keeps the NSN of the original page.

When you scan a parent page and decide to move down, it's possible that 
the child page is split before you read it, but after you read the 
parent page. So you didn't see the downlink for the right half when you 
scanned the parent page. To reach the right half, you need to follow the 
rightlink from the left page, but first you need to detect that the page 
was split. To do that, when you scan the parent page you remember the 
LSN on the parent. When you scan the child, you compare the parent's LSN 
you saw with the NSN of the child. If the child's NSN > parent's LSN, 
the page was split after you scanned the parent, so you need to follow 
the rightlink.

The B-tree code has similar move-right logic, but it uses the "high" key 
on each page to decide when it needs to move right. There's no high key 
on GiST pages, so we rely on the NSN for that.

In 9.1, I added the F_FOLLOW_RIGHT flag to handle the case of an 
incomplete split correctly. If the flag is set on a page, its rightlink 
needs to be followed regardless of the NSN.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Thom Brown
Date:
On 1 August 2011 11:44, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 01.08.2011 13:13, Simon Riggs wrote:
>> And what does NSN stand for? :-)
>
> Hmm, I don't know actually know what NSN is an acronym for :-).

Node Sequence Number.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Simon Riggs
Date:
On Mon, Aug 1, 2011 at 11:47 AM, Thom Brown <thom@linux.com> wrote:
> On 1 August 2011 11:44, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> On 01.08.2011 13:13, Simon Riggs wrote:
>>> And what does NSN stand for? :-)
>>
>> Hmm, I don't know actually know what NSN is an acronym for :-).
>
> Node Sequence Number.

Do you have a reference to support that explanation?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Thom Brown
Date:
On 1 August 2011 12:25, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Mon, Aug 1, 2011 at 11:47 AM, Thom Brown <thom@linux.com> wrote:
>> On 1 August 2011 11:44, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>> On 01.08.2011 13:13, Simon Riggs wrote:
>>>> And what does NSN stand for? :-)
>>>
>>> Hmm, I don't know actually know what NSN is an acronym for :-).
>>
>> Node Sequence Number.
>
> Do you have a reference to support that explanation?

Here's one reference to it:
http://archives.postgresql.org/pgsql-hackers/2005-06/msg00294.php

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Alexander Korotkov
Date:
On Mon, Aug 1, 2011 at 3:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Aug 1, 2011 at 11:47 AM, Thom Brown <thom@linux.com> wrote:
> On 1 August 2011 11:44, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> On 01.08.2011 13:13, Simon Riggs wrote:
>>> And what does NSN stand for? :-)
>>
>> Hmm, I don't know actually know what NSN is an acronym for :-).
>
> Node Sequence Number.

Do you have a reference to support that explanation?

See "Access Methods for Next-Generation Database Systems" by Marcel Kornacker, Chapter 4 "Concurrency and Recovery for GiSTs".

------
With best regards,
Alexander Korotkov.

Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Simon Riggs
Date:
On Mon, Aug 1, 2011 at 11:44 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

>> Does the order of locking of the buffers matter? I'm sure it does.
>
> Yep.

Do you mean that the BlockNumbers are already in correct sequence, or
that you will be adding this code to redo?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Heikki Linnakangas
Date:
On 01.08.2011 14:35, Simon Riggs wrote:
> On Mon, Aug 1, 2011 at 11:44 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>
>>> Does the order of locking of the buffers matter? I'm sure it does.
>>
>> Yep.
>
> Do you mean that the BlockNumbers are already in correct sequence, or
> that you will be adding this code to redo?

I just meant that yes, the order of locking of the buffers does matter.

I believe we code acquire the locks in right order already, and the 
patch I posted fixes the premature release of locks at page split.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Simon Riggs
Date:
On Mon, Aug 1, 2011 at 2:29 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 01.08.2011 14:35, Simon Riggs wrote:
>>
>> On Mon, Aug 1, 2011 at 11:44 AM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>
>>>> Does the order of locking of the buffers matter? I'm sure it does.
>>>
>>> Yep.
>>
>> Do you mean that the BlockNumbers are already in correct sequence, or
>> that you will be adding this code to redo?
>
> I just meant that yes, the order of locking of the buffers does matter.
>
> I believe we code acquire the locks in right order already, and the patch I
> posted fixes the premature release of locks at page split.

Your patch is good, but it does rely on the idea that we're logging
the blocks in the same order they were originally locked. That's a
good assumption, but I would like to see that documented for general
sanity, or just mine at least.

I can't really see anything in the master-side code that attempts to
lock things in a specific sequence, which bothers me also.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Heikki Linnakangas
Date:
On 01.08.2011 17:26, Simon Riggs wrote:
> On Mon, Aug 1, 2011 at 2:29 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> I believe we code acquire the locks in right order already, and the patch I
>> posted fixes the premature release of locks at page split.
>
> Your patch is good, but it does rely on the idea that we're logging
> the blocks in the same order they were originally locked. That's a
> good assumption, but I would like to see that documented for general
> sanity, or just mine at least.
>
> I can't really see anything in the master-side code that attempts to
> lock things in a specific sequence, which bothers me also.

All but the first page are unused pages, grabbed with either P_NEW or 
from the FSM. gistNewBuffer() uses ConditionalLockBuffer() to guard for 
the case that someone else chooses the same victim buffer, and picks 
another page.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Simon Riggs
Date:
On Mon, Aug 1, 2011 at 3:34 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 01.08.2011 17:26, Simon Riggs wrote:
>>
>> On Mon, Aug 1, 2011 at 2:29 PM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>>
>>> I believe we code acquire the locks in right order already, and the patch
>>> I
>>> posted fixes the premature release of locks at page split.
>>
>> Your patch is good, but it does rely on the idea that we're logging
>> the blocks in the same order they were originally locked. That's a
>> good assumption, but I would like to see that documented for general
>> sanity, or just mine at least.
>>
>> I can't really see anything in the master-side code that attempts to
>> lock things in a specific sequence, which bothers me also.
>
> All but the first page are unused pages, grabbed with either P_NEW or from
> the FSM. gistNewBuffer() uses ConditionalLockBuffer() to guard for the case
> that someone else chooses the same victim buffer, and picks another page.

Seems good. Thanks for checking some more for me.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Heikki Linnakangas
Date:
On 01.08.2011 13:44, Heikki Linnakangas wrote:
> On 01.08.2011 13:13, Simon Riggs wrote:
>> Did you want me to write the patch for 9.0?
>
> I'm looking at it now.

So, in 9.0, we currently leave the rightlink and NSN invalid when 
replaying a page split. To set them correctly, we'd need the old 
rightlink and NSN from the page being split, but the WAL record doesn't 
currently include them. I can see two solutions to this:

1. Add them to the page split WAL record. That's straightforward, but 
breaks WAL format compatibility with older minor versions.

2. Read the old page version, and copy the rightlink and NSN from there. 
Since we're restoring what's basically a full-page image of the page 
after the split, in crash recovery the old contents might be a torn 
page, or a newer version of the page. I believe that's harmless, because 
we only care about the NSN and rightlink in hot standby mode, but it's a 
bit ugly.

If we change the WAL record, we have to make it so that the new version 
can still read the old format, which complicates the implementation a 
bit. Neverthelss, I'm leaning towards option 1.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Simon Riggs
Date:
On Tue, Aug 2, 2011 at 12:03 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 01.08.2011 13:44, Heikki Linnakangas wrote:
>>
>> On 01.08.2011 13:13, Simon Riggs wrote:
>>>
>>> Did you want me to write the patch for 9.0?
>>
>> I'm looking at it now.
>
> So, in 9.0, we currently leave the rightlink and NSN invalid when replaying
> a page split. To set them correctly, we'd need the old rightlink and NSN
> from the page being split, but the WAL record doesn't currently include
> them. I can see two solutions to this:
>
> 1. Add them to the page split WAL record. That's straightforward, but breaks
> WAL format compatibility with older minor versions.
>
> 2. Read the old page version, and copy the rightlink and NSN from there.
> Since we're restoring what's basically a full-page image of the page after
> the split, in crash recovery the old contents might be a torn page, or a
> newer version of the page. I believe that's harmless, because we only care
> about the NSN and rightlink in hot standby mode, but it's a bit ugly.
>
> If we change the WAL record, we have to make it so that the new version can
> still read the old format, which complicates the implementation a bit.
> Neverthelss, I'm leaning towards option 1.

We may as well do (1), with two versions of the WAL record.

Hmm, the biggest issue is actually that existing GIST indexes are
corrupted, from the perspective of being unusable during HS.

We can fix the cause but that won't repair the existing damage. So the
requirement is for us to re/create new indexes, which can then use a
new WAL record format. We probably need to store some information in
the metapage saying whether or not the index has been maintained only
with v2 WAL records, or with a mixture of v1 and v2 records. If the
latter, then issue a WARNING to rebuild the index.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Heikki Linnakangas
Date:
On 02.08.2011 14:36, Simon Riggs wrote:
> On Tue, Aug 2, 2011 at 12:03 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> If we change the WAL record, we have to make it so that the new version can
>> still read the old format, which complicates the implementation a bit.
>> Neverthelss, I'm leaning towards option 1.
>
> We may as well do (1), with two versions of the WAL record.

Actually I think we can append the new information to the end of the 
page split record, so that an old version server can read WAL generated 
by new version, too. It just won't set the right link and NSN correctly, 
so hot standby will be broken like it is today.

> Hmm, the biggest issue is actually that existing GIST indexes are
> corrupted, from the perspective of being unusable during HS.
>
> We can fix the cause but that won't repair the existing damage. So the
> requirement is for us to re/create new indexes, which can then use a
> new WAL record format.

No-no, it's not that bad. The right-links and NSNs are only needed to 
handle scans concurrent with page splits. The existing indexes are fine, 
you only have a problem if you run queries in hot standby mode, while 
you replay page splits on it. As soon as you upgrade the master and 
standby to new minor version with the fix, that will work too.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Simon Riggs
Date:
On Tue, Aug 2, 2011 at 12:43 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 02.08.2011 14:36, Simon Riggs wrote:
>>
>> On Tue, Aug 2, 2011 at 12:03 PM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>>
>>> If we change the WAL record, we have to make it so that the new version
>>> can
>>> still read the old format, which complicates the implementation a bit.
>>> Neverthelss, I'm leaning towards option 1.
>>
>> We may as well do (1), with two versions of the WAL record.
>
> Actually I think we can append the new information to the end of the page
> split record, so that an old version server can read WAL generated by new
> version, too.

Not sure how that would work. Lengths, CRCs?

Or do you mean we will support 2 versions, have them both called the
same thing, just resolve which is which by the record length. Don't
like that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Hi!

I'm now working on adding features to your version of patch. Current version is attached. Somehow this version produce huge amount of WAL and that makes it slow. Though count and avg. length of WAL records is similar to that of non-buffering build.

test=# create table points as (select point(random(),random()) from generate_series(1,1000000));
SELECT 1000000
test=# select pg_xlogfile_name_offset(pg_current_xlog_location());       
       pg_xlogfile_name_offset       
-------------------------------------
 (000000010000004000000073,15005048)
(1 row)

test=# create index points_idx on points using gist (point) with (buffering=off);CREATE INDEX
test=# select pg_xlogfile_name_offset(pg_current_xlog_location());
       pg_xlogfile_name_offset       
-------------------------------------
 (00000001000000400000007E,13764024)
(1 row)

test=# create index points_idx2 on points using gist (point) with (buffering=on, neighborrelocation=off);
INFO:  Level step = 1, pagesPerBuffer = 406
NOTICE:  final emptying
NOTICE:  final emptying
NOTICE:  final emptying
NOTICE:  final emptying
CREATE INDEX
test=# select pg_xlogfile_name_offset(pg_current_xlog_location());
       pg_xlogfile_name_offset       
-------------------------------------
 (0000000100000040000000D2,10982288)
(1 row)

May be you have any ideas about it?

------
With best regards,
Alexander Korotkov.
Attachment

Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Heikki Linnakangas
Date:
On 02.08.2011 15:18, Simon Riggs wrote:
> On Tue, Aug 2, 2011 at 12:43 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> On 02.08.2011 14:36, Simon Riggs wrote:
>> Actually I think we can append the new information to the end of the page
>> split record, so that an old version server can read WAL generated by new
>> version, too.
>
> Not sure how that would work. Lengths, CRCs?
>
> Or do you mean we will support 2 versions, have them both called the
> same thing, just resolve which is which by the record length. Don't
> like that.

Here's a patch to do what I meant. The new fields are stored at the very
end of the WAL record, and you check the length to see if they're there
or not. The nice thing about this is that it's compatible in both
directions.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

From
Heikki Linnakangas
Date:
On 02.08.2011 20:06, Alvaro Herrera wrote:
> Excerpts from Heikki Linnakangas's message of mar ago 02 11:59:24 -0400 2011:
>> On 02.08.2011 15:18, Simon Riggs wrote:
>>> On Tue, Aug 2, 2011 at 12:43 PM, Heikki Linnakangas
>>> <heikki.linnakangas@enterprisedb.com>   wrote:
>>>> On 02.08.2011 14:36, Simon Riggs wrote:
>>>> Actually I think we can append the new information to the end of the page
>>>> split record, so that an old version server can read WAL generated by new
>>>> version, too.
>>>
>>> Not sure how that would work. Lengths, CRCs?
>>>
>>> Or do you mean we will support 2 versions, have them both called the
>>> same thing, just resolve which is which by the record length. Don't
>>> like that.
>>
>> Here's a patch to do what I meant. The new fields are stored at the very
>> end of the WAL record, and you check the length to see if they're there
>> or not. The nice thing about this is that it's compatible in both
>> directions.
>
> Err, did you attach the wrong patch?

Yes, sorry about that. Here's the right patch.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
I found that in previous version of patch I missed PageSetLSN and PageSetTLI, but huge amount of WAL is still here. Also I found that huge amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but messages "FATAL:  xlog flush request BFF11148/809A600 is not satisfied --- flushed only to 44/9C518750" appears. Seems that there is some totally wrong use of WAL if even optimization level does matter...

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: Fast GiST index build

From
Robert Haas
Date:
On Wed, Aug 3, 2011 at 4:18 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> I found that in previous version of patch I missed PageSetLSN
> and PageSetTLI, but huge amount of WAL is still here. Also I found that huge
> amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
> messages "FATAL:  xlog flush request BFF11148/809A600 is not satisfied ---
> flushed only to 44/9C518750" appears. Seems that there is some totally wrong
> use of WAL if even optimization level does matter...

Try setting wal_debug=true to see what records are getting emitted.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 03.08.2011 11:18, Alexander Korotkov wrote:
> I found that in previous version of patch I missed PageSetLSN
> and PageSetTLI, but huge amount of WAL is still here. Also I found that huge
> amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
> messages "FATAL:  xlog flush request BFF11148/809A600 is not satisfied ---
> flushed only to 44/9C518750" appears. Seems that there is some totally wrong
> use of WAL if even optimization level does matter...

Try this:

diff --git a/src/backend/access/gist/gistbuild.c 
b/src/backend/access/gist/gistbuild.c
index 5099330..5a441e0 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -478,7 +478,7 @@ bufferingbuildinsert(GISTInsertState *state,         /* Write the WAL record */         if
(RelationNeedsWAL(state->r))        {
 
-            gistXLogUpdate(state->r->rd_node, buffer, oldoffnum, noldoffnum,
+            recptr = gistXLogUpdate(state->r->rd_node, buffer, oldoffnum, 
noldoffnum,                                                     itup, ntup,    InvalidBuffer);
PageSetLSN(page,recptr);             PageSetTLI(page, ThisTimeLineID);
 


--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Uhh, my bad, really stupid bug. Many thanks.

------
With best regards,
Alexander Korotkov. 

On Wed, Aug 3, 2011 at 8:31 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 03.08.2011 11:18, Alexander Korotkov wrote:
I found that in previous version of patch I missed PageSetLSN
and PageSetTLI, but huge amount of WAL is still here. Also I found that huge
amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
messages "FATAL:  xlog flush request BFF11148/809A600 is not satisfied ---
flushed only to 44/9C518750" appears. Seems that there is some totally wrong
use of WAL if even optimization level does matter...

Try this:

diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 5099330..5a441e0 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -478,7 +478,7 @@ bufferingbuildinsert(GISTInsertState *state,
               /* Write the WAL record */
               if (RelationNeedsWAL(state->r))
               {
-                       gistXLogUpdate(state->r->rd_node, buffer, oldoffnum, noldoffnum,
+                       recptr = gistXLogUpdate(state->r->rd_node, buffer, oldoffnum, noldoffnum,
                                                                                                       itup, ntup,     InvalidBuffer);
                       PageSetLSN(page, recptr);
                       PageSetTLI(page, ThisTimeLineID);



--
 Heikki Linnakangas
 EnterpriseDB   http://www.enterprisedb.com

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Hi!

There is last version of patch. There is the list of most significant changes in comparison with your version of patch:
1) Reference counting of path items was added. It should helps against possible accumulation of path items.
2) Neighbor relocation was added.
3) Subtree prefetching was added. 
4) Final emptying algorithm was reverted to the original one. My experiments shows that typical number of passes in final emptying in your version of patch is about 5. It may be significant itself. Also I haven't estimate of number of passes and haven't guarantees that it will not be high in some corner cases. I.e. I prefer more predictable single-pass algorithm in spite of it's a little more complex.
5) Unloading even last page of node buffer from main memory to the disk. Imagine that that with levelstep = 1 each inner node has buffer. It seems to me that keeping one page of each buffer in memory may be memory consuming. 

Open items I see at this moment:
1) I dislike my switching to buffering build method because it's based on very unproven assumptions. And I didn't more reliable assumptions in scientific papers while. I would like to replace it with something much simplier. For example, switching to buffering build when regular build actually starts to produce a lot of IO. For this approach implementation I need to somehow detect actual IO (not just buffer read but miss of OS cache).
2) I'm worrying about possible size of nodeBuffersTab and path items. If we imagine extremely large tree with levelstep = 1 size of this datastructures will be singnificant. And it's hard to predict that size without knowing of tree size.

------
With best regards,
Alexander Korotkov.  
Attachment

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 07.08.2011 22:28, Alexander Korotkov wrote:
> There is last version of patch. There is the list of most significant
> changes in comparison with your version of patch:
> 1) Reference counting of path items was added. It should helps against
> possible accumulation of path items.

Ok.

> 2) Neighbor relocation was added.

Ok. I think we're going to need some sort of a heuristic on when to 
enable neighbor relocation. If I remember the performance tests 
correctly, it improves the quality of the resulting index, but incurs a 
significant CPU overhead.

Actually, looking at the performance numbers on the wiki page again 
(http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011), it 
looks like neighbor relocation doesn't help very much with the index 
quality - sometimes it even results in a slightly worse index. Based on 
those results, shouldn't we just remove it? Or is there some other data 
set where it helps significantly?

> 3) Subtree prefetching was added.

I'm inclined to leave out the prefetching code for now. Unless you have 
some performance numbers that show that it's critical for the overall 
performance. But I don't think that was the case, it's just an 
additional optimization for servers with big RAID arrays. So, please 
separate that into an add-on patch. It needs to be performance tests and 
reviewed separately.

> 4) Final emptying algorithm was reverted to the original one. My
> experiments shows that typical number of passes in final emptying in
> your version of patch is about 5. It may be significant itself. Also I
> haven't estimate of number of passes and haven't guarantees that it will
> not be high in some corner cases. I.e. I prefer more predictable
> single-pass algorithm in spite of it's a little more complex.

I was trying to get rid of that complexity during index build. Some 
extra code in the final pass would be easier to understand than extra 
work that needs to be done through the index build. It's not a huge 
amount of code, but still.

I'm not worried about the extra CPU overhead of scanning the data 
structures at the final pass. I guess in my patch you had to do extra 
I/O as well, because the buffers were not emptied in strict top-down 
order, so let's avoid that. How about:

Track all buffers in the lists, not only those that are non-empty. Add 
the buffer to the right list at getNodeBuffer(). That way in the final 
stage, you need to scan through all buffers instead of just the 
non-empty ones. But the overhead of that to should be minimal in 
practice, scanning some in-memory data structures is pretty cheap 
compared to building an index. That way you wouldn't need to maintain 
the lists during the index build, except for adding each buffer to 
correct lists in getNodeBuffer().

BTW, please use List for the linked lists. No need to re-implement the 
wheel.

> 5) Unloading even last page of node buffer from main memory to the disk.
> Imagine that that with levelstep = 1 each inner node has buffer. It
> seems to me that keeping one page of each buffer in memory may be memory
> consuming.
>
> Open items I see at this moment:
> 1) I dislike my switching to buffering build method because it's based
> on very unproven assumptions. And I didn't more reliable assumptions in
> scientific papers while. I would like to replace it with something much
> simplier. For example, switching to buffering build when regular build
> actually starts to produce a lot of IO. For this approach implementation
> I need to somehow detect actual IO (not just buffer read but miss of OS
> cache).

Yeah, that's a surprisingly hard problem. I don't much like the method 
used in the patch either.

> 2) I'm worrying about possible size of nodeBuffersTab and path items. If
> we imagine extremely large tree with levelstep = 1 size of this
> datastructures will be singnificant. And it's hard to predict that size
> without knowing of tree size.

I'm not very worried about that in practice. If you have a very large 
index, you presumably have a fair amount of memory too. Otherwise the 
machine is horrendously underpowered to build or do anything useful with 
the index anyway. Nevertheless it would nice to have some figures on 
that. If you have, say, an index of 1 TB in size, how much memory will 
building the index need?

Miscellaneous observations:

* Please run pgindent over the code, there's a lot of spurious 
whitespace in the patch.
* How about renaming GISTLoadedPartItem to something like 
GISTBulkInsertStack, to resemble the GISTInsertStack struct used in the 
normal insertion code. The "loaded part" nomenclature is obsolete, as 
the patch doesn't explicitly load parts of the tree into memory anymore. 
Think about the names of other structs, variables and functions too, 
GISTLoadedPartItem just caught my eye first but there's probably others 
that could have better names.
* Any user-visible options need to be documented in the user manual.
* And of course, make sure comments and the readme are up-to-date.
* Compiler warning:

reloptions.c:259: warning: initializer-string for array of chars is too long
reloptions.c:259: warning: (near initialization for 
‘stringRelOpts[0].default_val’)

I don't think there's a way to add an entry to stringRelOpts in a way 
that works correctly. That's a design flaw in the reloptions.c code that 
has never come up before, as there hasn't been any string-formatted 
relopts before (actually buffering option might be better served by an 
enum reloption too, if we had that). Please start a new thread on that 
on pgsql-hackers.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Mon, Aug 8, 2011 at 1:23 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
2) Neighbor relocation was added.

Ok. I think we're going to need some sort of a heuristic on when to enable neighbor relocation. If I remember the performance tests correctly, it improves the quality of the resulting index, but incurs a significant CPU overhead.

Actually, looking at the performance numbers on the wiki page again (http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011), it looks like neighbor relocation doesn't help very much with the index quality - sometimes it even results in a slightly worse index. Based on those results, shouldn't we just remove it? Or is there some other data set where it helps significantly?
Oh, actually I didn't add some results with neighborrelocation = off. I would like to rerun some tests with current version of patch. 
 
3) Subtree prefetching was added.

I'm inclined to leave out the prefetching code for now. Unless you have some performance numbers that show that it's critical for the overall performance. But I don't think that was the case, it's just an additional optimization for servers with big RAID arrays. So, please separate that into an add-on patch. It needs to be performance tests and reviewed separately.
I though that prefetch helps even on separate hard disks by ordering of IOs.
 
4) Final emptying algorithm was reverted to the original one. My
experiments shows that typical number of passes in final emptying in
your version of patch is about 5. It may be significant itself. Also I
haven't estimate of number of passes and haven't guarantees that it will
not be high in some corner cases. I.e. I prefer more predictable
single-pass algorithm in spite of it's a little more complex.

I was trying to get rid of that complexity during index build. Some extra code in the final pass would be easier to understand than extra work that needs to be done through the index build. It's not a huge amount of code, but still.

I'm not worried about the extra CPU overhead of scanning the data structures at the final pass. I guess in my patch you had to do extra I/O as well, because the buffers were not emptied in strict top-down order, so let's avoid that. How about:

Track all buffers in the lists, not only those that are non-empty. Add the buffer to the right list at getNodeBuffer(). That way in the final stage, you need to scan through all buffers instead of just the non-empty ones. But the overhead of that to should be minimal in practice, scanning some in-memory data structures is pretty cheap compared to building an index. That way you wouldn't need to maintain the lists during the index build, except for adding each buffer to correct lists in getNodeBuffer().

BTW, please use List for the linked lists. No need to re-implement the wheel.
Ok.
 
5) Unloading even last page of node buffer from main memory to the disk.
Imagine that that with levelstep = 1 each inner node has buffer. It
seems to me that keeping one page of each buffer in memory may be memory
consuming.

Open items I see at this moment:
1) I dislike my switching to buffering build method because it's based
on very unproven assumptions. And I didn't more reliable assumptions in
scientific papers while. I would like to replace it with something much
simplier. For example, switching to buffering build when regular build
actually starts to produce a lot of IO. For this approach implementation
I need to somehow detect actual IO (not just buffer read but miss of OS
cache).

Yeah, that's a surprisingly hard problem. I don't much like the method used in the patch either.
Is it possible to make buffering build a user defined option until we have a better idea?
 
2) I'm worrying about possible size of nodeBuffersTab and path items. If
we imagine extremely large tree with levelstep = 1 size of this
datastructures will be singnificant. And it's hard to predict that size
without knowing of tree size.

I'm not very worried about that in practice. If you have a very large index, you presumably have a fair amount of memory too. Otherwise the machine is horrendously underpowered to build or do anything useful with the index anyway. Nevertheless it would nice to have some figures on that. If you have, say, an index of 1 TB in size, how much memory will building the index need?
I think with points it would be about 1 million of buffers and about 100-300 megabytes of RAM depending on space utilization. It may be ok, because 1 TB is really huge index. But if maintenance_work_mem is low we can run out of it. Though maintenance_work_mem is quite strange for system with 1 TB indexes.
 
Miscellaneous observations:

* Please run pgindent over the code, there's a lot of spurious whitespace in the patch.
* How about renaming GISTLoadedPartItem to something like GISTBulkInsertStack, to resemble the GISTInsertStack struct used in the normal insertion code. The "loaded part" nomenclature is obsolete, as the patch doesn't explicitly load parts of the tree into memory anymore. Think about the names of other structs, variables and functions too, GISTLoadedPartItem just caught my eye first but there's probably others that could have better names.
* Any user-visible options need to be documented in the user manual.
* And of course, make sure comments and the readme are up-to-date.
* Compiler warning:

reloptions.c:259: warning: initializer-string for array of chars is too long
reloptions.c:259: warning: (near initialization for ‘stringRelOpts[0].default_val’)

I don't think there's a way to add an entry to stringRelOpts in a way that works correctly. That's a design flaw in the reloptions.c code that has never come up before, as there hasn't been any string-formatted relopts before (actually buffering option might be better served by an enum reloption too, if we had that). Please start a new thread on that on pgsql-hackers.
Ok. 

------
With best regards,
Alexander Korotkov.  

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Hi!

Here is last verion of the patch.
List of changes:
1) Neighbor relocation and prefetch were removed. They will be supplied as separate patches.
2) Final emptying now using standart lists of all buffers by levels.
3) Automatic switching again use simple comparison of index size and effective_cache_size.
4) Some renames. In particular GISTLoadedPartItem to GISTBufferingInsertStack.
5) Some comments were corrected and some were added.
6) pgindent
7) rebased with head

Readme update and user documentation coming soon.

------
With best regards,
Alexander Korotkov.  
Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Manual and readme updates.

------
With best regards,
Alexander Korotkov.  
Attachment

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 10.08.2011 13:19, Alexander Korotkov wrote:
> Hi!
>
> Here is last verion of the patch.
> List of changes:
> 1) Neighbor relocation and prefetch were removed. They will be supplied as
> separate patches.

unloadNodeBuffers() is now dead code.

> 2) Final emptying now using standart lists of all buffers by levels.
> 3) Automatic switching again use simple comparison of index size and
> effective_cache_size.

LEAF_PAGES_STATS_* are unused now. Should avoid calling smgrnblocks() on 
every tuple, the overhead of that could add up.

> 4) Some renames. In particular GISTLoadedPartItem
> to GISTBufferingInsertStack.
> 5) Some comments were corrected and some were added.
> 6) pgindent
> 7) rebased with head
>
> Readme update and user documentation coming soon.

I wonder, how hard would it be to merge gistBufferingBuildPlaceToPage() 
with the gistplacetopage() function used in the main codepath? There's 
very little difference between them, and it would be nice to maintain 
just one function. At the very least I think there should be a comment 
in both along the lines of "NOTE: if you change this function, make sure 
you update XXXX (the other function) as well!"

In gistbuild(), in the final emptying stage, there's this special-case 
handling for the root block before looping through the buffers in the 
buffersOnLevels lists:

>         for (;;)
>         {
>             nodeBuffer = getNodeBuffer(gfbb, &buildstate.giststate, GIST_ROOT_BLKNO,
>                                        InvalidOffsetNumber, NULL, false);
>             if (!nodeBuffer || nodeBuffer->blocksCount <= 0)
>                 break;
>             MemoryContextSwitchTo(gfbb->context);
>             gfbb->bufferEmptyingStack = lcons(nodeBuffer, gfbb->bufferEmptyingStack);
>             MemoryContextSwitchTo(buildstate.tmpCtx);
>             processEmptyingStack(&buildstate.giststate, &insertstate);
>         }

What's the purpose of that? Wouldn't the loop through buffersOnLevels 
lists take care of the root node too?

The calculations in initBuffering() desperately need comments. As does 
the rest of the code too, but the heuristics in that function are 
particularly hard to understand without some explanation.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
Split of an internal node works like this:

1. Gather all the existing tuples on the page, plus the new tuple being 
inserted.
2. Call picksplit on the tuples, to divide them into pages
3. Go through all tuples on the buffer associated with the page, and 
divide them into buffers on the new pages. This is done by calling 
penalty function on each buffered tuple.

I wonder if it would be better for index quality to pass the buffered 
tuples to picksplit in the 2nd step, so that they too can affect the 
split decision. Maybe it doesn't make much difference in practice..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Thu, Aug 11, 2011 at 10:21 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Split of an internal node works like this:

1. Gather all the existing tuples on the page, plus the new tuple being inserted.
2. Call picksplit on the tuples, to divide them into pages
3. Go through all tuples on the buffer associated with the page, and divide them into buffers on the new pages. This is done by calling penalty function on each buffered tuple.

I wonder if it would be better for index quality to pass the buffered tuples to picksplit in the 2nd step, so that they too can affect the split decision. Maybe it doesn't make much difference in practice..

I had this idea. But:
1) Buffer contain much more tuples than page plus new tuple.
2) Picksplit method can easily be quadratic for example. 

Let's see the complexity of picksplit algorithms:
1) geometric datatypes (point, box etc) - O(n) (BTW, I have serious doubts about it, i.e. O(n*log(n)) algorithm can be in times better in many cases)
2) pg_trgm and fts - O(n^2)
3) seg - O(n*log(n))
4) cube - O(n^2)

Thus, I believe such feature should be an optional. We can try it as add-on patch.

------
With best regards,
Alexander Korotkov.  

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Wed, Aug 10, 2011 at 11:45 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
unloadNodeBuffers() is now dead code.
processEmptyingStack calls it.

LEAF_PAGES_STATS_* are unused now.
Removed.
 
Should avoid calling smgrnblocks() on every tuple, the overhead of that could add up.
Now calling at each BUFFERING_MODE_SWITCH_CHECK_STEP(256) tuples.
 
I wonder, how hard would it be to merge gistBufferingBuildPlaceToPage() with the gistplacetopage() function used in the main codepath? There's very little difference between them, and it would be nice to maintain just one function. At the very least I think there should be a comment in both along the lines of "NOTE: if you change this function, make sure you update XXXX (the other function) as well!"
I doubt they can be effectively merged, but will try.
 
In gistbuild(), in the final emptying stage, there's this special-case handling for the root block before looping through the buffers in the buffersOnLevels lists:

               for (;;)
               {
                       nodeBuffer = getNodeBuffer(gfbb, &buildstate.giststate, GIST_ROOT_BLKNO,
                                                                          InvalidOffsetNumber, NULL, false);
                       if (!nodeBuffer || nodeBuffer->blocksCount <= 0)
                               break;
                       MemoryContextSwitchTo(gfbb->context);
                       gfbb->bufferEmptyingStack = lcons(nodeBuffer, gfbb->bufferEmptyingStack);
                       MemoryContextSwitchTo(buildstate.tmpCtx);
                       processEmptyingStack(&buildstate.giststate, &insertstate);
               }

What's the purpose of that? Wouldn't the loop through buffersOnLevels lists take care of the root node too?
I was worried about node buffer deletion from list while scanning that list gistbuild(). That's why I avoided deletion from lists. 
Now I've added additional check for root node in loop over list.
 
The calculations in initBuffering() desperately need comments. As does the rest of the code too, but the heuristics in that function are particularly hard to understand without some explanation.
Some comments were added. I'm working on more of them.
 
------
With best regards,
Alexander Korotkov.  
Attachment

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 10.08.2011 22:44, Alexander Korotkov wrote:
> Manual and readme updates.

Thanks, I'm reviewing these now.

Do we want to expose the level-step and buffersize parameters to users? 
They've been useful during testing, but I'm thinking we should be able 
to guess good enough values for them automatically, and just remove the 
options. It's pretty much impossible for a user to tune them correctly, 
it would require deep knowledge of the buffering algorithm.

I'm thinking that even when you explicitly turn buffering on, we should 
still process the first 10000 or so tuples with simple inserts. That way 
we always have a sample of tuples to calculate the average tuple size 
from. It's plausible that if the input data is ordered, looking at the 
first N tuples will give skewed sample, but I don't think there's much 
danger of that in practice. Even if the data is ordered, the length of 
GiST tuples shouldn't vary much.

What happens if we get the levelstep and pagesPerBuffer estimates wrong? 
How sensitive is the algorithm to that? Or will we run out of memory? 
Would it be feasible to adjust those in the middle of the index build, 
if we e.g exceed the estimated memory usage greatly?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 10.08.2011 22:44, Alexander Korotkov wrote:
> Manual and readme updates.

I went through these, and did some editing and rewording. Attached is an
updated README, and an updated patch of the doc changes. Let me know if
I screwed up something.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Thu, Aug 11, 2011 at 2:28 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 10.08.2011 22:44, Alexander Korotkov wrote:
Manual and readme updates.

Thanks, I'm reviewing these now.

Do we want to expose the level-step and buffersize parameters to users? They've been useful during testing, but I'm thinking we should be able to guess good enough values for them automatically, and just remove the options. It's pretty much impossible for a user to tune them correctly, it would require deep knowledge of the buffering algorithm.

I'm thinking that even when you explicitly turn buffering on, we should still process the first 10000 or so tuples with simple inserts. That way we always have a sample of tuples to calculate the average tuple size from. It's plausible that if the input data is ordered, looking at the first N tuples will give skewed sample, but I don't think there's much danger of that in practice. Even if the data is ordered, the length of GiST tuples shouldn't vary much.

What happens if we get the levelstep and pagesPerBuffer estimates wrong? How sensitive is the algorithm to that? Or will we run out of memory? Would it be feasible to adjust those in the middle of the index build, if we e.g exceed the estimated memory usage greatly?

I see the following risks.

For levelstep:
Too small: not so great IO benefit as can be
Too large: 
  1) If subtree doesn't fit effective_cache, much more IO then should be (because of cache misses during buffer emptying)
  2) If last pages of buffers don't fit to maintenance_work_mem, possible OOM

For buffersize:
Too small: less IO benefit, becuse buffer size is relatively small in comparison with sub-tree size.
Too large: greater CPU overhead (because of more penalty calls) then can be with same IO benefit.

Thereby I propose following.
1) Too large levelstep is greatest risk. Let's use pessimistic estimate for it. Pessimistic estimate has following logic:
largest sub-tree => maximal tuples per page => minimal tuple size
Thereby always using minimal tuple size in levelstep calculation we exclude greatest risks.
2) Risks of buffersize are comparable and not too critical. Thats why I propose to use size of first 10000 tuples for estimate.

------
With best regards,
Alexander Korotkov.  

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 11.08.2011 23:30, Alexander Korotkov wrote:
> On Thu, Aug 11, 2011 at 2:28 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> On 10.08.2011 22:44, Alexander Korotkov wrote:
>>
>>> Manual and readme updates.
>>>
>>
>> Thanks, I'm reviewing these now.
>>
>> Do we want to expose the level-step and buffersize parameters to users?
>> They've been useful during testing, but I'm thinking we should be able to
>> guess good enough values for them automatically, and just remove the
>> options. It's pretty much impossible for a user to tune them correctly, it
>> would require deep knowledge of the buffering algorithm.
>>
>> I'm thinking that even when you explicitly turn buffering on, we should
>> still process the first 10000 or so tuples with simple inserts. That way we
>> always have a sample of tuples to calculate the average tuple size from.
>> It's plausible that if the input data is ordered, looking at the first N
>> tuples will give skewed sample, but I don't think there's much danger of
>> that in practice. Even if the data is ordered, the length of GiST tuples
>> shouldn't vary much.
>>
>> What happens if we get the levelstep and pagesPerBuffer estimates wrong?
>> How sensitive is the algorithm to that? Or will we run out of memory? Would
>> it be feasible to adjust those in the middle of the index build, if we e.g
>> exceed the estimated memory usage greatly?
>
>
> I see the following risks.
>
> For levelstep:
> Too small: not so great IO benefit as can be
> Too large:
>    1) If subtree doesn't fit effective_cache, much more IO then should be
> (because of cache misses during buffer emptying)
>    2) If last pages of buffers don't fit to maintenance_work_mem, possible
> OOM

Hmm, we could avoid running out of memory if we used a LRU cache 
replacement policy on the buffer pages, instead of explicitly unloading 
the buffers. 1) would still apply, though.

> For buffersize:
> Too small: less IO benefit, becuse buffer size is relatively small in
> comparison with sub-tree size.
> Too large: greater CPU overhead (because of more penalty calls) then can be
> with same IO benefit.
>
> Thereby I propose following.
> 1) Too large levelstep is greatest risk. Let's use pessimistic estimate for
> it. Pessimistic estimate has following logic:
> largest sub-tree =>  maximal tuples per page =>  minimal tuple size
> Thereby always using minimal tuple size in levelstep calculation we exclude
> greatest risks.
> 2) Risks of buffersize are comparable and not too critical. Thats why I
> propose to use size of first 10000 tuples for estimate.

Yep, sounds reasonable.

I think it would also be fairly simple to decrease levelstep and/or 
adjust buffersize on-the-fly. The trick would be in figuring out the 
heuristics on when to do that.

Another thing occurred to me while looking at the buffer emptying 
process: At the moment, we stop emptying after we've flushed 1/2 buffer 
size worth of tuples. The point of that is to avoid overfilling a 
lower-level buffer, in the case that the tuples we emptied all landed on 
the same lower-level buffer. Wouldn't it be fairly simple to detect that 
case explicitly, and stop the emptying process only if one of the 
lower-level buffers really fills up? That should be more efficient, as 
you would have "swap" between different subtrees less often.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Fri, Aug 12, 2011 at 12:23 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
I think it would also be fairly simple to decrease levelstep and/or adjust buffersize on-the-fly. The trick would be in figuring out the heuristics on when to do that.
I would be simple to decrease levelstep to the it's divider. It seems quite hard to dicrease it, for example, from 3 to 2. Also, it's pretty hard to detect that sub-tree actually doen't fit to the cache. I don't see much difficulties in buffersize runtime tuning.
  
Another thing occurred to me while looking at the buffer emptying process: At the moment, we stop emptying after we've flushed 1/2 buffer size worth of tuples. The point of that is to avoid overfilling a lower-level buffer, in the case that the tuples we emptied all landed on the same lower-level buffer. Wouldn't it be fairly simple to detect that case explicitly, and stop the emptying process only if one of the lower-level buffers really fills up? That should be more efficient, as you would have "swap" between different subtrees less often.
 Yes, it seems reasonable to me.

------
With best regards,
Alexander Korotkov.  

Re: WIP: Fast GiST index build

From
Robert Haas
Date:
On Thu, Aug 11, 2011 at 6:21 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> [ new patch ]

Some random comments:

- It appears that the "noFollowFight" flag is really supposed to be
called "noFollowRight".

- In gist_private.h you've written "halt-filled" where you really mean
"half-filled".

- It seems like you're using reloptions to set parameters that are
only going to do anything at index creation time.  IIUC, "BUFFERING",
"LEVELSTEP" and "BUFFERSIZE" have no permanent meaning for that index;
they're just used ephemerally while constructing it.  If we're going
to expose such things as options, maybe they should be GUCs, not
reloptions.

- Function names should begin with "gist" or some other, appropriate
prefix, especially if they are non-static.  decreasePathRefcount(),
getNodeBuffer(), relocateBuildBuffersOnSplit(), adn
getNodeBufferBusySize() violate this rule, and it might be good to
change the static functions to follow it, too, just for consistency,
and to avoid renaming things if something that's currently static
later needs to be made non-static.

- validateBufferOption needs to use ereport(), not elog().

- This needs a bit of attention:

+               /* TODO: Write the WAL record */
+               if (RelationNeedsWAL(state->r))
+                       recptr = gistXLogSplit(state->r->rd_node,
blkno, is_leaf,
+                                                               dist,
oldrlink, oldnsn, InvalidBuffer, true);
+               else
+                       recptr = GetXLogRecPtrForTemp();
+

I don't think the comment matches the code, since gistXLogSplit() does
in fact call XLogInsert().  Also, you should probably move the
RelationNeedsWAL() test inside gistXLogSplit().  Otherwise, every
single caller of gistXLogSplit() is going to need to do the same
dance.

- In gistBufferingPlaceToPage, you've got a series of loops that look like this:

+               for (ptr = dist; ptr; ptr = ptr->next)

The first loop allocates a bunch of buffers.  The second loop sets up
downlink pointers.   Then there's some other code.  Then there's a
third loop, which adds items to each page in turn and sets up right
links.  Then there's a fourth loop, which marks all those buffers
dirty.  Then you write XLOG.  Then there's a fifth loop, which sets
all the LSNs and TLIs, and a sixth loop, which does
UnlockReleaseBuffer() on each valid buffer in the list.  All of this
seems like it could be simplified.  In particular, the third and
fourth loops can certainly be merged - you should set the dirty bit at
the same time you're adding items to the page.  And the fifth and
sixth loops can also be merged.  You certainly don't need to set all
the LSNs and TLIs before releasing any of the buffer locks & pins.
I'm not sure if there's any more merging that can be done than that,
but you might want to have a look.

I'm also wondering how long this linked list can be.  It's not good to
have large numbers of buffers locked for a long period of time.  At
the very least, some comments are in order here.

Another general comment about this function is that it seems like it
is backwards.  The overall flow of the function is:

if (is_split)
{   /* complicated stuff */
}
else
{   /* simple case */
}

It seems like it might be better to flip that around and do this:

if (!is_split)
{   /* simple case */   return result;
}
/* complicated stuff */

It's easier to read and avoids having the indentation get too deep.

- As I look at this more, I see that a lot of the logic in
gistBufferingBuildPlaceToPage is copied from gistplacetopage().  It
would be nice to move the common bits to common subroutines that both
functions can call, instead of duplicating the code.

- On a related note, gistBufferingBuildPlaceToPage needs to do
START_CRIT_SECTION and END_CRIT_SECTION at appropriate points in the
sequence, as gistplacetopage() does.

- gistFindCorrectParent() seems to rely heavily on the assumption that
there's no concurrent activity going on in this index.  Otherwise,
it's got to be unsafe to release the buffer lock before using the
answer the function computes.  Some kind of comment seems like it
would be a good idea.

- On a more algorithmic note, I don't really understand why we attach
buffers to all pages on a level or none of them.  If it isn't
necessary to have buffers on every internal page in the tree, why do
we have them on every other level or every third level rather than,
say, creating them on the fly in whatever parts of the tree end up
busy?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Hi!

Thank you for your notes.

On Fri, Aug 12, 2011 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 11, 2011 at 6:21 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> [ new patch ]

Some random comments:

- It appears that the "noFollowFight" flag is really supposed to be
called "noFollowRight".
Fixed.

- In gist_private.h you've written "halt-filled" where you really mean
"half-filled".
Fixed.
 
- It seems like you're using reloptions to set parameters that are
only going to do anything at index creation time.  IIUC, "BUFFERING",
"LEVELSTEP" and "BUFFERSIZE" have no permanent meaning for that index;
they're just used ephemerally while constructing it.  If we're going
to expose such things as options, maybe they should be GUCs, not
reloptions.
Having these as index parameters may be helpful when you reindex. It's likely that you would like to rebuild it with same parameters as it was created. Actually, we have the same situation with FILLFACTOR: it is used only during index creation.
 
- Function names should begin with "gist" or some other, appropriate
prefix, especially if they are non-static.  decreasePathRefcount(),
getNodeBuffer(), relocateBuildBuffersOnSplit(), adn
getNodeBufferBusySize() violate this rule, and it might be good to
change the static functions to follow it, too, just for consistency,
and to avoid renaming things if something that's currently static
later needs to be made non-static.
Fixed.
 
- validateBufferOption needs to use ereport(), not elog().
Fixed.
 
- This needs a bit of attention:

+               /* TODO: Write the WAL record */
+               if (RelationNeedsWAL(state->r))
+                       recptr = gistXLogSplit(state->r->rd_node,
blkno, is_leaf,
+                                                               dist,
oldrlink, oldnsn, InvalidBuffer, true);
+               else
+                       recptr = GetXLogRecPtrForTemp();
+

I don't think the comment matches the code, since gistXLogSplit() does
in fact call XLogInsert(). Also, you should probably move the
RelationNeedsWAL() test inside gistXLogSplit().  Otherwise, every
single caller of gistXLogSplit() is going to need to do the same
dance.

- In gistBufferingPlaceToPage, you've got a series of loops that look like this:

+               for (ptr = dist; ptr; ptr = ptr->next)

The first loop allocates a bunch of buffers.  The second loop sets up
downlink pointers.   Then there's some other code.  Then there's a
third loop, which adds items to each page in turn and sets up right
links.  Then there's a fourth loop, which marks all those buffers
dirty.  Then you write XLOG.  Then there's a fifth loop, which sets
all the LSNs and TLIs, and a sixth loop, which does
UnlockReleaseBuffer() on each valid buffer in the list.  All of this
seems like it could be simplified.  In particular, the third and
fourth loops can certainly be merged - you should set the dirty bit at
the same time you're adding items to the page.  And the fifth and
sixth loops can also be merged.  You certainly don't need to set all
the LSNs and TLIs before releasing any of the buffer locks & pins.
I'm not sure if there's any more merging that can be done than that,
but you might want to have a look.

I'm also wondering how long this linked list can be.  It's not good to
have large numbers of buffers locked for a long period of time.  At
the very least, some comments are in order here.

Another general comment about this function is that it seems like it
is backwards.  The overall flow of the function is:

if (is_split)
{
   /* complicated stuff */
}
else
{
   /* simple case */
}

It seems like it might be better to flip that around and do this:

if (!is_split)
{
   /* simple case */
   return result;
}
/* complicated stuff */

It's easier to read and avoids having the indentation get too deep.

- As I look at this more, I see that a lot of the logic in
gistBufferingBuildPlaceToPage is copied from gistplacetopage().  It
would be nice to move the common bits to common subroutines that both
functions can call, instead of duplicating the code.

- On a related note, gistBufferingBuildPlaceToPage needs to do
START_CRIT_SECTION and END_CRIT_SECTION at appropriate points in the
sequence, as gistplacetopage() does.
While, I've merged gistplacetopage() and gistBufferingBuildPlaceToPage(). Now I'm trying some more refactoring.
 
- gistFindCorrectParent() seems to rely heavily on the assumption that
there's no concurrent activity going on in this index.  Otherwise,
it's got to be unsafe to release the buffer lock before using the
answer the function computes.  Some kind of comment seems like it
would be a good idea.
Corresponding comment was added.
 
- On a more algorithmic note, I don't really understand why we attach
buffers to all pages on a level or none of them.  If it isn't
necessary to have buffers on every internal page in the tree, why do
we have them on every other level or every third level rather than,
say, creating them on the fly in whatever parts of the tree end up
busy?
Idea of having buffers on levels with some step is following. We have enough of cache to have a sub-tree of some height fits to cache. When we loaded such sub-tree once we can process index tuples inside it effectively (without actual IO). During buffer emptying we're flushing index tuples to undeflying buffers or leaf pages. Having buffers on levels with step we guarantee that flushing don't require loading(and writing) more then such sub-tree (which fits to cache). Thus, if we've processed many enough of index tuples during emptying, it's IO effective. It's possible that some more effective distribution of buffers exists, but it's currently unclear for me.

Other changes:
1) Levelstep and buffersize user options were removed.
2) Buffer size is now run time tuned.
3) Buffer emptying now stops when some child can't take index tuple anymore.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
I found that I forgot to remove levelstep and buffersize from reloptions.c. Updated patch is attached.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
Looking at the calculation of levelStep:

> +     /*
> +      * Calculate levelStep by available amount of memory. We should be able to
> +      * load into main memory one page of each underlying node buffer (which
> +      * are in levelStep below). That give constraint over
> +      * maintenance_work_mem. Also we should be able to have subtree of
> +      * levelStep level in cache. That give constraint over
> +      * effective_cache_size.
> +      *
> +      * i'th underlying level of sub-tree can consists of
> +      * i^maxIndexTuplesPerPage pages at maximum. So, subtree of levelStep
> +      * levels can't be greater then 2 * maxIndexTuplesPerPage ^ levelStep
> +      * pages. We use some more reserve due to we probably can't take whole
> +      * effective cache and use formula 4 * maxIndexTuplesPerPage ^ levelStep =
> +      * effectiveCache. We use similar logic with maintenance_work_mem. We
> +      * should be able to store at least last pages of all buffers where we are
> +      * emptying current buffer to.
> +      */
> +     effectiveMemory = Min(maintenance_work_mem * 1024 / BLCKSZ,
> +                           effective_cache_size);
> +     levelStep = (int) log((double) effectiveMemory / 4.0) /
> +         log((double) maxIndexTuplesPerPage);
> +

I can see that that's equal to the formula given in the paper, 
log_B(M/4B), but I couldn't see any explanation for that formula in the 
paper. Your explanation makes sense, but where did it come from?

It seems a bit pessimistic: while it's true that the a subtree can't be 
larger than 2 * maxIndexTuplesPerPage ^ levelStep, you can put a tighter 
upper bound on it. The max size of a subtree of depth n can be 
calculated as the geometric series:

r^0 + r^1 + r^2 + ... + r^n = (1 - r^(n + 1)) / (1 - r)

where r = maxIndexTuplesPerPage. For r=2 those formulas are equal, but 
for a large r and small n (which is typical), the 2 * 
maxIndexTuplesPerPage^levelStep formula gives a value that's almost 
twice as large as the real max size of a subtree.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Tue, Aug 16, 2011 at 4:04 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
I can see that that's equal to the formula given in the paper, log_B(M/4B), but I couldn't see any explanation for that formula in the paper. Your explanation makes sense, but where did it come from?
I didn't find it too. But it has to reservse memory for both sub-tree and active buffers. While we'are reserving memory for sub-tree in effective_cache_size and memory for last pages of buffers in maintenance_work_mem.
 
It seems a bit pessimistic: while it's true that the a subtree can't be larger than 2 * maxIndexTuplesPerPage ^ levelStep, you can put a tighter upper bound on it. The max size of a subtree of depth n can be calculated as the geometric series:

r^0 + r^1 + r^2 + ... + r^n = (1 - r^(n + 1)) / (1 - r)

where r = maxIndexTuplesPerPage. For r=2 those formulas are equal, but for a large r and small n (which is typical), the 2 * maxIndexTuplesPerPage^levelStep formula gives a value that's almost twice as large as the real max size of a subtree.
Thus, we can calculate:
levelstep = min(log_r(1 + effective_cache_size_in_pages*(r - 1)) - 1, log_r(maintenance_work_mem_in_pages - 1))
and get more precise result. But also we need at least very rough estimate of memory occupied by node buffers hash tab and path items.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
Why is there ever a buffer on the root node? It seems like a waste of 
time to load N tuples from the heap into the root buffer, only to empty 
the buffer after it fills up. You might as well pull tuples directly 
from the heap.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
<div class="gmail_quote">On Tue, Aug 16, 2011 at 9:43 PM, Heikki Linnakangas <span dir="ltr"><<a
href="mailto:heikki.linnakangas@enterprisedb.com">heikki.linnakangas@enterprisedb.com</a>></span>wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"> Why is there
evera buffer on the root node? It seems like a waste of time to load N tuples from the heap into the root buffer, only
toempty the buffer after it fills up. You might as well pull tuples directly from the heap.<br
/></blockquote></div>Yes,seems reasonable. Buffer on the root node was in the paper. But now I don't see the need of it
too.<br/><br />------<br />With best regards,<br />Alexander Korotkov.<br /> 

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 16.08.2011 21:46, Alexander Korotkov wrote:
> On Tue, Aug 16, 2011 at 9:43 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> Why is there ever a buffer on the root node? It seems like a waste of time
>> to load N tuples from the heap into the root buffer, only to empty the
>> buffer after it fills up. You might as well pull tuples directly from the
>> heap.
>>
> Yes, seems reasonable. Buffer on the root node was in the paper. But now I
> don't see the need of it too.

Here's an version of the patch with a bunch of minor changes:

* No more buffer on root node. Aside from the root buffer being 
pointless, this simplifies gistRelocateBuildBuffersOnSplit slightly as 
it doesn't need the special case for root block anymore.

* Moved the code to create new root item from gistplacetopage() to 
gistRelocateBuildBuffersOnSplit(). Seems better to keep the 
buffering-related code away from the normal codepath, for the sake of 
readability.

* Changed the levelStep calculation to use the more accurate upper bound 
on subtree size that we discussed.

* Changed the levelStep calculation so that it doesn't do just 
min(maintenance_work_mem, effective_cache_size) and calculate the 
levelStep from that. Maintenance_work_mem matters determines the max. 
number of page buffers that can be held in memory at a time, while 
effective_cache_size determines the max size of the subtree. Those are 
subtly different things.

* Renamed NodeBuffer to GISTNodeBuffer, to avoid cluttering the namespace

* Plus misc comment, whitespace, formatting and naming changes.

I think this patch is in pretty good shape now. Could you re-run the 
performance tests you have on the wiki page, please, to make sure the 
performance hasn't regressed? It would also be nice to get some testing 
on the levelStep and pagesPerBuffer estimates, and the point where we 
switch to the buffering method. I'm particularly interested to know if 
there's any corner-cases with very skewed data distributions or strange 
GUC settings, where the estimates fails badly.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 16.08.2011 22:10, Heikki Linnakangas wrote:
> Here's an version of the patch with a bunch of minor changes:

And here it really is, this time with an attachment...

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 16.08.2011 22:10, Heikki Linnakangas wrote:
Here's an version of the patch with a bunch of minor changes:

And here it really is, this time with an attachment...
Thanks a lot. I'm going to start rerunning the tests now.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Wed, Aug 17, 2011 at 11:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 16.08.2011 22:10, Heikki Linnakangas wrote:
Here's an version of the patch with a bunch of minor changes:

And here it really is, this time with an attachment...
Thanks a lot. I'm going to start rerunning the tests now.

First bunch of test results will be available soon (tests running and results processing take some time). While there is a patch with few small bugfixes.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
I've added some testing results to the wiki page:
http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011
There are not all the results I planned for the first chunk because it takes more time than I expect. 
Some notes about it.

Now I see two causes which accelerate regular build of GiST indexes:
1) As it was noted before regular index build of pretty ordered dataset is fast.
2) I found that worse index is faster to build. I mean worse index is index with higher overlaps. Function gistchoose selects the first index tuple with zero penalty if any. Thus, with higher overlap in root page only few index tuples of it will be choosed for insert. And, recursively, only small part of the tree will be used for actual inserts. And that part of tree can easier fit to the cache. Thus, high overlaps  makes inserts cheaper as much as searches expensiver.

In the tests on the first version of patch I found index quality of regular build much better than it of buffering build (without neighborrelocation). Now it's similar, though it's because index quality of regular index build become worse. There by in current tests regular index build is faster than in previous. I see following possible causes of it:
1) I didn't save source random data. So, now it's a new random data.
2) Some environment parameters of my test setup may alters, though I doubt.
Despite these possible explanation it seems quite strange for me. 

In order to compare index build methods on more qualitative indexes, I've tried to build indexes with my double sorting split method (see: http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36). So on uniform dataset search is faster in about 10 times! And, as it was expected, regular index build becomes much slower. It runs more than 60 hours and while only 50% of index is complete (estimated by file sizes).

Also, automatic switching to buffering build shows better index quality results in all the tests. While it's hard for me to explain that.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 24.08.2011 16:57, Alexander Korotkov wrote:
> I've added some testing results to the wiki page:
> http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011
> There are not all the results I planned for the first chunk because it takes
> more time than I expect.
> Some notes about it.
>
> Now I see two causes which accelerate regular build of GiST indexes:
> 1) As it was noted before regular index build of pretty ordered dataset is
> fast.
> 2) I found that worse index is faster to build. I mean worse index is index
> with higher overlaps. Function gistchoose selects the first index tuple with
> zero penalty if any. Thus, with higher overlap in root page only few index
> tuples of it will be choosed for insert. And, recursively, only small part
> of the tree will be used for actual inserts. And that part of tree can
> easier fit to the cache. Thus, high overlaps  makes inserts cheaper as much
> as searches expensiver.

As an extreme case, a trivial penalty function that just always returns 
0 will make index build fast - but the index will be useless for querying.

> In the tests on the first version of patch I found index quality of regular
> build much better than it of buffering build (without neighborrelocation).
> Now it's similar, though it's because index quality of regular index build
> become worse. There by in current tests regular index build is faster than
> in previous. I see following possible causes of it:
>   1) I didn't save source random data. So, now it's a new random data.
> 2) Some environment parameters of my test setup may alters, though I doubt.
> Despite these possible explanation it seems quite strange for me.

That's pretty surprising. Assuming the data is truly random, I wouldn't 
expect a big difference in the index quality of one random data set over 
another. If the index quality depends so much on, say, the distribution 
of the few first tuples that are inserted to it, that's a quite 
interesting find on its own, and merits some further research.

> In order to compare index build methods on more qualitative indexes, I've
> tried to build indexes with my double sorting split method (see:
> http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36). So
> on uniform dataset search is faster in about 10 times! And, as it was
> expected, regular index build becomes much slower. It runs more than 60
> hours and while only 50% of index is complete (estimated by file sizes).
>
> Also, automatic switching to buffering build shows better index quality
> results in all the tests. While it's hard for me to explain that.

Hmm, makes me a bit uneasy that we're testing with a modified page 
splitting algorithm. But if the new algorithm is that good, could you 
post that as a separate patch, please?

That said, I don't see any new evidence that the buffering build 
algorithm would be significantly worse. There's the case of ordered data 
that we already knew about, and will have to just accept for now.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 22.08.2011 13:23, Alexander Korotkov wrote:
> On Wed, Aug 17, 2011 at 11:11 AM, Alexander Korotkov
> <aekorotkov@gmail.com>wrote:
>
>> On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas<
>> heikki.linnakangas@enterprisedb.com>  wrote:
>>
>>> On 16.08.2011 22:10, Heikki Linnakangas wrote:
>>>
>>>> Here's an version of the patch with a bunch of minor changes:
>>>>
>>>
>>> And here it really is, this time with an attachment...
>>
>> Thanks a lot. I'm going to start rerunning the tests now.
>>
>
> First bunch of test results will be available soon (tests running and
> results processing take some time). While there is a patch with few small
> bugfixes.

I've been mulling this through, and will continue working on this
tomorrow, but wanted to share this version meanwhile:

* Moved all the buffering build logic from gistplacetopage() to a new
function in gistbuild.c. There's almost no changes to gistplacetopage()
now, it returns the SplitInfo struct as usual, and the new function
deals with that and handles the call to
gistRelocateBuildBuffersOnSplit(), and the recursion to insert downlinks.

* Simplified the handling of buffersOnLevels lists a bit. There's now an
entry in buffersOnLevels array for all levels, even those that don't
have buffers because levelStep > 1. That wastes a few bytes in the
array, but it's more easy to debug and understand that way. Also,
there's no separate Len and Count variables for it anymore.

* Moved validateBufferingOption() to gistbuild.c

* Moved the code to add buffer to emptying queue to
gistPushItupToNodeBuffer() (was handled by the callers previously)

* Removed gistGetNodeBufferBusySize(), it was unused

* A lot of comment changes

Could you share the test scripts, patches and data sets etc. needed to
reproduce the tests you've been running? I'd like to try them out on a
test server.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Thu, Aug 25, 2011 at 11:08 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Could you share the test scripts, patches and data sets etc. needed to reproduce the tests you've been running? I'd like to try them out on a test server.

1) I've updated links to the datasets on the wiki page.
2) Script for index quality testing fastbuild_test.php is in the attachment. In order to run it you need PHP with pdo and pdo_pgsql modules. Also plantuner moduler is required (it is used to force planer to use specific index). After running that script following query returns relative score of index quality:

select indexname, avg(count::real/(select count from test_result a2 where a2.indexname = 'usnoa2_idx3' and a2.predicate = a1.predicate and a2.tablename = a1.tablename)::real) from test_result a1 where a1.tablename = 'usnoa2' group by indexname;

where 'usnoa2' - table name, 'usnoa2_idx3' - name of index which quality was assumed to be 1.
3) Patch which makes plantuner work with HEAD is also in attachment.
4) Patch with my split algorithm implementation is attached. Now it's form is appropriate only for testing purposes.
5) For indexes creation I use simple script which is attached as 'indexes.sql'. Also, similar script with different index names I'm running with my split patch.

Feel free to ask questions about all this stuff.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Thu, Aug 25, 2011 at 10:53 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

In the tests on the first version of patch I found index quality of regular
build much better than it of buffering build (without neighborrelocation).
Now it's similar, though it's because index quality of regular index build
become worse. There by in current tests regular index build is faster than
in previous. I see following possible causes of it:
 1) I didn't save source random data. So, now it's a new random data.
2) Some environment parameters of my test setup may alters, though I doubt.
Despite these possible explanation it seems quite strange for me.

That's pretty surprising. Assuming the data is truly random, I wouldn't expect a big difference in the index quality of one random data set over another. If the index quality depends so much on, say, the distribution of the few first tuples that are inserted to it, that's a quite interesting find on its own, and merits some further research.
Yeah, it's pretty strange. Using same random datasets in different tests can help to exclude onepossible cause of difference.
 
In order to compare index build methods on more qualitative indexes, I've
tried to build indexes with my double sorting split method (see:
http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36). So
on uniform dataset search is faster in about 10 times! And, as it was
expected, regular index build becomes much slower. It runs more than 60
hours and while only 50% of index is complete (estimated by file sizes).

Also, automatic switching to buffering build shows better index quality
results in all the tests. While it's hard for me to explain that.

Hmm, makes me a bit uneasy that we're testing with a modified page splitting algorithm. But if the new algorithm is that good, could you post that as a separate patch, please?
I've post it in another message and I will try to get it into more appropriate form. Let me clarify this a little. I don't think my split algorithm is 10 times better than state of the art algorithms. I think that currently used new linear split shows unreasonably bad results in may cases. For example, uniformly distributed data is pretty easy case. And with almost any splitting algorithm we can get index with almost zero overlaps. But new linear split produces huge overlaps in this case. That's why I decided to make some experiments with another split algorithm. 

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 26.08.2011 17:18, Alexander Korotkov wrote:
> On Thu, Aug 25, 2011 at 11:08 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> Could you share the test scripts, patches and data sets etc. needed to
>> reproduce the tests you've been running? I'd like to try them out on a test
>> server.
>>
>
> 1) I've updated links to the datasets on the wiki page.
> 2) Script for index quality testing fastbuild_test.php is in the attachment.
> In order to run it you need PHP with pdo and pdo_pgsql modules. Also
> plantuner moduler is required (it is used to force planer to use specific
> index). After running that script following query returns relative score of
> index quality:
>
> select indexname, avg(count::real/(select count from test_result a2 where
> a2.indexname = 'usnoa2_idx3' and a2.predicate = a1.predicate and
> a2.tablename = a1.tablename)::real) from test_result a1 where a1.tablename =
> 'usnoa2' group by indexname;
>
> where 'usnoa2' - table name, 'usnoa2_idx3' - name of index which quality was
> assumed to be 1.
> 3) Patch which makes plantuner work with HEAD is also in attachment.
> 4) Patch with my split algorithm implementation is attached. Now it's form
> is appropriate only for testing purposes.
> 5) For indexes creation I use simple script which is attached as
> 'indexes.sql'. Also, similar script with different index names I'm running
> with my split patch.
>
> Feel free to ask questions about all this stuff.

Thanks! Meanwhile, I hacked together a script of my own to do 
performance testing. I let it run over the weekend, but I just realized 
that I forgot to vacuum the test tables so the results are not worth 
much. I'm rerunning them now, stay tuned!

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 26.08.2011 17:18, Alexander Korotkov wrote:
> On Thu, Aug 25, 2011 at 11:08 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> Could you share the test scripts, patches and data sets etc. needed to
>> reproduce the tests you've been running? I'd like to try them out on a test
>> server.
>>
>
> 1) I've updated links to the datasets on the wiki page.
> 2) Script for index quality testing fastbuild_test.php is in the attachment.
> In order to run it you need PHP with pdo and pdo_pgsql modules. Also
> plantuner moduler is required (it is used to force planer to use specific
> index). After running that script following query returns relative score of
> index quality:
>
> select indexname, avg(count::real/(select count from test_result a2 where
> a2.indexname = 'usnoa2_idx3' and a2.predicate = a1.predicate and
> a2.tablename = a1.tablename)::real) from test_result a1 where a1.tablename =
> 'usnoa2' group by indexname;
>
> where 'usnoa2' - table name, 'usnoa2_idx3' - name of index which quality was
> assumed to be 1.
> 3) Patch which makes plantuner work with HEAD is also in attachment.
> 4) Patch with my split algorithm implementation is attached. Now it's form
> is appropriate only for testing purposes.
> 5) For indexes creation I use simple script which is attached as
> 'indexes.sql'. Also, similar script with different index names I'm running
> with my split patch.
>
> Feel free to ask questions about all this stuff.

Thanks. Meanwhile, I hacked together my own set of test scripts, and let
them run over the weekend. I'm still running tests with ordered data,
but here are some preliminary results:

            testname           |   nrows   |    duration     | accesses
-----------------------------+-----------+-----------------+----------
  points unordered auto       | 250000000 | 08:08:39.174956 |  3757848
  points unordered buffered   | 250000000 | 09:29:16.47012  |  4049832
  points unordered unbuffered | 250000000 | 03:48:10.999861 |  4564986

As you can see, the results are very disappointing :-(. The buffered
builds take a lot *longer* than unbuffered ones. I was expecting the
buffering to be very helpful at least in these unordered tests. On the
positive side, the buffering made index quality somewhat better
(accesses column, smaller is better), but that's not what we're aiming at.

What's going on here? This data set was large enough to not fit in RAM,
the table was about 8.5 GB in size (and I think the index is even larger
than that), and the box has 4GB of RAM. Does the buffering only help
with even larger indexes that exceed the cache size even more?


Test methodology
----------------

These tests consist of creating a gist index using the point datatype.

      Table "public.points"
   Column |  Type   | Modifiers
--------+---------+-----------
   x      | integer |
   y      | integer |

CREATE INDEX testindex ON points_ordered USING gist (point(x,y)) WITH
(buffering = 'on');

The points in the table are uniformly distributed. In the 'unordered'
tests, they are in random order. The ordered tests use the exact same
data, but sorted by x, y coordinates.

The 'accesses' column measures the quality of the produced index.
Smaller is better. It is calculated by performing a million queries on
the table, selecting points within a small square at evenly spaced
locations. Like:

(SELECT COUNT(*) FROM points WHERE point(x,y) <@ box(point(xx-20,
yy-20), point(xx+20, yy+20)));

The number of index pages touched by those queries are count from
pg_statio_user_indexes, and that number is reported in the 'accesses'
column.

I've attached the whole script used. Pass the number of rows to use in
the test as argument, and the script does the rest.

--
    Heikki Linnakangas
    EnterpriseDB   http://www.enterprisedb.com


Attachment

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 30.08.2011 12:08, Heikki Linnakangas wrote:
> What's going on here? This data set was large enough to not fit in RAM,
> the table was about 8.5 GB in size (and I think the index is even larger
> than that), and the box has 4GB of RAM. Does the buffering only help
> with even larger indexes that exceed the cache size even more?

The tests are still running, so I decided to try oprofile. The build is 
in the final emptying phase, according to the log, and has been for over 
half an hour now. Oprofile output looks very interesting:

samples  %        image name               symbol name
228590   30.3045  postgres                 pg_qsort
200558   26.5882  postgres                 gistBuffersFreeBlocksCmp
49397     6.5486  postgres                 gistchoose
45563     6.0403  postgres                 gist_box_penalty
31425     4.1661  postgres                 AllocSetAlloc
24182     3.2058  postgres                 FunctionCall3Coll
22671     3.0055  postgres                 rt_box_union
20147     2.6709  postgres                 gistpenalty
17007     2.2546  postgres                 DirectFunctionCall2Coll
15790     2.0933  no-vmlinux               /no-vmlinux
14148     1.8756  postgres                 XLogInsert
10612     1.4068  postgres                 gistdentryinit
10542     1.3976  postgres                 MemoryContextAlloc
9466      1.2549  postgres                 FunctionCall1Coll
9190      1.2183  postgres                 gist_box_decompress
6681      0.8857  postgres                 med3
4941      0.6550  libc-2.12.so             isnanf

So, over 50% of the CPU time is spent in choosing a block from the 
temporary files. That should be pretty easy to improve..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
So, over 50% of the CPU time is spent in choosing a block from the temporary files. That should be pretty easy to improve..
Yes, probably we can just remove free blocks sorting.

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Tue, Aug 30, 2011 at 1:08 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

Thanks. Meanwhile, I hacked together my own set of test scripts, and let them run over the weekend. I'm still running tests with ordered data, but here are some preliminary results:

          testname           |   nrows   |    duration     | accesses
-----------------------------+-----------+-----------------+----------
 points unordered auto       | 250000000 | 08:08:39.174956 |  3757848
 points unordered buffered   | 250000000 | 09:29:16.47012  |  4049832
 points unordered unbuffered | 250000000 | 03:48:10.999861 |  4564986

As you can see, the results are very disappointing :-(. The buffered builds take a lot *longer* than unbuffered ones. I was expecting the buffering to be very helpful at least in these unordered tests. On the positive side, the buffering made index quality somewhat better (accesses column, smaller is better), but that's not what we're aiming at.

What's going on here? This data set was large enough to not fit in RAM, the table was about 8.5 GB in size (and I think the index is even larger than that), and the box has 4GB of RAM. Does the buffering only help with even larger indexes that exceed the cache size even more?
This seems pretty strange for me. Time of unbuffered index build shows that there is not bottleneck at IO. That radically differs from my experiments. I'm going to try your test script on my test setup.
While I have only express assumption that random function appears to be somewhat bad. Thereby unordered dataset behave like the ordered one. Can you rerun tests on your test setup with dataset generation on the backend like this?
CREATE TABLE points AS (SELECT point(random(), random() FROM generate_series(1,10000000));

------
With best regards,
Alexander Korotkov.

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 30.08.2011 13:29, Alexander Korotkov wrote:
> On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> So, over 50% of the CPU time is spent in choosing a block from the
>> temporary files. That should be pretty easy to improve..
>>
> Yes, probably we can just remove free blocks sorting.

I'm re-running the tests with that change now. It seems like using the 
list of free blocks as a simple stack would be better anyway, it 
probably yields a better cache hit ratio when we re-use blocks that have 
just been released.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 30.08.2011 13:38, Alexander Korotkov wrote:
> On Tue, Aug 30, 2011 at 1:08 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>>
>> Thanks. Meanwhile, I hacked together my own set of test scripts, and let
>> them run over the weekend. I'm still running tests with ordered data, but
>> here are some preliminary results:
>>
>>            testname           |   nrows   |    duration     | accesses
>> -----------------------------+**-----------+-----------------+**----------
>>   points unordered auto       | 250000000 | 08:08:39.174956 |  3757848
>>   points unordered buffered   | 250000000 | 09:29:16.47012  |  4049832
>>   points unordered unbuffered | 250000000 | 03:48:10.999861 |  4564986
>>
>> As you can see, the results are very disappointing :-(. The buffered builds
>> take a lot *longer* than unbuffered ones. I was expecting the buffering to
>> be very helpful at least in these unordered tests. On the positive side, the
>> buffering made index quality somewhat better (accesses column, smaller is
>> better), but that's not what we're aiming at.
>>
>> What's going on here? This data set was large enough to not fit in RAM, the
>> table was about 8.5 GB in size (and I think the index is even larger than
>> that), and the box has 4GB of RAM. Does the buffering only help with even
>> larger indexes that exceed the cache size even more?
>>
> This seems pretty strange for me. Time of unbuffered index build shows that
> there is not bottleneck at IO. That radically differs from my
> experiments. I'm going to try your test script on my test setup.
> While I have only express assumption that random function appears to be
> somewhat bad. Thereby unordered dataset behave like the ordered one.

Oh. Doing a simple "SELECT * FROM points LIMIT 10", it looks pretty 
random to me. The data should be uniformly distributed in a rectangle 
from (0, 0) to (100000, 100000).

>  Can you
> rerun tests on your test setup with dataset generation on the backend like
> this?
> CREATE TABLE points AS (SELECT point(random(), random() FROM
> generate_series(1,10000000));

Ok, I'll queue up that test after the ones I'm running now.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 30.08.2011 13:29, Alexander Korotkov wrote:
> On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> So, over 50% of the CPU time is spent in choosing a block from the
>> temporary files. That should be pretty easy to improve..
>>
> Yes, probably we can just remove free blocks sorting.

Ok, the first results are in for that:
         testname          |   nrows   |    duration     | accesses
---------------------------+-----------+-----------------+---------- points unordered buffered | 250000000 |
06:00:23.707579|  4049832
 
From the previous test runs, the unbuffered index build took under 4 
hours, so even though this is a lot better than with the sorting, it's 
still a loss compared to non-buffered build.

I had vmstat running during most of this index build. At a quick glance, 
it doesn't seem to be CPU bound anymore. I suspect the buffers in the 
temporary file gets very fragmented. Or, we're reading it in backwards 
order because the buffers work in a LIFO fashion. The system seems to be 
doing about 5 MB/s of I/O during the build, which sounds like a figure 
you'd get for more or less random I/O.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Tue, Aug 30, 2011 at 9:29 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 30.08.2011 13:29, Alexander Korotkov wrote:
On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com>  wrote:

So, over 50% of the CPU time is spent in choosing a block from the
temporary files. That should be pretty easy to improve..

Yes, probably we can just remove free blocks sorting.

Ok, the first results are in for that:

        testname          |   nrows   |    duration     | accesses
---------------------------+-----------+-----------------+----------
 points unordered buffered | 250000000 | 06:00:23.707579 |  4049832

>From the previous test runs, the unbuffered index build took under 4 hours, so even though this is a lot better than with the sorting, it's still a loss compared to non-buffered build.

I had vmstat running during most of this index build. At a quick glance, it doesn't seem to be CPU bound anymore. I suspect the buffers in the temporary file gets very fragmented. Or, we're reading it in backwards order because the buffers work in a LIFO fashion. The system seems to be doing about 5 MB/s of I/O during the build, which sounds like a figure you'd get for more or less random I/O.

So, we still have two questions:
1) Why buffering build algorithm doesn't show any benefit on these tests?
2) Why test results on your test setup differs from test results on my test setup?

I can propose following answers now:
1) I think it's because high overlaps in the tree. As I mentioned before high overlaps can cause only fraction of the tree to be used for actual inserts. For comparison, with my split algorithm (which produce almost no overlaps on uniform dataset) buffering index build took 4 hours, while regular build is still running (already more than 8 days = 192 hours)!
2) Probably it's because different behavour of OS cache. For example, on my test setup OS displace unused pages from cache too slowly. Thereby buffering algorithm showed benefit nevertheless.

Also it seems to me that I start to understand problem of new linear splitting algorithm. On dataset with 1M rows it produces almost no overlaps while it produces significant overlaps already on 10M rows (drama!). Probably nobody tested it on large enough datasets (neither while original research or before commit). I'll dig it in more details and provide some testing results.

------
With best regards,
Alexander Korotkov. 

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 30.08.2011 13:38, Alexander Korotkov wrote:
> On Tue, Aug 30, 2011 at 1:08 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>>
>> Thanks. Meanwhile, I hacked together my own set of test scripts, and let
>> them run over the weekend. I'm still running tests with ordered data, but
>> here are some preliminary results:
>>
>>            testname           |   nrows   |    duration     | accesses
>> -----------------------------+**-----------+-----------------+**----------
>>   points unordered auto       | 250000000 | 08:08:39.174956 |  3757848
>>   points unordered buffered   | 250000000 | 09:29:16.47012  |  4049832
>>   points unordered unbuffered | 250000000 | 03:48:10.999861 |  4564986
>>
>> As you can see, the results are very disappointing :-(. The buffered builds
>> take a lot *longer* than unbuffered ones. I was expecting the buffering to
>> be very helpful at least in these unordered tests. On the positive side, the
>> buffering made index quality somewhat better (accesses column, smaller is
>> better), but that's not what we're aiming at.
>>
>> What's going on here? This data set was large enough to not fit in RAM, the
>> table was about 8.5 GB in size (and I think the index is even larger than
>> that), and the box has 4GB of RAM. Does the buffering only help with even
>> larger indexes that exceed the cache size even more?
>>
> This seems pretty strange for me. Time of unbuffered index build shows that
> there is not bottleneck at IO. That radically differs from my
> experiments. I'm going to try your test script on my test setup.
> While I have only express assumption that random function appears to be
> somewhat bad. Thereby unordered dataset behave like the ordered one. Can you
> rerun tests on your test setup with dataset generation on the backend like
> this?
> CREATE TABLE points AS (SELECT point(random(), random() FROM
> generate_series(1,10000000));

So I changed the test script to generate the table as:

CREATE TABLE points AS SELECT random() as x, random() as y FROM 
generate_series(1, $NROWS);

The unordered results are in:
          testname           |   nrows   |    duration     | accesses
-----------------------------+-----------+-----------------+---------- points unordered buffered   | 250000000 |
05:56:58.575789|  2241050 points unordered auto       | 250000000 | 05:34:12.187479 |  2246420 points unordered
unbuffered| 250000000 | 04:38:48.663952 |  2244228
 

Although the buffered build doesn't lose as badly as it did with more 
overlap, it still doesn't look good :-(. Any ideas?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
So I changed the test script to generate the table as:

CREATE TABLE points AS SELECT random() as x, random() as y FROM generate_series(1, $NROWS);

The unordered results are in:

         testname           |   nrows   |    duration     | accesses
-----------------------------+-----------+-----------------+----------
 points unordered buffered   | 250000000 | 05:56:58.575789 |  2241050
 points unordered auto       | 250000000 | 05:34:12.187479 |  2246420
 points unordered unbuffered | 250000000 | 04:38:48.663952 |  2244228

Although the buffered build doesn't lose as badly as it did with more overlap, it still doesn't look good :-(. Any ideas?

But it's still a lot of overlap. It's about 220 accesses per small area request. It's about 10 - 20 times greater than should be without overlaps. If we roughly assume that 10 times more overlap makes 1/10 of tree to be used for actual inserts, then that part of tree can easily fit to the cache.
You can try my splitting algorithm on your test setup (it this case I advice to start from smaller number of rows, 100 M for example).
I'm requesting real-life datasets which makes troubles in real life from Oleg. Probably those datasets is even larger or new linear split produce less overlaps on them.
 
------
With best regards,
Alexander Korotkov. 

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 01.09.2011 12:23, Alexander Korotkov wrote:
> On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> So I changed the test script to generate the table as:
>>
>> CREATE TABLE points AS SELECT random() as x, random() as y FROM
>> generate_series(1, $NROWS);
>>
>> The unordered results are in:
>>
>>           testname           |   nrows   |    duration     | accesses
>> -----------------------------+**-----------+-----------------+**----------
>>   points unordered buffered   | 250000000 | 05:56:58.575789 |  2241050
>>   points unordered auto       | 250000000 | 05:34:12.187479 |  2246420
>>   points unordered unbuffered | 250000000 | 04:38:48.663952 |  2244228
>>
>> Although the buffered build doesn't lose as badly as it did with more
>> overlap, it still doesn't look good :-(. Any ideas?
>
>
> But it's still a lot of overlap. It's about 220 accesses per small area
> request. It's about 10 - 20 times greater than should be without overlaps.

Hmm, those "accesses" numbers are actually quite bogus for this test. I 
changed the creation of the table as you suggested, so that all x and y 
values are in the range 0.0 - 1.0, but I didn't change the loop to 
calculate those accesses, so it still queried for boxes in the range 0 - 
100000. That makes me wonder, why does it need 220 accesses on average 
to satisfy queries most of which lie completely outside the range of 
actual values in the index? I would expect such queries to just look at 
the root node, conclude that there can't be any matching tuples, and 
return immediately.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 01.09.2011 12:23, Alexander Korotkov wrote:
> On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> So I changed the test script to generate the table as:
>>
>> CREATE TABLE points AS SELECT random() as x, random() as y FROM
>> generate_series(1, $NROWS);
>>
>> The unordered results are in:
>>
>>           testname           |   nrows   |    duration     | accesses
>> -----------------------------+**-----------+-----------------+**----------
>>   points unordered buffered   | 250000000 | 05:56:58.575789 |  2241050
>>   points unordered auto       | 250000000 | 05:34:12.187479 |  2246420
>>   points unordered unbuffered | 250000000 | 04:38:48.663952 |  2244228
>>
>> Although the buffered build doesn't lose as badly as it did with more
>> overlap, it still doesn't look good :-(. Any ideas?
>
> But it's still a lot of overlap. It's about 220 accesses per small area
> request. It's about 10 - 20 times greater than should be without overlaps.
> If we roughly assume that 10 times more overlap makes 1/10 of tree to be
> used for actual inserts, then that part of tree can easily fit to the cache.
> You can try my splitting algorithm on your test setup (it this case I advice
> to start from smaller number of rows, 100 M for example).
> I'm requesting real-life datasets which makes troubles in real life from
> Oleg. Probably those datasets is even larger or new linear split produce
> less overlaps on them.

I made a small tweak to the patch, and got much better results (this is
with my original method of generating the data):

           testname           |   nrows   |    duration     | accesses
-----------------------------+-----------+-----------------+----------
  points unordered buffered   | 250000000 | 03:34:23.488275 |  3945486
  points unordered auto       | 250000000 | 02:55:10.248722 |  3767548
  points unordered unbuffered | 250000000 | 04:02:04.168138 |  4564986

The tweak I made was to the way buffers are emptied in the final
emptying phase. Previously, it repeatedly looped through all the buffers
at a level, until there were no more non-empty buffers at the level.
When a buffer was split while it was being emptied, processing that
buffer stopped, and the emptying process moved on to the next buffer. I
changed it so that when a buffer splits, we continue emptying that
buffer until it's completely empty. That behavior is much more
cache-friendly, which shows as much better overall performance.

I probably changed that behavior for the worse in previous my rounds of
cleanup. Anyway, attached is the patch I used to get the above numbers.
Now that the performance problem is fixed, I'll continue reviewing and
cleaning up the patch.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 05.09.2011 14:10, Heikki Linnakangas wrote:
> On 01.09.2011 12:23, Alexander Korotkov wrote:
>> On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas<
>> heikki.linnakangas@enterprisedb.com> wrote:
>>
>>> So I changed the test script to generate the table as:
>>>
>>> CREATE TABLE points AS SELECT random() as x, random() as y FROM
>>> generate_series(1, $NROWS);
>>>
>>> The unordered results are in:
>>>
>>> testname | nrows | duration | accesses
>>> -----------------------------+**-----------+-----------------+**----------
>>>
>>> points unordered buffered | 250000000 | 05:56:58.575789 | 2241050
>>> points unordered auto | 250000000 | 05:34:12.187479 | 2246420
>>> points unordered unbuffered | 250000000 | 04:38:48.663952 | 2244228
>>>
>>> Although the buffered build doesn't lose as badly as it did with more
>>> overlap, it still doesn't look good :-(. Any ideas?
>>
>> But it's still a lot of overlap. It's about 220 accesses per small area
>> request. It's about 10 - 20 times greater than should be without
>> overlaps.
>> If we roughly assume that 10 times more overlap makes 1/10 of tree to be
>> used for actual inserts, then that part of tree can easily fit to the
>> cache.
>> You can try my splitting algorithm on your test setup (it this case I
>> advice
>> to start from smaller number of rows, 100 M for example).
>> I'm requesting real-life datasets which makes troubles in real life from
>> Oleg. Probably those datasets is even larger or new linear split produce
>> less overlaps on them.
>
> I made a small tweak to the patch, and got much better results (this is
> with my original method of generating the data):
>
> testname | nrows | duration | accesses
> -----------------------------+-----------+-----------------+----------
> points unordered buffered | 250000000 | 03:34:23.488275 | 3945486
> points unordered auto | 250000000 | 02:55:10.248722 | 3767548
> points unordered unbuffered | 250000000 | 04:02:04.168138 | 4564986

The full results of this test are in:
          testname           |   nrows   |    duration     | accesses
-----------------------------+-----------+-----------------+---------- points unordered buffered   | 250000000 |
03:34:23.488275|  3945486 points unordered auto       | 250000000 | 02:55:10.248722 |  3767548 points unordered
unbuffered| 250000000 | 04:02:04.168138 |  4564986 points ordered buffered     | 250000000 | 02:00:10.467914 |  5572906
pointsordered auto         | 250000000 | 02:16:01.859039 |  5435673 points ordered unbuffered   | 250000000 |
03:23:18.061742|  1875826
 
(6 rows)

Interestingly, in this test case the buffered build was significantly 
faster even in the case of ordered input - but the quality of the 
produced index was much worse. I suspect it's because of the 
last-in-first-out nature of the buffering, tuples that pushed into 
buffers first are flushed to lower levels last. Tweaking the data 
structures to make the buffer flushing a FIFO process might help with 
that, but I'm afraid that might make our cache hit ratio worse when 
reading pages from the temporary file.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum doesn't exceed maximal offset number. 

------
With best regards,
Alexander Korotkov. 
Attachment

Re: WIP: Fast GiST index build

From
Stefan Keller
Date:
Hi,

Unlogged tables seems to me to follow a similar goal. Obviously GiST
indexes are not supported there.
Do you know the technical reason?
Do you see some synergy in your work on fast GiST index building and
unlogged tables?

Yours, Stefan

2011/9/6 Alexander Korotkov <aekorotkov@gmail.com>:
> Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
> doesn't exceed maximal offset number.
> ------
> With best regards,
> Alexander Korotkov.
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Hi!
 
Unlogged tables seems to me to follow a similar goal. Obviously GiST
indexes are not supported there.
Do you know the technical reason?
GiST using serial numbers of operations for concurrency. In current implementation xlog record ids are used in capacity of that numbers. In unlogged table no xlog records are produced. So, we haven't serial numbers of operations. AFAIK, it's enough to provide some other source of serial number in order to make GiST work with unlogged tables.
 
Do you see some synergy in your work on fast GiST index building and
unlogged tables?
With tecnhique discussing in this thread GiST build can win form unlogged as much as with regular build.
 
------
With best regards,
Alexander Korotkov. 

Re: WIP: Fast GiST index build

From
Heikki Linnakangas
Date:
On 06.09.2011 01:18, Alexander Korotkov wrote:
> Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
> doesn't exceed maximal offset number.

I've committed the patch now, including that fix. Thanks for a great 
GSoC project!

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Robert Haas
Date:
On Thu, Sep 8, 2011 at 10:59 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 06.09.2011 01:18, Alexander Korotkov wrote:
>>
>> Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
>> doesn't exceed maximal offset number.
>
> I've committed the patch now, including that fix. Thanks for a great GSoC
> project!

Wow, major congrats, Alexander!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Fast GiST index build - further improvements

From
Heikki Linnakangas
Date:
Now that the main GiST index build patch has been committed, there's a 
few further improvements that could make it much faster still:

Better management of the buffer pages on disk. At the moment, the 
temporary file is used as a heap of pages belonging to all the buffers 
in random order. I think we could speed up the algorithm considerably by 
reading/writing the buffer pages sequentially. For example, when an 
internal page is split, and all the tuples in its buffer are relocated, 
that would be a great chance to write the new pages of the new buffers 
in sequential order, instead of writing them back to the pages freed up 
by the original buffer, which can be spread all around the temp file. I 
wonder if we could use a separate file for each buffer? Or at least, a 
separate file for all buffers that are larger than, say 100 MB in size.

Better management of in-memory buffer pages. When we start emptying a 
buffer, we currently flush all the buffer pages in memory to the 
temporary file, to make room for new buffer pages. But that's a waste of 
time, if some of the pages we had in memory belonged to the buffer we're 
about to empty next, or that we empty tuples to. Also, if while emptying 
a buffer, all the tuples go to just one or two lower level buffers, it 
would be beneficial to keep more than one page in-memory for those buffers.

Buffering leaf pages. This I posted on a separate thread already: 
http://archives.postgresql.org/message-id/4E5350DB.3060209@enterprisedb.com


Also, at the moment there is one issue with the algorithm that we have 
glossed over this far: For each buffer, we keep some information in 
memory, in the hash table, and in the auxiliary lists. That means that 
the amount of memory needed for the build scales with the size of the 
index. If you're dealing with very large indexes, hopefully you also 
have a lot of RAM in your system, so I don't think this is a problem in 
practice. Still, it would be nice to do something about that. A 
straightforward idea would be to swap some of the information to disk. 
Another idea that, simpler to implement, would be to completely destroy 
a buffer, freeing all the memory it uses, when it becomes completely 
empty. Then, if you're about to run out of memory (as defined by 
maintenance_work_mem), you can empty some low level buffers to disk to 
free up some.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: Fast GiST index build

From
Oleg Bartunov
Date:
My congratulations too, Alexander ! Hope to work on SP-GiST together !

Oleg

On Thu, 8 Sep 2011, Heikki Linnakangas wrote:

> On 06.09.2011 01:18, Alexander Korotkov wrote:
>> Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
>> doesn't exceed maximal offset number.
>
> I've committed the patch now, including that fix. Thanks for a great GSoC 
> project!
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: WIP: Fast GiST index build

From
Alexander Korotkov
Date:
Thanks for congratulations!
Tnanks to Heikki for mentoring and his work on patch!

------
With best regards,
Alexander Korotkov. 

Re: WIP: Fast GiST index build

From
Stefan Keller
Date:
Robert,

2011/9/6 Alexander Korotkov <aekorotkov@gmail.com>:
> GiST use serial numbers of operations for concurrency. In current
> implementation xlog record ids are used in capacity of that numbers. In
> unlogged table no xlog records are produced. So, we haven't serial numbers
> of operations. AFAIK, it's enough to provide some other source of serial
> number in order to make GiST work with unlogged tables.

GiST is IMHO quite broadly used. I use it for example for indexing
geometry and hstore types and there's no other choice there.
Do you know whether unlogged option in create table will support GiST
in the next release?

Stefan


Re: WIP: Fast GiST index build

From
Stefan Keller
Date:
I'm on the way to open a ticket for hash indexes (adding WAL support) anyway:
May I open a ticket for adding GiST support to unlogged tables ?

Stefan

2011/9/14 Stefan Keller <sfkeller@gmail.com>:
> Robert,
>
> 2011/9/6 Alexander Korotkov <aekorotkov@gmail.com>:
>> GiST use serial numbers of operations for concurrency. In current
>> implementation xlog record ids are used in capacity of that numbers. In
>> unlogged table no xlog records are produced. So, we haven't serial numbers
>> of operations. AFAIK, it's enough to provide some other source of serial
>> number in order to make GiST work with unlogged tables.
>
> GiST is IMHO quite broadly used. I use it for example for indexing
> geometry and hstore types and there's no other choice there.
> Do you know whether unlogged option in create table will support GiST
> in the next release?
>
> Stefan
>


Re: WIP: Fast GiST index build

From
Robert Haas
Date:
On Tue, Sep 13, 2011 at 5:00 PM, Stefan Keller <sfkeller@gmail.com> wrote:
> 2011/9/6 Alexander Korotkov <aekorotkov@gmail.com>:
>> GiST use serial numbers of operations for concurrency. In current
>> implementation xlog record ids are used in capacity of that numbers. In
>> unlogged table no xlog records are produced. So, we haven't serial numbers
>> of operations. AFAIK, it's enough to provide some other source of serial
>> number in order to make GiST work with unlogged tables.
>
> GiST is IMHO quite broadly used. I use it for example for indexing
> geometry and hstore types and there's no other choice there.
> Do you know whether unlogged option in create table will support GiST
> in the next release?

It's probably not a difficult patch to write, but I don't have any
current plans to work on it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Fast GiST index build - further improvements

From
Alexander Korotkov
Date:
On Thu, Sep 8, 2011 at 8:35 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Now that the main GiST index build patch has been committed, there's a few further improvements that could make it much faster still:

Better management of the buffer pages on disk. At the moment, the temporary file is used as a heap of pages belonging to all the buffers in random order. I think we could speed up the algorithm considerably by reading/writing the buffer pages sequentially. For example, when an internal page is split, and all the tuples in its buffer are relocated, that would be a great chance to write the new pages of the new buffers in sequential order, instead of writing them back to the pages freed up by the original buffer, which can be spread all around the temp file. I wonder if we could use a separate file for each buffer? Or at least, a separate file for all buffers that are larger than, say 100 MB in size.

Better management of in-memory buffer pages. When we start emptying a buffer, we currently flush all the buffer pages in memory to the temporary file, to make room for new buffer pages. But that's a waste of time, if some of the pages we had in memory belonged to the buffer we're about to empty next, or that we empty tuples to. Also, if while emptying a buffer, all the tuples go to just one or two lower level buffers, it would be beneficial to keep more than one page in-memory for those buffers.

Buffering leaf pages. This I posted on a separate thread already: http://archives.postgresql.org/message-id/4E5350DB.3060209@enterprisedb.com


Also, at the moment there is one issue with the algorithm that we have glossed over this far: For each buffer, we keep some information in memory, in the hash table, and in the auxiliary lists. That means that the amount of memory needed for the build scales with the size of the index. If you're dealing with very large indexes, hopefully you also have a lot of RAM in your system, so I don't think this is a problem in practice. Still, it would be nice to do something about that. A straightforward idea would be to swap some of the information to disk. Another idea that, simpler to implement, would be to completely destroy a buffer, freeing all the memory it uses, when it becomes completely empty. Then, if you're about to run out of memory (as defined by maintenance_work_mem), you can empty some low level buffers to disk to free up some.
Unfortunately, I hadn't enough of time to implement something of this before 9.2 release. Work on my Phd. thesis and web-projects takes too much time.

But, I think there is one thing we should fix before 9.2 release. We assume that gist index build have at least effective_cache_size/4 of cache. This assumption could easily be false on high concurrency systems. I don't see the way for convincing estimate here, but we could document this behaviour. So, users could just tune effective_cache_size for gist index build on high concurrency.

------
With best regards,
Alexander Korotkov.