Thread: Google Summer of Code: question about GiST API advancement project

Google Summer of Code: question about GiST API advancement project

From
GUO Rui
Date:
Hi, hackers,

I'm Rui Guo, a PhD student focusing on database at the University of California, Irvine. I'm interested in the "GiST API advancement" project for the Google Summer of Code 2019 which is listed at https://wiki.postgresql.org/wiki/GSoC_2019#GiST_API_advancement_.282019.29 .

I'm still reading about RR*-tree, GiST and the PostgreSQL source code to have a better idea on my proposal. Meanwhile, I have a very basic and simple question:

Since the chooseSubtree() algorithm in both R*-tree and RR*-tree are heuristic and somehow greedy (e.g. pick the MBB that needs to enlarge the least), is it possible to apply machine learning algorithm to improve it? The only related reference I got is to use deep learning in database join operation (
https://arxiv.org/abs/1808.03196). Is it not suitable to use machine learning here or someone already did?

Thanks,
Rui Guo

Re: Google Summer of Code: question about GiST API advancementproject

From
Andrey Borodin
Date:
Hi!

> 31 марта 2019 г., в 14:58, GUO Rui <ruig2@uci.edu> написал(а):
>
> I'm Rui Guo, a PhD student focusing on database at the University of California, Irvine. I'm interested in the "GiST
APIadvancement" project for the Google Summer of Code 2019 which is listed at
https://wiki.postgresql.org/wiki/GSoC_2019#GiST_API_advancement_.282019.29. 
>
> I'm still reading about RR*-tree, GiST and the PostgreSQL source code to have a better idea on my proposal.
Meanwhile,I have a very basic and simple question: 
>
> Since the chooseSubtree() algorithm in both R*-tree and RR*-tree are heuristic and somehow greedy (e.g. pick the MBB
thatneeds to enlarge the least), is it possible to apply machine learning algorithm to improve it? The only related
referenceI got is to use deep learning in database join operation (https://arxiv.org/abs/1808.03196). Is it not
suitableto use machine learning here or someone already did? 

If you are interested in ML and DBs you should definitely look into [0]. You do not have to base your proposal on
mentorideas, you can use your own. Implementing learned indexes - seems reasonable. 

RR*-tree algorithms are heuristic in some specific parts, but in general they are designed to optimize very clear
metrics.Generally, ML algorithms tend to compose much bigger pile of heuristics and solve less mathematically clear
tasksthan splitting subtrees or choosing subtree for insertion. 
R*-tree algorithms are heuristic only to be faster.

Best regards, Andrey Borodin.

[0] https://arxiv.org/pdf/1712.01208.pdf


Dear Andrey Borodin,

I discussed the above topic with the professors at my school, and got the following points:

1. The case that the volume of an MBB is 0 should be very rare, and if the data is skewed (e.g. only a few nodes have non-NULL value on a dimension) then the data can be pre-proceeded and normalized before it goes to the database, thus the storage and query can be much faster;
2. The performance of the R-tree family may depend on the specific data set and even the order of the data insertions, so one algorithm may be better on one dataset and slower on another, thus the benchmark should include different datasets;

I totally agree that by adopting the RR*-tree algorithm we can improve the performance of PostgreSQL. For my proposal, I'll:
1. Document the benchmarks I found available online (e.g. 
https://github.com/ambling/rtree-benchmark), and then state how we'd like to generate data ourselves (e.g. data with a Gaussian distribution, or the same dataset but different insertion order...) to test with for a wilder coverage;
2. Create tools to generate a report on current PostgreSQL performance with the benchmark;
3. Plan to improve the R-tree and GiST part of PostgreSQL. For the discussion in the email thread https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail.gmail.com#CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg@mail.gmail.com I prefer to do a scale-based trick rather than using bits in a float or creating a new struct;
4. Generate a performance report on PostgreSQL with the above R-tree patch;
The following would be marked as optional:
5. Optimize GiST with New APIs (e.g. non-penalty-based choose subtree function, also discussed in the above email thread);
6. For skewed data, try to warn the user, and then suggest methods to cook the data (e.g. the normalization algorithms in ML); pre-proceeding the data should not be the duty of the database;
7. Other advanced features of RR*-tree and GiST bulk loading;

Any comments or feedback on the above ideas? I'll work on a draft proposal ASAP.

Many thanks,
Rui Guo




On Sun, Mar 31, 2019 at 10:53 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Hi!

> 31 марта 2019 г., в 14:58, GUO Rui <ruig2@uci.edu> написал(а):
>
> I'm Rui Guo, a PhD student focusing on database at the University of California, Irvine. I'm interested in the "GiST API advancement" project for the Google Summer of Code 2019 which is listed at https://wiki.postgresql.org/wiki/GSoC_2019#GiST_API_advancement_.282019.29 .
>
> I'm still reading about RR*-tree, GiST and the PostgreSQL source code to have a better idea on my proposal. Meanwhile, I have a very basic and simple question:
>
> Since the chooseSubtree() algorithm in both R*-tree and RR*-tree are heuristic and somehow greedy (e.g. pick the MBB that needs to enlarge the least), is it possible to apply machine learning algorithm to improve it? The only related reference I got is to use deep learning in database join operation (https://arxiv.org/abs/1808.03196). Is it not suitable to use machine learning here or someone already did?

If you are interested in ML and DBs you should definitely look into [0]. You do not have to base your proposal on mentor ideas, you can use your own. Implementing learned indexes - seems reasonable.

RR*-tree algorithms are heuristic in some specific parts, but in general they are designed to optimize very clear metrics. Generally, ML algorithms tend to compose much bigger pile of heuristics and solve less mathematically clear tasks than splitting subtrees or choosing subtree for insertion.
R*-tree algorithms are heuristic only to be faster.

Best regards, Andrey Borodin.

[0] https://arxiv.org/pdf/1712.01208.pdf
I drafted my proposal about the above topic at https://docs.google.com/document/d/1X7Lw-c0rLYuSjwLNfw6qXpN5Cf1_0u2gXtgEgLkNezA/edit?usp=sharing  . Looking forward to your feedback.

On Wed, Apr 3, 2019 at 10:55 PM GUO Rui <ruig2@uci.edu> wrote:
Dear Andrey Borodin,

I discussed the above topic with the professors at my school, and got the following points:

1. The case that the volume of an MBB is 0 should be very rare, and if the data is skewed (e.g. only a few nodes have non-NULL value on a dimension) then the data can be pre-proceeded and normalized before it goes to the database, thus the storage and query can be much faster;
2. The performance of the R-tree family may depend on the specific data set and even the order of the data insertions, so one algorithm may be better on one dataset and slower on another, thus the benchmark should include different datasets;

I totally agree that by adopting the RR*-tree algorithm we can improve the performance of PostgreSQL. For my proposal, I'll:
1. Document the benchmarks I found available online (e.g. 
https://github.com/ambling/rtree-benchmark), and then state how we'd like to generate data ourselves (e.g. data with a Gaussian distribution, or the same dataset but different insertion order...) to test with for a wilder coverage;
2. Create tools to generate a report on current PostgreSQL performance with the benchmark;
3. Plan to improve the R-tree and GiST part of PostgreSQL. For the discussion in the email thread https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail.gmail.com#CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg@mail.gmail.com I prefer to do a scale-based trick rather than using bits in a float or creating a new struct;
4. Generate a performance report on PostgreSQL with the above R-tree patch;
The following would be marked as optional:
5. Optimize GiST with New APIs (e.g. non-penalty-based choose subtree function, also discussed in the above email thread);
6. For skewed data, try to warn the user, and then suggest methods to cook the data (e.g. the normalization algorithms in ML); pre-proceeding the data should not be the duty of the database;
7. Other advanced features of RR*-tree and GiST bulk loading;

Any comments or feedback on the above ideas? I'll work on a draft proposal ASAP.

Many thanks,
Rui Guo




On Sun, Mar 31, 2019 at 10:53 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Hi!

> 31 марта 2019 г., в 14:58, GUO Rui <ruig2@uci.edu> написал(а):
>
> I'm Rui Guo, a PhD student focusing on database at the University of California, Irvine. I'm interested in the "GiST API advancement" project for the Google Summer of Code 2019 which is listed at https://wiki.postgresql.org/wiki/GSoC_2019#GiST_API_advancement_.282019.29 .
>
> I'm still reading about RR*-tree, GiST and the PostgreSQL source code to have a better idea on my proposal. Meanwhile, I have a very basic and simple question:
>
> Since the chooseSubtree() algorithm in both R*-tree and RR*-tree are heuristic and somehow greedy (e.g. pick the MBB that needs to enlarge the least), is it possible to apply machine learning algorithm to improve it? The only related reference I got is to use deep learning in database join operation (https://arxiv.org/abs/1808.03196). Is it not suitable to use machine learning here or someone already did?

If you are interested in ML and DBs you should definitely look into [0]. You do not have to base your proposal on mentor ideas, you can use your own. Implementing learned indexes - seems reasonable.

RR*-tree algorithms are heuristic in some specific parts, but in general they are designed to optimize very clear metrics. Generally, ML algorithms tend to compose much bigger pile of heuristics and solve less mathematically clear tasks than splitting subtrees or choosing subtree for insertion.
R*-tree algorithms are heuristic only to be faster.

Best regards, Andrey Borodin.

[0] https://arxiv.org/pdf/1712.01208.pdf

Re: Google Summer of Code: question about GiST API advancementproject

From
Andrey Borodin
Date:
Hi!

> 5 апр. 2019 г., в 14:07, GUO Rui <ruig2@uci.edu> написал(а):
>
> I drafted my proposal about the above topic at
https://docs.google.com/document/d/1X7Lw-c0rLYuSjwLNfw6qXpN5Cf1_0u2gXtgEgLkNezA/edit?usp=sharing . Looking forward to
yourfeedback. 
I'd recommend planning some time to review other patches. If you plan to post your patches to commitfest, you are
expectedto review work some work of comparable complexity of others. 

> 1. The case that the volume of an MBB is 0 should be very rare
Nope, if you index data points tend to cluster on grid-alligned planes (which happens with geo data, all points have
sameheight above ocean) if always have zero volume of MBB. 

BTW look at PostGIS, they are main GiST users, so indexing their data is important.

Best regards, Andrey Borodin.




I added more details about GiST use cases (PostGIS and ScalaGiST) in my proposal and created one more entry for reviewing other patches in the time table.

I'll try to polish the proposal in the remaining three days to the GSoC deadline. I don't think I have much time to modify PostgreSQL source code and test against it myself though ;(. Many thanks to your feedback.

On Fri, Apr 5, 2019 at 3:13 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Hi!

> 5 апр. 2019 г., в 14:07, GUO Rui <ruig2@uci.edu> написал(а):
>
> I drafted my proposal about the above topic at https://docs.google.com/document/d/1X7Lw-c0rLYuSjwLNfw6qXpN5Cf1_0u2gXtgEgLkNezA/edit?usp=sharing  . Looking forward to your feedback.
I'd recommend planning some time to review other patches. If you plan to post your patches to commitfest, you are expected to review work some work of comparable complexity of others.

> 1. The case that the volume of an MBB is 0 should be very rare
Nope, if you index data points tend to cluster on grid-alligned planes (which happens with geo data, all points have same height above ocean) if always have zero volume of MBB.

BTW look at PostGIS, they are main GiST users, so indexing their data is important.

Best regards, Andrey Borodin.

Re: Google Summer of Code: question about GiST API advancementproject

From
Andrey Borodin
Date:
Hi!

> 6 апр. 2019 г., в 6:38, GUO Rui <ruig2@uci.edu> написал(а):
>
> I added more details about GiST use cases (PostGIS and ScalaGiST) in my proposal and created one more entry for
reviewingother patches in the time table. 

ScalaGiST is quite distant project. It is not PostgreSQL part, it is MR subsystem.

Best regards, Andrey Borodin.


Yes, it is a different project, and we cannot run it on top of PostgreSQL directly.

Maybe we can learn from it by:
1. Study its benchmark. The benchmark used is YCSB [1], and maybe we can generate data and run queries in a similar way as YCSB in our project. YCSB already discussed the order to insert data and the distribution models of data.
2. Study how they use R-tree and GiST in their system, and perhaps we can be inspired and be able to borrow some ideas when we implement our patch. 

I'm still learning the codebase of PostgreSQL and how R-tree/GiST can be implemented. Does that make sense?

[1] Cooper, Brian F., et al. "Benchmarking cloud serving systems with YCSB." Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010.

On Fri, Apr 5, 2019 at 10:44 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Hi!

> 6 апр. 2019 г., в 6:38, GUO Rui <ruig2@uci.edu> написал(а):
>
> I added more details about GiST use cases (PostGIS and ScalaGiST) in my proposal and created one more entry for reviewing other patches in the time table.

ScalaGiST is quite distant project. It is not PostgreSQL part, it is MR subsystem.

Best regards, Andrey Borodin.