Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google) - Mailing list pgsql-hackers

From Stefan Keller
Subject Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
Date
Msg-id CAFcOn28TpR4oydF1NdMEuKFVVRiRwAwG4dGunu2kVxHZw3ViEA@mail.gmail.com
Whole thread Raw
In response to Re: ML-based indexing ("The Case for Learned Index Structures", apaper from Google)  (Oleg Ivanov <o.ivanov@postgrespro.ru>)
Responses Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)  (Andrey Borodin <x4mmm@yandex-team.ru>)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)  (Oleg Bartunov <obartunov@postgrespro.ru>)
List pgsql-hackers
Dear Olegs, dear Nikolay, dear all

Allow me to revive this thread:

Are there any advances in a learned index for PostgreSQL?

Background: I'm trying to benchmark those experimental indices. For
this I did some bibliography work (see below). Fun fact: Not only
Postgres people love high-proof drinks, sorry, high-proof index names:
see "From wisckey to bourbon" :-).

My main discovery is a discussion report - which included Stonebraker
- entitled "ML-In-Databases: Assessment and Prognosis" [1]. The first
two takeaways are "#1: The initial comparisons of learned indices with
optimized traditional indices should be further expanded to include
concurrency control and multi-user settings. (...) #2: A key benefit
of learned indices may come from the learned structures requiring
lower space utilization, rather than a reduction in search time."

Yours, Stefan


[1] Kraska, T., Minhas, U. F., Neumann, T., Papaemmanouil, O., Patel,
J. M., Ré, C., & Stonebraker, M. (2021). ML-In-Databases: Assessment
and Prognosis. Data Engineering, 3. Web access:
https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf


Bibliography (without any claim for completeness):

Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018,
May). The case for learned index structures. In Proceedings of the
2018 International Conference on Management of Data (pp. 489-504). Web
access https://dl.acm.org/doi/pdf/10.1145/3183713.3196909

"Recursive model index, a learned index structure" (based on Kraska et
al. 2017? 2018). Source code: https://github.com/BenJoyenConseil/rmi
(Go language), https://github.com/learnedsystems/RMI (Rust)

Kaur, T. (2018). Learned Index Structures: Practical Implementations
and Future Directions. Master Thesis. Web access:
https://wwwiti.cs.uni-magdeburg.de/iti_db/publikationen/ps/auto/kaur2018learned.pdf
 (pretends to be open source C++ but none found)

Kipf, A., Marcus, R., van Renen, A., Stoian, M., Kemper, A., Kraska,
T., & Neumann, T. (2020, June). RadixSpline: a single-pass learned
index. In Proceedings of the Third International Workshop on
Exploiting Artificial Intelligence Techniques for Data Management (pp.
1-5). Web access: https://dl.acm.org/doi/10.1145/3401071.3401659),
Source code: https://github.com/learnedsystems/RadixSpline (C++).

Dai, Y., Xu, Y., Ganesan, A., Alagappan, R., Kroth, B.,
Arpaci-Dusseau, A., & Arpaci-Dusseau, R. (2020). From wisckey to
bourbon: A learned index for log-structured merge trees. In 14th
{USENIX} Symposium on Operating Systems Design and Implementation
({OSDI} 20) (pp. 155-171). Web access:
https://www.usenix.org/system/files/osdi20-dai_0.pdf

Ding, J., Minhas, U. F., Yu, J., Wang, C., Do, J., Li, Y., ... &
Kraska, T. (2020, June). ALEX: an updatable adaptive learned index. In
Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Data (pp. 969-984). Web access:
https://doi.org/10.1145/3318464.3389711 /
https://dblp.org/rec/conf/sigmod/DingMYWDLZCGKLK20 . Source code:
https://github.com/microsoft/ALEX (C++, MIT license)

Am Di., 12. Dez. 2017 um 18:02 Uhr schrieb Oleg Ivanov
<o.ivanov@postgrespro.ru>:
>
> On 12/12/2017 04:33 PM, Oleg Bartunov wrote:
> > On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
> > <samokhvalov@gmail.com> wrote:
> >> Very interesting read: https://arxiv.org/abs/1712.01208
> >>
> >> HN discussion: https://news.ycombinator.com/item?id=15894896
> >>
> >> Some of the comments (from Twitter
> >> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
> >> at GOOG just released a paper showing how machine-learned indexes can
> >> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
> >> B-Trees, 10-100x less space. Executes on GPU, which are getting faster
> >> unlike CPU. Amazing."
> >>
> >> Can those ideas be applied to Postgres in its current state? Or it's not
> >> really down-to-earth?
> > Oleg made some analysis of the paper.
> If the keys are comparable and the data distribution is simple enough
> (which seems to happen rather often) and the data does not change, we
> can learn the distribution, and so perform the search faster, and save
> the memory also. That is what authors wrote and that is what must work
> in my opinion.
>
> The limitations of the approach, which authors mention in their work:
> + The data does not change. The proposed method can be extended more or
> less straightforward to nearly append-only workloads and to workloads in
> which the data distribution does not change or changes very slowly (the
> data itself or its amount may change, nevertheless). Otherwise it is
> challenging, because the key positions cannot be considered as
> independent values anymore, but it looks still possible. The other
> proposed by the authors option is to store new objects in buffer and
> retrain model in the background, but that does not look as a nice solution.
> + GPU are fast, but the latency of doing operations on the GPU almost
> certainly avoids all speed benefits for such small models. The only case
> in which it is reasonable to use GPU is training models (i. e. building
> such indexes).
>
> The following left unclear for me from the paper:
> + How effectively the approach can be implemented taking into account
> the limitations above.
> + For hash functions authors propose to replace hash function with the
> learned CDF, which can decrease the amount of collisions, which decreaes
> the memory consumption. That is reasonable, but this hash function
> implicitly uses the information about the order on the keys. I suppose
> that using the ordering on the data and with access to the data
> histogram one could achieve similar memory consumption decrease.
> + The proposed hierarchical learning looks natural for the problem, but
> it is not clear that it is the best option. For example, one might use
> the end-to-end learning procedure with weights sparsity regularizers as
> well. I wonder whether the last approach can obtain the competitive
> results with the first one, and if not, then why. Anyway, I have a
> feeling that the proposed method has a room for improvement.
>
> In general, the approach still needs some additional research (mostly
> about the changing data), and has some points to think about how to
> implement them properly (paging, for example), but it looks promising
> enough nevertheless.
>



pgsql-hackers by date:

Previous
From: Maksim Milyutin
Date:
Subject: Re: Synchronous commit behavior during network outage
Next
From: Ondřej Žižka
Date:
Subject: Re: Synchronous commit behavior during network outage