Re: LSM tree for Postgres - Mailing list pgsql-hackers
From | Alexander Korotkov |
---|---|
Subject | Re: LSM tree for Postgres |
Date | |
Msg-id | CAPpHfdvkFFKd3-JvVX_aWALMb8xjhkLA9juFG1uAk6aZ7JVRSA@mail.gmail.com Whole thread Raw |
In response to | Re: LSM tree for Postgres (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>) |
Responses |
Re: LSM tree for Postgres
|
List | pgsql-hackers |
On Sat, Aug 8, 2020 at 11:49 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > On 08.08.2020 21:18, Alexander Korotkov wrote: > > On Sat, Aug 8, 2020 at 5:07 PM Konstantin Knizhnik > > <k.knizhnik@postgrespro.ru> wrote: > >> I agree with your that loosing sequential order of B-Tree pages may have > >> negative impact on performance. > >> But it first of all critical for order-by and range queries, when we > >> should traverse several subsequent leave pages. > >> It is less critical for exact-search or delete/insert operations. > >> Efficiency of merge operations mostly depends on how much keys > >> will be stored at the same B-Tree page. > > What do you mean by "mostly"? Given PostgreSQL has quite small (8k) > > pages, sequential read in times faster than random read on SSDs > > (dozens of times on HDDs). I don't think this is something to > > neglect. > > When yo insert one record in B-Tree, the order of pages doesn't matter > at all. > If you insert ten records at one leaf page then order is also not so > important. > If you insert 100 records, 50 got to one page and 50 to the next page, > then insertion may be faster if second page follows on the disk first one. > But such insertion may cause page split and so allocation of new page, > so sequential write order can still be violated. Sorry, I've no idea of what you're getting at. > >> And it is first of all > >> determined by size of top index and key distribution. > > How can you be sure that the top index can fit memory? On production > > systems, typically there are multiple consumers of memory: other > > tables, indexes, other LSMs. This is one of reasons why LSM > > implementations have multiple levels: they don't know in advance which > > levels fit memory. Another reason is dealing with very large > > datasets. And I believe there is a quite strong reason to keep page > > order sequential within level. > > There is no any warranty that top index is kept in memory. > But as far top index pages are frequently accessed, I hope that buffer > management cache replacement > algorithm does it best to keep them in memory. So, the top index should be small enough that we can safely assume it wouldn't be evicted from cache on a heavily loaded production system. I think it's evident that it should be in orders of magnitude less than the total amount of server RAM. > > I'm OK with your design for a third-party extension. It's very cool > > to have. But I'm -1 for something like this to get into core > > PostgreSQL, assuming it's feasible to push some effort and get > > state-of-art LSM there. > I realize that it is not true LSM. > But still I wan to notice that it is able to provide ~10 times increase > of insert speed when size of index is comparable with RAM size. > And "true LSM" from RocksDB shows similar results. It's very far from being shown. All the things you've shown is a naive benchmark. I don't object that your design can work out some cases. And it's great that we have the lsm3 extension now. But I think for PostgreSQL core we should think about better design. > May be if size of > index will be 100 times larger then > size of RAM, RocksDB will be significantly faster than Lsm3. But modern > servers has 0.5-1Tb of RAM. > Can't believe that there are databases with 100Tb indexes. Comparison of whole RAM size to single index size looks plain wrong for me. I think we can roughly compare whole RAM size to whole database size. But also not the whole RAM size is always available for caching data. Let's assume half of RAM is used for caching data. So, a modern server with 0.5-1Tb of RAM, which suffers from random B-tree insertions and badly needs LSM-like data-structure, runs a database of 25-50Tb. Frankly speaking, there is nothing counterintuitive for me. ------ Regards, Alexander Korotkov
pgsql-hackers by date: