Thread: LSM tree for Postgres
Hi hackers,
I want to share results of my last research of implementing LSM index in Postgres.
Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using LSM tree instead of classical B-Tree.
From one side, capacity of RAM at modern servers allows to keep the whole database in memory.
It leads to the anti-caching approach proposed by Michael Stonebraker
https://15721.courses.cs.cmu.edu/spring2016/papers/hstore-anticaching.pdf
From the other side: maintaining if indexes is one of the most important factor limiting database performance.
Postgres is able to insert records in table without indexes almost with linear disk write speed(hundred megabytes per second).
But if table contains indexes, inserted keys have random values and indexes don't fill in memory then we observer dramatic degradation of performance. Average HDD access time is about 10msec, which corresponds to 100 random reads per second. If table has several indexes
and is large enough not to fit in memory, then insert speed can be as low as tens TPS. Certainly SSD can provide much better random access time,
but still random reads are slow.
LSM approach tries to address this problem.
First of all I started my experiments with RocksDB (may be most popular LSM-based key-value storage, used in many databases).
There was FDW for RocksDB from VidarDB project: https://github.com/vidardb/pgrocks
As far as RocksDB is multuthreaded embedded database engine and Postgres is based on multiprocess architecture,
them used interesting approach "server inside server": there is bgworker process which works with RocksDB and
backends sendind requests to it through shared memory queue.
I have significantly rewriten their FDW implementation: original RocksDB server implementation was single threaded.
I have made it multitheaded making it possible to run multiple RocksDB requests in parallel.
My implementation can be found there:
https://github.com/postgrespro/lsm
Some results of benchmarking.
Benchmark is just insertion of 250 millions of records with random key in inclusive index containing one bigint key and 8 bigint fields.
SIze of index is about 20Gb and target system has 16GB of RAM:
As you can see there is about 10 times difference.
May be somebody will find useful this idea of using IOT (index organized table) based on RocksDB in Postgres.
But this approach violates all ACID properties of Postgres:
there is no atomicity and consistency (in principle RocksDB supports 2PC, but it is not used here),
isolation corresponds to something like "read uncommitted",
and concerning durability - it is all up to RocksDB and I have serious doubts that it will survive failure especially with sync write mode disabled.
So I considered this project mostly as prototype for estimating efficiency of LSM approach.
Then I think about implementing ideas of LSM using standard Postgres nbtree.
We need two indexes: one small for fast inserts and another - big (main) index. This top index is small enough to fit in memory
so inserts in this index are very fast.
Periodically we will merge data from top index to base index and truncate the top index. To prevent blocking of inserts in the table
while we are merging indexes we can add ... on more index, which will be used during merge.
So final architecture of Lsm3 is the following:
two top indexes used in cyclic way and one main index. When top index reaches some threshold value
we initiate merge with main index, done by bgworker and switch to another top index.
As far as merging indexes is done in background, it doesn't affect insert speed.
Unfortunately Postgres Index AM has not bulk insert operation, so we have to perform normal inserts.
But inserted data is already sorted by key which should improve access locality and partly solve random reads problem for base index.
Certainly to perform search in Lsm3 we have to make lookups in all three indexes and merge search results.
(in case of unique index we can avoid extra searches if searched value is found in top index).
It can happen that during merge of top and base indexes the same TID can be found in both of them.
But such duplicates can be easily eliminated during merge of search results.
As far as we are using standard nbtree indexes there is no need to worry about logging information in WAL.
There is no need to use inefficient "generic WAL records" or patch kernel by adding own WAL records.
Implementation of Lsm3 Index AM as standart Postgres extension is available here:
https://github.com/postgrespro/lsm3
I have performed the same benchmark with random inserts (described above) for Lsm3:
Size of nbtree is about 29Gb, single client performance is even higher than of RocksDB FDW, but parallel results are signficantly worser.
So Lsm3 can provide significant improve of performance for large indexes not fitting in main memory.
And the larger ratio between index size and RAM size is, the larger benefit in insertion speed you get.
Lsm3 is just standard postgres extension, fully integrated in Postgres infrastructure (MVCC, WAL, backups,...).
SO I hope it can be useful when standard indexes becomes bottleneck.
I will be glad to receive any feedback, may be some change requests or proposals.
Best regards,
Konstantin
I want to share results of my last research of implementing LSM index in Postgres.
Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using LSM tree instead of classical B-Tree.
From one side, capacity of RAM at modern servers allows to keep the whole database in memory.
It leads to the anti-caching approach proposed by Michael Stonebraker
https://15721.courses.cs.cmu.edu/spring2016/papers/hstore-anticaching.pdf
From the other side: maintaining if indexes is one of the most important factor limiting database performance.
Postgres is able to insert records in table without indexes almost with linear disk write speed(hundred megabytes per second).
But if table contains indexes, inserted keys have random values and indexes don't fill in memory then we observer dramatic degradation of performance. Average HDD access time is about 10msec, which corresponds to 100 random reads per second. If table has several indexes
and is large enough not to fit in memory, then insert speed can be as low as tens TPS. Certainly SSD can provide much better random access time,
but still random reads are slow.
LSM approach tries to address this problem.
First of all I started my experiments with RocksDB (may be most popular LSM-based key-value storage, used in many databases).
There was FDW for RocksDB from VidarDB project: https://github.com/vidardb/pgrocks
As far as RocksDB is multuthreaded embedded database engine and Postgres is based on multiprocess architecture,
them used interesting approach "server inside server": there is bgworker process which works with RocksDB and
backends sendind requests to it through shared memory queue.
I have significantly rewriten their FDW implementation: original RocksDB server implementation was single threaded.
I have made it multitheaded making it possible to run multiple RocksDB requests in parallel.
My implementation can be found there:
https://github.com/postgrespro/lsm
Some results of benchmarking.
Benchmark is just insertion of 250 millions of records with random key in inclusive index containing one bigint key and 8 bigint fields.
SIze of index is about 20Gb and target system has 16GB of RAM:
Index | Clients | TPS |
---|---|---|
Inclusive B-Tree | 1 | 9387 |
Inclusive B-Tree | 10 | 18761 |
RocksDB FDW | 1 | 138350 |
RocksDB FDW | 10 | 122369 |
RocksDB | 1 | 166333 |
RocksDB | 10 | 141482 |
As you can see there is about 10 times difference.
May be somebody will find useful this idea of using IOT (index organized table) based on RocksDB in Postgres.
But this approach violates all ACID properties of Postgres:
there is no atomicity and consistency (in principle RocksDB supports 2PC, but it is not used here),
isolation corresponds to something like "read uncommitted",
and concerning durability - it is all up to RocksDB and I have serious doubts that it will survive failure especially with sync write mode disabled.
So I considered this project mostly as prototype for estimating efficiency of LSM approach.
Then I think about implementing ideas of LSM using standard Postgres nbtree.
We need two indexes: one small for fast inserts and another - big (main) index. This top index is small enough to fit in memory
so inserts in this index are very fast.
Periodically we will merge data from top index to base index and truncate the top index. To prevent blocking of inserts in the table
while we are merging indexes we can add ... on more index, which will be used during merge.
So final architecture of Lsm3 is the following:
two top indexes used in cyclic way and one main index. When top index reaches some threshold value
we initiate merge with main index, done by bgworker and switch to another top index.
As far as merging indexes is done in background, it doesn't affect insert speed.
Unfortunately Postgres Index AM has not bulk insert operation, so we have to perform normal inserts.
But inserted data is already sorted by key which should improve access locality and partly solve random reads problem for base index.
Certainly to perform search in Lsm3 we have to make lookups in all three indexes and merge search results.
(in case of unique index we can avoid extra searches if searched value is found in top index).
It can happen that during merge of top and base indexes the same TID can be found in both of them.
But such duplicates can be easily eliminated during merge of search results.
As far as we are using standard nbtree indexes there is no need to worry about logging information in WAL.
There is no need to use inefficient "generic WAL records" or patch kernel by adding own WAL records.
Implementation of Lsm3 Index AM as standart Postgres extension is available here:
https://github.com/postgrespro/lsm3
I have performed the same benchmark with random inserts (described above) for Lsm3:
Index | Clients | TPS |
---|---|---|
Inclusive B-Tree | 1 | 9387 |
Inclusive B-Tree | 10 | 18761 |
RocksDB FDW | 1 | 138350 |
RocksDB FDW | 10 | 122369 |
RocksDB | 1 | 166333 |
RocksDB | 10 | 141482 |
Lsm3 | 1 | 151699 |
Lsm3 | 10 | 65997 |
Size of nbtree is about 29Gb, single client performance is even higher than of RocksDB FDW, but parallel results are signficantly worser.
So Lsm3 can provide significant improve of performance for large indexes not fitting in main memory.
And the larger ratio between index size and RAM size is, the larger benefit in insertion speed you get.
Lsm3 is just standard postgres extension, fully integrated in Postgres infrastructure (MVCC, WAL, backups,...).
SO I hope it can be useful when standard indexes becomes bottleneck.
I will be glad to receive any feedback, may be some change requests or proposals.
Best regards,
Konstantin
Hi! On Tue, Aug 4, 2020 at 11:22 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > I want to share results of my last research of implementing LSM index in Postgres. > Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using LSM tree instead of classical B-Tree. I wouldn't say it that way. I would say they are providing LSM in addition to the B-tree. For instance WiredTiger (which is the main engine for MongoDB) provides both B-tree and LSM. As I know Tarantool provides at least an in-memory B-tree. If RocksDB is used as an engine for MySQL, then it's also an addition to B-tree, which is provided by InnoDB. Also, implementation of B-tree's in mentioned DBMSes are very different. I would say none of them is purely classical. > LSM approach tries to address this problem. LSM has great use-cases for sure. > I have significantly rewriten their FDW implementation: original RocksDB server implementation was single threaded. > I have made it multitheaded making it possible to run multiple RocksDB requests in parallel. > My implementation can be found there: > https://github.com/postgrespro/lsm Great, thank you for your work. > Some results of benchmarking. > Benchmark is just insertion of 250 millions of records with random key in inclusive index containing one bigint key and8 bigint fields. > SIze of index is about 20Gb and target system has 16GB of RAM: What storage do you use? > As you can see there is about 10 times difference. > May be somebody will find useful this idea of using IOT (index organized table) based on RocksDB in Postgres. > But this approach violates all ACID properties of Postgres: > there is no atomicity and consistency (in principle RocksDB supports 2PC, but it is not used here), > isolation corresponds to something like "read uncommitted", > and concerning durability - it is all up to RocksDB and I have serious doubts that it will survive failure especiallywith sync write mode disabled. > So I considered this project mostly as prototype for estimating efficiency of LSM approach. Yes, integration of WAL and snapshots between Postgres and RocksDB is problematic. I also doubt that RocksDB can use the full power of Postgres extendable type system. > Then I think about implementing ideas of LSM using standard Postgres nbtree. > > We need two indexes: one small for fast inserts and another - big (main) index. This top index is small enough to fit inmemory > so inserts in this index are very fast. > Periodically we will merge data from top index to base index and truncate the top index. To prevent blocking of insertsin the table > while we are merging indexes we can add ... on more index, which will be used during merge. > > So final architecture of Lsm3 is the following: > two top indexes used in cyclic way and one main index. When top index reaches some threshold value > we initiate merge with main index, done by bgworker and switch to another top index. > As far as merging indexes is done in background, it doesn't affect insert speed. > Unfortunately Postgres Index AM has not bulk insert operation, so we have to perform normal inserts. > But inserted data is already sorted by key which should improve access locality and partly solve random reads problem forbase index. > > Certainly to perform search in Lsm3 we have to make lookups in all three indexes and merge search results. > (in case of unique index we can avoid extra searches if searched value is found in top index). > It can happen that during merge of top and base indexes the same TID can be found in both of them. > But such duplicates can be easily eliminated during merge of search results. You use a fixed number of trees. Is this a limitation of prototype or intention of design? I guess if the base index is orders of magnitude bigger than RAM, this scheme can degrade greatly. > As far as we are using standard nbtree indexes there is no need to worry about logging information in WAL. > There is no need to use inefficient "generic WAL records" or patch kernel by adding own WAL records. As I get the merge operations are logged in the same way as ordinal inserts. This seems acceptable, but a single insert operation would eventually cause in times more WAL than it does in B-tree (especially if we'll have an implementation of a flexible number of trees). In principle that could be evaded if recovery could replay logic of trees merge on its side. But this approach hardly can fit Postgres in many ways. > I have performed the same benchmark with random inserts (described above) for Lsm3: > > Index Clients TPS > Inclusive B-Tree 1 9387 > Inclusive B-Tree 10 18761 > RocksDB FDW 1 138350 > RocksDB FDW 10 122369 > RocksDB 1 166333 > RocksDB 10 141482 > Lsm3 1 151699 > Lsm3 10 65997 > > Size of nbtree is about 29Gb, single client performance is even higher than of RocksDB FDW, but parallel results are signficantlyworser. Did you investigate the source of degradation? Such degradation doesn't yet look inevitable for me. Probably, things could be improved. > I will be glad to receive any feedback, may be some change requests or proposals. As I get you benchmarked just inserts. But what about vacuum? I think the way Postgres vacuum works for now isn't optimal for lsm. Postgres vacuum requires full scan of index, because it provides a bitmap of tids to be deleted without information of index keys. For lsm, it would be better if vacuum would push delete requests to the top level of lsm (as specially marked tuples of something). Thanks to that index deletes could be as efficient as inserts. This is especially important for lsm with many levels and/or aggressive vacuum. ------ Regards, Alexander Korotkov
On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote: >Hi hackers, > >I want to share results of my last research of implementing LSM index >in Postgres. >Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using >LSM tree instead of classical B-Tree. > I was under the impression that LSM is more an alternative primary storage, not for indexes. Or am I wrong / confused? >From one side, capacity of RAM at modern servers allows to keep the >whole database in memory. >It leads to the anti-caching approach proposed by Michael Stonebraker >https://15721.courses.cs.cmu.edu/spring2016/papers/hstore-anticaching.pdf > >From the other side: maintaining if indexes is one of the most >important factor limiting database performance. >Postgres is able to insert records in table without indexes almost >with linear disk write speed(hundred megabytes per second). >But if table contains indexes, inserted keys have random values and >indexes don't fill in memory then we observer dramatic degradation of >performance. Average HDD access time is about 10msec, which >corresponds to 100 random reads per second. If table has several >indexes >and is large enough not to fit in memory, then insert speed can be as >low as tens TPS. Certainly SSD can provide much better random access >time, >but still random reads are slow. > True. Indexes (the way we do them) almost inevitably cause random I/O. >LSM approach tries to address this problem. >First of all I started my experiments with RocksDB (may be most >popular LSM-based key-value storage, used in many databases). >There was FDW for RocksDB from VidarDB project: >https://github.com/vidardb/pgrocks >As far as RocksDB is multuthreaded embedded database engine and >Postgres is based on multiprocess architecture, >them used interesting approach "server inside server": there is >bgworker process which works with RocksDB and >backends sendind requests to it through shared memory queue. > >I have significantly rewriten their FDW implementation: original >RocksDB server implementation was single threaded. >I have made it multitheaded making it possible to run multiple RocksDB >requests in parallel. >My implementation can be found there: >https://github.com/postgrespro/lsm > >Some results of benchmarking. >Benchmark is just insertion of 250 millions of records with random key >in inclusive index containing one bigint key and 8 bigint fields. >SIze of index is about 20Gb and target system has 16GB of RAM: > > >Index Clients TPS >Inclusive B-Tree 1 9387 >Inclusive B-Tree 10 18761 >RocksDB FDW 1 138350 >RocksDB FDW 10 122369 >RocksDB 1 166333 >RocksDB 10 141482 > Interesting, although those are just writes, right? Do you have any numbers for read? Also, what are the numbers when you start with "larger than RAM" data (i.e. ignoring the initial period when the index fits into memory)? > >As you can see there is about 10 times difference. >May be somebody will find useful this idea of using IOT (index >organized table) based on RocksDB in Postgres. >But this approach violates all ACID properties of Postgres: >there is no atomicity and consistency (in principle RocksDB supports >2PC, but it is not used here), >isolation corresponds to something like "read uncommitted", >and concerning durability - it is all up to RocksDB and I have >serious doubts that it will survive failure especially with sync write >mode disabled. >So I considered this project mostly as prototype for estimating >efficiency of LSM approach. > Yeah, I think in general FDW to a database with different consistency model is not going to get us very far ... Good for PoC experiments, but not really designed for stuff like this. Also, in my experience FDW has siginficant per-row overhead. >Then I think about implementing ideas of LSM using standard Postgres nbtree. > >We need two indexes: one small for fast inserts and another - big >(main) index. This top index is small enough to fit in memory >so inserts in this index are very fast. >Periodically we will merge data from top index to base index and >truncate the top index. To prevent blocking of inserts in the table >while we are merging indexes we can add ... on more index, which will >be used during merge. > >So final architecture of Lsm3 is the following: >two top indexes used in cyclic way and one main index. When top index >reaches some threshold value >we initiate merge with main index, done by bgworker and switch to >another top index. >As far as merging indexes is done in background, it doesn't affect >insert speed. >Unfortunately Postgres Index AM has not bulk insert operation, so we >have to perform normal inserts. >But inserted data is already sorted by key which should improve access >locality and partly solve random reads problem for base index. > >Certainly to perform search in Lsm3 we have to make lookups in all >three indexes and merge search results. >(in case of unique index we can avoid extra searches if searched value >is found in top index). >It can happen that during merge of top and base indexes the same TID >can be found in both of them. >But such duplicates can be easily eliminated during merge of search results. > >As far as we are using standard nbtree indexes there is no need to >worry about logging information in WAL. >There is no need to use inefficient "generic WAL records" or patch >kernel by adding own WAL records. > Makes sense, I guess. I always imagined we could do something like this by adding "buffers" into the btree directly, and instead of pushing them all the way down to the leaf pages we'd only insert them into the first buffer (and full buffers would get "propagated" in the background). How many such "buffers" we'd need / in which places in the btree is an open question - I suppose we could have a buffer for every internal page Typical indexes have ~1% in internal pages, so even if each 8kB internal page has an associated 8kB buffer page, it's not going to increase the size significantly. Of course, it's going to make lookups more expensive because you have to search all the buffers on the way to the leaf page (I wonder if we could improve this by keeping a a tiny bloom filters for those buffers, representing data in the subtree). Not sure if this would be simpler/cheaper than maintaining multiple separate indexes, which is what you propose. BTW how would your approach work with unique indexes, speculative inserts etc.? >Implementation of Lsm3 Index AM as standart Postgres extension is >available here: >https://github.com/postgrespro/lsm3 > >I have performed the same benchmark with random inserts (described >above) for Lsm3: > >Index Clients TPS >Inclusive B-Tree 1 9387 >Inclusive B-Tree 10 18761 >RocksDB FDW 1 138350 >RocksDB FDW 10 122369 >RocksDB 1 166333 >RocksDB 10 141482 >Lsm3 1 151699 >Lsm3 10 65997 > > >Size of nbtree is about 29Gb, single client performance is even higher >than of RocksDB FDW, but parallel results are signficantly worser. >So Lsm3 can provide significant improve of performance for large >indexes not fitting in main memory. >And the larger ratio between index size and RAM size is, the larger >benefit in insertion speed you get. >Lsm3 is just standard postgres extension, fully integrated in Postgres >infrastructure (MVCC, WAL, backups,...). >SO I hope it can be useful when standard indexes becomes bottleneck. > Isn't it a bit suspicious that with more clients the throughput actually drops significantly? Is this merely due to PoC stage, or is there some inherent concurrency bottleneck? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote: > >two top indexes used in cyclic way and one main index. When top index > >reaches some threshold value > >we initiate merge with main index, done by bgworker and switch to another > >top index. > >As far as merging indexes is done in background, it doesn't affect insert > >speed. > >Unfortunately Postgres Index AM has not bulk insert operation, so we have > >to perform normal inserts. > >But inserted data is already sorted by key which should improve access > >locality and partly solve random reads problem for base index. > > > >Certainly to perform search in Lsm3 we have to make lookups in all three > >indexes and merge search results. > >(in case of unique index we can avoid extra searches if searched value is > >found in top index). > >It can happen that during merge of top and base indexes the same TID can > >be found in both of them. > >But such duplicates can be easily eliminated during merge of search results. > > > >As far as we are using standard nbtree indexes there is no need to worry > >about logging information in WAL. > >There is no need to use inefficient "generic WAL records" or patch kernel > >by adding own WAL records. > > Makes sense, I guess. I always imagined we could do something like this > by adding "buffers" into the btree directly, and instead of pushing them > all the way down to the leaf pages we'd only insert them into the first > buffer (and full buffers would get "propagated" in the background). I get that it's not quite the same, but this all is reminding me of the GIN pending list and making me wonder if there's some way to generalize that (or come up with something new that would work for GIN too). Independently while considering this, I don't think the issues around how to deal with unique btrees properly has really been considered- you certainly can't stop your search on the first tuple you find even if the index is unique, since the "unique" btree could certainly have multiple entries for a given key and you might need to find a different one. Thanks, Stephen
Attachment
On Tue, Aug 4, 2020 at 6:11 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote: > >Hi hackers, > > > >I want to share results of my last research of implementing LSM index > >in Postgres. > >Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using > >LSM tree instead of classical B-Tree. > > > > I was under the impression that LSM is more an alternative primary > storage, not for indexes. Or am I wrong / confused? As I understand, there are different use-cases. We can use LSM for index, and this is good already. Such indexes would be faster for insertions and probably even vacuum if we redesign it (see my previous message), but slower for search. But for updates/deletes you still have to do random access to the heap. And you also need to find a heap record to update/delete, probably using the LSM index (and it's slower for search than B-tree). LSM as a primary storage can do more advanced tricks. For instance, some updates/inserts_on_conflict could be also just pushed to the top level of LSM without fetching the affected record before. So, in my point of view LSM as an index AM is far not a full power LSM for PostgreSQL, but it's still useful. Large insert-only tables can benefit from LSM. Large tables with many indexes could also benefit, because non-HOT updates will become cheaper. ------ Regards, Alexander Korotkov
On 04.08.2020 18:04, Alexander Korotkov wrote: > Hi! > > On Tue, Aug 4, 2020 at 11:22 AM Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: >> I want to share results of my last research of implementing LSM index in Postgres. >> Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using LSM tree instead of classical B-Tree. > I wouldn't say it that way. I would say they are providing LSM in > addition to the B-tree. For instance WiredTiger (which is the main > engine for MongoDB) provides both B-tree and LSM. As I know Tarantool > provides at least an in-memory B-tree. If RocksDB is used as an > engine for MySQL, then it's also an addition to B-tree, which is > provided by InnoDB. Also, implementation of B-tree's in mentioned > DBMSes are very different. I would say none of them is purely > classical. I am not suggesting to completely replace B-Tree with LSM. My experiments shows tah Postgres nbtree is faster than RocksDB when index is small and fits in memory. Definitely I have suggest to use LSM only for huge tables which indexes are much larger than size of available memory. > Some results of benchmarking. >> Benchmark is just insertion of 250 millions of records with random key in inclusive index containing one bigint key and8 bigint fields. >> SIze of index is about 20Gb and target system has 16GB of RAM: > What storage do you use? It is my notebook with 16GB of RAM and SSD. Certainly it should be tested at more serious hardware. But there is a "problem": powerful servers now have hundreds of gigabytes of memory. To let LSM index demonstrates it advantages we need to create index not fitting in memory. And CPU speed of very expensive servers is not significantly faster than of my notebook. Performing may inserts in parallel also will not significantly increase population of table with data: multiple bottlenecks in Postgres do not allow to reach liner scalability even thoughwe have hundreds of CPU cores. > Yes, integration of WAL and snapshots between Postgres and RocksDB is > problematic. I also doubt that RocksDB can use the full power of > Postgres extendable type system. This implementation support only basic scala Postgres types. Actually RocksDB is dealing only with string key-value pairs. So we need to serialize Postgres type into array of bytes (and provide right ordering)! > > You use a fixed number of trees. Is this a limitation of prototype or > intention of design? I guess if the base index is orders of magnitude > bigger than RAM, this scheme can degrade greatly. I do not understand why do we need multiple indexes. We need one "hot" index which fits in memory to perform fast inserts. But it should not be too small to be able to accumulate substantial amount of records to provide efficient bulk insert. I expect that top index can be efficiently merger with based index because of better access locality. I.e. we will insert multiple entries into one B-Tree page and so minimize slowdown of random reads. Third index is needed to perform parallel merge (while merge is in progress top index will be blocked and we can not perform inserts in it). I do not understand benefits of performing more than one merge in parallel: it will only increase fraction of random reads. Degradation certainly takes place. But it is not so critical as in case of standard nbtree. It is possible to tune threshold for top index size to make merge most efficient. But we can not truncate and swap index before we complete the merge. So if merge operation takes long time, then it will cause exhaustion of top index and it will not fit in memory any more. It will lead to further slowdown (so we have negative feedback here). Certainly it is possible to create more top indexes, keeping their size small enough to fit in memory. But in this case search will require merge not of 3, but of N indexes. I think that it may cause unacceptable slowdown of search operations. And it is highly undesirable, taken in account that most of application send more select-only queries than updates. > >> As far as we are using standard nbtree indexes there is no need to worry about logging information in WAL. >> There is no need to use inefficient "generic WAL records" or patch kernel by adding own WAL records. > As I get the merge operations are logged in the same way as ordinal > inserts. This seems acceptable, but a single insert operation would > eventually cause in times more WAL than it does in B-tree (especially > if we'll have an implementation of a flexible number of trees). In > principle that could be evaded if recovery could replay logic of trees > merge on its side. But this approach hardly can fit Postgres in many > ways. Yes, this approach increase mount of logged data twice: we need to write in WAL inserts in top and base indexes. And it will certainly have negative influence on performance. Unfortunately I have no idea how to avoid it without patching Postgres core. > >> I have performed the same benchmark with random inserts (described above) for Lsm3: >> >> Index Clients TPS >> Inclusive B-Tree 1 9387 >> Inclusive B-Tree 10 18761 >> RocksDB FDW 1 138350 >> RocksDB FDW 10 122369 >> RocksDB 1 166333 >> RocksDB 10 141482 >> Lsm3 1 151699 >> Lsm3 10 65997 >> >> Size of nbtree is about 29Gb, single client performance is even higher than of RocksDB FDW, but parallel results are signficantlyworser. > Did you investigate the source of degradation? Such degradation > doesn't yet look inevitable for me. Probably, things could be > improved. As I explained above insertion in Lsm3 now is process with negative feedback. If insertion rate is higher than merge speed then top index is blown and doesn't fit in memory any more which in turn cause more slowdown of merge and as a result - further increase of top index size. Several parallel clients performing inserts can fill top index faster than single client does and faster than it can be merged with main index. I have tried several approaches which try to slowdown inserts to prevent undesired growth of top index (I have considered two criteria: size of top index and time of merge operation). But none of my attempts was successful: its leads to even worse performance. > >> I will be glad to receive any feedback, may be some change requests or proposals. > As I get you benchmarked just inserts. But what about vacuum? I > think the way Postgres vacuum works for now isn't optimal for lsm. > Postgres vacuum requires full scan of index, because it provides a > bitmap of tids to be deleted without information of index keys. For > lsm, it would be better if vacuum would push delete requests to the > top level of lsm (as specially marked tuples of something). Thanks to > that index deletes could be as efficient as inserts. This is > especially important for lsm with many levels and/or aggressive > vacuum. Right now vacuum process Lsm3 indexes in usual way. Removing records from top indexes may be not needed at all (just because this indexes will be truncated in any case). But as far as size of top index is expected to be small enough vacuumming it should not take a long time, so I didn't to avoid it (although it should not be difficult - just disable ambulkdelete for correspondent nbtree wrappers). Concerning deletes from main index - I do not understand how it can be optimized.
On 04.08.2020 18:11, Tomas Vondra wrote: > On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote: >> Hi hackers, >> >> I want to share results of my last research of implementing LSM index >> in Postgres. >> Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using >> LSM tree instead of classical B-Tree. >> > > I was under the impression that LSM is more an alternative primary > storage, not for indexes. Or am I wrong / confused? Yes, originally I considered LSM for IOT (index organized table). And RocksDB FDW is actually such implementation of IOT. But implement IOT using existed nbtree indexes is more challenging task: I have thought about it but have not tried it yet. > Interesting, although those are just writes, right? Do you have any > numbers for read? Also, what are the numbers when you start with "larger > than RAM" data (i.e. ignoring the initial period when the index fits > into memory)? When benchmark starts, insertion speed is almost the same for Lsm3 and standard nbtree (we have to insert record twice, but second insertion s done in background). At the end of benchmark - when we close to 250 million records, Lsm3 shows TPS about 20 times faster than nbtree. Finally it gives about 6 times difference in elapsed time. > Yeah, I think in general FDW to a database with different consistency > model is not going to get us very far ... Good for PoC experiments, but > not really designed for stuff like this. Also, in my experience FDW has > siginficant per-row overhead. From my experience the main drawback of FDW is lack of support of parallel operations. But it is important mostly for OLAP, not for OLTP. > BTW how would your approach work with unique indexes, speculative > inserts etc.? Unique indexes are not supported now. And I do not see some acceptable solution here. If we will have to check presence of duplicate at the time of insert then it will eliminate all advantages of LSM approach. And if we postpone to the moment of merge, then... I afraid that it will be too late. > > Isn't it a bit suspicious that with more clients the throughput actually > drops significantly? Is this merely due to PoC stage, or is there some > inherent concurrency bottleneck? > My explaination is the following (I am not 100% sure that it is true): multiple clients insert records faster than merge bgworker is able to merge them to main index. It cause blown of top index and as a result it doesn't fir in memory any more. So we loose advantages of fast inserts. If we have N top indexes instead of just 2, we can keep size of each top index small enough. But in this case search operations will have to merge N indexes and so search is almost N times slow (the fact that each top index fits in memory doesn't mean that all of the fits in memory at the same time, so we still have to read pages from disk during lookups in top indexes).
On 04.08.2020 18:18, Stephen Frost wrote: > > Independently while considering this, I don't think the issues around > how to deal with unique btrees properly has really been considered- you > certainly can't stop your search on the first tuple you find even if the > index is unique, since the "unique" btree could certainly have multiple > entries for a given key and you might need to find a different one. But search locates not ANY record with specified key in top index but record which satisfies snapshot of the transaction. Why do we need more records if we know that there are no duplicates?
On Tue, Aug 04, 2020 at 08:18:01PM +0300, Konstantin Knizhnik wrote: > > >On 04.08.2020 18:11, Tomas Vondra wrote: >>On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote: >>>Hi hackers, >>> >>>I want to share results of my last research of implementing LSM >>>index in Postgres. >>>Most of modern databases (RocksDB, MongoDB, Tarantool,...) are >>>using LSM tree instead of classical B-Tree. >>> >> >>I was under the impression that LSM is more an alternative primary >>storage, not for indexes. Or am I wrong / confused? > >Yes, originally I considered LSM for IOT (index organized table). >And RocksDB FDW is actually such implementation of IOT. >But implement IOT using existed nbtree indexes is more challenging task: >I have thought about it but have not tried it yet. > >>Interesting, although those are just writes, right? Do you have any >>numbers for read? Also, what are the numbers when you start with "larger >>than RAM" data (i.e. ignoring the initial period when the index fits >>into memory)? > >When benchmark starts, insertion speed is almost the same for Lsm3 and >standard nbtree >(we have to insert record twice, but second insertion s done in background). >At the end of benchmark - when we close to 250 million records, Lsm3 >shows TPS about 20 times faster than nbtree. >Finally it gives about 6 times difference in elapsed time. > IMO the 6x difference is rather misleading, as it very much depends on the duration of the benchmark and how much data it ends up with. I think it's better to test 'stable states' i.e. with small data set that does not exceed RAM during the whole test, and large ones that already starts larger than RAM. Not sure if it makes sense to make a difference between cases that fit into shared buffers and those that exceed shared buffers but still fit into RAM. > >>Yeah, I think in general FDW to a database with different consistency >>model is not going to get us very far ... Good for PoC experiments, but >>not really designed for stuff like this. Also, in my experience FDW has >>siginficant per-row overhead. > >From my experience the main drawback of FDW is lack of support of >parallel operations. >But it is important mostly for OLAP, not for OLTP. > True. There are other overheads, though - having to format/copy the tuples is not particularly cheap. >>BTW how would your approach work with unique indexes, speculative >>inserts etc.? > >Unique indexes are not supported now. >And I do not see some acceptable solution here. >If we will have to check presence of duplicate at the time of insert >then it will eliminate all advantages of LSM approach. >And if we postpone to the moment of merge, then... I afraid that it >will be too late. > Ummm, but in your response to Stephen you said: But search locates not ANY record with specified key in top index but record which satisfies snapshot of the transaction. Why do we need more records if we know that there are no duplicates? So how do you know there are no duplicates, if unique indexes are not supported (and may not be for LSM)? >> >>Isn't it a bit suspicious that with more clients the throughput actually >>drops significantly? Is this merely due to PoC stage, or is there some >>inherent concurrency bottleneck? >> >My explaination is the following (I am not 100% sure that it is true): >multiple clients insert records faster than merge bgworker is able to >merge them to main index. It cause blown of top index and as a result >it doesn't fir in memory any more. >So we loose advantages of fast inserts. If we have N top indexes >instead of just 2, we can keep size of each top index small enough. >But in this case search operations will have to merge N indexes and so >search is almost N times slow (the fact that each top index fits in >memory >doesn't mean that all of the fits in memory at the same time, so we >still have to read pages from disk during lookups in top indexes). > Hmmm, maybe. Should be easy to verify by monitoring the size of the top index, and limiting it to some reasonable value to keep good performance. Something like gin_pending_list_size I guess. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04.08.2020 20:44, Tomas Vondra wrote: > Unique indexes are not supported now. >> And I do not see some acceptable solution here. >> If we will have to check presence of duplicate at the time of insert >> then it will eliminate all advantages of LSM approach. >> And if we postpone to the moment of merge, then... I afraid that it >> will be too late. >> > > Ummm, but in your response to Stephen you said: > > But search locates not ANY record with specified key in top index > but record which satisfies snapshot of the transaction. Why do we > need more records if we know that there are no duplicates? > > So how do you know there are no duplicates, if unique indexes are not > supported (and may not be for LSM)? > In index AM I marked Lsm3 index as not supporting unique constraint. So it can not be used to enforce unique contraint. But it is possible to specify "unique" in index properties. In this case it is responsibility of programmer to guarantee that there are no duplicates in the index. This option allows to use this search optimization - locate first record satisfying snapshot and not touch other indexes. >>> >>> Isn't it a bit suspicious that with more clients the throughput >>> actually >>> drops significantly? Is this merely due to PoC stage, or is there some >>> inherent concurrency bottleneck? >>> >> My explaination is the following (I am not 100% sure that it is >> true): multiple clients insert records faster than merge bgworker is >> able to merge them to main index. It cause blown of top index and as >> a result it doesn't fir in memory any more. >> So we loose advantages of fast inserts. If we have N top indexes >> instead of just 2, we can keep size of each top index small enough. >> But in this case search operations will have to merge N indexes and >> so search is almost N times slow (the fact that each top index fits >> in memory >> doesn't mean that all of the fits in memory at the same time, so we >> still have to read pages from disk during lookups in top indexes). >> > > Hmmm, maybe. Should be easy to verify by monitoring the size of the top > index, and limiting it to some reasonable value to keep good > performance. Something like gin_pending_list_size I guess. > Lsm3 provides functions for getting size of active top index, explicitly force merge of top index and wait completion of merge operation. Once of use cases of Lsm3 may be delayed update of indexes. For some application insert speed is very critical: them can not loose data which is received at high rate. In this case in working hours we insert data in small index and at night initiate merge of this index with main index.
On Tue, Aug 4, 2020 at 7:56 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > I do not understand why do we need multiple indexes. > We need one "hot" index which fits in memory to perform fast inserts. > But it should not be too small to be able to accumulate substantial > amount of records to provide efficient bulk insert. > I expect that top index can be efficiently merger with based index > because of better access locality. > I.e. we will insert multiple entries into one B-Tree page and so > minimize slowdown of random reads. > > Third index is needed to perform parallel merge (while merge is in > progress top index will be blocked and we can not perform inserts in it). > I do not understand benefits of performing more than one merge in > parallel: it will only increase fraction of random reads. > > Degradation certainly takes place. But it is not so critical as in case > of standard nbtree. > It is possible to tune threshold for top index size to make merge most > efficient. > But we can not truncate and swap index before we complete the merge. > So if merge operation takes long time, then it will cause exhaustion of > top index and it will not fit in memory any more. > It will lead to further slowdown (so we have negative feedback here). > > Certainly it is possible to create more top indexes, keeping their size > small enough to fit in memory. > But in this case search will require merge not of 3, but of N indexes. > I think that it may cause unacceptable slowdown of search operations. > And it is highly undesirable, taken in account that most of application > send more select-only queries than updates. The things you're writing makes me uneasy. I initially understood lsm3 as a quick and dirty prototype, while you're probably keeping some design in your mind (for instance, original design of LSM). However, your message makes me think you're trying to defend the approach currently implemented in lsm3 extension. Therefore, I've to criticise this approach. 1) The base index can degrade. At first, since merge can cause page splits. Therefore logical ordering of pages will become less correlated with their physical ordering with each merge. 2) If your workload will include updates and/or deletes, page utilization may also degrade. 3) While base index degrades, merge performance also degrades. Traverse of base index in logical order will require more and more random reads (at some point almost every page read will be random). While the base index becomes large and/or bloat, you push less top index tuples to a single base index page (at some point you will push one tuple per page). Original LSM design implies strict guarantees over average resources spent per index operation. Your design doesn't. Moreover, I bet lsm3 will degrade significantly even on insert-only workload. It should degrade to the performance level of B-tree once you insert enough data. Try something like number_of_merges = numer_of_tuples_per_index_page * 2 and you should see this degradation. Real LSM doesn't degrade that way. > Yes, this approach increase mount of logged data twice: > we need to write in WAL inserts in top and base indexes. > And it will certainly have negative influence on performance. > Unfortunately I have no idea how to avoid it without patching Postgres core. Huh, I didn't mean "without patching Postgres core" :) I mean it's difficult in principle, assuming PostgreSQL recovery is single-process and doesn't have access to system catalog (because it might be inconsistent). > Right now vacuum process Lsm3 indexes in usual way. > Removing records from top indexes may be not needed at all (just because > this indexes will be truncated in any case). > But as far as size of top index is expected to be small enough > vacuumming it should not take a long time, > so I didn't to avoid it (although it should not be difficult - just > disable ambulkdelete for correspondent nbtree wrappers). It doesn't seem important, but I don't get your point here. Postgres expects ambulkdelete to delete TIDs from index. If you don't delete it from the top index, this TID will be merged to the base index. And that could lead wrong query answers unless you eliminate those TIDs in a different way (during the merge stage or something). > Concerning deletes from main index - I do not understand how it can be > optimized. This is a trick you can learn from almost every LSM implementation. For instance, check docs for leveldb [1] about "delete marker". For sure, that requires some redesign of the vacuum and can't be done in extension (at least in the reasonable way). But, frankly speaking, I think core modifications are inevitable to utilize the power of LSM in PostgreSQL. Links 1. https://github.com/google/leveldb/blob/master/doc/impl.md ------ Regards, Alexander Korotkov
On Tue, Aug 4, 2020 at 8:24 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > So, in my point of view LSM as an index AM is far not a full power LSM > for PostgreSQL, but it's still useful. Large insert-only tables can > benefit from LSM. Large tables with many indexes could also benefit, > because non-HOT updates will become cheaper. Right -- this is why you usually have to choose one or the other. An LSM design typically subsumes not just indexing and table storage, but also checkpointing -- you cannot really compare an LSM to a B-Tree because you really have to talk about other components to make a sensible comparison (at which point you're actually comparing two totally different *storage engines*). Roughly speaking, the compaction process is the equivalent of checkpointing. So you either use (say) InnoDB or RocksDB everywhere -- you usually can't have it both ways. Well, maybe you can kind of get the benefits of both, but in practice LSMs are usually highly optimized for the things that they're good at, at the expense of other things. So in practice you kind of have to make an up-front choice. An LSM is definitely not a natural fit for the index access method interface in Postgres. One thing that I don't think anyone else made reference to on the thread (which is surprising) is that the performance of an LSM is usually not measured using any of the conventional metrics that we care about. For example, consider the Facebook MyRocks paper "Optimizing Space Amplification in RocksDB" [1]. The reported RocksDB throughput for an LSM-sympathetic workload is not really any faster than InnoDB, and sometimes slower. That's not the point, though; the main advantages of using an LSM are reductions in space amplification and write amplification, particularly the latter. This isn't so much about performance as it is about efficiency -- it enabled Facebook to get a lot more out of the inexpensive flash storage that they use. It lowered the total cost of ownership by *a lot*. I personally think that we should care about efficiency in this sense a lot more than we do now, but the fact remains that it hasn't really been considered an independent problem that could be addressed by accepting a tradeoff similar to the tradeoff LSMs make quite explicit (apparently you can tune LSMs to get less write amplification but more read amplification, or vice versa). In general, we almost always just talk about throughout and latency without considering efficiency specifically. I'm not suggesting that we need an LSM, but an appreciation of LSMs could be helpful -- it could lead to better designs elsewhere. Mark Callaghan's blog is a pretty good resource for learning about LSMs [2] (perhaps you've heard of him?). He wrote a bunch of stuff about Postgres recently, which I enjoyed. [1] http://cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf [2] https://smalldatum.blogspot.com/ -- Peter Geoghegan
On 05.08.2020 02:59, Alexander Korotkov wrote: > > The things you're writing makes me uneasy. I initially understood > lsm3 as a quick and dirty prototype, while you're probably keeping > some design in your mind (for instance, original design of LSM). > However, your message makes me think you're trying to defend the > approach currently implemented in lsm3 extension. Therefore, I've to > criticise this approach. > > 1) The base index can degrade. At first, since merge can cause page > splits. Therefore logical ordering of pages will become less > correlated with their physical ordering with each merge. > 2) If your workload will include updates and/or deletes, page > utilization may also degrade. > 3) While base index degrades, merge performance also degrades. > Traverse of base index in logical order will require more and more > random reads (at some point almost every page read will be random). > While the base index becomes large and/or bloat, you push less top > index tuples to a single base index page (at some point you will push > one tuple per page). > > Original LSM design implies strict guarantees over average resources > spent per index operation. Your design doesn't. Moreover, I bet lsm3 > will degrade significantly even on insert-only workload. It should > degrade to the performance level of B-tree once you insert enough > data. Try something like number_of_merges = > numer_of_tuples_per_index_page * 2 and you should see this > degradation. Real LSM doesn't degrade that way. I mostly agree with your critices. My Lsm3 is not true LSM, but from my point of view it preserves basic principles of LSM: fast and small top index and bulk updates of main index. My experiments with RocksDB shows that degradation also takes place in this case. More experiments are needed to compare two approaches. Concerning degrade of basic index - B-Tree itself is balanced tree. Yes, insertion of random keys can cause split of B-Tree page. In the worst case half of B-Tree page will be empty. So B-Tree size will be two times larger than ideal tree. It may cause degrade up to two times. But that is all. There should not be infinite degrade of speed tending to zero. > > Right now vacuum process Lsm3 indexes in usual way. > Removing records from top indexes may be not needed at all (just because > this indexes will be truncated in any case). > But as far as size of top index is expected to be small enough > vacuumming it should not take a long time, > so I didn't to avoid it (although it should not be difficult - just > disable ambulkdelete for correspondent nbtree wrappers). > It doesn't seem important, but I don't get your point here. Postgres > expects ambulkdelete to delete TIDs from index. If you don't delete > it from the top index, this TID will be merged to the base index. And > that could lead wrong query answers unless you eliminate those TIDs in > a different way (during the merge stage or something). Yes, your are right. It is not possible to avoid delete TIDs from top indexes. >> Concerning deletes from main index - I do not understand how it can be >> optimized. > This is a trick you can learn from almost every LSM implementation. > For instance, check docs for leveldb [1] about "delete marker". For > sure, that requires some redesign of the vacuum and can't be done in > extension (at least in the reasonable way). But, frankly speaking, I > think core modifications are inevitable to utilize the power of LSM in > PostgreSQL. The main idea of Lsm3 was to investigate whether it is possible to achieve the same result as with "classical" LSM using standard Postgres nbtree indexes. Right now it seems t me that answer is positive, but I have not performed exhaustive measurements. For example I have not measured vacuum overhead (it was enabled, so vacuumming takes place in my benchmark, but I have not tries to separate its overhead and influence on performance), index search speed,... > Links > 1. https://github.com/google/leveldb/blob/master/doc/impl.md > > ------ > Regards, > Alexander Korotkov
On 04.08.2020 20:44, Tomas Vondra wrote:
IMO the 6x difference is rather misleading, as it very much depends on
the duration of the benchmark and how much data it ends up with. I think
it's better to test 'stable states' i.e. with small data set that does
not exceed RAM during the whole test, and large ones that already starts
larger than RAM. Not sure if it makes sense to make a difference between
cases that fit into shared buffers and those that exceed shared buffers
but still fit into RAM.
I have changed benchmark scenario.
Now I inserted 200 million records with sequential key: it is fast enough and makes index size about 19Gb.
Then I perform 1 million random inserts.
-- init schema
create table t(k bigint, v1 bigint, v2 bigint, v3 bigint, v4 bigint, v5 bigint, v6 bigint, v7 bigint, v8 bigint);
create index lsm_index on t using lsm3(k) include (v1,v2,v3,v4,v5,v6,v7,v8);
create table t2(k bigint, v1 bigint, v2 bigint, v3 bigint, v4 bigint, v5 bigint, v6 bigint, v7 bigint, v8 bigint);
create index on t2(k) include (v1,v2,v3,v4,v5,v6,v7,v8);
-- fill with sequential data
insert into t values (generate_series(1,200000000),0,0,0,0,0,0,0,0);
Time: 520655,635 ms (08:40,656)
insert into t2 values (generate_series(1,200000000),0,0,0,0,0,0,0,0);
Time: 372245,093 ms (06:12,245)
-- random inserts
insert into t (v1,k,v2,v3,v4,v5,v6,v7,v8) values (generate_series(1,1000000),(random()*1000000000)::bigint,0,0,0,0,0,0,0);
Time: 3781,614 ms (00:03,782)
insert into t2 (v1,k,v2,v3,v4,v5,v6,v7,v8) values (generate_series(1,1000000),(random()*1000000000)::bigint,0,0,0,0,0,0,0);
Time: 39034,574 ms (00:39,035)
The I perform random selects
select.sql:
\set k random(1, 1000000000)
select * from t where k=:k;
select2.sql:
\set k random(1, 1000000000)
select * from t2 where k=:k;
pgbench -n -T 100 -P 10 -M prepared -f select.sql postgres
tps = 11372.821006 (including connections establishing)
pgbench -n -T 100 -P 10 -M prepared -f select2.sql postgres
tps = 10392.729026 (including connections establishing)
So as you can see - insertion speed of Lsm3 is ten times higher and select speed is the same as of nbtree.
> On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote: > > Then I think about implementing ideas of LSM using standard Postgres > nbtree. > > We need two indexes: one small for fast inserts and another - big > (main) index. This top index is small enough to fit in memory so > inserts in this index are very fast. Periodically we will merge data > from top index to base index and truncate the top index. To prevent > blocking of inserts in the table while we are merging indexes we can > add ... on more index, which will be used during merge. > > So final architecture of Lsm3 is the following: two top indexes used > in cyclic way and one main index. When top index reaches some > threshold value we initiate merge with main index, done by bgworker > and switch to another top index. As far as merging indexes is done in > background, it doesn't affect insert speed. Unfortunately Postgres > Index AM has not bulk insert operation, so we have to perform normal > inserts. But inserted data is already sorted by key which should > improve access locality and partly solve random reads problem for base > index. > > Certainly to perform search in Lsm3 we have to make lookups in all > three indexes and merge search results. Thanks for sharing this! In fact this reminds me more of partitioned b-trees [1] (and more older [2]) rather than LSM as it is (although could be that the former was influenced by the latter). What could be interesting is that quite often in these and many other whitepapers (e.g. [3]) to address the lookup overhead the design includes bloom filters in one or another way to avoid searching not relevant part of an index. Tomas mentioned them in this thread as well (in the different context), probably the design suggested here could also benefit from it? [1]: Riegger Christian, Vincon Tobias, Petrov Ilia. Write-optimized indexing with partitioned b-trees. (2017). 296-300. 10.1145/3151759.3151814. [2]: Graefe Goetz. Write-Optimized B-Trees. (2004). 672-683. 10.1016/B978-012088469-8/50060-7. [3]: Huanchen Zhang, David G. Andersen, Andrew Pavlo, Michael Kaminsky, Lin Ma, and Rui Shen. Reducing the Storage Overhead of Main-Memory OLTP Databases with Hybrid Indexes. (2016). 1567–1581. 10.1145/2882903.2915222.
ср, 5 авг. 2020 г., 09:13 Konstantin Knizhnik <k.knizhnik@postgrespro.ru>: > Concerning degrade of basic index - B-Tree itself is balanced tree. Yes, > insertion of random keys can cause split of B-Tree page. > In the worst case half of B-Tree page will be empty. So B-Tree size will > be two times larger than ideal tree. > It may cause degrade up to two times. But that is all. There should not > be infinite degrade of speed tending to zero. My concerns are not just about space utilization. My main concern is about the order of the pages. After the first merge the base index will be filled in key order. So physical page ordering perfectly matches their logical ordering. After the second merge some pages of base index splits, and new pages are added to the end of the index. Splits also happen in key order. So, now physical and logical orderings match within two extents corresponding to first and second merges, but not within the whole tree. While there are only few such extents, disk page reads may in fact be mostly sequential, thanks to OS cache and readahead. But finally, after many merges, we can end up with mostly random page reads. For instance, leveldb doesn't have a problem of ordering degradation, because it stores levels in sorted files. ------ Regards, Alexander Korotkov
On 07.08.2020 15:31, Alexander Korotkov wrote: > ср, 5 авг. 2020 г., 09:13 Konstantin Knizhnik <k.knizhnik@postgrespro.ru>: >> Concerning degrade of basic index - B-Tree itself is balanced tree. Yes, >> insertion of random keys can cause split of B-Tree page. >> In the worst case half of B-Tree page will be empty. So B-Tree size will >> be two times larger than ideal tree. >> It may cause degrade up to two times. But that is all. There should not >> be infinite degrade of speed tending to zero. > My concerns are not just about space utilization. My main concern is > about the order of the pages. After the first merge the base index > will be filled in key order. So physical page ordering perfectly > matches their logical ordering. After the second merge some pages of > base index splits, and new pages are added to the end of the index. > Splits also happen in key order. So, now physical and logical > orderings match within two extents corresponding to first and second > merges, but not within the whole tree. While there are only few such > extents, disk page reads may in fact be mostly sequential, thanks to > OS cache and readahead. But finally, after many merges, we can end up > with mostly random page reads. For instance, leveldb doesn't have a > problem of ordering degradation, because it stores levels in sorted > files. > I agree with your that loosing sequential order of B-Tree pages may have negative impact on performance. But it first of all critical for order-by and range queries, when we should traverse several subsequent leave pages. It is less critical for exact-search or delete/insert operations. Efficiency of merge operations mostly depends on how much keys will be stored at the same B-Tree page. And it is first of all determined by size of top index and key distribution.
On Sat, Aug 8, 2020 at 5:07 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > I agree with your that loosing sequential order of B-Tree pages may have > negative impact on performance. > But it first of all critical for order-by and range queries, when we > should traverse several subsequent leave pages. > It is less critical for exact-search or delete/insert operations. > Efficiency of merge operations mostly depends on how much keys > will be stored at the same B-Tree page. What do you mean by "mostly"? Given PostgreSQL has quite small (8k) pages, sequential read in times faster than random read on SSDs (dozens of times on HDDs). I don't think this is something to neglect. > And it is first of all > determined by size of top index and key distribution. How can you be sure that the top index can fit memory? On production systems, typically there are multiple consumers of memory: other tables, indexes, other LSMs. This is one of reasons why LSM implementations have multiple levels: they don't know in advance which levels fit memory. Another reason is dealing with very large datasets. And I believe there is a quite strong reason to keep page order sequential within level. I'm OK with your design for a third-party extension. It's very cool to have. But I'm -1 for something like this to get into core PostgreSQL, assuming it's feasible to push some effort and get state-of-art LSM there. ------ Regards, Alexander Korotkov
On 08.08.2020 21:18, Alexander Korotkov wrote: > On Sat, Aug 8, 2020 at 5:07 PM Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: >> I agree with your that loosing sequential order of B-Tree pages may have >> negative impact on performance. >> But it first of all critical for order-by and range queries, when we >> should traverse several subsequent leave pages. >> It is less critical for exact-search or delete/insert operations. >> Efficiency of merge operations mostly depends on how much keys >> will be stored at the same B-Tree page. > What do you mean by "mostly"? Given PostgreSQL has quite small (8k) > pages, sequential read in times faster than random read on SSDs > (dozens of times on HDDs). I don't think this is something to > neglect. When yo insert one record in B-Tree, the order of pages doesn't matter at all. If you insert ten records at one leaf page then order is also not so important. If you insert 100 records, 50 got to one page and 50 to the next page, then insertion may be faster if second page follows on the disk first one. But such insertion may cause page split and so allocation of new page, so sequential write order can still be violated. >> And it is first of all >> determined by size of top index and key distribution. > How can you be sure that the top index can fit memory? On production > systems, typically there are multiple consumers of memory: other > tables, indexes, other LSMs. This is one of reasons why LSM > implementations have multiple levels: they don't know in advance which > levels fit memory. Another reason is dealing with very large > datasets. And I believe there is a quite strong reason to keep page > order sequential within level. There is no any warranty that top index is kept in memory. But as far top index pages are frequently accessed, I hope that buffer management cache replacement algorithm does it best to keep them in memory. > I'm OK with your design for a third-party extension. It's very cool > to have. But I'm -1 for something like this to get into core > PostgreSQL, assuming it's feasible to push some effort and get > state-of-art LSM there. I realize that it is not true LSM. But still I wan to notice that it is able to provide ~10 times increase of insert speed when size of index is comparable with RAM size. And "true LSM" from RocksDB shows similar results. May be if size of index will be 100 times larger then size of RAM, RocksDB will be significantly faster than Lsm3. But modern servers has 0.5-1Tb of RAM. Can't believe that there are databases with 100Tb indexes.
On Sat, Aug 8, 2020 at 11:49 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > On 08.08.2020 21:18, Alexander Korotkov wrote: > > On Sat, Aug 8, 2020 at 5:07 PM Konstantin Knizhnik > > <k.knizhnik@postgrespro.ru> wrote: > >> I agree with your that loosing sequential order of B-Tree pages may have > >> negative impact on performance. > >> But it first of all critical for order-by and range queries, when we > >> should traverse several subsequent leave pages. > >> It is less critical for exact-search or delete/insert operations. > >> Efficiency of merge operations mostly depends on how much keys > >> will be stored at the same B-Tree page. > > What do you mean by "mostly"? Given PostgreSQL has quite small (8k) > > pages, sequential read in times faster than random read on SSDs > > (dozens of times on HDDs). I don't think this is something to > > neglect. > > When yo insert one record in B-Tree, the order of pages doesn't matter > at all. > If you insert ten records at one leaf page then order is also not so > important. > If you insert 100 records, 50 got to one page and 50 to the next page, > then insertion may be faster if second page follows on the disk first one. > But such insertion may cause page split and so allocation of new page, > so sequential write order can still be violated. Sorry, I've no idea of what you're getting at. > >> And it is first of all > >> determined by size of top index and key distribution. > > How can you be sure that the top index can fit memory? On production > > systems, typically there are multiple consumers of memory: other > > tables, indexes, other LSMs. This is one of reasons why LSM > > implementations have multiple levels: they don't know in advance which > > levels fit memory. Another reason is dealing with very large > > datasets. And I believe there is a quite strong reason to keep page > > order sequential within level. > > There is no any warranty that top index is kept in memory. > But as far top index pages are frequently accessed, I hope that buffer > management cache replacement > algorithm does it best to keep them in memory. So, the top index should be small enough that we can safely assume it wouldn't be evicted from cache on a heavily loaded production system. I think it's evident that it should be in orders of magnitude less than the total amount of server RAM. > > I'm OK with your design for a third-party extension. It's very cool > > to have. But I'm -1 for something like this to get into core > > PostgreSQL, assuming it's feasible to push some effort and get > > state-of-art LSM there. > I realize that it is not true LSM. > But still I wan to notice that it is able to provide ~10 times increase > of insert speed when size of index is comparable with RAM size. > And "true LSM" from RocksDB shows similar results. It's very far from being shown. All the things you've shown is a naive benchmark. I don't object that your design can work out some cases. And it's great that we have the lsm3 extension now. But I think for PostgreSQL core we should think about better design. > May be if size of > index will be 100 times larger then > size of RAM, RocksDB will be significantly faster than Lsm3. But modern > servers has 0.5-1Tb of RAM. > Can't believe that there are databases with 100Tb indexes. Comparison of whole RAM size to single index size looks plain wrong for me. I think we can roughly compare whole RAM size to whole database size. But also not the whole RAM size is always available for caching data. Let's assume half of RAM is used for caching data. So, a modern server with 0.5-1Tb of RAM, which suffers from random B-tree insertions and badly needs LSM-like data-structure, runs a database of 25-50Tb. Frankly speaking, there is nothing counterintuitive for me. ------ Regards, Alexander Korotkov
On 09.08.2020 04:53, Alexander Korotkov wrote: >> >> I realize that it is not true LSM. >> But still I wan to notice that it is able to provide ~10 times increase >> of insert speed when size of index is comparable with RAM size. >> And "true LSM" from RocksDB shows similar results. > It's very far from being shown. All the things you've shown is a > naive benchmark. I don't object that your design can work out some > cases. And it's great that we have the lsm3 extension now. But I > think for PostgreSQL core we should think about better design. Sorry, I mean that at particular benchmark and hardware Lsm3 and RocksDB shows similar performance. It definitely doesn't mean that it will be true in all other cases. This is one of the reasons why I have published this Lsm3 and RockDB FDW extensions: anybody can try to test them at their workload. It will be very interesting to me to know this results, because I certainly understand that measuring of random insert performance in dummy table is not enough to make some conclusions. And I certainly do not want to say that we do not need "right" LSM implementation inside Postgres core. It just requires an order of magnitude more efforts. And there are many questions and challenges. For example Postgres buffer size (8kb) seems to be too small for LSM. Should LSM implementation bypass Postgres buffer cache? There pros and contras... Another issue is logging. Should we just log all operations with LSM in WAL in usual way (as it is done for nbtree and Lsm3)? It seems to me that for LSM alternative and more efficient solutions may be proposed. For example we may not log inserts in top index at all and just replay them during recovery, assuming that this operation with small index is fast enough. And merge of top index with base index can be done in atomic way and so also doesn't require WAL. As far as I know Anastasia Lubennikova several years ago has implemented LSM for Postgres. There was some performance issues (with concurrent access?). This is why the first thing I want to clarify for myself is what are the bottlenecks of LSM architecture and are them caused by LSM itself or its integration in Postgres infrastructure. I any case, before thinking about details of in-core LSM implementation for Postgres, I think that it is necessary to demonstrate workloads at which RocksDB (or any other existed DBMS with LSM) shows significant performance advantages comparing with Postgres with nbtree/Lsm3. >> May be if size of >> index will be 100 times larger then >> size of RAM, RocksDB will be significantly faster than Lsm3. But modern >> servers has 0.5-1Tb of RAM. >> Can't believe that there are databases with 100Tb indexes. > Comparison of whole RAM size to single index size looks plain wrong > for me. I think we can roughly compare whole RAM size to whole > database size. But also not the whole RAM size is always available > for caching data. Let's assume half of RAM is used for caching data. > So, a modern server with 0.5-1Tb of RAM, which suffers from random > B-tree insertions and badly needs LSM-like data-structure, runs a > database of 25-50Tb. Frankly speaking, there is nothing > counterintuitive for me. There is actually nothing counterintuitive. I just mean that there are not so much 25-50Tb OLTP databases.
Dmitry Dolgov wrote >> On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote: >> >> Then I think about implementing ideas of LSM using standard Postgres >> nbtree. >> >> We need two indexes: one small for fast inserts and another - big >> (main) index. This top index is small enough to fit in memory so >> inserts in this index are very fast. Periodically we will merge data >> from top index to base index and truncate the top index. To prevent >> blocking of inserts in the table while we are merging indexes we can >> add ... on more index, which will be used during merge. >> >> So final architecture of Lsm3 is the following: two top indexes used >> in cyclic way and one main index. When top index reaches some >> threshold value we initiate merge with main index, done by bgworker >> and switch to another top index. As far as merging indexes is done in >> background, it doesn't affect insert speed. Unfortunately Postgres >> Index AM has not bulk insert operation, so we have to perform normal >> inserts. But inserted data is already sorted by key which should >> improve access locality and partly solve random reads problem for base >> index. >> >> Certainly to perform search in Lsm3 we have to make lookups in all >> three indexes and merge search results. > > Thanks for sharing this! In fact this reminds me more of partitioned > b-trees [1] (and more older [2]) rather than LSM as it is (although > could be that the former was influenced by the latter). What could be > interesting is that quite often in these and many other whitepapers > (e.g. [3]) to address the lookup overhead the design includes bloom > filters in one or another way to avoid searching not relevant part of an > index. Tomas mentioned them in this thread as well (in the different > context), probably the design suggested here could also benefit from it? > > [1]: Riegger Christian, Vincon Tobias, Petrov Ilia. Write-optimized > indexing with partitioned b-trees. (2017). 296-300. > 10.1145/3151759.3151814. > [2]: Graefe Goetz. Write-Optimized B-Trees. (2004). 672-683. > 10.1016/B978-012088469-8/50060-7. > [3]: Huanchen Zhang, David G. Andersen, Andrew Pavlo, Michael Kaminsky, > Lin Ma, and Rui Shen. Reducing the Storage Overhead of Main-Memory OLTP > Databases with Hybrid Indexes. (2016). 1567–1581. 10.1145/2882903.2915222. I found this 2019 paper recently, might be worth a skim read for some different ideas. too technical for me :) "Jungle: Towards Dynamically Adjustable Key-Value Storeby Combining LSM-Tree and Copy-On-Write B+-Tree" https://www.usenix.org/system/files/hotstorage19-paper-ahn.pdf -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html