Re: Berkeley DB... - Mailing list pgsql-hackers
From | Michael A. Olson |
---|---|
Subject | Re: Berkeley DB... |
Date | |
Msg-id | 200005211839.LAA00177@triplerock.olsons.net Whole thread Raw |
In response to | Berkeley DB... ("Mikheev, Vadim" <vmikheev@SECTORBASE.COM>) |
Responses |
Re: Berkeley DB...
|
List | pgsql-hackers |
At 06:43 PM 5/20/00 -0700, Vadim Mikheev wrote: > 1. Using RECNO db for heap. > For relational DB over-writing smgr means ability to re-use space after > DELETE/UPDATE operations (without vacuum -:)). RECNO (btree by nature, > with record number as key) will not give us this ability. To insert record > into RECNO db one has either to provide "put" method with record number > (where to store new record) or specify DB_APPEND in flags, to add new record > to the end of db (without space re-using). So, the problem (one of two base > problems of over-writing smgr for us) "where to store new tuple" (ie - where > in data file there is free space for new tuple) is not resolved. > => we can't use SDB smgr: there are no required features - space re-using > and MVCC support. All of the Berkeley DB access methods reuse space. We return free space to a pool and allocate from the pool in the ordinary course of operation. We have no notion of vacuum. Empty pages get appended to a free list, and will be reused on next page allocation. Empty space on pages (from deleted tuples) where the rest of the page isn't empty will get reused the next time the page is visited. So you do get space reuse. We don't return blocks to the file system automatically (requires reorg, which is hard). "Appending" means appending in key space; that may or may not be physically at the end of the file. We do, however, do reverse splits of underfull nodes, so we're aggressive at getting empty pages back on the free list. In short, I think the space reuse story of Berkeley DB is better than the current space reuse story in PostgreSQL, even for heaps. This is because the current heap AM doesn't opportunistically coalesce pages to make free pages available for reuse by new inserts. We don't have multi-version concurrency control. It's a feature we'd like to see added, but it certainly represents a substantial new engineering effort. As I've said before, we'd be glad to support you in that project if you decide to undertake it. > 2. SDB' btree-s support only one key, but we have multi-key btree-s... This is a misunderstanding. Berkeley DB allows you to use arbitrary data structures as keys. You define your own comparison function, which understands your key structure and is capable of doing comparisons between keys. It's precisely equivalent to the support you've got in PostgreSQL now, since your comparator has to understand key schema (including the presence or absence of nulls). You'd define your own comparator and your own key type. You'd hand (key, value) pairs to Berkeley DB, and we'd call your comparator to compare keys during tree descent. The key you hand us is an arbitrarily complex data structure, but we don't care. You get another benefit from Berkeley DB -- we eliminate the 8K limit on tuple size. For large records, we break them into page-sized chunks for you, and we reassemble them on demand. Neither PostgreSQL nor the user needs to worry about this, it's a service that just works. A single record or a single key may be up to 4GB in size. > 3. How can we implement gist, rtree AND (multi-key) BTREE access methods > using btree and hash access methods provided by SDB?! You'd build gist and rtree on top of the current buffer manager, much as rtree is currently implemented on top of the lower-level page manager in PostgreSQL. Multi-key btree support is there already, as is multi- key extended linear hashing. In exchange for having to build a new rtree AM, you'd get high-performance persistent queues for free. I'd argue that queues are more generally useful than rtrees. I understand that you have users who need rtrees. I wrote that access method in Postgres, and used it extensively for geospatial indexing during the Sequoia 2000 project. I'm a big fan. Nevertheless, there are more database customers looking for fast queues than are looking for spatial indices. > 1,2,3 => we have to preserve our access methods (and ability to add new!). Again, you can add new access methods in Berkeley DB in the same way that you do for PostgreSQL now. > Now, about WAL. What is WAL? WAL *mostly* is set of functions to > write/read log (90% implemented) + *access method specific* redo/undo > functions... to be implemented anyway, because of conclusion above. You wouldn't need to rewrite the current access-method undo and redo functions in Berkeley DB; they're there, and they work. You'd need to do that work for the new access methods you want to define, but as you note, that work is required whether you roll your own or use Berkeley DB. I encourage you to think hard about the amount of work that's really required to produce a commercial-grade recovery and transaction system. This stuff is extremely hard to get right -- you need to design, code, and test for very high-concurrency, complex workloads. The log is a new source of contention, and will be a gate to performance. The log is also a new way to consume space endlessly, so you'll want to think about backup and checkpoint support. With Berkeley DB, you get both today. Our backup support permits you to do on-line backups. Backups don't acquire locks and don't force a shutdown. Testing this stuff is tricky. For example, you need to prove that you're able to survive a crash that interrupts the three internal page writes that you do in the btree access method on a page split. Postgres (when I wrote the Btree AM) carefully ordered those writes to guarantee no loss of data, but it was possible to crash with the children written and the parent lost. The result is an inefficiency in the tree structure that you'll never recover, but that you can deal with at read time. This is an example of a case that Berkeley DB gets right. The advantage of Berkeley DB is that we've got five years of commercial deployment behind us. The code is in production use, and is known to support terabyte-sized databases, hundreds of concurrent threads with arbitrary read/write mixes, and system and application crashes. By putting the log and the data on separate spindles, we are able to survive loss of either device without losing committed data. Big companies have made significant bets on the software by deploying it in mission-critical applications. It works. Plus, we're continuing to work on the code, and we're paid real money to do that. We're able to deliver significant new features and performance improvements about three times a year. All of that said, I'd boil Vadim's message down to this: + With Berkeley DB, you'd need to reimplement multi-version concurrency control, and that's an opportunity to introduce new bugs. + With PostgreSQL, you'll need to implement logging and recovery, and that's an opportunity to introduce new bugs. I don't think that either alternative presents insurmountable difficulties. Which you choose depends on the technical issues and on your willingness to integrate code from outside the project into PostreSQL's internals, to a degree that you've never done before. Regards, mike
pgsql-hackers by date: