RE: Berkeley DB... - Mailing list pgsql-hackers
From | Mikheev, Vadim |
---|---|
Subject | RE: Berkeley DB... |
Date | |
Msg-id | 8F4C99C66D04D4118F580090272A7A23018BF6@SECTORBASE1 Whole thread Raw |
In response to | Berkeley DB... ("Mikheev, Vadim" <vmikheev@SECTORBASE.COM>) |
Responses |
Re: Berkeley DB...
|
List | pgsql-hackers |
> Empty pages get appended to a free list, and will be reused > on next page allocation. Empty space on pages (from deleted > tuples) where the rest of the page isn't empty will get reused > the next time the page is visited. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If visited for update... but how can I know with what RECNO I have to insert new tuple to store it on half-empty page? > We do, however, do reverse splits of underfull nodes, so > we're aggressive at getting empty pages back on the free list. You can't merge two 49% empty pages in one. So, how to reuse this 49%? How will we able to implement feature that good databases have: one can specify while table creation - "insert new tuples on pages which N% empty"? More of that, you can't just re-use empty space (even if you know where in the tree there is space for new tuple) - you have to assign *right* recno to new tuple. What if 2k tuple on a page was updated to 0.2k and no tuple was deleted on this page? You can't re-use this empty-space without tree reorg, what 1. requires additional writes; 2. is not always possible at all. ...Oh, we can work it out, by using somehing like current TID (blkno + itemid) as index key - this will require some changes in SDB btree code (new RECNO-like AM)... so, this was just calculation of work required -:) And, while we are on heap subject - using index (RECNO) for heap means that all our secondary-index scans will performe TWO index scans - first, to find recno in secondary-index, and second, to find heap tuple using recno (now indices give us TID, which is physical address). > > 2. SDB' btree-s support only one key, but we have multi-key > btree-s... > > This is a misunderstanding. Berkeley DB allows you to use > arbitrary data structures as keys. You define your own comparison > function, which understands your key structure and is capable of > doing comparisons between keys. Oh, you're right. Though, I'm unhappy with DB_SET_RANGE The DB_SET_RANGE flag is identical to the DB_SET flag, except that the key is returned as well as thedata item, and, in the case of the Btree access method, the returned key/data pair is the smallest key greaterthan or equal to the specified key ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ we would need differentiate > and >= : it's not good for key > 2 to read from disk all items with key == 2. But, this could be done. (as determined by the comparison function), permitting partial key matches and range searches. > You get another benefit from Berkeley DB -- we eliminate the 8K limit > on tuple size. For large records, we break them into page-sized > chunks for you, and we reassemble them on demand. Neither PostgreSQL > nor the user needs to worry about this, it's a service that > just works. But what I've seen in code - you *always* copy data to memory, what is not good for both seq scans (when we have to evaluate query qual) and index ones (with range quals). See also comments from Chris Bitmead about the case when some of data are not in select target list... Probably, we could change DB->(c_)get methods and provide them with two functions (to check qual and to determine what exactly should be returned)... > A single record or a single key may be up to 4GB in size. > > > 3. How can we implement gist, rtree AND (multi-key) BTREE > > access methods using btree and hash access methods provided by SDB?! > > You'd build gist and rtree on top of the current buffer manager, much > as rtree is currently implemented on top of the lower-level page manager > in PostgreSQL. Oh, so we have to implement redo/undo for them. Ok. WAL: > I encourage you to think hard about the amount of work that's really > required to produce a commercial-grade recovery and transaction system. > This stuff is extremely hard to get right -- you need to design, code, > and test for very high-concurrency, complex workloads. The log is a > new source of contention, and will be a gate to performance. The log > is also a new way to consume space endlessly, so you'll want to think > about backup and checkpoint support. With Berkeley DB, you get both > today. Our backup support permits you to do on-line backups. Backups > don't acquire locks and don't force a shutdown. As for design and coding, 90% is already done (though, with SDB we could avoid heap/btree/hash redo/undo implementation). As for stability/testing - as I already said, - after rewriting ~50% of system to use SDB, nothing in PostgreSQL will be "well tested". > Testing this stuff is tricky. For example, you need to prove > that you're able to survive a crash that interrupts the three > internal page writes that you do in the btree access method on > a page split. Oh, testing of this case is very easy - I'll just stop backend using gdb in critical points and will turn power off -:)) I've run 2-3 backends under gdb to catch some concurrency-related bug in buffer manager - this technique works very well -:) > All of that said, I'd boil Vadim's message down to this: > > + With Berkeley DB, you'd need to reimplement multi-version > concurrency control, and that's an opportunity to introduce > new bugs. Unfortunately, this must be done *before* we could migrate to SDB -:( So, some of us will have to stop PG development and switch to SDB code... which is hard for soul -:) > + With PostgreSQL, you'll need to implement logging > and recovery, and that's an opportunity to introduce new bugs. But... we know where in our code set up breakpoints under gdb -:)) Oh, well.. let's continue to think... Regards,Vadim
pgsql-hackers by date: