Re: Berkeley DB... - Mailing list pgsql-hackers

From Michael A. Olson
Subject Re: Berkeley DB...
Date
Msg-id 200005211839.LAA00177@triplerock.olsons.net
Whole thread Raw
In response to Berkeley DB...  ("Mikheev, Vadim" <vmikheev@SECTORBASE.COM>)
Responses Re: Berkeley DB...
List pgsql-hackers
At 06:43 PM 5/20/00 -0700, Vadim Mikheev wrote:

> 1. Using RECNO db for heap.
> For relational DB over-writing smgr means ability to re-use space after
> DELETE/UPDATE operations (without vacuum -:)). RECNO (btree by nature,
> with record number as key) will not give us this ability. To insert record
> into RECNO db one has either to provide "put" method with record number
> (where to store new record) or specify DB_APPEND in flags, to add new record
> to the end of db (without space re-using). So, the problem (one of two base
> problems of over-writing smgr for us) "where to store new tuple" (ie - where
> in data file there is free space for new tuple) is not resolved.
> => we can't use SDB smgr: there are no required features - space re-using
> and MVCC support.

All of the Berkeley DB access methods reuse space.  We return free space
to a pool and allocate from the pool in the ordinary course of operation.
We have no notion of vacuum.

Empty pages get appended to a free list, and will be reused on next page
allocation.  Empty space on pages (from deleted tuples) where the rest
of the page isn't empty will get reused the next time the page is
visited.  So you do get space reuse.  We don't return blocks to the
file system automatically (requires reorg, which is hard).  "Appending"
means appending in key space; that may or may not be physically at the
end of the file.

We do, however, do reverse splits of underfull nodes, so we're aggressive
at getting empty pages back on the free list.

In short, I think the space reuse story of Berkeley DB is better than
the current space reuse story in PostgreSQL, even for heaps.  This is
because the current heap AM doesn't opportunistically coalesce pages
to make free pages available for reuse by new inserts.

We don't have multi-version concurrency control.  It's a feature we'd like
to see added, but it certainly represents a substantial new engineering
effort.  As I've said before, we'd be glad to support you in that project
if you decide to undertake it.

> 2. SDB' btree-s support only one key, but we have multi-key btree-s...

This is a misunderstanding.  Berkeley DB allows you to use arbitrary

data structures as keys.  You define your own comparison function, which
understands your key structure and is capable of doing comparisons between
keys.  It's precisely equivalent to the support you've got in PostgreSQL
now, since your comparator has to understand key schema (including the
presence or absence of nulls).

You'd define your own comparator and your own key type.  You'd hand
(key, value) pairs to Berkeley DB, and we'd call your comparator to
compare keys during tree descent.  The key you hand us is an arbitrarily
complex data structure, but we don't care.

You get another benefit from Berkeley DB -- we eliminate the 8K limit
on tuple size.  For large records, we break them into page-sized
chunks for you, and we reassemble them on demand.  Neither PostgreSQL
nor the user needs to worry about this, it's a service that just works.

A single record or a single key may be up to 4GB in size.

> 3. How can we implement gist, rtree AND (multi-key) BTREE access methods
> using btree and hash access methods provided by SDB?!

You'd build gist and rtree on top of the current buffer manager, much
as rtree is currently implemented on top of the lower-level page manager
in PostgreSQL.  Multi-key btree support is there already, as is multi-
key extended linear hashing.  In exchange for having to build a new
rtree AM, you'd get high-performance persistent queues for free.

I'd argue that queues are more generally useful than rtrees.  I understand
that you have users who need rtrees.  I wrote that access method in
Postgres, and used it extensively for geospatial indexing during the
Sequoia 2000 project.  I'm a big fan.  Nevertheless, there are more
database customers looking for fast queues than are looking for spatial
indices.

> 1,2,3 => we have to preserve our access methods (and ability to add new!).

Again, you can add new access methods in Berkeley DB in the same way
that you do for PostgreSQL now.


> Now, about WAL. What is WAL? WAL *mostly* is set of functions to 
> write/read log (90% implemented) + *access method specific* redo/undo
> functions... to be implemented anyway, because of conclusion above.

You wouldn't need to rewrite the current access-method undo and redo
functions in Berkeley DB; they're there, and they work.  You'd need to
do that work for the new access methods you want to define, but as you
note, that work is required whether you roll your own or use Berkeley
DB.

I encourage you to think hard about the amount of work that's really
required to produce a commercial-grade recovery and transaction system.
This stuff is extremely hard to get right -- you need to design, code,
and test for very high-concurrency, complex workloads.  The log is a
new source of contention, and will be a gate to performance.  The log
is also a new way to consume space endlessly, so you'll want to think
about backup and checkpoint support.  With Berkeley DB, you get both
today.  Our backup support permits you to do on-line backups.  Backups
don't acquire locks and don't force a shutdown.

Testing this stuff is tricky.  For example, you need to prove that you're
able to survive a crash that interrupts the three internal page writes
that you do in the btree access method on a page split.  Postgres (when
I wrote the Btree AM) carefully ordered those writes to guarantee no
loss of data, but it was possible to crash with the children written and
the parent lost.  The result is an inefficiency in the tree structure
that you'll never recover, but that you can deal with at read time.  This
is an example of a case that Berkeley DB gets right.

The advantage of Berkeley DB is that we've got five years of commercial
deployment behind us.  The code is in production use, and is known to
support terabyte-sized databases, hundreds of concurrent threads with
arbitrary read/write mixes, and system and application crashes.  By
putting the log and the data on separate spindles, we are able to survive
loss of either device without losing committed data.  Big companies have
made significant bets on the software by deploying it in mission-critical
applications.  It works.

Plus, we're continuing to work on the code, and we're paid real money to
do that.  We're able to deliver significant new features and performance
improvements about three times a year.

All of that said, I'd boil Vadim's message down to this:
+  With Berkeley DB, you'd need to reimplement multi-version   concurrency control, and that's an opportunity to
introduce  new bugs.
 
+  With PostgreSQL, you'll need to implement logging and recovery,   and that's an opportunity to introduce new bugs.

I don't think that either alternative presents insurmountable difficulties.
Which you choose depends on the technical issues and on your willingness
to integrate code from outside the project into PostreSQL's internals, to
a degree that you've never done before.

Regards,                mike



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Re: MySQL crashme test and PostgreSQL
Next
From: Tom Lane
Date:
Subject: Re: Performance (was: The New Slashdot Setup (includes MySql server))