Thread: Berkeley DB...

Berkeley DB...

From
"Mikheev, Vadim"
Date:
Well, I've read SDB code/doc for a few hours...

1. Using RECNO db for heap.
For relational DB over-writing smgr means ability to re-use space after
DELETE/UPDATE operations (without vacuum -:)). RECNO (btree by nature,
with record number as key) will not give us this ability. To insert record
into RECNO db one has either to provide "put" method with record number
(where to store new record) or specify DB_APPEND in flags, to add new record
to the end of db (without space re-using). So, the problem (one of two base
problems of over-writing smgr for us) "where to store new tuple" (ie - where
in data file there is free space for new tuple) is not resolved.
=> we can't use SDB smgr: there are no required features - space re-using
and MVCC support.

2. SDB' btree-s support only one key, but we have multi-key btree-s...

3. How can we implement gist, rtree AND (multi-key) BTREE access methods
using btree and hash access methods provided by SDB?!

1,2,3 => we have to preserve our access methods (and ability to add new!).

Now, about WAL. What is WAL? WAL *mostly* is set of functions to 
write/read log (90% implemented) + *access method specific* redo/undo
functions... to be implemented anyway, because of conclusion above.

Comments?

Vadim


Re: Berkeley DB...

From
"Michael A. Olson"
Date:
At 06:43 PM 5/20/00 -0700, Vadim Mikheev wrote:

> 1. Using RECNO db for heap.
> For relational DB over-writing smgr means ability to re-use space after
> DELETE/UPDATE operations (without vacuum -:)). RECNO (btree by nature,
> with record number as key) will not give us this ability. To insert record
> into RECNO db one has either to provide "put" method with record number
> (where to store new record) or specify DB_APPEND in flags, to add new record
> to the end of db (without space re-using). So, the problem (one of two base
> problems of over-writing smgr for us) "where to store new tuple" (ie - where
> in data file there is free space for new tuple) is not resolved.
> => we can't use SDB smgr: there are no required features - space re-using
> and MVCC support.

All of the Berkeley DB access methods reuse space.  We return free space
to a pool and allocate from the pool in the ordinary course of operation.
We have no notion of vacuum.

Empty pages get appended to a free list, and will be reused on next page
allocation.  Empty space on pages (from deleted tuples) where the rest
of the page isn't empty will get reused the next time the page is
visited.  So you do get space reuse.  We don't return blocks to the
file system automatically (requires reorg, which is hard).  "Appending"
means appending in key space; that may or may not be physically at the
end of the file.

We do, however, do reverse splits of underfull nodes, so we're aggressive
at getting empty pages back on the free list.

In short, I think the space reuse story of Berkeley DB is better than
the current space reuse story in PostgreSQL, even for heaps.  This is
because the current heap AM doesn't opportunistically coalesce pages
to make free pages available for reuse by new inserts.

We don't have multi-version concurrency control.  It's a feature we'd like
to see added, but it certainly represents a substantial new engineering
effort.  As I've said before, we'd be glad to support you in that project
if you decide to undertake it.

> 2. SDB' btree-s support only one key, but we have multi-key btree-s...

This is a misunderstanding.  Berkeley DB allows you to use arbitrary

data structures as keys.  You define your own comparison function, which
understands your key structure and is capable of doing comparisons between
keys.  It's precisely equivalent to the support you've got in PostgreSQL
now, since your comparator has to understand key schema (including the
presence or absence of nulls).

You'd define your own comparator and your own key type.  You'd hand
(key, value) pairs to Berkeley DB, and we'd call your comparator to
compare keys during tree descent.  The key you hand us is an arbitrarily
complex data structure, but we don't care.

You get another benefit from Berkeley DB -- we eliminate the 8K limit
on tuple size.  For large records, we break them into page-sized
chunks for you, and we reassemble them on demand.  Neither PostgreSQL
nor the user needs to worry about this, it's a service that just works.

A single record or a single key may be up to 4GB in size.

> 3. How can we implement gist, rtree AND (multi-key) BTREE access methods
> using btree and hash access methods provided by SDB?!

You'd build gist and rtree on top of the current buffer manager, much
as rtree is currently implemented on top of the lower-level page manager
in PostgreSQL.  Multi-key btree support is there already, as is multi-
key extended linear hashing.  In exchange for having to build a new
rtree AM, you'd get high-performance persistent queues for free.

I'd argue that queues are more generally useful than rtrees.  I understand
that you have users who need rtrees.  I wrote that access method in
Postgres, and used it extensively for geospatial indexing during the
Sequoia 2000 project.  I'm a big fan.  Nevertheless, there are more
database customers looking for fast queues than are looking for spatial
indices.

> 1,2,3 => we have to preserve our access methods (and ability to add new!).

Again, you can add new access methods in Berkeley DB in the same way
that you do for PostgreSQL now.


> Now, about WAL. What is WAL? WAL *mostly* is set of functions to 
> write/read log (90% implemented) + *access method specific* redo/undo
> functions... to be implemented anyway, because of conclusion above.

You wouldn't need to rewrite the current access-method undo and redo
functions in Berkeley DB; they're there, and they work.  You'd need to
do that work for the new access methods you want to define, but as you
note, that work is required whether you roll your own or use Berkeley
DB.

I encourage you to think hard about the amount of work that's really
required to produce a commercial-grade recovery and transaction system.
This stuff is extremely hard to get right -- you need to design, code,
and test for very high-concurrency, complex workloads.  The log is a
new source of contention, and will be a gate to performance.  The log
is also a new way to consume space endlessly, so you'll want to think
about backup and checkpoint support.  With Berkeley DB, you get both
today.  Our backup support permits you to do on-line backups.  Backups
don't acquire locks and don't force a shutdown.

Testing this stuff is tricky.  For example, you need to prove that you're
able to survive a crash that interrupts the three internal page writes
that you do in the btree access method on a page split.  Postgres (when
I wrote the Btree AM) carefully ordered those writes to guarantee no
loss of data, but it was possible to crash with the children written and
the parent lost.  The result is an inefficiency in the tree structure
that you'll never recover, but that you can deal with at read time.  This
is an example of a case that Berkeley DB gets right.

The advantage of Berkeley DB is that we've got five years of commercial
deployment behind us.  The code is in production use, and is known to
support terabyte-sized databases, hundreds of concurrent threads with
arbitrary read/write mixes, and system and application crashes.  By
putting the log and the data on separate spindles, we are able to survive
loss of either device without losing committed data.  Big companies have
made significant bets on the software by deploying it in mission-critical
applications.  It works.

Plus, we're continuing to work on the code, and we're paid real money to
do that.  We're able to deliver significant new features and performance
improvements about three times a year.

All of that said, I'd boil Vadim's message down to this:
+  With Berkeley DB, you'd need to reimplement multi-version   concurrency control, and that's an opportunity to
introduce  new bugs.
 
+  With PostgreSQL, you'll need to implement logging and recovery,   and that's an opportunity to introduce new bugs.

I don't think that either alternative presents insurmountable difficulties.
Which you choose depends on the technical issues and on your willingness
to integrate code from outside the project into PostreSQL's internals, to
a degree that you've never done before.

Regards,                mike



Re: Berkeley DB...

From
Chris Bitmead
Date:
"Michael A. Olson" wrote:

> You get another benefit from Berkeley DB -- we eliminate the 8K limit
> on tuple size.  For large records, we break them into page-sized
> chunks for you, and we reassemble them on demand.  Neither PostgreSQL
> nor the user needs to worry about this, it's a service that just works.
> 
> A single record or a single key may be up to 4GB in size.

That's certainly nice. But if you don't access a BIG column, you have to
retrieve the whole record? A very nice idea of the Postgres TOAST idea
is that you don't. You can have...
CREATE TABLE image (name TEXT, size INTEGER, giganticTenMegImage GIF);
As long as you don't select the huge column you don't lift it off disk.
That's pretty nice. In other databases I've had to do some annoying
refactoring of data models to avoid this.


Re: Berkeley DB...

From
Tatsuo Ishii
Date:
Can I ask you a simple question? Does Berkeley DB support encodings
other than ASCII?
--
Tatsuo Ishii



Re: Berkeley DB...

From
"Michael A. Olson"
Date:
At 10:14 AM 5/22/00 +0900, Tatsuo Ishii wrote:

> Can I ask you a simple question? Does Berkeley DB support encodings
> other than ASCII?

Berkeley DB is entirely agnostic on data types.  We store and retrieve
keys and values; you define the types and assign semantics to them.
We've got a number of customers storing wide character data in various
encodings and character sets in Berkeley DB.

Our default btree comparator and hash function are simple bit string
operators.  You'd need to write a comparison function for btrees that
understood the collating sequence of the character set you store.
                mike



RE: Berkeley DB...

From
"Mikheev, Vadim"
Date:
> Empty pages get appended to a free list, and will be reused 
> on next page allocation. Empty space on pages (from deleted
> tuples) where the rest of the page isn't empty will get reused
> the next time the page is visited.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If visited for update... but how can I know with what RECNO I have
to insert new tuple to store it on half-empty page?

> We do, however, do reverse splits of underfull nodes, so 
> we're aggressive at getting empty pages back on the free list.

You can't merge two 49% empty pages in one. So, how to reuse
this 49%? How will we able to implement feature that good
databases have: one can specify while table creation -
"insert new tuples on pages which N% empty"?

More of that, you can't just re-use empty space (even if you know
where in the tree there is space for new tuple) - you have to assign
*right* recno to new tuple. What if 2k tuple on a page was updated
to 0.2k and no tuple was deleted on this page? You can't re-use
this empty-space without tree reorg, what 1. requires additional
writes; 2. is not always possible at all.
...Oh, we can work it out, by using somehing like current TID
(blkno + itemid) as index key - this will require some changes
in SDB btree code (new RECNO-like AM)... so, this was just
calculation of work required -:)

And, while we are on heap subject - using index (RECNO) for heap
means that all our secondary-index scans will performe TWO
index scans - first, to find recno in secondary-index, and
second, to find heap tuple using recno (now indices give us
TID, which is physical address).

> > 2. SDB' btree-s support only one key, but we have multi-key 
> btree-s...
> 
> This is a misunderstanding. Berkeley DB allows you to use
> arbitrary data structures as keys. You define your own comparison 
> function, which understands your key structure and is capable of
> doing comparisons between keys.

Oh, you're right.
Though, I'm unhappy with

DB_SET_RANGE    The DB_SET_RANGE flag is identical to the DB_SET flag, except that    the key is returned as well as
thedata item, and, in the case of    the Btree access method, the returned key/data pair is     the smallest key
greaterthan or equal to the specified key
 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
we would need differentiate > and >= : it's not good for key > 2 to read
from disk all items with key == 2. But, this could be done.
    (as determined by the comparison function), permitting partial key    matches and range searches. 

> You get another benefit from Berkeley DB -- we eliminate the 8K limit
> on tuple size.  For large records, we break them into page-sized
> chunks for you, and we reassemble them on demand.  Neither PostgreSQL
> nor the user needs to worry about this, it's a service that 
> just works.

But what I've seen in code - you *always* copy data to memory,
what is not good for both seq scans (when we have to evaluate
query qual) and index ones (with range quals). See also
comments from Chris Bitmead about the case when some of data
are not in select target list...
Probably, we could change DB->(c_)get methods and provide them
with two functions (to check qual and to determine what exactly
should be returned)...

> A single record or a single key may be up to 4GB in size.
> 
> > 3. How can we implement gist, rtree AND (multi-key) BTREE 
> > access methods using btree and hash access methods provided by SDB?!
> 
> You'd build gist and rtree on top of the current buffer manager, much
> as rtree is currently implemented on top of the lower-level page manager
> in PostgreSQL.

Oh, so we have to implement redo/undo for them. Ok.

WAL:
> I encourage you to think hard about the amount of work that's really
> required to produce a commercial-grade recovery and transaction system.
> This stuff is extremely hard to get right -- you need to design, code,
> and test for very high-concurrency, complex workloads. The log is a
> new source of contention, and will be a gate to performance. The log
> is also a new way to consume space endlessly, so you'll want to think
> about backup and checkpoint support. With Berkeley DB, you get both
> today. Our backup support permits you to do on-line backups. Backups
> don't acquire locks and don't force a shutdown.

As for design and coding, 90% is already done (though, with SDB we could
avoid heap/btree/hash redo/undo implementation). As for stability/testing
- as I already said, - after rewriting ~50% of system to use SDB,
nothing in PostgreSQL will be "well tested".

> Testing this stuff is tricky.  For example, you need to prove 
> that you're able to survive a crash that interrupts the three
> internal page writes that you do in the btree access method on
> a page split.  

Oh, testing of this case is very easy - I'll just stop backend
using gdb in critical points and will turn power off -:))
I've run 2-3 backends under gdb to catch some concurrency-related
bug in buffer manager - this technique works very well -:)

> All of that said, I'd boil Vadim's message down to this:
> 
> +  With Berkeley DB, you'd need to reimplement multi-version
>    concurrency control, and that's an opportunity to introduce
>    new bugs.

Unfortunately, this must be done *before* we could migrate to
SDB -:( So, some of us will have to stop PG development and
switch to SDB code... which is hard for soul -:)

> +  With PostgreSQL, you'll need to implement logging 
>    and recovery, and that's an opportunity to introduce new bugs.

But... we know where in our code set up breakpoints under gdb -:))

Oh, well.. let's continue to think...

Regards,Vadim


Re: Berkeley DB...

From
Bruce Momjian
Date:
> And, while we are on heap subject - using index (RECNO) for heap
> means that all our secondary-index scans will performe TWO
> index scans - first, to find recno in secondary-index, and
> second, to find heap tuple using recno (now indices give us
> TID, which is physical address).

Yes, that was one of my questions.  Why use recno at all?  We already
have heap access which is very fast.  Why switch to SDB which gives us
a recno ordering of heap that doesn't do us any real good, except to
allow tuple update without changing indexes.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: Berkeley DB...

From
"Mikheev, Vadim"
Date:
> > And, while we are on heap subject - using index (RECNO) for heap
> > means that all our secondary-index scans will performe TWO
> > index scans - first, to find recno in secondary-index, and
> > second, to find heap tuple using recno (now indices give us
> > TID, which is physical address).
> 
> Yes, that was one of my questions.  Why use recno at all?  We already
> have heap access which is very fast.  Why switch to SDB which gives us
> a recno ordering of heap that doesn't do us any real good, except to
> allow tuple update without changing indexes.

But if we'll use our heap AM, then we'll have to implement redo/undo
for it... no sence to switch to SDB for btree/hash WAL support -:)

Vadim


Re: Berkeley DB...

From
Bruce Momjian
Date:
[Charset iso-8859-1 unsupported, filtering to ASCII...]
> > > And, while we are on heap subject - using index (RECNO) for heap
> > > means that all our secondary-index scans will performe TWO
> > > index scans - first, to find recno in secondary-index, and
> > > second, to find heap tuple using recno (now indices give us
> > > TID, which is physical address).
> > 
> > Yes, that was one of my questions.  Why use recno at all?  We already
> > have heap access which is very fast.  Why switch to SDB which gives us
> > a recno ordering of heap that doesn't do us any real good, except to
> > allow tuple update without changing indexes.
> 
> But if we'll use our heap AM, then we'll have to implement redo/undo
> for it... no sence to switch to SDB for btree/hash WAL support -:)

Yes, SDB would give us redo/undo in heap, and that would make things
easier.  However, if there is the overhead of a double-index lookup when
using indexes, it seems like a very high cost.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: Berkeley DB...

From
"Mikheev, Vadim"
Date:
> > Yes, that was one of my questions.  Why use recno at all?  
> > We already have heap access which is very fast. Why switch
> > to SDB which gives us a recno ordering of heap that doesn't
> > do us any real good, except to allow tuple update without
> > changing indexes.
> 
> But if we'll use our heap AM, then we'll have to implement redo/undo
> for it... no sence to switch to SDB for btree/hash WAL support -:)

Also, I think that our native index logging will require less space
in log, because of we can do not write *key values* to log!
Index tuple insertion will be logged as "index tuple pointing to
heap TID was added to page BLKNO at position ITEMID".
The same for index page split...

Vadim


Re: Berkeley DB...

From
"Michael A. Olson"
Date:
I'm responding in a single message to several questions prompted by my
message of this morning.

Chris Bitmead asked:

> > A single record or a single key may be up to 4GB in size.
> 
> That's certainly nice. But if you don't access a BIG column, you have to
> retrieve the whole record?

You can do partial reads, but only at the expense of complicating the
code in the PostgreSQL server.  We provide some reasonable interfaces
for fetching only part of large records.  However, you need to know
whether you should call them or not.  As a result, you'd have to
record in the system catalog somewhere that a particular table contained
big tuples, and you'd have to write your fetch code to read only the
byte range you care about.

That would complicate the server, since you'd want to do simple
fetches for the simple case, too.

Vadim Mikheev made some good points on space reuse.  Unless a page is
empty space on the page for keys in the right range.  For append-only 
workloads (like increasing heap tids), that's not what you want.

Vadim then asked:

> You can't merge two 49% empty pages in one. So, how to reuse
> this 49%? How will we able to implement feature that good
> databases have: one can specify while table creation -
> "insert new tuples on pages which N% empty"?


We already recognize the special case of in-order insersions (as in
the case of increasing heap tids).  We split pages so that the right
child is nearly empty and left is nearly full.  That gives you close
to 100% space utilization at build time.  Adding a fill factor to
the initialization code would be very easy.

> And, while we are on heap subject - using index (RECNO) for heap
> means that all our secondary-index scans will performe TWO
> index scans - first, to find recno in secondary-index, and
> second, to find heap tuple using recno (now indices give us
> TID, which is physical address).

We're not going to resolve this question without building both
systems and measuring their performance.  The non-leaf levels of
btrees are pretty much always in the cache because they're hot.
Whether your fetch-a-tuple code path is shorter than my fetch-
a-tuple code path is undecided.

Frankly, based on my experience with Berkeley DB, I'd bet on mine.
I can do 2300 tuple fetches per CPU per second, with linear scale-
up to at least four processors (that's what we had on the box we
used).  That's 9200 fetches a second.  Performance isn't going
to be the deciding issue.

(The test system was a mid-range Solaris box -- reasonable, but not
extravagant, clock speed, memory, and disk.)

On testing failure at critical points in the code, Vadim wrote:

> Oh, testing of this case is very easy - I'll just stop backend
> using gdb in critical points and will turn power off -:))
> I've run 2-3 backends under gdb to catch some concurrency-related
> bug in buffer manager - this technique works very well -:)

For limited concurrency and fairly simple problems, that technique
works well.  You should plan to test PostgreSQL with hundreds of
concurrent backends with a complex workload for days in order to
convince people that the system works correctly.  This is what the
commercial vendors (including Sleepycat!) do.  Your testing
strategy should include randomly killing the system to demonstrate
that you recover correctly.

I'm only warning you to be careful and to take this seriously.  It's
very hard to do the kind of testing you should.  The recovery system
is generally the most poorly-exercised part of the system, but it's
the one piece that absolutely has to work flawlessly.  It only runs
after your system has crashed, and your customer is already angry.

Finally, Vadim makes the point that switching to Berkeley DB forces
you to stop working on code you understand, and to pick up a new
package altogether.  Worse, you'd need to do some pretty serious
engineering to get multi-version concurrency control into Berkeley
DB before you could use it.

This is a pretty compelling argument to me.  I've been busy
explaining how you *could* make the switch, but the real question
is whether you *should*.  I can answer all of Vadim's questions
reasonably.  Frankly, though, if I were in charge of the engineering
effort for PostgreSQL, I'd be disinclined to use Berkeley DB on the
strength of the interface changes it requires and the effort that
would be required to implement MVCC.

I say this in the spirit of complete disclosure -- we'd like to
see you use our software, but you need to make a business decision
here.  If you hadn't already done MVCC, I'd be arguing the other
side, but you have.

Regards,                mike



RE: Berkeley DB...

From
"Mikheev, Vadim"
Date:
> Times for the insert loop:
> 14   MySQL-MyISAM
> 23   PostgreSQL (no fsync)
> 53   MySQL-BDB (with fsync -- don't know how to turn it off yet)

PostgreSQL 6.5.3; -B 16384; SUN Ultra10 with some IDE disk.

1. with fsync, all inserts in single transaction:     73 sec
2. with fsync, use COPY:                               3 sec
3. without fsync, use COPY:                            3 sec
4. without fsync, all inserts in single transaction:  71 sec
5. without fsync, each insert in own transaction:    150 sec

Do you see difference for INSERT/COPY in 1./2.? Shouldn't we try
to speed up our PARSER/PLANNER, keeping in mind that WAL will speed
up our storage sub-system?!

Also, 4. & 5. show that transaction begin/commit take too long time.
Could you run your test for all inserts in single transaction?

(If we want to test storage sub-system, let's do our test un-affected
by other ones...)

> The select:
> 0.75  MySQL-MyISAM
> 0.77  MySQL-BDB
> 2.43  PostgreSQL

select a from foo order by a

didn't use index in my case, so I've run

select a from foo where a >= 0 also.

1. ORDER:                0.74
2. A >= 0 with index:    0.73
3. A >= 0 without index: 0.56

Note that I used -B 16384 (very big pool) and run queries *twice* to get
all data into pool. What size of pool did you use? 64 (default) * 8192 =
512Kb,
but size of foo is 1.5Mb...

2. & 3. show that index slows data retrieval... as it should -:)
Also, does MySQL read table itself if it can get all required
columns from index?! I mean - did your query really read *both*
index and *table*? 
PostgreSQL has to read table anyway...

Vadim


RE: Berkeley DB...

From
"Mikheev, Vadim"
Date:
> > ... that being said (and I took a quick test with 10000 
> > randomly-inserted
> > records and fetched them in index order) if the data's in the 
> > cache, the
> > speed difference is insignificant. 
> 
> As long as everything fits into the system cache and is 
> already in there, this test is moot.

Oh, if we want to test how cache is affected by index-nature of
MySQL-BDB then 10000 rows table is toooo small. It adds just
2 levels of internal pages (including root) and ~25 8k pages
to ~ 190 pages of true heap table: additional pages will be
cached very fast, just while fetching first 25 rows from table -:) 
Now create 10000000 rows table (~ 25000 additional pages, 3 internal
levels) and fetch random 10000 rows...

...And again - does MySQL-BDB really read *table* for query like
select a from foo order by a? I remember that MySQL is smart to
read only index when possible...

Vadim


Re: Berkeley DB...

From
"Matthias Urlichs"
Date:
Hi,

Mikheev, Vadim:
> Also, does MySQL read table itself if it can get all required
> columns from index?! I mean - did your query really read *both*
> index and *table*? 

Yes, and yes.

Note that this "benchmark" was much too quick-and-dirty and didn't
really say anything conclusive... we'll have to wait a bit for that.

-- 
Matthias Urlichs  |  noris network GmbH   |   smurf@noris.de  |  ICQ: 20193661
The quote was selected randomly. Really.       |        http://smurf.noris.de/
-- 
The best way to preserve a right is to exercise it,
and the right to smoke is a right worth dying for.


Re: Berkeley DB...

From
Mike Mascari
Date:
Matthias Urlichs wrote:
> 
> Hi,
> 
> Mikheev, Vadim:
> > Also, does MySQL read table itself if it can get all required
> > columns from index?! I mean - did your query really read *both*
> > index and *table*?
> 
> Yes, and yes.
> 
> Note that this "benchmark" was much too quick-and-dirty and didn't
> really say anything conclusive... we'll have to wait a bit for that.
> 
> --
> Matthias Urlichs  |  noris network GmbH   |   smurf@noris.de  |  ICQ: 20193661
> The quote was selected randomly. Really.       |        http://smurf.noris.de/
> --

Although I am a PostgreSQL zealot, I have to admit that many
PostgreSQL users have hidden behind the use of transactions in
justifying the sometimes 2 - 3 times slower execution speeds in
DML statements vs. MySQL. As Vadim points out in his comparison
of COPY vs. INSERT, something is *wrong* with the time it takes
for PostgreSQL to parse, plan, rewrite, and optimize. Now that
MySQL has transactions through Berkley DB, I think its going to
be harder to justify the pre-executor execution times. 

Just my two cents, 

Mike Mascari


Re: Berkeley DB...

From
Hannu Krosing
Date:
Mike Mascari wrote:
> 
> 
> Although I am a PostgreSQL zealot, I have to admit that many
> PostgreSQL users have hidden behind the use of transactions in
> justifying the sometimes 2 - 3 times slower execution speeds in
> DML statements vs. MySQL. As Vadim points out in his comparison
> of COPY vs. INSERT, something is *wrong* with the time it takes
> for PostgreSQL to parse, plan, rewrite, and optimize. Now that
> MySQL has transactions through Berkley DB, I think its going to
> be harder to justify the pre-executor execution times.

We can always justify it by referring to extensibility of postgres,
which is surely part of the story

Sure we will be able to do cacheing to improve speed of 
serial inserts.
> Just my two cents,
> 
> Mike Mascari


Re: Berkeley DB...

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
>> As Vadim points out in his comparison
>> of COPY vs. INSERT, something is *wrong* with the time it takes
>> for PostgreSQL to parse, plan, rewrite, and optimize.

We might have part of the story in the recently noticed fact that
each insert/update query begins by doing a seqscan of pg_index.

I have done profiles of INSERT in the past and not found any really
spectacular bottlenecks (but I was looking at a test table with no
indexes, so I failed to see the pg_index problem :-().  Last time
I did it, I had these top profile entries for inserting 100,000 rows
of 30 columns apiece:
 %   cumulative   self              self     total           time   seconds   seconds    calls  ms/call  ms/call  name
 30.08    290.79   290.79                             _mcount 6.48    353.46    62.67 30702766     0.00     0.00
AllocSetAlloc5.27    404.36    50.90   205660     0.25     0.25  write 3.06    433.97    29.61 30702765     0.00
0.00 MemoryContextAlloc 2.74    460.45    26.48   100001     0.26     0.74  yyparse 2.63    485.86    25.41 24300077
0.00     0.00  newNode 2.22    507.33    21.47  3900054     0.01     0.01  yylex 1.63    523.04    15.71 30500751
0.00    0.00  PortalHeapMemoryAlloc 1.31    535.68    12.64  5419526     0.00     0.00  hash_search 1.18    547.11
11.43 9900000     0.00     0.00  expression_tree_walker 1.01    556.90     9.79  3526752     0.00     0.00
SpinRelease

While the time spent in memory allocation is annoying, that's only about
ten mallocs per parsed data expression, so it's unlikely that we will be
able to improve on it very much.  (We could maybe avoid having *three*
levels of subroutine call to do an alloc, though ;-).)  Unless you are
smarter than the flex and bison guys you are not going to be able to
improve on the lex/parse times either.  The planner isn't even showing
up for a simple INSERT.  Not much left, unless you can figure out how
to write and commit a tuple with less than two disk writes.

But, as I said, this was a case with no indexes to update.

I intend to do something about caching pg_index info ASAP in the 7.1
cycle, and then we can see how much of a difference that makes...
        regards, tom lane


RE: Berkeley DB...

From
"Mikheev, Vadim"
Date:
> We might have part of the story in the recently noticed fact that
> each insert/update query begins by doing a seqscan of pg_index.
> 
> I have done profiles of INSERT in the past and not found any really
> spectacular bottlenecks (but I was looking at a test table with no
> indexes, so I failed to see the pg_index problem :-().  Last time
> I did it, I had these top profile entries for inserting 100,000 rows
> of 30 columns apiece:

Well, I've dropped index but INSERTs still take 70 sec and 
COPY just 1sec -:(((

Vadim


Re: Berkeley DB...

From
Mike Mascari
Date:
Tom Lane wrote:
> 
> Hannu Krosing <hannu@tm.ee> writes:
> >> As Vadim points out in his comparison
> >> of COPY vs. INSERT, something is *wrong* with the time it takes
> >> for PostgreSQL to parse, plan, rewrite, and optimize.
> 
> We might have part of the story in the recently noticed fact that
> each insert/update query begins by doing a seqscan of pg_index.
> 
> I have done profiles of INSERT in the past and not found any really
> spectacular bottlenecks (but I was looking at a test table with no
> indexes, so I failed to see the pg_index problem :-().  Last time
> I did it, I had these top profile entries for inserting 100,000 rows
> of 30 columns apiece:
> 
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls  ms/call  ms/call  name
>  30.08    290.79   290.79                             _mcount
>   6.48    353.46    62.67 30702766     0.00     0.00  AllocSetAlloc
>   5.27    404.36    50.90   205660     0.25     0.25  write
>   3.06    433.97    29.61 30702765     0.00     0.00  MemoryContextAlloc
>   2.74    460.45    26.48   100001     0.26     0.74  yyparse
>   2.63    485.86    25.41 24300077     0.00     0.00  newNode
>   2.22    507.33    21.47  3900054     0.01     0.01  yylex
>   1.63    523.04    15.71 30500751     0.00     0.00  PortalHeapMemoryAlloc
>   1.31    535.68    12.64  5419526     0.00     0.00  hash_search
>   1.18    547.11    11.43  9900000     0.00     0.00  expression_tree_walker
>   1.01    556.90     9.79  3526752     0.00     0.00  SpinRelease
> 
> While the time spent in memory allocation is annoying, that's only about
> ten mallocs per parsed data expression, so it's unlikely that we will be
> able to improve on it very much.  (We could maybe avoid having *three*
> levels of subroutine call to do an alloc, though ;-).)  Unless you are
> smarter than the flex and bison guys you are not going to be able to
> improve on the lex/parse times either.  The planner isn't even showing
> up for a simple INSERT.  Not much left, unless you can figure out how
> to write and commit a tuple with less than two disk writes.
> 
> But, as I said, this was a case with no indexes to update.
> 
> I intend to do something about caching pg_index info ASAP in the 7.1
> cycle, and then we can see how much of a difference that makes...
> 
>                         regards, tom lane

It will be interesting to see the speed differences between the
100,000 inserts above and those which have been PREPARE'd using
Karel Zak's PREPARE patch. Perhaps a generic query cache could be
used to skip the parsing/planning/optimizing stage when multiple
exact queries are submitted to the database? I suppose the cached
plans could then be discarded whenever a DDL statement or a
VACUUM ANALYZE is executed? The old Berkeley Postgres docs spoke
about cached query plans *and* results (as well as 64-bit oids,
amongst other things). I'm looking forward to when the 7.1 branch
occurs... :-)

Mike Mascari


RE: Berkeley DB...

From
"Mikheev, Vadim"
Date:
> > Well, I've dropped index but INSERTs still take 70 sec and 
> > COPY just 1sec -:(((
> 
> Well, for those that have fsync turned off we could actually 
> avoid most of the writes, could'nt we ? Just leave the page
> marked dirty. We would only need to write each new page once.
> The problem as I see it is, that we don't have a good place 
> where the writes would actually be done. Now they are obviously
> done after each insert.

I've run test without fsync and with all inserts in *single*
transaction - there should be no write after each insert...

Vadim



Re: Berkeley DB...

From
Bruce Momjian
Date:
[ Charset ISO-8859-1 unsupported, converting... ]
> > > Well, I've dropped index but INSERTs still take 70 sec and 
> > > COPY just 1sec -:(((
> > 
> > Well, for those that have fsync turned off we could actually 
> > avoid most of the writes, could'nt we ? Just leave the page
> > marked dirty. We would only need to write each new page once.
> > The problem as I see it is, that we don't have a good place 
> > where the writes would actually be done. Now they are obviously
> > done after each insert.
> 
> I've run test without fsync and with all inserts in *single*
> transaction - there should be no write after each insert...

Watch out.  I think Vadim is settled into San Franciso and is getting
fired up again...  :-)

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: Berkeley DB...

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@hub.org 
> [mailto:pgsql-hackers-owner@hub.org]On Behalf Of Mikheev, Vadim
> 
> > We might have part of the story in the recently noticed fact that
> > each insert/update query begins by doing a seqscan of pg_index.
> > 
> > I have done profiles of INSERT in the past and not found any really
> > spectacular bottlenecks (but I was looking at a test table with no
> > indexes, so I failed to see the pg_index problem :-().  Last time
> > I did it, I had these top profile entries for inserting 100,000 rows
> > of 30 columns apiece:
> 
> Well, I've dropped index but INSERTs still take 70 sec and 
> COPY just 1sec -:(((
>

Did you run vacuum after dropping indexes ?
Because DROP INDEX doesn't update relhasindex of pg_class,
planner/executer may still look up pg_index.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp 


RE: Berkeley DB...

From
"Mikheev, Vadim"
Date:
> > Well, I've dropped index but INSERTs still take 70 sec and 
> > COPY just 1sec -:(((
> >
> 
> Did you run vacuum after dropping indexes ?
> Because DROP INDEX doesn't update relhasindex of pg_class,
> planner/executer may still look up pg_index.

Actually, I dropped and re-created table without indices...

Vadim


Re: Berkeley DB...

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>>>> Well, I've dropped index but INSERTs still take 70 sec and 
>>>> COPY just 1sec -:(((

Mebbe so, but you can't blame it all on parse/plan overhead.

I did some experimentation on this with current sources, using a test
case of inserting 100,000 rows of 16 columns (the regression "tenk1"
table's contents repeated 10 times).  Each test was started with a
freshly created empty table.  The initial runs were done with all
postmaster options except -F defaulted.  All numbers are wall-clock time
in seconds; the "+" column is the time increase from the previous case:

load via COPY, fsync off:
0 indexes    24.45s
1 index        48.88s        + 24.43
2 indexes    62.65s        + 13.77
3 indexes    96.84s        + 34.19
4 indexes    134.09s        + 37.25

load via INSERTs, fsync off, one xact (begin/end around all inserts):
0 indexes    194.95s
1 index        247.21s        + 52.26
2 indexes    269.69s        + 22.48
3 indexes    307.33s        + 37.64
4 indexes    352.72s        + 45.39

load via INSERTs, fsync off, separate transaction for each insert:
0 indexes    236.53s
1 index        295.96s        + 59.43
2 indexes    323.40s        + 27.44
[ got bored before doing 3/4 index cases ... ]

load via INSERTs, fsync on, separate transactions:
0 indexes    5189.99s
[ don't want to know how long it will take with indexes :-( ]

So while the parse/plan overhead looks kinda bad next to a bare COPY,
it's not anything like a 70:1 penalty.  But an fsync per insert is
that bad and worse.

I then recompiled with -pg to learn more about where the time was going.
One of the useful places to look at is calls to FileSeek, since mdread,
mdwrite, and mdextend all call it.  To calibrate these numbers, the
table being created occupies 2326 pages and the first index is 343
pages.

Inserts (all in 1 xact), no indexes:               0.00    0.00       1/109528      init_irels [648]               0.00
  0.00      85/109528      mdread [592]               0.01    0.00    2327/109528      mdextend [474]
0.01   0.00    2343/109528      mdwrite [517]               0.23    0.00  104772/109528      _mdnblocks [251]
 
[250]    0.0    0.24    0.00  109528         FileSeek [250]
Inserts (1 xact), 1 index:               0.00    0.00       1/321663      init_irels [649]               0.00    0.00
2667/321663      mdextend [514]               0.10    0.00   55478/321663      mdread [277]               0.11    0.00
58096/321663      mdwrite [258]               0.38    0.00  205421/321663      _mdnblocks [229]
 
[213]    0.1    0.60    0.00  321663         FileSeek [213]
COPY, no indexes:               0.00    0.00       1/109527      init_irels [431]               0.00    0.00
84/109527     mdread [404]               0.00    0.00    2327/109527      mdextend [145]               0.00    0.00
2343/109527     mdwrite [178]               0.07    0.00  104772/109527      _mdnblocks [77]
 
[83]     0.0    0.07    0.00  109527         FileSeek [83]
COPY, 1 index:               0.00    0.00       1/218549      init_irels [382]               0.00    0.00
2667/218549     mdextend [220]               0.07    0.00   53917/218549      mdread [106]               0.08    0.00
56542/218549     mdwrite [99]               0.14    0.00  105422/218549      _mdnblocks [120]
 
[90]     0.0    0.30    0.00  218549         FileSeek [90]

The extra _mdnblocks() calls for the inserts/1index case seem to be from
the pg_index scans in ExecOpenIndices (which is called 100000 times in
the inserts case but just once in the COPY case).  We know how to fix
that.  Otherwise the COPY and INSERT paths seem to be pretty similar as
far as actual I/O calls go.  The thing that jumps out here, however, is
that it takes upwards of 50000 page reads and writes to prepare a
343-page index.  Most of the write calls turn out to be from
BufferReplace, which is pretty conclusive evidence that the default
setting of -B 64 is not enough for this example; we need more buffers.

At -B 128, inserts/0index seems about the same, inserts/1index traffic is               0.00    0.00       1/270331
init_irels [637]               0.01    0.00    2667/270331      mdextend [510]               0.06    0.00
29798/270331     mdread [354]               0.06    0.00   32444/270331      mdwrite [277]               0.40    0.00
205421/270331     _mdnblocks [229]
 
[223]    0.1    0.52    0.00  270331         FileSeek [223]
At -B 256, inserts/1index traffic is               0.00    0.00       1/221849      init_irels [650]               0.00
  0.00    2667/221849      mdextend [480]               0.01    0.00    5556/221849      mdread [513]
0.01   0.00    8204/221849      mdwrite [460]               0.37    0.00  205421/221849      _mdnblocks [233]
 
[240]    0.0    0.40    0.00  221849         FileSeek [240]
At -B 512, inserts/1index traffic is               0.00    0.00       1/210788      init_irels [650]               0.00
  0.00      25/210788      mdread [676]               0.00    0.00    2667/210788      mdextend [555]
0.00   0.00    2674/210788      mdwrite [564]               0.27    0.00  205421/210788      _mdnblocks [248]
 
[271]    0.0    0.28    0.00  210788         FileSeek [271]

So as long as the -B setting is large enough to avoid thrashing, there
shouldn't be much penalty to making an index.  I didn't have time to run
the COPY cases but I expect they'd be about the same.

Bottom line is that where I/O costs are concerned, the parse/plan
overhead for INSERTs is insignificant except for the known problem
of wanting to rescan pg_index for each INSERT.  The CPU overhead is
significant, at least if you're comparing no-fsync performance ...
but as I commented before, I doubt we can do a whole lot better in
that area for simple INSERTs.
        regards, tom lane


Re: Berkeley DB...

From
"Zeugswetter Andreas"
Date:
> >   %   cumulative   self              self     total
> >  time   seconds   seconds    calls  ms/call  ms/call  name
> >  30.08    290.79   290.79                             _mcount
> >   6.48    353.46    62.67 30702766     0.00     0.00  AllocSetAlloc
> >   5.27    404.36    50.90   205660     0.25     0.25  write
> >   3.06    433.97    29.61 30702765     0.00     0.00  MemoryContextAlloc
> >   2.74    460.45    26.48   100001     0.26     0.74  yyparse
> >   2.63    485.86    25.41 24300077     0.00     0.00  newNode
> >   2.22    507.33    21.47  3900054     0.01     0.01  yylex
> >   1.63    523.04    15.71 30500751     0.00     0.00  PortalHeapMemoryAlloc
> >   1.31    535.68    12.64  5419526     0.00     0.00  hash_search
> >   1.18    547.11    11.43  9900000     0.00     0.00  expression_tree_walker
> >   1.01    556.90     9.79  3526752     0.00     0.00  SpinRelease
> > 
> > While the time spent in memory allocation is annoying, that's only about
> > ten mallocs per parsed data expression, so it's unlikely that we will be
> > able to improve on it very much.  (We could maybe avoid having *three*
> > levels of subroutine call to do an alloc, though ;-).)  Unless you are
> > smarter than the flex and bison guys you are not going to be able to
> > improve on the lex/parse times either.  The planner isn't even showing
> > up for a simple INSERT.  Not much left, unless you can figure out how
> > to write and commit a tuple with less than two disk writes.

> It will be interesting to see the speed differences between the
> 100,000 inserts above and those which have been PREPARE'd using
> Karel Zak's PREPARE patch

If we believe the above output, the win won't be very noticeable. It is the 
writes (and the Allocs) we have to get rid of.
The above is much faster if you do:
begin work;
100000 inserts ....;
commit work;

Andreas



Re: Berkeley DB...

From
Peter Eisentraut
Date:
Tom Lane writes:

> I have done profiles of INSERT in the past and not found any really
> spectacular bottlenecks

I am still at a loss on how to make profiles. The latest thing that
happened to me is that the postmaster gave me a `Profiling timer expired'
message and never started up. Any idea?


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: Berkeley DB...

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> I am still at a loss on how to make profiles. The latest thing that
> happened to me is that the postmaster gave me a `Profiling timer expired'
> message and never started up. Any idea?

Dunno ... PROFILE=-pg works for me ...

Normally there's a special startup file that the compiler is supposed to
know to link instead of the usual crt0.o, when you link with -pg.
Possibly there's something wrong with yours?
        regards, tom lane


Re: Berkeley DB...

From
"Matthias Urlichs"
Date:
Hi,

Mike Mascari:
> DML statements vs. MySQL. As Vadim points out in his comparison
> of COPY vs. INSERT, something is *wrong* with the time it takes
> for PostgreSQL to parse, plan, rewrite, and optimize.

In my throughly unscientific opinion, the problem may well be the fact
that PostgreSQL recurses the whole process, i.e. it is looking up
attributes of one table in a bunch of other tables.

MySQL, by contrast, has three files per table -- one with the data,
one with _all_ the indices, and one .frm file with all the other
metadata you would ever want to know about a table.

That metadata file is mapped into shared memory space by the first task
that opens a table, and it stays there. The data and index files also
stay open until they're flushed.

Since MySQL is multithreaded, opening a new connection is extremely
cheap. By contrast, PostgreSQL does more than 30 open() calls when I
connect to it.(*) It's still lots faster than some other databases I might
mention, though...

Access control is done by a bunch of tables in the "mysql" database,
but these are 100% cached.

One nice side effect of this is that it's very easy to access tables
from another database. Just say "select * from foo.bar".


(*) The list:
/data//pg_options
/etc/passwd
/etc/group
/data//PG_VERSION
/data//pg_database
/data//base/test/PG_VERSION
/data//base/test/pg_internal.init
/data//pg_log
/data//pg_variable
/data//base/test/pg_class
/data//base/test/pg_class_relname_index
/data//base/test/pg_attribute
/data//base/test/pg_attribute_relid_attnum_index
/data//base/test/pg_trigger
/data//base/test/pg_am
/data//base/test/pg_index
/data//base/test/pg_amproc
/data//base/test/pg_amop
/data//base/test/pg_operator
/data//base/test/pg_index_indexrelid_index
/data//base/test/pg_operator_oid_index
/data//base/test/pg_index_indexrelid_index
/data//base/test/pg_trigger_tgrelid_index
/data//pg_shadow
/data//pg_database
/data//base/test/pg_proc
/data//base/test/pg_proc_proname_narg_type_index
/data//base/test/pg_type
/data//base/test/pg_type_oid_index
/data//base/test/pg_proc_oid_index
/data//base/test/pg_rewrite
/data//base/test/pg_user
/data//base/test/pg_attribute_relid_attnam_index
/data//base/test/pg_operator_oprname_l_r_k_index
/data//base/test/pg_class_oid_index
/data//base/test/pg_statistic
/data//base/test/pg_statistic_relid_att_index

-- 
Matthias Urlichs  |  noris network GmbH   |   smurf@noris.de  |  ICQ: 20193661
The quote was selected randomly. Really.       |        http://smurf.noris.de/
-- 
Man with hand in pocket feel cocky all day.    -- Confucius


RE: Berkeley DB...

From
Andreas Zeugswetter
Date:
On Fri, 26 May 2000, Mikheev, Vadim wrote:
> > > Well, I've dropped index but INSERTs still take 70 sec and 
> > > COPY just 1sec -:(((
> > 
> > Well, for those that have fsync turned off we could actually 
> > avoid most of the writes, could'nt we ? Just leave the page
> > marked dirty. We would only need to write each new page once.
> > The problem as I see it is, that we don't have a good place 
> > where the writes would actually be done. Now they are obviously
> > done after each insert.
> 
> I've run test without fsync and with all inserts in *single*
> transaction - there should be no write after each insert...

Yes, but if you don't do the inserts in one big transaction
and don't issue transaction statements ( no begin or commit )
then you get the behavior I described.

Andreas


RE: Berkeley DB...

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM]
> 
> > > Well, I've dropped index but INSERTs still take 70 sec and 
> > > COPY just 1sec -:(((
> > >
> > 
> > Did you run vacuum after dropping indexes ?
> > Because DROP INDEX doesn't update relhasindex of pg_class,
> > planner/executer may still look up pg_index.
> 
> Actually, I dropped and re-created table without indices...
>

Oops,aren't you testing in 6.5.3 ?
ExecOpenIndices() always refers to pg_index in 6.5.x.
Currently it doesn't refer to pg_index if relhasindex is
false. 

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp 


Re: Berkeley DB...

From
Karel Zak
Date:
> It will be interesting to see the speed differences between the
> 100,000 inserts above and those which have been PREPARE'd using
> Karel Zak's PREPARE patch. Perhaps a generic query cache could be

My test:
postmaster:    -F -B 2000    rows:        100,000 table:        create table (data text);data:        37B for eache
line---all is in one transaction
 
native insert:        66.522sprepared insert:    59.431s        - 11% faster    

IMHO parsing/optimizing is relative easy for a simple INSERT.
The query (plan) cache will probably save time for complicated SELECTs 
with functions ...etc. (like query that for parsing need look at to system
tables). For example:
insert into tab values ('some data' || 'somedata' || 'some data');
native insert:        91.787sprepared insert:    45.077s     - 50% faster
(Note: This second test was faster, because I stop X-server andpostgres had more memory :-)

The best way for large and simple data inserting is (forever) COPY, not
exist faster way. 
pg's path(s) of query:native insert:        parser -> planner -> executor -> storageprepared insert:    parser (for
executestmt) -> executor -> storagecopy:            utils (copy) -> storage
 

> amongst other things). I'm looking forward to when the 7.1 branch
> occurs... :-)
I too.
                        Karel



RE: Berkeley DB...

From
"Mikheev, Vadim"
Date:
> So while the parse/plan overhead looks kinda bad next to a bare COPY,
> it's not anything like a 70:1 penalty.  But an fsync per insert is

Isn't it because of your table has 16 columns and my table has only 2?

> that bad and worse.

Of course -:)

2Hiroshi - yes, I've used 6.5.3...

Vadim