Thread: why postgresql over other RDBMS

why postgresql over other RDBMS

From
"Jasbinder Singh Bali"
Date:
Hi
I was wondering, apart from extensive procedural language support and being free,
what are other major advantages of Postgresql over other major RDBMS like oracle and sql server.

Any pointers would be highly appreciated.

Thanks,
~Jas

Re: why postgresql over other RDBMS

From
Tom Allison
Date:
You've addressed cost and performance.
Not much left.

Try it out for yourself and see if it works for you.

On May 24, 2007, at 4:06 AM, Jasbinder Singh Bali wrote:

> Hi
> I was wondering, apart from extensive procedural language support
> and being free,
> what are other major advantages of Postgresql over other major
> RDBMS like oracle and sql server.
>
> Any pointers would be highly appreciated.
>
> Thanks,
> ~Jas


Re: why postgresql over other RDBMS

From
Richard Huxton
Date:
Tom Allison wrote:
> You've addressed cost and performance.
> Not much left.
>
> Try it out for yourself and see if it works for you.

+ elephant in logo
- unpronounceable name
+ excellent mailing lists
+ excellent developer community
- you can download as many copies as you like and a salesman still won't
take you out to lunch

If you want specific vs each DB...

MSSQL obviously integrates nicely with the rest of the MS developer
tools. If you're planning a .NET (TM) deployment over a range of Windows
(TM) systems and have a lot of experiences MS developers in-house then
it's perhaps the default choice.

Oracle has much more experience running on top-end hardware than PG. If
you've got the in-depth knowledge and/or the money then you can push it
further.

On the other hand, I can buy a 2-cpu x 4-core machine with 16GB RAM and
half-a-dozen disks for £5k from dell (not that you necessarily would).
That's a lot of oomph for the money - think what it would have cost five
years ago.

Add Debian + PostgreSQL, total cost=£5k.

Add Windows + SQL Server, total cost = £12k

--
   Richard Huxton
   Archonet Ltd

Re: why postgresql over other RDBMS

From
Wiebe Cazemier
Date:
On Thursday 24 May 2007 10:06, Jasbinder Singh Bali wrote:

> Hi
> I was wondering, apart from extensive procedural language support and being
> free,
> what are other major advantages of Postgresql over other major RDBMS like
> oracle and sql server.
>
> Any pointers would be highly appreciated.
>
> Thanks,
> ~Jas

A free (as in free speech) license. You can still use postgresql if the
developers decide to stop making postgres, or stop making it free, as it is
likely it will be forked and continued by others (like XFree86 forked into
Xorg).

And, you won't run into annoying licence agreements that you can't use it on
machines with two CPU's...

In my experience as sysadmin, I prefer free licenses (open source) software so
much above commercial software.

Something to think about.

Re: why postgresql over other RDBMS

From
"Ben Trewern"
Date:
Better support!

Where else can you get feedback from the actual programmers (sometimes
within minutes of writing a message) than here?

Ben

> Hi
> I was wondering, apart from extensive procedural language support
> and being free,
> what are other major advantages of Postgresql over other major
> RDBMS like oracle and sql server.
>
> Any pointers would be highly appreciated.
>
> Thanks,
> ~Jas



Re: why postgresql over other RDBMS

From
"Alexander Staubo"
Date:
On 5/24/07, Richard Huxton <dev@archonet.com> wrote:
> Tom Allison wrote:
> > You've addressed cost and performance.
> > Not much left.
> >
> > Try it out for yourself and see if it works for you.
>
> + elephant in logo
> - unpronounceable name
> + excellent mailing lists
> + excellent developer community
> - you can download as many copies as you like and a salesman still won't
> take you out to lunch

+ Friendly toolset in the box. [1]
+ Transactional DDL. [2]
+ Table inheritance, if you care to use unportable features.
+ Extensibility. [3]
+ PostGIS for spatial extensions.
- Replication offerings suck.
- Warm standby involve a lot of manual labour.
- Point-in-time recovery involve a lot of manual labour.

[1] psql+readline, pg_dump etc. are a breeze compared to crusty Oracle
tools; psql feels distinctly modern compared to MySQL's crummy
interpreter.

[2] Nobody else has this, I believe, except possibly Ingres and
NonStop SQL. This means you can do a "begin transaction", then issue
"create table", "alter table", etc. ad nauseum, and in the mean time
concurrent transactions will just work. Beautiful for atomically
upgrading a production server. Oracle, of course, commits after each
DDL statements.

[3] PostgreSQL can be extended with new domain types, stored-procedure
languages (eg., PL/Python, PL/Perl), functions (eg., dblink,
fuzzystrmatch, cube), and indexes (GiST, GIN). Together this allows
projects such as TSearch2 and PostGIS to be implemented as separate
extensions to PostgreSQL.

Alexander.

Re: why postgresql over other RDBMS

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/24/07 10:30, Alexander Staubo wrote:
> On 5/24/07, Richard Huxton <dev@archonet.com> wrote:
>> Tom Allison wrote:
>> > You've addressed cost and performance.
>> > Not much left.
>> >
>> > Try it out for yourself and see if it works for you.
>>
>> + elephant in logo
>> - unpronounceable name
>> + excellent mailing lists
>> + excellent developer community
>> - you can download as many copies as you like and a salesman still won't
>> take you out to lunch
>
> + Friendly toolset in the box. [1]
> + Transactional DDL. [2]
> + Table inheritance, if you care to use unportable features.
> + Extensibility. [3]
> + PostGIS for spatial extensions.
> - Replication offerings suck.
> - Warm standby involve a lot of manual labour.
> - Point-in-time recovery involve a lot of manual labour.
>
> [1] psql+readline, pg_dump etc. are a breeze compared to crusty Oracle
> tools; psql feels distinctly modern compared to MySQL's crummy
> interpreter.
>
> [2] Nobody else has this, I believe, except possibly Ingres and
> NonStop SQL. This means you can do a "begin transaction", then issue
> "create table", "alter table", etc. ad nauseum, and in the mean time
> concurrent transactions will just work. Beautiful for atomically
> upgrading a production server. Oracle, of course, commits after each
> DDL statements.

Rdb/VMS and CODASYL DBMS (both Oracle, formerly DEC, products) also
have transactional DDL.  Actually, I was quite stunned to discover
that Oracle doesn't do that.

Interbase/Firebird probably also has transactional DDL.

> [3] PostgreSQL can be extended with new domain types, stored-procedure
> languages (eg., PL/Python, PL/Perl), functions (eg., dblink,
> fuzzystrmatch, cube), and indexes (GiST, GIN). Together this allows
> projects such as TSearch2 and PostGIS to be implemented as separate
> extensions to PostgreSQL.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGVcaiS9HxQb37XmcRAu69AJ0YtBw6zpuR6zAZzEIjVWfisKeM5ACeKz+x
bJGv1Wh7YrUeYglaOHg1n7I=
=YKt9
-----END PGP SIGNATURE-----

Re: why postgresql over other RDBMS

From
"Ericson Smith"
Date:
For us?

Stability. Pure and simple. It ended up that the site was faster too.
One issue with postgresql is that connection time is slower than
mysql. Otherwise most everything else is just as fast or faster.

So with Ruby on Rails, there is a persistent connections from the
container, and that disadvantage dissapears. If there's anything slow
in the site, its ruby template rendering, database access generally is
10% or less of total time it takes to generate our most popular pages.

Ericson Smith
Developer
http://www.funadvice.com


> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 05/24/07 10:30, Alexander Staubo wrote:
> > On 5/24/07, Richard Huxton <dev@archonet.com> wrote:
> >> Tom Allison wrote:
> >> > You've addressed cost and performance.
> >> > Not much left.
> >> >
> >> > Try it out for yourself and see if it works for you.
> >>
> >> + elephant in logo
> >> - unpronounceable name
> >> + excellent mailing lists
> >> + excellent developer community
> >> - you can download as many copies as you like and a salesman still won't
> >> take you out to lunch
> >
> > + Friendly toolset in the box. [1]
> > + Transactional DDL. [2]
> > + Table inheritance, if you care to use unportable features.
> > + Extensibility. [3]
> > + PostGIS for spatial extensions.
> > - Replication offerings suck.
> > - Warm standby involve a lot of manual labour.
> > - Point-in-time recovery involve a lot of manual labour.
> >
> > [1] psql+readline, pg_dump etc. are a breeze compared to crusty Oracle
> > tools; psql feels distinctly modern compared to MySQL's crummy
> > interpreter.
> >
> > [2] Nobody else has this, I believe, except possibly Ingres and
> > NonStop SQL. This means you can do a "begin transaction", then issue
> > "create table", "alter table", etc. ad nauseum, and in the mean time
> > concurrent transactions will just work. Beautiful for atomically
> > upgrading a production server. Oracle, of course, commits after each
> > DDL statements.
>
> Rdb/VMS and CODASYL DBMS (both Oracle, formerly DEC, products) also
> have transactional DDL.  Actually, I was quite stunned to discover
> that Oracle doesn't do that.
>
> Interbase/Firebird probably also has transactional DDL.
>
> > [3] PostgreSQL can be extended with new domain types, stored-procedure
> > languages (eg., PL/Python, PL/Perl), functions (eg., dblink,
> > fuzzystrmatch, cube), and indexes (GiST, GIN). Together this allows
> > projects such as TSearch2 and PostGIS to be implemented as separate
> > extensions to PostgreSQL.
>
> - --
> Ron Johnson, Jr.
> Jefferson LA  USA
>
> Give a man a fish, and he eats for a day.
> Hit him with a fish, and he goes away for good!
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
>
> iD8DBQFGVcaiS9HxQb37XmcRAu69AJ0YtBw6zpuR6zAZzEIjVWfisKeM5ACeKz+x
> bJGv1Wh7YrUeYglaOHg1n7I=
> =YKt9
> -----END PGP SIGNATURE-----
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>

Re: why postgresql over other RDBMS

From
"Alexander Staubo"
Date:
On 5/24/07, Ron Johnson <ron.l.johnson@cox.net> wrote:
> > [2] Nobody else has this, I believe, except possibly Ingres and
> > NonStop SQL. This means you can do a "begin transaction", then issue
> > "create table", "alter table", etc. ad nauseum, and in the mean time
> > concurrent transactions will just work. Beautiful for atomically
> > upgrading a production server. Oracle, of course, commits after each
> > DDL statements.
>
> Rdb/VMS and CODASYL DBMS (both Oracle, formerly DEC, products) also
> have transactional DDL.

Not exactly mainstream databases, and Codasyl isn't even a relational
database as far as I remember.

> Interbase/Firebird probably also has transactional DDL.

Apparently. I'm surprised, given how InterBase and Firebird otherwise
feel so awfully ancient.

All right, so not unique, but very rare.

Alexander.

Re: why postgresql over other RDBMS

From
Erik Jones
Date:
On May 24, 2007, at 4:39 AM, Richard Huxton wrote:

> - unpronounceable name

post-gres-queue-el

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com



Re: why postgresql over other RDBMS

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/24/07 12:48, Alexander Staubo wrote:
> On 5/24/07, Ron Johnson <ron.l.johnson@cox.net> wrote:
>> > [2] Nobody else has this, I believe, except possibly Ingres and
>> > NonStop SQL. This means you can do a "begin transaction", then issue
>> > "create table", "alter table", etc. ad nauseum, and in the mean time
>> > concurrent transactions will just work. Beautiful for atomically
>> > upgrading a production server. Oracle, of course, commits after each
>> > DDL statements.
>>
>> Rdb/VMS and CODASYL DBMS (both Oracle, formerly DEC, products) also
>> have transactional DDL.
>
> Not exactly mainstream databases, and Codasyl isn't even a relational
> database as far as I remember.

True, they are "legacy systems", but are still actively developed by
Oracle, for Alpha and Itanium VMS.  (The VAX products still are
actually maintained, but only for bug-fixes.)

And yes, DBMS uses the network model.  (Since both it and Rdb use
the same back end and on-disk structures, it's not that much extra
effort to also keep on actively developing DBMS.)

>> Interbase/Firebird probably also has transactional DDL.
>
> Apparently. I'm surprised, given how InterBase and Firebird otherwise
> feel so awfully ancient.

InterBase was devloped by a man who worked for DEC at the same time
that Rdb was in the works.  His DECisms are still apparent in Firebird.

God, I'm going to hate it when we migrate to Oracle...

> All right, so not unique, but very rare.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGVdNvS9HxQb37XmcRAkK4AKCxilH8XRM9es1RZ/Vma/AUrLoVnACfaGgP
VSce0lLj/mw2Vj435EDEaII=
=Gkbz
-----END PGP SIGNATURE-----

Re: why postgresql over other RDBMS

From
"Leif B. Kristensen"
Date:
On Thursday 24. May 2007 19:57, Erik Jones wrote:
>On May 24, 2007, at 4:39 AM, Richard Huxton wrote:
>> - unpronounceable name
>
>post-gres-queue-el

Somebody probably wants to put that pot-grass-kewl thing in his pipe and
smoke it.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

Re: why postgresql over other RDBMS

From
Wiebe Cazemier
Date:
On Thursday 24 May 2007 17:30, Alexander Staubo wrote:

> [2] Nobody else has this, I believe, except possibly Ingres and
> NonStop SQL. This means you can do a "begin transaction", then issue
> "create table", "alter table", etc. ad nauseum, and in the mean time
> concurrent transactions will just work. Beautiful for atomically
> upgrading a production server. Oracle, of course, commits after each
> DDL statements.

If this is such a rare feature, I'm very glad we chose postgresql. I use it all
the time, and wouldn't know what to do without it. We circumvented Ruby on
Rails' migrations, and just implemented them in SQL. Writing migrations is a
breeze this way, and you don't have to hassle with atomicity, or the pain when
you discover the migration doesn't work on the production server.



Re: why postgresql over other RDBMS

From
"A.M."
Date:
On May 24, 2007, at 14:29 , Wiebe Cazemier wrote:

> On Thursday 24 May 2007 17:30, Alexander Staubo wrote:
>
>> [2] Nobody else has this, I believe, except possibly Ingres and
>> NonStop SQL. This means you can do a "begin transaction", then issue
>> "create table", "alter table", etc. ad nauseum, and in the mean time
>> concurrent transactions will just work. Beautiful for atomically
>> upgrading a production server. Oracle, of course, commits after each
>> DDL statements.
>
> If this is such a rare feature, I'm very glad we chose postgresql.
> I use it all
> the time, and wouldn't know what to do without it. We circumvented
> Ruby on
> Rails' migrations, and just implemented them in SQL. Writing
> migrations is a
> breeze this way, and you don't have to hassle with atomicity, or
> the pain when
> you discover the migration doesn't work on the production server.

Indeed. Wouldn't it be a cool feature to persists transaction states
across connections so that a new connection could get access to a sub-
transaction state? That way, you could make your schema changes and
test them with any number of test clients (which designate the state
to connect with) and then you would commit when everything works.

Unfortunately, the postgresql architecture wouldn't lend itself well
to this. Still, it seems like a basic extension of the notion of sub-
transactions.

Cheers,
M

Re: why postgresql over other RDBMS

From
Alvaro Herrera
Date:
A.M. wrote:
>
> On May 24, 2007, at 14:29 , Wiebe Cazemier wrote:
>
> >On Thursday 24 May 2007 17:30, Alexander Staubo wrote:
> >
> >>[2] Nobody else has this, I believe, except possibly Ingres and
> >>NonStop SQL. This means you can do a "begin transaction", then issue
> >>"create table", "alter table", etc. ad nauseum, and in the mean time
> >>concurrent transactions will just work. Beautiful for atomically
> >>upgrading a production server. Oracle, of course, commits after each
> >>DDL statements.
> >
> >If this is such a rare feature, I'm very glad we chose postgresql.
> >I use it all
> >the time, and wouldn't know what to do without it. We circumvented
> >Ruby on
> >Rails' migrations, and just implemented them in SQL. Writing
> >migrations is a
> >breeze this way, and you don't have to hassle with atomicity, or
> >the pain when
> >you discover the migration doesn't work on the production server.
>
> Indeed. Wouldn't it be a cool feature to persists transaction states
> across connections so that a new connection could get access to a sub-
> transaction state? That way, you could make your schema changes and
> test them with any number of test clients (which designate the state
> to connect with) and then you would commit when everything works.
>
> Unfortunately, the postgresql architecture wouldn't lend itself well
> to this. Still, it seems like a basic extension of the notion of sub-
> transactions.

Hmm, doesn't this Just Work with two-phase commit?

--
Alvaro Herrera       Valdivia, Chile   ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
Oh, oh, las chicas galacianas, lo harán por las perlas,
¡Y las de Arrakis por el agua! Pero si buscas damas
Que se consuman como llamas, ¡Prueba una hija de Caladan! (Gurney Halleck)

Re: why postgresql over other RDBMS

From
"A.M."
Date:
On May 24, 2007, at 15:57 , Alvaro Herrera wrote:

> A.M. wrote:
>>
>> Indeed. Wouldn't it be a cool feature to persists transaction states
>> across connections so that a new connection could get access to a
>> sub-
>> transaction state? That way, you could make your schema changes and
>> test them with any number of test clients (which designate the state
>> to connect with) and then you would commit when everything works.
>>
>> Unfortunately, the postgresql architecture wouldn't lend itself well
>> to this. Still, it seems like a basic extension of the notion of sub-
>> transactions.
>
> Hmm, doesn't this Just Work with two-phase commit?

2PC requires that the modifications already be in concrete. What I
suggest is a method for a new connection to insert itself into an
existing (sub-)transaction SQL stream, make changes, and commit to
the root or parent transaction.

In the scenario where changes are pending, only one connection can
test the changes- it must be the connection that opened the
transaction. Concurrency issues cannot be tested before committing,
for example.

The implementation could be as simple as decoupling connections from
transactions- then a connection could make serialized requests to
other backends. A proof-of-concept could certainly be cobbled
together with pipes and pl/perl, but the real beef would be the
ability to "pass" responsibility for transactions from one connection
to another.

Cheers,
M

Re: why postgresql over other RDBMS

From
"Alexander Staubo"
Date:
We're derailing the thread, but...

On 5/24/07, A.M. <agentm@themactionfaction.com> wrote:
> 2PC requires that the modifications already be in concrete. What I
> suggest is a method for a new connection to insert itself into an
> existing (sub-)transaction SQL stream, make changes, and commit to
> the root or parent transaction.

The problem with long-running transactions is that they need to avoid
locking the resources they touch. Short-running transactions are bad
enough as they stand -- until fairly recently (8.1? 8.2?), merely
inserting or updating a row that had a foreign-key reference to
another table row would lock the referenced row until the end of the
transaction, by issuing an implicit "select ... for update".

Although a mechanism whereby multiple connections can share a single
session/transaction is probably easy to implement, using long-running
transactions to isolate DDL changes is not feasible at the moment
because PostgreSQL currently acquires an AccessExclusiveLock on the
modified table until the transaction ends, which means that concurrent
transactions would be blocked from even querying the table.

I don't know PostgreSQL's internals, so I can only postulate that this
locking occurs because PostgreSQL holds a single copy of the schema
and related bookeeping structures in memory.

Alexander.

Re: why postgresql over other RDBMS

From
PFC
Date:
>> Indeed. Wouldn't it be a cool feature to persists transaction states
>> across connections so that a new connection could get access to a sub-
>> transaction state? That way, you could make your schema changes and
>> test them with any number of test clients (which designate the state
>> to connect with) and then you would commit when everything works.

    Actually you can hack this by starting your webserver with only 1 thread,
use persistent connections, and disable all commits in the application.
    But I'd call that "a very hackish hack".

Re: why postgresql over other RDBMS

From
"A.M."
Date:
On May 24, 2007, at 18:12 , PFC wrote:

>
>>> Indeed. Wouldn't it be a cool feature to persists transaction states
>>> across connections so that a new connection could get access to a
>>> sub-
>>> transaction state? That way, you could make your schema changes and
>>> test them with any number of test clients (which designate the state
>>> to connect with) and then you would commit when everything works.
>
>     Actually you can hack this by starting your webserver with only 1
> thread, use persistent connections, and disable all commits in the
> application.
>     But I'd call that "a very hackish hack".

Not really- then I can't use any transactions, in which case, I might
as well use MySQL. I would like to be able to pass transaction state
between connections.

-M


Re: why postgresql over other RDBMS

From
Chris Browne
Date:
agentm@themactionfaction.com ("A.M.") writes:
> On May 24, 2007, at 14:29 , Wiebe Cazemier wrote:
>
>> On Thursday 24 May 2007 17:30, Alexander Staubo wrote:
>>
>>> [2] Nobody else has this, I believe, except possibly Ingres and
>>> NonStop SQL. This means you can do a "begin transaction", then issue
>>> "create table", "alter table", etc. ad nauseum, and in the mean time
>>> concurrent transactions will just work. Beautiful for atomically
>>> upgrading a production server. Oracle, of course, commits after each
>>> DDL statements.
>>
>> If this is such a rare feature, I'm very glad we chose postgresql.
>> I use it all
>> the time, and wouldn't know what to do without it. We circumvented
>> Ruby on
>> Rails' migrations, and just implemented them in SQL. Writing
>> migrations is a
>> breeze this way, and you don't have to hassle with atomicity, or
>> the pain when
>> you discover the migration doesn't work on the production server.
>
> Indeed. Wouldn't it be a cool feature to persists transaction states
> across connections so that a new connection could get access to a sub-
> transaction state? That way, you could make your schema changes and
> test them with any number of test clients (which designate the state
> to connect with) and then you would commit when everything works.
>
> Unfortunately, the postgresql architecture wouldn't lend itself well
> to this. Still, it seems like a basic extension of the notion of sub-
> transactions.

Jan Wieck had a proposal to a similar effect, namely to give some way
to get one connection to duplicate the state of another one.

This would permit doing a neat parallel decomposition of pg_dump: you
could do a 4-way parallelization of it that would function something
like the following:

- connection 1 opens, establishes the usual serialized mode transaction

- connection 1 dumps the table metadata into one or more files in a
  specified directory

- then it forks 3 more connections, and seeds them with the same
  serialized mode state

- it then goes thru and can dump 4 tables concurrently at a time,
  one apiece to a file in the directory.

This could considerably improve speed of dumps, possibly of restores,
too.

Note that this isn't related to subtransactions...
--
output = reverse("ofni.secnanifxunil" "@" "enworbbc")
http://www3.sympatico.ca/cbbrowne/internet.html
"Unless  there really are  chrono-synclastic infundibula. (Roll on the
Church of God the Utterly Indifferent...)" -- Will Deakin

Re: why postgresql over other RDBMS

From
Thomas Kellerer
Date:
Alexander Staubo wrote on 24.05.2007 17:30:
> [2] Nobody else has this, I believe, except possibly Ingres and
> NonStop SQL. This means you can do a "begin transaction", then issue
> "create table", "alter table", etc. ad nauseum, and in the mean time
> concurrent transactions will just work. Beautiful for atomically
> upgrading a production server. Oracle, of course, commits after each
> DDL statements.

I do have to commit a CREATE TABLE in SQL Server (Express) 2005 (and I believe
in 2000 as well), and I can rollback a DROP TABLE.
I haven't checked how this behaves with concurrent access though.

Thomas



Re: why postgresql over other RDBMS

From
Alvaro Herrera
Date:
Chris Browne wrote:
> agentm@themactionfaction.com ("A.M.") writes:

> Jan Wieck had a proposal to a similar effect, namely to give some way
> to get one connection to duplicate the state of another one.
>
> This would permit doing a neat parallel decomposition of pg_dump: you
> could do a 4-way parallelization of it that would function something
> like the following:

Interesting ...

> Note that this isn't related to subtransactions...

I think what needs to happen is that the snapshot is duplicated.
Probably not that hard to do actually ... the difficult thing is coming
up with an API that makes sense.

--
Alvaro Herrera                 http://www.amazon.com/gp/registry/CTMLCN8V17R4
"Estoy de acuerdo contigo en que la verdad absoluta no existe...
El problema es que la mentira sí existe y tu estás mintiendo" (G. Lama)

Re: why postgresql over other RDBMS

From
"A.M."
Date:
On May 24, 2007, at 18:21 , Chris Browne wrote:
>
> Jan Wieck had a proposal to a similar effect, namely to give some way
> to get one connection to duplicate the state of another one.
>
> This would permit doing a neat parallel decomposition of pg_dump: you
> could do a 4-way parallelization of it that would function something
> like the following:
>
> - connection 1 opens, establishes the usual serialized mode
> transaction
>
> - connection 1 dumps the table metadata into one or more files in a
>   specified directory
>
> - then it forks 3 more connections, and seeds them with the same
>   serialized mode state
>
> - it then goes thru and can dump 4 tables concurrently at a time,
>   one apiece to a file in the directory.
>
> This could considerably improve speed of dumps, possibly of restores,
> too.
>
> Note that this isn't related to subtransactions...

Well, I was thinking that since transactions are now serializable, it
should be possible to move the state between existing open transactions.

-M

Parallel backups (was Re: why postgresql over other RDBMS)

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/24/07 17:21, Chris Browne wrote:
[snip]
>
> This would permit doing a neat parallel decomposition of pg_dump: you
> could do a 4-way parallelization of it that would function something
> like the following:
>
> - connection 1 opens, establishes the usual serialized mode transaction
>
> - connection 1 dumps the table metadata into one or more files in a
>   specified directory
>
> - then it forks 3 more connections, and seeds them with the same
>   serialized mode state
>
> - it then goes thru and can dump 4 tables concurrently at a time,
>   one apiece to a file in the directory.
>
> This could considerably improve speed of dumps, possibly of restores,
> too.

What about a master thread that "establishes the usual serialized
mode transaction" and then issues N asynchronous requests to the
database, and as they return with data, pipe the data to N number of
corresponding "writer" threads.  Matching N to the number of tape
drives comes to mind.

Yes, the master thread would be the choke point, but CPUs and RAM
are still a heck of a lot faster than disks, so maybe it wouldn't be
such a problem after all.

Of course, if libpq(??) doesn't handle async IO, then it's not such
a good idea after all.

> Note that this isn't related to subtransactions...

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGViC4S9HxQb37XmcRAgkAAKC4pyZQWDF01S17uITbOkcj+KY8lgCg40pi
2B3xg2tnp554GGP0VsgACWE=
=eIUP
-----END PGP SIGNATURE-----

Re: why postgresql over other RDBMS

From
Andrew Sullivan
Date:
On Thu, May 24, 2007 at 03:25:52PM -0400, A.M. wrote:
> Wouldn't it be a cool feature to persists transaction states
> across connections so that a new connection could get access to a sub-
> transaction state?

You could do this using an incredibly evil, carefully implemented
hack in a connection pool.  I'm shuddering at the thought of it, to
be honest, so details are left as an exervisse for the reader.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
A certain description of men are for getting out of debt, yet are
against all taxes for raising money to pay it off.
        --Alexander Hamilton

Re: why postgresql over other RDBMS

From
"A.M."
Date:
On May 24, 2007, at 20:39 , Andrew Sullivan wrote:

> On Thu, May 24, 2007 at 03:25:52PM -0400, A.M. wrote:
>> Wouldn't it be a cool feature to persists transaction states
>> across connections so that a new connection could get access to a
>> sub-
>> transaction state?
>
> You could do this using an incredibly evil, carefully implemented
> hack in a connection pool.  I'm shuddering at the thought of it, to
> be honest, so details are left as an exervisse for the reader.

Actually, a sample implementation could be done using stored
procedures and some IPC. It would however require that the receiver
poll for requests- the API would probably look very similar to dblink.

-M

Re: why postgresql over other RDBMS

From
Tom Allison
Date:
I think the general topic got lost in the weeds there...

But it would be helpful if you identified what you care about in a
database.  That drives a lot of these decisions.

Example:  My brother uses MySQL and I use Postgresql.
His reasons for MySQL:
More books on it at the bookstore.
It was already a default build in Rails.
Came installed on his SuSE build.
Didn't like the pg_hba.conf file and preferred to do all the
permissions via SQL tables.

My reasons for Postgresql:
Very focused on consistent data and not losing information when
things go wrong.
The performance curve as you add users is much flatter than MySQL.
Under MySQL you have better speed with ~3 users but at 50 users
Postgresql is a clear winner on my hardware.  In general I found it
to scale smoother without a sweet spot.
I prefer pg_hba.conf over tables.
Zero maintenance & great tools.
It's more SQL standard so what I do on Oracle I can generally do on
Postgresql and visa versa.

None of this has anything to do with Oracle or SQL Server, but these
are examples of what is important to each of us.

What's important to you?

On May 24, 2007, at 4:06 AM, Jasbinder Singh Bali wrote:

> Hi
> I was wondering, apart from extensive procedural language support
> and being free,
> what are other major advantages of Postgresql over other major
> RDBMS like oracle and sql server.
>
> Any pointers would be highly appreciated.
>
> Thanks,
> ~Jas


Re: why postgresql over other RDBMS

From
Erik Jones
Date:
On May 24, 2007, at 5:21 PM, Chris Browne wrote:

> agentm@themactionfaction.com ("A.M.") writes:
>> On May 24, 2007, at 14:29 , Wiebe Cazemier wrote:
>>
>>> On Thursday 24 May 2007 17:30, Alexander Staubo wrote:
>>>
>>>> [2] Nobody else has this, I believe, except possibly Ingres and
>>>> NonStop SQL. This means you can do a "begin transaction", then
>>>> issue
>>>> "create table", "alter table", etc. ad nauseum, and in the mean
>>>> time
>>>> concurrent transactions will just work. Beautiful for atomically
>>>> upgrading a production server. Oracle, of course, commits after
>>>> each
>>>> DDL statements.
>>>
>>> If this is such a rare feature, I'm very glad we chose postgresql.
>>> I use it all
>>> the time, and wouldn't know what to do without it. We circumvented
>>> Ruby on
>>> Rails' migrations, and just implemented them in SQL. Writing
>>> migrations is a
>>> breeze this way, and you don't have to hassle with atomicity, or
>>> the pain when
>>> you discover the migration doesn't work on the production server.
>>
>> Indeed. Wouldn't it be a cool feature to persists transaction states
>> across connections so that a new connection could get access to a
>> sub-
>> transaction state? That way, you could make your schema changes and
>> test them with any number of test clients (which designate the state
>> to connect with) and then you would commit when everything works.
>>
>> Unfortunately, the postgresql architecture wouldn't lend itself well
>> to this. Still, it seems like a basic extension of the notion of sub-
>> transactions.
>
> Jan Wieck had a proposal to a similar effect, namely to give some way
> to get one connection to duplicate the state of another one.
>
> This would permit doing a neat parallel decomposition of pg_dump: you
> could do a 4-way parallelization of it that would function something
> like the following:
>
> - connection 1 opens, establishes the usual serialized mode
> transaction
>
> - connection 1 dumps the table metadata into one or more files in a
>   specified directory
>
> - then it forks 3 more connections, and seeds them with the same
>   serialized mode state
>
> - it then goes thru and can dump 4 tables concurrently at a time,
>   one apiece to a file in the directory.
>
> This could considerably improve speed of dumps, possibly of restores,
> too.
>
> Note that this isn't related to subtransactions...

Interesting.  That's actually pretty close to the reindexing strategy/
script that I use and I've been planning on extending it to a vacuum
strategy.  So, I will add my support into someone building this kind
of support into pg_dump/restore.

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com



Re: why postgresql over other RDBMS

From
Chris Browne
Date:
erik@myemma.com (Erik Jones) writes:
> On May 24, 2007, at 5:21 PM, Chris Browne wrote:
>> Jan Wieck had a proposal to a similar effect, namely to give some way
>> to get one connection to duplicate the state of another one.
>>
>> This would permit doing a neat parallel decomposition of pg_dump: you
>> could do a 4-way parallelization of it that would function something
>> like the following [elided]:
>
> Interesting.  That's actually pretty close to the reindexing strategy/
> script that I use and I've been planning on extending it to a vacuum
> strategy.  So, I will add my support into someone building this kind
> of support into pg_dump/restore.

Well, I think that particular idea is dead for 8.3, as there wasn't
agreement that there were enough relevant use-cases.

If discussion gets bombarded with "yes, yes, that's useful for me
too!" responses the next time it gets proposed, then that will
increase the chances of acceptance.

We seem to be suffering, as the community, and patch queue, grows,
from the problem that features that are regarded as being useful only
to small sets of users are seeing greater reluctance for acceptance.
--
output = ("cbbrowne" "@" "acm.org")
http://linuxdatabases.info/info/linuxxian.html
"Is your pencil Y2K certified?  Do you know the possible effects if it
isn't?"

Re: why postgresql over other RDBMS

From
Andrew Sullivan
Date:
On Fri, May 25, 2007 at 12:36:20PM -0400, Chris Browne wrote:
> We seem to be suffering, as the community, and patch queue, grows,
> from the problem that features that are regarded as being useful only
> to small sets of users are seeing greater reluctance for acceptance.

Another way of expressing that regards it as a positive benefit:
given finite numbers of developers, testers, and patch reviewers, we
as a community have to make decisions about how big a feature set we
can realistically support, and the value that contributes to the user
community.  A small potential user community probably means a lower
estimation of the value of the feature.  So features that seem sort
of boutique are to be regarded at least with scepticism, in order to
keep the code useful for everyone.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
When my information changes, I alter my conclusions.  What do you do sir?
        --attr. John Maynard Keynes

Re: why postgresql over other RDBMS

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/25/07 12:18, Andrew Sullivan wrote:
> On Fri, May 25, 2007 at 12:36:20PM -0400, Chris Browne wrote:
>> We seem to be suffering, as the community, and patch queue, grows,
>> from the problem that features that are regarded as being useful only
>> to small sets of users are seeing greater reluctance for acceptance.
>
> Another way of expressing that regards it as a positive benefit:
> given finite numbers of developers, testers, and patch reviewers, we
> as a community have to make decisions about how big a feature set we
> can realistically support, and the value that contributes to the user
> community.  A small potential user community probably means a lower
> estimation of the value of the feature.  So features that seem sort
> of boutique are to be regarded at least with scepticism, in order to
> keep the code useful for everyone.

Except that seemingly "boutique" features can be road-blocks to
implementing projects, which means that you never hear from them.

In my case, there are two such road-blocks:
1. transaction failure on statement failure[0], and
2. single-threaded backups[1].


[0] Savepoints are a work-around, but there's a lot of existing code
that would have to be modified.  And you need a savepoint for every
INSERT and UPDATE.

[1] Tarballing data directories and saving WAL files works around
that, but a pg_dump file is, in itself, a transactionaly consistent
database.  Shipping a bunch of tarballs and WALs to the development
team is much more complicated than a single (or multiple, if that
ever comes to pass) dump file.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGVyP9S9HxQb37XmcRAq0JAJ0btgoWOpaxdedppqwJIKZSaOrtmgCg4Yuu
Lt/72CVBnOPflVgqnK7FgT8=
=KEf8
-----END PGP SIGNATURE-----

Re: why postgresql over other RDBMS

From
Andrew Sullivan
Date:
On Fri, May 25, 2007 at 12:59:25PM -0500, Ron Johnson wrote:
> Except that seemingly "boutique" features can be road-blocks to
> implementing projects, which means that you never hear from them.

Yes.  That's a risk that free software projects take, alas.  If you
can't force your users to tell you what they're doing, then you can't
be sure you have the right picture of what they're doing.

> 1. transaction failure on statement failure[0], and

I personally regard that as a feature, not a bug, so I'd be opposed
to changing it.

> 2. single-threaded backups[1].

This one has always seemed like a nasty limitation to me, but given
the desire (which you have, apparently) to get transaction-bounds
consistency and the design of postgres, I don't see an easy path to
fixing this.

Something that is important to note, however, is that a group of
users who have enough of a desire for a feature that they code it up,
offer to support it, and make sure it gets reviewed (this one is
important!) are going to have an easier time of it.  What this means
is that users need to dedicate resources to things that aren't
obviously in their own immediate interests, like patch review, so
that when later they say, "This patch works," there is a stronger
probability the community will take seriously their claims that the
patch works correctly.  Free software never comes without cost, and
this is one of them.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
Information security isn't a technological problem.  It's an economics
problem.
        --Bruce Schneier

Re: why postgresql over other RDBMS

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/25/07 13:09, Andrew Sullivan wrote:
> On Fri, May 25, 2007 at 12:59:25PM -0500, Ron Johnson wrote:
>> Except that seemingly "boutique" features can be road-blocks to
>> implementing projects, which means that you never hear from them.
>
> Yes.  That's a risk that free software projects take, alas.  If you
> can't force your users to tell you what they're doing, then you can't
> be sure you have the right picture of what they're doing.
>
>> 1. transaction failure on statement failure[0], and
>
> I personally regard that as a feature, not a bug, so I'd be opposed
> to changing it.

Why?  Not failing the whole transaction lets me decide how to handle
that *one* statement error, without have to code for retrying the
whole transaction.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGVyo6S9HxQb37XmcRAoSBAJ97Q99SwXNfj6s33uQpwKDBUOtHhwCeMXKk
QAZ5Ti/1btcT/RaR9dLmjTI=
=pAKC
-----END PGP SIGNATURE-----

Re: why postgresql over other RDBMS

From
Andrew Sullivan
Date:
On Fri, May 25, 2007 at 01:26:02PM -0500, Ron Johnson wrote:
> >> 1. transaction failure on statement failure[0], and
> >
> > I personally regard that as a feature, not a bug, so I'd be opposed
> > to changing it.
>
> Why?  Not failing the whole transaction lets me decide how to handle
> that *one* statement error, without have to code for retrying the
> whole transaction.

Because it's entailed by the definition of atomicity.  Codd put it
this way: "Such a block constitutes a transaction if, during its
execution, either all parts succeed or none succeeds."  If you want
to get around this, you can put a subtransaction around every
statement.

You say you don't want to do the latter, but there's no reason your
client couldn't do it for you, in much the same way we have
AUTOCOMMIT modes.  Indeed, PL/pgSQL actually does this sort of trick
in order to get exception handling.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
"The year's penultimate month" is not in truth a good way of saying
November.
        --H.W. Fowler

Re: why postgresql over other RDBMS

From
Erik Jones
Date:
On May 25, 2007, at 11:36 AM, Chris Browne wrote:

> erik@myemma.com (Erik Jones) writes:
>> On May 24, 2007, at 5:21 PM, Chris Browne wrote:
>>> Jan Wieck had a proposal to a similar effect, namely to give some
>>> way
>>> to get one connection to duplicate the state of another one.
>>>
>>> This would permit doing a neat parallel decomposition of pg_dump:
>>> you
>>> could do a 4-way parallelization of it that would function something
>>> like the following [elided]:
>>
>> Interesting.  That's actually pretty close to the reindexing
>> strategy/
>> script that I use and I've been planning on extending it to a vacuum
>> strategy.  So, I will add my support into someone building this kind
>> of support into pg_dump/restore.
>
> Well, I think that particular idea is dead for 8.3, as there wasn't
> agreement that there were enough relevant use-cases.
>
> If discussion gets bombarded with "yes, yes, that's useful for me
> too!" responses the next time it gets proposed, then that will
> increase the chances of acceptance.
>
> We seem to be suffering, as the community, and patch queue, grows,
> from the problem that features that are regarded as being useful only
> to small sets of users are seeing greater reluctance for acceptance.
> --

Well, in the current context, I'm less interested in shared
transaction state across processes and more in the ability to speed
up dumps with processes working in parallel.  However, given that
shared transaction state would be necessary for a transactionally
consistent dump, I guess you could say that my interest there lies in
it as a means to an end.

As far as use cases go, it's really only useful for large databases
(in our case shortening our dump time to significantly less than 12
hours) and from what I've heard, until recently, postgres hasn't seen
much of a user base with seriously large databases, at least that
share their stats with the community.  Now that postgres has garnered
a reputation as being a rock solid database and I'm seeing more and
more books and online tutorials pushing postgres (as opposed to the
previous de facto of mysql) that will most definitely change.

And, to finish up, is there any reason that pg_restore couldn't
already work with separate processes working in parallel?

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com



Re: why postgresql over other RDBMS

From
Alvaro Herrera
Date:
Erik Jones wrote:

> And, to finish up, is there any reason that pg_restore couldn't
> already work with separate processes working in parallel?

The problem is that the ordering of objects in the dump is the only
thing that makes the dump consistent with regards to the dependencies of
objects.  So pg_restore cannot make any assumptions of parallelisability
of the restoration process of objects in the dump.

pg_dump is the only one who has the dependency information.

If that information were to be saved in the dump, then maybe pg_restore
could work in parallel.  But it seems a fairly non-trivial thing to do.

Mind you, while I am idling at this idea, it seems that just having
multiple processes generating a dump is not such a hot idea by itself,
because you then have no clue on how to order the restoration of the
multiple files that are going to result.

--
Alvaro Herrera                 http://www.amazon.com/gp/registry/DXLWNGRJD34J
"Para tener más hay que desear menos"

Re: why postgresql over other RDBMS

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Erik Jones wrote:
>> And, to finish up, is there any reason that pg_restore couldn't
>> already work with separate processes working in parallel?

> The problem is that the ordering of objects in the dump is the only
> thing that makes the dump consistent with regards to the dependencies of
> objects.  So pg_restore cannot make any assumptions of parallelisability
> of the restoration process of objects in the dump.

That's true at the level of DDL operations, but AFAIK we could
parallelize table-loading and index-creation steps pretty effectively
--- and that's where all the time goes.

A more interesting question is what sort of hardware you need for that
actually to be a win, though.  Loading a few tables in parallel sounds
like an ideal recipe for oversaturating your disk bandwidth...

            regards, tom lane

Re: why postgresql over other RDBMS

From
Andrew Sullivan
Date:
On Fri, May 25, 2007 at 05:28:43PM -0400, Tom Lane wrote:
> That's true at the level of DDL operations, but AFAIK we could
> parallelize table-loading and index-creation steps pretty effectively
> --- and that's where all the time goes.

I made a presentation at OSCON a few years ago about how we did it
that way when we imported .org.  We had limited time to work in, and
we had to do a lot of validation, so getting the data in quickly was
important.  So we split the data files up into segments and loaded
them in parallel (Chris Browne did most of the implementation of
this.)  It was pretty helpful for loading, anyway.

> A more interesting question is what sort of hardware you need for that
> actually to be a win, though.  Loading a few tables in parallel sounds
> like an ideal recipe for oversaturating your disk bandwidth...

Right, you need to be prepared for that.  But of course, if you're in
the situation where you have to get a given database up and running,
who cares about the disk bandwidth? -- you don't have the database
running yet.  The kind of system that is busy enough to have that
size of database and that urgency of recovery is also the kind that
is likely to have dedicated storage hardware for that database.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
Unfortunately reformatting the Internet is a little more painful
than reformatting your hard drive when it gets out of whack.
        --Scott Morris

Re: why postgresql over other RDBMS

From
Chris Browne
Date:
alvherre@commandprompt.com (Alvaro Herrera) writes:
> Erik Jones wrote:
>
>> And, to finish up, is there any reason that pg_restore couldn't
>> already work with separate processes working in parallel?
>
> The problem is that the ordering of objects in the dump is the only
> thing that makes the dump consistent with regards to the dependencies of
> objects.  So pg_restore cannot make any assumptions of parallelisability
> of the restoration process of objects in the dump.
>
> pg_dump is the only one who has the dependency information.
>
> If that information were to be saved in the dump, then maybe pg_restore
> could work in parallel.  But it seems a fairly non-trivial thing to do.
>
> Mind you, while I am idling at this idea, it seems that just having
> multiple processes generating a dump is not such a hot idea by itself,
> because you then have no clue on how to order the restoration of the
> multiple files that are going to result.

I think it's less bad than you think.

The really timeconsuming bits of "pg_restore" are:

1. the loading of table data
2. creating indices on those tables
[distant] 3. setting up R/I constraints

If you look at the present structure of pg_dump output, those are all
pretty visibly separate steps.

pg_dump output [loosely] consists of:
- Type definitions & such
- Table definitions
- loading table data  (e.g. - 1)
- stored function definitions
- indices             (e.g. - parts of 2)
- primary keys        (e.g. - the rest of 2)
- triggers + rules    (including 3)

Thus, a "parallel load" would start by doing some things in a serial
fashion, namely creating types and tables.  This isn't a
parallelizable step, but so what?  It shouldn't take very long.

The parallel load can load as many tables concurrently as you choose;
since there are no indices or R/I triggers, those are immaterial
factors.

Generating indices and primary keys could, again, be parallelized
pretty heavily, and have (potentially) heavy benefit.

Furthermore, an interesting thing to do might be to use the same
approach that Slony-I does, at present, for subscriptions.  It
temporarily deactivates triggers and indices while loading the data,
then reactivates them, and requests a re-index.  That would permit
loading the *entire* schema, save for data, and then load and index
with fairly much maximum possible efficiency.

That seems like a not-completely-frightening "SMOP" (simple matter of
programming).  Not completely trivial, but not frighteningly
non-trivial...
--
(format nil "~S@~S" "cbbrowne" "cbbrowne.com")
http://linuxdatabases.info/info/linuxdistributions.html
Developmental Psychology
"Schoolyard behavior resembles adult primate behavior because "Ontogeny
Recapitulates Phylogeny" doesn't stop at birth."
-- Mark Miller

Re: why postgresql over other RDBMS

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/25/07 23:02, Chris Browne wrote:
> alvherre@commandprompt.com (Alvaro Herrera) writes:
>> Erik Jones wrote:
>>
>>> And, to finish up, is there any reason that pg_restore couldn't
>>> already work with separate processes working in parallel?
>> The problem is that the ordering of objects in the dump is the only
>> thing that makes the dump consistent with regards to the dependencies of
>> objects.  So pg_restore cannot make any assumptions of parallelisability
>> of the restoration process of objects in the dump.
>>
>> pg_dump is the only one who has the dependency information.
>>
>> If that information were to be saved in the dump, then maybe pg_restore
>> could work in parallel.  But it seems a fairly non-trivial thing to do.
>>
>> Mind you, while I am idling at this idea, it seems that just having
>> multiple processes generating a dump is not such a hot idea by itself,
>> because you then have no clue on how to order the restoration of the
>> multiple files that are going to result.
>
> I think it's less bad than you think.
>
> The really timeconsuming bits of "pg_restore" are:
>
> 1. the loading of table data
> 2. creating indices on those tables
> [distant] 3. setting up R/I constraints
>
> If you look at the present structure of pg_dump output, those are all
> pretty visibly separate steps.
>
> pg_dump output [loosely] consists of:
> - Type definitions & such
> - Table definitions
> - loading table data  (e.g. - 1)
> - stored function definitions
> - indices             (e.g. - parts of 2)
> - primary keys        (e.g. - the rest of 2)
> - triggers + rules    (including 3)
>
> Thus, a "parallel load" would start by doing some things in a serial
> fashion, namely creating types and tables.  This isn't a
> parallelizable step, but so what?  It shouldn't take very long.

Which would be sped up by having pg_dump create multiple output files.

Of course, as I see it, this is only of real benefit when you are
using tablespaces spread across multiple RAID devices on a SAN or
multiple SCSI cards.  But then, organizations with lots of data
usually have that kind of h/w.

> The parallel load can load as many tables concurrently as you choose;
> since there are no indices or R/I triggers, those are immaterial
> factors.
>
> Generating indices and primary keys could, again, be parallelized
> pretty heavily, and have (potentially) heavy benefit.
>
> Furthermore, an interesting thing to do might be to use the same
> approach that Slony-I does, at present, for subscriptions.  It
> temporarily deactivates triggers and indices while loading the data,
> then reactivates them, and requests a re-index.  That would permit
> loading the *entire* schema, save for data, and then load and index
> with fairly much maximum possible efficiency.
>
> That seems like a not-completely-frightening "SMOP" (simple matter of
> programming).  Not completely trivial, but not frighteningly
> non-trivial...

pg_dump would have to be smart enough to rationally split the data
into N number of output files, and that would get tricky
(impossible?) if most of your data is in one *huge* unpartitioned
table in a single tablespace.  Que sera.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGV8yrS9HxQb37XmcRAhEPAKDl4231rervBQO3pLHO+HwNx9dX+ACfb4Pu
qSWZNGmh/x/04QQT//nlEwI=
=zs2a
-----END PGP SIGNATURE-----

Re: why postgresql over other RDBMS

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Erik Jones wrote:
>>> And, to finish up, is there any reason that pg_restore couldn't
>>> already work with separate processes working in parallel?
>
>> The problem is that the ordering of objects in the dump is the only
>> thing that makes the dump consistent with regards to the dependencies of
>> objects.  So pg_restore cannot make any assumptions of parallelisability
>> of the restoration process of objects in the dump.
>
> That's true at the level of DDL operations, but AFAIK we could
> parallelize table-loading and index-creation steps pretty effectively
> --- and that's where all the time goes.

yes loading the data and creating the index is the most time consuming
part of a large dump and reload cycle.

>
> A more interesting question is what sort of hardware you need for that
> actually to be a win, though.  Loading a few tables in parallel sounds
> like an ideal recipe for oversaturating your disk bandwidth...

you don't actually need that much of disk bandwidth both COPY and CREATE
INDEX are CPU bottlenecked on modern boxes and reasonable disk
subsystems - spreading their work over multiple cores/processes can give
big benefits.
For example I have managed to load ~2B rows (5 integer columns - no
indexes) at a rate of about 320000 rows/s on a modern(but already 1,5
years old) 4 core 2.6Ghz Opteron box (with 12 disks and BBWC iirc).
Using 4 concurrent processes to load the data resulted in about 930000
rows/s loaded (hitting the disk-io limit at that rate).
So having the ability to parallelize those operations at both the dump
and the restore level would be a huge win.
A manual experiment I did a while back with doing that by hand (ie.
splitting the dump manually and feeding it in parallel with a
concurrency of 2)  on a copy of a production database brought down the
restore time from 3h+ to a bit less than 2 hours.


Stefan

Re: why postgresql over other RDBMS

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> A more interesting question is what sort of hardware you need for that
>> actually to be a win, though.  Loading a few tables in parallel sounds
>> like an ideal recipe for oversaturating your disk bandwidth...

> you don't actually need that much of disk bandwidth both COPY and CREATE
> INDEX are CPU bottlenecked on modern boxes and reasonable disk
> subsystems - spreading their work over multiple cores/processes can give
> big benefits.

Hmm ... I wonder if that's true for COPY BINARY ...

            regards, tom lane

Re: why postgresql over other RDBMS

From
"Harpreet Dhaliwal"
Date:
is the host base configuration methodology in postgres superior to other RDBMS.
is this something novel that postgres has come up with?

~Harpreet

On 5/26/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Stefan Kaltenbrunner < stefan@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> A more interesting question is what sort of hardware you need for that
>> actually to be a win, though.  Loading a few tables in parallel sounds
>> like an ideal recipe for oversaturating your disk bandwidth...

> you don't actually need that much of disk bandwidth both COPY and CREATE
> INDEX are CPU bottlenecked on modern boxes and reasonable disk
> subsystems - spreading their work over multiple cores/processes can give
> big benefits.

Hmm ... I wonder if that's true for COPY BINARY ...

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Re: why postgresql over other RDBMS

From
Zoltan Boszormenyi
Date:
If you ask me, yes. When I had to choose between MySQL 3.x and
PostgreSQL 6.5 a long ago and I was able to exclude the DB superuser
with REVOKE CONNECT from MySQL, I said "no, thanks".
I did it on purpose to prove that you can the external configuration
is better in this case.

And apart from fixing pg_hba.conf after you move the machine,
PostgreSQL is quite location agnostic network-wise.
You can modify the IP address[es] and FQDN of  the machine,
which is not easily doable if you use e.g. Informix where the hostname
is stored deep inside the DB and some subsystems break if it changes.

Harpreet Dhaliwal írta:
> is the host base configuration methodology in postgres superior to
> other RDBMS.
> is this something novel that postgres has come up with?
>
> ~Harpreet
>
> On 5/26/07, * Tom Lane* <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>>
> wrote:
>
>     Stefan Kaltenbrunner < stefan@kaltenbrunner.cc
>     <mailto:stefan@kaltenbrunner.cc>> writes:
>     > Tom Lane wrote:
>     >> A more interesting question is what sort of hardware you need
>     for that
>     >> actually to be a win, though.  Loading a few tables in parallel
>     sounds
>     >> like an ideal recipe for oversaturating your disk bandwidth...
>
>     > you don't actually need that much of disk bandwidth both COPY
>     and CREATE
>     > INDEX are CPU bottlenecked on modern boxes and reasonable disk
>     > subsystems - spreading their work over multiple cores/processes
>     can give
>     > big benefits.
>
>     Hmm ... I wonder if that's true for COPY BINARY ...
>
>                             regards, tom lane
>
>     ---------------------------(end of
>     broadcast)---------------------------
>     TIP 2: Don't 'kill -9' the postmaster
>
>


--
----------------------------------
Zoltán Böszörményi
Cybertec Geschwinde & Schönig GmbH
http://www.postgresql.at/


Re: why postgresql over other RDBMS

From
Zoltan Boszormenyi
Date:
Zoltan Boszormenyi írta:
> If you ask me, yes. When I had to choose between MySQL 3.x and
> PostgreSQL 6.5 a long ago and I was able to exclude the DB superuser
> with REVOKE CONNECT from MySQL, I said "no, thanks".
> I did it on purpose to prove that you can the external configuration
> is better in this case.

I wanted to write "you can reenable the superuser to fix problems later,
so the external configuration is better".

And sorry for the top-posting.

> And apart from fixing pg_hba.conf after you move the machine,
> PostgreSQL is quite location agnostic network-wise.
> You can modify the IP address[es] and FQDN of  the machine,
> which is not easily doable if you use e.g. Informix where the hostname
> is stored deep inside the DB and some subsystems break if it changes.
>
> Harpreet Dhaliwal írta:
>> is the host base configuration methodology in postgres superior to
>> other RDBMS.
>> is this something novel that postgres has come up with?
>>
>> ~Harpreet
>>
>> On 5/26/07, * Tom Lane* <tgl@sss.pgh.pa.us
>> <mailto:tgl@sss.pgh.pa.us>> wrote:
>>
>>     Stefan Kaltenbrunner < stefan@kaltenbrunner.cc
>>     <mailto:stefan@kaltenbrunner.cc>> writes:
>>     > Tom Lane wrote:
>>     >> A more interesting question is what sort of hardware you need
>>     for that
>>     >> actually to be a win, though.  Loading a few tables in parallel
>>     sounds
>>     >> like an ideal recipe for oversaturating your disk bandwidth...
>>
>>     > you don't actually need that much of disk bandwidth both COPY
>>     and CREATE
>>     > INDEX are CPU bottlenecked on modern boxes and reasonable disk
>>     > subsystems - spreading their work over multiple cores/processes
>>     can give
>>     > big benefits.
>>
>>     Hmm ... I wonder if that's true for COPY BINARY ...
>>
>>                             regards, tom lane
>>
>>     ---------------------------(end of
>>     broadcast)---------------------------
>>     TIP 2: Don't 'kill -9' the postmaster
>>
>>
>
>


--
----------------------------------
Zoltán Böszörményi
Cybertec Geschwinde & Schönig GmbH
http://www.postgresql.at/


Re: why postgresql over other RDBMS

From
Tom Lane
Date:
Zoltan Boszormenyi <zb@cybertec.at> writes:
> Harpreet Dhaliwal �rta:
>> is the host base configuration methodology in postgres superior to
>> other RDBMS.

> If you ask me, yes. When I had to choose between MySQL 3.x and
> PostgreSQL 6.5 a long ago and I was able to exclude the DB superuser
> with REVOKE CONNECT from MySQL, I said "no, thanks".
> I did it on purpose to prove that you can the external configuration
> is better in this case.

> And apart from fixing pg_hba.conf after you move the machine,
> PostgreSQL is quite location agnostic network-wise.

MySQL has a related problem, which is that they have embedded IPv4
addressing rather deeply into their client authentication logic (by
making userids be user@host not just a username).  This is probably why
they still haven't got IPv6 support:
http://bugs.mysql.com/bug.php?id=8836
I wonder what their plans are for fixing that ...

            regards, tom lane

Re: why postgresql over other RDBMS

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> Tom Lane wrote:
>>> A more interesting question is what sort of hardware you need for that
>>> actually to be a win, though.  Loading a few tables in parallel sounds
>>> like an ideal recipe for oversaturating your disk bandwidth...
>
>> you don't actually need that much of disk bandwidth both COPY and CREATE
>> INDEX are CPU bottlenecked on modern boxes and reasonable disk
>> subsystems - spreading their work over multiple cores/processes can give
>> big benefits.
>
> Hmm ... I wonder if that's true for COPY BINARY ...

not sure on that - I was simply trying to say that even a simple
parallel dump & restore capability could result in a serious improvement
for people running large databases(especially considering that one can
now buy 1U boxes with 8+ cores or 2U boxes that can hold 14disks+)  ;-)


Stefan

Re: why postgresql over other RDBMS

From
PFC
Date:

> MySQL has a related problem, which is that they have embedded IPv4
> addressing rather deeply into their client authentication logic (by
> making userids be user@host not just a username).  This is probably why
> they still haven't got IPv6 support:
> http://bugs.mysql.com/bug.php?id=8836
> I wonder what their plans are for fixing that ...
>
>             regards, tom lane

    Well, I have now almost finished my spare-time forum benchmark of MySQL
versus Postgres...
    It was a really interesting experience, and by turning a few stones I
discovered some really nasty stuff about MySQL.. well, about InnoDB
really. Here's one that you probably didn't know about :

    - auto_increment isn't concurrent, this means all inserts into a table
which has an auto_increment column are serialized
    - yes this means INSERT INTO SELECT locks out all other inserts
    - the lock is held during ON INSERT TRIGGERS, which means all INSERT
TRIGGERs on a given table can't execute concurrently

Re: why postgresql over other RDBMS

From
Lincoln Yeoh
Date:
At 03:25 AM 5/25/2007, A.M. wrote:

>Indeed. Wouldn't it be a cool feature to persists transaction states
>across connections so that a new connection could get access to a
>sub- transaction state? That way, you could make your schema changes and
>test them with any number of test clients (which designate the state
>to connect with) and then you would commit when everything works.
>
>Unfortunately, the postgresql architecture wouldn't lend itself well
>to this. Still, it seems like a basic extension of the notion of
>sub- transactions.

I've proposed this for postgresql before (7 years ago?), but for a
different reason. Didn't seem to get much interest though.

The idea was people wouldn't have to reinvent/reimplement
transactions for stuff like webapps. Of course you might have to have
a separate database for "shopping cart" or other "persistent"
transactions, so that outstanding transactions don't cause lots of
nonrelated transient rows to be kept unreclaimable.

And also I was thinking it would be good to decouple the maximum
number of concurrent transactions supported by postgresql from the
maximum number of concurrent connections supported.

Issues: what should happen if multiple connections try to continue
the _same_ transaction.

Also should keep in mind clustering support though - this might
create interesting problems and features in a clustering scenario :).

Regards,
Link.


Re: why postgresql over other RDBMS

From
Andrew Sullivan
Date:
On Sat, May 26, 2007 at 02:33:05PM -0400, Tom Lane wrote:
> making userids be user@host not just a username).  This is probably why
> they still haven't got IPv6 support:
> http://bugs.mysql.com/bug.php?id=8836
> I wonder what their plans are for fixing that ...

Panic at the last second, like everyone else?  It's actually
remarkable how many systems still have poor or dodgy IPv6 support.
And the management tools for IPv6 remain sort of awful, given the
immense address space that people are going to have to deal with
(there's 128 bits in an IPv6 address).  But I guess MySQL AB will
have to figure out something soon, since even ARIN is on the IPv6
wagon now:

http://www.arin.net/announcements/20070521.html

(In the American-continents Internet operations world, this has
caused quite a kerfuffle, because many people have mostly ignored
IPv6 for some time.  And there are a lot of grotty corners to IPv6
for which there are no analogies in IPv4.  Uh, scoped addresses,
anyone ;-)

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
Information security isn't a technological problem.  It's an economics
problem.
        --Bruce Schneier

Re: why postgresql over other RDBMS

From
Vivek Khera
Date:
On May 25, 2007, at 5:28 PM, Tom Lane wrote:

> That's true at the level of DDL operations, but AFAIK we could
> parallelize table-loading and index-creation steps pretty effectively
> --- and that's where all the time goes.

I would be happy with parallel builds of the indexes of a given
table.  That way you have just one scan of the whole table to build
all its indexes.


Re: why postgresql over other RDBMS

From
PFC
Date:
On Thu, 31 May 2007 22:20:09 +0200, Vivek Khera <vivek@khera.org> wrote:

>
> On May 25, 2007, at 5:28 PM, Tom Lane wrote:
>
>> That's true at the level of DDL operations, but AFAIK we could
>> parallelize table-loading and index-creation steps pretty effectively
>> --- and that's where all the time goes.
>
> I would be happy with parallel builds of the indexes of a given table.
> That way you have just one scan of the whole table to build all its
> indexes.

    Will the synchronized seq scan patch be able to do this by issuing all
the CREATE INDEX commands at the same time from several different database
connections ?

Re: why postgresql over other RDBMS

From
PFC
Date:
On Thu, 31 May 2007 23:36:32 +0200, PFC <lists@peufeu.com> wrote:

> On Thu, 31 May 2007 22:20:09 +0200, Vivek Khera <vivek@khera.org> wrote:
>
>>
>> On May 25, 2007, at 5:28 PM, Tom Lane wrote:
>>
>>> That's true at the level of DDL operations, but AFAIK we could
>>> parallelize table-loading and index-creation steps pretty effectively
>>> --- and that's where all the time goes.
>>
>> I would be happy with parallel builds of the indexes of a given table.
>> That way you have just one scan of the whole table to build all its
>> indexes.

    Just did a test :

    - large table (does not fit in RAM)
    - rows with text column (forum posts)
    - about 700K rows

Time to create 3 indexes : 61 s
Time to create 3 indexes with 3 simultaneous connections : 22 s

    That's what you would expect...
    vmstat shows the data is really loaded from disk, once with the 3
threads, 3 times when indexes are created one at a time.

Re: why postgresql over other RDBMS

From
Bruce Momjian
Date:
PFC wrote:
> On Thu, 31 May 2007 22:20:09 +0200, Vivek Khera <vivek@khera.org> wrote:
>
> >
> > On May 25, 2007, at 5:28 PM, Tom Lane wrote:
> >
> >> That's true at the level of DDL operations, but AFAIK we could
> >> parallelize table-loading and index-creation steps pretty effectively
> >> --- and that's where all the time goes.
> >
> > I would be happy with parallel builds of the indexes of a given table.
> > That way you have just one scan of the whole table to build all its
> > indexes.
>
>     Will the synchronized seq scan patch be able to do this by issuing all
> the CREATE INDEX commands at the same time from several different database
> connections ?

No, but it could someday.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: why postgresql over other RDBMS

From
Ron Johnson
Date:
On 06/01/07 11:22, Bruce Momjian wrote:
> PFC wrote:
>> On Thu, 31 May 2007 22:20:09 +0200, Vivek Khera <vivek@khera.org> wrote:
>>
>>> On May 25, 2007, at 5:28 PM, Tom Lane wrote:
>>>
>>>> That's true at the level of DDL operations, but AFAIK we could
>>>> parallelize table-loading and index-creation steps pretty effectively
>>>> --- and that's where all the time goes.
>>> I would be happy with parallel builds of the indexes of a given table.
>>> That way you have just one scan of the whole table to build all its
>>> indexes.
>>     Will the synchronized seq scan patch be able to do this by issuing all
>> the CREATE INDEX commands at the same time from several different database
>> connections ?
>
> No, but it could someday.

Or would a "CREATE MANY INDEXES" (where in one statement you specify
all the indexes on a single table) command be easier to implement?

This way also the process reads the table once, building separate
sortwork files on-the-fly.  Too bad child processes can't inherit
transaction state.

--
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!


Re: why postgresql over other RDBMS

From
PFC
Date:
>>>     Will the synchronized seq scan patch be able to do this by issuing
>>> all  the CREATE INDEX commands at the same time from several different
>>> database  connections ?
>>  No, but it could someday.

    Actually I tested, it does it right now, albeit unconsciously (pg doesn't
do anything to synchronize the scans, but if you launch the concurrent
connections at the same time and issue all your "create index" at the same
time, only 1 table scan is needed). Maybe if the tables were bigger, it
would lose sync between the 3 concurrent scans and would end up going
slower. That's why I spoke about the "synchronized scan" patch.

> Or would a "CREATE MANY INDEXES" (where in one statement you specify all
> the indexes on a single table) command be easier to implement?

    You can get the effect right now by using concurrent connections it
seems. Not very practical in a psql script, though...

Re: why postgresql over other RDBMS

From
Ron Johnson
Date:
On 06/01/07 16:38, PFC wrote:
>
>>>>     Will the synchronized seq scan patch be able to do this by
>>>> issuing all  the CREATE INDEX commands at the same time from several
>>>> different database  connections ?
>>>  No, but it could someday.
>
>     Actually I tested, it does it right now, albeit unconsciously (pg
> doesn't do anything to synchronize the scans, but if you launch the
> concurrent connections at the same time and issue all your "create
> index" at the same time, only 1 table scan is needed). Maybe if the
> tables were bigger, it would lose sync between the 3 concurrent scans
> and would end up going slower. That's why I spoke about the
> "synchronized scan" patch.

How much of this, though, is from the OS's disk cache?  Or are Seq
Scans O_DIRECT and bypass the OS cache?

>> Or would a "CREATE MANY INDEXES" (where in one statement you specify
>> all the indexes on a single table) command be easier to implement?
>
>     You can get the effect right now by using concurrent connections it
> seems. Not very practical in a psql script, though...

--
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!


Re: why postgresql over other RDBMS

From
PFC
Date:
On Sat, 02 Jun 2007 00:14:28 +0200, Ron Johnson <ron.l.johnson@cox.net>
wrote:

> On 06/01/07 16:38, PFC wrote:
>>
>>>>>     Will the synchronized seq scan patch be able to do this by
>>>>> issuing all  the CREATE INDEX commands at the same time from several
>>>>> different database  connections ?
>>>>  No, but it could someday.
>>      Actually I tested, it does it right now, albeit unconsciously (pg
>> doesn't do anything to synchronize the scans, but if you launch the
>> concurrent connections at the same time and issue all your "create
>> index" at the same time, only 1 table scan is needed). Maybe if the
>> tables were bigger, it would lose sync between the 3 concurrent scans
>> and would end up going slower. That's why I spoke about the
>> "synchronized scan" patch.
>
> How much of this, though, is from the OS's disk cache?  Or are Seq Scans
> O_DIRECT and bypass the OS cache?


    Well, the file was larger than disk cache, and I checked in vmstat's
number of actual bytes read from disks...
    Three threads read the table once, One thread reads the table 3 times.

    So it works right now, except it doesn't have (yet) the infrastructure to
keep the scans synchronized, and psql can't open several connections (yet).

Re: why postgresql over other RDBMS

From
Jeremy Harris
Date:
On 06/01/07 11:22, Bruce Momjian wrote:
> > PFC wrote:
>> >> On Thu, 31 May 2007 22:20:09 +0200, Vivek Khera <vivek@khera.org> wrote:
>> >>
>>> >>> On May 25, 2007, at 5:28 PM, Tom Lane wrote:
>>> >>>
>>>> >>>> That's true at the level of DDL operations, but AFAIK we could
>>>> >>>> parallelize table-loading and index-creation steps pretty effectively
>>>> >>>> --- and that's where all the time goes.
>>> >>> I would be happy with parallel builds of the indexes of a given table.
>>> >>> That way you have just one scan of the whole table to build all its
>>> >>> indexes.
>> >>     Will the synchronized seq scan patch be able to do this by issuing all
>> >> the CREATE INDEX commands at the same time from several different database
>> >> connections ?
> >
> > No, but it could someday.

Would it be possible to track stats sufficient for a cost/benefit based
automatic recreate of all indices on a table whenever a full-table-scan
occurred, whether due to a commanded index rebuild or not?

Cheers,
   Jeremy Harris

Re: why postgresql over other RDBMS

From
Scott Ribe
Date:
> So it works right now, except it doesn't have (yet) the infrastructure to
> keep the scans synchronized

Perhaps you only got one read of the table because the process is
essentially self-synchronizing. Whenever one process "gets ahead", it
requires a disk read for the next page, which causes it to block for a
relatively long time, during which time the other two processes either
proceed reading rows from cache, or come to the end of the cache and block
waiting for the same page to be read from disk. Obviously not a guarantee,
as indexing a relatively more expensive type COULD cause one process to get
multiple pages behind, and memory usage by other processes COULD cause
intervening pages to be flushed from cache. But I have a suspicion that the
experiment was not just a happy fluke, that there will be a strong tendency
for multiple simultaneous index operations to stay sufficiently closely
synch'd that the table will only be read from disk once. (Especially when
such operations are done while the database is otherwise quiescent, as would
be the typical case during a restore.)

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



Re: why postgresql over other RDBMS

From
PFC
Date:
    I believe you have made quite a good description of what happens.

    Index-building isn't very CPU-intensive for integers (geometrics and
tsearch is another matter, of course), so building all indexes of a large
table in one pass is a possibility that works now, provided you issue all
create index commands in concurrent connections at roughly the same time.

    I don't think pgrestore does this, though.


>> So it works right now, except it doesn't have (yet) the infrastructure
>> to
>> keep the scans synchronized
>
> Perhaps you only got one read of the table because the process is
> essentially self-synchronizing. Whenever one process "gets ahead", it
> requires a disk read for the next page, which causes it to block for a
> relatively long time, during which time the other two processes either
> proceed reading rows from cache, or come to the end of the cache and
> block
> waiting for the same page to be read from disk. Obviously not a
> guarantee,
> as indexing a relatively more expensive type COULD cause one process to
> get
> multiple pages behind, and memory usage by other processes COULD cause
> intervening pages to be flushed from cache. But I have a suspicion that
> the
> experiment was not just a happy fluke, that there will be a strong
> tendency
> for multiple simultaneous index operations to stay sufficiently closely
> synch'd that the table will only be read from disk once. (Especially when
> such operations are done while the database is otherwise quiescent, as
> would
> be the typical case during a restore.)
>



Re: why postgresql over other RDBMS

From
Tom Lane
Date:
Scott Ribe <scott_ribe@killerbytes.com> writes:
>> So it works right now, except it doesn't have (yet) the infrastructure to
>> keep the scans synchronized

> Perhaps you only got one read of the table because the process is
> essentially self-synchronizing.

Right.  Multiple seqscans that are anywhere near reading the same block
of a table will tend to self-synchronize.  There is a patch under
consideration for 8.3 that helps this along by making seqscans run
"circularly" --- that is, not always from block 0 to block N, but from
block M to N and then 0 to M-1, where the start point M can be chosen
by looking to see where any other concurrent seqscan is presently
reading.  Once you've got a reasonable start point, you don't have to do
anything else.

            regards, tom lane

Re: why postgresql over other RDBMS

From
Alban Hertroys
Date:
Tom Lane wrote:
> Right.  Multiple seqscans that are anywhere near reading the same block
> of a table will tend to self-synchronize.  There is a patch under
> consideration for 8.3 that helps this along by making seqscans run
> "circularly" --- that is, not always from block 0 to block N, but from
> block M to N and then 0 to M-1, where the start point M can be chosen
> by looking to see where any other concurrent seqscan is presently
> reading.  Once you've got a reasonable start point, you don't have to do
> anything else.
>
>             regards, tom lane

Interesting concept (as expected from you guys).

Would that imply that the sequential scan of one connection could place
data into the disk cache that another parallel seq scan would need soon?
Would that speed up parallel seq scans? Or am I being optimistic here.

Regards,
--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

Re: why postgresql over other RDBMS

From
Bruce Momjian
Date:
Added to TODO:

* Allow multiple indexes to be created concurrently, ideally via a
  single heap scan, and have a restore of a pg_dump somehow use it

  http://archives.postgresql.org/pgsql-general/2007-05/msg01274.php


---------------------------------------------------------------------------

Ron Johnson wrote:
> On 06/01/07 11:22, Bruce Momjian wrote:
> > PFC wrote:
> >> On Thu, 31 May 2007 22:20:09 +0200, Vivek Khera <vivek@khera.org> wrote:
> >>
> >>> On May 25, 2007, at 5:28 PM, Tom Lane wrote:
> >>>
> >>>> That's true at the level of DDL operations, but AFAIK we could
> >>>> parallelize table-loading and index-creation steps pretty effectively
> >>>> --- and that's where all the time goes.
> >>> I would be happy with parallel builds of the indexes of a given table.
> >>> That way you have just one scan of the whole table to build all its
> >>> indexes.
> >>     Will the synchronized seq scan patch be able to do this by issuing all
> >> the CREATE INDEX commands at the same time from several different database
> >> connections ?
> >
> > No, but it could someday.
>
> Or would a "CREATE MANY INDEXES" (where in one statement you specify
> all the indexes on a single table) command be easier to implement?
>
> This way also the process reads the table once, building separate
> sortwork files on-the-fly.  Too bad child processes can't inherit
> transaction state.
>
> --
> Ron Johnson, Jr.
> Jefferson LA  USA
>
> Give a man a fish, and he eats for a day.
> Hit him with a fish, and he goes away for good!
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org/

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: why postgresql over other RDBMS

From
"Matthew T. O'Connor"
Date:
Bruce Momjian wrote:
> Added to TODO:
>
> * Allow multiple indexes to be created concurrently, ideally via a
>   single heap scan, and have a restore of a pg_dump somehow use it
>
>   http://archives.postgresql.org/pgsql-general/2007-05/msg01274.php


Would it not also make sense to use this ability for a
non-index-specific REINDEX command?

Re: why postgresql over other RDBMS

From
Bruce Momjian
Date:
Matthew T. O'Connor wrote:
> Bruce Momjian wrote:
> > Added to TODO:
> >
> > * Allow multiple indexes to be created concurrently, ideally via a
> >   single heap scan, and have a restore of a pg_dump somehow use it
> >
> >   http://archives.postgresql.org/pgsql-general/2007-05/msg01274.php
>
>
> Would it not also make sense to use this ability for a
> non-index-specific REINDEX command?

Not sure, but I suppose.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: why postgresql over other RDBMS

From
Gregory Stark
Date:
"Bruce Momjian" <bruce@momjian.us> writes:

> Matthew T. O'Connor wrote:
>> Bruce Momjian wrote:
>> >
>> > * Allow multiple indexes to be created concurrently, ideally via a
>> >   single heap scan, and have a restore of a pg_dump somehow use it

Actually, the sync scan patch ought to make this more or less happen
magically. If you start a bunch of concurrent index builds they will try to
scan the table together.

There's no useful way for pg_dump to make use of this since it only has one
backend. And you still need to generate n copies of the data for sorting. And
performing n sorts in parallel won't be as cache efficient as doing them one
after the other. So there's still a use case for the TODO

But the hole is not nearly as urgent as before. You can get most of the
benefit if you really need it by rolling your own. And the cool thing is some
people already have rolled their own and they'll just magically see an
improvement. They don't have to do anything they weren't doing already to turn
it on.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: why postgresql over other RDBMS

From
Bruce Momjian
Date:
Gregory Stark wrote:
> "Bruce Momjian" <bruce@momjian.us> writes:
>
> > Matthew T. O'Connor wrote:
> >> Bruce Momjian wrote:
> >> >
> >> > * Allow multiple indexes to be created concurrently, ideally via a
> >> >   single heap scan, and have a restore of a pg_dump somehow use it
>
> Actually, the sync scan patch ought to make this more or less happen
> magically. If you start a bunch of concurrent index builds they will try to
> scan the table together.
>
> There's no useful way for pg_dump to make use of this since it only has one
> backend. And you still need to generate n copies of the data for sorting. And
> performing n sorts in parallel won't be as cache efficient as doing them one
> after the other. So there's still a use case for the TODO
>
> But the hole is not nearly as urgent as before. You can get most of the
> benefit if you really need it by rolling your own. And the cool thing is some
> people already have rolled their own and they'll just magically see an
> improvement. They don't have to do anything they weren't doing already to turn
> it on.

They could roll their own a lot easier if you had finished the psql
concurrent patch.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: why postgresql over other RDBMS

From
Gregory Stark
Date:
"Bruce Momjian" <bruce@momjian.us> writes:

> They could roll their own a lot easier if you had finished the psql
> concurrent patch.

I did. But you decided you didn't want it.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: why postgresql over other RDBMS

From
Bruce Momjian
Date:
Gregory Stark wrote:
>
> "Bruce Momjian" <bruce@momjian.us> writes:
>
> > They could roll their own a lot easier if you had finished the psql
> > concurrent patch.
>
> I did. But you decided you didn't want it.

As far as I know, we asked for a libpq API change and you ignored
multiple requests.  You want the URLs?

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: why postgresql over other RDBMS

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Gregory Stark wrote:
> >
> > "Bruce Momjian" <bruce@momjian.us> writes:
> >
> > > They could roll their own a lot easier if you had finished the psql
> > > concurrent patch.
> >
> > I did. But you decided you didn't want it.
>
> As far as I know, we asked for a libpq API change and you ignored
> multiple requests.  You want the URLs?

Never mind.  You ignored requests.  I am not going to address this
further.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: why postgresql over other RDBMS

From
Gregory Stark
Date:
"Bruce Momjian" <bruce@momjian.us> writes:

>> As far as I know, we asked for a libpq API change and you ignored
>> multiple requests.  You want the URLs?

No. One suggestion was made regarding an internal technical issue (polling
versus using select/poll on the sockets). I've long ago made that change
though I had questions about the best way to do it which were never answered
so even the way I made that change might not be acceptable.

In any case this was just one suggestion made based on discussion which
happened to turn up on list without actually reading the rest of the code. If
you're interested in reviewing the patch I'm sure you would have dozens of
issues. I would be happy to rework it along whatever lines you want.

But I would prefer to see people focus on reviewing major features like HOT,
clustered indexes, GII (which I would suggest calling index organized tables
since that's effectively what they are).

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: why postgresql over other RDBMS

From
Naz Gassiep
Date:
Surely such a use case could, and more to the point *should* be met using PITR?
Regards,
- Naz.

Alvaro Herrera wrote:
A.M. wrote: 
On May 24, 2007, at 14:29 , Wiebe Cazemier wrote:
   
On Thursday 24 May 2007 17:30, Alexander Staubo wrote:
     
[2] Nobody else has this, I believe, except possibly Ingres and
NonStop SQL. This means you can do a "begin transaction", then issue
"create table", "alter table", etc. ad nauseum, and in the mean time
concurrent transactions will just work. Beautiful for atomically
upgrading a production server. Oracle, of course, commits after each
DDL statements.       
If this is such a rare feature, I'm very glad we chose postgresql.  
I use it all
the time, and wouldn't know what to do without it. We circumvented  
Ruby on
Rails' migrations, and just implemented them in SQL. Writing  
migrations is a
breeze this way, and you don't have to hassle with atomicity, or  
the pain when
you discover the migration doesn't work on the production server.     
Indeed. Wouldn't it be a cool feature to persists transaction states  
across connections so that a new connection could get access to a sub- 
transaction state? That way, you could make your schema changes and  
test them with any number of test clients (which designate the state  
to connect with) and then you would commit when everything works.

Unfortunately, the postgresql architecture wouldn't lend itself well  
to this. Still, it seems like a basic extension of the notion of sub- 
transactions.   
Hmm, doesn't this Just Work with two-phase commit?