Thread: RE: Database cluster?

RE: Database cluster?

From

"Nathan Barnett"

Date:

30 November 2000, 09:22:43

This system is in use through what is called shared nothing clustering which
is employed by IBM's DB2 and Microsoft SQL Server 2000.  Either of these
products will work in the manner that you are looking for.

---------------------------------------------
Nathan Barnett

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Gordan Bobic
Sent: Thursday, November 30, 2000 4:02 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Database cluster?


> > > I am considering splitting the database into tables residing on
separate
> > > machines, and connect them on one master node.
> > >
> > > The question I have is:
> > >
> > > 1) How can I do this using PostgreSQL?
> >
> > You can't.
>
>    I'll jump in with a bit more info.  Splitting tables across multiple
> machines would do nothing more than make the entire system run at a
snail's
> pace.  Yes, it would slow it down immensely, because you just couldn't
move
> data between machines quickly enough.

I don't believe that is the case. In my case, queries typically return
comparatively small amounts of data. Around 100 records at most. The amount
of data that needs to be transferred is comparatively small, and even over
10 Mb ethernet, it would take at most about a second to transfer. This is a
much smaller delay than the query time itself, which can take 10 seconds or
more. Remember that I said there are tables with over 30M records? Doing
multi-table joins on things like this takes a long time...

So, splitting the data in such a way that one table is queried, and then
tables joined from it are queried in parallel, would cause a signifficant
speed-up.

For example, say we have tables T1, T2 and T3.

T1 has fields F1.1, F1.2 and F1.3. T2 has F2.1 and T3 has F3.1 (at least,
probably lots of other fields.

Say I want to do
SELECT *
FROM T1
WHERE F1.1 = F2.1 AND F1.2 = F3.1 AND F1.3 = 'somedata';

Then F1.3 could be searched for 'somedata'. When the records are found,
this could be cross-matched remotely, in parallel for F1.1=F2.1 and
F1.2=F3.1, on different machines.

This means that depending on the type, configuration and the usage of the
database, a potentially massive improvement in performance could be
achiveved, especially on multi-table joins which span lots of BIG tables.

Somebody mentioned the fact that postgres uses IPC for communicating
between processes. I think there are tools for clustering (I am not sure if
Mosix supports transparently allowing IPC across nodes) which can work
around that.

>   Why?  Well, whenever you join two tables that are on different
machines,
> the tables have to go across whatever sort of connection you have between
> the machines.  Even if you use gigabit ethernet, you are still running at
a
> mere fraction of the bandwidth of the computer's internal bus - and at
> orders of magnitude greater latency.  You'd have lots of CPU's sitting
> around, doing absolutely nothing, waiting for data to come across the
wire.

Gigabit ethernet has around the same bandwidth as PCI bus. I suppose it all
depends on what machine you have running this. This would be true in the
case that the datbase server is a nice big Alpha with severl CPUs.

>    There are alternatives, such as IP-over-SCSI.  That reduces the
latency
> of ethernet quite a bit, and gives you much more bandwidth (say, up to
160
> megabytes/second).  However, that's still a pittance compared to the main
> system bus inside your computer.

But SCSI is still 160MB burst (not sustained, unless you're using very
expensive arrays). And Gigabit ethernet is 133 MB/s, albeit with greater
latency.

>     That's one of the greatest hurdles to distributed computing.  That's
why
> the applications that are best adapted to distributed computing are those
> that don't require much data over the wire - which certainly doesn't
apply
> to databases. : )

I think it depends whether the amount of data is the problem, or fitting it
together.

Somebody please explain to me further why I am wrong in all this?

Regards.

Gordan

Re: Database cluster?

From

"Gordan Bobic"

Date:

30 November 2000, 09:50:12

Thanks.

I have just had another thought. If all the tables are split across several
computers, this would help as well.

For example, if we have 100 records and 2 database servers, each server
could have 50 of those 100 records on it. When a selection is required,
each server would look through it's much smaller database, and report back
the "hits". This would, effectively, provide a near linear speedup in the
query time, while introducing only the minor network overhead (or a major
one, depending on how much data is transferred).

Some extra logic could then be implemented for related tables that would
allow the most closely related records from the different tables to be
"clustered" (as in kind of remotely similar to the CLUSTER command) on the
same server, for faster response time and minimized network usage
requirements. The "vacuum" or "cluster" features could be used overnight to
re-optimize the distribution of records across the servers.

In all this, a "master" node could be used for coordinating the whole
operation. We could ask the master node to do a query, and it would
automatically, knowing what slaves it has, fire off that query on them.
Each slave would then in parallel, execute a query, and return a subset of
the data we were looking for. This data would then be joined into one
recordset before it is returned to the client that requested it.

As far I can see, as long as the amounts of data shifted aren't huge enough
to cause problems with network congestion, and the query time is dominant
to data transfer time over the network, this should provide a rather
scaleable system. I understand that the form of database clustering I am
mentioning here is fairly rudimentary and unsophisticated, but it would
certaily be a very useful feature.

Are there any plans to implement this sort of functionality in PostgreSQL?
Or is this a lot more complicated than it seems...

Regards.

Gordan

----- Original Message -----
From: "Nathan Barnett" <nbarnett@cellularphones.com>
To: "'Gordan Bobic'" <gordan@freeuk.com>; <pgsql-general@postgresql.org>
Sent: Thursday, November 30, 2000 2:34 PM
Subject: RE: [GENERAL] Database cluster?


> This system is in use through what is called shared nothing clustering
which
> is employed by IBM's DB2 and Microsoft SQL Server 2000.  Either of these
> products will work in the manner that you are looking for.
>
> ---------------------------------------------
> Nathan Barnett
>
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Gordan Bobic
> Sent: Thursday, November 30, 2000 4:02 AM
> To: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Database cluster?
>
>
> > > > I am considering splitting the database into tables residing on
> separate
> > > > machines, and connect them on one master node.
> > > >
> > > > The question I have is:
> > > >
> > > > 1) How can I do this using PostgreSQL?
> > >
> > > You can't.
> >
> >    I'll jump in with a bit more info.  Splitting tables across multiple
> > machines would do nothing more than make the entire system run at a
> snail's
> > pace.  Yes, it would slow it down immensely, because you just couldn't
> move
> > data between machines quickly enough.
>
> I don't believe that is the case. In my case, queries typically return
> comparatively small amounts of data. Around 100 records at most. The
amount
> of data that needs to be transferred is comparatively small, and even
over
> 10 Mb ethernet, it would take at most about a second to transfer. This is
a
> much smaller delay than the query time itself, which can take 10 seconds
or
> more. Remember that I said there are tables with over 30M records? Doing
> multi-table joins on things like this takes a long time...
>
> So, splitting the data in such a way that one table is queried, and then
> tables joined from it are queried in parallel, would cause a signifficant
> speed-up.
>
> For example, say we have tables T1, T2 and T3.
>
> T1 has fields F1.1, F1.2 and F1.3. T2 has F2.1 and T3 has F3.1 (at least,
> probably lots of other fields.
>
> Say I want to do
> SELECT *
> FROM T1
> WHERE F1.1 = F2.1 AND F1.2 = F3.1 AND F1.3 = 'somedata';
>
> Then F1.3 could be searched for 'somedata'. When the records are found,
> this could be cross-matched remotely, in parallel for F1.1=F2.1 and
> F1.2=F3.1, on different machines.
>
> This means that depending on the type, configuration and the usage of the
> database, a potentially massive improvement in performance could be
> achiveved, especially on multi-table joins which span lots of BIG tables.
>
> Somebody mentioned the fact that postgres uses IPC for communicating
> between processes. I think there are tools for clustering (I am not sure
if
> Mosix supports transparently allowing IPC across nodes) which can work
> around that.
>
> >   Why?  Well, whenever you join two tables that are on different
> machines,
> > the tables have to go across whatever sort of connection you have
between
> > the machines.  Even if you use gigabit ethernet, you are still running
at
> a
> > mere fraction of the bandwidth of the computer's internal bus - and at
> > orders of magnitude greater latency.  You'd have lots of CPU's sitting
> > around, doing absolutely nothing, waiting for data to come across the
> wire.
>
> Gigabit ethernet has around the same bandwidth as PCI bus. I suppose it
all
> depends on what machine you have running this. This would be true in the
> case that the datbase server is a nice big Alpha with severl CPUs.
>
> >    There are alternatives, such as IP-over-SCSI.  That reduces the
> latency
> > of ethernet quite a bit, and gives you much more bandwidth (say, up to
> 160
> > megabytes/second).  However, that's still a pittance compared to the
main
> > system bus inside your computer.
>
> But SCSI is still 160MB burst (not sustained, unless you're using very
> expensive arrays). And Gigabit ethernet is 133 MB/s, albeit with greater
> latency.
>
> >     That's one of the greatest hurdles to distributed computing.
That's
> why
> > the applications that are best adapted to distributed computing are
those
> > that don't require much data over the wire - which certainly doesn't
> apply
> > to databases. : )
>
> I think it depends whether the amount of data is the problem, or fitting
it
> together.
>
> Somebody please explain to me further why I am wrong in all this?
>
> Regards.
>
> Gordan
>
>
>
>

Re: Database cluster?

From

Doug Semig

Date:

30 November 2000, 10:40:52

You're almost describing a Teradata DBM.

What an amazing machine!  Last I heard about 6 years ago, though, AT&T was
rewriting it as an NT app instead of running on proprietary hardware.  The
proprietary hardware was essentially a cluster of 80486 computers (at the
time).

What they had done was implemented a pyramid structure of 80486 computers.
The lowest level of computers had hard disks and stored the data.  Two of
the lowest level computers would "report" to a single higher up computer.
Two of these higher up computers would "report" to yet another single
higher up computer until there was only one higher up computer to report to.

The thing that impacted me the most about this architecture was that
sorting was practically built in.  So all the intermediary computers had to
do was merge the sorted result sets from its lower level computers.  Blazing!

And data was stored on a couple of leaf-level computers for redundancy.

I miss that old beast.  But I certainly cannot afford the multimillion
dollars required to get one for myself.  We lovingly called the one we
worked with the "Pteradactyl," which is the old name for that bird-like
dinosaur (evidentally there's a new word for the bird-like dinosaur, the
pteronodon or something?).

Doug

At 02:44 PM 11/30/00 -0000, Gordan Bobic wrote:
>Thanks.
>
>I have just had another thought. If all the tables are split across several
>computers, this would help as well.
>
>For example, if we have 100 records and 2 database servers, each server
>could have 50 of those 100 records on it. When a selection is required,
>each server would look through it's much smaller database, and report back
>the "hits". This would, effectively, provide a near linear speedup in the
>query time, while introducing only the minor network overhead (or a major
>one, depending on how much data is transferred).
>
>Some extra logic could then be implemented for related tables that would
>allow the most closely related records from the different tables to be
>"clustered" (as in kind of remotely similar to the CLUSTER command) on the
>same server, for faster response time and minimized network usage
>requirements. The "vacuum" or "cluster" features could be used overnight to
>re-optimize the distribution of records across the servers.
>
>In all this, a "master" node could be used for coordinating the whole
>operation. We could ask the master node to do a query, and it would
>automatically, knowing what slaves it has, fire off that query on them.
>Each slave would then in parallel, execute a query, and return a subset of
>the data we were looking for. This data would then be joined into one
>recordset before it is returned to the client that requested it.
>
>As far I can see, as long as the amounts of data shifted aren't huge enough
>to cause problems with network congestion, and the query time is dominant
>to data transfer time over the network, this should provide a rather
>scaleable system. I understand that the form of database clustering I am
>mentioning here is fairly rudimentary and unsophisticated, but it would
>certaily be a very useful feature.
>
>Are there any plans to implement this sort of functionality in PostgreSQL?
>Or is this a lot more complicated than it seems...
>
>Regards.
>
>Gordan

Re: Database cluster?

From

"Gordan Bobic"

Date:

30 November 2000, 11:08:29

> You're almost describing a Teradata DBM.

I knew someone must have thought of it before. ;-)

[snip]

> The thing that impacted me the most about this architecture was that
> sorting was practically built in.  So all the intermediary computers had
to
> do was merge the sorted result sets from its lower level computers.
Blazing!

They effectively implemented a binary tree in hardware. One hell of an
indexing mechanism. :-)

> I miss that old beast.  But I certainly cannot afford the multimillion
> dollars required to get one for myself.

I suppose it would depend on how many computers you want to have in this
cluster. The main reason why clusters are getting popular recently (albeit
not yet for databases, or so it would seem) is because it is cheaper than
anything else with similar performance.

The main question remains - are there any plans to implement something
similar to this with PostgreSQL? I would volunteer to help with some
coding, if a "group" was formed to work on this "clustering" module.

Regards.

Gordan

Re: Database cluster?

From

Doug Semig

Date:

30 November 2000, 15:59:13

I actually analyzed it once.  I came to the conclusion that to do it right
it would be easier to make an almost entirely new db but use the same
external interfaces as PostgreSQL.  To do a kludge of it, one might just
implement a tier that sits between the user and a bunch of standard
PostgreSQL backends.

It'd make a neat companion project, though.  Like PG/Enterprise or
PG/Warehouse or something.

Doug

At 04:02 PM 11/30/00 -0000, Gordan Bobic wrote:
>> You're almost describing a Teradata DBM.
>
>I knew someone must have thought of it before. ;-)
>
>[snip]
>
>> The thing that impacted me the most about this architecture was that
>> sorting was practically built in.  So all the intermediary computers had
>to
>> do was merge the sorted result sets from its lower level computers.
>Blazing!
>
>They effectively implemented a binary tree in hardware. One hell of an
>indexing mechanism. :-)
>
>> I miss that old beast.  But I certainly cannot afford the multimillion
>> dollars required to get one for myself.
>
>I suppose it would depend on how many computers you want to have in this
>cluster. The main reason why clusters are getting popular recently (albeit
>not yet for databases, or so it would seem) is because it is cheaper than
>anything else with similar performance.
>
>The main question remains - are there any plans to implement something
>similar to this with PostgreSQL? I would volunteer to help with some
>coding, if a "group" was formed to work on this "clustering" module.
>
>Regards.
>
>Gordan

Re: Database cluster?

From

Peter Korsgaard

Date:

30 November 2000, 16:32:40

On Thu, 30 Nov 2000, Doug Semig wrote:

> I actually analyzed it once.  I came to the conclusion that to do it right
> it would be easier to make an almost entirely new db but use the same
> external interfaces as PostgreSQL.  To do a kludge of it, one might just
> implement a tier that sits between the user and a bunch of standard
> PostgreSQL backends.
>
> It'd make a neat companion project, though.  Like PG/Enterprise or
> PG/Warehouse or something.

I'm currently developing a simple version of such a system as an
university project. It is a fairly simple aproach with a proxy or a
distributor in front of a bunch of standard postgresl database servers.

The proxy monitors and forwards the requests from the clients to the
database servers. If it is a read-only request the query is forwarded to
the databaseserver currently experiencing the lowest load/most free
memory, otherwise it is sent to all database servers.

This approach obviously only performs well in systems with a high ratio of
read-only queries, such as search engines and so on.

--
Bye, Peter Korsgaard

Re: Database cluster?

From

"Steve Wolfe"

Date:

30 November 2000, 17:15:16

> The proxy monitors and forwards the requests from the clients to the
> database servers. If it is a read-only request the query is forwarded to
> the databaseserver currently experiencing the lowest load/most free
> memory, otherwise it is sent to all database servers.
>
> This approach obviously only performs well in systems with a high ratio of
> read-only queries, such as search engines and so on.

  The tough part is syncronicity, should one of the machines drop out of the
cluster and need to be re-added without bringing the others down.  In order
to get around that, each query needs to be logged on the master node with a
timestamp, so that the failed node can "catch up" in real-time.  That brings
about other considerations, as well....

steve

Re: Database cluster?

From

Peter Korsgaard

Date:

30 November 2000, 17:22:14

On Thu, 30 Nov 2000, Steve Wolfe wrote:

>   The tough part is syncronicity, should one of the machines drop out of the
> cluster and need to be re-added without bringing the others down.  In order
> to get around that, each query needs to be logged on the master node with a
> timestamp, so that the failed node can "catch up" in real-time.  That brings
> about other considerations, as well....

Exactly! Thats also why I have decided not to implement that feature as
the deadline is in 3 weeks ;) If a database server fails it is simple
discarded from the cluster and can only be reconnected by taking the
system offline and doing the syncronisation manually.

--
Bye, Peter Korsgaard

Re: Database cluster?

From

"Valter Mazzola"

Date:

30 November 2000, 20:03:07

I've succesfully pacthed linux kernel 2.2.17 with DIPC and modified
postgresql's src  (src/backend/storage/ipc/ipc.c) to create distributed shm
and sem.

The strategy is then to start a postgresql that creates shm and sem on ONE
machine, then start other postgres on other machines on the cluster that
create NO shared structures ( there is a command line flag to do this).

Than you can connect to any of the postgres on your cluster, for example:
round robin.

Another issue are datafiles, GFS seems promising.
But postgresql uses fcnl, and GFS (globalfilesystem.org) doesn't support it
yet.
A distributed filesystem with locking etc. is required, Ideas ?


Another issue is that DIPC doesn't have a failover mechanism.

This is a shared All approach, it's not the best, but probably it's the
fastest solution (bad) to implement, with little modifications (4-5) lines
to postgresql sources.

This system can give a sort of single-system-image, useful to distribute
other software beyond postgresql.

Also Mariposa (http://s2k-ftp.cs.berkeley.edu:8000/mariposa/) seems
interesting,
but it's not maintained and it's for an old postgresql version.

hoping for clustrering...
valter mazzola.


>From: Alain Toussaint <nailed@videotron.ca>
>To: PostgreSQL general list <pgsql-general@postgresql.org>
>Subject: Re: [GENERAL] Database cluster?
>Date: Thu, 30 Nov 2000 15:05:16 -0500 (EST)
>
> > Somebody mentioned the fact that postgres uses IPC for communicating
> > between processes. I think there are tools for clustering (I am not sure
>if
> > Mosix supports transparently allowing IPC across nodes) which can work
> > around that.
>
>one of those tool is distributed ipc <http://wallybox.cei.net/dipc/> but
>it only work with Linux,AFAIK,the software there is just a patch to the
>Linux kernel and a daemon.
>
>Alain
>

_____________________________________________________________________________________
Get more from the Web.  FREE MSN Explorer download : http://explorer.msn.com

Re: Database cluster?

From

Tim Kientzle

Date:

30 November 2000, 20:04:16

> > > > I am considering splitting the database into
> > > > tables residing on separate machines, and connect
> > > > them on one master node.
> >
> > Splitting tables across multiple machines would do
> > nothing more than make the entire system run at a
> > snail's pace . . . because you just couldn't move
> > data between machines quickly enough.
>
> In my case, queries typically return ... [a]round 100
> records at most. [E]ven over 10 Mb ethernet, it would
> take at most about a second to transfer. This is a
> much smaller delay than the query time itself, which
> can take 10 seconds or more.
>
> So, splitting the data in such a way that one table is
> queried, and then tables joined from it are queried in
> parallel, would cause a signifficant speed-up.

Then do exactly that: run separate PostgreSQL databases on
multiple machines and build a data abstraction layer that
does the join manually.  If there really are only small
numbers of rows, then it's just as fast to transfer the
data to your application machine as to transfer it all
to a common PostgreSQL machine and then to your application.

If you really need application transparency, then things get
a bit uglier; it shouldn't be too hard to build your own
middleware layer that lets you treat the data storage as
a single entity.

        - Tim

Re: Database cluster?

From

"Gordan Bobic"

Date:

01 December 2000, 04:10:46

> I actually analyzed it once.  I came to the conclusion that to do it
right
> it would be easier to make an almost entirely new db but use the same
> external interfaces as PostgreSQL.

I admit that I am not really too up-to-date on database theory, but I am a
bit surprised at that...

> To do a kludge of it, one might just implement a tier that sits between
the
> user and a bunch of standard PostgreSQL backends.

That is precisely what I was thinking about. There would have to be a
"master" node that controls what goes where, and distributed the load. This
"shouldn't" be too difficult (although I am not totally sure what I mean by
that). The nasty bit would probably be hacking the optimizer, SQL command
"CLUSTER", and VACUUM to take account and efficiently use all the extra
room for improving the performance.

Automating a "near-optimal" distribution of tables across machines could be
a bit of a difficult problem from a theory side, but it ought to be
possible. There are several options here.

One could just put one table on each server, which is unlikely to be all
that beneficial, although in a multi-table join, you'd want to search the
smallest tables first.

Then, there's the option of just splitting each table across multiple
machines. There is also the possibility of having some records overlap
between machines, of the on-line optimizer decides that that would be
useful for performance, and then sort out the syncing somehow.

Or, one could set up an even more sophisticated system where only the
tables and data that would benefit from being together would be on the same
server, so there could be a section of two tables on one server, the rest
of those two tables and a section of another table on another server, etc.
Basically, make both the table and record allocations completely dynamic
between the servers.

I am not sure how useful each of these splits would be, but it is certainly
something well worth exploring theoretically before the actual
implementation, because I reserve the right to be wrong in thinking that
any of these methods would produce an actual improvement in performance.

And, of course, there would be the bit of getting the optimizer and partial
replication to work properly across servers, which may not be an easy task.

> It'd make a neat companion project, though.  Like PG/Enterprise or
> PG/Warehouse or something.

I agree. It would be really neat. Something like Mosix, but for databases.
And it just sounds like something that would be really useful for large
databases, especially as we start reaching steep part of the
price/performance curve for database servers.

Regards.

Gordan

> At 04:02 PM 11/30/00 -0000, Gordan Bobic wrote:
> >> You're almost describing a Teradata DBM.
> >
> >I knew someone must have thought of it before. ;-)
> >
> >[snip]
> >
> >> The thing that impacted me the most about this architecture was that
> >> sorting was practically built in.  So all the intermediary computers
had
> >to
> >> do was merge the sorted result sets from its lower level computers.
> >Blazing!
> >
> >They effectively implemented a binary tree in hardware. One hell of an
> >indexing mechanism. :-)
> >
> >> I miss that old beast.  But I certainly cannot afford the multimillion
> >> dollars required to get one for myself.
> >
> >I suppose it would depend on how many computers you want to have in this
> >cluster. The main reason why clusters are getting popular recently
(albeit
> >not yet for databases, or so it would seem) is because it is cheaper
than
> >anything else with similar performance.
> >
> >The main question remains - are there any plans to implement something
> >similar to this with PostgreSQL? I would volunteer to help with some
> >coding, if a "group" was formed to work on this "clustering" module.
> >
> >Regards.
> >
> >Gordan

Re: Database cluster?

From

"Gordan Bobic"

Date:

01 December 2000, 04:14:19

> > I actually analyzed it once.  I came to the conclusion that to do it
right
> > it would be easier to make an almost entirely new db but use the same
> > external interfaces as PostgreSQL.  To do a kludge of it, one might
just
> > implement a tier that sits between the user and a bunch of standard
> > PostgreSQL backends.
> >
> > It'd make a neat companion project, though.  Like PG/Enterprise or
> > PG/Warehouse or something.
>
> I'm currently developing a simple version of such a system as an
> university project. It is a fairly simple aproach with a proxy or a
> distributor in front of a bunch of standard postgresl database servers.
>
> The proxy monitors and forwards the requests from the clients to the
> database servers. If it is a read-only request the query is forwarded to
> the databaseserver currently experiencing the lowest load/most free
> memory, otherwise it is sent to all database servers.

Surely, you have to send off a query to all servers that the specific
tables you are searching are split across. How are you handling splitting
of the data?

> This approach obviously only performs well in systems with a high ratio
of
> read-only queries, such as search engines and so on.

This implies lots of servers with identical data on them. Am I right?

Could you tell us a bit more about your project (if it's not breaching any
non-disclosure agreements, that is)? It could be a good starting point for
what we are talking about implementing.

Regards.

Gordan

Re: Database cluster?

From

Peter Korsgaard

Date:

01 December 2000, 04:38:20

On Fri, 1 Dec 2000, Gordan Bobic wrote:

> > The proxy monitors and forwards the requests from the clients to the
> > database servers. If it is a read-only request the query is forwarded to
> > the databaseserver currently experiencing the lowest load/most free
> > memory, otherwise it is sent to all database servers.
>
> Surely, you have to send off a query to all servers that the specific
> tables you are searching are split across. How are you handling splitting
> of the data?

I'm not ;) All the database servers of the clusters simply each contain
a copy of all the data. I know this isn't an optimal solution, but this
project is only running for a very short period (around 2 months)

> > read-only queries, such as search engines and so on.
>
> This implies lots of servers with identical data on them. Am I right?
>
> Could you tell us a bit more about your project (if it's not breaching any
> non-disclosure agreements, that is)? It could be a good starting point for
> what we are talking about implementing.

There is no nda's involved in this project. We are planning on releasing
the software under GPL after the deadline (22th of December).

There isn't much more to tell about the system at this time. Currently we
are finishing off the lower levels of the systems like database driver
(currently only postgres), status daemons through snmp and so on. At this
time the actual scheduling hasn't been implemented.

--
Bye, Peter Korsgaard

Re: Database cluster?

From

"Gordan Bobic"

Date:

01 December 2000, 05:07:59

> On Thu, 30 Nov 2000, Steve Wolfe wrote:
>
> >   The tough part is syncronicity, should one of the machines drop out
of the
> > cluster and need to be re-added without bringing the others down.  In
order
> > to get around that, each query needs to be logged on the master node
with a
> > timestamp, so that the failed node can "catch up" in real-time.  That
brings
> > about other considerations, as well....
>
> Exactly! Thats also why I have decided not to implement that feature as
> the deadline is in 3 weeks ;)

Fair!

> If a database server fails it is simple
> discarded from the cluster and can only be reconnected by taking the
> system offline and doing the syncronisation manually.

Where's all that enthusiasm for a new project gone? Ever thought of making
a "new and improved" version 2 or something? I think it's a great idea, and
if we can get a few people together to implement it, it sould be a
worthwhile exercise. And right now, it looks like you have a bit of a head
start. ;-)

Regards.

Gordan

Re: Database cluster?

From

"Gordan Bobic"

Date:

01 December 2000, 05:09:27

> > The proxy monitors and forwards the requests from the clients to the
> > database servers. If it is a read-only request the query is forwarded
> > to the databaseserver currently experiencing the lowest load/most free
> > memory, otherwise it is sent to all database servers.
> >
> > This approach obviously only performs well in systems with a high ratio
> > of read-only queries, such as search engines and so on.
>
>   The tough part is syncronicity, should one of the machines drop out of
> the cluster and need to be re-added without bringing the others down.
> In order to get around that, each query needs to be logged on the master
> node with a timestamp, so that the failed node can "catch up" in
> real-time.  That brings about other considerations, as well....

What about using something like RAID-type setup on the database servers? So
that for example, all data (records or tables, depending on the split)
exists on more than one server. That way if a server goes down, all the
data would still exist on the other servers. A server failure could trigger
a replication operation that would re-distribute the data across the
remaining servers in such a way that if another one fails, the data
integrity would still be preserved. Similarly, when a new server is added,
this should trigger an operation that would re-distribute records in sich a
way as to insure redundancy with the new number of server (in this case,
move some records from the old servers to the new one and load balance it).
This, again, "shouldn't" be too hard to implement, if we could sort out the
design first.

However, this could be (theoretically) difficult to combine with
"near-optimal" table/record distribution. The two might go against each
other. There should be a way of enforcing the redundancy part, and then
doing a "best effort" to do it in such a way as to gain most performance
improvement. This _could_ be done, and off the top of my head, I cannot see
any reason why it wouldn't work reasonably well. Obviously, a server
failure would cause the whole thing to slow down for a while until the
records were re-distributed, and the optimizer re-assesses the situation,
but that is true of any redundant system that has to remain at least
operational during a failure. After the replication is completed, it should
resume at it's full speed (dictated by the number of available servers).

Any opinions on this?

Regards.

Gordan

Re: Database cluster?

From

"Gordan Bobic"

Date:

01 December 2000, 05:25:50

> I've succesfully pacthed linux kernel 2.2.17 with DIPC and modified
> postgresql's src  (src/backend/storage/ipc/ipc.c) to create distributed
> shm and sem.

Please forgive my ignorance (I haven't used Postgres for that long), but
what are shm and sem?

> The strategy is then to start a postgresql that creates shm and sem on
> ONE machine, then start other postgres on other machines on the cluster
> that create NO shared structures ( there is a command line flag to do
this).

So, one "master" and lots of "slaves", right?

> Than you can connect to any of the postgres on your cluster, for example:
> round robin.

Hmm... But is this really what we want to do? This is less than ideal for
several reasons (if I understand what you're saying correctly). Replication
is off-line for a start, and it only works well for a system that has few
inserts and lots of selects, probably from a lot of different users.
Probably a good things for applications like web search engines, but not
necessarily for much else.

> Another issue are datafiles, GFS seems promising.
> But postgresql uses fcnl, and GFS (globalfilesystem.org) doesn't
> support it yet.
> A distributed filesystem with locking etc. is required, Ideas ?

Hmm... I am not sure that a distributed file system is what we want here. I
think it might be better to have separate postgres databases on separate
local file systems, and handle putting the data together on a higher level.
I think this would be better for both performance and scaleability. Having
one big file system is likely to incur heavy network traffic penalties, and
that is not necessary, as it can be avoided by just having the distribution
done on a database level, rather than file system level.

But then again, the distributed file system can be seen as a "neater"
solution, and it might work rather well, if they get the caching right with
the correct "near-line" distribution of data across the network file system
to make sure that the data is where it is most useful. In other words, make
sure that the files (or even better, inodes) that are frequently accessed
by a computer are on that computer).

Still there is the issue of replication and redundancy. I just think that
for a database application, this would be best done on the database level,
rather than a file system level, unless the distributed file system in use
was designed with all the database-useful features in mind.

> Another issue is that DIPC doesn't have a failover mechanism.

Again, for a database, it might be best to handle it at a higher level.

> This is a shared All approach, it's not the best, but probably it's the
> fastest solution (bad) to implement, with little modifications (4-5)
> lines to postgresql sources.

Indeed. As such, it should probably be the first thing to do toward
"clustering" a database. Still, it would be good to have a clear
development path, even though on that path we cludge things slightly at
various steps in order to have a useable system now, as opposed to a
"perfect" system later.

A shared all approach is not necessarily that bad. It is (as far as I can
tell), not better or worse than a "share nothing" approach. They both have
pros and cons. Ideally, we should work toward coming up with an idea for a
hybrid system that would pick the best of both worlds.

> This system can give a sort of single-system-image, useful to distribute
> other software beyond postgresql.

Indeed. This is always a good thing for scalability for most applications,
but databases have their specific requirements which may not be best
catered for by standard means of distributed processing. Still, what you
are suggesting would be a major improvement, from where I'm looking at it,
but I am probably biased by looking at it from the point of view of my
particular application.

> Also Mariposa (http://s2k-ftp.cs.berkeley.edu:8000/mariposa/) seems
> interesting, but it's not maintained and it's for an old postgresql
version.

Hmm... Most interesting. There could be something recyclable in there. Must
look at the specs and some source later...

Regards.

Gordan

Re: Database cluster?

From

"Gordan Bobic"

Date:

01 December 2000, 05:32:02

> > > The proxy monitors and forwards the requests from the clients to the
> > > database servers. If it is a read-only request the query is forwarded
to
> > > the databaseserver currently experiencing the lowest load/most free
> > > memory, otherwise it is sent to all database servers.
> >
> > Surely, you have to send off a query to all servers that the specific
> > tables you are searching are split across. How are you handling
splitting
> > of the data?
>
> I'm not ;)

That's what I thought you were implying...

> All the database servers of the clusters simply each contain
> a copy of all the data. I know this isn't an optimal solution, but this
> project is only running for a very short period (around 2 months)

It's actually a rather fair compromise, given the time scale (IMHO). It's
like mirroring disks with RAID. You get same write performance, but
multiply the read performance for multiple disks. It sounds like you're
doing the same thing with databases.

And as your application is web search engine, this should work quite well,
especially under heavy load / lots of hits.

> > > read-only queries, such as search engines and so on.
> >
> > This implies lots of servers with identical data on them. Am I right?
> >
> > Could you tell us a bit more about your project (if it's not breaching
any
> > non-disclosure agreements, that is)? It could be a good starting point
for
> > what we are talking about implementing.
>
> There is no nda's involved in this project. We are planning on releasing
> the software under GPL after the deadline (22th of December).

Hmm... I'll be looking forward to having a look at that. Could you post us
a URL, once it is released? Are you intending to continue development of
this? I'm just thinking about how well this approach would work with
something like what Valter has suggested here recently...

Regards.

Gordan

Re: Database cluster?

From

"Gordan Bobic"

Date:

01 December 2000, 05:32:02

> > > > > I am considering splitting the database into
> > > > > tables residing on separate machines, and connect
> > > > > them on one master node.
> > >
> > > Splitting tables across multiple machines would do
> > > nothing more than make the entire system run at a
> > > snail's pace . . . because you just couldn't move
> > > data between machines quickly enough.
> >
> > In my case, queries typically return ... [a]round 100
> > records at most. [E]ven over 10 Mb ethernet, it would
> > take at most about a second to transfer. This is a
> > much smaller delay than the query time itself, which
> > can take 10 seconds or more.
> >
> > So, splitting the data in such a way that one table is
> > queried, and then tables joined from it are queried in
> > parallel, would cause a signifficant speed-up.
>
> Then do exactly that: run separate PostgreSQL databases on
> multiple machines and build a data abstraction layer that
> does the join manually.  If there really are only small
> numbers of rows, then it's just as fast to transfer the
> data to your application machine as to transfer it all
> to a common PostgreSQL machine and then to your application.

I was hoping to do precisely that. But I never got the answer to the
question of does postgres support linking to tables external to it's
database? Is it possible to attach a table from another postgres database
that is on a different server?

> If you really need application transparency

Which would always be nice...

> then things get a bit uglier;

I know.

> it shouldn't be too hard to build your own middleware layer that
> lets you treat the data storage as a single entity.

Yes, but if plans are formed to do something like an "official" postgres
based project do do this, then I'd rather spend time working on development
of that, rather than spend time building my own proprietary solution.

Regards.

Gordan

Re: Database cluster?

From

Dave Smith

Date:

01 December 2000, 08:28:08

Have you looked at intermezzo? http://www.inter-mezzo.org/

Valter Mazzola wrote:

> I've succesfully pacthed linux kernel 2.2.17 with DIPC and modified
> postgresql's src  (src/backend/storage/ipc/ipc.c) to create distributed
> shm and sem.
>
> The strategy is then to start a postgresql that creates shm and sem on
> ONE machine, then start other postgres on other machines on the cluster
> that create NO shared structures ( there is a command line flag to do
> this).
>
> Than you can connect to any of the postgres on your cluster, for
> example: round robin.
>
> Another issue are datafiles, GFS seems promising.
> But postgresql uses fcnl, and GFS (globalfilesystem.org) doesn't support
> it yet.
> A distributed filesystem with locking etc. is required, Ideas ?
>
>
> Another issue is that DIPC doesn't have a failover mechanism.
>
> This is a shared All approach, it's not the best, but probably it's the
> fastest solution (bad) to implement, with little modifications (4-5)
> lines to postgresql sources.
>
> This system can give a sort of single-system-image, useful to distribute
> other software beyond postgresql.
>
> Also Mariposa (http://s2k-ftp.cs.berkeley.edu:8000/mariposa/) seems
> interesting,
> but it's not maintained and it's for an old postgresql version.
>
> hoping for clustrering...
> valter mazzola.
>
>
>> From: Alain Toussaint <nailed@videotron.ca>
>> To: PostgreSQL general list <pgsql-general@postgresql.org>
>> Subject: Re: [GENERAL] Database cluster?
>> Date: Thu, 30 Nov 2000 15:05:16 -0500 (EST)
>>
>>  > Somebody mentioned the fact that postgres uses IPC for communicating
>>  > between processes. I think there are tools for clustering (I am not
>> sure if
>>  > Mosix supports transparently allowing IPC across nodes) which can work
>>  > around that.
>>
>> one of those tool is distributed ipc <http://wallybox.cei.net/dipc/> but
>> it only work with Linux,AFAIK,the software there is just a patch to the
>> Linux kernel and a daemon.
>>
>> Alain
>>
>
> _____________________________________________________________________________________
>
> Get more from the Web.  FREE MSN Explorer download :
> http://explorer.msn.com

Re: Database cluster?

From

"Valter Mazzola"

Date:

01 December 2000, 12:28:45



>From: "Gordan Bobic" To: Subject: Re: [GENERAL] Database cluster? Date:
>Fri, 1 Dec 2000 10:13:55 -0000
>
> > I've succesfully pacthed linux kernel 2.2.17 with DIPC and modified >
>postgresql's src (src/backend/storage/ipc/ipc.c) to create distributed >
>shm and sem.
>
>Please forgive my ignorance (I haven't used Postgres for that long), but
>what are shm and sem?
>

shared memory and semaphores

> > The strategy is then to start a postgresql that creates shm and sem on >
>ONE machine, then start other postgres on other machines on the cluster >
>that create NO shared structures ( there is a command line flag to do
>this).
>
>So, one "master" and lots of "slaves", right?
>

no, every machine is totally similar to the others, the only different this
is that only ONE machine creates the ( network Distributed by DIPC)shared
memory and semaphores.


> > Than you can connect to any of the postgres on your cluster, for
>example: > round robin.
>
>Hmm... But is this really what we want to do? This is less than ideal for
>several reasons (if I understand what you're saying correctly). Replication
>is off-line for a start, and it only works well for a system that has few
>inserts and lots of selects, probably from a lot of different users.
>Probably a good things for applications like web search engines, but not
>necessarily for much else.

*** it isn't replication. It's that your cluster behaves like a
single-computer. You modify the 'OS' (GFS + DIPC), not postgresql.


>
> > Another issue are datafiles, GFS seems promising. > But postgresql uses
>fcnl, and GFS (globalfilesystem.org) doesn't > support it yet. > A
>distributed filesystem with locking etc. is required, Ideas ?
>
>Hmm... I am not sure that a distributed file system is what we want here. I
>think it might be better to have separate postgres databases on separate
>local file systems, and handle putting the data together on a higher level.
>I think this would be better for both performance and scaleability. Having

***yes... but WHEN we can have these features ? No one have done it till
now, i've requested and searched but almost no reply.

>one big file system is likely to incur heavy network traffic penalties, and
>that is not necessary, as it can be avoided by just having the distribution
>done on a database level, rather than file system level.
>
>But then again, the distributed file system can be seen as a "neater"
>solution, and it might work rather well, if they get the caching right with
>the correct "near-line" distribution of data across the network file system
>to make sure that the data is where it is most useful. In other words, make
>sure that the files (or even better, inodes) that are frequently accessed
>by a computer are on that computer).
>
>Still there is the issue of replication and redundancy.

***GFS does it transparently.

I just think that
>for a database application, this would be best done on the database level,
>rather than a file system level, unless the distributed file system in use
>was designed with all the database-useful features in mind.
>
> > Another issue is that DIPC doesn't have a failover mechanism.
>
>Again, for a database, it might be best to handle it at a higher level.
>
> > This is a shared All approach, it's not the best, but probably it's the
> > fastest solution (bad) to implement, with little modifications (4-5) >
>lines to postgresql sources.
>
>Indeed. As such, it should probably be the first thing to do toward
>"clustering" a database. Still, it would be good to have a clear
>development path, even though on that path we cludge things slightly at
>various steps in order to have a useable system now, as opposed to a
>"perfect" system later.
>

*** yes, i want clustering now...and i'm alone.
I my opinion if GFS will do fcntl (and we can ask to GFS people, i think),
the stuff in this email can be done rapidly.


>A shared all approach is not necessarily that bad. It is (as far as I can
>tell), not better or worse than a "share nothing" approach. They both have
>pros and cons. Ideally, we should work toward coming up with an idea for a
>hybrid system that would pick the best of both worlds.
>
> > This system can give a sort of single-system-image, useful to distribute
> > other software beyond postgresql.
>
>Indeed. This is always a good thing for scalability for most applications,
>but databases have their specific requirements which may not be best
>catered for by standard means of distributed processing. Still, what you
>are suggesting would be a major improvement, from where I'm looking at it,
>but I am probably biased by looking at it from the point of view of my
>particular application.
>
> > Also Mariposa (http://s2k-ftp.cs.berkeley.edu:8000/mariposa/) seems >
>interesting, but it's not maintained and it's for an old postgresql
>version.
>
>Hmm... Most interesting. There could be something recyclable in there. Must
>look at the specs and some source later...
>

*** i've compiled it , but with no results.
An idea is to get diff to corresponding pure postgresql version (6.4/5?),
then study the patch, and grab the secrets to fuse in current version. The
research papers seems very good. Perhaps some guy that have done Mariposa
can help...

My goal is to have a clustered open source database with the less effort
possible, now.

The project to do good stuff (ie code) in this field is very long...

i hope that some guy will start a real thing ... one idea is to start a
project on cosource or similar to receive founding $$.
This project is very important for the OpenSource world.


valter

>Regards.
>
>Gordan
>
_____________________________________________________________________________________
Get more from the Web.  FREE MSN Explorer download : http://explorer.msn.com

Re: Database cluster?

From

"Gordan Bobic"

Date:

01 December 2000, 12:37:25

> > Than you can connect to any of the postgres on your cluster, for
> >example: > round robin.
> >
> >Hmm... But is this really what we want to do? This is less than ideal
for
> >several reasons (if I understand what you're saying correctly).
Replication
> >is off-line for a start, and it only works well for a system that has
few
> >inserts and lots of selects, probably from a lot of different users.
> >Probably a good things for applications like web search engines, but not
> >necessarily for much else.
>
> *** it isn't replication. It's that your cluster behaves like a
> single-computer. You modify the 'OS' (GFS + DIPC), not postgresql.

OK, that makes sense. Kind of like Mosix, then. But like mosix, this would
require lots of network bandwidth - or not, depending on how good GFS is at
figuring our what goes where.

> > > Another issue are datafiles, GFS seems promising. > But postgresql
uses
> >fcnl, and GFS (globalfilesystem.org) doesn't > support it yet. > A
> >distributed filesystem with locking etc. is required, Ideas ?
> >
> >Hmm... I am not sure that a distributed file system is what we want
here. I
> >think it might be better to have separate postgres databases on separate
> >local file systems, and handle putting the data together on a higher
level.
> >I think this would be better for both performance and scaleability.
Having
>
> ***yes... but WHEN we can have these features ? No one have done it till
> now, i've requested and searched but almost no reply.

Well, if you come up with a detailed design, I'm quite happy to help with
coding individual functions...

> >one big file system is likely to incur heavy network traffic penalties,
and
> >that is not necessary, as it can be avoided by just having the
distribution
> >done on a database level, rather than file system level.
> >
> >But then again, the distributed file system can be seen as a "neater"
> >solution, and it might work rather well, if they get the caching right
with
> >the correct "near-line" distribution of data across the network file
system
> >to make sure that the data is where it is most useful. In other words,
make
> >sure that the files (or even better, inodes) that are frequently
accessed
> >by a computer are on that computer).
> >
> >Still there is the issue of replication and redundancy.
>
> ***GFS does it transparently.

But wouldn't this all be incredibly network intensive? Could we implement
something that would make a process go to the data, instead of the other
way around? In database, data is typically bigger than the process
accessing it...

> >Indeed. As such, it should probably be the first thing to do toward
> >"clustering" a database. Still, it would be good to have a clear
> >development path, even though on that path we cludge things slightly at
> >various steps in order to have a useable system now, as opposed to a
> >"perfect" system later.
> >
>
> *** yes, i want clustering now...and i'm alone.

No, you're not. I NEED clustering now. Eventually the number of records and
tables comes and bites you, no matter how much you optimize your
application. And for most of us mere mortals, buying a Cray for running a
database is just not a viable option...

> I my opinion if GFS will do fcntl (and we can ask to GFS people, i
think),
> the stuff in this email can be done rapidly.

Well, I think you've just volunteered to contact them. :-)

> >A shared all approach is not necessarily that bad. It is (as far as I
can
> >tell), not better or worse than a "share nothing" approach. They both
have
> >pros and cons. Ideally, we should work toward coming up with an idea for
a
> >hybrid system that would pick the best of both worlds.
> >
> > > This system can give a sort of single-system-image, useful to
distribute
> > > other software beyond postgresql.
> >
> >Indeed. This is always a good thing for scalability for most
applications,
> >but databases have their specific requirements which may not be best
> >catered for by standard means of distributed processing. Still, what you
> >are suggesting would be a major improvement, from where I'm looking at
it,
> >but I am probably biased by looking at it from the point of view of my
> >particular application.
> >
> > > Also Mariposa (http://s2k-ftp.cs.berkeley.edu:8000/mariposa/) seems >
> >interesting, but it's not maintained and it's for an old postgresql
> >version.
> >
> >Hmm... Most interesting. There could be something recyclable in there.
Must
> >look at the specs and some source later...
> >
>
> *** i've compiled it , but with no results.
> An idea is to get diff to corresponding pure postgresql version (6.4/5?),
> then study the patch, and grab the secrets to fuse in current version.
The
> research papers seems very good. Perhaps some guy that have done
> Mariposa can help...

See above comment...

> My goal is to have a clustered open source database with the less effort
> possible, now.
>
> The project to do good stuff (ie code) in this field is very long...

Indeed. There has to be a feasible starting point that yields modest
improvements at modest cost (in time and effort in this case)

> i hope that some guy will start a real thing ... one idea is to start a
> project on cosource or similar to receive founding $$.
> This project is very important for the OpenSource world.

I agree. Having a fully clustered database with very little network
overhead would be a major success, both for Postgres and OpenSource. Here's
an obvious question - how good is (does it exist?) clustering support on
Oracle?

Regards.

Gordan

Re: Database cluster?

From

Alain Toussaint

Date:

01 December 2000, 18:23:25

> Please forgive my ignorance (I haven't used Postgres for that long), but
> what are shm and sem?

shared memory and semaphores,interresting tidbit on Linux (2.4.0-test
series at least) is that shared memory can be a nodev filesystem (like proc or
devfs):

/dev/ide/host0/bus0/target0/lun0/part2 on / type ext2 (rw,errors=remount-ro,errors=remount-ro)
proc on /proc type proc (rw)
/shm on /shm type shm (rw)

and there's at least a few other nodev filesystem in the latest kernel
series:

nodev    shm
nodev    sockfs
nodev    pipefs
nodev    proc
    ext2
nodev    devfs

i'm doing some heavy research into these filesystem but so far,there's a
BIG lack of docs.

Alain