Re: Database cluster? - Mailing list pgsql-general

From Gordan Bobic
Subject Re: Database cluster?
Date
Msg-id 005701c05adc$0c7e90a0$8000000a@localdomain
Whole thread Raw
In response to RE: Database cluster?  ("Nathan Barnett" <nbarnett@cellularphones.com>)
Responses Re: Database cluster?
List pgsql-general
Thanks.

I have just had another thought. If all the tables are split across several
computers, this would help as well.

For example, if we have 100 records and 2 database servers, each server
could have 50 of those 100 records on it. When a selection is required,
each server would look through it's much smaller database, and report back
the "hits". This would, effectively, provide a near linear speedup in the
query time, while introducing only the minor network overhead (or a major
one, depending on how much data is transferred).

Some extra logic could then be implemented for related tables that would
allow the most closely related records from the different tables to be
"clustered" (as in kind of remotely similar to the CLUSTER command) on the
same server, for faster response time and minimized network usage
requirements. The "vacuum" or "cluster" features could be used overnight to
re-optimize the distribution of records across the servers.

In all this, a "master" node could be used for coordinating the whole
operation. We could ask the master node to do a query, and it would
automatically, knowing what slaves it has, fire off that query on them.
Each slave would then in parallel, execute a query, and return a subset of
the data we were looking for. This data would then be joined into one
recordset before it is returned to the client that requested it.

As far I can see, as long as the amounts of data shifted aren't huge enough
to cause problems with network congestion, and the query time is dominant
to data transfer time over the network, this should provide a rather
scaleable system. I understand that the form of database clustering I am
mentioning here is fairly rudimentary and unsophisticated, but it would
certaily be a very useful feature.

Are there any plans to implement this sort of functionality in PostgreSQL?
Or is this a lot more complicated than it seems...

Regards.

Gordan

----- Original Message -----
From: "Nathan Barnett" <nbarnett@cellularphones.com>
To: "'Gordan Bobic'" <gordan@freeuk.com>; <pgsql-general@postgresql.org>
Sent: Thursday, November 30, 2000 2:34 PM
Subject: RE: [GENERAL] Database cluster?


> This system is in use through what is called shared nothing clustering
which
> is employed by IBM's DB2 and Microsoft SQL Server 2000.  Either of these
> products will work in the manner that you are looking for.
>
> ---------------------------------------------
> Nathan Barnett
>
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Gordan Bobic
> Sent: Thursday, November 30, 2000 4:02 AM
> To: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Database cluster?
>
>
> > > > I am considering splitting the database into tables residing on
> separate
> > > > machines, and connect them on one master node.
> > > >
> > > > The question I have is:
> > > >
> > > > 1) How can I do this using PostgreSQL?
> > >
> > > You can't.
> >
> >    I'll jump in with a bit more info.  Splitting tables across multiple
> > machines would do nothing more than make the entire system run at a
> snail's
> > pace.  Yes, it would slow it down immensely, because you just couldn't
> move
> > data between machines quickly enough.
>
> I don't believe that is the case. In my case, queries typically return
> comparatively small amounts of data. Around 100 records at most. The
amount
> of data that needs to be transferred is comparatively small, and even
over
> 10 Mb ethernet, it would take at most about a second to transfer. This is
a
> much smaller delay than the query time itself, which can take 10 seconds
or
> more. Remember that I said there are tables with over 30M records? Doing
> multi-table joins on things like this takes a long time...
>
> So, splitting the data in such a way that one table is queried, and then
> tables joined from it are queried in parallel, would cause a signifficant
> speed-up.
>
> For example, say we have tables T1, T2 and T3.
>
> T1 has fields F1.1, F1.2 and F1.3. T2 has F2.1 and T3 has F3.1 (at least,
> probably lots of other fields.
>
> Say I want to do
> SELECT *
> FROM T1
> WHERE F1.1 = F2.1 AND F1.2 = F3.1 AND F1.3 = 'somedata';
>
> Then F1.3 could be searched for 'somedata'. When the records are found,
> this could be cross-matched remotely, in parallel for F1.1=F2.1 and
> F1.2=F3.1, on different machines.
>
> This means that depending on the type, configuration and the usage of the
> database, a potentially massive improvement in performance could be
> achiveved, especially on multi-table joins which span lots of BIG tables.
>
> Somebody mentioned the fact that postgres uses IPC for communicating
> between processes. I think there are tools for clustering (I am not sure
if
> Mosix supports transparently allowing IPC across nodes) which can work
> around that.
>
> >   Why?  Well, whenever you join two tables that are on different
> machines,
> > the tables have to go across whatever sort of connection you have
between
> > the machines.  Even if you use gigabit ethernet, you are still running
at
> a
> > mere fraction of the bandwidth of the computer's internal bus - and at
> > orders of magnitude greater latency.  You'd have lots of CPU's sitting
> > around, doing absolutely nothing, waiting for data to come across the
> wire.
>
> Gigabit ethernet has around the same bandwidth as PCI bus. I suppose it
all
> depends on what machine you have running this. This would be true in the
> case that the datbase server is a nice big Alpha with severl CPUs.
>
> >    There are alternatives, such as IP-over-SCSI.  That reduces the
> latency
> > of ethernet quite a bit, and gives you much more bandwidth (say, up to
> 160
> > megabytes/second).  However, that's still a pittance compared to the
main
> > system bus inside your computer.
>
> But SCSI is still 160MB burst (not sustained, unless you're using very
> expensive arrays). And Gigabit ethernet is 133 MB/s, albeit with greater
> latency.
>
> >     That's one of the greatest hurdles to distributed computing.
That's
> why
> > the applications that are best adapted to distributed computing are
those
> > that don't require much data over the wire - which certainly doesn't
> apply
> > to databases. : )
>
> I think it depends whether the amount of data is the problem, or fitting
it
> together.
>
> Somebody please explain to me further why I am wrong in all this?
>
> Regards.
>
> Gordan
>
>
>
>


pgsql-general by date:

Previous
From: Phil Steinke
Date:
Subject: plpgsql variable trouble
Next
From: martin.chantler@convergys.com
Date:
Subject: Can PostGreSQL handle 100 user database?