Re: Database cluster? - Mailing list pgsql-general

From Gordan Bobic
Subject Re: Database cluster?
Date
Msg-id 004b01c05aac$2bfdc520$8000000a@localdomain
Whole thread Raw
In response to Re: Database cluster?  (Peter Eisentraut <peter_e@gmx.net>)
Responses Re: Database cluster?
RE: Database cluster?
List pgsql-general
> > > I am considering splitting the database into tables residing on
separate
> > > machines, and connect them on one master node.
> > >
> > > The question I have is:
> > >
> > > 1) How can I do this using PostgreSQL?
> >
> > You can't.
>
>    I'll jump in with a bit more info.  Splitting tables across multiple
> machines would do nothing more than make the entire system run at a
snail's
> pace.  Yes, it would slow it down immensely, because you just couldn't
move
> data between machines quickly enough.

I don't believe that is the case. In my case, queries typically return
comparatively small amounts of data. Around 100 records at most. The amount
of data that needs to be transferred is comparatively small, and even over
10 Mb ethernet, it would take at most about a second to transfer. This is a
much smaller delay than the query time itself, which can take 10 seconds or
more. Remember that I said there are tables with over 30M records? Doing
multi-table joins on things like this takes a long time...

So, splitting the data in such a way that one table is queried, and then
tables joined from it are queried in parallel, would cause a signifficant
speed-up.

For example, say we have tables T1, T2 and T3.

T1 has fields F1.1, F1.2 and F1.3. T2 has F2.1 and T3 has F3.1 (at least,
probably lots of other fields.

Say I want to do
SELECT *
FROM T1
WHERE F1.1 = F2.1 AND F1.2 = F3.1 AND F1.3 = 'somedata';

Then F1.3 could be searched for 'somedata'. When the records are found,
this could be cross-matched remotely, in parallel for F1.1=F2.1 and
F1.2=F3.1, on different machines.

This means that depending on the type, configuration and the usage of the
database, a potentially massive improvement in performance could be
achiveved, especially on multi-table joins which span lots of BIG tables.

Somebody mentioned the fact that postgres uses IPC for communicating
between processes. I think there are tools for clustering (I am not sure if
Mosix supports transparently allowing IPC across nodes) which can work
around that.

>   Why?  Well, whenever you join two tables that are on different
machines,
> the tables have to go across whatever sort of connection you have between
> the machines.  Even if you use gigabit ethernet, you are still running at
a
> mere fraction of the bandwidth of the computer's internal bus - and at
> orders of magnitude greater latency.  You'd have lots of CPU's sitting
> around, doing absolutely nothing, waiting for data to come across the
wire.

Gigabit ethernet has around the same bandwidth as PCI bus. I suppose it all
depends on what machine you have running this. This would be true in the
case that the datbase server is a nice big Alpha with severl CPUs.

>    There are alternatives, such as IP-over-SCSI.  That reduces the
latency
> of ethernet quite a bit, and gives you much more bandwidth (say, up to
160
> megabytes/second).  However, that's still a pittance compared to the main
> system bus inside your computer.

But SCSI is still 160MB burst (not sustained, unless you're using very
expensive arrays). And Gigabit ethernet is 133 MB/s, albeit with greater
latency.

>     That's one of the greatest hurdles to distributed computing.  That's
why
> the applications that are best adapted to distributed computing are those
> that don't require much data over the wire - which certainly doesn't
apply
> to databases. : )

I think it depends whether the amount of data is the problem, or fitting it
together.

Somebody please explain to me further why I am wrong in all this?

Regards.

Gordan


pgsql-general by date:

Previous
From: Steve Heaven
Date:
Subject: Re: How do I install pl/perl
Next
From: Steve Heaven
Date:
Subject: Re: How do I install pl/perl