Thread: Re: Hardware/OS recommendations for large databases ( 5TB)

Re: Hardware/OS recommendations for large databases ( 5TB)

From
"James Mello"
Date:
Unless there was a way to guarantee consistency, it would be hard at
best to make this work. Convergence on large data sets across boxes is
non-trivial, and diffing databases is difficult at best. Unless there
was some form of automated way to ensure consistency, going 8 ways into
separate boxes is *very* hard. I do suppose that if you have fancy
storage (EMC, Hitachi) you could do BCV or Shadow copies. But in terms
of commodity stuff, I'd have to agree with Merlin.

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of William Yu
Sent: Tuesday, November 15, 2005 10:57 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Hardware/OS recommendations for large databases (
5TB)

Merlin Moncure wrote:
>>You could instead buy 8 machines that total 16 cores, 128GB RAM and
>
> It's hard to say what would be better.  My gut says the 5u box would
> be a lot better at handling high cpu/high concurrency problems...like
> your typical business erp backend.  This is pure speculation of
> course...I'll defer to the experts here.

In this specific case (data warehouse app), multiple machines is the
better bet. Load data on 1 machine, copy to other servers and then use a
middleman to spread out SQL statements to each machine.

I was going to suggest pgpool as the middleman but I believe it's
limited to 2 machines max at this time. I suppose you could daisy chain
pgpools running on every machine.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: Hardware/OS recommendations for large databases (

From
"Luke Lonergan"
Date:
James,


On 11/15/05 11:07 AM, "James Mello" <james@haydrian.com> wrote:

Unless there was a way to guarantee consistency, it would be hard at
best to make this work. Convergence on large data sets across boxes is
non-trivial, and diffing databases is difficult at best. Unless there
was some form of automated way to ensure consistency, going 8 ways into
separate boxes is *very* hard. I do suppose that if you have fancy
storage (EMC, Hitachi) you could do BCV or Shadow copies. But in terms
of commodity stuff, I'd have to agree with Merlin.

It’s a matter of good software that handles the distribution / parallel query optimization / distributed transactions and management features.  Combine that with a gigabit ethernet switch and it works – we routinely get 50x speedup over SMP on OLAP / Decision Support workloads.

Regards,

- Luke

Re: Hardware/OS recommendations for large databases ( 5TB)

From
William Yu
Date:
James Mello wrote:
> Unless there was a way to guarantee consistency, it would be hard at
> best to make this work. Convergence on large data sets across boxes is
> non-trivial, and diffing databases is difficult at best. Unless there
> was some form of automated way to ensure consistency, going 8 ways into
> separate boxes is *very* hard. I do suppose that if you have fancy
> storage (EMC, Hitachi) you could do BCV or Shadow copies. But in terms
> of commodity stuff, I'd have to agree with Merlin.

If you're talking about data consistency, I don't see why that's an
issue in a bulk-load/read-only setup. Either bulk load on 1 server and
then do a file copy to all the others -- or simultaneously bulk load on
all servers.

If you're talking about consistency in directly queries to the
appropriate servers, I agree that's a more complicated issue but not
unsurmountable. If you don't use persistent connections, you can
probably get pretty good routing using DNS -- monitor servers by looking
at top/iostat/memory info/etc and continually change the DNS zonemaps to
direct traffic to less busy servers. (I use this method for our global
load balancers -- pretty easy to script via Perl/Python/etc.) Mind you
since you need a Dual Processor motherboard anyways to get PCI-X, that
means every machine would be a 2xDual Core so there's enough CPU power
to handle the cases where 2 or 3 queries get sent to the same server
back-to-back. Of course, I/O would take a hit in this case -- but I/O
would take a hit in every case on a single 16-core mega system.

If use persistent connections, it'll definitely require extra
programming beyond simple scripting. Take one of the opensource projects
like PgPool or SQLRelay and alter it so it monitors all servers to see
what server is least busy before passing a query on.