Re: Partitioning / Clustering - Mailing list pgsql-performance
From | Mischa Sandberg |
---|---|
Subject | Re: Partitioning / Clustering |
Date | |
Msg-id | 1115762155.42812deb506d7@webmail.telus.net Whole thread Raw |
In response to | Re: Partitioning / Clustering (Alex Stapleton <alexs@advfn.com>) |
Responses |
Re: Partitioning / Clustering
|
List | pgsql-performance |
Quoting Alex Stapleton <alexs@advfn.com>: > This is why I mention partitioning. It solves this issue by storing > different data sets on different machines under the same schema. > These seperate chunks of the table can then be replicated as well for > data redundancy and so on. MySQL are working on these things, but PG > just has a bunch of third party extensions, I wonder why these are > not being integrated into the main trunk :/ Thanks for pointing me to > PGCluster though. It looks like it should be better than Slony at > least. Across a decade or two of projects, including creating a federated database engine for Simba, I've become rather dubious of horizontal partitions (across disks or servers), either to improve performance, or just to scale up and not lose performance. [[The one exception is for <emphasis> non-time-critical read-only</emphasis> systems, with Slony-style replication.]] The most successful high-volume systems I've seen have broken up databases functionally, like a pipeline, where different applications use different sections of the pipe. The highest-volume system I've worked on is Acxiom's gigantic data-cleansing system. This is the central clearinghouse for every scrap of demographic that can be associated with some North American, somewhere. Think of D&B for 300M people (some dead). The volumes are just beyond belief, for both updates and queries. At Acxiom, the datasets are so large, even after partitioning, that they just constantly cycle them through memory, and commands are executes in convoys --- sort of like riding a paternoster. .......... Anybody been tracking on what Mr Stonebraker's been up to, lately? Datastream management. Check it out. Like replication, everybody hand-rolled their own datastream systems until finally somebody else generalized it well enough that it didn't have to be built from scratch every time. Datastream systems require practically no locking, let alone distributed transactions. They give you some really strong guarantees on transaction elapsed-time and throughput. ....... Where is this all leading? Well, for scaling data like this, the one feature that you need is the ability of procedures/rules on one server to perform queries/procedures on another. MSSQL has linked servers and (blech) OpenQuery. This lets you do reasonably-efficient work when you only deal with one table at a time. Do NOT try anything fancy with multi-table joins; timeouts are unavoidable, and painful. Postgres has a natural advantage in such a distributed server system: all table/index stats are openly available through the SQL interface, for one server to make rational query plans involving another server's resources. God! I would have killed for that when I was writing a federated SQL engine; the kluges you need to do this at arms-length from that information are true pain. So where should I go look, to see what's been done so far, on a Postgres that can treat another PG server as a new table type?
pgsql-performance by date: