Thread: RE: Database cluster?
This system is in use through what is called shared nothing clustering which is employed by IBM's DB2 and Microsoft SQL Server 2000. Either of these products will work in the manner that you are looking for. --------------------------------------------- Nathan Barnett -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Gordan Bobic Sent: Thursday, November 30, 2000 4:02 AM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] Database cluster? > > > I am considering splitting the database into tables residing on separate > > > machines, and connect them on one master node. > > > > > > The question I have is: > > > > > > 1) How can I do this using PostgreSQL? > > > > You can't. > > I'll jump in with a bit more info. Splitting tables across multiple > machines would do nothing more than make the entire system run at a snail's > pace. Yes, it would slow it down immensely, because you just couldn't move > data between machines quickly enough. I don't believe that is the case. In my case, queries typically return comparatively small amounts of data. Around 100 records at most. The amount of data that needs to be transferred is comparatively small, and even over 10 Mb ethernet, it would take at most about a second to transfer. This is a much smaller delay than the query time itself, which can take 10 seconds or more. Remember that I said there are tables with over 30M records? Doing multi-table joins on things like this takes a long time... So, splitting the data in such a way that one table is queried, and then tables joined from it are queried in parallel, would cause a signifficant speed-up. For example, say we have tables T1, T2 and T3. T1 has fields F1.1, F1.2 and F1.3. T2 has F2.1 and T3 has F3.1 (at least, probably lots of other fields. Say I want to do SELECT * FROM T1 WHERE F1.1 = F2.1 AND F1.2 = F3.1 AND F1.3 = 'somedata'; Then F1.3 could be searched for 'somedata'. When the records are found, this could be cross-matched remotely, in parallel for F1.1=F2.1 and F1.2=F3.1, on different machines. This means that depending on the type, configuration and the usage of the database, a potentially massive improvement in performance could be achiveved, especially on multi-table joins which span lots of BIG tables. Somebody mentioned the fact that postgres uses IPC for communicating between processes. I think there are tools for clustering (I am not sure if Mosix supports transparently allowing IPC across nodes) which can work around that. > Why? Well, whenever you join two tables that are on different machines, > the tables have to go across whatever sort of connection you have between > the machines. Even if you use gigabit ethernet, you are still running at a > mere fraction of the bandwidth of the computer's internal bus - and at > orders of magnitude greater latency. You'd have lots of CPU's sitting > around, doing absolutely nothing, waiting for data to come across the wire. Gigabit ethernet has around the same bandwidth as PCI bus. I suppose it all depends on what machine you have running this. This would be true in the case that the datbase server is a nice big Alpha with severl CPUs. > There are alternatives, such as IP-over-SCSI. That reduces the latency > of ethernet quite a bit, and gives you much more bandwidth (say, up to 160 > megabytes/second). However, that's still a pittance compared to the main > system bus inside your computer. But SCSI is still 160MB burst (not sustained, unless you're using very expensive arrays). And Gigabit ethernet is 133 MB/s, albeit with greater latency. > That's one of the greatest hurdles to distributed computing. That's why > the applications that are best adapted to distributed computing are those > that don't require much data over the wire - which certainly doesn't apply > to databases. : ) I think it depends whether the amount of data is the problem, or fitting it together. Somebody please explain to me further why I am wrong in all this? Regards. Gordan
Thanks. I have just had another thought. If all the tables are split across several computers, this would help as well. For example, if we have 100 records and 2 database servers, each server could have 50 of those 100 records on it. When a selection is required, each server would look through it's much smaller database, and report back the "hits". This would, effectively, provide a near linear speedup in the query time, while introducing only the minor network overhead (or a major one, depending on how much data is transferred). Some extra logic could then be implemented for related tables that would allow the most closely related records from the different tables to be "clustered" (as in kind of remotely similar to the CLUSTER command) on the same server, for faster response time and minimized network usage requirements. The "vacuum" or "cluster" features could be used overnight to re-optimize the distribution of records across the servers. In all this, a "master" node could be used for coordinating the whole operation. We could ask the master node to do a query, and it would automatically, knowing what slaves it has, fire off that query on them. Each slave would then in parallel, execute a query, and return a subset of the data we were looking for. This data would then be joined into one recordset before it is returned to the client that requested it. As far I can see, as long as the amounts of data shifted aren't huge enough to cause problems with network congestion, and the query time is dominant to data transfer time over the network, this should provide a rather scaleable system. I understand that the form of database clustering I am mentioning here is fairly rudimentary and unsophisticated, but it would certaily be a very useful feature. Are there any plans to implement this sort of functionality in PostgreSQL? Or is this a lot more complicated than it seems... Regards. Gordan ----- Original Message ----- From: "Nathan Barnett" <nbarnett@cellularphones.com> To: "'Gordan Bobic'" <gordan@freeuk.com>; <pgsql-general@postgresql.org> Sent: Thursday, November 30, 2000 2:34 PM Subject: RE: [GENERAL] Database cluster? > This system is in use through what is called shared nothing clustering which > is employed by IBM's DB2 and Microsoft SQL Server 2000. Either of these > products will work in the manner that you are looking for. > > --------------------------------------------- > Nathan Barnett > > -----Original Message----- > From: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Gordan Bobic > Sent: Thursday, November 30, 2000 4:02 AM > To: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Database cluster? > > > > > > I am considering splitting the database into tables residing on > separate > > > > machines, and connect them on one master node. > > > > > > > > The question I have is: > > > > > > > > 1) How can I do this using PostgreSQL? > > > > > > You can't. > > > > I'll jump in with a bit more info. Splitting tables across multiple > > machines would do nothing more than make the entire system run at a > snail's > > pace. Yes, it would slow it down immensely, because you just couldn't > move > > data between machines quickly enough. > > I don't believe that is the case. In my case, queries typically return > comparatively small amounts of data. Around 100 records at most. The amount > of data that needs to be transferred is comparatively small, and even over > 10 Mb ethernet, it would take at most about a second to transfer. This is a > much smaller delay than the query time itself, which can take 10 seconds or > more. Remember that I said there are tables with over 30M records? Doing > multi-table joins on things like this takes a long time... > > So, splitting the data in such a way that one table is queried, and then > tables joined from it are queried in parallel, would cause a signifficant > speed-up. > > For example, say we have tables T1, T2 and T3. > > T1 has fields F1.1, F1.2 and F1.3. T2 has F2.1 and T3 has F3.1 (at least, > probably lots of other fields. > > Say I want to do > SELECT * > FROM T1 > WHERE F1.1 = F2.1 AND F1.2 = F3.1 AND F1.3 = 'somedata'; > > Then F1.3 could be searched for 'somedata'. When the records are found, > this could be cross-matched remotely, in parallel for F1.1=F2.1 and > F1.2=F3.1, on different machines. > > This means that depending on the type, configuration and the usage of the > database, a potentially massive improvement in performance could be > achiveved, especially on multi-table joins which span lots of BIG tables. > > Somebody mentioned the fact that postgres uses IPC for communicating > between processes. I think there are tools for clustering (I am not sure if > Mosix supports transparently allowing IPC across nodes) which can work > around that. > > > Why? Well, whenever you join two tables that are on different > machines, > > the tables have to go across whatever sort of connection you have between > > the machines. Even if you use gigabit ethernet, you are still running at > a > > mere fraction of the bandwidth of the computer's internal bus - and at > > orders of magnitude greater latency. You'd have lots of CPU's sitting > > around, doing absolutely nothing, waiting for data to come across the > wire. > > Gigabit ethernet has around the same bandwidth as PCI bus. I suppose it all > depends on what machine you have running this. This would be true in the > case that the datbase server is a nice big Alpha with severl CPUs. > > > There are alternatives, such as IP-over-SCSI. That reduces the > latency > > of ethernet quite a bit, and gives you much more bandwidth (say, up to > 160 > > megabytes/second). However, that's still a pittance compared to the main > > system bus inside your computer. > > But SCSI is still 160MB burst (not sustained, unless you're using very > expensive arrays). And Gigabit ethernet is 133 MB/s, albeit with greater > latency. > > > That's one of the greatest hurdles to distributed computing. That's > why > > the applications that are best adapted to distributed computing are those > > that don't require much data over the wire - which certainly doesn't > apply > > to databases. : ) > > I think it depends whether the amount of data is the problem, or fitting it > together. > > Somebody please explain to me further why I am wrong in all this? > > Regards. > > Gordan > > > >
You're almost describing a Teradata DBM. What an amazing machine! Last I heard about 6 years ago, though, AT&T was rewriting it as an NT app instead of running on proprietary hardware. The proprietary hardware was essentially a cluster of 80486 computers (at the time). What they had done was implemented a pyramid structure of 80486 computers. The lowest level of computers had hard disks and stored the data. Two of the lowest level computers would "report" to a single higher up computer. Two of these higher up computers would "report" to yet another single higher up computer until there was only one higher up computer to report to. The thing that impacted me the most about this architecture was that sorting was practically built in. So all the intermediary computers had to do was merge the sorted result sets from its lower level computers. Blazing! And data was stored on a couple of leaf-level computers for redundancy. I miss that old beast. But I certainly cannot afford the multimillion dollars required to get one for myself. We lovingly called the one we worked with the "Pteradactyl," which is the old name for that bird-like dinosaur (evidentally there's a new word for the bird-like dinosaur, the pteronodon or something?). Doug At 02:44 PM 11/30/00 -0000, Gordan Bobic wrote: >Thanks. > >I have just had another thought. If all the tables are split across several >computers, this would help as well. > >For example, if we have 100 records and 2 database servers, each server >could have 50 of those 100 records on it. When a selection is required, >each server would look through it's much smaller database, and report back >the "hits". This would, effectively, provide a near linear speedup in the >query time, while introducing only the minor network overhead (or a major >one, depending on how much data is transferred). > >Some extra logic could then be implemented for related tables that would >allow the most closely related records from the different tables to be >"clustered" (as in kind of remotely similar to the CLUSTER command) on the >same server, for faster response time and minimized network usage >requirements. The "vacuum" or "cluster" features could be used overnight to >re-optimize the distribution of records across the servers. > >In all this, a "master" node could be used for coordinating the whole >operation. We could ask the master node to do a query, and it would >automatically, knowing what slaves it has, fire off that query on them. >Each slave would then in parallel, execute a query, and return a subset of >the data we were looking for. This data would then be joined into one >recordset before it is returned to the client that requested it. > >As far I can see, as long as the amounts of data shifted aren't huge enough >to cause problems with network congestion, and the query time is dominant >to data transfer time over the network, this should provide a rather >scaleable system. I understand that the form of database clustering I am >mentioning here is fairly rudimentary and unsophisticated, but it would >certaily be a very useful feature. > >Are there any plans to implement this sort of functionality in PostgreSQL? >Or is this a lot more complicated than it seems... > >Regards. > >Gordan
> You're almost describing a Teradata DBM. I knew someone must have thought of it before. ;-) [snip] > The thing that impacted me the most about this architecture was that > sorting was practically built in. So all the intermediary computers had to > do was merge the sorted result sets from its lower level computers. Blazing! They effectively implemented a binary tree in hardware. One hell of an indexing mechanism. :-) > I miss that old beast. But I certainly cannot afford the multimillion > dollars required to get one for myself. I suppose it would depend on how many computers you want to have in this cluster. The main reason why clusters are getting popular recently (albeit not yet for databases, or so it would seem) is because it is cheaper than anything else with similar performance. The main question remains - are there any plans to implement something similar to this with PostgreSQL? I would volunteer to help with some coding, if a "group" was formed to work on this "clustering" module. Regards. Gordan
I actually analyzed it once. I came to the conclusion that to do it right it would be easier to make an almost entirely new db but use the same external interfaces as PostgreSQL. To do a kludge of it, one might just implement a tier that sits between the user and a bunch of standard PostgreSQL backends. It'd make a neat companion project, though. Like PG/Enterprise or PG/Warehouse or something. Doug At 04:02 PM 11/30/00 -0000, Gordan Bobic wrote: >> You're almost describing a Teradata DBM. > >I knew someone must have thought of it before. ;-) > >[snip] > >> The thing that impacted me the most about this architecture was that >> sorting was practically built in. So all the intermediary computers had >to >> do was merge the sorted result sets from its lower level computers. >Blazing! > >They effectively implemented a binary tree in hardware. One hell of an >indexing mechanism. :-) > >> I miss that old beast. But I certainly cannot afford the multimillion >> dollars required to get one for myself. > >I suppose it would depend on how many computers you want to have in this >cluster. The main reason why clusters are getting popular recently (albeit >not yet for databases, or so it would seem) is because it is cheaper than >anything else with similar performance. > >The main question remains - are there any plans to implement something >similar to this with PostgreSQL? I would volunteer to help with some >coding, if a "group" was formed to work on this "clustering" module. > >Regards. > >Gordan
On Thu, 30 Nov 2000, Doug Semig wrote: > I actually analyzed it once. I came to the conclusion that to do it right > it would be easier to make an almost entirely new db but use the same > external interfaces as PostgreSQL. To do a kludge of it, one might just > implement a tier that sits between the user and a bunch of standard > PostgreSQL backends. > > It'd make a neat companion project, though. Like PG/Enterprise or > PG/Warehouse or something. I'm currently developing a simple version of such a system as an university project. It is a fairly simple aproach with a proxy or a distributor in front of a bunch of standard postgresl database servers. The proxy monitors and forwards the requests from the clients to the database servers. If it is a read-only request the query is forwarded to the databaseserver currently experiencing the lowest load/most free memory, otherwise it is sent to all database servers. This approach obviously only performs well in systems with a high ratio of read-only queries, such as search engines and so on. -- Bye, Peter Korsgaard
> The proxy monitors and forwards the requests from the clients to the > database servers. If it is a read-only request the query is forwarded to > the databaseserver currently experiencing the lowest load/most free > memory, otherwise it is sent to all database servers. > > This approach obviously only performs well in systems with a high ratio of > read-only queries, such as search engines and so on. The tough part is syncronicity, should one of the machines drop out of the cluster and need to be re-added without bringing the others down. In order to get around that, each query needs to be logged on the master node with a timestamp, so that the failed node can "catch up" in real-time. That brings about other considerations, as well.... steve
On Thu, 30 Nov 2000, Steve Wolfe wrote: > The tough part is syncronicity, should one of the machines drop out of the > cluster and need to be re-added without bringing the others down. In order > to get around that, each query needs to be logged on the master node with a > timestamp, so that the failed node can "catch up" in real-time. That brings > about other considerations, as well.... Exactly! Thats also why I have decided not to implement that feature as the deadline is in 3 weeks ;) If a database server fails it is simple discarded from the cluster and can only be reconnected by taking the system offline and doing the syncronisation manually. -- Bye, Peter Korsgaard
I've succesfully pacthed linux kernel 2.2.17 with DIPC and modified postgresql's src (src/backend/storage/ipc/ipc.c) to create distributed shm and sem. The strategy is then to start a postgresql that creates shm and sem on ONE machine, then start other postgres on other machines on the cluster that create NO shared structures ( there is a command line flag to do this). Than you can connect to any of the postgres on your cluster, for example: round robin. Another issue are datafiles, GFS seems promising. But postgresql uses fcnl, and GFS (globalfilesystem.org) doesn't support it yet. A distributed filesystem with locking etc. is required, Ideas ? Another issue is that DIPC doesn't have a failover mechanism. This is a shared All approach, it's not the best, but probably it's the fastest solution (bad) to implement, with little modifications (4-5) lines to postgresql sources. This system can give a sort of single-system-image, useful to distribute other software beyond postgresql. Also Mariposa (http://s2k-ftp.cs.berkeley.edu:8000/mariposa/) seems interesting, but it's not maintained and it's for an old postgresql version. hoping for clustrering... valter mazzola. >From: Alain Toussaint <nailed@videotron.ca> >To: PostgreSQL general list <pgsql-general@postgresql.org> >Subject: Re: [GENERAL] Database cluster? >Date: Thu, 30 Nov 2000 15:05:16 -0500 (EST) > > > Somebody mentioned the fact that postgres uses IPC for communicating > > between processes. I think there are tools for clustering (I am not sure >if > > Mosix supports transparently allowing IPC across nodes) which can work > > around that. > >one of those tool is distributed ipc <http://wallybox.cei.net/dipc/> but >it only work with Linux,AFAIK,the software there is just a patch to the >Linux kernel and a daemon. > >Alain > _____________________________________________________________________________________ Get more from the Web. FREE MSN Explorer download : http://explorer.msn.com
> > > > I am considering splitting the database into > > > > tables residing on separate machines, and connect > > > > them on one master node. > > > > Splitting tables across multiple machines would do > > nothing more than make the entire system run at a > > snail's pace . . . because you just couldn't move > > data between machines quickly enough. > > In my case, queries typically return ... [a]round 100 > records at most. [E]ven over 10 Mb ethernet, it would > take at most about a second to transfer. This is a > much smaller delay than the query time itself, which > can take 10 seconds or more. > > So, splitting the data in such a way that one table is > queried, and then tables joined from it are queried in > parallel, would cause a signifficant speed-up. Then do exactly that: run separate PostgreSQL databases on multiple machines and build a data abstraction layer that does the join manually. If there really are only small numbers of rows, then it's just as fast to transfer the data to your application machine as to transfer it all to a common PostgreSQL machine and then to your application. If you really need application transparency, then things get a bit uglier; it shouldn't be too hard to build your own middleware layer that lets you treat the data storage as a single entity. - Tim
> I actually analyzed it once. I came to the conclusion that to do it right > it would be easier to make an almost entirely new db but use the same > external interfaces as PostgreSQL. I admit that I am not really too up-to-date on database theory, but I am a bit surprised at that... > To do a kludge of it, one might just implement a tier that sits between the > user and a bunch of standard PostgreSQL backends. That is precisely what I was thinking about. There would have to be a "master" node that controls what goes where, and distributed the load. This "shouldn't" be too difficult (although I am not totally sure what I mean by that). The nasty bit would probably be hacking the optimizer, SQL command "CLUSTER", and VACUUM to take account and efficiently use all the extra room for improving the performance. Automating a "near-optimal" distribution of tables across machines could be a bit of a difficult problem from a theory side, but it ought to be possible. There are several options here. One could just put one table on each server, which is unlikely to be all that beneficial, although in a multi-table join, you'd want to search the smallest tables first. Then, there's the option of just splitting each table across multiple machines. There is also the possibility of having some records overlap between machines, of the on-line optimizer decides that that would be useful for performance, and then sort out the syncing somehow. Or, one could set up an even more sophisticated system where only the tables and data that would benefit from being together would be on the same server, so there could be a section of two tables on one server, the rest of those two tables and a section of another table on another server, etc. Basically, make both the table and record allocations completely dynamic between the servers. I am not sure how useful each of these splits would be, but it is certainly something well worth exploring theoretically before the actual implementation, because I reserve the right to be wrong in thinking that any of these methods would produce an actual improvement in performance. And, of course, there would be the bit of getting the optimizer and partial replication to work properly across servers, which may not be an easy task. > It'd make a neat companion project, though. Like PG/Enterprise or > PG/Warehouse or something. I agree. It would be really neat. Something like Mosix, but for databases. And it just sounds like something that would be really useful for large databases, especially as we start reaching steep part of the price/performance curve for database servers. Regards. Gordan > At 04:02 PM 11/30/00 -0000, Gordan Bobic wrote: > >> You're almost describing a Teradata DBM. > > > >I knew someone must have thought of it before. ;-) > > > >[snip] > > > >> The thing that impacted me the most about this architecture was that > >> sorting was practically built in. So all the intermediary computers had > >to > >> do was merge the sorted result sets from its lower level computers. > >Blazing! > > > >They effectively implemented a binary tree in hardware. One hell of an > >indexing mechanism. :-) > > > >> I miss that old beast. But I certainly cannot afford the multimillion > >> dollars required to get one for myself. > > > >I suppose it would depend on how many computers you want to have in this > >cluster. The main reason why clusters are getting popular recently (albeit > >not yet for databases, or so it would seem) is because it is cheaper than > >anything else with similar performance. > > > >The main question remains - are there any plans to implement something > >similar to this with PostgreSQL? I would volunteer to help with some > >coding, if a "group" was formed to work on this "clustering" module. > > > >Regards. > > > >Gordan
> > I actually analyzed it once. I came to the conclusion that to do it right > > it would be easier to make an almost entirely new db but use the same > > external interfaces as PostgreSQL. To do a kludge of it, one might just > > implement a tier that sits between the user and a bunch of standard > > PostgreSQL backends. > > > > It'd make a neat companion project, though. Like PG/Enterprise or > > PG/Warehouse or something. > > I'm currently developing a simple version of such a system as an > university project. It is a fairly simple aproach with a proxy or a > distributor in front of a bunch of standard postgresl database servers. > > The proxy monitors and forwards the requests from the clients to the > database servers. If it is a read-only request the query is forwarded to > the databaseserver currently experiencing the lowest load/most free > memory, otherwise it is sent to all database servers. Surely, you have to send off a query to all servers that the specific tables you are searching are split across. How are you handling splitting of the data? > This approach obviously only performs well in systems with a high ratio of > read-only queries, such as search engines and so on. This implies lots of servers with identical data on them. Am I right? Could you tell us a bit more about your project (if it's not breaching any non-disclosure agreements, that is)? It could be a good starting point for what we are talking about implementing. Regards. Gordan
On Fri, 1 Dec 2000, Gordan Bobic wrote: > > The proxy monitors and forwards the requests from the clients to the > > database servers. If it is a read-only request the query is forwarded to > > the databaseserver currently experiencing the lowest load/most free > > memory, otherwise it is sent to all database servers. > > Surely, you have to send off a query to all servers that the specific > tables you are searching are split across. How are you handling splitting > of the data? I'm not ;) All the database servers of the clusters simply each contain a copy of all the data. I know this isn't an optimal solution, but this project is only running for a very short period (around 2 months) > > read-only queries, such as search engines and so on. > > This implies lots of servers with identical data on them. Am I right? > > Could you tell us a bit more about your project (if it's not breaching any > non-disclosure agreements, that is)? It could be a good starting point for > what we are talking about implementing. There is no nda's involved in this project. We are planning on releasing the software under GPL after the deadline (22th of December). There isn't much more to tell about the system at this time. Currently we are finishing off the lower levels of the systems like database driver (currently only postgres), status daemons through snmp and so on. At this time the actual scheduling hasn't been implemented. -- Bye, Peter Korsgaard
> On Thu, 30 Nov 2000, Steve Wolfe wrote: > > > The tough part is syncronicity, should one of the machines drop out of the > > cluster and need to be re-added without bringing the others down. In order > > to get around that, each query needs to be logged on the master node with a > > timestamp, so that the failed node can "catch up" in real-time. That brings > > about other considerations, as well.... > > Exactly! Thats also why I have decided not to implement that feature as > the deadline is in 3 weeks ;) Fair! > If a database server fails it is simple > discarded from the cluster and can only be reconnected by taking the > system offline and doing the syncronisation manually. Where's all that enthusiasm for a new project gone? Ever thought of making a "new and improved" version 2 or something? I think it's a great idea, and if we can get a few people together to implement it, it sould be a worthwhile exercise. And right now, it looks like you have a bit of a head start. ;-) Regards. Gordan
> > The proxy monitors and forwards the requests from the clients to the > > database servers. If it is a read-only request the query is forwarded > > to the databaseserver currently experiencing the lowest load/most free > > memory, otherwise it is sent to all database servers. > > > > This approach obviously only performs well in systems with a high ratio > > of read-only queries, such as search engines and so on. > > The tough part is syncronicity, should one of the machines drop out of > the cluster and need to be re-added without bringing the others down. > In order to get around that, each query needs to be logged on the master > node with a timestamp, so that the failed node can "catch up" in > real-time. That brings about other considerations, as well.... What about using something like RAID-type setup on the database servers? So that for example, all data (records or tables, depending on the split) exists on more than one server. That way if a server goes down, all the data would still exist on the other servers. A server failure could trigger a replication operation that would re-distribute the data across the remaining servers in such a way that if another one fails, the data integrity would still be preserved. Similarly, when a new server is added, this should trigger an operation that would re-distribute records in sich a way as to insure redundancy with the new number of server (in this case, move some records from the old servers to the new one and load balance it). This, again, "shouldn't" be too hard to implement, if we could sort out the design first. However, this could be (theoretically) difficult to combine with "near-optimal" table/record distribution. The two might go against each other. There should be a way of enforcing the redundancy part, and then doing a "best effort" to do it in such a way as to gain most performance improvement. This _could_ be done, and off the top of my head, I cannot see any reason why it wouldn't work reasonably well. Obviously, a server failure would cause the whole thing to slow down for a while until the records were re-distributed, and the optimizer re-assesses the situation, but that is true of any redundant system that has to remain at least operational during a failure. After the replication is completed, it should resume at it's full speed (dictated by the number of available servers). Any opinions on this? Regards. Gordan
> I've succesfully pacthed linux kernel 2.2.17 with DIPC and modified > postgresql's src (src/backend/storage/ipc/ipc.c) to create distributed > shm and sem. Please forgive my ignorance (I haven't used Postgres for that long), but what are shm and sem? > The strategy is then to start a postgresql that creates shm and sem on > ONE machine, then start other postgres on other machines on the cluster > that create NO shared structures ( there is a command line flag to do this). So, one "master" and lots of "slaves", right? > Than you can connect to any of the postgres on your cluster, for example: > round robin. Hmm... But is this really what we want to do? This is less than ideal for several reasons (if I understand what you're saying correctly). Replication is off-line for a start, and it only works well for a system that has few inserts and lots of selects, probably from a lot of different users. Probably a good things for applications like web search engines, but not necessarily for much else. > Another issue are datafiles, GFS seems promising. > But postgresql uses fcnl, and GFS (globalfilesystem.org) doesn't > support it yet. > A distributed filesystem with locking etc. is required, Ideas ? Hmm... I am not sure that a distributed file system is what we want here. I think it might be better to have separate postgres databases on separate local file systems, and handle putting the data together on a higher level. I think this would be better for both performance and scaleability. Having one big file system is likely to incur heavy network traffic penalties, and that is not necessary, as it can be avoided by just having the distribution done on a database level, rather than file system level. But then again, the distributed file system can be seen as a "neater" solution, and it might work rather well, if they get the caching right with the correct "near-line" distribution of data across the network file system to make sure that the data is where it is most useful. In other words, make sure that the files (or even better, inodes) that are frequently accessed by a computer are on that computer). Still there is the issue of replication and redundancy. I just think that for a database application, this would be best done on the database level, rather than a file system level, unless the distributed file system in use was designed with all the database-useful features in mind. > Another issue is that DIPC doesn't have a failover mechanism. Again, for a database, it might be best to handle it at a higher level. > This is a shared All approach, it's not the best, but probably it's the > fastest solution (bad) to implement, with little modifications (4-5) > lines to postgresql sources. Indeed. As such, it should probably be the first thing to do toward "clustering" a database. Still, it would be good to have a clear development path, even though on that path we cludge things slightly at various steps in order to have a useable system now, as opposed to a "perfect" system later. A shared all approach is not necessarily that bad. It is (as far as I can tell), not better or worse than a "share nothing" approach. They both have pros and cons. Ideally, we should work toward coming up with an idea for a hybrid system that would pick the best of both worlds. > This system can give a sort of single-system-image, useful to distribute > other software beyond postgresql. Indeed. This is always a good thing for scalability for most applications, but databases have their specific requirements which may not be best catered for by standard means of distributed processing. Still, what you are suggesting would be a major improvement, from where I'm looking at it, but I am probably biased by looking at it from the point of view of my particular application. > Also Mariposa (http://s2k-ftp.cs.berkeley.edu:8000/mariposa/) seems > interesting, but it's not maintained and it's for an old postgresql version. Hmm... Most interesting. There could be something recyclable in there. Must look at the specs and some source later... Regards. Gordan
> > > The proxy monitors and forwards the requests from the clients to the > > > database servers. If it is a read-only request the query is forwarded to > > > the databaseserver currently experiencing the lowest load/most free > > > memory, otherwise it is sent to all database servers. > > > > Surely, you have to send off a query to all servers that the specific > > tables you are searching are split across. How are you handling splitting > > of the data? > > I'm not ;) That's what I thought you were implying... > All the database servers of the clusters simply each contain > a copy of all the data. I know this isn't an optimal solution, but this > project is only running for a very short period (around 2 months) It's actually a rather fair compromise, given the time scale (IMHO). It's like mirroring disks with RAID. You get same write performance, but multiply the read performance for multiple disks. It sounds like you're doing the same thing with databases. And as your application is web search engine, this should work quite well, especially under heavy load / lots of hits. > > > read-only queries, such as search engines and so on. > > > > This implies lots of servers with identical data on them. Am I right? > > > > Could you tell us a bit more about your project (if it's not breaching any > > non-disclosure agreements, that is)? It could be a good starting point for > > what we are talking about implementing. > > There is no nda's involved in this project. We are planning on releasing > the software under GPL after the deadline (22th of December). Hmm... I'll be looking forward to having a look at that. Could you post us a URL, once it is released? Are you intending to continue development of this? I'm just thinking about how well this approach would work with something like what Valter has suggested here recently... Regards. Gordan
> > > > > I am considering splitting the database into > > > > > tables residing on separate machines, and connect > > > > > them on one master node. > > > > > > Splitting tables across multiple machines would do > > > nothing more than make the entire system run at a > > > snail's pace . . . because you just couldn't move > > > data between machines quickly enough. > > > > In my case, queries typically return ... [a]round 100 > > records at most. [E]ven over 10 Mb ethernet, it would > > take at most about a second to transfer. This is a > > much smaller delay than the query time itself, which > > can take 10 seconds or more. > > > > So, splitting the data in such a way that one table is > > queried, and then tables joined from it are queried in > > parallel, would cause a signifficant speed-up. > > Then do exactly that: run separate PostgreSQL databases on > multiple machines and build a data abstraction layer that > does the join manually. If there really are only small > numbers of rows, then it's just as fast to transfer the > data to your application machine as to transfer it all > to a common PostgreSQL machine and then to your application. I was hoping to do precisely that. But I never got the answer to the question of does postgres support linking to tables external to it's database? Is it possible to attach a table from another postgres database that is on a different server? > If you really need application transparency Which would always be nice... > then things get a bit uglier; I know. > it shouldn't be too hard to build your own middleware layer that > lets you treat the data storage as a single entity. Yes, but if plans are formed to do something like an "official" postgres based project do do this, then I'd rather spend time working on development of that, rather than spend time building my own proprietary solution. Regards. Gordan
Have you looked at intermezzo? http://www.inter-mezzo.org/ Valter Mazzola wrote: > I've succesfully pacthed linux kernel 2.2.17 with DIPC and modified > postgresql's src (src/backend/storage/ipc/ipc.c) to create distributed > shm and sem. > > The strategy is then to start a postgresql that creates shm and sem on > ONE machine, then start other postgres on other machines on the cluster > that create NO shared structures ( there is a command line flag to do > this). > > Than you can connect to any of the postgres on your cluster, for > example: round robin. > > Another issue are datafiles, GFS seems promising. > But postgresql uses fcnl, and GFS (globalfilesystem.org) doesn't support > it yet. > A distributed filesystem with locking etc. is required, Ideas ? > > > Another issue is that DIPC doesn't have a failover mechanism. > > This is a shared All approach, it's not the best, but probably it's the > fastest solution (bad) to implement, with little modifications (4-5) > lines to postgresql sources. > > This system can give a sort of single-system-image, useful to distribute > other software beyond postgresql. > > Also Mariposa (http://s2k-ftp.cs.berkeley.edu:8000/mariposa/) seems > interesting, > but it's not maintained and it's for an old postgresql version. > > hoping for clustrering... > valter mazzola. > > >> From: Alain Toussaint <nailed@videotron.ca> >> To: PostgreSQL general list <pgsql-general@postgresql.org> >> Subject: Re: [GENERAL] Database cluster? >> Date: Thu, 30 Nov 2000 15:05:16 -0500 (EST) >> >> > Somebody mentioned the fact that postgres uses IPC for communicating >> > between processes. I think there are tools for clustering (I am not >> sure if >> > Mosix supports transparently allowing IPC across nodes) which can work >> > around that. >> >> one of those tool is distributed ipc <http://wallybox.cei.net/dipc/> but >> it only work with Linux,AFAIK,the software there is just a patch to the >> Linux kernel and a daemon. >> >> Alain >> > > _____________________________________________________________________________________ > > Get more from the Web. FREE MSN Explorer download : > http://explorer.msn.com
>From: "Gordan Bobic" To: Subject: Re: [GENERAL] Database cluster? Date: >Fri, 1 Dec 2000 10:13:55 -0000 > > > I've succesfully pacthed linux kernel 2.2.17 with DIPC and modified > >postgresql's src (src/backend/storage/ipc/ipc.c) to create distributed > >shm and sem. > >Please forgive my ignorance (I haven't used Postgres for that long), but >what are shm and sem? > shared memory and semaphores > > The strategy is then to start a postgresql that creates shm and sem on > >ONE machine, then start other postgres on other machines on the cluster > >that create NO shared structures ( there is a command line flag to do >this). > >So, one "master" and lots of "slaves", right? > no, every machine is totally similar to the others, the only different this is that only ONE machine creates the ( network Distributed by DIPC)shared memory and semaphores. > > Than you can connect to any of the postgres on your cluster, for >example: > round robin. > >Hmm... But is this really what we want to do? This is less than ideal for >several reasons (if I understand what you're saying correctly). Replication >is off-line for a start, and it only works well for a system that has few >inserts and lots of selects, probably from a lot of different users. >Probably a good things for applications like web search engines, but not >necessarily for much else. *** it isn't replication. It's that your cluster behaves like a single-computer. You modify the 'OS' (GFS + DIPC), not postgresql. > > > Another issue are datafiles, GFS seems promising. > But postgresql uses >fcnl, and GFS (globalfilesystem.org) doesn't > support it yet. > A >distributed filesystem with locking etc. is required, Ideas ? > >Hmm... I am not sure that a distributed file system is what we want here. I >think it might be better to have separate postgres databases on separate >local file systems, and handle putting the data together on a higher level. >I think this would be better for both performance and scaleability. Having ***yes... but WHEN we can have these features ? No one have done it till now, i've requested and searched but almost no reply. >one big file system is likely to incur heavy network traffic penalties, and >that is not necessary, as it can be avoided by just having the distribution >done on a database level, rather than file system level. > >But then again, the distributed file system can be seen as a "neater" >solution, and it might work rather well, if they get the caching right with >the correct "near-line" distribution of data across the network file system >to make sure that the data is where it is most useful. In other words, make >sure that the files (or even better, inodes) that are frequently accessed >by a computer are on that computer). > >Still there is the issue of replication and redundancy. ***GFS does it transparently. I just think that >for a database application, this would be best done on the database level, >rather than a file system level, unless the distributed file system in use >was designed with all the database-useful features in mind. > > > Another issue is that DIPC doesn't have a failover mechanism. > >Again, for a database, it might be best to handle it at a higher level. > > > This is a shared All approach, it's not the best, but probably it's the > > fastest solution (bad) to implement, with little modifications (4-5) > >lines to postgresql sources. > >Indeed. As such, it should probably be the first thing to do toward >"clustering" a database. Still, it would be good to have a clear >development path, even though on that path we cludge things slightly at >various steps in order to have a useable system now, as opposed to a >"perfect" system later. > *** yes, i want clustering now...and i'm alone. I my opinion if GFS will do fcntl (and we can ask to GFS people, i think), the stuff in this email can be done rapidly. >A shared all approach is not necessarily that bad. It is (as far as I can >tell), not better or worse than a "share nothing" approach. They both have >pros and cons. Ideally, we should work toward coming up with an idea for a >hybrid system that would pick the best of both worlds. > > > This system can give a sort of single-system-image, useful to distribute > > other software beyond postgresql. > >Indeed. This is always a good thing for scalability for most applications, >but databases have their specific requirements which may not be best >catered for by standard means of distributed processing. Still, what you >are suggesting would be a major improvement, from where I'm looking at it, >but I am probably biased by looking at it from the point of view of my >particular application. > > > Also Mariposa (http://s2k-ftp.cs.berkeley.edu:8000/mariposa/) seems > >interesting, but it's not maintained and it's for an old postgresql >version. > >Hmm... Most interesting. There could be something recyclable in there. Must >look at the specs and some source later... > *** i've compiled it , but with no results. An idea is to get diff to corresponding pure postgresql version (6.4/5?), then study the patch, and grab the secrets to fuse in current version. The research papers seems very good. Perhaps some guy that have done Mariposa can help... My goal is to have a clustered open source database with the less effort possible, now. The project to do good stuff (ie code) in this field is very long... i hope that some guy will start a real thing ... one idea is to start a project on cosource or similar to receive founding $$. This project is very important for the OpenSource world. valter >Regards. > >Gordan > _____________________________________________________________________________________ Get more from the Web. FREE MSN Explorer download : http://explorer.msn.com
> > Than you can connect to any of the postgres on your cluster, for > >example: > round robin. > > > >Hmm... But is this really what we want to do? This is less than ideal for > >several reasons (if I understand what you're saying correctly). Replication > >is off-line for a start, and it only works well for a system that has few > >inserts and lots of selects, probably from a lot of different users. > >Probably a good things for applications like web search engines, but not > >necessarily for much else. > > *** it isn't replication. It's that your cluster behaves like a > single-computer. You modify the 'OS' (GFS + DIPC), not postgresql. OK, that makes sense. Kind of like Mosix, then. But like mosix, this would require lots of network bandwidth - or not, depending on how good GFS is at figuring our what goes where. > > > Another issue are datafiles, GFS seems promising. > But postgresql uses > >fcnl, and GFS (globalfilesystem.org) doesn't > support it yet. > A > >distributed filesystem with locking etc. is required, Ideas ? > > > >Hmm... I am not sure that a distributed file system is what we want here. I > >think it might be better to have separate postgres databases on separate > >local file systems, and handle putting the data together on a higher level. > >I think this would be better for both performance and scaleability. Having > > ***yes... but WHEN we can have these features ? No one have done it till > now, i've requested and searched but almost no reply. Well, if you come up with a detailed design, I'm quite happy to help with coding individual functions... > >one big file system is likely to incur heavy network traffic penalties, and > >that is not necessary, as it can be avoided by just having the distribution > >done on a database level, rather than file system level. > > > >But then again, the distributed file system can be seen as a "neater" > >solution, and it might work rather well, if they get the caching right with > >the correct "near-line" distribution of data across the network file system > >to make sure that the data is where it is most useful. In other words, make > >sure that the files (or even better, inodes) that are frequently accessed > >by a computer are on that computer). > > > >Still there is the issue of replication and redundancy. > > ***GFS does it transparently. But wouldn't this all be incredibly network intensive? Could we implement something that would make a process go to the data, instead of the other way around? In database, data is typically bigger than the process accessing it... > >Indeed. As such, it should probably be the first thing to do toward > >"clustering" a database. Still, it would be good to have a clear > >development path, even though on that path we cludge things slightly at > >various steps in order to have a useable system now, as opposed to a > >"perfect" system later. > > > > *** yes, i want clustering now...and i'm alone. No, you're not. I NEED clustering now. Eventually the number of records and tables comes and bites you, no matter how much you optimize your application. And for most of us mere mortals, buying a Cray for running a database is just not a viable option... > I my opinion if GFS will do fcntl (and we can ask to GFS people, i think), > the stuff in this email can be done rapidly. Well, I think you've just volunteered to contact them. :-) > >A shared all approach is not necessarily that bad. It is (as far as I can > >tell), not better or worse than a "share nothing" approach. They both have > >pros and cons. Ideally, we should work toward coming up with an idea for a > >hybrid system that would pick the best of both worlds. > > > > > This system can give a sort of single-system-image, useful to distribute > > > other software beyond postgresql. > > > >Indeed. This is always a good thing for scalability for most applications, > >but databases have their specific requirements which may not be best > >catered for by standard means of distributed processing. Still, what you > >are suggesting would be a major improvement, from where I'm looking at it, > >but I am probably biased by looking at it from the point of view of my > >particular application. > > > > > Also Mariposa (http://s2k-ftp.cs.berkeley.edu:8000/mariposa/) seems > > >interesting, but it's not maintained and it's for an old postgresql > >version. > > > >Hmm... Most interesting. There could be something recyclable in there. Must > >look at the specs and some source later... > > > > *** i've compiled it , but with no results. > An idea is to get diff to corresponding pure postgresql version (6.4/5?), > then study the patch, and grab the secrets to fuse in current version. The > research papers seems very good. Perhaps some guy that have done > Mariposa can help... See above comment... > My goal is to have a clustered open source database with the less effort > possible, now. > > The project to do good stuff (ie code) in this field is very long... Indeed. There has to be a feasible starting point that yields modest improvements at modest cost (in time and effort in this case) > i hope that some guy will start a real thing ... one idea is to start a > project on cosource or similar to receive founding $$. > This project is very important for the OpenSource world. I agree. Having a fully clustered database with very little network overhead would be a major success, both for Postgres and OpenSource. Here's an obvious question - how good is (does it exist?) clustering support on Oracle? Regards. Gordan
> Please forgive my ignorance (I haven't used Postgres for that long), but > what are shm and sem? shared memory and semaphores,interresting tidbit on Linux (2.4.0-test series at least) is that shared memory can be a nodev filesystem (like proc or devfs): /dev/ide/host0/bus0/target0/lun0/part2 on / type ext2 (rw,errors=remount-ro,errors=remount-ro) proc on /proc type proc (rw) /shm on /shm type shm (rw) and there's at least a few other nodev filesystem in the latest kernel series: nodev shm nodev sockfs nodev pipefs nodev proc ext2 nodev devfs i'm doing some heavy research into these filesystem but so far,there's a BIG lack of docs. Alain