Thread: Partitioning / Clustering
What is the status of Postgres support for any sort of multi-machine scaling support? What are you meant to do once you've upgraded your box and tuned the conf files as much as you can? But your query load is just too high for a single machine? Upgrading stock Dell boxes (I know we could be using better machines, but I am trying to tackle the real issue) is not a hugely price efficient way of getting extra performance, nor particularly scalable in the long term. So, when/is PG meant to be getting a decent partitioning system? MySQL is getting one (eventually) which is apparently meant to be similiar to Oracle's according to the docs. Clusgres does not appear to be widely/or at all used, and info on it seems pretty thin on the ground, so I am not too keen on going with that. Is the real solution to multi- machine partitioning (as in, not like MySQLs MERGE tables) on PostgreSQL actually doing it in our application API? This seems like a less than perfect solution once we want to add redundancy and things into the mix.
Alex Stapleton wrote: > What is the status of Postgres support for any sort of multi-machine > scaling support? What are you meant to do once you've upgraded your box > and tuned the conf files as much as you can? But your query load is > just too high for a single machine? > > Upgrading stock Dell boxes (I know we could be using better machines, > but I am trying to tackle the real issue) is not a hugely price > efficient way of getting extra performance, nor particularly scalable > in the long term. Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell is far away from Big Iron. I don't know what performance you are looking for, but you can easily get into inserting 10M rows/day with quality hardware. But actually is it your SELECT load that is too high, or your INSERT load, or something inbetween. Because Slony is around if it is a SELECT problem. http://gborg.postgresql.org/project/slony1/projdisplay.php Basically, Slony is a Master/Slave replication system. So if you have INSERT going into the Master, you can have as many replicated slaves, which can handle your SELECT load. Slony is an asynchronous replicator, so there is a time delay from the INSERT until it will show up on a slave, but that time could be pretty small. This would require some application level support, since an INSERT goes to a different place than a SELECT. But there has been some discussion about pg_pool being able to spread the query load, and having it be aware of the difference between a SELECT and an INSERT and have it route the query to the correct host. The biggest problem being that functions could cause a SELECT func() to actually insert a row, which pg_pool wouldn't know about. There are 2 possible solutions, a) don't do that when you are using this system, b) add some sort of comment hint so that pg_pool can understand that the select is actually an INSERT, and needs to be done on the master. > > So, when/is PG meant to be getting a decent partitioning system? MySQL > is getting one (eventually) which is apparently meant to be similiar to > Oracle's according to the docs. Clusgres does not appear to be > widely/or at all used, and info on it seems pretty thin on the ground, > so I am > not too keen on going with that. Is the real solution to multi- machine > partitioning (as in, not like MySQLs MERGE tables) on PostgreSQL > actually doing it in our application API? This seems like a less than > perfect solution once we want to add redundancy and things into the mix. There is also PGCluster http://pgfoundry.org/projects/pgcluster/ Which is trying to be more of a Synchronous multi-master system. I haven't heard of Clusgres, so I'm guessing it is an older attempt, which has been overtaken by pgcluster. Just realize that clusters don't necessarily scale like you would want them too. Because at some point you have to insert into the same table, which means you need to hold a lock which prevents the other machine from doing anything. And with synchronous replication, you have to wait for all of the machines to get a copy of the data before you can say it has been committed, which does *not* scale well with the number of machines. If you can make it work, I think having a powerful master server, who can finish an INSERT quickly, and then having a bunch of Slony slaves with a middleman (like pg_pool) to do load balancing among them, is the best way to scale up. There are still some requirements, like not having to see the results of an INSERT instantly (though if you are using hinting to pg_pool, you could hint that this query must be done on the master, realizing that the more you do it, the more you slow everything down). John =:-> PS> I don't know what functionality has been actually implemented in pg_pool, just that it was discussed in the past. Slony-II is also in the works.
Attachment
I think that perhaps he was trying to avoid having to buy "Big Iron" at all. With all the Opteron v. Xeon around here, and talk of $30,000 machines, perhaps it would be worth exploring the option of buying 10 cheapass machines for $300 each. At the moment, that $300 buys you, from Dell, a 2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit ethernet. The aggregate CPU and bandwidth is pretty stupendous, but not as easy to harness as a single machine. For those of us looking at batch and data warehousing applications, it would be really handy to be able to partition databases, tables, and processing load across banks of cheap hardware. Yes, clustering solutions can distribute the data, and can even do it on a per-table basis in some cases. This still leaves it up to the application's logic to handle reunification of the data. Ideas: 1. Create a table/storage type that consists of a select statement on another machine. While I don't think the current executor is capable of working on multiple nodes of an execution tree at the same time, it would be great if it could offload a select of tuples from a remote table to an entirely different server and merge the resulting data into the current execution. I believe MySQL has this, and Oracle may implement it in another way. 2. There is no #2 at this time, but I'm sure one can be hypothesized. ...Google and other companies have definitely proved that one can harness huge clusters of cheap hardware. It can't be _that_ hard, can it. :) -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of John A Meinel Sent: Tuesday, May 10, 2005 7:41 AM To: Alex Stapleton Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Partitioning / Clustering Alex Stapleton wrote: > What is the status of Postgres support for any sort of multi-machine > scaling support? What are you meant to do once you've upgraded your > box and tuned the conf files as much as you can? But your query load > is just too high for a single machine? > > Upgrading stock Dell boxes (I know we could be using better machines, > but I am trying to tackle the real issue) is not a hugely price > efficient way of getting extra performance, nor particularly scalable > in the long term. Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell is far away from Big Iron. I don't know what performance you are looking for, but you can easily get into inserting 10M rows/day with quality hardware. But actually is it your SELECT load that is too high, or your INSERT load, or something inbetween. Because Slony is around if it is a SELECT problem. http://gborg.postgresql.org/project/slony1/projdisplay.php Basically, Slony is a Master/Slave replication system. So if you have INSERT going into the Master, you can have as many replicated slaves, which can handle your SELECT load. Slony is an asynchronous replicator, so there is a time delay from the INSERT until it will show up on a slave, but that time could be pretty small. This would require some application level support, since an INSERT goes to a different place than a SELECT. But there has been some discussion about pg_pool being able to spread the query load, and having it be aware of the difference between a SELECT and an INSERT and have it route the query to the correct host. The biggest problem being that functions could cause a SELECT func() to actually insert a row, which pg_pool wouldn't know about. There are 2 possible solutions, a) don't do that when you are using this system, b) add some sort of comment hint so that pg_pool can understand that the select is actually an INSERT, and needs to be done on the master. > > So, when/is PG meant to be getting a decent partitioning system? > MySQL is getting one (eventually) which is apparently meant to be > similiar to Oracle's according to the docs. Clusgres does not appear > to be widely/or at all used, and info on it seems pretty thin on the > ground, so I am not too keen on going with that. Is the real solution > to multi- machine partitioning (as in, not like MySQLs MERGE tables) > on PostgreSQL actually doing it in our application API? This seems > like a less than perfect solution once we want to add redundancy and > things into the mix. There is also PGCluster http://pgfoundry.org/projects/pgcluster/ Which is trying to be more of a Synchronous multi-master system. I haven't heard of Clusgres, so I'm guessing it is an older attempt, which has been overtaken by pgcluster. Just realize that clusters don't necessarily scale like you would want them too. Because at some point you have to insert into the same table, which means you need to hold a lock which prevents the other machine from doing anything. And with synchronous replication, you have to wait for all of the machines to get a copy of the data before you can say it has been committed, which does *not* scale well with the number of machines. If you can make it work, I think having a powerful master server, who can finish an INSERT quickly, and then having a bunch of Slony slaves with a middleman (like pg_pool) to do load balancing among them, is the best way to scale up. There are still some requirements, like not having to see the results of an INSERT instantly (though if you are using hinting to pg_pool, you could hint that this query must be done on the master, realizing that the more you do it, the more you slow everything down). John =:-> PS> I don't know what functionality has been actually implemented in pg_pool, just that it was discussed in the past. Slony-II is also in the works.
On 10 May 2005, at 15:41, John A Meinel wrote: > Alex Stapleton wrote: > >> What is the status of Postgres support for any sort of multi-machine >> scaling support? What are you meant to do once you've upgraded >> your box >> and tuned the conf files as much as you can? But your query load is >> just too high for a single machine? >> >> Upgrading stock Dell boxes (I know we could be using better machines, >> but I am trying to tackle the real issue) is not a hugely price >> efficient way of getting extra performance, nor particularly scalable >> in the long term. >> > > Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell is > far away from Big Iron. I don't know what performance you are looking > for, but you can easily get into inserting 10M rows/day with quality > hardware. Better hardware = More Efficient != More Scalable > But actually is it your SELECT load that is too high, or your INSERT > load, or something inbetween. > > Because Slony is around if it is a SELECT problem. > http://gborg.postgresql.org/project/slony1/projdisplay.php > > Basically, Slony is a Master/Slave replication system. So if you have > INSERT going into the Master, you can have as many replicated slaves, > which can handle your SELECT load. > Slony is an asynchronous replicator, so there is a time delay from the > INSERT until it will show up on a slave, but that time could be pretty > small. <snip> > >> >> So, when/is PG meant to be getting a decent partitioning system? >> MySQL >> is getting one (eventually) which is apparently meant to be >> similiar to >> Oracle's according to the docs. Clusgres does not appear to be >> widely/or at all used, and info on it seems pretty thin on the >> ground, >> so I am >> not too keen on going with that. Is the real solution to multi- >> machine >> partitioning (as in, not like MySQLs MERGE tables) on PostgreSQL >> actually doing it in our application API? This seems like a less >> than >> perfect solution once we want to add redundancy and things into >> the mix. >> > > There is also PGCluster > http://pgfoundry.org/projects/pgcluster/ > > Which is trying to be more of a Synchronous multi-master system. I > haven't heard of Clusgres, so I'm guessing it is an older attempt, > which > has been overtaken by pgcluster. > > Just realize that clusters don't necessarily scale like you would want > them too. Because at some point you have to insert into the same > table, > which means you need to hold a lock which prevents the other machine > from doing anything. And with synchronous replication, you have to > wait > for all of the machines to get a copy of the data before you can > say it > has been committed, which does *not* scale well with the number of > machines. This is why I mention partitioning. It solves this issue by storing different data sets on different machines under the same schema. These seperate chunks of the table can then be replicated as well for data redundancy and so on. MySQL are working on these things, but PG just has a bunch of third party extensions, I wonder why these are not being integrated into the main trunk :/ Thanks for pointing me to PGCluster though. It looks like it should be better than Slony at least.
On 10 May 2005, at 16:02, Adam Haberlach wrote: > > I think that perhaps he was trying to avoid having to buy "Big > Iron" at all. You would be right. Although we are not against paying a bit more than $300 for a server ;) > With all the Opteron v. Xeon around here, and talk of $30,000 > machines, > perhaps it would be worth exploring the option of buying 10 cheapass > machines for $300 each. At the moment, that $300 buys you, from > Dell, a > 2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit > ethernet. > The aggregate CPU and bandwidth is pretty stupendous, but not as > easy to > harness as a single machine. <snip> > Yes, clustering solutions can distribute the data, and can even do > it on a > per-table basis in some cases. This still leaves it up to the > application's > logic to handle reunification of the data. If your going to be programming that sort of logic into your API in the beginning, it's not too much more work to add basic replication, load balancing and partitioning into it either. But the DB should be able to do it for you, adding that stuff in later is often more difficult and less likely to get done. > Ideas: > 1. Create a table/storage type that consists of a select statement > on another machine. While I don't think the current executor is > capable of > working on multiple nodes of an execution tree at the same time, it > would be > great if it could offload a select of tuples from a remote table to an > entirely different server and merge the resulting data into the > current > execution. I believe MySQL has this, and Oracle may implement it > in another > way. MySQL sort of has this, it's not as good as Oracle's though. Apparently there is a much better version of it in 5.1 though, that should make it to stable sometime next year I imagine. > 2. There is no #2 at this time, but I'm sure one can be > hypothesized. I would of thought a particularly smart version of pg_pool could do it. It could partition data to different servers if it knew which columns to key by on each table. > ...Google and other companies have definitely proved that one can > harness > huge clusters of cheap hardware. It can't be _that_ hard, can it. :) I shudder to think how much the "Big Iron" equivalent of a google data-center would cost. > -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of John A > Meinel > Sent: Tuesday, May 10, 2005 7:41 AM > To: Alex Stapleton > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Partitioning / Clustering > > Alex Stapleton wrote: > >> What is the status of Postgres support for any sort of multi-machine >> scaling support? What are you meant to do once you've upgraded your >> box and tuned the conf files as much as you can? But your query load >> is just too high for a single machine? >> >> Upgrading stock Dell boxes (I know we could be using better machines, >> but I am trying to tackle the real issue) is not a hugely price >> efficient way of getting extra performance, nor particularly scalable >> in the long term. >> > > Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell > is far > away from Big Iron. I don't know what performance you are looking > for, but > you can easily get into inserting 10M rows/day with quality hardware. > > But actually is it your SELECT load that is too high, or your > INSERT load, > or something inbetween. > > Because Slony is around if it is a SELECT problem. > http://gborg.postgresql.org/project/slony1/projdisplay.php > > Basically, Slony is a Master/Slave replication system. So if you > have INSERT > going into the Master, you can have as many replicated slaves, > which can > handle your SELECT load. > Slony is an asynchronous replicator, so there is a time delay from the > INSERT until it will show up on a slave, but that time could be pretty > small. > > This would require some application level support, since an INSERT > goes to a > different place than a SELECT. But there has been some discussion > about > pg_pool being able to spread the query load, and having it be aware > of the > difference between a SELECT and an INSERT and have it route the > query to the > correct host. The biggest problem being that functions could cause > a SELECT > func() to actually insert a row, which pg_pool wouldn't know about. > There > are 2 possible solutions, a) don't do that when you are using this > system, > b) add some sort of comment hint so that pg_pool can understand > that the > select is actually an INSERT, and needs to be done on the master. > > >> >> So, when/is PG meant to be getting a decent partitioning system? >> MySQL is getting one (eventually) which is apparently meant to be >> similiar to Oracle's according to the docs. Clusgres does not appear >> to be widely/or at all used, and info on it seems pretty thin on the >> ground, so I am not too keen on going with that. Is the real solution >> to multi- machine partitioning (as in, not like MySQLs MERGE tables) >> on PostgreSQL actually doing it in our application API? This seems >> like a less than perfect solution once we want to add redundancy and >> things into the mix. >> > > There is also PGCluster > http://pgfoundry.org/projects/pgcluster/ > > Which is trying to be more of a Synchronous multi-master system. I > haven't > heard of Clusgres, so I'm guessing it is an older attempt, which > has been > overtaken by pgcluster. > > Just realize that clusters don't necessarily scale like you would > want them > too. Because at some point you have to insert into the same table, > which > means you need to hold a lock which prevents the other machine from > doing > anything. And with synchronous replication, you have to wait for > all of the > machines to get a copy of the data before you can say it has been > committed, > which does *not* scale well with the number of machines. > > If you can make it work, I think having a powerful master server, > who can > finish an INSERT quickly, and then having a bunch of Slony slaves > with a > middleman (like pg_pool) to do load balancing among them, is the > best way to > scale up. There are still some requirements, like not having to see > the > results of an INSERT instantly (though if you are using hinting to > pg_pool, > you could hint that this query must be done on the master, > realizing that > the more you do it, the more you slow everything down). > > John > =:-> > > PS> I don't know what functionality has been actually implemented in > pg_pool, just that it was discussed in the past. Slony-II is also > in the > works. > > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to > majordomo@postgresql.org) > >
> exploring the option of buying 10 cheapass > machines for $300 each. At the moment, that $300 buys you, from Dell, a > 2.5Ghz Pentium 4 Buy cheaper ass Dells with an AMD 64 3000+. Beats the crap out of the 2.5 GHz Pentium, especially for PostgreSQL. See the thread "Whence the Opterons" for more.... Rick pgsql-performance-owner@postgresql.org wrote on 05/10/2005 10:02:50 AM: > > I think that perhaps he was trying to avoid having to buy "Big Iron" at all. > > With all the Opteron v. Xeon around here, and talk of $30,000 machines, > perhaps it would be worth exploring the option of buying 10 cheapass > machines for $300 each. At the moment, that $300 buys you, from Dell, a > 2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit ethernet. > The aggregate CPU and bandwidth is pretty stupendous, but not as easy to > harness as a single machine. > > For those of us looking at batch and data warehousing applications, it would > be really handy to be able to partition databases, tables, and processing > load across banks of cheap hardware. > > Yes, clustering solutions can distribute the data, and can even do it on a > per-table basis in some cases. This still leaves it up to the application's > logic to handle reunification of the data. > > Ideas: > 1. Create a table/storage type that consists of a select statement > on another machine. While I don't think the current executor is capable of > working on multiple nodes of an execution tree at the same time, it would be > great if it could offload a select of tuples from a remote table to an > entirely different server and merge the resulting data into the current > execution. I believe MySQL has this, and Oracle may implement it in another > way. > > 2. There is no #2 at this time, but I'm sure one can be > hypothesized. > > ...Google and other companies have definitely proved that one can harness > huge clusters of cheap hardware. It can't be _that_ hard, can it. :) > > > -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of John A Meinel > Sent: Tuesday, May 10, 2005 7:41 AM > To: Alex Stapleton > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Partitioning / Clustering > > Alex Stapleton wrote: > > What is the status of Postgres support for any sort of multi-machine > > scaling support? What are you meant to do once you've upgraded your > > box and tuned the conf files as much as you can? But your query load > > is just too high for a single machine? > > > > Upgrading stock Dell boxes (I know we could be using better machines, > > but I am trying to tackle the real issue) is not a hugely price > > efficient way of getting extra performance, nor particularly scalable > > in the long term. > > Switch from Dell Xeon boxes, and go to Opterons. :) Seriously, Dell is far > away from Big Iron. I don't know what performance you are looking for, but > you can easily get into inserting 10M rows/day with quality hardware. > > But actually is it your SELECT load that is too high, or your INSERT load, > or something inbetween. > > Because Slony is around if it is a SELECT problem. > http://gborg.postgresql.org/project/slony1/projdisplay.php > > Basically, Slony is a Master/Slave replication system. So if you have INSERT > going into the Master, you can have as many replicated slaves, which can > handle your SELECT load. > Slony is an asynchronous replicator, so there is a time delay from the > INSERT until it will show up on a slave, but that time could be pretty > small. > > This would require some application level support, since an INSERT goes to a > different place than a SELECT. But there has been some discussion about > pg_pool being able to spread the query load, and having it be aware of the > difference between a SELECT and an INSERT and have it route the query to the > correct host. The biggest problem being that functions could cause a SELECT > func() to actually insert a row, which pg_pool wouldn't know about. There > are 2 possible solutions, a) don't do that when you are using this system, > b) add some sort of comment hint so that pg_pool can understand that the > select is actually an INSERT, and needs to be done on the master. > > > > > So, when/is PG meant to be getting a decent partitioning system? > > MySQL is getting one (eventually) which is apparently meant to be > > similiar to Oracle's according to the docs. Clusgres does not appear > > to be widely/or at all used, and info on it seems pretty thin on the > > ground, so I am not too keen on going with that. Is the real solution > > to multi- machine partitioning (as in, not like MySQLs MERGE tables) > > on PostgreSQL actually doing it in our application API? This seems > > like a less than perfect solution once we want to add redundancy and > > things into the mix. > > There is also PGCluster > http://pgfoundry.org/projects/pgcluster/ > > Which is trying to be more of a Synchronous multi-master system. I haven't > heard of Clusgres, so I'm guessing it is an older attempt, which has been > overtaken by pgcluster. > > Just realize that clusters don't necessarily scale like you would want them > too. Because at some point you have to insert into the same table, which > means you need to hold a lock which prevents the other machine from doing > anything. And with synchronous replication, you have to wait for all of the > machines to get a copy of the data before you can say it has been committed, > which does *not* scale well with the number of machines. > > If you can make it work, I think having a powerful master server, who can > finish an INSERT quickly, and then having a bunch of Slony slaves with a > middleman (like pg_pool) to do load balancing among them, is the best way to > scale up. There are still some requirements, like not having to see the > results of an INSERT instantly (though if you are using hinting to pg_pool, > you could hint that this query must be done on the master, realizing that > the more you do it, the more you slow everything down). > > John > =:-> > > PS> I don't know what functionality has been actually implemented in > pg_pool, just that it was discussed in the past. Slony-II is also in the > works. > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Adam Haberlach wrote: > I think that perhaps he was trying to avoid having to buy "Big Iron" at all. > > With all the Opteron v. Xeon around here, and talk of $30,000 machines, > perhaps it would be worth exploring the option of buying 10 cheapass > machines for $300 each. At the moment, that $300 buys you, from Dell, a > 2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit ethernet. > The aggregate CPU and bandwidth is pretty stupendous, but not as easy to > harness as a single machine. > > For those of us looking at batch and data warehousing applications, it would > be really handy to be able to partition databases, tables, and processing > load across banks of cheap hardware. > > Yes, clustering solutions can distribute the data, and can even do it on a > per-table basis in some cases. This still leaves it up to the application's > logic to handle reunification of the data. Sure. A lot of this is application dependent, though. For instance foreign key constraints. In a general cluster solution, you would allow foreign keys across partitions. I have a feeling this would be extra slow, and hard to do correctly. Multi-machine transactions are also a difficulty, since WAL now has to take into account all machines, and you have to wait for fsync on all of them. I'm not sure how Oracle does it, but these things seem like they prevent clustering from really scaling very well. > > Ideas: > 1. Create a table/storage type that consists of a select statement > on another machine. While I don't think the current executor is capable of > working on multiple nodes of an execution tree at the same time, it would be > great if it could offload a select of tuples from a remote table to an > entirely different server and merge the resulting data into the current > execution. I believe MySQL has this, and Oracle may implement it in another > way. > > 2. There is no #2 at this time, but I'm sure one can be > hypothesized. > > ...Google and other companies have definitely proved that one can harness > huge clusters of cheap hardware. It can't be _that_ hard, can it. :) Again, it depends on the application. A generic database with lots of cross reference integrity checking does not work on a cluster very well. A very distributed db where you don't worry about cross references does scale. Google made a point of making their application work in a distributed manner. In the other post he mentions that pg_pool could naturally split out the rows into different machines based on partitioning, etc. I would argue that it is more of a custom pool daemon based on the overall application. Because you have to start dealing with things like cross-machine joins. Who handles that? the pool daemon has to, since it is the only thing that talks to both tables. I think you could certainly write a reasonably simple application specific daemon where all of the clients send their queries to, and it figures out where they need to go, and aggregates them as necessary. But a fully generic one is *not* simple at all, and I think is far out of the scope of something like pg_pool. I'm guessing that PGCluster is looking at working on that, and it might be true that pg_pool is thinking about it. But just thinking about the very simple query: SELECT row1, row2 FROM table1_on_machine_a NATURAL JOIN table2_on_machine_b WHERE restrict_table_1 AND restrict_table_2 AND restrict_1_based_on_2; This needs to be broken into something like: SELECT row1 FROM table1_on_machine_a WHERE restrict_table_1 ORDER BY join_column; SELECT row2 FROM table2_on_machine_b WHERE restrict_table_2 ORDER BY join_column; Then these rows need to be merge_joined, and the restrict_1_based_on_2 needs to be applied. This is in no way trivial, and I think it is outside the scope of pg_pool. Now maybe if you restrict yourself so that each query stays within one machine you can make it work. Or write your own app to handle some of this transparently for the clients. But I would expect to make the problem feasible, it would not be a generic solution. Maybe I'm off base, I don't really keep track of pg_pool/PGCluster/etc. But I can see that the problem is very difficult. Not at the very least, this is a simple query. And it doesn't even do optimizations. You might actually prefer the above to be done with a Nestloop style, where table_1 is selected, and then for each row you do a single index select on table_2. But how is your app going to know that? It has to have the statistics from the backend databases. And if it has to place an extra query to get those statistics, you just hurt your scalability even more. Whereas big-iron already has all the statistics, and can optimize the query plan. Perhaps pg_cluster will handle this, by maintaining full statistics across the cluster on each machine, so that more optimal queries can be performed. I don't really know. John =:->
Attachment
> SELECT row1, row2 FROM table1_on_machine_a NATURAL JOIN > table2_on_machine_b > WHERE restrict_table_1 AND restrict_table_2 > AND restrict_1_based_on_2; I don't think that's ever going to be efficient... What would be efficient would be, for instance, a Join of a part of a table against another part of another table which both happen to be on the same machine, because the partitioning was done with this in mind (ie. for instance partitioning on client_id and keeping the information for each client on the same machine). You could build your smart pool daemon in pl/pgsql and use dblink ! At least you have the query parser built-in. I wonder how Oracle does it ;)
Alex, > This is why I mention partitioning. It solves this issue by storing > different data sets on different machines under the same schema. That's clustering, actually. Partitioning is simply dividing up a table into chunks and using the chunks intelligently. Putting those chunks on seperate machines is another thing entirely. We're working on partitioning through the Bizgres sub-project: www.bizgres.org / http://pgfoundry.org/projects/bizgres/ ... and will be pushing it to the main PostgreSQL when we have something. I invite you to join the mailing list. > These seperate chunks of the table can then be replicated as well for > data redundancy and so on. MySQL are working on these things, Don't hold your breath. MySQL, to judge by their first "clustering" implementation, has a *long* way to go before they have anything usable. In fact, at OSCON their engineers were asking Jan Wieck for advice. If you have $$$ to shell out, my employer (GreenPlum) has a multi-machine distributed version of PostgreSQL. It's proprietary, though. www.greenplum.com. If you have more time than money, I understand that Stanford is working on this problem: http://www-db.stanford.edu/~bawa/ But, overall, some people on this list are very mistaken in thinking it's an easy problem. GP has devoted something like 5 engineers for 3 years to develop their system. Oracle spent over $100 million to develop RAC. > but PG > just has a bunch of third party extensions, I wonder why these are > not being integrated into the main trunk :/ Because it represents a host of complex functionality which is not applicable to most users? Because there are 4 types of replication and 3 kinds of clusering and not all users want the same kind? -- Josh Berkus Aglio Database Solutions San Francisco
Quoting Richard_D_Levine@raytheon.com: > > exploring the option of buying 10 cheapass > > machines for $300 each. At the moment, that $300 buys you, from > Dell, a > > 2.5Ghz Pentium 4 > > Buy cheaper ass Dells with an AMD 64 3000+. Beats the crap out of > the 2.5 > GHz Pentium, especially for PostgreSQL. Whence "Dells with an AMD 64" ?? Perhaps you skimmed: http://www.thestreet.com/tech/kcswanson/10150604.html or http://www.eweek.com/article2/0,1759,1553822,00.asp
On Tue, May 10, 2005 at 07:29:59PM +0200, PFC wrote: > I wonder how Oracle does it ;) Oracle *clustering* demands shared storage. So you've shifted your money from big-iron CPUs to big-iron disk arrays. Oracle replication works similar to Slony, though it supports a lot more modes (ie: syncronous). -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Quoting Alex Stapleton <alexs@advfn.com>: > This is why I mention partitioning. It solves this issue by storing > different data sets on different machines under the same schema. > These seperate chunks of the table can then be replicated as well for > data redundancy and so on. MySQL are working on these things, but PG > just has a bunch of third party extensions, I wonder why these are > not being integrated into the main trunk :/ Thanks for pointing me to > PGCluster though. It looks like it should be better than Slony at > least. Across a decade or two of projects, including creating a federated database engine for Simba, I've become rather dubious of horizontal partitions (across disks or servers), either to improve performance, or just to scale up and not lose performance. [[The one exception is for <emphasis> non-time-critical read-only</emphasis> systems, with Slony-style replication.]] The most successful high-volume systems I've seen have broken up databases functionally, like a pipeline, where different applications use different sections of the pipe. The highest-volume system I've worked on is Acxiom's gigantic data-cleansing system. This is the central clearinghouse for every scrap of demographic that can be associated with some North American, somewhere. Think of D&B for 300M people (some dead). The volumes are just beyond belief, for both updates and queries. At Acxiom, the datasets are so large, even after partitioning, that they just constantly cycle them through memory, and commands are executes in convoys --- sort of like riding a paternoster. .......... Anybody been tracking on what Mr Stonebraker's been up to, lately? Datastream management. Check it out. Like replication, everybody hand-rolled their own datastream systems until finally somebody else generalized it well enough that it didn't have to be built from scratch every time. Datastream systems require practically no locking, let alone distributed transactions. They give you some really strong guarantees on transaction elapsed-time and throughput. ....... Where is this all leading? Well, for scaling data like this, the one feature that you need is the ability of procedures/rules on one server to perform queries/procedures on another. MSSQL has linked servers and (blech) OpenQuery. This lets you do reasonably-efficient work when you only deal with one table at a time. Do NOT try anything fancy with multi-table joins; timeouts are unavoidable, and painful. Postgres has a natural advantage in such a distributed server system: all table/index stats are openly available through the SQL interface, for one server to make rational query plans involving another server's resources. God! I would have killed for that when I was writing a federated SQL engine; the kluges you need to do this at arms-length from that information are true pain. So where should I go look, to see what's been done so far, on a Postgres that can treat another PG server as a new table type?
On Tue, May 10, 2005 at 02:55:55PM -0700, Mischa Sandberg wrote: > just beyond belief, for both updates and queries. At Acxiom, the > datasets are so large, even after partitioning, that they just > constantly cycle them through memory, and commands are executes in > convoys --- sort of like riding a paternoster. Speaking of which... what's the status of the patch that allows seqscans to piggyback on already running seqscans on the same table? > So where should I go look, to see what's been done so far, on a Postgres > that can treat another PG server as a new table type? To the best of my knowledge no such work has been done. There is a project (who's name escapes me) that lets you run queries against a remote postgresql server from a postgresql connection to a different server, which could serve as the basis for what you're proposing. BTW, given your experience, you might want to check out Bizgres. (http://pgfoundry.org/projects/bizgres/) I'm sure your insights would be most welcome. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
> This is why I mention partitioning. It solves this issue by storing > different data sets on different machines under the same schema. These > seperate chunks of the table can then be replicated as well for data > redundancy and so on. MySQL are working on these things *laff* Yeah, like they've been working on views for the last 5 years, and still haven't released them :D :D :D Chris
Quoting Christopher Kings-Lynne <chriskl@familyhealth.com.au>: > > This is why I mention partitioning. It solves this issue by storing > > different data sets on different machines under the same schema. > > These seperate chunks of the table can then be replicated as well for > > data redundancy and so on. MySQL are working on these things > *laff* > Yeah, like they've been working on views for the last 5 years, and > still haven't released them :D :D :D ? http://dev.mysql.com/doc/mysql/en/create-view.html ...for MySQL 5.0.1+ ?
Mischa Sandberg wrote: > Quoting Christopher Kings-Lynne <chriskl@familyhealth.com.au>: > > >>>This is why I mention partitioning. It solves this issue by storing >>>different data sets on different machines under the same schema. >>>These seperate chunks of the table can then be replicated as well for >>>data redundancy and so on. MySQL are working on these things >> >>*laff* >>Yeah, like they've been working on views for the last 5 years, and >>still haven't released them :D :D :D > > > ? > http://dev.mysql.com/doc/mysql/en/create-view.html > ...for MySQL 5.0.1+ ? Yes but MySQL 5 isn't out yet (considered stable). Sincerely, Joshua D. Drake > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org
>>*laff* >>Yeah, like they've been working on views for the last 5 years, and >>still haven't released them :D :D :D > > > ? > http://dev.mysql.com/doc/mysql/en/create-view.html > ...for MySQL 5.0.1+ ? Give me a call when it's RELEASED. Chris
Quoting Christopher Kings-Lynne <chriskl@familyhealth.com.au>: > > >>*laff* > >>Yeah, like they've been working on views for the last 5 years, and > >>still haven't released them :D :D :D > > > > ? > > http://dev.mysql.com/doc/mysql/en/create-view.html > > ...for MySQL 5.0.1+ ? > > Give me a call when it's RELEASED. :-) Touche'
Quoting "Jim C. Nasby" <decibel@decibel.org>: > To the best of my knowledge no such work has been done. There is a > project (who's name escapes me) that lets you run queries against a > remote postgresql server from a postgresql connection to a different > server, which could serve as the basis for what you're proposing. Okay, if the following looks right to the powerthatbe, I'd like to start a project. Here's the proposition: "servername.dbname.schema.object" would change RangeVar, which would affect much code. "dbname.schema.object" itself is not implemented in 8.0. So, simplicity dictates something like: table pg_remote(schemaname text, connectby text, remoteschema text) The pg_statistic info from a remote server cannot be cached in local pg_statistic, without inventing pseudo reloids as well as a pseudoschema. Probably cleaner to cache it somewhere else. I'm still reading down the path that puts pg_statistic data where costsize can get at it. First step: find out whether one can link libpq.so to postmaster :-)
Josh Berkus wrote: > Don't hold your breath. MySQL, to judge by their first "clustering" > implementation, has a *long* way to go before they have anything usable. Oh? What's wrong with MySQL's clustering implementation? -Neil
Neil Conway wrote: > Josh Berkus wrote: > >> Don't hold your breath. MySQL, to judge by their first "clustering" >> implementation, has a *long* way to go before they have anything usable. > > > Oh? What's wrong with MySQL's clustering implementation? Ram only tables :) > > -Neil > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
On Tue, May 10, 2005 at 08:02:50 -0700, Adam Haberlach <adam@mediariffic.com> wrote: > > > With all the Opteron v. Xeon around here, and talk of $30,000 machines, > perhaps it would be worth exploring the option of buying 10 cheapass > machines for $300 each. At the moment, that $300 buys you, from Dell, a > 2.5Ghz Pentium 4 w/ 256mb of RAM and a 40Gb hard drive and gigabit ethernet. > The aggregate CPU and bandwidth is pretty stupendous, but not as easy to > harness as a single machine. That isn't going to be ECC ram. I don't think you really want to use non-ECC ram in a critical database.
Joshua D. Drake wrote: > Neil Conway wrote: >> Oh? What's wrong with MySQL's clustering implementation? > > Ram only tables :) Sure, but that hardly makes it not "usable". Considering the price of RAM these days, having enough RAM to hold the database (distributed over the entire cluster) is perfectly acceptable for quite a few people. (Another deficiency is in 4.0, predicates in queries would not be pushed down to storage nodes -- so you had to stream the *entire* table over the network, and then apply the WHERE clause at the frontend query node. That is fixed in 5.0, though.) -Neil
Neil, > Sure, but that hardly makes it not "usable". Considering the price of > RAM these days, having enough RAM to hold the database (distributed over > the entire cluster) is perfectly acceptable for quite a few people. The other problem, as I was told it at OSCON, was that these were not high-availability clusters; it's impossible to add a server to an existing cluster, and a server going down is liable to take the whole cluster down. Mind you, I've not tried that aspect of it myself; once I saw the ram-only rule, we switched to something else. -- Josh Berkus Aglio Database Solutions San Francisco
On Tue, 2005-05-10 at 11:03 +0100, Alex Stapleton wrote: > So, when/is PG meant to be getting a decent partitioning system? ISTM that your question seems to confuse where code comes from. Without meaning to pick on you, or reply rudely, I'd like to explore that question. Perhaps it should be a FAQ entry. All code is written by someone, and those people need to eat. Some people are fully or partly funded to perform their tasks on this project (coding, patching, etc). Others contribute their time for a variety of reasons where involvement has a positive benefit. You should ask these questions: - Is anyone currently working on (Feature X)? - If not, Can I do it myself? - If not, and I still want it, can I fund someone else to build it for me? Asking "when is Feature X going to happen" is almost certainly going to get the answer "never" otherwise, if the initial development is large and complex. There are many TODO items that have lain untouched for years, even though adding the feature has been discussed and agreed. Best Regards, Simon Riggs
Josh Berkus wrote: > The other problem, as I was told it at OSCON, was that these were not > high-availability clusters; it's impossible to add a server to an existing > cluster Yeah, that's a pretty significant problem. > a server going down is liable to take the whole cluster down. That's news to me. Do you have more information on this? -Neil
For an interesting look at scalability, clustering, caching, etc for a large site have a look at how livejournal did it. http://www.danga.com/words/2004_lisa/lisa04.pdf They have 2.6 Million active users, posting 200 new blog entries per minute, plus many comments and countless page views. Although this system is of a different sort to the type I work on it's interesting to see how they've made it scale. They use mysql on dell hardware! And found single master replication did not scale. There's a section on multimaster replication, not sure if they use it. The main approach they use is to parition users into spefic database clusters. Caching is done using memcached at the application level to avoid hitting the db for rendered pageviews. It's interesting that the solution livejournal have arrived at is quite similar in ways to the way google is set up. David
On 11 May 2005, at 08:16, Simon Riggs wrote: > On Tue, 2005-05-10 at 11:03 +0100, Alex Stapleton wrote: > >> So, when/is PG meant to be getting a decent partitioning system? >> > > ISTM that your question seems to confuse where code comes from. > Without > meaning to pick on you, or reply rudely, I'd like to explore that > question. Perhaps it should be a FAQ entry. > > All code is written by someone, and those people need to eat. Some > people are fully or partly funded to perform their tasks on this > project > (coding, patching, etc). Others contribute their time for a variety of > reasons where involvement has a positive benefit. > > You should ask these questions: > - Is anyone currently working on (Feature X)? > - If not, Can I do it myself? > - If not, and I still want it, can I fund someone else to build it for > me? > > Asking "when is Feature X going to happen" is almost certainly > going to > get the answer "never" otherwise, if the initial development is large > and complex. There are many TODO items that have lain untouched for > years, even though adding the feature has been discussed and agreed. > > Best Regards, Simon Riggs > Acceptable Answers to 'So, when/is PG meant to be getting a decent partitioning system?': 1. Person X is working on it I believe. 2. It's on the list, but nobody has done anything about it yet 3. Your welcome to take a stab at it, I expect the community would support your efforts as well. 4. If you have a huge pile of money you could probably buy the Moon. Thinking along those lines, you can probably pay someone to write it for you. 5. It's a stupid idea, and it's never going to work, and heres why.............. Unacceptable Answers to the same question: 1. Yours. Be more helpful, and less arrogant please. Everyone else who has contributed to this thread has been very helpful in clarifying the state of affairs and pointing out what work is and isn't being done, and alternatives to just waiting for PG do it for you. > > ---------------------------(end of > broadcast)--------------------------- > TIP 8: explain analyze is your friend > >
On 11 May 2005, at 08:57, David Roussel wrote: > For an interesting look at scalability, clustering, caching, etc for a > large site have a look at how livejournal did it. > http://www.danga.com/words/2004_lisa/lisa04.pdf I have implemented similar systems in the past, it's a pretty good technique, unfortunately it's not very "Plug-and-Play" as you have to base most of your API on memcached (I imagine MySQLs NDB tables might work as well actually) for it to work well. > They have 2.6 Million active users, posting 200 new blog entries per > minute, plus many comments and countless page views. > > Although this system is of a different sort to the type I work on it's > interesting to see how they've made it scale. > > They use mysql on dell hardware! And found single master > replication did > not scale. There's a section on multimaster replication, not sure if > they use it. The main approach they use is to parition users into > spefic database clusters. Caching is done using memcached at the > application level to avoid hitting the db for rendered pageviews I don't think they are storing pre-rendered pages (or bits of) in memcached, but are principally storing the data for the pages in it. Gluing pages together is not a hugely intensive process usually :) The only problem with memcached is that the clients clustering/ partitioning system will probably break if a node dies, and probably get confused if you add new nodes onto it as well. Easily extensible clustering (no complete redistribution of data required when you add/ remove nodes) with the data distributed across nodes seems to be nothing but a pipe dream right now. > It's interesting that the solution livejournal have arrived at is > quite > similar in ways to the way google is set up. Don't Google use indexing servers which keep track of where data is? So that you only need to update them when you add or move data, deletes don't even have to be propagated among indexes immediately really because you'll find out if data isn't there when you visit where it should be. Or am I talking crap? > David > > ---------------------------(end of > broadcast)--------------------------- > TIP 8: explain analyze is your friend > >
On 11 May 2005, at 09:50, Alex Stapleton wrote: > > On 11 May 2005, at 08:57, David Roussel wrote: > > >> For an interesting look at scalability, clustering, caching, etc >> for a >> large site have a look at how livejournal did it. >> http://www.danga.com/words/2004_lisa/lisa04.pdf >> > > I have implemented similar systems in the past, it's a pretty good > technique, unfortunately it's not very "Plug-and-Play" as you have > to base most of your API on memcached (I imagine MySQLs NDB tables > might work as well actually) for it to work well. > > >> They have 2.6 Million active users, posting 200 new blog entries per >> minute, plus many comments and countless page views. >> >> Although this system is of a different sort to the type I work on >> it's >> interesting to see how they've made it scale. >> >> They use mysql on dell hardware! And found single master >> replication did >> not scale. There's a section on multimaster replication, not sure if >> they use it. The main approach they use is to parition users into >> spefic database clusters. Caching is done using memcached at the >> application level to avoid hitting the db for rendered pageviews >> > > I don't think they are storing pre-rendered pages (or bits of) in > memcached, but are principally storing the data for the pages in > it. Gluing pages together is not a hugely intensive process usually :) > The only problem with memcached is that the clients clustering/ > partitioning system will probably break if a node dies, and > probably get confused if you add new nodes onto it as well. Easily > extensible clustering (no complete redistribution of data required > when you add/remove nodes) with the data distributed across nodes > seems to be nothing but a pipe dream right now. > > >> It's interesting that the solution livejournal have arrived at is >> quite >> similar in ways to the way google is set up. >> > > Don't Google use indexing servers which keep track of where data > is? So that you only need to update them when you add or move data, > deletes don't even have to be propagated among indexes immediately > really because you'll find out if data isn't there when you visit > where it should be. Or am I talking crap? That will teach me to RTFA first ;) Ok so LJ maintain an index of which cluster each user is on, kinda of like google do :) > >> David >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 8: explain analyze is your friend >> >> >> > > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > >
> Acceptable Answers to 'So, when/is PG meant to be getting a decent > partitioning system?': > > 1. Person X is working on it I believe. > 2. It's on the list, but nobody has done anything about it yet > 3. Your welcome to take a stab at it, I expect the community would > support your efforts as well. > 4. If you have a huge pile of money you could probably buy the > Moon. Thinking along those lines, you can probably pay someone to write > it for you. > 5. It's a stupid idea, and it's never going to work, and heres > why.............. > > Unacceptable Answers to the same question: > > 1. Yours. > > Be more helpful, and less arrogant please. Everyone else who has > contributed to this thread has been very helpful in clarifying the > state of affairs and pointing out what work is and isn't being done, > and alternatives to just waiting for PG do it for you. Please YOU be more helpful and less arrogant. I thought your inital email was arrogant, demanding and insulting. Your followup email has done nothing to dispel my impression. Simon (one of PostgreSQL's major contributors AND one of the very few people working on partitioning in PostgreSQL, as you requested) told you all the reasons clearly and politely. Chris
Mischa Sandberg <mischa.sandberg@telus.net> writes: > So, simplicity dictates something like: > table pg_remote(schemaname text, connectby text, remoteschema text) Previous discussion of this sort of thing concluded that we wanted to follow the SQL-MED standard. regards, tom lane
David, > It's interesting that the solution livejournal have arrived at is quite > similar in ways to the way google is set up. Yes, although again, they're using memcached as pseudo-clustering software, and as a result are limited to what fits in RAM (RAM on 27 machines, but it's still RAM). And due to limitations on memcached, the whole thing blows whenever a server goes out (the memcached project is working on this). But any LJ user could tell you that it's a low-availability system. However, memcached (and for us, pg_memcached) is an excellent way to improve horizontal scalability by taking disposable data (like session information) out of the database and putting it in protected RAM. On some websites, adding memcached can result is as much as a 60% decrease in database traffic. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Alex Stapleton <alexs@advfn.com> writes: > Acceptable Answers to 'So, when/is PG meant to be getting a decent > partitioning system?': ... > 3. Your welcome to take a stab at it, I expect the community would > support your efforts as well. As long as we're being curt all around, this one's not acceptable on the basis that it's not grammatical. -- greg
On Wed, 2005-05-11 at 17:13 +0800, Christopher Kings-Lynne wrote: > > Alex Stapleton wrote > > Be more helpful, and less arrogant please. > > Simon told you all the reasons clearly and politely. Thanks Chris for your comments. PostgreSQL can always do with one more developer and my sole intent was to encourage Alex and other readers to act themselves. If my words seem arrogant, then I apologise to any and all that think so. Best Regards, Simon Riggs
On Wed, May 11, 2005 at 08:57:57AM +0100, David Roussel wrote: > For an interesting look at scalability, clustering, caching, etc for a > large site have a look at how livejournal did it. > http://www.danga.com/words/2004_lisa/lisa04.pdf > > They have 2.6 Million active users, posting 200 new blog entries per > minute, plus many comments and countless page views. Neither of which is that horribly impressive. 200 TPM is less than 4TPS. While I haven't run high transaction rate databases under PostgreSQL, I suspect others who have will say that 4TPS isn't that big of a deal. > Although this system is of a different sort to the type I work on it's > interesting to see how they've made it scale. > > They use mysql on dell hardware! And found single master replication did > not scale. There's a section on multimaster replication, not sure if Probably didn't scale because they used to use MyISAM. > they use it. The main approach they use is to parition users into > spefic database clusters. Caching is done using memcached at the Which means they've got a huge amount of additional code complexity, not to mention how many times you can't post something because 'that cluster is down for maintenance'. > application level to avoid hitting the db for rendered pageviews. Memcached is about the only good thing I've seen come out of livejournal. > It's interesting that the solution livejournal have arrived at is quite > similar in ways to the way google is set up. Except that unlike LJ, google stays up and it's fast. Though granted, LJ is quite a bit faster than it was 6 months ago. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
> However, memcached (and for us, pg_memcached) is an excellent way to > improve > horizontal scalability by taking disposable data (like session > information) > out of the database and putting it in protected RAM. So, what is the advantage of such a system versus, say, a "sticky sessions" system where each session is assigned to ONE application server (not PHP then) which keeps it in RAM as native objects instead of serializing and deserializing it on each request ? I'd say the sticky sessions should perform a lot better, and if one machine dies, only the sessions on this one are lost. But of course you can't do it with PHP as you need an app server which can manage sessions. Potentially the savings are huge, though. On Google, their distributed system spans a huge number of PCs and it has redundancy, ie. individual PC failure is a normal thing and is a part of the system, it is handled gracefully. I read a paper on this matter, it's pretty impressive. The google filesystem has nothing to do with databases though, it's more a massive data store / streaming storage.
On 11 May 2005, at 23:35, PFC wrote: > > > >> However, memcached (and for us, pg_memcached) is an excellent way >> to improve >> horizontal scalability by taking disposable data (like session >> information) >> out of the database and putting it in protected RAM. >> > > So, what is the advantage of such a system versus, say, a > "sticky sessions" system where each session is assigned to ONE > application server (not PHP then) which keeps it in RAM as native > objects instead of serializing and deserializing it on each request ? > I'd say the sticky sessions should perform a lot better, and if > one machine dies, only the sessions on this one are lost. > But of course you can't do it with PHP as you need an app > server which can manage sessions. Potentially the savings are huge, > though. Theres no reason it couldn't be done with PHP to be fair as long as you could ensure that the client was always routed back to the same machines. Which has it's own set of issues entirely. I am not entirely sure that memcached actually does serialize data when it's comitted into memcached either, although I could be wrong, I have not looked at the source. Certainly if you can ensure that a client always goes back to the same machine you can simplify the whole thing hugely. It's generally not that easy though, you need a proxy server of some description capable of understanding the HTTP traffic and maintaining a central session lookup table to redirect with. Which isn't really solving the problem so much as moving it somewhere else. Instead of needing huge memcached pools, you need hardcore loadbalancers. Load Balancers tend to cost $$$$$ in comparison. Distributed sticky sessions are a rather nice idea, I would like to hear a way of implementing them cheaply (and on PHP) as well. I may have to give that some thought in fact. Oh yeah, and load balancers software often sucks in annoying (if not always important) ways. > On Google, their distributed system spans a huge number of PCs > and it has redundancy, ie. individual PC failure is a normal thing > and is a part of the system, it is handled gracefully. I read a > paper on this matter, it's pretty impressive. The google filesystem > has nothing to do with databases though, it's more a massive data > store / streaming storage. > Since when did Massive Data stores have nothing to do with DBs? Isn't Oracle Cluster entirely based on forming an enormous scalable disk array to store your DB on?
> machines. Which has it's own set of issues entirely. I am not entirely > sure that memcached actually does serialize data when it's comitted into I think it does, ie. it's a simple mapping of [string key] => [string value]. > memcached either, although I could be wrong, I have not looked at the > source. Certainly if you can ensure that a client always goes back to > the same machine you can simplify the whole thing hugely. It's generally > not that easy though, you need a proxy server of some description > capable of understanding the HTTP traffic and maintaining a central Yes... You could implement it by mapping servers to the hash of the user session id. Statistically, the servers would get the same numbers of sessions on each of them, but you have to trust statistics... It does eliminate the lookup table though. > idea, I would like to hear a way of implementing them cheaply (and on > PHP) as well. I may have to give that some thought in fact. Oh yeah, and > load balancers software often sucks in annoying (if not always > important) ways. You can use lighttpd as a load balancer, I believe it has a stick sessions plugin (or you could code one in, it's open source after all). It definitely support simple round-robin load balancing, acting as a proxy to any number of independent servers. >> matter, it's pretty impressive. The google filesystem has nothing to do >> with databases though, it's more a massive data store / streaming >> storage. > > Since when did Massive Data stores have nothing to do with DBs? Isn't > Oracle Cluster entirely based on forming an enormous scalable disk array > to store your DB on? Um, well, the Google Filesystem is (like its name implies) a filesystem designed to store huge files in a distributed and redundant manner. Files are structured as a stream of records (which are themselves big in size) and it's designed to support appending records to these stream files efficiently and without worrying about locking. It has no querying features however, that is why I said it was not a database. I wish I could find the whitepaper, I think the URL was on this list some day, maybe it's on Google's site ?
Having local sessions is unnesesary, and here is my logic: Generaly most people have less than 100Mb of bandwidth to the internet. If you make the assertion that you are transferring equal or less session data between your session server (lets say an RDBMS) and the app server than you are between the app server and the client, an out of band 100Mb network for session information is plenty of bandwidth. This also represents OLTP style traffic, which postgresql is pretty good at. You should easily be able to get over 100Tps. 100 hits per second is an awful lot of traffic, more than any website I've managed will ever see. Why solve the complicated clustered sessions problem, when you don't really need to? Alex Turner netEconomist On 5/11/05, PFC <lists@boutiquenumerique.com> wrote: > > > > However, memcached (and for us, pg_memcached) is an excellent way to > > improve > > horizontal scalability by taking disposable data (like session > > information) > > out of the database and putting it in protected RAM. > > So, what is the advantage of such a system versus, say, a "sticky > sessions" system where each session is assigned to ONE application server > (not PHP then) which keeps it in RAM as native objects instead of > serializing and deserializing it on each request ? > I'd say the sticky sessions should perform a lot better, and if one > machine dies, only the sessions on this one are lost. > But of course you can't do it with PHP as you need an app server which > can manage sessions. Potentially the savings are huge, though. > > On Google, their distributed system spans a huge number of PCs and it has > redundancy, ie. individual PC failure is a normal thing and is a part of > the system, it is handled gracefully. I read a paper on this matter, it's > pretty impressive. The google filesystem has nothing to do with databases > though, it's more a massive data store / streaming storage. > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
On 12 May 2005, at 15:08, Alex Turner wrote: > Having local sessions is unnesesary, and here is my logic: > > Generaly most people have less than 100Mb of bandwidth to the > internet. > > If you make the assertion that you are transferring equal or less > session data between your session server (lets say an RDBMS) and the > app server than you are between the app server and the client, an out > of band 100Mb network for session information is plenty of bandwidth. > This also represents OLTP style traffic, which postgresql is pretty > good at. You should easily be able to get over 100Tps. 100 hits per > second is an awful lot of traffic, more than any website I've managed > will ever see. > > Why solve the complicated clustered sessions problem, when you don't > really need to? 100 hits a second = 8,640,000 hits a day. I work on a site which does > 100 million dynamic pages a day. In comparison Yahoo probably does > 100,000,000,000 (100 billion) views a day if I am interpreting Alexa's charts correctly. Which is about 1,150,000 a second. Now considering the site I work on is not even in the top 1000 on Alexa, theres a lot of sites out there which need to solve this problem I would assume. There are also only so many hash table lookups a single machine can do, even if its a Quad Opteron behemoth. > Alex Turner > netEconomist > > On 5/11/05, PFC <lists@boutiquenumerique.com> wrote: > >> >> >> >>> However, memcached (and for us, pg_memcached) is an excellent way to >>> improve >>> horizontal scalability by taking disposable data (like session >>> information) >>> out of the database and putting it in protected RAM. >>> >> >> So, what is the advantage of such a system versus, say, a >> "sticky >> sessions" system where each session is assigned to ONE application >> server >> (not PHP then) which keeps it in RAM as native objects instead of >> serializing and deserializing it on each request ? >> I'd say the sticky sessions should perform a lot better, >> and if one >> machine dies, only the sessions on this one are lost. >> But of course you can't do it with PHP as you need an app >> server which >> can manage sessions. Potentially the savings are huge, though. >> >> On Google, their distributed system spans a huge number of >> PCs and it has >> redundancy, ie. individual PC failure is a normal thing and is a >> part of >> the system, it is handled gracefully. I read a paper on this >> matter, it's >> pretty impressive. The google filesystem has nothing to do with >> databases >> though, it's more a massive data store / streaming storage. >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 1: subscribe and unsubscribe commands go to >> majordomo@postgresql.org >> >> > >
Ok - my common sense alarm is going off here... There are only 6.446 billion people worldwide. 100 Billion page views would require every person in the world to view 18 pages of yahoo every day. Not very likely. http://www.internetworldstats.com/stats.htm suggests that there are around 1 billion people actualy on the internet. That means each and every person on the internet has to view 100 pages per day of yahoo. pretty unlikely IMHO. I for one don't even use Yahoo ;) 100 million page views per day suggests that 1 in 100 people on the internet each viewed 10 pages of a site. Thats a pretty high percentage if you ask me. If I visit 20 web sites in a day, and see an average of 10 pages per site. that means only about 2000 or so sites generate 100 million page views in a day or better. 100 million pageviews averages to 1157/sec, which we'll double for peak load to 2314. I can easily see a system doing 2314 hash lookups per second. Hell I wrote a system that could do a thousand times that four years ago on a single 1Ghz Athlon. Heck - you can get 2314 lookups/sec on a 486 ;) Given that session information doesn't _have_ to persist to storage, and can be kept in RAM. A single server could readily manage session information for even very large sites (of course over a million concurrent users could really start chewing into RAM, but if you are Yahoo, you can probably afford a box with 100GB of RAM ;). We get over 1000 tps on a dual opteron with a couple of mid size RAID arrays on 10k discs with fsync on for small transactions. I'm sure that could easily be bettered with a few more dollars. Maybe my number are off, but somehow it doesn't seem like that many people need a highly complex session solution to me. Alex Turner netEconomist On 5/12/05, Alex Stapleton <alexs@advfn.com> wrote: > > On 12 May 2005, at 15:08, Alex Turner wrote: > > > Having local sessions is unnesesary, and here is my logic: > > > > Generaly most people have less than 100Mb of bandwidth to the > > internet. > > > > If you make the assertion that you are transferring equal or less > > session data between your session server (lets say an RDBMS) and the > > app server than you are between the app server and the client, an out > > of band 100Mb network for session information is plenty of bandwidth. > > This also represents OLTP style traffic, which postgresql is pretty > > good at. You should easily be able to get over 100Tps. 100 hits per > > second is an awful lot of traffic, more than any website I've managed > > will ever see. > > > > Why solve the complicated clustered sessions problem, when you don't > > really need to? > > 100 hits a second = 8,640,000 hits a day. I work on a site which does > > 100 million dynamic pages a day. In comparison Yahoo probably does > > 100,000,000,000 (100 billion) views a day > if I am interpreting Alexa's charts correctly. Which is about > 1,150,000 a second. > > Now considering the site I work on is not even in the top 1000 on > Alexa, theres a lot of sites out there which need to solve this > problem I would assume. > > There are also only so many hash table lookups a single machine can > do, even if its a Quad Opteron behemoth. > > > > Alex Turner > > netEconomist > > > > On 5/11/05, PFC <lists@boutiquenumerique.com> wrote: > > > >> > >> > >> > >>> However, memcached (and for us, pg_memcached) is an excellent way to > >>> improve > >>> horizontal scalability by taking disposable data (like session > >>> information) > >>> out of the database and putting it in protected RAM. > >>> > >> > >> So, what is the advantage of such a system versus, say, a > >> "sticky > >> sessions" system where each session is assigned to ONE application > >> server > >> (not PHP then) which keeps it in RAM as native objects instead of > >> serializing and deserializing it on each request ? > >> I'd say the sticky sessions should perform a lot better, > >> and if one > >> machine dies, only the sessions on this one are lost. > >> But of course you can't do it with PHP as you need an app > >> server which > >> can manage sessions. Potentially the savings are huge, though. > >> > >> On Google, their distributed system spans a huge number of > >> PCs and it has > >> redundancy, ie. individual PC failure is a normal thing and is a > >> part of > >> the system, it is handled gracefully. I read a paper on this > >> matter, it's > >> pretty impressive. The google filesystem has nothing to do with > >> databases > >> though, it's more a massive data store / streaming storage. > >> > >> ---------------------------(end of > >> broadcast)--------------------------- > >> TIP 1: subscribe and unsubscribe commands go to > >> majordomo@postgresql.org > >> > >> > > > > > >
Alex Turner wrote: > Ok - my common sense alarm is going off here... > > There are only 6.446 billion people worldwide. 100 Billion page views > would require every person in the world to view 18 pages of yahoo > every day. Not very likely. > > http://www.internetworldstats.com/stats.htm > suggests that there are around 1 billion people actualy on the internet. > > That means each and every person on the internet has to view 100 pages > per day of yahoo. > > pretty unlikely IMHO. I for one don't even use Yahoo ;) > > 100 million page views per day suggests that 1 in 100 people on the > internet each viewed 10 pages of a site. Thats a pretty high > percentage if you ask me. In general I think your point is valid. Just remember that it probably also matters how you count page views. Because technically images are a separate page (and this thread did discuss serving up images). So if there are 20 graphics on a specific page, that is 20 server hits just for that one page. I could easily see an image heavy site getting 100 hits / page. Which starts meaning that if 1M users hit 10 pages, then you get 1M*10*100 = 1G. I still think 100G views on a single website is a lot, but 100M is certainly possible. John =:->
Attachment
> 100 hits a second = 8,640,000 hits a day. I work on a site which does > > 100 million dynamic pages a day. In comparison Yahoo probably does > > 100,000,000,000 (100 billion) views a day > if I am interpreting Alexa's charts correctly. Which is about > 1,150,000 a second. Read the help on Alexa's site... ;)
People, > In general I think your point is valid. Just remember that it probably > also matters how you count page views. Because technically images are a > separate page (and this thread did discuss serving up images). So if > there are 20 graphics on a specific page, that is 20 server hits just > for that one page. Also, there's bots and screen-scrapers and RSS, web e-mails, and web services and many other things which create hits but are not "people". I'm currently working on clickstream for a site which is nowhere in the top 100, and is getting 3 million real hits a day ... and we know for a fact that at least 1/4 of that is bots. Regardless, the strategy you should be employing for a high traffic site is that if your users hit the database for anything other than direct interaction (like filling out a webform) then you're lost. Use memcached, squid, lighttpd caching, ASP.NET caching, pools, etc. Keep the load off the database except for the stuff that only the database can do. -- Josh Berkus Aglio Database Solutions San Francisco
On 12 May 2005, at 18:33, Josh Berkus wrote: > People, > > >> In general I think your point is valid. Just remember that it >> probably >> also matters how you count page views. Because technically images >> are a >> separate page (and this thread did discuss serving up images). So if >> there are 20 graphics on a specific page, that is 20 server hits just >> for that one page. >> > > Also, there's bots and screen-scrapers and RSS, web e-mails, and > web services > and many other things which create hits but are not "people". I'm > currently > working on clickstream for a site which is nowhere in the top 100, > and is > getting 3 million real hits a day ... and we know for a fact that > at least > 1/4 of that is bots. I doubt bots are generally Alexa toolbar enabled. > Regardless, the strategy you should be employing for a high traffic > site is > that if your users hit the database for anything other than direct > interaction (like filling out a webform) then you're lost. Use > memcached, > squid, lighttpd caching, ASP.NET caching, pools, etc. Keep the > load off the > database except for the stuff that only the database can do. This is the aproach I would take as well. There is no point storing stuff in a DB, if your only doing direct lookups on it and it isn't the sort of data that you care so much about the integrity of. > -- > Josh Berkus > Aglio Database Solutions > San Francisco > > ---------------------------(end of > broadcast)--------------------------- > TIP 8: explain analyze is your friend > >
Ross, > Memcached is a PG memory store, I gather, Nope. It's a hyperfast resident-in-memory hash that allows you to stash stuff like user session information and even materialized query set results. Thanks to SeanC, we even have a plugin, pgmemcached. > but...what is squid, lighttpd? > anything directly PG-related? No. These are all related to making the web server do more. The idea is NOT to hit the database every time you have to serve up a web page, and possibly not to hit the web server either. For example, you can use squid 3 for "reverse" caching in front of your web server, and serve far more page views than you could with Apache alone. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
> If you make the assertion that you are transferring equal or less > session data between your session server (lets say an RDBMS) and the > app server than you are between the app server and the client, an out > of band 100Mb network for session information is plenty of bandwidth. So if you count on a mean page size of 6-8 kbytes gzipped, that will prevent you from caching the N first results of the Big Slow Search Query in a native object in the user session state (say, a list of integers indicating which rows match), so you will have to redo the Big Slow Search Query everytime the user clicks on Next Page instead of grabbing a set of cached row id's and doing a fast SELECT WHERE id IN ... This is the worst case ... I'd gzip() the row id's and stuff them in the session, that's always better than blowing up the database with the Big Slow Search Query everytime someone does Next Page... > This also represents OLTP style traffic, which postgresql is pretty > good at. You should easily be able to get over 100Tps. 100 hits per > second is an awful lot of traffic, more than any website I've managed > will ever see. On the latest anandtech benchmarks, 100 hits per second on a blog/forum software is a big bi-opteron server running dotNET, at 99% load... it's a lot if you count only dynamic page hits.