Thread: PostgreSQL clustering VS MySQL clustering
Dear community, My company, which I actually represent, is a fervent user of PostgreSQL. We used to make all our applications using PostgreSQL for more than 5 years. We usually do classical client/server applications under Linux, and Web interface (php, perl, C/C++). We used to manage also public web services with 10/15 millions records and up to 8 millions pages view by month. Now we are in front of a new need, but we do not find any good solution with PostgreSQL. We need to make a sort of directory of millions of data growing about 4/8 millions per month, and to be able to be used by many users from the web. In order to do this, our solution need to be able to run perfectly with many insert and many select access (done before each insert, and done by web site visitors). We will also need to make a search engine for the millions of data (140/150 millions records at the immediate beginning) ... No it's not google, but the kind of volume of data stored in the main table is similar. Then ... we have made some tests, with the actual servers we have here, like a Bi-Pro Xeon 2.8 Ghz, with 4 Gb of RAM and the result of the cumulative inserts, and select access is slowing down the service really quickly ... (Load average is going up to 10 really quickly on the database). We were at this moment thinking about a Cluster solution ... We saw on the Internet many solution talking about Cluster solution using MySQL ... but nothing about PostgreSQL ... the idea is to use several servers to make a sort of big virtual server using the disk space of each server as one, and having the ability to use the CPU and RAM of each servers in order to maintain good service performance ...one can imagin it is like a GFS but dedicated to postgreSQL... Is there any solution with PostgreSQL matching these needs ... ? Do we have to backport our development to MySQL for this kind of problem ? Is there any other solution than a Cluster for our problem ? Looking for your reply, Regards, -- Hervé
On Thu, 20 Jan 2005 15:03:31 +0100, Hervé Piedvache <herve@elma.fr> wrote: > We were at this moment thinking about a Cluster solution ... We saw on the > Internet many solution talking about Cluster solution using MySQL ... but > nothing about PostgreSQL ... the idea is to use several servers to make a > sort of big virtual server using the disk space of each server as one, and > having the ability to use the CPU and RAM of each servers in order to > maintain good service performance ...one can imagin it is like a GFS but > dedicated to postgreSQL... > forget mysql cluster for now. We have a small database which size is 500 Mb. It is not possible to load these base in a computer with 2 Mb of RAM and loading the base in RAM is required. So, we shrink the database and it is ok with 350 Mb to fit in the 2 Gb RAM. First tests of performance on a basic request: 500x slower, yes 500x. This issue is reported to mysql team but no answer (and correction) Actually, the solution is running with a replication database: 1 node for write request and all the other nodes for read requests and the load balancer is made with round robin solution. -- Jean-Max Reymond CKR Solutions Nice France http://www.ckr-solutions.com
> Is there any solution with PostgreSQL matching these needs ... ? You want: http://www.slony.info/ > Do we have to backport our development to MySQL for this kind of problem ? > Is there any other solution than a Cluster for our problem ? Well, Slony does replication which is basically what you want :) Only master->slave though, so you will need to have all inserts go via the master server, but selects can come off any server. Chris
* Herv? Piedvache (herve@elma.fr) wrote: > Is there any solution with PostgreSQL matching these needs ... ? You might look into pg_pool. Another possibility would be slony, though I'm not sure it's to the point you need it at yet, depends on if you can handle some delay before an insert makes it to the slave select systems. > Do we have to backport our development to MySQL for this kind of problem ? Well, hopefully not. :) > Is there any other solution than a Cluster for our problem ? Bigger server, more CPUs/disks in one box. Try to partition up your data some way such that it can be spread across multiple machines, then if you need to combine the data have it be replicated using slony to a big box that has a view which joins all the tables and do your big queries against that. Just some thoughts. Stephen
Attachment
Le Jeudi 20 Janvier 2005 15:24, Christopher Kings-Lynne a écrit : > > Is there any solution with PostgreSQL matching these needs ... ? > > You want: http://www.slony.info/ > > > Do we have to backport our development to MySQL for this kind of problem > > ? Is there any other solution than a Cluster for our problem ? > > Well, Slony does replication which is basically what you want :) > > Only master->slave though, so you will need to have all inserts go via > the master server, but selects can come off any server. Sorry but I don't agree with this ... Slony is a replication solution ... I don't need replication ... what will I do when my database will grow up to 50 Gb ... I'll need more than 50 Gb of RAM on each server ??? This solution is not very realistic for me ... I need a Cluster solution not a replication one or explain me in details how I will do for managing the scalabilty of my database ... regards, -- Hervé
> Sorry but I don't agree with this ... Slony is a replication solution ... I > don't need replication ... what will I do when my database will grow up to 50 > Gb ... I'll need more than 50 Gb of RAM on each server ??? > This solution is not very realistic for me ... > > I need a Cluster solution not a replication one or explain me in details how I > will do for managing the scalabilty of my database ... Buy Oracle
Le Jeudi 20 Janvier 2005 15:30, Stephen Frost a écrit : > * Herv? Piedvache (herve@elma.fr) wrote: > > Is there any solution with PostgreSQL matching these needs ... ? > > You might look into pg_pool. Another possibility would be slony, though > I'm not sure it's to the point you need it at yet, depends on if you can > handle some delay before an insert makes it to the slave select systems. I think not ... pgpool or slony are replication solutions ... but as I have said to Christopher Kings-Lynne how I'll manage the scalabilty of the database ? I'll need several servers able to load a database growing and growing to get good speed performance ... > > Do we have to backport our development to MySQL for this kind of problem > > ? > > Well, hopefully not. :) I hope so ;o) > > Is there any other solution than a Cluster for our problem ? > > Bigger server, more CPUs/disks in one box. Try to partition up your > data some way such that it can be spread across multiple machines, then > if you need to combine the data have it be replicated using slony to a > big box that has a view which joins all the tables and do your big > queries against that. But I'll arrive to limitation of a box size quickly I thing a 4 processors with 64 Gb of RAM ... and after ? regards, -- Hervé
Le Jeudi 20 Janvier 2005 15:38, Christopher Kings-Lynne a écrit : > > Sorry but I don't agree with this ... Slony is a replication solution ... > > I don't need replication ... what will I do when my database will grow up > > to 50 Gb ... I'll need more than 50 Gb of RAM on each server ??? > > This solution is not very realistic for me ... > > > > I need a Cluster solution not a replication one or explain me in details > > how I will do for managing the scalabilty of my database ... > > Buy Oracle I think this is not my solution ... sorry I'm talking about finding a PostgreSQL solution ... -- Hervé Piedvache Elma Ingénierie Informatique 6 rue du Faubourg Saint-Honoré F-75008 - Paris - France Pho. 33-144949901 Fax. 33-144949902
* Herv? Piedvache (herve@elma.fr) wrote: > Le Jeudi 20 Janvier 2005 15:30, Stephen Frost a écrit : > > * Herv? Piedvache (herve@elma.fr) wrote: > > > Is there any solution with PostgreSQL matching these needs ... ? > > > > You might look into pg_pool. Another possibility would be slony, though > > I'm not sure it's to the point you need it at yet, depends on if you can > > handle some delay before an insert makes it to the slave select systems. > > I think not ... pgpool or slony are replication solutions ... but as I have > said to Christopher Kings-Lynne how I'll manage the scalabilty of the > database ? I'll need several servers able to load a database growing and > growing to get good speed performance ... They're both replication solutions, but they also help distribute the load. For example: pg_pool will distribute the select queries amoung the servers. They'll all get the inserts, so that hurts, but at least the select queries are distributed. slony is similar, but your application level does the load distribution of select statements instead of pg_pool. Your application needs to know to send insert statements to the 'main' server, and select from the others. > > > Is there any other solution than a Cluster for our problem ? > > > > Bigger server, more CPUs/disks in one box. Try to partition up your > > data some way such that it can be spread across multiple machines, then > > if you need to combine the data have it be replicated using slony to a > > big box that has a view which joins all the tables and do your big > > queries against that. > > But I'll arrive to limitation of a box size quickly I thing a 4 processors > with 64 Gb of RAM ... and after ? Go to non-x86 hardware after if you're going to continue to increase the size of the server. Personally I think your better bet might be to figure out a way to partition up your data (isn't that what google does anyway?). Stephen
Attachment
Hervé Piedvache wrote: >Dear community, > >My company, which I actually represent, is a fervent user of PostgreSQL. >We used to make all our applications using PostgreSQL for more than 5 years. >We usually do classical client/server applications under Linux, and Web >interface (php, perl, C/C++). We used to manage also public web services with >10/15 millions records and up to 8 millions pages view by month. > > Depending on your needs either: Slony: www.slony.info or Replicator: www.commandprompt.com Will both do what you want. Replicator is easier to setup but Slony is free. Sincerely, Joshua D. Drake >Now we are in front of a new need, but we do not find any good solution with >PostgreSQL. >We need to make a sort of directory of millions of data growing about 4/8 >millions per month, and to be able to be used by many users from the web. In >order to do this, our solution need to be able to run perfectly with many >insert and many select access (done before each insert, and done by web site >visitors). We will also need to make a search engine for the millions of data >(140/150 millions records at the immediate beginning) ... No it's not google, >but the kind of volume of data stored in the main table is similar. > >Then ... we have made some tests, with the actual servers we have here, like a >Bi-Pro Xeon 2.8 Ghz, with 4 Gb of RAM and the result of the cumulative >inserts, and select access is slowing down the service really quickly ... >(Load average is going up to 10 really quickly on the database). > >We were at this moment thinking about a Cluster solution ... We saw on the >Internet many solution talking about Cluster solution using MySQL ... but >nothing about PostgreSQL ... the idea is to use several servers to make a >sort of big virtual server using the disk space of each server as one, and >having the ability to use the CPU and RAM of each servers in order to >maintain good service performance ...one can imagin it is like a GFS but >dedicated to postgreSQL... > >Is there any solution with PostgreSQL matching these needs ... ? >Do we have to backport our development to MySQL for this kind of problem ? >Is there any other solution than a Cluster for our problem ? > >Looking for your reply, > >Regards, > > -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
On Jan 20, 2005, at 9:36 AM, Hervé Piedvache wrote: > Sorry but I don't agree with this ... Slony is a replication solution > ... I > don't need replication ... what will I do when my database will grow > up to 50 > Gb ... I'll need more than 50 Gb of RAM on each server ??? Slony doesn't use much ram. The mysql clustering product, ndb I believe it is called, requires all data fit in RAM. (At least, it used to). What you'll need is disk space. As for a cluster I think you are thinking of multi-master replication. You should look into what others have said about trying to partiition data among several boxes and then join the results together. Or you could fork over hundreds of thousands of dollars for Oracle's RAC. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
Stephen Frost wrote: >* Herv? Piedvache (herve@elma.fr) wrote: > > >>Le Jeudi 20 Janvier 2005 15:30, Stephen Frost a écrit : >> >> >>>* Herv? Piedvache (herve@elma.fr) wrote: >>> >>> >>>>Is there any solution with PostgreSQL matching these needs ... ? >>>> >>>> >>>You might look into pg_pool. Another possibility would be slony, though >>>I'm not sure it's to the point you need it at yet, depends on if you can >>>handle some delay before an insert makes it to the slave select systems. >>> >>> >>I think not ... pgpool or slony are replication solutions ... but as I have >>said to Christopher Kings-Lynne how I'll manage the scalabilty of the >>database ? I'll need several servers able to load a database growing and >>growing to get good speed performance ... >> >> > >They're both replication solutions, but they also help distribute the >load. For example: > >pg_pool will distribute the select queries amoung the servers. They'll >all get the inserts, so that hurts, but at least the select queries are >distributed. > >slony is similar, but your application level does the load distribution >of select statements instead of pg_pool. Your application needs to know >to send insert statements to the 'main' server, and select from the >others. > > You can put pgpool in front of replicator or slony to get load balancing for reads. > > >>>>Is there any other solution than a Cluster for our problem ? >>>> >>>> >>>Bigger server, more CPUs/disks in one box. Try to partition up your >>>data some way such that it can be spread across multiple machines, then >>>if you need to combine the data have it be replicated using slony to a >>>big box that has a view which joins all the tables and do your big >>>queries against that. >>> >>> >>But I'll arrive to limitation of a box size quickly I thing a 4 processors >>with 64 Gb of RAM ... and after ? >> >> Opteron. > >Go to non-x86 hardware after if you're going to continue to increase the >size of the server. Personally I think your better bet might be to >figure out a way to partition up your data (isn't that what google >does anyway?). > > Stephen > > -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
>>>Sorry but I don't agree with this ... Slony is a replication solution ... >>>I don't need replication ... what will I do when my database will grow up >>>to 50 Gb ... I'll need more than 50 Gb of RAM on each server ??? >>>This solution is not very realistic for me ... >>> >>>I need a Cluster solution not a replication one or explain me in details >>>how I will do for managing the scalabilty of my database ... >> >>Buy Oracle > > > I think this is not my solution ... sorry I'm talking about finding a > PostgreSQL solution ... My point being is that there is no free solution. There simply isn't. I don't know why you insist on keeping all your data in RAM, but the mysql cluster requires that ALL data MUST fit in RAM all the time. PostgreSQL has replication, but not partitioning (which is what you want). So, your only option is Oracle or another very expensive commercial database. Chris
Le Jeudi 20 Janvier 2005 15:48, Jeff a écrit : > On Jan 20, 2005, at 9:36 AM, Hervé Piedvache wrote: > > Sorry but I don't agree with this ... Slony is a replication solution > > ... I > > don't need replication ... what will I do when my database will grow > > up to 50 > > Gb ... I'll need more than 50 Gb of RAM on each server ??? > > Slony doesn't use much ram. The mysql clustering product, ndb I believe > it is called, requires all data fit in RAM. (At least, it used to). > What you'll need is disk space. Slony do not use RAM ... but PostgreSQL will need RAM for accessing a database of 50 Gb ... so having two servers with the same configuration replicated by slony do not slove the problem of the scalability of the database ... > As for a cluster I think you are thinking of multi-master replication. No I'm really thinking about a Cluster solution ... having several servers making one big virtual server to have several processors, and many RAM in many boxes ... > You should look into what others have said about trying to partiition > data among several boxes and then join the results together. ??? Who talk about this ? > Or you could fork over hundreds of thousands of dollars for Oracle's > RAC. No please do not talk about this again ... I'm looking about a PostgreSQL solution ... I know RAC ... and I'm not able to pay for a RAC certify hardware configuration plus a RAC Licence. Regards, -- Hervé Piedvache Elma Ingénierie Informatique 6 rue du Faubourg Saint-Honoré F-75008 - Paris - France Pho. 33-144949901 Fax. 33-144949902
>>Or you could fork over hundreds of thousands of dollars for Oracle's >>RAC. > > > No please do not talk about this again ... I'm looking about a PostgreSQL > solution ... I know RAC ... and I'm not able to pay for a RAC certify > hardware configuration plus a RAC Licence. There is absolutely zero PostgreSQL solution... You may have to split the data yourself onto two independent db servers and combine the results somehow in your application. Chris
Joshua, Le Jeudi 20 Janvier 2005 15:44, Joshua D. Drake a écrit : > Hervé Piedvache wrote: > > > >My company, which I actually represent, is a fervent user of PostgreSQL. > >We used to make all our applications using PostgreSQL for more than 5 > > years. We usually do classical client/server applications under Linux, > > and Web interface (php, perl, C/C++). We used to manage also public web > > services with 10/15 millions records and up to 8 millions pages view by > > month. > > Depending on your needs either: > > Slony: www.slony.info > > or > > Replicator: www.commandprompt.com > > Will both do what you want. Replicator is easier to setup but > Slony is free. No ... as I have said ... how I'll manage a database getting a table of may be 250 000 000 records ? I'll need incredible servers ... to get quick access or index reading ... no ? So what we would like to get is a pool of small servers able to make one virtual server ... for that is called a Cluster ... no ? I know they are not using PostgreSQL ... but how a company like Google do to get an incredible database in size and so quick access ? regards, -- Hervé Piedvache Elma Ingénierie Informatique 6 rue du Faubourg Saint-Honoré F-75008 - Paris - France Pho. 33-144949901 Fax. 33-144949902
Le Jeudi 20 Janvier 2005 15:51, Christopher Kings-Lynne a écrit : > >>>Sorry but I don't agree with this ... Slony is a replication solution > >>> ... I don't need replication ... what will I do when my database will > >>> grow up to 50 Gb ... I'll need more than 50 Gb of RAM on each server > >>> ??? This solution is not very realistic for me ... > >>> > >>>I need a Cluster solution not a replication one or explain me in details > >>>how I will do for managing the scalabilty of my database ... > >> > >>Buy Oracle > > > > I think this is not my solution ... sorry I'm talking about finding a > > PostgreSQL solution ... > > My point being is that there is no free solution. There simply isn't. > I don't know why you insist on keeping all your data in RAM, but the > mysql cluster requires that ALL data MUST fit in RAM all the time. I don't insist about have data in RAM .... but when you use PostgreSQL with big database you know that for quick access just for reading the index file for example it's better to have many RAM as possible ... I just want to be able to get a quick access with a growing and growind database ... > PostgreSQL has replication, but not partitioning (which is what you want). :o( > So, your only option is Oracle or another very expensive commercial > database. That's not a good news ... -- Hervé Piedvache Elma Ingénierie Informatique 6 rue du Faubourg Saint-Honoré F-75008 - Paris - France Pho. 33-144949901 Fax. 33-144949902
> >No please do not talk about this again ... I'm looking about a PostgreSQL >solution ... I know RAC ... and I'm not able to pay for a RAC certify >hardware configuration plus a RAC Licence. > > What you want does not exist for PostgreSQL. You will either have to build it yourself or pay somebody to build it for you. Sincerely, Joshua D. Drake >Regards, > > -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
>So what we would like to get is a pool of small servers able to make one >virtual server ... for that is called a Cluster ... no ? > >I know they are not using PostgreSQL ... but how a company like Google do to >get an incredible database in size and so quick access ? > > You could use dblink with multiple servers across data partitions within PostgreSQL but I don't know how fast that would be. J >regards, > > -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
Christopher Kings-Lynne wrote: >>> Or you could fork over hundreds of thousands of dollars for Oracle's >>> RAC. >> >> >> >> No please do not talk about this again ... I'm looking about a >> PostgreSQL solution ... I know RAC ... and I'm not able to pay for a >> RAC certify hardware configuration plus a RAC Licence. > > > There is absolutely zero PostgreSQL solution... I just replied the same thing but then I was thinking. Couldn't he use multiple databases over multiple servers with dblink? It is not exactly how I would want to do it, but it would provide what he needs I think??? Sincerely, Joshua D. Drake > > You may have to split the data yourself onto two independent db > servers and combine the results somehow in your application. > > Chris > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
* Herv? Piedvache (herve@elma.fr) wrote: > I know they are not using PostgreSQL ... but how a company like Google do to > get an incredible database in size and so quick access ? They segment their data across multiple machines and have an algorithm which tells the application layer which machine to contact for what data. Stephen
Attachment
Le Jeudi 20 Janvier 2005 16:05, Joshua D. Drake a écrit : > Christopher Kings-Lynne wrote: > >>> Or you could fork over hundreds of thousands of dollars for Oracle's > >>> RAC. > >> > >> No please do not talk about this again ... I'm looking about a > >> PostgreSQL solution ... I know RAC ... and I'm not able to pay for a > >> RAC certify hardware configuration plus a RAC Licence. > > > > There is absolutely zero PostgreSQL solution... > > I just replied the same thing but then I was thinking. Couldn't he use > multiple databases > over multiple servers with dblink? > > It is not exactly how I would want to do it, but it would provide what > he needs I think??? Yes seems to be the only solution ... but I'm a little disapointed about this ... could you explain me why there is not this kind of functionnality ... it seems to be a real need for big applications no ? Thanks all for your answers ... -- Hervé Piedvache Elma Ingénierie Informatique 6 rue du Faubourg Saint-Honoré F-75008 - Paris - France Pho. 33-144949901 Fax. 33-144949902
* Christopher Kings-Lynne (chriskl@familyhealth.com.au) wrote: > PostgreSQL has replication, but not partitioning (which is what you want). It doesn't have multi-server partitioning.. It's got partitioning within a single server (doesn't it? I thought it did, I know it was discussed w/ the guy from Cox Communications and I thought he was using it :). > So, your only option is Oracle or another very expensive commercial > database. Or partition the data at the application layer. Stephen
Attachment
>> then I was thinking. Couldn't he use >>multiple databases >>over multiple servers with dblink? >> >>It is not exactly how I would want to do it, but it would provide what >>he needs I think??? >> >> > >Yes seems to be the only solution ... but I'm a little disapointed about >this ... could you explain me why there is not this kind of >functionnality ... it seems to be a real need for big applications no ? > > Because it is really, really hard to do correctly and hard equals expensive. Sincerely, Joshua D. Drake >Thanks all for your answers ... > > -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
Hervé Piedvache wrote: > > No ... as I have said ... how I'll manage a database getting a table of may be > 250 000 000 records ? I'll need incredible servers ... to get quick access or > index reading ... no ? > > So what we would like to get is a pool of small servers able to make one > virtual server ... for that is called a Cluster ... no ? > > I know they are not using PostgreSQL ... but how a company like Google do to > get an incredible database in size and so quick access ? Probably by carefully partitioning their data. I can't imagine anything being fast on a single table in 250,000,000 tuple range. Nor can I really imagine any database that efficiently splits a single table across multiple machines (or even inefficiently unless some internal partitioning is being done). So, you'll have to do some work at your end and not just hope that a "magic bullet" is available. Once you've got the data partitioned, the question becomes one of how to inhance performance/scalability. Have you considered RAIDb? -- Steve Wampler -- swampler@noao.edu The gods that smiled on your birth are now laughing out loud.
> No please do not talk about this again ... I'm looking about a PostgreSQL > solution ... I know RAC ... and I'm not able to pay for a RAC certify > hardware configuration plus a RAC Licence. Are you totally certain you can't solve your problem with a single server solution? How about: Price out a 4 way Opteron 4u rackmount server with 64 bit linux, stuffed with hard drives (like 40) set up in a complex raidconfiguration (multiple raid controllers) allowing you (with tablespaces) to divide up your database. You can drop in dual core opterons at some later point for an easy upgrade. Let's say this server costs 20k$...are you surethis will not be enough to handle your load? Merlin
Google uses something called the google filesystem, look it up in google. It is a distributed file system.
Dave
Hervé Piedvache wrote:
Dave
Hervé Piedvache wrote:
Joshua, Le Jeudi 20 Janvier 2005 15:44, Joshua D. Drake a écrit :Hervé Piedvache wrote:My company, which I actually represent, is a fervent user of PostgreSQL. We used to make all our applications using PostgreSQL for more than 5 years. We usually do classical client/server applications under Linux, and Web interface (php, perl, C/C++). We used to manage also public web services with 10/15 millions records and up to 8 millions pages view by month.Depending on your needs either: Slony: www.slony.info or Replicator: www.commandprompt.com Will both do what you want. Replicator is easier to setup but Slony is free.No ... as I have said ... how I'll manage a database getting a table of may be 250 000 000 records ? I'll need incredible servers ... to get quick access or index reading ... no ? So what we would like to get is a pool of small servers able to make one virtual server ... for that is called a Cluster ... no ? I know they are not using PostgreSQL ... but how a company like Google do to get an incredible database in size and so quick access ? regards,
-- Dave Cramer http://www.postgresintl.com 519 939 0336 ICQ#14675561
Le Jeudi 20 Janvier 2005 16:14, Steve Wampler a écrit : > Once you've got the data partitioned, the question becomes one of > how to inhance performance/scalability. Have you considered RAIDb? No but I'll seems to be very interesting ... close to the explanation of Joshua ... but automaticly done ... Thanks ! -- Hervé Piedvache Elma Ingénierie Informatique 6 rue du Faubourg Saint-Honoré F-75008 - Paris - France Pho. 33-144949901 Fax. 33-144949902
Le Jeudi 20 Janvier 2005 16:16, Merlin Moncure a écrit : > > No please do not talk about this again ... I'm looking about a PostgreSQL > > solution ... I know RAC ... and I'm not able to pay for a RAC certify > > hardware configuration plus a RAC Licence. > > Are you totally certain you can't solve your problem with a single server > solution? > > How about: > Price out a 4 way Opteron 4u rackmount server with 64 bit linux, stuffed > with hard drives (like 40) set up in a complex raid configuration (multiple > raid controllers) allowing you (with tablespaces) to divide up your > database. > > You can drop in dual core opterons at some later point for an easy upgrade. > Let's say this server costs 20k$...are you sure this will not be enough to > handle your load? I'm not as I said ibn my mail I want to do a Cluster of servers ... :o) -- Hervé Piedvache Elma Ingénierie Informatique 6 rue du Faubourg Saint-Honoré F-75008 - Paris - France Pho. 33-144949901 Fax. 33-144949902
Le Jeudi 20 Janvier 2005 16:23, Dave Cramer a écrit : > Google uses something called the google filesystem, look it up in > google. It is a distributed file system. Yes that's another point I'm working on ... make a cluster of server using GFS ... and making PostgreSQL running with it ... But I have not finished my test ... and may be people could have experience with this ... Regards, -- Hervé Piedvache Elma Ingénierie Informatique 6 rue du Faubourg Saint-Honoré F-75008 - Paris - France Pho. 33-144949901 Fax. 33-144949902
Hervé Piedvache wrote: > Le Jeudi 20 Janvier 2005 16:23, Dave Cramer a écrit : > >>Google uses something called the google filesystem, look it up in >>google. It is a distributed file system. > > > Yes that's another point I'm working on ... make a cluster of server using > GFS ... and making PostgreSQL running with it ... A few years ago I played around with GFS, but not for postgresql. I don't think it's going to help - logically there's no difference between putting PG on GFS and putting PG on NFS - in both cases the filesystem doesn't provide any support for distributing the task at hand - and a PG database server isn't written to be distributed across hosts regardless of the distribution of the data across filesystems. -- Steve Wampler -- swampler@noao.edu The gods that smiled on your birth are now laughing out loud.
> Probably by carefully partitioning their data. I can't imagine anything > being fast on a single table in 250,000,000 tuple range. Nor can I > really imagine any database that efficiently splits a single table > across multiple machines (or even inefficiently unless some internal > partitioning is being done). Ah, what about partial indexes - those might help. As a kind of 'semi-partition'. Chris
I think maybe a SAN in conjunction with tablespaces might be the answer. Still need one honking server. Rick Stephen Frost <sfrost@snowman.net> To: Christopher Kings-Lynne <chriskl@familyhealth.com.au> Sent by: cc: Hervé Piedvache <herve@elma.fr>, pgsql-performance@postgresql.org pgsql-performance-owner@pos Subject: Re: [PERFORM] PostgreSQL clustering VS MySQL clustering tgresql.org 01/20/2005 10:08 AM * Christopher Kings-Lynne (chriskl@familyhealth.com.au) wrote: > PostgreSQL has replication, but not partitioning (which is what you want). It doesn't have multi-server partitioning.. It's got partitioning within a single server (doesn't it? I thought it did, I know it was discussed w/ the guy from Cox Communications and I thought he was using it :). > So, your only option is Oracle or another very expensive commercial > database. Or partition the data at the application layer. Stephen (See attached file: signature.asc)
Attachment
On Thu, 2005-01-20 at 15:36 +0100, Hervé Piedvache wrote: > Le Jeudi 20 Janvier 2005 15:24, Christopher Kings-Lynne a écrit : > > > Is there any solution with PostgreSQL matching these needs ... ? > > > > You want: http://www.slony.info/ > > > > > Do we have to backport our development to MySQL for this kind of problem > > > ? Is there any other solution than a Cluster for our problem ? > > > > Well, Slony does replication which is basically what you want :) > > > > Only master->slave though, so you will need to have all inserts go via > > the master server, but selects can come off any server. > > Sorry but I don't agree with this ... Slony is a replication solution ... I > don't need replication ... what will I do when my database will grow up to 50 > Gb ... I'll need more than 50 Gb of RAM on each server ??? > This solution is not very realistic for me ... Slony has some other issues with databases > 200GB in size as well (well, it hates long running transactions -- and pg_dump is a regular long running transaction) However, you don't need RAM one each server for this, you simply need enough disk space. Have a Master which takes writes, a "replicator" which you can consider to be a hot-backup of the master, have N slaves replicate off of the otherwise untouched "replicator" machine. For your next trick, have the application send read requests for Clients A-C to slave 1, D-F to slave 2, ... You need enough memory to hold the index sections for clients A-C on slave 1. The rest of the index can remain on disk. It's available should it be required (D-F box crashed, so your application is now feeding those read requests to the A-C machine)... Go to more slaves and smaller segments as you require. Use the absolute cheapest hardware you can find for the slaves that gives reasonable performance. They don't need to be reliable, so RAID 0 on IDE drives is perfectly acceptable. PostgreSQL can do the replication portion quite nicely. You need to implement the "cluster" part in the application side. --
Christopher Kings-Lynne wrote: >> Probably by carefully partitioning their data. I can't imagine anything >> being fast on a single table in 250,000,000 tuple range. Nor can I >> really imagine any database that efficiently splits a single table >> across multiple machines (or even inefficiently unless some internal >> partitioning is being done). > > > Ah, what about partial indexes - those might help. As a kind of > 'semi-partition'. He could also you schemas to partition out the information within the same database. J > > Chris -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
What you want is some kind of huge pararell computing , isn't it? I have heard from many groups of Japanese Pgsql developer did it but they are talking in japanese website and of course in Japanese. I can name one of them " Asushi Mitani" and his website http://www.csra.co.jp/~mitani/jpug/pgcluster/en/index.html and you may directly contact him. Amrit Thailand
* Richard_D_Levine@raytheon.com (Richard_D_Levine@raytheon.com) wrote: > I think maybe a SAN in conjunction with tablespaces might be the answer. > Still need one honking server. That's interesting- can a PostgreSQL partition be acress multiple tablespaces? Stephen
Attachment
Steve Wampler <swampler@noao.edu> writes: > Hervé Piedvache wrote: > > > No ... as I have said ... how I'll manage a database getting a table of may > > be 250 000 000 records ? I'll need incredible servers ... to get quick access > > or index reading ... no ? > > Probably by carefully partitioning their data. I can't imagine anything > being fast on a single table in 250,000,000 tuple range. Why are you all so psyched out by the size of the table? That's what indexes are for. The size of the table really isn't relevant here. The important thing is the size of the working set. Ie, How many of those records are required to respond to queries. As long as you tune your application so every query can be satisfied by reading a (very) limited number of those records and have indexes to speed access to those records you can have quick response time even if you have terabytes of raw data. I would start by looking at the plans for the queries you're running and seeing if you have any queries that are reading more than hundred records or so. If so then you have to optimize them or rethink your application design. You might need to restructure your data so you don't have to scan too many records for any query. No clustering system is going to help you if your application requires reading through too much data. If every query is designed to not have to read more than a hundred or so records then there's no reason you can't have sub-100ms response time even if you had terabytes of raw data. If the problem is just that each individual query is fast but there's too many coming for a single server then something like slony is all you need. It'll spread the load over multiple machines. If you spread the load in an intelligent way you can even concentrate each server on certain subsets of the data. But that shouldn't even really be necessary, just a nice improvement. -- greg
Hervé Piedvache wrote: > Sorry but I don't agree with this ... Slony is a replication solution ... I > don't need replication ... what will I do when my database will grow up to 50 > Gb ... I'll need more than 50 Gb of RAM on each server ??? > This solution is not very realistic for me ... Have you confirmed you need a 1:1 RAM:data ratio? Of course more memory gets more speed but often at a diminishing rate of return. Unless every record of your 50GB is used in every query, only the most commonly used elements of your DB needs to be in RAM. This is the very idea of caching.
The problem is very large ammounts of data that needs to be both read and updated. If you replicate a system, you will need to intelligently route the reads to the server that has the data in RAM or you will always be hitting DIsk which is slow. This kind of routing AFAIK is not possible with current database technology, and you are still stuck for writes. Writes are always going to be the bane of any cluster. Clustering can give better parallel read performance i.e. large no. of clients accessing data simultaneously, but your write performance is always going to be bound by the underlying disk infrastructure, not even Oracle RAC can get around this (It uses multiple read nodes accessing the same set of database files underneath) Google solved the problem by building this intelligence into the middle tier, and using a distributed file system. Java Entity Beans are supposed to solve this problem somewhat by distributing the data across multiple servers in a cluster and allowing you to defer write syncing, but it really doesn't work all that well. The only way I know to solve this at the RDBMS layer is to configure a very powerfull disk layer, which is basicaly going to a SAN mesh with multiple cards on a single system with multiple IO boards, or an OS that clusters at the base level, thinking HP Superdome or z900. Even Opteron w/PCI-X cards has a limit of about 400MB/sec throughput on a single IO channel, and there are only two independent channels on any boards I know about. The other solution is to do what google did. Implement your own middle tier that knows how to route queries to the appropriate place. Each node can then have it's own independant database with it's own independant disk subsystem, and your throughput is only limited by your network interconnects, and your internet pipe. This kind of middle tier is really not that hard to if your data can easily be segmented. Each node runs it's own query sort and filter independantly, and supplies the result to the central data broker, which then collates the results and supplies them back to the user. Updated work in a similar fasion. The update comes into the central broker that decides which nodes it will affect, and then issues updates to those nodes. I've built this kind of architecture, if you want to do it, don't use Java unless you want to pay top dollar for your programmers, because it's hard to make it work well in Java (most JMS implementations suck, look at MQueue or a custom queue impl, forget XML it's too slow to serialize and deserialize requests). Alex Turner NetEconomist On Thu, 20 Jan 2005 11:13:25 -0500, Stephen Frost <sfrost@snowman.net> wrote: > * Richard_D_Levine@raytheon.com (Richard_D_Levine@raytheon.com) wrote: > > I think maybe a SAN in conjunction with tablespaces might be the answer. > > Still need one honking server. > > That's interesting- can a PostgreSQL partition be acress multiple > tablespaces? > > Stephen > > >
On Thu, 20 Jan 2005 16:32:27 +0100, Hervé Piedvache wrote: > Le Jeudi 20 Janvier 2005 16:23, Dave Cramer a écrit : >> Google uses something called the google filesystem, look it up in >> google. It is a distributed file system. > > Yes that's another point I'm working on ... make a cluster of server using > GFS ... and making PostgreSQL running with it ... Did you read the GFS whitepaper? It really works differently from other filesystems with regard to latency and consistency. You'll probably have better success with Lustre (http://www.clusterfs.com/) or RedHat's Global File System (http://www.redhat.com/software/rha/gfs/). If you're looking for a 'cheap, free and easy' solution you can just as well stop right now. :-) -h
On January 20, 2005 06:49 am, Joshua D. Drake wrote: > Stephen Frost wrote: > >* Herv? Piedvache (herve@elma.fr) wrote: > >>Le Jeudi 20 Janvier 2005 15:30, Stephen Frost a écrit : > >>>* Herv? Piedvache (herve@elma.fr) wrote: > >>>>Is there any solution with PostgreSQL matching these needs ... ? > >>> > >>>You might look into pg_pool. Another possibility would be slony, though > >>>I'm not sure it's to the point you need it at yet, depends on if you can > >>>handle some delay before an insert makes it to the slave select systems. > >> > >>I think not ... pgpool or slony are replication solutions ... but as I > >> have said to Christopher Kings-Lynne how I'll manage the scalabilty of > >> the database ? I'll need several servers able to load a database growing > >> and growing to get good speed performance ... > > > >They're both replication solutions, but they also help distribute the > >load. For example: > > > >pg_pool will distribute the select queries amoung the servers. They'll > >all get the inserts, so that hurts, but at least the select queries are > >distributed. > > > >slony is similar, but your application level does the load distribution > >of select statements instead of pg_pool. Your application needs to know > >to send insert statements to the 'main' server, and select from the > >others. > > You can put pgpool in front of replicator or slony to get load > balancing for reads. Last time I checked load ballanced reads was only available in pgpool if you were using pgpools's internal replication. Has something changed recently? > > >>>>Is there any other solution than a Cluster for our problem ? > >>> > >>>Bigger server, more CPUs/disks in one box. Try to partition up your > >>>data some way such that it can be spread across multiple machines, then > >>>if you need to combine the data have it be replicated using slony to a > >>>big box that has a view which joins all the tables and do your big > >>>queries against that. > >> > >>But I'll arrive to limitation of a box size quickly I thing a 4 > >> processors with 64 Gb of RAM ... and after ? > > Opteron. IBM Z-series, or other big iron. > > >Go to non-x86 hardware after if you're going to continue to increase the > >size of the server. Personally I think your better bet might be to > >figure out a way to partition up your data (isn't that what google > >does anyway?). > > > > Stephen -- Darcy Buskermolen Wavefire Technologies Corp. ph: 250.717.0200 fx: 250.763.1759 http://www.wavefire.com
On January 20, 2005 06:51 am, Christopher Kings-Lynne wrote: > >>>Sorry but I don't agree with this ... Slony is a replication solution > >>> ... I don't need replication ... what will I do when my database will > >>> grow up to 50 Gb ... I'll need more than 50 Gb of RAM on each server > >>> ??? This solution is not very realistic for me ... > >>> > >>>I need a Cluster solution not a replication one or explain me in details > >>>how I will do for managing the scalabilty of my database ... > >> > >>Buy Oracle > > > > I think this is not my solution ... sorry I'm talking about finding a > > PostgreSQL solution ... > > My point being is that there is no free solution. There simply isn't. > I don't know why you insist on keeping all your data in RAM, but the > mysql cluster requires that ALL data MUST fit in RAM all the time. > > PostgreSQL has replication, but not partitioning (which is what you want). > > So, your only option is Oracle or another very expensive commercial > database. Another Option to consider would be pgmemcache. that way you just build the farm out of lots of large memory, diskless boxes for keeping the whole database in memory in the whole cluster. More information on it can be found at: http://people.freebsd.org/~seanc/pgmemcache/ > > Chris > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend -- Darcy Buskermolen Wavefire Technologies Corp. ph: 250.717.0200 fx: 250.763.1759 http://www.wavefire.com
Could you explain us what do you have in mind for that solution? I mean, forget the PostgreSQL (or any other database) restrictions and explain us how this hardware would be. Where the data would be stored? I've something in mind for you, but first I need to understand your needs! C ya. Bruno Almeida do Lago -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Hervé Piedvache Sent: Thursday, January 20, 2005 1:31 PM To: Merlin Moncure Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] PostgreSQL clustering VS MySQL clustering Le Jeudi 20 Janvier 2005 16:16, Merlin Moncure a écrit : > > No please do not talk about this again ... I'm looking about a PostgreSQL > > solution ... I know RAC ... and I'm not able to pay for a RAC certify > > hardware configuration plus a RAC Licence. > > Are you totally certain you can't solve your problem with a single server > solution? > > How about: > Price out a 4 way Opteron 4u rackmount server with 64 bit linux, stuffed > with hard drives (like 40) set up in a complex raid configuration (multiple > raid controllers) allowing you (with tablespaces) to divide up your > database. > > You can drop in dual core opterons at some later point for an easy upgrade. > Let's say this server costs 20k$...are you sure this will not be enough to > handle your load? I'm not as I said ibn my mail I want to do a Cluster of servers ... :o) -- Hervé Piedvache Elma Ingénierie Informatique 6 rue du Faubourg Saint-Honoré F-75008 - Paris - France Pho. 33-144949901 Fax. 33-144949902 ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster
On Thu, 20 Jan 2005 09:33:42 -0800, Darcy Buskermolen <darcy@wavefire.com> wrote: > > Another Option to consider would be pgmemcache. that way you just build the > farm out of lots of large memory, diskless boxes for keeping the whole > database in memory in the whole cluster. More information on it can be found > at: http://people.freebsd.org/~seanc/pgmemcache/ Which brings up another question: why not just cluster at the hardware layer? Get an external fiberchannel array, and cluster a bunch of dual Opterons, all sharing that storage. In that sense you would be getting one big PostgreSQL 'image' running across all of the servers. Or is that idea too 90's? ;-) -- Mitch
On January 20, 2005 10:42 am, Mitch Pirtle wrote: > On Thu, 20 Jan 2005 09:33:42 -0800, Darcy Buskermolen > > <darcy@wavefire.com> wrote: > > Another Option to consider would be pgmemcache. that way you just build > > the farm out of lots of large memory, diskless boxes for keeping the > > whole database in memory in the whole cluster. More information on it > > can be found at: http://people.freebsd.org/~seanc/pgmemcache/ > > Which brings up another question: why not just cluster at the hardware > layer? Get an external fiberchannel array, and cluster a bunch of dual > Opterons, all sharing that storage. In that sense you would be getting > one big PostgreSQL 'image' running across all of the servers. It dosn't quite work that way, thanks to shared memory, and kernel disk cache. (among other things) > > Or is that idea too 90's? ;-) > > -- Mitch > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Darcy Buskermolen Wavefire Technologies Corp. ph: 250.717.0200 fx: 250.763.1759 http://www.wavefire.com
Mitch Pirtle wrote: > Which brings up another question: why not just cluster at the hardware > layer? Get an external fiberchannel array, and cluster a bunch of dual > Opterons, all sharing that storage. In that sense you would be getting > one big PostgreSQL 'image' running across all of the servers. This isn't as easy as it sounds. Simply sharing the array among hosts with a 'standard' file system won't work because of cache inconsistencies. So, you need to put a shareable filesystem (such as GFS or Lustre) on it. But that's not enough, because you're going to be running separate postgresql backends on the different hosts, and there are definitely consistency issues with trying to do that. So far as I know (right, experts?) postgresql isn't designed with providing distributed consistency in mind (isn't shared memory used for consistency, which restricts all the backends to a single host?). -- Steve Wampler -- swampler@noao.edu The gods that smiled on your birth are now laughing out loud.
Le Jeudi 20 Janvier 2005 19:09, Bruno Almeida do Lago a écrit : > Could you explain us what do you have in mind for that solution? I mean, > forget the PostgreSQL (or any other database) restrictions and explain us > how this hardware would be. Where the data would be stored? > > I've something in mind for you, but first I need to understand your needs! I just want to make a big database as explained in my first mail ... At the beginning we will have aprox. 150 000 000 records ... each month we will add about 4/8 millions new rows in constant flow during the day ... and in same time web users will access to the database in order to read those data. Stored data are quite close to data stored by google ... (we are not making a google clone ... just a lot of data many small values and some big ones ... that's why I'm comparing with google for data storage). Then we will have a search engine searching into those data ... Dealing about the hardware, for the moment we have only a bi-pentium Xeon 2.8Ghz with 4 Gb of RAM ... and we saw we had bad performance results ... so we are thinking about a new solution with maybe several servers (server design may vary from one to other) ... to get a kind of cluster to get better performance ... Am I clear ? Regards, -- Hervé Piedvache Elma Ingénierie Informatique 6 rue du Faubourg Saint-Honoré F-75008 - Paris - France Pho. 33-144949901 Fax. 33-144949902
On Thu, 20 Jan 2005 12:13:17 -0700, Steve Wampler <swampler@noao.edu> wrote: > Mitch Pirtle wrote: > But that's not enough, because you're going to be running separate > postgresql backends on the different hosts, and there are > definitely consistency issues with trying to do that. So far as > I know (right, experts?) postgresql isn't designed with providing > distributed consistency in mind (isn't shared memory used for > consistency, which restricts all the backends to a single host?). yes, you're right: you'll need a Distributed Lock Manager and an application to manage it , Postgres ?
Hervé Piedvache <herve@elma.fr> writes: > Le Jeudi 20 Janvier 2005 19:09, Bruno Almeida do Lago a écrit : > > Could you explain us what do you have in mind for that solution? I mean, > > forget the PostgreSQL (or any other database) restrictions and explain us > > how this hardware would be. Where the data would be stored? > > > > I've something in mind for you, but first I need to understand your needs! > > I just want to make a big database as explained in my first mail ... At the > beginning we will have aprox. 150 000 000 records ... each month we will add > about 4/8 millions new rows in constant flow during the day ... and in same > time web users will access to the database in order to read those data. > Stored data are quite close to data stored by google ... (we are not making a > google clone ... just a lot of data many small values and some big ones ... > that's why I'm comparing with google for data storage). > Then we will have a search engine searching into those data ... You're concentrating on the data within the database. That's only half the picture. What are you going to *do* with the data in the database? You need to analyze what "we will have a search engine searching into those data" means in more detail. Postgres is more than capable of storing 150Gb of data. There are people with terabyte databases on this list. You need to define what types of queries you need to perform, how many data they need to manipulate, and what your performance requirements are for those queries. -- greg
Two way xeon's are as fast as a single opteron, 150M rows isn't a big deal.
Clustering isn't really the solution, I fail to see how clustering actually helps since it has to slow down file access.
Dave
Hervé Piedvache wrote:
Clustering isn't really the solution, I fail to see how clustering actually helps since it has to slow down file access.
Dave
Hervé Piedvache wrote:
Le Jeudi 20 Janvier 2005 19:09, Bruno Almeida do Lago a écrit :Could you explain us what do you have in mind for that solution? I mean, forget the PostgreSQL (or any other database) restrictions and explain us how this hardware would be. Where the data would be stored? I've something in mind for you, but first I need to understand your needs!I just want to make a big database as explained in my first mail ... At the beginning we will have aprox. 150 000 000 records ... each month we will add about 4/8 millions new rows in constant flow during the day ... and in same time web users will access to the database in order to read those data. Stored data are quite close to data stored by google ... (we are not making a google clone ... just a lot of data many small values and some big ones ... that's why I'm comparing with google for data storage). Then we will have a search engine searching into those data ... Dealing about the hardware, for the moment we have only a bi-pentium Xeon 2.8Ghz with 4 Gb of RAM ... and we saw we had bad performance results ... so we are thinking about a new solution with maybe several servers (server design may vary from one to other) ... to get a kind of cluster to get better performance ... Am I clear ? Regards,
-- Dave Cramer http://www.postgresintl.com 519 939 0336 ICQ#14675561
> Dealing about the hardware, for the moment we have only a bi-pentium Xeon > 2.8Ghz with 4 Gb of RAM ... and we saw we had bad performance results ... > so > we are thinking about a new solution with maybe several servers (server > design may vary from one to other) ... to get a kind of cluster to get > better > performance ... > > Am I clear ? yes. Clustering is not the answer to your problem. You need to build a bigger, faster box with lots of storage. Clustering is A: a headache B: will cost you more, not less C: not designed for what you are trying to do. Going the x86 route, for about 20k$ you can get quad Opteron with 1-2 terabytes of storage (SATA), depending on how you configureyour raid. This is the best bang for the buck you are going to get, period. Replicate for redundancy, not performance. If you are doing fair amount of writes, you will not be able to make a faster system than this for similar amount of cash. You can drop the price a bit by pushing optional upgrades out to the future... If this is not good enough for you, it's time to start thinking about a mid range server. Merlin
Hervé Piedvache wrote: > > > Dealing about the hardware, for the moment we have only a bi-pentium Xeon > 2.8Ghz with 4 Gb of RAM ... and we saw we had bad performance results ... so > we are thinking about a new solution with maybe several servers (server > design may vary from one to other) ... to get a kind of cluster to get better > performance ... > The poor performance may not necessarily be: i) attributable to the hardware or, ii) solved by clustering. I would recommend determining *why* you got the slowdown. A few possible reasons are: i) not vacuuming often enough, freespacemap settings too small. ii) postgresql.conf setting very non optimal. iii) index and/or data design not optimal for PG. My suspicions would start at iii). Other posters have pointed out that 250000000 records in itself is not necessarily a problem, so this sort of data size is manageable. regards Mark
Merlin Moncure wrote: > ...You need to build a bigger, faster box with lots of storage... > Clustering ... > B: will cost you more, not less Is this still true when you get to 5-way or 17-way systems? My (somewhat outdated) impression is that up to about 4-way systems they're price competitive; but beyond that, I thought multiple cheap servers scales much more afordably than large servers. Certainly at the point of a 129-CPU system I bet you're better off with a network of cheap servers. > A: a headache Agreed if you mean clustering as-in making it look like one single database to the end user. However in my experience a few years ago, if you can partition the data in a way managed by the application, it'll not only be less of a headache, but probably provide a more flexable solution. Currently I'm working on a pretty big GIS database, that we're looking to partition our data in a manner similar to the microsoft whitepaper on scaling terraserver that can be found here: http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2002-53 I think this paper is a very nice analysis of many aspects of larger-server&SAN vs. application-partitioned-clusters, including looking at cost, reliability, managability, etc. After reading that paper, we started very seriously looking into application-level partitioning.
Ron Mayer wrote: > http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2002-53 Wrong link... http://research.microsoft.com/research/pubs/view.aspx?type=Technical%20Report&id=812 This is the one that discusses scalability, price, performance, failover, power consumption, hardware components, etc. Bottom line was that the large server with SAN had $1877K hardware costs while the application-partitioned cluster had $110K hardware costs -- but it's apples-to-oranges since they were deployed in different years. Still a big advantage for the small systems.
I was thinking the same! I'd like to know how other databases such as Oracle do it. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Mitch Pirtle Sent: Thursday, January 20, 2005 4:42 PM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] PostgreSQL clustering VS MySQL clustering On Thu, 20 Jan 2005 09:33:42 -0800, Darcy Buskermolen <darcy@wavefire.com> wrote: > > Another Option to consider would be pgmemcache. that way you just build the > farm out of lots of large memory, diskless boxes for keeping the whole > database in memory in the whole cluster. More information on it can be found > at: http://people.freebsd.org/~seanc/pgmemcache/ Which brings up another question: why not just cluster at the hardware layer? Get an external fiberchannel array, and cluster a bunch of dual Opterons, all sharing that storage. In that sense you would be getting one big PostgreSQL 'image' running across all of the servers. Or is that idea too 90's? ;-) -- Mitch ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Bruno, > Which brings up another question: why not just cluster at the hardware > layer? Get an external fiberchannel array, and cluster a bunch of dual > Opterons, all sharing that storage. In that sense you would be getting > one big PostgreSQL 'image' running across all of the servers. > > Or is that idea too 90's? ;-) No, it just doesn't work. Multiple postmasters can't share one database. LinuxLabs (as I've gathered) tried to go one better by using a tool that allows shared memory to bridge multple networked servers -- in other words, one postmaster controlling 4 or 5 servers. The problem is that IPC via this method is about 1,000 times slower than IPC on a single machine, wiping out all of the scalability gains from having the cluster in the first place. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Thu, Jan 20, 2005 at 10:08:47AM -0500, Stephen Frost wrote: > * Christopher Kings-Lynne (chriskl@familyhealth.com.au) wrote: > > PostgreSQL has replication, but not partitioning (which is what you want). > > It doesn't have multi-server partitioning.. It's got partitioning > within a single server (doesn't it? I thought it did, I know it was > discussed w/ the guy from Cox Communications and I thought he was using > it :). No, PostgreSQL doesn't support any kind of partitioning, unless you write it yourself. I think there's some work being done in this area, though. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Thu, Jan 20, 2005 at 10:40:02PM -0200, Bruno Almeida do Lago wrote: > > I was thinking the same! I'd like to know how other databases such as Oracle > do it. > In a nutshell, in a clustered environment (which iirc in oracle means shared disks), they use a set of files for locking and consistency across machines. So you better have fast access to the drive array, and the array better have caching of some kind. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
> On January 20, 2005 06:49 am, Joshua D. Drake wrote: > > Stephen Frost wrote: > > >* Herv? Piedvache (herve@elma.fr) wrote: > > >>Le Jeudi 20 Janvier 2005 15:30, Stephen Frost a écrit : > > >>>* Herv? Piedvache (herve@elma.fr) wrote: > > >>>>Is there any solution with PostgreSQL matching these needs ... ? > > >>> > > >>>You might look into pg_pool. Another possibility would be slony, though > > >>>I'm not sure it's to the point you need it at yet, depends on if you can > > >>>handle some delay before an insert makes it to the slave select systems. > > >> > > >>I think not ... pgpool or slony are replication solutions ... but as I > > >> have said to Christopher Kings-Lynne how I'll manage the scalabilty of > > >> the database ? I'll need several servers able to load a database growing > > >> and growing to get good speed performance ... > > > > > >They're both replication solutions, but they also help distribute the > > >load. For example: > > > > > >pg_pool will distribute the select queries amoung the servers. They'll > > >all get the inserts, so that hurts, but at least the select queries are > > >distributed. > > > > > >slony is similar, but your application level does the load distribution > > >of select statements instead of pg_pool. Your application needs to know > > >to send insert statements to the 'main' server, and select from the > > >others. > > > > You can put pgpool in front of replicator or slony to get load > > balancing for reads. > > Last time I checked load ballanced reads was only available in pgpool if you > were using pgpools's internal replication. Has something changed recently? Yes. However it would be pretty easy to modify pgpool so that it could cope with Slony-I. I.e. 1) pgpool does the load balance and sends query to Slony-I's slave and master if the query is SELECT. 2) pgpool sends query only to the master if the query is other than SELECT. Remaining problem is that Slony-I is not a sync replication solution. Thus you need to prepare that the load balanced query results might differ among servers. If there's enough demand, I would do such that enhancements to pgpool. -- Tatsuo Ishii > > >>>>Is there any other solution than a Cluster for our problem ? > > >>> > > >>>Bigger server, more CPUs/disks in one box. Try to partition up your > > >>>data some way such that it can be spread across multiple machines, then > > >>>if you need to combine the data have it be replicated using slony to a > > >>>big box that has a view which joins all the tables and do your big > > >>>queries against that. > > >> > > >>But I'll arrive to limitation of a box size quickly I thing a 4 > > >> processors with 64 Gb of RAM ... and after ? > > > > Opteron. > > IBM Z-series, or other big iron. > > > > > >Go to non-x86 hardware after if you're going to continue to increase the > > >size of the server. Personally I think your better bet might be to > > >figure out a way to partition up your data (isn't that what google > > >does anyway?). > > > > > > Stephen > > -- > Darcy Buskermolen > Wavefire Technologies Corp. > ph: 250.717.0200 > fx: 250.763.1759 > http://www.wavefire.com > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
On Thu, Jan 20, 2005 at 07:12:42AM -0800, Joshua D. Drake wrote: > > >>then I was thinking. Couldn't he use > >>multiple databases > >>over multiple servers with dblink? > >> > >>It is not exactly how I would want to do it, but it would provide what > >>he needs I think??? > >> > >> > > > >Yes seems to be the only solution ... but I'm a little disapointed about > >this ... could you explain me why there is not this kind of > >functionnality ... it seems to be a real need for big applications no ? > > > > > Because it is really, really hard to do correctly and hard > equals expensive. To expand on what Josh said, the expense in this case is development resources. If you look on the developer site you'll see a huge TODO list and a relatively small list of PostgreSQL developers. To develop a cluster solution similar to RAC would probably take the efforts of the entire development team for a year or more, during which time very little else would be done. I'm glad to see your persistance in wanting to use PostgreSQL, and there might be some kind of limited clustering scheme that could be implemented without a great amount of effort by the core developers. In that case I think there's a good chance you could find people willing to work on it. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Oracle's RAC is good, but I think it's best to view it as a step in the high availability direction rather than a performance enhancer. While it can help your application scale up, that depends on the usage pattern. Also it's not 100% transparent to the application for example you can't depend on a sequence numbers being allocated uniquely as there can be delays propagating them to all nodes. So in clusters where insert rates are high this means you should explicitly check for unique key violations and try again. Dealing with propagation delays comes with the clustering technology I guess. Nonetheless, I would love to see this kind of functionality in postgres. Regards Iain ----- Original Message ----- From: "Jim C. Nasby" <decibel@decibel.org> To: "Bruno Almeida do Lago" <teolupus@gmail.com> Cc: "'Mitch Pirtle'" <mitch.pirtle@gmail.com>; <pgsql-performance@postgresql.org> Sent: Friday, January 21, 2005 10:30 AM Subject: Re: [PERFORM] PostgreSQL clustering VS MySQL clustering > On Thu, Jan 20, 2005 at 10:40:02PM -0200, Bruno Almeida do Lago wrote: >> >> I was thinking the same! I'd like to know how other databases such as >> Oracle >> do it. >> > In a nutshell, in a clustered environment (which iirc in oracle means > shared disks), they use a set of files for locking and consistency > across machines. So you better have fast access to the drive array, and > the array better have caching of some kind. > -- > Jim C. Nasby, Database Consultant decibel@decibel.org > Give your computer some brain candy! www.distributed.net Team #1828 > > Windows: "Where do you want to go today?" > Linux: "Where do you want to go tomorrow?" > FreeBSD: "Are you guys coming, or what?" > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
>1) pgpool does the load balance and sends query to Slony-I's slave and > master if the query is SELECT. > >2) pgpool sends query only to the master if the query is other than > SELECT. > >Remaining problem is that Slony-I is not a sync replication >solution. Thus you need to prepare that the load balanced query >results might differ among servers. > >If there's enough demand, I would do such that enhancements to pgpool. > > Well I know that Replicator could also use this functionality. Sincerely, Joshua D. Drake >-- >Tatsuo Ishii > > > >>>>>>>Is there any other solution than a Cluster for our problem ? >>>>>>> >>>>>>> >>>>>>Bigger server, more CPUs/disks in one box. Try to partition up your >>>>>>data some way such that it can be spread across multiple machines, then >>>>>>if you need to combine the data have it be replicated using slony to a >>>>>>big box that has a view which joins all the tables and do your big >>>>>>queries against that. >>>>>> >>>>>> >>>>>But I'll arrive to limitation of a box size quickly I thing a 4 >>>>>processors with 64 Gb of RAM ... and after ? >>>>> >>>>> >>>Opteron. >>> >>> >>IBM Z-series, or other big iron. >> >> >> >>>>Go to non-x86 hardware after if you're going to continue to increase the >>>>size of the server. Personally I think your better bet might be to >>>>figure out a way to partition up your data (isn't that what google >>>>does anyway?). >>>> >>>> Stephen >>>> >>>> >>-- >>Darcy Buskermolen >>Wavefire Technologies Corp. >>ph: 250.717.0200 >>fx: 250.763.1759 >>http://www.wavefire.com >> >>---------------------------(end of broadcast)--------------------------- >>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >> >> >> > >---------------------------(end of broadcast)--------------------------- >TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
Tatsuo, > Yes. However it would be pretty easy to modify pgpool so that it could > cope with Slony-I. I.e. > > 1) pgpool does the load balance and sends query to Slony-I's slave and > master if the query is SELECT. > > 2) pgpool sends query only to the master if the query is other than > SELECT. > > Remaining problem is that Slony-I is not a sync replication > solution. Thus you need to prepare that the load balanced query > results might differ among servers. Yes, please, some of us are already doing the above ad-hoc. The simple workaround to replication lag is to calculate the longest likely lag (<3 seconds if Slony is tuned right) and have the dispatcher (pgpool) send all requests from that connection to the master for that period. Then it switches back to "pool" mode where the slaves may be used. Of course, all of the above is only useful if you're doing a web app where 96% of query activity is selects. For additional scalability, put all of your session maintenance in memcached, so that you're not doing database writes every time a page loads. -- Josh Berkus Aglio Database Solutions San Francisco
I have no experience with pgCluster, but I found: PGCluster is a multi-master and synchronous replication system that supports load balancing of PostgreSQL. http://www.software-facilities.com/databases-software/pgcluster.php May be some have some expierience with this tool? ----- Original Message ----- From: "Christopher Kings-Lynne" <chriskl@familyhealth.com.au> To: "Hervé Piedvache" <herve@elma.fr> Cc: "Jeff" <threshar@torgo.978.org>; <pgsql-performance@postgresql.org> Sent: Thursday, January 20, 2005 4:58 PM Subject: [spam] Re: [PERFORM] PostgreSQL clustering VS MySQL clustering >>>Or you could fork over hundreds of thousands of dollars for Oracle's >>>RAC. >> >> >> No please do not talk about this again ... I'm looking about a PostgreSQL >> solution ... I know RAC ... and I'm not able to pay for a RAC certify >> hardware configuration plus a RAC Licence. > > There is absolutely zero PostgreSQL solution... > > You may have to split the data yourself onto two independent db servers > and combine the results somehow in your application. > > Chris > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
This idea won't work with postgresql only one instance can operate on a datastore at a time.
Dave
Bruno Almeida do Lago wrote:
Dave
Bruno Almeida do Lago wrote:
I was thinking the same! I'd like to know how other databases such as Oracle do it. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Mitch Pirtle Sent: Thursday, January 20, 2005 4:42 PM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] PostgreSQL clustering VS MySQL clustering On Thu, 20 Jan 2005 09:33:42 -0800, Darcy Buskermolen <darcy@wavefire.com> wrote:Another Option to consider would be pgmemcache. that way you just buildthefarm out of lots of large memory, diskless boxes for keeping the whole database in memory in the whole cluster. More information on it can befoundat: http://people.freebsd.org/~seanc/pgmemcache/Which brings up another question: why not just cluster at the hardware layer? Get an external fiberchannel array, and cluster a bunch of dual Opterons, all sharing that storage. In that sense you would be getting one big PostgreSQL 'image' running across all of the servers. Or is that idea too 90's? ;-) -- Mitch ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings
> Tatsuo, > > > Yes. However it would be pretty easy to modify pgpool so that it could > > cope with Slony-I. I.e. > > > > 1) pgpool does the load balance and sends query to Slony-I's slave and > > master if the query is SELECT. > > > > 2) pgpool sends query only to the master if the query is other than > > SELECT. > > > > Remaining problem is that Slony-I is not a sync replication > > solution. Thus you need to prepare that the load balanced query > > results might differ among servers. > > Yes, please, some of us are already doing the above ad-hoc. > > The simple workaround to replication lag is to calculate the longest likely > lag (<3 seconds if Slony is tuned right) and have the dispatcher (pgpool) > send all requests from that connection to the master for that period. Then > it switches back to "pool" mode where the slaves may be used. Can I ask a question? Suppose table A gets updated on the master at time 00:00. Until 00:03 pgpool needs to send all queries regarding A to the master only. My question is, how can pgpool know a query is related to A? -- Tatsuo Ishii > Of course, all of the above is only useful if you're doing a web app where 96% > of query activity is selects. For additional scalability, put all of your > session maintenance in memcached, so that you're not doing database writes > every time a page loads. > > -- > Josh Berkus > Aglio Database Solutions > San Francisco >
Presumably it can't _ever_ know without being explicitly told, because even for a plain SELECT there might be triggers involved that update tables, or it might be a select of a stored proc, etc. So in the general case, you can't assume that a select doesn't cause an update, and you can't be sure that the table list in an update is a complete list of the tables that might be updated. Tatsuo Ishii wrote: >Can I ask a question? > >Suppose table A gets updated on the master at time 00:00. Until 00:03 >pgpool needs to send all queries regarding A to the master only. My >question is, how can pgpool know a query is related to A? >-- >Tatsuo Ishii > > >
Matt Clark wrote: > Presumably it can't _ever_ know without being explicitly told, because > even for a plain SELECT there might be triggers involved that update > tables, or it might be a select of a stored proc, etc. So in the > general case, you can't assume that a select doesn't cause an update, > and you can't be sure that the table list in an update is a complete > list of the tables that might be updated. Uhmmm no :) There is no such thing as a select trigger. The closest you would get is a function that is called via select which could be detected by making sure you are prepending with a BEGIN or START Transaction. Thus yes pgPool can be made to do this. Sincerely, Joshua D. Drake > > > > Tatsuo Ishii wrote: > >> Can I ask a question? >> >> Suppose table A gets updated on the master at time 00:00. Until 00:03 >> pgpool needs to send all queries regarding A to the master only. My >> question is, how can pgpool know a query is related to A? >> -- >> Tatsuo Ishii >> >> >> > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
Joshua D. Drake wrote: > Matt Clark wrote: > >> Presumably it can't _ever_ know without being explicitly told, because >> even for a plain SELECT there might be triggers involved that update >> tables, or it might be a select of a stored proc, etc. So in the >> general case, you can't assume that a select doesn't cause an update, >> and you can't be sure that the table list in an update is a complete >> list of the tables that might be updated. > > > Uhmmm no :) There is no such thing as a select trigger. The closest you > would get > is a function that is called via select which could be detected by > making sure > you are prepending with a BEGIN or START Transaction. Thus yes pgPool > can be made > to do this. SELECT SETVAL() is another case. I'd really love to see pgpool do this. I am also curious about Slony-II development, Tom mentioned a first meeting about it :) Regards, Bjoern
> Uhmmm no :) There is no such thing as a select trigger. The closest you > would get > is a function that is called via select which could be detected by > making sure > you are prepending with a BEGIN or START Transaction. Thus yes pgPool > can be made > to do this. Technically, you can also set up a rule to do things on a select with DO ALSO. However putting update statements in there would be considered (at least by me) very bad form. Note that this is not a trigger because it does not operate at the row level [I know you knew that already :-)]. Merlin
Yes, I wasn't really choosing my examples particularly carefully, but I think the conclusion stands: pgpool (or anyone/thing except for the server) cannot in general tell from the SQL it is handed by the client whether an update will occur, nor which tables might be affected. That's not to say that pgpool couldn't make a good guess in the majority of cases! M Joshua D. Drake wrote: > Matt Clark wrote: > >> Presumably it can't _ever_ know without being explicitly told, >> because even for a plain SELECT there might be triggers involved that >> update tables, or it might be a select of a stored proc, etc. So in >> the general case, you can't assume that a select doesn't cause an >> update, and you can't be sure that the table list in an update is a >> complete list of the tables that might be updated. > > > Uhmmm no :) There is no such thing as a select trigger. The closest > you would get > is a function that is called via select which could be detected by > making sure > you are prepending with a BEGIN or START Transaction. Thus yes pgPool > can be made > to do this. > > Sincerely, > > Joshua D. Drake > > >
Tatsuo, > Suppose table A gets updated on the master at time 00:00. Until 00:03 > pgpool needs to send all queries regarding A to the master only. My > question is, how can pgpool know a query is related to A? Well, I'm a little late to head off tangental discussion about this, but .... The systems where I've implemented something similar are for web applications. In the case of the web app, you don't care if a most users see data which is 2 seconds out of date; with caching and whatnot, it's often much more than that! The one case where it's not permissable for a user to see "old" data is the case where the user is updating the data. Namely: (1) 00:00 User A updates "My Profile" (2) 00:01 "My Profile" UPDATE finishes executing. (3) 00:02 User A sees "My Profile" re-displayed (6) 00:04 "My Profile":UserA cascades to the last Slave server So in an application like the above, it would be a real problem if User A were to get switched over to a slave server immediately after the update; she would see the old data, assume that her update was not saved, and update again. Or send angry e-mails to webmaster@. However, it makes no difference what User B sees: (1) 00:00 User A updates "My Profile"v1 Master (2) 00:01 "My Profile" UPDATE finishes executing. Master (3) 00:02 User A sees "My Profile"v2 displayed Master (4) 00:02 User B requests "MyProfile":UserA Slave2 (5) 00:03 User B sees "My Profile"v1 Slave2 (6) 00:04 "My Profile"v2 cascades to the last Slave server Slave2 If the web application is structured properly, the fact that UserB is seeing UserA's information which is 2 seconds old is not a problem (though it might be for web auctions, where it could result in race conditions. Consider memcached as a helper). This means that pgPool only needs to monitor "update switching" by *connection* not by *table*. Make sense? -- Josh Berkus Aglio Database Solutions San Francisco
This is probably a lot easier than you would think. You say that your DB will have lots of data, lots of updates and lots of reads. Very likely the disk bottleneck is mostly index reads and writes, with some critical WAL fsync() calls. In the grand scheme of things, the actual data is likely not accessed very often. The indexes can be put on a RAM disk tablespace and that's the end of index problems -- just make sure you have enough memory available. Also make sure that the machine can restart correctly after a crash: the tablespace is dropped and recreated, along with the indexes. This will cause a machine restart to take some time. After that, if the WAL fsync() calls are becoming a problem, put the WAL files on a fast RAID array, etiher a card or external enclosure, that has a good amount of battery-backed write cache. This way, the WAL fsync() calls will flush quickly to the RAM and Pg can move on while the RAID controller worries about putting the data to disk. With WAL, low access time is usually more important than total throughput. The truth is that you could have this running for not much money. Good Luck, Marty > Le Jeudi 20 Janvier 2005 19:09, Bruno Almeida do Lago a écrit : > > Could you explain us what do you have in mind for that solution? I mean, > > forget the PostgreSQL (or any other database) restrictions and > explain us > > how this hardware would be. Where the data would be stored? > > > > I've something in mind for you, but first I need to understand your > needs! > > I just want to make a big database as explained in my first mail ... At the > beginning we will have aprox. 150 000 000 records ... each month we will > add > about 4/8 millions new rows in constant flow during the day ... and in same > time web users will access to the database in order to read those data. > Stored data are quite close to data stored by google ... (we are not > making a > google clone ... just a lot of data many small values and some big ones ... > that's why I'm comparing with google for data storage). > Then we will have a search engine searching into those data ... > > Dealing about the hardware, for the moment we have only a bi-pentium Xeon > 2.8Ghz with 4 Gb of RAM ... and we saw we had bad performance results > ... so > we are thinking about a new solution with maybe several servers (server > design may vary from one to other) ... to get a kind of cluster to get > better > performance ... > > Am I clear ? > > Regards,
> >Technically, you can also set up a rule to do things on a select with DO > >ALSO. However putting update statements in there would be considered (at > >least by me) very bad form. Note that this is not a trigger because it > >does not operate at the row level [I know you knew that already :-)]. > > > > > > > Unfortunately, you can't. Select operations only allow a single rule, > and it must be a DO INSTEAD rule, unless this has changed in 8.0 and I > missed it in the docs. However, you can do this in a view by calling a > function either in the row definition or in the where clause. You're right...forgot about that. Heh, the do instead rule could be a set returning function which could (besides returning the set) do almost anything! So in theory it makes no difference...diclaimer: never tried doing this! Merlin
Tatsuo, What would happen with SELECT queries that, through a function or some other mechanism, updates data in the database? Would those need to be passed to pgpool in some special way? Thanks, Peter Darley -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Tatsuo Ishii Sent: Thursday, January 20, 2005 5:40 PM To: darcy@wavefire.com Cc: jd@www.commandprompt.com; sfrost@snowman.net; herve@elma.fr; pgsql-performance@postgresql.org Subject: Re: [PERFORM] PostgreSQL clustering VS MySQL clustering > On January 20, 2005 06:49 am, Joshua D. Drake wrote: > > Stephen Frost wrote: > > >* Herv? Piedvache (herve@elma.fr) wrote: > > >>Le Jeudi 20 Janvier 2005 15:30, Stephen Frost a écrit : > > >>>* Herv? Piedvache (herve@elma.fr) wrote: > > >>>>Is there any solution with PostgreSQL matching these needs ... ? > > >>> > > >>>You might look into pg_pool. Another possibility would be slony, though > > >>>I'm not sure it's to the point you need it at yet, depends on if you can > > >>>handle some delay before an insert makes it to the slave select systems. > > >> > > >>I think not ... pgpool or slony are replication solutions ... but as I > > >> have said to Christopher Kings-Lynne how I'll manage the scalabilty of > > >> the database ? I'll need several servers able to load a database growing > > >> and growing to get good speed performance ... > > > > > >They're both replication solutions, but they also help distribute the > > >load. For example: > > > > > >pg_pool will distribute the select queries amoung the servers. They'll > > >all get the inserts, so that hurts, but at least the select queries are > > >distributed. > > > > > >slony is similar, but your application level does the load distribution > > >of select statements instead of pg_pool. Your application needs to know > > >to send insert statements to the 'main' server, and select from the > > >others. > > > > You can put pgpool in front of replicator or slony to get load > > balancing for reads. > > Last time I checked load ballanced reads was only available in pgpool if you > were using pgpools's internal replication. Has something changed recently? Yes. However it would be pretty easy to modify pgpool so that it could cope with Slony-I. I.e. 1) pgpool does the load balance and sends query to Slony-I's slave and master if the query is SELECT. 2) pgpool sends query only to the master if the query is other than SELECT. Remaining problem is that Slony-I is not a sync replication solution. Thus you need to prepare that the load balanced query results might differ among servers. If there's enough demand, I would do such that enhancements to pgpool. -- Tatsuo Ishii > > >>>>Is there any other solution than a Cluster for our problem ? > > >>> > > >>>Bigger server, more CPUs/disks in one box. Try to partition up your > > >>>data some way such that it can be spread across multiple machines, then > > >>>if you need to combine the data have it be replicated using slony to a > > >>>big box that has a view which joins all the tables and do your big > > >>>queries against that. > > >> > > >>But I'll arrive to limitation of a box size quickly I thing a 4 > > >> processors with 64 Gb of RAM ... and after ? > > > > Opteron. > > IBM Z-series, or other big iron. > > > > > >Go to non-x86 hardware after if you're going to continue to increase the > > >size of the server. Personally I think your better bet might be to > > >figure out a way to partition up your data (isn't that what google > > >does anyway?). > > > > > > Stephen > > -- > Darcy Buskermolen > Wavefire Technologies Corp. > ph: 250.717.0200 > fx: 250.763.1759 > http://www.wavefire.com > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Peter, Tatsuo: would happen with SELECT queries that, through a function or some > other mechanism, updates data in the database? Would those need to be > passed to pgpool in some special way? Oh, yes, that reminds me. It would be helpful if pgPool accepted a control string ... perhaps one in a SQL comment ... which indicated that the statement to follow was, despite appearances, an update. For example: --STATEMENT_IS_UPDATE\n The alternative is, of course, that pgPool direct all explicit transactions to the master ... which is a good idea anyway. So you could do: BEGIN; SELECT some_update_function(); COMMIT; -- --Josh Josh Berkus Aglio Database Solutions San Francisco
> Tatsuo, > > > Suppose table A gets updated on the master at time 00:00. Until 00:03 > > pgpool needs to send all queries regarding A to the master only. My > > question is, how can pgpool know a query is related to A? > > Well, I'm a little late to head off tangental discussion about this, but .... > > The systems where I've implemented something similar are for web applications. > In the case of the web app, you don't care if a most users see data which is > 2 seconds out of date; with caching and whatnot, it's often much more than > that! > > The one case where it's not permissable for a user to see "old" data is the > case where the user is updating the data. Namely: > > (1) 00:00 User A updates "My Profile" > (2) 00:01 "My Profile" UPDATE finishes executing. > (3) 00:02 User A sees "My Profile" re-displayed > (6) 00:04 "My Profile":UserA cascades to the last Slave server > > So in an application like the above, it would be a real problem if User A were > to get switched over to a slave server immediately after the update; she > would see the old data, assume that her update was not saved, and update > again. Or send angry e-mails to webmaster@. > > However, it makes no difference what User B sees: > > (1) 00:00 User A updates "My Profile"v1 Master > (2) 00:01 "My Profile" UPDATE finishes executing. Master > (3) 00:02 User A sees "My Profile"v2 displayed Master > (4) 00:02 User B requests "MyProfile":UserA Slave2 > (5) 00:03 User B sees "My Profile"v1 Slave2 > (6) 00:04 "My Profile"v2 cascades to the last Slave server Slave2 > > If the web application is structured properly, the fact that UserB is seeing > UserA's information which is 2 seconds old is not a problem (though it might > be for web auctions, where it could result in race conditions. Consider > memcached as a helper). This means that pgPool only needs to monitor > "update switching" by *connection* not by *table*. > > Make sense? I'm not clear what "pgPool only needs to monitor "update switching" by *connection* not by *table*" means. In your example: > (1) 00:00 User A updates "My Profile" > (2) 00:01 "My Profile" UPDATE finishes executing. > (3) 00:02 User A sees "My Profile" re-displayed > (6) 00:04 "My Profile":UserA cascades to the last Slave server I think (2) and (3) are on different connections, thus pgpool cannot judge if SELECT in (3) should go only to the master or not. To solve the problem you need to make pgpool understand "web sessions" not "database connections" and it seems impossible for pgpool to understand "sessions". -- Tatsuo Ishii
IMO the bottle neck is not WAL but table/index bloat. Lots of updates on large tables will produce lots of dead tuples. Problem is, There' is no effective way to reuse these dead tuples since VACUUM on huge tables takes longer time. 8.0 adds new vacuum delay paramters. Unfortunately this does not help. It just make the execution time of VACUUM longer, that means more and more dead tuples are being made while updating. Probably VACUUM works well for small to medium size tables, but not for huge ones. I'm considering about to implement "on the spot salvaging dead tuples". -- Tatsuo Ishii > This is probably a lot easier than you would think. You say that your > DB will have lots of data, lots of updates and lots of reads. > > Very likely the disk bottleneck is mostly index reads and writes, with > some critical WAL fsync() calls. In the grand scheme of things, the > actual data is likely not accessed very often. > > The indexes can be put on a RAM disk tablespace and that's the end of > index problems -- just make sure you have enough memory available. Also > make sure that the machine can restart correctly after a crash: the > tablespace is dropped and recreated, along with the indexes. This will > cause a machine restart to take some time. > > After that, if the WAL fsync() calls are becoming a problem, put the WAL > files on a fast RAID array, etiher a card or external enclosure, that > has a good amount of battery-backed write cache. This way, the WAL > fsync() calls will flush quickly to the RAM and Pg can move on while the > RAID controller worries about putting the data to disk. With WAL, low > access time is usually more important than total throughput. > > The truth is that you could have this running for not much money. > > Good Luck, > Marty > > > Le Jeudi 20 Janvier 2005 19:09, Bruno Almeida do Lago a écrit : > > > Could you explain us what do you have in mind for that solution? I mean, > > > forget the PostgreSQL (or any other database) restrictions and > > explain us > > > how this hardware would be. Where the data would be stored? > > > > > > I've something in mind for you, but first I need to understand your > > needs! > > > > I just want to make a big database as explained in my first mail ... At the > > beginning we will have aprox. 150 000 000 records ... each month we will > > add > > about 4/8 millions new rows in constant flow during the day ... and in same > > time web users will access to the database in order to read those data. > > Stored data are quite close to data stored by google ... (we are not > > making a > > google clone ... just a lot of data many small values and some big ones ... > > that's why I'm comparing with google for data storage). > > Then we will have a search engine searching into those data ... > > > > Dealing about the hardware, for the moment we have only a bi-pentium Xeon > > 2.8Ghz with 4 Gb of RAM ... and we saw we had bad performance results > > ... so > > we are thinking about a new solution with maybe several servers (server > > design may vary from one to other) ... to get a kind of cluster to get > > better > > performance ... > > > > Am I clear ? > > > > Regards, > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
> Peter, Tatsuo: > > would happen with SELECT queries that, through a function or some > > other mechanism, updates data in the database? Would those need to be > > passed to pgpool in some special way? > > Oh, yes, that reminds me. It would be helpful if pgPool accepted a control > string ... perhaps one in a SQL comment ... which indicated that the > statement to follow was, despite appearances, an update. For example: > --STATEMENT_IS_UPDATE\n Actually the way judging if it's a "pure" SELECT or not in pgpool is very simple. pgpool just checkes if the SQL statement exactly begins with "SELECT" (case insensitive, of course). So, for example, you could insert an SQL comment something like "/*this SELECT has side effect*/ at the beginning of line to indicate that pgpool should not send this query to the slave. > The alternative is, of course, that pgPool direct all explicit transactions to > the master ... which is a good idea anyway. So you could do: > > BEGIN; > SELECT some_update_function(); > COMMIT; Yes. pgpool has already done this in load balancing. Expanding this for Slony-I is pretty easy. -- Tatsuo Ishii
On Sat, 22 Jan 2005 12:13:00 +0900 (JST), Tatsuo Ishii <t-ishii@sra.co.jp> wrote: > IMO the bottle neck is not WAL but table/index bloat. Lots of updates > on large tables will produce lots of dead tuples. Problem is, There' > is no effective way to reuse these dead tuples since VACUUM on huge > tables takes longer time. 8.0 adds new vacuum delay > paramters. Unfortunately this does not help. It just make the > execution time of VACUUM longer, that means more and more dead tuples > are being made while updating. > > Probably VACUUM works well for small to medium size tables, but not > for huge ones. I'm considering about to implement "on the spot > salvaging dead tuples". Quick thought -- would it be to possible to implement a 'partial VACUUM' per analogiam to partial indexes? It would be then posiible to do: VACUUM footable WHERE footime < current_date - 60; after a statement to DELETE all/some rows older than 60 days. The VACUUM would check visibility of columns which are mentioned in an index (in this case: footable_footime_index ;)). Of course it is not a great solution, but could be great for doing housecleaning after large update/delete in a known range. ...and should be relatively simple to implement, I guess (maybe without 'ANALYZE' part). Regards, Dawid
Dawid Kuroczko <qnex42@gmail.com> writes: > Quick thought -- would it be to possible to implement a 'partial VACUUM' > per analogiam to partial indexes? No. But it gave me another idea. Perhaps equally infeasible, but I don't see why. What if there were a map of modified pages. So every time any tuple was marked deleted it could be marked in the map as modified. VACUUM would only have to look at these pages. And if it could mark as free every tuple that was marked as deleted then it could unmark the page. The only downside I see is that this could be a source of contention on multi-processor machines running lots of concurrent update/deletes. -- greg
On Sat, Jan 22, 2005 at 12:13:00 +0900, Tatsuo Ishii <t-ishii@sra.co.jp> wrote: > > Probably VACUUM works well for small to medium size tables, but not > for huge ones. I'm considering about to implement "on the spot > salvaging dead tuples". You are probably vacuuming too often. You want to wait until a significant fraction of a large table is dead tuples before doing a vacuum. If you are scanning a large table and only marking a few tuples as deleted, you aren't getting much bang for your buck.
On Sat, 2005-01-22 at 12:41 -0600, Bruno Wolff III wrote: > On Sat, Jan 22, 2005 at 12:13:00 +0900, > Tatsuo Ishii <t-ishii@sra.co.jp> wrote: > > > > Probably VACUUM works well for small to medium size tables, but not > > for huge ones. I'm considering about to implement "on the spot > > salvaging dead tuples". > > You are probably vacuuming too often. You want to wait until a significant > fraction of a large table is dead tuples before doing a vacuum. If you are > scanning a large table and only marking a few tuples as deleted, you aren't > getting much bang for your buck. The big problem occurs when you have a small set of hot tuples within a large table. In the time it takes to vacuum a table with 200M tuples one can update a small subset of that table many many times. Some special purpose vacuum which can target hot spots would be great, but I've always assumed this would come in the form of table partitioning and the ability to vacuum different partitions independently of each-other. --
From http://developer.postgresql.org/todo.php: Maintain a map of recently-expired rows This allows vacuum to reclaim free space without requiring a sequential scan On Sat, Jan 22, 2005 at 12:20:53PM -0500, Greg Stark wrote: > Dawid Kuroczko <qnex42@gmail.com> writes: > > > Quick thought -- would it be to possible to implement a 'partial VACUUM' > > per analogiam to partial indexes? > > No. > > But it gave me another idea. Perhaps equally infeasible, but I don't see why. > > What if there were a map of modified pages. So every time any tuple was marked > deleted it could be marked in the map as modified. VACUUM would only have to > look at these pages. And if it could mark as free every tuple that was marked > as deleted then it could unmark the page. > > The only downside I see is that this could be a source of contention on > multi-processor machines running lots of concurrent update/deletes. > > -- > greg > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > Probably VACUUM works well for small to medium size tables, but not > for huge ones. I'm considering about to implement "on the spot > salvaging dead tuples". That's impossible on its face, except for the special case where the same transaction inserts and deletes a tuple. In all other cases, the transaction deleting a tuple cannot know whether it will commit. regards, tom lane
In an attempt to throw the authorities off his trail, herve@elma.fr (Hervé Piedvache) transmitted: > Le Jeudi 20 Janvier 2005 15:24, Christopher Kings-Lynne a écrit : >> > Is there any solution with PostgreSQL matching these needs ... ? >> >> You want: http://www.slony.info/ >> >> > Do we have to backport our development to MySQL for this kind of problem >> > ? Is there any other solution than a Cluster for our problem ? >> >> Well, Slony does replication which is basically what you want :) >> >> Only master->slave though, so you will need to have all inserts go via >> the master server, but selects can come off any server. > > Sorry but I don't agree with this ... Slony is a replication > solution ... I don't need replication ... what will I do when my > database will grow up to 50 Gb ... I'll need more than 50 Gb of RAM > on each server ??? This solution is not very realistic for me ... Huh? Why on earth do you imagine that Slony-I requires a lot of memory? It doesn't. A fairly _large_ Slony-I process is about 10MB. There will be some demand for memory on the DB servers, but you don't need an enormous quantity of extra memory to run it. There is a MySQL "replicating/clustering" system that uses an in-memory database which means that if your DB is 50GB in size, you need something like 200GB of RAM. If you're thinking of that, that's not relevant to PostgreSQL or Slony-I... > I need a Cluster solution not a replication one or explain me in > details how I will do for managing the scalabilty of my database ... I'm not sure you understand clustering if you imagine it doesn't involve replication. There are numerous models for clustering, much as there are numerous RAID models. But the only sorts of clustering cases where you get to NOT do replication are the cases where all you're looking for from clustering is improved speed, and you're willing for any breakage on any host to potentially destroy your cluster. Perhaps you need to describe what you _think_ you mean by a "cluster solution." It may be that it'll take further thought to determine what you actually need... -- output = ("cbbrowne" "@" "gmail.com") http://www3.sympatico.ca/cbbrowne/postgresql.html "Not me, guy. I read the Bash man page each day like a Jehovah's Witness reads the Bible. No wait, the Bash man page IS the bible. Excuse me..." (More on confusing aliases, taken from comp.os.linux.misc)
In the last exciting episode, herve@elma.fr (Hervé Piedvache) wrote: > Le Jeudi 20 Janvier 2005 16:05, Joshua D. Drake a écrit : >> Christopher Kings-Lynne wrote: >> >>> Or you could fork over hundreds of thousands of dollars for Oracle's >> >>> RAC. >> >> >> >> No please do not talk about this again ... I'm looking about a >> >> PostgreSQL solution ... I know RAC ... and I'm not able to pay for a >> >> RAC certify hardware configuration plus a RAC Licence. >> > >> > There is absolutely zero PostgreSQL solution... >> >> I just replied the same thing but then I was thinking. Couldn't he use >> multiple databases >> over multiple servers with dblink? >> >> It is not exactly how I would want to do it, but it would provide what >> he needs I think??? > > Yes seems to be the only solution ... but I'm a little disapointed about > this ... could you explain me why there is not this kind of > functionnality ... it seems to be a real need for big applications no ? If this is what you actually need, well, it's something that lots of people would sort of like to have, but it's really DIFFICULT to implement it. Partitioning data onto different servers appears like it ought to be a good idea. Unfortunately, getting _exactly_ the right semantics is hard. If the data is all truly independent, then it's no big deal; just have one server for one set of data, and another for the other. But reality normally is that if you _think_ you need a cluster, that's because some of the data needs to be _shared_, which means you need to either: a) Have queries that run across two databases, or b) Replicate the shared data between the systems. We're likely back to the need for replication. -- If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me http://www3.sympatico.ca/cbbrowne/rdbms.html "It is the user who should parameterize procedures, not their creators." -- Alan Perlis
Quoth Ron Mayer <rm_pg@cheapcomplexdevices.com>: > Merlin Moncure wrote: >> ...You need to build a bigger, faster box with lots of storage... >> Clustering ... B: will cost you more, not less > > > Is this still true when you get to 5-way or 17-way systems? > > My (somewhat outdated) impression is that up to about 4-way systems > they're price competitive; but beyond that, I thought multiple cheap > servers scales much more afordably than large servers. Certainly > at the point of a 129-CPU system I bet you're better off with a > network of cheap servers. Not necessarily. If you have 129 boxes that you're trying to keep synced, it is likely that the cost of syncing them will be greater than the other write load. If the problem being addressed is that a 4-way box won't handle the transaction load, it is unlikely that building a cluster of _smaller_ machines will help terribly much. The reason to "cluster" in the context of a transactional system is that you need improved _reliability_. Since communications between servers is _thousands_ of times slower than communicating with local memory, you have to be willing to live with an ENORMOUS degradation of performance when hosts are synchronized. And if "real estate" has a cost, where you have to pay for rack space, having _fewer_ machines is preferable to having more. -- output = ("cbbrowne" "@" "gmail.com") http://www.ntlug.org/~cbbrowne/postgresql.html If con is the opposite of pro, is Congress the opposite of progress?
A long time ago, in a galaxy far, far away, gsstark@mit.edu (Greg Stark) wrote: > Dawid Kuroczko <qnex42@gmail.com> writes: > >> Quick thought -- would it be to possible to implement a 'partial VACUUM' >> per analogiam to partial indexes? > > No. > > But it gave me another idea. Perhaps equally infeasible, but I don't see why. > > What if there were a map of modified pages. So every time any tuple > was marked deleted it could be marked in the map as modified. VACUUM > would only have to look at these pages. And if it could mark as free > every tuple that was marked as deleted then it could unmark the > page. > > The only downside I see is that this could be a source of contention > on multi-processor machines running lots of concurrent > update/deletes. I was thinking the same thing after hearing fairly extensive "pooh-poohing" of the notion of vacuuming based on all the pages in the shared cache. This "hot list page table" would probably need to be a hash table. It rather parallels the FSM, including the way that it would need to be limited in size. -- wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','gmail.com'). http://cbbrowne.com/info/lsf.html Rules of the Evil Overlord #57. "Before employing any captured artifacts or machinery, I will carefully read the owner's manual." <http://www.eviloverlord.com/>
After a long battle with technology, herve@elma.fr (Hervé Piedvache), an earthling, wrote: > Joshua, > > Le Jeudi 20 Janvier 2005 15:44, Joshua D. Drake a écrit : >> Hervé Piedvache wrote: >> > >> >My company, which I actually represent, is a fervent user of PostgreSQL. >> >We used to make all our applications using PostgreSQL for more than 5 >> > years. We usually do classical client/server applications under Linux, >> > and Web interface (php, perl, C/C++). We used to manage also public web >> > services with 10/15 millions records and up to 8 millions pages view by >> > month. >> >> Depending on your needs either: >> >> Slony: www.slony.info >> >> or >> >> Replicator: www.commandprompt.com >> >> Will both do what you want. Replicator is easier to setup but >> Slony is free. > > No ... as I have said ... how I'll manage a database getting a table > of may be 250 000 000 records ? I'll need incredible servers ... to > get quick access or index reading ... no ? > > So what we would like to get is a pool of small servers able to make > one virtual server ... for that is called a Cluster ... no ? The term "cluster" simply indicates the use of multiple servers. There are numerous _DIFFERENT_ forms of "clusters," so that for someone to say "I want a cluster" commonly implies that since they didn't realize the need to specify things further, they really don't know what they want in a usefully identifiable way. > I know they are not using PostgreSQL ... but how a company like > Google do to get an incredible database in size and so quick access > ? Google has built a specialized application that evidently falls into the category known as "embarrassingly parallel." <http://c2.com/cgi/wiki?EmbarrassinglyParallel> There are classes of applications that are amenable to parallelization. Those tend to be applications completely different from those implemented atop transactional data stores like PostgreSQL. If your problem is "embarrassingly parallel," then I'd bet lunch that PostgreSQL (and all other SQL databases) are exactly the _wrong_ tool for implementing its data store. If your problem is _not_ "embarrassingly parallel," then you'll almost certainly discover that the cheapest way to make it fast involves fitting all the data onto _one_ computer so that you do not have to pay the costs of transmitting data over slow inter-computer communications links. -- let name="cbbrowne" and tld="gmail.com" in String.concat "@" [name;tld];; http://www.ntlug.org/~cbbrowne/ It isn't that physicists enjoy physics more than they enjoy sex, its that they enjoy sex more when they are thinking of physics.
On Sat, 2005-01-22 at 16:10 -0500, Tom Lane wrote: > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > Probably VACUUM works well for small to medium size tables, but not > > for huge ones. I'm considering about to implement "on the spot > > salvaging dead tuples". > > That's impossible on its face, except for the special case where the > same transaction inserts and deletes a tuple. In all other cases, the > transaction deleting a tuple cannot know whether it will commit. Perhaps Tatsuo has an idea... As Tom says, if you have only a single row version and then you update that row to create a second version, then we must not remove the first version, since it is effectively the Undo copy. However, if there were already 2+ row versions, then as Tatsuo suggests, it might be possible to use on the spot salvaging of dead tuples. It might be worth checking the Xid of the earlier row version(s), to see if they are now expired and could be removed immediately. However, if you had a high number of concurrent updaters, this extra effort would not be that useful, since the other row versions might still be transaction-in-progress versions. That would mean implementing this idea would be useful often, but not in the case of repeatedly updated rows. Changing the idea slightly might be better: if a row update would cause a block split, then if there is more than one row version then we vacuum the whole block first, then re-attempt the update. That way we wouldn't do the row every time, just when it becomes a problem. I'm suggesting putting a call to vacuum_page() into heap_update(), immediately before any call to RelationGetBufferForTuple(). We already know that page splitting is an expensive operation, so doing some work to try to avoid that could frequently pay off. This would be isolated to updating. This wouldn't remove the need for vacuuming, but it would act to prevent severe performance degradation caused by frequent re-updating. What do you think? -- Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Changing the idea slightly might be better: if a row update would cause > a block split, then if there is more than one row version then we vacuum > the whole block first, then re-attempt the update. "Block split"? I think you are confusing tables with indexes. Chasing down prior versions of the same row is not very practical anyway, since there is no direct way to find them. One possibility is, if you tried to insert a row on a given page but there's not room, to look through the other rows on the same page to see if any are deletable (xmax below the GlobalXmin event horizon). This strikes me as a fairly expensive operation though, especially when you take into account the need to get rid of their index entries first. Moreover, the check would often be unproductive. The real issue with any such scheme is that you are putting maintenance costs into the critical paths of foreground processes that are executing user queries. I think that one of the primary advantages of the Postgres storage design is that we keep that work outside the critical path and delegate it to maintenance processes that can run in the background. We shouldn't lightly toss away that advantage. There was some discussion in Toronto this week about storing bitmaps that would tell VACUUM whether or not there was any need to visit individual pages of each table. Getting rid of useless scans through not-recently-changed areas of large tables would make for a significant reduction in the cost of VACUUM. regards, tom lane
For reference, here's the discussion about this that took place on hackers: http://lnk.nu/archives.postgresql.org/142.php On Sun, Jan 23, 2005 at 01:16:20AM -0500, Christopher Browne wrote: > A long time ago, in a galaxy far, far away, gsstark@mit.edu (Greg Stark) wrote: > > Dawid Kuroczko <qnex42@gmail.com> writes: > > > >> Quick thought -- would it be to possible to implement a 'partial VACUUM' > >> per analogiam to partial indexes? > > > > No. > > > > But it gave me another idea. Perhaps equally infeasible, but I don't see why. > > > > What if there were a map of modified pages. So every time any tuple > > was marked deleted it could be marked in the map as modified. VACUUM > > would only have to look at these pages. And if it could mark as free > > every tuple that was marked as deleted then it could unmark the > > page. > > > > The only downside I see is that this could be a source of contention > > on multi-processor machines running lots of concurrent > > update/deletes. > > I was thinking the same thing after hearing fairly extensive > "pooh-poohing" of the notion of vacuuming based on all the pages in > the shared cache. > > This "hot list page table" would probably need to be a hash table. It > rather parallels the FSM, including the way that it would need to be > limited in size. > -- > wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','gmail.com'). > http://cbbrowne.com/info/lsf.html > Rules of the Evil Overlord #57. "Before employing any captured > artifacts or machinery, I will carefully read the owner's manual." > <http://www.eviloverlord.com/> > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Sun, Jan 23, 2005 at 03:40:03PM -0500, Tom Lane wrote: > There was some discussion in Toronto this week about storing bitmaps > that would tell VACUUM whether or not there was any need to visit > individual pages of each table. Getting rid of useless scans through > not-recently-changed areas of large tables would make for a significant > reduction in the cost of VACUUM. FWIW, that's already on the TODO. See also http://lnk.nu/archives.postgresql.org/142.php. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Tatsuo, > I'm not clear what "pgPool only needs to monitor "update switching" by > > *connection* not by *table*" means. In your example: > > (1) 00:00 User A updates "My Profile" > > (2) 00:01 "My Profile" UPDATE finishes executing. > > (3) 00:02 User A sees "My Profile" re-displayed > > (6) 00:04 "My Profile":UserA cascades to the last Slave server > > I think (2) and (3) are on different connections, thus pgpool cannot > judge if SELECT in (3) should go only to the master or not. > > To solve the problem you need to make pgpool understand "web sessions" > not "database connections" and it seems impossible for pgpool to > understand "sessions". Depends on your connection pooling software, I suppose. Most connection pooling software only returns connections to the pool after a user has been inactive for some period ... generally more than 3 seconds. So connection continuity could be trusted. -- Josh Berkus Aglio Database Solutions San Francisco
> The real issue with any such scheme is that you are putting maintenance > costs into the critical paths of foreground processes that are executing > user queries. I think that one of the primary advantages of the > Postgres storage design is that we keep that work outside the critical > path and delegate it to maintenance processes that can run in the > background. We shouldn't lightly toss away that advantage. As a rather naive user, I'd consider modifying the FSM so that it has pages with 'possibly freeable' space on them, as well as those with free space. This way when the pages of actually free space is depleted, the list of 'possibly freeable' pages could be vacuumed (as a batch for that relation) then placed on the actually-free list like vacuum currently does Since there is concern about critical path performance, there could be an extra backend process that would wake up perodically (or on a signal) and vacuum the pages, so theyre not processed inline with some transaction. Then grabbing a page with free space is the same as it is currently. Actually I was hoping to find some time to investigate this myself, but my employer is keeping me busy with other tasks ;/. Our particular data management problems could be mitigated much better with a data partitioning approach, anyway. On another note, is anybody investigating backing up the FSM with disk files so when the FSM size exceeds memory allocated, the appropriate data is swapped to disk? At least since 7.4 you no longer need a VACUUM when postgres starts, to learn about free space ;) - Guy Thornley
> Tatsuo, > > > I'm not clear what "pgPool only needs to monitor "update switching" by > > > > *connection* not by *table*" means. In your example: > > > (1) 00:00 User A updates "My Profile" > > > (2) 00:01 "My Profile" UPDATE finishes executing. > > > (3) 00:02 User A sees "My Profile" re-displayed > > > (6) 00:04 "My Profile":UserA cascades to the last Slave server > > > > I think (2) and (3) are on different connections, thus pgpool cannot > > judge if SELECT in (3) should go only to the master or not. > > > > To solve the problem you need to make pgpool understand "web sessions" > > not "database connections" and it seems impossible for pgpool to > > understand "sessions". > > Depends on your connection pooling software, I suppose. Most connection > pooling software only returns connections to the pool after a user has been > inactive for some period ... generally more than 3 seconds. So connection > continuity could be trusted. Not sure what you mean by "most connection pooling software", but I'm sure that pgpool behaves differently. -- Tatsuo Ishii
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > Probably VACUUM works well for small to medium size tables, but not > > for huge ones. I'm considering about to implement "on the spot > > salvaging dead tuples". > > That's impossible on its face, except for the special case where the > same transaction inserts and deletes a tuple. In all other cases, the > transaction deleting a tuple cannot know whether it will commit. Of course. We need to keep a list of such that tuples until commit or abort. -- Tatsuo Ishii
Faster and more frequent VACUUM (was PostgreSQL clustering VS MySQL clustering)
From
Simon Riggs
Date:
On Sun, 2005-01-23 at 15:40 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Changing the idea slightly might be better: if a row update would cause > > a block split, then if there is more than one row version then we vacuum > > the whole block first, then re-attempt the update. > > "Block split"? I think you are confusing tables with indexes. Terminologically loose, as ever. :( I meant both tables and indexes and was referring to the part of the algorithm that is entered when we have a block-full situation. > Chasing down prior versions of the same row is not very practical > anyway, since there is no direct way to find them. > > One possibility is, if you tried to insert a row on a given page but > there's not room, to look through the other rows on the same page to see > if any are deletable (xmax below the GlobalXmin event horizon). This > strikes me as a fairly expensive operation though, especially when you > take into account the need to get rid of their index entries first. Thats what I was suggesting, vac the whole page, not just those rows. Doing it immediately greatly increases the chance that the index blocks would be in cache also. > Moreover, the check would often be unproductive. > The real issue with any such scheme is that you are putting maintenance > costs into the critical paths of foreground processes that are executing > user queries. I think that one of the primary advantages of the > Postgres storage design is that we keep that work outside the critical > path and delegate it to maintenance processes that can run in the > background. We shouldn't lightly toss away that advantage. Completely agree. ...which is why I was trying to find a place for such an operation in-front-of another expensive operation which is also currently on the critical path. That way there might be benefit rather than just additional overhead. > There was some discussion in Toronto this week about storing bitmaps > that would tell VACUUM whether or not there was any need to visit > individual pages of each table. Getting rid of useless scans through > not-recently-changed areas of large tables would make for a significant > reduction in the cost of VACUUM. ISTM there are two issues here, which are only somewhat related: - speed of VACUUM on large tables - ability to run VACUUM very frequently on very frequently updated tables The needs-maintenance bitmap idea hits both, whilst the on-the-spot idea only hits the second one, even if it does it +/- better. Gut feel says we would implement only one idea...so... On balance that indicates the need-maintenance bitmap is a better idea, and one for which we already have existing code. A few questions... - wouldn't we need a bitmap per relation? - wouldn't all the extra bitmaps need to be cached in shared_buffers, which could use up a good proportion of buffer cache space - maybe we should use a smaller block size and a different cache for it - how would we update the bitmap without creating a new LWlock that needs to be acquired for every block write and so reducing scalability? - would this be implemented as an option for each table, so that we could avoid the implementation overhead? (Or perhaps don't have a bitmap if table is less than 16 blocks?) -- Best Regards, Simon Riggs
Tatsuo, I agree completely that vacuum falls apart on huge tables. We could probably do the math and figure out what the ratio of updated rows per total rows is each day, but on a constantly growing table, that ratio gets smaller and smaller, making the impact of dead tuples in the table proportionately less and less. If multi-version indexes are handled the same way as table rows, then the indexes will also suffer the same fate, if not worse. For huge tables, the b-tree depth can get fairly large. When a b-tree is of depth X and the machine holds the first Y levels of the b-tree in memory, then each table row selected requires a MINIMUM of (X-Y) disk access *before* the table row is accessed. Substitute any numbers you want for X and Y, but you will find that huge tables require many index reads. Index updates are even worse. A table row update requires only a copy of the row. An index update requires at least a copy of the leaf node, and possibly more nodes if nodes must be split or collapsed. These splits and collapses can cascade, causing many nodes to be affected. This whole process takes place for each and every index affected by the change, which is every index on the table when a row is added or deleted. All of this monkeying around takes place above and beyond the simple change of the row data. Further, each and every affected index page is dumped to WAL. Assuming the indexes have the same MVCC proprties of row data, then the indexes would get dead tuples at a rate far higher than that of the table data. So yes, vacuuming is a problem on large tables. It is a bigger problem for indexes. On large tables, index I/O comprises most of the I/O mix. Don't take my word for it. Run a benchmark on Pg. Then, soft-link the index files and the WAL directories to a RAM disk. Rerun the benchmark and you will find that Pg far faster, much faster than if only the data were on the RAM disk. Marty Tatsuo Ishii wrote: > IMO the bottle neck is not WAL but table/index bloat. Lots of updates > on large tables will produce lots of dead tuples. Problem is, There' > is no effective way to reuse these dead tuples since VACUUM on huge > tables takes longer time. 8.0 adds new vacuum delay > paramters. Unfortunately this does not help. It just make the > execution time of VACUUM longer, that means more and more dead tuples > are being made while updating. > > Probably VACUUM works well for small to medium size tables, but not > for huge ones. I'm considering about to implement "on the spot > salvaging dead tuples". > -- > Tatsuo Ishii > > >>This is probably a lot easier than you would think. You say that your >>DB will have lots of data, lots of updates and lots of reads. >> >>Very likely the disk bottleneck is mostly index reads and writes, with >>some critical WAL fsync() calls. In the grand scheme of things, the >>actual data is likely not accessed very often. >> >>The indexes can be put on a RAM disk tablespace and that's the end of >>index problems -- just make sure you have enough memory available. Also >>make sure that the machine can restart correctly after a crash: the >>tablespace is dropped and recreated, along with the indexes. This will >>cause a machine restart to take some time. >> >>After that, if the WAL fsync() calls are becoming a problem, put the WAL >>files on a fast RAID array, etiher a card or external enclosure, that >>has a good amount of battery-backed write cache. This way, the WAL >>fsync() calls will flush quickly to the RAM and Pg can move on while the >>RAID controller worries about putting the data to disk. With WAL, low >>access time is usually more important than total throughput. >> >>The truth is that you could have this running for not much money. >> >>Good Luck, >>Marty >> >> >>>Le Jeudi 20 Janvier 2005 19:09, Bruno Almeida do Lago a écrit : >>> > Could you explain us what do you have in mind for that solution? I mean, >>> > forget the PostgreSQL (or any other database) restrictions and >>>explain us >>> > how this hardware would be. Where the data would be stored? >>> > >>> > I've something in mind for you, but first I need to understand your >>>needs! >>> >>>I just want to make a big database as explained in my first mail ... At the >>>beginning we will have aprox. 150 000 000 records ... each month we will >>>add >>>about 4/8 millions new rows in constant flow during the day ... and in same >>>time web users will access to the database in order to read those data. >>>Stored data are quite close to data stored by google ... (we are not >>>making a >>>google clone ... just a lot of data many small values and some big ones ... >>>that's why I'm comparing with google for data storage). >>>Then we will have a search engine searching into those data ... >>> >>>Dealing about the hardware, for the moment we have only a bi-pentium Xeon >>>2.8Ghz with 4 Gb of RAM ... and we saw we had bad performance results >>>... so >>>we are thinking about a new solution with maybe several servers (server >>>design may vary from one to other) ... to get a kind of cluster to get >>>better >>>performance ... >>> >>>Am I clear ? >>> >>>Regards, >> >> >> >> >>---------------------------(end of broadcast)--------------------------- >>TIP 2: you can get off all lists at once with the unregister command >> (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >> >
Tatsuo, > > Depends on your connection pooling software, I suppose. Most connection > > pooling software only returns connections to the pool after a user has > > been inactive for some period ... generally more than 3 seconds. So > > connection continuity could be trusted. > > Not sure what you mean by "most connection pooling software", but I'm > sure that pgpool behaves differently. Ah, clarity problem here. I'm talking about connection pooling tools from the client (webserver) side, such as Apache::DBI, PHP's pg_pconnect, Jakarta's connection pools, etc. Not pooling on the database server side, which is what pgPool provides. Most of these tools allocate a database connection to an HTTP/middleware client, and only release it after a specific period of inactivity. This means that you *could* count on "web-user==connection" for purposes of switching back and forth to the master -- as long as the connection-recycling timeout were set higher than the pgPool switch-off period. -- Josh Berkus Aglio Database Solutions San Francisco
On Mon, 2005-01-24 at 09:52 -0800, Josh Berkus wrote: > [about keeping connections open in web context] > Ah, clarity problem here. I'm talking about connection pooling tools from > the client (webserver) side, such as Apache::DBI, PHP's pg_pconnect, > Jakarta's connection pools, etc. Not pooling on the database server side, > which is what pgPool provides. note that these sometimes do not provide connection pooling as such, just persistent connections (Apache::DBI) > Most of these tools allocate a database connection to an HTTP/middleware > client, and only release it after a specific period of inactivity. This > means that you *could* count on "web-user==connection" for purposes of > switching back and forth to the master -- as long as the connection-recycling > timeout were set higher than the pgPool switch-off period. no. you can only count on web-server-process==connection, but not web-user==connection, unless you can garantee that the same user client always connects to same web-server process. am i missing something ? gnari
Ragnar, > note that these sometimes do not provide connection pooling as such, > just persistent connections (Apache::DBI) Yes, right. > no. you can only count on web-server-process==connection, but not > web-user==connection, unless you can garantee that the same user > client always connects to same web-server process. Are there ones that you use which might use several different connections to send a series of queries from a single web-user, less than 5 seconds apart? -- --Josh Josh Berkus Aglio Database Solutions San Francisco
> On Mon, 2005-01-24 at 09:52 -0800, Josh Berkus wrote: > > [about keeping connections open in web context] > > Ah, clarity problem here. I'm talking about connection pooling tools from > > the client (webserver) side, such as Apache::DBI, PHP's pg_pconnect, > > Jakarta's connection pools, etc. Not pooling on the database server side, > > which is what pgPool provides. > > note that these sometimes do not provide connection pooling as such, > just persistent connections (Apache::DBI) Right. Same thing can be said to pg_pconnect. > > Most of these tools allocate a database connection to an HTTP/middleware > > client, and only release it after a specific period of inactivity. This > > means that you *could* count on "web-user==connection" for purposes of > > switching back and forth to the master -- as long as the connection-recycling > > timeout were set higher than the pgPool switch-off period. > > no. you can only count on web-server-process==connection, but not > web-user==connection, unless you can garantee that the same user > client always connects to same web-server process. I have same opinion. > am i missing something ? -- Tatsuo Ishii
On Sun, Jan 23, 2005 at 03:40:03PM -0500, Tom Lane wrote: > The real issue with any such scheme is that you are putting maintenance > costs into the critical paths of foreground processes that are executing > user queries. I think that one of the primary advantages of the > Postgres storage design is that we keep that work outside the critical > path and delegate it to maintenance processes that can run in the > background. We shouldn't lightly toss away that advantage. To pull out the oft-used "show me the numbers" card... has anyone done a study to see if keeping this stuff out of the 'critical path' actually helps overall system performance? While the current scheme initially speeds up transactions, eventually you have to run vacuum, which puts a big load on the system. If you can put off vacuuming until off-hours (assuming your system has off-hours), then this doesn't matter, but more and more we're seeing systems where vacuum is a big performance issue (hence recent work with the delay in vacuum so as not to swamp the IO system). If you vacuum as part of the transaction it's going to be more efficient of resources, because you have more of what you need right there (ie: odds are that you're on the same page as the old tuple). In cases like that it very likely makes a lot of sense to take a small hit in your transaction time up-front, instead of a larger hit doing a vacuum down the road. Of course, without numbers this is a bunch of hand-waving, but I don't think it's valid to assume that minimizing the amount of work you do in a transaction means better throughput without considering what it will cost to do the work you're putting off until later. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Mon, 2005-01-24 at 15:45 -0800, Josh Berkus wrote: > [about keeping open DB connections between web-client connections] > [I wrote:] > > no. you can only count on web-server-process==connection, but not > > web-user==connection, unless you can garantee that the same user > > client always connects to same web-server process. > > Are there ones that you use which might use several different connections to > send a series of queries from a single web-user, less than 5 seconds apart? actually, it had never occurred to me to test all browsers in this reguard, but i can think of LWP::UserAgent. gnari
Ühel kenal päeval (esmaspäev, 24. jaanuar 2005, 11:52+0900), kirjutas Tatsuo Ishii: > > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > > Probably VACUUM works well for small to medium size tables, but not > > > for huge ones. I'm considering about to implement "on the spot > > > salvaging dead tuples". > > > > That's impossible on its face, except for the special case where the > > same transaction inserts and deletes a tuple. In all other cases, the > > transaction deleting a tuple cannot know whether it will commit. > > Of course. We need to keep a list of such that tuples until commit or > abort. what about other transactions, which may have started before current one and be still running when current one commites ? I once proposed an extra parameter added to VACUUM FULL which determines how much free space to leave in each page vacuumed. If there were room the new tuple could be placed near the old one in most cases and thus avoid lots of disk head movement when updating huge tables in one go. ------------ Hannu Krosing <hannu@tm.ee>
Ühel kenal päeval (neljapäev, 20. jaanuar 2005, 16:00+0100), kirjutas Hervé Piedvache: > > Will both do what you want. Replicator is easier to setup but > > Slony is free. > > No ... as I have said ... how I'll manage a database getting a table of may be > 250 000 000 records ? I'll need incredible servers ... to get quick access or > index reading ... no ? > > So what we would like to get is a pool of small servers able to make one > virtual server ... for that is called a Cluster ... no ? > > I know they are not using PostgreSQL ... but how a company like Google do to > get an incredible database in size and so quick access ? They use lots of boxes and lots custom software to implement a very specific kind of cluster. > regards, -- Hannu Krosing <hannu@tm.ee>
Ühel kenal päeval (neljapäev, 20. jaanuar 2005, 11:02-0500), kirjutas Rod Taylor: > Slony has some other issues with databases > 200GB in size as well > (well, it hates long running transactions -- and pg_dump is a regular > long running transaction) IIRC it hates pg_dump mainly on master. If you are able to run pg_dump from slave, it should be ok. -- Hannu Krosing <hannu@tm.ee>
Ühel kenal päeval (pühapäev, 23. jaanuar 2005, 15:40-0500), kirjutas Tom Lane: > Simon Riggs <simon@2ndquadrant.com> writes: > > Changing the idea slightly might be better: if a row update would cause > > a block split, then if there is more than one row version then we vacuum > > the whole block first, then re-attempt the update. > > "Block split"? I think you are confusing tables with indexes. > > Chasing down prior versions of the same row is not very practical > anyway, since there is no direct way to find them. > > One possibility is, if you tried to insert a row on a given page but > there's not room, to look through the other rows on the same page to see > if any are deletable (xmax below the GlobalXmin event horizon). This > strikes me as a fairly expensive operation though, especially when you > take into account the need to get rid of their index entries first. Why is removing index entries essential ? In pg yuo always have to visit data page, so finding the wrong tuple there could just produce the same result as deleted tuple (which in this case it actually is). The cleaning of index entries could be left to the real vacuum. -- Hannu Krosing <hannu@tm.ee>
> > > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > > > Probably VACUUM works well for small to medium size tables, but not > > > > for huge ones. I'm considering about to implement "on the spot > > > > salvaging dead tuples". > > > > > > That's impossible on its face, except for the special case where the > > > same transaction inserts and deletes a tuple. In all other cases, the > > > transaction deleting a tuple cannot know whether it will commit. > > > > Of course. We need to keep a list of such that tuples until commit or > > abort. > > what about other transactions, which may have started before current one > and be still running when current one commites ? Then dead tuples should be left. Perhaps in this case we could register them in FSM or whatever for later processing. -- Tatsuo Ishii > I once proposed an extra parameter added to VACUUM FULL which determines > how much free space to leave in each page vacuumed. If there were room > the new tuple could be placed near the old one in most cases and thus > avoid lots of disk head movement when updating huge tables in one go. > > ------------ > > Hannu Krosing <hannu@tm.ee> > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
Hannu Krosing <hannu@tm.ee> writes: > Why is removing index entries essential ? Because once you re-use the tuple slot, any leftover index entries would be pointing to the wrong rows. regards, tom lane
Josh, Please excuse how my client quotes things... > Are there ones that you use which might use several different connections to > send a series of queries from a single web-user, less than 5 seconds apart? Using Apache/Perl I often have a situation where we're sending several queries from the same user (web client) withinseconds, or even simultaneously, that use different connections. When someone logs in to our system they get a frameset that has 5 windows, each of which is filled with data from queries. Since the pages in the frames are requested separately by the client the system doesn't insure that they go to thesame process, and subsequently, that they're not served by the same db connection. Session information is stored in the database (so it's easily persistent across server processes), so it would be badif a request for a page was served by a db server that didn't yet have information about the user (such as that they'relogged in, etc.). If we ever have enough traffic to warrant it, we're going to go to a load balancer that passes requests to differentidentical web servers, at which point we won't even be getting requests from the same machine, much less the sameconnection. Thanks, Peter Darley -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Josh Berkus Sent: Monday, January 24, 2005 3:46 PM To: Ragnar Hafstað Cc: pgsql-performance@postgresql.org; Tatsuo Ishii Subject: Re: [PERFORM] PgPool changes WAS: PostgreSQL clustering VS MySQL Ragnar, > note that these sometimes do not provide connection pooling as such, > just persistent connections (Apache::DBI) Yes, right. > no. you can only count on web-server-process==connection, but not > web-user==connection, unless you can garantee that the same user > client always connects to same web-server process. Are there ones that you use which might use several different connections to send a series of queries from a single web-user, less than 5 seconds apart? -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Peter, Ragnar, > > Are there ones that you use which might use several different connections > > to send a series of queries from a single web-user, less than 5 seconds > > apart? > > Using Apache/Perl I often have a situation where we're sending several > queries from the same user (web client) within seconds, or even > simultaneously, that use different connections. So from the sound of it, the connection methods I've been using are the exception rather than the rule. Darn, it worked well for us. :-( What this would point to is NOT being able to use Slony-I for database server pooling for most web applications. Yes? Users should look to pgCluster and C-JDBC instead. BTW, Tatsuo, what's the code relationship between pgPool and pgCluster, if any? --Josh -- Josh Berkus Aglio Database Solutions San Francisco
> Peter, Ragnar, > > > > Are there ones that you use which might use several different connections > > > to send a series of queries from a single web-user, less than 5 seconds > > > apart? > > > > Using Apache/Perl I often have a situation where we're sending several > > queries from the same user (web client) within seconds, or even > > simultaneously, that use different connections. > > So from the sound of it, the connection methods I've been using are the > exception rather than the rule. Darn, it worked well for us. :-( > > What this would point to is NOT being able to use Slony-I for database server > pooling for most web applications. Yes? Users should look to pgCluster and > C-JDBC instead. Yup. That's the limitaion of async replication solutions. > BTW, Tatsuo, what's the code relationship between pgPool and pgCluster, if > any? PGCluster consists of three kind of servers, "load balance server", "cluster server"(modified PostgreSQL backend) and "replication server". I believe some of codes of pgpool are used in the load balance server to avoid "re-invent a wheel". This is a beauty of open source software project. -- Tatsuo Ishii
Ühel kenal päeval (teisipäev, 25. jaanuar 2005, 10:41-0500), kirjutas Tom Lane: > Hannu Krosing <hannu@tm.ee> writes: > > Why is removing index entries essential ? > > Because once you re-use the tuple slot, any leftover index entries would > be pointing to the wrong rows. That much I understood ;) But can't clearing up the index be left for "later" ? Indexscan has to check the data tuple anyway, at least for visibility. would adding the check for field sameness in index and data tuples be too big performance hit ? > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Hannu Krosing <hannu@tm.ee>
Hannu Krosing <hannu@tm.ee> writes: > But can't clearing up the index be left for "later" ? Based on what? Are you going to store the information about what has to be cleaned up somewhere else, and if so where? > Indexscan has to check the data tuple anyway, at least for visibility. > would adding the check for field sameness in index and data tuples be > too big performance hit ? It does pretty much suck, especially when you think about functional indexes on expensive functions. regards, tom lane
http://borg.postgresql.org/docs/8.0/interactive/storage-page-layout.html > If you vacuum as part of the transaction it's going to be more efficient > of resources, because you have more of what you need right there (ie: > odds are that you're on the same page as the old tuple). In cases like > that it very likely makes a lot of sense to take a small hit in your > transaction time up-front, instead of a larger hit doing a vacuum down > the road. Some pros would be that you're going to make a disk write anyway because the page is modified, so why not vacuum that page while it's there. If the machine is CPU bound you lose, if it's IO bound you save some IO, but the cost of index updates has to be taken into account... It prompted a few questions : Note : temp contains 128k (131072) values generated from a sequence. create table test (id serial primary key, a integer, z integer, e integer, r integer, t integer, y integer ) without oids; insert into test (id,a,z,e,r,t,y) select id,0,0,0,0,0,0 from temp; INSERT 0 131072 explain analyze update test set y=1; Seq Scan on test (cost=0.00..2226.84 rows=126284 width=30) (ac Seq Scan on test (cost=0.00..2274.72 rows=131072 width=30) (actual time=0.046..964.590 rows=131072 loops=1) Total runtime: 15628.143 ms tual time=0.047..617.553 rows=131072 loops=1) Total runtime: 4432.509 ms explain analyze update test set y=1; Seq Scan on test (cost=0.00..4453.68 rows=252568 width=30) (actual time=52.198..611.594 rows=131072 loops=1) Total runtime: 5739.064 ms explain analyze update test set y=1; Seq Scan on test (cost=0.00..6680.52 rows=378852 width=30) (actual time=127.301..848.762 rows=131072 loops=1) Total runtime: 6548.206 ms Gets slower as more and more dead tuples accumulate... normal as this is a seq scan. Note the row estimations getting bigger with the table size... vacuum full test; explain analyze update test set y=1; Seq Scan on test (cost=0.00..2274.72 rows=131072 width=30) (actual time=0.019..779.864 rows=131072 loops=1) Total runtime: 5600.311 ms vacuum full test; explain analyze update test set y=1; Seq Scan on test (cost=0.00..2274.72 rows=131072 width=30) (actual time=0.039..1021.847 rows=131072 loops=1) Total runtime: 5126.590 ms -> Seems vacuum full does its job.... vacuum test; explain analyze update test set y=1; Seq Scan on test (cost=0.00..3894.08 rows=196608 width=30) (actual time=36.491..860.135 rows=131072 loops=1) Total runtime: 7293.698 ms vacuum test; explain analyze update test set y=1; Seq Scan on test (cost=0.00..3894.08 rows=196608 width=30) (actual time=0.044..657.125 rows=131072 loops=1) Total runtime: 5934.141 ms vacuum analyze test; explain analyze update test set y=1; Seq Scan on test (cost=0.00..3894.08 rows=196608 width=30) (actual time=0.018..871.132 rows=131072 loops=1) Total runtime: 5548.053 ms -> here vacuum is about as slow as vacuum full (which is normal as the whole table is updated) however the row estimation is still off even after ANALYZE. Let's create a few indices : vacuum full test; create index testa on test(a); create index testz on test(z); create index teste on test(e); create index testr on test(r); create index testt on test(t); -- we don't create an index on y vacuum full test; explain analyze update test set a=id; Seq Scan on test (cost=0.00..2274.72 rows=131072 width=30) (actual time=0.044..846.102 rows=131072 loops=1) Total runtime: 14998.307 ms We see that the index updating time has made this query a lot slower. This is normal, but : vacuum full test; explain analyze update test set a=id; Seq Scan on test (cost=0.00..2274.72 rows=131072 width=30) (actual time=0.045..1387.626 rows=131072 loops=1) Total runtime: 17644.368 ms Now, we updated ALL rows but didn't actually change a single value. However it took about the same time as the first one. I guess the updates all really took place, even if all it did was copy the rows with new transaction ID's. Now, let's update a column which is not indexed : vacuum full test; explain analyze update test set y=id; Seq Scan on test (cost=0.00..2274.72 rows=131072 width=30) (actual time=0.046..964.590 rows=131072 loops=1) Total runtime: 15628.143 ms Takes 'bout the same time : the indexes still have to be updated to reference the new rows after all. So, here is something annoying with the current approach : Updating rows in a table bloats ALL indices, not just those whose indexed values have been actually updated. So if you have a table with many indexed fields and you often update some obscure timestamp field, all the indices will bloat, which will of course be corrected by VACUUM, but vacuum will have extra work to do. I don't have suggestions, just questions : Is there a way that an update to the indices can be avoided if the indexed values do not change ? Would it depend if an updated tuple can be stored on the same page it was before (along with the old version) ? If the answer is Yes : - would saving the cost of updating the indexes pay off over vacuuming the page on the run to try to squeeze the new tuple version in ? - would it be interesting to specify for each table a target % of free space ('air holes') in pages for vacuum to try to achieve, in order to be able to insert updated row versions on the same page they were before, and save index updates ? Regards...
On Thu, Jan 20, 2005 at 04:02:39PM +0100, Hervé Piedvache wrote: > > I don't insist about have data in RAM .... but when you use PostgreSQL with > big database you know that for quick access just for reading the index file > for example it's better to have many RAM as possible ... I just want to be > able to get a quick access with a growing and growind database ... Well, in any case, you need much better hardware than you're looking at. I mean, dual Xeon with 2 Gig isn't hardly big iron. Why don't you try benchmarking on a honking big box -- IBM P690 or a big Sun (I'd counsel against that, though) or something like that? Or even some Opterons. Dual Xeon is probablt your very worst choice at the moment. A -- Andrew Sullivan | ajs@crankycanuck.ca Information security isn't a technological problem. It's an economics problem. --Bruce Schneier
On Thu, Jan 20, 2005 at 10:40:02PM -0200, Bruno Almeida do Lago wrote: > > I was thinking the same! I'd like to know how other databases such as Oracle > do it. You mean "how Oracle does it". They're the only ones in the market that really have this technology. A -- Andrew Sullivan | ajs@crankycanuck.ca This work was visionary and imaginative, and goes to show that visionary and imaginative work need not end up well. --Dennis Ritchie
On Thu, Jan 20, 2005 at 03:54:23PM +0100, Hervé Piedvache wrote: > Slony do not use RAM ... but PostgreSQL will need RAM for accessing a database > of 50 Gb ... so having two servers with the same configuration replicated by > slony do not slove the problem of the scalability of the database ... You could use SSD for your storage. That'd make it go rather quickly even if it had to seek on disk. A -- Andrew Sullivan | ajs@crankycanuck.ca The plural of anecdote is not data. --Roger Brinner
On Thu, Jan 20, 2005 at 04:07:51PM +0100, Hervé Piedvache wrote: > Yes seems to be the only solution ... but I'm a little disapointed about > this ... could you explain me why there is not this kind of > functionnality ... it seems to be a real need for big applications no ? I hate to be snarky, but the reason there isn't this kind of system just hanging around is that it's a Very Hard Problem. I spent 2 days last week in a room with some of the smartest people I know, and there was widespread agreement that what you want is a very tough problem. A -- Andrew Sullivan | ajs@crankycanuck.ca In the future this spectacle of the middle classes shocking the avant- garde will probably become the textbook definition of Postmodernism. --Brad Holland
On Mon, Jan 24, 2005 at 01:28:29AM +0200, Hannu Krosing wrote: > > IIRC it hates pg_dump mainly on master. If you are able to run pg_dump > from slave, it should be ok. For the sake of the archives, that's not really a good idea. There is some work afoot to solve it, but at the moment dumping from a slave gives you a useless database dump. A -- Andrew Sullivan | ajs@crankycanuck.ca The fact that technology doesn't work is no bar to success in the marketplace. --Philip Greenspun
At this point I will interject a couple of benchmark numbers based on a new system we just configured as food for thought. System A (old system): Compaq Proliant Dual Pentium III 933 with Smart Array 5300, one RAID 1, one 3 Disk RAID 5 on 10k RPM drives, 2GB PC133 RAM. Original Price: $6500 System B (new system): Self Built Dual Opteron 242 with 2x3ware 9500S-8MI SATA, one RAID 1 (OS), one 4 drive RAID 10 (pg_xlog), one 6 drive RAID 10 (data) on 10k RPM Raptors, 4GB PC3200 RAM. Current price $7200 System A for our large insert job: 125 minutes System B for our large insert job: 10 minutes. There is no logical way there should be a 12x performance difference between these two systems, maybe 2x or even 4x, but not 12x Bad controler cards/configuration will seriously ruin your day. 3ware escalade cards are very well supported on linux, and work excellently. Compaq smart array cards are not. Bonnie++ benchmarks show a 9MB/sec write, 29MB/sec read on the RAID 5, but a 172MB/sec write on the 6xRAID 10, and 66MB/sec write on the RAID 1 on the 3ware. With the right configuration you can get very serious throughput. The new system is processing over 2500 insert transactions per second. We don't need more RAM with this config. The disks are fast enough. 2500 transaction/second is pretty damn fast. Alex Turner On Fri, 28 Jan 2005 10:31:38 -0500, Andrew Sullivan <ajs@crankycanuck.ca> wrote: > On Thu, Jan 20, 2005 at 10:40:02PM -0200, Bruno Almeida do Lago wrote: > > > > I was thinking the same! I'd like to know how other databases such as Oracle > > do it. > > You mean "how Oracle does it". They're the only ones in the market > that really have this technology. > > A > > -- > Andrew Sullivan | ajs@crankycanuck.ca > This work was visionary and imaginative, and goes to show that visionary > and imaginative work need not end up well. > --Dennis Ritchie > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >
On Fri, 28 Jan 2005 10:59:58 -0500 Alex Turner <armtuk@gmail.com> wrote: > At this point I will interject a couple of benchmark numbers based on > a new system we just configured as food for thought. > > System A (old system): > Compaq Proliant Dual Pentium III 933 with Smart Array 5300, one RAID > 1, one 3 Disk RAID 5 on 10k RPM drives, 2GB PC133 RAM. Original > Price: $6500 > > System B (new system): > Self Built Dual Opteron 242 with 2x3ware 9500S-8MI SATA, one RAID 1 > (OS), one 4 drive RAID 10 (pg_xlog), one 6 drive RAID 10 (data) on 10k > RPM Raptors, 4GB PC3200 RAM. Current price $7200 > > System A for our large insert job: 125 minutes > System B for our large insert job: 10 minutes. > > There is no logical way there should be a 12x performance difference > between these two systems, maybe 2x or even 4x, but not 12x > > Bad controler cards/configuration will seriously ruin your day. 3ware > escalade cards are very well supported on linux, and work excellently. > Compaq smart array cards are not. Bonnie++ benchmarks show a 9MB/sec > write, 29MB/sec read on the RAID 5, but a 172MB/sec write on the > 6xRAID 10, and 66MB/sec write on the RAID 1 on the 3ware. > > With the right configuration you can get very serious throughput. The > new system is processing over 2500 insert transactions per second. We > don't need more RAM with this config. The disks are fast enough. > 2500 transaction/second is pretty damn fast. I agree that badly supported or configured cards can ruin your performance. However, don't you think moving pg_xlog onto a separate RAID and increasing your number of spindles from 3 to 6 on the data RAID would also have a significant impact on performance, no matter what card was used? I'm not sure you can give all the credit to the card on this one. --------------------------------- Frank Wiles <frank@wiles.org> http://www.wiles.org ---------------------------------
ajs@crankycanuck.ca (Andrew Sullivan) writes: > On Mon, Jan 24, 2005 at 01:28:29AM +0200, Hannu Krosing wrote: >> >> IIRC it hates pg_dump mainly on master. If you are able to run pg_dump >> from slave, it should be ok. > > For the sake of the archives, that's not really a good idea. There > is some work afoot to solve it, but at the moment dumping from a > slave gives you a useless database dump. That overstates things a tad; I think it's worth elaborating on a bit. There's a problem with the results of dumping the _schema_ from a Slony-I 'subscriber' node; you want to get the schema from the origin node. The problem has to do with triggers; Slony-I suppresses RI triggers and such like on subscriber nodes in a fashion that leaves the dumped schema a bit broken with regard to triggers. But there's nothing wrong with the idea of using "pg_dump --data-only" against a subscriber node to get you the data without putting a load on the origin. And then pulling the schema from the origin, which oughtn't be terribly expensive there. -- "cbbrowne","@","ca.afilias.info" <http://dev6.int.libertyrms.com/> Christopher Browne (416) 673-4124 (land)
PFC wrote: > So, here is something annoying with the current approach : Updating rows > in a table bloats ALL indices, not just those whose indexed values have > been actually updated. So if you have a table with many indexed fields and > you often update some obscure timestamp field, all the indices will bloat, > which will of course be corrected by VACUUM, but vacuum will have extra > work to do. The MVCC approach probably doesn't leave you with many choices here. The index entries point directly to the rows in the table, and since an update creates a new row (it's the equivalent of doing an insert then a delete), all indexes have to be updated to reflect the location of the new row. Unless my understanding of how this works is completely off... -- Kevin Brown kevin@sysexperts.com
Hervé Piedvache wrote: >>My point being is that there is no free solution. There simply isn't. >>I don't know why you insist on keeping all your data in RAM, but the >>mysql cluster requires that ALL data MUST fit in RAM all the time. > > > I don't insist about have data in RAM .... but when you use PostgreSQL with > big database you know that for quick access just for reading the index file > for example it's better to have many RAM as possible ... I just want to be > able to get a quick access with a growing and growind database ... If it's an issue of RAM and not CPU power, think about this scenario. Let's just say you *COULD* partition your DB over multiple servers. What are your plans then? Are you going to buy 4 Dual Xeon servers? Ok, let's price that out. For a full-blown rackmount server w/ RAID, 6+ SCSI drives and so on, you are looking at roughly $4000 per machine. So now you have 4 machines -- total of 16GB of RAM over the 4 machines. On the otherhand, let's say you spent that money on a Quad Opteron instead. 4x850 will cost you roughly $8000. 16GB of RAM using 1GB DIMMs is $3000. If you went with 2GB DIMMs, you could stuff 32GB of RAM onto that machine for $7500. Let's review the math: 4X server cluster, total 16GB RAM = $16K 1 beefy server w/ 16GB RAM = $11K 1 beefy server w/ 32GB RAM = $16K I know what I would choose. I'd get the mega server w/ a ton of RAM and skip all the trickyness of partitioning a DB over multiple servers. Yes your data will grow to a point where even the XXGB can't cache everything. On the otherhand, memory prices drop just as fast. By that time, you can ebay your original 16/32GB and get 64/128GB.
On Fri, 28 Jan 2005 11:54:57 -0500, Christopher Weimann <cweimann@k12hq.com> wrote: > On 01/28/2005-10:59AM, Alex Turner wrote: > > At this point I will interject a couple of benchmark numbers based on > > a new system we just configured as food for thought. > > > > System A (old system): > > Compaq Proliant Dual Pentium III 933 with Smart Array 5300, one RAID > > 1, one 3 Disk RAID 5 on 10k RPM drives, 2GB PC133 RAM. Original > > Price: $6500 > > > > System B (new system): > > Self Built Dual Opteron 242 with 2x3ware 9500S-8MI SATA, one RAID 1 > > (OS), one 4 drive RAID 10 (pg_xlog), one 6 drive RAID 10 (data) on 10k > > RPM Raptors, 4GB PC3200 RAM. Current price $7200 > > > > System A for our large insert job: 125 minutes > > System B for our large insert job: 10 minutes. > > > > There is no logical way there should be a 12x performance difference > > between these two systems, maybe 2x or even 4x, but not 12x > > > > Your system A has the absolute worst case Raid 5, 3 drives. The more > drives you add to Raid 5 the better it gets but it will never beat Raid > 10. On top of it being the worst case, pg_xlog is not on a separate > spindle. > True for writes, but not for reads. > Your system B has a MUCH better config. Raid 10 is faster than Raid 5 to > begin with but on top of that you have more drives involved plus pg_xlog > is on a separate spindle. I absolutely agree, it is a much better config, thats why we bought it ;).. In system A, the xlog was actualy on the RAID 1, so it was infact on a seperate spindle set. > > I'd say I am not surprised by your performance difference. > I'm not surprised at all that the new system outperformed the old, it's more the factor of improvement. 12x is a _VERY_ big performance jump. > > Bad controler cards/configuration will seriously ruin your day. 3ware > > escalade cards are very well supported on linux, and work excellently. > > Compaq smart array cards are not. Bonnie++ benchmarks show a 9MB/sec > > write, 29MB/sec read on the RAID 5, but a 172MB/sec write on the > > 6xRAID 10, and 66MB/sec write on the RAID 1 on the 3ware. > > > > What does bonnie say about the Raid 1 on the Compaq? Comparing the two > Raid 1s is really the only valid comparison that can be made between > these two machines. Other than that you are comparing apples to > snow shovels. > > My main point is that you can spend $7k on a server and believe you have a fast system. The person who bought the original system was under the delusion that it would make a good DB server. For the same $7k a different configuration can yield a vastly different performance output. This means that it's not quite apples to snow shovels. People who _believe_ they have an adequate config are often sorely mistaken, and ask misguided questions about needed 20GB of RAM because the system can't page to disk fast enough, when what they really need is a good RAID 10 with a high quality controler. A six drive RAID 10 is going to run a bit less than 20G of SSD. Alex Turner NetEconomist
William Yu <wyu@talisys.com> writes: > 1 beefy server w/ 32GB RAM = $16K > > I know what I would choose. I'd get the mega server w/ a ton of RAM and skip > all the trickyness of partitioning a DB over multiple servers. Yes your data > will grow to a point where even the XXGB can't cache everything. On the > otherhand, memory prices drop just as fast. By that time, you can ebay your > original 16/32GB and get 64/128GB. a) What do you do when your calculations show you need 256G of ram? [Yes such machines exist but you're not longer in the realm of simply "add more RAM". Administering such machines is nigh as complex as clustering] b) What do you do when you find you need multiple machines anyways to divide the CPU or I/O or network load up. Now you need n big beefy servers when n servers 1/nth as large would really have sufficed. This is a big difference when you're talking about the difference between colocating 16 1U boxen with 4G of ram vs 16 4U opterons with 64G of RAM... All that said, yes, speaking as a user I think the path of least resistance is to build n complete slaves using Slony and then just divide the workload. That's how I'm picturing going when I get to that point. Even if I just divide the workload randomly it's easier than building a machine with n times the cpu and i/o. And if I divide the workload up in a way that correlates with data in the database I can probably get close to the same performance as clustering. The actual cost of replicating the unused data is slight. And the simplicity of master-slave makes it much more appealing than full on clustering. -- greg
>>I know what I would choose. I'd get the mega server w/ a ton of RAM and skip >>all the trickyness of partitioning a DB over multiple servers. Yes your data >>will grow to a point where even the XXGB can't cache everything. On the >>otherhand, memory prices drop just as fast. By that time, you can ebay your >>original 16/32GB and get 64/128GB. > > > a) What do you do when your calculations show you need 256G of ram? [Yes such > machines exist but you're not longer in the realm of simply "add more RAM". > Administering such machines is nigh as complex as clustering] If you need that much memory, you've got enough customers paying you cash to pay for anything. :) Technology always increase -- 8X Opterons would double your memory capacity, higher capacity DIMMs, etc. > b) What do you do when you find you need multiple machines anyways to divide > the CPU or I/O or network load up. Now you need n big beefy servers when n > servers 1/nth as large would really have sufficed. This is a big difference > when you're talking about the difference between colocating 16 1U boxen with > 4G of ram vs 16 4U opterons with 64G of RAM... > > All that said, yes, speaking as a user I think the path of least resistance is > to build n complete slaves using Slony and then just divide the workload. > That's how I'm picturing going when I get to that point. Replication is good for uptime and high read systems. The problem is that if your system has a high volume of writes and you need near realtime data syncing, clusters don't get you anything. A write on one server means a write on every server. Spreading out the damage over multiple machines doesn't help a bit. Plus the fact that we don't have multi-master replication yet is quite a bugaboo. That requires writing quite extensive code if you can't afford to have 1 server be your single point of failure. We wrote our own multi-master replication code at the client app level and it's quite a chore making sure the replication act logically. Every table needs to have separate logic to parse situations like "voucher was posted on server 1 but voided after on server 2, what's the correct action here?" So I've got a slew of complicated if-then-else statements that not only have to take into account type of update being made but the sequence. And yes, I tried doing realtime locks over a VPN link over our servers in SF and VA. Ugh...latency was absolutely horrible and made transactions run 1000X slower.
On 01/28/2005-10:59AM, Alex Turner wrote: > At this point I will interject a couple of benchmark numbers based on > a new system we just configured as food for thought. > > System A (old system): > Compaq Proliant Dual Pentium III 933 with Smart Array 5300, one RAID > 1, one 3 Disk RAID 5 on 10k RPM drives, 2GB PC133 RAM. Original > Price: $6500 > > System B (new system): > Self Built Dual Opteron 242 with 2x3ware 9500S-8MI SATA, one RAID 1 > (OS), one 4 drive RAID 10 (pg_xlog), one 6 drive RAID 10 (data) on 10k > RPM Raptors, 4GB PC3200 RAM. Current price $7200 > > System A for our large insert job: 125 minutes > System B for our large insert job: 10 minutes. > > There is no logical way there should be a 12x performance difference > between these two systems, maybe 2x or even 4x, but not 12x > Your system A has the absolute worst case Raid 5, 3 drives. The more drives you add to Raid 5 the better it gets but it will never beat Raid 10. On top of it being the worst case, pg_xlog is not on a separate spindle. Your system B has a MUCH better config. Raid 10 is faster than Raid 5 to begin with but on top of that you have more drives involved plus pg_xlog is on a separate spindle. I'd say I am not surprised by your performance difference. > Bad controler cards/configuration will seriously ruin your day. 3ware > escalade cards are very well supported on linux, and work excellently. > Compaq smart array cards are not. Bonnie++ benchmarks show a 9MB/sec > write, 29MB/sec read on the RAID 5, but a 172MB/sec write on the > 6xRAID 10, and 66MB/sec write on the RAID 1 on the 3ware. > What does bonnie say about the Raid 1 on the Compaq? Comparing the two Raid 1s is really the only valid comparison that can be made between these two machines. Other than that you are comparing apples to snow shovels.
On 01/28/2005-05:57PM, Alex Turner wrote: > > > > Your system A has the absolute worst case Raid 5, 3 drives. The more > > drives you add to Raid 5 the better it gets but it will never beat Raid > > 10. On top of it being the worst case, pg_xlog is not on a separate > > spindle. > > > > True for writes, but not for reads. > Good point. > > My main point is that you can spend $7k on a server and believe you > have a fast system. The person who bought the original system was > under the delusion that it would make a good DB server. For the same > $7k a different configuration can yield a vastly different performance > output. This means that it's not quite apples to snow shovels. That point is definatly made. I primarily wanted to point out that the controlers involved were not the only difference. In my experience with SQL servers of various flavors fast disks and getting things onto a separate spindles is more important than just about anything else. Depending on the size of your 'hot' dataset RAM could be more important and CPU never is.
Le Vendredi 21 Janvier 2005 19:18, Marty Scholes a écrit : > The indexes can be put on a RAM disk tablespace and that's the end of > index problems -- just make sure you have enough memory available. Also > make sure that the machine can restart correctly after a crash: the > tablespace is dropped and recreated, along with the indexes. This will > cause a machine restart to take some time. Tell me if I am wrong but it sounds to me like like an endless problem....This solution may work with small indexes (less than 4GB) but what appends when the indexes grow ? You would add more memory to your server ? But there will be a moment were you can not add more so what's next ?
> Tell me if I am wrong but it sounds to me like like > an endless problem.... Agreed. Such it is with caching. After doing some informal benchmarking with 8.0 under Solaris, I am convinced that our major choke point is WAL synchronization, at least for applications with a high commit rate. We have noticed a substantial improvement in performance with 8.0 vs 7.4.6. All of the update/insert problems seem to have gone away, save WAL syncing. I may have to take back what I said about indexes. Olivier Sirven wrote: > Le Vendredi 21 Janvier 2005 19:18, Marty Scholes a écrit : > >>The indexes can be put on a RAM disk tablespace and that's the end of >>index problems -- just make sure you have enough memory available. Also >>make sure that the machine can restart correctly after a crash: the >>tablespace is dropped and recreated, along with the indexes. This will >>cause a machine restart to take some time. > > Tell me if I am wrong but it sounds to me like like an endless problem....This > solution may work with small indexes (less than 4GB) but what appends when > the indexes grow ? You would add more memory to your server ? But there will > be a moment were you can not add more so what's next ?
On 1/20/2005 9:23 AM, Jean-Max Reymond wrote: > On Thu, 20 Jan 2005 15:03:31 +0100, Hervé Piedvache <herve@elma.fr> wrote: > >> We were at this moment thinking about a Cluster solution ... We saw on the >> Internet many solution talking about Cluster solution using MySQL ... but >> nothing about PostgreSQL ... the idea is to use several servers to make a >> sort of big virtual server using the disk space of each server as one, and >> having the ability to use the CPU and RAM of each servers in order to >> maintain good service performance ...one can imagin it is like a GFS but >> dedicated to postgreSQL... >> > > forget mysql cluster for now. Sorry for the late reply. I'd second that. I was just on the Solutions Linux in Paris and spoke with MySQL people. There were some questions I had around the new NDB cluster tables and I stopped by at their booth. My question if there are any plans to add foreign key support to NDB cluster tables got answered with "it will definitely be in the next version, which is the one containing NDB cluster, so yes, it will support foreign key from the start". Back home I found some more time to investigate and found this forum article http://lists.mysql.com/cluster/1442 posted by a MySQL AB senior software architect, where he says exactly the opposite. I don't know about your application, but trust me that maintaining proper referential integrity on the application level against a multimaster clustered database isn't that easy. So this is in fact a very important question. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On 1/28/2005 2:49 PM, Christopher Browne wrote: > But there's nothing wrong with the idea of using "pg_dump --data-only" > against a subscriber node to get you the data without putting a load > on the origin. And then pulling the schema from the origin, which > oughtn't be terribly expensive there. And there is a script in the current CVS head that extracts the schema from the origin in a clean, slony-traces-removed state. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Josh Berkus wrote: > Tatsuo, > > >>Yes. However it would be pretty easy to modify pgpool so that it could >>cope with Slony-I. I.e. >> >>1) pgpool does the load balance and sends query to Slony-I's slave and >> master if the query is SELECT. >> >>2) pgpool sends query only to the master if the query is other than >> SELECT. Don't you think that this is unsafe ? SELECT foo(id), id FROM bar; where foo have side effect. Is pgpool able to detect it and perform this select on the master ? Regards Gaetano Mendola
Jim C. Nasby wrote: > On Thu, Jan 20, 2005 at 10:08:47AM -0500, Stephen Frost wrote: > >>* Christopher Kings-Lynne (chriskl@familyhealth.com.au) wrote: >> >>>PostgreSQL has replication, but not partitioning (which is what you want). >> >>It doesn't have multi-server partitioning.. It's got partitioning >>within a single server (doesn't it? I thought it did, I know it was >>discussed w/ the guy from Cox Communications and I thought he was using >>it :). > > > No, PostgreSQL doesn't support any kind of partitioning, unless you > write it yourself. I think there's some work being done in this area, > though. Seen my last attempts to perform an horizontal partition I have to say that postgres do not support it even if you try to write it yourself (see my post "horizontal partion" ). Regards Gaetano Mendola