Thread: Need help in setting optimal configuration for a huge database.
Hi All,
I am Kishore doing freelance development of J2EE applications.
We switched to use Postgresql recently because of the advantages it has over other commercial databases. All went well untill recently, untill we began working on an application that needs to maintain a huge database.
I am describing the problem we are facing below. Can you please take a look at the case, and help me in configuring the PostgreSQL.
We have only two tables, one of which contains 97% of the data and the other table which contains 2.8% of the data. All other contain only the remaining 0.2% of data and are designed to support these two big tables. Currently we have 9 million of records in the first table and 0.2 million of records in the second table.
We need to insert into the bigger table almost for every second , through out the life time. In addition, we receive at least 200,000 records a day at a fixed time.
We are facing a critical situation because of the performance of the database . Even a basic query like select count(*) from bigger_table is taking about 4 minutes to return.
The following is the system configuration.
Database : Postgresql 7.3
OS : Redhat Linux
Processor : Athlon,
Memory : 2 GB
We are expecting that at least 200 active connections need to be maintained through out the day.
Can any you please suggest the best configuration to satisfy the above requirements?
Thanks in advance.
Thank you,
Kishore.
Attachment
Re: Need help in setting optimal configuration for a huge database.
On Sun, Oct 23, 2005 at 02:45:25AM +0530, Kishore B wrote: > Database *:* Postgresql 7.3 You definitely want to upgrade this if you can. > Memory : 2 GB For 2GB of RAM, your effective_cache_size (100000) is a bit low (try doubling it), and sort_mem (2048) is probably a bit too low as well. /* Steinar */ -- Homepage: http://www.sesse.net/
On 10/22/2005 04:15:25 PM, Kishore B wrote: > Can any you please suggest the best configuration to satisfy the above > requirements? You've increased shared memory buffers, told the kernel to allow more shared memory (lots), and otherwise increased the parameters associated with memory? If so you might want to post the relevant configs here. If the basic tuning does not help enough you may want to upgrade to 8.0 as it has significant performance improvements. Karl <kop@meme.com> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
Re: Need help in setting optimal configuration for a huge database.
[please send replies to the list, not to me directly] On Sun, Oct 23, 2005 at 03:19:39AM +0530, Kishore B wrote: > *You definitely want to upgrade this if you can. > > > Memory : 2 GB > * > We can move upto 12 GB if need to be. I was referring to your PostgreSQL version, not your RAM. More RAM is almost always an improvement, but for your data set, 2GB sounds quite good. (700k rows is not really a “huge database”, BTW -- I've seen people here have several billion rows a _day_.) > For now, let us set the configuraiton parameters for 2GB. > I failed to mention earlier, that we have a dedicated server for database. > Can I set the effective_cache_size to 200000? Yes, that should work fine. > Can I set the sort_mem size to 4096? This depends a bit on the queries you're running. Remember that for each and every sort you do, one of these (measured in 8kB buffers) will get allocated. Some tuning of your queries against this would probably be useful. > Will the performance suffer, if I set these parameters too high? Yes, you can easily run into allocating too much RAM with too high sort_mem, which could kill your performance. Overestimating effective_cache_size is AFAIK not half as bad, though -- it is merely a hint to the planner, it does not actually allocate memory. /* Steinar */ -- Homepage: http://www.sesse.net/
Kishore B <kishorebh@gmail.com> writes: > Even a basic query like select count(*) from bigger_table is > taking about 4 minutes to return. You do realize that "select count(*)" requires a full table scan in Postgres? It's never going to be fast. If that's not where your performance problem really is, you need to show us some of the actual problem queries. If it is, you should rethink why your application needs an exact row count. regards, tom lane
[please send replies to the list, not to me directly]
On Sun, Oct 23, 2005 at 03:19:39AM +0530, Kishore B wrote:
> *You definitely want to upgrade this if you can.
>
> > Memory : 2 GB
> *
> We can move upto 12 GB if need to be.
I was referring to your PostgreSQL version, not your RAM. More RAM is almost
always an improvement, but for your data set, 2GB sounds quite good. (700k
rows is not really a "huge database", BTW -- I've seen people here have
several billion rows a _day_.)
> For now, let us set the configuraiton parameters for 2GB.
> I failed to mention earlier, that we have a dedicated server for database.
> Can I set the effective_cache_size to 200000?
Yes, that should work fine.
> Can I set the sort_mem size to 4096?
This depends a bit on the queries you're running. Remember that for each and
every sort you do, one of these (measured in 8kB buffers) will get allocated.
Some tuning of your queries against this would probably be useful.
> Will the performance suffer, if I set these parameters too high?
Yes, you can easily run into allocating too much RAM with too high sort_mem,
which could kill your performance. Overestimating effective_cache_size is
AFAIK not half as bad, though -- it is merely a hint to the planner, it does
not actually allocate memory.
/* Steinar */
--
Homepage: http://www.sesse.net/
On Sun, 23 Oct 2005, Kishore B wrote: > We need to insert into the bigger table almost for every second , through > out the life time. In addition, we receive at least 200,000 records a day at > a fixed time. > > We are facing a* critical situation because of the performance of the ** > database**.* Even a basic query like select count(*) from bigger_table is > taking about 4 minutes to return. Count(*) like that always scans the full table, but 4 minutes still sound like a lot. How often do you vacuum? Could it be that the disk is full of garbage due to not enough vacuum? A query like this can help find bloat: SELECT oid::regclass, reltuples, relpages FROM pg_class ORDER BY 3 DESC; I assume to do updates and deletes as well, and not just inserts? -- /Dennis Björklund
> We are facing a* critical situation because of the performance of the > **database** .* Even a basic query like select count(*) from > bigger_table is taking about 4 minutes to return. Several other replies have mentioned that COUNT() requires a full table scan, but this point can't be emphasized enough:Don't do it! People who are migrating from other environments (Oracle or MySQL) are used to COUNT(), MIN() and MAX()returning almost instantaneously, certainly on indexed columns. But for reasons that have something to do with transactions,these operations are unbelievably slow in PostgreSQL. Here are the alternatives that I've learned. COUNT() -- There is no good substitute. What I do is create a new column, "ROW_NUM" with an auto-incrementing sequence. Every time I insert a row, it gets a new value. Unfortunately, this doesn't work if you ever delete a row. Thealternative is a more complex pair of triggers, one for insert and one for delete, that maintains the count in a separateone-row table. It's a nuisance, but it's a lot faster than doing a full table scan for every COUNT(). MIN() and MAX() -- These are surprisingly slow, because they seem to do a full table scan EVEN ON AN INDEXED COLUMN! I don'tunderstand why, but happily there is an effective substitute: select mycolumn from mytable order by mycolumn limit 1; -- same as MIN() select mycolumn from mytable order by mycolumn desc limit 1; -- same as MAX() For a large table, MIN or MAX can take 5-10 minutes, where the above "select..." replacements can return in one millisecond. You should carefully examine your entire application for COUNT, MIN, and MAX, and get rid of them EVERYWHERE. This may bethe entire source of your problem. It was in my case. This is, in my humble opinion, the only serious flaw in PostgreSQL. I've been totally happy with it in every other way,and once I understood these shortcomings, my application is runs faster than ever on PostgreSQL. Craig
On Sun, Oct 23, 2005 at 09:31:44AM -0700, Craig A. James wrote: > COUNT() -- There is no good substitute. What I do is create a new column, > "ROW_NUM" with an auto-incrementing sequence. Every time I insert a row, > it gets a new value. Unfortunately, this doesn't work if you ever delete a > row. The alternative is a more complex pair of triggers, one for insert > and one for delete, that maintains the count in a separate one-row table. > It's a nuisance, but it's a lot faster than doing a full table scan for > every COUNT(). This will sometimes give you wrong results if your transactions ever roll back, for instance. The correct way to do it is to maintain a table of deltas, and insert a new positive record every time you insert rows, and a negative one every time you delete them (using a trigger, of course). Then you can query it for SUM(). (To keep the table small, run a SUM() in a cron job or such to combine the deltas.) There has, IIRC, been talks of supporting fast (index-only) scans on read-only (ie. archived) partitions of tables, but it doesn't look like this is coming in the immediate future. I guess others know more than me here :-) > MIN() and MAX() -- These are surprisingly slow, because they seem to do a > full table scan EVEN ON AN INDEXED COLUMN! I don't understand why, but > happily there is an effective substitute: They are slow because PostgreSQL has generalized aggregates, ie. MAX() gets fed exactly the same data as SUM() would. PostgreSQL 8.1 (soon-to-be released) can rewrite a MAX() or MIN() to an appropriate LIMIT form, though, which solves the problem. /* Steinar */ -- Homepage: http://www.sesse.net/
Dnia 23-10-2005, nie o godzinie 09:31 -0700, Craig A. James napisał(a): <cut> > MIN() and MAX() -- These are surprisingly slow, because they seem to do a full table scan EVEN ON AN INDEXED COLUMN! In 8.1 this is no true, see the changelog. > I don't understand why, but happily there is an effective substitute: > > select mycolumn from mytable order by mycolumn limit 1; -- same as MIN() > > select mycolumn from mytable order by mycolumn desc limit 1; -- same as MAX() In 8.1 these queries are equivalent: select mycolumn from mytable order by mycolumn limit 1; select min(mycolumn) from mytable; -- Tomasz Rybak <bogomips@post.pl>
"Craig A. James" <cjames@modgraph-usa.com> writes: > MIN() and MAX() -- These are surprisingly slow, because they seem to do a full table scan EVEN ON AN INDEXED COLUMN! Idon't understand why, but happily there is an effective substitute: > select mycolumn from mytable order by mycolumn limit 1; -- same as MIN() > select mycolumn from mytable order by mycolumn desc limit 1; -- same as MAX() BTW, Postgres does know to do that for itself as of 8.1. regards, tom lane
> We are facing a* critical situation because of the performance of the
> **database** .* Even a basic query like select count(*) from
> bigger_table is taking about 4 minutes to return.
Several other replies have mentioned that COUNT() requires a full table scan, but this point can't be emphasized enough: Don't do it! People who are migrating from other environments (Oracle or MySQL) are used to COUNT(), MIN() and MAX() returning almost instantaneously, certainly on indexed columns. But for reasons that have something to do with transactions, these operations are unbelievably slow in PostgreSQL.
Here are the alternatives that I've learned.
COUNT() -- There is no good substitute. What I do is create a new column, "ROW_NUM" with an auto-incrementing sequence. Every time I insert a row, it gets a new value. Unfortunately, this doesn't work if you ever delete a row. The alternative is a more complex pair of triggers, one for insert and one for delete, that maintains the count in a separate one-row table. It's a nuisance, but it's a lot faster than doing a full table scan for every COUNT().
MIN() and MAX() -- These are surprisingly slow, because they seem to do a full table scan EVEN ON AN INDEXED COLUMN! I don't understand why, but happily there is an effective substitute:
select mycolumn from mytable order by mycolumn limit 1; -- same as MIN()
select mycolumn from mytable order by mycolumn desc limit 1; -- same as MAX()
For a large table, MIN or MAX can take 5-10 minutes, where the above "select..." replacements can return in one millisecond.
You should carefully examine your entire application for COUNT, MIN, and MAX, and get rid of them EVERYWHERE. This may be the entire source of your problem. It was in my case.
This is, in my humble opinion, the only serious flaw in PostgreSQL. I've been totally happy with it in every other way, and once I understood these shortcomings, my application is runs faster than ever on PostgreSQL.
Craig