Thread: Need help in setting optimal configuration for a huge database.

Need help in setting optimal configuration for a huge database.

From
Kishore B
Date:

Hi All,

 

I am Kishore doing freelance development of J2EE applications.

 

We switched to use Postgresql recently because of the advantages it has over other commercial databases. All went well untill recently, untill we began working on an application that needs to maintain  a huge database.

 

I am describing the problem we are facing below. Can you please take a look at the case, and help me in configuring the PostgreSQL.

 

We have only two tables, one of which contains 97% of the data and the other table which contains 2.8% of the data. All other contain only the remaining 0.2% of data and are designed to support these two big tables. Currently we have 9 million of records in the first table and 0.2 million of records in the second table.

We need to insert into the bigger table almost for every second , through out the life time. In addition, we receive at least 200,000 records a day at a fixed time.

We are facing a critical situation because of the performance of the database . Even a basic query like select count(*) from bigger_table is taking about 4 minutes to return.

The following is the system configuration.

Database : Postgresql 7.3
OS : Redhat Linux
Processor : Athlon,
Memory : 2 GB

We are expecting that at least 200 active connections need to be maintained through out the day.

Can any you please suggest the best configuration to satisfy the above requirements?

Thanks in advance.

 

Thank  you,

Kishore.

 

Attachment

Re: Need help in setting optimal configuration for a huge database.

From
"Steinar H. Gunderson"
Date:
On Sun, Oct 23, 2005 at 02:45:25AM +0530, Kishore B wrote:
> Database *:* Postgresql 7.3

You definitely want to upgrade this if you can.

> Memory : 2 GB

For 2GB of RAM, your effective_cache_size (100000) is a bit low (try doubling
it), and sort_mem (2048) is probably a bit too low as well.

/* Steinar */
--
Homepage: http://www.sesse.net/


Re: Need help in setting optimal configuration for a huge

From
"Karl O. Pinc"
Date:
On 10/22/2005 04:15:25 PM, Kishore B wrote:

> Can any you please suggest the best configuration to satisfy the above
> requirements?

You've increased shared memory buffers, told the kernel
to allow more shared memory (lots), and otherwise increased
the parameters associated with memory?

If so you might want to post the relevant configs
here.

If the basic tuning does not help enough you may
want to upgrade to 8.0 as it has significant
performance improvements.

Karl <kop@meme.com>
Free Software:  "You don't pay back, you pay forward."
                  -- Robert A. Heinlein


Re: Need help in setting optimal configuration for a huge database.

From
"Steinar H. Gunderson"
Date:
[please send replies to the list, not to me directly]

On Sun, Oct 23, 2005 at 03:19:39AM +0530, Kishore B wrote:
>  *You definitely want to upgrade this if you can.
>
> > Memory : 2 GB
> *
> We can move upto 12 GB if need to be.

I was referring to your PostgreSQL version, not your RAM. More RAM is almost
always an improvement, but for your data set, 2GB sounds quite good. (700k
rows is not really a “huge database”, BTW -- I've seen people here have
several billion rows a _day_.)

>  For now, let us set the configuraiton parameters for 2GB.
> I failed to mention earlier, that we have a dedicated server for database.
>  Can I set the effective_cache_size to 200000?

Yes, that should work fine.

> Can I set the sort_mem size to 4096?

This depends a bit on the queries you're running. Remember that for each and
every sort you do, one of these (measured in 8kB buffers) will get allocated.
Some tuning of your queries against this would probably be useful.

>   Will the performance suffer, if I set these parameters too high?

Yes, you can easily run into allocating too much RAM with too high sort_mem,
which could kill your performance. Overestimating effective_cache_size is
AFAIK not half as bad, though -- it is merely a hint to the planner, it does
not actually allocate memory.

/* Steinar */
--
Homepage: http://www.sesse.net/


Re: Need help in setting optimal configuration for a huge database.

From
Tom Lane
Date:
Kishore B <kishorebh@gmail.com> writes:
> Even a basic query like select count(*) from bigger_table is
> taking about 4 minutes to return.

You do realize that "select count(*)" requires a full table scan in
Postgres?  It's never going to be fast.

If that's not where your performance problem really is, you need to
show us some of the actual problem queries.  If it is, you should
rethink why your application needs an exact row count.

            regards, tom lane

Re: Need help in setting optimal configuration for a huge database.

From
Kishore B
Date:
Hi Gunderson,
 
 Can I set the effective_cache_size to 200000?
 Yes, that should work fine.

 
Do you mean that I can set the effective_cache_size to 1.5 GB out of 2GB Memory that I have in the current system?
 
Can I set the sort_memory to 3072? We need to generate reports which make heavy use of group by and order by clauses.
 
Based on the 2GB available memory, how do you want me to Please note further that we need to execute upto 10 data centric queries at any instance. Based on these specifications, how do you  want me to allocate memory to the following configuration parameters?
 
shared_buffers, (Current Setting : 48000 (375MB))
sort_memory,    (Current setting 2048 kb (2MB))
effective_cache_size (Current setting: 100000 (1GB))
 


 
On 10/23/05, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote:
[please send replies to the list, not to me directly]

On Sun, Oct 23, 2005 at 03:19:39AM +0530, Kishore B wrote:
>  *You definitely want to upgrade this if you can.
>
> > Memory : 2 GB
> *
> We can move upto 12 GB if need to be.

I was referring to your PostgreSQL version, not your RAM. More RAM is almost
always an improvement, but for your data set, 2GB sounds quite good. (700k
rows is not really a "huge database", BTW -- I've seen people here have
several billion rows a _day_.)

>  For now, let us set the configuraiton parameters for 2GB.
> I failed to mention earlier, that we have a dedicated server for database.
>  Can I set the effective_cache_size to 200000?

Yes, that should work fine.

> Can I set the sort_mem size to 4096?

This depends a bit on the queries you're running. Remember that for each and
every sort you do, one of these (measured in 8kB buffers) will get allocated.
Some tuning of your queries against this would probably be useful.

>   Will the performance suffer, if I set these parameters too high?

Yes, you can easily run into allocating too much RAM with too high sort_mem,
which could kill your performance. Overestimating effective_cache_size is
AFAIK not half as bad, though -- it is merely a hint to the planner, it does
not actually allocate memory.

/* Steinar */
--
Homepage: http://www.sesse.net/


Re: Need help in setting optimal configuration for a huge

From
Dennis Bjorklund
Date:
On Sun, 23 Oct 2005, Kishore B wrote:

> We need to insert into the bigger table almost for every second , through
> out the life time. In addition, we receive at least 200,000 records a day at
> a fixed time.
>
> We are facing a* critical situation because of the performance of the **
> database**.* Even a basic query like select count(*) from bigger_table is
> taking about 4 minutes to return.

Count(*) like that always scans the full table, but 4 minutes still sound
like a lot.  How often do you vacuum? Could it be that the disk is full of
garbage due to not enough vacuum?

A query like this can help find bloat:

 SELECT oid::regclass, reltuples, relpages FROM pg_class ORDER BY 3 DESC;

I assume to do updates and deletes as well, and not just inserts?

--
/Dennis Björklund


Re: Need help in setting optimal configuration for a huge

From
"Craig A. James"
Date:
> We are facing a* critical situation because of the performance of the
> **database** .* Even a basic query like select count(*) from
> bigger_table is taking about 4 minutes to return.

Several other replies have mentioned that COUNT() requires a full table scan, but this point can't be emphasized
enough:Don't do it!  People who are migrating from other environments (Oracle or MySQL) are used to COUNT(), MIN() and
MAX()returning almost instantaneously, certainly on indexed columns.  But for reasons that have something to do with
transactions,these operations are unbelievably slow in PostgreSQL.   

Here are the alternatives that I've learned.

COUNT() -- There is no good substitute.  What I do is create a new column, "ROW_NUM" with an auto-incrementing
sequence. Every time I insert a row, it gets a new value.  Unfortunately, this doesn't work if you ever delete a row.
Thealternative is a more complex pair of triggers, one for insert and one for delete, that maintains the count in a
separateone-row table.  It's a nuisance, but it's a lot faster than doing a full table scan for every COUNT(). 

MIN() and MAX() -- These are surprisingly slow, because they seem to do a full table scan EVEN ON AN INDEXED COLUMN!  I
don'tunderstand why, but happily there is an effective substitute: 

   select mycolumn from mytable order by mycolumn limit 1;  -- same as MIN()

   select mycolumn from mytable order by mycolumn desc limit 1;  -- same as MAX()

For a large table, MIN or MAX can take 5-10 minutes, where the above "select..." replacements can return in one
millisecond.

You should carefully examine your entire application for COUNT, MIN, and MAX, and get rid of them EVERYWHERE.  This may
bethe entire source of your problem.  It was in my case.   

This is, in my humble opinion, the only serious flaw in PostgreSQL.  I've been totally happy with it in every other
way,and once I understood these shortcomings, my application is runs faster than ever on PostgreSQL. 

Craig

Re: Need help in setting optimal configuration for a huge

From
"Steinar H. Gunderson"
Date:
On Sun, Oct 23, 2005 at 09:31:44AM -0700, Craig A. James wrote:
> COUNT() -- There is no good substitute.  What I do is create a new column,
> "ROW_NUM" with an auto-incrementing sequence.  Every time I insert a row,
> it gets a new value.  Unfortunately, this doesn't work if you ever delete a
> row.  The alternative is a more complex pair of triggers, one for insert
> and one for delete, that maintains the count in a separate one-row table.
> It's a nuisance, but it's a lot faster than doing a full table scan for
> every COUNT().

This will sometimes give you wrong results if your transactions ever roll
back, for instance. The correct way to do it is to maintain a table of
deltas, and insert a new positive record every time you insert rows, and a
negative one every time you delete them (using a trigger, of course). Then
you can query it for SUM(). (To keep the table small, run a SUM() in a cron
job or such to combine the deltas.)

There has, IIRC, been talks of supporting fast (index-only) scans on
read-only (ie. archived) partitions of tables, but it doesn't look like this
is coming in the immediate future. I guess others know more than me here :-)

> MIN() and MAX() -- These are surprisingly slow, because they seem to do a
> full table scan EVEN ON AN INDEXED COLUMN!  I don't understand why, but
> happily there is an effective substitute:

They are slow because PostgreSQL has generalized aggregates, ie. MAX() gets
fed exactly the same data as SUM() would. PostgreSQL 8.1 (soon-to-be
released) can rewrite a MAX() or MIN() to an appropriate LIMIT form, though,
which solves the problem.

/* Steinar */
--
Homepage: http://www.sesse.net/


Re: Need help in setting optimal configuration for a huge

From
Tomasz Rybak
Date:
Dnia 23-10-2005, nie o godzinie 09:31 -0700, Craig A. James napisał(a):
<cut>
> MIN() and MAX() -- These are surprisingly slow, because they seem to do a full table scan EVEN ON AN INDEXED COLUMN!
In 8.1 this is no true, see the changelog.

> I don't understand why, but happily there is an effective substitute:
>
>    select mycolumn from mytable order by mycolumn limit 1;  -- same as MIN()
>
>    select mycolumn from mytable order by mycolumn desc limit 1;  -- same as MAX()

In 8.1 these queries are equivalent:

select mycolumn from mytable order by mycolumn limit 1;
select min(mycolumn) from mytable;

--
Tomasz Rybak <bogomips@post.pl>


Re: Need help in setting optimal configuration for a huge

From
Tom Lane
Date:
"Craig A. James" <cjames@modgraph-usa.com> writes:
> MIN() and MAX() -- These are surprisingly slow, because they seem to do a full table scan EVEN ON AN INDEXED COLUMN!
Idon't understand why, but happily there is an effective substitute: 

>    select mycolumn from mytable order by mycolumn limit 1;  -- same as MIN()

>    select mycolumn from mytable order by mycolumn desc limit 1;  -- same as MAX()

BTW, Postgres does know to do that for itself as of 8.1.

            regards, tom lane

Re: Need help in setting optimal configuration for a huge database.

From
Kishore B
Date:
Hi Craig,
 
Thank you very much for your response.
 
It really covered a great point.
 
Thank you,
Kishore.

 
On 10/23/05, Craig A. James <cjames@modgraph-usa.com> wrote:

> We are facing a* critical situation because of the performance of the
> **database** .* Even a basic query like select count(*) from
> bigger_table is taking about 4 minutes to return.

Several other replies have mentioned that COUNT() requires a full table scan, but this point can't be emphasized enough: Don't do it!  People who are migrating from other environments (Oracle or MySQL) are used to COUNT(), MIN() and MAX() returning almost instantaneously, certainly on indexed columns.  But for reasons that have something to do with transactions, these operations are unbelievably slow in PostgreSQL.

Here are the alternatives that I've learned.

COUNT() -- There is no good substitute.  What I do is create a new column, "ROW_NUM" with an auto-incrementing sequence.  Every time I insert a row, it gets a new value.  Unfortunately, this doesn't work if you ever delete a row.  The alternative is a more complex pair of triggers, one for insert and one for delete, that maintains the count in a separate one-row table.  It's a nuisance, but it's a lot faster than doing a full table scan for every COUNT().

MIN() and MAX() -- These are surprisingly slow, because they seem to do a full table scan EVEN ON AN INDEXED COLUMN!  I don't understand why, but happily there is an effective substitute:

  select mycolumn from mytable order by mycolumn limit 1;  -- same as MIN()

  select mycolumn from mytable order by mycolumn desc limit 1;  -- same as MAX()

For a large table, MIN or MAX can take 5-10 minutes, where the above "select..." replacements can return in one millisecond.

You should carefully examine your entire application for COUNT, MIN, and MAX, and get rid of them EVERYWHERE.  This may be the entire source of your problem.  It was in my case.

This is, in my humble opinion, the only serious flaw in PostgreSQL.  I've been totally happy with it in every other way, and once I understood these shortcomings, my application is runs faster than ever on PostgreSQL.

Craig