Thread: Re: Performance for relative large DB

Re: Performance for relative large DB

From

Chris Browne

Date:

23 August 2005, 13:16:10

"tobbe" <tobbe@tripnet.se> writes:
> The company that I'm working for are surveying the djungle of DBMS
> since we are due to implement the next generation of our system.
>
> The companys buissnes is utilizing the DBMS to store data that are
> accessed trough the web at daytime (only SELECTs, sometimes with joins,
> etc). The data is a collection of bjects that are for sale. The data
> consists of basic text information about theese togheter with some
> group information, etc.
>
> The data is updated once every night.

How much data is updated per night?  The whole 4M "posts"?  Or just
some subset?

> There are about 4 M posts in the database (one table) and is
> expected to grow with atleast 50% during a reasonable long time.

So you're expecting to have ~6M entries in the 'posts' table?

> How well would PostgreSQL fit our needs?
>
> We are using Pervasive SQL today and suspect that it is much to small.
> We have some problems with latency. Esp. when updating information,
> complicated conditions in selects and on concurrent usage.

If you're truly updating all 4M/6M rows each night, *that* would turn
out to be something of a bottleneck, as every time you update a tuple,
this creates a new copy, leaving the old one to be later cleaned away
via VACUUM.

That strikes me as unlikely: I expect instead that you update a few
thousand or a few tens of thousands of entries per day, in which case
the "vacuum pathology" won't be a problem.

I wouldn't expect PostgreSQL to be "too small;" it can and does cope
well with complex queries.

And the use of MVCC allows there to be a relatively minimal amount of
locking done even though there may be a lot of concurrent users, the
particular merit there being that you can essentially eliminate most
read locks.  That is, you can get consistent reports without having to
lock rows or tables.

One table with millions of rows isn't that complex a scenario :-).
--
output = ("cbbrowne" "@" "cbbrowne.com")
http://cbbrowne.com/info/spiritual.html
Appendium to  the Rules  of the  Evil Overlord #1:  "I will  not build
excessively integrated  security-and-HVAC systems. They  may be Really
Cool, but are far too vulnerable to breakdowns."

Re: Performance for relative large DB

From

Chris Browne

Date:

24 August 2005, 13:43:53

"tobbe" <tobbe@tripnet.se> writes:
> Hi Chris.
>
> Thanks for the answer.
> Sorry that i was a bit unclear.
>
> 1) We update around 20.000 posts per night.

No surprise there; I would have been surprised to see 100/nite or
6M/nite...

> 2) What i meant was that we suspect that the DBMS called PervasiveSQL
> that we are using today is much to small. That's why we're looking for
> alternatives.
>
> Today we base our solution much on using querry-specific tables created
> at night, so instead of doing querrys direct on the "post" table (with
> 4-6M rows) at daytime, we have the data pre-aligned in several much
> smaller tables. This is just to make the current DBMS coop with our
> amount of data.
>
> What I am particulary interested in is if we can expect to run all our
> select querrys directly from the "post" table with PostgreSQL.

Given a decent set of indices, I'd expect that to work OK...  Whether
4M or 6M rows, that's pretty moderate in size.

If there are specific states that rows are in which are "of interest,"
then you can get big wins out of having partial indices...  Consider...

create index partial_post_status on posts where status in ('Active', 'Pending', 'Locked');
-- When processing of postings are completely finished, they wind up with 'Closed' status

We have some 'stateful' tables in our environment where the
interesting states are 'P' (where work is "pending") and 'C' (where
all the work has been completed and the records are never of interest
again except as ancient history); the partial index "where status =
'P'" winds up being incredibly helpful.

It's worth your while to dump data out from Pervasive and load it into
a PostgreSQL instance and to do some typical sorts of queries on the
PostgreSQL side.

Do "EXPLAIN ANALYZE [some select statement];" and you'll get a feel
for how PostgreSQL is running the queries.

Fiddling with indices to see how that affects things will also be a
big help.

You may find there are columns with large cardinalities (quite a lot
of unique values) where you want to improve the stats analysis via...

  alter posts alter column [whatever] set statistics 100;
           -- Default is 10 bins
  analyze posts;
           -- then run ANALYZE to update statistics

> 3) How well does postgres work with load balancing environments. Is
> it built-in?

Load balancing means too many things.  Can you be more specific about
what you consider it to mean?

For Internet registry operations, we use replication (Slony-I) to
create replicas used to take particular sorts of load off the "master"
systems.

But you might be referring to something else...

For instance, connection pools, whether implemented inside
applications (everyone doing Java has one or more favorite Java
connection pool implementations) or in web servers (Apache has a DB
connection pool manager) or in an outside application (pgpool, a
C-based connection pool manager) are also sometimes used for load
balancing.
--
(reverse (concatenate 'string "gro.mca" "@" "enworbbc"))
http://www3.sympatico.ca/cbbrowne/postgresql.html
In case you weren't aware, "ad homineum" is not latin for "the user of
this technique is a fine debater." -- Thomas F. Burdick

Re: Performance for relative large DB

From

"tobbe"

Date:

27 August 2005, 02:11:19

Hi Chris.

Thanks for the answer.
Sorry that i was a bit unclear.

1) We update around 20.000 posts per night.

2) What i meant was that we suspect that the DBMS called PervasiveSQL
that we are using today is much to small. That's why we're looking for
alternatives.

Today we base our solution much on using querry-specific tables created
at night, so instead of doing querrys direct on the "post" table (with
4-6M rows) at daytime, we have the data pre-aligned in several much
smaller tables. This is just to make the current DBMS coop with our
amount of data.

What I am particulary interested in is if we can expect to run all our
select querrys directly from the "post" table with PostgreSQL.

3) How well does postgres work with load balancing environments. Is it
built-in?

Best Regards
Robert Bengtsson
Project Manager

Re: Performance for relative large DB

From

Chris Travers

Date:

27 August 2005, 03:30:16

tobbe wrote:

>Hi Chris.
>
>Thanks for the answer.
>Sorry that i was a bit unclear.
>
>1) We update around 20.000 posts per night.
>
>2) What i meant was that we suspect that the DBMS called PervasiveSQL
>that we are using today is much to small. That's why we're looking for
>alternatives.
>
>Today we base our solution much on using querry-specific tables created
>at night, so instead of doing querrys direct on the "post" table (with
>4-6M rows) at daytime, we have the data pre-aligned in several much
>smaller tables. This is just to make the current DBMS coop with our
>amount of data.
>
>What I am particulary interested in is if we can expect to run all our
>select querrys directly from the "post" table with PostgreSQL.
>
>
20k transactions per day?  Doesn't seem too bad.  That amounts to how
many transactions per second during peak times?  Personally I don't
think it will be a problem, but you might want to clarify what sort of
load you are expecting during its peak time.

>3) How well does postgres work with load balancing environments. Is it
>built-in?
>
>
There is no load balancing "built in."  You would need to use Slony-I
and possibly Pg-Pool for that.  I don't know about Pg-Pool, but Slony-I
was written in large part by member(s?) of the core development team so
even if it is not "built in" it is not as if it is a team of outsiders
who wrote it.

If you need something proprietary, there are similar solutions with
replication built in which are based on PostgreSQL and licensed under
proprietary licenses.

Best Wishes,
Chris Travers
Metatron Technology Consulting

Re: Performance for relative large DB

From

"Jim C. Nasby"

Date:

29 August 2005, 18:26:13

On Tue, Aug 23, 2005 at 11:25:02PM -0700, tobbe wrote:
> Hi Chris.
>
> Thanks for the answer.
> Sorry that i was a bit unclear.
>
> 1) We update around 20.000 posts per night.
Doesn't seem like a lot at all.

> 2) What i meant was that we suspect that the DBMS called PervasiveSQL
> that we are using today is much to small. That's why we're looking for
> alternatives.

Just so no one gets confused, PervasiveSQL is our Btrieve-based
database; it has nothing to do with Pervasive Posgres or PosgreSQL.
Also, feel free to contact me off-list if you'd like our help with this.

> Today we base our solution much on using querry-specific tables created
> at night, so instead of doing querrys direct on the "post" table (with
> 4-6M rows) at daytime, we have the data pre-aligned in several much
> smaller tables. This is just to make the current DBMS coop with our
> amount of data.
>
> What I am particulary interested in is if we can expect to run all our
> select querrys directly from the "post" table with PostgreSQL.

Probably, depending on what those queries are, what hardware you have
and how the table is laid out. Unless you've got a really high query
load I suspect you could handle this on some fairly mundane hardware...

> 3) How well does postgres work with load balancing environments. Is it
> built-in?

As Chris said, there is no built-in solution. PGCluster
(http://pgfoundry.org/projects/pgcluster/) is a possible solution should
you need clustering/load balancing, but as I mentioned I suspect you
should be ok without it.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software        http://pervasive.com        512-569-9461