Thread: how good is PostgreSQL

how good is PostgreSQL

From
"Arnold Gamboa"
Date:
Hi,

For users of large PostgreSQL and PostgreSQL builders, this is for you.

I'm having a terrible time deciding now. :(

We're about to build a "huge" website now.  I got tied up in signing the
contract without really getting enough information about PgSQL since this
what we plan to implement with PHP (normally we use mySQL but i guess it
does not fit for huge databases like that).

Here's my problem.. We're about to build a site like hitbox.com where there
is a large amount of database required.. If say there is 100,000 users with
1000 page hits per day for each, and everything will be logged, you could
imagine how huge this will be.  I'm just so "nervous" (really, that's the
term) if we implement this and later on experience a slow down or worse than
that, crash in the server.

My questions are:
1. What is the limit for number of records in a table WITHOUT SUFFERING SLOW
DOWN.
2. ....limit in number of tables per database
3. ... limit in number of database.

Thanks for you comments.  I would really appreciate every comment that I'll
receive regarding this.

Arnold


Re: how good is PostgreSQL

From
Gunnar R|nning
Date:
"Arnold Gamboa" <arnold@php4us.com> writes:

> We're about to build a "huge" website now.  I got tied up in signing the
> contract without really getting enough information about PgSQL since this
> what we plan to implement with PHP (normally we use mySQL but i guess it
> does not fit for huge databases like that).

Can you do connection pooling and client side caching of database queries
in PHP ? From working with Java this is the spot we really improve speed.

>
> Here's my problem.. We're about to build a site like hitbox.com where there
> is a large amount of database required.. If say there is 100,000 users with
> 1000 page hits per day for each, and everything will be logged, you could
> imagine how huge this will be.  I'm just so "nervous" (really, that's the
> term) if we implement this and later on experience a slow down or worse than
> that, crash in the server.

How many database queries do you have per page hit ?
How many database inserts/updates do you have per page hit ?

Are you using the database for httpd access logging, or is it some
application level logging ? Anyhow you might want to look into an
architecture where you have a dedicated box for the logging.

But most important, test with real data. Populate your database and run
stress tests.

I'm was doing some testing on a portal my company has developed with
PostgreSQL as the backend database. Running on my Linux laptop P466 with
128MB, Apache JServ, PostgreSQL 7.0.2. I managed to get about ~20 pageviews
a second. Each pageview had on average 4 queries and 1 insert.

But measure for yourself. Remember that you can gain a lot by tuning
application, database and OS.

regards,

    Gunnar

Re: how good is PostgreSQL

From
markw
Date:
Arnold Gamboa wrote:

> Hi,
>
> For users of large PostgreSQL and PostgreSQL builders, this is for you.
>
> I'm having a terrible time deciding now. :(
>
> We're about to build a "huge" website now.  I got tied up in signing the
> contract without really getting enough information about PgSQL since this
> what we plan to implement with PHP (normally we use mySQL but i guess it
> does not fit for huge databases like that).
>
> Here's my problem.. We're about to build a site like hitbox.com where there
> is a large amount of database required.. If say there is 100,000 users with
> 1000 page hits per day for each, and everything will be logged, you could
> imagine how huge this will be.  I'm just so "nervous" (really, that's the
> term) if we implement this and later on experience a slow down or worse than
> that, crash in the server.

That is a LOT of work for any system. That is over 1100 page views a second, or
under 900us each..  A standard Pentium III system, serving static pages would
have problems with that.

If you look at search engines, to get that performance with readonly data, they
usually cluster multiple systems and load balance across them. You may need to
segment your data and have multiple SQL servers perform different functions.

Also, that 1100 page view per second is assuming an even distribution of
traffic, which does not happen in a web server. If you average that much,
chances are there will be periods of twice that.

Look into a "local director," "Alteon," or even LVS.

>
>
> My questions are:
> 1. What is the limit for number of records in a table WITHOUT SUFFERING SLOW
> DOWN.

> 2. ....limit in number of tables per database
> 3. ... limit in number of database.

There are a couple factors involved, more complex than a simple response.

Use multiple databases and put each on a separate disk, with its own
controller. Better yet, have multiple load balanced web boxes do a lot of
processing in PHP and offload much of the CPU bound SQL work to the "cheap" web
boxes, and have multiple SQL databases in the back handling various independent
tasks.

In a web site I worked on, we had multiple front end web servers, load balanced
with an Alteon. Each web server had its own SQL database which provided SQL
access to "static" data which was updated each week.  We had an additional
single SQL database backend which all the Web servers accessed for synchronized
dynamic data.

If you are serious about the load you expect to put on this system you must be
careful:
Do not create any indexes you do not need.
Do not use the "foreign key" constraint as it forces a trigger for each insert.

Make sure you index the keys by which you will access data.
Avoid searching by strings, try to use keys.

Even after that, you have a long way to go before you will hit 1000
transactions per second from any SQL database.

If you are betting your business on this implementation, you have a lot of
homework to do.

>
>
> Thanks for you comments.  I would really appreciate every comment that I'll
> receive regarding this.
>
> Arnold


Re: how good is PostgreSQL

From
"Steve Wolfe"
Date:
> Even after that, you have a long way to go before you will hit 1000
> transactions per second from any SQL database.

   I guess they could always buy a few Sun E10000's on the backend, and a
large room of rack-mountable PC's for web/CGI serving.  Nothing like
plopping down ten or twenty million dollars on hardware. : )

steve



Re: how good is PostgreSQL

From
Lamar Owen
Date:
Steve Wolfe wrote:
>
> > Even after that, you have a long way to go before you will hit 1000
> > transactions per second from any SQL database.

>    I guess they could always buy a few Sun E10000's on the backend, and a
> large room of rack-mountable PC's for web/CGI serving.  Nothing like
> plopping down ten or twenty million dollars on hardware. : )

Or they could buy a single IBM S/390, run Linux/390 and PostgreSQL on
that. Probably would cost less, and be more reliable.  And they can
always load another Linux/390 VM -- an S/390 can run something like
41,000 virtual machines each running Linux/390 and Apache.

However, if you want to see the architecture of a _large_
database-backed website, see the story behind Digital City at
www.aolserver.com.  While they're using Sybase instead of PostgreSQL,
the architecture is the same.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

Re: how good is PostgreSQL

From
"Steve Wolfe"
Date:
> Or they could buy a single IBM S/390, run Linux/390 and PostgreSQL on
> that. Probably would cost less, and be more reliable.  And they can
> always load another Linux/390 VM -- an S/390 can run something like
> 41,000 virtual machines each running Linux/390 and Apache.

  Yeah....  I'm very optomistic about IBM's new chips that are coming out
next year.  Each "processor module" will have 4 processors, but each
processor will have 2 cores - so in effect, each "processor module" has 8
processors on it.  All processors will have copper interconnects, and
depending on the source, will debut at anywhere from 1.3 to 2 gigahertz.  I
think that will certainly help them get a larger share of the high-end
market!

steve



Re: how good is PostgreSQL

From
"Steve Wolfe"
Date:
> Even after that, you have a long way to go before you will hit 1000
> transactions per second from any SQL database.

     Since my last post probably wasn't too useful, here's some information
that might be a little more help.  It's a little long, I know, but hopefully
it will be of use to someone.

     As programmers, we naturally want to throw things into databases for
three reasons.  First, it's easy to get data in.  Second, it's easy to get
relevant data out.  And third, it's "cool".  We don't want to work with flat
files, now do we?  ; )

     However, in some cases, using the database to get data out ends up
costing us a lot of time and money.  Sometimes we do the same nasty query so
often, that we end up purchasing bigger hardware to make the system work
reasonably.  Why?  Because it was easier for us to write a program that did:

GetDataFromDatabase();
PrepareData();
PrintData();

     Each time, the database server does the work.  But it doesn't
necessarily have to be that way.  In our company, we've found two trends
that have enabled us to save a LOT of processing power on our machines.
(read:  Increase the capacity of our servers by 30% or more, with fairly
minor changes)

     The first case is that of rarely-changing data.  Some of our datasets
probably have around 50,000 to 1,000,000 views (selects) for each update
(insert/delete).  Having the database repeat the query every time is a
waste.  So, we began writing our programs such that they will grab the data
from the database once, and generate the HTML for every page, and the
indexes.  Then, when an update is made to the database (via the
administrative tools), it simply rewrites *the relevant HTML files*, and
changes the indeces pointing to them.  (There are also some other very large
advantages to this sort of thing, but I'm not allowed to say them. ; )  )

       The second case is that of often-repeated queries.  One of the
offerings on our site is an online directory, which gets a pretty fair
amount of traffic.  Unfortunately, it uses a proprietary program that was
purchased by management before they spoke with us.  Grr....   It was the
most utterly inefficient program I've ever seen.  It would *not* allow the
database to do joins, it would grab entire tables, then try to do the joins
itself, in Perl.

       We rewrote the program to let PostgreSQL do the joins, and that sped
it up.   Then we realized that a very small number of queries (those for the
first one or two levels of pages) accounted for a huge portion of the
useage.  So, we replaced the front page with a static HTML page (the front
page doesn't change...), and saw another terrific drop in our system loads.


   Overall, by only modifying a couple of our more heavily-uesd programs,
our server loads dropped by about 30%-40%.  If we went to the trouble to
modify some others, it would drop even more.  But we're going to rewrite
them completely for other reasons. : )


   In any event, there are ways like this to save a LOT of CPU and disk I/O.
Most web servers can server out several hundred static pages with the
resources that would otherwise deliver one dynamically-created,
database-driven page.  It also allows you to cluster the web servers with
cheap commodity hardware, instead of using big-iron on the database.  And if
you have a big-iron machine running the back-end, this can severely lighten
the load on it, keeping you from dropping a few hundred grand on the next
step up. ; )


   (Incidentally, we've toyed around with developping a query-caching system
that would sit betwen PostgreSQL and our DB libraries.  However, it seems
like it could be done *much* more efficiently in PostgreSQL itself, as it
would be much easier to keep track of which tables have changed, etc..
Anybody know if this sort of functionality is planned?  It would be terrific
to simply give the machine another 256 megs of RAM, and tell it to use it as
a DB cache...)

steve




Re: how good is PostgreSQL

From
"Kevin O'Gorman"
Date:
markw wrote:
>
> Arnold Gamboa wrote:
>
> > Hi,
> >
> > For users of large PostgreSQL and PostgreSQL builders, this is for you.

..snip..

>
> Also, that 1100 page view per second is assuming an even distribution of
> traffic, which does not happen in a web server. If you average that much,
> chances are there will be periods of twice that.
>

That's excessively optimistic.  If your daily average is 1100 per second, you'll
have 2200 average for many of the hours in that day, 5500 for a few hours, and
some 10-minute periods with 11,000, certainly once in a while.

++ kevin


--
Kevin O'Gorman  (805) 650-6274  mailto:kogorman@pacbell.net
Permanent e-mail forwarder:  mailto:Kevin.O'Gorman.64@Alum.Dartmouth.org
At school: mailto:kogorman@cs.ucsb.edu
Web: http://www.cs.ucsb.edu/~kogorman/index.html
Web: http://trixie.kosman.via.ayuda.com/~kevin/index.html

"There is a freedom lying beyond circumstance,
derived from the direct intuition that life can
be grounded upon its absorption in what is
changeless amid change"
   -- Alfred North Whitehead

Re: how good is PostgreSQL

From
KuroiNeko
Date:
>      As programmers, we naturally want to throw things into databases for
> three reasons. First, it's easy to get  data in. Second, it's easy to get
> relevant data  out. And third,  it's "cool". We  don't want to  work with
> flat files, now do we? ; )

 Kiddin', eh? :)  Actually, the third reason seems to  dominate the younger
developers' minds. People often tend to  keep everything in poor DBMS until
it begins  to kick back.  And this has impact  on the customers.  Does your
system use a database?  No, why should it? You mean  you'll keep our dearly
beloved banner  ads as flat files?  Yes, this is where  they belong. Sorry,
we'll seek for someone more advanced. Good luck.
 Of course, hardware  vendors jump up of  joy :) Maybe I don't  get it, but
IMHO there's no reason  to put into DB something that  can't be indexed and
used in where clause.

> It would *not* allow the
> database to  do joins, it  would grab entire tables,  then try to  do the
> joins
> itself, in Perl.

 Umh.... Yeah....  Well.... To  keep compatibility  with other  Open Source
Databases and ESR/RMS, you know :)

>    (Incidentally,  we've toyed  around with  developping a  query-caching
> system that would sit betwen PostgreSQL and our DB libraries.

 Sounds  amazing, but  requires some  research, I  guess. However,  in many
cases one  would be  more than  happy with  cahced connections.  Of course,
cahced query results  can be naturally added to that,  but just connections
are OK to start with. Security....


--

 contaminated fish and microchips
  huge supertankers on Arabian trips
 oily propaganda from the leaders' lips
  all about the future
 there's people over here, people over there
  everybody's looking for a little more air
 crossing all the borders just to take their share
  planning for the future

 Rainbow, Difficult to Cure

Query caching

From
"Steve Wolfe"
Date:
> >    (Incidentally,  we've toyed  around with  developping a
query-caching
> > system that would sit betwen PostgreSQL and our DB libraries.
>
>  Sounds  amazing, but  requires some  research, I  guess. However,  in
many
> cases one  would be  more than  happy with  cahced connections.  Of
course,
> cahced query results  can be naturally added to that,  but just
connections
> are OK to start with. Security....

    To me, it doesn't sound like it would be that difficult of a project, at
least not for the likes of the PostgreSQL developpers.  It also doesn't seem
like it would really introduce any security problems, not if it were done
inside of PostgreSQL.  Long ago, I got sidetracked from my endeavors in C,
and so I don't feel that I'm qualified to do it.  (otherwise, I would have
done it already. : ) )   If you wanted it done in Perl or Object Pascal, I
could help. : )

    Here's a simple design that I was tossing back and forth.  Please
understand that I'm not saying this is the best way to do it, or even a good
way to do it.  Just a possible way to do it.  I haven't been able to give it
as much thought as I would like to.  Here goes.

------------
Implementation

    Upon starting, the PostgreSQL engine could allocate a chunk of memory,
sized according to the administrator's desire.  That chunk would be used
solely for query caching.

    When a query came in that was not cached (say, the first query), the
database engine would process it as normal.  It would then return it to the
user, and add it to the cache.  "Adding it to the cache" would mean that it
would enter the query itself, the result set, and a list of which tables the
query relied upon.  The query that is stored could be either the query
coming from the user, or the query after it goes through the optimizer.
Each has pros and cons, I would probably favor using the query that comes
from the user.

     When another query comes along, the caching engine would quickly look
in the hash table, and see if it already had the cached results of the
query.  If so, it returns them, and wham.  You've just avoided all of the
work of optimizing, parsing, and executing, not to mention the disk I/O.  A
hash lookup seems extremely cheap compared to the work of actually
processing a query.

     When an update/delete/insert comes along, the engine would analyze
which tables were affected, and clear the cache entries that relied upon
those tables.

-----------------
Cache Clearing

     Cache clearing would be achieved via an LRU-based algorithm, which
would also take into account the amount of RAM used by each query in the
cache.
-----------------
Performance Impact

   The potential performance differences range from a miniscule decrease to
a tremendous increase.  And it's a lot cheaper to throw an extra half gig of
RAM in a machine that to upgrade processors and disk subsystems!

------------------
Possible Changes

     One potential drawback is that when a table is modified, the queries
that rely upon it would be discarded.  Where a table is updated frequently,
that could greatly reduce the performance benefit.  One possible alternative
is to store the query cost with each query in the cache.  When a table is
updated, those queries are marked as "dirty".  If the system load is below a
certain amount, or the system has been idle, it could then re-execute those
queries and update the cache.  Which queries it re-executed would be
determined on a factor of query cost and how frequently those cache entries
were used.
-------------------

   The reason I would prefer it done in the PostgreSQL engine (as opposed to
in a middleware application) is that the caching engine needs to know (a)
which tables a query relies upon, and (b) which tables get changed.  It
seems that it would significantly reduce overhead to do those inside of
PostgreSQL (which is already doing the query parsing and analysis).

    This could certainly give PostgreSQL a huge advantage over other
database systems, too.   It could save administrators a very large chunk of
cash that they would otherwise have to spend on large systems.  And it would
just be cool.   ; )

steve



Re: how good is PostgreSQL

From
"Bryan White"
Date:
>      Whenever a query is executed (not found in cache, etc.), the caching
> system would simply store the query, the results, and a list of tables
> queried.   When a new query came in, it would do a quick lookup in the
query
> hash to see if it already had the results.  If so, whammo. Whenever an
> insert/delete/update was sensed, it would look at the tables being
affected,
> and the caching mechanism would clear out the entries depending on those
> tables.

It seems to me that tracking the list of cached queries and watching for
queries that might invalidate them adds a lot of complexity to the back end
and the front end still has to establish the connection and wait transfer
the data over the socket.

On a more practical level, a backend solution would require someone with
fairly detailed knowlege of the internals of the backend.  A front end
solution can more likely to be implemented by someone not as knowlegable.

One of the big advantages of your technique is there is no code change at
the application level.  This means less database lock-in.  Maybe that is a
disadvantage too. ;-)



Re: how good is PostgreSQL

From
"Steve Wolfe"
Date:
> It seems to me that tracking the list of cached queries and watching for
> queries that might invalidate them adds a lot of complexity to the back
end
> and the front end still has to establish the connection and wait transfer
> the data over the socket.

  I really don't think that it would.  Checking to see if you have a query
(a hash lookup) is very, very cheap relative to normally processing a query,
I would think.

  And invalidating cache entries would also be very, very cheap compared to
the normal activity of the database.  Assuming hashes are done correctly, it
would probably be done much, much faster than any query could execute.  If
software caches can increase the performance of disk drives that have
latencies in thousandths of seconds, I'm sure they could help with queries
that take hundredths or tenths of seconds.  ; )

> On a more practical level, a backend solution would require someone with
> fairly detailed knowlege of the internals of the backend.  A front end
> solution can more likely to be implemented by someone not as knowlegable.

  Yeah.  I was hoping that one of the developpers would say "oooh... that
would rock.  We should do that." : )

> One of the big advantages of your technique is there is no code change at
> the application level.  This means less database lock-in.  Maybe that is a
> disadvantage too. ;-)

  I'm sure that someone with a better understanding of the theory associated
with cache invalidation would design a better algorithm that I would, but it
seems that even a fairly rudimentary implementation would seriously increase
performance.

steve




Re: Query caching

From
KuroiNeko
Date:
>     Here's a  simple design  that I  was tossing  back and  forth. Please
> understand that I'm not  saying this is the best way to do  it, or even a
> good way to do it. Just a possible way to do it.

 Sounds interesting, I  certainly have reasons to play bad  guy, but that's
what I always do, so nevermind :)
 However, there's  one major point where  I disagree. Not that  I have real
reasons to,  or observation or analysis  to background my position,  just a
feeling. And the feeling is  that connection/query cache should be separate
from DBMS server itself.
 Several things  come to the  mind right  off, like possibilities  to cache
connections  to different  sources, like  PGSQL and  Oracle, as  well as  a
chance  to run  this cache  on  a separate  box that  will perform  various
additional functions, like load balancing. But that's right on the surface.
 Still in doubt....


--

 contaminated fish and microchips
  huge supertankers on Arabian trips
 oily propaganda from the leaders' lips
  all about the future
 there's people over here, people over there
  everybody's looking for a little more air
 crossing all the borders just to take their share
  planning for the future

 Rainbow, Difficult to Cure

Re: Query caching

From
"Steve Wolfe"
Date:
>  Sounds interesting, I  certainly have reasons to play bad  guy, but
that's
> what I always do, so nevermind :)

  That's OK.  Somebody has to be a realist. : )

>  However, there's  one major point where  I disagree. Not that  I have
real
> reasons to,  or observation or analysis  to background my position,  just
a
> feeling. And the feeling is  that connection/query cache should be
separate
> from DBMS server itself.
>  Several things  come to the  mind right  off, like possibilities  to
cache
> connections  to different  sources, like  PGSQL and  Oracle,

   That would be a benefit if you're running multiple DBMS' in the house -
and you're certainly welcome to do something like that as a standalone
package.  ; )   I think it would be terrific if PostgreSQL could have the
feature added to it, which would (a) give it a big performance benefit, (b)
let it take advantage of already-written code, and (c) make one less machine
and service to administer.

> as  well as  a
> chance  to run  this cache  on  a separate  box that  will perform
various
> additional functions, like load balancing. But that's right on the
surface.
>  Still in doubt....

  Yes, load-balancing would be another good factor.  However, to my (very
limitted) knowledge, there aren't any truly good ways of splitting up
database work.  If you're doing nothing but selects, it would be easy.  But
when updates come around, it gets hairier - and when you try to try for
dynamic continuity-checking and database rebuilding, it gets very ugly.   If
there are any systems that get around those without huge performance hits,
I'd love to hear about it.

   (Of course, if you have lots of money, a Beowolf-style cluster with high
bandwidth, low-latency interconnects becomes desireable.  But that's a
different ballgame.)

   However, there is one other possibility:  With caching, your servers
might see enough of a performance increase that you wouldn't need to
load-balance them. : )

steve


Re: Query caching

From
"Poul L. Christiansen"
Date:

KuroiNeko wrote:
>
> >     Here's a  simple design  that I  was tossing  back and  forth. Please
> > understand that I'm not  saying this is the best way to do  it, or even a
> > good way to do it. Just a possible way to do it.
>
>  Sounds interesting, I  certainly have reasons to play bad  guy, but that's
> what I always do, so nevermind :)

I think this feature deserves to be put on the TODO list under exotic
features.

This feature would probably also be a threat to MySQL dominance in the
web scripting area for websites with medium to high traffic ;)

Poul L. Christiansen

Re: Query caching

From
Alfred Perlstein
Date:
* Steve Wolfe <steve@iboats.com> [001031 13:47] wrote:
>
> > >    (Incidentally,  we've toyed  around with  developping a
> query-caching
> > > system that would sit betwen PostgreSQL and our DB libraries.
> >
> >  Sounds  amazing, but  requires some  research, I  guess. However,  in
> many
> > cases one  would be  more than  happy with  cahced connections.  Of
> course,
> > cahced query results  can be naturally added to that,  but just
> connections
> > are OK to start with. Security....
>
>     To me, it doesn't sound like it would be that difficult of a project, at
> least not for the likes of the PostgreSQL developpers.  It also doesn't seem
> like it would really introduce any security problems, not if it were done
> inside of PostgreSQL.  Long ago, I got sidetracked from my endeavors in C,
> and so I don't feel that I'm qualified to do it.  (otherwise, I would have
> done it already. : ) )   If you wanted it done in Perl or Object Pascal, I
> could help. : )
>
>     Here's a simple design that I was tossing back and forth.  Please
> understand that I'm not saying this is the best way to do it, or even a good
> way to do it.  Just a possible way to do it.  I haven't been able to give it
> as much thought as I would like to.  Here goes.
>
> ------------
> Implementation
>

[snip]

Karel Zak <zakkr@zf.jcu.cz> Implemented stored proceedures for
postgresql but still hasn't been approached to integrated them.

You can find his second attempt to get a response from the developers
here:

http://people.freebsd.org/~alfred/karel-pgsql.txt

--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

Re: Query caching

From
KuroiNeko
Date:
> I think this feature deserves to be put on the TODO list under exotic
> features.

 Well, it's kinda implemented already, I believe, with decades of being run
unattended :)

> This feature would probably also be a threat to MySQL dominance in the
> web scripting area for websites with medium to high traffic ;)

 Dominance? Who needs it, anyway?


--

 contaminated fish and microchips
  huge supertankers on Arabian trips
 oily propaganda from the leaders' lips
  all about the future
 there's people over here, people over there
  everybody's looking for a little more air
 crossing all the borders just to take their share
  planning for the future

 Rainbow, Difficult to Cure

Re: how good is PostgreSQL

From
Tim Kientzle
Date:
a) Don't log to a database.  Log data should be sent into a process
   that collects any needed on-the-fly statistics and then outputs
   into disk files (rotating hourly or daily depending on your needs).
   This model is becoming pretty standard with Apache now; look at
   rotatelog in the Apache distribution for an example.

b) Number of records isn't really the issue.  Query complexity and
   number of queries are more pertinent.  Generally, for example, a
   single SELECT that pulls in multiple rows is much faster than
   a bunch of small SELECTs.

c) For very high traffic, you are going to have multiple front-end
   servers.  If you design the system carefully, you can have a single
   shared network disk used by all of your front ends, then just stack
   boxes in front of it.  This doesn't give you endless scalability,
though;
   at some point you'll saturate your network file server and/or
database
   box.

d) PHP may not be a great choice.  It doesn't provide a lot of hooks
   for effective caching of database connections and/or results.
   mod_perl or Java servlets may be better, depending on the details.

                - Tim Kientzle

Arnold Gamboa wrote:
>
> Hi,
>
> For users of large PostgreSQL and PostgreSQL builders, this is for you.
>
> I'm having a terrible time deciding now. :(
>
> We're about to build a "huge" website now.  I got tied up in signing the
> contract without really getting enough information about PgSQL since this
> what we plan to implement with PHP (normally we use mySQL but i guess it
> does not fit for huge databases like that).
>
> Here's my problem.. We're about to build a site like hitbox.com where there
> is a large amount of database required.. If say there is 100,000 users with
> 1000 page hits per day for each, and everything will be logged, you could
> imagine how huge this will be.  I'm just so "nervous" (really, that's the
> term) if we implement this and later on experience a slow down or worse than
> that, crash in the server.
>
> My questions are:
> 1. What is the limit for number of records in a table WITHOUT SUFFERING SLOW
> DOWN.
> 2. ....limit in number of tables per database
> 3. ... limit in number of database.
>
> Thanks for you comments.  I would really appreciate every comment that I'll
> receive regarding this.
>
> Arnold

Re: how good is PostgreSQL

From
"Steve Wolfe"
Date:
> d) PHP may not be a great choice.  It doesn't provide a lot of hooks
>    for effective caching of database connections and/or results.
>    mod_perl or Java servlets may be better, depending on the details.

  One of our competitors spent a very, very large deal of money on high-end
Sun equipment, so that they could write their CGI stuff in Java servlets.
It still ran slow.  We run Perl on machines that pale compared to theirs,
and get far better performance. : )

steve


Re: how good is PostgreSQL

From
KuroiNeko
Date:
> We run Perl on machines that pale compared to theirs,
> and get far better performance. : )

 Well, don't get  me wrong, I'm not going  to a war. Here :) But  CGI is so
simple and straightforward  that anything more than C is  quite an overkill
(think assembly).
 Myself I'm  planning to port  all my PERL  stuff eventually. Yes,  PERL is
great for string handling, but when you spend a couple of weeks on BugTraq,
you'll suddenly  feel that it's  still too much.  When you only  let `known
good' values in, lex or regexp libs will do.
 Sorry  for the  offtopic,  anyone interested  is welcome  to  email me  in
private.

 Ed


--

 contaminated fish and microchips
  huge supertankers on Arabian trips
 oily propaganda from the leaders' lips
  all about the future
 there's people over here, people over there
  everybody's looking for a little more air
 crossing all the borders just to take their share
  planning for the future

 Rainbow, Difficult to Cure

Re: how good is PostgreSQL

From
Gunnar R|nning
Date:
"Steve Wolfe" <steve@iboats.com> writes:

>   One of our competitors spent a very, very large deal of money on high-end
> Sun equipment, so that they could write their CGI stuff in Java servlets.
> It still ran slow.  We run Perl on machines that pale compared to theirs,
> and get far better performance. : )

You can always do it slow if you don't design properly. A former customer
saved a lot hardware and maintenance cost by migrating from a perl based
publishing system to a Java based one. Less hardware, better performance and
more functionality. ;-) The old perl system had been developed and maintained
over a 4 year period - the initial development of the new Java based system
took about 9 months.

regards,

    Gunnar

Re: how good is PostgreSQL

From
Jim Mercer
Date:
i have a client which merged two companies, one running perl the other running
java.

what to do?

i convinced them to port both the perl and java code to INTERCAL, and run
the whole system on an array of C-64's.

works better than either of the perl or java stuff.

On Wed, Nov 01, 2000 at 01:58:55AM +0100, Gunnar R|nning wrote:
> "Steve Wolfe" <steve@iboats.com> writes:
> >   One of our competitors spent a very, very large deal of money on high-end
> > Sun equipment, so that they could write their CGI stuff in Java servlets.
> > It still ran slow.  We run Perl on machines that pale compared to theirs,
> > and get far better performance. : )
>
> You can always do it slow if you don't design properly. A former customer
> saved a lot hardware and maintenance cost by migrating from a perl based
> publishing system to a Java based one. Less hardware, better performance and
> more functionality. ;-) The old perl system had been developed and maintained
> over a 4 year period - the initial development of the new Java based system
> took about 9 months.

--
[ Jim Mercer                 jim@reptiles.org              +1 416 410-5633 ]
[          Reptilian Research -- Longer Life through Colder Blood          ]
[  Don't be fooled by cheap Finnish imitations; BSD is the One True Code.  ]

Re: how good is PostgreSQL

From
Alex Pilosov
Date:
On Tue, 31 Oct 2000, Jim Mercer wrote:

> i convinced them to port both the perl and java code to INTERCAL, and run
> the whole system on an array of C-64's.
But there are no bindings from Postgresql to intercal! ;)

(I hope I didn't just give a bad idea to someone...;)

-alex



Re: Query caching

From
Daniel Freedman
Date:
On the topic of query cache (or maybe this is just tangential and I'm
confused):

I've always heard that Oracle has the ability to essentially suck in as
much of the database into RAM as you have memory to allow it, and can then
just run its queries on that in-RAM database (or db subset) without doing
disk I/O (which I would probably imagine is one of the more expensive
parts of a given SQL command).  I've looked for references as to
Postgresql's ability to do something like this, but I've never been
certain if it's possible.  Can postgresql do this, please?  And, if not,
does it have to hit the disk for every SQL instruction (I would assume
so)?

I would imagine that the actual query cache would be slightly orthogonal
to this in-RAM database cache, in as much as it would actually store the
results of specific queries, rather than the complete tuple set on which
to run queries.  However, I would imagine that both schemes would provide
performance increases.

Also, as KuroiNeko writes below about placing the query cache outside the
actual DBMS, don't some webservers (or at least specific custom coding
implementations of them) just cache common query results themselves?
(Not that it would necessarily be bad for the DBMS to do so, I
wouldn't know enough about this to surmise.)

I'd appreciate any pointers to more information on specific performance
tuning in this area (IMHO, it would probably be a boon to the postgresql
database and its community, if there existed some reference like
O'Reilly's _Oracle Performance Tuning_ that was focused on Postgresql.)

Thanks for any extra info,

Daniel


On Tue, 31 Oct 2000, KuroiNeko wrote:

> >     Here's a  simple design  that I  was tossing  back and  forth. Please
> > understand that I'm not  saying this is the best way to do  it, or even a
> > good way to do it. Just a possible way to do it.
>
>  Sounds interesting, I  certainly have reasons to play bad  guy, but that's
> what I always do, so nevermind :)
>  However, there's  one major point where  I disagree. Not that  I have real
> reasons to,  or observation or analysis  to background my position,  just a
> feeling. And the feeling is  that connection/query cache should be separate
> from DBMS server itself.
>  Several things  come to the  mind right  off, like possibilities  to cache
> connections  to different  sources, like  PGSQL and  Oracle, as  well as  a
> chance  to run  this cache  on  a separate  box that  will perform  various
> additional functions, like load balancing. But that's right on the surface.
>  Still in doubt....
>



Re: Query caching

From
KuroiNeko
Date:
> I've looked for references as to
> Postgresql's ability to do something like this, but I've never been
> certain if it's possible. Can postgresql do this, please? And, if not,
> does it have to hit the disk for every SQL instruction (I would assume
> so)?

 Doing so,  as you  might guess  is quite dangerous.  Eg, RAM  failures are
extremely  rare, with  probability very  close  to 0,  but there's  nothing
absolutely reliable.
 From my, quite  limited, experience, I can tell that  PGSQL relies more on
file caching (or whatever is the term),  provided by the OS, rather than on
slurping relations into RAM. See  the recent discussion of [f]sync(), maybe
it sheds more light.

> I would imagine that the actual query cache would be slightly orthogonal
> to this in-RAM database cache

 Actually, there are  several ways to keep the data  in memory, each having
its advantages drawbacks and reasons. To name just a few: caching pages and
files, mapping  files, storing  `internal' structures  (like the  tuples in
your example) in shared memory areas.
 Apologets and enemies of each method come in all shapes, but the real life
is  even  worse.  Often  these  methods  interfere  with  each  other,  and
inaccurate combination (you  cache the pages, but  overlooked file caching,
performed by the OS) may easily become a bottleneck.

> I'd appreciate any pointers to more information on specific performance
> tuning in this area (IMHO, it would probably be a boon to the postgresql
> database and its community, if there existed some reference like
> O'Reilly's _Oracle Performance Tuning_ that was focused on Postgresql.)

 As I see  it, performance tuning with PGSQL should  be concentrated around
quality design of your DB and queries. I may be wrong, but there's not much
to play with where PGSQL server touches the system.
 Maybe it's  bad, but I like  it. General suggestions about  fs performance
apply to PGSQL  and you don't have  to re-invent the wheel.  There are just
files. Play  with sync, install  a RAID of SCSI  drives, keep your  swap on
separate controller.  Nothing really special  that would impact  or, what's
more important, interfere with other services running on the same box.
 Change must come from inside :) Here, inside is DB design.

 Ed


--

 contaminated fish and microchips
  huge supertankers on Arabian trips
 oily propaganda from the leaders' lips
  all about the future
 there's people over here, people over there
  everybody's looking for a little more air
 crossing all the borders just to take their share
  planning for the future

 Rainbow, Difficult to Cure

Re: Query caching

From
"Poul L. Christiansen"
Date:
Daniel Freedman wrote:
>
> On the topic of query cache (or maybe this is just tangential and I'm
> confused):
>
> I've always heard that Oracle has the ability to essentially suck in as
> much of the database into RAM as you have memory to allow it, and can then
> just run its queries on that in-RAM database (or db subset) without doing
> disk I/O (which I would probably imagine is one of the more expensive
> parts of a given SQL command).  I've looked for references as to
> Postgresql's ability to do something like this, but I've never been
> certain if it's possible.  Can postgresql do this, please?  And, if not,
> does it have to hit the disk for every SQL instruction (I would assume
> so)?

PostgreSQL hits the disk on UPDATE/DELETE/INSERT operations. SELECT's
are cached, but the default cache is only ½MB of RAM. You can change
this to whatever you want.

I'm using Cold Fusion and it can cache queries itself, so no database
action is necessary. But I don't think PHP and others have this
possibility. But Cold Fusion costs 1300$ :(

Poul L. Christiansen

Re: Query caching

From
Denis Perchine
Date:
> PostgreSQL hits the disk on UPDATE/DELETE/INSERT operations. SELECT's
> are cached, but the default cache is only ½MB of RAM. You can change
> this to whatever you want.
>
> I'm using Cold Fusion and it can cache queries itself, so no database
> action is necessary. But I don't think PHP and others have this
> possibility. But Cold Fusion costs 1300$ :(

No, PHP has this.

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

Re: Query caching

From
Frank Joerdens
Date:
On Wed, Nov 01, 2000 at 10:16:58AM +0000, Poul L. Christiansen wrote:
> PostgreSQL hits the disk on UPDATE/DELETE/INSERT operations. SELECT's
> are cached, but the default cache is only ½MB of RAM. You can change
> this to whatever you want.

That sound like a very cool thing to do, and the default seems awfully
conservative, given the average server´s RAM equipment nowadays. If you
have a small Linux server with 128 MB of RAM, it would be interesting to
see what happens, performance-wise, if you increase the cache for
selects to, for instance, 64 MB. Has anyone tried to benchmark this? How
would you benchmark it? Where do you change this cache size? How do you
keep the cache from being swapped out to disk (which would presumably
all but eradicate the benefits of such a measure)?

Cheers Frank

--
frank joerdens

joerdens new media
urbanstr. 116
10967 berlin
germany

e: frank@joerdens.de
t: +49 (0)30 69597650
f: +49 (0)30 7864046
h: http://www.joerdens.de

pgp public key: http://www.joerdens.de/pgp/frank_joerdens.asc

Re: Query caching

From
"Poul L. Christiansen"
Date:
Frank Joerdens wrote:
>
> On Wed, Nov 01, 2000 at 10:16:58AM +0000, Poul L. Christiansen wrote:
> > PostgreSQL hits the disk on UPDATE/DELETE/INSERT operations. SELECT's
> > are cached, but the default cache is only ½MB of RAM. You can change
> > this to whatever you want.
>
> That sound like a very cool thing to do, and the default seems awfully
> conservative, given the average server´s RAM equipment nowadays. If you
> have a small Linux server with 128 MB of RAM, it would be interesting to
> see what happens, performance-wise, if you increase the cache for
> selects to, for instance, 64 MB. Has anyone tried to benchmark this? How
> would you benchmark it? Where do you change this cache size? How do you
> keep the cache from being swapped out to disk (which would presumably
> all but eradicate the benefits of such a measure)?

I have a PostgreSQL server with 80MB of RAM running Redhat Linux 7.0 and
in my /etc/rc.d/init.d/postgresql start script I have these 2 lines that
start the postmaster.

echo 67108864 > /proc/sys/kernel/shmmax

su -l postgres -c "/usr/bin/pg_ctl  -D $PGDATA -p /usr/bin/postmaster -o
'-i -B 4096 -o -F' start >/dev/null 2>&1" < /dev/null

The first line increases the maxium shared memory to 64MB.
The "-B 4096" indicates 4096 * 8kb = 32MB to each postmaster.

I haven't benchmarked it, but I know it's MUCH faster.

Poul L. Christiansen

Re: Query caching

From
"Steve Wolfe"
Date:
> How do you
> keep the cache from being swapped out to disk (which would presumably
> all but eradicate the benefits of such a measure)?

  You make sure that you have enough RAM that you aren't using swap. : )

  Seriously, as cheap as RAM is today, if a machine uses swap more than
occasionally, an upgrade is in order.

steve



Re: how good is PostgreSQL

From
Joseph Shraibman
Date:
Performance depends on a lot of factors.  Shelling out $$$ for Sun
hardware doesn't garuntee good performance.  They might have been better
off buying a Tru64 system with Compaq's jdk.

Steve Wolfe wrote:
>
> > d) PHP may not be a great choice.  It doesn't provide a lot of hooks
> >    for effective caching of database connections and/or results.
> >    mod_perl or Java servlets may be better, depending on the details.
>
>   One of our competitors spent a very, very large deal of money on high-end
> Sun equipment, so that they could write their CGI stuff in Java servlets.
> It still ran slow.  We run Perl on machines that pale compared to theirs,
> and get far better performance. : )
>
> steve

--
Joseph Shraibman
jks@selectacast.net
Increase signal to noise ratio.  http://www.targabot.com

Re: how good is PostgreSQL

From
"Steve Wolfe"
Date:
> Performance depends on a lot of factors.  Shelling out $$$ for Sun
> hardware doesn't garuntee good performance.  They might have been better
> off buying a Tru64 system with Compaq's jdk.

   Yeah, it could be.  But comparing the $7,000 Intel machine I built
against a $20,000 Alpha, I'm still very happy with Intel.  Yes, the Alpha
was faster on a per-processor basis.  But it also cost more than twice as
much on a dollar-for-transaction basis. ; )

steve