Thread: Scaling concerns

Scaling concerns

From
tsuraan
Date:
I'm writing a webmail-type application that is meant to be used in a
corporate environment.  The core of my system is a Postgres database
that is used as a message header cache.  The two (relevant) tables
being used are pasted into the end of this message.  My problem is
that, as the messages table increases to tens of millions of rows,
pgsql slows down considerably.  Even an operation like "select
count(*) from messages" can take minutes, with a totally idle system.
Postgres seems to be the most scalable Free database out there, so I
must be doing something wrong.

As for the most common strategy of having a slower (more rows)
"archival" database and a smaller, faster "live" database, all the
clients in the company are using their normal corporate email server
for day-to-day email handling.  The webmail is used for access email
that's no longer on the corporate server, so it's not really simple to
say which emails should be considered live and which are really
out-of-date.

My postgres settings are entirely default with the exception of
shared_buffers being set to 40,000 and max_connections set to 400.
I'm not sure what the meaning of most of the other settings are, so I
haven't touched them.  The machines running the database servers are
my home desktop (a dual-core athlon 3200+ with 2GB RAM and a 120GB
SATA II drive), and a production server with two dual-core Intel
chips, 4 GB RAM, and a RAID 5 array of SATA II drives on a 3Ware 9550
controller.  Both machines are running Gentoo Linux with a 2.6.1x
kernel, and both exhibit significant performance degradation when I
start getting tens of millions of records.

Any advice would be most appreciated.  Thanks in advance!

Tables:

CREATE TABLE EmailAddresses (
  emailid   SERIAL PRIMARY KEY,     --  The unique identifier of this address
  name      TEXT NOT NULL,          --  The friendly name in the address
  addrspec  TEXT NOT NULL,          --  The user@domain part of the address
  UNIQUE(name, addrspec)
);

and

CREATE TABLE Messages (
  -- Store info:
  msgkey    BIGSERIAL PRIMARY KEY,  --  Unique identifier for a message
  path      TEXT NOT NULL,          --  Where the message is on the file system
  inserted  TIMESTAMP DEFAULT now(),--  When the message was fetched
  -- Message Info:
  msgid     TEXT UNIQUE NOT NULL,   --  Message's Message-Id field
  mfrom     INTEGER                 --  Who sent the message
            REFERENCES EmailAddresses
            DEFAULT NULL,
  mdate     TIMESTAMP DEFAULT NULL, --  Message "date" header field
  replyto   TEXT DEFAULT NULL,      --  Message-ID of replied-to message
  subject   TEXT DEFAULT NULL,      --  Message "subject" header field
  numatch   INTEGER DEFAULT NULL,   --  Number of attachments
  UNIQUE(path)
);

Re: Scaling concerns

From
"Steinar H. Gunderson"
Date:
On Sat, Dec 16, 2006 at 11:26:02AM -0600, tsuraan wrote:
> Even an operation like "select count(*) from messages" can take minutes,
> with a totally idle system.  Postgres seems to be the most scalable Free
> database out there, so I must be doing something wrong.

Unqualified SELECT COUNT(*) FROM foo is one of the most expensive operations
you can do on your system, since the visibility information has to be checked
on disk for each row. Instead, try real queries on real data, and post here
if some are too slow for you.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Scaling concerns

From
tsuraan
Date:
> Unqualified SELECT COUNT(*) FROM foo is one of the most expensive operations
> you can do on your system, since the visibility information has to be
> checked
> on disk for each row. Instead, try real queries on real data, and post here
> if some are too slow for you.

Ok, that's a bad example.  I'm learning :-)  Is insert ... select also
really expensive then?  I have a table loaded with message-id and path
information of currently-existing messages.  It has ~20 million rows.
Trying to do "INSERT INTO Messages(path, msgid) SELECT (path, msgid)
FROM tmpMessages" took a really long time before psql died with an
out-of-memory error.  Is there a more sane way to do a table copy, or
should I have just dropped all the indices from the Message table and
loaded into that?

Thanks!

Re: Scaling concerns

From
Dimitri Fontaine
Date:
Le samedi 16 décembre 2006 18:32, Steinar H. Gunderson a écrit :
> Instead, try real queries on real data,
> and post here if some are too slow for you.

To quickly find out a subset of slow queries on your production system, you
can use the pgfouine tool:
  http://pgfouine.projects.postgresql.org/

If you then want to make some measurements of PostgreSQL performances with
some different settings and compare them, consider using the tsung tool (and
may be tsung-ploter companion tool to graph several benchs onto the same
charts for comparing purpose):
  http://pgfouine.projects.postgresql.org/tsung.html
  http://tsung.erlang-projects.org/
  http://debian.dalibo.org/unstable/

This latter link also contains a .tar.gz archive of tsung-ploter in case
you're not running a debian system. Dependencies are python and matplotlib.

Regards,
--
Dimitri Fontaine
http://www.dalibo.com/

Attachment

Re: Scaling concerns

From
Andreas Kostyrka
Date:
* tsuraan <tsuraan@gmail.com> [061216 18:26]:
> I'm writing a webmail-type application that is meant to be used in a
> corporate environment.  The core of my system is a Postgres database
> that is used as a message header cache.  The two (relevant) tables
> being used are pasted into the end of this message.  My problem is
> that, as the messages table increases to tens of millions of rows,
> pgsql slows down considerably.  Even an operation like "select
> count(*) from messages" can take minutes, with a totally idle system.
> Postgres seems to be the most scalable Free database out there, so I
> must be doing something wrong.

select count(*) from table is the worst case in PostgreSQL. (MVC
systems in general I guess).

If you really need to run count(*) you need to think about the
required isolation level of these operations and make some aggregate
table yourself.

(btw, select aggregate(*) from bigtable is something that no database
likes, it's just the degree of slowness that sometimes is different).

For scaling you should consider slony. Either hangout on #slony on
Freenode.net or ask on the mailing list if you have questions.

> As for the most common strategy of having a slower (more rows)
> "archival" database and a smaller, faster "live" database, all the
> clients in the company are using their normal corporate email server
> for day-to-day email handling.  The webmail is used for access email
> that's no longer on the corporate server, so it's not really simple to
> say which emails should be considered live and which are really
> out-of-date.
>
> My postgres settings are entirely default with the exception of
> shared_buffers being set to 40,000 and max_connections set to 400.
> I'm not sure what the meaning of most of the other settings are, so I
> haven't touched them.  The machines running the database servers are
> my home desktop (a dual-core athlon 3200+ with 2GB RAM and a 120GB
> SATA II drive), and a production server with two dual-core Intel

Intel chips => define more. There are Intel boxes known to have issues
under specific load scenarios with PostgreSQL (again specific
versions). To make it funnier, these are really really hard to track
down ;)

> chips, 4 GB RAM, and a RAID 5 array of SATA II drives on a 3Ware 9550
> controller.  Both machines are running Gentoo Linux with a 2.6.1x
> kernel, and both exhibit significant performance degradation when I
> start getting tens of millions of records.
>
> Any advice would be most appreciated.  Thanks in advance!

Cluster. One box that applies changes, and multiple boxes that read
the data.

If you cannot afford multiple boxes from the start, design your
application still to work with two connections: one connection to a
user with read/write permissions, and one connecting to a user having
only select permissions => this way you can later easily add a
loadbalancer to the mix, and use multiple postgres boxes for reading
stuff.

Andreas

Re: Scaling concerns

From
"Luke Lonergan"
Date:
Tsuraan,

"Select count(*) from bigtable" is testing your disk drive speed up till
about 300MB/s, after which it is CPU limited in Postgres.

My guess is that your system has a very slow I/O configuration, either due
to faulty driver/hardware or the configuration.

The first thing you should do is run a simple I/O test on your data
directory - write a file twice the size of memory using dd like this:

  time bash -c "dd if=/dev/zero of=data_directory/bigfile bs=8k count=(2 *
memory_size / 8192) && sync"

  time dd if=data_directory/bigfile of=/dev/null bs=8k

Then report the times here.

- Luke

On 12/16/06 9:26 AM, "tsuraan" <tsuraan@gmail.com> wrote:

> I'm writing a webmail-type application that is meant to be used in a
> corporate environment.  The core of my system is a Postgres database
> that is used as a message header cache.  The two (relevant) tables
> being used are pasted into the end of this message.  My problem is
> that, as the messages table increases to tens of millions of rows,
> pgsql slows down considerably.  Even an operation like "select
> count(*) from messages" can take minutes, with a totally idle system.
> Postgres seems to be the most scalable Free database out there, so I
> must be doing something wrong.
>
> As for the most common strategy of having a slower (more rows)
> "archival" database and a smaller, faster "live" database, all the
> clients in the company are using their normal corporate email server
> for day-to-day email handling.  The webmail is used for access email
> that's no longer on the corporate server, so it's not really simple to
> say which emails should be considered live and which are really
> out-of-date.
>
> My postgres settings are entirely default with the exception of
> shared_buffers being set to 40,000 and max_connections set to 400.
> I'm not sure what the meaning of most of the other settings are, so I
> haven't touched them.  The machines running the database servers are
> my home desktop (a dual-core athlon 3200+ with 2GB RAM and a 120GB
> SATA II drive), and a production server with two dual-core Intel
> chips, 4 GB RAM, and a RAID 5 array of SATA II drives on a 3Ware 9550
> controller.  Both machines are running Gentoo Linux with a 2.6.1x
> kernel, and both exhibit significant performance degradation when I
> start getting tens of millions of records.
>
> Any advice would be most appreciated.  Thanks in advance!
>
> Tables:
>
> CREATE TABLE EmailAddresses (
>   emailid   SERIAL PRIMARY KEY,     --  The unique identifier of this address
>   name      TEXT NOT NULL,          --  The friendly name in the address
>   addrspec  TEXT NOT NULL,          --  The user@domain part of the address
>   UNIQUE(name, addrspec)
> );
>
> and
>
> CREATE TABLE Messages (
>   -- Store info:
>   msgkey    BIGSERIAL PRIMARY KEY,  --  Unique identifier for a message
>   path      TEXT NOT NULL,          --  Where the message is on the file
> system
>   inserted  TIMESTAMP DEFAULT now(),--  When the message was fetched
>   -- Message Info:
>   msgid     TEXT UNIQUE NOT NULL,   --  Message's Message-Id field
>   mfrom     INTEGER                 --  Who sent the message
>             REFERENCES EmailAddresses
>             DEFAULT NULL,
>   mdate     TIMESTAMP DEFAULT NULL, --  Message "date" header field
>   replyto   TEXT DEFAULT NULL,      --  Message-ID of replied-to message
>   subject   TEXT DEFAULT NULL,      --  Message "subject" header field
>   numatch   INTEGER DEFAULT NULL,   --  Number of attachments
>   UNIQUE(path)
> );
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>



Re: Scaling concerns

From
Bill Moran
Date:
tsuraan <tsuraan@gmail.com> wrote:
>
> I'm writing a webmail-type application that is meant to be used in a
> corporate environment.  The core of my system is a Postgres database
> that is used as a message header cache.  The two (relevant) tables
> being used are pasted into the end of this message.  My problem is
> that, as the messages table increases to tens of millions of rows,
> pgsql slows down considerably.  Even an operation like "select
> count(*) from messages" can take minutes, with a totally idle system.
> Postgres seems to be the most scalable Free database out there, so I
> must be doing something wrong.
>
> As for the most common strategy of having a slower (more rows)
> "archival" database and a smaller, faster "live" database, all the
> clients in the company are using their normal corporate email server
> for day-to-day email handling.  The webmail is used for access email
> that's no longer on the corporate server, so it's not really simple to
> say which emails should be considered live and which are really
> out-of-date.
>
> My postgres settings are entirely default with the exception of
> shared_buffers being set to 40,000 and max_connections set to 400.
> I'm not sure what the meaning of most of the other settings are, so I
> haven't touched them.  The machines running the database servers are
> my home desktop (a dual-core athlon 3200+ with 2GB RAM and a 120GB
> SATA II drive), and a production server with two dual-core Intel
> chips, 4 GB RAM, and a RAID 5 array of SATA II drives on a 3Ware 9550
> controller.  Both machines are running Gentoo Linux with a 2.6.1x
> kernel, and both exhibit significant performance degradation when I
> start getting tens of millions of records.
>
> Any advice would be most appreciated.  Thanks in advance!
>
> Tables:
>
> CREATE TABLE EmailAddresses (
>   emailid   SERIAL PRIMARY KEY,     --  The unique identifier of this address
>   name      TEXT NOT NULL,          --  The friendly name in the address
>   addrspec  TEXT NOT NULL,          --  The user@domain part of the address
>   UNIQUE(name, addrspec)
> );
>
> and
>
> CREATE TABLE Messages (
>   -- Store info:
>   msgkey    BIGSERIAL PRIMARY KEY,  --  Unique identifier for a message
>   path      TEXT NOT NULL,          --  Where the message is on the file system
>   inserted  TIMESTAMP DEFAULT now(),--  When the message was fetched
>   -- Message Info:
>   msgid     TEXT UNIQUE NOT NULL,   --  Message's Message-Id field
>   mfrom     INTEGER                 --  Who sent the message
>             REFERENCES EmailAddresses
>             DEFAULT NULL,
>   mdate     TIMESTAMP DEFAULT NULL, --  Message "date" header field
>   replyto   TEXT DEFAULT NULL,      --  Message-ID of replied-to message
>   subject   TEXT DEFAULT NULL,      --  Message "subject" header field
>   numatch   INTEGER DEFAULT NULL,   --  Number of attachments
>   UNIQUE(path)
> );

You might benefit from adding some performance-specific changes to your
schema.

For example, if you created a separate table for each emailid (call them
Messages_1, Messages_2, etc).  I expect that no one user will have an
unbearable number of messages, thus each user will see reasonable
performance when working with their mailbox.

You can handle the management of this entirely in your application logic,
but it might be better of you wrote stored procedures to access message
tables -- to make it easier on the application side.

-Bill

Re: Scaling concerns

From
tsuraan
Date:
> To quickly find out a subset of slow queries on your production system, you
> can use the pgfouine tool:
>   http://pgfouine.projects.postgresql.org/
>
> If you then want to make some measurements of PostgreSQL performances with
> some different settings and compare them, consider using the tsung tool (and
> may be tsung-ploter companion tool to graph several benchs onto the same
> charts for comparing purpose):
>   http://pgfouine.projects.postgresql.org/tsung.html
>   http://tsung.erlang-projects.org/
>   http://debian.dalibo.org/unstable/

Thanks for all the links!  I'll check these out when I get back to work.

Re: Scaling concerns

From
tsuraan
Date:
> For scaling you should consider slony. Either hangout on #slony on
> Freenode.net or ask on the mailing list if you have questions.

For some reason I had thought slony was really immature, but it
actually looks really usable.

> Intel chips => define more. There are Intel boxes known to have issues
> under specific load scenarios with PostgreSQL (again specific
> versions). To make it funnier, these are really really hard to track
> down ;)

I can't access the Intel machine right now, but it's the current
fastest Intel dual-core.  I'll figure it out tomorrow.

> Cluster. One box that applies changes, and multiple boxes that read
> the data.
>
> If you cannot afford multiple boxes from the start, design your
> application still to work with two connections: one connection to a
> user with read/write permissions, and one connecting to a user having
> only select permissions => this way you can later easily add a
> loadbalancer to the mix, and use multiple postgres boxes for reading
> stuff.

I think I'll see what I can do for that.  Is there an aggregation-type
of clustering for Postgres?  I'm thinking of something where database
information is divided between machines, rather than shared among them
as in Slony.  Sort of like the difference between RAID0 and RAID1.
Since my application is constantly adding to the database (far more is
written than is ever read), it would be nice to have a multiple-write,
single reader solution, if such a thing exists.

Re: Scaling concerns

From
Greg Smith
Date:
On Sun, 17 Dec 2006, tsuraan wrote:

> Since my application is constantly adding to the database (far more is
> written than is ever read), it would be nice to have a multiple-write,
> single reader solution, if such a thing exists.

You seem to be working from the assumption that you have a scaling issue,
and that therefore you should be researching how to scale your app to more
machines.  I'm not so sure you do; I would suggest that you drop that
entire idea for now, spend some time doing basic performance tuning for
Postgres instead, and only after then consider adding more machines.  It
does little good to add more incorrectly setup servers to the mix, and
solving the multiple-write problem is hard.  Let's take a quick tour
through your earlier messages:

> My postgres settings are entirely default with the exception of
> shared_buffers being set to 40,000 and max_connections set to 400. I'm
> not sure what the meaning of most of the other settings are, so I
> haven't touched them.

http://www.powerpostgresql.com/Downloads/annotated_conf_80.html is a good
intro to the various parameters you might set, with some valuable hints on
the effective range you should be considering.  I'd suggest you use that
to identify the most likely things to increase, then read the manuals at
http://www.postgresql.org/docs/8.2/interactive/runtime-config.html for
more detail on what you're actually adjusting.  To get you started,
consider increasing effective_cache_size, checkpoint_segments, and
work_mem; those are three whose defaults are very low for your
application, relative to your hardware.  The thought of how your poor
database is suffering when trying to manage a heavy write load with the
default checkpoint_segments in particular makes me sad, especially when we
add:

> The machines running the database servers are my home desktop (a
> dual-core athlon 3200+ with 2GB RAM and a 120GB SATA II drive), and a
> production server with two dual-core Intel chips, 4 GB RAM, and a RAID 5
> array of SATA II drives on a 3Ware 9550 controller.

One big RAID 5 volume is probably the worst setup available for what
you're doing.  Luke already gave you a suggestion for testing write speed;
you should run that test, but I wouldn't expect happy numbers there.  You
might be able to get by with the main database running like that, but
think about what you'd need to do to add more disks (or reorganize the
ones you have) so that you could dedicate a pair to a RAID-1 volume for
holding the WAL.  If you're limited by write performance, I think you'd
find adding a separate WAL drive set a dramatically more productive
upgrade than trying to split the app to another machine.  Try it on your
home machine first; that's a cheap upgrade, to add another SATA drive to
there, and you should see a marked improvement (especially once you get
the server parameters set to more appropriate values).

I'd also suggest that you'd probably be able to get more help from people
here if you posted a snippet of output from vmstat and iostat -x with a
low interval (say 5 seconds) during a period where the machine was busy;
that's helpful for figuring out where the bottleneck on your machine
really is.

> Trying to do "INSERT INTO Messages(path, msgid) SELECT (path, msgid)
> FROM tmpMessages" took a really long time before psql died with an
> out-of-memory error.

Do you have the exact text of the error?  I suspect you're falling victim
to the default parameters being far too low here as well, but without the
error it's hard to know exactly which.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Scaling concerns

From
tsuraan
Date:
> http://www.powerpostgresql.com/Downloads/annotated_conf_80.html is a good
> intro to the various parameters you might set, with some valuable hints on
> the effective range you should be considering.  I'd suggest you use that
> to identify the most likely things to increase, then read the manuals at
> http://www.postgresql.org/docs/8.2/interactive/runtime-config.html for
> more detail on what you're actually adjusting.  To get you started,
> consider increasing effective_cache_size, checkpoint_segments, and
> work_mem; those are three whose defaults are very low for your

I'll play with these - I've definitely seen the (for me) confusing use
of seqscan rather than index scan that the annotated page says is
caused by too little effective_cache_size.  That first link is really
great; I can't believe I've never seen it before.

> One big RAID 5 volume is probably the worst setup available for what
> you're doing.  Luke already gave you a suggestion for testing write speed;
> you should run that test, but I wouldn't expect happy numbers there.  You

I've run dstat with a really busy postgres and seen 94 MB read and
write simultaneously for a few seconds.  I think our RAID cards have
16MB of RAM, so unless it was really freakish, I probably wasn't
seeing all cache access.  I'll try the some tests with dd tomorrow
when I get to work.

> might be able to get by with the main database running like that, but
> think about what you'd need to do to add more disks (or reorganize the
> ones you have) so that you could dedicate a pair to a RAID-1 volume for
> holding the WAL.  If you're limited by write performance, I think you'd
> find adding a separate WAL drive set a dramatically more productive
> upgrade than trying to split the app to another machine.  Try it on your
> home machine first; that's a cheap upgrade, to add another SATA drive to
> there, and you should see a marked improvement (especially once you get
> the server parameters set to more appropriate values).

Is the WAL at the same location as the xlog (transaction log?)?  The
checkpoint_segments doc says increasing that value is really only
useful if the xlog is separate from the data, so do I put both WAL and
xlog on the separate drive, or is that automatic (or redundant; I
don't know what I'm talking about...)?

> I'd also suggest that you'd probably be able to get more help from people
> here if you posted a snippet of output from vmstat and iostat -x with a
> low interval (say 5 seconds) during a period where the machine was busy;
> that's helpful for figuring out where the bottleneck on your machine
> really is.

I'll try to stress a machine and get some real stats soon.

> Do you have the exact text of the error?  I suspect you're falling victim
> to the default parameters being far too low here as well, but without the
> error it's hard to know exactly which.

Well, I tried to repeat it on my home machine with 20 million rows,
and it worked fine in about two minutes.  I'll have to see what's
going on on that other system...

Thanks for the help!

Re: Scaling concerns

From
Tom Lane
Date:
tsuraan <tsuraan@gmail.com> writes:
> Is the WAL at the same location as the xlog (transaction log?)?

Same thing.

> The checkpoint_segments doc says increasing that value is really only
> useful if the xlog is separate from the data,

Dunno where you read that, but it's utter bilge.  If you've got a
write-intensive workload, you want to crank checkpoint_segments as high
as you can stand.  With the default settings on a modern machine it's
not hard at all to push it into checkpointing every dozen or seconds,
which will completely kill performance.  (Disk space for pg_xlog/ and
potential delay during crash restart are the only negatives here.  If
you are willing to push the average inter-checkpoint interval past five
minutes then you need to increase checkpoint_timeout too.)

            regards, tom lane