Thread: sustained update load of 1-2k/sec

sustained update load of 1-2k/sec

From
Mark Cotner
Date:
Hi all,
I bet you get tired of the same ole questions over and
over.

I'm currently working on an application that will poll
thousands of cable modems per minute and I would like
to use PostgreSQL to maintain state between polls of
each device.  This requires a very heavy amount of
updates in place on a reasonably large table(100k-500k
rows, ~7 columns mostly integers/bigint).  Each row
will be refreshed every 15 minutes, or at least that's
how fast I can poll via SNMP.  I hope I can tune the
DB to keep up.

The app is threaded and will likely have well over 100
concurrent db connections.  Temp tables for storage
aren't a preferred option since this is designed to be
a shared nothing approach and I will likely have
several polling processes.

Here are some of my assumptions so far . . .

HUGE WAL
Vacuum hourly if not more often

I'm getting 1700tx/sec from MySQL and I would REALLY
prefer to use PG.  I don't need to match the number,
just get close.

Is there a global temp table option?  In memory tables
would be very beneficial in this case.  I could just
flush it to disk occasionally with an insert into blah
select from memory table.

Any help or creative alternatives would be greatly
appreciated.  :)

'njoy,
Mark


--
Writing software requires an intelligent person,
creating functional art requires an artist.
-- Unknown


Re: sustained update load of 1-2k/sec

From
Bob Ippolito
Date:
On Aug 18, 2005, at 10:24 PM, Mark Cotner wrote:

> I'm currently working on an application that will poll
> thousands of cable modems per minute and I would like
> to use PostgreSQL to maintain state between polls of
> each device.  This requires a very heavy amount of
> updates in place on a reasonably large table(100k-500k
> rows, ~7 columns mostly integers/bigint).  Each row
> will be refreshed every 15 minutes, or at least that's
> how fast I can poll via SNMP.  I hope I can tune the
> DB to keep up.
>
> The app is threaded and will likely have well over 100
> concurrent db connections.  Temp tables for storage
> aren't a preferred option since this is designed to be
> a shared nothing approach and I will likely have
> several polling processes.

Somewhat OT, but..

The easiest way to speed that up is to use less threads.  You're
adding a whole TON of overhead with that many threads that you just
don't want or need.  You should probably be using something event-
driven to solve this problem, with just a few database threads to
store all that state.  Less is definitely more in this case.  See
<http://www.kegel.com/c10k.html> (and there's plenty of other
literature out there saying that event driven is an extremely good
way to do this sort of thing).

Here are some frameworks to look at for this kind of network code:
(Python) Twisted - <http://twistedmatrix.com/>
(Perl) POE - <http://poe.perl.org/>
(Java) java.nio (not familiar enough with the Java thing to know
whether or not there's a high-level wrapper)
(C++) ACE - <http://www.cs.wustl.edu/~schmidt/ACE.html>
(Ruby) IO::Reactor - <http://www.deveiate.org/code/IO-Reactor.html>
(C) libevent - <http://monkey.org/~provos/libevent/>

.. and of course, you have select/poll/kqueue/WaitNextEvent/whatever
that you could use directly, if you wanted to roll your own solution,
but don't do that.

If you don't want to optimize the whole application, I'd at least
just push the DB operations down to a very small number of
connections (*one* might even be optimal!), waiting on some kind of
thread-safe queue for updates from the rest of the system.  This way
you can easily batch those updates into transactions and you won't be
putting so much unnecessary synchronization overhead into your
application and the database.

Generally, once you have more worker threads (or processes) than
CPUs, you're going to get diminishing returns in a bad way, assuming
those threads are making good use of their time.

-bob


Re: sustained update load of 1-2k/sec

From
Mark Cotner
Date:
Excellent feedback.  Thank you.  Please do keep in mind I'm storing the
results of SNMP queries.  The majority of the time each thread is in a wait
state, listening on a UDP port for return packet.  The number of threads is
high because in order to sustain poll speed I need to minimize the impact of
timeouts and all this waiting for return packets.

I had intended to have a fallback plan which would build a thread safe queue
for db stuffs, but the application isn't currently architected that way.
It's not completely built yet so now is the time for change.  I hadn't
thought of building up a batch of queries and creating a transaction from
them.

I've been looking into memcached as a persistent object store as well and
hadn't seen the reactor pattern yet.  Still trying to get my puny brain
around that one.

Again, thanks for the help.

'njoy,
Mark


On 8/19/05 5:09 AM, "Bob Ippolito" <bob@redivi.com> wrote:

>
> On Aug 18, 2005, at 10:24 PM, Mark Cotner wrote:
>
>> I'm currently working on an application that will poll
>> thousands of cable modems per minute and I would like
>> to use PostgreSQL to maintain state between polls of
>> each device.  This requires a very heavy amount of
>> updates in place on a reasonably large table(100k-500k
>> rows, ~7 columns mostly integers/bigint).  Each row
>> will be refreshed every 15 minutes, or at least that's
>> how fast I can poll via SNMP.  I hope I can tune the
>> DB to keep up.
>>
>> The app is threaded and will likely have well over 100
>> concurrent db connections.  Temp tables for storage
>> aren't a preferred option since this is designed to be
>> a shared nothing approach and I will likely have
>> several polling processes.
>
> Somewhat OT, but..
>
> The easiest way to speed that up is to use less threads.  You're
> adding a whole TON of overhead with that many threads that you just
> don't want or need.  You should probably be using something event-
> driven to solve this problem, with just a few database threads to
> store all that state.  Less is definitely more in this case.  See
> <http://www.kegel.com/c10k.html> (and there's plenty of other
> literature out there saying that event driven is an extremely good
> way to do this sort of thing).
>
> Here are some frameworks to look at for this kind of network code:
> (Python) Twisted - <http://twistedmatrix.com/>
> (Perl) POE - <http://poe.perl.org/>
> (Java) java.nio (not familiar enough with the Java thing to know
> whether or not there's a high-level wrapper)
> (C++) ACE - <http://www.cs.wustl.edu/~schmidt/ACE.html>
> (Ruby) IO::Reactor - <http://www.deveiate.org/code/IO-Reactor.html>
> (C) libevent - <http://monkey.org/~provos/libevent/>
>
> .. and of course, you have select/poll/kqueue/WaitNextEvent/whatever
> that you could use directly, if you wanted to roll your own solution,
> but don't do that.
>
> If you don't want to optimize the whole application, I'd at least
> just push the DB operations down to a very small number of
> connections (*one* might even be optimal!), waiting on some kind of
> thread-safe queue for updates from the rest of the system.  This way
> you can easily batch those updates into transactions and you won't be
> putting so much unnecessary synchronization overhead into your
> application and the database.
>
> Generally, once you have more worker threads (or processes) than
> CPUs, you're going to get diminishing returns in a bad way, assuming
> those threads are making good use of their time.
>
> -bob
>



Re: sustained update load of 1-2k/sec

From
Bob Ippolito
Date:
On Aug 19, 2005, at 12:14 AM, Mark Cotner wrote:

> Excellent feedback.  Thank you.  Please do keep in mind I'm storing
> the
> results of SNMP queries.  The majority of the time each thread is
> in a wait
> state, listening on a UDP port for return packet.  The number of
> threads is
> high because in order to sustain poll speed I need to minimize the
> impact of
> timeouts and all this waiting for return packets.

Asynchronous IO via select/poll/etc. basically says: "given these 100
sockets, wake me up when any of them has something to tell me, or
wake me up anyway in N milliseconds".  From one thread, you can
usually deal with thousands of connections without breaking a sweat,
where with thread-per-connection you have so much overhead just for
the threads that you probably run out of RAM before your network is
throttled.  The reactor pattern basically just abstracts this a bit
so that you worry about what do to when the sockets have something to
say, and also allow you to schedule timed events, rather than having
to worry about how to implement that correctly *and* write your
application.

With 100 threads you are basically invoking a special-case of the
same mechanism that only looks at one socket, but this makes for 100
different data structures that end up in both userspace and kernel
space, plus the thread stacks (which can easily be a few megs each)
and context switching when any of them wakes up..  You're throwing a
lot of RAM and CPU cycles out the window by using this design.

Also, preemptive threads are hard.

> I had intended to have a fallback plan which would build a thread
> safe queue
> for db stuffs, but the application isn't currently architected that
> way.
> It's not completely built yet so now is the time for change.  I hadn't
> thought of building up a batch of queries and creating a
> transaction from
> them.

It should be *really* easy to just swap out the implementation of
your "change this record" function with one that simply puts its
arguments on a queue, with another thread that gets them from the
queue and actually does the work.

> I've been looking into memcached as a persistent object store as
> well and
> hadn't seen the reactor pattern yet.  Still trying to get my puny
> brain
> around that one.

memcached is RAM based, it's not persistent at all... unless you are
sure all of your nodes will be up at all times and will never go
down.  IIRC, it also just starts throwing away data once you hit its
size limit.  If course, this isn't really any different than MySQL's
MyISAM tables if you hit the row limit, but I think that memcached
might not even give you an error when this happens.  Also, memcached
is just key/value pairs over a network, not much of a database going
on there.

If you can fit all this data in RAM and you don't care so much about
the integrity, you might not benefit much from a RDBMS at all.
However, I don't really know what you're doing with the data once you
have it so I might be very wrong here...

-bob

>
> Again, thanks for the help.
>
> 'njoy,
> Mark
>
>
> On 8/19/05 5:09 AM, "Bob Ippolito" <bob@redivi.com> wrote:
>
>
>>
>> On Aug 18, 2005, at 10:24 PM, Mark Cotner wrote:
>>
>>
>>> I'm currently working on an application that will poll
>>> thousands of cable modems per minute and I would like
>>> to use PostgreSQL to maintain state between polls of
>>> each device.  This requires a very heavy amount of
>>> updates in place on a reasonably large table(100k-500k
>>> rows, ~7 columns mostly integers/bigint).  Each row
>>> will be refreshed every 15 minutes, or at least that's
>>> how fast I can poll via SNMP.  I hope I can tune the
>>> DB to keep up.
>>>
>>> The app is threaded and will likely have well over 100
>>> concurrent db connections.  Temp tables for storage
>>> aren't a preferred option since this is designed to be
>>> a shared nothing approach and I will likely have
>>> several polling processes.
>>>
>>
>> Somewhat OT, but..
>>
>> The easiest way to speed that up is to use less threads.  You're
>> adding a whole TON of overhead with that many threads that you just
>> don't want or need.  You should probably be using something event-
>> driven to solve this problem, with just a few database threads to
>> store all that state.  Less is definitely more in this case.  See
>> <http://www.kegel.com/c10k.html> (and there's plenty of other
>> literature out there saying that event driven is an extremely good
>> way to do this sort of thing).
>>
>> Here are some frameworks to look at for this kind of network code:
>> (Python) Twisted - <http://twistedmatrix.com/>
>> (Perl) POE - <http://poe.perl.org/>
>> (Java) java.nio (not familiar enough with the Java thing to know
>> whether or not there's a high-level wrapper)
>> (C++) ACE - <http://www.cs.wustl.edu/~schmidt/ACE.html>
>> (Ruby) IO::Reactor - <http://www.deveiate.org/code/IO-Reactor.html>
>> (C) libevent - <http://monkey.org/~provos/libevent/>
>>
>> .. and of course, you have select/poll/kqueue/WaitNextEvent/whatever
>> that you could use directly, if you wanted to roll your own solution,
>> but don't do that.
>>
>> If you don't want to optimize the whole application, I'd at least
>> just push the DB operations down to a very small number of
>> connections (*one* might even be optimal!), waiting on some kind of
>> thread-safe queue for updates from the rest of the system.  This way
>> you can easily batch those updates into transactions and you won't be
>> putting so much unnecessary synchronization overhead into your
>> application and the database.
>>
>> Generally, once you have more worker threads (or processes) than
>> CPUs, you're going to get diminishing returns in a bad way, assuming
>> those threads are making good use of their time.
>>
>> -bob
>>
>>
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>


Re: sustained update load of 1-2k/sec

From
Alex Turner
Date:
I have managed tx speeds that high from postgresql going even as high
as 2500/sec for small tables, but it does require a good RAID
controler card (yes I'm even running with fsync on).  I'm using 3ware
9500S-8MI with Raptor drives in multiple RAID 10s.  The box wasn't too
$$$ at just around $7k.  I have two independant controlers on two
independant PCI buses to give max throughput. on with a 6 drive RAID
10 and the other with two 4 drive RAID 10s.

Alex Turner
NetEconomist

On 8/19/05, Mark Cotner <mcotner@yahoo.com> wrote:
> Hi all,
> I bet you get tired of the same ole questions over and
> over.
>
> I'm currently working on an application that will poll
> thousands of cable modems per minute and I would like
> to use PostgreSQL to maintain state between polls of
> each device.  This requires a very heavy amount of
> updates in place on a reasonably large table(100k-500k
> rows, ~7 columns mostly integers/bigint).  Each row
> will be refreshed every 15 minutes, or at least that's
> how fast I can poll via SNMP.  I hope I can tune the
> DB to keep up.
>
> The app is threaded and will likely have well over 100
> concurrent db connections.  Temp tables for storage
> aren't a preferred option since this is designed to be
> a shared nothing approach and I will likely have
> several polling processes.
>
> Here are some of my assumptions so far . . .
>
> HUGE WAL
> Vacuum hourly if not more often
>
> I'm getting 1700tx/sec from MySQL and I would REALLY
> prefer to use PG.  I don't need to match the number,
> just get close.
>
> Is there a global temp table option?  In memory tables
> would be very beneficial in this case.  I could just
> flush it to disk occasionally with an insert into blah
> select from memory table.
>
> Any help or creative alternatives would be greatly
> appreciated.  :)
>
> 'njoy,
> Mark
>
>
> --
> Writing software requires an intelligent person,
> creating functional art requires an artist.
> -- Unknown
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>

Re: sustained update load of 1-2k/sec

From
Tom Lane
Date:
Bob Ippolito <bob@redivi.com> writes:
> If you don't want to optimize the whole application, I'd at least
> just push the DB operations down to a very small number of
> connections (*one* might even be optimal!), waiting on some kind of
> thread-safe queue for updates from the rest of the system.

While I agree that hundreds of threads seems like overkill, I think the
above advice might be going too far in the other direction.  The problem
with single-threaded operation is that any delay affects the whole
system --- eg, if you're blocked waiting for disk I/O, the CPU doesn't
get anything done either.  You want enough DB connections doing things
in parallel to make sure that there's always something else useful to do
for each major component.  This is particularly important for Postgres,
which doesn't do any internal query parallelization (not that it would
help much anyway for the sorts of trivial queries you are worried about).
If you have, say, a 4-way CPU you want at least 4 active connections to
make good use of the CPUs.

I'd suggest trying to build the system so that it uses a dozen or two
active database connections.  If that doesn't match up to the number of
polling activities you want to have in flight at any instant, then you
can do something like what Bob suggested on the client side to bridge
the gap.

As far as the question "can PG do 1-2k xact/sec", the answer is "yes
if you throw enough hardware at it".  Spending enough money on the
disk subsystem is the key ...

            regards, tom lane

Re: sustained update load of 1-2k/sec

From
Andreas Pflug
Date:
Tom Lane wrote:

>Bob Ippolito <bob@redivi.com> writes:
>
>
>>If you don't want to optimize the whole application, I'd at least
>>just push the DB operations down to a very small number of
>>connections (*one* might even be optimal!), waiting on some kind of
>>thread-safe queue for updates from the rest of the system.
>>
>>
>
>While I agree that hundreds of threads seems like overkill, I think the
>above advice might be going too far in the other direction.  The problem
>with single-threaded operation is that any delay affects the whole
>system --- eg, if you're blocked waiting for disk I/O, the CPU doesn't
>get anything done either.  You want enough DB connections doing things
>in parallel to make sure that there's always something else useful to do
>for each major component.  This is particularly important for Postgres,
>which doesn't do any internal query parallelization (not that it would
>help much anyway for the sorts of trivial queries you are worried about).
>If you have, say, a 4-way CPU you want at least 4 active connections to
>make good use of the CPUs.
>
>I'd suggest trying to build the system so that it uses a dozen or two
>active database connections.  If that doesn't match up to the number of
>polling activities you want to have in flight at any instant, then you
>can do something like what Bob suggested on the client side to bridge
>the gap.
>
>As far as the question "can PG do 1-2k xact/sec", the answer is "yes
>if you throw enough hardware at it".  Spending enough money on the
>disk subsystem is the key ...
>
>
The 1-2k xact/sec for MySQL seems suspicious, sounds very much like
write-back cached, not write-through, esp. considering that heavy
concurrent write access isn't said to be MySQLs strength...

I wonder if preserving the database after a fatal crash is really
necessary, since the data stored sounds quite volatile; in this case,
fsync=false might be sufficient.

Regards,
Andreas


Re: sustained update load of 1-2k/sec

From
Tom Lane
Date:
Andreas Pflug <pgadmin@pse-consulting.de> writes:
> Tom Lane wrote:
>> As far as the question "can PG do 1-2k xact/sec", the answer is "yes
>> if you throw enough hardware at it".  Spending enough money on the
>> disk subsystem is the key ...
>>
> The 1-2k xact/sec for MySQL seems suspicious, sounds very much like
> write-back cached, not write-through, esp. considering that heavy
> concurrent write access isn't said to be MySQLs strength...

> I wonder if preserving the database after a fatal crash is really
> necessary, since the data stored sounds quite volatile; in this case,
> fsync=false might be sufficient.

Yeah, that's something to think about.  If you do need full transaction
safety, then you *must* have a decent battery-backed-write-cache setup,
else your transaction commit rate will be limited by disk rotation
speed --- for instance, a single connection can commit at most 250 xacts
per second if the WAL log is on a 15000RPM drive.  (You can improve this
to the extent that you can spread activity across multiple connections,
but I'm not sure you can expect to reliably have 8 or more connections
ready to commit each time the disk goes 'round.)

            regards, tom lane

Re: sustained update load of 1-2k/sec

From
Ron
Date:
Alex mentions a nice setup, but I'm pretty sure I know how to beat
that IO subsystems HW's performance by at least 1.5x or 2x.  Possibly
more.  (No, I do NOT work for any vendor I'm about to discuss.)

Start by replacing the WD Raptors with Maxtor Atlas 15K II's.
At 5.5ms average access, 97.4MB/s outer track throughput, 85.9MB/s
average, and 74.4 MB/s inner track throughput, they have the best
performance characteristics of any tested shipping HDs I know
of.  (Supposedly the new SAS versions will _sustain_ ~98MB/s, but
I'll believe that only if I see it under independent testing).
In comparison, the numbers on the WD740GD are 8.1ms average access,
71.8, 62.9, and 53.9 MB/s outer, average and inner track throughputs
respectively.

Be prepared to use as many of them as possible (read: as many you can
afford) if you want to maximize transaction rates, particularly for
small transactions like this application seems to be mentioning.

Next, use a better RAID card.  The TOL enterprise stuff (Xyratex,
Engino, Dot-hill) is probably too expensive, but in the commodity
market benchmarks indicate that that Areca's 1GB buffer RAID cards
currently outperform all the other commodity RAID stuff.

9 Atlas II's per card in a RAID 5 set, or 16 per card in a RAID 10
set, should max the RAID card's throughput and come very close to, if
not attaining, the real world peak bandwidth of the 64b 133MHz PCI-X
bus they are plugged into.  Say somewhere in the 700-800MB/s range.

Repeat the above for as many independent PCI-X buses as you have for
a very fast commodity RAID IO subsystem.

Two such configured cards used in the dame manner as mentioned by
Alex should easily attain 1.5x - 2x the transaction numbers mentioned
by Alex unless there's a bottleneck somewhere else in the system design.

Hope this helps,
Ron Peacetree

At 08:40 AM 8/19/2005, Alex Turner wrote:
>I have managed tx speeds that high from postgresql going even as high
>as 2500/sec for small tables, but it does require a good RAID
>controler card (yes I'm even running with fsync on).  I'm using 3ware
>9500S-8MI with Raptor drives in multiple RAID 10s.  The box wasn't too
>$$$ at just around $7k.  I have two independant controlers on two
>independant PCI buses to give max throughput. on with a 6 drive RAID
>10 and the other with two 4 drive RAID 10s.
>
>Alex Turner
>NetEconomist
>
>On 8/19/05, Mark Cotner <mcotner@yahoo.com> wrote:
> > Hi all,
> > I bet you get tired of the same ole questions over and
> > over.
> >
> > I'm currently working on an application that will poll
> > thousands of cable modems per minute and I would like
> > to use PostgreSQL to maintain state between polls of
> > each device.  This requires a very heavy amount of
> > updates in place on a reasonably large table(100k-500k
> > rows, ~7 columns mostly integers/bigint).  Each row
> > will be refreshed every 15 minutes, or at least that's
> > how fast I can poll via SNMP.  I hope I can tune the
> > DB to keep up.
> >
> > The app is threaded and will likely have well over 100
> > concurrent db connections.  Temp tables for storage
> > aren't a preferred option since this is designed to be
> > a shared nothing approach and I will likely have
> > several polling processes.
> >
> > Here are some of my assumptions so far . . .
> >
> > HUGE WAL
> > Vacuum hourly if not more often
> >
> > I'm getting 1700tx/sec from MySQL and I would REALLY
> > prefer to use PG.  I don't need to match the number,
> > just get close.
> >
> > Is there a global temp table option?  In memory tables
> > would be very beneficial in this case.  I could just
> > flush it to disk occasionally with an insert into blah
> > select from memory table.
> >
> > Any help or creative alternatives would be greatly
> > appreciated.  :)
> >
> > 'njoy,
> > Mark
> >
> >
> > --
> > Writing software requires an intelligent person,
> > creating functional art requires an artist.
> > -- Unknown
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: Have you checked our extensive FAQ?
> >
> >                http://www.postgresql.org/docs/faq
> >
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: Don't 'kill -9' the postmaster




Re: sustained update load of 1-2k/sec

From
Ron
Date:
At 09:58 AM 8/19/2005, Andreas Pflug wrote:

>The 1-2k xact/sec for MySQL seems suspicious, sounds very much like
>write-back cached, not write-through, esp. considering that heavy
>concurrent write access isn't said to be MySQLs strength...

Don't be suspicious.

I haven't seen the code under discussion, but I have seen mySQL
easily achieve these kinds of numbers using the myISAM storage engine
in write-through cache
mode.

myISAM can be =FAST=.  Particularly when decent HW is thrown at it.

Ron



Re: sustained update load of 1-2k/sec

From
"Jeffrey W. Baker"
Date:
On Fri, 2005-08-19 at 10:54 -0400, Ron wrote:
> Maxtor Atlas 15K II's.

> Areca's 1GB buffer RAID cards

The former are SCSI disks and the latter is an SATA controller.  The
combination would have a transaction rate of approximately 0.

I can vouch for the Areca controllers, however.  You can certainly
achieve pgbench transaction rates in the hundreds per second even with
only 5 7200RPM disks and 128MB cache.

Don't forget to buy the battery.

-jwb

Re: sustained update load of 1-2k/sec

From
"J. Andrew Rogers"
Date:
On 8/19/05 1:24 AM, "Mark Cotner" <mcotner@yahoo.com> wrote:
> I'm currently working on an application that will poll
> thousands of cable modems per minute and I would like
> to use PostgreSQL to maintain state between polls of
> each device.  This requires a very heavy amount of
> updates in place on a reasonably large table(100k-500k
> rows, ~7 columns mostly integers/bigint).  Each row
> will be refreshed every 15 minutes, or at least that's
> how fast I can poll via SNMP.  I hope I can tune the
> DB to keep up.
>
> The app is threaded and will likely have well over 100
> concurrent db connections.  Temp tables for storage
> aren't a preferred option since this is designed to be
> a shared nothing approach and I will likely have
> several polling processes.


Mark,

We have PostgreSQL databases on modest hardware doing exactly what you are
attempting to (massive scalable SNMP monitoring system).  The monitoring
volume for a single database server appears to exceed what you are trying to
do by a few orders of magnitude with no scaling or performance issues, so I
can state without reservation that PostgreSQL can easily handle your
application in theory.

However, that is predicated on having a well-architected system that
minimizes resource contention and unnecessary blocking, and based on your
description you may be going about it a bit wrong.

The biggest obvious bottleneck is the use of threads and massive
process-level parallelization.  As others have pointed out, async queues are
your friends, as is partitioning the workload horizontally rather than
vertically through the app stack.  A very scalable high-throughput engine
for SNMP polling only requires two or three threads handling different parts
of the workload to saturate the network, and by choosing what each thread
does carefully you can all but eliminate blocking when there is work to be
done.

We only use a single database connection to insert all the data into
PostgreSQL, and that process/thread receives its data from a work queue.
Depending on how you design your system, you can batch many records in your
queue as a single transaction.  In our case, we also use very few updates,
mostly just inserts, which is probably advantageous in terms of throughput
if you have the disk for it.  The insert I/O load is easily handled, and our
disk array is a modest 10k SCSI rig.  The only thing that really hammers the
server is when multiple reporting processes are running, which frequently
touch several million rows each (the database is much larger than the system
memory), and even this is manageable with clever database design.


In short, what you are trying to do is easily doable on PostgreSQL in
theory.  However, restrictions on design choices may pose significant
hurdles.  We did not start out with an ideal system either; it took a fair
amount of re-engineering to solve all the bottlenecks and problems that pop
up.

Good luck,

J. Andrew Rogers
jrogers@neopolitan.com



Re: sustained update load of 1-2k/sec

From
Ron
Date:
At 12:34 PM 8/19/2005, Jeffrey W. Baker wrote:
>On Fri, 2005-08-19 at 10:54 -0400, Ron wrote:
> > Maxtor Atlas 15K II's.
>
> > Areca's 1GB buffer RAID cards
>
>The former are SCSI disks and the latter is an SATA controller.  The
>combination would have a transaction rate of approximately 0.

You are evidently thinking of the Areca ARC-11xx controllers (and you
are certainly right for that HW combination ;-) ).  Those are not the
only product Areca makes that can be upgraded to a 1GB cache.

Until SAS infrastructure is good enough, U320 SCSI and FC HD's remain
the top performing HD's realistically available.  At the most
fundamental, your DBMS is only as good as your HD IO subsystem, and
your HD IO subsystem is only as good as your HDs.  As others have
said here, skimping on your HDs is _not_ a good design choice where
DBMSs are concerned.

As an aside, the Atlas 15K II's are now available in SAS:

http://www.maxtor.com/portal/site/Maxtor/menuitem.ba88f6d7cf664718376049b291346068/?channelpath=/en_us/Products/SCSI%20Hard%20Drives/Atlas%2015K%20Family/Atlas%2015K%20II%20SAS

I haven't seen independent benches on them, so I explicitly
referenced the U320 Atlas 15K II's known performance numbers
instead.  As I said, Maxtor is claiming even better for the SAS
version of the Atlas 15K II.

None of the SAS <-> PCI-X or PCI-E RAID cards I know of are ready for
mass market yet, although a few are in beta..


>I can vouch for the Areca controllers, however.  You can certainly
>achieve pgbench transaction rates in the hundreds per second even with
>only 5 7200RPM disks and 128MB cache.
>
>Don't forget to buy the battery.

Agreed.

Hope this is helpful,
Ron Peacetree



Re: sustained update load of 1-2k/sec

From
PFC
Date:
> While I agree that hundreds of threads seems like overkill, I think the
> above advice might be going too far in the other direction.  The problem
> with single-threaded operation is that any delay affects the whole
> system --- eg, if you're blocked waiting for disk I/O, the CPU doesn't

    You use UDP which is a connectionless protocol... then why use threads ?

    I'd advise this :

    Use asynchronous network code (one thread) to do your network stuff. This
will lower the CPU used by this code immensely.
    Every minute, dump a file contianing everything to insert into the table.
    Use another thread to COPY it into the DB, in a temporary table if you
wish, and then INSERT INTO ... SELECT.
    This should be well adapted to your requirements.

Re: sustained update load of 1-2k/sec

From
Alex Turner
Date:
Don't forget that Ultra 320 is the speed of the bus, not each drive.
No matter how many honking 15k disks you put on a 320MB bus, you can
only get 320MB/sec! and have so many outstanding IO/s on the bus.

Not so with SATA! Each drive is on it's own bus, and you are only
limited by the speed of your PCI-X Bus, which can be as high as
800MB/sec at 133Mhz/64bit.

It's cheap and it's fast - all you have to do is pay for the
enclosure, which can be a bit pricey, but there are some nice 24bay
and even 40bay enclosures out there for SATA.

Yes a 15k RPM drive will give you better seek time and better peak
through put, but put them all on a single U320 bus and you won't see
much return past a stripe size of 3 or 4.

If it's raw transactions per second data warehouse style, it's all
about the xlog baby which is sequential writes, and all about large
block reads, which is sequential reads.

Alex Turner
NetEconomist
P.S. Sorry if i'm a bit punchy, I've been up since yestarday with
server upgrade nightmares that continue ;)

On 8/19/05, Ron <rjpeace@earthlink.net> wrote:
> Alex mentions a nice setup, but I'm pretty sure I know how to beat
> that IO subsystems HW's performance by at least 1.5x or 2x.  Possibly
> more.  (No, I do NOT work for any vendor I'm about to discuss.)
>
> Start by replacing the WD Raptors with Maxtor Atlas 15K II's.
> At 5.5ms average access, 97.4MB/s outer track throughput, 85.9MB/s
> average, and 74.4 MB/s inner track throughput, they have the best
> performance characteristics of any tested shipping HDs I know
> of.  (Supposedly the new SAS versions will _sustain_ ~98MB/s, but
> I'll believe that only if I see it under independent testing).
> In comparison, the numbers on the WD740GD are 8.1ms average access,
> 71.8, 62.9, and 53.9 MB/s outer, average and inner track throughputs
> respectively.
>
> Be prepared to use as many of them as possible (read: as many you can
> afford) if you want to maximize transaction rates, particularly for
> small transactions like this application seems to be mentioning.
>
> Next, use a better RAID card.  The TOL enterprise stuff (Xyratex,
> Engino, Dot-hill) is probably too expensive, but in the commodity
> market benchmarks indicate that that Areca's 1GB buffer RAID cards
> currently outperform all the other commodity RAID stuff.
>
> 9 Atlas II's per card in a RAID 5 set, or 16 per card in a RAID 10
> set, should max the RAID card's throughput and come very close to, if
> not attaining, the real world peak bandwidth of the 64b 133MHz PCI-X
> bus they are plugged into.  Say somewhere in the 700-800MB/s range.
>
> Repeat the above for as many independent PCI-X buses as you have for
> a very fast commodity RAID IO subsystem.
>
> Two such configured cards used in the dame manner as mentioned by
> Alex should easily attain 1.5x - 2x the transaction numbers mentioned
> by Alex unless there's a bottleneck somewhere else in the system design.
>
> Hope this helps,
> Ron Peacetree
>
> At 08:40 AM 8/19/2005, Alex Turner wrote:
> >I have managed tx speeds that high from postgresql going even as high
> >as 2500/sec for small tables, but it does require a good RAID
> >controler card (yes I'm even running with fsync on).  I'm using 3ware
> >9500S-8MI with Raptor drives in multiple RAID 10s.  The box wasn't too
> >$$$ at just around $7k.  I have two independant controlers on two
> >independant PCI buses to give max throughput. on with a 6 drive RAID
> >10 and the other with two 4 drive RAID 10s.
> >
> >Alex Turner
> >NetEconomist
> >
> >On 8/19/05, Mark Cotner <mcotner@yahoo.com> wrote:
> > > Hi all,
> > > I bet you get tired of the same ole questions over and
> > > over.
> > >
> > > I'm currently working on an application that will poll
> > > thousands of cable modems per minute and I would like
> > > to use PostgreSQL to maintain state between polls of
> > > each device.  This requires a very heavy amount of
> > > updates in place on a reasonably large table(100k-500k
> > > rows, ~7 columns mostly integers/bigint).  Each row
> > > will be refreshed every 15 minutes, or at least that's
> > > how fast I can poll via SNMP.  I hope I can tune the
> > > DB to keep up.
> > >
> > > The app is threaded and will likely have well over 100
> > > concurrent db connections.  Temp tables for storage
> > > aren't a preferred option since this is designed to be
> > > a shared nothing approach and I will likely have
> > > several polling processes.
> > >
> > > Here are some of my assumptions so far . . .
> > >
> > > HUGE WAL
> > > Vacuum hourly if not more often
> > >
> > > I'm getting 1700tx/sec from MySQL and I would REALLY
> > > prefer to use PG.  I don't need to match the number,
> > > just get close.
> > >
> > > Is there a global temp table option?  In memory tables
> > > would be very beneficial in this case.  I could just
> > > flush it to disk occasionally with an insert into blah
> > > select from memory table.
> > >
> > > Any help or creative alternatives would be greatly
> > > appreciated.  :)
> > >
> > > 'njoy,
> > > Mark
> > >
> > >
> > > --
> > > Writing software requires an intelligent person,
> > > creating functional art requires an artist.
> > > -- Unknown
> > >
> > >
> > > ---------------------------(end of broadcast)---------------------------
> > > TIP 3: Have you checked our extensive FAQ?
> > >
> > >                http://www.postgresql.org/docs/faq
> > >
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 2: Don't 'kill -9' the postmaster
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>

Re: sustained update load of 1-2k/sec

From
Ron
Date:
At 03:31 PM 8/19/2005, Alex Turner wrote:
>Don't forget that Ultra 320 is the speed of the bus, not each drive.
>No matter how many honking 15k disks you put on a 320MB bus, you can
>only get 320MB/sec! and have so many outstanding IO/s on the bus.

Of course.  This is exactly why multi-channel SCSI and multichannel
Fibre Channel cards exist; and why external RAID enclosures usually
have multiple such cards in them...

Even moderately acceptable U320 SCSI cards are dual channel at this
point (think Adaptec dual channel AHAxxxx's), and Quad channel ones
are just as common.  The Quads will, of course, saturate a 64b 133MHz
PCI-X bus.  _IF_ the chipset on them can keep up.

The current kings of RAID card performance are all Fibre Channel
based, and all the ones I know of are theoretically capable of
saturating a 64b 133MHz PCI-X bus.  Again, _IF_ the chipset on them
can keep up.

Most commodity RAID card have neither adequate CPU nor enough
buffer.  Regardless of the peripheral IO technology they use.


>Not so with SATA! Each drive is on it's own bus, and you are only
>limited by the speed of your PCI-X Bus, which can be as high as
>800MB/sec at 133Mhz/64bit.

That's the Theory anyway, and latency should be lower as well.  OTOH,
as my wife likes to say "In theory, Theory and Practice are the
same.  In practice, they almost never are."

You are only getting the performance you mention as long as your card
can keep up with multiplexing N IO streams, crunching RAID 5 XORs
(assuming you are using RAID 5), etc, etc.  As I'm sure you know,
"The chain is only as strong as its weakest link.".

Most commodity SATA RAID cards brag about being able to pump 300MB/s
(they were all over LW SF bragging about this!?), which in this
context is woefully unimpressive.  Sigh.

I'm impressed with the Areca cards because they usually have CPUs
that actually can come close to pushing the theoretical IO limit of
the bus they are plugged into; and they can be upgraded to (barely)
acceptable buffer amounts <rant>(come on, manufacturers! 4GB of DDR
PC3200 is only -2- DIMMs, and shortly that will be enough to hold 8GB
of DDR PC3200.  Give us more buffer!)</rant>.


>It's cheap and it's fast - all you have to do is pay for the
>enclosure, which can be a bit pricey, but there are some nice 24bay
>and even 40bay enclosures out there for SATA.

I've even seen 48 bay ones.  However, good enclosures, particularly
for larger numbers of HDs, are examples of non-trivial engineering
and priced accordingly.  Too many times I see people buy "bargain"
enclosures and set themselves and their organizations up for some
_very_ unpleasant times that could easily have been avoided by being
careful to buy quality products.  "Pay when you buy or pay much more later."


>Yes a 15k RPM drive will give you better seek time and better peak
>through put, but put them all on a single U320 bus and you won't see
>much return past a stripe size of 3 or 4

Agreed.  Same holds for 2Gbps FC.  Haven't tested 4Gbps FC personally
yet, but I'm told the limit is higher in the manner you'd expect.


>If it's raw transactions per second data warehouse style, it's all
>about the xlog baby which is sequential writes, and all about large
>block reads, which is sequential reads.
>
>Alex Turner
>NetEconomist
>P.S. Sorry if i'm a bit punchy, I've been up since yestarday with
>server upgrade nightmares that continue ;)

My condolences and sympathies.  I've definitely been there and done that.

Ron Peacetree



Re: sustained update load of 1-2k/sec

From
Mark Cotner
Date:
:)  Most of the ppl on this list are systems programmers, however I am not.
The tool of choice for this app is Ruby and the libraries don't support
async SNMP at the moment.

I've done a good deal of async snmp and the libraries that actually pull it
off generally aren't that good(Net-SNMP and Perl's Net::SNMP).  Granted, UDP
is connectionless to an extent, but you still have to send the PDU, and bind
to the return socket and wait.  If you batch the outgoing PDUs then you can
get away with sending them out synchronously and listening on the returning
socket synchronously, but it would require that your libraries support this.
I understand the concepts well enough, maybe I'll put together a patch.  It
would be much lower overhead than managing all those threads.  Looks like
it's gonna be a fun weekend.

Thanks again for all the great feedback.

'njoy,
Mark


On 8/19/05 2:11 PM, "PFC" <lists@boutiquenumerique.com> wrote:

>
>> While I agree that hundreds of threads seems like overkill, I think the
>> above advice might be going too far in the other direction.  The problem
>> with single-threaded operation is that any delay affects the whole
>> system --- eg, if you're blocked waiting for disk I/O, the CPU doesn't
>
> You use UDP which is a connectionless protocol... then why use threads ?
>
> I'd advise this :
>
> Use asynchronous network code (one thread) to do your network stuff. This
> will lower the CPU used by this code immensely.
> Every minute, dump a file contianing everything to insert into the table.
> Use another thread to COPY it into the DB, in a temporary table if you
> wish, and then INSERT INTO ... SELECT.
> This should be well adapted to your requirements.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster



Re: sustained update load of 1-2k/sec

From
Mark Cotner
Date:
Thanks again everyone for the excellent suggestions.

I looked into IO::Reactor, but after a few hours of fiddling decided I was
getting the kind of performance I wanted from using a slightly more than
modest number of threads and decided(due to dev timelines) to come back to
patching the SNMP libraries for Ruby to do async using Reactor later.

I am unfortunately stuck with updates, but I think(with you're suggestions)
I've made it work for me.

MySQL = 1500 updates/sec
PostgreSQL w/10k tx per commit using single thread = 1400 updates/sec
Given the update heavy nature of this table I felt it was necessary to test
during a vacuum.  Turns out the hit wasn't that bad . . .
PostgreSQL w/10k tx per commit using a single thread during a vacuum = 1300
updates/sec

100-200 updates/sec is a small price to pay for mature stored procedures,
more stored procedure language options, acid compliance, mvcc, very few if
any corrupt tables(get about 2 a week from MySQL on the 40 DBs I manage),
more crash resistant db(crash about once a month on one of my 40 MySQL dbs),
and replication that actually works for more than a day before quitting for
no apparent reason ;) [/flame off]

For those of you with Cox Communications cable modems look forward to better
customer service and cable plant management.  :)

And if anyone's curious here's the app I'm rebuilding/updating
http://www.mysql.com/customers/customer.php?id=16
We won runner up behind Saabre airline reservation system for MySQL app of
the year.  Needless to say they weren't too happy when they heard we might
be switching DBs.

'njoy,
Mark

On 8/19/05 1:12 PM, "J. Andrew Rogers" <jrogers@neopolitan.com> wrote:

> On 8/19/05 1:24 AM, "Mark Cotner" <mcotner@yahoo.com> wrote:
>> I'm currently working on an application that will poll
>> thousands of cable modems per minute and I would like
>> to use PostgreSQL to maintain state between polls of
>> each device.  This requires a very heavy amount of
>> updates in place on a reasonably large table(100k-500k
>> rows, ~7 columns mostly integers/bigint).  Each row
>> will be refreshed every 15 minutes, or at least that's
>> how fast I can poll via SNMP.  I hope I can tune the
>> DB to keep up.
>>
>> The app is threaded and will likely have well over 100
>> concurrent db connections.  Temp tables for storage
>> aren't a preferred option since this is designed to be
>> a shared nothing approach and I will likely have
>> several polling processes.
>
>
> Mark,
>
> We have PostgreSQL databases on modest hardware doing exactly what you are
> attempting to (massive scalable SNMP monitoring system).  The monitoring
> volume for a single database server appears to exceed what you are trying to
> do by a few orders of magnitude with no scaling or performance issues, so I
> can state without reservation that PostgreSQL can easily handle your
> application in theory.
>
> However, that is predicated on having a well-architected system that
> minimizes resource contention and unnecessary blocking, and based on your
> description you may be going about it a bit wrong.
>
> The biggest obvious bottleneck is the use of threads and massive
> process-level parallelization.  As others have pointed out, async queues are
> your friends, as is partitioning the workload horizontally rather than
> vertically through the app stack.  A very scalable high-throughput engine
> for SNMP polling only requires two or three threads handling different parts
> of the workload to saturate the network, and by choosing what each thread
> does carefully you can all but eliminate blocking when there is work to be
> done.
>
> We only use a single database connection to insert all the data into
> PostgreSQL, and that process/thread receives its data from a work queue.
> Depending on how you design your system, you can batch many records in your
> queue as a single transaction.  In our case, we also use very few updates,
> mostly just inserts, which is probably advantageous in terms of throughput
> if you have the disk for it.  The insert I/O load is easily handled, and our
> disk array is a modest 10k SCSI rig.  The only thing that really hammers the
> server is when multiple reporting processes are running, which frequently
> touch several million rows each (the database is much larger than the system
> memory), and even this is manageable with clever database design.
>
>
> In short, what you are trying to do is easily doable on PostgreSQL in
> theory.  However, restrictions on design choices may pose significant
> hurdles.  We did not start out with an ideal system either; it took a fair
> amount of re-engineering to solve all the bottlenecks and problems that pop
> up.
>
> Good luck,
>
> J. Andrew Rogers
> jrogers@neopolitan.com
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq