Re: sustained update load of 1-2k/sec - Mailing list pgsql-performance

From Bob Ippolito
Subject Re: sustained update load of 1-2k/sec
Date
Msg-id 9852B15E-8F9D-46F5-B4B4-9EAAE26F1AF7@redivi.com
Whole thread Raw
In response to Re: sustained update load of 1-2k/sec  (Mark Cotner <mcotner@yahoo.com>)
List pgsql-performance
On Aug 19, 2005, at 12:14 AM, Mark Cotner wrote:

> Excellent feedback.  Thank you.  Please do keep in mind I'm storing
> the
> results of SNMP queries.  The majority of the time each thread is
> in a wait
> state, listening on a UDP port for return packet.  The number of
> threads is
> high because in order to sustain poll speed I need to minimize the
> impact of
> timeouts and all this waiting for return packets.

Asynchronous IO via select/poll/etc. basically says: "given these 100
sockets, wake me up when any of them has something to tell me, or
wake me up anyway in N milliseconds".  From one thread, you can
usually deal with thousands of connections without breaking a sweat,
where with thread-per-connection you have so much overhead just for
the threads that you probably run out of RAM before your network is
throttled.  The reactor pattern basically just abstracts this a bit
so that you worry about what do to when the sockets have something to
say, and also allow you to schedule timed events, rather than having
to worry about how to implement that correctly *and* write your
application.

With 100 threads you are basically invoking a special-case of the
same mechanism that only looks at one socket, but this makes for 100
different data structures that end up in both userspace and kernel
space, plus the thread stacks (which can easily be a few megs each)
and context switching when any of them wakes up..  You're throwing a
lot of RAM and CPU cycles out the window by using this design.

Also, preemptive threads are hard.

> I had intended to have a fallback plan which would build a thread
> safe queue
> for db stuffs, but the application isn't currently architected that
> way.
> It's not completely built yet so now is the time for change.  I hadn't
> thought of building up a batch of queries and creating a
> transaction from
> them.

It should be *really* easy to just swap out the implementation of
your "change this record" function with one that simply puts its
arguments on a queue, with another thread that gets them from the
queue and actually does the work.

> I've been looking into memcached as a persistent object store as
> well and
> hadn't seen the reactor pattern yet.  Still trying to get my puny
> brain
> around that one.

memcached is RAM based, it's not persistent at all... unless you are
sure all of your nodes will be up at all times and will never go
down.  IIRC, it also just starts throwing away data once you hit its
size limit.  If course, this isn't really any different than MySQL's
MyISAM tables if you hit the row limit, but I think that memcached
might not even give you an error when this happens.  Also, memcached
is just key/value pairs over a network, not much of a database going
on there.

If you can fit all this data in RAM and you don't care so much about
the integrity, you might not benefit much from a RDBMS at all.
However, I don't really know what you're doing with the data once you
have it so I might be very wrong here...

-bob

>
> Again, thanks for the help.
>
> 'njoy,
> Mark
>
>
> On 8/19/05 5:09 AM, "Bob Ippolito" <bob@redivi.com> wrote:
>
>
>>
>> On Aug 18, 2005, at 10:24 PM, Mark Cotner wrote:
>>
>>
>>> I'm currently working on an application that will poll
>>> thousands of cable modems per minute and I would like
>>> to use PostgreSQL to maintain state between polls of
>>> each device.  This requires a very heavy amount of
>>> updates in place on a reasonably large table(100k-500k
>>> rows, ~7 columns mostly integers/bigint).  Each row
>>> will be refreshed every 15 minutes, or at least that's
>>> how fast I can poll via SNMP.  I hope I can tune the
>>> DB to keep up.
>>>
>>> The app is threaded and will likely have well over 100
>>> concurrent db connections.  Temp tables for storage
>>> aren't a preferred option since this is designed to be
>>> a shared nothing approach and I will likely have
>>> several polling processes.
>>>
>>
>> Somewhat OT, but..
>>
>> The easiest way to speed that up is to use less threads.  You're
>> adding a whole TON of overhead with that many threads that you just
>> don't want or need.  You should probably be using something event-
>> driven to solve this problem, with just a few database threads to
>> store all that state.  Less is definitely more in this case.  See
>> <http://www.kegel.com/c10k.html> (and there's plenty of other
>> literature out there saying that event driven is an extremely good
>> way to do this sort of thing).
>>
>> Here are some frameworks to look at for this kind of network code:
>> (Python) Twisted - <http://twistedmatrix.com/>
>> (Perl) POE - <http://poe.perl.org/>
>> (Java) java.nio (not familiar enough with the Java thing to know
>> whether or not there's a high-level wrapper)
>> (C++) ACE - <http://www.cs.wustl.edu/~schmidt/ACE.html>
>> (Ruby) IO::Reactor - <http://www.deveiate.org/code/IO-Reactor.html>
>> (C) libevent - <http://monkey.org/~provos/libevent/>
>>
>> .. and of course, you have select/poll/kqueue/WaitNextEvent/whatever
>> that you could use directly, if you wanted to roll your own solution,
>> but don't do that.
>>
>> If you don't want to optimize the whole application, I'd at least
>> just push the DB operations down to a very small number of
>> connections (*one* might even be optimal!), waiting on some kind of
>> thread-safe queue for updates from the rest of the system.  This way
>> you can easily batch those updates into transactions and you won't be
>> putting so much unnecessary synchronization overhead into your
>> application and the database.
>>
>> Generally, once you have more worker threads (or processes) than
>> CPUs, you're going to get diminishing returns in a bad way, assuming
>> those threads are making good use of their time.
>>
>> -bob
>>
>>
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>


pgsql-performance by date:

Previous
From: Mark Cotner
Date:
Subject: Re: sustained update load of 1-2k/sec
Next
From: Kari Lavikka
Date:
Subject: Re: Finding bottleneck