Thread: "large" spam tables and performance: postgres memory parameters

"large" spam tables and performance: postgres memory parameters

From
Gary Warner
Date:
Hello,

I've been lurking on this list a couple weeks now, and have asked some "side questions" to some of the list members,
whohave been gracious, and helpful, and encouraged me to just dive in and participate on the list. 

I'll not tell you the whole story right off the bat, but let me give you a basic outline.

I dual report into two academic departments at the University of Alabama at Birmingham - Computer & Information
Sciencesand Justice Sciences.  Although my background is on the CS side, I specialize in cybercrime investigations and
ourresearch focuses on things that help to train or equip cybercrime investigators. 

One of our research projects is called the "UAB Spam Data Mine".  Basically, we collect spam, use it to detect emerging
malwarethreats, phishing sites, or spam campaigns, and share our findings with law enforcement and our corporate
partners.

We started off small, with only around 10,000 to 20,000 emails per day running on a smallish server.  Once we had our
basicworkflow down, we moved to nicer hardware, and opened the floodgates a bit.  We're currently receiving about 1.2
millionemails per day, and hope to very quickly grow that to more than 5 million emails per day. 

I've got very nice hardware - many TB of very fast disks, and several servers with 12GB of RAM and 8 pentium cores
each.  

For the types of investigative support we do, some of our queries are of the 'can you tell me what this botnet was
spammingfor the past six months', but most of them are more "real time", of the "what is the top spam campaign today?"
or"what domains are being spammed by this botnet RIGHT NOW". 

We currently use 15 minute batch queues, where we parse between 10,000 to 20,000 emails every 15 minutes.  Each message
isassigned a unique message_id, which is a combination of what date and time "batch" it is in, followed by a sequential
number,so the most recent batch processed this morning starts with "10Jan07.0" and goes through "10Jan07.13800". 

OK, you've done the math . . . we're at 60 million records in the spam table.  The other "main" table is "spam_links"
whichhas the URL information.  Its got 170 million records and grows by more than 3 million per day currently.   

Believe it or not, many law enforcement cases actually seek evidence from two or more years ago when its time to go to
trial. We're looking at a potential max retention size of a billion emails and 3 billion URLs. 

-------------

I don't know what this list considers "large databases", but I'm going to go ahead and call 170 million records a
"large"table. 

-------------

I'll have several questions, but I'll limit this thread to:

 - if you have 8 pentium cores, 12GB of RAM and "infinite" diskspace, what sorts of memory settings would you have in
yourstart up tables? 

  My biggest question mark there really has to do with how many users I have and how that might alter the results.  My
researchteam has about 12 folks who might be using the UAB Spam Data Mine at any given time, plus we have the "parser"
runningpretty much constantly, and routines that are fetching IP addresses for all spammed URLs and nameservers for all
spammeddomains and constantly updating the databases with that information. In the very near future, we'll be accepting
queriesdirectly from law enforcement through a web interface and may have as many as another 12 simultaneous users, so
maybe25 max users.  We plan to limit "web users" to a "recent" subset of the data, probably allowing "today" "previous
24hour" and "previous 7 days" as query options within the web interface.  The onsite researchers will bang the heck out
ofmuch larger datasets. 


(I'll let this thread run a bit, and then come back to ask questions about "vacuum analyze" and "partitioned tables" as
asecond and third round of questions.)  

--

----------------------------------------------------------

Gary Warner
Director of Research in Computer Forensics
The University of Alabama at Birmingham
Department of Computer & Information Sciences
& Department of Justice Sciences
205.934.8620             205.422.2113
gar@cis.uab.edu        gar@askgar.com

-----------------------------------------------------------

Re: "large" spam tables and performance: postgres memory parameters

From
"Kevin Grittner"
Date:
Welcome out of the shadows, Gary!  ;-)

Gary Warner <gar@cis.uab.edu> wrote:

> My biggest question mark there really has to do with how many
> users I have and how that might alter the results.

In benchmarks I've run with our software load, I've found that I get
best throughput when I use a connection pool which limits the active
database transaction count to (2 * CPU_count) + effective_spindles.
CPU_count should be fairly obvious; effective_spindles is
essentially "what's the maximum number of random read requests your
disk subsystem can productively handle concurrently?"

One or two others commented that their benchmark results seemed to
fit with that formula.  I don't know just how far to trust it as a
generalization, but in the absence of anything else, it probably
isn't a horrible rule of thumb.  If you expect to have more logical
connections than that number, you might want to establish a
connection pool which limits to that number.  Be sure that if it is
"full" when a request to start a transaction comes in, the request
queues instead of failing.

To convince yourself that this really helps, picture a hypothetical
machine which can only make progress on one request at a time, but
will task-switch among as many requests as are presented.  Then
picture 100 requests being presented simultaneously, each of which
needs one second of time to complete.  Even without figuring in the
overhead of task switching or the cache effects, it's clear that a
connection "pool" of one connection improve response time by almost
50% with no cost in throughput. When you factor in the very real
costs of having large numbers of requests competing, both throughput
and response time win with connection pooling above some threshold.
Of course, with multiple CPUs, multiple spindles, network latency,
etc., the pool should be large enough to tend to keep them all busy.

Of course, the exact point at which a connection pool gives optimal
performance depends on so many factors that the only way to *really*
get it right is to test with a realistic load through your actual
software.  The above is just intended to suggest a reasonable
starting point.

-Kevin

Re: "large" spam tables and performance: postgres memory parameters

From
Scott Marlowe
Date:
On Thu, Jan 7, 2010 at 8:23 AM, Gary Warner <gar@cis.uab.edu> wrote:
> Hello,
>
> I've been lurking on this list a couple weeks now, and have asked some "side questions" to some of the list members,
whohave been gracious, and helpful, and encouraged me to just dive in and participate on the list. 
>
> I'll not tell you the whole story right off the bat, but let me give you a basic outline.
>
> I dual report into two academic departments at the University of Alabama at Birmingham - Computer & Information
Sciencesand Justice Sciences.  Although my background is on the CS side, I specialize in cybercrime investigations and
ourresearch focuses on things that help to train or equip cybercrime investigators. 
>
> One of our research projects is called the "UAB Spam Data Mine".  Basically, we collect spam, use it to detect
emergingmalware threats, phishing sites, or spam campaigns, and share our findings with law enforcement and our
corporatepartners. 
>
> We started off small, with only around 10,000 to 20,000 emails per day running on a smallish server.  Once we had our
basicworkflow down, we moved to nicer hardware, and opened the floodgates a bit.  We're currently receiving about 1.2
millionemails per day, and hope to very quickly grow that to more than 5 million emails per day. 
>
> I've got very nice hardware - many TB of very fast disks, and several servers with 12GB of RAM and 8 pentium cores
each.

Are you running 8.3.x?  I'd go with that as a minimum for now.

You'll want to make sure those fast disks are under a fast RAID setup
like RAID-10, perhaps with a high quality RAID controller with battery
backed cache as well.  I/O is going to be your real issue here, not
CPU, most likely.  Also look at increasing RAM to 48 or 96Gig if if
you can afford it.  I assume your pentium cores are Nehalem since
you've got 12G of ram (multiple of 3).  Those are a good choice here,
they're fast and have memory access.

> For the types of investigative support we do, some of our queries are of the 'can you tell me what this botnet was
spammingfor the past six months', but most of them are more "real time", of the "what is the top spam campaign today?"
or"what domains are being spammed by this botnet RIGHT NOW". 

Then you'll probably want to look at partitioning your data.

> We currently use 15 minute batch queues, where we parse between 10,000 to 20,000 emails every 15 minutes.  Each
messageis assigned a unique message_id, which is a combination of what date and time "batch" it is in, followed by a
sequentialnumber, so the most recent batch processed this morning starts with "10Jan07.0" and goes through
"10Jan07.13800".

> I don't know what this list considers "large databases", but I'm going to go ahead and call 170 million records a
"large"table. 


>  - if you have 8 pentium cores, 12GB of RAM and "infinite" diskspace, what sorts of memory settings would you have in
yourstart up tables? 

crank up shared_buffers and effective cache size.  Probably could use
a bump for work_mem, something in the 16 to 32 Meg range.  Especially
since you're only looking at a dozen or so, not hundreds, of
concurrent users.  work_mem is per sort, so it can get out of hand
fast if you crank it up too high, and for most users higher settings
won't help anyway.

>  My biggest question mark there really has to do with how many users I have and how that might alter the results.  My
researchteam has about 12 folks who might be using the UAB Spam Data Mine at any given time, plus we have the "parser"
runningpretty much constantly, and routines that are fetching IP addresses for all spammed URLs and nameservers for all
spammeddomains and constantly updating the databases with that information. In the very near future, we'll be accepting
queriesdirectly from law enforcement through a web interface and may have as many as another 12 simultaneous users, so
maybe25 max users.  We plan to limit "web users" to a "recent" subset of the data, probably allowing "today" "previous
24hour" and "previous 7 days" as query options within the web interface.  The onsite researchers will bang the heck out
ofmuch larger datasets. 

You might want to look at pre-rolling common requests if that isn't
what you're already doing with the routines you're mentioning up
there.

> (I'll let this thread run a bit, and then come back to ask questions about "vacuum analyze" and "partitioned tables"
asa second and third round of questions.) 

You shouldn't have a lot of vacuum problems since you won't be
deleting anything.  You might wanna crank up autovacuum aggressiveness
as regards analyze though.  Partitioning is in your future.

Re: "large" spam tables and performance: postgres memory parameters

From
Stephen Frost
Date:
* Gary Warner (gar@cis.uab.edu) wrote:
>  - if you have 8 pentium cores, 12GB of RAM and "infinite" diskspace, what sorts of memory settings would you have in
yourstart up tables? 

If the PG database is the only thing on the system, I'd probably go with
something like:

shared_buffers = 4GB
temp_buffers = 1GB
work_mem = 128M # Maybe adjust this during a session if you have
                # big/complex queries
maintenance_work_mem = 256M
checkpoint_segments = 20 # Maybe more..
effective_cache_size = 8GB # Maybe more if you have a SAN which is doing
                           # cacheing for you too..


>   My biggest question mark there really has to do with how many users I have and how that might alter the results.

Presuming what you mean by this is "how would the number of users change
the settings I'm suggesting above", I'd say "probably not much for the
number of users you're talking about.".  Really, 12-25 users just isn't
all that many.  You probably need a queueing system to handle requests
that are going to take a long time to complete (as in, don't expect the
user or their web browser to stick around while you run a query that
takes half an hour to complete...).

> (I'll let this thread run a bit, and then come back to ask questions about "vacuum analyze" and "partitioned tables"
asa second and third round of questions.)  

You should definitely be lookig to do partitioning based on the type of
data and the way you want to restrict the queries.  I'm not convinced
you'd actually *need* to restrict the queries to recent things if you
partition correctly- you might restrict the total *range* to be
something small enough that it won't take too long.

It sounds like you have a number of systems, in which case you might
consider sharding if you get a large number of users (where large is a
whole lot bigger than 25...) or you find that users really do need
real-time results on very large ranges.  It involves a fair bit of code
to do and do well though, so you really have to consider it carefully
and make sure it will help your important use cases (and not too badly
impact your other use cases) before going that route.

autovacuum is your friend..  though you might need to tune it.

    Thanks,

        Stephen

Attachment