"large" spam tables and performance: postgres memory parameters - Mailing list pgsql-performance
From | Gary Warner |
---|---|
Subject | "large" spam tables and performance: postgres memory parameters |
Date | |
Msg-id | 1370423997.141099.1262877797658.JavaMail.root@zimbra.cis.uab.edu Whole thread Raw |
Responses |
Re: "large" spam tables and performance: postgres
memory parameters
Re: "large" spam tables and performance: postgres memory parameters Re: "large" spam tables and performance: postgres memory parameters |
List | pgsql-performance |
Hello, I've been lurking on this list a couple weeks now, and have asked some "side questions" to some of the list members, whohave been gracious, and helpful, and encouraged me to just dive in and participate on the list. I'll not tell you the whole story right off the bat, but let me give you a basic outline. I dual report into two academic departments at the University of Alabama at Birmingham - Computer & Information Sciencesand Justice Sciences. Although my background is on the CS side, I specialize in cybercrime investigations and ourresearch focuses on things that help to train or equip cybercrime investigators. One of our research projects is called the "UAB Spam Data Mine". Basically, we collect spam, use it to detect emerging malwarethreats, phishing sites, or spam campaigns, and share our findings with law enforcement and our corporate partners. We started off small, with only around 10,000 to 20,000 emails per day running on a smallish server. Once we had our basicworkflow down, we moved to nicer hardware, and opened the floodgates a bit. We're currently receiving about 1.2 millionemails per day, and hope to very quickly grow that to more than 5 million emails per day. I've got very nice hardware - many TB of very fast disks, and several servers with 12GB of RAM and 8 pentium cores each. For the types of investigative support we do, some of our queries are of the 'can you tell me what this botnet was spammingfor the past six months', but most of them are more "real time", of the "what is the top spam campaign today?" or"what domains are being spammed by this botnet RIGHT NOW". We currently use 15 minute batch queues, where we parse between 10,000 to 20,000 emails every 15 minutes. Each message isassigned a unique message_id, which is a combination of what date and time "batch" it is in, followed by a sequential number,so the most recent batch processed this morning starts with "10Jan07.0" and goes through "10Jan07.13800". OK, you've done the math . . . we're at 60 million records in the spam table. The other "main" table is "spam_links" whichhas the URL information. Its got 170 million records and grows by more than 3 million per day currently. Believe it or not, many law enforcement cases actually seek evidence from two or more years ago when its time to go to trial. We're looking at a potential max retention size of a billion emails and 3 billion URLs. ------------- I don't know what this list considers "large databases", but I'm going to go ahead and call 170 million records a "large"table. ------------- I'll have several questions, but I'll limit this thread to: - if you have 8 pentium cores, 12GB of RAM and "infinite" diskspace, what sorts of memory settings would you have in yourstart up tables? My biggest question mark there really has to do with how many users I have and how that might alter the results. My researchteam has about 12 folks who might be using the UAB Spam Data Mine at any given time, plus we have the "parser" runningpretty much constantly, and routines that are fetching IP addresses for all spammed URLs and nameservers for all spammeddomains and constantly updating the databases with that information. In the very near future, we'll be accepting queriesdirectly from law enforcement through a web interface and may have as many as another 12 simultaneous users, so maybe25 max users. We plan to limit "web users" to a "recent" subset of the data, probably allowing "today" "previous 24hour" and "previous 7 days" as query options within the web interface. The onsite researchers will bang the heck out ofmuch larger datasets. (I'll let this thread run a bit, and then come back to ask questions about "vacuum analyze" and "partitioned tables" as asecond and third round of questions.) -- ---------------------------------------------------------- Gary Warner Director of Research in Computer Forensics The University of Alabama at Birmingham Department of Computer & Information Sciences & Department of Justice Sciences 205.934.8620 205.422.2113 gar@cis.uab.edu gar@askgar.com -----------------------------------------------------------
pgsql-performance by date: