Thread: sytem log audit/reporting and psql
Hello group, I need to design and develop a web reporting system to let users query/view syslog files on a unix host. For now, I am concentrating on the authentication file that has user logon (success/failure) and logoff records. The log file is logrotated every week or so. My reporting system parses the log entries and put the result into a postgresql database (I am proposing to use postgresql as the backend). Since this deals with multi-year archive and I believe 'partitioing' is an ideal feature to handle this problem. So here is the design scheme: CREATE TABLE logon_success( name varchar(32) not null, srcip inet not null, date date not null, time time not null, ... ); CREATE TABLE logon_success_yy${year}mm${month}( CHECK (date >= DATE '$year-$month-01' AND date < DATE '$next_year-$next_month-1') ) INHERITS ($tname) ; As you can see from the sample code, I am using perl to dynamically generate children tables as I parse log files in a daily cron job script. Once the log file is analyzed and archived in the database, I have a simple web UI that sysadmin can select and view user logon events. I have built a sample framework and it works so far. Keep in mind, this reporting system is not limited to just user logon, it should also work with system events such as services failures/startup, hardware failures, etc My initial testing has not shown any significant difference between a partitioning approach and a plain (all entries in master) database approach... 2005-01-01 | 00:27:55 | firewood | ssh | Login Successful | None | local | user9819 | 192.168.1.31 My test was based on two artificial tables that has 1700 records per day from 2004-02-01 to 2007-04-27, around 2 million entries that are identical in both tables. My test script: echo Testing database $t1 time based time psql -p 5583 netilla postgres << EOF select count(date) from $t1 where date > '2005-03-01' and date < '2006-12-11'; \q EOF echo Testing database $t2 time based time psql -p 5583 netilla postgres << EOF select count(date) from $t2 where date > '2005-03-01' and date < '2006-12-11'; \q EOF Result: ./timing_test.sh Testing database logon_test time based count --------- 1121472 (1 row) 0.00user 0.00system 0:02.92elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+456minor)pagefaults 0swaps Testing database logon_test2 time based count --------- 1121472 (1 row) 0.00user 0.00system 0:02.52elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+456minor)pagefaults 0swaps But the numbers are really not static and logon_test2 (with partitioning) sometimes behave worse than logon_test... Now here are my questions: 1) Should I use database to implement such a reporting system? Are there any alternatives, architects, designs? 2) Is partitioning a good approach to speed up log query/view? The user comment in partitioning in pgsql manual seems to indicate partitioning may be slower than non-partitioned table under certain circumstances. 3) How to avoid repetitive log entry scanning since my cron job script is run daily but logrotate runs weekly? This means everytime my script will be parsing duplicate entries. 4) When parsing log files, it's quite possible that there are identical entries, for example a user logins really fast, resulting 2 or more identical entries..In this case can I still use primary key/index at all? If I can, how do I design primary key or index to speed up query? 5) What are the most glaring limitations and flaws in my design? 6) What are the best approaches to analyze postgresql query performance and how to improve postgresql query performance? Thank you for taking time to review and answer my questions! Let me know if I am not clear on any specific detail.. Fei
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi! Fei Liu schrieb: [...] > > Now here are my questions: These are a lot of questions, and some of them are not related to pgsql-performance or even PostgreSQL. I'll try to answer some of them, because I'm currently experimenting with partitioned tables, too. > 2) Is partitioning a good approach to speed up log query/view? The user > comment in partitioning in pgsql manual seems to indicate partitioning > may be slower than non-partitioned table under certain circumstances. You can look at table partitioning under several points of view. My experience with table partitioning under the aspect of "performance" is: *) It is a benefit for large tables only, and the definition of "large" depends on your data. I did some testing for an application we are developing and here it shows that table partitioning does not seem to make sense for tables with less than 10 million rows (perhaps even more) *) The performance benefit depends on your queries. Some queries get a big improvement, but some queries might even run significantly slower. *) Depending on the way you setup your system, inserts can be much slower with partitioned tables (e.g. if you are using triggers to automatically create your partitions on demand) > 3) How to avoid repetitive log entry scanning since my cron job script > is run daily but logrotate runs weekly? This means everytime my script > will be parsing duplicate entries. This has nothing to do with postgres, but I wrote something similar years ago. Here's what I did and what you could do: Remember the last line of your logfile in some external file (or even the database). Then on the next run you can read the logfile again line by line and skip all lines until you have found the line you saved on the last run. * If you find the line that way, just start parsing the logfile beginning at the next line. * If you can not find your line and you reach EOF, start parsing again at the beginning of the logfile. * If this is your first run and you don't have a line stored yet, start parsing at the beginning of the logfile When you are finished you have to remember the last line from the logfile again at some place. > 6) What are the best approaches to analyze postgresql query performance > and how to improve postgresql query performance? Here are some general recommendations for performance testing from my experience: *) Test with real data. For table partitioning this means you have to create really large datasets to make your tests useful. You should write a small program to generate your test data if possible or use some other means to create your test database. You also need time: creating a test database and importing 100 million rows of test data will take several hours or even days! *) Test with the real queries from your application! Testing with just a few easy standard queries will almost for sure not be sufficient to get the right numbers for the performance you will see in your application later on! Look at the thread "Query performance problems with partitioned tables" I started on pgsql-performance just yesterday to see what I mean! *) Use "EXPLAIN ANALYZE" and look at the "cost" and "actual time" numbers this gives you. It also will show you the query plan used by PostgreSQL when executing your query. You sometimes might be surprised what is going on behind the scenes... *) If you are just using some stopwatch to time your queries be aware of other factors which might significantly influence your test: Caching, other jobs running on the machine in parallel, cosmic rays, ... *) Before running your tests you should always try to get to some well defined starting point (this might even mean rebooting your server before running each test) and you should always repeat each test several times and then calculate a mean value (and standard deviation to see how "good" your results are...) *) Document your test setup and procedure as well as your results (otherwise two days later you won't remember which test obtained what result) HTH - - andreas - -- Andreas Haumer | mailto:andreas@xss.co.at *x Software + Systeme | http://www.xss.co.at/ Karmarschgasse 51/2/20 | Tel: +43-1-6060114-0 A-1100 Vienna, Austria | Fax: +43-1-6060114-71 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGN3NIxJmyeGcXPhERAspaAJ9MgymiwyehN6yU6jGtA0pbkdolsACfb6JC kB5KLyQ5WOTUD9uabVzsjwY= =3QSa -----END PGP SIGNATURE-----