Thread: Speeding up this function
Hello, I've thought it would be nice to index certain aspects of my apache log files for analysis. I've used several different techniques and have something usable now, but I'd like to tweak it one step further. My first performance optimization was to change the logformat into a CSV format. I processed the logfiles with PHP and plsql stored procedures. Unfortunately, it took more than 24 hours to process 1 days worth of log files. I've now switched to using C# (using mono) to create hash-tables to do almost all of the pre-processing. This has brought the time down to about 3 hours. Actually, if I take out one step it brought the process down to about 6 minutes, which is a tremendous improvement. The one step that is adding 2.5+ hours to the job is not easily done in C#, as far as I know. Once the mostly-normalized data has been put into a table called usage_raw_access I then use this query: insert into usage_access select * , usage_normalize_session(accountid,client,atime) as sessionid from usage_raw_access; All it does is try to "link" pageviews together into a session. here's the function: create or replace function usage_normalize_session (varchar(12), inet, timestamptz) returns integer as ' DECLARE -- $1 = Account ID, $2 = IP Address, $3 = Time RecordSet record; BEGIN SELECT INTO RecordSet DISTINCT sessionid FROM usage_access ua WHERE ua.accountid = $1 AND ua.client = $2 AND ua.atime <= ($3 - ''20 min''::interval)::timestamptz; if found then return RecordSet.sessionid; end if; return nextval(''usage_session_ids''); END;' language plpgsql; And the table usage_access looks like this: Table "public.usage_access" Column | Type | Modifiers -------------+--------------------------+----------- [snip] client | inet | atime | timestamp with time zone | accountid | character varying(12) | sessionid | integer | Indexes: usage_acccess_req_url btree (req_url), usage_access_accountid btree (accountid), usage_access_atime btree (atime), usage_access_hostid btree (hostid), usage_access_sessionid btree (sessionid) usage_access_sessionlookup btree (accountid,client,atime); As you can see, this looks for clients who have visited the same site within 20 min. If there is no match, a unique sessionid is assigned from a sequence. If there is a visit, the session id assigned to them is used. I'm only able to process about 25 records per second with my setup. My window to do this job is 3-4 hours and the shorter the better. Here is an explain analyze of the query I do (note I limited it to 1000): EXPLAIN ANALYZE insert into usage_access select * , usage_normalize_session(accountid,client,atime) as sessionid from usage_raw_access limit 1000; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Subquery Scan "*SELECT*" (cost=0.00..20.00 rows=1000 width=196) (actual time=51.63..47634.22 rows=1000 loops=1) -> Limit (cost=0.00..20.00 rows=1000 width=196) (actual time=51.59..47610.23 rows=1000 loops=1) -> Seq Scan on usage_raw_access (cost=0.00..20.00 rows=1000 width=196) (actual time=51.58..47606.14 rows=1001 loops=1) Total runtime: 48980.54 msec I also did an explain of the query that's performed inside the function: EXPLAIN ANALYZE select sessionid from usage_access ua where ua.accountid = 'XYZ' and ua.client = '64.68.88.45'::inet and ua.atime <= '2003-11-02 04:50:01-05'::timestamptz; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using usage_access_sessionlookup on usage_access ua (cost=0.00..6.02 rows=1 width=4) (actual time=0.29..0.29 rows=0 loops=1) Index Cond: ((accountid = 'XYZ'::character varying) AND (client = '64.68.88.45'::inet) AND (atime <= '2003-11-02 04:50:01-05'::timestamp with time zone)) Total runtime: 0.35 msec (3 rows) What I'd really like to know is if someone knows a way to do any of the following: a: Make the INSERT into ... SELECT *,usage_access_sessionlookup().. work faster b: Make the usage_access_sessionlookup() smarter,better,etc. c: Do this in C# using a hash-table or some other procedure that would be quicker. d: Find an algorithm to create the sessionid without having to do any database or hash-table lookups. As the dataset gets bigger, it won't fit in RAM and the lookup queries will become I/O bound, drastically slowing things down. d: is my first choice. For some reason I just can't seem to get my mind around the data. I wonder if there's someway to create a unique value from client ip address, the accountid and the period of time so that all visits by the IP for the account in that period would match. I thought of using the date_part function to create a unique period, but it would be a hack because if someone visits at 11:50 pm and continues to browse for an hour they would be counted as two sessions. That's not the end of the world, but some of my customers in drastically different time zones would always have skewed results. I tried and tried to get C# to turn the apache date string into a usable time but could not. I just leave the date intact and let postgresql handle it when I do the copy. Therefore, though I'd like to do it in my C# program, I'll likely have to do the sessionid code in a stored procedure. I'd really love some feedback on ways to optimize this. Any suggestions are greatly appreciated. -- Matthew Nuzum | Makers of "Elite Content Management System" www.followers.net | View samples of Elite CMS in action matt@followers.net | http://www.followers.net/portfolio/
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of > Matt Nuzum > Sent: Tuesday, October 19, 2004 3:35 PM > To: pgsql-performance > Subject: [PERFORM] Speeding up this function > <snip> > > All it does is try to "link" pageviews together into a session. > here's the function: > create or replace function usage_normalize_session > (varchar(12), inet, timestamptz) returns integer as ' DECLARE > -- $1 = Account ID, $2 = IP Address, $3 = Time > RecordSet record; > BEGIN > SELECT INTO RecordSet DISTINCT sessionid FROM usage_access ua > WHERE ua.accountid = $1 > AND ua.client = $2 > AND ua.atime <= ($3 - ''20 > min''::interval)::timestamptz; > > if found > then return RecordSet.sessionid; > end if; > > return nextval(''usage_session_ids''); > END;' > language plpgsql; > This is probably a stupid question, but why are you trying to create sessions after the fact? Since it appears that users of your site must login, why not just assign a sessionID to them at login time, and keep it in the URL for the duration of the session? Then it would be easy to track where they've been. - Jeremy
On Tue, 19 Oct 2004 15:49:45 -0400, Jeremy Dunn <jdunn@autorevenue.com> wrote: > > -----Original Message----- > > From: pgsql-performance-owner@postgresql.org > > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of > > Matt Nuzum > > Sent: Tuesday, October 19, 2004 3:35 PM > > To: pgsql-performance > > Subject: [PERFORM] Speeding up this function > > > <snip> <snip> > > This is probably a stupid question, but why are you trying to create > sessions after the fact? Since it appears that users of your site must > login, why not just assign a sessionID to them at login time, and keep > it in the URL for the duration of the session? Then it would be easy to > track where they've been. > > - Jeremy > > You don't have to log in to visit the sites. These log files are actually for many domains. Right now, we do logging with a web-bug and it does handle the sessions, but it relies on javascript and we want to track a lot more than we are now. Plus, that code is in JavaScript and one of our primary motiviations is to ditch MySQL completely. -- Matthew Nuzum | Makers of "Elite Content Management System" www.followers.net | View samples of Elite CMS in action matt@followers.net | http://www.followers.net/portfolio/
How many lines do you have in your daily logfiles > As you can see, this looks for clients who have visited the same site > within 20 min. If there is no match, a unique sessionid is assigned > from a sequence. If there is a visit, the session id assigned to them > is used. I'm only able to process about 25 records per second with my > setup. My window to do this job is 3-4 hours and the shorter the > better. I'd say your function is flawed because if a client stays more than 20 minutes he'll get two sessions. I'd propose the following : * solution with postgres (variant #1): - insert everything into big table, - SELECT make_session(...) FROM big table GROUP BY account_id (you may or may not wish to use the ip address, using it will duplicate sessions for people using anonimyzing crowds-style proxies, not using it will merge sessions from the same user from two different ip's). I'd not use it. use an index-powered GroupAggregate maybe. Now it's well ordered, ie. all accesses from the same account are grouped, you just have to find 'gaps' of more than 20 minutes in the atimes to merge or make sessions. This is made by the aggregate 'make_session' which has an internal state consisting of a list of sessions of the form : - session : - session start time - session end time all the aggregate does is look if the atime of the incoming row is < (session end time + 20 min) if <, update session to mark session end time to atime if >, create a new session with session start time = session end time = atime and append it to the session list So you get a table of session arrays, you just have to assign them id's and trackback to the URLs to mark them. If an aggregate can issue INSERT or UPDATE queries, it can even generate session ids on the fly in a table, which simplifies its internal state. * solution with postgres (variant #2): - insert everything into raw_table, - CREATE TABLE sorted_table just like raw_table but with a "id SERIAL PRIMARY KEY" added. - INSERT INTO sorted_table SELECT * FROM raw_table ORDER by account_id, atime; the aggregate was basically comparing the atime's of two adjacent lines to detect a gap of more than 20 minutes, so you could also do a join between rows a and b where b.id = a.id+1 AND ( b.account_id != a.account_id OR (b.atime > a.atime+20 minutes) OR b does not exist ) this will give you the rows which mark a session start, then you have to join again to update all the rows in that session (BETWEEN id's) with the session id. * solution without postgres Take advantage of the fact that the sessions are created and then die to only use RAM for the active sessions. Read the logfile sequentially, you'll need to parse the date, if you can't do it use another language, change your apache date format output, or write a parser. Basically you're doing event-driven programming like in a logic simulator, where the events are session expirations. As you read the file, - keep a hash of sessions indexed on account_id, - and also a sorted (btree) list of sessions indexed on a the session expiry time. It's very important that this list has fast insertion even in the middle, which is why a tree structure would be better. Try a red-black tree. For each record do: - look in the hashtable for account_id, find expiry date for this session, if session still alive you're in that session, update session expiry date and btree index accordingly append url and infos to a list in the session if you want to keep them else expire session and start a new one, insert into hash and btree store the expired session on disk and remove it from memory, you dont need it anymore ! And, as you see the atime advancing, scan the btree for sessions to expire. It's ordered by expiry date, so that's fast. For all expired sessions found, expire session store the expired session on disk and remove it from memory, you dont need it anymore ! That'd be my preferred solution. You'll need a good implementation of a sorted tree, you can find that in opensource. * solution with postgres (variant #3) just like variant #2 but instead of an aggregate use a plpgsql procedure which reads the logs ordered by account_id, atime, while keeping a copy of the last row, and detecting session expirations on the fly.