Home > mailing lists

efficient data reduction (and deduping) - Mailing list pgsql-performance

From	Alessandro Gagliardi
Subject	efficient data reduction (and deduping)
Date	March 1, 2012 14:27:44
Msg-id	CAAB3BBJQxQ1VVd-mub90a=2H6ezNXx1_bRL28VoxZH5EXWE0Lw@mail.gmail.com Whole thread
Responses	Re: efficient data reduction (and deduping) Re: efficient data reduction (and deduping) Re: efficient data reduction (and deduping) Re: efficient data reduction (and deduping)
List	pgsql-performance

Tree view

Hi folks,

I have a system that racks up about 40M log lines per day. I'm able to COPY the log files into a PostgreSQL table that looks like this:

CREATE TABLE activity_unlogged

(

user_id character(24) NOT NULL,

client_ip inet,

hr_timestamp timestamp without time zone,

locale character varying,

log_id character(36),

method character varying(6),

server_ip inet,

uri character varying,

user_agent character varying

)

Now, I want to reduce that data to get the last activity that was performed by each user in any given hour. It should fit into a table like this:

CREATE TABLE hourly_activity

(

activity_hour timestamp without time zone NOT NULL,

user_id character(24) NOT NULL,

client_ip inet,

hr_timestamp timestamp without time zone,

locale character varying,

log_id character(36),

method character varying(6),

server_ip inet,

uri character varying,

user_agent character varying,

CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id )

)

where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the primary key constraint)

I am attempting to do that with the following:

INSERT INTO hourly_activity

SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour, activity_unlogged.user_id,

client_ip, hr_timestamp, locale, log_id, method, server_ip, uri, user_agent

FROM activity_unlogged,

(SELECT user_id, MAX(hr_timestamp) AS last_timestamp

FROM activity_unlogged GROUP BY user_id, date_trunc('hour', hr_timestamp)) AS last_activity

WHERE activity_unlogged.user_id = last_activity.user_id AND activity_unlogged.hr_timestamp = last_activity.last_timestamp;

I have two problems:

It's incredibly slow (like: hours). I assume this is because I am scanning through a huge unindexed table twice. I imagine there is a more efficient way to do this, but I can't think of what it is. If I were doing this in a procedural programming language, it might look something like:
for row in activity_unlogged:
if (date_trunc('hour', hr_timestamp), user_id) in hourly_activity[(activity_hour, user_id)]:
if hr_timestamp > hourly_activity[(date_trunc('hour', hr_timestamp), user_id)][hr_timestamp]:
hourly_activity <- row # UPDATE
else:
hourly_activity <- row # INSERT
I suspect some implementation of this (hopefully my pseudocode is at least somewhat comprehensible) would be very slow as well, but at least it would only go through activity_unlogged once. (Then again, it would have to rescan hourly_activity each time, so it really wouldn't be any faster at all, would it?) I feel like there must be a more efficient way to do this in SQL though I can't put my finger on it.
Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE activity_unlogged.user_id = last_activity.user_id AND activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to multiple records leading to a primary key collision. In such cases, I don't really care which of the two rows are picked, I just want to make sure that no more than one row is inserted per user per hour. In fact, though I would prefer to get the last row for each hour, I could probably get much the same effect if I just limited it to one per hour. Though I don't know if that really helps at all.

pgsql-performance by date:

From: Craig James
Date: 01 March 2012, 14:19:53
Subject: Re: [planner] Ignore "order by" in subselect if parrent do count(*)

From: Claudio Freire
Date: 01 March 2012, 14:35:42
Subject: Re: efficient data reduction (and deduping)

efficient data reduction (and deduping) - Mailing list pgsql-performance

Previous

Next