Including unique users in huge data warehouse in Postgresql... - Mailing list pgsql-general
From | Mark Jensen |
---|---|
Subject | Including unique users in huge data warehouse in Postgresql... |
Date | |
Msg-id | 20061128175544.65232.qmail@web82108.mail.mud.yahoo.com Whole thread Raw |
Responses |
Re: Including unique users in huge data warehouse in Postgresql...
|
List | pgsql-general |
So i've been given the task of designing a data warehouse in either Postgresql or Mysql for our clickstream data for our sites. I started with Mysql but the joins in Mysql are just way too slow compared to Postgresql when playing with star schemas. I can't say which sites i'm working on, but we get close to 3-5 million uniques users per day, so over time, that's a lot of unique users to keep around and de-dup your fact tables by. Need to be able to query normal analytics like: - Unique users per time (day/month/week/ etc) - Unique users per area of the site - Unique users by anything really that we have in the dimension tables and that we know about. it's all about unique users at all times with them. so any report i do, unique users are usually included in them. so it's hard to keep summary tables around since it's mostly adhoc from the raw fact tables. Let me just get to the point. I'm wondering if anyone has had any experience doing a clickstream data warehouse in postgresql. Only way I can think of doing this is to keep a user id in the fact table that you would run a count(distinct(uu_id)) on to get the number of unique users for that query. so to keep this simple, this is what i have so far, and the fact table is probably close to 1 billion rows for about 20 days of data. but of course, this takes forever when you want to dedup by uu_id by each element. i can include visits in there as well, but most of the sales/business guys only really want unique users. it's pretty fast when you query against one time_id or hour/day, but when you have to scan the whole fact table to get LTV so far, it's crazy. i've made a lot of optimizations in postgresql.conf by playing with work_mem and shared_buffers and such and i think the database is using as much as it can disk/memory/cpu wise. also vacuuming the tables as much as possible. just wondering if anyone had any suggestions or could point out anything i could be doing wrong, or make it better to get at uniques. doing simple queries by not including uu_id (uniques) is pretty fast and that's no problem. I've also gotten 3 books from Kimball about data warehousing including the clickstream one. i've also tried bizgres version of postgresql using bitmap indexes, but didn't see a huge difference for what i need, so i'm back to using the new beta3 of postgresql right now, since i love the new copy command you can include queries in. here's a sample query that takes a while to run... just a simple report that shows gender by area of the site. select A.gender as gender, B.area as area, sum(C.imps) as imps, sum(C.clicks) as clicks, count(distinct(C.uu_id)) as users from uus as A, areas as B, daily_area_fact as C where A.uu_id = C.uu_id and B.area_id = C.area_id group by gender,area; so by just having one day of data, with 3,168,049 rows in the user dimension table (uus), 17,213,420 in the daily_area_fact table that joins all the dimension tables, takes about 15 minutes. if i had 30-90 days in this fact table, who knows how long this would take... i know doing a distinct on uu_id is very expensive, so that's the main problem here i guess and would want to know if anyone else is doing it this way or better. Total query runtime: 878595 ms. Data retrieval runtime: 1361 ms. 163 rows retrieved. here's the explain: QUERY PLAN ------------------------------------------------------------------------------------------------------ GroupAggregate (cost=3846163.34..4104788.14 rows=24200 width=33) -> Sort (cost=3846163.34..3889196.89 rows=17213420 width=33) Sort Key: a.gender, b.area -> Hash Join (cost=132111.19..1306685.84 rows=17213420 width=33) Hash Cond: (c.uu_id = a.uu_id) -> Hash Join (cost=86.58..701292.17 rows=17213420 width=25) Hash Cond: (c.area_id = b.area_id) -> Seq Scan on daily_area_fact c (cost=0.00..356937.20 rows=17213420 width=16) -> Hash (cost=78.26..78.26 rows=3326 width=17) -> Seq Scan on areas b (cost=0.00..78.26 rows=3326 width=17) -> Hash (cost=124104.49..124104.49 rows=3168049 width=12) -> Seq Scan on uus a (cost=0.00..124104.49 rows=3168049 width=12) (12 rows) So here is a snapshot of my user dimension table for each user (uus table) dw_big=# \d uus Table "public.uus" Column | Type | Modifiers --------------+------------------------+----------------------------------------------------- uu_id | integer | not null default nextval('uus_uu_id_seq'::regclass) uu | character(50) | not null imps | integer | clicks | integer | gca_clicks | integer | convs | integer | imp_rev | numeric(10,6) | click_rev | numeric(10,6) | total_rev | numeric(10,6) | geo_id | integer | reg | integer | usernum | integer | gender | character(1) | age | integer | age_grp | character(10) | reg_date | date | vis | integer | first_date | date | first_hour | integer | last_date | date | utm_campaign | character varying(100) | utm_medium | character varying(100) | utm_source | character varying(100) | utm_content | character varying(255) | utm_keyword | character varying(255) | Indexes: "uus_pkey" PRIMARY KEY, btree (uu) "uus_geo_id_index" btree (geo_id) "uus_usernum_id_index" btree (usernum) "uus_uu_id_index" btree (uu_id) here is my "area" of the site dimension table (areas table) dw_big=# \d areas; Table "public.areas" Column | Type | Modifiers ---------+------------------------+--------------------------------------------------------- area_id | integer | not null default nextval('areas_area_id_seq'::regclass) site | character varying(100) | not null area | character varying(100) | not null subarea | character varying(100) | not null size | character varying(50) | not null pos | character varying(50) | not null Indexes: "areas_pkey" PRIMARY KEY, btree (site, subarea, area, size, pos) "areas_area_id_index" btree (area_id) and here is my huge fact table which joins all the dimension tables together since they want to be able to query, cut, and slice anything by anything at any time: dw_big=# \d daily_area_fact; Table "public.daily_area_fact" Column | Type | Modifiers -----------+---------+----------- time_id | integer | not null ccid | integer | not null area_id | integer | not null uu_id | integer | not null geo_id | integer | not null imps | integer | clicks | integer | imp_rev | numeric | click_rev | numeric | total_rev | numeric | Indexes: "daily_area_fact_pkey" PRIMARY KEY, btree (time_id, ccid, area_id, uu_id, geo_id) ____________________________________ Mark Jensen
pgsql-general by date: