Including unique users in huge data warehouse in Postgresql... - Mailing list pgsql-general

From Mark Jensen
Subject Including unique users in huge data warehouse in Postgresql...
Date
Msg-id 20061128175544.65232.qmail@web82108.mail.mud.yahoo.com
Whole thread Raw
Responses Re: Including unique users in huge data warehouse in Postgresql...  (Richard Huxton <dev@archonet.com>)
List pgsql-general
So i've been given the task of designing a data warehouse in
either Postgresql or Mysql for our clickstream data for our sites.  I
started with Mysql but the joins in Mysql are just way too slow
compared to Postgresql when playing with star schemas.  I can't say
which sites i'm working on, but we get close to 3-5 million uniques
users per day, so over time, that's a lot of unique users to keep
around and de-dup your fact tables by.  Need to be able to query normal
analytics like:

- Unique users per time (day/month/week/ etc)
- Unique users per area of the site
-
Unique users by anything really that we have in the dimension tables
and that we know about.  it's all about unique users at all times with
them.  so any report i do, unique users are usually included in
them.  so it's hard to keep summary tables around since it's mostly
adhoc from the raw fact tables.

Let me just get to the
point.  I'm wondering if anyone has had any experience doing a
clickstream data warehouse in postgresql.  Only way I can think of
doing this is to keep a user id in the fact table that you would run a
count(distinct(uu_id)) on to get the number of unique users for that
query.  so to keep this simple, this is what i have so far, and the
fact table is probably close to 1 billion rows for about 20 days of
data.  but of course, this takes forever when you want to dedup by
uu_id by each element.  i can include visits in there as well, but most
of the sales/business guys only really want unique users.  it's pretty
fast when you query against one time_id or hour/day, but when you have
to scan the whole fact table to get LTV so far, it's crazy.

i've
made a lot of optimizations in postgresql.conf by playing with work_mem
and shared_buffers and such and i think the database is using as much
as it can disk/memory/cpu wise.  also vacuuming the tables as much as
possible.  just wondering if anyone had any suggestions or could point
out anything i could be doing wrong, or make it better to get at
uniques.  doing simple queries by not including uu_id (uniques) is
pretty fast and that's no problem.  I've also gotten 3 books from
Kimball about data warehousing including the clickstream one.  i've
also tried bizgres version of postgresql using bitmap indexes, but
didn't see a huge difference for what i need, so i'm back to using the
new beta3 of postgresql right now, since i love the new copy command
you can include queries in.

here's a sample query that takes a while to run... just a simple report that shows gender by area of the site.

select A.gender as gender, B.area as area, sum(C.imps) as imps, sum(C.clicks) as clicks, count(distinct(C.uu_id)) as
users
from uus as A, areas as B, daily_area_fact as C
where A.uu_id = C.uu_id
and B.area_id = C.area_id
group by gender,area;

so
by just having one day of data, with 3,168,049 rows in the user
dimension table (uus), 17,213,420 in the daily_area_fact table that
joins all the dimension tables, takes about 15 minutes.  if i had 30-90
days in this fact table, who knows how long this would take... i know
doing a distinct on uu_id is very expensive, so that's the main problem
here i guess and would want to know if anyone else is doing it this way
or better.

Total query runtime: 878595 ms.
Data retrieval runtime: 1361 ms.
163 rows retrieved.

here's the explain:

                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=3846163.34..4104788.14 rows=24200 width=33)
   ->  Sort  (cost=3846163.34..3889196.89 rows=17213420 width=33)
         Sort Key: a.gender, b.area
         ->  Hash Join  (cost=132111.19..1306685.84 rows=17213420 width=33)
               Hash Cond: (c.uu_id = a.uu_id)
               ->  Hash Join  (cost=86.58..701292.17 rows=17213420 width=25)
                     Hash Cond: (c.area_id = b.area_id)
                     ->  Seq Scan on daily_area_fact c  (cost=0.00..356937.20 rows=17213420 width=16)
                     ->  Hash  (cost=78.26..78.26 rows=3326 width=17)
                           ->  Seq Scan on areas b  (cost=0.00..78.26 rows=3326 width=17)
               ->  Hash  (cost=124104.49..124104.49 rows=3168049 width=12)
                     ->  Seq Scan on uus a  (cost=0.00..124104.49 rows=3168049 width=12)
(12 rows)

So here is a snapshot of my user dimension table for each user (uus table)

dw_big=# \d uus
                                     Table "public.uus"
    Column    |          Type          |                      Modifiers
--------------+------------------------+-----------------------------------------------------
 uu_id        | integer                | not null default nextval('uus_uu_id_seq'::regclass)
 uu           | character(50)          | not null
 imps         | integer                |
 clicks       | integer                |
 gca_clicks   | integer                |
 convs        | integer                |
 imp_rev      | numeric(10,6)          |
 click_rev    | numeric(10,6)          |
 total_rev    | numeric(10,6)          |
 geo_id       | integer                |
 reg          | integer                |
 usernum      | integer                |
 gender       | character(1)           |
 age          | integer                |
 age_grp      | character(10)          |
 reg_date     | date                   |
 vis          | integer                |
 first_date   | date                   |
 first_hour   | integer                |
 last_date    | date                   |
 utm_campaign | character varying(100) |
 utm_medium   | character varying(100) |
 utm_source   | character varying(100) |
 utm_content  | character varying(255) |
 utm_keyword  | character varying(255) |
Indexes:
    "uus_pkey" PRIMARY KEY, btree (uu)
    "uus_geo_id_index" btree (geo_id)
    "uus_usernum_id_index" btree (usernum)
    "uus_uu_id_index" btree (uu_id)

here is my "area" of the site dimension table (areas table)

dw_big=# \d areas;
                                    Table "public.areas"
 Column  |          Type          |                        Modifiers
---------+------------------------+---------------------------------------------------------
 area_id | integer                | not null default nextval('areas_area_id_seq'::regclass)
 site    | character varying(100) | not null
 area    | character varying(100) | not null
 subarea | character varying(100) | not null
 size    | character varying(50)  | not null
 pos     | character varying(50)  | not null
Indexes:
    "areas_pkey" PRIMARY KEY, btree (site, subarea, area, size, pos)
    "areas_area_id_index" btree (area_id)

and
here is my huge fact table which joins all the dimension tables
together since they want to be able to query, cut, and slice anything
by anything at any time:

dw_big=# \d daily_area_fact;
 Table "public.daily_area_fact"
  Column   |  Type   | Modifiers
-----------+---------+-----------
 time_id   | integer | not null
 ccid      | integer | not null
 area_id   | integer | not null
 uu_id     | integer | not null
 geo_id    | integer | not null
 imps      | integer |
 clicks    | integer |
 imp_rev   | numeric |
 click_rev | numeric |
 total_rev | numeric |
Indexes:
    "daily_area_fact_pkey" PRIMARY KEY, btree (time_id, ccid, area_id, uu_id, geo_id)



____________________________________
Mark Jensen





pgsql-general by date:

Previous
From: novnov
Date:
Subject: Re: Editing contrib modules which are loaded by default?
Next
From: Jeff Davis
Date:
Subject: Re: How to implement backup protocol