Re: Average New Users Per DOW - Mailing list pgsql-general

From Michael Nolan
Subject Re: Average New Users Per DOW
Date
Msg-id CAOzAquKw4ALjN_P_P_3M+3cRrBVwx6_rEbr6k3_7UezUjA11Pg@mail.gmail.com
Whole thread Raw
In response to Re: Average New Users Per DOW  ("David G. Johnston" <david.g.johnston@gmail.com>)
Responses Re: Average New Users Per DOW
List pgsql-general


On Mon, Jul 6, 2015 at 5:50 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Mon, Jul 6, 2015 at 6:16 PM, Michael Nolan <htfoot@gmail.com> wrote:
> But you can see it wont give correct results since (for example) Monday's
> with no new users will not be counted in the average as 0.

One way to handle this is to union your query with one that has a
generate_series (0,6) for the DOW column and nulls for the other
columns, then treat both that and your original query as a subquery
and do your averages, since nulls are not included in either count()
or average() aggregates:

select dow, count(*), avg(some_column) from (
select extract ('dow' from some_date) as dow, some_number from some_table
union select generate_series(0,6) as dow, null as some_number) as x
group by 1 order by 1

​I'm not seeing how this is at all useful.

As you said, the average function ignores the null introduced by the union so the final answer with and without the union is the same.

No matter how you work a "generate_series(0,6)" based query it will never be able to give a correct answer expect accidentally.  Each actual missing date contributes a ZERO to the numerator and a ONE to the denominator in the final division that constitutes the mean-average.  You must have those dates.  

In a series with four Mondays ( 4, 0, 0, 8 ) the average desired is 3, not 6 (or 4).  There is no way to make the denominator (number of Mondays) 4 instead of 3 by using generate_series(0,6).

David J.


 

Ah, you're right.  The problem is that avg() is going to treat missing data as missing (of course.)  It will either be necessary to add in the missing days as a zero value (but ONLY the missing days, requiring some kind of 'not exists' select, I suppose) or to 'roll your own' average function by adding in the missing days as I did with a union in my earlier post.

The real problem is the DOW is not the field where the missing data is, it is in the underlying date field. 

I created a test dataset.  It has 1 day missing in a two-week period from June 1st through June 14th (Sunday, June 7th).  Here's what the OP's SQL generates:

Day       New Users      
--- ----------------------
Sun     2.0000000000000000
Mon     4.5000000000000000
Tue     2.0000000000000000
Wed     4.5000000000000000
Thu 1.00000000000000000000
Fri     3.0000000000000000
Sat     3.0000000000000000

Here's the SQL to generate the missing day and do the average function by hand:

select "Day", "New Users" from (
select dow, "Day", sum(total) / count(distinct created) as "New Users"from
(select extract(dow from created) as dow,
to_char(created,'Dy') as "Day", created, created2, total from

(select created, created as created2, count(*) as total from users
group by 1, 2
union
(select generate_series('2015-06-01 00:00'::timestamp,
'2015-06-14'::timestamp,'1 day')::date, null, 0) ) as x) as y
group by 1, 2) as z
order by dow


Day       New Users       
--- ----------------------
Sun 1.00000000000000000000
Mon     4.5000000000000000
Tue     2.0000000000000000
Wed     4.5000000000000000
Thu 1.00000000000000000000
Fri     3.0000000000000000
Sat     3.0000000000000000

--
Mike Nolan
nolan@tssi.com

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: database-level lockdown
Next
From: "Edson F. Lidorio"
Date:
Subject: Re: Download PostgreSQL 9.5 Alpha