Thread: counting query
I have a table definition such as: CREATE TABLE attendance ( attendanceid serial primary key, entered date DEFAULT current_date NOT NULL, absent boolean, authorization text default 'N', timeperiod char(2) check(timeperiod in('AM','PM')), days varchar(10), studentid int, unique(entered,timeperiod,studentid) ) Which is used to record school attendance data. I am now trying to write a query to identify trends in absences by counting the days column and returning any student that has repeated absences on certain days. I am struggling to return anything that does not need further manipulation in Python before being useful. Does anyone have any ideas?
>-----Original Message----- >From: pgsql-general-owner@postgresql.org >[mailto:pgsql-general-owner@postgresql.org] On Behalf Of garry >saddington >Sent: zondag 28 januari 2007 14:06 >To: pgsql-general@postgresql.org >Subject: [GENERAL] counting query > >I have a table definition such as: > >CREATE TABLE attendance >( > attendanceid serial primary key, Why you have this??? You already have (entered,timeperiod,studentid) that you can use, since that must be unique too. Try to avoid surrogate keys as much as possible (it really increases performance and ease-of-writing for complex queries! > entered date DEFAULT current_date NOT NULL, > absent boolean, > authorization text default 'N', > timeperiod char(2) check(timeperiod in('AM','PM')), > days varchar(10), > studentid int, > unique(entered,timeperiod,studentid) >) Guessing the meaning a bit, not too self-explaining. >Which is used to record school attendance data. I am now >trying to write a query to identify trends in absences by >counting the days column and returning any student that has >repeated absences on certain days. I am struggling to return >anything that does not need further manipulation in Python >before being useful. >Does anyone have any ideas? Yes, before starting you must have a well-defined idea on what you want to know. What should the result look like? In most situations start simple things, and eventually combine these to something more complex, but always know what you are querying. You should have some idea of what results you will be getting. Of course, this is a proces of discovery rather than following set rules. Some background on probabilities and statistics really helps. Personally I like visualization quite a lot to help me with this. Personally I've found nothing that will beat Excel for doing data analysis. Learn to use the pivot table and pivot charts. They are extremely powerful. However, it can be a bit tricky to transform the input into something the tool can use. A good starting point is to split the dates into seperate year, month, day, week values. Some idea's that might work to get it started: * A graph with days vs occurrences (count). * Graph of total sick days per student vs occurrences. * Graph of Check the occurrences per month/day/week Next try filtering of the data to form sequential periods and more funny things. I believe all this can be done with plain SQL and you don't need any python or plsql or other languages. - Joris Dobbelsteen
Joris Dobbelsteen wrote: >> >> CREATE TABLE attendance >> ( >> attendanceid serial primary key, > > Why you have this??? You already have (entered,timeperiod,studentid) > that you can use, since that must be unique too. Try to avoid surrogate > keys as much as possible (it really increases performance and > ease-of-writing for complex queries! Correct me if I am wrong, but wouldn't a simple number be a lot easier to look up than a composite key?
>-----Original Message----- >From: pgsql-general-owner@postgresql.org >[mailto:pgsql-general-owner@postgresql.org] On Behalf Of John Meyer >Sent: zondag 28 januari 2007 15:36 >To: pgsql-general@postgresql.org >Subject: Re: [GENERAL] counting query > >Joris Dobbelsteen wrote: >>> >>> CREATE TABLE attendance >>> ( >>> attendanceid serial primary key, >> >> Why you have this??? You already have (entered,timeperiod,studentid) >> that you can use, since that must be unique too. Try to avoid >> surrogate keys as much as possible (it really increases performance >> and ease-of-writing for complex queries! > > >Correct me if I am wrong, but wouldn't a simple number be a >lot easier to look up than a composite key? No, it is not. Better, it is, It might be, until you go just a bit larger. I've learned it when I've build a production database (that's still used in production and still performs excellent. At the time I was only 15 or 16 or 17 years old. That was 7 to 10 years ago. Since then I've learned a lot.) The trouble is, the database consists of well over 40 tables (with nearly surrogate keys) and joining a bit of data on the far ends of the database requires you to join arround 10 to 15 tables. If you are doing something complex you will get lost at some point and really need graphical tools to just grasp what you are trying to query. So a bit of a complex query easily results in a excessive number of tables that must be queried. Why I did this. At this time I was not aware that you could build a primary key consisting of multiple columns. And if you look arround you at the Internet you see (nearly) all databases of free 'web applications' making excessive use of surrogate keys. This resulted in a lot of excessive surrogate keys that could have been easily avoided and where not a required atrifact of the inconsistent data I had to import. What would have been better without surrogate keys all-over: * Easier to write complex queries with much fewer tables to be queried. * Much faster query performance, as fewer tables need to be referenced. * Better integrity enforcement with simple foreign key constraints. If fact, in (guessed) 50% of the queries I could have avoided at least 2 table joins! Think big. That is why... - Joris Dobbelsteen
"Joris Dobbelsteen" <Joris@familiedobbelsteen.nl> writes: > What would have been better without surrogate keys all-over: > * Easier to write complex queries with much fewer tables to be queried. > * Much faster query performance, as fewer tables need to be referenced. > * Better integrity enforcement with simple foreign key constraints. Not this debeta again. ;) Surrugate vs natural keys shouldn't make a difference in how many yables you have--they depends on the degree of normalization. Sounds like you denormalized your database and happened to eliminate surrogate keys at the same time. Using that to say "surrogate keys are bad" is kind of misleading. -Doug
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/28/07 08:36, John Meyer wrote: > Joris Dobbelsteen wrote: >>> CREATE TABLE attendance >>> ( >>> attendanceid serial primary key, >> Why you have this??? You already have (entered,timeperiod,studentid) >> that you can use, since that must be unique too. Try to avoid surrogate >> keys as much as possible (it really increases performance and >> ease-of-writing for complex queries! > > > Correct me if I am wrong, but wouldn't a simple number be a lot easier > to look up than a composite key? This is the great synthetic-vs-natural key debate. Sure, it's easier to write queries that join on a synthetic integer field. However, adding 3 extra fields to a few other tables is not onerous, and it adds useful information to the other tables, since (entered,timeperiod,studentid) are what *really* makes a record unique. Also, synthetic keys mean that you have to do more joins, since if you want to know about "entered" and table T_FOO, you'd have to join "attendance" to T_FOO. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFFvMH/S9HxQb37XmcRAoefAKDpf/6TG5WzP4nBIEcqVHE1dmb4/gCgxkZd 5fxfG4NoBR/Ul3fhqmpuTFQ= =g/F2 -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/28/07 07:05, garry saddington wrote: > I have a table definition such as: > > CREATE TABLE attendance > ( > attendanceid serial primary key, > entered date DEFAULT current_date NOT NULL, > absent boolean, > authorization text default 'N', > timeperiod char(2) check(timeperiod in('AM','PM')), > days varchar(10), > studentid int, > unique(entered,timeperiod,studentid) > ) > > Which is used to record school attendance data. I am now trying to write > a query to identify trends in absences by counting the days column and > returning any student that has repeated absences on certain days. I am > struggling to return anything that does not need further manipulation in > Python before being useful. > Does anyone have any ideas? When you say "certain days", you mean "days of the week"? If so, create a view like: CREATE VIEW V_DAY_ABSENCES AS SELECT ENTERED, AUTHORIZATION, TIMEPERIOD, DAYS, STUDENTID, DOW(CAST(ENTERED AS TIMESTAMP)) AS WEEKDAY FROM ATTENDANCE WHERE ABSENT = TRUE; Then, this query should do what you want: SELECT STUDENTID, TIMEPERIOD, WEEKDAY, COUNT(*) FROM V_DAY_ABSENSES GROUP BY STUDENTID, TIMEPERIOD, WEEKDAY HAVING COUNT(*) > 3; -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFFvMgCS9HxQb37XmcRAkvrAJ9ZiOJCM8GGE7ptIzcZsUJc7T2fnQCgpkUn /9nkR9BO04WB0XThPlx+254= =9D2A -----END PGP SIGNATURE-----
On 1/28/07, Ron Johnson <ron.l.johnson@cox.net> wrote:
Truly. But what the heck!
Surrogate keys are not evil, and they do have value. I see no value in proclaiming "surrogate keys are evil, do not use them".
Surrogate keys do have advantages:
- Disassociation of natural data to other referential tables (which can also be "confusing")
Imagine a social security number, drivers license number, or any other natural key. Now imagine that
key value has changed for a specific person, and you have used it as a natural key throughout your data
structures. (and they do change)
- Reduced storage requirements (yields better performance)
It is cheaper to store a 50 byte field + a 4 byte surrogate key once, then it is to store it a million times:
(surrogate key) 54 bytes + (4 bytes * 1 million) = 4MB
vs.
(natural key) 50 bytes * 1 million = 50 MB
Natural keys are not evil either, and they have their own advantages. But when your modeling very large databases (around 10TB range) then you absolutely have to consider every single decision, and natural keys (in my opinion) is not always a good one as a single natural key could result in another 100GB of storage requirements.
There should be some thought when you are modeling and these are some of the things to consider. I don't see a 10 table join being a major performance penalty, especially when 8 of the tables may be a few MB in size.
--
Chad
http://www.postgresqlforums.com/
This is the great synthetic-vs-natural key debate.
Truly. But what the heck!
Surrogate keys are not evil, and they do have value. I see no value in proclaiming "surrogate keys are evil, do not use them".
Surrogate keys do have advantages:
- Disassociation of natural data to other referential tables (which can also be "confusing")
Imagine a social security number, drivers license number, or any other natural key. Now imagine that
key value has changed for a specific person, and you have used it as a natural key throughout your data
structures. (and they do change)
- Reduced storage requirements (yields better performance)
It is cheaper to store a 50 byte field + a 4 byte surrogate key once, then it is to store it a million times:
(surrogate key) 54 bytes + (4 bytes * 1 million) = 4MB
vs.
(natural key) 50 bytes * 1 million = 50 MB
Natural keys are not evil either, and they have their own advantages. But when your modeling very large databases (around 10TB range) then you absolutely have to consider every single decision, and natural keys (in my opinion) is not always a good one as a single natural key could result in another 100GB of storage requirements.
There should be some thought when you are modeling and these are some of the things to consider. I don't see a 10 table join being a major performance penalty, especially when 8 of the tables may be a few MB in size.
--
Chad
http://www.postgresqlforums.com/
>-----Original Message----- >From: Douglas McNaught [mailto:doug@mcnaught.org] >Sent: zondag 28 januari 2007 16:29 >To: Joris Dobbelsteen >Cc: John Meyer; pgsql-general@postgresql.org >Subject: Re: [GENERAL] counting query > >"Joris Dobbelsteen" <Joris@familiedobbelsteen.nl> writes: > >> What would have been better without surrogate keys all-over: >> * Easier to write complex queries with much fewer tables to >be queried. >> * Much faster query performance, as fewer tables need to be >referenced. >> * Better integrity enforcement with simple foreign key constraints. > >Not this debeta again. ;) > >Surrugate vs natural keys shouldn't make a difference in how >many yables you have--they depends on the degree of >normalization. Sounds like you denormalized your database and >happened to eliminate surrogate keys at the same time. Using >that to say "surrogate keys are bad" is kind of misleading. I have perhaps formulated it quite extreme. It was not intended to take such a extreme stance. I appologize if it will lead to such a discussion. I'll try to do it a bit more careful next time. My point is only, be careful with surrogate keys and try not the use them for everything. In my sole opinion I see them getting used too much. But then again, what is good and wrong will always be subjective. So perhaps <quote>What would have been better without surrogate keys all-over<quote> should have been "My database where I extremely overdid it with surrogate keys". Lets leave it to this. - Joris
"Joris Dobbelsteen" <Joris@familiedobbelsteen.nl> writes: > So perhaps <quote>What would have been better without surrogate keys > all-over<quote> should have been "My database where I extremely overdid > it with surrogate keys". Fair enough. It's generally true that going to extremes with anything causes problems. :) -Doug
On Sun, 2007-01-28 at 09:57 -0600, Ron Johnson wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 01/28/07 07:05, garry saddington wrote: > > I have a table definition such as: > > > > CREATE TABLE attendance > > ( > > attendanceid serial primary key, > > entered date DEFAULT current_date NOT NULL, > > absent boolean, > > authorization text default 'N', > > timeperiod char(2) check(timeperiod in('AM','PM')), > > days varchar(10), > > studentid int, > > unique(entered,timeperiod,studentid) > > ) > > > > Which is used to record school attendance data. I am now trying to write > > a query to identify trends in absences by counting the days column and > > returning any student that has repeated absences on certain days. I am > > struggling to return anything that does not need further manipulation in > > Python before being useful. > > Does anyone have any ideas? > > When you say "certain days", you mean "days of the week"? > > If so, create a view like: > CREATE VIEW V_DAY_ABSENCES AS > SELECT ENTERED, > AUTHORIZATION, > TIMEPERIOD, > DAYS, > STUDENTID, > DOW(CAST(ENTERED AS TIMESTAMP)) AS WEEKDAY > FROM ATTENDANCE > WHERE ABSENT = TRUE; > > Then, this query should do what you want: > SELECT STUDENTID, > TIMEPERIOD, > WEEKDAY, > COUNT(*) > FROM V_DAY_ABSENSES > GROUP BY STUDENTID, > TIMEPERIOD, > WEEKDAY > HAVING COUNT(*) > 3; Thank you, this works great. But I have another problem: Is it possible to identify absences in consecutive weeks on the same day. EG. If a pupil has a pattern of having every monday AM off school, how could that be identified? Regards Garry
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/28/07 15:18, garry saddington wrote: > On Sun, 2007-01-28 at 09:57 -0600, Ron Johnson wrote: >> On 01/28/07 07:05, garry saddington wrote: [snip] >> When you say "certain days", you mean "days of the week"? >> >> If so, create a view like: >> CREATE VIEW V_DAY_ABSENCES AS >> SELECT ENTERED, >> AUTHORIZATION, >> TIMEPERIOD, >> DAYS, >> STUDENTID, >> DOW(CAST(ENTERED AS TIMESTAMP)) AS WEEKDAY >> FROM ATTENDANCE >> WHERE ABSENT = TRUE; >> >> Then, this query should do what you want: >> SELECT STUDENTID, >> TIMEPERIOD, >> WEEKDAY, >> COUNT(*) >> FROM V_DAY_ABSENSES >> GROUP BY STUDENTID, >> TIMEPERIOD, >> WEEKDAY >> HAVING COUNT(*) > 3; >> > Thank you, this works great. But I have another problem: Is it possible > to identify absences in consecutive weeks on the same day. EG. If a > pupil has a pattern of having every monday AM off school, how could that > be identified? I'd use the T_CALENDAR table, modified for your purposes. (It's a "static" that we create on every database.) We populate it with 22 years of dates. You'll have to write a small procedure to do it. CREATE TABLE T_CALENDAR ( DATE_ANSI DATE, YEAR_NUM SMALLINT, MONTH_NUM SMALLINT, DAY_OF_MONTH SMALLINT, DAY_OF_WEEK SMALLINT, JULIAN_DAY SMALLINT, DAY_OF_WEEK SMALLINT, IS_SCHOOL_DAY BOOL, SCHOOL_YEAR SMALLINT, -- "2006" for the 2006/07 school year SCHOOL_MONTH SMALLINT); -- 1 for August, 2 for September, etc Then, join T_CALENDAR to ATTENDANCE, WHERE DATE_ANSI = ENTERED AND DAY_OF_WEEK = 1 AND IS_SCHOOL_DAY = TRUE AND SCHOOL_YEAR = 2006; Making that join into a view and then, as Joris suggested, connect it to a spreadsheet. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFFvRmAS9HxQb37XmcRAvpcAKDQ7kjs9/rMb39w5JYRRTl65mYoKQCfVVUm NTv6r6Kzu8T5D+SS8vxwFjs= =VDXa -----END PGP SIGNATURE-----
Joris Dobbelsteen wrote: > Personally I've found nothing that will beat Excel for doing data > analysis. Learn to use the pivot table and pivot charts. They are > extremely powerful. Funny, there is an on-going discussion about this on one of our internal mailing lists. Excel is perhaps okay for simple analysis, and drawing pretty pictures, but the statistics routines, necessary for any kind of detailed analysis, have historically had lots of problems. Here're a few explications of the many ways Excel seems to fall down: http://www.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf http://pages.stern.nyu.edu/~jsimonof/classes/1305/pdf/excelreg.pdf - John Burger