Thread: Need help building this query

Need help building this query

From
Rihad
Date:
Hi, folks. I currently need to join two tables that lack primary keys, 
and columns used to distinguish each record can be duplicated. I need to 
build statistics over the data in those tables. Consider this:


TableA:
row 1: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
row 2: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
row 3: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date

TableB:
row 1: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date


Columns foo + bar + baz are used to distinguish a performed "operation":
TableA.date_of_op isn't, because it can lag behind TableB.

Not all different "operations" are in table B.
Table B is just there so we know which "operations" are complete, so to 
speak (happening under external means and not under any of my control).

Now, for each operation (foo+bar+baz) in table A, only *one* row should 
be matched in table B, because it only has one matching row there.
The other two in TableA should be considered unmatched.

Now the query should be able to get count(*) and sum(amount) every day 
for that day, considering that matched and unmatched operations should 
be counted separately. The report would look something like this:

TableA.date_of_op  TableB.date_of_op
2012-06-21            [empty]                  [count(*) and sum(amount) 
of all data in TableA for this day unmatched in TableB]
2012-06-21            2012-06-20            [count(*) and sum(amount) of 
all data in TableA matched in TableB for the 20-th]
2012-06-21            2012-06-19            [count(*) and sum(amount) of 
all data in TableA matched in TableB for the 19-th]


Can this awkward thing be done in pure SQL, or I'd be better off using 
programming for this?

Thanks, I hope I could explain this.


Re: Need help building this query

From
"David Johnston"
Date:
> -----Original Message-----
> From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-
> owner@postgresql.org] On Behalf Of Rihad
> Sent: Thursday, June 21, 2012 1:49 PM
> To: pgsql-sql@postgresql.org
> Subject: [SQL] Need help building this query
> 
> Hi, folks. I currently need to join two tables that lack primary keys, and
> columns used to distinguish each record can be duplicated. I need to build
> statistics over the data in those tables. Consider this:
> 
> 
> TableA:
> row 1: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date row
2:
> foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date row 3: foo:
123,
> bar: 456, baz: 789, amount: 10.99, date_of_op: date
> 
> TableB:
> row 1: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
> 
> 
> Columns foo + bar + baz are used to distinguish a performed "operation":
> TableA.date_of_op isn't, because it can lag behind TableB.
> 
> Not all different "operations" are in table B.
> Table B is just there so we know which "operations" are complete, so to
> speak (happening under external means and not under any of my control).
> 
> Now, for each operation (foo+bar+baz) in table A, only *one* row should be
> matched in table B, because it only has one matching row there.
> The other two in TableA should be considered unmatched.
> 
> Now the query should be able to get count(*) and sum(amount) every day
> for that day, considering that matched and unmatched operations should be
> counted separately. The report would look something like this:
> 
> TableA.date_of_op  TableB.date_of_op
> 2012-06-21            [empty]                  [count(*) and sum(amount)
> of all data in TableA for this day unmatched in TableB]
> 2012-06-21            2012-06-20            [count(*) and sum(amount) of
> all data in TableA matched in TableB for the 20-th]
> 2012-06-21            2012-06-19            [count(*) and sum(amount) of
> all data in TableA matched in TableB for the 19-th]
> 
> 
> Can this awkward thing be done in pure SQL, or I'd be better off using
> programming for this?
> 
> Thanks, I hope I could explain this.
> 


You seem to be describing a straight reconciliation between two tables.  My
current means of doing this are programmatically but for the simple case
pure SQL should be doable.  The main thing is that you have to distinguish
between "duplicate" records first and then match them up:

TableA Keys:

AA
AA
AA
AB
AB
AC

TableB Keys:
AA
AA
AB

First you use "ROW_NUMBER() OVER (PARTITION BY key)" to assign an integer
"sub-id" to every set of possible keys in both tables:

TableA-Sub:
AA-1
AA-2
AA-3
AB-1
AB-2
AC-1

TableB-Sub:
AA-1
AA-2
AB-1

Now, with these newly constructed key+sub-key values in place, you can
perform a simple LEFT (or possibly FULL) JOIN between tables A & B.

This makes no allowances for any of kind of desired date restriction on the
matching nor does it consider the eventual report that you wish to generate.
What this gives you is a listing of ALL rows in both tables with matched
records joined together into a single (NULL-less) row while unmatched
records will have one of the two resultant columns NULLed

SELECT tableA.subid_a, tableB.subid_b
FROM tableA FULL OUTER JOIN tableB ON (tableA.subid_a = tableB.subid_b)

Requires at least version 8.4

David J.




Re: Need help building this query

From
"Oliver d'Azevedo Christina"
Date:
For matching triples (foo, bar, baz) the date in table B shouldnt always be 
after any date in table A, as table B contains complete operations?



Best,
Oliver

----- Original Message ----- 
From: "Rihad" <rihad@stream.az>
To: <pgsql-sql@postgresql.org>
Sent: Thursday, June 21, 2012 6:48 PM
Subject: [SQL] Need help building this query


> Hi, folks. I currently need to join two tables that lack primary keys, and 
> columns used to distinguish each record can be duplicated. I need to build 
> statistics over the data in those tables. Consider this:
>
>
> TableA:
> row 1: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
> row 2: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
> row 3: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
>
> TableB:
> row 1: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
>
>
> Columns foo + bar + baz are used to distinguish a performed "operation":
> TableA.date_of_op isn't, because it can lag behind TableB.
>
> Not all different "operations" are in table B.
> Table B is just there so we know which "operations" are complete, so to 
> speak (happening under external means and not under any of my control).
>
> Now, for each operation (foo+bar+baz) in table A, only *one* row should be 
> matched in table B, because it only has one matching row there.
> The other two in TableA should be considered unmatched.
>
> Now the query should be able to get count(*) and sum(amount) every day for 
> that day, considering that matched and unmatched operations should be 
> counted separately. The report would look something like this:
>
> TableA.date_of_op  TableB.date_of_op
> 2012-06-21            [empty]                  [count(*) and sum(amount) 
> of all data in TableA for this day unmatched in TableB]
> 2012-06-21            2012-06-20            [count(*) and sum(amount) of 
> all data in TableA matched in TableB for the 20-th]
> 2012-06-21            2012-06-19            [count(*) and sum(amount) of 
> all data in TableA matched in TableB for the 19-th]
>
>
> Can this awkward thing be done in pure SQL, or I'd be better off using 
> programming for this?
>
> Thanks, I hope I could explain this.
>
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
> 



Re: Need help building this query

From
"Oliver d'Azevedo Christina"
Date:
If I understand correctly,

You have table A with a record for each operation performed, perhaps 
duplicated.
And you have table B with one record for each operation completed,
Is my understanding correct?

I fail to understand what is the report you are trying to obtain, exactly.

For example,
This line
>> TableA.date_of_op  TableB.date_of_op
>> 2012-06-21            2012-06-20            [count(*) and sum(amount) of 
>> all data in TableA matched in TableB for the 20-th]

Does it mean that you have operations records in table A from day 21 that 
match records from table B for day 20?
The operations shouldn't have already been completed and, thus, without any 
record on table A ...?

Could you please kindly elucidate me?

Thank you

Best,
Oliver

----- Original Message ----- 
From: "Oliver d'Azevedo Christina" <oliveiros.cristina@asperger-talents.com>
To: "Rihad" <rihad@stream.az>; <pgsql-sql@postgresql.org>
Sent: Thursday, June 21, 2012 7:50 PM
Subject: Re: [SQL] Need help building this query


> For matching triples (foo, bar, baz) the date in table B shouldnt always 
> be after any date in table A, as table B contains complete operations?
>
>
>
> Best,
> Oliver
>
> ----- Original Message ----- 
> From: "Rihad" <rihad@stream.az>
> To: <pgsql-sql@postgresql.org>
> Sent: Thursday, June 21, 2012 6:48 PM
> Subject: [SQL] Need help building this query
>
>
>> Hi, folks. I currently need to join two tables that lack primary keys, 
>> and columns used to distinguish each record can be duplicated. I need to 
>> build statistics over the data in those tables. Consider this:
>>
>>
>> TableA:
>> row 1: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
>> row 2: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
>> row 3: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
>>
>> TableB:
>> row 1: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
>>
>>
>> Columns foo + bar + baz are used to distinguish a performed "operation":
>> TableA.date_of_op isn't, because it can lag behind TableB.
>>
>> Not all different "operations" are in table B.
>> Table B is just there so we know which "operations" are complete, so to 
>> speak (happening under external means and not under any of my control).
>>
>> Now, for each operation (foo+bar+baz) in table A, only *one* row should 
>> be matched in table B, because it only has one matching row there.
>> The other two in TableA should be considered unmatched.
>>
>> Now the query should be able to get count(*) and sum(amount) every day 
>> for that day, considering that matched and unmatched operations should be 
>> counted separately. The report would look something like this:
>>
>> TableA.date_of_op  TableB.date_of_op
>> 2012-06-21            [empty]                  [count(*) and sum(amount) 
>> of all data in TableA for this day unmatched in TableB]
>> 2012-06-21            2012-06-20            [count(*) and sum(amount) of 
>> all data in TableA matched in TableB for the 20-th]
>> 2012-06-21            2012-06-19            [count(*) and sum(amount) of 
>> all data in TableA matched in TableB for the 19-th]
>>
>>
>> Can this awkward thing be done in pure SQL, or I'd be better off using 
>> programming for this?
>>
>> Thanks, I hope I could explain this.
>>
>> -- 
>> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-sql
>>
>
>
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
> 



Re: Need help building this query

From
rihad@stream.az
Date:
> For matching triples (foo, bar, baz) the date in table B shouldnt always
> be
> after any date in table A, as table B contains complete operations?
>
Operations in Table B can usually be obtained a day after, when Table A
gets the updates. Table B contains the physical date of operation. The
date in Table A are today's dates because this is how the reports feeding
them is generated. Those are only dates, no time parts. Date_of_op is
actually a settlement date (banking term) and TableA is more recent than
TableB for reasons irrelevant to the problem (and which suck).



>
> Best,
> Oliver
>
> ----- Original Message -----
> From: "Rihad" <rihad@stream.az>
> To: <pgsql-sql@postgresql.org>
> Sent: Thursday, June 21, 2012 6:48 PM
> Subject: [SQL] Need help building this query
>
>
>> Hi, folks. I currently need to join two tables that lack primary keys,
>> and
>> columns used to distinguish each record can be duplicated. I need to
>> build
>> statistics over the data in those tables. Consider this:
>>
>>
>> TableA:
>> row 1: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
>> row 2: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
>> row 3: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
>>
>> TableB:
>> row 1: foo: 123, bar: 456, baz: 789, amount: 10.99, date_of_op: date
>>
>>
>> Columns foo + bar + baz are used to distinguish a performed "operation":
>> TableA.date_of_op isn't, because it can lag behind TableB.
>>
>> Not all different "operations" are in table B.
>> Table B is just there so we know which "operations" are complete, so to
>> speak (happening under external means and not under any of my control).
>>
>> Now, for each operation (foo+bar+baz) in table A, only *one* row should
>> be
>> matched in table B, because it only has one matching row there.
>> The other two in TableA should be considered unmatched.
>>
>> Now the query should be able to get count(*) and sum(amount) every day
>> for
>> that day, considering that matched and unmatched operations should be
>> counted separately. The report would look something like this:
>>
>> TableA.date_of_op  TableB.date_of_op
>> 2012-06-21            [empty]                  [count(*) and sum(amount)
>> of all data in TableA for this day unmatched in TableB]
>> 2012-06-21            2012-06-20            [count(*) and sum(amount) of
>> all data in TableA matched in TableB for the 20-th]
>> 2012-06-21            2012-06-19            [count(*) and sum(amount) of
>> all data in TableA matched in TableB for the 19-th]
>>
>>
>> Can this awkward thing be done in pure SQL, or I'd be better off using
>> programming for this?
>>
>> Thanks, I hope I could explain this.
>>
>> --
>> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-sql
>>
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>




Re: Need help building this query

From
rihad@stream.az
Date:
> You seem to be describing a straight reconciliation between two tables.
> My
> current means of doing this are programmatically but for the simple case
> pure SQL should be doable.  The main thing is that you have to distinguish
> between "duplicate" records first and then match them up:
>
> TableA Keys:
>
> AA
> AA
> AA
> AB
> AB
> AC
>
> TableB Keys:
> AA
> AA
> AB
>
> First you use "ROW_NUMBER() OVER (PARTITION BY key)" to assign an integer
> "sub-id" to every set of possible keys in both tables:
>
> TableA-Sub:
> AA-1
> AA-2
> AA-3
> AB-1
> AB-2
> AC-1
>
> TableB-Sub:
> AA-1
> AA-2
> AB-1
>
> Now, with these newly constructed key+sub-key values in place, you can
> perform a simple LEFT (or possibly FULL) JOIN between tables A & B.
>
> This makes no allowances for any of kind of desired date restriction on
> the
> matching nor does it consider the eventual report that you wish to
> generate.
> What this gives you is a listing of ALL rows in both tables with matched
> records joined together into a single (NULL-less) row while unmatched
> records will have one of the two resultant columns NULLed
>
> SELECT tableA.subid_a, tableB.subid_b
> FROM tableA FULL OUTER JOIN tableB ON (tableA.subid_a = tableB.subid_b)
>
> Requires at least version 8.4
>

This looks really promising, thanks. I'll read more about it (WINDOWING in
general) and see how it fits our model. This might allow me to distinguish
"operation" rows that exist in both TableA and TableB, and whether TableA
has more of them, in which case they would be marked unmatched. Using
sub-selects this seems doable.



Re: Need help building this query

From
rihad@stream.az
Date:
> You seem to be describing a straight reconciliation between two tables.
> My
> current means of doing this are programmatically but for the simple case
> pure SQL should be doable.  The main thing is that you have to distinguish
> between "duplicate" records first and then match them up:
>
> TableA Keys:
>
> AA
> AA
> AA
> AB
> AB
> AC
>
> TableB Keys:
> AA
> AA
> AB
>
> First you use "ROW_NUMBER() OVER (PARTITION BY key)" to assign an integer
> "sub-id" to every set of possible keys in both tables:
>
> TableA-Sub:
> AA-1
> AA-2
> AA-3
> AB-1
> AB-2
> AC-1
>
> TableB-Sub:
> AA-1
> AA-2
> AB-1
>
> Now, with these newly constructed key+sub-key values in place, you can
> perform a simple LEFT (or possibly FULL) JOIN between tables A & B.
>
> This makes no allowances for any of kind of desired date restriction on
> the
> matching nor does it consider the eventual report that you wish to
> generate.
> What this gives you is a listing of ALL rows in both tables with matched
> records joined together into a single (NULL-less) row while unmatched
> records will have one of the two resultant columns NULLed
>
> SELECT tableA.subid_a, tableB.subid_b
> FROM tableA FULL OUTER JOIN tableB ON (tableA.subid_a = tableB.subid_b)
>
> Requires at least version 8.4

This looks really promising, thanks. I'll read more about it (WINDOWING in
general) and see how it fits our model. This might allow me to distinguish
"operation" rows that exist in both TableA and TableB, and whether TableA
has more of them, in which case they would be marked unmatched. Using
sub-selects this seems doable.