Thread: [BUGS] BUG #14648: counts for queries using array unnesting is incorrect

[BUGS] BUG #14648: counts for queries using array unnesting is incorrect

From
sebastian.calbaza@hgdata.com
Date:
The following bug has been logged on the website:

Bug reference:      14648
Logged by:          Sebastian Calbaza
Email address:      sebastian.calbaza@hgdata.com
PostgreSQL version: 9.6.2
Operating system:   Ubuntu 14.04 AWS
Description:

Below is a query that we are using to calculate some counts:   * first version of the query lacks ```unnest(ids) as
id,```,but second 
one has it  * ```companies``` count value is incorrect for the second one, first
query has the correct value

```
mydb=#           select  count(company) as available,count(distinct
matchedCompany) as matchedCompanies,count(distinct company) as companies
from            (                SELECT  F.urlx as company,
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg.url as matchedCompany                from                 (select unnest(urls) as
urlxfrom 
hg_data_discovery_2017_04_10.GroupedFirmographics                    where              (                    (TRUE
              AND TRUE                     AND revenueRangeMin >= 1                     AND employeesRangeMin >= 1
              AND revenueRangeMax <= 1783792664                     AND employeesRangeMax <= 4999000
)                   OR                     FALSE                  )       ) as I                  inner join (
         select unnest(urls) as urlx from 
hg_data_discovery_2017_04_10.GroupedInstallsWithoutDateSignalScore where
productId IN (562) and signalScoreId IN (1,2,3)                    ) as F using(urlx)                left outer join
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_aggon 
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg.url=F.urlx where TRUE and TRUE            ) as P                ;available |
matchedcompanies| companies 
 
-----------+------------------+-----------   496493 |            28503 |    495799
(1 row)

Time: 7974.053 ms
mydb=#              select count(id) as people, count(company) as
available,count(distinct matchedCompany) as matchedCompanies,count(distinct
company) as companies from            (                SELECT unnest(ids) as id, F.urlx as company,
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg.url as matchedCompany                from                 (select unnest(urls) as
urlxfrom 
hg_data_discovery_2017_04_10.GroupedFirmographics                    where              (                    (TRUE
              AND TRUE                     AND revenueRangeMin >= 1                     AND employeesRangeMin >= 1
              AND revenueRangeMax <= 1783792664                     AND employeesRangeMax <= 4999000
)                   OR                     FALSE                  )       ) as I                  inner join (
         select unnest(urls) as urlx from 
hg_data_discovery_2017_04_10.GroupedInstallsWithoutDateSignalScore where
productId IN (562) and signalScoreId IN (1,2,3)                    ) as F using(urlx)                left outer join
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_aggon 
CB_RDPPKLCPQDVACOMSTEFANHGDATACOM_agg.url=F.urlx where TRUE and TRUE            ) as P                ;people |
available| matchedcompanies | companies 
 
--------+-----------+------------------+-----------689905 |    689905 |            28503 |     28503

```




--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

sebastian.calbaza@hgdata.com writes:
> Below is a query that we are using to calculate some counts: 
>    * first version of the query lacks ```unnest(ids) as id,```, but second
> one has it
>    * ```companies``` count value is incorrect for the second one, first
> query has the correct value

My first suggestion would be to see if updating to 9.6.3 fixes it.
If not, please try to create a self-contained test case.  These
queries are unreadable, and without the underlying data, nobody
else can even tell whether the answers are wrong or not.

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14648: counts for queries using array unnesting is incorrect

From
Sebastion Calbaza
Date:
Thanks for replying.... I know it's complicated to follow it (even if formatted ) , but probably I did not emphasize enough that by using unnest(ids) in the second query, the distinct count for company is not computed correctly anymore.
Focusing on the used select clauses  is important, they are just distinct counts, it would have been logical to stay the same for both queries, ignoring the unnest(ids)(the data set is in the milions of rows so you need to take my word that the first query return the correct values :) ).

Anyway, will try to test it with latest version, also will try to provide some sql test case for this, but again... There surely is an issue with the unnesting arrays (at least in the relase that I'm using)

Seb

On Fri, May 12, 2017 at 5:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
sebastian.calbaza@hgdata.com writes:
> Below is a query that we are using to calculate some counts:
>    * first version of the query lacks ```unnest(ids) as id,```, but second
> one has it
>    * ```companies``` count value is incorrect for the second one, first
> query has the correct value

My first suggestion would be to see if updating to 9.6.3 fixes it.
If not, please try to create a self-contained test case.  These
queries are unreadable, and without the underlying data, nobody
else can even tell whether the answers are wrong or not.

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

                        regards, tom lane

Re: [BUGS] BUG #14648: counts for queries using array unnesting is incorrect

From
"David G. Johnston"
Date:
On Fri, May 12, 2017 at 7:55 AM, Sebastion Calbaza <sebastian.calbaza@hgdata.com> wrote:
Thanks for replying.... I know it's complicated to follow it (even if formatted ) , but probably I did not emphasize enough that by using unnest(ids) in the second query, the distinct count for company is not computed correctly anymore.
Focusing on the used select clauses  is important, they are just distinct counts, it would have been logical to stay the same for both queries, ignoring the unnest(ids)(the data set is in the milions of rows so you need to take my word that the first query return the correct values :) ).

​In the first query the number of distinct companies ​is greater than the number of matched companies.  In the second the numbers are equal.  If the only difference between the two is the "unnest(ids)" then its presence is causing every unmatched company to be discarded from the result.  Since "SELECT unnest(null::text[])" is the empty set it would remove the corresponding row from your subquery output.  You might try writing the following which will convert the empty set to a NULL and thus not discard records.

(SELECT unnest(ids)) AS id, F.urlx as company ...

David J.

Re: [BUGS] BUG #14648: counts for queries using array unnesting is incorrect

From
Sebastion Calbaza
Date:
You are correct, this is how it behaves.
For my dev mind (probably for others ) this was pretty non intuitive. In the end I used array_length and sum to compute the count.

Still,  is this the intended behaviour (I guess so from a mathematical point of view ?? )? If it is I can argue that is not too intuitive and adding a doc note near the unnest function description would be good.

Thanks a lot,
   Sebi C.

On Fri, May 12, 2017 at 6:46 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Fri, May 12, 2017 at 7:55 AM, Sebastion Calbaza <sebastian.calbaza@hgdata.com> wrote:
Thanks for replying.... I know it's complicated to follow it (even if formatted ) , but probably I did not emphasize enough that by using unnest(ids) in the second query, the distinct count for company is not computed correctly anymore.
Focusing on the used select clauses  is important, they are just distinct counts, it would have been logical to stay the same for both queries, ignoring the unnest(ids)(the data set is in the milions of rows so you need to take my word that the first query return the correct values :) ).

​In the first query the number of distinct companies ​is greater than the number of matched companies.  In the second the numbers are equal.  If the only difference between the two is the "unnest(ids)" then its presence is causing every unmatched company to be discarded from the result.  Since "SELECT unnest(null::text[])" is the empty set it would remove the corresponding row from your subquery output.  You might try writing the following which will convert the empty set to a NULL and thus not discard records.

(SELECT unnest(ids)) AS id, F.urlx as company ...

David J.


Re: [BUGS] BUG #14648: counts for queries using array unnesting is incorrect

From
"David G. Johnston"
Date:
On Sun, May 14, 2017 at 11:29 PM, Sebastion Calbaza <sebastian.calbaza@hgdata.com> wrote:
You are correct, this is how it behaves.
For my dev mind (probably for others ) this was pretty non intuitive. In the end I used array_length and sum to compute the count.

Still,  is this the intended behaviour (I guess so from a mathematical point of view ?? )? If it is I can argue that is not too intuitive and adding a doc note near the unnest function description would be good.


Yes, its intentional, and it isn't specific to just the unnest function so documenting it just there doesn't seem correct.  I don't know where it is documented but I suspect that even just reading the documentation for this would be of minimal help - I think its likely best learned after experiencing the aforementioned problem.

David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> Yes, its intentional, and it isn't specific to just the unnest function so
> documenting it just there doesn't seem correct.  I don't know where it is
> documented but I suspect that even just reading the documentation for this
> would be of minimal help - I think its likely best learned after
> experiencing the aforementioned problem.

https://www.postgresql.org/docs/current/static/xfunc-sql.html#XFUNC-SQL-FUNCTIONS-RETURNING-SET

The last example in section 36.4.8 covers this specifically.

I've felt more than once that having these sorts of details about function
semantics underneath the "extending SQL" chapter isn't right, but I'm not
sure what organization would be better.  A lot of the examples would be
tough to do without use of custom-made functions.
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs