Thread: Query to return normalized floats

Query to return normalized floats

From

Kip Warner

Date:

28 February 2016, 06:15:14

Hey list,

I am new to PostgreSQL, and it's been a while since I had to do some
complicated SQL queries. I have a table like so...

CREATE TABLE my_table
(
    id  SERIAL NOT NULL,
    PRIMARY KEY(id),

    some_value_1 float NOT NULL,
    ...
    some_value_N float NOT NULL
);

The some_value_* fields contain floating point data and are at least a
dozen in number.

I would like to be able to perform queries on the table on the
normalized versions of these values. By normalized I don't mean in the
database nomenclature, but where all some_value_i's are in the range of
[0,1].

To do this, I must find the min() and max() of each row's some_value_i
within the table and divide each some_value_i by the absolute
difference of these two values.

As an example, if a row contained the lowest some_value_3 of -4.0 and
the largest row 1.5, then the normalized version of any some_value_3
field is some_value_3 / (1.5 - -4.0) or some_value_3 / 5.5.

I am having difficulty expressing this as a query to just list every
row in the table for starters (e.g. SELECT * FROM my_table;). I
considered creating a VIEW, my_table_normalized, but I'm not sure if
that is the appropriate strategy here.

The my_table table also contains a large number of rows. I am therefore
worried about every query on the normalized variant taking a while to
find the min() and the max() before it can do anything else.

Any help appreciated.

--
Kip Warner -- Senior Software Engineer
OpenPGP encrypted/signed mail preferred
http://www.thevertigo.com

Attachment

signature.asc

Re: Query to return normalized floats

From

Andreas Kretschmer

Date:

28 February 2016, 08:21:41

Kip Warner <kip@thevertigo.com> wrote:

> The some_value_* fields contain floating point data and are at least a
> dozen in number.

maybe a wrong table design, but that's another question.


>
> I would like to be able to perform queries on the table on the
> normalized versions of these values. By normalized I don't mean in the
> database nomenclature, but where all some_value_i's are in the range of
> [0,1].
>
> To do this, I must find the min() and max() of each row's some_value_i
> within the table and divide each some_value_i by the absolute
> difference of these two values.
>
> As an example, if a row contained the lowest some_value_3 of -4.0 and
> the largest row 1.5, then the normalized version of any some_value_3
> field is some_value_3 / (1.5 - -4.0) or some_value_3 / 5.5.
>
> I am having difficulty expressing this as a query to just list every
> row in the table for starters (e.g. SELECT * FROM my_table;). I
> considered creating a VIEW, my_table_normalized, but I'm not sure if
> that is the appropriate strategy here.

simple example:

test=*# select * from bla;
 id |       col1        |        col2
----+-------------------+--------------------
  1 | -27.3061781190336 |   9.23637737520039
  2 | -34.9188138730824 |   4.02728125452995
  3 |  27.7425193693489 |  -1.71850152313709
  4 |  18.2173402048647 |   1.78571328520775
  5 | -49.3932147044688 |   3.25902994722128
  6 | -21.3868645019829 | 0.0395399890840054
  7 |  48.6888256389648 | -0.219368590041995
  8 |  -26.943267416209 |  -2.84633947536349
  9 | -47.2060812171549 |   1.46993971429765
 10 |  -16.008263733238 |   5.15772333368659
 11 |   12.651920504868 |   4.91552650928497
 12 |  38.5760291945189 |   6.94333815015852
 13 |   -47.87487979047 |  -4.18941779062152
 14 | -2.24363747984171 |  0.686697596684098
 15 |  -2.6916132774204 |   7.83255377784371
 16 | -4.97196828946471 |    4.0004417207092
 17 |  35.3446557652205 |    2.0218435768038
 18 | -9.86138512380421 |   2.61743502691388
 19 | -48.4832897316664 |  -8.32880898378789
 20 | -27.9842584393919 |  -9.57530088722706
(20 rows)

test=*# create view my_factor as select max(col1) - min(col1) as c1,
max(col2) - min(col2) as c2 from bla;
CREATE VIEW
test=*# select id, col1 / f.c1, col2 / f.c2 from bla cross join
my_factor f;
 id |      ?column?       |      ?column?
----+---------------------+---------------------
  1 |  -0.278401407876724 |   0.490991672638172
  2 |  -0.356016389451263 |   0.214084102351125
  3 |   0.282850145370229 | -0.0913529085052159
  4 |   0.185735738582485 |   0.094925782819407
  5 |  -0.503590815724457 |   0.173245039690613
  6 |  -0.218050770835282 | 0.00210188525087517
  7 |   0.496409184275543 | -0.0116612982096414
  8 |  -0.274701334942333 |  -0.151307046381315
  9 |  -0.481291794622778 |   0.078139743503564
 10 |  -0.163212996764598 |   0.274176671625737
 11 |   0.128993243417117 |   0.261301859446679
 12 |   0.393303698204536 |   0.369097219997987
 13 |   -0.48811056155476 |  -0.222703032242957
 14 | -0.0228751101831245 |  0.0365037923307268
 15 | -0.0274424682438878 |   0.416366560632055
 16 | -0.0506919337327751 |   0.212657353847013
 17 |   0.360358080250589 |   0.107478107407462
 18 |  -0.100542210268818 |   0.139138836546109
 19 |  -0.494313633381836 |  -0.442746727197808
 20 |  -0.285314807291989 |  -0.509008327361828
(20 rows)

test=*#



is that okay? (you should define aliases for the columns...)



>
> The my_table table also contains a large number of rows. I am therefore
> worried about every query on the normalized variant taking a while to
> find the min() and the max() before it can do anything else.


create indexes on the columns, so it should use indexes for the min/max.



Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°

Re: Query to return normalized floats

From

Kip Warner

Date:

12 April 2016, 00:17:14

On Sun, 2016-02-28 at 09:21 +0100, Andreas Kretschmer wrote:
> Kip Warner <kip@thevertigo.com> wrote:
>
> > The some_value_* fields contain floating point data and are at
> > least a
> > dozen in number.
>
> maybe a wrong table design, but that's another question.

Hey Andreas. I am very sorry for taking so long to acknowledge and
respond to your thoughtful answer. The software I am working on had
another aspect of it that sucked me away for the last few weeks before
I could return to the database schema.

> > I would like to be able to perform queries on the table on the
> > normalized versions of these values. By normalized I don't mean in
> > the database nomenclature, but where all some_value_i's are in the
> > range of [0,1].
> >
> > To do this, I must find the min() and max() of each row's
> > some_value_i within the table and divide each some_value_i by the
> > absolute difference of these two values.
> >
> > As an example, if a row contained the lowest some_value_3 of -4.0
> > and the largest row 1.5, then the normalized version of any
> > some_value_3 field is some_value_3 / (1.5 - -4.0) or some_value_3 /
> > 5.5.
> >
> > I am having difficulty expressing this as a query to just list
> > every row in the table for starters (e.g. SELECT * FROM my_table;).
> > I considered creating a VIEW, my_table_normalized, but I'm not sure
> > if that is the appropriate strategy here.
>
> simple example:
>
> test=*# select * from bla;
>  id |       col1        |        col2
> ----+-------------------+--------------------
>   1 | -27.3061781190336 |   9.23637737520039
>   2 | -34.9188138730824 |   4.02728125452995
>   3 |  27.7425193693489 |  -1.71850152313709
>   4 |  18.2173402048647 |   1.78571328520775
>   5 | -49.3932147044688 |   3.25902994722128
>   6 | -21.3868645019829 | 0.0395399890840054
>   7 |  48.6888256389648 | -0.219368590041995
>   8 |  -26.943267416209 |  -2.84633947536349
>   9 | -47.2060812171549 |   1.46993971429765
>  10 |  -16.008263733238 |   5.15772333368659
>  11 |   12.651920504868 |   4.91552650928497
>  12 |  38.5760291945189 |   6.94333815015852
>  13 |   -47.87487979047 |  -4.18941779062152
>  14 | -2.24363747984171 |  0.686697596684098
>  15 |  -2.6916132774204 |   7.83255377784371
>  16 | -4.97196828946471 |    4.0004417207092
>  17 |  35.3446557652205 |    2.0218435768038
>  18 | -9.86138512380421 |   2.61743502691388
>  19 | -48.4832897316664 |  -8.32880898378789
>  20 | -27.9842584393919 |  -9.57530088722706
> (20 rows)
>
> test=*# create view my_factor as select max(col1) - min(col1) as c1,
> max(col2) - min(col2) as c2 from bla;
> CREATE VIEW
> test=*# select id, col1 / f.c1, col2 / f.c2 from bla cross join
> my_factor f;
>  id |      ?column?       |      ?column?
> ----+---------------------+---------------------
>   1 |  -0.278401407876724 |   0.490991672638172
>   2 |  -0.356016389451263 |   0.214084102351125
>   3 |   0.282850145370229 | -0.0913529085052159
>   4 |   0.185735738582485 |   0.094925782819407
>   5 |  -0.503590815724457 |   0.173245039690613
>   6 |  -0.218050770835282 | 0.00210188525087517
>   7 |   0.496409184275543 | -0.0116612982096414
>   8 |  -0.274701334942333 |  -0.151307046381315
>   9 |  -0.481291794622778 |   0.078139743503564
>  10 |  -0.163212996764598 |   0.274176671625737
>  11 |   0.128993243417117 |   0.261301859446679
>  12 |   0.393303698204536 |   0.369097219997987
>  13 |   -0.48811056155476 |  -0.222703032242957
>  14 | -0.0228751101831245 |  0.0365037923307268
>  15 | -0.0274424682438878 |   0.416366560632055
>  16 | -0.0506919337327751 |   0.212657353847013
>  17 |   0.360358080250589 |   0.107478107407462
>  18 |  -0.100542210268818 |   0.139138836546109
>  19 |  -0.494313633381836 |  -0.442746727197808
>  20 |  -0.285314807291989 |  -0.509008327361828
> (20 rows)
>
> test=*#
>
> is that okay?

Almost. I realized that the normalization is not calculated correctly
and should be calculated as...

    norm(x) = (x - min)/(max - min)

This also ensures the values are in the range of [0.0, 1.0].

It would also be nice if the view contained the fully normalized
results. I've started with this...

    CREATE VIEW my_view AS
        SELECT
            id,
            (col1 - MIN(col1) / (MAX(col1) - MIN(col1)) AS col1_norm,
            (col2 - MIN(col2) / (MAX(col2) - MIN(col2)) AS col2_norm
        FROM my_table GROUP_BY id;

The problem is when I try to see what it's populated with via...

    SELECT * FROM my_view;

I get an ERROR: division by zero. I can see how that would happen if
MAX(x) - MIN(x) was zero, but I've made sure it wasn't from the sample
rows in my_table. I suspect my_view schema is not declared correctly.

> (you should define aliases for the columns...)

Yes. Hopefully I'm using the AS keyword correctly here.

> create indexes on the columns, so it should use indexes for the
> min/max.

Done.

--
Kip Warner -- Senior Software Engineer
OpenPGP encrypted/signed mail preferred
http://www.thevertigo.com

Attachment

signature.asc

Re: Query to return normalized floats

From

Kip Warner

Date:

14 April 2016, 08:21:51

Hey Andreas,

I figured it out with the help of some folks on IRC. It turns out the
view schema was almost correct. It needed some adjustments, in
particular safe handling of divide by zero errors.

CREATE VIEW my_view AS
    SELECT
        id,
        COALESCE((col1 - (SELECT MIN(col1) FROM my_table)) / NULLIF((SELECT MAX(col1) FROM my_table) - (SELECT
MIN(col1)FROM my_table), 0), 0) AS col1_norm, 
        COALESCE((col2 - (SELECT MIN(col2) FROM my_table)) / NULLIF((SELECT MAX(col2) FROM my_table) - (SELECT
MIN(col2)FROM my_table), 0), 0) AS col2_norm, 
        COALESCE((col3 - (SELECT MIN(col3) FROM my_table)) / NULLIF((SELECT MAX(col3) FROM my_table) - (SELECT
MIN(col3)FROM my_table), 0), 0) AS col3_norm 
    FROM my_table GROUP_BY id;

Since my_table contains hundreds of thousands of rows, it was also
suggested to me to create indices for every column in my_table. e.g.

    CREATE INDEX col1_index ON my_table(col1);
    CREATE INDEX col2_index ON my_table(col2);
    CREATE INDEX col3_index ON my_table(col3);

Hopefully that will not only work, but will also be efficient too.

--
Kip Warner -- Senior Software Engineer
OpenPGP encrypted/signed mail preferred
http://www.thevertigo.com

Attachment

signature.asc

Re: Query to return normalized floats

From

Yaroslav

Date:

14 April 2016, 23:36:50

Kip Warner wrote
>  GROUP_BY id;

Are you sure you really need this? I see this in your table definition:

Kip Warner wrote
> PRIMARY KEY(id),





-----
WBR, Yaroslav Schekin.
--
View this message in context: http://postgresql.nabble.com/Re-Query-to-return-normalized-floats-tp5889604p5898853.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.

Re: Query to return normalized floats

From

Kip Warner

Date:

15 April 2016, 21:19:47

On Thu, 2016-04-14 at 16:36 -0700, Yaroslav wrote:
> Kip Warner wrote
> >  GROUP_BY id;
>
> Are you sure you really need this? I see this in your table
> definition:
>
> Kip Warner wrote
> > PRIMARY KEY(id),

Indeed I do not. I removed that recently and should have noted that in
my post. Thanks for pointing that out though.

--
Kip Warner -- Senior Software Engineer
OpenPGP encrypted/signed mail preferred
http://www.thevertigo.com

Attachment

signature.asc