Thread: Query to return normalized floats
Hey list, I am new to PostgreSQL, and it's been a while since I had to do some complicated SQL queries. I have a table like so... CREATE TABLE my_table ( id SERIAL NOT NULL, PRIMARY KEY(id), some_value_1 float NOT NULL, ... some_value_N float NOT NULL ); The some_value_* fields contain floating point data and are at least a dozen in number. I would like to be able to perform queries on the table on the normalized versions of these values. By normalized I don't mean in the database nomenclature, but where all some_value_i's are in the range of [0,1]. To do this, I must find the min() and max() of each row's some_value_i within the table and divide each some_value_i by the absolute difference of these two values. As an example, if a row contained the lowest some_value_3 of -4.0 and the largest row 1.5, then the normalized version of any some_value_3 field is some_value_3 / (1.5 - -4.0) or some_value_3 / 5.5. I am having difficulty expressing this as a query to just list every row in the table for starters (e.g. SELECT * FROM my_table;). I considered creating a VIEW, my_table_normalized, but I'm not sure if that is the appropriate strategy here. The my_table table also contains a large number of rows. I am therefore worried about every query on the normalized variant taking a while to find the min() and the max() before it can do anything else. Any help appreciated. -- Kip Warner -- Senior Software Engineer OpenPGP encrypted/signed mail preferred http://www.thevertigo.com
Attachment
Kip Warner <kip@thevertigo.com> wrote: > The some_value_* fields contain floating point data and are at least a > dozen in number. maybe a wrong table design, but that's another question. > > I would like to be able to perform queries on the table on the > normalized versions of these values. By normalized I don't mean in the > database nomenclature, but where all some_value_i's are in the range of > [0,1]. > > To do this, I must find the min() and max() of each row's some_value_i > within the table and divide each some_value_i by the absolute > difference of these two values. > > As an example, if a row contained the lowest some_value_3 of -4.0 and > the largest row 1.5, then the normalized version of any some_value_3 > field is some_value_3 / (1.5 - -4.0) or some_value_3 / 5.5. > > I am having difficulty expressing this as a query to just list every > row in the table for starters (e.g. SELECT * FROM my_table;). I > considered creating a VIEW, my_table_normalized, but I'm not sure if > that is the appropriate strategy here. simple example: test=*# select * from bla; id | col1 | col2 ----+-------------------+-------------------- 1 | -27.3061781190336 | 9.23637737520039 2 | -34.9188138730824 | 4.02728125452995 3 | 27.7425193693489 | -1.71850152313709 4 | 18.2173402048647 | 1.78571328520775 5 | -49.3932147044688 | 3.25902994722128 6 | -21.3868645019829 | 0.0395399890840054 7 | 48.6888256389648 | -0.219368590041995 8 | -26.943267416209 | -2.84633947536349 9 | -47.2060812171549 | 1.46993971429765 10 | -16.008263733238 | 5.15772333368659 11 | 12.651920504868 | 4.91552650928497 12 | 38.5760291945189 | 6.94333815015852 13 | -47.87487979047 | -4.18941779062152 14 | -2.24363747984171 | 0.686697596684098 15 | -2.6916132774204 | 7.83255377784371 16 | -4.97196828946471 | 4.0004417207092 17 | 35.3446557652205 | 2.0218435768038 18 | -9.86138512380421 | 2.61743502691388 19 | -48.4832897316664 | -8.32880898378789 20 | -27.9842584393919 | -9.57530088722706 (20 rows) test=*# create view my_factor as select max(col1) - min(col1) as c1, max(col2) - min(col2) as c2 from bla; CREATE VIEW test=*# select id, col1 / f.c1, col2 / f.c2 from bla cross join my_factor f; id | ?column? | ?column? ----+---------------------+--------------------- 1 | -0.278401407876724 | 0.490991672638172 2 | -0.356016389451263 | 0.214084102351125 3 | 0.282850145370229 | -0.0913529085052159 4 | 0.185735738582485 | 0.094925782819407 5 | -0.503590815724457 | 0.173245039690613 6 | -0.218050770835282 | 0.00210188525087517 7 | 0.496409184275543 | -0.0116612982096414 8 | -0.274701334942333 | -0.151307046381315 9 | -0.481291794622778 | 0.078139743503564 10 | -0.163212996764598 | 0.274176671625737 11 | 0.128993243417117 | 0.261301859446679 12 | 0.393303698204536 | 0.369097219997987 13 | -0.48811056155476 | -0.222703032242957 14 | -0.0228751101831245 | 0.0365037923307268 15 | -0.0274424682438878 | 0.416366560632055 16 | -0.0506919337327751 | 0.212657353847013 17 | 0.360358080250589 | 0.107478107407462 18 | -0.100542210268818 | 0.139138836546109 19 | -0.494313633381836 | -0.442746727197808 20 | -0.285314807291989 | -0.509008327361828 (20 rows) test=*# is that okay? (you should define aliases for the columns...) > > The my_table table also contains a large number of rows. I am therefore > worried about every query on the normalized variant taking a while to > find the min() and the max() before it can do anything else. create indexes on the columns, so it should use indexes for the min/max. Andreas -- Really, I'm not out to destroy Microsoft. That will just be a completely unintentional side effect. (Linus Torvalds) "If I was god, I would recompile penguin with --enable-fly." (unknown) Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889°
On Sun, 2016-02-28 at 09:21 +0100, Andreas Kretschmer wrote: > Kip Warner <kip@thevertigo.com> wrote: > > > The some_value_* fields contain floating point data and are at > > least a > > dozen in number. > > maybe a wrong table design, but that's another question. Hey Andreas. I am very sorry for taking so long to acknowledge and respond to your thoughtful answer. The software I am working on had another aspect of it that sucked me away for the last few weeks before I could return to the database schema. > > I would like to be able to perform queries on the table on the > > normalized versions of these values. By normalized I don't mean in > > the database nomenclature, but where all some_value_i's are in the > > range of [0,1]. > > > > To do this, I must find the min() and max() of each row's > > some_value_i within the table and divide each some_value_i by the > > absolute difference of these two values. > > > > As an example, if a row contained the lowest some_value_3 of -4.0 > > and the largest row 1.5, then the normalized version of any > > some_value_3 field is some_value_3 / (1.5 - -4.0) or some_value_3 / > > 5.5. > > > > I am having difficulty expressing this as a query to just list > > every row in the table for starters (e.g. SELECT * FROM my_table;). > > I considered creating a VIEW, my_table_normalized, but I'm not sure > > if that is the appropriate strategy here. > > simple example: > > test=*# select * from bla; > id | col1 | col2 > ----+-------------------+-------------------- > 1 | -27.3061781190336 | 9.23637737520039 > 2 | -34.9188138730824 | 4.02728125452995 > 3 | 27.7425193693489 | -1.71850152313709 > 4 | 18.2173402048647 | 1.78571328520775 > 5 | -49.3932147044688 | 3.25902994722128 > 6 | -21.3868645019829 | 0.0395399890840054 > 7 | 48.6888256389648 | -0.219368590041995 > 8 | -26.943267416209 | -2.84633947536349 > 9 | -47.2060812171549 | 1.46993971429765 > 10 | -16.008263733238 | 5.15772333368659 > 11 | 12.651920504868 | 4.91552650928497 > 12 | 38.5760291945189 | 6.94333815015852 > 13 | -47.87487979047 | -4.18941779062152 > 14 | -2.24363747984171 | 0.686697596684098 > 15 | -2.6916132774204 | 7.83255377784371 > 16 | -4.97196828946471 | 4.0004417207092 > 17 | 35.3446557652205 | 2.0218435768038 > 18 | -9.86138512380421 | 2.61743502691388 > 19 | -48.4832897316664 | -8.32880898378789 > 20 | -27.9842584393919 | -9.57530088722706 > (20 rows) > > test=*# create view my_factor as select max(col1) - min(col1) as c1, > max(col2) - min(col2) as c2 from bla; > CREATE VIEW > test=*# select id, col1 / f.c1, col2 / f.c2 from bla cross join > my_factor f; > id | ?column? | ?column? > ----+---------------------+--------------------- > 1 | -0.278401407876724 | 0.490991672638172 > 2 | -0.356016389451263 | 0.214084102351125 > 3 | 0.282850145370229 | -0.0913529085052159 > 4 | 0.185735738582485 | 0.094925782819407 > 5 | -0.503590815724457 | 0.173245039690613 > 6 | -0.218050770835282 | 0.00210188525087517 > 7 | 0.496409184275543 | -0.0116612982096414 > 8 | -0.274701334942333 | -0.151307046381315 > 9 | -0.481291794622778 | 0.078139743503564 > 10 | -0.163212996764598 | 0.274176671625737 > 11 | 0.128993243417117 | 0.261301859446679 > 12 | 0.393303698204536 | 0.369097219997987 > 13 | -0.48811056155476 | -0.222703032242957 > 14 | -0.0228751101831245 | 0.0365037923307268 > 15 | -0.0274424682438878 | 0.416366560632055 > 16 | -0.0506919337327751 | 0.212657353847013 > 17 | 0.360358080250589 | 0.107478107407462 > 18 | -0.100542210268818 | 0.139138836546109 > 19 | -0.494313633381836 | -0.442746727197808 > 20 | -0.285314807291989 | -0.509008327361828 > (20 rows) > > test=*# > > is that okay? Almost. I realized that the normalization is not calculated correctly and should be calculated as... norm(x) = (x - min)/(max - min) This also ensures the values are in the range of [0.0, 1.0]. It would also be nice if the view contained the fully normalized results. I've started with this... CREATE VIEW my_view AS SELECT id, (col1 - MIN(col1) / (MAX(col1) - MIN(col1)) AS col1_norm, (col2 - MIN(col2) / (MAX(col2) - MIN(col2)) AS col2_norm FROM my_table GROUP_BY id; The problem is when I try to see what it's populated with via... SELECT * FROM my_view; I get an ERROR: division by zero. I can see how that would happen if MAX(x) - MIN(x) was zero, but I've made sure it wasn't from the sample rows in my_table. I suspect my_view schema is not declared correctly. > (you should define aliases for the columns...) Yes. Hopefully I'm using the AS keyword correctly here. > create indexes on the columns, so it should use indexes for the > min/max. Done. -- Kip Warner -- Senior Software Engineer OpenPGP encrypted/signed mail preferred http://www.thevertigo.com
Attachment
Hey Andreas, I figured it out with the help of some folks on IRC. It turns out the view schema was almost correct. It needed some adjustments, in particular safe handling of divide by zero errors. CREATE VIEW my_view AS SELECT id, COALESCE((col1 - (SELECT MIN(col1) FROM my_table)) / NULLIF((SELECT MAX(col1) FROM my_table) - (SELECT MIN(col1)FROM my_table), 0), 0) AS col1_norm, COALESCE((col2 - (SELECT MIN(col2) FROM my_table)) / NULLIF((SELECT MAX(col2) FROM my_table) - (SELECT MIN(col2)FROM my_table), 0), 0) AS col2_norm, COALESCE((col3 - (SELECT MIN(col3) FROM my_table)) / NULLIF((SELECT MAX(col3) FROM my_table) - (SELECT MIN(col3)FROM my_table), 0), 0) AS col3_norm FROM my_table GROUP_BY id; Since my_table contains hundreds of thousands of rows, it was also suggested to me to create indices for every column in my_table. e.g. CREATE INDEX col1_index ON my_table(col1); CREATE INDEX col2_index ON my_table(col2); CREATE INDEX col3_index ON my_table(col3); Hopefully that will not only work, but will also be efficient too. -- Kip Warner -- Senior Software Engineer OpenPGP encrypted/signed mail preferred http://www.thevertigo.com
Attachment
Kip Warner wrote > GROUP_BY id; Are you sure you really need this? I see this in your table definition: Kip Warner wrote > PRIMARY KEY(id), ----- WBR, Yaroslav Schekin. -- View this message in context: http://postgresql.nabble.com/Re-Query-to-return-normalized-floats-tp5889604p5898853.html Sent from the PostgreSQL - novice mailing list archive at Nabble.com.
On Thu, 2016-04-14 at 16:36 -0700, Yaroslav wrote: > Kip Warner wrote > > GROUP_BY id; > > Are you sure you really need this? I see this in your table > definition: > > Kip Warner wrote > > PRIMARY KEY(id), Indeed I do not. I removed that recently and should have noted that in my post. Thanks for pointing that out though. -- Kip Warner -- Senior Software Engineer OpenPGP encrypted/signed mail preferred http://www.thevertigo.com