Thread: Postgresql GROUP BY "SIMILAR" but not equal values
I wanted to ask you the following question to all experts here. Let's say I have this table foo ID|G1|T1| 1|2|ABC| 1|2|ABCD| 1|2|DEF| 1|2|DEFG| SELECT * FROM foo GROUP BY ID,G1,T1 RETURNS exactly the same table. Is there a way in SQL or PostgreSQL in general to group by values than are not exactly the same but are quite similar (like 'ABC' and 'ABCD') based on some distance function (levenshtein for example) if the distance is within some threshold (i.e., 1) My intuition is that SQL cannot support such queries but I was wondering if there was some hack around it. The problem as I see it that distance functions require 2 values but GROUP BY only checks equality. Another subproblem that might help is can we overload an operator of a custom type, so that equals operator is more "relaxed" and is calculated by a function? OR can we use GROUP BY with a custom comparator for a data type? I hope my question makes sense. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Postgresql-GROUP-BY-SIMILAR-but-not-equal-values-tp5790860.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
alexandros_e <alexandros.ef@gmail.com> writes: > Is there a way in SQL or PostgreSQL in general to group by values than are > not exactly the same but are quite similar (like 'ABC' and 'ABCD') based on > some distance function (levenshtein for example) if the distance is within > some threshold (i.e., 1) Well, you can GROUP BY the result of a function. You are going to have to think harder than the above in any case. For example, it's not hard to imagine a "similarity" operator that says that A is similar to B, and B is similar to C, but if you ask it to compare A to C it says they're not similar (enough). Now what? Are A,B,C all part of the same group? If you take the transitive closure of such an operator you probably end up with everything in one group; but if you don't, it's hard to see a principled result at all. If you can cast your problem as transformation of the values into some canonical or representative form, then you can do that and then group on simple equality of the canonical values. For instance case-insensitive grouping is customarily done with GROUP BY lower(x) regards, tom lane
On 6 February 2014 16:18, alexandros_e <alexandros.ef@gmail.com> wrote: > Let's say I have this table foo > > ID|G1|T1| > 1|2|ABC| > 1|2|ABCD| > 1|2|DEF| > 1|2|DEFG| > > SELECT * FROM foo > GROUP BY ID,G1,T1 > Is there a way in SQL or PostgreSQL in general to group by values than are > not exactly the same but are quite similar (like 'ABC' and 'ABCD') based on > some distance function (levenshtein for example) if the distance is within > some threshold (i.e., 1) Perhaps there is: You can calculate the levenshtein distance between those values using a self-join and then GROUP BY the result of that expression and limit the results with HAVING. For example: SELECT foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1) FROM foo foo1 INNER JOIN foo foo2 ON (foo2.ID = foo1.ID AND foo2.G1 = foo1.G1) GROUP BY foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1) HAVING levenshtein(foo1.T1, foo2.T1) > 1 Is that what you're looking for? -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest.
On Thu, Feb 6, 2014 at 7:41 AM, Alban Hertroys <haramrae@gmail.com> wrote: > On 6 February 2014 16:18, alexandros_e <alexandros.ef@gmail.com> wrote: >> Is there a way in SQL or PostgreSQL in general to group by values than are >> not exactly the same but are quite similar (like 'ABC' and 'ABCD') based on >> some distance function (levenshtein for example) if the distance is within >> some threshold (i.e., 1) > > Perhaps there is: You can calculate the levenshtein distance between > those values using a self-join and then GROUP BY the result of that > expression and limit the results with HAVING. > > For example: > SELECT foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1) > FROM foo foo1 > INNER JOIN foo foo2 ON (foo2.ID = foo1.ID AND foo2.G1 = foo1.G1) > GROUP BY foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1) > HAVING levenshtein(foo1.T1, foo2.T1) > 1 From my understanding of the question, probably, adding another levenshtein distance to some base value will make more sense: GROUP BY levenshtein(foo1.T1, foo2.T1), levenshtein(foo1.T1, 'A') Though, it looks like a clusterization task for me, and therefore I would recommend OP to look at the PL/R http://www.joeconway.com/plr/doc/index.html. -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
What about a regexp match ? -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane Sent: Thursday, February 06, 2014 10:32 AM To: alexandros_e Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Postgresql GROUP BY "SIMILAR" but not equal values alexandros_e <alexandros.ef@gmail.com> writes: > Is there a way in SQL or PostgreSQL in general to group by values than > are not exactly the same but are quite similar (like 'ABC' and 'ABCD') > based on some distance function (levenshtein for example) if the > distance is within some threshold (i.e., 1) Well, you can GROUP BY the result of a function. You are going to have to think harder than the above in any case. For example, it's not hard to imagine a "similarity" operator that says that A is similar to B, and B is similar to C, butif you ask it to compare A to C it says they're not similar (enough). Now what? Are A,B,C all part of the same group? If you take the transitive closure of such an operator you probably end up with everythingin one group; but if you don't, it's hard to see a principled result at all. If you can cast your problem as transformation of the values into some canonical or representative form, then you can dothat and then group on simple equality of the canonical values. For instance case-insensitive grouping is customarilydone with GROUP BY lower(x) regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Conceptually, Tom (as always) is right. But Alban's hack help. DROP TABLE foo; CREATE TABLE IF NOT EXISTS foo(ID INTEGER,G1 INTEGER, T1 TEXT, ID2 SERIAL PRIMARY KEY); INSERT INTO foo(ID,G1,T1) VALUES(1,2,'ABC'); INSERT INTO foo(ID,G1,T1) VALUES(1,2,'ABCD'); INSERT INTO foo(ID,G1,T1) VALUES(1,2,'ABDC'); INSERT INTO foo(ID,G1,T1) VALUES(1,2,'DEF'); INSERT INTO foo(ID,G1,T1) VALUES(1,2,'DEFH'); /* A little editing to remove duplicates a to b and b to a */ SELECT foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1),foo2.T1 FROM foo foo1 INNER JOIN foo foo2 ON (foo2.ID = foo1.ID AND foo2.G1 = foo1.G1) WHERE foo1.ID2<foo2.ID2 GROUP BY foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1),foo2.T1 HAVING levenshtein(foo1.T1, foo2.T1) <2; RETURNS ID|G1|foo1.T1|foo2.T1 1;2;"ABC";1;"ABCD" 1;2;"ABC";1;"ABDC" 1;2;"ABCD";2;"ABDC" 1;2;"DEF";1;"DEFH" Then it requires a second grouping but as Tom suggested it would be hard to somehow group all similar cases together because then it becomes a clustering problem. With a second grouping we will have 3 records instead of 4, so it is better than the initial case by 25%. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Postgresql-GROUP-BY-SIMILAR-but-not-equal-values-tp5790860p5790876.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.