Thread: Postgresql GROUP BY "SIMILAR" but not equal values

Postgresql GROUP BY "SIMILAR" but not equal values

From
alexandros_e
Date:
I wanted to ask you the following question to all experts here.

Let's say I have this table foo

ID|G1|T1|
1|2|ABC|
1|2|ABCD|
1|2|DEF|
1|2|DEFG|

SELECT * FROM foo
GROUP BY ID,G1,T1

RETURNS exactly the same table.

Is there a way in SQL or PostgreSQL in general to group by values than are
not exactly the same but are quite similar (like 'ABC' and 'ABCD') based on
some distance function (levenshtein for example) if the distance is within
some threshold (i.e., 1)

My intuition is that SQL cannot support such queries but I was wondering if
there was some hack around it. The problem as I see it that distance
functions require 2 values but GROUP BY only checks equality.

Another subproblem that might help is can we overload an operator of a
custom type, so that equals operator is more "relaxed" and is calculated by
a function?  OR can we use GROUP BY with a custom comparator for a data
type?

I hope my question makes sense.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Postgresql-GROUP-BY-SIMILAR-but-not-equal-values-tp5790860.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Postgresql GROUP BY "SIMILAR" but not equal values

From
Tom Lane
Date:
alexandros_e <alexandros.ef@gmail.com> writes:
> Is there a way in SQL or PostgreSQL in general to group by values than are
> not exactly the same but are quite similar (like 'ABC' and 'ABCD') based on
> some distance function (levenshtein for example) if the distance is within
> some threshold (i.e., 1)

Well, you can GROUP BY the result of a function.

You are going to have to think harder than the above in any case.
For example, it's not hard to imagine a "similarity" operator that
says that A is similar to B, and B is similar to C, but if you ask
it to compare A to C it says they're not similar (enough).  Now what?
Are A,B,C all part of the same group?  If you take the transitive
closure of such an operator you probably end up with everything in
one group; but if you don't, it's hard to see a principled result
at all.

If you can cast your problem as transformation of the values into
some canonical or representative form, then you can do that and then
group on simple equality of the canonical values.  For instance
case-insensitive grouping is customarily done with

      GROUP BY lower(x)

            regards, tom lane


Re: Postgresql GROUP BY "SIMILAR" but not equal values

From
Alban Hertroys
Date:
On 6 February 2014 16:18, alexandros_e <alexandros.ef@gmail.com> wrote:
> Let's say I have this table foo
>
> ID|G1|T1|
> 1|2|ABC|
> 1|2|ABCD|
> 1|2|DEF|
> 1|2|DEFG|
>
> SELECT * FROM foo
> GROUP BY ID,G1,T1

> Is there a way in SQL or PostgreSQL in general to group by values than are
> not exactly the same but are quite similar (like 'ABC' and 'ABCD') based on
> some distance function (levenshtein for example) if the distance is within
> some threshold (i.e., 1)

Perhaps there is: You can calculate the levenshtein distance between
those values using a self-join and then GROUP BY the result of that
expression and limit the results with HAVING.

For example:
SELECT foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1)
FROM foo foo1
INNER JOIN foo foo2 ON (foo2.ID = foo1.ID AND foo2.G1 = foo1.G1)
GROUP BY foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1)
HAVING levenshtein(foo1.T1, foo2.T1) > 1

Is that what you're looking for?

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.


Re: Postgresql GROUP BY "SIMILAR" but not equal values

From
Sergey Konoplev
Date:
On Thu, Feb 6, 2014 at 7:41 AM, Alban Hertroys <haramrae@gmail.com> wrote:
> On 6 February 2014 16:18, alexandros_e <alexandros.ef@gmail.com> wrote:
>> Is there a way in SQL or PostgreSQL in general to group by values than are
>> not exactly the same but are quite similar (like 'ABC' and 'ABCD') based on
>> some distance function (levenshtein for example) if the distance is within
>> some threshold (i.e., 1)
>
> Perhaps there is: You can calculate the levenshtein distance between
> those values using a self-join and then GROUP BY the result of that
> expression and limit the results with HAVING.
>
> For example:
> SELECT foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1)
> FROM foo foo1
> INNER JOIN foo foo2 ON (foo2.ID = foo1.ID AND foo2.G1 = foo1.G1)
> GROUP BY foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1)
> HAVING levenshtein(foo1.T1, foo2.T1) > 1

From my understanding of the question, probably, adding another
levenshtein distance to some base value will make more sense:

GROUP BY levenshtein(foo1.T1, foo2.T1), levenshtein(foo1.T1, 'A')

Though, it looks like a clusterization task for me, and therefore I
would recommend OP to look at the PL/R
http://www.joeconway.com/plr/doc/index.html.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com


Re: Postgresql GROUP BY "SIMILAR" but not equal values

From
"Gauthier, Dave"
Date:
What about a regexp match ?

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Thursday, February 06, 2014 10:32 AM
To: alexandros_e
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Postgresql GROUP BY "SIMILAR" but not equal values

alexandros_e <alexandros.ef@gmail.com> writes:
> Is there a way in SQL or PostgreSQL in general to group by values than
> are not exactly the same but are quite similar (like 'ABC' and 'ABCD')
> based on some distance function (levenshtein for example) if the
> distance is within some threshold (i.e., 1)

Well, you can GROUP BY the result of a function.

You are going to have to think harder than the above in any case.
For example, it's not hard to imagine a "similarity" operator that says that A is similar to B, and B is similar to C,
butif you ask it to compare A to C it says they're not similar (enough).  Now what? 
Are A,B,C all part of the same group?  If you take the transitive closure of such an operator you probably end up with
everythingin one group; but if you don't, it's hard to see a principled result at all. 

If you can cast your problem as transformation of the values into some canonical or representative form, then you can
dothat and then group on simple equality of the canonical values.  For instance case-insensitive grouping is
customarilydone with 

      GROUP BY lower(x)

            regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Postgresql GROUP BY "SIMILAR" but not equal values

From
alexandros_e
Date:
Conceptually, Tom (as always) is right. But Alban's hack help.

DROP TABLE foo;
CREATE TABLE IF NOT EXISTS foo(ID INTEGER,G1 INTEGER, T1 TEXT, ID2 SERIAL
PRIMARY KEY);

INSERT INTO foo(ID,G1,T1) VALUES(1,2,'ABC');
INSERT INTO foo(ID,G1,T1) VALUES(1,2,'ABCD');
INSERT INTO foo(ID,G1,T1) VALUES(1,2,'ABDC');
INSERT INTO foo(ID,G1,T1) VALUES(1,2,'DEF');
INSERT INTO foo(ID,G1,T1) VALUES(1,2,'DEFH');

/* A little editing to remove duplicates a to b and b to a */
SELECT foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1),foo2.T1
FROM foo foo1
INNER JOIN foo foo2 ON (foo2.ID = foo1.ID AND foo2.G1 = foo1.G1)
WHERE foo1.ID2<foo2.ID2
GROUP BY foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1),foo2.T1
HAVING levenshtein(foo1.T1, foo2.T1) <2;

RETURNS

ID|G1|foo1.T1|foo2.T1
1;2;"ABC";1;"ABCD"
1;2;"ABC";1;"ABDC"
1;2;"ABCD";2;"ABDC"
1;2;"DEF";1;"DEFH"

Then it requires a second grouping but as Tom suggested it would be hard to
somehow group all similar cases together because then it becomes a
clustering problem. With a second grouping we will have 3 records instead of
4, so it is better than the initial case by 25%.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Postgresql-GROUP-BY-SIMILAR-but-not-equal-values-tp5790860p5790876.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.