Re: Recursive merging of overlapping arrays in a column - Mailing list pgsql-sql

From David G. Johnston
Subject Re: Recursive merging of overlapping arrays in a column
Date
Msg-id CAKFQuwZaOBndHjiyD9-b2bE3y2Mbd5ttQajT4BQ48cXbZZwxrQ@mail.gmail.com
Whole thread Raw
In response to Re: Recursive merging of overlapping arrays in a column  (dave <audiotecture@web.de>)
Responses Re: Recursive merging of overlapping arrays in a column
List pgsql-sql
On Sunday, September 20, 2015, dave <audiotecture@web.de> wrote:
Sorry, here is the post again in plain text...

i have the following Table:

CREATE TABLE arrays (id SERIAL, arr INT[]);
INSERT INTO arrays (arr) VALUES (ARRAY[1,3,6,9]);
INSERT INTO arrays (arr) VALUES (ARRAY[2,4]);
INSERT INTO arrays (arr) VALUES (ARRAY[3,10,40]);
INSERT INTO arrays (arr) VALUES (ARRAY[3,18,44]);
INSERT INTO arrays (arr) VALUES (ARRAY[63,140,420]);
INSERT INTO arrays (arr) VALUES (ARRAY[42,102,420]);
INSERT INTO arrays (arr) VALUES (ARRAY[2,7]);
INSERT INTO arrays (arr) VALUES (ARRAY[1,3,11]);
INSERT INTO arrays (arr) VALUES (ARRAY[8,12,19]);


I want to merge the arrays which have overlapping elements, so that I get
the result which doesn't contain overlapping arrays anymore:

           arr
--------------------------
 {1,3,6,9,10,11,18,40,44}
 {2,4,7}
 {8,12,19}
 {42,63,102,140,420}


I am not an expert in SQL and it took me a long time to come up with this
solution:

WITH RECURSIVE clusters AS (
        select DISTINCT uniq(sort_asc(array_cat(a1.arr, a2.arr))) AS arr
        from arrays a1 cross join arrays a2
        where a1.arr && a2.arr AND
        least(a1.id,a2.id) != greatest(a1.id, a2.id)

        UNION

        select DISTINCT uniq(sort_asc(array_cat(a1.arr, a2.arr))) AS arr
        from arrays a1 cross join clusters a2
        where a1.arr && a2.arr AND
        a1.arr != a2.arr
)

SELECT arr FROM (
        SELECT * FROM (
                SELECT DISTINCT ON (arr[1]) arr
                FROM clusters
                ORDER BY arr[1], array_length(arr, 1) DESC
        ) AS c

        UNION

        SELECT arr FROM arrays WHERE id NOT IN (
                SELECT DISTINCT a1.id
                FROM arrays a1 CROSS JOIN arrays a2
                WHERE a1.arr && a2.arr AND
                least(a1.id,a2.id) != greatest(a1.id, a2.id)
        )
) AS clustertable
ORDER BY arr;


Which gives me the result:

           arr
--------------------------
 {1,3,6,9,10,11,18,40,44}
 {2,4,7}
 {3,10,18,40,44}
 {8,12,19}
 {42,63,102,140,420}
(5 rows)


Result number 3 is contained in number one and shouldn't be in the output
anymore, because I only want non overlapping arrays in the result.

Another problem I encountered is that the performance of this query seems to
be very bad. I tried running it on a larger table (~400000 arrays) and it is
still running after ~10h.

I would appreciate any input on this problems, so it would be nice if anyone
could give me a hint how to get only the merged arrays without overlaps in
the resultset and maybe how to build a more elegant and efficient query.


Thanks in advance,

Dave



While I've never actually done this personally...

Convert the data into nodes and edges and import it into a graph database.

David J.

pgsql-sql by date:

Previous
From: dave
Date:
Subject: Re: Recursive merging of overlapping arrays in a column
Next
From: dave
Date:
Subject: Re: Recursive merging of overlapping arrays in a column