Thread: Overlapping ranges

Overlapping ranges

From
Jason Long
Date:
I have a large table of access logs to an application.

I want is to find all rows that overlap startdate and enddate with any
other rows.

The query below seems to work, but does not finish unless I specify a
single id.

select distinct a1.id
from t_access a1,
        t_access a2
where tstzrange(a1.startdate, a1.enddate) &&
      tstzrange(a2.startdate, a2.enddate)




Re: Overlapping ranges

From
Rob Sargent
Date:
On 06/18/2014 05:47 PM, Jason Long wrote:
I have a large table of access logs to an application.  

I want is to find all rows that overlap startdate and enddate with any
other rows.

The query below seems to work, but does not finish unless I specify a
single id.  

select distinct a1.id
from t_access a1,        t_access a2 
where tstzrange(a1.startdate, a1.enddate) &&      tstzrange(a2.startdate, a2.enddate) 




I'm sure you're best bet is a windowing function, but your descriptions suggests there is no index on start/end date columns.  Probably want those in any event.

Re: Overlapping ranges

From
Jason Long
Date:
On Wed, 2014-06-18 at 18:08 -0600, Rob Sargent wrote:
On 06/18/2014 05:47 PM, Jason Long wrote:

I have a large table of access logs to an application.  

I want is to find all rows that overlap startdate and enddate with any
other rows.

The query below seems to work, but does not finish unless I specify a
single id.  

select distinct a1.id
from t_access a1,        t_access a2 
where tstzrange(a1.startdate, a1.enddate) &&      tstzrange(a2.startdate, a2.enddate) 




I'm sure you're best bet is a windowing function, but your descriptions suggests there is no index on start/end date columns.  Probably want those in any event.

There are indexs on startdate and enddate.
If I specify a known a1.id=1234 then the query returns all records that overlap it, but this takes 1.7 seconds.

There are about 2 million records in the table.

I will see what I come up with on the window function.

If anyone else has some suggestions let me know.

I get with for EXPLAIN ANALYZE the id specified.

Nested Loop  (cost=0.43..107950.50 rows=8825 width=84) (actual time=2803.932..2804.558 rows=11 loops=1)
   Join Filter: (tstzrange(a1.startdate, a1.enddate) && tstzrange(a2.startdate, a2.enddate))
   Rows Removed by Join Filter: 1767741
   ->  Index Scan using t_access_pkey on t_access a1  (cost=0.43..8.45 rows=1 width=24) (actual time=0.016..0.019 rows=1 loops=1)
         Index Cond: (id = 1928761)
   ->  Seq Scan on t_access a2  (cost=0.00..77056.22 rows=1764905 width=60) (actual time=0.006..1200.657 rows=1767752 loops=1)
         Filter: (enddate IS NOT NULL)
         Rows Removed by Filter: 159270
Total runtime: 2804.599 ms


and for EXPLAIN without the id specified.  EXPLAIN ANALYZE will not complete without the id specified.

Nested Loop  (cost=0.00..87949681448.20 rows=17005053815 width=84)
   Join Filter: (tstzrange(a1.startdate, a1.enddate) && tstzrange(a2.startdate, a2.enddate))
   ->  Seq Scan on t_access a2  (cost=0.00..77056.22 rows=1764905 width=60)
         Filter: (enddate IS NOT NULL)
   ->  Materialize  (cost=0.00..97983.33 rows=1927022 width=24)
         ->  Seq Scan on t_access a1  (cost=0.00..77056.22 rows=1927022 width=24)

Re: Overlapping ranges

From
Alban Hertroys
Date:
On 19 Jun 2014, at 1:47, Jason Long <mailing.lists@octgsoftware.com> wrote:

> I have a large table of access logs to an application.
>
> I want is to find all rows that overlap startdate and enddate with any
> other rows.
>
> The query below seems to work, but does not finish unless I specify a
> single id.
>
> select distinct a1.id
> from t_access a1,
>        t_access a2
> where tstzrange(a1.startdate, a1.enddate) &&
>      tstzrange(a2.startdate, a2.enddate)

You’re comparing overlapping records twice there; you compare all records in a1 to all records in a2. You’ll want to
skipthe records that you already compared. 

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



Re: Overlapping ranges

From
Steve Crawford
Date:
On 06/18/2014 04:47 PM, Jason Long wrote:
> I have a large table of access logs to an application.
>
> I want is to find all rows that overlap startdate and enddate with any
> other rows.
>
> The query below seems to work, but does not finish unless I specify a
> single id.
>
> select distinct a1.id
> from t_access a1,
>          t_access a2
> where tstzrange(a1.startdate, a1.enddate) &&
>        tstzrange(a2.startdate, a2.enddate)
>
>
>
>
I'm not sure what you mean by "specify a single id" but a couple comments.

1. This query will return all ids since there is no constraint to
prevent a1 from finding the matching record in a2 which will, of course,
overlap. You need to add something like ...and a1.id != a2.id...

2. Even without the above issue there is a great potential to have this
query run a very long time - especially if the indexes are such that
each row on a1 requires scanning all rows in a2. I'd test it on a small
table to make sure it gives the results you want and read up on what
indexes are most appropriate to help speed it up. (I can't help much
here as I haven't yet experimented enough with indexing on range types.)

Cheers,
Steve