Thread: Finding sequential records

Finding sequential records

From
Steve Midgley
Date:
Hi,

I've been kicking this around today and I can't think of a way to solve 
my problem in "pure SQL" (i.e. I can only do it with a 
looping/cursor-type solution and some variables).

Given a table with this DDL/data script:

drop table if exists dummy;
create table dummy (  id integer primary key,  name varchar(255),  fkey_id integer  )
;
insert into dummy (id, name, fkey_id) values (502163,'3Br/3Ba Pool 
Villa in Westin St. John, USVI- Summer 2008',500100);
insert into dummy (id, name, fkey_id) values (502164,'3Br/3Ba Pool 
Villa in Westin St. John, USVI- Summer 2008',500100);
insert into dummy (id, name, fkey_id) values (502169,'Lazy Bear 
Lodge',105);
-- not sequential id to previous
insert into dummy (id, name, fkey_id) values (502199,'Lazy Bear 
Lodge',105);
insert into dummy (id, name, fkey_id) values (502170,'3 Bed, 1 Bath 
Cottage Less Than a Mile from West Dennis Beach',500089);
insert into dummy (id, name, fkey_id) values (502171,'3 Bed, 1 Bath 
Cottage Less Than a Mile from West Dennis Beach',500089);
-- not sequential id nor duplicate fkey_id to previous
insert into dummy (id, name, fkey_id) values (502175,'3 Bed, 1 Bath 
Cottage Less Than a Mile from West Dennis Beach',500102);
insert into dummy (id, name, fkey_id) values (502213,'Sea 
Watch',500128);
-- not duplicate fkey_id to previous
insert into dummy (id, name, fkey_id) values (502214,'Sea 
Watch',500130);

Find all instances where * name is duplicated * fkey_id is the same (for the any set of duplicated name fields) * id is
sequential(for any set of duplicated name fields)
 

The system should return

502163
502164
502170
502171

Here's as far as I got:

select id
from dummy
where
name in (  select name from dummy  group by name  having count(name)>1
)
order by id

I can't figure out how to test for duplicate fkey_id when name is the 
same, nor to test for sequential id's when name is the same.

Having a method for either would be great, and both would be a bonus!

It seems like there's a clever way to do this without cursors but I 
can't figure it out!

Thanks for any help!

Steve



Re: Finding sequential records

From
"Richard Broersma"
Date:
On Fri, Sep 26, 2008 at 10:39 AM, Steve Midgley <science@misuse.org> wrote:
> drop table if exists dummy;
> create table dummy (
>  id integer primary key,
>  name varchar(255),
>  fkey_id integer
>  )
> ;

> The system should return
>
> 502163
> 502164
> 502170
> 502171


--first get all of the duplicated ids
SELECT id    FROM Dummy
GROUP BY name, fkey_id


--Next from this list find check to see if there are any sibling
immediate above or below it.

SELECT A.* FROM ( SELECT ID                 FROM Dummy            GROUP BY name, fkey_id ) AS A
INNER JOIN Dummy AS D             ON A.id - 1 = D.id             OR A.id + 1 = D.id;

-- 
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug


Re: Finding sequential records

From
"Oliveiros Cristina"
Date:
Can this be what you need?

Best,
Oliveiros

SELECT id
FROM dummy a
NATURAL JOIN 
(
SELECT fkey_id,name
FROM dummy
GROUP BY fkey_id,name
HAVING COUNT(*) > 1 
AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) - MIN(id) + 1) / 2
) b
ORDER BY id;


----- Original Message ----- 
From: "Steve Midgley" <science@misuse.org>
To: <pgsql-sql@postgresql.org>
Sent: Friday, September 26, 2008 6:39 PM
Subject: [SQL] Finding sequential records


> Hi,
> 
> I've been kicking this around today and I can't think of a way to solve 
> my problem in "pure SQL" (i.e. I can only do it with a 
> looping/cursor-type solution and some variables).
> 
> Given a table with this DDL/data script:
> 
> drop table if exists dummy;
> create table dummy (
>   id integer primary key,
>   name varchar(255),
>   fkey_id integer
>   )
> ;
> insert into dummy (id, name, fkey_id) values (502163,'3Br/3Ba Pool 
> Villa in Westin St. John, USVI- Summer 2008',500100);
> insert into dummy (id, name, fkey_id) values (502164,'3Br/3Ba Pool 
> Villa in Westin St. John, USVI- Summer 2008',500100);
> insert into dummy (id, name, fkey_id) values (502169,'Lazy Bear 
> Lodge',105);
> -- not sequential id to previous
> insert into dummy (id, name, fkey_id) values (502199,'Lazy Bear 
> Lodge',105);
> insert into dummy (id, name, fkey_id) values (502170,'3 Bed, 1 Bath 
> Cottage Less Than a Mile from West Dennis Beach',500089);
> insert into dummy (id, name, fkey_id) values (502171,'3 Bed, 1 Bath 
> Cottage Less Than a Mile from West Dennis Beach',500089);
> -- not sequential id nor duplicate fkey_id to previous
> insert into dummy (id, name, fkey_id) values (502175,'3 Bed, 1 Bath 
> Cottage Less Than a Mile from West Dennis Beach',500102);
> insert into dummy (id, name, fkey_id) values (502213,'Sea 
> Watch',500128);
> -- not duplicate fkey_id to previous
> insert into dummy (id, name, fkey_id) values (502214,'Sea 
> Watch',500130);
> 
> Find all instances where
>  * name is duplicated
>  * fkey_id is the same (for the any set of duplicated name fields)
>  * id is sequential (for any set of duplicated name fields)
> 
> The system should return
> 
> 502163
> 502164
> 502170
> 502171
> 
> Here's as far as I got:
> 
> select id
> from dummy
> where
> name in (
>   select name from dummy
>   group by name
>   having count(name)>1
> )
> order by id
> 
> I can't figure out how to test for duplicate fkey_id when name is the 
> same, nor to test for sequential id's when name is the same.
> 
> Having a method for either would be great, and both would be a bonus!
> 
> It seems like there's a clever way to do this without cursors but I 
> can't figure it out!
> 
> Thanks for any help!
> 
> Steve
> 
> 
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql


Re: Finding sequential records

From
"Richard Broersma"
Date:
oops I noticed I forgot the having clause:

>  SELECT id
>     FROM Dummy
> GROUP BY name, fkey_id
Having count(*) > 1;


> SELECT A.*
>  FROM ( SELECT ID
>                  FROM Dummy
>             GROUP BY name, fkey_id              HAVING count(*) > 1 ) AS A
> INNER JOIN Dummy AS D
>              ON A.id - 1 = D.id
>              OR A.id + 1 = D.id;


-- 
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug


Re: Finding sequential records

From
Steve Midgley
Date:
Wow.

Thanks to both Richard and Oliveiros.

Out of the box Oliveiros' solution does what I want but I don't 
understand why!

>SELECT id
>FROM dummy a
>NATURAL JOIN (
>SELECT fkey_id,name
>FROM dummy
>GROUP BY fkey_id,name
>HAVING COUNT(*) > 1 AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) - 
>MIN(id) + 1) / 2
>) b
>ORDER BY id;

What's going on here with the sum(id) equaling the average product of 
the min and max? I gather that's to match id's with id's that are one 
bigger than itself? Can anyone clarify how that is working?

Richard's sql is very interesting to me in concept - but it's not 
getting me the results correctly:

>SELECT A.*
>   FROM ( SELECT ID
>                   FROM Dummy
>              GROUP BY name, fkey_id ) AS A
>INNER JOIN Dummy AS D
>               ON A.id - 1 = D.id
>               OR A.id + 1 = D.id;

This returns an error:

ERROR: column "dummy.id" must appear in the GROUP BY clause or be used 
in an aggregate function
SQL state: 42803

I'm not sure how to setup that "from select" to produce id's without 
adding id to the group by (which would cause the query to return too 
many rows). Perhaps a natural join like in Oliveiros' sql would do the 
job?

Thanks for any advice on either of these solutions. I'm going to learn 
a lot here if someone can pound it into my head.

Thanks,

Steve

It seems to be returning any records that have sequential id's 
regardless
At 11:02 AM 9/26/2008, Richard Broersma wrote:
>On Fri, Sep 26, 2008 at 10:39 AM, Steve Midgley <science@misuse.org> 
>wrote:
> > drop table if exists dummy;
> > create table dummy (
> >  id integer primary key,
> >  name varchar(255),
> >  fkey_id integer
> >  )
> > ;
>
> > The system should return
> >
> > 502163
> > 502164
> > 502170
> > 502171
>
>
>--first get all of the duplicated ids
>
>  SELECT id
>      FROM Dummy
>GROUP BY name, fkey_id
>
>
>--Next from this list find check to see if there are any sibling
>immediate above or below it.
>
>SELECT A.*
>   FROM ( SELECT ID
>                   FROM Dummy
>              GROUP BY name, fkey_id ) AS A
>INNER JOIN Dummy AS D
>               ON A.id - 1 = D.id
>               OR A.id + 1 = D.id;
>
>--
>Regards,
>Richard Broersma Jr.
>
>Visit the Los Angeles PostgreSQL Users Group (LAPUG)
>http://pugs.postgresql.org/lapug



Re: Finding sequential records

From
"Oliveiros Cristina"
Date:
<div dir="ltr">Howdy, Steve.<br /><br /> SELECT id<br /> FROM dummy a<br /> NATURAL JOIN (<br /> SELECT fkey_id,name<br
/>FROM dummy<br /> GROUP BY fkey_id,name<br /> HAVING COUNT(*) > 1 AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) -
MIN(id)+ 1) / 2<br /> ) b<br /> ORDER BY id;<br /><br /><br />The GROUP BY clause is to associate records that have the
samefkey_id and name<br />The COUNT(*) > 1 eliminates the situations when there is just one.<br />Now, about the
equality,now i am thinking and maybe it is a bazooka to kill a fly. :)<br /> In your table you just have duplicates? Or
youmay have triplicates? And quadruplicates? And in general n-uplicates? At the time, I thought you might have
n-uplicates,so I designed the query to be as general as possible to handle all that cases, from which duplicates are a
particularcase, but now i am wondering if you don't have more than duplicates.<br /><br />Well, anyway the idea is as
follows<br/>The sum of a sequence is given by first + last / 2 * n, with n = last - first + 1, OK ?<br /><br />So, if
theset of ids is sequencial, its sum must equal that expression. It's basically that.<br /><br />But I am now wondering
now that I might have misunderstood what your requests were...<br /><br />If you just have duplicates, then maybe it is
cleanerto substitute that clause by something simpler, like MAX(id) - MIN(id) = 1 <br /><br />I dunno if I fully
answeredyour questions, but if I didn't feel free to ask<br /><br /><br />Best, Oliveiros<br /><br /><div
class="gmail_quote"><br/><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt
0pt0pt 0.8ex; padding-left: 1ex;"><br /><br /></blockquote></div><br clear="all" /><br />-- <br />We are going to have
peaceeven if we have to fight for it. - General Dwight D. Eisenhower<br /><br />Teremos paz, nem que tenhamos de lutar
porela<br />- General Dwight D. Eisenhower<br /></div> 

Re: Finding sequential records

From
"Richard Broersma"
Date:
On Fri, Sep 26, 2008 at 3:25 PM, Steve Midgley <science@misuse.org> wrote:

> This returns an error:
>
> ERROR: column "dummy.id" must appear in the GROUP BY clause or be used in an
> aggregate function
> SQL state: 42803

Oops that what I get for trying air code :(

This works instead:

SELECT D1.*     FROM Dummy AS D1
INNER JOIN Dummy AS D2       ON (D1.name,D1.fkey_id)=(D2.name,D2.fkey_id)      AND (D1.id = D2.id + 1 OR D1.id = D2.id
-1 )
 
ORDER BY D1.id;


-- 
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug


Re: Finding sequential records

From
Steve Midgley
Date:
At 05:38 PM 9/26/2008, Oliveiros Cristina wrote:
>In-Reply-To: <20080926222618.4DD3664FC01@postgresql.org>
>References: <20080926173921.EFDA164FC00@postgresql.org>
>         <396486430809261102j73869b8es6b325621bcfe1ea6@mail.gmail.com>
>         <20080926222618.4DD3664FC01@postgresql.org>
>Howdy, Steve.
>
>SELECT id
>FROM dummy a
>NATURAL JOIN (
>SELECT fkey_id,name
>FROM dummy
>GROUP BY fkey_id,name
>HAVING COUNT(*) > 1 AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) - 
>MIN(id) + 1) / 2
>) b
>ORDER BY id;
>
>In your table you just have duplicates? Or you may have triplicates? 
>And quadruplicates? And in general n-uplicates? At the time, I thought 
>you might have n-uplicates, so I designed the query to be as general 
>as possible to handle all that cases, from which duplicates are a 
>particular case, but now i am wondering if you don't have more than 
>duplicates.

In my specific case it turns out I only had duplicates, but there could 
have been n-plicates, so your code is still correct for my use-case 
(though I didn't say that in my OP).

>Well, anyway the idea is as follows
>The sum of a sequence is given by first + last / 2 * n, with n = last 
>- first + 1, OK ?

I *love* your application of that formula. It's rare for me to be able 
to use "real" math in SQL, so this was a pleasure to read (and 
understand!)

Thanks again to Richard and Oliveiros for a truly educating experience! 
I hope some others were similarly enlightened.

With gratitude,

Steve



Re: Finding sequential records

From
"Richard Broersma"
Date:
On Mon, Sep 29, 2008 at 7:48 PM, Steve Midgley <science@misuse.org> wrote:

> In my specific case it turns out I only had duplicates, but there could have
> been n-plicates, so your code is still correct for my use-case (though I
> didn't say that in my OP).

Ya there are a lot of neat queries that you can construct.  If you
have a good background in math and set theory (which I don't have) you
can develop all sorts of powerful analysis queries.

On a side note, I thought that I should mention that unwanted
duplicates are an example where some ~have gotten bitten~ with a
purely surrogate key approach.  To make matter worse, is when some
users update part of one duplicate and another updates a different
duplicated on a another field(s).  Then once the designer discovers
the duplicate problem, she/he has to figure out some way of merging
these non-exact duplicates.  So even if the designer has no intention
of implementing natural primary/foreign keys, he/she will still
benefit from a natural key consideration in that a strategy can be
designed to prevent getting bitten by duplicated data.

I only mention this because db designers get bitten by this all the
time.  Well at least the ones that subscribe to www.utteraccess.com
get bitten.  From what I've seen not one day has gone by without
someone posting a question to this site about how to both find and
remove all but one of the duplicates.

-- 
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug


Re: Finding sequential records

From
Steve Midgley
Date:
At 09:50 PM 9/29/2008, Richard Broersma wrote:
>On Mon, Sep 29, 2008 at 7:48 PM, Steve Midgley <science@misuse.org> 
>wrote:
>
> > In my specific case it turns out I only had duplicates, but there 
> could have
> > been n-plicates, so your code is still correct for my use-case 
> (though I
> > didn't say that in my OP).
>
>Ya there are a lot of neat queries that you can construct.  If you
>have a good background in math and set theory (which I don't have) you
>can develop all sorts of powerful analysis queries.
>
>On a side note, I thought that I should mention that unwanted
>duplicates are an example where some ~have gotten bitten~ with a
>purely surrogate key approach.  To make matter worse, is when some
>users update part of one duplicate and another updates a different
>duplicated on a another field(s).  Then once the designer discovers
>the duplicate problem, she/he has to figure out some way of merging
>these non-exact duplicates.  So even if the designer has no intention
>of implementing natural primary/foreign keys, he/she will still
>benefit from a natural key consideration in that a strategy can be
>designed to prevent getting bitten by duplicated data.
>
>I only mention this because db designers get bitten by this all the
>time.  Well at least the ones that subscribe to www.utteraccess.com
>get bitten.  From what I've seen not one day has gone by without
>someone posting a question to this site about how to both find and
>remove all but one of the duplicates.

Truly. I have worked with some school districts around the US and this 
duplicate record problem is more than theoretical. Some of the 
gnarliest, dirtiest, n-plicate data I've ever seen comes out of the US 
public education system.

More generally where I have seen a need for natural keys, I've always 
taken the "best of both worlds" approach. So I always stick an 
integer/serial PK into any table - why not - they're cheap and 
sometimes are handy. And then for tables along the lines of your 
description, I add a compound unique index which serves the business 
rule of "no dupes along these lines."

Am I following your point? Any reason why using serial PK's with 
"compound natural unique indices" is better/worse than just using 
natural PK's?

Steve



Re: Finding sequential records

From
"Richard Broersma"
Date:
On Mon, Sep 29, 2008 at 11:05 PM, Steve Midgley <science@misuse.org> wrote:

> Any reason why using serial PK's with "compound
> natural unique indices" is better/worse than just using natural PK's?

Not really, surrogate keys will always work well so long as unwanted
duplicates are constrained.  Surrogate Keys will allow as much
flexibility and versatility as is possible.  However as the case of
unwanted duplicates illustrates, sometimes flexibility and versatility
is always wanted.   So if flexibility and versatility is valued less
than the ability to add inflexible constraints (beyond preventing
duplicates) across multiple relationships, then using natural primary
keys becomes very attractive.


-- 
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug