Thread: Finding sequential records
Hi, I've been kicking this around today and I can't think of a way to solve my problem in "pure SQL" (i.e. I can only do it with a looping/cursor-type solution and some variables). Given a table with this DDL/data script: drop table if exists dummy; create table dummy ( id integer primary key, name varchar(255), fkey_id integer ) ; insert into dummy (id, name, fkey_id) values (502163,'3Br/3Ba Pool Villa in Westin St. John, USVI- Summer 2008',500100); insert into dummy (id, name, fkey_id) values (502164,'3Br/3Ba Pool Villa in Westin St. John, USVI- Summer 2008',500100); insert into dummy (id, name, fkey_id) values (502169,'Lazy Bear Lodge',105); -- not sequential id to previous insert into dummy (id, name, fkey_id) values (502199,'Lazy Bear Lodge',105); insert into dummy (id, name, fkey_id) values (502170,'3 Bed, 1 Bath Cottage Less Than a Mile from West Dennis Beach',500089); insert into dummy (id, name, fkey_id) values (502171,'3 Bed, 1 Bath Cottage Less Than a Mile from West Dennis Beach',500089); -- not sequential id nor duplicate fkey_id to previous insert into dummy (id, name, fkey_id) values (502175,'3 Bed, 1 Bath Cottage Less Than a Mile from West Dennis Beach',500102); insert into dummy (id, name, fkey_id) values (502213,'Sea Watch',500128); -- not duplicate fkey_id to previous insert into dummy (id, name, fkey_id) values (502214,'Sea Watch',500130); Find all instances where * name is duplicated * fkey_id is the same (for the any set of duplicated name fields) * id is sequential(for any set of duplicated name fields) The system should return 502163 502164 502170 502171 Here's as far as I got: select id from dummy where name in ( select name from dummy group by name having count(name)>1 ) order by id I can't figure out how to test for duplicate fkey_id when name is the same, nor to test for sequential id's when name is the same. Having a method for either would be great, and both would be a bonus! It seems like there's a clever way to do this without cursors but I can't figure it out! Thanks for any help! Steve
On Fri, Sep 26, 2008 at 10:39 AM, Steve Midgley <science@misuse.org> wrote: > drop table if exists dummy; > create table dummy ( > id integer primary key, > name varchar(255), > fkey_id integer > ) > ; > The system should return > > 502163 > 502164 > 502170 > 502171 --first get all of the duplicated ids SELECT id FROM Dummy GROUP BY name, fkey_id --Next from this list find check to see if there are any sibling immediate above or below it. SELECT A.* FROM ( SELECT ID FROM Dummy GROUP BY name, fkey_id ) AS A INNER JOIN Dummy AS D ON A.id - 1 = D.id OR A.id + 1 = D.id; -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug
Can this be what you need? Best, Oliveiros SELECT id FROM dummy a NATURAL JOIN ( SELECT fkey_id,name FROM dummy GROUP BY fkey_id,name HAVING COUNT(*) > 1 AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) - MIN(id) + 1) / 2 ) b ORDER BY id; ----- Original Message ----- From: "Steve Midgley" <science@misuse.org> To: <pgsql-sql@postgresql.org> Sent: Friday, September 26, 2008 6:39 PM Subject: [SQL] Finding sequential records > Hi, > > I've been kicking this around today and I can't think of a way to solve > my problem in "pure SQL" (i.e. I can only do it with a > looping/cursor-type solution and some variables). > > Given a table with this DDL/data script: > > drop table if exists dummy; > create table dummy ( > id integer primary key, > name varchar(255), > fkey_id integer > ) > ; > insert into dummy (id, name, fkey_id) values (502163,'3Br/3Ba Pool > Villa in Westin St. John, USVI- Summer 2008',500100); > insert into dummy (id, name, fkey_id) values (502164,'3Br/3Ba Pool > Villa in Westin St. John, USVI- Summer 2008',500100); > insert into dummy (id, name, fkey_id) values (502169,'Lazy Bear > Lodge',105); > -- not sequential id to previous > insert into dummy (id, name, fkey_id) values (502199,'Lazy Bear > Lodge',105); > insert into dummy (id, name, fkey_id) values (502170,'3 Bed, 1 Bath > Cottage Less Than a Mile from West Dennis Beach',500089); > insert into dummy (id, name, fkey_id) values (502171,'3 Bed, 1 Bath > Cottage Less Than a Mile from West Dennis Beach',500089); > -- not sequential id nor duplicate fkey_id to previous > insert into dummy (id, name, fkey_id) values (502175,'3 Bed, 1 Bath > Cottage Less Than a Mile from West Dennis Beach',500102); > insert into dummy (id, name, fkey_id) values (502213,'Sea > Watch',500128); > -- not duplicate fkey_id to previous > insert into dummy (id, name, fkey_id) values (502214,'Sea > Watch',500130); > > Find all instances where > * name is duplicated > * fkey_id is the same (for the any set of duplicated name fields) > * id is sequential (for any set of duplicated name fields) > > The system should return > > 502163 > 502164 > 502170 > 502171 > > Here's as far as I got: > > select id > from dummy > where > name in ( > select name from dummy > group by name > having count(name)>1 > ) > order by id > > I can't figure out how to test for duplicate fkey_id when name is the > same, nor to test for sequential id's when name is the same. > > Having a method for either would be great, and both would be a bonus! > > It seems like there's a clever way to do this without cursors but I > can't figure it out! > > Thanks for any help! > > Steve > > > -- > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-sql
oops I noticed I forgot the having clause: > SELECT id > FROM Dummy > GROUP BY name, fkey_id Having count(*) > 1; > SELECT A.* > FROM ( SELECT ID > FROM Dummy > GROUP BY name, fkey_id HAVING count(*) > 1 ) AS A > INNER JOIN Dummy AS D > ON A.id - 1 = D.id > OR A.id + 1 = D.id; -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug
Wow. Thanks to both Richard and Oliveiros. Out of the box Oliveiros' solution does what I want but I don't understand why! >SELECT id >FROM dummy a >NATURAL JOIN ( >SELECT fkey_id,name >FROM dummy >GROUP BY fkey_id,name >HAVING COUNT(*) > 1 AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) - >MIN(id) + 1) / 2 >) b >ORDER BY id; What's going on here with the sum(id) equaling the average product of the min and max? I gather that's to match id's with id's that are one bigger than itself? Can anyone clarify how that is working? Richard's sql is very interesting to me in concept - but it's not getting me the results correctly: >SELECT A.* > FROM ( SELECT ID > FROM Dummy > GROUP BY name, fkey_id ) AS A >INNER JOIN Dummy AS D > ON A.id - 1 = D.id > OR A.id + 1 = D.id; This returns an error: ERROR: column "dummy.id" must appear in the GROUP BY clause or be used in an aggregate function SQL state: 42803 I'm not sure how to setup that "from select" to produce id's without adding id to the group by (which would cause the query to return too many rows). Perhaps a natural join like in Oliveiros' sql would do the job? Thanks for any advice on either of these solutions. I'm going to learn a lot here if someone can pound it into my head. Thanks, Steve It seems to be returning any records that have sequential id's regardless At 11:02 AM 9/26/2008, Richard Broersma wrote: >On Fri, Sep 26, 2008 at 10:39 AM, Steve Midgley <science@misuse.org> >wrote: > > drop table if exists dummy; > > create table dummy ( > > id integer primary key, > > name varchar(255), > > fkey_id integer > > ) > > ; > > > The system should return > > > > 502163 > > 502164 > > 502170 > > 502171 > > >--first get all of the duplicated ids > > SELECT id > FROM Dummy >GROUP BY name, fkey_id > > >--Next from this list find check to see if there are any sibling >immediate above or below it. > >SELECT A.* > FROM ( SELECT ID > FROM Dummy > GROUP BY name, fkey_id ) AS A >INNER JOIN Dummy AS D > ON A.id - 1 = D.id > OR A.id + 1 = D.id; > >-- >Regards, >Richard Broersma Jr. > >Visit the Los Angeles PostgreSQL Users Group (LAPUG) >http://pugs.postgresql.org/lapug
<div dir="ltr">Howdy, Steve.<br /><br /> SELECT id<br /> FROM dummy a<br /> NATURAL JOIN (<br /> SELECT fkey_id,name<br />FROM dummy<br /> GROUP BY fkey_id,name<br /> HAVING COUNT(*) > 1 AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) - MIN(id)+ 1) / 2<br /> ) b<br /> ORDER BY id;<br /><br /><br />The GROUP BY clause is to associate records that have the samefkey_id and name<br />The COUNT(*) > 1 eliminates the situations when there is just one.<br />Now, about the equality,now i am thinking and maybe it is a bazooka to kill a fly. :)<br /> In your table you just have duplicates? Or youmay have triplicates? And quadruplicates? And in general n-uplicates? At the time, I thought you might have n-uplicates,so I designed the query to be as general as possible to handle all that cases, from which duplicates are a particularcase, but now i am wondering if you don't have more than duplicates.<br /><br />Well, anyway the idea is as follows<br/>The sum of a sequence is given by first + last / 2 * n, with n = last - first + 1, OK ?<br /><br />So, if theset of ids is sequencial, its sum must equal that expression. It's basically that.<br /><br />But I am now wondering now that I might have misunderstood what your requests were...<br /><br />If you just have duplicates, then maybe it is cleanerto substitute that clause by something simpler, like MAX(id) - MIN(id) = 1 <br /><br />I dunno if I fully answeredyour questions, but if I didn't feel free to ask<br /><br /><br />Best, Oliveiros<br /><br /><div class="gmail_quote"><br/><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt0pt 0.8ex; padding-left: 1ex;"><br /><br /></blockquote></div><br clear="all" /><br />-- <br />We are going to have peaceeven if we have to fight for it. - General Dwight D. Eisenhower<br /><br />Teremos paz, nem que tenhamos de lutar porela<br />- General Dwight D. Eisenhower<br /></div>
On Fri, Sep 26, 2008 at 3:25 PM, Steve Midgley <science@misuse.org> wrote: > This returns an error: > > ERROR: column "dummy.id" must appear in the GROUP BY clause or be used in an > aggregate function > SQL state: 42803 Oops that what I get for trying air code :( This works instead: SELECT D1.* FROM Dummy AS D1 INNER JOIN Dummy AS D2 ON (D1.name,D1.fkey_id)=(D2.name,D2.fkey_id) AND (D1.id = D2.id + 1 OR D1.id = D2.id -1 ) ORDER BY D1.id; -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug
At 05:38 PM 9/26/2008, Oliveiros Cristina wrote: >In-Reply-To: <20080926222618.4DD3664FC01@postgresql.org> >References: <20080926173921.EFDA164FC00@postgresql.org> > <396486430809261102j73869b8es6b325621bcfe1ea6@mail.gmail.com> > <20080926222618.4DD3664FC01@postgresql.org> >Howdy, Steve. > >SELECT id >FROM dummy a >NATURAL JOIN ( >SELECT fkey_id,name >FROM dummy >GROUP BY fkey_id,name >HAVING COUNT(*) > 1 AND SUM(id) = (MAX(id) + MIN(id)) * (MAX(id) - >MIN(id) + 1) / 2 >) b >ORDER BY id; > >In your table you just have duplicates? Or you may have triplicates? >And quadruplicates? And in general n-uplicates? At the time, I thought >you might have n-uplicates, so I designed the query to be as general >as possible to handle all that cases, from which duplicates are a >particular case, but now i am wondering if you don't have more than >duplicates. In my specific case it turns out I only had duplicates, but there could have been n-plicates, so your code is still correct for my use-case (though I didn't say that in my OP). >Well, anyway the idea is as follows >The sum of a sequence is given by first + last / 2 * n, with n = last >- first + 1, OK ? I *love* your application of that formula. It's rare for me to be able to use "real" math in SQL, so this was a pleasure to read (and understand!) Thanks again to Richard and Oliveiros for a truly educating experience! I hope some others were similarly enlightened. With gratitude, Steve
On Mon, Sep 29, 2008 at 7:48 PM, Steve Midgley <science@misuse.org> wrote: > In my specific case it turns out I only had duplicates, but there could have > been n-plicates, so your code is still correct for my use-case (though I > didn't say that in my OP). Ya there are a lot of neat queries that you can construct. If you have a good background in math and set theory (which I don't have) you can develop all sorts of powerful analysis queries. On a side note, I thought that I should mention that unwanted duplicates are an example where some ~have gotten bitten~ with a purely surrogate key approach. To make matter worse, is when some users update part of one duplicate and another updates a different duplicated on a another field(s). Then once the designer discovers the duplicate problem, she/he has to figure out some way of merging these non-exact duplicates. So even if the designer has no intention of implementing natural primary/foreign keys, he/she will still benefit from a natural key consideration in that a strategy can be designed to prevent getting bitten by duplicated data. I only mention this because db designers get bitten by this all the time. Well at least the ones that subscribe to www.utteraccess.com get bitten. From what I've seen not one day has gone by without someone posting a question to this site about how to both find and remove all but one of the duplicates. -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug
At 09:50 PM 9/29/2008, Richard Broersma wrote: >On Mon, Sep 29, 2008 at 7:48 PM, Steve Midgley <science@misuse.org> >wrote: > > > In my specific case it turns out I only had duplicates, but there > could have > > been n-plicates, so your code is still correct for my use-case > (though I > > didn't say that in my OP). > >Ya there are a lot of neat queries that you can construct. If you >have a good background in math and set theory (which I don't have) you >can develop all sorts of powerful analysis queries. > >On a side note, I thought that I should mention that unwanted >duplicates are an example where some ~have gotten bitten~ with a >purely surrogate key approach. To make matter worse, is when some >users update part of one duplicate and another updates a different >duplicated on a another field(s). Then once the designer discovers >the duplicate problem, she/he has to figure out some way of merging >these non-exact duplicates. So even if the designer has no intention >of implementing natural primary/foreign keys, he/she will still >benefit from a natural key consideration in that a strategy can be >designed to prevent getting bitten by duplicated data. > >I only mention this because db designers get bitten by this all the >time. Well at least the ones that subscribe to www.utteraccess.com >get bitten. From what I've seen not one day has gone by without >someone posting a question to this site about how to both find and >remove all but one of the duplicates. Truly. I have worked with some school districts around the US and this duplicate record problem is more than theoretical. Some of the gnarliest, dirtiest, n-plicate data I've ever seen comes out of the US public education system. More generally where I have seen a need for natural keys, I've always taken the "best of both worlds" approach. So I always stick an integer/serial PK into any table - why not - they're cheap and sometimes are handy. And then for tables along the lines of your description, I add a compound unique index which serves the business rule of "no dupes along these lines." Am I following your point? Any reason why using serial PK's with "compound natural unique indices" is better/worse than just using natural PK's? Steve
On Mon, Sep 29, 2008 at 11:05 PM, Steve Midgley <science@misuse.org> wrote: > Any reason why using serial PK's with "compound > natural unique indices" is better/worse than just using natural PK's? Not really, surrogate keys will always work well so long as unwanted duplicates are constrained. Surrogate Keys will allow as much flexibility and versatility as is possible. However as the case of unwanted duplicates illustrates, sometimes flexibility and versatility is always wanted. So if flexibility and versatility is valued less than the ability to add inflexible constraints (beyond preventing duplicates) across multiple relationships, then using natural primary keys becomes very attractive. -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug