Thread: Need help: Find dirty rows, Update, Delete SQL
Hello, I need a bit of help with some SQL. I have two tables, call them Page and Bookmark. Each row in Page can have many Bookmarks pointing to it, and they are joined via a FK (Page.id = Bookmark.page_id). Page has a 'url' column: Page.url, which has a unique index on it. My Page.url column got a little dirty, and I need to clean it up, and that's what I need help with. Here is an example of dirtiness: Page: id=1 url = 'http://example.com/' id=2 url = 'http://example.com/#' -- dirty id=3 url = 'http://example.com/#foo' -- dirty The last two rows are dirty. Normally I normalize URLs before inserting them, but these got in, and now I need to clean them. The problem is that rows in Bookmark table may point to dirty rows in Page, so I can't just remove the dirty rows, and I can't just update 'url' column in Page to 'http://example.com/', because that column is unique. Is there some fancy SQL that I can use them to find the dirty rows in page (... where url like '%#%') and then find rows in Bookmark table that point to them, then point those rows to good rows in Page (e.g. id=1 row above), and finally remove the dirty rows from Page? Any help would be greatly appreciated. I'm using Pg 8.0.3 Thanks, Otis
Am Samstag, 18. Februar 2006 18:41 schrieb ogjunk-pgjedan@yahoo.com: > Hello, > > I need a bit of help with some SQL. > I have two tables, call them Page and Bookmark. > Each row in Page can have many Bookmarks pointing to it, and > they are joined via a FK (Page.id = Bookmark.page_id). > > Page has a 'url' column: Page.url, which has a unique index on it. > > My Page.url column got a little dirty, and I need to clean it up, > and that's what I need help with. > > Here is an example of dirtiness: > > Page: > > id=1 url = 'http://example.com/' > id=2 url = 'http://example.com/#' -- dirty > id=3 url = 'http://example.com/#foo' -- dirty > > The last two rows are dirty. Normally I normalize URLs before > inserting them, but these got in, and now I need to clean them. > > The problem is that rows in Bookmark table may point to dirty > rows in Page, so I can't just remove the dirty rows, and I can't > just update 'url' column in Page to 'http://example.com/', > because that column is unique. > > Is there some fancy SQL that I can use them to find the dirty > rows in page (... where url like '%#%') and then find rows in > Bookmark table that point to them, then point those rows to > good rows in Page (e.g. id=1 row above), and finally remove the > dirty rows from Page? try this. But please check if it really does its job. I just wrote it down in a minute or two. There will be an easier way or nicer written SQL but a sit is just a one time operation you shoudn't care too much. One more hint: you should add a CHECK clause to your page_url like "page_url text NOT NULL UNIQUE CHECK (page_url !~ '#')" here is my test code CREATE TABLE pages ( page_id SERIAL PRIMARY KEY, page_url text NOT NULL UNIQUE ); CREATE TABLE bookmarks ( bm_id SERIAL PRIMARY KEY, bm_text text not null, page_id int4 NOT NULL REFERENCES pages (page_id) ); INSERT INTO pages (page_url) VALUES ('http://example.com/'); INSERT INTO pages (page_url) VALUES ('http://example.com/#'); INSERT INTO pages (page_url) VALUES ('http://example.com/#foo'); INSERT INTO pages (page_url) VALUES ('http://example2.com/#foo'); INSERT INTO pages (page_url) VALUES ('http://example3.com/#foobar'); insert into bookmarks (bm_text, page_id) values ('test1', 1); insert into bookmarks (bm_text, page_id) values ('test2', 1); insert into bookmarks (bm_text, page_id) values ('test3', 2); insert into bookmarks (bm_text, page_id) values ('test4', 2); insert into bookmarks (bm_text, page_id) values ('test5', 3); insert into bookmarks (bm_text, page_id) values ('test6', 3); insert into bookmarks (bm_text, page_id) values ('test7', 4); BEGIN; UPDATE bookmarks set page_id = pages2.page_id FROM pages AS pages1, pages AS pages2 WHERE pages1.page_id = bookmarks.page_id AND pages2.page_url = split_part(pages1.page_url, '#', 1) ; DELETE FROM pages WHERE page_id IN ( SELECT pages1.page_id FROM pages AS pages1 JOIN pages AS pages2 ON ( pages1.page_id!= pages2.page_id AND pages2.page_url = split_part(pages1.page_url, '#', 1) ) WHERE position('#' inpages1.page_url) > 0 AND pages1.page_id NOT IN (SELECT page_id FROM bookmarks) ); ; UPDATE pages SET page_url = split_part(page_url, '#', 1) WHERE position('#' in pages.page_url) > 0 ; select * from bookmarks; select * from pages; COMMIT; kind regards, janning
Janning Vygen wrote: >Am Samstag, 18. Februar 2006 18:41 schrieb ogjunk-pgjedan@yahoo.com: > > >>Hello, >> >>I need a bit of help with some SQL. >>I have two tables, call them Page and Bookmark. >>Each row in Page can have many Bookmarks pointing to it, and >>they are joined via a FK (Page.id = Bookmark.page_id). >> >>Page has a 'url' column: Page.url, which has a unique index on it. >> >>My Page.url column got a little dirty, and I need to clean it up, >>and that's what I need help with. >> >>Here is an example of dirtiness: >> >>Page: >> >>id=1 url = 'http://example.com/' >>id=2 url = 'http://example.com/#' -- dirty >>id=3 url = 'http://example.com/#foo' -- dirty >> >>The last two rows are dirty. Normally I normalize URLs before >>inserting them, but these got in, and now I need to clean them. >> >>The problem is that rows in Bookmark table may point to dirty >>rows in Page, so I can't just remove the dirty rows, and I can't >>just update 'url' column in Page to 'http://example.com/', >>because that column is unique. >> >>Is there some fancy SQL that I can use them to find the dirty >>rows in page (... where url like '%#%') and then find rows in >>Bookmark table that point to them, then point those rows to >>good rows in Page (e.g. id=1 row above), and finally remove the >>dirty rows from Page? >> >> > >try this. But please check if it really does its job. I just wrote it down in >a minute or two. There will be an easier way or nicer written SQL but a sit >is just a one time operation you shoudn't care too much. One more hint: you >should add a CHECK clause to your page_url like "page_url text NOT NULL >UNIQUE CHECK (page_url !~ '#')" > >here is my test code > >CREATE TABLE pages ( > page_id SERIAL PRIMARY KEY, > page_url text NOT NULL UNIQUE >); > >CREATE TABLE bookmarks ( > bm_id SERIAL PRIMARY KEY, > bm_text text not null, > page_id int4 NOT NULL REFERENCES pages (page_id) >); > >INSERT INTO pages (page_url) VALUES ('http://example.com/'); >INSERT INTO pages (page_url) VALUES ('http://example.com/#'); >INSERT INTO pages (page_url) VALUES ('http://example.com/#foo'); >INSERT INTO pages (page_url) VALUES ('http://example2.com/#foo'); >INSERT INTO pages (page_url) VALUES ('http://example3.com/#foobar'); > >insert into bookmarks (bm_text, page_id) values ('test1', 1); >insert into bookmarks (bm_text, page_id) values ('test2', 1); >insert into bookmarks (bm_text, page_id) values ('test3', 2); >insert into bookmarks (bm_text, page_id) values ('test4', 2); >insert into bookmarks (bm_text, page_id) values ('test5', 3); >insert into bookmarks (bm_text, page_id) values ('test6', 3); >insert into bookmarks (bm_text, page_id) values ('test7', 4); > >BEGIN; >UPDATE bookmarks set page_id = pages2.page_id >FROM > pages AS pages1, > pages AS pages2 >WHERE > pages1.page_id = bookmarks.page_id > AND pages2.page_url = split_part(pages1.page_url, '#', 1) >; > >DELETE FROM pages WHERE page_id IN ( > SELECT > pages1.page_id > FROM > pages AS pages1 > JOIN pages AS pages2 ON ( > pages1.page_id != pages2.page_id > AND pages2.page_url = split_part(pages1.page_url, '#', 1) > ) > WHERE position('#' in pages1.page_url) > 0 > AND pages1.page_id NOT IN (SELECT page_id FROM bookmarks) >); >; > >UPDATE pages SET page_url = split_part(page_url, '#', 1) >WHERE position('#' in pages.page_url) > 0 >; >select * from bookmarks; >select * from pages; >COMMIT; > > >kind regards, >janning > >---------------------------(end of broadcast)--------------------------- >TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > > > why do you consider as dirty perfectly honest URLs as http://example.com#foo ? Such a construct points to a specific part (foo) of a specific document (http://example.com)