Thread: Help: Function for splitting VARCHAR column and migrating its data to 2 new tables
I am restructuring my DB schema and need help migrating data from 1 column of an existing table to two new tables. I have some Java code that can do this for me, but it's very slow, and I am now hoping I can migrate this data with some clever SQL instead. Here are my 3 tables: user_data (existing, old table) --------- id (PK), user_id (FK) keywords VARCHAR(256) -- this contains comma separated keywords --e.g. "new york,san francisco, dallas, food" -- also "keywords without strings are really just 1 keyword" add_date TIMESTAMP So now I'm trying to migrate this "keywords" VARCHAR column to a more normalized schema: user_data_keyword (new lookup table to populate) ----------------- id (PK) -- I may change PK to PK(user_data_id, keyword_id) user_data_id (FK) keyword_id (FK) keyword (new table to populate) ------- id (PK) name VARCHAR(64) NOT NULL UNIQUE add_date TIMEZONE I just found http://www.postgresql.org/docs/current/static/functions-string.html , but if anyone could lend me a hand by getting me started with writing a function for this, I'd really appreciate it. Thanks, Otis
Re: Help: Function for splitting VARCHAR column and migrating its data to 2 new tables
From
Tony Wasson
Date:
On 5/23/05, ogjunk-pgjedan@yahoo.com <ogjunk-pgjedan@yahoo.com> wrote: > I am restructuring my DB schema and need help migrating data from 1 > column of an existing table to two new tables. I have some Java code > that can do this for me, but it's very slow, and I am now hoping I can > migrate this data with some clever SQL instead. > > Here are my 3 tables: > > user_data (existing, old table) > --------- > id (PK), > user_id (FK) > keywords VARCHAR(256) > -- this contains comma separated keywords > -- e.g. "new york,san francisco, dallas, food" > -- also "keywords without strings are really just 1 keyword" > add_date TIMESTAMP > > > So now I'm trying to migrate this "keywords" VARCHAR column to a more > normalized schema: > > user_data_keyword (new lookup table to populate) > ----------------- > id (PK) -- I may change PK to PK(user_data_id, keyword_id) > user_data_id (FK) > keyword_id (FK) > > > keyword (new table to populate) > ------- > id (PK) > name VARCHAR(64) NOT NULL UNIQUE > add_date TIMEZONE > > > I just found > http://www.postgresql.org/docs/current/static/functions-string.html , > but if anyone could lend me a hand by getting me started with writing a > function for this, I'd really appreciate it. > > Thanks, > Otis > I am not aware of any extremely clever SQL to make this ALL happen. However you can do everything with pl/pgsql. My contribution below will build the keyword table for you. Once you have this working and you understand it, you could extend it to build your user_data_keywordtable. CREATE OR REPLACE FUNCTION split_on_commas(TEXT) RETURNS SETOF TEXT LANGUAGE 'plpgsql' AS ' DECLARE mystring ALIAS FOR $1; incomma BOOLEAN := FALSE; -- ## Catch the first word endpos INTEGER; -- endcharacter startpos INTEGER := 0; pos INTEGER; outstring TEXT; BEGIN SELECT INTO endpos CHAR_LENGTH(mystring); FOR pos IN 1 .. endpos LOOP -- There are 2 single quotes, a space,and 2 single quotes below: IF SUBSTRING(mystring,pos,1) = '','' THEN incomma := TRUE; outstring:= SUBSTRING(mystring,startpos,pos-startpos); RETURN NEXT outstring; ELSE IF incomma IS TRUETHEN incomma := FALSE; startpos := pos; END IF; END IF; END LOOP; IF incommaIS FALSE THEN -- ## Catch the last phrase outstring := SUBSTRING(mystring,startpos,endpos); RETURN NEXToutstring; END IF; RETURN; END; '; -- try it! You will get a set of keyword all split out SELECT * FROM split_on_commas('foo1,bar2,foo3'); From here, we need a migration function that will loop through row(s) in your table. I also noticed that some of your words may have leading and trailing spaces. The function drop_first_and_last_space should take care of that. Notice that I use this function below in the build_keyword_table function. CREATE OR REPLACE FUNCTION drop_first_and_last_space(TEXT) RETURNS TEXT LANGUAGE 'plpgsql' AS ' DECLARE in_string ALIAS FOR $1; out_msg TEXT; BEGIN out_msg := in_string; IF SUBSTRING(out_msg,1,1) = '' '' THEN RAISE NOTICE ''dropping leading space''; out_msg := substring(out_msg,2,length(out_msg)); END IF; --and the last space IF SUBSTRING(out_msg,length(out_msg),1)= '' '' THEN RAISE NOTICE ''dropping trailing space''; out_msg := substring(out_msg,1,length(out_msg)-1); END IF; RETURN out_msg; END '; -- This function uses the INSERT ((SELECT)EXCEPT(SELECT)); syntax mentioned in http://www.varlena.com/varlena/GeneralBits/19.htm. Note that I am unsure why I didn't need to use FOR r in EXECUTE ''SELECT... syntax. But this did work in my small scale test. CREATE OR REPLACE FUNCTION build_keyword_table(TEXT) RETURNS TEXT LANGUAGE 'plpgsql' AS ' DECLARE match ALIAS FOR $1; rec RECORD; r RECORD; out TEXT := ''done''; BEGIN FOR rec IN SELECT keywords FROM user_data WHERE id SIMILAR TO match ORDER BY id LOOP RAISE NOTICE'' working on %'',rec.keywords; FOR r IN SELECT drop_first_and_last_space(split_on_commas) AS kw FROM split_on_commas(rec.keywords) LOOP RAISE NOTICE ''trying to insert - %'',r.kw; INSERT INTOkeyword (name) ( (SELECT r.kw) EXCEPT (SELECT name FROM keyword WHERE name=r.kw)); END LOOP; END LOOP; RETURN out; END; '; Then try it all together, like so: SELECT build_keyword_table('%'); The input goes to a SIMILAR TO on id - and % means all elements. You can match a single row by using something like SELECT build_keyword_table('123'); I also used this as my test data... It worked for me! CREATE TABLE user_data (id SERIAL,user_id INTEGER,keywords VARCHAR(256) NOT NULL,add_date TIMESTAMP,PRIMARYKEY(id) ); INSERT INTO user_data (keywords) VALUES ('new york,san francisco, dallas, food'); INSERT INTO user_data (keywords) VALUES ('phoenix, hot, summer, fun'); CREATE TABLE keyword (name VARCHAR(64) NOT NULL,id SERIAL,add_date TIMESTAMP,PRIMARY KEY(name) ); -- todo put a UNIQUE INDEX on keyword (id) I hope this helps. Tony Wasson