Thread: Help: Function for splitting VARCHAR column and migrating its data to 2 new tables

I am restructuring my DB schema and need help migrating data from 1
column of an existing table to two new tables.  I have some Java code
that can do this for me, but it's very slow, and I am now hoping I can
migrate this data with some clever SQL instead.

Here are my 3 tables:

user_data (existing, old table)
--------- id           (PK), user_id      (FK) keywords     VARCHAR(256)   -- this contains comma separated keywords
--e.g. "new york,san francisco, dallas, food"   -- also "keywords without strings are really just 1 keyword" add_date
 TIMESTAMP
 


So now I'm trying to migrate this "keywords" VARCHAR column to a more
normalized schema:

user_data_keyword (new lookup table to populate)
----------------- id           (PK) -- I may change PK to PK(user_data_id, keyword_id) user_data_id (FK) keyword_id
(FK)


keyword (new table to populate)
------- id           (PK) name         VARCHAR(64) NOT NULL UNIQUE add_date     TIMEZONE


I just found
http://www.postgresql.org/docs/current/static/functions-string.html ,
but if anyone could lend me a hand by getting me started with writing a
function for this, I'd really appreciate it.

Thanks,
Otis



On 5/23/05, ogjunk-pgjedan@yahoo.com <ogjunk-pgjedan@yahoo.com> wrote:
> I am restructuring my DB schema and need help migrating data from 1
> column of an existing table to two new tables.  I have some Java code
> that can do this for me, but it's very slow, and I am now hoping I can
> migrate this data with some clever SQL instead.
>
> Here are my 3 tables:
>
> user_data (existing, old table)
> ---------
>   id           (PK),
>   user_id      (FK)
>   keywords     VARCHAR(256)
>     -- this contains comma separated keywords
>     -- e.g. "new york,san francisco, dallas, food"
>     -- also "keywords without strings are really just 1 keyword"
>   add_date     TIMESTAMP
>
>
> So now I'm trying to migrate this "keywords" VARCHAR column to a more
> normalized schema:
>
> user_data_keyword (new lookup table to populate)
> -----------------
>   id           (PK) -- I may change PK to PK(user_data_id, keyword_id)
>   user_data_id (FK)
>   keyword_id   (FK)
>
>
> keyword (new table to populate)
> -------
>   id           (PK)
>   name         VARCHAR(64) NOT NULL UNIQUE
>   add_date     TIMEZONE
>
>
> I just found
> http://www.postgresql.org/docs/current/static/functions-string.html ,
> but if anyone could lend me a hand by getting me started with writing a
> function for this, I'd really appreciate it.
>
> Thanks,
> Otis
>

I am not aware of any extremely clever SQL to make this ALL happen.
However you can do everything with pl/pgsql. My contribution below
will build the keyword table for you. Once you have this working and
you understand it, you could extend it to build your user_data_keywordtable.

CREATE OR REPLACE FUNCTION split_on_commas(TEXT) RETURNS SETOF TEXT
LANGUAGE 'plpgsql' AS '
DECLARE  mystring  ALIAS FOR $1;  incomma   BOOLEAN := FALSE;       -- ## Catch the first word  endpos    INTEGER; --
endcharacter  startpos  INTEGER := 0;  pos       INTEGER;  outstring TEXT; 
BEGIN  SELECT INTO endpos CHAR_LENGTH(mystring);  FOR pos IN 1 .. endpos  LOOP      -- There are 2 single quotes, a
space,and 2 single quotes below:      IF SUBSTRING(mystring,pos,1) =  '','' THEN          incomma := TRUE;
outstring:= SUBSTRING(mystring,startpos,pos-startpos);          RETURN NEXT outstring;      ELSE          IF incomma IS
TRUETHEN              incomma := FALSE;              startpos := pos;          END IF;      END IF;  END LOOP;  IF
incommaIS FALSE THEN     -- ## Catch the last phrase      outstring := SUBSTRING(mystring,startpos,endpos);      RETURN
NEXToutstring;  END IF; 
RETURN;
END;
';

-- try it! You will get a set of keyword all split out
SELECT * FROM split_on_commas('foo1,bar2,foo3');
From here, we need a migration function that will loop through row(s)
in your table. I also noticed that some of your words may have leading
and trailing spaces. The function drop_first_and_last_space should
take care of that. Notice that I use this function below in the
build_keyword_table function.

CREATE OR REPLACE FUNCTION drop_first_and_last_space(TEXT) RETURNS
TEXT LANGUAGE 'plpgsql' AS '
DECLARE  in_string ALIAS FOR $1;  out_msg TEXT;
BEGIN  out_msg := in_string;  IF SUBSTRING(out_msg,1,1) =  '' '' THEN      RAISE NOTICE ''dropping leading space'';
out_msg := substring(out_msg,2,length(out_msg));  END IF;  --and the last space  IF
SUBSTRING(out_msg,length(out_msg),1)=  '' '' THEN      RAISE NOTICE ''dropping trailing space'';      out_msg :=
substring(out_msg,1,length(out_msg)-1); END IF; 
RETURN out_msg;
END
';

-- This function uses the INSERT ((SELECT)EXCEPT(SELECT)); syntax
mentioned in http://www.varlena.com/varlena/GeneralBits/19.htm. Note
that I am unsure why I didn't need to use FOR r in EXECUTE ''SELECT...
syntax. But this did work in my small scale test.

CREATE OR REPLACE FUNCTION build_keyword_table(TEXT) RETURNS TEXT
LANGUAGE 'plpgsql' AS '
DECLARE  match   ALIAS FOR $1;  rec     RECORD;  r       RECORD;  out     TEXT := ''done'';
BEGIN  FOR rec IN SELECT keywords      FROM user_data      WHERE id SIMILAR TO match      ORDER BY id  LOOP      RAISE
NOTICE'' working on %'',rec.keywords;      FOR r IN SELECT drop_first_and_last_space(split_on_commas) AS 
kw FROM split_on_commas(rec.keywords)      LOOP          RAISE NOTICE ''trying to insert - %'',r.kw;          INSERT
INTOkeyword (name) ( (SELECT r.kw) EXCEPT (SELECT 
name FROM keyword WHERE name=r.kw));      END LOOP;  END LOOP;
RETURN out;
END;
';

Then try it all together, like so:
SELECT build_keyword_table('%');
The input goes to a SIMILAR TO on id - and % means all elements. You
can match a single row by using something like SELECT
build_keyword_table('123');

I also used this as my test data... It worked for me!

CREATE TABLE user_data (id           SERIAL,user_id      INTEGER,keywords     VARCHAR(256) NOT NULL,add_date
TIMESTAMP,PRIMARYKEY(id) 
);

INSERT INTO user_data (keywords) VALUES ('new york,san francisco,
dallas, food');
INSERT INTO user_data (keywords) VALUES ('phoenix, hot, summer, fun');

CREATE TABLE keyword (name         VARCHAR(64) NOT NULL,id           SERIAL,add_date     TIMESTAMP,PRIMARY KEY(name)
);
-- todo put a UNIQUE INDEX on keyword (id)

I hope this helps.
Tony Wasson