Thread: SUBSTRING for a regular expression
I can't seem to get right the regular expression for parsing data like these four sample rows (names and addresses changed to ficticious values) from a text-type column: Yolanda Harris, 38, of 40 South Main St., Newtown City, was charged Sunday with breach of peace and interfering with a police officer. Allen K. George, 30, of 88 Beverly Court was charged Saturday with possession of marijuana, third-degree criminal mischief, breach of peace, evading responsibility, interfering with a police officer, driving with a suspended license, driving under the influence of drugs or alcohol, failure to drive right and failure to have proper insurance. Brain T. Grafton, 18, of 97 Bristol Ave. was charged Sunday with possession of marijuana, possession of alcohol by a minor and failure to wear a seat belt. Brian D. Sptizer Jr., 18, of 20 Walling Ave., Northford, was charged Sunday with driving under the influence of drugs or alcohol, evading responsibility and following too closely. Into separate columns for: name, age, address, charge. For example the first record would have name='Yolanda Harris' age=38 address='40 South Main St., Newtown City' charge='was charged Sunday with breach of peace and interfering with a police officer.' To get the name, for instance, I tried SELECT SUBSTRING(description FROM '^([:alnum:]*), \d{2}, .*$') FROM police_log; or the age value SELECT SUBSTRING(description FROM '^[:alnum:]*, (\d{2}), .*$') FROM police_log; But return values are all NULL. Can anyone give me some RE help, please? --Berend Tober
What language are you using for this plperl? That command listed in the documentation certainly works in psql or plpgsql but I don't know about plperl. Are you assigning the result to a variable and then printing the result of that variable before the function ends or printing based on what the function returns? I think you will probably need to show us the rest of the code in your function. On Mon, 2004-07-05 at 19:40, btober@computer.org wrote: > I can't seem to get right the regular expression for parsing data like > these four sample rows (names and addresses changed to ficticious values) > from a text-type column: > > Yolanda Harris, 38, of 40 South Main St., Newtown City, was charged > Sunday with breach of peace and interfering with a police officer. > > Allen K. George, 30, of 88 Beverly Court was charged Saturday with > possession of marijuana, third-degree criminal mischief, breach of peace, > evading responsibility, interfering with a police officer, driving with a > suspended license, driving under the influence of drugs or alcohol, > failure to drive right and failure to have proper insurance. > > Brain T. Grafton, 18, of 97 Bristol Ave. was charged Sunday with > possession of marijuana, possession of alcohol by a minor and failure to > wear a seat belt. > > Brian D. Sptizer Jr., 18, of 20 Walling Ave., Northford, was charged > Sunday with driving under the influence of drugs or alcohol, evading > responsibility and following too closely. > > Into separate columns for: name, age, address, charge. For example the > first record would have > > name='Yolanda Harris' > age=38 > address='40 South Main St., Newtown City' > charge='was charged Sunday with breach of peace and interfering with a > police officer.' > > To get the name, for instance, I tried > > SELECT SUBSTRING(description FROM '^([:alnum:]*), \d{2}, .*$') FROM > police_log; > > or the age value > > SELECT SUBSTRING(description FROM '^[:alnum:]*, (\d{2}), .*$') FROM > police_log; > > But return values are all NULL. Can anyone give me some RE help, please? > > --Berend Tober > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match
In article <64617.206.53.65.243.1089074434.squirrel@$HOSTNAME>, <btober@computer.org> writes: > I can't seem to get right the regular expression for parsing data like > these four sample rows (names and addresses changed to ficticious values) > from a text-type column: > Yolanda Harris, 38, of 40 South Main St., Newtown City, was charged > Sunday with breach of peace and interfering with a police officer. > Allen K. George, 30, of 88 Beverly Court was charged Saturday with > possession of marijuana, third-degree criminal mischief, breach of peace, > evading responsibility, interfering with a police officer, driving with a > suspended license, driving under the influence of drugs or alcohol, > failure to drive right and failure to have proper insurance. > Brain T. Grafton, 18, of 97 Bristol Ave. was charged Sunday with > possession of marijuana, possession of alcohol by a minor and failure to > wear a seat belt. > Brian D. Sptizer Jr., 18, of 20 Walling Ave., Northford, was charged > Sunday with driving under the influence of drugs or alcohol, evading > responsibility and following too closely. > Into separate columns for: name, age, address, charge. For example the > first record would have > name='Yolanda Harris' > age=38 > address='40 South Main St., Newtown City' > charge='was charged Sunday with breach of peace and interfering with a > police officer.' > To get the name, for instance, I tried > SELECT SUBSTRING(description FROM '^([:alnum:]*), \d{2}, .*$') FROM > police_log; > or the age value > SELECT SUBSTRING(description FROM '^[:alnum:]*, (\d{2}), .*$') FROM > police_log; > But return values are all NULL. Can anyone give me some RE help, please? Could you use Perl? A Perl regexp for that would be /^(.+), (\d+), of (.+?),? (was charged.+)$/
> > SELECT > SUBSTRING(description FROM '(.*), \\d{1,3},') AS vname, > SUBSTRING(description FROM '.*, (\\d{1,3}),') AS age, > SUBSTRING(description FROM '\\d{1,3}, of (.*?),? was charged') AS > address, SUBSTRING(description FROM ' was charged ([^ ]+)') AS dow, > SUBSTRING(description FROM ' was charged [^ ]+ with (.+)') AS charge > FROM police_log; Aha! The old double-slash escape. Thank you very much. --Berend Tober
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 SELECT SUBSTRING(description FROM '(.*), \\d{1,3},') AS vname, SUBSTRING(description FROM '.*, (\\d{1,3}),') AS age, SUBSTRING(description FROM '\\d{1,3}, of (.*?),? was charged') AS address, SUBSTRING(description FROM ' was charged ([^ ]+)') AS dow, SUBSTRING(description FROM ' was charged [^ ]+ with (.+)') AS charge FROM police_log; - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200407062103 -----BEGIN PGP SIGNATURE----- iD8DBQFA60wAvJuQZxSWSsgRAuKPAJ0QAeG0hdoJ/Ofqq/lXVtwMjyzQjACgoer3 kwPy0xvRiZxwr3cgPq6Rjwc= =mF/C -----END PGP SIGNATURE-----
> SELECT > SUBSTRING(description FROM '(.*), \\d{1,3},') AS vname, > SUBSTRING(description FROM '.*, (\\d{1,3}),') AS age, > SUBSTRING(description FROM '\\d{1,3}, of (.*?),? was charged') > AS address, > SUBSTRING(description FROM ' was charged ([^ ]+)') AS dow, > SUBSTRING(description FROM ' was charged [^ ]+ with (.+)') AS charge > FROM police_log; > > - -- > Greg Sabino Mullane greg@turnstep.com > PGP Key: 0x14964AC8 200407062103 Thanks Greg, I was hopeful that this would work, since I had missed the need to double the back-slash escape character in my original work, but something still isn't right. First I got an error message that psql didn't like the "?" characters in the RE, so I eliminated them and wrote SELECT SUBSTRING(description FROM '(.*), \\d{1,3},') AS vname, SUBSTRING(description FROM '.*, (\\d{1,3}),') AS age, SUBSTRING(description FROM '\\d{1,3}, of (.*), was charged') AS address, SUBSTRING(description FROM ' was charged ([^ ]+)') AS dow, SUBSTRING(description FROM ' was charged [^ ]+ with (.+)') AS charge FROM police_log; This modified version of your suggestion gets the dow, and charge columns right, but I'm still not seeing the vname, age, and address columns -- they return null. You've gotten me part way there, and I appreciate that. Any further ideas? --Berend Tober
> >> I was hopeful that this would work, since I had missed the need to >> double the back-slash escape character in my original work, but >> something still isn't right. >> >> First I got an error message that psql didn't like the "?" characters >> in the RE, so I eliminated them and wrote > > > Sounds like you may be using an older version of Pg than me. What > version are you running? Only that can tell us how limiting your regex > capabilitites are.... Not that old (I think)! crime=# select version(); version -------------------------------------------------------------------------------- ----------------------------------------------- PostgreSQL 7.3.4 on i386-redhat-linux-gnu, compiled by GCC i386-redhat-linux-gc c (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5) (1 row) --Berend Tober
<btober@computer.org> writes: >> Sounds like you may be using an older version of Pg than me. What >> version are you running? Only that can tell us how limiting your regex >> capabilitites are.... > Not that old (I think)! > PostgreSQL 7.3.4 on i386-redhat-linux-gnu, compiled by GCC > i386-redhat-linux-gc c (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5) Nonetheless too old --- 7.4 contains an entirely new regex package. The one in 7.3 is quite brain-dead compared to any modern regexes... regards, tom lane