Re: importing db as text files - Mailing list pgsql-general
From | expect |
---|---|
Subject | Re: importing db as text files |
Date | |
Msg-id | 20030813211417.46be5abe.expect@ihubbell.com Whole thread Raw |
In response to | Re: importing db as text files ("Gregory S. Williamson" <gsw@globexplorer.com>) |
Responses |
query tuning
query tuning |
List | pgsql-general |
On Wed, 13 Aug 2003 14:59:29 -0700 "Gregory S. Williamson" <gsw@globexplorer.com> wrote: > I tend to use perl to preprocess dumps (in our case from Informix). It tends to work much faster than shell scripts (althoughyour mileage may vary). I have to fix missing numeric values (if the column allows nulls why can't the copy commandaccept an empty field, I wonder?), missing dates. In other cases is massages date formats, spatial data, etc. Forexample, a crude program below to clean DOQQ metadata: I'm not having any problems massaging the data. I believe the problem is that pg is unable to use the default value when a value is not present. BTW FWIW I took a few liberties with your perl code below for ease of reading. For $out_line you can use the little append operator like: $out_line = $t_source_filename . "|"; $out_line .= $t_quadrangle_name . "|"; etc., etc. > > Greg W. > ================================== > firenze% more infxunl2psql > #!/usr/dist/bin/perl -w > > $FILE = $ARGV[0]; > $OUTPUT = $ARGV[1]; > $MODE = $ARGV[2]; > > open (INFILE,"$FILE"); > open (OUTFILE,">$OUTPUT"); > > foreach $line (<INFILE>) > { > chop($line); > chop($line); > if (($MODE cmp "DOQ") == 0) { ($t_source_filename, $t_quadrangle_name, $t_west_longitude, $t_east_longitude, $t_north_latitude, $t_south_latitude, $t_production_date, $t_raster_order, $t_band_organization, $t_band_content, $t_bits_per_pixel, $t_samples_and_lines, $t_horizontal_datum, $t_horizontal_coordinate_system, $t_coordinate_zone, $t_horizontal_units, $t_horizontal_resolution, $t_secondary_horizontal_datum, $t_xy_origin, $t_secondary_xy_origin, $t_nw_quad_corner_xy, $t_ne_quad_corner_xy, $t_se_quad_corner_xy, $t_sw_quad_corner_xy, $t_secondary_nw_quad_xy, $t_secondary_ne_quad_xy, $tsecondary_se_quad_xy, $t_secondary_sw_quad_xy, $t_rmse_xy, $t_image_source, $t_source_dem_date, $t_agency, $t_producer, $t_production_system, $t_standard_version, $t_metadata_date, $t_data_file_size, $byte_count) = split(/\|/,$line); > if (length($t_production_date) == 0) { > $t_production_date = "\\N"; # psql seems to dump a blank data with this nomenclature > } > if (length($t_coordinate_zone) == 0) { # an integer > $t_coordinate_zone = 0; > } > if (length($t_band_content) == 0) { > $t_band_content = 0; # also an int > } > if (length($t_bits_per_pixel) == 0) { > $t_bits_per_pixel = 0; # reasonable default for an int ? > } > if (length($t_horizontal_resolution) == 0) { > $t_horizontal_resolution = 0.0; > } > if (length($t_secondary_horizontal_datum) == 0) { > $t_secondary_horizontal_datum = "\'\'"; > } > if (length($t_rmse_xy) == 0) { > $t_rmse_xy = 0.0; > } > if (length($t_metadata_date) == 0) { > $t_metadata_date = "\\N"; > } > if (length($t_data_file_size) == 0) { > $t_data_file_size = 0; # a big int > } > if (length($byte_count) == 0) { > $byte_count = 0; # reasonable default ? for an int > } > $out_line = $t_source_filename . "|" . $t_quadrangle_name . "|" . $t_west_longitude . "|" . $t_east_longitude ."|" . $t_north_latitude . "|" . $t_south_latitude . "|" . $t_production_date . "|" . $t_raster_order . "|" . $t_band_org > anization . "|" . $t_band_content . "|" . $t_bits_per_pixel . "|" . $t_samples_and_lines . "|" . $t_horizontal_datum ."|" . $t_horizontal_coordinate_system . "|" . $t_coordinate_zone . "|" . $t_horizontal_units . "|" . $t_horizontal_resolu > tion . "|" . $t_secondary_horizontal_datum . "|" . $t_xy_origin . "|" . $t_secondary_xy_origin . "|" . $t_nw_quad_corner_xy. "|" . $t_ne_quad_corner_xy . "|" . $t_se_quad_corner_xy . "|" . $t_sw_quad_corner_xy . "|" . $t_secondary_nw_quad_xy. "|" . $t_secondary_ne_quad_xy . "|" . $tsecondary_se_quad_xy . "|" . $t_secondary_sw_quad_xy . "|". $t_rmse_xy . "|" . $t_image_source . "|" . $t_source_dem_date . "|" . $t_agency . "|" . $t_producer . "|" . $t_production_system. "|" . $t_standard_version . "|" . $t_metadata_date . "|" . $t_data_file_size . "|" . $byte_count; > print OUTFILE "$out_line\n"; > } > else { > print OUTFILE "$line\n"; > } > } > close INFILE; > close OUTFILE; > > -----Original Message----- > From: Jason Godden [mailto:jasongodden@optushome.com.au] > Sent: Wednesday, August 13, 2003 2:35 PM > To: expect; pgsql-general@postgresql.org > Subject: Re: [GENERAL] importing db as text files > > > Hi expect, > > Best way in my opinion is to use the copy table command. This way Pg will > actually 'massage' the data (string escapes and all) for you. > > If you have complex files best to use a combo of sed/awk to pre-massage the > field delimiters and import the data to a temporary table. In most instances > it is best to pipe the data to the psql command using copy table from stdin > as from file requires that you are the process owner. > > Always use a temporary import table and perform validation/further conversion > in that table (IMO). > > eg (from one of my earlier posts) > > cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g" > | psql -dmydatabase -c "copy importtable from stdin delimiter '\t';" > > The first sed will replace all instances of "," with a tab character > (I achieve tab in my shell scripts with a line like TAB=$(printf "\t")) > > The second sed will remove the start and end " because the first only checks > for ",". > > The third sed will remove the Windows carriage returns (if any - depends on > the platform the file was generated from. > > Note here that my file format never contains any " in a field so I can safely > run the second sed. If your import file contains these then you will need to > change the regex or use awk. > > Regards, > > Jason > > On Thu, 14 Aug 2003 07:14 am, expect wrote: > > What's the big deal with importing text files? I have a 70 MB file to > > import and it's been one problem after another. I used the copy command > > and it appears that it's just not possible. I finally massaged the file > > into a .sql file and ran that using \i db.sql but that failed too because I > > overlooked ' in names like D'Adario. The other problem I encountered was > > that a numeric field had to have data in it, pg would not default to the > > default value. So instead of massaging all the data again I decided to > > change the data type for that column. This is my first experience with > > postgresql and I'm wondering if I should expect to encounter similar pain > > as I go further into this? So far it's been very painful trying to do what > > I thought would be easy and what I think should be easy. > > > > PostgreSQL 7.3.4 on linux redhat 9 > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 7: don't forget to increase your free space map settings > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > >
pgsql-general by date: