Re: importing db as text files - Mailing list pgsql-general

From expect
Subject Re: importing db as text files
Date
Msg-id 20030813211701.031090a6.expect@ihubbell.com
Whole thread Raw
In response to Re: importing db as text files  ("Gregory S. Williamson" <gsw@globexplorer.com>)
Responses Re: importing db as text files  (Jason Godden <jasongodden@optushome.com.au>)
List pgsql-general
On Wed, 13 Aug 2003 14:59:29 -0700
"Gregory S. Williamson" <gsw@globexplorer.com> wrote:

> I tend to use perl to preprocess dumps (in our case from Informix). It tends
> to work much faster than shell scripts (although your mileage may vary). I




> have to fix missing numeric values (if the column allows nulls why can't the
> copy command accept an empty field, I wonder?), missing dates. In other cases

This seems to be the question I need answered as well.  Does anyone know the
answer?



> is massages date formats, spatial data, etc. For example, a crude program
> below to clean DOQQ metadata:
>
> Greg W.
> ==================================
> firenze% more infxunl2psql
> #!/usr/dist/bin/perl -w
>
> $FILE = $ARGV[0];
> $OUTPUT = $ARGV[1];
> $MODE = $ARGV[2];
>
> open (INFILE,"$FILE");
> open (OUTFILE,">$OUTPUT");
>
> foreach $line (<INFILE>)
> {
> chop($line);
> chop($line);
> if (($MODE cmp "DOQ") == 0) {
>         ($t_source_filename, $t_quadrangle_name, $t_west_longitude,
>         $t_east_longitude, $t_north_latitude, $t_south_latitude,
>         $t_production_date, $t_raster_order, $t_band_organization,
>         $t_band_content, $t_bits_per_pixel, $t_samples_and_lines,
>         $t_horizontal_datum, $t_horizontal_coordinate_system,
>         $t_coordinate_zone, $t_horizontal_units, $t_horizontal_resolution,
>         $t_secondary_horizontal_datum, $t_xy_origin, $t_secondary_xy_origin,
>         $t_nw_quad_corner_xy, $t_ne_quad_corner_xy, $t_se_quad_corner_xy,
>         $t_sw_quad_corner_xy, $t_secondary_nw_quad_xy,
>         $t_secondary_ne_quad_xy, $tsecondary_se_quad_xy,
>         $t_secondary_sw_quad_xy, $t_rmse_xy, $t_image_source,
>         $t_source_dem_date, $t_agency, $t_producer, $t_production_system,
>         $t_standard_version, $t_metadata_date, $t_data_file_size, $byte_count)
>         = split(/\|/,$line); if (length($t_production_date) == 0) {
>                 $t_production_date = "\\N";    # psql seems to dump a blank data with
>                 this nomenclature}
>         if (length($t_coordinate_zone) == 0) {  # an integer
>                 $t_coordinate_zone = 0;
>                 }
>         if (length($t_band_content) == 0) {
>                 $t_band_content = 0;    # also an int
>                 }
>         if (length($t_bits_per_pixel) == 0) {
>                 $t_bits_per_pixel = 0;  # reasonable default for an int ?
>                 }
>         if (length($t_horizontal_resolution) == 0) {
>                 $t_horizontal_resolution = 0.0;
>                 }
>         if (length($t_secondary_horizontal_datum) == 0) {
>                 $t_secondary_horizontal_datum = "\'\'";
>                 }
>         if (length($t_rmse_xy) == 0) {
>                 $t_rmse_xy = 0.0;
>                 }
>         if (length($t_metadata_date) == 0) {
>                 $t_metadata_date = "\\N";
>                 }
>         if (length($t_data_file_size) == 0) {
>                 $t_data_file_size = 0;  # a big int
>                 }
>         if (length($byte_count) == 0) {
>                 $byte_count = 0;        # reasonable default ? for an int
>                 }
>         $out_line = $t_source_filename . "|" . $t_quadrangle_name . "|" .
>         $t_west_longitude . "|" . $t_east_longitude . "|" . $t_north_latitude
>         . "|" . $t_south_latitude . "|" . $t_production_date . "|" .
>         $t_raster_order . "|" . $t_band_org
> anization . "|" . $t_band_content . "|" . $t_bits_per_pixel . "|" .
> $t_samples_and_lines . "|" . $t_horizontal_datum . "|" .
> $t_horizontal_coordinate_system . "|" . $t_coordinate_zone . "|" .
> $t_horizontal_units . "|" . $t_horizontal_resolu tion . "|" .
> $t_secondary_horizontal_datum . "|" . $t_xy_origin . "|" .
> $t_secondary_xy_origin . "|" .  $t_nw_quad_corner_xy . "|" .
> $t_ne_quad_corner_xy . "|" . $t_se_quad_corner_xy . "|" . $t_sw_quad_corner_xy
> . "|" . $t_secondary_nw_quad_xy . "|" . $t_secondary_ne_quad_xy . "|" .
> $tsecondary_se_quad_xy . "|" . $t_secondary_sw_quad_xy . "|" . $t_rmse_xy .
> "|" . $t_image_source . "|" . $t_source_dem_date . "|" . $t_agency . "|" .
> $t_producer . "|" . $t_production_system . "|" . $t_standard_version . "|" .
> $t_metadata_date . "|" . $t_data_file_size . "|" . $byte_count;
>         print OUTFILE "$out_line\n";
>         }
> else {
>         print OUTFILE "$line\n";
>         }
> }
> close INFILE;
> close OUTFILE;
>
> -----Original Message-----
> From: Jason Godden [mailto:jasongodden@optushome.com.au]
> Sent: Wednesday, August 13, 2003 2:35 PM
> To: expect; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] importing db as text files
>
>
> Hi expect,
>
> Best way in my opinion is to use the copy table command.  This way Pg will
> actually 'massage' the data (string escapes and all) for you.
>
> If you have complex files best to use a combo of sed/awk to pre-massage the
> field delimiters and import the data to a temporary table.  In most instances
> it is best to pipe the data to the psql command using copy table from stdin
> as from file requires that you are the process owner.
>
> Always use a temporary import table and perform validation/further conversion
> in that table (IMO).
>
> eg (from one of my earlier posts)
>
> cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"
>
> | psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"
>
> The first sed will replace all instances of "," with a tab character
> (I achieve tab in my shell scripts with a line like TAB=$(printf "\t"))
>
> The second sed will remove the start and end " because the first only checks
> for ",".
>
> The third sed will remove the Windows carriage returns (if any - depends on
> the platform the file was generated from.
>
> Note here that my file format never contains any " in a field so I can safely
> run the second sed.  If your import file contains these then you will need to
> change the regex or use awk.
>
> Regards,
>
> Jason
>
> On Thu, 14 Aug 2003 07:14 am, expect wrote:
> > What's the big deal with importing text files?  I have a 70 MB file to
> > import and it's been one problem after another.  I used the copy command
> > and it appears that it's just not possible. I finally massaged the file
> > into a .sql file and ran that using \i db.sql but that failed too because I
> > overlooked ' in names like D'Adario.  The other problem I encountered was
> > that a numeric field had to have data in it, pg would not default to the
> > default value.  So instead of massaging all the data again I decided to
> > change the data type for that column. This is my first experience with
> > postgresql and I'm wondering if I should expect to encounter similar pain
> > as I go further into this?  So far it's been very painful trying to do what
> > I thought would be easy and what I think should be easy.
> >
> > PostgreSQL 7.3.4 on linux redhat 9
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 7: don't forget to increase your free space map settings
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>

pgsql-general by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: importing db as text files
Next
From: Mike Mascari
Date:
Subject: Re: Why the duplicate messages to pgsql-general?