Re: importing db as text files - Mailing list pgsql-general

From expect
Subject Re: importing db as text files
Date
Msg-id 20030813211417.46be5abe.expect@ihubbell.com
Whole thread Raw
In response to Re: importing db as text files  ("Gregory S. Williamson" <gsw@globexplorer.com>)
Responses query tuning
query tuning
List pgsql-general
On Wed, 13 Aug 2003 14:59:29 -0700
"Gregory S. Williamson" <gsw@globexplorer.com> wrote:

> I tend to use perl to preprocess dumps (in our case from Informix). It tends to work much faster than shell scripts
(althoughyour mileage may vary). I have to fix missing numeric values (if the column allows nulls why can't the copy
commandaccept an empty field, I wonder?), missing dates. In other cases is massages date formats, spatial data, etc.
Forexample, a crude program below to clean DOQQ metadata: 

I'm not having any problems massaging the data.  I believe the problem is that
pg is unable to use the default value when a value is not present.



BTW FWIW I took a few liberties with your perl code below for ease of reading.

For $out_line you can use the little append operator like:

$out_line  = $t_source_filename . "|";
$out_line .= $t_quadrangle_name . "|";
etc., etc.



>
> Greg W.
> ==================================
> firenze% more infxunl2psql
> #!/usr/dist/bin/perl -w
>
> $FILE = $ARGV[0];
> $OUTPUT = $ARGV[1];
> $MODE = $ARGV[2];
>
> open (INFILE,"$FILE");
> open (OUTFILE,">$OUTPUT");
>
> foreach $line (<INFILE>)
> {
> chop($line);
> chop($line);
> if (($MODE cmp "DOQ") == 0)
  {
        ($t_source_filename, $t_quadrangle_name, $t_west_longitude,
    $t_east_longitude, $t_north_latitude, $t_south_latitude,
    $t_production_date, $t_raster_order, $t_band_organization,
    $t_band_content, $t_bits_per_pixel, $t_samples_and_lines,
    $t_horizontal_datum, $t_horizontal_coordinate_system,
    $t_coordinate_zone, $t_horizontal_units, $t_horizontal_resolution,
    $t_secondary_horizontal_datum, $t_xy_origin, $t_secondary_xy_origin,
    $t_nw_quad_corner_xy, $t_ne_quad_corner_xy, $t_se_quad_corner_xy,
    $t_sw_quad_corner_xy, $t_secondary_nw_quad_xy, $t_secondary_ne_quad_xy,
    $tsecondary_se_quad_xy, $t_secondary_sw_quad_xy, $t_rmse_xy, $t_image_source,
    $t_source_dem_date, $t_agency, $t_producer, $t_production_system,
    $t_standard_version, $t_metadata_date, $t_data_file_size, $byte_count)
    = split(/\|/,$line);

>         if (length($t_production_date) == 0) {
>                 $t_production_date = "\\N";    # psql seems to dump a blank data with this nomenclature
>                 }
>         if (length($t_coordinate_zone) == 0) {  # an integer
>                 $t_coordinate_zone = 0;
>                 }
>         if (length($t_band_content) == 0) {
>                 $t_band_content = 0;    # also an int
>                 }
>         if (length($t_bits_per_pixel) == 0) {
>                 $t_bits_per_pixel = 0;  # reasonable default for an int ?
>                 }
>         if (length($t_horizontal_resolution) == 0) {
>                 $t_horizontal_resolution = 0.0;
>                 }
>         if (length($t_secondary_horizontal_datum) == 0) {
>                 $t_secondary_horizontal_datum = "\'\'";
>                 }
>         if (length($t_rmse_xy) == 0) {
>                 $t_rmse_xy = 0.0;
>                 }
>         if (length($t_metadata_date) == 0) {
>                 $t_metadata_date = "\\N";
>                 }
>         if (length($t_data_file_size) == 0) {
>                 $t_data_file_size = 0;  # a big int
>                 }
>         if (length($byte_count) == 0) {
>                 $byte_count = 0;        # reasonable default ? for an int
>                 }
>         $out_line = $t_source_filename . "|" . $t_quadrangle_name . "|" . $t_west_longitude . "|" . $t_east_longitude
."|" . $t_north_latitude . "|" . $t_south_latitude . "|" . $t_production_date . "|" . $t_raster_order . "|" .
$t_band_org
> anization . "|" . $t_band_content . "|" . $t_bits_per_pixel . "|" . $t_samples_and_lines . "|" . $t_horizontal_datum
."|" . $t_horizontal_coordinate_system . "|" . $t_coordinate_zone . "|" . $t_horizontal_units . "|" .
$t_horizontal_resolu
> tion . "|" . $t_secondary_horizontal_datum . "|" . $t_xy_origin . "|" . $t_secondary_xy_origin . "|" .
$t_nw_quad_corner_xy. "|" . $t_ne_quad_corner_xy . "|" . $t_se_quad_corner_xy . "|" . $t_sw_quad_corner_xy . "|" .
$t_secondary_nw_quad_xy. "|" . $t_secondary_ne_quad_xy . "|" . $tsecondary_se_quad_xy . "|" . $t_secondary_sw_quad_xy .
"|". $t_rmse_xy . "|" . $t_image_source . "|" . $t_source_dem_date . "|" . $t_agency . "|" . $t_producer . "|" .
$t_production_system. "|" . $t_standard_version . "|" . $t_metadata_date . "|" . $t_data_file_size . "|" . $byte_count; 
>         print OUTFILE "$out_line\n";
>         }
> else {
>         print OUTFILE "$line\n";
>         }
> }
> close INFILE;
> close OUTFILE;
>
> -----Original Message-----
> From: Jason Godden [mailto:jasongodden@optushome.com.au]
> Sent: Wednesday, August 13, 2003 2:35 PM
> To: expect; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] importing db as text files
>
>
> Hi expect,
>
> Best way in my opinion is to use the copy table command.  This way Pg will
> actually 'massage' the data (string escapes and all) for you.
>
> If you have complex files best to use a combo of sed/awk to pre-massage the
> field delimiters and import the data to a temporary table.  In most instances
> it is best to pipe the data to the psql command using copy table from stdin
> as from file requires that you are the process owner.
>
> Always use a temporary import table and perform validation/further conversion
> in that table (IMO).
>
> eg (from one of my earlier posts)
>
> cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"
> | psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"
>
> The first sed will replace all instances of "," with a tab character
> (I achieve tab in my shell scripts with a line like TAB=$(printf "\t"))
>
> The second sed will remove the start and end " because the first only checks
> for ",".
>
> The third sed will remove the Windows carriage returns (if any - depends on
> the platform the file was generated from.
>
> Note here that my file format never contains any " in a field so I can safely
> run the second sed.  If your import file contains these then you will need to
> change the regex or use awk.
>
> Regards,
>
> Jason
>
> On Thu, 14 Aug 2003 07:14 am, expect wrote:
> > What's the big deal with importing text files?  I have a 70 MB file to
> > import and it's been one problem after another.  I used the copy command
> > and it appears that it's just not possible. I finally massaged the file
> > into a .sql file and ran that using \i db.sql but that failed too because I
> > overlooked ' in names like D'Adario.  The other problem I encountered was
> > that a numeric field had to have data in it, pg would not default to the
> > default value.  So instead of massaging all the data again I decided to
> > change the data type for that column. This is my first experience with
> > postgresql and I'm wondering if I should expect to encounter similar pain
> > as I go further into this?  So far it's been very painful trying to do what
> > I thought would be easy and what I think should be easy.
> >
> > PostgreSQL 7.3.4 on linux redhat 9
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 7: don't forget to increase your free space map settings
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>

pgsql-general by date:

Previous
From: expect
Date:
Subject: Why the duplicate messages to pgsql-general?
Next
From: Bruce Momjian
Date:
Subject: Re: Why the duplicate messages to pgsql-general?