These seem to work. I'm posting them now before I'm done with
them before I start messing them up with specifics to my database.
Any comment is welcome. However, I've no intention of publishing
production quality code or documentation, these are my notes
and I've used published them here in the event that
somebody finds them useful.
Postgresql developers might note the hackery required to accomodate
COPY's insertion of escape sequences (and the escape
sequences in the syntax of literal strings, be nice to
turn off \'s per-database.)
This uses the XBase perl module to convert from .dbf to
delimited text files.
--------------------------
file:INSTALL
Postgresql can import delmited text fields (COPY). This means that the
field
separator and record separator delimters chosen cannot appear in the
data. I choose tab as my field separator and newline as my record
separator. This may not work with all data.
Use the perl program dbfdump to convert individual files.
If your distribution does not contain dbfdump you can hack an
installation of the program in a temporary location by following this
procedure:
Obtain the code for the XBase module (DBD-XBase-0.241.tar.gz) from
http://www.cpan.org. Place the tarfile in some temporary directory
and cd into that directory.
Install the XBase module (see
http://www.cpan.org/misc/cpan-faq.html#How_install_private) with: (do
not do this as root)
tar -xzf DBD-XBase-0.241.tar.gz
(# Install XBase module
cd DBD-XBase-0.241
perl Makefile.PL LIB=$(dirname $(pwd)) PREFIX=$(dirname $(pwd))
make
make test
make install
)
(# Customize dbfdump to account for our strange install
cd bin
foo=foo.$$ head -n 1 dbfdump | sed "s#\$# -I $(dirname $(pwd))#" >
$foo tail -n +2 dbfdump >> $foo chmod u+w dbfdump cat $foo >
dbfdump chmod u-w dbfdump rm -f $foo )
The XBase version number will vary with the version you have installed.
XBase requires on the perl DBD module. If you don't have that I
imagine you can so something similar to install it.
You should now be able to run the dbfdump program by explictly
supplying a full pathname. Test this by typing:
yourtempdir/bin/dbfdump --help
When you're done with dbfdump you can uninstall by deleting your
temporary directory.
The remainder of this document does not fully qualify the path to the
dbfdump program. If you would like to simply cut and paste the code,
you can (temporarly) alter your $PATH as follows:
export PATH=$PATH:yourtempdir/bin
-----------------
file:dbf2txt
#!/bin/sh
#
# Convert foxpro files to tab delimited text files
#
#
# Put all your dbf related files into a directory, in this case dbfdir.
# Make a directory for converted output, in this case txtdir.
export PATH=$PATH:./xbase_bin
for suffix in DBF dbf ; do
for f in $(ls dbfdir/*.$suffix) ; do
# Choose a character that doesn't appear in the data as the
# field separator. I choose tab.
dbfdump -fs $'\t' $f > txtdir/$(basename $f $suffix)txt done
done
-------------------------------------------------------------
file: loader
#!/bin/sh
# load the data from text files
# Needs directories (in the pwd):
# dbfdir all the foxpro .dbf files
# txtdir creates tab delimited text files here
# rejects tab delimited text files of rows that won't convert
# We deal with bad data as follows:
# The raw data row is inserted into a temporary table.
# Then we start a transaction, insert the data into the
# real table and delete the raw data from the temporary table.
# If the real data does not insert, then the whole tranaction
# is rolled back and the raw data is left in the temporary
# table. After completing the load, the bad data is left
# in the temporary table and we dump that back out to the
# filesystem.
function regular_load () {
export table=$1
#total hack, sed command to change old col names to new
export colfilter="$2"
if [ -e dbfdir/${table}.dbf ] ; then
export t=$table.dbf
else
export t=$table.DBF
fi
# Get the columns and the order in which they appear.
export cols="$(
./xbase_bin/dbfdump --info dbfdir/$t \
| awk '/[1-9][0-9]*\./ {printf("%s%s", sep, $2); sep=", "}'
)"
if [ -n "$colfilter" ] ; then
# Convert the old column names to new column names.
cols="$(echo $cols | $colfilter)";
fi
# Get the data types that go with the columns.
export types="$(
./xbase_bin/dbfdump --info dbfdir/$t \
| awk '/[1-9][0-9]*\./ {printf("%s%s", sep, $3); sep=" "}'
)"
awk -v cols="$cols" -v types="$types" -v table="$table" -v
pwd="$(pwd)" '
function fix_escapes(s) {
# Double all backslashes and single quotes.
# See the GNU awk users guide
#
http://www.gnu.org/software/gawk/manual/html_node/Gory-Details.html#Gory%20Details
return gensub("\\\\", "\\\\\\\\", "g",
gensub("\x27", "\x27\x27", "g", s));
}
BEGIN {split(types, t);
FS = "\t";
# Create the temporary table.
print "CREATE TEMP TABLE rejects(id INT, rawdata TEXT);";
id = 1;
}
{# First insert the raw data into the temp table.
printf("INSERT INTO rejects (id, rawdata)
values(%s,\x27%s\x27);\n" \
, id \
, fix_escapes($0));
# Then update the table in a transaction that also removes
# the raw row. Either it all succeeds or it does not.
print "BEGIN TRANSACTION;";
printf("insert into " table "(" cols ") values (");
sep = "";
for (i = 1; i <= NF; i++) {
printf("%s", sep);
if ($i == "") {
# No data, try inserting a NULL value
printf("NULL");
} else {
# There is data.
if (t[i] != "N") {
# Data is not numeric, needs quotes.
printf("\x27"); # single quote char
};
# Ta Da! The data.
# Double all backslashes and single quotes.
printf("%s", fix_escapes($i));
if (t[i] != "N") {
# Data is not numeric, needs closing quotes.
printf("\x27"); # single quote char
};
}
sep = ", ";
};
print ");";
# Delete the raw data from the temp table.
printf("DELETE FROM rejects WHERE id = %s;\n", id);
id++;
print "END TRANSACTION;";
}
END {# Dump the rejects back to a file.
print "\\copy rejects (rawdata) TO \x27" pwd "/rejects/" \
table ".txt\x27";
}' \
txtdir/$table.txt \
| psql -d example_db -h server.example.com -f - -U example_user
# Convert rejected data from sql COPY TO format back to tab separated
# data so we can fix it and feed it right back to this program.
# (Welcome to escape hell.)
# Note that this assumes that there are no vertical tabs, newlines,
# or other such characters that COPY would mung.
tfile=$table.txt.$$
mv rejects/$table.txt rejects/$tfile
sed 's/\\\\/\\/g' rejects/$tfile | sed 's/\\t/'$'\t''/g' >
rejects/$table.txt
rm rejects/$tfile
}
regular_load acts 'sed s/OLD/RETIRED/'
for t in STATUSES BSTATS MSTATS dcauses rnktypes ; do
regular_load $t
done
Karl <kop@meme.com>
Free Software: "You don't pay back, you pay forward."
-- Robert A. Heinlein