Thread: Some CSV file-import questions

Some CSV file-import questions

From
Ron Johnson
Date:
Hello,

If the csv file generated by the application that I am
importing from contains quotes around each field, must I
write a program to strip these "field-level" quotes before
sending the file to COPY?

As we know, COPY is a single transaction.  Therefore, it
would be "unpleasant" if, say, the process that is doing the
importing dies 90% of the way through a 10,000,000 row table.
Is there a checkpoint mechanism, that, would do a COMMIT, for
example, every 10,000 rows.  Then, if the process that is doing
the importing does 90% of the way through that 10,000,000 row
table, when you restart the COPY, it skips over the inserted
rows.
Here is an example from the RDBMS that I currently use:
$ bulkload -load -log -commit=10000 -tran=exclusive -db=test \
     -table=foo ~/foo.csv
Then, if something happens after inserting 9,000,000 rows,
it can be restarted by:
$ bulkload -load -log -commit=10000 -skip=9000000 -db=test \
     -tran=exclusive -table=foo ~/foo.csv

From what I've seen in the documentation, and the mailing
list archives, the solution to both of these questions is
to roll my bulk loader.

Ron
--
+---------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
|                                                         |
| "I have created a government of whirled peas..."        |
|   Maharishi Mahesh Yogi, 12-May-2002,                   |
!   CNN, Larry King Live                                  |
+---------------------------------------------------------+


Re: Some CSV file-import questions

From
Andrew McMillan
Date:
On Sun, 2002-05-19 at 23:01, Ron Johnson wrote:
>
> If the csv file generated by the application that I am
> importing from contains quotes around each field, must I
> write a program to strip these "field-level" quotes before
> sending the file to COPY?
>
> As we know, COPY is a single transaction.  Therefore, it
> would be "unpleasant" if, say, the process that is doing the
> importing dies 90% of the way through a 10,000,000 row table.
> Is there a checkpoint mechanism, that, would do a COMMIT, for
> example, every 10,000 rows.  Then, if the process that is doing
> the importing does 90% of the way through that 10,000,000 row
> table, when you restart the COPY, it skips over the inserted
> rows.
> Here is an example from the RDBMS that I currently use:
> $ bulkload -load -log -commit=10000 -tran=exclusive -db=test \
>      -table=foo ~/foo.csv
> Then, if something happens after inserting 9,000,000 rows,
> it can be restarted by:
> $ bulkload -load -log -commit=10000 -skip=9000000 -db=test \
>      -tran=exclusive -table=foo ~/foo.csv
>
> >From what I've seen in the documentation, and the mailing
> list archives, the solution to both of these questions is
> to roll my bulk loader.

Yes, or to borrow one someone else has already done.

I have a perl script I use for this sort of thing, and although it
handles the full possibilities of quoting fields, it only loads the
whole file as a single transaction, or as one transaction per line.

You are welcome to it if you wish.  It shouldn't be hard to extend it to
allow groups of transactions to be checkpointed - I will probably even
do it myself before the end of the year.

Regards,
                    Andrew.
--
--------------------------------------------------------------------
Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/        PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201    MOB: +64(21)635-694    OFFICE: +64(4)499-2267
       Are you enrolled at http://schoolreunions.co.nz/ yet?


Is a randomized default value primary key possible?

From
April L
Date:
After reading about data types, create table, and other search results for
"default" and "random" in the postgres idocs, I haven't found a way to do
this:

Instead of a sequentially auto-incrementing primary key, I would like a
random 16 byte character value (assumedly hex) to be automatically created
as the default value for each new record.

I guess I could just use a large random number, created with the postgres
random() function - however I don't see any way of seeding it, or know if
that is necessary.

The purpose of using a random rather than sequential number is to prevent
people being able to access other's records by guessing.

Has anyone else encountered this challenge, and do you know of a way to
generate a random default value for the primary key?

Thank you,

- April

Re: Is a randomized default value primary key possible?

From
Ron Johnson
Date:
On Sun, 2002-05-19 at 11:10, April L wrote:
> After reading about data types, create table, and other search results for
> "default" and "random" in the postgres idocs, I haven't found a way to do
> this:
>
> Instead of a sequentially auto-incrementing primary key, I would like a
> random 16 byte character value (assumedly hex) to be automatically created
> as the default value for each new record.
>
> I guess I could just use a large random number, created with the postgres
> random() function - however I don't see any way of seeding it, or know if
> that is necessary.
>
> The purpose of using a random rather than sequential number is to prevent
> people being able to access other's records by guessing.
>
> Has anyone else encountered this challenge, and do you know of a way to
> generate a random default value for the primary key?

Have you considered appending (or prepending) a check digit
to the sequence number before inserting into the database?

That way, it is unlikely someone will get an account just by
typing in a random number.  Also, and more importantly, it will
ensure that any fat-fingers by the clerical staff doesn't
accidently bring up the wrong account.

--
+---------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
|                                                         |
| "I have created a government of whirled peas..."        |
|   Maharishi Mahesh Yogi, 12-May-2002,                   |
!   CNN, Larry King Live                                  |
+---------------------------------------------------------+


Re: Is a randomized default value primary key possible?

From
Josh Berkus
Date:
April,

> Instead of a sequentially auto-incrementing primary key, I would like a
> random 16 byte character value (assumedly hex) to be automatically created
> as the default value for each new record.
>
> I guess I could just use a large random number, created with the postgres
> random() function - however I don't see any way of seeding it, or know if
> that is necessary.

First, let me point out that there is no reason for the "user key" you are
trying to implement to be the same value as the primary key of the table.
There are, in fact, a number of good arguments against it, the least of which
is that a 16-byte string will take up 4x the sort memory of a 4-byte integer.
I would suggest that you give the table an actual, hidden primary key based
on a simple sequence, and a seperate unique "user key" for lookups.

This is actually easy to implement through custom functions, triggers, or
rules.  However, there are some performance implications if your table gets
very large, as you would have to prescan for accidental duplicates (in a
truly random distribution, this is nearly certain given enough records, even
with  a 16-byte value).

> The purpose of using a random rather than sequential number is to prevent
> people being able to access other's records by guessing.
>
> Has anyone else encountered this challenge, and do you know of a way to
> generate a random default value for the primary key?

Genrally, a pseudo-random number is more than adequate.  For example, one of
my applications generates a pseudo-random session key based on a calculation
involving the user_id of the modifying user and the epoch timestamp on which
the record was locked.  This appears random to the casual eye, and is
near-impossible to guess.

--
-Josh Berkus


Re: Is a randomized default value primary key

From
April L
Date:
I made the primary key

"authkey"    bigint DEFAULT trunc(random()*10^15) NOT NULL

Does that seem reasonable? bigint is 8 bytes. I came up with this prior to
receiving your reply.

Since I do have to use the authkey to find records, it seems I would still
benefit by having an index for it even if I had a separate 4 byte primary
key - so I don't understand how it would save resources or increase
performance to avoid making this column the primary key? Admittedly, I
don't understand indexes in depth yet, I just assumed that every additional
index means additional housekeeping activities each time a record is
changed or added.

Thank you,

- April

At 01:49 PM 5/19/2002 -0700, Josh Berkus wrote:
>April,
>
>> Instead of a sequentially auto-incrementing primary key, I would like a
>> random 16 byte character value (assumedly hex) to be automatically created
>> as the default value for each new record.
>>
>> I guess I could just use a large random number, created with the postgres
>> random() function - however I don't see any way of seeding it, or know if
>> that is necessary.
>
>First, let me point out that there is no reason for the "user key" you are
>trying to implement to be the same value as the primary key of the table.
>There are, in fact, a number of good arguments against it, the least of
which
>is that a 16-byte string will take up 4x the sort memory of a 4-byte
integer.
>I would suggest that you give the table an actual, hidden primary key based
>on a simple sequence, and a seperate unique "user key" for lookups.
>
>This is actually easy to implement through custom functions, triggers, or
>rules.  However, there are some performance implications if your table gets
>very large, as you would have to prescan for accidental duplicates (in a
>truly random distribution, this is nearly certain given enough records, even
>with  a 16-byte value).
>
>> The purpose of using a random rather than sequential number is to prevent
>> people being able to access other's records by guessing.
>>
>> Has anyone else encountered this challenge, and do you know of a way to
>> generate a random default value for the primary key?
>
>Genrally, a pseudo-random number is more than adequate.  For example, one of
>my applications generates a pseudo-random session key based on a calculation
>involving the user_id of the modifying user and the epoch timestamp on which
>the record was locked.  This appears random to the casual eye, and is
>near-impossible to guess.
>
>--
>-Josh Berkus
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 6: Have you searched our list archives?
>
>http://archives.postgresql.org
>
>

Re: Is a randomized default value primary key

From
"Joel Burton"
Date:
> I made the primary key
>
> "authkey"    bigint DEFAULT trunc(random()*10^15) NOT NULL
>
> Does that seem reasonable? bigint is 8 bytes. I came up with this prior to
> receiving your reply.
>
> Since I do have to use the authkey to find records, it seems I would still
> benefit by having an index for it even if I had a separate 4 byte primary
> key - so I don't understand how it would save resources or increase
> performance to avoid making this column the primary key? Admittedly, I
> don't understand indexes in depth yet, I just assumed that every
> additional
> index means additional housekeeping activities each time a record is
> changed or added.
>
> Thank you,
>
> - April

Be sure to build into your app some way of handling the error that will
happen if a random primary key is already in the table -- resubmitting it
should work fine (the odds of this happening twice become _extremely_
unlikely, unless your table is very large)

Anyone know off the top of their heads what the period for PG's random()
function is?

- J.



Re: Is a randomized default value primary key

From
ghaverla@freenet.edmonton.ab.ca
Date:
On Sun, 19 May 2002, Joel Burton wrote:

> > I made the primary key
> >
> > "authkey"    bigint DEFAULT trunc(random()*10^15) NOT NULL

> Be sure to build into your app some way of handling the error that will
> happen if a random primary key is already in the table -- resubmitting it
> should work fine (the odds of this happening twice become _extremely_
> unlikely, unless your table is very large)

If you know about how many rows you will be inserting, why not
build a table of random ints drawn from a finite pool without
replacement?  Then, you just have to pick the next unused random
int from the random pool table to use in your other table.  If
you get to the point where the table is exhausted (surely the
number of rows in the pool is less than the period), you just
add some more random ints.

> Anyone know off the top of their heads what the period for PG's random()
> function is?

Sorry, not I.

Gord

Matter Realisations     http://www.materialisations.com/
Gordon Haverland, B.Sc. M.Eng. President
101  9504 182 St. NW    Edmonton, AB, CA  T5T 3A7
780/481-8019            ghaverla @ freenet.edmonton.ab.ca
780/993-1274 (cell)



Re: Is a randomized default value primary key

From
Josh Berkus
Date:
April,

> Since I do have to use the authkey to find records, it seems I would still
> benefit by having an index for it even if I had a separate 4 byte primary
> key - so I don't understand how it would save resources or increase
> performance to avoid making this column the primary key? Admittedly, I
> don't understand indexes in depth yet, I just assumed that every additional
> index means additional housekeeping activities each time a record is
> changed or added.

Oh, yeah, you're right.  I should have added the caveat that having the two
keys makes sense if you are doing a lot of JOINing on the primary key of this
table.  If this table does not have a lot of key-related data in other
tables, the dual keying does not make sense.  Since you seemed to be talking
about a central data table, I just assumed that it would have child tables.

And Joel has a good point, which I made somewhat obliquely:  you need to be
prepared for key violation messages, with your method.  If you expect many
thousands of records, you should do the key-checking up front:

1. Create a PL/pgSQL trigger function called random_authkey(),
2. Have function generate a random number, scan the table for duplicates, and
if found generate a replacement number and return it as NEW.authkey;
3. Create a BEFORE trigger which calls this function for each INSERT.

This should automate the process for you, although it will slow down your
inserts significantly due to the extra index scan required for each insert.
Depending on your level of activity, though, you will probably not even
notice.


--
-Josh Berkus




Re: Is a randomized default value primary key

From
Ron Johnson
Date:
On Sun, 2002-05-19 at 17:19, April L wrote:
> I made the primary key
>
> "authkey"    bigint DEFAULT trunc(random()*10^15) NOT NULL
>
> Does that seem reasonable? bigint is 8 bytes. I came up with this prior to
> receiving your reply.
>
> Since I do have to use the authkey to find records, it seems I would still
> benefit by having an index for it even if I had a separate 4 byte primary
> key - so I don't understand how it would save resources or increase
> performance to avoid making this column the primary key? Admittedly, I
> don't understand indexes in depth yet, I just assumed that every additional
> index means additional housekeeping activities each time a record is
> changed or added.

You are right that more indexes means more work.  Another
benefit of pseudo-random numbers is that keys will be inserted
into the tree in, well, pseudo-random order...  So what?  Keys
that are inserted into the tree in ascending order all get
inserted into the right side of the tree.  Therefore, postgres
must do extra work to keep the tree balanced.  (That's the B in
b-tree).  Random numbers get inserted all over the tree, thus
minimizing the work needed to keep the tree balanced.

If your transactions are SERIALIZABLE, then, since ascending
order keys all get inserted into the right side of the tree,
all users are trying to insert into the same nodes, thus causing
rollbacks.  Keys that go all over the tree will minimize this
problem.

--
+---------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
|                                                         |
| "I have created a government of whirled peas..."        |
|   Maharishi Mahesh Yogi, 12-May-2002,                   |
!   CNN, Larry King Live                                  |
+---------------------------------------------------------+