Thread: Some CSV file-import questions
Hello, If the csv file generated by the application that I am importing from contains quotes around each field, must I write a program to strip these "field-level" quotes before sending the file to COPY? As we know, COPY is a single transaction. Therefore, it would be "unpleasant" if, say, the process that is doing the importing dies 90% of the way through a 10,000,000 row table. Is there a checkpoint mechanism, that, would do a COMMIT, for example, every 10,000 rows. Then, if the process that is doing the importing does 90% of the way through that 10,000,000 row table, when you restart the COPY, it skips over the inserted rows. Here is an example from the RDBMS that I currently use: $ bulkload -load -log -commit=10000 -tran=exclusive -db=test \ -table=foo ~/foo.csv Then, if something happens after inserting 9,000,000 rows, it can be restarted by: $ bulkload -load -log -commit=10000 -skip=9000000 -db=test \ -tran=exclusive -table=foo ~/foo.csv From what I've seen in the documentation, and the mailing list archives, the solution to both of these questions is to roll my bulk loader. Ron -- +---------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA http://ronandheather.dhs.org:81 | | | | "I have created a government of whirled peas..." | | Maharishi Mahesh Yogi, 12-May-2002, | ! CNN, Larry King Live | +---------------------------------------------------------+
On Sun, 2002-05-19 at 23:01, Ron Johnson wrote: > > If the csv file generated by the application that I am > importing from contains quotes around each field, must I > write a program to strip these "field-level" quotes before > sending the file to COPY? > > As we know, COPY is a single transaction. Therefore, it > would be "unpleasant" if, say, the process that is doing the > importing dies 90% of the way through a 10,000,000 row table. > Is there a checkpoint mechanism, that, would do a COMMIT, for > example, every 10,000 rows. Then, if the process that is doing > the importing does 90% of the way through that 10,000,000 row > table, when you restart the COPY, it skips over the inserted > rows. > Here is an example from the RDBMS that I currently use: > $ bulkload -load -log -commit=10000 -tran=exclusive -db=test \ > -table=foo ~/foo.csv > Then, if something happens after inserting 9,000,000 rows, > it can be restarted by: > $ bulkload -load -log -commit=10000 -skip=9000000 -db=test \ > -tran=exclusive -table=foo ~/foo.csv > > >From what I've seen in the documentation, and the mailing > list archives, the solution to both of these questions is > to roll my bulk loader. Yes, or to borrow one someone else has already done. I have a perl script I use for this sort of thing, and although it handles the full possibilities of quoting fields, it only loads the whole file as a single transaction, or as one transaction per line. You are welcome to it if you wish. It shouldn't be hard to extend it to allow groups of transactions to be checkpointed - I will probably even do it myself before the end of the year. Regards, Andrew. -- -------------------------------------------------------------------- Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267 Are you enrolled at http://schoolreunions.co.nz/ yet?
After reading about data types, create table, and other search results for "default" and "random" in the postgres idocs, I haven't found a way to do this: Instead of a sequentially auto-incrementing primary key, I would like a random 16 byte character value (assumedly hex) to be automatically created as the default value for each new record. I guess I could just use a large random number, created with the postgres random() function - however I don't see any way of seeding it, or know if that is necessary. The purpose of using a random rather than sequential number is to prevent people being able to access other's records by guessing. Has anyone else encountered this challenge, and do you know of a way to generate a random default value for the primary key? Thank you, - April
On Sun, 2002-05-19 at 11:10, April L wrote: > After reading about data types, create table, and other search results for > "default" and "random" in the postgres idocs, I haven't found a way to do > this: > > Instead of a sequentially auto-incrementing primary key, I would like a > random 16 byte character value (assumedly hex) to be automatically created > as the default value for each new record. > > I guess I could just use a large random number, created with the postgres > random() function - however I don't see any way of seeding it, or know if > that is necessary. > > The purpose of using a random rather than sequential number is to prevent > people being able to access other's records by guessing. > > Has anyone else encountered this challenge, and do you know of a way to > generate a random default value for the primary key? Have you considered appending (or prepending) a check digit to the sequence number before inserting into the database? That way, it is unlikely someone will get an account just by typing in a random number. Also, and more importantly, it will ensure that any fat-fingers by the clerical staff doesn't accidently bring up the wrong account. -- +---------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA http://ronandheather.dhs.org:81 | | | | "I have created a government of whirled peas..." | | Maharishi Mahesh Yogi, 12-May-2002, | ! CNN, Larry King Live | +---------------------------------------------------------+
April, > Instead of a sequentially auto-incrementing primary key, I would like a > random 16 byte character value (assumedly hex) to be automatically created > as the default value for each new record. > > I guess I could just use a large random number, created with the postgres > random() function - however I don't see any way of seeding it, or know if > that is necessary. First, let me point out that there is no reason for the "user key" you are trying to implement to be the same value as the primary key of the table. There are, in fact, a number of good arguments against it, the least of which is that a 16-byte string will take up 4x the sort memory of a 4-byte integer. I would suggest that you give the table an actual, hidden primary key based on a simple sequence, and a seperate unique "user key" for lookups. This is actually easy to implement through custom functions, triggers, or rules. However, there are some performance implications if your table gets very large, as you would have to prescan for accidental duplicates (in a truly random distribution, this is nearly certain given enough records, even with a 16-byte value). > The purpose of using a random rather than sequential number is to prevent > people being able to access other's records by guessing. > > Has anyone else encountered this challenge, and do you know of a way to > generate a random default value for the primary key? Genrally, a pseudo-random number is more than adequate. For example, one of my applications generates a pseudo-random session key based on a calculation involving the user_id of the modifying user and the epoch timestamp on which the record was locked. This appears random to the casual eye, and is near-impossible to guess. -- -Josh Berkus
I made the primary key "authkey" bigint DEFAULT trunc(random()*10^15) NOT NULL Does that seem reasonable? bigint is 8 bytes. I came up with this prior to receiving your reply. Since I do have to use the authkey to find records, it seems I would still benefit by having an index for it even if I had a separate 4 byte primary key - so I don't understand how it would save resources or increase performance to avoid making this column the primary key? Admittedly, I don't understand indexes in depth yet, I just assumed that every additional index means additional housekeeping activities each time a record is changed or added. Thank you, - April At 01:49 PM 5/19/2002 -0700, Josh Berkus wrote: >April, > >> Instead of a sequentially auto-incrementing primary key, I would like a >> random 16 byte character value (assumedly hex) to be automatically created >> as the default value for each new record. >> >> I guess I could just use a large random number, created with the postgres >> random() function - however I don't see any way of seeding it, or know if >> that is necessary. > >First, let me point out that there is no reason for the "user key" you are >trying to implement to be the same value as the primary key of the table. >There are, in fact, a number of good arguments against it, the least of which >is that a 16-byte string will take up 4x the sort memory of a 4-byte integer. >I would suggest that you give the table an actual, hidden primary key based >on a simple sequence, and a seperate unique "user key" for lookups. > >This is actually easy to implement through custom functions, triggers, or >rules. However, there are some performance implications if your table gets >very large, as you would have to prescan for accidental duplicates (in a >truly random distribution, this is nearly certain given enough records, even >with a 16-byte value). > >> The purpose of using a random rather than sequential number is to prevent >> people being able to access other's records by guessing. >> >> Has anyone else encountered this challenge, and do you know of a way to >> generate a random default value for the primary key? > >Genrally, a pseudo-random number is more than adequate. For example, one of >my applications generates a pseudo-random session key based on a calculation >involving the user_id of the modifying user and the epoch timestamp on which >the record was locked. This appears random to the casual eye, and is >near-impossible to guess. > >-- >-Josh Berkus > > >---------------------------(end of broadcast)--------------------------- >TIP 6: Have you searched our list archives? > >http://archives.postgresql.org > >
> I made the primary key > > "authkey" bigint DEFAULT trunc(random()*10^15) NOT NULL > > Does that seem reasonable? bigint is 8 bytes. I came up with this prior to > receiving your reply. > > Since I do have to use the authkey to find records, it seems I would still > benefit by having an index for it even if I had a separate 4 byte primary > key - so I don't understand how it would save resources or increase > performance to avoid making this column the primary key? Admittedly, I > don't understand indexes in depth yet, I just assumed that every > additional > index means additional housekeeping activities each time a record is > changed or added. > > Thank you, > > - April Be sure to build into your app some way of handling the error that will happen if a random primary key is already in the table -- resubmitting it should work fine (the odds of this happening twice become _extremely_ unlikely, unless your table is very large) Anyone know off the top of their heads what the period for PG's random() function is? - J.
On Sun, 19 May 2002, Joel Burton wrote: > > I made the primary key > > > > "authkey" bigint DEFAULT trunc(random()*10^15) NOT NULL > Be sure to build into your app some way of handling the error that will > happen if a random primary key is already in the table -- resubmitting it > should work fine (the odds of this happening twice become _extremely_ > unlikely, unless your table is very large) If you know about how many rows you will be inserting, why not build a table of random ints drawn from a finite pool without replacement? Then, you just have to pick the next unused random int from the random pool table to use in your other table. If you get to the point where the table is exhausted (surely the number of rows in the pool is less than the period), you just add some more random ints. > Anyone know off the top of their heads what the period for PG's random() > function is? Sorry, not I. Gord Matter Realisations http://www.materialisations.com/ Gordon Haverland, B.Sc. M.Eng. President 101 9504 182 St. NW Edmonton, AB, CA T5T 3A7 780/481-8019 ghaverla @ freenet.edmonton.ab.ca 780/993-1274 (cell)
April, > Since I do have to use the authkey to find records, it seems I would still > benefit by having an index for it even if I had a separate 4 byte primary > key - so I don't understand how it would save resources or increase > performance to avoid making this column the primary key? Admittedly, I > don't understand indexes in depth yet, I just assumed that every additional > index means additional housekeeping activities each time a record is > changed or added. Oh, yeah, you're right. I should have added the caveat that having the two keys makes sense if you are doing a lot of JOINing on the primary key of this table. If this table does not have a lot of key-related data in other tables, the dual keying does not make sense. Since you seemed to be talking about a central data table, I just assumed that it would have child tables. And Joel has a good point, which I made somewhat obliquely: you need to be prepared for key violation messages, with your method. If you expect many thousands of records, you should do the key-checking up front: 1. Create a PL/pgSQL trigger function called random_authkey(), 2. Have function generate a random number, scan the table for duplicates, and if found generate a replacement number and return it as NEW.authkey; 3. Create a BEFORE trigger which calls this function for each INSERT. This should automate the process for you, although it will slow down your inserts significantly due to the extra index scan required for each insert. Depending on your level of activity, though, you will probably not even notice. -- -Josh Berkus
On Sun, 2002-05-19 at 17:19, April L wrote: > I made the primary key > > "authkey" bigint DEFAULT trunc(random()*10^15) NOT NULL > > Does that seem reasonable? bigint is 8 bytes. I came up with this prior to > receiving your reply. > > Since I do have to use the authkey to find records, it seems I would still > benefit by having an index for it even if I had a separate 4 byte primary > key - so I don't understand how it would save resources or increase > performance to avoid making this column the primary key? Admittedly, I > don't understand indexes in depth yet, I just assumed that every additional > index means additional housekeeping activities each time a record is > changed or added. You are right that more indexes means more work. Another benefit of pseudo-random numbers is that keys will be inserted into the tree in, well, pseudo-random order... So what? Keys that are inserted into the tree in ascending order all get inserted into the right side of the tree. Therefore, postgres must do extra work to keep the tree balanced. (That's the B in b-tree). Random numbers get inserted all over the tree, thus minimizing the work needed to keep the tree balanced. If your transactions are SERIALIZABLE, then, since ascending order keys all get inserted into the right side of the tree, all users are trying to insert into the same nodes, thus causing rollbacks. Keys that go all over the tree will minimize this problem. -- +---------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA http://ronandheather.dhs.org:81 | | | | "I have created a government of whirled peas..." | | Maharishi Mahesh Yogi, 12-May-2002, | ! CNN, Larry King Live | +---------------------------------------------------------+