Thread: populate table with large csv file
have the table "numbercheck" Attribute | Type | Modifier -----------+------------+---------- svcnumber | integer | not null svcqual | varchar(9) | svcequip | char(1) | svctroub | varchar(6) | svcrate | varchar(4) | svcclass | char(1) | trailer | varchar(3) | Index: numbercheck_pkey also have a csv file 7057211380,Y,,,3,B 7057216800,Y,,,3,B 7057265038,Y,,,3,B 7057370261,Y,,,3,B 7057374613,Y,,,3,B 7057371832,Y,,,3,B 4166336554,Y,,,3,B 4166336863,Y,,,3,B 7057201148,Y,,,3,B aside from parsing the csv file through a PHP interface, what isthe easiest way to get that csv data importted into the postgres database. thoughts? thanks Dave
Dave [Hawk-Systems] wrote: > aside from parsing the csv file through a PHP interface, what isthe easiest way > to get that csv data importted into the postgres database. thoughts? > see COPY: http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=sql-copy.html Joe
Dave [Hawk-Systems] wrote: > aside from parsing the csv file through a PHP interface, what isthe easiest way > to get that csv data importted into the postgres database. thoughts? Assuming the CSV file data is well formed, use psql and the COPY command. In psql, create the table. Then issue command: copy <tablename> from 'filename' using delimiters ','; -- P. J. "Josh" Rovero Sonalysts, Inc. Email: rovero@sonalysts.com www.sonalysts.com 215 Parkway North Work: (860)326-3671 or 442-4355 Waterford CT 06385 ***********************************************************************
On Thu, 2003-09-25 at 11:38, Dave [Hawk-Systems] wrote: > have the table "numbercheck" > Attribute | Type | Modifier > -----------+------------+---------- > svcnumber | integer | not null > svcqual | varchar(9) | > svcequip | char(1) | > svctroub | varchar(6) | > svcrate | varchar(4) | > svcclass | char(1) | > trailer | varchar(3) | > Index: numbercheck_pkey > > also have a csv file > 7057211380,Y,,,3,B > 7057216800,Y,,,3,B > 7057265038,Y,,,3,B > 7057370261,Y,,,3,B > 7057374613,Y,,,3,B > 7057371832,Y,,,3,B > 4166336554,Y,,,3,B > 4166336863,Y,,,3,B > 7057201148,Y,,,3,B > > aside from parsing the csv file through a PHP interface, what isthe easiest way > to get that csv data importted into the postgres database. thoughts? No matter what you do, it's going to barf: svcnumber is a 32-bit integer, and 7,057,211,380 is significantly out of range. Once you change svcnumber to bigint, the COPY command will easily suck in the csv file. -- ----------------------------------------------------------------- Ron Johnson, Jr. ron.l.johnson@cox.net Jefferson, LA USA "Python is executable pseudocode; Perl is executable line noise"
>> aside from parsing the csv file through a PHP interface, what isthe >easiest way >> to get that csv data importted into the postgres database. thoughts? > >Assuming the CSV file data is well formed, use psql and >the COPY command. > >In psql, create the table. Then issue command: > >copy <tablename> from 'filename' using delimiters ','; perfect solution that was overlooked. Unfortunately processing the 143mb file which would result in a database size of approx 500mb takes an eternity. As luck would have it we can get away with just dropping to an exec and doing a cat/grep for any data we need... takes 2-3 seconds. the copy command is definately a keeper as I am not looking at replacing code elsewhere with a simpler model using that. Thanks Dave
On Fri, 2003-09-26 at 06:58, Dave [Hawk-Systems] wrote: [snip] > Unfortunately processing the 143mb file which would result in a database size of > approx 500mb takes an eternity. As luck would have it we can get away with just Something's not right, then. I loaded 30GB in about 8 hours, on a slow system, with non-optimized IO. Did you drop the indexes first? -- ----------------------------------------------------------------- Ron Johnson, Jr. ron.l.johnson@cox.net Jefferson, LA USA "As the night fall does not come at once, neither does oppression. It is in such twilight that we must all be aware of change in the air - however slight - lest we become unwitting victims of the darkness." Justice William O. Douglas
We are doing some performance testing among various databases (Oracle, MySQL and Postgres). One of the queries is showing Postgres lagging quite a bit: SELECT count(*) FROM commercial_entity, country, user_account, address_list LEFT JOIN state_province ON address_list.state_province_id = state_province.state_province_id LEFT JOIN contact_info ON address_list.contact_info_id = contact_info.contact_info_id WHERE address_list.address_type_id = 101 AND commercial_entity.commercial_entity_id = address_list.commercial_entity_id AND address_list.country_id = country.country_id AND commercial_entity.user_account_id = user_account.user_account_id AND user_account.user_role_id IN (101, 101); I ran a "vacuum analyze" after realizing that I had loaded all the data into the database without redoing the statistics; the query jumped from 19 seconds to 41 seconds _after_ the analyze. I'd also like to make sure my query is performing correctly - I want all the count of records where the commercial_entity matches user_account, address_list, country, and a left-outer-join on address_list-province and address_list-contact_info. Finally, I read some posts on the shared_buffers; they stated that the shared_buffers should be set to 1/4 to 1/5 of total memory available. Is that correct? I give the MySQL/InnoDB buffers about 70% of the 2 gig on the machine. Here's the explain (I'm not too familiar with reading a Postgres explain...): ---------------------------------------------------------------------------- ---------------------------------------------------------------- Aggregate (cost=52951.09..52951.09 rows=1 width=116) -> Merge Join (cost=52941.61..52950.83 rows=105 width=116) Merge Cond: ("outer".country_id = "inner".country_id) -> Index Scan using country_pkey on country (cost=0.00..7.54 rows=231 width=11) -> Sort (cost=52941.61..52941.88 rows=105 width=105) Sort Key: address_list.country_id -> Merge Join (cost=52729.54..52938.07 rows=105 width=105) Merge Cond: ("outer".commercial_entity_id = "inner".commercial_entity_id) -> Sort (cost=8792.01..8792.52 rows=201 width=36) Sort Key: commercial_entity.commercial_entity_id -> Nested Loop (cost=0.00..8784.31 rows=201 width=36) -> Index Scan using usr_acc_usr_role_id_i on user_account (cost=0.00..2403.08 rows=1401 width=12) Index Cond: (user_role_id = 101::numeric) -> Index Scan using comm_ent_usr_acc_id_i on commercial_entity (cost=0.00..4.54 rows=1 width=24) Index Cond: (commercial_entity.user_account_id = "outer".user_account_id) -> Sort (cost=43937.53..44173.84 rows=94526 width=69) Sort Key: address_list.commercial_entity_id -> Merge Join (cost=29019.03..32585.73 rows=94526 width=69) Merge Cond: ("outer".contact_info_id = "inner".contact_info_id) -> Index Scan using contact_info_pkey on contact_info (cost=0.00..3366.76 rows=56435 width=12) -> Sort (cost=29019.03..29255.34 rows=94526 width=57) Sort Key: address_list.contact_info_id -> Merge Join (cost=16930.18..18354.55 rows=94526 width=57) Merge Cond: ("outer".state_province_id = "inner".state_province_id) -> Index Scan using state_province_pkey on state_province (cost=0.00..3.81 rows=67 width=11) -> Sort (cost=16930.18..17166.50 rows=94526 width=46) Sort Key: address_list.state_province_id -> Seq Scan on address_list (cost=0.00..6882.52 rows=94526 width=46) Filter: (address_type_id = 101::numeric) What's the "Sort (cost...)"? I noticed that joining the address_list to country was slow; there was no index on just country_id; there were composite indexes on multiple columns, so I added one and did a vacuum analyze on the table, and got: Aggregate (cost=54115.74..54115.74 rows=1 width=116) -> Merge Join (cost=54105.91..54115.46 rows=109 width=116) Merge Cond: ("outer".country_id = "inner".country_id) -> Index Scan using country_pkey on country (cost=0.00..7.54 rows=231 width=11) -> Sort (cost=54105.91..54106.19 rows=110 width=105) Sort Key: address_list.country_id -> Merge Join (cost=53884.34..54102.18 rows=110 width=105) Merge Cond: ("outer".commercial_entity_id = "inner".commercial_entity_id) -> Sort (cost=8792.01..8792.52 rows=201 width=36) Sort Key: commercial_entity.commercial_entity_id -> Nested Loop (cost=0.00..8784.31 rows=201 width=36) -> Index Scan using usr_acc_usr_role_id_i on user_account (cost=0.00..2403.08 rows=1401 width=12) Index Cond: (user_role_id = 101::numeric) -> Index Scan using comm_ent_usr_acc_id_i on commercial_entity (cost=0.00..4.54 rows=1 width=24) Index Cond: (commercial_entity.user_account_id = "outer".user_account_id) -> Sort (cost=45092.32..45335.37 rows=97221 width=69) Sort Key: address_list.commercial_entity_id -> Merge Join (cost=29770.81..33338.09 rows=97221 width=69) Merge Cond: ("outer".contact_info_id = "inner".contact_info_id) -> Index Scan using contact_info_pkey on contact_info (cost=0.00..3366.76 rows=56435 width=12) -> Sort (cost=29770.81..30013.86 rows=97221 width=57) Sort Key: address_list.contact_info_id -> Merge Join (cost=17271.79..18731.55 rows=97221 width=57) Merge Cond: ("outer".state_province_id = "inner".state_province_id) -> Index Scan using state_province_pkey on state_province (cost=0.00..3.81 rows=67 width=11) -> Sort (cost=17271.79..17514.84 rows=97221 width=46) Sort Key: address_list.state_province_id -> Seq Scan on address_list (cost=0.00..6882.52 rows=97221 width=46) Filter: (address_type_id = 101::numeric) No difference. Note that all the keys that are used in the joins are numeric(10)'s, so there shouldn't be any cast-issues. When you create a primary key on a table, is an index created (I seem to remember a message going by stating that an index would be added). For comparison, our production Oracle database (running on nearly identical hardware - the Postgres machine has IDE-RAID-5 and the Oracle machine has RAID mirroring) takes between 1 and 2 seconds. I've got one last question, and I really hope responses don't get sidetracked by it; I see alot of negative comments towards MySQL, many of them stating that it's a database layer overtop of the file system. Can someone explain why Postgres is better than MySQL 4.0.14 using InnoDB? MySQL, on the above query, with one less index (on address_list.country) takes 0.20 seconds. David.
On Sunday 28 September 2003 09:19, David Griffiths wrote: > No difference. Note that all the keys that are used in the joins are > numeric(10)'s, so there shouldn't be any cast-issues. Can you make them bigint and see? It might make some difference perhaps. Checking the plan in the meantime.. BTW what tuning you did to postgresql? Check http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html, assuming you haven't seen earlier.. HTH Shridhar
On Sat, 27 Sep 2003, David Griffiths wrote: > > We are doing some performance testing among various databases (Oracle, MySQL > and Postgres). > > One of the queries is showing Postgres lagging quite a bit: > > SELECT count(*) > FROM commercial_entity, country, user_account, address_list > LEFT JOIN state_province ON address_list.state_province_id = > state_province.state_province_id > LEFT JOIN contact_info ON address_list.contact_info_id = > contact_info.contact_info_id > WHERE address_list.address_type_id = 101 > AND commercial_entity.commercial_entity_id = > address_list.commercial_entity_id > AND address_list.country_id = country.country_id > AND commercial_entity.user_account_id = user_account.user_account_id > AND user_account.user_role_id IN (101, 101); I guess that this question has been discussed very often - but I cannot remember why exactly. Is there a pointer to a technical explanation? Has it something to do with MVCC? But ist it one of MVCC's benefits that we can make a consistent online backup without archiving redo locks (Oracle can't, DB2 can). Is DB2 slower than Oracle in such cases (count(*)) as well? Workaround: We can sometimes fake a bit to avoid such costly queries and set up a trigger that calls a function that increases a counter in a separate counter table. Then we are lightning-fast. But many users compain about PostgreSQL's poor count(*) performance, that's true and can be critical when someone wants to replace another database product by PostgreSQL.
Holger Marzen <holger@marzen.de> writes: > But many users compain about PostgreSQL's poor count(*) performance, I don't think that's relevant here. Some other DB's have shortcuts for determining the total number of rows in a single table, that is they can do "SELECT COUNT(*) FROM a_table" quickly, but David's query is messy enough that I can't believe anyone can actually do it without forming the join result. What I'd ask for is EXPLAIN ANALYZE output. Usually, if a complex query is slower than it should be, it's because the planner is picking a bad plan. So you need to look at how its estimates diverge from reality. But plain EXPLAIN doesn't show the reality, only the estimates ... David, could we see EXPLAIN ANALYZE for the query, and also the table schemas (psql \d displays would do)? Also, please take it to pgsql-performance, it's not really on-topic for pgsql-general. regards, tom lane
> I guess that this question has been discussed very often - but I cannot > remember why exactly. Is there a pointer to a technical explanation? Has > it something to do with MVCC? But ist it one of MVCC's benefits that we > can make a consistent online backup without archiving redo locks (Oracle > can't, DB2 can). Is DB2 slower than Oracle in such cases (count(*)) as > well? > > Workaround: > We can sometimes fake a bit to avoid such costly queries and set up a > trigger that calls a function that increases a counter in a separate > counter table. Then we are lightning-fast. > > But many users compain about PostgreSQL's poor count(*) performance, > that's true and can be critical when someone wants to replace another > database product by PostgreSQL. This is but one of many tests we're doing. The count(*) performance is not the deciding factor. This query was pulled from our production system, and I've extracted the exact tables and data from the production system to test. MySQL with MyISAM does in fact cheat on the count(*). InnoDB does not, however. The "explain" indicates that it's doing the work, and analyzing the tables dropped the cost of the query from .35 seconds to .20 seconds. Here's the same query, but selecting data (to test the databases ability to find a single row quicky): SELECT current_timestamp; SELECT company_name, address_1, address_2, address_3, city, address_list.state_province_id, state_province_short_desc, country_desc, zip_code, address_list.country_id, contact_info.email, commercial_entity.user_account_id, phone_num_1, phone_num_fax, website, boats_website FROM commercial_entity, country, user_account, address_list LEFT JOIN state_province ON address_list.state_province_id = state_province.state_province_id LEFT JOIN contact_info ON address_list.contact_info_id = contact_info.contact_info_id WHERE address_list.address_type_id = 101 AND commercial_entity.commercial_entity_id=225528 AND commercial_entity.commercial_entity_id = address_list.commercial_entity_id AND address_list.country_id = country.country_id AND commercial_entity.user_account_id = user_account.user_account_id AND user_account.user_role_id IN (101, 101); SELECT current_timestamp; Postgres takes about 33 seconds to get the row back. Here's the "EXPLAIN": Nested Loop (cost=0.00..64570.33 rows=1 width=385) -> Nested Loop (cost=0.00..64567.30 rows=1 width=361) -> Nested Loop (cost=0.00..64563.97 rows=1 width=349) Join Filter: ("outer".commercial_entity_id = "inner".commercial_entity_id) -> Index Scan using commercial_entity_pkey on commercial_entity (cost=0.00..5.05 rows=1 width=94) Index Cond: (commercial_entity_id = 225528::numeric) -> Materialize (cost=63343.66..63343.66 rows=97221 width=255) -> Merge Join (cost=0.00..63343.66 rows=97221 width=255) Merge Cond: ("outer".contact_info_id = "inner".contact_info_id) -> Nested Loop (cost=0.00..830457.52 rows=97221 width=222) Join Filter: ("outer".state_province_id = "inner".state_province_id) -> Index Scan using addr_list_ci_id_i on address_list (cost=0.00..586676.65 rows=97221 width=205) Filter: (address_type_id = 101::numeric) -> Seq Scan on state_province (cost=0.00..1.67 rows=67 width=17) -> Index Scan using contact_info_pkey on contact_info (cost=0.00..3366.76 rows=56435 width=33) -> Index Scan using user_account_pkey on user_account (cost=0.00..3.32 rows=1 width=12) Index Cond: ("outer".user_account_id = user_account.user_account_id) Filter: (user_role_id = 101::numeric) -> Index Scan using country_pkey on country (cost=0.00..3.01 rows=1 width=24) Index Cond: ("outer".country_id = country.country_id) (20 rows) David.
> David, could we see EXPLAIN ANALYZE for the query, and also the table > schemas (psql \d displays would do)? Also, please take it to > pgsql-performance, it's not really on-topic for pgsql-general. > > regards, tom lane Will do. Thanks, David.
Holger Marzen wrote: >On Sat, 27 Sep 2003, David Griffiths wrote: > > > >>We are doing some performance testing among various databases (Oracle, MySQL >>and Postgres). >> >>One of the queries is showing Postgres lagging quite a bit: >> >>SELECT count(*) >>FROM commercial_entity, country, user_account, address_list >>LEFT JOIN state_province ON address_list.state_province_id = >>state_province.state_province_id >>LEFT JOIN contact_info ON address_list.contact_info_id = >>contact_info.contact_info_id >>WHERE address_list.address_type_id = 101 >>AND commercial_entity.commercial_entity_id = >>address_list.commercial_entity_id >>AND address_list.country_id = country.country_id >>AND commercial_entity.user_account_id = user_account.user_account_id >>AND user_account.user_role_id IN (101, 101); >> >> > >I guess that this question has been discussed very often - but I cannot >remember why exactly. Is there a pointer to a technical explanation? Has >it something to do with MVCC? But ist it one of MVCC's benefits that we >can make a consistent online backup without archiving redo locks (Oracle >can't, DB2 can). Is DB2 slower than Oracle in such cases (count(*)) as >well? > >Workaround: >We can sometimes fake a bit to avoid such costly queries and set up a >trigger that calls a function that increases a counter in a separate >counter table. Then we are lightning-fast. > >But many users compain about PostgreSQL's poor count(*) performance, >that's true and can be critical when someone wants to replace another >database product by PostgreSQL. > >---------------------------(end of broadcast)--------------------------- >TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > > > Yup, it'd be nice to have faster count(*) performance.