Thread: How to improve insert speed with index on text column
Hi all, I am using Postgresql database for our project and doing some performance testing. We need to insert millions of record with indexed columns. We have 5 columns in table. I created index on integer only then performance is good but when I created index on text column as well then the performance reduced to 1/8th times. My question is how I can improve performance when inserting data using index on text column? Thanks, Saurabh
On Mon, Jan 30, 2012 at 6:27 AM, Saurabh <saurabh.b85@gmail.com> wrote: > Hi all, > > I am using Postgresql database for our project and doing some > performance testing. We need to insert millions of record with indexed > columns. We have 5 columns in table. I created index on integer only > then performance is good but when I created index on text column as > well then the performance reduced to 1/8th times. My question is how I > can improve performance when inserting data using index on text > column? Post all the necessary details. Schema, table and index sizes, some config... Assuming your text column is a long one (long text), this results in really big indices. Assuming you only search by equality, you can make it a lot faster by hashing. Last time I checked, hash indices were quite limited and performed badly, but I've heard they improved quite a bit. If hash indices don't work for you, you can always build them on top of btree indices by indexing on the expression hash(column) and comparing as hash(value) = hash(column) and value = column. On a table indexed by URL I have, this improved things immensely. Both lookup and insertion times improved.
On Mon, Jan 30, 2012 at 1:27 AM, Saurabh <saurabh.b85@gmail.com> wrote: > Hi all, > > I am using Postgresql database for our project and doing some > performance testing. We need to insert millions of record with indexed > columns. We have 5 columns in table. I created index on integer only > then performance is good but when I created index on text column as > well then the performance reduced to 1/8th times. Inserting into a indexed table causes a lot of random access to the underlying index (unless the data is inserted in an order which corresponds to the index order of all indexes, which is not likely to happen with multiple indexes). As soon as your indexes don't fit in cache, your performance will collapse. What if you don't have the integer index but just the text? What is the average length of the data in the text field? Is your system CPU limited or IO limited during the load? > My question is how I > can improve performance when inserting data using index on text > column? The only "magic" answer is to drop the index and rebuild after the insert. If that doesn't work for you, then you have to identify your bottleneck and fix it. That can't be done with just the information you provide. Cheers, Jeff
Thank you for the information. Schema of table is: ID bigint company_name text data_set text time timestamp Date date Length of company_name is not known so it is of datatype text. I need to build the index on company_name and ID. And then insert the records. I can not create the index after insertion because user can search the data as well while insertion. Machine is of 8 core, os centos6 and 8 GB of RAM. Here is my configuration: max_connections = 100 shared_buffers = 32MB wal_buffers = 1024KB checkpoint_segments = 3
On Mon, Jan 30, 2012 at 2:46 PM, Saurabh <saurabh.b85@gmail.com> wrote: > max_connections = 100 > shared_buffers = 32MB > wal_buffers = 1024KB > checkpoint_segments = 3 That's a default config isn't it? You'd do well to try and optimize it for your system. The defaults are really, reeallly conservative. You should also consider normalizing. I'm assuming company_name could be company_id ? (ie: each will have many rows). Otherwise I cannot see how you'd expect to be *constantly* inserting millions of rows. If it's a one-time initialization thing, just drop the indices and recreate them as you've been suggested. If you create new records all the time, I'd bet you'll also have many rows with the same company_name, so normalizing would be a clear win.
On 1/30/2012 3:27 AM, Saurabh wrote: > Hi all, > > I am using Postgresql database for our project and doing some > performance testing. We need to insert millions of record with indexed > columns. We have 5 columns in table. I created index on integer only > then performance is good but when I created index on text column as > well then the performance reduced to 1/8th times. My question is how I > can improve performance when inserting data using index on text > column? > > Thanks, > Saurabh > Do it in a single transaction, and use COPY. -Andy
On Mon, Jan 30, 2012 at 9:46 AM, Saurabh <saurabh.b85@gmail.com> wrote: > Thank you for the information. > > Schema of table is: > > ID bigint > company_name text > data_set text > time timestamp > Date date > > Length of company_name is not known so it is of datatype text. I need > to build the index on company_name and ID. And then insert the > records. I can not create the index after insertion because user can > search the data as well while insertion. > > Machine is of 8 core, os centos6 and 8 GB of RAM. > > Here is my configuration: > > shared_buffers = 32MB That is very small for your server. I'd use at least 512MB, and maybe 2GB > wal_buffers = 1024KB If you are using 9.1, I would have this set to the default of -1 and let the database decide for itself what to use.
I changed the configuration in postgresql.conf. Following are the changed parameters: shared_buffers = 1GB maintenance_work_mem = 50MB checkpoint_segments = 64 wal_buffers = 5MB autovacuum = off Insert the records in the database and got a very good performance it is increased by 6 times. Can you please tell me the purpose of shared_buffer and maintenance_work_mem parameter? Thanks, Saurabh
On 31 Leden 2012, 10:29, Saurabh wrote: > I changed the configuration in postgresql.conf. Following are the > changed parameters: > > shared_buffers = 1GB > maintenance_work_mem = 50MB > checkpoint_segments = 64 > wal_buffers = 5MB > autovacuum = off > > Insert the records in the database and got a very good performance it > is increased by 6 times. > > Can you please tell me the purpose of shared_buffer and > maintenance_work_mem parameter? Shared buffers is the cache maintained by PostgreSQL. All all the data that you read/write need to go through shared buffers. Maintenance_work_mem specifies how much memory can "maintenance tasks" (e.g. autovacuum, reindex, etc.) use. This is similar to work_mem for common queries (sorting, grouping, ...). Tomas
> Shared buffers is the cache maintained by PostgreSQL. All all the data > that you read/write need to go through shared buffers. While this is technically true, I need to point out that you generally increase shared_buffers for high concurrency, and for reads, not for writes, especially for row-at-a-time inserts. There's just not that much memory required (although more than the out-of-the-box defaults). I'd suggest increasing wal_buffers to 16MB, which is the maximum useful amount, rather than 5MB. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Jan 31, 2012 at 12:46 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> Shared buffers is the cache maintained by PostgreSQL. All all the data >> that you read/write need to go through shared buffers. > > While this is technically true, I need to point out that you generally > increase shared_buffers for high concurrency, and for reads, not for > writes, especially for row-at-a-time inserts. There's just not that > much memory required (although more than the out-of-the-box defaults). > > I'd suggest increasing wal_buffers to 16MB, which is the maximum useful > amount, rather than 5MB. yeah -- postgresql.conf settings are not going to play a big role here unless: *) you defer index build to the end of the load, and do CREATE INDEX and crank maintenance_work_mem *) you are doing lots of transactions and relax your sync policy via synchronous_commit what's almost certainly happening here is that the text index is writing out a lot more data. what's the average length of your key? If I'm inspecting sizes of tables/indexes which start with 'foo', I can do this: postgres=# select relname, pg_size_pretty(pg_relation_size(relname::text)) from pg_class where relname like 'foo%'; relname | pg_size_pretty -----------+---------------- foo | 40 kB foo_i_idx | 40 kB foo_t_idx | 40 kB We'd like to see the numbers for your table/indexes in question. merlin
On Tue, Jan 31, 2012 at 10:46 AM, Josh Berkus <josh@agliodbs.com> wrote: > >> Shared buffers is the cache maintained by PostgreSQL. All all the data >> that you read/write need to go through shared buffers. > > While this is technically true, I need to point out that you generally > increase shared_buffers for high concurrency, and for reads, not for > writes, especially for row-at-a-time inserts. There's just not that > much memory required (although more than the out-of-the-box defaults). When inserting rows in bulk (even just with inserts in a tight loop) into indexed tables, I often see the performance collapse soon after the active index size exceeds shared_buffers. Or at least, shared_buffers + however much dirty data the kernel is willing to tolerate. But that later value is hard to predict. Increasing the shared_buffers can really help a lot here. I'm sure this behavior depends greatly on your IO subsystem. Cheers, Jeff
On Mon, Jan 30, 2012 at 9:46 AM, Saurabh <saurabh.b85@gmail.com> wrote: > I can not create the index after insertion because user can > search the data as well while insertion. Remember, DDL is transactional in PostgreSQL. In principle, you should be able to drop the index, do your inserts, and re-create the index without affecting concurrent users, if you do all of that inside an explicit transaction. Doing the inserts inside a transaction may speed them up, as well. rls -- :wq
On Wed, Feb 1, 2012 at 12:29 AM, Rosser Schwarz <rosser.schwarz@gmail.com> wrote: > Remember, DDL is transactional in PostgreSQL. In principle, you > should be able to drop the index, do your inserts, and re-create the > index without affecting concurrent users, if you do all of that inside > an explicit transaction. Doing the inserts inside a transaction may > speed them up, as well. Creating an index requires an update lock on the table, and an exclusive lock on the system catalog. Even though with "CONCURRENTLY" it's only for a short while. So it does affect concurrent users.
On Wed, Feb 1, 2012 at 12:49 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Feb 1, 2012 at 12:29 AM, Rosser Schwarz > <rosser.schwarz@gmail.com> wrote: >> Remember, DDL is transactional in PostgreSQL. In principle, you >> should be able to drop the index, do your inserts, and re-create the >> index without affecting concurrent users, if you do all of that inside >> an explicit transaction. Doing the inserts inside a transaction may >> speed them up, as well. > > Creating an index requires an update lock on the table, and an > exclusive lock on the system catalog. > Even though with "CONCURRENTLY" it's only for a short while. > So it does affect concurrent users. Forgot to mention that if you don't commit the drop, you see no performance increase. So: begin drop insert create commit Does not work to improve performance. At all.
On Tue, Jan 31, 2012 at 1:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > yeah -- postgresql.conf settings are not going to play a big role here unless: > *) you defer index build to the end of the load, and do CREATE INDEX > and crank maintenance_work_mem > *) you are doing lots of transactions and relax your sync policy via > synchronous_commit checkpoint segments sometimes helps for loading large amounts of data. > what's almost certainly happening here is that the text index is > writing out a lot more data. what's the average length of your key? Yeah, the OP really needs to switch to hashes for the indexes. And like another poster mentioned, it's often faster to use a btree of hashes than to use the built in hash index type.