Thread: How to improve insert speed with index on text column

How to improve insert speed with index on text column

From

Saurabh

Date:

30 January 2012, 05:43:34

Hi all,

I am using Postgresql database for our project and doing some
performance testing. We need to insert millions of record with indexed
columns. We have 5 columns in table. I created index on integer only
then performance is good but when I created index on text column as
well then the performance reduced to 1/8th times. My question is how I
can improve performance when inserting data using index on text
column?

Thanks,
Saurabh

Re: How to improve insert speed with index on text column

From

Claudio Freire

Date:

30 January 2012, 10:10:40

On Mon, Jan 30, 2012 at 6:27 AM, Saurabh <saurabh.b85@gmail.com> wrote:
> Hi all,
>
> I am using Postgresql database for our project and doing some
> performance testing. We need to insert millions of record with indexed
> columns. We have 5 columns in table. I created index on integer only
> then performance is good but when I created index on text column as
> well then the performance reduced to 1/8th times. My question is how I
> can improve performance when inserting data using index on text
> column?

Post all the necessary details. Schema, table and index sizes, some config...

Assuming your text column is a long one (long text), this results in
really big indices.
Assuming you only search by equality, you can make it a lot faster by hashing.
Last time I checked, hash indices were quite limited and performed
badly, but I've heard they improved quite a bit. If hash indices don't
work for you, you can always build them on top of btree indices by
indexing on the expression hash(column) and comparing as hash(value) =
hash(column) and value = column.
On a table indexed by URL I have, this improved things immensely. Both
lookup and insertion times improved.

Re: How to improve insert speed with index on text column

From

Jeff Janes

Date:

30 January 2012, 11:24:52

On Mon, Jan 30, 2012 at 1:27 AM, Saurabh <saurabh.b85@gmail.com> wrote:
> Hi all,
>
> I am using Postgresql database for our project and doing some
> performance testing. We need to insert millions of record with indexed
> columns. We have 5 columns in table. I created index on integer only
> then performance is good but when I created index on text column as
> well then the performance reduced to 1/8th times.

Inserting into a indexed table causes a lot of random access to the
underlying index (unless the data is inserted in an order which
corresponds to the index order of all indexes, which is not likely to
happen with multiple indexes).  As soon as your indexes don't fit in
cache, your performance will collapse.

What if you don't have the integer index but just the text?  What is
the average length of the data in the text field?  Is your system CPU
limited or IO limited during the load?

> My question is how I
> can improve performance when inserting data using index on text
> column?

The only "magic" answer is to drop the index and rebuild after the
insert.  If that doesn't work for you, then you have to identify your
bottleneck and fix it.  That can't be done with just the information
you provide.

Cheers,

Jeff

Re: How to improve insert speed with index on text column

From

Saurabh

Date:

30 January 2012, 14:13:36

Thank you for the information.

Schema of table is:

ID                         bigint
company_name     text
data_set                text
time                      timestamp
Date                     date

Length of company_name is not known so it is of datatype text. I need
to build the index on company_name and ID. And then insert the
records. I can not create the index after insertion because user can
search the data as well while insertion.

Machine is of 8 core, os centos6 and 8 GB of RAM.

Here is my configuration:

max_connections = 100
shared_buffers = 32MB
wal_buffers = 1024KB
checkpoint_segments = 3

Re: How to improve insert speed with index on text column

From

Claudio Freire

Date:

30 January 2012, 14:20:46

On Mon, Jan 30, 2012 at 2:46 PM, Saurabh <saurabh.b85@gmail.com> wrote:
> max_connections = 100
> shared_buffers = 32MB
> wal_buffers = 1024KB
> checkpoint_segments = 3

That's a default config isn't it?

You'd do well to try and optimize it for your system. The defaults are
really, reeallly conservative.

You should also consider normalizing. I'm assuming company_name could
be company_id ? (ie: each will have many rows). Otherwise I cannot see
how you'd expect to be *constantly* inserting millions of rows. If
it's a one-time initialization thing, just drop the indices and
recreate them as you've been suggested. If you create new records all
the time, I'd bet you'll also have many rows with the same
company_name, so normalizing would be a clear win.

Re: How to improve insert speed with index on text column

From

Andy Colson

Date:

30 January 2012, 14:33:19

On 1/30/2012 3:27 AM, Saurabh wrote:
> Hi all,
>
> I am using Postgresql database for our project and doing some
> performance testing. We need to insert millions of record with indexed
> columns. We have 5 columns in table. I created index on integer only
> then performance is good but when I created index on text column as
> well then the performance reduced to 1/8th times. My question is how I
> can improve performance when inserting data using index on text
> column?
>
> Thanks,
> Saurabh
>

Do it in a single transaction, and use COPY.

-Andy

Re: How to improve insert speed with index on text column

From

Jeff Janes

Date:

30 January 2012, 16:32:39

On Mon, Jan 30, 2012 at 9:46 AM, Saurabh <saurabh.b85@gmail.com> wrote:
> Thank you for the information.
>
> Schema of table is:
>
> ID                         bigint
> company_name     text
> data_set                text
> time                      timestamp
> Date                     date
>
> Length of company_name is not known so it is of datatype text. I need
> to build the index on company_name and ID. And then insert the
> records. I can not create the index after insertion because user can
> search the data as well while insertion.
>
> Machine is of 8 core, os centos6 and 8 GB of RAM.
>
> Here is my configuration:
>
> shared_buffers = 32MB

That is very small for your server.  I'd use at least 512MB, and maybe 2GB

> wal_buffers = 1024KB

If you are using 9.1, I would have this set to the default of -1 and
let the database decide for itself what to use.

Re: How to improve insert speed with index on text column

From

Saurabh

Date:

31 January 2012, 06:14:37

I changed the configuration in postgresql.conf. Following are the
changed parameters:

shared_buffers = 1GB
maintenance_work_mem = 50MB
checkpoint_segments = 64
wal_buffers = 5MB
autovacuum = off

Insert the records in the database and got a very good performance it
is increased by 6 times.

Can you please tell me the purpose of shared_buffer and
maintenance_work_mem parameter?

Thanks,
Saurabh

Re: How to improve insert speed with index on text column

From

"Tomas Vondra"

Date:

31 January 2012, 07:21:56

On 31 Leden 2012, 10:29, Saurabh wrote:
> I changed the configuration in postgresql.conf. Following are the
> changed parameters:
>
> shared_buffers = 1GB
> maintenance_work_mem = 50MB
> checkpoint_segments = 64
> wal_buffers = 5MB
> autovacuum = off
>
> Insert the records in the database and got a very good performance it
> is increased by 6 times.
>
> Can you please tell me the purpose of shared_buffer and
> maintenance_work_mem parameter?

Shared buffers is the cache maintained by PostgreSQL. All all the data
that you read/write need to go through shared buffers.

Maintenance_work_mem specifies how much memory can "maintenance tasks"
(e.g. autovacuum, reindex, etc.) use. This is similar to work_mem for
common queries (sorting, grouping, ...).

Tomas

Re: How to improve insert speed with index on text column

From

Josh Berkus

Date:

31 January 2012, 14:46:56

> Shared buffers is the cache maintained by PostgreSQL. All all the data
> that you read/write need to go through shared buffers.

While this is technically true, I need to point out that you generally
increase shared_buffers for high concurrency, and for reads, not for
writes, especially for row-at-a-time inserts.  There's just not that
much memory required (although more than the out-of-the-box defaults).

I'd suggest increasing wal_buffers to 16MB, which is the maximum useful
amount, rather than 5MB.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: How to improve insert speed with index on text column

From

Merlin Moncure

Date:

31 January 2012, 16:21:32

On Tue, Jan 31, 2012 at 12:46 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> Shared buffers is the cache maintained by PostgreSQL. All all the data
>> that you read/write need to go through shared buffers.
>
> While this is technically true, I need to point out that you generally
> increase shared_buffers for high concurrency, and for reads, not for
> writes, especially for row-at-a-time inserts.  There's just not that
> much memory required (although more than the out-of-the-box defaults).
>
> I'd suggest increasing wal_buffers to 16MB, which is the maximum useful
> amount, rather than 5MB.

yeah -- postgresql.conf settings are not going to play a big role here unless:
*) you defer index build to the end of the load, and do CREATE INDEX
and crank maintenance_work_mem
*) you are doing lots of transactions and relax your sync policy via
synchronous_commit

what's almost certainly happening here is that the text index is
writing out a lot more data.  what's the average length of your key?

If I'm inspecting sizes of tables/indexes which start with 'foo', I can do this:
postgres=# select relname,
pg_size_pretty(pg_relation_size(relname::text)) from pg_class where
relname like 'foo%';
  relname  | pg_size_pretty
-----------+----------------
 foo       | 40 kB
 foo_i_idx | 40 kB
 foo_t_idx | 40 kB

We'd like to see the numbers for your table/indexes in question.

merlin

Re: How to improve insert speed with index on text column

From

Jeff Janes

Date:

31 January 2012, 22:12:13

On Tue, Jan 31, 2012 at 10:46 AM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> Shared buffers is the cache maintained by PostgreSQL. All all the data
>> that you read/write need to go through shared buffers.
>
> While this is technically true, I need to point out that you generally
> increase shared_buffers for high concurrency, and for reads, not for
> writes, especially for row-at-a-time inserts.  There's just not that
> much memory required (although more than the out-of-the-box defaults).

When inserting rows in bulk (even just with inserts in a tight loop)
into indexed tables, I often see the performance collapse soon after
the active index size exceeds shared_buffers.  Or at least,
shared_buffers + however much dirty data the kernel is willing to
tolerate.  But that later value is hard to predict.  Increasing the
shared_buffers can really help a lot here.  I'm sure this behavior
depends greatly on your IO subsystem.

Cheers,

Jeff

Re: How to improve insert speed with index on text column

From

Rosser Schwarz

Date:

31 January 2012, 23:29:37

On Mon, Jan 30, 2012 at 9:46 AM, Saurabh <saurabh.b85@gmail.com> wrote:
> I can not create the index after insertion because user can
> search the data as well while insertion.

Remember, DDL is transactional in PostgreSQL.  In principle, you
should be able to drop the index, do your inserts, and re-create the
index without affecting concurrent users, if you do all of that inside
an explicit transaction.  Doing the inserts inside a transaction may
speed them up, as well.

rls

--
:wq

Re: How to improve insert speed with index on text column

From

Claudio Freire

Date:

31 January 2012, 23:49:31

On Wed, Feb 1, 2012 at 12:29 AM, Rosser Schwarz
<rosser.schwarz@gmail.com> wrote:
> Remember, DDL is transactional in PostgreSQL.  In principle, you
> should be able to drop the index, do your inserts, and re-create the
> index without affecting concurrent users, if you do all of that inside
> an explicit transaction.  Doing the inserts inside a transaction may
> speed them up, as well.

Creating an index requires an update lock on the table, and an
exclusive lock on the system catalog.
Even though with "CONCURRENTLY" it's only for a short while.
So it does affect concurrent users.

Re: How to improve insert speed with index on text column

From

Claudio Freire

Date:

31 January 2012, 23:51:29

On Wed, Feb 1, 2012 at 12:49 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Feb 1, 2012 at 12:29 AM, Rosser Schwarz
> <rosser.schwarz@gmail.com> wrote:
>> Remember, DDL is transactional in PostgreSQL.  In principle, you
>> should be able to drop the index, do your inserts, and re-create the
>> index without affecting concurrent users, if you do all of that inside
>> an explicit transaction.  Doing the inserts inside a transaction may
>> speed them up, as well.
>
> Creating an index requires an update lock on the table, and an
> exclusive lock on the system catalog.
> Even though with "CONCURRENTLY" it's only for a short while.
> So it does affect concurrent users.

Forgot to mention that if you don't commit the drop, you see no
performance increase.
So:

begin
drop
insert
create
commit

Does not work to improve performance. At all.

Re: How to improve insert speed with index on text column

From

Scott Marlowe

Date:

03 February 2012, 16:54:53

On Tue, Jan 31, 2012 at 1:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> yeah -- postgresql.conf settings are not going to play a big role here unless:
> *) you defer index build to the end of the load, and do CREATE INDEX
> and crank maintenance_work_mem
> *) you are doing lots of transactions and relax your sync policy via
> synchronous_commit

checkpoint segments sometimes helps for loading large amounts of data.

> what's almost certainly happening here is that the text index is
> writing out a lot more data.  what's the average length of your key?

Yeah, the OP really needs to switch to hashes for the indexes.  And
like another poster mentioned, it's often faster to use a btree of
hashes than to use the built in hash index type.