Batch update of indexes on data loading - Mailing list pgsql-hackers

From ITAGAKI Takahiro
Subject Batch update of indexes on data loading
Date
Msg-id 20080221130554.B2BA.52131E4D@oss.ntt.co.jp
Whole thread Raw
Responses Re: Batch update of indexes on data loading
Re: Batch update of indexes on data loading
Re: Batch update of indexes on data loading
List pgsql-hackers
This is a proposal of fast data loading using batch update of indexes for 8.4.
It is a part of pg_bulkload (http://pgbulkload.projects.postgresql.org/) and
I'd like to integrate it in order to cooperate with other parts of postgres.

The basic concept is spooling new coming data, and merge the spool and
the existing indexes into a new index at the end of data loading. It is 
5-10 times faster than index insertion per-row, that is the way in 8.3.


One of the problem is locking; Index building in bulkload is similar to
REINDEX rather than INSERT, so we need ACCESS EXCLUSIVE LOCK during it.
Bulkloading is not a upper compatible method, so I'm thinking about
adding a new "WITH LOCK" option for COPY command.
 COPY tbl FROM 'datafile' WITH LOCK;

If the LOCK option is specified, the behavior of COPY will be changed
as follows:

1. Lock the target table in ACCESS EXCLUSIVE mode instead of ROW EXCLUSIVE.
2. Prepare spooler (BTSpool) for each indexes.
3. For each new row, put index entries into the spools (_bt_spool)  instead of index_insert.
4. At the end of COPY, merge the spool and the existing indexes into a new  index file. The relfilenode of the index is
changedlike REINDEX.
 

However, there might be better interfaces for bulk index creation.
For example, if we want to use it with pgloader, we might need
"bulkload mode" for indexes. pgloader commits every 10000 rows,
so the index spooler must keep alive until end of the session
over transactions. (or end of the transaction over sub-transactions)

I'm working toward the simple "COPY WITH LOCK" approach for now,
but if there are other better ideas, I want to use them.
Advices and suggestions welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: Permanent settings
Next
From: ITAGAKI Takahiro
Date:
Subject: Re: ANALYZE to be ignored by VACUUM