Thread: pg_bulkload ON_DUPLICATE_MERGE

pg_bulkload ON_DUPLICATE_MERGE

From
Benjamin Johnson
Date:
PG Gurus,

I have a table like this:

CREATE TABLE filemods (
  guid                  BIGINT NOT NULL UNIQUE,
  filepath_guid         BIGINT NOT NULL,
  createtime            TIMESTAMP WITH TIME ZONE DEFAULT NULL,
  writetime             TIMESTAMP WITH TIME ZONE DEFAULT NULL,
  deletetime            TIMESTAMP WITH TIME ZONE DEFAULT NULL,
);

One "event" might have (1, '2012-01-25 11:00:00', NULL, NULL) and
another event might have (1, NULL, '2012-01-25 11:05:00', NULL) and the
end result should be:
(1, '2012-01-25 11:00:00', '2012-01-25 11:05:00', NULL).


I'm trying to modify pg_bulkload to "merge" two rows together.  The
changes I have made seem to be working, although I would like input on
what I am doing that is unsafe or terribly wrong.  You can be brutal.

I've seen incredible write speed with using pg_bulkload.  If I can just
get it to "consolidate" our rows based on the unique key it will remove
a lot of complexity in our software.

Also, I'm not entirely sure this mailing list is the correct one, but
with all the internals you all know, I'm hoping you can help point out
nasty flaws in my algorithm.  This is the first legitimate attempt I
have made at modifying PG source, so I'm not real familiar with the
proper way of loading pages and tuples and updating heaps and all that
pg core stuff.

Here's the modifications to pg_btree.c (from pg_bulkload HEAD):

http://pastebin.com/U23CapvR

I also attached the patch.

Thank you!!

Ben


--
Benjamin Johnson
http://getcarbonblack.com/ | @getcarbonblack
cell: 312.933.3612

Attachment

Re: pg_bulkload ON_DUPLICATE_MERGE

From
Robert Haas
Date:
On Wed, Jan 25, 2012 at 2:49 PM, Benjamin Johnson
<benjamin.johnson@getcarbonblack.com> wrote:
> PG Gurus,
>
> I have a table like this:
>
> CREATE TABLE filemods (
>  guid                  BIGINT NOT NULL UNIQUE,
>  filepath_guid         BIGINT NOT NULL,
>  createtime            TIMESTAMP WITH TIME ZONE DEFAULT NULL,
>  writetime             TIMESTAMP WITH TIME ZONE DEFAULT NULL,
>  deletetime            TIMESTAMP WITH TIME ZONE DEFAULT NULL,
> );
>
> One "event" might have (1, '2012-01-25 11:00:00', NULL, NULL) and
> another event might have (1, NULL, '2012-01-25 11:05:00', NULL) and the
> end result should be:
> (1, '2012-01-25 11:00:00', '2012-01-25 11:05:00', NULL).
>
>
> I'm trying to modify pg_bulkload to "merge" two rows together.  The
> changes I have made seem to be working, although I would like input on
> what I am doing that is unsafe or terribly wrong.  You can be brutal.
>
> I've seen incredible write speed with using pg_bulkload.  If I can just
> get it to "consolidate" our rows based on the unique key it will remove
> a lot of complexity in our software.
>
> Also, I'm not entirely sure this mailing list is the correct one, but
> with all the internals you all know, I'm hoping you can help point out
> nasty flaws in my algorithm.  This is the first legitimate attempt I
> have made at modifying PG source, so I'm not real familiar with the
> proper way of loading pages and tuples and updating heaps and all that
> pg core stuff.
>
> Here's the modifications to pg_btree.c (from pg_bulkload HEAD):
>
> http://pastebin.com/U23CapvR
>
> I also attached the patch.

I am not sure who maintains pg_bulkload, but it's not part of the core
distribution, so this is the wrong mailing list....  you may want to
look here:

http://pgfoundry.org/mail/?group_id=1000261

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company