Thread: Loading 500m json files to database

Loading 500m json files to database

From
pinker
Date:
Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
  psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)


Any ideas?



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html



Re: Loading 500m json files to database

From
Ertan Küçükoğlu
Date:
> On 23 Mar 2020, at 13:20, pinker <pinker@onet.eu> wrote:
>
> Hi, do you have maybe idea how to make loading process faster?
>
> I have 500 millions of json files (1 json per file) that I need to load to
> db.
> My test set is "only" 1 million files.
>
> What I came up with now is:
>
> time for i in datafiles/*; do
>  psql -c "\copy json_parts(json_data) FROM $i"&
> done
>
> which is the fastest so far. But it's not what i expect. Loading 1m of data
> takes me ~3h so loading 500 times more is just unacceptable.
>
> some facts:
> * the target db is on cloud so there is no option to do tricks like turning
> fsync off
> * version postgres 11
> * i can spin up huge postgres instance if necessary in terms of cpu/ram
> * i tried already hash partitioning (to write to 10 different tables instead
> of 1)
>
>
> Any ideas?
Hello,

I may not be knowledge enough to answer your question.

However, if possible, you may think of using a local physical computer to do all uploading and after do backup/restore
oncloud system. 

Compressed backup will be far less internet traffic compared to direct data inserts.

Moreover you can do additional tricks as you mentioned.

Thanks & regards,
Ertan





Re: Loading 500m json files to database

From
Andrei Zhidenkov
Date:
Try to write a stored procedure (probably pl/python) that will accept an array of JSON objects so it will be possible
toload data in chunks (by 100-1000 files) which should be faster. 

> On 23. Mar 2020, at 12:49, Ertan Küçükoğlu <ertan.kucukoglu@1nar.com.tr> wrote:
>
>
>> On 23 Mar 2020, at 13:20, pinker <pinker@onet.eu> wrote:
>>
>> Hi, do you have maybe idea how to make loading process faster?
>>
>> I have 500 millions of json files (1 json per file) that I need to load to
>> db.
>> My test set is "only" 1 million files.
>>
>> What I came up with now is:
>>
>> time for i in datafiles/*; do
>> psql -c "\copy json_parts(json_data) FROM $i"&
>> done
>>
>> which is the fastest so far. But it's not what i expect. Loading 1m of data
>> takes me ~3h so loading 500 times more is just unacceptable.
>>
>> some facts:
>> * the target db is on cloud so there is no option to do tricks like turning
>> fsync off
>> * version postgres 11
>> * i can spin up huge postgres instance if necessary in terms of cpu/ram
>> * i tried already hash partitioning (to write to 10 different tables instead
>> of 1)
>>
>>
>> Any ideas?
> Hello,
>
> I may not be knowledge enough to answer your question.
>
> However, if possible, you may think of using a local physical computer to do all uploading and after do
backup/restoreon cloud system. 
>
> Compressed backup will be far less internet traffic compared to direct data inserts.
>
> Moreover you can do additional tricks as you mentioned.
>
> Thanks & regards,
> Ertan
>
>
>
>




Re: Loading 500m json files to database

From
Rob Sargent
Date:

> On Mar 23, 2020, at 5:59 AM, Andrei Zhidenkov <andrei.zhidenkov@n26.com> wrote:
>
> Try to write a stored procedure (probably pl/python) that will accept an array of JSON objects so it will be
possibleto load data in chunks (by 100-1000 files) which should be faster. 
>
>>> On 23. Mar 2020, at 12:49, Ertan Küçükoğlu <ertan.kucukoglu@1nar.com.tr> wrote:
>>>
>>>
>>>> On 23 Mar 2020, at 13:20, pinker <pinker@onet.eu> wrote:
>>>
>>> Hi, do you have maybe idea how to make loading process faster?
>>>
>>> I have 500 millions of json files (1 json per file) that I need to load to
>>> db.
>>> My test set is "only" 1 million files.
>>>
>>> What I came up with now is:
>>>
>>> time for i in datafiles/*; do
>>> psql -c "\copy json_parts(json_data) FROM $i"&
>>> done
>>>
>>> which is the fastest so far. But it's not what i expect. Loading 1m of data
>>> takes me ~3h so loading 500 times more is just unacceptable.
>>>
>>> some facts:
>>> * the target db is on cloud so there is no option to do tricks like turning
>>> fsync off
>>> * version postgres 11
>>> * i can spin up huge postgres instance if necessary in terms of cpu/ram
>>> * i tried already hash partitioning (to write to 10 different tables instead
>>> of 1)
>>>
>>>
>>> Any ideas?
>> Hello,
>>
>> I may not be knowledge enough to answer your question.
>>
>> However, if possible, you may think of using a local physical computer to do all uploading and after do
backup/restoreon cloud system. 
>>
>> Compressed backup will be far less internet traffic compared to direct data inserts.
>>
>> Moreover you can do additional tricks as you mentioned.
>>
>> Thanks & regards,
>> Ertan
>>
>>

Drop any and all indices

>>
>>
>
>
>



Re: Loading 500m json files to database

From
Christopher Browne
Date:


On Mon, 23 Mar 2020 at 06:24, pinker <pinker@onet.eu> wrote:
Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
  psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)


Any ideas?


Well, you're paying for a lot of overhead in that, as you're
establishing a psql command, connecting to a database, spawning a backend
process, starting a transactions, committing a transaction, closing the backend
process, disconnecting from the database, and cleaning up after the launching
of the psql command.  And you're doing that 500 million times.

The one thing I left off that was the loading of a single tuple into json_parts.

What you could do to improve things quite a lot would be to group some number
of those files together, so that each time you pay for the overhead, you at least
get the benefit of loading several entries into json_parts.

So, loosely, I'd commend using /bin/cat (or similar) to assemble several files together
into one, and then \copy that one file in.

Having 2 tuples loaded at once drops overhead by 50%
Having 10 tuples loaded at once drops overhead by 90%
Having 100 tuples loaded at once drops overhead by 99%
Having 1000 tuples loaded at once drops overhead by 99.9%

There probably isn't too much real value to going past 1000 tuples per batch; the
overhead, by that point, is getting pretty immaterial.

Reducing that overhead is the single most important thing you can do.

It is also quite likely that you could run such streams in parallel, although
it would require quite a bit more information about the I/O capabilities of your
hardware to know if that would do any good.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: Loading 500m json files to database

From
Rob Sargent
Date:

On 3/23/20 4:24 AM, pinker wrote:
> Hi, do you have maybe idea how to make loading process faster?
>
> I have 500 millions of json files (1 json per file) that I need to load to
> db.
> My test set is "only" 1 million files.
>
> What I came up with now is:
>
> time for i in datafiles/*; do
>    psql -c "\copy json_parts(json_data) FROM $i"&
> done
>
> which is the fastest so far. But it's not what i expect. Loading 1m of data
> takes me ~3h so loading 500 times more is just unacceptable.
>
> some facts:
> * the target db is on cloud so there is no option to do tricks like turning
> fsync off
> * version postgres 11
> * i can spin up huge postgres instance if necessary in terms of cpu/ram
> * i tried already hash partitioning (to write to 10 different tables instead
> of 1)
>
>
> Any ideas?
>
Most advanced languages have a bulk copy implementation.  I've found this to be blindingly fast when the receiving
tablehas no indices, constraints.  It's not clear how large your files are, but you might take this time to
"normalized"them: extract any id, datatype, etc into table attributes.
 

> --
> Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
>
>




Re: Loading 500m json files to database

From
Adrian Klaver
Date:
On 3/23/20 3:24 AM, pinker wrote:
> Hi, do you have maybe idea how to make loading process faster?
> 
> I have 500 millions of json files (1 json per file) that I need to load to
> db.
> My test set is "only" 1 million files.
> 
> What I came up with now is:
> 
> time for i in datafiles/*; do
>    psql -c "\copy json_parts(json_data) FROM $i"&
> done
> 
> which is the fastest so far. But it's not what i expect. Loading 1m of data
> takes me ~3h so loading 500 times more is just unacceptable.

Aggregating the JSON files as others have suggested would help greatly.

Knowing what is happening in json_parts() might help folks provide 
further tips.


> 
> some facts:
> * the target db is on cloud so there is no option to do tricks like turning
> fsync off
> * version postgres 11
> * i can spin up huge postgres instance if necessary in terms of cpu/ram
> * i tried already hash partitioning (to write to 10 different tables instead
> of 1)
> 
> 
> Any ideas?
> 
> 
> 
> --
> Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Loading 500m json files to database

From
pinker
Date:
Christopher Browne-3 wrote
> Well, you're paying for a lot of overhead in that, as you're
> establishing a psql command, connecting to a database, spawning a backend
> process, starting a transactions, committing a transaction, closing the
> backend
> process, disconnecting from the database, and cleaning up after the
> launching
> of the psql command.  And you're doing that 500 million times.
> 
> The one thing I left off that was the loading of a single tuple into
> json_parts.
> 
> What you could do to improve things quite a lot would be to group some
> number
> of those files together, so that each time you pay for the overhead, you
> at
> least
> get the benefit of loading several entries into json_parts.
> 
> So, loosely, I'd commend using /bin/cat (or similar) to assemble several
> files together
> into one, and then \copy that one file in.
> 
> Having 2 tuples loaded at once drops overhead by 50%
> Having 10 tuples loaded at once drops overhead by 90%
> Having 100 tuples loaded at once drops overhead by 99%
> Having 1000 tuples loaded at once drops overhead by 99.9%
> 
> There probably isn't too much real value to going past 1000 tuples per
> batch; the
> overhead, by that point, is getting pretty immaterial.
> 
> Reducing that overhead is the single most important thing you can do.

Yes, I was thinking about that but no idea now how to do it right now. like
some kind of outer loop to concatenate those files? and adding delimiter
between them?


Christopher Browne-3 wrote
> It is also quite likely that you could run such streams in parallel,
> although
> it would require quite a bit more information about the I/O capabilities
> of
> your
> hardware to know if that would do any good.

I can spin up every size of instance.





--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html



Re: Loading 500m json files to database

From
pinker
Date:
Ertan Küçükoğlu wrote
> However, if possible, you may think of using a local physical computer to
> do all uploading and after do backup/restore on cloud system.
>
> Compressed backup will be far less internet traffic compared to direct
> data inserts.

I was thinking about that but data source is a blob storage, so downloading
it first and then loading locally it's couple days extra for processing :/
it's not that fast even when I'm doing it locally ... so that would be like
extra 2 steps overhead :/



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html



Re: Loading 500m json files to database

From
pinker
Date:
it's a cloud and no plpythonu extension avaiable unfortunately



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html



Re: Loading 500m json files to database

From
pinker
Date:
there is no indexes nor foreign keys, or any other constraints



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html



Re: Loading 500m json files to database

From
pinker
Date:
Hi,
json_parts it's just single table with 2 column:

                                                  Table "public.json_parts"
  Column   |  Type   | Collation | Nullable |                Default                
| Storage  | Stats target | Description 

-----------+---------+-----------+----------+----------------------------------------+----------+--------------+-------------
 id        | integer |           | not null |
nextval('json_parts_id_seq'::regclass) | plain    |              | 
 json_data | jsonb   |           |          |  

no indexes, constraints or anything else



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html



Re: Loading 500m json files to database

From
Adrian Klaver
Date:
On 3/23/20 5:26 PM, pinker wrote:
> Hi,
> json_parts it's just single table with 2 column:

Well I misread that.

> 
>                                                    Table "public.json_parts"
>    Column   |  Type   | Collation | Nullable |                Default
> | Storage  | Stats target | Description
>
-----------+---------+-----------+----------+----------------------------------------+----------+--------------+-------------
>   id        | integer |           | not null |
> nextval('json_parts_id_seq'::regclass) | plain    |              |
>   json_data | jsonb   |           |          |
> 
> no indexes, constraints or anything else
> 
> 
> 
> --
> Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Loading 500m json files to database

From
Adrian Klaver
Date:
On 3/23/20 5:23 PM, pinker wrote:
> it's a cloud and no plpythonu extension avaiable unfortunately

I presume Python itself is available, so would it not be possible to 
create a program that concatenates the files into batches and COPY(s) 
that data into Postgres using the Psycopg2 COPY functions:

https://www.psycopg.org/docs/usage.html#using-copy-to-and-copy-from

> 
> 
> 
> --
> Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Loading 500m json files to database

From
"David G. Johnston"
Date:
On Mon, Mar 23, 2020 at 3:24 AM pinker <pinker@onet.eu> wrote:
time for i in datafiles/*; do
  psql -c "\copy json_parts(json_data) FROM $i"&
done

Don't know whether this is faster but it does avoid spinning up a connection multiple times.

#bash, linux
    function append_each_split_file_to_etl_load_script() {
        for filetoload in ./*; do
            ronumber="$(basename $filetoload)"
            # only process files since subdirs can be present
            if [[ -f "$filetoload" ]]; then
                echo ""
                echo "\set invoice"' `cat '"'""$filetoload""'"'`'
                echo ", ('$ronumber',:'invoice')"
            fi >> "$PSQLSCRIPT"
        done

        echo ""  >> "$PSQLSCRIPT"
        echo ";" >> "$PSQLSCRIPT"
        echo ""  >> "$PSQLSCRIPT"
    }

There is a bit other related code that is needed (for my specific usage) but this is the core of it.  Use psql variables to capture the contents of each file into a variable and then just perform a normal insert (specifically, a VALUES (...), (...) variant).  Since you can intermix psql and SQL you basically output a bloody long script, that has memory issues at scale - but you can divide and conquer - and then "psql --file bloody_long_script_part_1_of_100000.psql".

David J.

Re: Loading 500m json files to database

From
Rob Sargent
Date:


On Mar 23, 2020, at 7:11 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Mon, Mar 23, 2020 at 3:24 AM pinker <pinker@onet.eu> wrote:
time for i in datafiles/*; do
  psql -c "\copy json_parts(json_data) FROM $i"&
done

Don't know whether this is faster but it does avoid spinning up a connection multiple times.

#bash, linux
    function append_each_split_file_to_etl_load_script() {
        for filetoload in ./*; do
            ronumber="$(basename $filetoload)"
            # only process files since subdirs can be present
            if [[ -f "$filetoload" ]]; then
                echo ""
                echo "\set invoice"' `cat '"'""$filetoload""'"'`'
                echo ", ('$ronumber',:'invoice')"
            fi >> "$PSQLSCRIPT"
        done

        echo ""  >> "$PSQLSCRIPT"
        echo ";" >> "$PSQLSCRIPT"
        echo ""  >> "$PSQLSCRIPT"
    }

There is a bit other related code that is needed (for my specific usage) but this is the core of it.  Use psql variables to capture the contents of each file into a variable and then just perform a normal insert (specifically, a VALUES (...), (...) variant).  Since you can intermix psql and SQL you basically output a bloody long script, that has memory issues at scale - but you can divide and conquer - and then "psql --file bloody_long_script_part_1_of_100000.psql".

David J.

Can one put 550M files in a single directory?  I thought it topped out at 16M or so.



Re: Loading 500m json files to database

From
pinker
Date:
it's in a blob storage in Azure. I'm testing with 1m that I have locally



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html



Re: Loading 500m json files to database

From
pinker
Date:
hmm now I'm thinking maybe setting up pgbouncer in front of postgres with
statement mode would help?



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html



Re: Loading 500m json files to database

From
Reid Thompson
Date:
On Mon, 2020-03-23 at 03:24 -0700, pinker wrote:
> [EXTERNAL SOURCE]
> 
> 
> 
> Hi, do you have maybe idea how to make loading process faster?
> 
> I have 500 millions of json files (1 json per file) that I need to load to
> db.
> My test set is "only" 1 million files.
> 
> What I came up with now is:
> 
> time for i in datafiles/*; do
>   psql -c "\copy json_parts(json_data) FROM $i"&
> done
> 
> which is the fastest so far. But it's not what i expect. Loading 1m of data
> takes me ~3h so loading 500 times more is just unacceptable.
> 
> some facts:
> * the target db is on cloud so there is no option to do tricks like turning
> fsync off
> * version postgres 11
> * i can spin up huge postgres instance if necessary in terms of cpu/ram
> * i tried already hash partitioning (to write to 10 different tables instead
> of 1)
> 
> 
> Any ideas?


https://www.gnu.org/software/parallel/ 




Re: Loading 500m json files to database

From
"Peter J. Holzer"
Date:
On 2020-03-23 17:18:45 -0700, pinker wrote:
> Christopher Browne-3 wrote
> > Well, you're paying for a lot of overhead in that, as you're
> > establishing a psql command, connecting to a database, spawning a
> > backend process, starting a transactions, committing a transaction,
> > closing the backend process, disconnecting from the database, and
> > cleaning up after the launching of the psql command.  And you're
> > doing that 500 million times.
> >
> > The one thing I left off that was the loading of a single tuple into
> > json_parts.
[...]
> > Reducing that overhead is the single most important thing you can do.
>
> Yes, I was thinking about that but no idea now how to do it right now.

Do you know any programming language (Python, Perl, ...)? You'll
probably get a huge gain from writing a script that just opens the
connection once and then inserts each file.

Copy usually is even faster by a fair amount, but since you have to read
the data for each row from a different file (and - if I understood you
correctly, a remote one at that), the additional speedup is probably not
that great in this case.

Splitting the work int batches and executing several batches in parallel
probably helps.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

RE: Loading 500m json files to database

From
Kevin Brannen
Date:
From: pinker <pinker@onet.eu>

> it's a cloud and no plpythonu extension avaiable unfortunately


You're misunderstanding him. See David's post for an example, but the point was that you can control all of this from
an*external* Perl, Python, Bash, whatever program on the command line at the shell. 

In pseudo-code, probably fed by a "find" command piping filenames to it:

while more files
do { read in a file name & add to list } while (list.length < 1000);
process entire list with \copy commands to 1 psql command

I've left all kinds of checks out of that, but that's the basic thing that you need, implement in whatever scripting
languageyou're comfortable with. 

HTH,
Kevin
This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain confidential
information.If you are not the intended recipient, or a person responsible for delivering it to the intended recipient,
youare hereby notified that any disclosure, distribution, review, copy or use of any of the information contained in or
attachedto this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately
notifyus by reply e-mail, and destroy the original transmission and its attachments without reading them or saving them
todisk. Thank you. 



Re: Loading 500m json files to database

From
Rob Sargent
Date:

On 3/24/20 11:29 AM, Kevin Brannen wrote:
> From: pinker <pinker@onet.eu>
>
>> it's a cloud and no plpythonu extension avaiable unfortunately
>
> You're misunderstanding him. See David's post for an example, but the point was that you can control all of this from
an*external* Perl, Python, Bash, whatever program on the command line at the shell.
 
>
> In pseudo-code, probably fed by a "find" command piping filenames to it:
>
> while more files
> do { read in a file name & add to list } while (list.length < 1000);
> process entire list with \copy commands to 1 psql command
>
> I've left all kinds of checks out of that, but that's the basic thing that you need, implement in whatever scripting
languageyou're comfortable with.
 
>
> HTH,
> Kevin

Sorry if I missed it, but have we seen the size range of these json files?
> This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain
confidentialinformation. If you are not the intended recipient, or a person responsible for delivering it to the
intendedrecipient, you are hereby notified that any disclosure, distribution, review, copy or use of any of the
informationcontained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in
error,please immediately notify us by reply e-mail, and destroy the original transmission and its attachments without
readingthem or saving them to disk. Thank you.
 
>
>




RE: Loading 500m json files to database

From
Kevin Brannen
Date:
From: Rob Sargent <robjsargent@gmail.com>

> Sorry if I missed it, but have we seen the size range of these json files?

Not that I've seen, but that's an implementation detail for whoever is doing the work. As someone else pointed out,
pickthe value as needed, whether that's 10, 100, 1000, or whatever. But issuing 1000 lines of "\copy file" sort of
commandsat a time isn't a big deal by itself. OTOH, you have a good point that 1000 could be too much work for the
serverto handle, especially if the "-1" flag is also used. As always: test, test, test... 😊
 

Kevin
This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain confidential
information.If you are not the intended recipient, or a person responsible for delivering it to the intended recipient,
youare hereby notified that any disclosure, distribution, review, copy or use of any of the information contained in or
attachedto this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately
notifyus by reply e-mail, and destroy the original transmission and its attachments without reading them or saving them
todisk. Thank you.
 

Re: Loading 500m json files to database

From
Rob Sargent
Date:

On 3/24/20 11:53 AM, Kevin Brannen wrote:
> From: Rob Sargent <robjsargent@gmail.com>
>
>> Sorry if I missed it, but have we seen the size range of these json files?
> Not that I've seen, but that's an implementation detail for whoever is doing the work. As someone else pointed out,
pickthe value as needed, whether that's 10, 100, 1000, or whatever. But issuing 1000 lines of "\copy file" sort of
commandsat a time isn't a big deal by itself. OTOH, you have a good point that 1000 could be too much work for the
serverto handle, especially if the "-1" flag is also used. As always: test, test, test... 😊
 
>

My fear is this:  the community helps OP load 500M "records" in a 
reasonable timeframe then OP's organization complains postgres is slow 
once they start using it... because the transition from file system to 
rdbms was ill conceived.  Are the json files large documents or 
arbitrary content or are they small data structures of recognizable 
types. And so on