Thread: Loading 500m json files to database
Hi, do you have maybe idea how to make loading process faster? I have 500 millions of json files (1 json per file) that I need to load to db. My test set is "only" 1 million files. What I came up with now is: time for i in datafiles/*; do psql -c "\copy json_parts(json_data) FROM $i"& done which is the fastest so far. But it's not what i expect. Loading 1m of data takes me ~3h so loading 500 times more is just unacceptable. some facts: * the target db is on cloud so there is no option to do tricks like turning fsync off * version postgres 11 * i can spin up huge postgres instance if necessary in terms of cpu/ram * i tried already hash partitioning (to write to 10 different tables instead of 1) Any ideas? -- Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
> On 23 Mar 2020, at 13:20, pinker <pinker@onet.eu> wrote: > > Hi, do you have maybe idea how to make loading process faster? > > I have 500 millions of json files (1 json per file) that I need to load to > db. > My test set is "only" 1 million files. > > What I came up with now is: > > time for i in datafiles/*; do > psql -c "\copy json_parts(json_data) FROM $i"& > done > > which is the fastest so far. But it's not what i expect. Loading 1m of data > takes me ~3h so loading 500 times more is just unacceptable. > > some facts: > * the target db is on cloud so there is no option to do tricks like turning > fsync off > * version postgres 11 > * i can spin up huge postgres instance if necessary in terms of cpu/ram > * i tried already hash partitioning (to write to 10 different tables instead > of 1) > > > Any ideas? Hello, I may not be knowledge enough to answer your question. However, if possible, you may think of using a local physical computer to do all uploading and after do backup/restore oncloud system. Compressed backup will be far less internet traffic compared to direct data inserts. Moreover you can do additional tricks as you mentioned. Thanks & regards, Ertan
Try to write a stored procedure (probably pl/python) that will accept an array of JSON objects so it will be possible toload data in chunks (by 100-1000 files) which should be faster. > On 23. Mar 2020, at 12:49, Ertan Küçükoğlu <ertan.kucukoglu@1nar.com.tr> wrote: > > >> On 23 Mar 2020, at 13:20, pinker <pinker@onet.eu> wrote: >> >> Hi, do you have maybe idea how to make loading process faster? >> >> I have 500 millions of json files (1 json per file) that I need to load to >> db. >> My test set is "only" 1 million files. >> >> What I came up with now is: >> >> time for i in datafiles/*; do >> psql -c "\copy json_parts(json_data) FROM $i"& >> done >> >> which is the fastest so far. But it's not what i expect. Loading 1m of data >> takes me ~3h so loading 500 times more is just unacceptable. >> >> some facts: >> * the target db is on cloud so there is no option to do tricks like turning >> fsync off >> * version postgres 11 >> * i can spin up huge postgres instance if necessary in terms of cpu/ram >> * i tried already hash partitioning (to write to 10 different tables instead >> of 1) >> >> >> Any ideas? > Hello, > > I may not be knowledge enough to answer your question. > > However, if possible, you may think of using a local physical computer to do all uploading and after do backup/restoreon cloud system. > > Compressed backup will be far less internet traffic compared to direct data inserts. > > Moreover you can do additional tricks as you mentioned. > > Thanks & regards, > Ertan > > > >
> On Mar 23, 2020, at 5:59 AM, Andrei Zhidenkov <andrei.zhidenkov@n26.com> wrote: > > Try to write a stored procedure (probably pl/python) that will accept an array of JSON objects so it will be possibleto load data in chunks (by 100-1000 files) which should be faster. > >>> On 23. Mar 2020, at 12:49, Ertan Küçükoğlu <ertan.kucukoglu@1nar.com.tr> wrote: >>> >>> >>>> On 23 Mar 2020, at 13:20, pinker <pinker@onet.eu> wrote: >>> >>> Hi, do you have maybe idea how to make loading process faster? >>> >>> I have 500 millions of json files (1 json per file) that I need to load to >>> db. >>> My test set is "only" 1 million files. >>> >>> What I came up with now is: >>> >>> time for i in datafiles/*; do >>> psql -c "\copy json_parts(json_data) FROM $i"& >>> done >>> >>> which is the fastest so far. But it's not what i expect. Loading 1m of data >>> takes me ~3h so loading 500 times more is just unacceptable. >>> >>> some facts: >>> * the target db is on cloud so there is no option to do tricks like turning >>> fsync off >>> * version postgres 11 >>> * i can spin up huge postgres instance if necessary in terms of cpu/ram >>> * i tried already hash partitioning (to write to 10 different tables instead >>> of 1) >>> >>> >>> Any ideas? >> Hello, >> >> I may not be knowledge enough to answer your question. >> >> However, if possible, you may think of using a local physical computer to do all uploading and after do backup/restoreon cloud system. >> >> Compressed backup will be far less internet traffic compared to direct data inserts. >> >> Moreover you can do additional tricks as you mentioned. >> >> Thanks & regards, >> Ertan >> >> Drop any and all indices >> >> > > >
On Mon, 23 Mar 2020 at 06:24, pinker <pinker@onet.eu> wrote:
Hi, do you have maybe idea how to make loading process faster?
I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.
What I came up with now is:
time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done
which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.
some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)
Any ideas?
Well, you're paying for a lot of overhead in that, as you're
establishing a psql command, connecting to a database, spawning a backend
process, starting a transactions, committing a transaction, closing the backend
process, disconnecting from the database, and cleaning up after the launching
of the psql command. And you're doing that 500 million times.
The one thing I left off that was the loading of a single tuple into json_parts.
What you could do to improve things quite a lot would be to group some number
of those files together, so that each time you pay for the overhead, you at least
get the benefit of loading several entries into json_parts.
So, loosely, I'd commend using /bin/cat (or similar) to assemble several files together
into one, and then \copy that one file in.
Having 2 tuples loaded at once drops overhead by 50%
Having 10 tuples loaded at once drops overhead by 90%
Having 100 tuples loaded at once drops overhead by 99%
Having 1000 tuples loaded at once drops overhead by 99.9%
There probably isn't too much real value to going past 1000 tuples per batch; the
overhead, by that point, is getting pretty immaterial.
Reducing that overhead is the single most important thing you can do.
It is also quite likely that you could run such streams in parallel, although
it would require quite a bit more information about the I/O capabilities of your
hardware to know if that would do any good.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
question, "How would the Lone Ranger handle this?"
On 3/23/20 4:24 AM, pinker wrote: > Hi, do you have maybe idea how to make loading process faster? > > I have 500 millions of json files (1 json per file) that I need to load to > db. > My test set is "only" 1 million files. > > What I came up with now is: > > time for i in datafiles/*; do > psql -c "\copy json_parts(json_data) FROM $i"& > done > > which is the fastest so far. But it's not what i expect. Loading 1m of data > takes me ~3h so loading 500 times more is just unacceptable. > > some facts: > * the target db is on cloud so there is no option to do tricks like turning > fsync off > * version postgres 11 > * i can spin up huge postgres instance if necessary in terms of cpu/ram > * i tried already hash partitioning (to write to 10 different tables instead > of 1) > > > Any ideas? > Most advanced languages have a bulk copy implementation. I've found this to be blindingly fast when the receiving tablehas no indices, constraints. It's not clear how large your files are, but you might take this time to "normalized"them: extract any id, datatype, etc into table attributes. > -- > Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html > >
On 3/23/20 3:24 AM, pinker wrote: > Hi, do you have maybe idea how to make loading process faster? > > I have 500 millions of json files (1 json per file) that I need to load to > db. > My test set is "only" 1 million files. > > What I came up with now is: > > time for i in datafiles/*; do > psql -c "\copy json_parts(json_data) FROM $i"& > done > > which is the fastest so far. But it's not what i expect. Loading 1m of data > takes me ~3h so loading 500 times more is just unacceptable. Aggregating the JSON files as others have suggested would help greatly. Knowing what is happening in json_parts() might help folks provide further tips. > > some facts: > * the target db is on cloud so there is no option to do tricks like turning > fsync off > * version postgres 11 > * i can spin up huge postgres instance if necessary in terms of cpu/ram > * i tried already hash partitioning (to write to 10 different tables instead > of 1) > > > Any ideas? > > > > -- > Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html > > -- Adrian Klaver adrian.klaver@aklaver.com
Christopher Browne-3 wrote > Well, you're paying for a lot of overhead in that, as you're > establishing a psql command, connecting to a database, spawning a backend > process, starting a transactions, committing a transaction, closing the > backend > process, disconnecting from the database, and cleaning up after the > launching > of the psql command. And you're doing that 500 million times. > > The one thing I left off that was the loading of a single tuple into > json_parts. > > What you could do to improve things quite a lot would be to group some > number > of those files together, so that each time you pay for the overhead, you > at > least > get the benefit of loading several entries into json_parts. > > So, loosely, I'd commend using /bin/cat (or similar) to assemble several > files together > into one, and then \copy that one file in. > > Having 2 tuples loaded at once drops overhead by 50% > Having 10 tuples loaded at once drops overhead by 90% > Having 100 tuples loaded at once drops overhead by 99% > Having 1000 tuples loaded at once drops overhead by 99.9% > > There probably isn't too much real value to going past 1000 tuples per > batch; the > overhead, by that point, is getting pretty immaterial. > > Reducing that overhead is the single most important thing you can do. Yes, I was thinking about that but no idea now how to do it right now. like some kind of outer loop to concatenate those files? and adding delimiter between them? Christopher Browne-3 wrote > It is also quite likely that you could run such streams in parallel, > although > it would require quite a bit more information about the I/O capabilities > of > your > hardware to know if that would do any good. I can spin up every size of instance. -- Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
Ertan Küçükoğlu wrote > However, if possible, you may think of using a local physical computer to > do all uploading and after do backup/restore on cloud system. > > Compressed backup will be far less internet traffic compared to direct > data inserts. I was thinking about that but data source is a blob storage, so downloading it first and then loading locally it's couple days extra for processing :/ it's not that fast even when I'm doing it locally ... so that would be like extra 2 steps overhead :/ -- Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
it's a cloud and no plpythonu extension avaiable unfortunately -- Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
there is no indexes nor foreign keys, or any other constraints -- Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
Hi, json_parts it's just single table with 2 column: Table "public.json_parts" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -----------+---------+-----------+----------+----------------------------------------+----------+--------------+------------- id | integer | | not null | nextval('json_parts_id_seq'::regclass) | plain | | json_data | jsonb | | | no indexes, constraints or anything else -- Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
On 3/23/20 5:26 PM, pinker wrote: > Hi, > json_parts it's just single table with 2 column: Well I misread that. > > Table "public.json_parts" > Column | Type | Collation | Nullable | Default > | Storage | Stats target | Description > -----------+---------+-----------+----------+----------------------------------------+----------+--------------+------------- > id | integer | | not null | > nextval('json_parts_id_seq'::regclass) | plain | | > json_data | jsonb | | | > > no indexes, constraints or anything else > > > > -- > Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html > > -- Adrian Klaver adrian.klaver@aklaver.com
On 3/23/20 5:23 PM, pinker wrote: > it's a cloud and no plpythonu extension avaiable unfortunately I presume Python itself is available, so would it not be possible to create a program that concatenates the files into batches and COPY(s) that data into Postgres using the Psycopg2 COPY functions: https://www.psycopg.org/docs/usage.html#using-copy-to-and-copy-from > > > > -- > Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html > > -- Adrian Klaver adrian.klaver@aklaver.com
On Mon, Mar 23, 2020 at 3:24 AM pinker <pinker@onet.eu> wrote:
time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done
Don't know whether this is faster but it does avoid spinning up a connection multiple times.
#bash, linux
function append_each_split_file_to_etl_load_script() {
for filetoload in ./*; do
ronumber="$(basename $filetoload)"
# only process files since subdirs can be present
if [[ -f "$filetoload" ]]; then
echo ""
echo "\set invoice"' `cat '"'""$filetoload""'"'`'
echo ", ('$ronumber',:'invoice')"
fi >> "$PSQLSCRIPT"
done
echo "" >> "$PSQLSCRIPT"
echo ";" >> "$PSQLSCRIPT"
echo "" >> "$PSQLSCRIPT"
}
for filetoload in ./*; do
ronumber="$(basename $filetoload)"
# only process files since subdirs can be present
if [[ -f "$filetoload" ]]; then
echo ""
echo "\set invoice"' `cat '"'""$filetoload""'"'`'
echo ", ('$ronumber',:'invoice')"
fi >> "$PSQLSCRIPT"
done
echo "" >> "$PSQLSCRIPT"
echo ";" >> "$PSQLSCRIPT"
echo "" >> "$PSQLSCRIPT"
}
There is a bit other related code that is needed (for my specific usage) but this is the core of it. Use psql variables to capture the contents of each file into a variable and then just perform a normal insert (specifically, a VALUES (...), (...) variant). Since you can intermix psql and SQL you basically output a bloody long script, that has memory issues at scale - but you can divide and conquer - and then "psql --file bloody_long_script_part_1_of_100000.psql".
David J.
Can one put 550M files in a single directory? I thought it topped out at 16M or so.On Mar 23, 2020, at 7:11 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:On Mon, Mar 23, 2020 at 3:24 AM pinker <pinker@onet.eu> wrote:time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
doneDon't know whether this is faster but it does avoid spinning up a connection multiple times.#bash, linuxfunction append_each_split_file_to_etl_load_script() {
for filetoload in ./*; do
ronumber="$(basename $filetoload)"
# only process files since subdirs can be present
if [[ -f "$filetoload" ]]; then
echo ""
echo "\set invoice"' `cat '"'""$filetoload""'"'`'
echo ", ('$ronumber',:'invoice')"
fi >> "$PSQLSCRIPT"
done
echo "" >> "$PSQLSCRIPT"
echo ";" >> "$PSQLSCRIPT"
echo "" >> "$PSQLSCRIPT"
}There is a bit other related code that is needed (for my specific usage) but this is the core of it. Use psql variables to capture the contents of each file into a variable and then just perform a normal insert (specifically, a VALUES (...), (...) variant). Since you can intermix psql and SQL you basically output a bloody long script, that has memory issues at scale - but you can divide and conquer - and then "psql --file bloody_long_script_part_1_of_100000.psql".David J.
it's in a blob storage in Azure. I'm testing with 1m that I have locally -- Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
hmm now I'm thinking maybe setting up pgbouncer in front of postgres with statement mode would help? -- Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
On Mon, 2020-03-23 at 03:24 -0700, pinker wrote: > [EXTERNAL SOURCE] > > > > Hi, do you have maybe idea how to make loading process faster? > > I have 500 millions of json files (1 json per file) that I need to load to > db. > My test set is "only" 1 million files. > > What I came up with now is: > > time for i in datafiles/*; do > psql -c "\copy json_parts(json_data) FROM $i"& > done > > which is the fastest so far. But it's not what i expect. Loading 1m of data > takes me ~3h so loading 500 times more is just unacceptable. > > some facts: > * the target db is on cloud so there is no option to do tricks like turning > fsync off > * version postgres 11 > * i can spin up huge postgres instance if necessary in terms of cpu/ram > * i tried already hash partitioning (to write to 10 different tables instead > of 1) > > > Any ideas? https://www.gnu.org/software/parallel/
On 2020-03-23 17:18:45 -0700, pinker wrote: > Christopher Browne-3 wrote > > Well, you're paying for a lot of overhead in that, as you're > > establishing a psql command, connecting to a database, spawning a > > backend process, starting a transactions, committing a transaction, > > closing the backend process, disconnecting from the database, and > > cleaning up after the launching of the psql command. And you're > > doing that 500 million times. > > > > The one thing I left off that was the loading of a single tuple into > > json_parts. [...] > > Reducing that overhead is the single most important thing you can do. > > Yes, I was thinking about that but no idea now how to do it right now. Do you know any programming language (Python, Perl, ...)? You'll probably get a huge gain from writing a script that just opens the connection once and then inserts each file. Copy usually is even faster by a fair amount, but since you have to read the data for each row from a different file (and - if I understood you correctly, a remote one at that), the additional speedup is probably not that great in this case. Splitting the work int batches and executing several batches in parallel probably helps. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
From: pinker <pinker@onet.eu> > it's a cloud and no plpythonu extension avaiable unfortunately You're misunderstanding him. See David's post for an example, but the point was that you can control all of this from an*external* Perl, Python, Bash, whatever program on the command line at the shell. In pseudo-code, probably fed by a "find" command piping filenames to it: while more files do { read in a file name & add to list } while (list.length < 1000); process entire list with \copy commands to 1 psql command I've left all kinds of checks out of that, but that's the basic thing that you need, implement in whatever scripting languageyou're comfortable with. HTH, Kevin This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain confidential information.If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, youare hereby notified that any disclosure, distribution, review, copy or use of any of the information contained in or attachedto this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notifyus by reply e-mail, and destroy the original transmission and its attachments without reading them or saving them todisk. Thank you.
On 3/24/20 11:29 AM, Kevin Brannen wrote: > From: pinker <pinker@onet.eu> > >> it's a cloud and no plpythonu extension avaiable unfortunately > > You're misunderstanding him. See David's post for an example, but the point was that you can control all of this from an*external* Perl, Python, Bash, whatever program on the command line at the shell. > > In pseudo-code, probably fed by a "find" command piping filenames to it: > > while more files > do { read in a file name & add to list } while (list.length < 1000); > process entire list with \copy commands to 1 psql command > > I've left all kinds of checks out of that, but that's the basic thing that you need, implement in whatever scripting languageyou're comfortable with. > > HTH, > Kevin Sorry if I missed it, but have we seen the size range of these json files? > This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain confidentialinformation. If you are not the intended recipient, or a person responsible for delivering it to the intendedrecipient, you are hereby notified that any disclosure, distribution, review, copy or use of any of the informationcontained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error,please immediately notify us by reply e-mail, and destroy the original transmission and its attachments without readingthem or saving them to disk. Thank you. > >
From: Rob Sargent <robjsargent@gmail.com> > Sorry if I missed it, but have we seen the size range of these json files? Not that I've seen, but that's an implementation detail for whoever is doing the work. As someone else pointed out, pickthe value as needed, whether that's 10, 100, 1000, or whatever. But issuing 1000 lines of "\copy file" sort of commandsat a time isn't a big deal by itself. OTOH, you have a good point that 1000 could be too much work for the serverto handle, especially if the "-1" flag is also used. As always: test, test, test... 😊 Kevin This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain confidential information.If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, youare hereby notified that any disclosure, distribution, review, copy or use of any of the information contained in or attachedto this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notifyus by reply e-mail, and destroy the original transmission and its attachments without reading them or saving them todisk. Thank you.
On 3/24/20 11:53 AM, Kevin Brannen wrote: > From: Rob Sargent <robjsargent@gmail.com> > >> Sorry if I missed it, but have we seen the size range of these json files? > Not that I've seen, but that's an implementation detail for whoever is doing the work. As someone else pointed out, pickthe value as needed, whether that's 10, 100, 1000, or whatever. But issuing 1000 lines of "\copy file" sort of commandsat a time isn't a big deal by itself. OTOH, you have a good point that 1000 could be too much work for the serverto handle, especially if the "-1" flag is also used. As always: test, test, test... 😊 > My fear is this: the community helps OP load 500M "records" in a reasonable timeframe then OP's organization complains postgres is slow once they start using it... because the transition from file system to rdbms was ill conceived. Are the json files large documents or arbitrary content or are they small data structures of recognizable types. And so on