Thread: PostgreSQL VS MongoDB: a use case comparison

PostgreSQL VS MongoDB: a use case comparison

From
Fabio Pardi
Date:
Hi list,


For your consideration I'm submitting you a research I made together with my colleague Wouter.

We compared the 2 databases for our specific use case on medical data stored using FHIR standard.

This is indeed a very restrictive use case, and moreover under read only circumstances. We are aware of that.

We might consider in a near future to enlarge the scope of the research including read and write operations, more use cases, using partitioning, and/or scaling Mongo.

This has to be considered as a starting point and we would like to share our results aiming in constructive feedbacks and perhaps fix mistakes or inaccuracies you might find.

This is the link:

https://portavita.github.io/2018-10-31-blog_A_JSON_use_case_comparison_between_PostgreSQL_and_MongoDB/

We are open to any kind of feedback and we hope you enjoy the reading.


Regards,

Fabio Pardi and Wouter van
Teijlingen

Re: PostgreSQL VS MongoDB: a use case comparison

From
Stephen Frost
Date:
Greetings,

* Fabio Pardi (f.pardi@portavita.eu) wrote:
> We are open to any kind of feedback and we hope you enjoy the reading.

Looks like a lot of the difference being seen and the comments made
about one being faster than the other are because one system is
compressing *everything*, while PG (quite intentionally...) only
compresses the data sometimes- once it hits the TOAST limit.  That
likely also contributes to why you're seeing the on-disk size
differences that you are.

Of course, if you want to see where PG will really shine, you'd stop
thinking of data as just blobs of JSON and actually define individual
fields in PG instead of just one 'jsonb' column, especially when you
know that field will always exist (which is obviously the case if you're
building an index on it, such as your MarriageDate) and then remove
those fields from the jsonb and just reconstruct the JSON when you
query.  Doing that you'll get the size down dramatically.

And that's without even going to that next-level stuff of actual
normalization where you pull out duplicate data from across the JSON
and have just one instance of that data in another, smaller, table and
use a JOIN to bring it all back together.  Even better is when you
realize that then you only have to update one row in this other table
when something changes in that subset of data, unlike when you
repeatedly store that data in individual JSON entries all across the
system and such a change requires rewriting every single JSON object in
the entire system...

Lastly, as with any performance benchmark, please include full details-
all scripts used, all commands run, all data used, so that others can
reproduce your results.  I'm sure it'd be fun to take your json data and
create actual tables out of it and see what it'd be like then.

Thanks!

Stephen

Attachment

Re: PostgreSQL VS MongoDB: a use case comparison

From
Fabio Pardi
Date:
Hi Stephen,

thanks for your feedback.

I agree with you the compression is playing a role in the comparison.
Probably there is a toll to pay when the load is high and the CPU
stressed from de/compressing data. If we will be able to bring our
studies that further, this is definitely something we would like to measure.

I also agree with you that at the moment Postgres really shines on
relational data. To be honest, after seeing the outcome of our research,
we are actually considering to decouple some (or all) fields from their
JSON structure. There will be a toll to be payed there too, since we are
receiving data in JSON format.
And the toll will be in time spent to deliver such a solution, and
indeed time spent by the engine in doing the conversion. It might not be
that convenient after all.

Anyway, to bring data from JSON to a relational model is out of topic
for the current discussion, since we are actually questioning if
Postgres is a good replacement for Mongo when handling JSON data.

As per sharing the dataset, as mentioned in the post we are handling
medical data. Even if the content is anonymized, we are not keen to
share the data structure too for security reasons.
That's a pity I know but i cannot do anything about it.
The queries we ran and the commands we used are mentioned in the blog
post but if you see gaps, feel free to ask.

regards,

fabio pardi



On 11/19/18 6:26 PM, Stephen Frost wrote:
> Greetings,
>
> * Fabio Pardi (f.pardi@portavita.eu) wrote:
>> We are open to any kind of feedback and we hope you enjoy the reading.
>
> Looks like a lot of the difference being seen and the comments made
> about one being faster than the other are because one system is
> compressing *everything*, while PG (quite intentionally...) only
> compresses the data sometimes- once it hits the TOAST limit.  That
> likely also contributes to why you're seeing the on-disk size
> differences that you are.
>
> Of course, if you want to see where PG will really shine, you'd stop
> thinking of data as just blobs of JSON and actually define individual
> fields in PG instead of just one 'jsonb' column, especially when you
> know that field will always exist (which is obviously the case if you're
> building an index on it, such as your MarriageDate) and then remove
> those fields from the jsonb and just reconstruct the JSON when you
> query.  Doing that you'll get the size down dramatically.
>
> And that's without even going to that next-level stuff of actual
> normalization where you pull out duplicate data from across the JSON
> and have just one instance of that data in another, smaller, table and
> use a JOIN to bring it all back together.  Even better is when you
> realize that then you only have to update one row in this other table
> when something changes in that subset of data, unlike when you
> repeatedly store that data in individual JSON entries all across the
> system and such a change requires rewriting every single JSON object in
> the entire system...
>
> Lastly, as with any performance benchmark, please include full details-
> all scripts used, all commands run, all data used, so that others can
> reproduce your results.  I'm sure it'd be fun to take your json data and
> create actual tables out of it and see what it'd be like then.
>
> Thanks!
>
> Stephen
>


Attachment

Re: PostgreSQL VS MongoDB: a use case comparison

From
Stephen Frost
Date:
Greetings,

* Fabio Pardi (f.pardi@portavita.eu) wrote:
> thanks for your feedback.

We prefer on these mailing lists to not top-post but instead to reply
inline, as I'm doing here.  This helps the conversation by eliminating
unnecessary dialogue and being able to make comments regarding specific
points clearly.

> I agree with you the compression is playing a role in the comparison.
> Probably there is a toll to pay when the load is high and the CPU
> stressed from de/compressing data. If we will be able to bring our
> studies that further, this is definitely something we would like to measure.

I was actually thinking of the compression as having more of an impact
with regard to the 'cold' cases because you're pulling fewer blocks when
it's compressed.  The decompression cost on CPU is typically much, much
less than the cost to pull the data off of the storage medium.  When
things are 'hot' and in cache then it might be interesting to question
if the compression/decompression is worth the cost.

> I also agree with you that at the moment Postgres really shines on
> relational data. To be honest, after seeing the outcome of our research,
> we are actually considering to decouple some (or all) fields from their
> JSON structure. There will be a toll to be payed there too, since we are
> receiving data in JSON format.

PostgreSQL has tools to help with this, you might look into
'json_to_record' and friends.

> And the toll will be in time spent to deliver such a solution, and
> indeed time spent by the engine in doing the conversion. It might not be
> that convenient after all.

Oh, the kind of reduction you'd see in space from both an on-disk and
in-memory footprint would almost certainly be worth the tiny amount of
CPU overhead from this.

> Anyway, to bring data from JSON to a relational model is out of topic
> for the current discussion, since we are actually questioning if
> Postgres is a good replacement for Mongo when handling JSON data.

This narrow viewpoint isn't really sensible though- what you should be
thinking about is what's appropriate for your *data*.  JSON is just a
data format, and while it's alright as a system inter-exchange format,
it's rather terrible as a storage format.

> As per sharing the dataset, as mentioned in the post we are handling
> medical data. Even if the content is anonymized, we are not keen to
> share the data structure too for security reasons.

If you really want people to take your analysis seriously, others must
be able to reproduce your results.  I certainly appreciate that there
are very good reasons that you can't share this actual data, but your
testing could be done with completely generated data which happens to be
similar in structure to your data and have similar frequency of values.

The way to approach generating such a data set would be to aggregate up
the actual data to a point where the appropriate committee/board agree
that it can be shared publicly, and then you build a randomly generated
set of data which aggregates to the same result and then use that for
testing.

> That's a pity I know but i cannot do anything about it.
> The queries we ran and the commands we used are mentioned in the blog
> post but if you see gaps, feel free to ask.

There were a lot of gaps that I saw when I looked through the article-
starting with things like the actual CREATE TABLE command you used, and
the complete size/structure of the JSON object, but really what a paper
like this should include is a full script which creates all the tables,
loads all the data, runs the analysis, calculates the results, etc.

Thanks!

Stephen

Attachment

Re: PostgreSQL VS MongoDB: a use case comparison

From
Fabio Pardi
Date:
Hi again,

On 11/20/18 2:34 PM, Stephen Frost wrote:

>> I agree with you the compression is playing a role in the comparison.
>> Probably there is a toll to pay when the load is high and the CPU
>> stressed from de/compressing data. If we will be able to bring our
>> studies that further, this is definitely something we would like to measure.
>
> I was actually thinking of the compression as having more of an impact
> with regard to the 'cold' cases because you're pulling fewer blocks when
> it's compressed.  The decompression cost on CPU is typically much, much
> less than the cost to pull the data off of the storage medium.  When
> things are 'hot' and in cache then it might be interesting to question
> if the compression/decompression is worth the cost.
>

true.

When data is present in RAM, Postgres then is faster, because as you say
the compression will not actually give a benefit on retrieving data from
disk.


In my statement here above about the CPU I was speculating if the speed
Mongo gains thanks to the blocks compression, would act as a double
edged sword under warm cache scenarios and heavy load.



>> I also agree with you that at the moment Postgres really shines on
>> relational data. To be honest, after seeing the outcome of our research,
>> we are actually considering to decouple some (or all) fields from their
>> JSON structure. There will be a toll to be payed there too, since we are
>> receiving data in JSON format.
>
> PostgreSQL has tools to help with this, you might look into
> 'json_to_record' and friends.
>

it might turn out useful to us if we normalize our data, thanks.

>> Anyway, to bring data from JSON to a relational model is out of topic
>> for the current discussion, since we are actually questioning if
>> Postgres is a good replacement for Mongo when handling JSON data.
>
> This narrow viewpoint isn't really sensible though- what you should be
> thinking about is what's appropriate for your *data*.  JSON is just a
> data format, and while it's alright as a system inter-exchange format,
> it's rather terrible as a storage format.
>

I did not want to narrow the viewpoint. I'm exploring possibilities.
Since Postgres supports JSON, it would have been nice to know how far
one can go in storing data without transforming it.


When we started our research the only question was:
Is it possible to replace Postgres with Mongo 1 to 1?
All other considerations came after, and as matter of fact, as told
already, we are actually considering to (maybe partially) transform data
to a relational model.

Maybe we did not look around enough but we did not find on internet all
the answers to our questions, therefore we initiated something ourselves.


>> As per sharing the dataset, as mentioned in the post we are handling
>> medical data. Even if the content is anonymized, we are not keen to
>> share the data structure too for security reasons.
>
> If you really want people to take your analysis seriously, others must
> be able to reproduce your results.  I certainly appreciate that there
> are very good reasons that you can't share this actual data, but your
> testing could be done with completely generated data which happens to be
> similar in structure to your data and have similar frequency of values.
>
> The way to approach generating such a data set would be to aggregate up
> the actual data to a point where the appropriate committee/board agree
> that it can be shared publicly, and then you build a randomly generated
> set of data which aggregates to the same result and then use that for
> testing.
>
Probably looking backward, I would generate data that is sharable with
everybody to give the opportunity to play with it and involve people more.

The fact is that we started very small and we ended up with quite a
bunch of information we felt like sharing.
Time is tyrant and at the moment we cannot re-run everything with
sharable data so we all have to live with it. It is not optimal and is
not perfectly academic but is still better than not sharing at all in my
opinion.


One good thing is that while testing and learning I found a similar
investigation which led to similar results (unfortunately also there you
can argue that is not sharing dataset and scripts and all the rest).

In the jsquery section of the blog post there is a link pointing to:

https://github.com/postgrespro/jsquery/blob/master/README.md

which in turn points to

http://www.sai.msu.su/~megera/postgres/talks/pgconfeu-2014-jsquery.pdf

At page 18 there are some results which are close to what we obtained.

I think those results are close to what we found even if the paper is
from 2014 and a lot changed in the landscape.

This to say that i suspect that if we generate random JSON data, we will
probably draw the same conclusions.


>> That's a pity I know but i cannot do anything about it.
>> The queries we ran and the commands we used are mentioned in the blog
>> post but if you see gaps, feel free to ask.
>
> There were a lot of gaps that I saw when I looked through the article-
> starting with things like the actual CREATE TABLE command you used,

you are right, there is only the command i used to transform the table
to jsonb.

Small detail, but I updated the post for clarity


 and
> the complete size/structure of the JSON object, but really what a paper
> like this should include is a full script which creates all the tables,
> loads all the data, runs the analysis, calculates the results, etc.
>

Queries are shared, but without data, to share the rest is quite useless
in my opinion.


regards,

fabio pardi


Attachment

Re: PostgreSQL VS MongoDB: a use case comparison

From
Merlin Moncure
Date:
On Mon, Nov 19, 2018 at 11:26 AM Stephen Frost <sfrost@snowman.net> wrote:
> Looks like a lot of the difference being seen and the comments made
> about one being faster than the other are because one system is
> compressing *everything*, while PG (quite intentionally...) only
> compresses the data sometimes- once it hits the TOAST limit.  That
> likely also contributes to why you're seeing the on-disk size
> differences that you are.

Hm.  It may be intentional, but is it ideal?  Employing datum
compression in the 1kb-8kb range with a faster but less compressing
algorithm could give benefits.

merlin


Re: PostgreSQL VS MongoDB: a use case comparison

From
Stephen Frost
Date:
Greetings,

* Merlin Moncure (mmoncure@gmail.com) wrote:
> On Mon, Nov 19, 2018 at 11:26 AM Stephen Frost <sfrost@snowman.net> wrote:
> > Looks like a lot of the difference being seen and the comments made
> > about one being faster than the other are because one system is
> > compressing *everything*, while PG (quite intentionally...) only
> > compresses the data sometimes- once it hits the TOAST limit.  That
> > likely also contributes to why you're seeing the on-disk size
> > differences that you are.
>
> Hm.  It may be intentional, but is it ideal?  Employing datum
> compression in the 1kb-8kb range with a faster but less compressing
> algorithm could give benefits.

Well, pglz is actually pretty fast and not as good at compression as
other things.  I could certainly see an argument for allowing a column
to always be (or at least attempted to be) compressed.

There's been a lot of discussion around supporting alternative
compression algorithms but making that happen is a pretty big task.

Thanks!

Stephen

Attachment

Re: PostgreSQL VS MongoDB: a use case comparison

From
Merlin Moncure
Date:
On Tue, Nov 20, 2018 at 10:43 AM Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Merlin Moncure (mmoncure@gmail.com) wrote:
> > On Mon, Nov 19, 2018 at 11:26 AM Stephen Frost <sfrost@snowman.net> wrote:
> > > Looks like a lot of the difference being seen and the comments made
> > > about one being faster than the other are because one system is
> > > compressing *everything*, while PG (quite intentionally...) only
> > > compresses the data sometimes- once it hits the TOAST limit.  That
> > > likely also contributes to why you're seeing the on-disk size
> > > differences that you are.
> >
> > Hm.  It may be intentional, but is it ideal?  Employing datum
> > compression in the 1kb-8kb range with a faster but less compressing
> > algorithm could give benefits.
>
> Well, pglz is actually pretty fast and not as good at compression as
> other things.  I could certainly see an argument for allowing a column
> to always be (or at least attempted to be) compressed.
>
> There's been a lot of discussion around supporting alternative
> compression algorithms but making that happen is a pretty big task.

Yeah; pglz is closer to zlib.  There's much faster stuff out
there...Andres summed it up pretty well;
https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de

There are also some interesting discussions on jsonb specific
discussion approaches.

merlin


Re: PostgreSQL VS MongoDB: a use case comparison

From
Stephen Frost
Date:
Greetings,

* Merlin Moncure (mmoncure@gmail.com) wrote:
> On Tue, Nov 20, 2018 at 10:43 AM Stephen Frost <sfrost@snowman.net> wrote:
> > * Merlin Moncure (mmoncure@gmail.com) wrote:
> > > On Mon, Nov 19, 2018 at 11:26 AM Stephen Frost <sfrost@snowman.net> wrote:
> > > > Looks like a lot of the difference being seen and the comments made
> > > > about one being faster than the other are because one system is
> > > > compressing *everything*, while PG (quite intentionally...) only
> > > > compresses the data sometimes- once it hits the TOAST limit.  That
> > > > likely also contributes to why you're seeing the on-disk size
> > > > differences that you are.
> > >
> > > Hm.  It may be intentional, but is it ideal?  Employing datum
> > > compression in the 1kb-8kb range with a faster but less compressing
> > > algorithm could give benefits.
> >
> > Well, pglz is actually pretty fast and not as good at compression as
> > other things.  I could certainly see an argument for allowing a column
> > to always be (or at least attempted to be) compressed.
> >
> > There's been a lot of discussion around supporting alternative
> > compression algorithms but making that happen is a pretty big task.
>
> Yeah; pglz is closer to zlib.  There's much faster stuff out
> there...Andres summed it up pretty well;
> https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de
>
> There are also some interesting discussions on jsonb specific
> discussion approaches.

Oh yes, having a dictionary would be a great start to reducing the size
of the jsonb data, though it could then become a contention point if
there's a lot of new values being inserted and such.  Naturally there
would also be a cost to pulling that data back out as well but likely it
would be well worth the benefit of not having to store the field names
repeatedly.

Then again, taken far enough, what you end up with are tables... :)

Thanks!

Stephen

Attachment

Re: PostgreSQL VS MongoDB: a use case comparison

From
Merlin Moncure
Date:
On Tue, Nov 20, 2018 at 11:28 AM Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Merlin Moncure (mmoncure@gmail.com) wrote:
> > On Tue, Nov 20, 2018 at 10:43 AM Stephen Frost <sfrost@snowman.net> wrote:
> > > * Merlin Moncure (mmoncure@gmail.com) wrote:
> > > > On Mon, Nov 19, 2018 at 11:26 AM Stephen Frost <sfrost@snowman.net> wrote:
> > > > > Looks like a lot of the difference being seen and the comments made
> > > > > about one being faster than the other are because one system is
> > > > > compressing *everything*, while PG (quite intentionally...) only
> > > > > compresses the data sometimes- once it hits the TOAST limit.  That
> > > > > likely also contributes to why you're seeing the on-disk size
> > > > > differences that you are.
> > > >
> > > > Hm.  It may be intentional, but is it ideal?  Employing datum
> > > > compression in the 1kb-8kb range with a faster but less compressing
> > > > algorithm could give benefits.
> > >
> > > Well, pglz is actually pretty fast and not as good at compression as
> > > other things.  I could certainly see an argument for allowing a column
> > > to always be (or at least attempted to be) compressed.
> > >
> > > There's been a lot of discussion around supporting alternative
> > > compression algorithms but making that happen is a pretty big task.
> >
> > Yeah; pglz is closer to zlib.  There's much faster stuff out
> > there...Andres summed it up pretty well;
> > https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de
> >
> > There are also some interesting discussions on jsonb specific
> > discussion approaches.
>
> Oh yes, having a dictionary would be a great start to reducing the size
> of the jsonb data, though it could then become a contention point if
> there's a lot of new values being inserted and such.  Naturally there
> would also be a cost to pulling that data back out as well but likely it
> would be well worth the benefit of not having to store the field names
> repeatedly.

Yes, the biggest concern with a shared dictionary ought to be
concurrency type problems.

merlin


Re: PostgreSQL VS MongoDB: a use case comparison

From
Stephen Frost
Date:
Greetings,

* Merlin Moncure (mmoncure@gmail.com) wrote:
> On Tue, Nov 20, 2018 at 11:28 AM Stephen Frost <sfrost@snowman.net> wrote:
> > Oh yes, having a dictionary would be a great start to reducing the size
> > of the jsonb data, though it could then become a contention point if
> > there's a lot of new values being inserted and such.  Naturally there
> > would also be a cost to pulling that data back out as well but likely it
> > would be well worth the benefit of not having to store the field names
> > repeatedly.
>
> Yes, the biggest concern with a shared dictionary ought to be
> concurrency type problems.

Hmmm, I wonder if we could do something like have a dictionary per
page..  Or perhaps based on some hash of the toast ID..  Not sure. :)

Thanks!

Stephen

Attachment

Re: PostgreSQL VS MongoDB: a use case comparison

From
Thomas Kellerer
Date:
Stephen Frost schrieb am 20.11.2018 um 18:28:
> Oh yes, having a dictionary would be a great start to reducing the size
> of the jsonb data, though it could then become a contention point if
> there's a lot of new values being inserted and such.  Naturally there
> would also be a cost to pulling that data back out as well but likely it
> would be well worth the benefit of not having to store the field names
> repeatedly.

There is an extension for a dictionary based JSONB compression:

https://github.com/postgrespro/zson



Re: PostgreSQL VS MongoDB: a use case comparison

From
Oleg Bartunov
Date:
On Wed, Nov 21, 2018 at 9:48 AM Thomas Kellerer <spam_eater@gmx.net> wrote:
>
> Stephen Frost schrieb am 20.11.2018 um 18:28:
> > Oh yes, having a dictionary would be a great start to reducing the size
> > of the jsonb data, though it could then become a contention point if
> > there's a lot of new values being inserted and such.  Naturally there
> > would also be a cost to pulling that data back out as well but likely it
> > would be well worth the benefit of not having to store the field names
> > repeatedly.
>
> There is an extension for a dictionary based JSONB compression:
>
> https://github.com/postgrespro/zson

That was a 'toy' experiment.  We did several experiments on jsonb
compression and presented
the results, for example,
http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconf.us-2017.pdf

Also, check this thread on custom compression

https://www.postgresql.org/message-id/flat/CAF4Au4xop7FqhCKgabYWymUS0yUk9i%3DbonPnmVUBbpoKsFYnLA%40mail.gmail.com#d01913d3b939b472ea5b38912bf3cbe4

Now, there is YCSB-JSON benchmark available and it is worth to run it
for postgres
https://dzone.com/articles/ycsb-json-implementation-for-couchbase-and-mongodb
We are pretty busy, so you may contribute.

For better indexing we are working on  parameters for opclasses and I
really wanted to have it
for PG12. http://www.sai.msu.su/~megera/postgres/talks/opclass_pgcon-2018.pdf

My recommendation for testing performance - always run concurrent
queries and distibution of
queries should be for most cases zipfian (we have added to PG11).
>
>


-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: PostgreSQL VS MongoDB: a use case comparison

From
Nicolas Paris
Date:
On Tue, Nov 20, 2018 at 08:34:20AM -0500, Stephen Frost wrote:
> > Anyway, to bring data from JSON to a relational model is out of topic
> > for the current discussion, since we are actually questioning if
> > Postgres is a good replacement for Mongo when handling JSON data.
> 
> This narrow viewpoint isn't really sensible though- what you should be
> thinking about is what's appropriate for your *data*.  JSON is just a
> data format, and while it's alright as a system inter-exchange format,
> it's rather terrible as a storage format.

I would add that *FHIR* is an inter-exchange format instead of a storage
format. FHIR spec evolves and its json format too. When implemented in
a relational format it allows to only change the serialization process
(eg: json_tuple & co) instead of the data. In case FHIR is stored as a
json, it makes the information frozen in its version and complicate to
make evolve.


-- 
nicolas