Thread: PostgreSQL Volume Question
> > Hi, I'm new to the community. > > Recently, I've been involved in a project that develops a social network data analysis service (and my client's DBMS isbased on PostgreSQL). > I need to gather huge volume of unstructured raw data for this project, and the problem is that with PostgreSQL, it wouldbe so dfficult to handle this kind of data. Are there any PG extension modules or methods that are recommended for myproject? Can you give a number to "huge volume" and how did you conclude that PG can not handle it.
On 06/14/2018 02:33 PM, Data Ace wrote: > Hi, I'm new to the community. > > Recently, I've been involved in a project that develops a social network > data analysis service (and my client's DBMS is based on PostgreSQL). > I need to gather huge volume of unstructured raw data for this project, > and the problem is that with PostgreSQL, it would be so dfficult to > handle this kind of data. Are there any PG extension modules or methods > that are recommended for my project? In addition to Ravi's questions: What does the data look like? What Postgres version? How is the data going to get from A <--> B, local or remotely or both? Is there another database or program involved in the process? > > Thanks in advance. -- Adrian Klaver adrian.klaver@aklaver.com
On 06/14/2018 02:33 PM, Data Ace wrote:Hi, I'm new to the community.
Recently, I've been involved in a project that develops a social network data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project, and the problem is that with PostgreSQL, it would be so dfficult to handle this kind of data. Are there any PG extension modules or methods that are recommended for my project?
In addition to Ravi's questions:
What does the data look like?
What Postgres version?
How is the data going to get from A <--> B, local or remotely or both?
Is there another database or program involved in the process?
Thanks in advance.
--
Adrian Klaver
adrian.klaver@aklaver.com
In addition to Ravi's and Adrian's questions:
Maj. Database & Exploration Specialist
Universe Exploration Command – UXC
Employment by invitation only!
On Thu, 14 Jun 2018 14:33:54 -0700 Data Ace <dataace9@gmail.com> wrote: > Hi, I'm new to the community. > > Recently, I've been involved in a project that develops a social > network data analysis service (and my client's DBMS is based on > PostgreSQL). I need to gather huge volume of unstructured raw data > for this project, and the problem is that with PostgreSQL, it would > be so dfficult to handle this kind of data. Are there any PG > extension modules or methods that are recommended for my project? "huge" by modern standards is Petabytes, which might require some specialized database service for a data lake. Short of that look up the "jsonb" data type in Postgres. The nice thing about using PG for this is that you can keep enough identifying and metadata in a relational system where it is easier to query and the documents in jsonb where they are still accessable. -- Steven Lembark 1505 National Ave Workhorse Computing Rockford, IL 61103 lembark@wrkhors.com +1 888 359 3508
Well I think my question is somewhat away from my intention cause of my poor understanding and questioning :(
Actually, I have 1TB data and have hardware spec enough to handle this amount of data, but the problem is that it needs too many join operations and the analysis process is going too slow right now.
I've searched and found that graph model nicely fits for network data like social data in query performance.
Should I change my DB (I mean my DB for analysis)? or do I need some other solutions or any extension?
Thanks
In addition to Ravi's and Adrian's questions:On Thu, Jun 14, 2018 at 6:30 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:On 06/14/2018 02:33 PM, Data Ace wrote:Hi, I'm new to the community.
Recently, I've been involved in a project that develops a social network data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project, and the problem is that with PostgreSQL, it would be so dfficult to handle this kind of data. Are there any PG extension modules or methods that are recommended for my project?
In addition to Ravi's questions:
What does the data look like?
What Postgres version?
How is the data going to get from A <--> B, local or remotely or both?
Is there another database or program involved in the process?
Thanks in advance.
--
Adrian Klaver
adrian.klaver@aklaver.comWhat is the hardware configuration?--Melvin Davidson
Maj. Database & Exploration Specialist
Universe Exploration Command – UXC
Employment by invitation only!
Well I think my question is somewhat away from my intention cause of my poor understanding and questioning :(
Actually, I have 1TB data and have hardware spec enough to handle this amount of data, but the problem is that it needs too many join operations and the analysis process is going too slow right now.
I've searched and found that graph model nicely fits for network data like social data in query performance.
Should I change my DB (I mean my DB for analysis)? or do I need some other solutions or any extension?
Thanks
On Thu, Jun 14, 2018 at 3:36 PM, Melvin Davidson <melvin6925@gmail.com> wrote:In addition to Ravi's and Adrian's questions:On Thu, Jun 14, 2018 at 6:30 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:On 06/14/2018 02:33 PM, Data Ace wrote:Hi, I'm new to the community.
Recently, I've been involved in a project that develops a social network data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project, and the problem is that with PostgreSQL, it would be so dfficult to handle this kind of data. Are there any PG extension modules or methods that are recommended for my project?
In addition to Ravi's questions:
What does the data look like?
What Postgres version?
How is the data going to get from A <--> B, local or remotely or both?
Is there another database or program involved in the process?
Thanks in advance.
--
Adrian Klaver
adrian.klaver@aklaver.comWhat is the hardware configuration?--Melvin Davidson
Maj. Database & Exploration Specialist
Universe Exploration Command – UXC
Employment by invitation only!
What Postgres version?
How is the data going to get from A <--> B, local or remotely or both?
Is there another database or program involved in the process?
So what is the structure of the tables involved, including indexes?
What is the actual query?
--
Maj. Database & Exploration Specialist
Universe Exploration Command – UXC
Employment by invitation only!
Well I think my question is somewhat away from my intention cause of my poor understanding and questioning :(
Actually, I have 1TB data and have hardware spec enough to handle this amount of data, but the problem is that it needs too many join operations and the analysis process is going too slow right now.
I've searched and found that graph model nicely fits for network data like social data in query performance.
If your data is hierarchal, then storing it in a network database is perfectly reasonable. I'm not sure, though, that there are many network databases for Linux. Raima is the only one I can think of.
Should I change my DB (I mean my DB for analysis)? or do I need some other solutions or any extension?
Thanks
Angular momentum makes the world go 'round.
As with base backups, the easiest way to produce a standalone hot backup is to use the pg_basebackup tool. If you include the -X
parameter when calling it, all the write-ahead log required to use the backup will be included in the backup automatically, and no special action is required to restore the backup.
Well I think my question is somewhat away from my intention cause of my poor understanding and questioning :(
Actually, I have 1TB data and have hardware spec enough to handle this amount of data, but the problem is that it needs too many join operations and the analysis process is going too slow right now.
I've searched and found that graph model nicely fits for network data like social data in query performance.
If your data is hierarchal, then storing it in a network database is perfectly reasonable. I'm not sure, though, that there are many network databases for Linux. Raima is the only one I can think of.
Should I change my DB (I mean my DB for analysis)? or do I need some other solutions or any extension?
Thanks
Angular momentum makes the world go 'round.
Hi Pierre, On Tue, Jun 19, 2018 at 12:03:58PM +0000, Pierre Timmermans wrote: > Here is the doc, the sentence that I find misleading is "There are > backups that cannot be used for point-in-time recovery", also > mentioning that they are faster than pg_dumps add to confusion (since > pg_dumps cannot be used for PITR): > https://www.postgresql.org/docs/current/static/continuous-archiving.html Yes, it is indeed perfectly possible to use such backups to do a PITR as long as you have a WAL archive able to replay up to the point where you want the replay to happen, so I agree that this is a bit confusing. This part of the documentation is here since the beginning of times, well 6559c4a2 to be exact. Perhaps we would want to reword this sentence as follows: "These are backups that could be used for point-in-time recovery if combined with a WAL archive able to recover up to the wanted recovery point. These backups are typically much faster to backup and restore than pg_dump for large deployments but can result as well in larger backup sizes, so the speed of one method or the other is to evaluate carefully first." I am open to better suggestions of course. -- Michael
Attachment
combined with a WAL archive able to recover up to the wanted recovery
point. These backups are typically much faster to backup and restore
than pg_dump for large deployments but can result as well in larger
backup sizes, so the speed of one method or the other is to evaluate
carefully first. Consider also that pg_dump backups cannot be used for point-in-time recovery."
On Tue, Jun 19, 2018 at 12:03:58PM +0000, Pierre Timmermans wrote:
> Here is the doc, the sentence that I find misleading is "There are
> backups that cannot be used for point-in-time recovery", also
> mentioning that they are faster than pg_dumps add to confusion (since
> pg_dumps cannot be used for PITR):
> https://www.postgresql.org/docs/current/static/continuous-archiving.html
Yes, it is indeed perfectly possible to use such backups to do a PITR
as long as you have a WAL archive able to replay up to the point where
you want the replay to happen, so I agree that this is a bit confusing.
This part of the documentation is here since the beginning of times,
well 6559c4a2 to be exact. Perhaps we would want to reword this
sentence as follows:
"These are backups that could be used for point-in-time recovery if
combined with a WAL archive able to recover up to the wanted recovery
point. These backups are typically much faster to backup and restore
than pg_dump for large deployments but can result as well in larger
backup sizes, so the speed of one method or the other is to evaluate
carefully first."
I am open to better suggestions of course.
--
Michael
Data Ace schrieb am 15.06.2018 um 18:26: > Well I think my question is somewhat away from my intention cause of > my poor understanding and questioning :( > > Actually, I have 1TB data and have hardware spec enough to handle > this amount of data, but the problem is that it needs too many join > operations and the analysis process is going too slow right now. > > I've searched and found that graph model nicely fits for network data > like social data in query performance. > > Should I change my DB (I mean my DB for analysis)? or do I need some > other solutions or any extension? AgensGraph is a Postgres fork implemententing a graph database supporting Cypher as the query language while at the same time still supporting SQL (and even queries mixing both) I have never used it, but maybe it's worth a try. http://bitnine.net/agensgraph/ Thomas
Hi Pierre, On Wed, Jun 20, 2018 at 08:06:31AM +0000, Pierre Timmermans wrote: > Hi Michael You should avoid top-posting on the Postgres lists, this is not the usual style used by people around :) > Thanks for the confirmation. Your rewording removes the confusion. I > would maybe take the opportunity to re-instate that pg_dump cannot be > used for PITR, so in the line of > "These are backups that could be used for point-in-time recovery if > combined with a WAL archive able to recover up to the wanted recovery > point. These backups are typically much faster to backup and restore > than pg_dump for large deployments but can result as well in larger > backup sizes, so the speed of one method or the other is to evaluate > carefully first. Consider also that pg_dump backups cannot be used for > point-in-time recovery." Attached is a patch which includes your suggestion. What do you think? As that's an improvement, only HEAD would get that clarification. > Maybe the confusion stems from the fact that if you restore a > standalone (self-contained) pg_basebackup then - by default - recovery > is done with the recovery_target immediate option, so if one needs > point-in-time recovery he has to edit the recovery.conf and brings the > archives.. Perhaps. There is really nothing preventing one to add a recovery.conf afterwards, which is also why pg_basebackup -R exists. I do that as well for some of the framework I work with and maintain. -- Michael
Attachment
[snip]
Attached is a patch which includes your suggestion. What do you think? As that's an improvement, only HEAD would get that clarification.
You've *got* to be kidding.
Fixing an ambiguously or poorly worded bit of documentation should obviously be pushed to all affected versions.
Angular momentum makes the world go 'round.
>usual style used by people around :)
Will do, but Yahoo Mail! does not seem to like that, so I am typing the > myself
>Attached is a patch which includes your suggestion. What do you think?
>As that's an improvement, only HEAD would get that clarification.
Yes I think it is now perfectly clear. Much appreciated to have the chance to contribute to the doc by the way, it is very nice
>Perhaps. There is really nothing preventing one to add a recovery.conf
>afterwards, which is also why pg_basebackup -R exists. I do that as
>well for some of the framework I work with and maintain.
>You should avoid top-posting on the Postgres lists, this is not the
>usual style used by people around :)
Will do, but Yahoo Mail! does not seem to like that, so I am typing the > myself
On 21/06/18 07:27, Michael Paquier wrote: > Attached is a patch which includes your suggestion. What do you think? > As that's an improvement, only HEAD would get that clarification. Say what? If the clarification applies to previous versions, as it does, it should be backpatched. This isn't a change in behavior, it's a change in the description of existing behavior. -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 21/06/18 07:27, Michael Paquier wrote:
> Attached is a patch which includes your suggestion. What do you think?
> As that's an improvement, only HEAD would get that clarification.
Say what? If the clarification applies to previous versions, as it
does, it should be backpatched. This isn't a change in behavior, it's a
change in the description of existing behavior.
On Thu, Jun 21, 2018 at 04:42:00PM -0400, Ravi Krishna wrote: > Same here even though I use Mac mail. But it is not yahoo alone. > Most of the web email clients have resorted to top posting. I miss > the old days of Outlook Express which was so '>' friendly. I think > Gmail allows '>' when you click on the dots to expand the mail you > are replying to, but it messes up in justifying and formatting it. Those products have good practices when it comes to break and redefine what the concept behind emails is... -- Michael
Attachment
On Thu, Jun 21, 2018 at 04:50:38PM -0700, David G. Johnston wrote: > Generally only actual bug fixes get back-patched; but I'd have to say > this looks like it could easily be classified as one. Everybody is against me here ;) > Some comments on the patch itself: > > "recover up to the wanted recovery point." - "desired recovery point" reads > better to me > > ==== > "These backups are typically much faster to backup and restore" - "These > backups are typically much faster to create and restore"; avoid repeated > use of the word backup Okay. > "but can result as well in larger backup sizes" - "but can result in larger > backup sizes", drop the unnecessary 'as well' Okay. > I like adding "cold backup" here to help contrast and explain why a base > backup is considered a "hot backup". The rest is style to make that flow > better. Indeed. The section uses hot backups a lot. What do all folks here think about the updated attached? -- Michael
Attachment
On Thu, Jun 21, 2018 at 04:50:38PM -0700, David G. Johnston wrote: > On Thu, Jun 21, 2018 at 4:26 PM, Vik Fearing <vik.fearing@2ndquadrant.com> > wrote: > > On 21/06/18 07:27, Michael Paquier wrote: > > Attached is a patch which includes your suggestion. What do you think? > > As that's an improvement, only HEAD would get that clarification. > > Say what? If the clarification applies to previous versions, as it > does, it should be backpatched. This isn't a change in behavior, it's a > change in the description of existing behavior. > > > Generally only actual bug fixes get back-patched; but I'd have to say this > looks like it could easily be classified as one. FYI, in recent discussions on the docs list: https://www.postgresql.org/message-id/CABUevEyumGh3r05U3_mhRrEU=dfacdRr2HEw140MvN7FSBMSyw@mail.gmail.com there was the conclusion that: If it's a clean backpatch I'd say it is -- people who are using PostgreSQL 9.6 will be reading the documentation for 9.6 etc, so they will not know about the fix then. If it's not a clean backpatch I can certainly see considering it, but if it's not a lot of effort then I'd say it's definitely worth it. so the rule I have been using for backpatching doc stuff has changed recently. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Well I think my question is somewhat away from my intention cause of my poor understanding and questioning :(
Actually, I have 1TB data and have hardware spec enough to handle this amount of data, but the problem is that it needs too many join operations and the analysis process is going too slow right now.
I've searched and found that graph model nicely fits for network data like social data in query performance.
Should I change my DB (I mean my DB for analysis)? or do I need some other solutions or any extension?
Thanks
On Thu, Jun 14, 2018 at 3:36 PM, Melvin Davidson <melvin6925@gmail.com> wrote:In addition to Ravi's and Adrian's questions:On Thu, Jun 14, 2018 at 6:30 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:On 06/14/2018 02:33 PM, Data Ace wrote:Hi, I'm new to the community.
Recently, I've been involved in a project that develops a social network data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project, and the problem is that with PostgreSQL, it would be so dfficult to handle this kind of data. Are there any PG extension modules or methods that are recommended for my project?
In addition to Ravi's questions:
What does the data look like?
What Postgres version?
How is the data going to get from A <--> B, local or remotely or both?
Is there another database or program involved in the process?
Thanks in advance.
--
Adrian Klaver
adrian.klaver@aklaver.comWhat is the hardware configuration?--Melvin Davidson
Maj. Database & Exploration Specialist
Universe Exploration Command – UXC
Employment by invitation only!
On Mon, Jun 25, 2018 at 12:51:10PM -0400, Bruce Momjian wrote: > FYI, in recent discussions on the docs list: > > https://www.postgresql.org/message-id/CABUevEyumGh3r05U3_mhRrEU=dfacdRr2HEw140MvN7FSBMSyw@mail.gmail.com I did not recall this one. Thanks for the reminder, Bruce. > There was the conclusion that: > > If it's a clean backpatch I'd say it is -- people who are using > PostgreSQL 9.6 will be reading the documentation for 9.6 etc, so they > will not know about the fix then. > > If it's not a clean backpatch I can certainly see considering it, but if > it's not a lot of effort then I'd say it's definitely worth it. > > so the rule I have been using for backpatching doc stuff has changed > recently. In the case of this thread, I think that the patch applies cleanly anyway as this comes from the period where hot standbys have been introduced. So that would not be a lot of work... Speaking of which, it would be nice to be sure about the wording folks here would prefer using before fixing anything ;p -- Michael