Thread: Ideas for building a system that parses medical research publications/articles

Ideas for building a system that parses medical research publications/articles

From
Achilleas Mantzios
Date:
Hello

I am imagining a system that can parse papers from various sources 
(web/files/etc) and in various formats (text, pdf, etc) and can store 
metadata for this paper ,some kind of global ID if applicable, authors, 
areas of research, whether the paper is "new", "highlighted", 
"historical", type (e.g. Case reports, Clinical trials), symptoms (e.g. 
tics, GI pain, psychological changes, anxiety, ), and other key 
attributes (I guess dynamic), it must be full text searchable, etc.

I am at the very beginning in this and it is done on a fully volunteer 
basis.

Lots of questions : is there any scientific/scholar analysis software 
already available? If yes and is really good and open source , then this 
will influence the rest of decisions. Otherwise , I'll have to form a 
team that can write one, in this case I'll have to decide DB, language, 
etc. I work 20 years with pgsql so it is the natural choice for any kind 
of data, I just ask this for the sake of completeness.

All ideas welcome.




‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, 5 June 2021 10:49, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

> Hello
>
> I am imagining a system that can parse papers from various sources
> (web/files/etc) and in various formats (text, pdf, etc) and can store
> metadata for this paper ,some kind of global ID if applicable, authors,
> areas of research, whether the paper is "new", "highlighted",
> "historical", type (e.g. Case reports, Clinical trials), symptoms (e.g.
> tics, GI pain, psychological changes, anxiety, ), and other key
> attributes (I guess dynamic), it must be full text searchable, etc.
>
> I am at the very beginning in this and it is done on a fully volunteer
> basis.
>
> Lots of questions : is there any scientific/scholar analysis software
> already available? If yes and is really good and open source , then this
> will influence the rest of decisions. Otherwise , I'll have to form a
> team that can write one, in this case I'll have to decide DB, language,
> etc. I work 20 years with pgsql so it is the natural choice for any kind
> of data, I just ask this for the sake of completeness.
>
> All ideas welcome.

Hello Achilleas

Not wishing to be discouraging, but you have very ambitious goals for what sounds like a one-person project ?

You are effectively looking at competing with platforms such as Elsevier Scopus/Scival which are market-leaders in the
areafor good reason (i.e. it takes a lot of manpower to write algorithms, manage metadata etc., and the only way to
consistentlymaintain that manpower is to employ people, lots of them).   There are also things like Google Scholar
aroundthe place. 

I think before starting on the technical side of Postgres etc., the honest truth is that you need to do more planning,
bothin terms of implementation and long-term sustainability. 

For example, before we even get to metadata, you talk of various sources and formats.  Have you considered licensing
issues?  Have you considered how to keep the dataset clean ? (If you are thinking you can just scrape the web, then
you'llbe in for a surprise). 

Laura



Re: Ideas for building a system that parses medical research publications/articles

From
Achilleas Mantzios
Date:
Στις 5/6/21 1:52 μ.μ., ο/η Laura Smith έγραψε:

> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Saturday, 5 June 2021 10:49, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
>
>> Hello
>>
>> I am imagining a system that can parse papers from various sources
>> (web/files/etc) and in various formats (text, pdf, etc) and can store
>> metadata for this paper ,some kind of global ID if applicable, authors,
>> areas of research, whether the paper is "new", "highlighted",
>> "historical", type (e.g. Case reports, Clinical trials), symptoms (e.g.
>> tics, GI pain, psychological changes, anxiety, ), and other key
>> attributes (I guess dynamic), it must be full text searchable, etc.
>>
>> I am at the very beginning in this and it is done on a fully volunteer
>> basis.
>>
>> Lots of questions : is there any scientific/scholar analysis software
>> already available? If yes and is really good and open source , then this
>> will influence the rest of decisions. Otherwise , I'll have to form a
>> team that can write one, in this case I'll have to decide DB, language,
>> etc. I work 20 years with pgsql so it is the natural choice for any kind
>> of data, I just ask this for the sake of completeness.
>>
>> All ideas welcome.
> Hello Achilleas
>
> Not wishing to be discouraging, but you have very ambitious goals for what sounds like a one-person project ?
>
> You are effectively looking at competing with platforms such as Elsevier Scopus/Scival which are market-leaders in
thearea for good reason (i.e. it takes a lot of manpower to write algorithms, manage metadata etc., and the only way to
consistentlymaintain that manpower is to employ people, lots of them).   There are also things like Google Scholar
aroundthe place.
 
>
> I think before starting on the technical side of Postgres etc., the honest truth is that you need to do more
planning,both in terms of implementation and long-term sustainability.
 
>
> For example, before we even get to metadata, you talk of various sources and formats.  Have you considered licensing
issues?  Have you considered how to keep the dataset clean ? (If you are thinking you can just scrape the web, then
you'llbe in for a surprise).
 

All I got is some very vague descriptions coming from either ppl from 
the advocacy side or the medical side.

I got no idea on the legal status of those documents, as you know some 
are covered by the artistic license (a few in PubMed) some not,

I am not a lawyer. The data are not to be stored locally AFAIK, so only 
metadata will be kept locally and can be reset, refreshed, amended, etc

Parsing will be equivalent to a one-off human reading the article on the 
web. There is a lawyer handling all those. From the whole network of ppl 
interested in this whole endeavor,  I am the only guy with DB/software 
knowledge, hence why I volunteered.

I know its a huge work, but you are missing a point. Nobody wishes to 
compete with anyone. This is a about a project, a parent-advocacy 
non-profit that *ONLY* aims to save the sick children (or maybe also 
very young adults) of a certain spectrum . So the goal is to make the 
right tools for researchers, clinicians and parents. This market is too 
small to even consider making any money out of it, but the research is 
still very expensive and the progress slower than optimum.

> Laura



> I am imagining a system that can parse papers from various sources
> (web/files/etc) and in various formats (text, pdf, etc) and can store
> metadata for this paper ,some kind of global ID if applicable, authors,
> areas of research, whether the paper is "new", "highlighted",
> "historical", type

Those three categories won't help much. I'm sure though you had
something specific in mind with them ?

Karsten





Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, 5 June 2021 12:14, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:


>
> I know its a huge work, but you are missing a point. Nobody wishes to
> compete with anyone. This is a about a project, a parent-advocacy
> non-profit that ONLY aims to save the sick children (or maybe also
> very young adults) of a certain spectrum . So the goal is to make the
> right tools for researchers, clinicians and parents. This market is too
> small to even consider making any money out of it, but the research is
> still very expensive and the progress slower than optimum.


Unfortunately I'm not "missing a point", your final paragraph summarises your position.

You have been taken in by the very charitable goal of saving sick children.

Unfortunately your head has been disconnected from your heart.

If we put the charitable purpose to one side and take a purely objective view at what you want to do, my original
statementstill stands, i.e. the certainty that you are grossly underestimating the technical and practical complexities
ofwhat you want to achieve. 



To get started with collecting doc metadata. It looks this tool can help you started.
postgres does support fuzzy text search, so I do think dumping meta data /abstract in postgresql and then using trigram tsearch etc like extensions it should work well for a POC.
this being a pg mailing list :) what would be your expectation of type of data and growth of data would be your queries.
If you store data to support multiple lingual papers, will postgresql be able to handle ?
Ideally the docs would be stored somewhere on a object storage etc and the link of the same would be stored in the db when someone would request to read the whole paper.
Long before I read this
So if this could work, your POC should too :) with postgresql.


On Sat, 5 Jun 2021 at 5:14 PM Laura Smith <n5d9xq3ti233xiyif2vp@protonmail.ch> wrote:



Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, 5 June 2021 12:14, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:


>
> I know its a huge work, but you are missing a point. Nobody wishes to
> compete with anyone. This is a about a project, a parent-advocacy
> non-profit that ONLY aims to save the sick children (or maybe also
> very young adults) of a certain spectrum . So the goal is to make the
> right tools for researchers, clinicians and parents. This market is too
> small to even consider making any money out of it, but the research is
> still very expensive and the progress slower than optimum.


Unfortunately I'm not "missing a point", your final paragraph summarises your position.

You have been taken in by the very charitable goal of saving sick children.

Unfortunately your head has been disconnected from your heart.

If we put the charitable purpose to one side and take a purely objective view at what you want to do, my original statement still stands, i.e. the certainty that you are grossly underestimating the technical and practical complexities of what you want to achieve.


--
Thanks,
Vijay
Mumbai, India
On 6/5/21 2:49 AM, Achilleas Mantzios wrote:
> Hello
> 
> I am imagining a system that can parse papers from various sources 
> (web/files/etc) and in various formats (text, pdf, etc) and can store 
> metadata for this paper ,some kind of global ID if applicable, authors, 
> areas of research, whether the paper is "new", "highlighted", 
> "historical", type (e.g. Case reports, Clinical trials), symptoms (e.g. 
> tics, GI pain, psychological changes, anxiety, ), and other key 
> attributes (I guess dynamic), it must be full text searchable, etc.
> 
> I am at the very beginning in this and it is done on a fully volunteer 
> basis.
> 
> Lots of questions : is there any scientific/scholar analysis software 
> already available? If yes and is really good and open source , then this 
> will influence the rest of decisions. Otherwise , I'll have to form a 
> team that can write one, in this case I'll have to decide DB, language, 
> etc. I work 20 years with pgsql so it is the natural choice for any kind 
> of data, I just ask this for the sake of completeness.
> 
> All ideas welcome.

A quick search found this:

https://solutionsreview.com/data-management/the-best-open-source-data-catalog-tools-to-consider/

Might be a good starting point on what is already out there.

There is also this:

The Directory of Open Access Journals
https://doaj.org/

It seems to be a service, not downloadable software.


> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Ideas for building a system that parses medical research publications/articles

From
Achilleas Mantzios
Date:
Στις 5/6/21 6:34 μ.μ., ο/η Adrian Klaver έγραψε:
> On 6/5/21 2:49 AM, Achilleas Mantzios wrote:
>> Hello
>>
>> I am imagining a system that can parse papers from various sources 
>> (web/files/etc) and in various formats (text, pdf, etc) and can store 
>> metadata for this paper ,some kind of global ID if applicable, 
>> authors, areas of research, whether the paper is "new", 
>> "highlighted", "historical", type (e.g. Case reports, Clinical 
>> trials), symptoms (e.g. tics, GI pain, psychological changes, 
>> anxiety, ), and other key attributes (I guess dynamic), it must be 
>> full text searchable, etc.
>>
>> I am at the very beginning in this and it is done on a fully 
>> volunteer basis.
>>
>> Lots of questions : is there any scientific/scholar analysis software 
>> already available? If yes and is really good and open source , then 
>> this will influence the rest of decisions. Otherwise , I'll have to 
>> form a team that can write one, in this case I'll have to decide DB, 
>> language, etc. I work 20 years with pgsql so it is the natural choice 
>> for any kind of data, I just ask this for the sake of completeness.
>>
>> All ideas welcome.
>
> A quick search found this:
>
> https://solutionsreview.com/data-management/the-best-open-source-data-catalog-tools-to-consider/ 
>
>
> Might be a good starting point on what is already out there.

This is interesting, so the keywords are "Data Catalog" ?

>
> There is also this:
>
> The Directory of Open Access Journals
> https://doaj.org/
>
This seems very very poor. Just try a search there and then repeat in 
PMC (PubMed Central).
> It seems to be a service, not downloadable software.
>
>
>>
>>
>>
>
>



Re: Ideas for building a system that parses medical research publications/articles

From
Achilleas Mantzios
Date:


Στις 5/6/21 4:45 μ.μ., ο/η Vijaykumar Jain έγραψε:

I checked, it behaves better with downloaded PDF rather than URL PDFs, in the 2nd case the metadata are poor.

Does not work with nih articles (but this is general problem not tika's )

To get started with collecting doc metadata. It looks this tool can help you started.
postgres does support fuzzy text search, so I do think dumping meta data /abstract in postgresql and then using trigram tsearch etc like extensions it should work well for a POC.
this being a pg mailing list :) what would be your expectation of type of data and growth of data would be your queries.
If you store data to support multiple lingual papers, will postgresql be able to handle ?
Ideally the docs would be stored somewhere on a object storage etc and the link of the same would be stored in the db when someone would request to read the whole paper.
Long before I read this
So if this could work, your POC should too :) with postgresql.


On Sat, 5 Jun 2021 at 5:14 PM Laura Smith <n5d9xq3ti233xiyif2vp@protonmail.ch> wrote:



Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, 5 June 2021 12:14, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:


>
> I know its a huge work, but you are missing a point. Nobody wishes to
> compete with anyone. This is a about a project, a parent-advocacy
> non-profit that ONLY aims to save the sick children (or maybe also
> very young adults) of a certain spectrum . So the goal is to make the
> right tools for researchers, clinicians and parents. This market is too
> small to even consider making any money out of it, but the research is
> still very expensive and the progress slower than optimum.


Unfortunately I'm not "missing a point", your final paragraph summarises your position.

You have been taken in by the very charitable goal of saving sick children.

Unfortunately your head has been disconnected from your heart.

If we put the charitable purpose to one side and take a purely objective view at what you want to do, my original statement still stands, i.e. the certainty that you are grossly underestimating the technical and practical complexities of what you want to achieve.


--
Thanks,
Vijay
Mumbai, India
On 6/5/21 9:56 AM, Achilleas Mantzios wrote:
> 
> Στις 5/6/21 6:34 μ.μ., ο/η Adrian Klaver έγραψε:
>> On 6/5/21 2:49 AM, Achilleas Mantzios wrote:
>>> Hello
>>>
>>> I am imagining a system that can parse papers from various sources 
>>> (web/files/etc) and in various formats (text, pdf, etc) and can store 
>>> metadata for this paper ,some kind of global ID if applicable, 
>>> authors, areas of research, whether the paper is "new", 
>>> "highlighted", "historical", type (e.g. Case reports, Clinical 
>>> trials), symptoms (e.g. tics, GI pain, psychological changes, 
>>> anxiety, ), and other key attributes (I guess dynamic), it must be 
>>> full text searchable, etc.
>>>
>>> I am at the very beginning in this and it is done on a fully 
>>> volunteer basis.
>>>
>>> Lots of questions : is there any scientific/scholar analysis software 
>>> already available? If yes and is really good and open source , then 
>>> this will influence the rest of decisions. Otherwise , I'll have to 
>>> form a team that can write one, in this case I'll have to decide DB, 
>>> language, etc. I work 20 years with pgsql so it is the natural choice 
>>> for any kind of data, I just ask this for the sake of completeness.
>>>
>>> All ideas welcome.
>>
>> A quick search found this:
>>
>> https://solutionsreview.com/data-management/the-best-open-source-data-catalog-tools-to-consider/ 
>>
>>
>> Might be a good starting point on what is already out there.
> 
> This is interesting, so the keywords are "Data Catalog" ?

What I searched on was 'open source article catalog'.

> 
>>
>> There is also this:
>>
>> The Directory of Open Access Journals
>> https://doaj.org/
>>
> This seems very very poor. Just try a search there and then repeat in 
> PMC (PubMed Central).

This is down to copyright issues I'm sure. For PubMed Central see:

https://www.ncbi.nlm.nih.gov/pmc/about/copyright/

for the if/ands/buts that restrict what you can do with the information 
and stay legal.

>> It seems to be a service, not downloadable software.
>>
>>
>>>
>>>
>>>
>>
>>


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Ideas for building a system that parses medical research publications/articles

From
Achilleas Mantzios
Date:
Στις 5/6/21 8:03 μ.μ., ο/η Adrian Klaver έγραψε:
> On 6/5/21 9:56 AM, Achilleas Mantzios wrote:
>>
>> Στις 5/6/21 6:34 μ.μ., ο/η Adrian Klaver έγραψε:
>>> On 6/5/21 2:49 AM, Achilleas Mantzios wrote:
>>>> Hello
>>>>
>>>> I am imagining a system that can parse papers from various sources 
>>>> (web/files/etc) and in various formats (text, pdf, etc) and can 
>>>> store metadata for this paper ,some kind of global ID if 
>>>> applicable, authors, areas of research, whether the paper is "new", 
>>>> "highlighted", "historical", type (e.g. Case reports, Clinical 
>>>> trials), symptoms (e.g. tics, GI pain, psychological changes, 
>>>> anxiety, ), and other key attributes (I guess dynamic), it must be 
>>>> full text searchable, etc.
>>>>
>>>> I am at the very beginning in this and it is done on a fully 
>>>> volunteer basis.
>>>>
>>>> Lots of questions : is there any scientific/scholar analysis 
>>>> software already available? If yes and is really good and open 
>>>> source , then this will influence the rest of decisions. Otherwise 
>>>> , I'll have to form a team that can write one, in this case I'll 
>>>> have to decide DB, language, etc. I work 20 years with pgsql so it 
>>>> is the natural choice for any kind of data, I just ask this for the 
>>>> sake of completeness.
>>>>
>>>> All ideas welcome.
>>>
>>> A quick search found this:
>>>
>>> https://solutionsreview.com/data-management/the-best-open-source-data-catalog-tools-to-consider/ 
>>>
>>>
>>> Might be a good starting point on what is already out there.
>>
>> This is interesting, so the keywords are "Data Catalog" ?
>
> What I searched on was 'open source article catalog'.
>
>>
>>>
>>> There is also this:
>>>
>>> The Directory of Open Access Journals
>>> https://doaj.org/
>>>
>> This seems very very poor. Just try a search there and then repeat in 
>> PMC (PubMed Central).
>
> This is down to copyright issues I'm sure. For PubMed Central see:
>
> https://www.ncbi.nlm.nih.gov/pmc/about/copyright/
>
> for the if/ands/buts that restrict what you can do with the 
> information and stay legal.

maybe but still :

https://www.ncbi.nlm.nih.gov/pmc/?term=open+access%5Bfilter%5D+PANDAS+IVIG

 >


https://doaj.org/search/articles?ref=homepage-box&source=%7B%22query%22%3A%7B%22query_string%22%3A%7B%22query%22%3A%22IVIG%20PANDAS%22%2C%22default_operator%22%3A%22AND%22%7D%7D%7D

>
>>> It seems to be a service, not downloadable software.
>>>
>>>
>>>>
>>>>
>>>>
>>>
>>>
>
>



On 6/5/21 10:39 AM, Achilleas Mantzios wrote:
> 
> Στις 5/6/21 8:03 μ.μ., ο/η Adrian Klaver έγραψε:
>> On 6/5/21 9:56 AM, Achilleas Mantzios wrote:
>>>
>>> Στις 5/6/21 6:34 μ.μ., ο/η Adrian Klaver έγραψε:
>>>> On 6/5/21 2:49 AM, Achilleas Mantzios wrote:
>>>>> Hello
>>>>>
>>>>> I am imagining a system that can parse papers from various sources 
>>>>> (web/files/etc) and in various formats (text, pdf, etc) and can 
>>>>> store metadata for this paper ,some kind of global ID if 
>>>>> applicable, authors, areas of research, whether the paper is "new", 
>>>>> "highlighted", "historical", type (e.g. Case reports, Clinical 
>>>>> trials), symptoms (e.g. tics, GI pain, psychological changes, 
>>>>> anxiety, ), and other key attributes (I guess dynamic), it must be 
>>>>> full text searchable, etc.
>>>>>
>>>>> I am at the very beginning in this and it is done on a fully 
>>>>> volunteer basis.
>>>>>
>>>>> Lots of questions : is there any scientific/scholar analysis 
>>>>> software already available? If yes and is really good and open 
>>>>> source , then this will influence the rest of decisions. Otherwise 
>>>>> , I'll have to form a team that can write one, in this case I'll 
>>>>> have to decide DB, language, etc. I work 20 years with pgsql so it 
>>>>> is the natural choice for any kind of data, I just ask this for the 
>>>>> sake of completeness.
>>>>>
>>>>> All ideas welcome.
>>>>
>>>> A quick search found this:
>>>>
>>>> https://solutionsreview.com/data-management/the-best-open-source-data-catalog-tools-to-consider/ 
>>>>
>>>>
>>>> Might be a good starting point on what is already out there.
>>>
>>> This is interesting, so the keywords are "Data Catalog" ?
>>
>> What I searched on was 'open source article catalog'.
>>
>>>
>>>>
>>>> There is also this:
>>>>
>>>> The Directory of Open Access Journals
>>>> https://doaj.org/
>>>>
>>> This seems very very poor. Just try a search there and then repeat in 
>>> PMC (PubMed Central).
>>
>> This is down to copyright issues I'm sure. For PubMed Central see:
>>
>> https://www.ncbi.nlm.nih.gov/pmc/about/copyright/
>>
>> for the if/ands/buts that restrict what you can do with the 
>> information and stay legal.
> 
> maybe but still :
> 
> https://www.ncbi.nlm.nih.gov/pmc/?term=open+access%5Bfilter%5D+PANDAS+IVIG

Yeah it is nice to have the resources of the NIH behind you. Still I 
would point out under Copyright and License information:

"This article is made available via the PMC Open Access Subset for 
unrestricted research re-use and secondary analysis in any form or by 
any means with acknowledgement of the original source. These permissions 
are granted for the duration of the World Health Organization (WHO) 
declaration of COVID-19 as a global pandemic."

Further on PMC Open Access Subset:

https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/

Again more ifs/ands/buts.

The point being, dealing with articles is a descent into legalese.  I am 
not saying this is show stopper, just that it will consume considerable 
resources to sort out. I for one applaud your effort and given what I 
have seen you do with the shipping software over the years I don't see 
this project as out of the realm of possibility.

> 
>  >
> 
>
https://doaj.org/search/articles?ref=homepage-box&source=%7B%22query%22%3A%7B%22query_string%22%3A%7B%22query%22%3A%22IVIG%20PANDAS%22%2C%22default_operator%22%3A%22AND%22%7D%7D%7D

> 
> 
>>
>>>> It seems to be a service, not downloadable software.
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>>
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Ideas for building a system that parses medical research publications/articles

From
Achilleas Mantzios
Date:
Στις 5/6/21 10:12 μ.μ., ο/η Adrian Klaver έγραψε:
> On 6/5/21 10:39 AM, Achilleas Mantzios wrote:
>>
>> Στις 5/6/21 8:03 μ.μ., ο/η Adrian Klaver έγραψε:
>>> On 6/5/21 9:56 AM, Achilleas Mantzios wrote:
>>>>
>>>> Στις 5/6/21 6:34 μ.μ., ο/η Adrian Klaver έγραψε:
>>>>> On 6/5/21 2:49 AM, Achilleas Mantzios wrote:
>>>>>> Hello
>>>>>>
>>>>>> I am imagining a system that can parse papers from various 
>>>>>> sources (web/files/etc) and in various formats (text, pdf, etc) 
>>>>>> and can store metadata for this paper ,some kind of global ID if 
>>>>>> applicable, authors, areas of research, whether the paper is 
>>>>>> "new", "highlighted", "historical", type (e.g. Case reports, 
>>>>>> Clinical trials), symptoms (e.g. tics, GI pain, psychological 
>>>>>> changes, anxiety, ), and other key attributes (I guess dynamic), 
>>>>>> it must be full text searchable, etc.
>>>>>>
>>>>>> I am at the very beginning in this and it is done on a fully 
>>>>>> volunteer basis.
>>>>>>
>>>>>> Lots of questions : is there any scientific/scholar analysis 
>>>>>> software already available? If yes and is really good and open 
>>>>>> source , then this will influence the rest of decisions. 
>>>>>> Otherwise , I'll have to form a team that can write one, in this 
>>>>>> case I'll have to decide DB, language, etc. I work 20 years with 
>>>>>> pgsql so it is the natural choice for any kind of data, I just 
>>>>>> ask this for the sake of completeness.
>>>>>>
>>>>>> All ideas welcome.
>>>>>
>>>>> A quick search found this:
>>>>>
>>>>> https://solutionsreview.com/data-management/the-best-open-source-data-catalog-tools-to-consider/ 
>>>>>
>>>>>
>>>>> Might be a good starting point on what is already out there.
>>>>
>>>> This is interesting, so the keywords are "Data Catalog" ?
>>>
>>> What I searched on was 'open source article catalog'.
>>>
>>>>
>>>>>
>>>>> There is also this:
>>>>>
>>>>> The Directory of Open Access Journals
>>>>> https://doaj.org/
>>>>>
>>>> This seems very very poor. Just try a search there and then repeat 
>>>> in PMC (PubMed Central).
>>>
>>> This is down to copyright issues I'm sure. For PubMed Central see:
>>>
>>> https://www.ncbi.nlm.nih.gov/pmc/about/copyright/
>>>
>>> for the if/ands/buts that restrict what you can do with the 
>>> information and stay legal.
>>
>> maybe but still :
>>
>> https://www.ncbi.nlm.nih.gov/pmc/?term=open+access%5Bfilter%5D+PANDAS+IVIG 
>>
>
> Yeah it is nice to have the resources of the NIH behind you. Still I 
> would point out under Copyright and License information:
>
> "This article is made available via the PMC Open Access Subset for 
> unrestricted research re-use and secondary analysis in any form or by 
> any means with acknowledgement of the original source. These 
> permissions are granted for the duration of the World Health 
> Organization (WHO) declaration of COVID-19 as a global pandemic."
>
> Further on PMC Open Access Subset:
>
> https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
>
> Again more ifs/ands/buts.
>
> The point being, dealing with articles is a descent into legalese.  I 
> am not saying this is show stopper, just that it will consume 
> considerable resources to sort out. I for one applaud your effort and 
> given what I have seen you do with the shipping software over the 
> years I don't see this project as out of the realm of possibility.
Thank you Adrian, there is no money in this project, but the stakes are 
much much higher.
>>
>>  >
>>
>>
https://doaj.org/search/articles?ref=homepage-box&source=%7B%22query%22%3A%7B%22query_string%22%3A%7B%22query%22%3A%22IVIG%20PANDAS%22%2C%22default_operator%22%3A%22AND%22%7D%7D%7D

>>
>>
>>>
>>>>> It seems to be a service, not downloadable software.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>
>>
>
>



I think the key word here that will help you is biocuration and it's an established field involving people with
scientific,computational, and linguistic backgrounds who are familiar with the problem space so I would suggest talking
topeople working in this area first to get an idea of what's feasible, what's already out there, etc., as they will
knowthis better than the Postgres community.
 

You can see an example of the sort of annotation that is fully automated at the moment here:

https://monarchinitiative.org/tools/text-annotate

Given the potential impact on human health, some level of manual involvement in annotation is frequently part of the
workflow.

Daniel

-----Original Message-----
From: Achilleas Mantzios <achill@matrix.gatewaynet.com> 
Sent: 05 June 2021 10:49
To: pgsql-general@lists.postgresql.org
Subject: Ideas for building a system that parses medical research publications/articles [EXT]

Hello

I am imagining a system that can parse papers from various sources
(web/files/etc) and in various formats (text, pdf, etc) and can store metadata for this paper ,some kind of global ID
ifapplicable, authors, areas of research, whether the paper is "new", "highlighted", "historical", type (e.g. Case
reports,Clinical trials), symptoms (e.g. 
 
tics, GI pain, psychological changes, anxiety, ), and other key attributes (I guess dynamic), it must be full text
searchable,etc.
 

I am at the very beginning in this and it is done on a fully volunteer basis.

Lots of questions : is there any scientific/scholar analysis software already available? If yes and is really good and
opensource , then this will influence the rest of decisions. Otherwise , I'll have to form a team that can write one,
inthis case I'll have to decide DB, language, etc. I work 20 years with pgsql so it is the natural choice for any kind
ofdata, I just ask this for the sake of completeness.
 

All ideas welcome.







--
 The Wellcome Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.