Re: [NOVICE] Database of articles, LaTeX code and pictures - Mailing list pgsql-novice

From Kevin Grittner
Subject Re: [NOVICE] Database of articles, LaTeX code and pictures
Date
Msg-id CACjxUsOxtLTiTvNfJxPE375O9zcWZcA1h3UtFwi6L=Jxs1d8Cw@mail.gmail.com
Whole thread Raw
In response to [NOVICE] Database of articles, LaTeX code and pictures  (philolilou <philolilou@free.fr>)
List pgsql-novice
On Wed, Jan 18, 2017 at 12:34 PM, philolilou <philolilou@free.fr> wrote:

> i wish to build a place where to store articles (with picture) and that
> can be accessed easily later by researching.
>
> Articles i wish to store, are actually articles of magazine, or some
> internet interesting articles.
>
> For the articles of magazines, i thought scan all interesting pages,
> make OCR (letter recognition) on them, eventually convert them into
> LaTeX formated code, and insert all into PGSQL database.
>
> Once this made, a php driven website will make reseaches in the PGSQL
> database.
>
> For this application, i want make researches the following way:
>
> -> i give a word, or a topic and database will search in plain text
> through all the documents and give results based of the accuracy and so
> the amount of the times that word or topic appears in the article
>
> -> by keywords: i specify keywords in the search and i get all the
> articles matching these keywords

I'm not clear on what you see as the difference between these.

> Once search made, and results found, a simple clic does deliver the pdf
> of the searched article or view it in the navigator as final document
> rendered (like the original article).
>
>
> Questions:
>
> 1. Is this possible to make with Postgresql? (store LaTeX code and
> images, browse the text of the code + keywords, and retrieve all of this
> to a LaTeX program for compile it again)

I helped do something very like this, except that the documents
were text-based PDFs, and we used the poppler library to pull the
text out for processing.  Your job will be easier, because LaTeX is
already text, so you probably won't need to use anything as messy
to work with from within PostgreSQL as the C++ based poppler
library.  You might get away with parsing the LaTeX source
directly, and if not a plperl function should be fairly easy to
write (or adapt from LaTeX2HTML).

> 2. If the answer of question 1 is yes, how can i structure database for
> can use search in the latex code?

After experimenting and benchmarking different options, we chose to
store the document and a tsvector derived from the text of the
document as two columns in a single table, creating a GIN index on
the tsvector column.  We needed some special parsing capabilities,
and found the custom parsing feature unusable; so we achieved that
by using regular expressions to find the necessary information,
which we built as text and cast to tsvector, concatenating the
result with the output of the normal parser/dictionary processing.
The dictionary chain included stop word processing, a snowball
stemmer, and a thesaurus for legal terms (e.g. "power of attorney"
is a phrase which should match that exact sequence of words on a
search much more closely than just having "power" and "attorney"
somewhere near each other in the document).  We were able to give
words in the title of the document higher priority by concatenating
the to_tsvector() of the title (using the priority parameter) with
the body.  Of course, triggers were used to maintain the tsvector
column.

We got accuracy of results that the users liked, with an average
query speed of about 300ms from real-world searches against a large
database of legal documents.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-novice by date:

Previous
From: philolilou
Date:
Subject: [NOVICE] Database of articles, LaTeX code and pictures
Next
From: Rouzzi Anissa
Date:
Subject: [NOVICE] Override Like Operator