Thread: Greenplum MapReduce
Hi all,
Has anybody worked on Greenplum MapReduce programming ?
I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue).
The error is thrown in the 7th line as:
Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red)
If somebody can explain this and the potential solution
%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_db1
USER: gpadmin
DEFINE:
- INPUT:
NAME: doc
TABLE: documents
- INPUT:
NAME: kw
TABLE: keywords
- MAP:
NAME: doc_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in data.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
for term in terms:
yield([doc_id, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- doc_id integer
- data text
RETURNS:
- doc_id integer
- term text
- positions text
- MAP:
NAME: kw_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in keyword.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
yield([keyword_id, i, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- keyword_id integer
- keyword text
RETURNS:
- keyword_id integer
- nterms integer
- term text
- positions text
- TASK:
NAME: doc_prep
SOURCE: doc
MAP: doc_map
- TASK:
NAME: kw_prep
SOURCE: kw
MAP: kw_map
- INPUT:
NAME: term_join
QUERY: |
SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms,
doc.positions as doc_positions,
kw.positions as kw_positions
FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term)
- REDUCE:
NAME: term_reducer
TRANSITION: term_transition
FINALIZE: term_finalizer
- TRANSITION:
NAME: term_transition
LANGUAGE: python
PARAMETERS:
- state text
- term text
- nterms integer
- doc_positions text
- kw_positions text
FUNCTION: |
if state:
kw_split = state.split(':')
else:
kw_split = []
for i in range(0,nterms):
kw_split.append('')
for kw_p in kw_positions.split(','):
kw_split[int(kw_p)-1] = doc_positions
outstate = kw_split[0]
for s in kw_split[1:]:
outstate = outstate + ':' + s
return outstate
- FINALIZE:
NAME: term_finalizer
LANGUAGE: python
RETURNS:
- count integer
MODE: MULTI
FUNCTION: |
if not state:
return 0
kw_split = state.split(':')
previous = None
for i in range(0,len(kw_split)):
isplit = kw_split[i].split(',')
if any(map(lambda(x): x == '', isplit)):
return 0
adjusted = set(map(lambda(x): int(x)-i, isplit))
if (previous):
previous = adjusted.intersection(previous)
else:
previous = adjusted
if previous:
return len(previous)
return 0
- TASK:
NAME: term_match
SOURCE: term_join
REDUCE: term_reducer
- INPUT:
NAME: final_output
QUERY: |
SELECT doc.*, kw.*, tm.count
FROM documents doc, keywords kw, term_match tm
WHERE doc.doc_id = tm.doc_id
AND kw.keyword_id = tm.keyword_id
AND tm.count > 0
EXECUTE:
- RUN:
SOURCE: final_output
TARGET: STDOUT
Regards,
Suvankar Roy
=====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Suvankar Roy wrote: > > Hi all, > > Has anybody worked on Greenplum MapReduce programming ? It's a commercial product, you need to contact greenplum. -- Postgresql & php tutorials http://www.designmagick.com/
Suvankar Roy wrote: > Hi all, > > Has anybody worked on Greenplum MapReduce programming ? > > I am facing a problem while trying to execute the below Greenplum > Mapreduce program written in YAML (in blue). The other poster suggested contacting Greenplum and I can only agree. > The error is thrown in the 7th line as: > Error: YAML syntax error - found character that cannot start any token > while scanning for the next token, at line 7 (in red) There is no red, particularly if viewing messages as plain text (which most people do on mailing lists). Consider indicating a line some other way next time (commonly below the line you put something like "this is line 7 ^^^^^") The most common problem I get with YAML files though is when a tab is accidentally inserted instead of spaces at the start of a line. -- Richard Huxton Archonet Ltd
Suvankar Roy wrote: > Hi Richard, > > I sincerely regret the inconvenience caused..... No big inconvenience, but the lists can be very busy sometimes and the easier you make it for people to answer your questions the better the answers you will get. > %YAML 1.1 > --- > VERSION: 1.0.0.1 > DATABASE: test_db1 > USER: gpadmin > DEFINE: > - INPUT: #****** This the line which is causing the error ******# > NAME: doc > TABLE: documents If it looks fine, always check for tabs. Oh, and you could have cut out all the rest of the file, really. > I have learnt that unnecessary TABs can the cause of this, so trying to > overcome that, hopefully the problem will subside then.... I'm always getting this. It's easy to accidentally introduce a tab character when reformatting YAML. It might be worth checking if your text editor has an option to always replace tabs with spaces. -- Richard Huxton Archonet Ltd
Hi Richard,
I sincerely regret the inconvenience caused.....
%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_db1
USER: gpadmin
DEFINE:
- INPUT: #****** This the line which is causing the error ******#
NAME: doc
TABLE: documents
- INPUT:
NAME: kw
TABLE: keywords
- MAP:
NAME: doc_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in data.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
for term in terms:
yield([doc_id, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- doc_id integer
- data text
RETURNS:
- doc_id integer
- term text
- positions text
- MAP:
NAME: kw_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in keyword.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
yield([keyword_id, i, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- keyword_id integer
- keyword text
RETURNS:
- keyword_id integer
- nterms integer
- term text
- positions text
- TASK:
NAME: doc_prep
SOURCE: doc
MAP: doc_map
- TASK:
NAME: kw_prep
SOURCE: kw
MAP: kw_map
- INPUT:
NAME: term_join
QUERY: |
SELECT doc.doc_id, kw.keyword_id, kw.term,
kw.nterms,
doc.positions as doc_positions,
kw.positions as kw_positions
FROM doc_prep doc INNER JOIN kw_prep kw ON
(doc.term = kw.term)
- REDUCE:
NAME: term_reducer
TRANSITION: term_transition
FINALIZE: term_finalizer
- TRANSITION:
NAME: term_transition
LANGUAGE: python
PARAMETERS:
- state text
- term text
- nterms integer
- doc_positions text
- kw_positions text
FUNCTION: |
if state:
kw_split = state.split(':')
else:
kw_split = []
for i in range(0,nterms):
kw_split.append('')
for kw_p in kw_positions.split(','):
kw_split[int(kw_p)-1] = doc_positions
outstate = kw_split[0]
for s in kw_split[1:]:
outstate = outstate + ':' + s
return outstate
- FINALIZE:
NAME: term_finalizer
LANGUAGE: python
RETURNS:
- count integer
MODE: MULTI
FUNCTION: |
if not state:
return 0
kw_split = state.split(':')
previous = None
for i in range(0,len(kw_split)):
isplit = kw_split[i].split(',')
if any(map(lambda(x): x == '', isplit)):
return 0
adjusted = set(map(lambda(x): int(x)-i,
isplit))
if (previous):
previous =
adjusted.intersection(previous)
else:
previous = adjusted
if previous:
return len(previous)
return 0
- TASK:
NAME: term_match
SOURCE: term_join
REDUCE: term_reducer
- INPUT:
NAME: final_output
QUERY: |
SELECT doc.*, kw.*, tm.count
FROM documents doc, keywords kw, term_match tm
WHERE doc.doc_id = tm.doc_id
AND kw.keyword_id = tm.keyword_id
AND tm.count > 0
EXECUTE:
- RUN:
SOURCE: final_output
TARGET: STDOUT
I have learnt that unnecessary TABs can the cause of this, so trying to overcome that, hopefully the problem will subside then....
Regards,
Suvankar Roy
Richard Huxton <dev@archonet.com> 08/03/2009 02:55 PM |
|
Suvankar Roy wrote:
> Hi all,
>
> Has anybody worked on Greenplum MapReduce programming ?
>
> I am facing a problem while trying to execute the below Greenplum
> Mapreduce program written in YAML (in blue).
The other poster suggested contacting Greenplum and I can only agree.
> The error is thrown in the 7th line as:
> Error: YAML syntax error - found character that cannot start any token
> while scanning for the next token, at line 7 (in red)
There is no red, particularly if viewing messages as plain text (which
most people do on mailing lists). Consider indicating a line some other
way next time (commonly below the line you put something like "this is
line 7 ^^^^^")
The most common problem I get with YAML files though is when a tab is
accidentally inserted instead of spaces at the start of a line.
--
Richard Huxton
Archonet Ltd
ForwardSourceID:NT000058E2
=====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Hi Robert,
Thanks much for your valuable inputs....
This spaces and tabs problem is killing me in a way, it is pretty cumbersome to say the least....
Regards,
Suvankar Roy
"Robert Mah" <rmah@pobox.com> Sent by: Robert Mah <robert.mah@gmail.com> 08/02/2009 10:52 PM |
|
Suvankar:
Check your file for spaces vs tabs (one of them is bad and yes, it matters).
And as an personal aside, this is yet another reason I hate YAML.
Cheers,
Rob
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Suvankar Roy
Sent: Thursday, July 30, 2009 8:25 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Greenplum MapReduce
Hi all,
Has anybody worked on Greenplum MapReduce programming ?
I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue).
The error is thrown in the 7th line as:
Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red)
If somebody can explain this and the potential solution
%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_db1
USER: gpadmin
DEFINE:
- INPUT:
NAME: doc
TABLE: documents
- INPUT:
NAME: kw
TABLE: keywords
- MAP:
NAME: doc_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in data.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
for term in terms:
yield([doc_id, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- doc_id integer
- data text
RETURNS:
- doc_id integer
- term text
- positions text
- MAP:
NAME: kw_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in keyword.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
yield([keyword_id, i, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- keyword_id integer
- keyword text
RETURNS:
- keyword_id integer
- nterms integer
- term text
- positions text
- TASK:
NAME: doc_prep
SOURCE: doc
MAP: doc_map
- TASK:
NAME: kw_prep
SOURCE: kw
MAP: kw_map
- INPUT:
NAME: term_join
QUERY: |
SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms,
doc.positions as doc_positions,
kw.positions as kw_positions
FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term)
- REDUCE:
NAME: term_reducer
TRANSITION: term_transition
FINALIZE: term_finalizer
- TRANSITION:
NAME: term_transition
LANGUAGE: python
PARAMETERS:
- state text
- term text
- nterms integer
- doc_positions text
- kw_positions text
FUNCTION: |
if state:
kw_split = state.split(':')
else:
kw_split = []
for i in range(0,nterms):
kw_split.append('')
for kw_p in kw_positions.split(','):
kw_split[int(kw_p)-1] = doc_positions
outstate = kw_split[0]
for s in kw_split[1:]:
outstate = outstate + ':' + s
return outstate
- FINALIZE:
NAME: term_finalizer
LANGUAGE: python
RETURNS:
- count integer
MODE: MULTI
FUNCTION: |
if not state:
return 0
kw_split = state.split(':')
previous = None
for i in range(0,len(kw_split)):
isplit = kw_split[i].split(',')
if any(map(lambda(x): x == '', isplit)):
return 0
adjusted = set(map(lambda(x): int(x)-i, isplit))
if (previous):
previous = adjusted.intersection(previous)
else:
previous = adjusted
if previous:
return len(previous)
return 0
- TASK:
NAME: term_match
SOURCE: term_join
REDUCE: term_reducer
- INPUT:
NAME: final_output
QUERY: |
SELECT doc.*, kw.*, tm.count
FROM documents doc, keywords kw, term_match tm
WHERE doc.doc_id = tm.doc_id
AND kw.keyword_id = tm.keyword_id
AND tm.count > 0
EXECUTE:
- RUN:
SOURCE: final_output
TARGET: STDOUT
Regards,
Suvankar Roy
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you
ForwardSourceID:NT000058B6
=====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you