spark-corenlp

Spark DataFrame wrapper methods for CoreNlp SimpleApi Annotators. These methods were tested with spark 2.3.1 and standford version 3.9.1

To import the methods import static com.ziad.spark.nlp.functions.*.

tokenize: Splits the text into roughly “words”, using rules or methods suitable for the language being processed.
ssplit: Splits a sequence of tokens into sentences.
lemmas: Generates the word lemmas for all tokens in the corpus.
ner: Generates the named entity tags of the text.
sentiment: Measures the sentiment of an input sentence on a scale of 0 (strong negative) to 4 (strong positive).

Note: You need to add the core nlp models jar to your class path. Arabic Chinese English (KBP) English French German Spanish

Example of usage:

//Collection of Strings (text) to parse...
List<String> data = Arrays.asList("first text", 
"second text");
/*
1. create Dataset from String collection
2. call UserDefinedFunction, named "sentiment" to measure the sentiment of an input sentence
3. generate sentiment type using sentiment scale
4. print table with results
 */
 Dataset<Row> df = session.createDataset(data, Encoders.STRING()).toDF();
 df.select(col("value"), sentiment.apply(col("value")).as("sentiment"))
 .show();

Output:

+-----------+----------+
|      value| sentiment|
+-----------+----------+
| first text|         2|
|second text|         2|
+-----------+----------+

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
.settings		.settings
src		src
target		target
.classpath		.classpath
.project		.project
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

spark-corenlp

About

Uh oh!

Releases

Packages

Languages

ziadmoubayed/spark-corenlp

Folders and files

Latest commit

History

Repository files navigation

spark-corenlp

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages