Twitter NLP Example: How to Scale Part-of-Speech Tagging with MPP (Part 2)

October 24, 2014 Srivatsan Ramanujam

featured-post-twitter-2In our previous post, we introduced POS tagging, established why it was important, and gave a sense of the challenges involved in making it work with conversational texts such as Tweets. We also introduced user defined functions (UDF) on Pivotal’s MPP platform and explained how they are well suited for data parallel problems such as POS tagging. In this post, we’ll dive deeper into PL/Java UDFs and write a wrapper on ArkTweetNLP to perform POS tagging at scale with Pivotal’s MPP data platform.

The figure below explains our workflow. We start with a description of our input data, and it is stored in a distributed database table. Then, we describe the SQL UDFs and the Java components of the code, using ArkTweetNLP for tokenization and tagging at scale. PL/Java is essentially the glue that binds the SQL components with the Java components.

GP_Ark_Tweet

Setting up the Data

We will use a table of tweets that looks like the one shown below for this demonstration. You can obtain Twitter data from a provider like GNIP or collect it yourself through the Twitter Streaming API. There are also other sources on the web where you can get a dataset of tweets such this one—TREC Tweets-2011.

We stored the dataset in a table called ‘training_data’ in the schema ‘posdemo’. The column of primary interest is the “tweet_body” column, and we will apply our tool to retrieve POS tags for each token in the tweet.

1
2
3
4
5
6
7
8
9
Table “posdemo.training_data”
Column | Type | Modifiers | Storage | Description
————+—————————–+———–+———-+————-
id | bigint | | plain |
ts | timestamp without time zone | | plain |
poster | text | | extended |
tweet_body | text | | extended |
Has OIDs: no
Distributed by: (id)

Here are some sample rows from the table:

1
2
3
4
5
6
7
8
9
10
11
12
vatsandb=# select id, tweet_body from posdemo.training_data limit 5;
id | tweet_body
————+—————————————————————————————–
1467820906 | @localtweeps Wow, tons of replies from you, may have to unfollow so I can see my friends’ tweets, you’re scrolling the feed a lot.
1467862806 | @MySteezRadio I’m goin’ to follow u, since u didn’t LOL GO ANGELS!
1467891880 | Argh! I was suuuper sleepy an hour ago, now I’m wide awake. Hope I don’t stay up all night. :-/
1467896211 | michigan state you make me sad
1467911846 | @bananaface IM SORRY I GOT YOU SICK. lol. going to bed too. NIGHT!
(5 rows)
Time: 329.507 ms

The UDF and Java Components

We first define the data structure to hold the results of POS tagging of tweets. Given a tweet as input, our UDF will invoke a method in our Java wrapper, which in turn will invoke ArkTweetNLP. Our UDF will return a set of rows where each row will consist of a token, the token’s ordinal position in the tweet, and the corresponding POS tag of the token.

We first declare a user-defined composite type (UDCT) for the result returned by our POS tagger. The UDCT is in essence a record where the first field is an integer (index), followed by a text field (token) followed by another text field (POS tag).

1
2
3
4
5
6
7
8
9
— Define a type to hold [tweet_id, token_index, token, tag] items
DROP TYPE IF EXISTS token_tag;
CREATE TYPE token_tag
AS
(
indx int,
token text,
tag text
);

Next, we define the PL/Java UDF that invokes our Java wrapper method postagger.nlp.POSTagger.tagTweet, which is defined later.

1
2
3
4
5
6
DROP FUNCTION IF EXISTS posdemo.tag_pos(varchar);
CREATE FUNCTION posdemo.tag_pos(varchar)
RETURNS SETOF token_tag
AS
postagger.nlp.POSTagger.tagTweet
IMMUTABLE LANGUAGE PLJAVAU;

This UDF accepts a varchar (string) as input and returns rows of composite type token_tag that we defined earlier as output. The keyword “SETOF” indicates that there are multiple rows of output for each input string to the UDF. The body of the UDF is of the form “<PACKAGE NAME>.<CLASS NAME>.<METHOD NAME>” (in this case postagger.nlp.POSTagger.tagTweet), and this refers to the Java wrapper written to invoke ArkTweetNLP. The Java component is exported as a JAR file and distributed across all segments of the MPP database for parallelization. PL/Java will invoke the Java functions we write at runtime when a query is executed.

Here are the Java components. The triplet {index, token, POS tag} that we defined through User Defined Composite Type (UDCT) in SQL earlier should also be defined in Java, and the Java class TaggedResult defines this triplet.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
package postagger.util;
/**
* A class to hold the {token, index, tag} triplet.
* @author Srivatsan Ramanujam<vatsan.cs@utexas.edu>
*
*/
public class TaggedResult {
private int index;
private String token;
private String tag;
/* Constructor to initialize the triplet */
public TaggedResult(int index, String tok, String tg) {
this.index = index;
this.token = tok;
this.tag = tg;
}
/**
* Accessor method for index of the token
* @return
*/
public int getIndex() {
return index;
}
/**
* Accessor method for the token
* @return
*/
public String getToken() {
return token;
}
/**
* Accessor method for part-of-speech tag of the token
* @return
*/
public String getTag() {
return tag;
}
}

In PL/Java, functions returning multiple rows of a base type such as int, text, and float can simply return a java.util.Iterator type. Functions returning multiple rows of composite types have to return a type defined by the ResultSetProvider interface. We implement this interface and iterate over the set of rows to return the individual columns present in each record in the output. This is implemented in a second java class TaggedResultSetProvider as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
package postagger.util;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.postgresql.pljava.ResultSetProvider;
/**
* Class implement PL/Java’s ResultSetProvider interface, to handle the return of rows of
* Composite Types
* @author Srivatsan Ramanujam<vatsan.cs@utexas.edu>
*/
public class TaggedResultProvider implements ResultSetProvider{
private List<TaggedResult> taggedItems;
private final Iterator<TaggedResult> itemIter;
/* Default constructor */
public TaggedResultProvider() {
taggedItems = new ArrayList<TaggedResult>();
itemIter = taggedItems.iterator();
}
/*Initialize with an existing List */
public TaggedResultProvider(List<TaggedResult> taggedResult) {
taggedItems = taggedResult;
itemIter = taggedItems.iterator();
}
@Override
/*Any procedure returning multiple rows will invoke this function once for each row
* Return false if all data has been consumed and true if a row was supplied as input
* @param currRow The currRow value is 0 for the first row and gets incremented
* for each subsequent row.
* @param receiver the object receiving values for the current row.
*/
public boolean assignRowValues(ResultSet receiver, int currRow)
throws SQLException {
if (!itemIter.hasNext()) {
return false;
}
TaggedResult item = itemIter.next();
//The strings “indx”, “token” and “tag” are the column names of the
// composite type that will be defined in SQL
receiver.updateInt(indx, item.getIndex());
receiver.updateString(token, item.getToken());
receiver.updateString(tag, item.getTag());
return true;
}
@Override/* Method invoked after the last row has returned */
public void close() throws SQLException {
//Do nothing
}
}

Next, we implement the java wrapper for the ArkTweetNLP’s tagging method. This function will return a TaggedResultSetProvider object that we defined above:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
package postagger.nlp;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.postgresql.pljava.ResultSetProvider;
import postagger.util.TaggedResult;
import postagger.util.TaggedResultProvider;
//CMU Ark Tweet NLP package (GPL v2 license)
import cmu.arktweetnlp.*;
import cmu.arktweetnlp.Tagger.TaggedToken;
/**
* Demonstrate part-of-speech tagging for tweets using the CMU Ark-Tweet-NLP package.
* Refer to : http://www.ark.cs.cmu.edu/TweetNLP/ for more information about the toolkit and the papers
* @author Srivatsan Ramanujam <vatsan.cs@utexas.edu>
*/
public class POSTagger {
//The model file is already packaged with the arktweetnlp.jar
public static final String modelFileName = /cmu/arktweetnlp/model.20120919;
//The Tagger class is defined in ArkTweetNLP.
//This class loads a model file and can// then be used to invoke part-of-speech tagging on input tweets
public static final Tagger tagger = new Tagger();
static {
//Load the pre-trained model
try {
tagger.loadModel(modelFileName);
} catch(IOException e) {
//Log the error
errorLogger(e.printStackTrace());
}
}
/**
* Return a tuple containing the tokenized tweet and the corresponding
* part-of-speech tags
* @param tweet The body of a tweet
* @return
*/
public static ResultSetProvider tagTweet(String tweet) {
if (tweet == null) {
return null;
}
//Tokenize and tag the input tweet
List<TaggedToken> taggedTokens = tagger.tokenizeAndTag(tweet);
List<TaggedResult> result = new ArrayList<TaggedResult>();
int idx=0;
// Return a set of [token index, token, tag] triplets, encapsulated by the
// TaggedResult class defined earlier
for (TaggedToken tt:taggedTokens) {
result.add(newTaggedResult(idx,tt.token, tt.tag));
idx++;
}
return new TaggedResultProvider(result);
}
}

The model file /cmu/arktweetnlp/model.20120919 is a pre-trained POS tagger model provided by ArkTweetNLP. We call the Java API of ArkTweetNLP to load its model file and invoke the POS tagger on an input tweet. The result is then returned as the TaggedResultProvider class, which PL/Java will iterate over and return the result when invoked from SQL.

Again, this Java package is exported as a jar file and distributed across all segments of our MPP database. The runtime search path of PL/Java should be able to find and load this jar file when invoked through a PL/Java UDF. Detailed installation instructions are available on this GitHub page—gp-ark-tweet-nlp.

Usage in Practice—Scaling the Process

Now that we’ve got all the plumbing described in our workflow diagram, we can perform POS tagging of the tweets like so:

1
2
3
4
5
6
7
8
9
10
select id,
(t).indx,
(t).token,
(t).tag
from
(
select id,
posdemo.tag_pos(tweet_body) as t
from posdemo.training_data
) q;

In the inner query, we invoke the POS tagger and the results are rows of the composite type token_tag. We extract the token, its index, and the corresponding part-of-speech tag in the outer query. The result of the above query looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
id | indx | token | tag
————+——+———-+—–
1467810672 | 0 | is | V
1467810672 | 1 | upset | A
1467810672 | 2 | that | P
1467810672 | 3 | he | O
1467810672 | 4 | can’t | V
1467810672 | 5 | update | V
1467810672 | 6 | his | D
1467810672 | 7 | Facebook | ^
1467810672 | 8 | by | P
1467810672 | 9 | texting | V

Thus by building on the strong foundation of state-of-the-art, open source software, we extend it, piggybacking on the MPP architecture of our tools and stack. This achieves instant scalability by several orders of magnitude. Our experimental system was a full-rack Data Computing Appliance (DCA), with 192 segment processes, and we achieved near linear scalability by tagging tweets in parallel across every segment.

The full power of data science can only be realized when you have a platform to support the scale. Pivotal’s focus on creating a Big Data Suite that is flexible in supporting the use of open source tools at scale is key in helping us solve challenging NLP problems for our customers in short timeframes.

Learn More: Installation/Usage Instructions

Please visit my GitHub page gp-ark-tweet-nlp to download this toolkit and for instructions on how to set this up with Pivotal’s Greenplum/Pivotal HAWQ engines or the open source PostgreSQL database. You’ll also be able to browse the full source. If you’d like to contribute a feature, please send me a pull request!

Learn More:

About the Author

Biography

More Content by Srivatsan Ramanujam
Previous
Building URLs with NSURLQueryItems and NSURLComponents
Building URLs with NSURLQueryItems and NSURLComponents

Building URLs in Objective-C is a fairly standard practice since most apps rely on some sort of backend ser...

Next
Cloud Foundry Foundation Bylaws and Related Documents Now Available to the Public on cloudfoundry.org
Cloud Foundry Foundation Bylaws and Related Documents Now Available to the Public on cloudfoundry.org

Today, the Cloud Foundry Foundation achieves an important milestone toward the transition to a formal Found...