Species Similarity Text Analysis

Text analysis is something that is very interesting to me and I don’t seem to get around to doing it as often as I would like. Word relationships is exceptionally interesting to me as we can see connections between word use which paint a picture which can be quite surprising. In the following project I used discussion posts on the Snapshot Wisconsin Zooniverse trail camera photo classification website to determine the relationship between common species. This worked very well for species which look similar or had other relationships, such as predator and prey.


Below is the knit Text Analysis R Markdown HTML:

Install necessary packages

Tidyverse is a very powerful set of packages that improves handling of data. The dplyr package is particularly useful as it allows for use of the pipe command (%>%) and special query functions. Additionally, the view() command is quite nice when dealing with wide dataframes.


Load the dataset.

The initial dataset was collected from Zooniverse “Lab” section which is only accessible by administrators of a Zooniverse project. The exported file was of JSON format. As I am much more familiar with molesting JSON from Python, I wrote a quick script in Python converting the comment JSON file to a CSV.

Note: The Python script below was initially run on Google’s Jupyter Notebook implementation called Colaboratory, or Google Colab for short.

import json
import csv

filename = 'sw-comments.json'
with open(filename) as f:
  data = json.load(f)
comment_file = open('sw-comments.csv', 'w')
csvwriter = csv.writer(comment_file)
count = 0
for comment in data:
  if count == 0:
    header = comment.keys()
    count += 1

The output CSV file from the above python script is loaded in R below,

df <- read.csv("sw-comments.csv")

and get a summary of the data that is loaded.

##     board_id                              board_title   
##  Min.   :391.0   Notes                          :64239  
##  1st Qu.:391.0   Chat                           :  725  
##  Median :391.0   Moderator Discussions          :  701  
##  Mean   :391.5   FAQ and Help                   :  524  
##  3rd Qu.:391.0   Science                        :  114  
##  Max.   :881.0   Trail Camera Host Message Board:   55  
##                  (Other)                        :   25  
##                                                                                                                                                              board_description
##  Comments about specific photos                                                                                                                                       :64239  
##  General discussion about wildlife or anything else you want to talk about!                                                                                           :  725  
##  A place for moderators to communicate                                                                                                                                :  701  
##  A place to ask questions about the classification interface, report bugs, and get help using the Snapshot Wisconsin site                                             :  524  
##  A place to talk about the science behind Snapshot Wisconsin and related research                                                                                     :  114  
##  This board is a place for current Snapshot Wisconsin trail camera hosts to communicate with each other, as well as with Zooniverse volunteers and the project team.  :   55  
##  (Other)                                                                                                                                                              :   25  
##  discussion_id    
##  Min.   :  51589  
##  1st Qu.: 175689  
##  Median : 500920  
##  Mean   : 530891  
##  3rd Qu.: 834965  
##  Max.   :1193161  
##                                                  discussion_title
##  Birds                                                   :   46  
##  Subject 3862238                                         :   38  
##  Help us test a new platform for future researcher chats?:   35  
##  Season 6 launch this week!                              :   34  
##  Introductions                                           :   33  
##  Snapshot Wisconsin #supersnap of the year!              :   32  
##  (Other)                                                 :66165  
##    comment_id                             comment_body  
##  Min.   : 102545   Thank you!                   :  514  
##  1st Qu.: 307716   #bobcat                      :  209  
##  Median : 837470   #coyote                      :  207  
##  Mean   : 888350   This comment has been deleted:  185  
##  3rd Qu.:1381292   #coyote                      :  182  
##  Max.   :1942885   #elk                         :  178  
##                    (Other)                      :64908  
##  comment_focus_id   comment_focus_type comment_user_id  
##  Min.   :  727249          : 2053      Min.   :      6  
##  1st Qu.: 5032539   Subject:64330      1st Qu.:1298797  
##  Median :14466026                      Median :1492595  
##  Mean   :16405494                      Mean   :1318877  
##  3rd Qu.:28682222                      3rd Qu.:1553608  
##  Max.   :36602539                      Max.   :1951069  
##  NA's   :2053                                           
##    comment_user_login                comment_created_at
##  gardenmaeve: 5533    2016-04-27T16:14:32.755Z:    1   
##  momsabina  : 4928    2016-04-27T16:35:00.349Z:    1   
##  smeurett   : 3733    2016-04-27T16:36:42.561Z:    1   
##  Swamp-eye  : 2549    2016-04-27T16:37:25.160Z:    1   
##  Snowdigger : 2434    2016-05-12T20:32:40.207Z:    1   
##  enog       : 2394    2016-05-12T20:33:19.324Z:    1   
##  (Other)    :44812    (Other)                 :66377


We start with 66,383 comments and posts; however, many of them are not related to discussion about actual photos. I had a hunch that board_title and board_discussion are attributes we may find useful in determining photo related comments vs general discussion posts. Let’s have a look at the first few rows of data and a summary or two.

head(df, 10)

The first thing I notice from the head() call is that board_id and board_title both seem to correspond to rows with the board_description of “Comments about specific photos” as well as discussion_titles that start with “Subject” followed by a number. The number is a reference to a photo ID, which may come in handy later. This information tells me that we likely want to use board_id as the identifier for filtering and that we can remove board_title, board_description, and discussion_title from our dataframe we will use for analysis. The reason I would decide to use the board_id instead of board_title is that it is much smaller and easier to work with as far as storage and processing is concerned. Similarly, we filter out unecessary columns to save on storage and processing effort. Let’s take a look at the summaries for board_title and board_id to make sure we have the same counts before we move forward with cleaning.

##                            Chat                       Education 
##                             725                              19 
##                    FAQ and Help           Moderator Discussions 
##                             524                             701 
##                           Notes                         Science 
##                           64239                             114 
## Trail Camera Host Message Board  Welcome to Snapshot Wisconsin! 
##                              55                               6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   391.0   391.0   391.0   391.5   391.0   881.0

Well that wasn’t what we were looking for from the summary of board_id. In the dataframe board_id is represented as a integer instead of an ID or factor. Let’s set board_id as a factor and try again.

df$board_id <- as.factor(df$board_id)
##   391   393   394   395   397   690   697   881 
## 64239   524   114   701   725     6    55    19

Now that looks a lot better. At first it may be a little tough to see what is going on in the summary of board_id, but believe me, it is a good sign. The top row is the board_id as a factor. The second row is the counts for that factor. In this case we are interested in the factor of 391, as that is what was associated with rows containing board_description of “Comments about specific photos”. Now we can see that board_title of “Notes” and board_id of “391” both have the same count. This is not proof that they are always associated, but logic tells me that this is good enough to begin cleaning the dataframe. How would you determine if these two columns were always associated? I’ll let you figure that out if you feel up to it. Now that we are fairly certain that board_id is always associated with photo posts, let’s clean up the dataframe using tools from the tidyverse package!

df %>%
  filter(board_id == "391") %>%
  #filter(year == "2018") %>%
  select(comment_body) -> df_notes
##       comment_body  
##  Thank you! :  509  
##  #bobcat    :  209  
##  #coyote    :  207  
##  #coyote    :  182  
##  #elk       :  178  
##  #supersnap :  178  
##  (Other)    :62776

WOOHOO! Now we have a dataframe (df_notes) which contains 64,239 rows of only comments with a comment_focus_type of “Subject”. The comment_focus_type column should also get removed as all rows contain the same factor, but I’ve left it in to show that it is also assiciated with the board_id.

More cleaning

You may be thinking to yourself, “We’re done!” On the contrary, we are finally ready to begin with some more cleaning.

Our first task is to prepare each and every comment_body. Let’s create a function that we will apply to each row to clean up the comment_body. This function will take in the etire data_frame and output a list of cleaned words for the given row. When we call this function, we will use apply() so that each row is dealt with on an individual basis so the comments stay separate.

prep_text <- function(x, output) {
  x %>%
    tolower %>%
    strsplit("\\W") %>%
    unlist -> temp_text
  out_text <- temp_text[which(temp_text != "")]

Now that we have the function in place, let’s use it and add a new column to the dataframe. We will also do 2 more tasks in the following lines of code. 1) Create a new variable that will contain a list of all of the comment words for aggregate analysis. 2) Use the combined vector of words to create a frequency table of these words. Let’s check it out.

df_notes$clean_text <- apply(df_notes, 1, prep_text)
words_combined.v <- unlist(df_notes$clean_text)
words_freq.t <- table(words_combined.v)

Now, if we wanted to see a sorted list of the top 10 words and their frequencies, we could do something like this…

sort(words_freq.t, decreasing = TRUE)[1:10]
## words_combined.v
##   the     a     i    to    is    of   and    it    in  this 
## 49845 40041 28328 21690 20199 19424 19122 18830 17446 15721

We can see here that “the” is the most common word used in all of the comment words at 49,845 orccurances. Another thing to note is that most of the top ten words are not very meaningful by themselves. This calls for some additional cleaning! For now we are going to turn on focus back to the dataframe and clean_text column. Let’s now use the tm (text mining) package to create the corpus and do some leaning.

## Attaching package: 'proxy'
## The following objects are masked from 'package:stats':
##     as.dist, dist
## The following object is masked from 'package:base':
##     as.matrix
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)

corpus <- Corpus(VectorSource(df_notes$comment_body))
corpus %>%
  tm_map(function(x) iconv(x, to="UTF-8", sub="byte")) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeURL) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeWords, c("subject")) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(stemDocument, language = "english") %>%
  tm_map(stripWhitespace) %>%
  tm_filter(function(x) x != "") -> corpus_cleaned

To conduct clustering of words, we will need to create a matrix of all words used. This matrix will be very sparse at first, as there are many words which may only be used once or twice throughout all comments. This sparseness in such a large dataset can cause all sorts of issues. To help reduce the sparseness of the matrix, we can removeSparseTerms(). Even with a very high amount of sparseness, up to 99.5% sparse, we can see immediate gains. Without removeSparseTerms(), the sucessive manipulation code could take 15-30 minutes to complete per commmand. Let’s set up the matrix.

dtm <- DocumentTermMatrix(corpus_cleaned)
dtm <- removeSparseTerms(dtm, sparse = 0.995)

In this particular case, we are interested in evaluating a particular set of animal groups. Keep in mind that because of the stemming process, some names may not have a complete spelling. Let’s create a vector of animal names (with stemming in mind) which we can use.

word_list.animals_all <- c("badger", "bear", "beaver", "bobcat", "cottontail", "cougar", "coyot", "crane", "deer", "elk", "fisher", "fox", "grous", "jackrabbit", "lynx", "marten", "mink", "moos", "muskrat", "opossum", "bird", "pheasant", "pig", "porcupin", "raccoon", "skunk", "snowsho", "turkey", "weasel", "wolf", "wolverin", "woodchuck")

So far we have been modifying comments and keeping them in tact. In the following blocks of code we will be using our previous cleaning techniques to find words associated with our target animal groups and then repeating some of the above processes to cluster on a much smaller, more consolidated set of text. First we will find associations in the matrix for our list of animals we created above. We will also create a function which will pull out the word names of the associated words. Lets complete these two tasks and then run the associations through the function we create.

#findAssocs(dtm, word_list.animals_all, corlimit = 0.05)
animal_ass <- findAssocs(dtm, word_list.animals_all, corlimit = 0.01)

animal_word_parse <- function(x) {
  temp_words <- vector()
  for (i in seq_along(x)) 
    animal_name <- names(x[i])
    ass_words <- names(x[[i]])
    if (!is.null(ass_words))
      temp_words[animal_name] <- paste(ass_words, collapse = " ")

parsed_words <- animal_word_parse(animal_ass)

We now have a vector of strings that contain associated words for each animal group of interest. As before, let’s now create a corpus for each string of condensed words.

animal_corpus <- Corpus(VectorSource(parsed_words))

With a corpus, we will create another matrix. This time the matrix will be used to determine the distribution of words so that we can perform hierarchical clustering and then plot a dendrogram of the animal groups. We’re so close, let’s check it out!

animal_dtm <- DocumentTermMatrix(animal_corpus)
animal_dtm.m <- as.matrix(animal_dtm)
animal_dtm.dist <- dist(animal_dtm.m)
animal_cluster <- hclust(animal_dtm.dist)

plot(animal_cluster, main = "Dendrogram of Animal Group by comment word association", ylab = "Height", xlab = "Animal Group")

The above dentrogram shows how each animal group is associated with others through comments collected. It is quite amazing that by only analyzing comments and their association to word use is able to group animals this closely. Let’s use the ape package to make a better visualization.

colors = brewer.pal(6, "Dark2")
clus6 = cutree(animal_cluster, 6)
plot(main = "Dendrogram of Animal Group by comment word association",
     as.phylo(animal_cluster), type = "cladogram", tip.color = colors[clus6],
     label.offset = 0.2, cex = 0.95)

This concludes the code portion of my Text Analysis project.

Helpful References.

I found the following sites helpful throughout the process of building this R code:

Dendrogram beautification http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning

Test Mining package help https://uc-r.github.io/word_relationships