Scroll down to see how to interpret a plot created by a great tool for comparing two classes and their corpora.
There are a few good ways to visualize text data, but many good ways to visualize other forms of data. Text data is messy and requires a vast amount of cleaning before it can be analyzed and visualized. This isn’t all bad; there are a tremendous amount of decision-making opportunities to be had during the preprocessing phase. A wide variety of tools and techniques exist to keep you adaptable regarding the data you are working with. This variety can make an otherwise mundane process quite fun. However, once you begin to prepare visualizations on your corpus data, the options diminish pretty rapidly. In this stage you are often left with very few intuitive visualizations to analyze and present your data. This problem compounds particularly if you are presenting to a non-technical audience.
Luckily, as a relatively new student in the world of data science, I was able to stumble upon an extraordinarily useful tool for visualizing and analyzing text from a corpus. This tool is specifically meant to analyze data with 2 categories. Scattertext is a tool developed on the back of Scatterplot, the age-old cartesian coordinated plotting tool used by many to visualize the relationship between two variables. Scattertext was created by Jason Kessler who has gained quite a bit of notoriety for his work in developing this resource. He abundantly demonstrates the capabilities of Scattertext to data interested audiences through many trade shows and video presentations. He also has created a detailed and easy to read whitepaper on the topic that I found to be quite helpful in understanding this tool and just how powerful it can be for a data professional to have in their tool kit.
Scattertext Plot (Example Below)
Below is a Scattertext plot that I created to give you a quick idea of how beautiful this resource is. This plot details text pulled from two separate Subreddits: r/Homebrewing and r/Winemaking. The data was cleaned to remove null values, spammed ads, duplicate posts and entries where ‘[deleted]’ or ‘[removed]’ was a value in the post. The data was also preprocessed by removing symbols, numbers, whitespace and stop-words. Based on what I’ve read about Scattertext, many of these steps may not be necessary but I wanted to make sure I had good data going in and was using the data in a modeling process that needed good preprocessing.
How to Interpret Scattertext Plots
It helps first to understand the plot. Most of you reading this already know that this plot uses a two-dimensional cartesian coordinate system where each point is represented by two coordinates, one from x-axis and one from the y-axis. The coordinates for each of these points are derived from the term frequencies in each class. Term frequencies are word and phrase frequencies. Two-word n-grams are included in Scattertext if you choose. The y-axis represents the term frequencies for the Winemaking class whereas the x-axis represents the term frequencies for the Homebrewing class. For instance, the word ‘sugar’ in the very top right of Figure 1 has a term frequency of 195 in the Winemaking class and 71 in the Homebrewing class. These frequencies are its plot coordinates (71, 195) where 71 is the x-coordinate and 195 is the y-coordinate. That very top right of the plot, where you see that the word ‘sugar’ is located, is an area where term frequency is high for both classes.
There is more information to ascertain here so I will show a few more examples. Take a look at the word ‘stout’ in the lower right-hand corner of Figure 1. It is all the way at the bottom of the y-axis but far right on the x-axis. With coordinates of (65, 0) this word is strongly found in the beer class and not represented at all in the wine class. Conversely, the word ‘winemaking’ has coordinates of (0, 95). That word is not represented at all in the Homebrewing class.
In his whitepaper, Kessler states that:
Precision is a “word’s discriminative power regardless of its frequency. A term that appears once in the categorized corpus will have perfect precision. This (and subsequent metrics) presuppose a balanced class distribution. Words close to the x and y-axis in Scattertext have high precision.”
These term examples (‘stout’ and ‘winemaking’) have a high degree of precision and are good illustrations of the way Kessler allows us to show distinguishing characteristics in classes.
Interpreting the Middle
The blue color representing Winemaking and the red color representing Homebrewing provides an easy to discern visual that allows the viewer to quickly identify where differences exist in the text. The yellow and orangish colors on the plot are an easy way to identify terms that are most shared among the two classes. In this case as you go toward the top-right of the chart you will find the most frequent of the most-shared terms and the bottom-left is where you will find the least frequent of the most-shared terms.
Not surprisingly Kessler also highlights recall in his whitepaper. Almost as if they are harmoniously married together, precision often cannot be highlighted without recall being present. Kessler describes recall as the “frequency a word appears in a particular class, or P(word|class). He describes the relationship between precision’s variance and recall stating that variance usually decreases as we see an increase in recall. Another interesting note that was a great epiphany for me was his revelation that “extremely high recall words tend to be stop words.” This could do wonders for those needing to visualize the effectiveness of their stop words lists. Most importantly for the interpretation of this plot, however, is the fact that high recall words tend toward the top right corner of the chart (See figure ).
Just with these few examples alone you can begin to see that a Scattertext plot is ‘plotted’ much like a scatterplot. What makes it so visually appealing is a technique Kessler uses that replaces the ‘jitter’ feature in scatterplot with a method in Scattertext that breaks ties based on an alphabetical algorithm. This method more efficiently and accurately uses whitespace in the plot to show the relationships between two classes.
As you can see with this plot you can interpret not only word frequency and corporal similarities in word frequency but you can also interpret metrics such as precision and recall. When you combine all of the features that are interpreted above it creates a lovely visual that is easy to read.
When you implement Scattertext completely it will have interactive features. As you scroll over dots on the plane you will see a pop up with statistics. The statistics include the word frequency per 25,000 words for both classes. It also features a**Scaled F-Score**. The word frequency metric is really easy to discern. That metric is what Scattertext uses as the coordinates for each point. You can see that metric represented below with 195:71 per 25k words.
The other metric here is score: 0.06247. This is called a “Scaled F-Score.” There is a bit of math involved in determining this score but Kessler describes the math and even highlights his reasoning behind the score in this Github Repo and this Jupyter Notebook (also found in repo). The most important thing to take away regarding interpretation of this score in relation to what you have plotted is this:
The score is on a scale of -1 to 1. Scores that are near zero have word frequencies that are similar for both classes (these are the yellow and orange dots). Scores that are near 1 will have word frequencies dominated by the positive class (in blue). Scores that are near -1 will have word frequencies dominated by the negative class (in red). The darker the color of red or blue indicates the closer the score is to -1 or 1.
The scroll over feature is not the only piece of interactivity. There is another very satisfying feature to supplement your analysis while using Scattertext. There is a ‘Search The Chart’ feature that can be utilized in two different ways. One way is using the query box provided either below the bottom left or next to the top right of the chart depending on how your browser is oriented. The other way is by simply clicking on the colored coordinate dot. It really doesn’t get more intuitive than this with plotting tools. This is a fantastic feature.
The output is really what makes this feature useful. When you use the query box or click on the word dot you are given metrics regarding frequency broken down by per-word-frequency (as seen in the pop-up), AND you can also see frequency per-1,000-docs (doc in this case is a reddit post). But wait, there is more. You are additionally provided with a nicely formatted list of the docs/posts where the word was mentioned. This is an outstanding tool that I imagine data professionals can even deploy for novices or non-data science professionals to use for their own research or curious exploration.
Other Highlights of Scattertext
As you have clearly seen Scattertext can be fairly easy to interpret and is an elegant visual to include in a variety of data related presentations. There is much more depth to discuss in regards to Scattertext and many of those items of interest can be found in Jason Kessler’s Github and Whitepaper. There you will also find coding notebooks and demonstrations regarding use and implementation of Scattertext. I will also likely be publishing some more posts on Medium regarding this topic so follow me @jamesopacich or direct message me for more information.