In our daily life, we come across a variety of publications with text content-books, journal articles, tweets and the like. The primary purpose of these publications is nothing but communicating ideas, information and knowledge. However, apart from reading the content in a book and understanding the topic discussed in it, we can also subject it to some kind of analysis. This means critically examining the book by looking at its separate parts to determine its theme (insight or hidden meaning), the author’s overall message behind the story, her writing style etc.
Different methods are used to conduct textual analysis: It could be a sentiment analysis (to identify if the overall sentiment of the text is positive or negative or neutral); it could be the style of the text (easy text, simple etc.); it could be studying the change in the writing style of an author over time; it could be identifying different usages of the same word or it could be creating a list of words grouped by the frequency of occurrence within a text or collection texts (text corpus).
An advantage of the digital age is that most of the text content is available in machine-readable form and with the help of a computer one can easily read millions of books at the click of a button. This opens up a whole lot of possibilities for a researcher. Instead of just analysing one or two publications, now she can extend the analysis to a variety of text materials published over several decades. The availability of several textual analysis tools further facilitates this process. One such tool that has gained immense attention among textual analysts is Google’ Ngram Viewer.
Ngram viewer is a simple text analysis tool that allows you to look for word frequency and see larger themes, concepts and trends over time. The application uses the massive corpus containing millions of books from Google Books. It scans this corpus and collects billions of words from hundreds of years of books. The application computes the frequency of a word over time by dividing the number of instances of it in a given year by the total number of words in the corpus in that year. The graph will tell you the change in popularity of a word over time.
Perhaps you may be wondering why this word frequency application is called a Ngram viewer. Here ‘gram’ means ‘word’ and ‘n’ represents the number of words in a sequence. So, an Ngram is just a sequence of one or more words. If the sequence contains only one word (e.g. ‘awesome’) we call it a unigram or one-gram. If the sequence is a phrase that contains two words (like ‘free software’), it is called a bigram; if it contains three words (e.g.: ‘a great writer’), we call it a three-gram etc.
The beauty of Ngram Viewer is that it can be used to measure the changing popularity of phrases (two-grams, three-grams etc.) too. In Google’s case, an Ngram can contain a maximum of five words. The frequency of a phrase (n-word phrase) for a particular year is measured by dividing the number of instances of the specific Ngram in a given year by the total number of instances of all N-word phrases in that year.
When you open up the Ngram viewer in the browser you will see that Google has already provided an example for you. This example compares the frequencies of the words “Albert Einstein’, “Sherlock Homes’ and “Frankenstein”.
The way you read the above plot is that the frequencies are along the left side of the chart and time is spread out over the bottom.
This ability to quantify textual material enables researchers to detect cultural changes, trends in human thought and to make inferences about historical events. This phenomenon of applying computational methods on a corpus of text for observing/exploring cultural trends and studying human behaviour is popularly known by the name Culturomics.
Critics, however, have pointed out certain shortcomings in the Google Books corpus that powers the Ngram viewer (like OCR errors, bias in the corpus towards scientific literature etc.). Nevertheless, the tool continues to be popular among digital humanity experts and other researchers. Its simplicity and ease of use could be one of the reasons for this widespread acceptance. We will demonstrate this aspect and the utility of this tool with a few examples.
Let us start with the word ‘Existentialism’ and see how the application plots it.
The term ‘Existentialism’ is a philosophical movement, which was popular during the mid 20th century. The plot tells us that the term peaks during the early 70s and is on the way out now.
If you scroll down, you will find some date ranges (screenshot above) separated out by the application; when you click on those, it will take you to the books cited in its frequency list.
Let us take another example. We know that the Net has facilitated collaborative work and the popularity gained by the free software movement is an offshoot of this cultural change. Let us see how these phrases show up in the Ngram viewer.
The above plot clearly shows that the concepts ‘collaborative work’ and ‘free software’ are both climbing up from the early 90s.
Like the Google’s search engine, Ngram Viewer also allows us to modify the input phrase with certain commands/keywords. The parts-of-speech tag that lets us filter out different forms of a word is a good example. Depending on the context, a word can assume different meanings. You can be a host or host a party. Here, the word ‘host’ is used both as noun and verb. Now, do you want to know the form in which the word ‘host’ is used most often? If so, distinguish between these different forms by appending ‘host’ with the tags _VERB and _NOUN (screenshot below).
As shown above, the word ‘host’ is rarely used as a verb.
One interesting feature Ngram Viewer is that we can even do maths on the Ngrams. Let us illustrate this with an example. We know that the word ‘engineering’ can be preceded by different words –like industrial engineering, genetic engineering, social engineering, software engineering, etc. Now, if we want to know what percentage of the mentions of the word ‘engineering’ is preceded by the word ‘genetic’, simply divide “industrial engineering”/engineering as shown below: