Uncategorized

Text Mining Northeastern University’s World History Dissertations

As I was exposed to more world history in class this semester, I continued to return to the same question: what separates empire history from world history? Does the ability of empire to lend itself as a bridge to world history explain its frequency? Is empire actually a topic of high frequency within world history? In order to answer these, I decided to explore the dissertations from Northeastern’s World History Ph.D. program, founded in 1994.

My first hurdle came with the realization that Northeastern University only has dissertations published 2008 and later available online, forcing me to go to the special collections. The collections are not organized by topic or department, surprisingly, and instead are organized by last name. Luckily, there were only seven dissertations published before 2008. After scanning these dissertations, I had everything I needed to create a corpus of Plain Text files. These text files were prepared for text analysis by removing all punctation, upper case letters, numbers, and common english stop words (a, and, also, the, etc.).

I wanted an initial look at the data, so I used the WordCounter tool to get a sense of what was being discussed in the dissertations. For this analysis, I used a text files containing all of the abstracts; I did not want to overwhelm the tool with too much text. The WorldCloud is below:

The high frequency of the words ‘world’ and ‘global’ comes as no surprise, as these are all world history dissertations. Just as I expected, ’empire’ is also a word of high frequency, as is ‘war,’ ‘political,’ and ‘cultural.’ Interestingly, the words ‘colonial’ and ‘imperialism,’ which are associated with empire, are also featured in the word cloud.

The majority of the text mining and analysis was undertaken in RStudio using R. For R, the corpus used featured the abstract, introduction and conclusion of every dissertation. After installing the necessary programs, I finalized the preprocessing of the text by removing any sparse words, or words used very infrequently. To begin, I wanted to see the fourteen most frequently used terms and the levels of frequency. Shown below:

I was surprised that empire, or words related to empire, were not on this list. It is fascinating that there is less than five mentions separating ‘world’ from ‘war,’ perhaps indicating a tendency of Northeastern’s dissertations to use the framework of war as a means of discussing world history. Also, this chart forced me to consider whether it is useful to include words like ‘one’  and ‘new’ in the corpus.

After creating a world cloud using the WordCounter website, I decided to try and create word clouds of my own in R. One of my first attempts underscored the importance of cleaning documents before analysis. As seen below, the word cloud picked out ‘Kuklinski’ as a term of high frequency; this was the last name of the man who was the focus of one of the dissertations. This forced me to go back through the documents and remove any words that appeared frequently, but were not useful for my analysis, such as Kuklinski.

My second attempt at a word cloud, following a more rigorous cleaning of the texts, produced a cloud that was vastly different from the one created using the WordCounter website. However, a long list of terms that fit within the frequency window, but ‘could not be fit on the page’ were excluded from the cloud. Being a beginning in R, I was not sure the explanation for this, nor how to fix it. Therefore, this different word cloud is interesting, but not necessarily reflective of the corpus.

Next, I used ggplot2 to create a graph of all the words that appear at least 200 times in the corpus. The word clouds and tables above are not as easily understood as a visualization, shown below:

The similarity between ‘world’ and ‘war’ was even more striking in this visualization, and highlights an interesting pattern of nation-state terms. These include ‘Soviet,’ ‘American,’ ‘ Chinese,’ ‘French,’ and ‘British.’ This is particularly interesting as world history’s goal is to move past the boundaries of the nation-state, and yet these terms are still among the highest frequencies across the Northeastern University dissertations. While ‘Soviet’ ‘American’ and ‘Chinese’ can not refer to official empires, they represent big players on the world stage. Perhaps this selection of nation-state terms demonstrates the most influential nation-states in terms of the world.

I also tried to create a cluster dendrogram that would highlight the common clusters of words, with their frequency and patterns. Unfortunately, I was unable to format it in a way that it was legible. The attempt is included below:

Following my interests in the terms ‘war’ and ’empire,’ I wanted to identify associated words within the corpus. I highlighted words that have a 0.85 correlation; a 1.0 would be that they are mentioned together 100% of the time. The output is shown below:

The correlation between empire and imperial makes sense, though I was surprised that colonial was not also included. I found the correlation between ‘war’ and ‘cemetery’ fascinating, and I wonder if that demonstrates a connection between the World History Ph.D. program and the Public History MA program at Northeastern; memorialization is an important topic in Public History. There was insufficient results for empire, in my opinion, so I ran it again looking for a 0.75 correlation with empire.

Even with the widened search criteria, colonial is still absent from the list;  my association of empire and colonial is not reflected in the textual corpus. That being said, I found the correlated terms that did show up to be very interesting, especially ‘ubiquitous.’ It makes sense that an adjective like ubiquitous would be commonly used when discussing empire in the context of the world and world history.

Overall, I learned more about text mining in R than about the state of the field of world history. However, some of my assumptions were proven wrong through my analysis; namely, the association of empire and colonial. That being said, my belief that empire is commonly used in world history was proven accurate, but perhaps not as encompassing as I thought. The preponderance of ‘war,’ especially when in comparison with the term ‘world,’ was the most interesting takeaway. Empires inherently cross nation-state boundaries, which was the foundation of my belief that they would be influential in world history, but I failed to consider wars, which also cross nation-state boundaries in most cases. Finally, the frequency of nation-state terms in the corpus demonstrates that there is still far to go to fully escape the confines of the nation-state.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top
css.php