Annie Champagne Queloz, PhD. ETH Zürich


Posts Tagged ‘text mining’

Mining “Origin of Species” with Iramuteq

IramuteqRecently, I have discovered Iramuteq (R Interface for multidimensional analysis of texts and questionnaires) developed by the Laboratoire d’Études et de Recherches Appliquées en Sciences Sociales at Toulouse University, France. This free text mining software can provide basic text analyzes such word frequencies (word clouds), or more complex ones such descending hierarchical classification, post-hoc correspondence analysis and similarity analysis. Iramuteq is relatively simple to use. It is an interface based on in R and Python languages. The software offers complete English and French dictionary. Other languages are also available, but in beta version only. For example, in German, plural words and adjectives are not considered, thus the lemmatization (to find word roots; infinitive verbs, singular nouns, adjectives in singular masculine) is not done.


Origin of Species by Darwin: Analysis of the First Chapter

“À la bonne franquette”, I’m describing a simple example of how Iramuteq can be used. The first chapter of the Origin of Species by Darwin will be my text corpus for the analysis (available here). Before to start an analysis, you should review the text to avoid spelling mistakes or errors to be taken into account as different words (mainly true for open-question surveys). In addition, all acronyms and abbreviations must be consistent. Then, you can download the text corpus in Iramuteq. I avoid describing all technical information about the segmentation of a corpus or how algorithms work. I could not better explain than information available on the website or from the help available on the Iramuteq forum. Moreover, the helpdesk available per forum usually answers your questions quite quickly.


Let’s start easy!

First, you have to download your text corpus and select the language of it. After, I usually start with a similarity analysis that allows to identify co-occurrences of words. Indeed, it reveals the clustering of words based on how often they were associated and it gives you a pretty co-occurrence tree (Figure 1). In our example, we can observe that “breed” is the most frequent word in the first chapter of Origin of Species and it is often associated with “domestic”, “animal”, or “pigeons” (Figure 1). In Figure 2, you can see the parameters that I have selected. Here, I have restricted the analysis on words having a frequency in the text higher than 10 times. Of course, more the text to analyze is elaborated, more the interpretation of this type of graph is complex. The vertices’ size is proportional to the words frequency. It is also possible to simply create a cloud word, illustrated the word frequency (Figure 3).

Figure 1: Word similarity tree


Figure 2: Parameters to generate the similarity tree

Figure 2: Word cloud

Figure 3: Word cloud

A little bit more complex…

Another cool analysis done by Iramuteq is the words clustering (a friendly name for Descending hierarchical classification or the Reinert method) (Figure 4). This classification is based on a correlation chi-squared test. A dendrogram is generated showing repartition of classes and their association. For each class, we obtain the most associated words. For our analyzed chapter, we observe 5 classes of words. Two subclasses (Classes 2 and 3; Classes 1, 4 and 5) are revealed. Note that a word can be found in different classes.



Figure 4: Word clustering

With Iramuteq, a correspondence analysis can be easily done (Figure 5). Briefly, the CA is often used to represent and model categorical/categorized data as “clouds” of points in a multidimensional Euclidean space. It is really useful to illustrate associations between variables. The variables are expressed as vectors and correlations as angles between vectors from the origin of the graph. An indication of a strong correlation between variables is represented by a small angle between vectors. In Figure 5, we can see the 5 classes of words in distinguishing colours. For example, “seeds”, “plants”, “cultivate”, “flowers” and “variety” (in pink) are closely associated. To the upper side of the graph, “pigeon”, “wild”, “birds”, “domestic” and “descend” (in red) are associated. The word clustering and the CA are often used to analyze discourses of different people or group of people (here is an example).

Figure 4: Correspondance analysis graph

Figure 5: Correspondance analysis graph

Have fun!

Iramuteq is really a cool software for text mining. On the website, you can find tutorial in English describing steps to learn how using it and how to analyze results. It is quite simple. However, the analysis of the results can sometimes be complex, especially with long texts. Have fun to try it! À découvrir!











Share Button