Home About Text Topic Modeling Results Conclusion

Topic Modeling


The results on this page were generated from the use of the open source software, MALLET. According to the Mallet page on the University of Amherst Site:MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. A tuturial on how to run your own topic model and generate visuals can be found at this repository. The visuals on this page were generated by Excel. To create proper visuals refer to the tutorial linked above.



The image to the left depicts a typical a MALLET command. This command is entered into the command line, prompting MALLET to retrieve a series of text files and process them. MALLET works by processing each word in the corpus of documents and placing said word into a set number of topics. At first this process is at first random but as more words are processed MALLET recognizes that there are patterns within the text and it begins to sort words based on how similar they are to the words that are already sorted into topics. One run through a corpus is 1 iteration. If another iteration is run(In this case 800 were run) MALLET takes each word from its topic and resorts it. So on. Thus, topics become more specific at higher # of iterations (there is a practical limit). (Figure 1)


The image above shows the command that would follow the above command and generate results for analysis which will be shown below. (Figure 2)

These results come from input all of the markup into MALLET with the specifications of 10 topics and 800 iterations. I found Topic 1,3,5,7 and 9 to be most significant. These topics are highlighted on the chart to the left. Topic 1 contains words that are associated with religion, a major influence on Van Gogh in his earlier years. Topic 3 and 5 consist of words associated with Van Gogh's work as an artist as well as his inspirations. Topic 7 included words linked with nature. Nature was a source of inspiration many for Van Gogh's pieces. Topic 9 contains words associated with his homelife and family. (Figure 3)


The above graph is an exmaple of how the data from MALLET would look without using some sort of smooting method(Cubic Bezier, polynomial interpolation). As you can see it is very hard to interpret this figure due to the jagged and cluttered nature of the lines. (Figure 4)

Figure 5

The graph below shows how polynomial interpolation can make MAlLET results much easier to read. This graph depicts the Topic Model of the entire corpus letter written by Van Gogh. To understand this graph you must refer back to Figure 3. For clarification, the Y axis is the measuremnt of Topic Weight, how present a topic is in a certain letter. The most siginificant parts of this graph are marked by number. The gray line, Topic 9, represents Van Gogh's connection with his family and his home. This can be determined by analyzing the 20 words that are grouped into topic 9. The connection with his family and home is highlighted by the words evening, uncle, home, letter, and write. It might be hard for the unaquantied reader to make much sense of the words in the topic as they have not read the letters Van Gogh wrote to his brother. This is a problem with presenting MALLET data. The results are very subjective as MALLET is not aware of what any of the words mean. It sorts topics algorithmically(more information can be found in the tuturial linked above). It is usually up to the person generating the results to interpret the meaning of each topic. It can be seen the Topic 9 is very present in letter from 1972-1980 but not present in letters from 1880-1890. This represents his sepeartion from his home due to his career as a painter causing him to move around Europe as well as take over most of his time. Line 5 and 3 are also notable because these topics are linked to Van Gogh's work through words such as paintings, canvas, studies, colours, portrait, studio, model. It can be seen that these two topics rise in Weight from 1880 to 1890. This is indicative of Van Gogh producing more works. Topic 7 is also of note because it refers to Van Gogh's connection with nature through the words sky, winter, trees, and landscape. Nature was one of Van Gogh's greatest joys and inspirations. It can be seen that nature is very present in between 1880 and 1888 but becomes less present from 1888-1890(during this time Van Gogh was in out of mental institutions). The way I look at this data is that as Van Gogh progressed in his career he lost connection with his family. Work started to take over his life and eventually took place of his joys and even his inspirations(nature). These factors can be linked with his descent into mental instability.


Topic Model of Period 1


The image above depicts the 10 topics generated by MALLET. These results were produced from constraining MALLET to 10 topics and 800 iterations. (Figure 6)

Figure 7

The grpah belows depicts the trends of the 10 topics presented in the chart above. The lines of most importance are linked with the rows that have a first cell filled with color(topics 2,3,5,6). Topic 2 represents Van Gogh's connection with his family and home. Topic 3 and 5 represents Van Gogh's devotion to Christ, a constant theme through this period as he was practicing to become a pastor. Topic 6 is connected with Van Gogh's constant walks throughout the city which were a source of happiness as well as inspiration for him. This visual is not a great indicator of mental health becuase during this period was not a time of instability for Van Gogh. The data generated can be used to understand what Van Gogh's focuses during this period were. It seems that they were God, family, some painting (present in topic 4 and 9) and the world around him.


Topic Model of Period 2

The image above depicts the 10 topics generated by MALLET. These results were produced from constraining MALLET to 10 topics and 800 iterations. (Figure 8)

Figure 9

The grpah belows depicts the trends of the 10 topics presented in the chart above. The lines of most importance are linked with the rows that have a first cell filled with color(topics 3,4,6,10). Topic 3 is connected with Van Gogh's work as a painter and his brother Theo's work as an art dealer. During this period Theo and Van Gogh conducting their business together; Van Gogh would send works to Theo and Theo would attempt to sell them. This situation seems convenient but in fact created a lot of stress for the both of them. Van Gogh became financially depdendent on Theo. During this period it would have been impossible for Van Gogh to fund the vast majority of his painting without Theo's help. This type of financial stress can be connected to Topic 10 by the words money, order, francs, pay. By following the line plotted for topic 10 it is obvious that this money stress was present throughout this period. Topic 6 is not very present in this collection of letters but I thought that the topic indicated some sort of emotional and spirtual fight that Van Gogh was trying to overcome; evident through the words love, woman, god, reality, experience, sympahty.


Topic Model of Period 3


The image above depicts the 10 topics generated by MALLET. These results were produced from constraining MALLET to 10 topics and 800 iterations. (Figure 10)

Figure 11

The grpah belows depicts the trends of the 10 topics presented in the chart above. The lines of most importance are linked with the rows that have a first cell filled with color(topics 3,4,6,9). This is the period in which Van Gogh was in mental institution for some periods of time. Topic 6 can be interepretted to reference the stress with money that Van Gogh had during this time. This is evident through the words Ganguin, money, send, Francs, business. Some sources say that during this time Van Gogh was in trouble with Ganguin for money he owed him. There are many conspiracy theories that Van Gogh's ear was acutally cut off by Ganguin due to these debts. This period was actually Van Gogh's most productive time. Work is referenced in topic 2,3,4,5,7, and 10; topic 4 being the most notable. It seems that topic 9 indicates poor health due to the inclusion of the words hospital and illness.



For this projct MALLET was a good tool for deconstructing the corpus. The topics generated for each period helped me understand the core elements of Van Gogh's life during that period. The figures that are most helpful for answering our research question are Figures 5 and 11. Through these figures it seems that the main drivers of Van Gogh's mental instabilyu were his stresses with money, stress from work, and lack of connection from family, friends, and God.