I got into machine learning in a very unusual way. In my undergraduate years, I studied comparative Indo-European linguistics. Comparative linguistics is one of the most intriguing fields that reverse engineers how the proto-language of Greek, Latin, Sanskrit, and many other languages sounded thousands of years ago. Up until the 20th century, most linguists as well as many prominent polymaths of the time, such as the Grimm brothers and Hermann Grassmann, were interested in reconstructing this language called Proto-Indo-European.
I got so fascinated by Proto-Indo-European that I spent my undergraduate studies majoring in comparative Indo-European linguistics. To be able to conduct research in comparative linguistics, one has to master old Indo-European languages first. At the University of Ljubljana, Slovenia, where I did my undergraduate studies, I studied Greek, Sanskrit, Old Church Slavic, and Hittite. In fact, I spent 10 semesters studying Hittite during my undergraduate years. I am pretty sure that no other program offers so much coursework in Hittite. At the end of the training, we could read and write in Hittite Cuneiform. In order to gain knowledge of other Indo-European languages, I studied at Leiden University, the University of Vienna, the University of Pavia, Free University of Berlin, among others. There, I learnt Old Irish, Hieroglyphic and Cuneiform Luwian, Old Persian, Avestan, Tocharian, and other fascinating languages and literatures that very few people study these days.
A Tocharian A inscription.
My linguistic education fully recapitulates the development of linguistic science. Modern linguistics was born out of the desire to reconstruct the Indo-European proto language. When colonialists became familiar with India and found Sanskrit, they were shocked by the fact that the most sacred language of India is strikingly similar to what they believed was the most amazing language of all times — Ancient Greek. They didn’t know what to make of it, so they realized that Greek, Latin, and Sanskrit all go back to a common mother language. Until the 20th century, most of linguistic science was preoccupied by figuring out how Proto-Indo-European, the language of the ancestors of Homer, Virgil, and Rigvedic poets, sounded like.
Ferdinand de Saussure was the first to shift this perspective: instead of focusing exclusively on the history of language, he argued that studying language in its current state is equally important. Speakers of English, for example, have no awareness of how Proto-Indo-European sounded, so according to de Saussure, we need to build an understanding of the internal workings of language in human individuals, here and now. While this line of thought was losing popularity in the US, Ljubljana remained heavily influenced by structuralism, in part due to the influence of the Prague linguistic circle and others. This is why I got an extensive training in structuralism as part of my double major degree.
De Saussure’s structuralism was heavily criticized by another important figure in the history of linguistics, Noam Chomsky who started a new school of thought within linguistics known as the generative grammar. Boston, where I continued my education as a Ph.D. student at Harvard, is one of the most generative towns in the world. After all, Chomsky was a long-time professor at MIT. By moving from Ljubljana to Boston, I necessarily got trained in this new approach to language.
Linguistics is constantly evolving, just like the object of its study—language. Linguistics now encompasses fields as diverse as sociology, philosophy, cognitive science, and machine learning, to name just a few. As a faculty member first at the University of Washington and now at UC Berkeley, I focus on using language to better understand how artificial intelligence learns. Humans have been studying language for centuries, amassing a rich understanding of it. Language is the perfect medium to untangle the complex inner workings of current AI models. As the Linguistics Lead of Project CETI, I aim to demonstrate that integrating linguistics and AI not only deepens our understanding of human language, but also sheds light on the communication systems of animals.
My linguistic education gave me two crucial skills that I now apply in my AI research. First, I got used to being almost obsessed with details. Reverse engineering a proto-language demands meticulous attention to detail, perhaps even more so than other fields. In today's world of big data, data points that diverge from global trends might be discarded. However, in reconstructing a proto-language, these divergent data points are invaluable. For example, the English plural form of ox — oxen — gives us a window into the past stages of English language that are lost in a more common way to form plurals (by adding an -s). Of all my papers, I’m still most proud of the one in which I propose a rule in Vedic Sanskrit poetic meter. It has nothing to do with machine learning, but uncovers a detail that was overlooked for hundreds of years.
Secondly, linguistics is an unsung hero among sciences. It has initiated several groundbreaking trends in global history, yet it seldom receives the acknowledgment it deserves. The idea that similarities in Greek, Latin, and Sanskrit can only be explained by a proto-language that no longer exists predates Darwin’s evolutionary theory by several years. In fact, Darwin used examples from the proto-language to support his evolutionary theories. If languages like Greek and Sanskrit could have evolved from a common ancestor, why not species? A few decades later, structuralism, championed by De Saussure, profoundly influenced philosophy and comparative literature, even birthing new fields like semiotics. It had a profound impact on the culture and philosophy of the time.
Because my education followed the evolution of the field of linguistics so closely, it gave me a unique insight into how ideas from the humanities can influence the world of natural science and society in general. In other words, linguistics is proof that a relatively small field obsessed with details can have global impacts on our understanding of humanity.