October 22nd 2025

How big data is changing how we study languages

Big data is enriching the field of language study, but data access needs to be opened up more for academics to scrutinize the figures properly.Do women really talk more than men? How does disfluency vary with sex and age? Have sentences in formal written English become

shorter and simpler over the past few hundred years? Using available digital resources, we can get answers to questions like these in just a few minutes.

 From the perspective of a linguist, today's vast archives of digital text and speech, along with new analysis techniques and inexpensive computation, look like a wonderful new scientific instrument, a modern equivalent of the 17th century invention of the telescope and microscope. We can now observe linguistic patterns in space, time, and cultural context, on a scale three to six orders of magnitude greater than in the past, and simultaneously in much greater detail than before.

 Of course, our observations may not be correct or general, because they depend on counting things in specific datasets with specific characteristics. But the same problem exists even more seriously for the answers we get from any other methods. And as long as we have data from a variety of different settings – personal conversations and broadcast interviews and classroom discussions and so on – it's easy to check the generality of our results.

 At least, it's easy if all that digital data is accessible.

 Luckily, we now have access to quite a bit of relevant linguistic data. This is partly because so much of our communication is now mediated by networked digital computing devices. But it's also because shared linguistic datasets played a central and critical role in the research behind the linguistic technology, science, and scholarship that we have today.

 This has resulted in several important consequences for science and the humanities. The most important being that we now have algorithms for the automatic analysis of text and speech, algorithms that can be applied to the even larger digital archives now emerging.

 And another important outcome has been to underline the value of reproducible research on accessible data.

 When research datasets are available, there's more research because barriers to entry are lowered. When research datasets are shared, the research is better, because results can be replicated, and algorithms and theories can be compared. In addition, shared datasets are typically much bigger and more expensive than any individual researcher's time and money would permit. And when datasets are associated with well-defined research questions, the whole field gets better, because the people who work on the "common tasks" form a community of practice within which ideas and tools circulate rapidly.

 We might call this process the data reformation, since it emphasises the spread of unmediated access to the primary material needed to discover truth. More familiar names for the trend are the open data and reproducible research movements. Under whatever name, this trend is making increasing amounts of digital data – including speech and language data – accessible to many researchers worldwide.

Source: www.theguardian.com