Lin, J. (July 2008)
This paper explores the challenge of scaling up language processing algorithms to increasingly large datasets. While cluster computing has been available in industrial environments for several years, academic researchers have fallen behind in their ability to work on large datasets. We discuss two challenges contributing to this problem: lack of a suitable programming model for managing concurrency and difficulty in obtaining access to hardware. Hadoop, an open-source implementation of Google’s MapReduce framework, provides a compelling solution to both issues. Its simple programming model hides systemlevel details from the developer, and its ability to run on commodity hardware puts cluster computing within reach of many academic research groups. This paper illustrates these points with a case study on building word cooccurrence matrices from large corpora. We conclude with an analysis of an alternative computing model based on renting instead of buying computer clusters.