Lin, J., Bahety, A., Konda, S., Mahindrakar, S. (January 2009)
Hadoop is an open source implementation of Google's MapReduce programming model that has recently gained popularity as a practical approach to distributed information processing. This work explores the use of memcached, an open-source distributed in-memory object caching system, to provide low-latency, high-throughput access to static global resources in Hadoop. Such a capability is essential to a large class of MapReduce algorithms that require, for example, querying language model probabilities, accessing model parameters in iterative algorithms, or performing joins across relational datasets. Experimental results on a simple demonstration application illustrate that memcached provides a feasible general-purpose solution for rapidly accessing global key-value pairs from within Hadoop programs. Our proposed architecture exhibits the desirable scaling characteristic of linear increase in throughput with respect to cluster size. To our knowledge, this application of memcached in Hadoop is novel. Although considerable opportunities for increased performance remain, this work enables implementation of algorithms that do not have satisfactory solutions at scale today.