Ahmed Elgohary wins best paper at VLDB 2016

PhD student Ahmed Elgohary is a bit of a Renaissance individual.  He has worked with Professor Amol Deshpande in databases, and he is now working with Assistant Professor Marine Carpuat in machine learning, natural language processing, and natural language inference.

In a brief interview in late August, he noted that a good thing about the University of Maryland is the opportunity for students to explore different fields and work with different advisors as they decide the best area in CS for them to conduct research.

He originally thought he was going to focus on databases, but he took a course in Computational Linguistics with Professor Philip Resnick (with whom he has also recently won a best paper) and then realized that he wanted to study Natural Language Processing. He looked at research subjects that Carpuat works on—problems related to machine translation, natural language inference, focusing on the cross-lingual settings— and found that they were compelling to him.

Elgohary has won a best paper at the 42nd Annual Conference on Very Large Databases 2016 (VLDB) that is going on next week from 9/5-9/9 in New Delhi, India.  This best paper is the result of his internship work at IBM Almaden Lab, where he worked with computer scientists from a variety of research backgrounds including systems, databases, algorithms, and machine learning. His mentors for this project were Mathais Boehm and Frederick R. Reiss. His paper is entitled  “Compressed Linear Algebra for Large-Scale Machine Learning,” and he authored with the paper with his advisors as well as Peter J. Haas and Berthold Reinwald. 

“I think I owe a huge part of the award to Professor Amol Deshpande,” he said. “He was my advisor at the time I got the internship. What I learnt from him during my first year at UMD and his graduate database class were my most valuable assistants to do that work. “

Elgohary described his work in System ML at IBM, that resulted in this best paper.  “The idea behind the paper is to compress machine learning data sets in a way that will allow us to do operations into a compressed domain. We wanted to design compression schemes to do that, and we wanted to decide in an efficient way, what is the best compression scheme to use for each dataset as efficiently as possible.”

In a longer conversation, Elgohary described the work in greater detail (edited for clarity):  

“We want developers to write Machine Learning scripts in a declarative way--like R or a very simple language. We don’t want them to have to worry about how to execute their programs.  Our job is to execute their programs on a very large data matrix that cannot be stored on the memory of a single node. The matrix is distributed, and we are going to do a distributed execution—this work is not trivial.”

According to Elgohary, developers of machine learning scripts shouldn’t have to worry about how the execution is going to happen. “So they give us the scripts, and the system will take care of planning the execution of the machine learning workload.  The planning will make decisions such as—‘that matrix is not really huge, so let’s fit it into the memory of a single node and then to the execution.’”

His project was to speed up the execution if it could be entire done in main memory.  He explained that It is better if to fit the data matrix into aggregate clusters of memory—the memory of the whole node. “We don’t want the disk to read memory at each iteration, he said, “If your data is huge, then you have to store it on a disk, and access the memory and then store it back on the disk—that [process] is really slow.”  

Over the summer he attempted to compress data, fit it into main memory and speed up the process, and not read anything from the disk. But he found a problem with compression. “If you want to do an operation—like matrix vector multiplication—and the matrix is compressed by any general matrix compression algorithm, you’ll have to decompress before execution, that process is slow.”

He and his team wanted to have data compressed in a particular way that the domain would not endure any decompression—and this is the main contribution of the paper. “Designing a compression scheme to execute operations, and finally how to make decisions about how to do the compressions,” were the main parts of the paper.  But he emphasized that “These matrices are machine learning data sets, and no two data sets are the same. They are different”

He continued, “Machine learning data sets are different—even the features are different. The features can be binary, or numeric, but span a very small range. There is a best practice compression for specific features, and there are questions about what are the best compression schemes for that.  This was one of the most complicated parts of the work. One must make decisions that go over the entire matrix (or dataset), read it, collect some statistics and then decide, which compression scheme to use for each feature. This is bad because you want to go over the entire dataset, but that is going to take a lot of time. The idea is to take a sample, if you have a billion rows, reduce that number to 10K rows, and take statistics from that sample.”

This is certainly the first of many best papers to come from Elgohary.

The Department welcomes comments, suggestions and corrections.  Send email to editor [at] cs [dot] umd [dot] edu.