- DataHub: Collaborative data science and dataset version management at scale;
Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya Parameswaran;
CIDR 2015.
Relational databases have limited support for data collaboration,
where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a)
a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform,
... [more]
Relational databases have limited support for data collaboration,
where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a)
a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform,
DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.
- VERTEXICA: Your Relational Friend for Graph Analytics!;
Alekh Jindal, Praynaa Rawlani, Eugene Wu, Samuel Madden, Amol Deshpande, Mike Stonebraker;
VLDB Demo 2014.
- NScale: Neighborhood-centric Large-Scale Graph Analytics in the Cloud;
Abdul Quamar, Amol Deshpande, Jimmy Lin;
CoRR Technical Report arXiv:1405.1499.
There is an increasing interest in executing rich and complex analysis tasks over large-scale graphs, many of which require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph. Examples of such tasks include ego
network analysis, motif counting, finding social circles, personalized recommendations, analyzing influence cascades, etc. These tasks are not well served by the existing vertex-centric graph processing frameworks, whose computation, execution models limit the
user program to directly access the state of a single vertex; this results in high communication, scheduling, and memory overheads in executing such tasks. Further, most existing graph processing frameworks typically ignore the challenges in extracting the
... [more]
There is an increasing interest in executing rich and complex analysis tasks over large-scale graphs, many of which require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph. Examples of such tasks include ego
network analysis, motif counting, finding social circles, personalized recommendations, analyzing influence cascades, etc. These tasks are not well served by the existing vertex-centric graph processing frameworks, whose computation, execution models limit the
user program to directly access the state of a single vertex; this results in high communication, scheduling, and memory overheads in executing such tasks. Further, most existing graph processing frameworks typically ignore the challenges in extracting the
relevant portion of the graph that an analysis task needs, and loading it onto distributed memory. In this paper, we describe NScale, a novel end-to-end graph processing framework that enables the distributed execution of complex subgraph-centric analytics over
large-scale graphs in the cloud. NScale enables users to write programs at the level of subgraphs, and to specify the subgraphs of interest declaratively. NScale uses Apache YARN, a state-of-the-art framework for efficient and fault-tolerant distribution of data
and computation. It features GEL, a novel graph extraction and loading phase, that extracts the relevant portions of the graph and utilizes a cost-based optimizer to partition and load the graph onto distributed memory using as few machines as possible to
minimize the communication cost. It utilizes novel techniques for the distributed execution of user computation that minimize memory consumption by exploiting overlap among the subgraphs of interest. Our experimental results show orders-of-magnitude improvements
in performance, and drastic reductions in the cost of analytics, over vertex-centric approaches.
- NScale: Neighborhood-centric Analytics on Large Graphs;
Abdul Quamar, Amol Deshpande, Jimmy Lin;
VLDB Demo 2014.
- SWORD: Workload-aware Data Placement and Replica Selection for Cloud Data Management Systems;
Ashwin Kumar Kayyoor, Abdul Quamar, Amol Deshpande, Samir Khuller;
VLDB Journal (Special Issue on Data-Intensive Cloud Infrastructure), 23(6): 845-870, 2014.
- PStore: An Efficient Storage Framework for Managing Scientific Data;
Souvik Bhattacherjee, Amol Deshpande, and Alan Sussman;
SSDBM 2014.
In this paper, we present the design, implementation, and evaluation of PStore, a no-overwrite storage framework for managing large volumes of array data
generated by scientific simulations. PStore comprises of two modules, a data ingestion module and a query processing module, that respectively address two of the
key challenges in scientific simulation data management. The data ingestion module is geared toward handling the high volumes of simulation data generated at a
... [more]
In this paper, we present the design, implementation, and evaluation of PStore, a no-overwrite storage framework for managing large volumes of array data
generated by scientific simulations. PStore comprises of two modules, a data ingestion module and a query processing module, that respectively address two of the
key challenges in scientific simulation data management. The data ingestion module is geared toward handling the high volumes of simulation data generated at a
very rapid rate, which often makes it impossible to offload the data onto storage devices; the module is responsible for selecting an appropriate compression
scheme for the data at hand, chunking the data, and then compressing it before sending it to the storage nodes. On the other hand, the query processing module is
in charge of efficiently executing different types of queries over the stored data; in this paper, we specifically focus on slice (also called range) queries.
PStore provides a suite of compression schemes that leverage existing techniques while extending some of them to provide support for diverse scientific
simulation data. To efficiently execute queries over such compressed data, PStore adopts and extends a two-level chunking scheme by incorporating the effect of
compression, and hides expensive disk latencies for long running range queries by exploiting chunk prefetching. In addition, we also parallelize the query
processing module to further speed up execution. We evaluate PStore on a 140 GB dataset obtained from real-world simulations using the regional climate model
CWRF. In this paper, we use both 3D and 4D datasets and demonstrate high performance through extensive experiments.
- Optimization Techniques for "Scaling Down" Hadoop on Multi-Core, Shared-Memory Systems;
K. Ashwin Kumar, Jonathan Gluck, Amol Deshpande, Jimmy Lin;
EDBT 2014.
The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a
single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale datastores are increasingly common, it is unclear whether ``typical''
analytics tasks require more than a single high-end server. Additionally, we are seeing increased sophistication in analytics, e.g., machine learning, which
... [more]
The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a
single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale datastores are increasingly common, it is unclear whether ``typical''
analytics tasks require more than a single high-end server. Additionally, we are seeing increased sophistication in analytics, e.g., machine learning, which
generally operate over smaller and more refined datasets. To address these trends, we propose ``scaling down'' Hadoop to run on shared-memory machines. This
paper presents a
prototype runtime called Hone, intended to be both API and binary compatible with standard (distributed) Hadoop. That is, Hone can take an existing Hadoop jar
and run it, without modification, on a multi-core shared-memory machine. This allows us to take existing Hadoop algorithms and find the most suitable runtime
environment for execution on datasets of varying sizes. Our experiments show that Hone order of magnitude faster than Hadoop pseudo-distributed mode (PDM); on
dataset sizes that fit into memory, Hone outperforms a fully-distributed 15-node Hadoop cluster in some cases as well.
- SPARSI: Partitioning Sensitive Data amongst Multiple Adversaries;
Theodoros Rekatsinas, Amol Deshpande, and Ashwin Machanavajjhala;
PVLDB 2013, to be presented at VLDB 2014 (also CoRR Technical Report arXiv:1302.6556).
We present SPARSI, a theoretical framework for partitioning sensitive data across multiple non-colluding adversaries. Most work in privacy-aware
data sharing has considered disclosing summaries where the aggregate information about the data is preserved, but sensitive user information is
protected. Nonetheless, there are applications, including online advertising, cloud computing and crowdsourcing markets, where detailed and
... [more]
We present SPARSI, a theoretical framework for partitioning sensitive data across multiple non-colluding adversaries. Most work in privacy-aware
data sharing has considered disclosing summaries where the aggregate information about the data is preserved, but sensitive user information is
protected. Nonetheless, there are applications, including online advertising, cloud computing and crowdsourcing markets, where detailed and
fine-grained user-data must be disclosed. We consider a new data sharing paradigm and introduce the problem of privacy-aware data partitioning,
where a sensitive dataset must be partitioned among k untrusted parties (adversaries). The goal is to maximize the utility derived by partitioning
and distributing the dataset, while minimizing the amount of sensitive information disclosed. The data should be distributed so that an adversary,
without colluding with other adversaries, cannot draw additional inferences about the private information, by linking together multiple pieces of
information released to her. The assumption of no collusion is both reasonable and necessary in the above application domains that require release
of private user information. SPARSI enables us to formally define privacy-aware data partitioning using the notion of sensitive properties for
modeling private information and a hypergraph representation for describing the interdependencies between data entries and private information. We
show that solving privacy-aware partitioning is, in general, NP-hard, but for specific information disclosure functions, good approximate solutions
can be found using relaxation techniques. Finally, we present a local search algorithm applicable to generic information disclosure functions. We
apply SPARSI together with the proposed algorithms on data from a real advertising scenario and show that we can partition data with no disclosure
to any single advertiser.
- Hone: "Scaling Down" Hadoop on Shared-Memory Systems;
K. Ashwin Kumar, Jonathan Gluck, Amol Deshpande, Jimmy Lin;
VLDB Demo 2013.
- Data Placement and Replica Selection for Improving Co-location in Distributed Environments;
K. Ashwin Kumar, Amol Deshpande, and Samir Khuller;
CoRR Technical Report arXiv:1302.4168.
Increasing need for large-scale data analytics in a number of application domains has led to a dramatic rise in the number of distributed data
management systems, both parallel relational databases, and systems that support alternative frameworks like MapReduce. There is thus an increasing
contention on scarce data center resources like network bandwidth; further, the energy requirements for powering the computing equipment are also
... [more]
Increasing need for large-scale data analytics in a number of application domains has led to a dramatic rise in the number of distributed data
management systems, both parallel relational databases, and systems that support alternative frameworks like MapReduce. There is thus an increasing
contention on scarce data center resources like network bandwidth; further, the energy requirements for powering the computing equipment are also
growing dramatically. As we show empirically, increasing the execution parallelism by spreading out data across a large number of machines may
achieve the intended goal of decreasing query latencies, but in most cases, may increase the total resource and energy consumption significantly. For
many analytical workloads, however, minimizing query latencies is often not critical; in such scenarios, we argue that we should instead focus on
minimizing the average query span, i.e., the average number of machines that are involved in processing of a query, through colocation of data items
that are frequently accessed together. In this work, we exploit the fact that most distributed environments need to use replication for fault
tolerance, and we devise workload-driven replica selection and placement algorithms that attempt to minimize the average query span. We model a
historical query workload trace as a hypergraph over a set of data items, and formulate and analyze the problem of replica placement by drawing
connections to several well-studied graph theoretic concepts. We develop a series of algorithms to decide which data items to replicate, and where to
place the replicas. We show effectiveness of our proposed approach by presenting results on a collection of synthetic and real workloads. Our
experiments show that careful data placement and replication can dramatically reduce the average query spans resulting in significant reductions in
the resource consumption.
- Algorithms for the Thermal Scheduling Problem;
Koyel Mukherjee, Samir Khuller, and Amol Deshpande;
IPDPS 2013.
The energy costs for cooling a data center constitute a significant portion of the overall running costs. Thermal imbalance and hot spots that arise due to
imbalanced workloads lead to significant wasted cooling effort -- in order to ensure that no equipment is operating above a certain temperature, the data center
may be cooled more than necessary. Therefore it is desirable to schedule the workload in a data center in a "thermally aware" manner, assigning jobs to machines
... [more]
The energy costs for cooling a data center constitute a significant portion of the overall running costs. Thermal imbalance and hot spots that arise due to
imbalanced workloads lead to significant wasted cooling effort -- in order to ensure that no equipment is operating above a certain temperature, the data center
may be cooled more than necessary. Therefore it is desirable to schedule the workload in a data center in a "thermally aware" manner, assigning jobs to machines
not just based on local load of the machines, but based on the overall thermal profile of the data center. This is challenging because of the spatial
cross-interference between machines, where a job assigned to a machine may impact not only that machine's temperature, but also nearby machines. Here, we
continue formal analysis of the "thermal scheduling" problem that we initiated recently~\cite{sig-abstract}. There we introduced the notion of "effective load of
a machine" which is a function of the local load on the machine as well as the load on nearby machines, and presented optimal scheduling policies for a simple
model (where cross-effects are restricted within a rack) under the assumption that jobs can be split among different machines. Here we consider the more
realistic problem of "integral" assignment of jobs, and allow for cross-interference among different machines in adjacent racks in the data center. The integral
assignment problem with cross-interference is NP-hard, even for a simple two machine model. We consider three different heat flow models, and give constant
factor approximation algorithms for maximizing the number (or total profit) of jobs assigned in each model, without violating thermal constraints. We also
consider the problem of minimizing the maximum temperature on any machine when all jobs need to be assigned, and give constant factor algorithms for this
problem.
- SWORD: Scalable Workload-Aware Data Placement for Transactional Workloads;
Abdul Quamar, K.Ashwin Kumar and Amol Deshpande;
EDBT 2013.
[pdf]
In this paper, we address the problem of transparently scaling out transactional (OLTP)
workloads on relational databases, to support "database-as-a-service" in cloud computing
environment. The primary challenges in supporting such workloads include choosing how to
... [more]
In this paper, we address the problem of transparently scaling out transactional (OLTP)
workloads on relational databases, to support "database-as-a-service" in cloud computing
environment. The primary challenges in supporting such workloads include choosing how to
"partition" the data across a large number of machines, minimizing the number of
"distributed transactions", providing high data availability, and tolerating failures gracefully.
Capturing and modeling the transactional workload over a period of time, and then exploiting
that information for data placement and replication has been shown to provide
significant benefits in performance, both in terms of transaction latencies and overall
throughput. However, such workload-aware data placement approaches can incur
very high overheads, and further, may perform worse than naive approaches if the workload changes.
In this work, we propose SWORD, a scalable workload-aware data partitioning and placement approach for
OLTP workloads, that incorporates a suite of novel techniques to significantly reduce the overheads
incurred both during the initial placement, and during query execution at runtime.
We model the workload as a hypergraph over the data items, and propose using a "hypergraph compression"
technique to reduce the overheads of partitioning. To deal with workload changes,
we propose an incremental data repartitioning technique that modifies data placement in small steps without
resorting to complete workload repartitioning.
We have built a workload-aware "active replication" mechanism in
SWORD to increase availability and enable load balancing. We propose the use of "fine-grained quorums" defined at the level of
groups of tuples to control the cost of distributed updates, improve throughput, and provide adaptability to different workloads.
To our knowledge, SWORD is the first system that uses fine-grained quorums in this context. The results of our experimental evaluation on
SWORD deployed on an Amazon EC2 cluster show that our techniques result in orders-of-magnitude reductions in the partitioning and
book-keeping overheads, and improve
tolerance to failures and workload changes; we also show that choosing quorums based on the query access patterns enables us to better handle query workloads
with different read and write access patterns.
- Saving on Cooling: The Thermal Scheduling Problem;
Koyel Mukherjee, Samir Khuller, and Amol Deshpande;
SigMetrics 2012 (Poster paper).