- Efficient
Constraint Monitoring Using Adaptive Thresholds (with
Jai Ramamirtham, Rajeev Rastogi and Pushpraj Shukla). In Proceedings
of the the 24th IEEE International Conference on Data Engineering (ICDE) 2008.
Abstract: Detecting constraint violations in
large-scale distributed systems has recently attracted plenty of
attention from the research community due to its varied
applications (security, network monitoring, etc.). Communication
efficiency of these systems is a critical concern and determines
their practicality. In this paper, we introduce a new set of
methods called non-zero slack schemes to implement distributed
SUM queries efficiently. We show, both analytically and
empirically, that these methods can lead to a considerable
reduction in the amount of communication. We propose three
adaptive non-zero slack schemes that adapt to changing data
distributions; our best scheme is a lightweight reactive scheme
that probabilistically adjusts local constraints based on the
occurrence of certain events (using only a periodic probability
estimation). We conduct an extensive experimental study using
real-life and synthetic data sets, and show that our non-zero
slack schemes incur significantly less communication overhead compared to the state of the art zero slack scheme (over a 60% savings).
- Efficient
Gossip-Based Aggregate Computation (with
Supratim
Deb, K.V.M. Naidu, Rajeev Rastogi and Anand Srinivasan). In Proceedings
of
the 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database
Systems (PODS) 2006.
Abstract: There has been a growing interest in
gossip-based protocols that employ randomized communication to ensure
robust information dissemination. In this paper we present a novel
gossip-based scheme using which all nodes in an n-node overlay network
can compute the common aggregates of MIN, MAX, SUM, AVERAGE and RANK of
their values
using O(n loglog n) messages in O(log n loglog n) rounds of communication.
To the best of ours knowledge, ours is the first result that shows how to
compute these aggregates with high probability using only O(n loglog n)
messages. In contrast, the best known gossip-based algorithm for computing
these aggregates requires O(n log n) messages and O(log n) rounds. Thus,
our algorithm allows system designers to trade off a small increase in
round complexity with a significant reduction in message complexity. This
can lead to dramatically lower network congestion and longer node
lifetimes in wireless and sensor networks, where channel bandwidth and
battery life are severly constrained.
- Fast
Reconfiguartion of Data Placement in Parallel Disks (with
Samir Khuller, Yung-Chun (Justin) Wan and Leana
Golubchik). In Proceedings of the 8th ALENEX 2006.
Abstract: The “How much information?” study produced by the school of information management and systems at the University of California at Berkeley [10], estimates that about 5 exabytes of new information was produced in 2002. It estimates that the amount of stored information doubled in the period between 1999 and 2002. It is believed that more data will be created in the next five years than in the history of the world. Clearly we live in an era of data explosion. This data explosion necessitates the use of large storage systems. Storage Area Networks (or SANs) are the leading [13] infrastructure for enterprise storage.
A SAN essentially allows multiple processors to access several storage devices. They typically access the storage medium as though it were one large shared repository. One of the crucial functions of such a storage system is that of deciding the placement of data within the system. This data placement is dependent on demand pattern for the data. For instance, if a particular data item is very popular the storage system might want to host it on a disk with high bandwidth or make multiple copies of the item. The storage system needs to be capable of handling flash crowds [8]. During events triggered by such flash crowds, the demand distribution becomes highly skewed and different from the normal demand distribution.
In this paper we consider a new approach to dealing with the problem of changes in the demand pattern. We ask the following question: In a given number of migration rounds, can we obtain a layout by making changes to the existing layout, so that the resulting layout will be the best possible layout that we can obtain within the specified number of rounds? Of course, such a layout is interesting only if it is significantly better than the existing layout for the new demand pattern. We approach the problem of finding a good layout that can be reached in a specified number of rounds by trying to find a sequence of layouts. Each layout in the sequence can be transformed to the next layout in the sequence by applying a small set of changes to the current layout. These changes are computed so that they can be applied within one round of migration (a disk may be involved in at most one transfer per round).
- Similarity Searching in P2P databases (with
Indrajit Bhattacharya and Srinivasan Parthasarathy). In
Proceedings of the 25th IEEE ICDCS 2005.
Abstract: We consider the problem of handling similarity
queries in peer-to-peer databases. We propose an indexing and
searching mechanism which, given a query object, returns the set
of objects in the database that are semantically related to the
query. We propose an indexing scheme which clusters data such
that semantically related objects are partitioned into a small
set of clusters, allowing for a simple and efficient similarity
search strategy. Our indexing scheme also decouples object and
node locations. Our adaptive replication and randomized lookup
schemes exploit this feature and ensure that the number of
copies of an object is proportional to its popularity and all
replicas are equally likely to serve a given query, thus
achieving perfect load balancing. The techniques developed in
this work are oblivious to the underlying DHT topology and can
be implemented on a variety of structured overlays such as
CAN,CHORD, Pastry, and Tapestry. We also present DHT independent
analytical guarantees for the performance of our algorithms in
terms of search accuracy, cost, and loadbalance; the
experimental results from our simulations confirm the insights derived from these analytical models.
- Node
Ranking in Labeled Directed Graphs (with
Krishna Chitrapura). In Proceedings of the 13th ACM CIKM 2004.
Abstract: Our work is motivated by the problem of ranking hyperlinked
documents for a given query. Given an arbitrary directed graph
with edge and node labels, we present a new flow-based model and
an efficient method to dynamically rank the nodes of this graph
with respect to any of the original labels. Ranking documents
for a given query in a hyperlinked document set and ranking of
authors/articles for a given topic in a citation database are
some typical applications of our method. We outline the
structural conditions that the graph must satisfy for our
ranking to be different from the traditional PageRank.
We have built a system using two indices that is capable of dynamically
ranking documents for any given query. We validate our system
and method using experiments on a few datasets: a crawl of the
IBM Intranet (12 million pages), a crawl of the www (30 million
pages) and the DBLP citation dataset. We compare our method to
existing schemes for topic-biased ranking that require a
classifier and the traditional PageRank. In these experiments,
we demonstrate that our method is well suited for fine-grained
ranking and that our method performs better than the existing
schemes. We also demonstrate that our system can obtain an
improved ranking with very little impact on query time.
- Approximation algorithms for non-uniform size data
placement on parallel disks. (with Samir Khuller). FSTTCS
2003. Journal of Algorithms 2006.
Abstract: Consider a set of disks that are part of a storage
system. Each disk has a storage capacity and a load
capacity. The storage capacity limits the total
size of items that can be stored on the disk. The load capacity
limits the total amount of demand that can be routed to the
disk. We are also given a set of items - each item has a size
and a demand. The data placement problem is that of finding a
placement of these items on the disks and an assignment of
demand for each of the items stored on the disk such that the
total (over all the disks) assigned demand is maximized.
Previous work on this problem has assumed that all data items have unit size. We present the
first successful attempt at removing the assumption that all
items have same size. We develop a polynomial time approximation
scheme (PTAS) for case where the maximum item size is a
constant. For the case where we have two distinct item sizes, we
develop an algorithm with a tight performance
guarantee. Specifically, no algorithm can guarantee (for all
distributions) to pack a larger fraction of items than our algorithm.
|