-
The Dropper Effect: Insights into Malware Distribution with Downloader Graph Analytics
BumJun Kwon, Jayanta Mondal, Jiyong Jang, Leyla Bilge, Tudor Dumitras;
To appear in ACM SIGSAC (CCS), 2015
[pdf][abstract][blog]
Malware remains an important security threat, as miscreants continue to deliver a variety of malicious programs to hosts around the world. At the heart of all the malware delivery techniques are executable files (known as downloader trojans or droppers) that download other malware. Because the act of downloading software components from the Internet is not inherently malicious, benign and malicious downloaders are difficult to distinguish based only on their content and behavior. In this paper, we introduce the downloader-graph abstraction, which captures the download activity on end hosts, and we explore the growth patterns of benign and malicious graphs. Downloader graphs have the potential of exposing large parts of the malware download activity, which may otherwise remain undetected. By combining telemetry from anti-virus and intrusion-prevention systems, we reconstruct and analyze 19 million graphs from 5 million real hosts. We identify several strong indicators of malicious activity, such as the growth rate, the diameter and the Internet access patterns of downloader graphs. Building on these insights, we implement and evaluate a machine learning system for malware detection. Our system achieves a 96.0% true-positive rate, with a 1.0% false-positive rate, and is able to detect malware an average of 9.24 days earlier than leading anti-virus products. We also perform an external validation of our system by examining some of the unlabeled samples predicted to be malicious and we find that 41.41% are blocked by anti-virus products.
-
EAGr: Supporting Continuous Ego-centric Aggregate Queries over Large Dynamic Graphs;
Jayanta Mondal, Amol Deshpande;
Appeared in SIGMOD 2014
[pdf][abstract][arXiv]
In this paper, we present EAGr, a system for supporting large numbers of continuous neighborhood-based ("ego-centric")
aggregate queries over large, highly dynamic, rapidly evolving graphs. Examples of such queries include computation of personalized, tailored trends in social networks, anomaly or event detection in communication or financial transaction networks, local search and alerts in spatio-temporal networks, to name a few. Key challenges in supporting such continuous queries include very high update rates typically seen in these situations, large numbers of queries that need to be executed simultaneously, stringent low latency requirements. In this paper, we propose a flexible, general, extensible in-memory framework for executing different types of ego-centric aggregate queries over large dynamic graphs with low latencies. Our framework is built around the notion of an aggregation overlay graph, a pre-compiled data structure that encodes the computations to be performed when an update or a query is received. The overlay graph enables sharing of partial aggregates across different ego centric queries (corresponding to different nodes in the graph), also allows partial pre-computation of the aggregates to minimize the query latencies. We present several highly scalable techniques for constructing an overlay graph given an aggregation function, also design incremental algorithms for handling changes to the structure of the underlying graph itself, that may result in significant changes to the neighborhoods on which queries are posed. We also present an optimal, polynomial-time algorithm for making the pre-computation decisions given an overlay graph. Although our approach is naturally parallelizable, we focus on a single-machine deployment in this paper and show that our techniques can easily handle graphs of size up to 320 million nodes and edges, achieve update and query throughputs of over 500,000/s using a single, powerful machine.
-
Stream Querying and Reasoning on Social Data;
Jayanta Mondal, Amol Deshpande;
Appeared in Encyclopedia of Social Network Analysis and Mining, Springer, 2014
[abstract]
In this paper, we present an introduction to the new research area of "stream querying and reasoning" over social data. This area combines aspects from several well-studied research areas, chief among them, social network analysis, graph databases, and data streams. We provide a formal definition of the problem, survey the related prior work, and discuss some of the key research challenges that need to be addressed (and some of the solutions that have been proposed). We note that we use the term "stream reasoning" in this paper to encompass a broad range of tasks including various types of analytics, probabilistic reasoning, statistical inference, and logical reasoning. We contrast our use of this term with the recent work where this term has been used more specifically to refer to integration of logical reasoning systems with data streams in the context of the Semantic Web. Given the vast amount of work on this and related topics, it is not our intention to be comprehensive in this brief overview. Rather we aim to cover some of the key ideas and representative work.
-
Managing Large Dynamic Graphs Efficiently;
Jayanta Mondal, Amol Deshpande;
Appeared
in SIGMOD 2012.
[pdf] [abstract]
There is an increasing need to ingest, manage, and query large volumes of
graph-structured data arising in applications like social networks,
communication networks, biological networks etc. Graph databases that can
explicitly reason about the graphical nature of the data, that can support
flexible schemas and node/edge-centric analysis and querying, are ideal for
storing such data. However, although there is much work on single-site graph
databases and on efficiently executing specific types of queries over large
graphs, to date there is little work on understanding the challenges in
distributed dynamic graph data management, needed to handle the large scale of
such data. In this paper, we propose the design of an in-memory, distributed
graph data management system aimed at managing a large dynamically changing
graph and supporting low-latency query processing over it. The key challenge in
a distributed graph database is that, partitioning a graph across a set of
machines inherently results in a large number of distributed traversals across
partitions to answer even simple queries. We propose a suite of novel techniques
to minimize the communication bandwidth and the storage requirements. We have
implemented our framework as a middle-ware on top of an open-source key-value
store. We evaluate our system on a social graph, and show that our system is
able to handle very large graphs efficiently, and that it reduces the network
bandwidth consumption significantly
|