Abstract

Invited Speakers

Stephen E. Fienberg, CMU
Aristides Gionis, Yahoo! Research
Thomas Gärtner, University of Bonn and Fraunhofer IAIS
Jennifer Neville, Purdue University
Padhraic Smyth, UC Irvine
Chris Volinsky, AT&T Labs
Eric Xing, CMU

Presentation Details

Graphs for Machine Learning: Useful Metaphor or Statistical Reality
Stephen E. Fienberg, Carnegie Mellon
Slides: pdf (1.7mb)
Graphs play an important role as a representation for two dual representations of statistical models---one where the nodes correspond to units and the edges to variables that relate them to on another, and the other where the nodes are variables and the edges represent relationships among them. In the former units are inherently dependent and relationships may or may not be, whereas in the later the units are inherently independent and the focus is on independence relationships among the variables. In this talk I describe both types of representations and how to think about them in the context of large-scale data examples, especially those involving discrete relationships or variables.

Stephen E. Fienberg is Maurice Falk University Professor of Statistics and Social Science at Carnegie Mellon University, with appointments in the Department of Statistics, the Machine Learning Department, Cylab, and i-lab. He has served as Dean of the College of Humanities and Social Sciences at Carnegie Mellon and as Vice President for Academic Affairs at York University, in Toronto, Canada, as well as on the faculties of the University of Chicago and the University of Minnesota. He was founding co-editor of Chance and served as the Coordinating and Applications Editor of the Journal of the American Statistical Association. He is currently one of the founding editors of the Annals of Applied Statistics and is co-founder of the new online Journal of Privacy and Confidentiality, based in Cylab. He has been Vice President of the American Statistical Association and President of the Institute of Mathematical Statistics and the International Society for Bayesian Analysis. His research includes the development of statistical methods, especially tools for categorical data analysis, from both likelihood and Bayesian perspectives. Fienberg is the author or editor of over 20 books and 400 papers and related publications. His 1975 book on categorical data analysis with Bishop and Holland, Discrete Multivariate Analysis: Theory and Practice, and his 1980 book The Analysis of Cross-Classified Categorical Data are both Citation Classics and were recently reprinted by Springer. He is a member of the U. S. National Academy of Sciences, and a fellow of the Royal Society of Canada, the American Academy of Arts and Sciences, and the American Academy of Political and Social Science.

Efficient tools for mining large graphs: Indexing, sampling, counting, and predicting
Aristides Gionis, Yahoo! Research
Slides: pdf (2.9mb)
Graphs provide a general framework for modeling entities and their relationships, and they are routinely used to describe a wide variety of data such as the Internet, the Web, social networks, biological data, citation networks, and more. To deal with large graphs one needs not only to understand which graph features to mine for the application at hand, but also to develop efficient tools that cope with graphs having millions of nodes. In this talk we will review some recent work in this area. We will discuss algorithms for indexing distances in graphs, sampling and counting patterns, finding frequent patterns of evolution, and classifying nodes on a graph. We motivate the problems we address with real application.

Aristides Gionis is a senior research scientist in Yahoo! Research, Barcelona. He received his Ph.D from the Computer Science department of Stanford University in 2003, and between 2003 and 2006 he has been a senior researcher at the Basic Research Unit of Helsinki Institute of Information Technology, Finland. His research interests include algorithms for data analysis and applications in the Web domain.

Kernel Methods for Structured Inputs and Outputs
Thomas Gärtner, University of Bonn and Fraunhofer IAIS
Slides: pdf (623k)
In this talk I will introduce the principles of kernel methods and show how this popular class of learning algorithms can be extended to handle structured inputs and outputs. I will concetrate on highlighting conceptual differences and similarities rather than their technical details.

Thomas Gärtner is the head of an Emmy-Noether research group at the University of Bonn and lead scientist for machine learning at the Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS. He holds a PhD from the University of Bonn, an MSc from the University of Bristol, and a Diplom from the University of Cooperative Education in Mannheim. During his career he was employed by the University of Bonn, Fraunhofer IAIS, the University of Bristol, GMD IPSI, and Alcatel SEL. His work on kernels for structured data and structured output prediction is highly cited and earned him several awards. He serves as an action editor for the Machine Learning Journal; has given tutorials as well as invited talks at premier venues such as ICML; has served as a program committee member for many major conferences on Machine Learning, and as an area chair for ECML/PKDD. This year he was a member of the senior program committee of AAAI and an area chair of ICML.

Evaluation Strategies for Network Classification
Jennifer Neville, Purdue University
Slides: pdf (1.6mb)
A central methodological question in machine learning research is how to accurately compare two learning algorithms and assess whether the observed performance difference is significant. We investigate this issue in the context of collective classification in networks, where there are dependencies among both the labeled (training) and unlabeled (test) instances. These dependencies can complicate the direct application of conventional statistical tests, which assume independent samples. Empirical exploration of potential sources of bias due to network dependencies shows surprisingly that a commonly- used form of evaluation can result in unacceptably high levels of Type I error. In other words, as much as 50% of the time observed algorithm difference may be incorrectly determined to be significant, when it is not. We propose two solutions to this bias---the first is a network cross-validation sampling method and the second is an analytical correction to conventional t-tests. We evaluate the corrections on both synthetic and real world data, with simulated and real classifiers, showing that the tests successfully adjusts for the bias, while maintaining reasonable levels of statistical power.

Jennifer Neville is an assistant professor at Purdue University with a joint appointment in the Departments of Computer Science and Statistics. She received her PhD from the University of Massachusetts Amherst in 2006. She received a DARPA IPTO Young Investigator Award in 2003 and was selected as a member of the DARPA Computer Science Study Group in 2007. In 2008, she was chosen by IEEE as one of "AI's 10 to watch." Her research focuses on developing data mining and machine learning techniques for relational domains, including citation analysis, fraud detection, and social network analysis.

Network Event Data over Time: Prediction and Latent Variable Modeling
Padhraic Smyth, UC Irvine
Slides: pdf (2.1mb)
In this talk I discuss the problem of modeling and prediction of relational event data in the form of time-stamped events involving a set of actors. This type of data is increasingly common in a number of different application contexts, such as email and blogging. The talk will begin by motivating the problem of modeling such data, discussing for example the difference between discrete-time aggregated network representations and continuous-time event-based representations. We will review some of the basic strategies in building statistical models for such data, starting with models for static (non-temporal) data and moving to temporal models. In particular we will focus on latent-variable models which are emerging as a broadly applicable and flexible framework for network modeling. Recent ideas in this area will be discussed as well as new ongoing work. We will also emphasize the importance of predictive evaluation in network modeling and discuss a number of issues that arise in this context. Experimental results will be presented comparing different modeling approaches using a variety of real-world event-based network data sets. The talk will conclude with some speculative comments on future research directions.

Joint work with Arthur Asuncion, Chris DuBois, and Jimmy Foulds.

Padhraic Smyth is a Professor in the Department of Computer Science and also serves as Director of the Center for Machine Learning and Intelligent Systems, both at the University of California, Irvine. He also has joint appointments in the Statistics and Biomedical Engineering Departments at UC Irvine. His research interests include machine learning, data mining, pattern recognition, and applied statistics. He was a recipient of best paper awards at the 2002 and 1997 ACM SIGKDD Conferences, received the NSF CAREER award in 1999, the ACM SIGKDD Innovation Award in 2009, and is a AAAI Fellow. He is co-author of Modeling the Internet and the Web: Probabilistic Methods and Algorithms (with Pierre Baldi and Paolo Frasconi in 2003), and was also co-author of Principles of Data Mining, MIT Press, August 2001, with David Hand and Heikki Mannila. He received a first class honors degree in Electronic Engineering from University College Galway (National University of Ireland) in 1984, and the MSEE and PhD degrees from the Electrical Engineering Department at the California Institute of Technology in 1985 and 1988 respectively. From 1988 to 1996 he was a Technical Group Leader at the Jet Propulsion Laboratory, Pasadena, and has been on the faculty at UC Irvine since 1996. In addition to his academic research he is also active in industry consulting, working with companies such as Netflix (on the Netflix Prize), eBay, Oracle, Yahoo!, Nokia, and AT&T.

Mining Massive Graphs for Telecommunication Applications
Chris Volinsky, AT&T Labs
Slides: pdf (1.3mb)
Telecommunications data is all about networks - packet delivery networks, cell tower networks, fiber optic networks. But perhaps the most interesting network is the virtual one created by billions of telephony transactions every day. This callgraph network represents hundreds of millions of devices and the billions of connections between them. How do we make sense of such a massive graph? How do we find communities, or look for influential members? In this talk I will present various applications of callgraphs at AT&T, from fraud detection to customer loyalty to targeted marketing. I will cover our ego-centric representation of the graph (Communities of Interest) and discuss how it helps us to analyze the graph at speed and scale.

Chris Volinsky is Executive Director of the Statistics Research Department at AT&T Labs-Research in Florham Park, N.J. Chris got his PhD from the University of Washington in 1997 studying Bayesian Model Averaging. He joined AT&T in 1997 and became Director of the Statistics Research Department in 2004. His research at AT&T focuses on large scale data mining: recommendation systems, social networks, statistical computation, and anomaly detection. In 2009, Chris was a member of the 7-person, 4-country team BellKor's Pragmatic Chaos that won the $1M Netflix Prize, an open competition for improving Netflix' online recommendation system.

Dynamic Network Analysis: Model, Algorithm, Theory, and Application
Eric Xing, CMU
Slides: pdf (29mb)
Across the sciences, a fundamental setting for representing and interpreting information about entities, the structure and organization of communities, and changes in these over time, is a stochastic network that is topologically rewiring and semantically evolving over time, or over a genealogy. While there is a rich literature in modeling invariant networks, until recently, little has been done toward modeling the dynamic processes underlying rewiring networks, and on recovering such networks when they are not observable.

In this talk, I will present two recent developments in analyzing what we refer to as the dynamic tomography of evolving networks. I will first present new sparse-coding algorithms for estimating the topological structures of latent evolving networks underlying nonstationary time-series or tree-series of nodal attributes, along with theoretical results on the asymptotic sparsistency of the proposed methods; then, I will present a new Bayesian model for estimating and visualizing the trajectories of latent multi-functionality of nodal states in the evolving networks.

I will show some promising empirical results on recovering and analyzing the latent evolving social networks in the US Senate and the Enron corporation, and the evolving gene network of fruit fly while aging, at a time resolution only limited by sample frequency. In all cases, our methods reveal interesting dynamic patterns in the networks.

Dr. Eric Xing is an associate professor in the School of Computer Science at Carnegie Mellon University. His principal research interests lie in the development of machine learning and statistical methodology; especially for solving problems involving automated learning, reasoning, and decision-making in high-dimensional and dynamic possible worlds; and for building quantitative models and predictive understandings of biological systems. Professor Xing received a Ph.D. in Molecular Biology from Rutgers University, and another Ph.D. in Computer Science from UC Berkeley. His current work involves, 1) foundations of statistical learning, including theory and algorithms for estimating time/space varying-coefficient models, sparse structured input/output models, and nonparametric Bayesian models; 2) computational and statistical analysis of gene regulation, genetic variation, and disease associations; and 3) application of statistical learning in social networks, data mining, vision. Professor Xing has published over 100 peer-reviewed papers; he is an action editor of the Machine Learning Journal, an associate editor of the Annals of Applied Statistics, and the PLoS Journal of Computational Biology. He is a recipient of the NSF Career Award, the Alfred P. Sloan Research Fellowship in Computer Science, and the United States Air Force Young Investigator Award.

Web Accessibility