ABSTRACT

Many datasets of interest today are best described as a linked collection of interrelated objects. These may represent homogeneous networks, in which there is a single-object type and link type, or richer, heterogeneous networks, in which there may be multiple object and link types (and possibly other semantic information). Examples of homogeneous networks include single mode social networks, such as people connected by friendship links, or the WWW, a collection of linked web pages. Examples of heterogeneous networks include those in medical domains describing patients, diseases, treatments and contacts, or in bibliographic domains describing publications, authors, and venues. Link mining refers to data mining techniques that explicitly consider these links when building predictive or descriptive models of the linked data. Commonly addressed link mining tasks include object ranking, group detection, collective classification, link prediction and subgraph discovery. While network analysis has been studied in depth in particular areas such as social network analysis, hypertext mining, and web analysis, only recently has there been a cross-fertilization of ideas among these different communities. This is an exciting, rapidly expanding area. In this article, we review some of the common emerging themes

1. INTRODUCTION

“Links,” or more generically relationships, among data instances are ubiquitous. These links often exhibit patterns that can indicate properties of the data instances such as the importance, rank, or category of the object. In some cases, not all links will be observed; therefore, we may be interested in predicting the existence of links between instances. In other domains, where the links are evolving over time, our goal may be to predict whether a link will exist in the future, given the previously observed links. By taking links into account, more complex patterns arise as well. This leads to other challenges focused on discovering substructures, such as communities, groups, or common subgraphs. Traditional data mining algorithms such as association rule mining, market basket analysis, and cluster analysis commonly attempt to find patterns in a dataset characterized by a collection of independent instances of a single relation. This is consistent with the classical statistical inference problem of trying to identify a model given a independent, identically distributed (IID) sample. One can think of this process as learning a model for the node attributes of a homogeneous graph while ignoring the links between the nodes.

A key emerging challenge for data mining is tackling the problem of mining richly structured, heterogeneous datasets. These kinds of datasets are best described as networks or graphs. The domains often consist of a variety of object types; the objects can be linked in a variety of ways. Thus, the graph may have different node and edge (or hyperedge) types. Naively applying traditional statistical inference procedures, which assume that instances are independent, can lead to inappropriate conclusions about the data [57]. Care must be taken that potential correlations due to links are handled appropriately. In fact, object linkage is knowledge that should be exploited. This information can be used to improve the predictive accuracy of the learned models: attributes of linked objects are often correlated, and links are more likely to exist between objects that have some commonality. In addition, the graph structure itself may be an important element to include in the model. Structural properties such as degree and connectivity can be important indicators.

Link mining is a newly emerging research area that is at the intersection of the work in link analysis [58; 40], hypertext and web mining [16], relational learning and inductive logic programming [38], and graph mining [23]. We use the term link mining to put a special emphasis on the links—moving them up to first-class citizens in the data analysis endeavor. In recent years, there have been several workshop series devoted to topics related to link mining. One of the earliest workshops was the 1998 AAAI Fall Symposium on AI and Link Analysis [58]. Other workshop series include the workshops on Statistical Relational Learning [48; 49; 28], MultiRelational Data Mining [65; 39; 36; 37], LinkKDD [35; 1; 3], Link Analysis, Counter-terrorism and Security [104; 26; 103], and Mining Graphs, Trees and Sequences [94; 66; 85]. The objective of this survey is to provide a perspective on research within the relevant communities that are addressing current link mining challenges. Link mining encompasses a wide range of tasks; therefore, our review will cover the core challenges addressed by a majority of ongoing research in the field. We begin by describing some of the challenges in data representation for link mining. Then we progress through eight link mining tasks that can be broadly categorized as tasks that focus on objects, links, or graphs (Table 1). Finally, we close with a discussion of areas that we believe have not yet received sufficient attention.