Lise's INQuisitive Students
Machine Learning Research Group @ UMD

  Home | People | Projects | Publications | Contact

Link-based Classification


Our Approaches

Traditional machine learning classification algorithms aim to label entities on the basis of their attribute values. Many real-world datasets, however, contain interlinked entities and exhibit correlations among labels of the interlinked entities. Link-based classification aims to improve classification accuracy by exploiting such correlations in the link structure besides utilizing the attribute values of each entity.

Our research in this area explores various aspects of this problem. We have proposed the use of simple aggregation functions such as sum, mode etc. that summarize the distribution of class labels in the Markov blanket of each entity and learn classifiers using these features and features based on attribute values to improve the classification accuracy. In addition, we have devised simple yet effective iterative approximate inference techniques that, in many cases, perform as well as more sophisticated approximate inference methods such as loopy belief propagation and relaxation labeling. We have also explored the utility of learning with unlabeled entities. We have tested the efficacy of the proposed techniques on various real-world datasets such scientific publication datasets and social network datasets.

In addition to link-based classification where each misclassification is considered to be equally costly, we have also developed techniques to handle unequal misclassification costs. Consider the example of a bank loan officer. The loan officer's task is to classify each application into one of 'grant' or 'deny'. If the officer grants a loan that should have been denied then the bank stands to lose the principal amount of the loan in question (assuming the applicant does not pay any of the loan back). On the other hand, if the officer denies a loan that should have been granted, not only does the bank lose the interest that could have been earned on the loan but also the amount that the applicant might have in her/his account with the bank (assuming the disgruntled applicant decides to take her/his business elsewhere). Such unequal misclassification costs occur in many real-world scenarios. Now consider the case when the bank allows joint accounts. Consider joint account J whose account holders A1 and A2 apply for two separate loans. The amount in J will be lost if either of A1's or A2's application gets denied. This is an example of a misclassification cost that is related to groups of related misclassifications that may arise in scenarios where entities to be labeled are inter-connected via relations. As part of our research on link-based classifiers, we extended classifiers based on Markov networks to handle such varied misclassification costs and demonstrated their efficacy on synthetic and real-world datasets.


Publications


Datasets

  • Document Classification Datasets:

    • CiteSeer: The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.
    • Cora: The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.
    • WebKB: The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.
  • Social Network Datasets:

    • Terrorists: This dataset contains information about terrorists and their relationships. Unlike the previous datasets, this dataset was designed for classification experiments aimed at classifying the relationships among terrorists. The dataset contains 851 relationships, each described by a 0/1-valued vector of attributes where each entry indicates the absence/presence of a feature. There are a total of 1224 distinct features. Each relationship can be assigned one or more labels out of a maximum of four labels making this dataset suitable for multi-label classification tasks. The README file provides more details. Click here to download the tarball containing the dataset.
    • Terrorist Attacks: This dataset consists of 1293 terrorist attacks each assigned one of 6 labels indicating the type of the attack. Each attack is described by a 0/1-valued vector of attributes whose entries indicate the absence/presence of a feature. There are a total of 106 distinct features. The files in the dataset can be used to create two distinct graphs. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.