Traditional machine learning classification algorithms aim to label entities on the basis of their attribute values. Many real-world datasets, however, contain interlinked entities and exhibit correlations among labels of the interlinked entities. Link-based classification aims to improve classification accuracy by exploiting such correlations in the link structure besides utilizing the attribute values of each entity.
Our research in this area explores various aspects of this problem. We have proposed the use of simple aggregation functions such as sum, mode etc. that summarize the distribution of class labels in the Markov blanket of each entity and learn classifiers using these features and features based on attribute values to improve the classification accuracy. In addition, we have devised simple yet effective iterative approximate inference techniques that, in many cases, perform as well as more sophisticated approximate inference methods such as loopy belief propagation and relaxation labeling. We have also explored the utility of learning with unlabeled entities. We have tested the efficacy of the proposed techniques on various real-world datasets such scientific publication datasets and social network datasets.
In addition to link-based classification where each misclassification is considered to be equally costly, we have also developed techniques to handle unequal misclassification costs. Consider the example of a bank loan officer. The loan officer's task is to classify each application into one of 'grant' or 'deny'. If the officer grants a loan that should have been denied then the bank stands to lose the principal amount of the loan in question (assuming the applicant does not pay any of the loan back). On the other hand, if the officer denies a loan that should have been granted, not only does the bank lose the interest that could have been earned on the loan but also the amount that the applicant might have in her/his account with the bank (assuming the disgruntled applicant decides to take her/his business elsewhere). Such unequal misclassification costs occur in many real-world scenarios. Now consider the case when the bank allows joint accounts. Consider joint account J whose account holders A1 and A2 apply for two separate loans. The amount in J will be lost if either of A1's or A2's application gets denied. This is an example of a misclassification cost that is related to groups of related misclassifications that may arise in scenarios where entities to be labeled are inter-connected via relations. As part of our research on link-based classifiers, we extended classifiers based on Markov networks to handle such varied misclassification costs and demonstrated their efficacy on synthetic and real-world datasets.