June 1, 2022

by Rishikanth

Literature Survey on Multi-Label Classification

Extreme Multi-Label Text Classification (XMTC) is the problem of assigning a document the most relevant set of labels from an extremely large label space (several hundred thounsands to millions). XMTC posses certain challenges such as data sparsity and scalability. Multi-label classification differs from multi-class classification in that, the assumption of mutual exclusivity among lables doesn’t hold in the former setting.

Related work

Prior works in XMTC focus on modelling the correlation between labels and can be broadly classified into 3 approaches: target embeddings, tree based ensemble and deep learning methods.

In this article I’ll summarize some of the relevant and representative approaches under each category with more emphasis on deep learning methods which are superior in performance compared to the other 2 categories.

Target Embeddings

This class of works address the data sparsity issue by finding a set of low-dimensional embedding of the label vectors from the original target space.

Let the training data be represented as:

$$ \{(x_i, y_i)\}^n_{i = 1} : x_i \in \mathbb{R}^D, y_i \in \{0, 1\}^L $$

\(L\) can be very large making the process of finding a mapping from \(x_i \to y_i\) very challenging. To combat this, these methods comppress the label space from \(L\) dimensions to \(\hat{L}\) dimensions where \(\hat{L} \lt\lt L\). Standard classifiers such as SVMs etc are then used to find a mapping between \(x \to \hat{L}\). For inference, the predicted \(\hat{L}\) is decompressed or projected back to the original \(L\) diemnsional target space.

SLEEC

SLEEC is the most representative of target embedding methods. SLEEC learns a \(\hat{L}\) dimensional embedding \(z_i \in \mathbb{R}\) that captures non-linear correlations by preserving the pairwise distances between the closest label vectors. Formally, \(d(z_i, z_j) \approx d(y_i, y_j) \) only if \(i\) is among \(j\)’s \(k\) nearest neighbours under some distance metric \(d(.,.)\). Regressors \(V \in \mathbb{R} ^{\hat{L} \times D} : V x_i \approx z_i \) are learnt to map \(x_i \to z_i\).

SLEEC groups data into multiple clusters and learns an embedding for each cluster. Daring inference a document \(x^* \in \mathbb{R}^D\) is projected to the embedding space using which kNN search is performed within the cluster that \(x^* \) belongs to. To improve stability an ensemble of SLEEC models are used.

Tree Ensemble Methods

This class of methods work similar to decision trees in that, they recursively partition the data space at each non-leaf node using a classifier at each node. Each node focuses on a sub-set of active labels at that node. Unlike the classical decision tree which partitions the space at each node using entropy on a single feature, for XMTC, prior works learn a hyperplane at each node which partitions the space using a weighted combination of all features. Tree based methods typically learn an ensemble of trees, each of which is induced using randomized subset of features at each node.

FastXML

Fast XML is considered the top-benchmark fore tree-based ensemble methods for XMTC. A parameterized hyperplane \(w \in \mathbb{R}\) is learnt at each node s.t. the data is partitioned into a sub-sets and also jointly learn the label ranks within each sub-set. The documents in each sub-set share similar label distributions which are represented as the set-specific label rankings learnt by optimizing the NDCG score of the label rankings.

During inference the document of interest is processed by each tree and the label distributions from all leaf nodes reached are aggregated to output the labels. In practice an ensemble of trees are used. The estimated prediction cost is given by \(\mathcal{O}(TD\log{L})\). where \(T\) is the no. of trees induced and \(L\) is the dimensionality of the labels.