CoClust: A Python Package for Co-Clustering

François Role, Stanislas Morbieu, Mohamed Nadif
2019 Journal of Statistical Software  
Co-clustering (also known as biclustering), is an important extension of cluster analysis since it allows to simultaneously group objects and features in a matrix, resulting in row and column clusters that are both more accurate and easier to interpret. This paper presents the theory underlying several effective diagonal and non-diagonal co-clustering algorithms, and describes CoClust, a package which provides implementations for these algorithms. The quality of the results produced by the
more » ... mented algorithms is demonstrated through extensive tests performed on datasets of various size and balance. CoClust has been designed to complete and easily interface with popular Python machine learning libraries such as scikit-learn. Charrad, Lechevallier, Ahmed, and Saporta 2009; George and Merugu 2005; Deodhar and Ghosh 2010) and text mining (Dhillon 2001; Dhillon, Mallela, and Modha 2003) and various co-clustering algorithms have been proposed over the years (recent surveys can be found in Freitas, Ayadi, Elloumi, Oliveira, and Hao 2012; Eren, Deveci, Küçüktunç, and Çatalyürek 2013; Henriques, Antunes, and Madeira 2015). While quite a large number of implementations of co-clustering algorithms (also known as biclustering) have been developed for gene expression data, such as biclust (Kaiser and Leisch 2008), BicAT (Barkow, Bleuler, Prelić, Zimmermann, and Zitzler 2006) and bibench (Eren et al. 2013) , not so many implementations are available for co-clustering co-occurrence matrices such, for example, as document-term matrices used in text mining applications. The CoClust package presented in this paper therefore provides implementations of algorithms designed to efficiently handle such matrices. Depending on the method used, algorithms for co-clustering co-occurrence matrices can broadly be divided into several categories: Spectral methods: Spectral co-clustering methods treat the input data matrix as a bipartite graph between documents and words, and approximate the normalized cut of this graph using a real relaxation. Currently scikit-learn supports two spectral co-clustering algorithms: (1) the well-known "spectral co-clustering" (Dhillon 2001) and (2) the "spectral biclustering" (Kluger, Basri, Chang, and Gerstein 2003) which is also available in the biclust R package. Model-based methods: With respect to probabilistic co-clustering methods, two modelbased co-clustering methods are implemented in the blockcluster (Singh Bhatia, Iovleff, and Govaert 2017) and blockmodels (Leger 2016) R packages. The first relies on the latent block models (LBM), especially Gaussian, Bernoulli and Poisson LBMs. The derived algorithms are of type expectation-maximization; for details see for instance Govaert and Nadif (2003 , 2005 , 2006 ; Nadif and Govaert (2010). The second relies on the stochastic block model and the latent block model without or with covariates. Both models have been extended to valued networks with optional covariates on the edges. Matrix factorization based methods: Matrix factorization based methods are also used in the clustering and co-clustering fields. However while packages exist for document clustering based on non-negative matrix factorization (e.g., the NMF R package, Gaujoux and Seoighe 2010, which includes different NMF methods) leading to clustering (see for instance Ding, Li, Peng, and Park 2006; Ding and Li 2007) , there is unfortunately no package on non negative matrix trifactorization factorization for co-clustering. Information-theoretic based methods: Information-theoretic based methods are used to co-cluster two-way contingency tables. In this approach, a joint probability distribution is first derived from the two-way contingency matrix. The loss function to minimize is then the loss in mutual information between this joint probability distribution and a distribution defined on a reduced contingency table obtained by collapsing the rows and the columns according to the partitions yielded by the co-clustering program. Notable algorithms in this area include those in Dhillon et al. (2003) ; Govaert and Nadif (2013). Modularity-based methods: The use of bipartite graph-modularity as a criterion to cocluster matrices has been pioneered by Labiod and Nadif (2011) and since further investigated in Ailem, Nadif (2015, 2016). This method allows to co-cluster binary Journal of Statistical Software
doi:10.18637/jss.v088.i07 fatcat:luen4v75vjgszjewpmpn7ed5cy