Consiglio Nazionale delle Ricerche - iit.cnr.it 09_2016.pdf · C Consiglio Nazionale delle Ricerche...

C

Consiglio Nazionale delle Ricerche

Arbitrary shape clustering via NMF factorization

P. Favati, G. Lotti, O. Menchi, F. Romani

IIT TR-09/2016

Technical Report

Giugno 2016

Iit

Istituto di Informatica e Telematica

Arbitrary shape clustering via NMF factorization

P. Favati G. Lotti O. Menchi F. Romani

Abstract

Organizing data into clusters is a key task for data mining problems.In this paper we address the problem of arbitrarily shaped clusteringof points belonging to a linear space. A model based on the Euclideandistance is assumed to define the similarity among the points. Analgorithm, based on the symmetric nonnegative matrix factorization(NMF) of the similarity matrix, is proposed. The main contribution ofour approach consists in the merging technique of the clusters whichexploits information already contained in the matrix obtained by NMF.Extensive experimentation shows that the proposed algorithm is effec-tive and robust also for noisy data.

1 Introduction

Clustering, i.e. the partitioning of the data into groups of similar objects,is a key step for many data mining problems. A model based on a distancemeasure is assumed to define the similarity among data objects, so that theclustering can be performed. Data of multidimensional numeric type canbe represented on the Euclidean space, allowing suitable distance functions.Also nonnumeric data can be treated, as long as an appropriate distanceamong objects can be defined. The measure must assign to pairs of closelyrelated objects a higher similarity than to objects which are only weaklyrelated. With this assumption, we can construct the graph of the distancesbetween pairs of objects and the clustering is performed taking into accountthe weighted adjacency matrix of the graph, the so called similarity matrix.The clusters on the nodes of the graph are then used to map back the clustersof the given data.

In this paper we assume data of multidimensional numeric type andconsider the Euclidean distance function on the objects, with the goal ofrecognize clusters of possibly non-convex shape. This is not easily achievedin an automatic way, because many clustering algorithms are biased towardfinding convex-shaped clusters, like for example representative-based algo-rithms (see [1], chap. 6) which detect clusters where the objects gatheraround selected representatives. Other algorithms perform first a dimen-sionality reduction by applying suitable matrix factorizations. In this waythe data are embedded into spaces where the clustering problem can be more

1

easily dealt with, but even in this case arbitrarily shaped clusters are noteasily found.

A commonly used technique for the dimensionality reduction exploitsthe Nonnegative Matrix Factorization (NMF), a procedure which finds anapproximate factorization of a nonnegative matrix M by means of two low-rank nonnegative matrices V and W in the form M ∼ V W T . For theclustering problem, NMF is applied to the similarity matrix A, which forthe Euclidean distance is n × n nonnegative symmetric. Hence the NMFrepresentation can be found symmetric in the form A ∼ WW T , where W isan n× k matrix, with k much smaller than n. Thanks to the nonnegativity,W provides the connection of objects to clusters. In fact, the elements of therows of W , normalized in such a way that the sums of the rows are equal toone, can be seen as the probabilities of the assignment of the objects to thedifferent clusters. This can be assumed as a soft solution of the clusteringproblem, which can be converted to a hard solution by connecting eachobject to the cluster for which it has the largest assignment probability.

When the matrix A is a kernel, symmetric NMF with the additional or-thogonality condition W TW = I can be seen equivalent to kernel k-meansclustering [12]. A theoretical foundation for using symmetric NMF for clus-tering can be found also in the close relationship to probabilistic model-basedclustering methods with a generative Gaussian distribution [18].

Largely applied procedures find arbitrarily shaped clusters using agglom-erative steps that merge elementary possibly convex subclusters of the data.This is the case of the hierarchical algorithms where the elementary startingsets reduce to single objects, but of course the sets to be merged can be ofany cardinality, possibly obtained by a representative-based method. Boththe number k of the elementary sets and the linkage rules applied in themerging steps are critical for the quality of the final results. For example, atoo small k can lead to incorrectly merged elementary clusters, which can-not be separated by the subsequent agglomerating steps. The linkage rulesdepend heavily on density estimations and on the presence of outliers points.Moreover, noisy points must be treated appropriately to avoid undesirablechaining effects.

In this paper we assume that the number kgoal of required, possibly non-convex, clusters is a priori given. We propose a method that deals with thepreviously discussed issues in an unitary way. It consists of two steps: firstwe detect a sufficiently large number k ≥ kgoal of elementary clusters (ingeneral convex) by applying NMF, then we suitably merge close elementaryclusters obtaining the kgoal arbitrarily shaped clusters. The novelty of ourapproach is that we perform the merging phase without resorting to geomet-rical issues such as distances and densities, but exploiting the informationalready coded in the matrix W obtained by NMF. Our algorithm resultsrobust also for noisy data.

A sketchy description of the proposed algorithm is given in Section 2

2

and the pseudo-code of the important functions is given in Section 3, wheredifferent aspects are analyzed. A special attention is paid to importantissues such as the construction of the similarity matrix, the implementationof the symmetric NMF, the validation of clusters and the merging rule.The numerical experimentation of Section 4, performed on both convex andnonconvex datasets, validates the algorithm.

2 The algorithm

Let xi, i = 1, . . . , n, be the given points in Rm. When dealing with datathat can be seen as vectors, it is natural to measure the distance betweenpoints using a norm. We use here the Euclidean norm d(x,y) = ∥x−y∥ forthe vectors and the compatible Frobenius norm for matrices.

Our clustering objective is to arrange the data in kgoal arbitrarily shapedclusters, for a given kgoal, by applying symmetric NMF to solve problems ofthe form

minW≥O

Ψ(W ), with Ψ(W ) = 12 ∥A−W W T ∥2, (1)

where A is a similarity nonnegative symmetric matrix and W is a nonnega-tive low rank matrix.

Let k0 ≥ kgoal be a starting value for a sequence of working dimensionsk = k0 2

j , j = 0, 1, . . . For each j we consider two sets of parameters: a set Σof scales for the construction of suitable similarity matrices A associated tothe given points, and a set Ω of seeds for the random generation of startingpoints of the iterative method used to compute the solution of problems (1).

The algorithm we propose computes a sequence Sj of storing structures,where each Sj contains all the clusterings C formed by exactly kgoal clusters,obtained through the solution of (1) with W ∈ Rn×k

+ . For any pair ofparameters p = (σ, ω) with σ ∈ Σ and ω ∈ Ω, the clustering C is obtainedby constructing the similarity matrix A corresponding to σ, solving (1) withan iterative method which starts with theW0 generated from ω and applyinga merging rule to the resulting W . For each j, a suitably stopping conditionis monitored which takes into consideration all the clusterings already storedin

∪r=0,...,j Sr.

3 Detailed description

We examine here in details the different issues of the algorithm.

3.1 The similarity matrix

The similarity matrix A is associated to an undirected graph, whose nodescorrespond to the given points xi, i = 1, . . . , n, and the edge weights are thesimilarity values between pairs of nodes.

3

Many different ways have been proposed according to the type of thedata. Our algorithm is based on edge weights expressed through a Gaussiankernel depending on a scale parameter σ > 0

ei,j = exp(− ∥xi − xj∥2

µσ

), (2)

where µ = maxi,j ∥xi−xj∥2. The similarity matrix A = Aσ is then obtainedby normalization

ai,j = d−1/2i ei,jd

−1/2j where di =

n∑r=1

ei,r, for i = 1, . . . , n. (3)

The parameter σ, which somehow measures when two points should be con-sidered similar, has a high impact on clustering and the choice of a suitablevalue may be critical. Larger values of σ tend to reduce the variability of Aand to merge the clusters, while smaller values of σ tend to separate morethe clusters. There is a strong correlation between the value of σ and thenumber of clusters that the algorithm can find. In our context we let σ todecrease when k increases.

3.2 The Symmetric Nonnegative Matrix Factorization

The optimization problem (1) aims not only at a dimensional reduction butalso at a soft clusterization of the points. Since the matrix A is nonneg-ative, we apply Nonnegative Matrix Factorization (NMF), which was firstproposed in [16] as a dimensional reduction algorithm in the context of dataanalysis and has proven to be well suited for clustering. Moreover, since Ais symmetric, we use a symmetric version of NMF.

By solving (1) we obtain a new representation of the columns of A in thebasis formed by the columns wj of W with nonnegative coefficients givenby the rows of W , i.e. the ith column ai of A is approximated by thelinear combination

∑kj=1wi,jwj . Hence the rows of W , if not numerically

zero, furnish a kind of soft clustering of the columns of A, where wj isconsidered the representative of the jth cluster. A hard clustering is obtainedby connecting the ith column to the jth cluster such that the correspondingcoefficient wi,j is the largest entry of the ith row (see [13]). This result canbe read as the partitioning of the indices In = 1, . . . , n into k subsetsΠ = π1, . . . , πk. The nonnegativity of the coefficients wi,j is fundamentalfor the clustering assignment. This excludes the use of different factorizationof A, like for example the Singular Value Decomposition.

If the matrix WW T represents a good factorization of A, we map thiscluster structure to the given points xi, i = 1, . . . , n, i. e. we assume that therows of W give the index partitioning also for the points. This assignmentrule is valid because the columns of A contain the similarity values amongthe points.

4

In literature [12, 13] this symmetric approach is described to efficientlysolve the clustering problem and it is shown to be competitive with thewidely used spectral clustering techniques which perform a dimensionalityreduction by means of the eigenvectors corresponding to the leading eigen-values of A.

Problem (1) has a fourth-order nonconvex objective function and opti-mization algorithms guarantee only the stationarity of the limit points, soone looks for local minima. Standard gradient algorithms lead to stationarysolutions, but suffer from slow convergence. Newton-like algorithms, whichhave a better rate of convergence, are computationally expensive. Following[12], we suggest to solve (1) through a nonsymmetric penalty problem of theform

minV,W≥O

Φλ(V,W ), with Φλ(V,W ) = 12

(∥A− VW T ∥2 + λ∥V −W∥2

), (4)

λ being a positive parameter which acts on the violation of the symmetry.To solve (4) we suggest the widely used alternating nonnegative least

squares (ANLS) method, which belongs to the block coordinate descentframework [11] of nonlinear optimization. Naturally the value of λ influencesthe symmetry of the solution (Vλ,Wλ) of (4): the largest λ, the smallest∥Vλ − Wλ∥, but a too large λ could lead to the trivial null solution Vλ =Wλ = O. In [5] we suggest to couple ANLS with an heuristic which, startingfrom an initial assigned λ(0), generates a convergent sequence whose valuesλ(ν) oscillate in progressively shrinking intervals. The experimental analysishas shown that the heuristic is effective and that the choice of λ(0) is notcritical (in the experiments we assume λ(0) = 1).

ANLS is an iterative algorithm which computes alternatively the solutionof the two subproblems

W (ν) = argminW≥0

12

∥∥∥∥ [ A√λ(ν−1) V (ν−1)T

]−

[V (ν−1)

√λ(ν−1) Ik

]W T

∥∥∥∥2,V (ν) = argmin

V≥0

12

∥∥∥∥ [ A√λ(ν−1) W (ν)T

]−

[W (ν)

√λ(ν−1) Ik

]V T

∥∥∥∥2.(5)

The choice of the initial matrix V (0) is critical, due to the fact that only alocal minimum is obtained, which obviously depends on this choice. Typi-cally, the algorithm is run several times with different initial matrices andthe run which produces the best clustering is chosen.

Although problem (4) is nonconvex, both problems (5) are convex andcan be dealt with by applying any procedure for constrained quadratic opti-mization, for example an Active-Set-like method [2, 8, 9, 10, 14]. Accordingto [6], the requirement that the nonnegativity constraints are exactly satis-fied at each step is necessary for convergence but makes the algorithm ratherslow at large dimensions. In our context, where ANLS is only a constitutive

5

brick of a broader algorithm, computing the exact solution of problems (5)is not so worthwhile.

The pseudo-code of a function Sym ANLS which implements the symmet-ric version of ANLS algorithm for the solution of problem (1) is shown inFigure 1. It requires a function for solving the constrained quadratic opti-mization problems (5) (we suggest the Greedy Coordinate Descent methodproposed in [7]) and a heuristic to update the parameter λ (we suggest theone in [5]). The stopping condition monitors the decreasing of the errormeasured through Ψ(W (ν)), given a tolerance η.

The function Sym ANLS is called by the function Trial (core of our algo-rithm, see Section 3.5).

function Sym ANLS(A,W, V, λ, νmax

)applies ANLS to problem (4). Matrix A depends on σ. No more than νmax

iterations allowed.

W (0) = W ; V (0) = V ; λ(0) = λ; ν = 0; cond = True;while cond

ν = ν + 1;

compute W (ν) and V (ν) by solving problems (5);

update λ(ν) by applying the heuristic;

ϵ(ν) = ∥A−W (ν)W (ν)T ∥2/∥A∥2;stop =

∣∣ϵ(ν) − ϵ(ν−1)∣∣ ≤ η ϵ(ν);

cond = not stop and ν < νmax;end while;

return W (ν), V (ν), λ(ν), stop;

Figure 1: Algorithm to solve problem (4). In the experiments the tolerancehas been set to η = 10−5.

3.3 The cluster validation

As already said, by solving problem (1) with Sym ANLS and different initialmatrices, we typically obtain different local minima. Thus we need a tool tochoose the best clustering Π among those corresponding to the different ob-tained matrices W . In general, the clustering validity is measured by meansof selected indices based on geometrical properties, such as distances of thepoints belonging to a same cluster or to different clusters. Over the years,dozens of indices have been proposed to measure the validity of a cluster-ing (see [4] and its references). Most of them take into account both thecompactness which measures the intra-cluster distances and the separability

6

which measures the inter-cluster distances. In our context, separability de-serves less attention than compactness in view of the subsequent merging ofclose clusters. Then for us a good clustering should be associated mainly tosmall intra-cluster distances.

The partitioning of the set of indices In = 1, . . . , n into the clusteringΠ = π1, . . . , πk is performed by function Partition(W ) which assignsindex i to πj if wi,j = ℓi where ℓi = maxr=1,...,nwi,r. Let γ be the vectorof k components, whose ith component is the average distance of all theelements in the ith cluster from its centroid ci, i.e.

γi =1

ni

∑j∈πi

d(xj , ci

), with ni = #πi, ci =

1

ni

∑j∈πi

xj ,

where the symbol # denotes the cardinality of a vector or of a set. Theindex we consider is

χ =1

k

k∑i=1

γi + maxi=1,...,k

γi. (6)

The smaller value of χ the more compact the clusters. When k increases,the corresponding index χ decreases until becoming zero for k = n. Thisindex is derived from the canonical DB index [3], by neglecting the distancesbetween centroids. Index χ is computed by the function Index (Π), whereΠ is the partitioning induced by W .

3.4 The merging rule

As already said, we want to perform the merging phase without resorting togeometrical issues such as distances and densities of the elementary clusters,but exploiting only information already coded in the matrix W . To this aim,the algorithm requests the identification of a soft clustering of the givenpoints through a suitable assignment rule.

Let Π = π1, . . . , πk be the clusters detected by W . The rows of W arefirst normalized by setting wi,j = wi,j/ℓi, where ℓi = maxr=1,...,nwi,r. Givena positive α < 1 we define a matrix B whose elements are

bi,j =

1 if wi,j ≥ α,0 otherwise,

and the matrix T = BTB. The points xi for which wi,j = 0 are said to beα-close to cluster πj (in particular they may belong to πj). The entry ti,iof T counts the number of elements which are α-close to πi and the entryti,j counts the number of elements which are α-close to both πi and πj . Theentry bi,j = 1 suggests a possible merging of πi and πj , but this merging iseffectively performed only when the number of elements which are α-closeto both πi and πj , normalized with respect to the product of the numbers of

7

elements which are α-close to either πi or πj , is sufficiently high, i.e. whent2i,j ≥ β ti,itj,j for a small parameter β. This condition has been set to avoidundesirable chaining effects.

The merged clustering C is computed using the above rules by the func-tion Merge (W ), which returns also the number nC of the clusters of C. Inthe experiments we set α = 0.8 and β = 0.0002.

3.5 The function Trial

The function Trial is applied to solve problem (1) for a value of the workingdimension k and pairs p = (σ, ω) ∈ Σ × Ω depending on k. The stoppingcondition stop used in Sym ANLS monitors the decrease of the error betweenmatrix A and the product W (ν)W (ν)T . However, for our purpose it is ex-pedient to include also a check on the stabilization of the clustering bycontrolling that the partitioning Π(ν) induced by W (ν) does not change fora sufficient number of iterations. The control makes use of the index χ de-fined in Section 3.3 to evaluate the clustering validity. To this aim, the runof Sym ANLS is partitioned into segments of length τ by performing only τiterations and computing χ at the end of each segment. The first segmentis computed with λ(0) = 1 and a randomly generated matrix V (0). Thesubsequent segments inherit the previous computed W (ν), V (ν) and λ(ν).The iteration stops when the index χ remains stable for enough consecutivesegments. In the experiments we set τ = 10 and request that χ does notchange for two consecutive segments (we have verified that in general fur-ther iterations produce insignificant changes). The splitting in segments ofeach run of Sym ANLS corresponding to a pair p is identified by a time indext = 1, . . . , tmax, with tmax sufficiently large to assure clusters stabilizationin the majority of cases.

In a straightforward approach the different runs of Sym ANLS, which de-pend on selected pairs p, would be exhaustively processed by increment-ing t until the clustering stabilization. The heuristic approach we proposeachieves a substantial saving of the computational cost by modifying theprocessing order of the various segments and interlacing the computationof segments of different runs. This strategy requires that several segmentscorresponding to different values of σ and ω and t are active at any instant ofthe computation. The processing of the most promising segments is carriedon and the other segments are discarded.

The strategy can be accomplished using a priority queue Q whose itemshave the form

Z = p, t,W, V, λ, χ, C, nC

where

– p = (σ, ω), where σ is the scale parameter chosen in a set Σ, and ω isthe seed for the random generation of the initial matrix chosen in a

8

set Ω of seeds uniformly distributed between 0 and 1, hence there areq = #(Σ× Ω) different pairs,

– t is the time of the segment,

– W , V , λ are the matrices and the penalty parameter obtained at time t,

– χ is the validity index induced by W ,

– C is the arbitrary shape clustering obtained by applying the mergingrule to W ,

– nC is the number of clusters of C.

At the beginning the queue is formed by the q initial items

p, 0, V0, O, 1, 0, In, 1,

p varying in Σ×Ω and V0 being the initial matrix randomly generated withseed ω.

The following operations are defined routinely on priority queues

– Y = Dequeue(Q) returns a item Y chosen according to a minimum policyand remove it from Q.

– Enqueue(Q, Y ) insert item Y into Q.

– Is empty(Q) returns True if Q is empty, False otherwise.

– Clear(Q) empties Q.

The queue is processed by choosing among the items with the minimumtime t the one with the minimum value of χ. The process evolves as follows:when item Y is extracted from the queue, the time is set to t = t(Y ) +1, itsparameters σ and ω are identified and function Sym ANLS is applied updatingthe variables W , V , λ and the boolean stop. The current index χ is inducedby the newly computed W and the merging rule is applied to W , producinga merged clustering C with nC clusters. The returned variables form anitem Z, which is saved in a suitable data structure Γ only if nC is equalto kgoal. The survival of items is controlled by the value χmin which is theminimum of the indices χ corresponding to merged clusterings C saved inΓ . The item Z is inserted back into Q unless one of the following conditionsoccurs.

(1) The execution of the run is terminated. This possibility is monitoredby the Boolean end which becomes True if the error decreases too slowor the cluster induced by W (ν) does not change for two iterations.

9

(2) The execution of the run is abandoned. This possibility is monitoredby the Boolean poor which is True when the current value of the indexχ is more than twice χmin, or when it is a little more than χmin butlarger than the mean of all the previous saved χ’s.

(3) Enough positive results have been obtained. This possibility is moni-tored by the Boolean close which is True when the last processed itemhas been saved and all the initial items have been run and the mini-mum χmin has been obtained at least 5 times or has not been modifiedin the last 10 steps. In this case Q is abruptly cleared.

The queue is emptied by processing all the items or by the clearing dueto the close condition. When the queue is empty, the saved items in Γ areselected according to the index χ, i.e. for each pair p having at least oneassociated item, only the item with the lowest index is returned.

A pseudo-code of the function Trial is given in Figure 2. Note thatthe processing of the items can be easily parallelized. Moreover, if thenumber of processors is greater than the cardinality q of Σ × Ω, all theitems are processed independently in parallel. An interesting issue concernswhether the number of processors can affect the survival policy of the items.Numerical experiments of Section 4 are devoted also to this issue: indeed thenumber of processors appears to influence somewhat the queue processing,but the algorithm is robust enough to guarantee success in all the cases.

3.6 The function Clu

The final clustering with kgoal arbitrary shaped clusters is computed by thefunction Clu, which rules the increase of the working dimensions k = k0 2

j ,j = 0, 1, . . . and the corresponding decrease of the scales σ for the similaritymatrices. For each k two scales σ1 and σ2 and a set Ω of seeds for thegeneration of V0 are considered (in the experiments #Ω = 8, hence q = 16).Then the function Clu invokes function Trial to produce a shortlist Sj ofat most q merged clusterings. To associate with each merged clustering C acharacterizing signature θ, we use again function Index applied directly tothe partition C. In this way a vector v of signatures is obtained by applyingfunction Index to each merged clustering C ∈

∪r=0,...,j Sr and a choice

by majority is performed on v. To this aim the function Multi(v), whichcomputes the multiplicity of distinct entries of the vector v, is applied.

Among all the clusterings produced, the final clustering C is chosenby applying a voting step based on the following majority criteria. If atleast two clusterings have been found and they have the same signature θ(unanimity), one of them is chosen. If clusterings having different signatureshave been found, the chosen clustering has an absolute majority or has areasonable majority over the others. If no clustering satisfies these criteria,

10

function Trial(k)computes a set of clusterings with kgoal arbitrary shaped clusters

initialize Q; χmin = ∞; χeq = 0; χgt = 0;

while not Is empty(Q)

Y = Dequeue(Q);

t = t(Y ) + 1;

p = (σ, ω) = p(Y );(W,V, λ, stop

)= Sym ANLS

(Aσ,W(Y ), V(Y ), λ(Y ), τ

);

χ(p,t) = χ = Index(Partition(W ));

(C, nC) = Merge(W );

Z = p, t,W, V, λ, χ, C, nC;if nC eq kgoal then

Save (Z) in Γ ;µ = the mean of all saved χ;

if χ < χmin then χmin = χ; χeq = χgt = 0;

else if χ eq χmin then χeq = χeq + 1; else χgt = χgt + 1;

stab = (χ(p,t−1) eq χ);

end = stop or stab or t ≥ tmax;

poor = χ > 2χmin or χ > max(µ, 1.1χmin

);

close = end and t > 1 and(χeq ≥ 5 or χgt ≥ 10

);

if not poor and not end then Enqueue (Q, Z);

if close then Clear (Q);

end while;

for each pair p, let Cp the clustering with the smaller χ among thesaved items with first component p;

return the set of all Cp.

Figure 2: The algorithm Trial depends on the working dimension k andproduces a shortlist of possible clusterings.

the function Clu proceeds by doubling the value of k and by halving thevalues of σ1 and σ2.

A pseudo-code of the function Clu is given in Figure 3. The number ofdoublings is kept under control by means of the condition k ≤ kmax. Theoutput failed is returned when the working dimension k exceeds kmax with-out finding a clustering that meets the majority criteria. In the experimentsthe value of kmax = 80 has been set but never reached.

11

function Clu

computes a solution of problem (1). The function Multi(v) computes avector, ordered decreasingly, containing the multiplicity of the distinctentries of v.

j = 0; k = k0; σ1 = 1/(20 k0);

while k ≤ kmax;

σ2 = 2σ1;

compute Aσ1 and Aσ2 using (2) and (3);

construct Ω;

Sj = Trial(k);

S =∪

r=0,j Sr;

s = #S;

fill a vector v with the indices θ = Index(C) for C ∈ S;

m = Multi(v);

if m1 > 1 then

let C be a clustering with index θ having multiplicity m1;

if #m eq 1 then return C;

if m1 ≥ 3 and (m1 > s/2 or m1 > 1.8m2) then return C;

j ++; k = 2 k; σ1 = σ1/2;

end while;

return failed.

Figure 3: If the majority criteria are satisfied, the corresponding clusteringis returned, otherwise a failed condition is raised. In the experiments k0 = 5has been set, being kgoal ≤ 5 for all the considered problems.

4 Numerical experimentation

A large experimentation to validate the clustering efficiency of the proposedalgorithm has been performed. For the purpose of a proper visualization thetest was carried out on synthetic datasets composed of points in R2. Twodifferent types of datasets were generated. The first type T1 (see Figure 4)corresponds to clusters whose convex envelopes are separable, and the secondtype T2 (see Figure 5) corresponds to clusters whose convex envelopes are notseparable. Overall, 16 datasets were considered, most of them suggested by[15, 17]. Actually, 17 clustering problems have been solved, since for datasetSC two different values of kgoal have been set. In order to evaluate thealgorithm in various situations, clusters of different cardinality and different

12

density were considered and in a few cases noise points were added.

WS WSN SC OR

DD SK MATH XX

Figure 4: Datasets of type T1.

C2 C2N CC CS

C3 CP SPI SPIL

Figure 5: Datasets of type T2.

The computation has been performed with a 4GHz Intel four core i7processor. To test if the queue processing of trial depends on the numberof processors, a first run was performed elaborating simultaneously 4 items,then a second run was performed accessing the queue in sequential way (i.e.with only one active core). For a reference, the algorithm has been testedalso on a server with 20 cores, sufficient to elaborate concurrently all theactive items.

Table 1 lists the characteristics of each problem, i.e. the number of points

13

and the requested kgoal, together with the results of the run performed withfour cores, i.e. the maximum number jmax of the required doubling of theworking dimension, the result of the voting step and the execution time inseconds. The result of the voting step is expressed by the ratio between thenumber m1 of votes of the winning clustering and the total number s of theclusterings which took part in the voting.

Problem n kgoal jmax m1/s sec.

WS 2000 5 0 15/15 6.4

WSN 2000 5 0 5/13 11.9

SC 2000 5 0 13/13 8.2

SC 2000 3 1 6/6 100.9

OR 2000 5 1 8/9 76.8

DD 2000 3 0 8/16 17.8

SK 2000 3 1 13/25 60.5

XX 3000 3 0 8/8 29.7

MATH 3998 4 0 12/12 82.6

C2 1980 2 0 10/10 37.6

C2N 1980 2 0 10/12 23.1

CC 2000 2 0 9/9 19.1

CS 2020 2 1 13/19 84.1

CP 2004 3 1 9/9 113.9

C3 3000 3 1 2/2 362.5

SPI 5320 4 3 8/8 5014.2

SPIL 7020 2 3 7/15 15256.1

Table 1: Results of the experiments on the problems of Fig.4 (upper part)and Fig.5 (lower part).

The value of jmax is an index of the difficulty of the problem. In par-ticular, the value jmax = 0 indicates that the s items which survived fromthe q = 16 items initially forming the queue, were enough to designate awinner of the voting. The value jmax = 1 indicates that the doubling of thedimensions was performed once, before finding a winner among the s itemssurviving from the 32 initially generated. For the spirals three doublingswere necessary with only 8 or 15 items surviving from the 64 initially gen-erated. Of course, the execution time raises when a dimension doubling isrequired and it is on average larger for the problems of type T2. The high

14

time of MATH with no dimension doubling is due to a larger number ofpoints.

In all cases the runs ended with a success, returning clusters which matchthe originally constructed ones. By comparing the results of problems WSand WSN it seems that the presence of the noise has a limited consequenceon the performance, designating the same winner even if with a reducedmajority. By comparing the results of problem SC with kgoal = 5 and withkgoal = 3 we see that in the second case we are really forcing a solutionwhich the program accepts with difficulty. Moreover, the presence of largescattered points, as in problems WSN, DD, SK and CS, appears to influencenegatively the ratio m1/s.

The voting results of the runs performed with only one active core orwith 20 cores are very similar to those listed in the table. The differencesconcern mainly the numbers m1 and s separately, while the ratios m1/s arebarely affected. This shows that the survival policy of items may indeedvary somewhat with the number of processors, but that the algorithm isrobust enough to guarantee in all cases the same successes.

5 Conclusions

In this paper a technique, based on the symmetric NMF factorization ofthe similarity matrix, is proposed to deal with arbitrarily shaped clustering.The main contribution of our approach consists in the merging techniqueof the clusters which exploits information already contained in the matrixW obtained by NMF, instead of geometrical issues such as distances anddensities of the points. The resulting algorithm can be easily parallelized.The experimentation, applied to problems with various characteristics, hasshown that the algorithm results robust also for noisy data.

References

[1] C. C. Aggarwal, Data Mining: the textbook, Springer, 2014.

[2] A. Bjorck, Numerical Methods for least squares problems, SIAM,Philadelphia, 1996.

[3] D. L. Davis and D. W. Bouldin, A cluster separation measure, IEEETransactions on Pattern Analysis and Machine Intelligence, PAMI, 1(1979), pp. 224–227.

[4] B. Desgraupes, Clustering Indices. https://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf

[5] P. Favati, G. Lotti, O. Menchi and F. Romani, Adaptive sym-metric NMF for graph clustering, IIT-CNR TR-05/2016.

15

[6] L. Grippo and M. Sciandrone, On the convergence of the blocknonlinear Gauss-Seidel method under convex constraints, Oper. Res.Lett., 26 (2000), pp. 127–136.

[7] C. J. Hsieh and I. S. Dhillon, Fast coordinate descent methods withvariable selection for non-negative matrix factorization, Proceedings ofthe 17th ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, (2011), pp. 1064–1072.

[8] H. Kim and H. Park, Nonnegative matrix factorization based on al-ternating nonnegativity constrained least squares and active set method,SIAM J. Matrix Anal. Appl., 30 (2008), pp. 713–730.

[9] H. Kim and H. Park, Toward faster nonnegative matrix factoriza-tion: a new algorithm and comparisons, Proceedings of the 8th IEEEInternational Conference on Data Mining (ICDM), (2008), pp. 353–362.

[10] H. Kim and H. Park, Fast nonnegative matrix factorization: anactive-set-like method and comparisons, SIAM J. on Scientific Com-puting, 33 (2011), pp. 3261–3281.

[11] J. Kim, Y. He and H. Park, Algorithms for nonnegative matrix andtensor factorization: an unified view based on block coordinate descentframework, J. Glob. Optim., 58 (2014), pp. 285–319.

[12] D. Kuang, S. Yun and H. Park, SymNMF: nonnegative low-rankapproximation of a similarity matrix for graph clustering, J. Glob. Op-tim., 62 (2015), pp. 545–574.

[13] D. Kuang, C. Ding and H. Park, Symmetric nonnegative matrixfactorization for graph clustering, SDM’12: Proc. of SIAM Int. Conf.on Data Mining, (2012), pp. 106–117.

[14] C. L. Lawson and R. J. Hanson, Solving least squares problems,Prentice-Hall, Englewood Cliffs, N.Y., 1974.

[15] Y. Liu, Z. Li,H. Xiong, X. Gao and J. Wu Understanding of inter-nal clustering validation measures, IEEE International Conference onData Mining, (2010), pp. 911–916.

[16] P. Paatero and U. Tappert, Positive Matrix Factorization: a non-negative factor model with optimal solution of error estimates of datavalues, Environmetrics, 5 (1994), pp. 111–126.

[17] L. Zelnik-Manor and P. Perona, Self-tuning spectral clustering,In Advances in Neural Information Processing Systems, 17 (2004),pp. 1601–1608.

16

[18] H. Zhao, P. Poupart, Y. Zhang and M. Lysy, SoF: Soft-Cluster Matrix Factorization for Probabilistic Clustering, Proceed-ings of the Twenty-Ninth AAAI Conference on Artificial Intelligence,www.aaai.org, (2015), pp. 3188–3195.

17

Consiglio Nazionale delle Ricerche - iit.cnr.it 09_2016.pdf · C Consiglio Nazionale delle Ricerche...

Documents

Transcript of Consiglio Nazionale delle Ricerche - iit.cnr.it 09_2016.pdf · C Consiglio Nazionale delle Ricerche...