r - Most representative instance of a cluster -
after performing cluster analysis dataset (a dataframe named data.matrix), added new column, named cluster, @ end (col 27) containing cluster name each instance belongs to.
what want now, representative instance each cluster. tried find instance having smallest euclidean distance cluster's centroid (and repeat procedure each 1 of clusters)
this did. can think of other -perhaps more elegant- ways? (assume numeric columns no nulls).
clusters <- levels(data.matrix$cluster) cluster_col = c(27) (j in 1:length(clusters)) { # subset cluster j data = data.matrix[data.matrix$cluster == clusters[j],] # remove cluster column data <- data[,-cluster_col] # calculate centroid cent <- mean(data) # copy data data.matrix_cl, attaching distance column @ end data.matrix_cl <- cbind(data, dist = apply(data, 1, function(x) {sqrt(sum((x - cent)^2))})) # instances min distance candidates <- data.matrix_cl[data.matrix_cl$dist == min(data.matrix_cl$dist),] # print rownames print(paste("candidates cluster ",j)) print(rownames(candidates)) }
at first don't if distance formula alright. think there should sqrt(sum((x-cent)^2))
or sum(abs(x-cent))
. assumed first. second thought printing solution not idea. first compute, print. third - recommend using plyr give both (with , without plyr) solutions.
# simulated data: n <- 100 data.matrix <- cbind( data.frame(matrix(runif(26*n), n, 26)), cluster=sample(letters[1:6], n, replace=true) ) cluster_col <- which(names(data.matrix)=="cluster") # plyr: require(plyr) candidates <- dlply(data.matrix, "cluster", function(data) { dists <- colsums(laply(data[, -cluster_col], function(x) (x-mean(x))^2)) rownames(data)[dists==min(dists)] }) l_ply(names(candidates), function(c_name, c_list=candidates[[c_name]]) { print(paste("candidates cluster ",c_name)) print(c_list) }) # without plyr candidates <- tapply( 1:nrow(data.matrix), data.matrix$cluster, function(id, data=data.matrix[id, ]) { dists <- rowsums(sapply(data[, -cluster_col], function(x) (x-mean(x))^2)) rownames(data)[dists==min(dists)] } ) invisible(lapply(names(candidates), function(c_name, c_list=candidates[[c_name]]) { print(paste("candidates cluster ",c_name)) print(c_list) }))
Comments
Post a Comment