r - Most representative instance of a cluster -

- May 15, 2011

after performing cluster analysis dataset (a dataframe named data.matrix), added new column, named cluster, @ end (col 27) containing cluster name each instance belongs to.

what want now, representative instance each cluster. tried find instance having smallest euclidean distance cluster's centroid (and repeat procedure each 1 of clusters)

this did. can think of other -perhaps more elegant- ways? (assume numeric columns no nulls).

clusters <- levels(data.matrix$cluster) cluster_col = c(27)  (j in 1:length(clusters)) {     # subset cluster j     data = data.matrix[data.matrix$cluster == clusters[j],]      # remove cluster column     data <- data[,-cluster_col]      # calculate centroid     cent <- mean(data)      # copy data data.matrix_cl, attaching distance column @ end     data.matrix_cl <- cbind(data, dist = apply(data, 1, function(x) {sqrt(sum((x - cent)^2))}))      # instances min distance     candidates <- data.matrix_cl[data.matrix_cl$dist == min(data.matrix_cl$dist),]      # print rownames     print(paste("candidates cluster ",j))     print(rownames(candidates)) }

at first don't if distance formula alright. think there should sqrt(sum((x-cent)^2)) or sum(abs(x-cent)). assumed first. second thought printing solution not idea. first compute, print. third - recommend using plyr give both (with , without plyr) solutions.

# simulated data: n <- 100 data.matrix <- cbind(   data.frame(matrix(runif(26*n), n, 26)),   cluster=sample(letters[1:6], n, replace=true) ) cluster_col <- which(names(data.matrix)=="cluster")  # plyr: require(plyr) candidates <- dlply(data.matrix, "cluster", function(data) {   dists <- colsums(laply(data[, -cluster_col], function(x) (x-mean(x))^2))   rownames(data)[dists==min(dists)] })  l_ply(names(candidates), function(c_name, c_list=candidates[[c_name]]) {     print(paste("candidates cluster ",c_name))     print(c_list) })  # without plyr candidates <- tapply(   1:nrow(data.matrix),   data.matrix$cluster,   function(id, data=data.matrix[id, ]) {     dists <- rowsums(sapply(data[, -cluster_col], function(x) (x-mean(x))^2))     rownames(data)[dists==min(dists)]   } )  invisible(lapply(names(candidates), function(c_name, c_list=candidates[[c_name]]) {     print(paste("candidates cluster ",c_name))     print(c_list) }))

Search This Blog

DR ode

r - Most representative instance of a cluster -

Comments

Post a Comment

Popular posts from this blog

c++ - Convert big endian to little endian when reading from a binary file -

C#: Application without a window or taskbar item (background app) that can still use Console.WriteLine() -

unicode - Are email addresses allowed to contain non-alphanumeric characters? -