agnes                package:cluster                R Documentation

_A_g_g_l_o_m_e_r_a_t_i_v_e _N_e_s_t_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     Computes agglomerative hierarchical clustering of the dataset.

_U_s_a_g_e:

     agnes(x, diss = inherits(x, "dist"), metric = "euclidean",
           stand = FALSE, method = "average", par.method,
           keep.diss = n < 100, keep.data = !diss)

_A_r_g_u_m_e_n_t_s:

       x: data matrix or data frame, or dissimilarity matrix, depending
          on the value of the 'diss' argument.

          In case of a matrix or data frame, each row corresponds to an
          observation, and each column corresponds to a variable. All
          variables must be numeric. Missing values (NAs) are allowed.

          In case of a dissimilarity matrix, 'x' is typically the
          output of 'daisy' or 'dist'. Also a vector with length
          n*(n-1)/2 is allowed (where n is the number of observations),
          and will be interpreted in the same way as the output of the
          above-mentioned functions. Missing values (NAs) are not
          allowed. 

    diss: logical flag: if TRUE (default for 'dist' or 'dissimilarity'
          objects), then 'x' is assumed to be a dissimilarity matrix. 
          If FALSE, then 'x' is treated as a matrix of observations by
          variables. 

  metric: character string specifying the metric to be used for
          calculating dissimilarities between observations. The
          currently available options are "euclidean" and "manhattan".
          Euclidean distances are root sum-of-squares of differences,
          and manhattan distances are the sum of absolute differences.
          If 'x' is already a dissimilarity matrix, then this argument
          will be ignored. 

   stand: logical flag: if TRUE, then the measurements in 'x' are
          standardized before calculating the dissimilarities.
          Measurements are standardized for each variable (column), by
          subtracting the variable's mean value and dividing by the
          variable's mean absolute deviation.  If 'x' is already a
          dissimilarity matrix, then this argument will be ignored. 

  method: character string defining the clustering method.  The six
          methods implemented are "average" (group average method),
          "single" (single linkage), "complete" (complete linkage),
          "ward" (Ward's method), "weighted" (weighted average linkage)
          and its generalization '"flexible"' which uses (a constant
          version of) the Lance-Williams formula and the 'par.method'
          argument.  Default is "average". 

par.method: if 'method == "flexible"', numeric vector of length 1, 3,
          or 4, see in the details section. 

keep.diss, keep.data: logicals indicating if the dissimilarities and/or
          input data 'x' should be kept in the result.  Setting these
          to 'FALSE' can give much smaller results and hence even save
          memory allocation _time_.

_D_e_t_a_i_l_s:

     'agnes' is fully described in chapter 5 of Kaufman and Rousseeuw
     (1990). Compared to other agglomerative clustering methods such as
     'hclust', 'agnes' has the following features: (a) it yields the
     agglomerative coefficient (see 'agnes.object') which measures the
     amount of clustering structure found; and (b) apart from the usual
     tree it also provides the banner, a novel graphical display (see
     'plot.agnes').

     The 'agnes'-algorithm constructs a hierarchy of clusterings.
      At first, each observation is a small cluster by itself. 
     Clusters are merged until only one large cluster remains which
     contains all the observations.  At each stage the two _nearest_
     clusters are combined to form one larger cluster.

     For 'method="average"', the distance between two clusters is the
     average of the dissimilarities between the points in one cluster
     and the points in the other cluster. 
      In 'method="single"', we use the smallest dissimilarity between a
     point in the first cluster and a point in the second cluster
     (nearest neighbor method). 
      When 'method="complete"', we use the largest dissimilarity
     between a point in the first cluster and a point in the second
     cluster (furthest neighbor method).

     The 'method = "flexible"' allows (and requires) more details: The
     Lance-Williams formula specifies how dissimilarities are computed
     when clusters are agglomerated (equation (32) in K.&R., p.237). 
     If clusters C_1 and C_2 are agglomerated into a new cluster, the
     dissimilarity between their union and another cluster Q is given
     by

 D(C_1 cup C_2, Q) = alpha_1 * D(C_1, Q) + alpha_2 * D(C_2, Q) + beta * D(C_1,C_2) + gamma * |D(C_1, Q) - D(C_2, Q)|,

     where the four coefficients (alpha_1, alpha_2, beta, gamma) are
     specified by the vector 'par.method':

     If 'par.method' is of length 1, say = alpha, 'par.method' is
     extended to give the "Flexible Strategy" (K. & R., p.236 f) with
     Lance-Williams coefficients (alpha_1 = alpha_2 = alpha, beta = 1 -
     2alpha, gamma=0).
      If of length 3, gamma = 0 is used.

     *Care* and expertise is probably needed when using 'method =
     "flexible"' particularly for the case when 'par.method' is
     specified of longer length than one. The _weighted average_
     ('method="weighted"') is the same as 'method="flexible",
     par.method = 0.5'.

_V_a_l_u_e:

     an object of class '"agnes"' representing the clustering. See
     'agnes.object' for details.

_B_A_C_K_G_R_O_U_N_D:

     Cluster analysis divides a dataset into groups (clusters) of
     observations that are similar to each other.

     _H_i_e_r_a_r_c_h_i_c_a_l _m_e_t_h_o_d_s like 'agnes', 'diana', and 'mona' construct a
          hierarchy of clusterings, with the number of clusters ranging
          from one to the number of observations.

     _P_a_r_t_i_t_i_o_n_i_n_g _m_e_t_h_o_d_s like 'pam', 'clara', and 'fanny' require that
          the number of clusters be given by the user.

_R_e_f_e_r_e_n_c_e_s:

     Kaufman, L. and Rousseeuw, P.J. (1990). _Finding Groups in Data:
     An Introduction to Cluster Analysis_. Wiley, New York.

     Anja Struyf, Mia Hubert & Peter J. Rousseeuw (1996): Clustering in
     an Object-Oriented Environment. _Journal of Statistical Software_,
     *1*. <URL: http://www.stat.ucla.edu/journals/jss/>

     Struyf, A., Hubert, M. and Rousseeuw, P.J. (1997). Integrating
     Robust Clustering Techniques in S-PLUS, _Computational Statistics
     and Data Analysis_, *26*, 17-37.

_S_e_e _A_l_s_o:

     'agnes.object', 'daisy', 'diana', 'dist', 'hclust', 'plot.agnes',
     'twins.object'.

_E_x_a_m_p_l_e_s:

     data(votes.repub)
     agn1 <- agnes(votes.repub, metric = "manhattan", stand = TRUE)
     agn1
     plot(agn1)

     op <- par(mfrow=c(2,2))
     agn2 <- agnes(daisy(votes.repub), diss = TRUE, method = "complete")
     plot(agn2)
     agnS <- agnes(votes.repub, method = "flexible", par.meth = 0.6)
     plot(agnS)
     par(op)

     data(agriculture)
     ## Plot similar to Figure 7 in ref
     ## Not run:  plot(agnes(agriculture), ask = TRUE)

