|
|
| Author |
Message |
Reef Fish science forum Guru Wannabe
Joined: 28 Apr 2005
Posts: 200
|
Posted: Tue Apr 04, 2006 5:34 pm Post subject:
Why Clustering and MDS are Methodologically Incompatible
|
|
|
In a related thread, I wrote
| Quote: | I'll post a separate post to explain WHY those two mothods
are INCOMPATIBLE, in GOAL or METHOD, and why trying to do
both all at once under some "hybrid" model can only make it worse
|
Let's start with why MDS is not compatible to, or suitable for finding
clusters, as in "cluster analysis". I had already given one reason in
the related post in question. (1) below:
1. A "proper and useful" cluster solution does NOT depend
on the representation (or even existence) in any dimension, but it
does depend heavily on the clustering "criterion" which produce
many DIFFERENT solutions on different criteria
2. There are literally HUNDREDS of clustering algorithms based on
at least a dozen or more different criteria for "clusters". Often
a particular single criterion can be carried out by different kinds
of algorithms:
For example,
(a) Agglomerative (start with n points as n clusters and combine
two at a time until the final cluster has n points)
(b) Divisive (start with n points as 1 cluster and split one each
time until there are n clusters of single points at the end).
(c) partition n points into a specified number of clusters (groups)
by optimizing certain criteria, such as minimum Within Groups
SS, Ward's Centroid method, etc.
(d) Iterative (if the number of clusters is specified in advance)
such as various k-means methods.
(e) Jardine-Sibson and others where clusters may be overlapping.
None of the above defines what a "cluster" is, or what constitutes a
cluster. They are based on some global-partition (a-d) or a global
criterion that allows overlapping sets (hence not a partition).
Clustering (or grouping of objects) is an easy and intuitive concept
to grasp, but nearly impossible to define or find. Throw a bunch
(n pieces, n moderately large) of candy on the floor, and any 4-year
old can "cluster" them better than any of your clustering algorithms
can!
(f) Ling (1972) "On the theory and construction of K-Clusters",
Computer Journal, 15, 326-332.
Ling (1973) "A Probability Theory of Cluster Analysis," JASA
58, 159-164.
made a "cluster" a well-defined object, defined by two
parameters, K, the number of points within a distance delta
of K other points in the cluster, and well isolated (separated)
from other clusters. K = 1 would yield the "single-linkage"
clusters, while other integer values of K require the clusters
to be more and more "compact", and less "stringy".
(g) There are many other cluster criteria and algorithms. You
can find numerous books titled, or with key phrases, "Cluster
Analysis" or "Clustering algorithms". There are also journal
articles reviewing and comparing these criteria and methods.
(3) In view of (1) and (2) above, the only INCISIVE analysis of
clusters can only be done via specific (or using several
different) clustering criteria and methods. It is unfortunate
that most of the methods are just algorithms, leaving the
concept of a "cluster" ill-defined. But that is the present
DEFECT within the global subject of Cluster Analysis, and
the existence of 100+ different algorithms and methods (often
delivering clusters with very DIFFERENTcharacteristics) is
the best one can do.
(4) MDS (Multidimensional Scaling) is not suitable for Clustering
because it forces the representation of n objects into n points
in a p-dimensional Euclidean space, by matching the INPUT
n x n matrix of observed Dissimilarities (not necessarily
distances) with the pairwise Distance matrix of the configuration
in p-space, often distorting the distances between points so
much that it LOSES any clustering phenomenon that is clear
via clustering methods.
(5) There are numerous books and journal articles giving examples
of the application of MDS to real problems. NONE of those
examples was concerned with, or dealt with, the existence or
non-existence of clusters! Just pick up any one of those
applications and you'll see WHY.
(6) Last but not least, the INPUT data for MDS may not be suitable
at all for clustering. For example, in the famous Rothkopf
data
of the % of trainees, when listened to a pair of Morse Codes
signals for letters and numbers, are asked whether the played
Morse Codes were the "same" or "different". The trainees are
confused even when the same symbols are played one after
the other. Thus none of the A-A, B-B, etc. was 100%, and
most of the A-B and B-A and other pairs have different values
for their "same" answer, and hence the matrix is asymmetric
with diagonals not equal to 0 for distances.
In short, the general class of methods for MDS is to seek a Euclidean
representation of an n x n similarity/dissimilarity matrix by a set of
points in a p-dimensional Euclidean space in search of some
interpretable AXES in some low dimension, with complete disregard
of any clustering properties -- with a goal similar to that of Factor
Analysis and Principal Components Analysis neither of which is
concerned with clusters also.
Given the above characteristics of Clustering and why MDS is not
suitable, it should be quite clear that Cluster Analysis is NOT
suitable
for discovering any interpretable axes in the MDS sense.
In my (1973) JASA example of clustering a set of the 60 brightest stars
by their "apparent" pairwise distances in the sky, using the single-
linkage equivalent of my definition of k =1, and the "isolation index"
associated with the probability theory of "significant clusters" in a
random graph model, I was able to correctly identify clusters (using
the
probability theory alone, 52 of the 60 stars that belonged to named
constellations, with five of the star constellations perfectly
identifed,
which included UMa (Big Dipper) and Cyg (Swan).
The constellation that was least accounted for (hence least well-
defined) was Draconis, which is a serpent-like constellation that
goes between UMa (Big Dipper) and UMi (Little Dipper) but is not
well separated from the stars in those constellations.
That was perhaps an extreme example in which the goal was to
identify the stringy-connected and isolated clusters to see how
well they corresponded to the constellations that are well-known
because people have long clustered them "by eye". This was
also a problem in which there was absolute no attempt to seek
any "interpretable dimensions" formed by the bright stars.
My bottom line:
If you want to find CLUSTERS, use clustering methods on
your original data.
If you want to seek the "structure" of "interpretable dimensions"
imbedded in a dataset, there are many multivariate methods
for that, including MDS which works for certain data (as that
of Rothkopf's Morse Code data) that would not work on other
methods that require (among other things) a metric or
symmetry of the input matrix.
The above is my quick attempt to summarize and distinguish two
HUGE areas of multivariate analysis, each of which had at least
a history of 40 or more years of development.
-- Bob. |
|
| Back to top |
|
 |
Data Matter science forum beginner
Joined: 26 May 2005
Posts: 7
|
Posted: Thu Apr 06, 2006 5:51 am Post subject:
Re: Why Clustering and MDS are Methodologically Incompatible
|
|
|
Very informative post, thanks. Most cluster analysis texts skim over
the issue of mixed numeric and nominal variables, usually prescribing
the standard solution of turning nominal variables into a string of
dummy variables. For many reasons, I find this solution
unsatisfactory, and particularly so when the nominal variables have
large numbers of values (say "US State"). Do you have a suggestion for
some algorithms to try in cases with mixed variables where nominal
variables dominate and may have dozens of values each?
DM
Reef Fish wrote:
| Quote: | In a related thread, I wrote
I'll post a separate post to explain WHY those two mothods
are INCOMPATIBLE, in GOAL or METHOD, and why trying to do
both all at once under some "hybrid" model can only make it worse
Let's start with why MDS is not compatible to, or suitable for finding
clusters, as in "cluster analysis". I had already given one reason in
the related post in question. (1) below:
1. A "proper and useful" cluster solution does NOT depend
on the representation (or even existence) in any dimension, but it
does depend heavily on the clustering "criterion" which produce
many DIFFERENT solutions on different criteria
2. There are literally HUNDREDS of clustering algorithms based on
at least a dozen or more different criteria for "clusters". Often
a particular single criterion can be carried out by different kinds
of algorithms:
For example,
(a) Agglomerative (start with n points as n clusters and combine
two at a time until the final cluster has n points)
(b) Divisive (start with n points as 1 cluster and split one each
time until there are n clusters of single points at the end).
(c) partition n points into a specified number of clusters (groups)
by optimizing certain criteria, such as minimum Within Groups
SS, Ward's Centroid method, etc.
(d) Iterative (if the number of clusters is specified in advance)
such as various k-means methods.
(e) Jardine-Sibson and others where clusters may be overlapping.
None of the above defines what a "cluster" is, or what constitutes a
cluster. They are based on some global-partition (a-d) or a global
criterion that allows overlapping sets (hence not a partition).
Clustering (or grouping of objects) is an easy and intuitive concept
to grasp, but nearly impossible to define or find. Throw a bunch
(n pieces, n moderately large) of candy on the floor, and any 4-year
old can "cluster" them better than any of your clustering algorithms
can!
(f) Ling (1972) "On the theory and construction of K-Clusters",
Computer Journal, 15, 326-332.
Ling (1973) "A Probability Theory of Cluster Analysis," JASA
58, 159-164.
made a "cluster" a well-defined object, defined by two
parameters, K, the number of points within a distance delta
of K other points in the cluster, and well isolated (separated)
from other clusters. K = 1 would yield the "single-linkage"
clusters, while other integer values of K require the clusters
to be more and more "compact", and less "stringy".
(g) There are many other cluster criteria and algorithms. You
can find numerous books titled, or with key phrases, "Cluster
Analysis" or "Clustering algorithms". There are also journal
articles reviewing and comparing these criteria and methods.
(3) In view of (1) and (2) above, the only INCISIVE analysis of
clusters can only be done via specific (or using several
different) clustering criteria and methods. It is unfortunate
that most of the methods are just algorithms, leaving the
concept of a "cluster" ill-defined. But that is the present
DEFECT within the global subject of Cluster Analysis, and
the existence of 100+ different algorithms and methods (often
delivering clusters with very DIFFERENTcharacteristics) is
the best one can do.
(4) MDS (Multidimensional Scaling) is not suitable for Clustering
because it forces the representation of n objects into n points
in a p-dimensional Euclidean space, by matching the INPUT
n x n matrix of observed Dissimilarities (not necessarily
distances) with the pairwise Distance matrix of the configuration
in p-space, often distorting the distances between points so
much that it LOSES any clustering phenomenon that is clear
via clustering methods.
(5) There are numerous books and journal articles giving examples
of the application of MDS to real problems. NONE of those
examples was concerned with, or dealt with, the existence or
non-existence of clusters! Just pick up any one of those
applications and you'll see WHY.
(6) Last but not least, the INPUT data for MDS may not be suitable
at all for clustering. For example, in the famous Rothkopf
data
of the % of trainees, when listened to a pair of Morse Codes
signals for letters and numbers, are asked whether the played
Morse Codes were the "same" or "different". The trainees are
confused even when the same symbols are played one after
the other. Thus none of the A-A, B-B, etc. was 100%, and
most of the A-B and B-A and other pairs have different values
for their "same" answer, and hence the matrix is asymmetric
with diagonals not equal to 0 for distances.
In short, the general class of methods for MDS is to seek a Euclidean
representation of an n x n similarity/dissimilarity matrix by a set of
points in a p-dimensional Euclidean space in search of some
interpretable AXES in some low dimension, with complete disregard
of any clustering properties -- with a goal similar to that of Factor
Analysis and Principal Components Analysis neither of which is
concerned with clusters also.
Given the above characteristics of Clustering and why MDS is not
suitable, it should be quite clear that Cluster Analysis is NOT
suitable
for discovering any interpretable axes in the MDS sense.
In my (1973) JASA example of clustering a set of the 60 brightest stars
by their "apparent" pairwise distances in the sky, using the single-
linkage equivalent of my definition of k =1, and the "isolation index"
associated with the probability theory of "significant clusters" in a
random graph model, I was able to correctly identify clusters (using
the
probability theory alone, 52 of the 60 stars that belonged to named
constellations, with five of the star constellations perfectly
identifed,
which included UMa (Big Dipper) and Cyg (Swan).
The constellation that was least accounted for (hence least well-
defined) was Draconis, which is a serpent-like constellation that
goes between UMa (Big Dipper) and UMi (Little Dipper) but is not
well separated from the stars in those constellations.
That was perhaps an extreme example in which the goal was to
identify the stringy-connected and isolated clusters to see how
well they corresponded to the constellations that are well-known
because people have long clustered them "by eye". This was
also a problem in which there was absolute no attempt to seek
any "interpretable dimensions" formed by the bright stars.
My bottom line:
If you want to find CLUSTERS, use clustering methods on
your original data.
If you want to seek the "structure" of "interpretable dimensions"
imbedded in a dataset, there are many multivariate methods
for that, including MDS which works for certain data (as that
of Rothkopf's Morse Code data) that would not work on other
methods that require (among other things) a metric or
symmetry of the input matrix.
The above is my quick attempt to summarize and distinguish two
HUGE areas of multivariate analysis, each of which had at least
a history of 40 or more years of development.
-- Bob. |
|
|
| Back to top |
|
 |
Reef Fish science forum Guru Wannabe
Joined: 28 Apr 2005
Posts: 200
|
Posted: Thu Apr 06, 2006 12:13 pm Post subject:
Re: Why Clustering and MDS are Methodologically Incompatible
|
|
|
Data Matter wrote:
| Quote: | Very informative post, thanks.
|
You're welcome, and I appreciate the acknowledgment from someone
I know, from reading these groups, who have had much experience
with these methods.
| Quote: | Most cluster analysis texts skim over
the issue of mixed numeric and nominal variables, usually prescribing
the standard solution of turning nominal variables into a string of
dummy variables. For many reasons, I find this solution
unsatisfactory, and particularly so when the nominal variables have
large numbers of values (say "US State").
|
In clustering, often the available variables are collapsed into an
n x n similarity or distance matrix through various metrics and
combination of metrics. If one of the variables contains a large
number of categorical information (as US States), that's a very
special problem -- which may be condensed into smaller number
of categories such as "regions" before proceeding.
Other times, there are methods (Hartigan and others) that cluster
the p variables AND the n objects simultaneously -- that is NOT
in any shape or form a multidimensional scaling method -- but
a simultaneous clustering method -- which is actually better than
using the nxn distance matrix most of the time, if doable.
| Quote: | Do you have a suggestion for
some algorithms to try in cases with mixed variables where nominal
variables dominate and may have dozens of values each?
DM
|
It's no longer a question of an "algorithm". Here your problem is
how to COMBINE the available information in your variables
BEFORE you start finding clusters.
There are no algorithms to handle these problems. Moreover, you
have to deal with subject matter of what OBJECTS you are trying
to cluster, and WHY are you seeking clusters.
Often when you don't have a natural n x n pairwise distance matrix
to begin with, but an n x p matrix of many variables, there are
other, often more suitable methods, of analyzing the problem,
such as those in market segmentation where the geographical
regions do come into play.
Sorry I can't give you a more specific response than this.
-- Bob.
| Quote: |
Reef Fish wrote:
In a related thread, I wrote
I'll post a separate post to explain WHY those two mothods
are INCOMPATIBLE, in GOAL or METHOD, and why trying to do
both all at once under some "hybrid" model can only make it worse
Let's start with why MDS is not compatible to, or suitable for finding
clusters, as in "cluster analysis". I had already given one reason in
the related post in question. (1) below:
1. A "proper and useful" cluster solution does NOT depend
on the representation (or even existence) in any dimension, but it
does depend heavily on the clustering "criterion" which produce
many DIFFERENT solutions on different criteria
2. There are literally HUNDREDS of clustering algorithms based on
at least a dozen or more different criteria for "clusters". Often
a particular single criterion can be carried out by different kinds
of algorithms:
For example,
(a) Agglomerative (start with n points as n clusters and combine
two at a time until the final cluster has n points)
(b) Divisive (start with n points as 1 cluster and split one each
time until there are n clusters of single points at the end).
(c) partition n points into a specified number of clusters (groups)
by optimizing certain criteria, such as minimum Within Groups
SS, Ward's Centroid method, etc.
(d) Iterative (if the number of clusters is specified in advance)
such as various k-means methods.
(e) Jardine-Sibson and others where clusters may be overlapping.
None of the above defines what a "cluster" is, or what constitutes a
cluster. They are based on some global-partition (a-d) or a global
criterion that allows overlapping sets (hence not a partition).
Clustering (or grouping of objects) is an easy and intuitive concept
to grasp, but nearly impossible to define or find. Throw a bunch
(n pieces, n moderately large) of candy on the floor, and any 4-year
old can "cluster" them better than any of your clustering algorithms
can!
(f) Ling (1972) "On the theory and construction of K-Clusters",
Computer Journal, 15, 326-332.
Ling (1973) "A Probability Theory of Cluster Analysis," JASA
58, 159-164.
made a "cluster" a well-defined object, defined by two
parameters, K, the number of points within a distance delta
of K other points in the cluster, and well isolated (separated)
from other clusters. K = 1 would yield the "single-linkage"
clusters, while other integer values of K require the clusters
to be more and more "compact", and less "stringy".
(g) There are many other cluster criteria and algorithms. You
can find numerous books titled, or with key phrases, "Cluster
Analysis" or "Clustering algorithms". There are also journal
articles reviewing and comparing these criteria and methods.
(3) In view of (1) and (2) above, the only INCISIVE analysis of
clusters can only be done via specific (or using several
different) clustering criteria and methods. It is unfortunate
that most of the methods are just algorithms, leaving the
concept of a "cluster" ill-defined. But that is the present
DEFECT within the global subject of Cluster Analysis, and
the existence of 100+ different algorithms and methods (often
delivering clusters with very DIFFERENTcharacteristics) is
the best one can do.
(4) MDS (Multidimensional Scaling) is not suitable for Clustering
because it forces the representation of n objects into n points
in a p-dimensional Euclidean space, by matching the INPUT
n x n matrix of observed Dissimilarities (not necessarily
distances) with the pairwise Distance matrix of the configuration
in p-space, often distorting the distances between points so
much that it LOSES any clustering phenomenon that is clear
via clustering methods.
(5) There are numerous books and journal articles giving examples
of the application of MDS to real problems. NONE of those
examples was concerned with, or dealt with, the existence or
non-existence of clusters! Just pick up any one of those
applications and you'll see WHY.
(6) Last but not least, the INPUT data for MDS may not be suitable
at all for clustering. For example, in the famous Rothkopf
data
of the % of trainees, when listened to a pair of Morse Codes
signals for letters and numbers, are asked whether the played
Morse Codes were the "same" or "different". The trainees are
confused even when the same symbols are played one after
the other. Thus none of the A-A, B-B, etc. was 100%, and
most of the A-B and B-A and other pairs have different values
for their "same" answer, and hence the matrix is asymmetric
with diagonals not equal to 0 for distances.
In short, the general class of methods for MDS is to seek a Euclidean
representation of an n x n similarity/dissimilarity matrix by a set of
points in a p-dimensional Euclidean space in search of some
interpretable AXES in some low dimension, with complete disregard
of any clustering properties -- with a goal similar to that of Factor
Analysis and Principal Components Analysis neither of which is
concerned with clusters also.
Given the above characteristics of Clustering and why MDS is not
suitable, it should be quite clear that Cluster Analysis is NOT
suitable
for discovering any interpretable axes in the MDS sense.
In my (1973) JASA example of clustering a set of the 60 brightest stars
by their "apparent" pairwise distances in the sky, using the single-
linkage equivalent of my definition of k =1, and the "isolation index"
associated with the probability theory of "significant clusters" in a
random graph model, I was able to correctly identify clusters (using
the
probability theory alone, 52 of the 60 stars that belonged to named
constellations, with five of the star constellations perfectly
identifed,
which included UMa (Big Dipper) and Cyg (Swan).
The constellation that was least accounted for (hence least well-
defined) was Draconis, which is a serpent-like constellation that
goes between UMa (Big Dipper) and UMi (Little Dipper) but is not
well separated from the stars in those constellations.
That was perhaps an extreme example in which the goal was to
identify the stringy-connected and isolated clusters to see how
well they corresponded to the constellations that are well-known
because people have long clustered them "by eye". This was
also a problem in which there was absolute no attempt to seek
any "interpretable dimensions" formed by the bright stars.
My bottom line:
If you want to find CLUSTERS, use clustering methods on
your original data.
If you want to seek the "structure" of "interpretable dimensions"
imbedded in a dataset, there are many multivariate methods
for that, including MDS which works for certain data (as that
of Rothkopf's Morse Code data) that would not work on other
methods that require (among other things) a metric or
symmetry of the input matrix.
The above is my quick attempt to summarize and distinguish two
HUGE areas of multivariate analysis, each of which had at least
a history of 40 or more years of development.
-- Bob. |
|
|
| Back to top |
|
 |
Phil Sherrod science forum beginner
Joined: 08 Jun 2005
Posts: 37
|
Posted: Thu Apr 06, 2006 12:31 pm Post subject:
Re: Why Clustering and MDS are Methodologically Incompatible
|
|
|
On 6-Apr-2006, "Data Matter" <fungile@gmail.com> wrote:
| Quote: | Most cluster analysis texts skim over
the issue of mixed numeric and nominal variables, usually prescribing
the standard solution of turning nominal variables into a string of
dummy variables. For many reasons, I find this solution
unsatisfactory, and particularly so when the nominal variables have
large numbers of values (say "US State"). Do you have a suggestion for
some algorithms to try in cases with mixed variables where nominal
variables dominate and may have dozens of values each?
|
Decision tree based models handle categorical variables with large numbers of
categories well because they don't have to generate dummy variables. They just
partition the categories. I have customers building decision tree models with
categorical variables including state, zipcode, automobile model, etc.
--
Phil Sherrod
(phil.sherrod 'at' sandh.com)
http://www.dtreg.com (decision tree and SVM modeling)
http://www.nlreg.com (nonlinear regression) |
|
| Back to top |
|
 |
Data Matter science forum beginner
Joined: 26 May 2005
Posts: 7
|
Posted: Thu Apr 06, 2006 8:18 pm Post subject:
Re: Why Clustering and MDS are Methodologically Incompatible
|
|
|
Your customers must have response variables to work with. Do you have
a way to do unsupervised learning with decision trees?
The efficacy of tree methods to deal with categorical variables with
many categories deserves a separate topic. The ability to do it does
not imply anything about the effectiveness. It is still better to
reduce the number of categories first before throwing them into the
tree algorithm.
Phil Sherrod wrote:
| Quote: | On 6-Apr-2006, "Data Matter" <fungile@gmail.com> wrote:
Most cluster analysis texts skim over
the issue of mixed numeric and nominal variables, usually prescribing
the standard solution of turning nominal variables into a string of
dummy variables. For many reasons, I find this solution
unsatisfactory, and particularly so when the nominal variables have
large numbers of values (say "US State"). Do you have a suggestion for
some algorithms to try in cases with mixed variables where nominal
variables dominate and may have dozens of values each?
Decision tree based models handle categorical variables with large numbers of
categories well because they don't have to generate dummy variables. They just
partition the categories. I have customers building decision tree models with
categorical variables including state, zipcode, automobile model, etc.
--
Phil Sherrod
(phil.sherrod 'at' sandh.com)
http://www.dtreg.com (decision tree and SVM modeling)
http://www.nlreg.com (nonlinear regression) |
|
|
| Back to top |
|
 |
Phil Sherrod science forum beginner
Joined: 08 Jun 2005
Posts: 37
|
Posted: Thu Apr 06, 2006 8:28 pm Post subject:
Re: Why Clustering and MDS are Methodologically Incompatible
|
|
|
On 6-Apr-2006, "Data Matter" <fungile@gmail.com> wrote:
| Quote: | Your customers must have response variables to work with. Do you have
a way to do unsupervised learning with decision trees?
|
Yes. I'm sorry, I missed the beginning of this thread.
| Quote: | The efficacy of tree methods to deal with categorical variables with
many categories deserves a separate topic. The ability to do it does
not imply anything about the effectiveness. It is still better to
reduce the number of categories first before throwing them into the
tree algorithm.
|
Why would you reduce the number of categories? That just adds extra steps
and removes potentially useful information.
--
Phil Sherrod
(phil.sherrod 'at' sandh.com)
http://www.dtreg.com (decision tree and SVM predictive modeling)
http://www.nlreg.com (nonlinear regression) |
|
| Back to top |
|
 |
Data Matter science forum beginner
Joined: 26 May 2005
Posts: 7
|
Posted: Sat Apr 08, 2006 5:32 am Post subject:
Re: Why Clustering and MDS are Methodologically Incompatible
|
|
|
Reef Fish wrote:
| Quote: | Data Matter wrote:
Very informative post, thanks.
You're welcome, and I appreciate the acknowledgment from someone
I know, from reading these groups, who have had much experience
with these methods.
you can never learn enough from analyzing real data |
| Quote: |
Other times, there are methods (Hartigan and others) that cluster
the p variables AND the n objects simultaneously -- that is NOT
in any shape or form a multidimensional scaling method -- but
a simultaneous clustering method -- which is actually better than
using the nxn distance matrix most of the time, if doable.
can you suggest references for these methods? Not sure which specific |
methods you are talking about
| Quote: | There are no algorithms to handle these problems. Moreover, you
have to deal with subject matter of what OBJECTS you are trying
to cluster, and WHY are you seeking clusters.
|
For the particular application I'm thinking about, I'm hoping to find a
parsimonious partitioning for a huge (large n, moderate p) data set.
Not looking for "natural" clusters or clusters that can be explained.
although p is moderate, many of the variables are categorical with
dozens of levels which as you suggested should really be grouped first.
DM |
|
| Back to top |
|
 |
Data Matter science forum beginner
Joined: 26 May 2005
Posts: 7
|
Posted: Sat Apr 08, 2006 5:34 am Post subject:
Re: Why Clustering and MDS are Methodologically Incompatible
|
|
|
Reducing the number of categories is a pretty standard practice. When
the categorical variable has so many levels, it is very likely that it
is highly skewed. There will be many levels with few data values and
thus not so useful for modeling. |
|
| Back to top |
|
 |
Google
|
|
| Back to top |
|
 |
|
|
The time now is Thu Jan 08, 2009 12:10 am | All times are GMT
|
|
Store Cards for the best credit | Praca | Remortgages | Cell Phones | Credit Cards
|
|
Copyright © 2004-2005 DeniX Solutions SRL
|
|
Other DeniX Solutions sites:
Electronics forum |
Medicine forum |
Unix/Linux blog |
Unix/Linux documentation |
Unix/Linux forums
|
Powered by phpBB © 2001, 2005 phpBB Group
|
|