FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   PreferencesPreferences   Log in to check your private messagesLog in to check your private messages   Log inLog in 
Forum index » Science and Technology » Math
Why Clustering and MDS are Methodologically Incompatible
Post new topic   Reply to topic Page 1 of 1 [8 Posts] View previous topic :: View next topic
Author Message
Reef Fish
science forum Guru Wannabe


Joined: 28 Apr 2005
Posts: 200

PostPosted: Tue Apr 04, 2006 5:34 pm    Post subject: Why Clustering and MDS are Methodologically Incompatible Reply with quote

In a related thread, I wrote

Quote:
I'll post a separate post to explain WHY those two mothods
are INCOMPATIBLE, in GOAL or METHOD, and why trying to do
both all at once under some "hybrid" model can only make it worse

Let's start with why MDS is not compatible to, or suitable for finding
clusters, as in "cluster analysis". I had already given one reason in
the related post in question. (1) below:

1. A "proper and useful" cluster solution does NOT depend
on the representation (or even existence) in any dimension, but it
does depend heavily on the clustering "criterion" which produce
many DIFFERENT solutions on different criteria

2. There are literally HUNDREDS of clustering algorithms based on
at least a dozen or more different criteria for "clusters". Often
a particular single criterion can be carried out by different kinds

of algorithms:

For example,

(a) Agglomerative (start with n points as n clusters and combine
two at a time until the final cluster has n points)

(b) Divisive (start with n points as 1 cluster and split one each
time until there are n clusters of single points at the end).

(c) partition n points into a specified number of clusters (groups)
by optimizing certain criteria, such as minimum Within Groups
SS, Ward's Centroid method, etc.

(d) Iterative (if the number of clusters is specified in advance)
such as various k-means methods.

(e) Jardine-Sibson and others where clusters may be overlapping.

None of the above defines what a "cluster" is, or what constitutes a
cluster. They are based on some global-partition (a-d) or a global
criterion that allows overlapping sets (hence not a partition).

Clustering (or grouping of objects) is an easy and intuitive concept
to grasp, but nearly impossible to define or find. Throw a bunch
(n pieces, n moderately large) of candy on the floor, and any 4-year
old can "cluster" them better than any of your clustering algorithms
can!

(f) Ling (1972) "On the theory and construction of K-Clusters",
Computer Journal, 15, 326-332.
Ling (1973) "A Probability Theory of Cluster Analysis," JASA
58, 159-164.

made a "cluster" a well-defined object, defined by two
parameters, K, the number of points within a distance delta
of K other points in the cluster, and well isolated (separated)
from other clusters. K = 1 would yield the "single-linkage"
clusters, while other integer values of K require the clusters
to be more and more "compact", and less "stringy".

(g) There are many other cluster criteria and algorithms. You
can find numerous books titled, or with key phrases, "Cluster
Analysis" or "Clustering algorithms". There are also journal
articles reviewing and comparing these criteria and methods.

(3) In view of (1) and (2) above, the only INCISIVE analysis of
clusters can only be done via specific (or using several
different) clustering criteria and methods. It is unfortunate
that most of the methods are just algorithms, leaving the
concept of a "cluster" ill-defined. But that is the present
DEFECT within the global subject of Cluster Analysis, and
the existence of 100+ different algorithms and methods (often
delivering clusters with very DIFFERENTcharacteristics) is
the best one can do.

(4) MDS (Multidimensional Scaling) is not suitable for Clustering
because it forces the representation of n objects into n points
in a p-dimensional Euclidean space, by matching the INPUT
n x n matrix of observed Dissimilarities (not necessarily
distances) with the pairwise Distance matrix of the configuration

in p-space, often distorting the distances between points so
much that it LOSES any clustering phenomenon that is clear
via clustering methods.

(5) There are numerous books and journal articles giving examples
of the application of MDS to real problems. NONE of those
examples was concerned with, or dealt with, the existence or
non-existence of clusters! Just pick up any one of those
applications and you'll see WHY.

(6) Last but not least, the INPUT data for MDS may not be suitable
at all for clustering. For example, in the famous Rothkopf
data
of the % of trainees, when listened to a pair of Morse Codes
signals for letters and numbers, are asked whether the played
Morse Codes were the "same" or "different". The trainees are
confused even when the same symbols are played one after
the other. Thus none of the A-A, B-B, etc. was 100%, and
most of the A-B and B-A and other pairs have different values
for their "same" answer, and hence the matrix is asymmetric
with diagonals not equal to 0 for distances.


In short, the general class of methods for MDS is to seek a Euclidean
representation of an n x n similarity/dissimilarity matrix by a set of
points in a p-dimensional Euclidean space in search of some
interpretable AXES in some low dimension, with complete disregard
of any clustering properties -- with a goal similar to that of Factor
Analysis and Principal Components Analysis neither of which is
concerned with clusters also.

Given the above characteristics of Clustering and why MDS is not
suitable, it should be quite clear that Cluster Analysis is NOT
suitable
for discovering any interpretable axes in the MDS sense.

In my (1973) JASA example of clustering a set of the 60 brightest stars
by their "apparent" pairwise distances in the sky, using the single-
linkage equivalent of my definition of k =1, and the "isolation index"
associated with the probability theory of "significant clusters" in a
random graph model, I was able to correctly identify clusters (using
the
probability theory alone, 52 of the 60 stars that belonged to named
constellations, with five of the star constellations perfectly
identifed,
which included UMa (Big Dipper) and Cyg (Swan).

The constellation that was least accounted for (hence least well-
defined) was Draconis, which is a serpent-like constellation that
goes between UMa (Big Dipper) and UMi (Little Dipper) but is not
well separated from the stars in those constellations.

That was perhaps an extreme example in which the goal was to
identify the stringy-connected and isolated clusters to see how
well they corresponded to the constellations that are well-known
because people have long clustered them "by eye". This was
also a problem in which there was absolute no attempt to seek
any "interpretable dimensions" formed by the bright stars.

My bottom line:

If you want to find CLUSTERS, use clustering methods on
your original data.

If you want to seek the "structure" of "interpretable dimensions"
imbedded in a dataset, there are many multivariate methods
for that, including MDS which works for certain data (as that
of Rothkopf's Morse Code data) that would not work on other
methods that require (among other things) a metric or
symmetry of the input matrix.

The above is my quick attempt to summarize and distinguish two
HUGE areas of multivariate analysis, each of which had at least
a history of 40 or more years of development.

-- Bob.
Back to top
Data Matter
science forum beginner


Joined: 26 May 2005
Posts: 7

PostPosted: Thu Apr 06, 2006 5:51 am    Post subject: Re: Why Clustering and MDS are Methodologically Incompatible Reply with quote

Very informative post, thanks. Most cluster analysis texts skim over
the issue of mixed numeric and nominal variables, usually prescribing
the standard solution of turning nominal variables into a string of
dummy variables. For many reasons, I find this solution
unsatisfactory, and particularly so when the nominal variables have
large numbers of values (say "US State"). Do you have a suggestion for
some algorithms to try in cases with mixed variables where nominal
variables dominate and may have dozens of values each?

DM


Reef Fish wrote:
Quote:
In a related thread, I wrote

I'll post a separate post to explain WHY those two mothods
are INCOMPATIBLE, in GOAL or METHOD, and why trying to do
both all at once under some "hybrid" model can only make it worse

Let's start with why MDS is not compatible to, or suitable for finding
clusters, as in "cluster analysis". I had already given one reason in
the related post in question. (1) below:

1. A "proper and useful" cluster solution does NOT depend
on the representation (or even existence) in any dimension, but it
does depend heavily on the clustering "criterion" which produce
many DIFFERENT solutions on different criteria

2. There are literally HUNDREDS of clustering algorithms based on
at least a dozen or more different criteria for "clusters". Often
a particular single criterion can be carried out by different kinds

of algorithms:

For example,

(a) Agglomerative (start with n points as n clusters and combine
two at a time until the final cluster has n points)

(b) Divisive (start with n points as 1 cluster and split one each
time until there are n clusters of single points at the end).

(c) partition n points into a specified number of clusters (groups)
by optimizing certain criteria, such as minimum Within Groups
SS, Ward's Centroid method, etc.

(d) Iterative (if the number of clusters is specified in advance)
such as various k-means methods.

(e) Jardine-Sibson and others where clusters may be overlapping.

None of the above defines what a "cluster" is, or what constitutes a
cluster. They are based on some global-partition (a-d) or a global
criterion that allows overlapping sets (hence not a partition).

Clustering (or grouping of objects) is an easy and intuitive concept
to grasp, but nearly impossible to define or find. Throw a bunch
(n pieces, n moderately large) of candy on the floor, and any 4-year
old can "cluster" them better than any of your clustering algorithms
can!

(f) Ling (1972) "On the theory and construction of K-Clusters",
Computer Journal, 15, 326-332.
Ling (1973) "A Probability Theory of Cluster Analysis," JASA
58, 159-164.

made a "cluster" a well-defined object, defined by two
parameters, K, the number of points within a distance delta
of K other points in the cluster, and well isolated (separated)
from other clusters. K = 1 would yield the "single-linkage"
clusters, while other integer values of K require the clusters
to be more and more "compact", and less "stringy".

(g) There are many other cluster criteria and algorithms. You
can find numerous books titled, or with key phrases, "Cluster
Analysis" or "Clustering algorithms". There are also journal
articles reviewing and comparing these criteria and methods.

(3) In view of (1) and (2) above, the only INCISIVE analysis of
clusters can only be done via specific (or using several
different) clustering criteria and methods. It is unfortunate
that most of the methods are just algorithms, leaving the
concept of a "cluster" ill-defined. But that is the present
DEFECT within the global subject of Cluster Analysis, and
the existence of 100+ different algorithms and methods (often
delivering clusters with very DIFFERENTcharacteristics) is
the best one can do.

(4) MDS (Multidimensional Scaling) is not suitable for Clustering
because it forces the representation of n objects into n points
in a p-dimensional Euclidean space, by matching the INPUT
n x n matrix of observed Dissimilarities (not necessarily
distances) with the pairwise Distance matrix of the configuration

in p-space, often distorting the distances between points so
much that it LOSES any clustering phenomenon that is clear
via clustering methods.

(5) There are numerous books and journal articles giving examples
of the application of MDS to real problems. NONE of those
examples was concerned with, or dealt with, the existence or
non-existence of clusters! Just pick up any one of those
applications and you'll see WHY.

(6) Last but not least, the INPUT data for MDS may not be suitable
at all for clustering. For example, in the famous Rothkopf
data
of the % of trainees, when listened to a pair of Morse Codes
signals for letters and numbers, are asked whether the played
Morse Codes were the "same" or "different". The trainees are
confused even when the same symbols are played one after
the other. Thus none of the A-A, B-B, etc. was 100%, and
most of the A-B and B-A and other pairs have different values
for their "same" answer, and hence the matrix is asymmetric
with diagonals not equal to 0 for distances.


In short, the general class of methods for MDS is to seek a Euclidean
representation of an n x n similarity/dissimilarity matrix by a set of
points in a p-dimensional Euclidean space in search of some
interpretable AXES in some low dimension, with complete disregard
of any clustering properties -- with a goal similar to that of Factor
Analysis and Principal Components Analysis neither of which is
concerned with clusters also.

Given the above characteristics of Clustering and why MDS is not
suitable, it should be quite clear that Cluster Analysis is NOT
suitable
for discovering any interpretable axes in the MDS sense.

In my (1973) JASA example of clustering a set of the 60 brightest stars
by their "apparent" pairwise distances in the sky, using the single-
linkage equivalent of my definition of k =1, and the "isolation index"
associated with the probability theory of "significant clusters" in a
random graph model, I was able to correctly identify clusters (using
the
probability theory alone, 52 of the 60 stars that belonged to named
constellations, with five of the star constellations perfectly
identifed,
which included UMa (Big Dipper) and Cyg (Swan).

The constellation that was least accounted for (hence least well-
defined) was Draconis, which is a serpent-like constellation that
goes between UMa (Big Dipper) and UMi (Little Dipper) but is not
well separated from the stars in those constellations.

That was perhaps an extreme example in which the goal was to
identify the stringy-connected and isolated clusters to see how
well they corresponded to the constellations that are well-known
because people have long clustered them "by eye". This was
also a problem in which there was absolute no attempt to seek
any "interpretable dimensions" formed by the bright stars.

My bottom line:

If you want to find CLUSTERS, use clustering methods on
your original data.

If you want to seek the "structure" of "interpretable dimensions"
imbedded in a dataset, there are many multivariate methods
for that, including MDS which works for certain data (as that
of Rothkopf's Morse Code data) that would not work on other
methods that require (among other things) a metric or
symmetry of the input matrix.

The above is my quick attempt to summarize and distinguish two
HUGE areas of multivariate analysis, each of which had at least
a history of 40 or more years of development.

-- Bob.
Back to top
Reef Fish
science forum Guru Wannabe


Joined: 28 Apr 2005
Posts: 200

PostPosted: Thu Apr 06, 2006 12:13 pm    Post subject: Re: Why Clustering and MDS are Methodologically Incompatible Reply with quote

Data Matter wrote:
Quote:
Very informative post, thanks.

You're welcome, and I appreciate the acknowledgment from someone
I know, from reading these groups, who have had much experience
with these methods.


Quote:
Most cluster analysis texts skim over
the issue of mixed numeric and nominal variables, usually prescribing
the standard solution of turning nominal variables into a string of
dummy variables. For many reasons, I find this solution
unsatisfactory, and particularly so when the nominal variables have
large numbers of values (say "US State").

In clustering, often the available variables are collapsed into an
n x n similarity or distance matrix through various metrics and
combination of metrics. If one of the variables contains a large
number of categorical information (as US States), that's a very
special problem -- which may be condensed into smaller number
of categories such as "regions" before proceeding.

Other times, there are methods (Hartigan and others) that cluster
the p variables AND the n objects simultaneously -- that is NOT
in any shape or form a multidimensional scaling method -- but
a simultaneous clustering method -- which is actually better than
using the nxn distance matrix most of the time, if doable.


Quote:
Do you have a suggestion for
some algorithms to try in cases with mixed variables where nominal
variables dominate and may have dozens of values each?

DM

It's no longer a question of an "algorithm". Here your problem is
how to COMBINE the available information in your variables
BEFORE you start finding clusters.

There are no algorithms to handle these problems. Moreover, you
have to deal with subject matter of what OBJECTS you are trying
to cluster, and WHY are you seeking clusters.

Often when you don't have a natural n x n pairwise distance matrix
to begin with, but an n x p matrix of many variables, there are
other, often more suitable methods, of analyzing the problem,
such as those in market segmentation where the geographical
regions do come into play.

Sorry I can't give you a more specific response than this.

-- Bob.








Quote:


Reef Fish wrote:
In a related thread, I wrote

I'll post a separate post to explain WHY those two mothods
are INCOMPATIBLE, in GOAL or METHOD, and why trying to do
both all at once under some "hybrid" model can only make it worse

Let's start with why MDS is not compatible to, or suitable for finding
clusters, as in "cluster analysis". I had already given one reason in
the related post in question. (1) below:

1. A "proper and useful" cluster solution does NOT depend
on the representation (or even existence) in any dimension, but it
does depend heavily on the clustering "criterion" which produce
many DIFFERENT solutions on different criteria

2. There are literally HUNDREDS of clustering algorithms based on
at least a dozen or more different criteria for "clusters". Often
a particular single criterion can be carried out by different kinds

of algorithms:

For example,

(a) Agglomerative (start with n points as n clusters and combine
two at a time until the final cluster has n points)

(b) Divisive (start with n points as 1 cluster and split one each
time until there are n clusters of single points at the end).

(c) partition n points into a specified number of clusters (groups)
by optimizing certain criteria, such as minimum Within Groups
SS, Ward's Centroid method, etc.

(d) Iterative (if the number of clusters is specified in advance)
such as various k-means methods.

(e) Jardine-Sibson and others where clusters may be overlapping.

None of the above defines what a "cluster" is, or what constitutes a
cluster. They are based on some global-partition (a-d) or a global
criterion that allows overlapping sets (hence not a partition).

Clustering (or grouping of objects) is an easy and intuitive concept
to grasp, but nearly impossible to define or find. Throw a bunch
(n pieces, n moderately large) of candy on the floor, and any 4-year
old can "cluster" them better than any of your clustering algorithms
can!

(f) Ling (1972) "On the theory and construction of K-Clusters",
Computer Journal, 15, 326-332.
Ling (1973) "A Probability Theory of Cluster Analysis," JASA
58, 159-164.

made a "cluster" a well-defined object, defined by two
parameters, K, the number of points within a distance delta
of K other points in the cluster, and well isolated (separated)
from other clusters. K = 1 would yield the "single-linkage"
clusters, while other integer values of K require the clusters
to be more and more "compact", and less "stringy".

(g) There are many other cluster criteria and algorithms. You
can find numerous books titled, or with key phrases, "Cluster
Analysis" or "Clustering algorithms". There are also journal
articles reviewing and comparing these criteria and methods.

(3) In view of (1) and (2) above, the only INCISIVE analysis of
clusters can only be done via specific (or using several
different) clustering criteria and methods. It is unfortunate
that most of the methods are just algorithms, leaving the
concept of a "cluster" ill-defined. But that is the present
DEFECT within the global subject of Cluster Analysis, and
the existence of 100+ different algorithms and methods (often
delivering clusters with very DIFFERENTcharacteristics) is
the best one can do.

(4) MDS (Multidimensional Scaling) is not suitable for Clustering
because it forces the representation of n objects into n points
in a p-dimensional Euclidean space, by matching the INPUT
n x n matrix of observed Dissimilarities (not necessarily
distances) with the pairwise Distance matrix of the configuration

in p-space, often distorting the distances between points so
much that it LOSES any clustering phenomenon that is clear
via clustering methods.

(5) There are numerous books and journal articles giving examples
of the application of MDS to real problems. NONE of those
examples was concerned with, or dealt with, the existence or
non-existence of clusters! Just pick up any one of those
applications and you'll see WHY.

(6) Last but not least, the INPUT data for MDS may not be suitable
at all for clustering. For example, in the famous Rothkopf
data
of the % of trainees, when listened to a pair of Morse Codes
signals for letters and numbers, are asked whether the played
Morse Codes were the "same" or "different". The trainees are
confused even when the same symbols are played one after
the other. Thus none of the A-A, B-B, etc. was 100%, and
most of the A-B and B-A and other pairs have different values
for their "same" answer, and hence the matrix is asymmetric
with diagonals not equal to 0 for distances.


In short, the general class of methods for MDS is to seek a Euclidean
representation of an n x n similarity/dissimilarity matrix by a set of
points in a p-dimensional Euclidean space in search of some
interpretable AXES in some low dimension, with complete disregard
of any clustering properties -- with a goal similar to that of Factor
Analysis and Principal Components Analysis neither of which is
concerned with clusters also.

Given the above characteristics of Clustering and why MDS is not
suitable, it should be quite clear that Cluster Analysis is NOT
suitable
for discovering any interpretable axes in the MDS sense.

In my (1973) JASA example of clustering a set of the 60 brightest stars
by their "apparent" pairwise distances in the sky, using the single-
linkage equivalent of my definition of k =1, and the "isolation index"
associated with the probability theory of "significant clusters" in a
random graph model, I was able to correctly identify clusters (using
the
probability theory alone, 52 of the 60 stars that belonged to named
constellations, with five of the star constellations perfectly
identifed,
which included UMa (Big Dipper) and Cyg (Swan).

The constellation that was least accounted for (hence least well-
defined) was Draconis, which is a serpent-like constellation that
goes between UMa (Big Dipper) and UMi (Little Dipper) but is not
well separated from the stars in those constellations.

That was perhaps an extreme example in which the goal was to
identify the stringy-connected and isolated clusters to see how
well they corresponded to the constellations that are well-known
because people have long clustered them "by eye". This was
also a problem in which there was absolute no attempt to seek
any "interpretable dimensions" formed by the bright stars.

My bottom line:

If you want to find CLUSTERS, use clustering methods on
your original data.

If you want to seek the "structure" of "interpretable dimensions"
imbedded in a dataset, there are many multivariate methods
for that, including MDS which works for certain data (as that
of Rothkopf's Morse Code data) that would not work on other
methods that require (among other things) a metric or
symmetry of the input matrix.

The above is my quick attempt to summarize and distinguish two
HUGE areas of multivariate analysis, each of which had at least
a history of 40 or more years of development.

-- Bob.
Back to top
Phil Sherrod
science forum beginner


Joined: 08 Jun 2005
Posts: 37

PostPosted: Thu Apr 06, 2006 12:31 pm    Post subject: Re: Why Clustering and MDS are Methodologically Incompatible Reply with quote

On 6-Apr-2006, "Data Matter" <fungile@gmail.com> wrote:

Quote:
Most cluster analysis texts skim over
the issue of mixed numeric and nominal variables, usually prescribing
the standard solution of turning nominal variables into a string of
dummy variables. For many reasons, I find this solution
unsatisfactory, and particularly so when the nominal variables have
large numbers of values (say "US State"). Do you have a suggestion for
some algorithms to try in cases with mixed variables where nominal
variables dominate and may have dozens of values each?

Decision tree based models handle categorical variables with large numbers of
categories well because they don't have to generate dummy variables. They just
partition the categories. I have customers building decision tree models with
categorical variables including state, zipcode, automobile model, etc.

--
Phil Sherrod
(phil.sherrod 'at' sandh.com)
http://www.dtreg.com (decision tree and SVM modeling)
http://www.nlreg.com (nonlinear regression)
Back to top
Data Matter
science forum beginner


Joined: 26 May 2005
Posts: 7

PostPosted: Thu Apr 06, 2006 8:18 pm    Post subject: Re: Why Clustering and MDS are Methodologically Incompatible Reply with quote

Your customers must have response variables to work with. Do you have
a way to do unsupervised learning with decision trees?

The efficacy of tree methods to deal with categorical variables with
many categories deserves a separate topic. The ability to do it does
not imply anything about the effectiveness. It is still better to
reduce the number of categories first before throwing them into the
tree algorithm.

Phil Sherrod wrote:
Quote:
On 6-Apr-2006, "Data Matter" <fungile@gmail.com> wrote:

Most cluster analysis texts skim over
the issue of mixed numeric and nominal variables, usually prescribing
the standard solution of turning nominal variables into a string of
dummy variables. For many reasons, I find this solution
unsatisfactory, and particularly so when the nominal variables have
large numbers of values (say "US State"). Do you have a suggestion for
some algorithms to try in cases with mixed variables where nominal
variables dominate and may have dozens of values each?

Decision tree based models handle categorical variables with large numbers of
categories well because they don't have to generate dummy variables. They just
partition the categories. I have customers building decision tree models with
categorical variables including state, zipcode, automobile model, etc.

--
Phil Sherrod
(phil.sherrod 'at' sandh.com)
http://www.dtreg.com (decision tree and SVM modeling)
http://www.nlreg.com (nonlinear regression)
Back to top
Phil Sherrod
science forum beginner


Joined: 08 Jun 2005
Posts: 37

PostPosted: Thu Apr 06, 2006 8:28 pm    Post subject: Re: Why Clustering and MDS are Methodologically Incompatible Reply with quote

On 6-Apr-2006, "Data Matter" <fungile@gmail.com> wrote:

Quote:
Your customers must have response variables to work with. Do you have
a way to do unsupervised learning with decision trees?

Yes. I'm sorry, I missed the beginning of this thread.

Quote:
The efficacy of tree methods to deal with categorical variables with
many categories deserves a separate topic. The ability to do it does
not imply anything about the effectiveness. It is still better to
reduce the number of categories first before throwing them into the
tree algorithm.

Why would you reduce the number of categories? That just adds extra steps
and removes potentially useful information.

--
Phil Sherrod
(phil.sherrod 'at' sandh.com)
http://www.dtreg.com (decision tree and SVM predictive modeling)
http://www.nlreg.com (nonlinear regression)
Back to top
Data Matter
science forum beginner


Joined: 26 May 2005
Posts: 7

PostPosted: Sat Apr 08, 2006 5:32 am    Post subject: Re: Why Clustering and MDS are Methodologically Incompatible Reply with quote

Reef Fish wrote:
Quote:
Data Matter wrote:
Very informative post, thanks.

You're welcome, and I appreciate the acknowledgment from someone
I know, from reading these groups, who have had much experience
with these methods.

you can never learn enough from analyzing real data


Quote:


Other times, there are methods (Hartigan and others) that cluster
the p variables AND the n objects simultaneously -- that is NOT
in any shape or form a multidimensional scaling method -- but
a simultaneous clustering method -- which is actually better than
using the nxn distance matrix most of the time, if doable.

can you suggest references for these methods? Not sure which specific

methods you are talking about

Quote:
There are no algorithms to handle these problems. Moreover, you
have to deal with subject matter of what OBJECTS you are trying
to cluster, and WHY are you seeking clusters.


For the particular application I'm thinking about, I'm hoping to find a
parsimonious partitioning for a huge (large n, moderate p) data set.
Not looking for "natural" clusters or clusters that can be explained.
although p is moderate, many of the variables are categorical with
dozens of levels which as you suggested should really be grouped first.

DM
Back to top
Data Matter
science forum beginner


Joined: 26 May 2005
Posts: 7

PostPosted: Sat Apr 08, 2006 5:34 am    Post subject: Re: Why Clustering and MDS are Methodologically Incompatible Reply with quote

Reducing the number of categories is a pretty standard practice. When
the categorical variable has so many levels, it is very likely that it
is highly skewed. There will be many levels with few data values and
thus not so useful for modeling.
Back to top
Google

Back to top
Display posts from previous:   
Post new topic   Reply to topic Page 1 of 1 [8 Posts] View previous topic :: View next topic
The time now is Thu Jan 08, 2009 12:10 am | All times are GMT
Forum index » Science and Technology » Math
Jump to:  

Similar Topics
Topic Author Forum Replies Last Post
No new posts One by One Clustering?????? mathlover Probability 0 Thu Jul 20, 2006 4:17 pm
No new posts One by One Clustering?????? mathlover Prediction 0 Thu Jul 20, 2006 4:15 pm
No new posts Clustering: if one cluster is huge, t... b83503104@yahoo.com Math 45 Sat Jun 03, 2006 7:42 am
No new posts multidimensional scaling clustering bird Math 7 Thu Mar 30, 2006 11:16 pm
No new posts Clustering Probabilities Adam Teasdale Hartshorne Math 0 Mon Jan 30, 2006 6:07 pm

Store Cards for the best credit | Praca | Remortgages | Cell Phones | Credit Cards
Copyright © 2004-2005 DeniX Solutions SRL
Other DeniX Solutions sites: Electronics forum |  Medicine forum |  Unix/Linux blog |  Unix/Linux documentation |  Unix/Linux forums


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.9983s ][ Queries: 16 (0.4108s) ][ GZIP on - Debug on ]