FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   PreferencesPreferences   Log in to check your private messagesLog in to check your private messages   Log inLog in 
Forum index » Science and Technology » Math
Clustering: if one cluster is huge, the other one is small...
Post new topic   Reply to topic Page 1 of 4 [46 Posts] View previous topic :: View next topic
Goto page:  1, 2, 3, 4 Next
Author Message
b83503104@yahoo.com
science forum beginner


Joined: 24 Jul 2005
Posts: 4

PostPosted: Sat Jun 03, 2006 7:42 am    Post subject: Clustering: if one cluster is huge, the other one is small... Reply with quote

I have this doubt in mind (that means, I didn't run experiments on
this, but purely guessing):

If the data is naturally consisted of N clusters, but some of them have
K points, some of them have L << K points.

My conjecture is, k-means clustering will not find the N clusters (even
if the correct number of clusters, N, is specified).

Reason of conjecture: the small clusters have very little influence on
the objective function, so k-means will partition the large clusters
into subclusters, and kind of ignore the small clusters.

My real concern is, under such circumstances, is there any method so
that the N clusters could still be discovered? Either by another
clustering algorithm, or by some clever preprocessing?
Back to top
guerif@gmail.com
science forum beginner


Joined: 03 Jun 2006
Posts: 1

PostPosted: Sat Jun 03, 2006 3:16 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

b83503104@yahoo.com a écrit :

Quote:
I have this doubt in mind (that means, I didn't run experiments on
this, but purely guessing):

If the data is naturally consisted of N clusters, but some of them have
K points, some of them have L << K points.

My conjecture is, k-means clustering will not find the N clusters (even
if the correct number of clusters, N, is specified).

Reason of conjecture: the small clusters have very little influence on
the objective function, so k-means will partition the large clusters
into subclusters, and kind of ignore the small clusters.

My real concern is, under such circumstances, is there any method so
that the N clusters could still be discovered? Either by another
clustering algorithm, or by some clever preprocessing?

J.J.Verbeek has proposed a greedy modified version of the kmeans
algorithm called global kmeans. It proceeds by progressively adding new
centers. At the begining, there is only one center (the centroid of the
data samples). Then, the data point that involves the maximum
decreasing of the cost function of the kmeans algorithm is chosen as
second center. And so on, until the number of required clusters is
obtained.
You should read p. 66-68 from his thesis
http://lear.inrialpes.fr/~verbeek/papers/thesis_verbeek.pdf

My first intuition is that if the correct number of clusters is known
and small clusters are well separated from the biggest one, global
kmeans should succeeds. Nevertheless, it is purely guessing too...

Hope this helps!
Sebastien

--
Sebastien Guerif
http://www-lipn.univ-paris13.fr/~guerif
Back to top
Reef Fish
science forum Guru Wannabe


Joined: 28 Apr 2005
Posts: 200

PostPosted: Sat Jun 03, 2006 4:20 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

b83503104@yahoo.com wrote:
Quote:

My conjecture is, k-means clustering will not find the N clusters (even
if the correct number of clusters, N, is specified).

Reason of conjecture: the small clusters have very little influence on
the objective function, so k-means will partition the large clusters
into subclusters, and kind of ignore the small clusters.

Therein lies the crux of the problem in the whole field of cluster
analysis.
Nobody (present company excepted) tried to define what constitutes a
"clusters" and then find it/them.

Most clustering methods are based on GLOBAL criteria of optimization
while the phenomenon of clustering is usually a LOCAL phenomenon.

To compound the problem of non-definition of what constitutes
clusters, the market is flood with ALGORITHM that have no guarantee
of finding anything that suits anybody. But they always deliver
clusters (whether there is a clustering phenomenon of not). GIGO,
(Garbage In, Garbage Out).

That was about the way I left the field about 20 years ago. I don't
believe the field of cluster analysis has made any progress since.

Quote:

My real concern is, under such circumstances, is there any method so
that the N clusters could still be discovered? Either by another
clustering algorithm, or by some clever preprocessing?

As the story goes ... if you have a million monkeys (cluster analysts)
on millions of typewriters (clustering algorithms) typing for a million
years, you can almost be guaranteed to discover some clusters that
makes some monkey (or non-monkey) happy that they found
something without evening knowing what they were looking for.

-- Bob.
Back to top
Russell.Martin@wdn.com
science forum beginner


Joined: 13 Sep 2005
Posts: 24

PostPosted: Sat Jun 03, 2006 5:11 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

b83503104@yahoo.com wrote:
Quote:
I have this doubt in mind (that means, I didn't run experiments on
this, but purely guessing):

If the data is naturally consisted of N clusters, but some of them have
K points, some of them have L << K points.

My conjecture is, k-means clustering will not find the N clusters (even
if the correct number of clusters, N, is specified).

Reason of conjecture: the small clusters have very little influence on
the objective function, so k-means will partition the large clusters
into subclusters, and kind of ignore the small clusters.

My real concern is, under such circumstances, is there any method so
that the N clusters could still be discovered? Either by another
clustering algorithm, or by some clever preprocessing?

There has been some work in climatology using neural networks
to define sets of locations with similar climates, which is an
application for which clustering has also been used. I reviewed
a paper on the subject some years ago, so I read up on the
literature. The results were good, but whether ot not they would
pass Reef Fish's "monkey" or GIGO tests, I don't know. I could
probably hunt up some references for you if your really want
them, but I'd have to do that from my office.

Cheers,
Russell
Back to top
Reef Fish
science forum Guru Wannabe


Joined: 28 Apr 2005
Posts: 200

PostPosted: Sat Jun 03, 2006 5:32 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

Russell.Martin@wdn.com wrote:
Quote:
b83503104@yahoo.com wrote:
I have this doubt in mind (that means, I didn't run experiments on
this, but purely guessing):

If the data is naturally consisted of N clusters, but some of them have
K points, some of them have L << K points.

My conjecture is, k-means clustering will not find the N clusters (even
if the correct number of clusters, N, is specified).

Reason of conjecture: the small clusters have very little influence on
the objective function, so k-means will partition the large clusters
into subclusters, and kind of ignore the small clusters.

My real concern is, under such circumstances, is there any method so
that the N clusters could still be discovered? Either by another
clustering algorithm, or by some clever preprocessing?

There has been some work in climatology using neural networks
to define sets of locations with similar climates, which is an
application for which clustering has also been used. I reviewed
a paper on the subject some years ago, so I read up on the
literature. The results were good, but whether ot not they would
pass Reef Fish's "monkey" or GIGO tests, I don't know.

Those can easily be tested by ONE simple litmus-test:

1. Is a "cluster" an EXPLICITLY well-defined object which
satisfies certain properties (before one starts looking for it)?

If (1) is not satisfied, then any method seeking clusters is just
monkey's on a witch hunt.

Quote:
I could
probably hunt up some references for you if your really want
them, but I'd have to do that from my office.

Cheers,
Russell

Russell or anyone else, I would like to know ANY method that
passes the litmus test of having a well-defined meaning to a
"cluster" in any field, any method -- BEFORE the monkeys
started searching for them.

Life is too short to read another thousand papers which I had
already squandered my time reading, on the subject of
Clustering. ;0(

-- Bob.
Back to top
b83503104@yahoo.com
science forum beginner


Joined: 24 Jul 2005
Posts: 4

PostPosted: Sat Jun 03, 2006 6:27 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

Quote:
Those can easily be tested by ONE simple litmus-test:

1. Is a "cluster" an EXPLICITLY well-defined object which
satisfies certain properties (before one starts looking for it)?

If (1) is not satisfied, then any method seeking clusters is just
monkey's on a witch hunt.

I guess the problem is how to "explicitly" define a cluster. If it is
too explicit (such as providing the algorithm input-output pairs) then
the problem becomes classification/regression. By definition,
clustering does not provided "labels", hence it is meant to be
implicit. So the goal should be to define an objective function that
can somehow achieve the explicit goal. I have the feeling that this is
somewhat similar to the "prior" concept in Bayesian statistics.

So I think the OP's question is still a valid discussion: Is there
perhaps a way to define an objective function, so that small clusters
still get a chance to be discovered.
Back to top
Russell.Martin@wdn.com
science forum beginner


Joined: 13 Sep 2005
Posts: 24

PostPosted: Sat Jun 03, 2006 7:18 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

Reef Fish wrote:
Quote:
Russell.Martin@wdn.com wrote:
b83503104@yahoo.com wrote:
I have this doubt in mind (that means, I didn't run experiments on
this, but purely guessing):

If the data is naturally consisted of N clusters, but some of them have
K points, some of them have L << K points.

My conjecture is, k-means clustering will not find the N clusters (even
if the correct number of clusters, N, is specified).

Reason of conjecture: the small clusters have very little influence on
the objective function, so k-means will partition the large clusters
into subclusters, and kind of ignore the small clusters.

My real concern is, under such circumstances, is there any method so
that the N clusters could still be discovered? Either by another
clustering algorithm, or by some clever preprocessing?

There has been some work in climatology using neural networks
to define sets of locations with similar climates, which is an
application for which clustering has also been used. I reviewed
a paper on the subject some years ago, so I read up on the
literature. The results were good, but whether ot not they would
pass Reef Fish's "monkey" or GIGO tests, I don't know.

Those can easily be tested by ONE simple litmus-test:

1. Is a "cluster" an EXPLICITLY well-defined object which
satisfies certain properties (before one starts looking for it)?

The idea, at least in climatology, is to have the method define
which elements of the set are similar in a multivariate situation,
followed by a human sanity check. IOW the joint properties
which are to be satisfied are not necessarily known a priori, e.g.
which values of temperature versus precipitation versus humidity
versus etc. constitute "similar". So if I understand your criteria
the answer no, but IMO that does not necessarily mean the
approach is without value as an exploratory technique, which
is how it is generally used to my field. I agree that the
techniques often leave something to be desired.

Quote:

If (1) is not satisfied, then any method seeking clusters is just
monkey's on a witch hunt.

Monkeys hunt witches? I thought monkeys worked for
witches hunting Dorothies (at least in "The Wizard of OZ"). ;-)

Quote:

I could
probably hunt up some references for you if your really want
them, but I'd have to do that from my office.

Cheers,
Russell

Russell or anyone else, I would like to know ANY method that
passes the litmus test of having a well-defined meaning to a
"cluster" in any field, any method -- BEFORE the monkeys
started searching for them.

Life is too short to read another thousand papers which I had
already squandered my time reading, on the subject of
Clustering. ;0(

-- Bob.

Cheers,
Russell
Back to top
Reef Fish
science forum Guru Wannabe


Joined: 28 Apr 2005
Posts: 200

PostPosted: Sat Jun 03, 2006 7:33 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

b83503104@yahoo.com wrote:
Quote:
Those can easily be tested by ONE simple litmus-test:

1. Is a "cluster" an EXPLICITLY well-defined object which
satisfies certain properties (before one starts looking for it)?

If (1) is not satisfied, then any method seeking clusters is just
monkey's on a witch hunt.

I guess the problem is how to "explicitly" define a cluster.

It usually meant at least two of these properties:

A. It must be well SEPARATED from other points and clusters (so that
it is an identifiable cluster if you can LOOK at it and see that
it's there by itself).

B. The points within a cluster must be somehow close to each other
to be called a cluster.

Now you can start from there.

Quote:
If it is
too explicit (such as providing the algorithm input-output pairs) then
the problem becomes classification/regression.

You are already mis-stepping when you put an algorithm before
a definition.


Quote:
By definition, clustering does not provided "labels", hence it is
meant to be implicit.

What does LABELS have to do with clustering? You can always
label later, if you have the definition for what constitutes a cluster
and find those. I think you misunderstood the meaning of
implicit.

Quote:
So the goal should be to define an objective function that
can somehow achieve the explicit goal. I have the feeling that this is
somewhat similar to the "prior" concept in Bayesian statistics.

You're just parroting some really old wheels.


Quote:
So I think the OP's question is still a valid discussion: Is there
perhaps a way to define an objective function, so that small clusters
still get a chance to be discovered.

It's a valid question that has no easy valid answers until you can
tell what constitutes a CLUSTER.

The "small cluster" itself is an undefined concept. Must be small
relative to something, but not too small. Otherwise, it's easy
to make a definition such that every single point is a cluster
except those with several other points with an infinitessimal
distance from each other. Then you'll get ALL small clusters.

Now what?

-- Bob.
Back to top
Reef Fish
science forum Guru Wannabe


Joined: 28 Apr 2005
Posts: 200

PostPosted: Sat Jun 03, 2006 7:49 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

Russell.Martin@wdn.com wrote:
Quote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
b83503104@yahoo.com wrote:
I have this doubt in mind (that means, I didn't run experiments on
this, but purely guessing):

If the data is naturally consisted of N clusters, but some of them have
K points, some of them have L << K points.

My conjecture is, k-means clustering will not find the N clusters (even
if the correct number of clusters, N, is specified).

Reason of conjecture: the small clusters have very little influence on
the objective function, so k-means will partition the large clusters
into subclusters, and kind of ignore the small clusters.

My real concern is, under such circumstances, is there any method so
that the N clusters could still be discovered? Either by another
clustering algorithm, or by some clever preprocessing?

There has been some work in climatology using neural networks
to define sets of locations with similar climates, which is an
application for which clustering has also been used. I reviewed
a paper on the subject some years ago, so I read up on the
literature. The results were good, but whether ot not they would
pass Reef Fish's "monkey" or GIGO tests, I don't know.

Those can easily be tested by ONE simple litmus-test:

1. Is a "cluster" an EXPLICITLY well-defined object which
satisfies certain properties (before one starts looking for it)?

The idea, at least in climatology, is to have the method define
which elements of the set are similar in a multivariate situation,
followed by a human sanity check. IOW the joint properties
which are to be satisfied are not necessarily known a priori, e.g.
which values of temperature versus precipitation versus humidity
versus etc. constitute "similar".

I had assumed that everyone who is in this discussion would have
known that (with few exceptions <such as simultaneous clustering
of cases and variables>) nearly ALL clustering methods start with
an nxn matrix of pairwise DISTANCES or SIMILARITIES between
every pair of points.

When the data came in as a vector of multiple observations, most
clustering methods provided choices to DEFINE the similarity or
distances to be used in the nxn pairs. There are HUNDREDS of
ways of defining the similarity measures, and most clustering
routines have several to dozens of choices for them.

With that status quo, if "similarity" is not even defined until you
start looking -- that would belong to the class of DOUBLE-
WITCHING methods, beyond just monkeys looking for similar
bunches of nuts.


Quote:
So if I understand your criteria
the answer no, but IMO that does not necessarily mean the
approach is without value as an exploratory technique, which
is how it is generally used to my field. I agree that the
techniques often leave something to be desired.

From what you said, I would say your field has NOT used any of
the methods that are understood in the literature as "clustering

methods". Because there, the two minimal criteria are (1)
measures of similarity or dissimilarity; and (2) criterion or
algorithm for clustering.
Quote:


If (1) is not satisfied, then any method seeking clusters is just
monkey's on a witch hunt.

Monkeys hunt witches? I thought monkeys worked for
witches hunting Dorothies (at least in "The Wizard of OZ"). Wink

Okay, I concede to the similie or mataphor being a cummy one,
but in clustering, the monkeys are actually looking for bunches
of nuts which they think the witches have in their pockets. :-)

-- Bob.
Quote:


I could
probably hunt up some references for you if your really want
them, but I'd have to do that from my office.

Cheers,
Russell

Russell or anyone else, I would like to know ANY method that
passes the litmus test of having a well-defined meaning to a
"cluster" in any field, any method -- BEFORE the monkeys
started searching for them.

Life is too short to read another thousand papers which I had
already squandered my time reading, on the subject of
Clustering. ;0(

-- Bob.

Cheers,
Russell
Back to top
Russell.Martin@wdn.com
science forum beginner


Joined: 13 Sep 2005
Posts: 24

PostPosted: Sat Jun 03, 2006 8:25 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

Reef Fish wrote:
Quote:
Russell.Martin@wdn.com wrote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
b83503104@yahoo.com wrote:
I have this doubt in mind (that means, I didn't run experiments on
this, but purely guessing):

If the data is naturally consisted of N clusters, but some of them have
K points, some of them have L << K points.

My conjecture is, k-means clustering will not find the N clusters (even
if the correct number of clusters, N, is specified).

Reason of conjecture: the small clusters have very little influence on
the objective function, so k-means will partition the large clusters
into subclusters, and kind of ignore the small clusters.

My real concern is, under such circumstances, is there any method so
that the N clusters could still be discovered? Either by another
clustering algorithm, or by some clever preprocessing?

There has been some work in climatology using neural networks
to define sets of locations with similar climates, which is an
application for which clustering has also been used. I reviewed
a paper on the subject some years ago, so I read up on the
literature. The results were good, but whether ot not they would
pass Reef Fish's "monkey" or GIGO tests, I don't know.

Those can easily be tested by ONE simple litmus-test:

1. Is a "cluster" an EXPLICITLY well-defined object which
satisfies certain properties (before one starts looking for it)?

The idea, at least in climatology, is to have the method define
which elements of the set are similar in a multivariate situation,
followed by a human sanity check. IOW the joint properties
which are to be satisfied are not necessarily known a priori, e.g.
which values of temperature versus precipitation versus humidity
versus etc. constitute "similar".

I had assumed that everyone who is in this discussion would have
known that (with few exceptions <such as simultaneous clustering
of cases and variables>) nearly ALL clustering methods start with
an nxn matrix of pairwise DISTANCES or SIMILARITIES between
every pair of points.

When the data came in as a vector of multiple observations, most
clustering methods provided choices to DEFINE the similarity or
distances to be used in the nxn pairs. There are HUNDREDS of
ways of defining the similarity measures, and most clustering
routines have several to dozens of choices for them.

With that status quo, if "similarity" is not even defined until you
start looking

Of course the measure and algorithm is defined, but the
clusters are not, at least explicitly, until the analysis is run.
After the analysis is run one can examine the results to see
what ranges of values of the observations are 'thought" by
the system to consitute "clusters". That was what I meant.
Sorry if that wasn't clear.

Quote:
-- that would belong to the class of DOUBLE-
WITCHING methods, beyond just monkeys looking for similar
bunches of nuts.


So if I understand your criteria
the answer no, but IMO that does not necessarily mean the
approach is without value as an exploratory technique, which
is how it is generally used to my field. I agree that the
techniques often leave something to be desired.

From what you said,

If I'd known you were looking for a complete dissertation
on the subject, I still wouldn't have provided it but I would
have at least expected that objection. But what I say on
Usenet should never be construed to be complete, or
even necessarily correct. :-)

Quote:
I would say your field has NOT used any of
the methods that are understood in the literature as "clustering
methods".

I guess I'll have to go up a floor and tell Dan Wilks that chapter
14 of his book _Statistical Methods in the Atmospheric
Sciences_, 2nd ed., should not have been titled "Cluster
Analysis". :-)

Quote:
Because there, the two minimal criteria are (1)
measures of similarity or dissimilarity; and (2) criterion or
algorithm for clustering.

Of course there are measures and algorithms. My point was
that knowing the measure and algorithm being used does not
tell you what the clusters are, unless you can run the data
through your mind like the computer.

Cheers,
Russell

Quote:


If (1) is not satisfied, then any method seeking clusters is just
monkey's on a witch hunt.

Monkeys hunt witches? I thought monkeys worked for
witches hunting Dorothies (at least in "The Wizard of OZ"). ;-)

Okay, I concede to the similie or mataphor being a cummy one,
but in clustering, the monkeys are actually looking for bunches
of nuts which they think the witches have in their pockets. :-)

-- Bob.


I could
probably hunt up some references for you if your really want
them, but I'd have to do that from my office.

Cheers,
Russell

Russell or anyone else, I would like to know ANY method that
passes the litmus test of having a well-defined meaning to a
"cluster" in any field, any method -- BEFORE the monkeys
started searching for them.

Life is too short to read another thousand papers which I had
already squandered my time reading, on the subject of
Clustering. ;0(

-- Bob.

Cheers,
Russell
Back to top
Reef Fish
science forum Guru Wannabe


Joined: 28 Apr 2005
Posts: 200

PostPosted: Sat Jun 03, 2006 8:52 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

Russell.Martin@wdn.com wrote:
Quote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
b83503104@yahoo.com wrote:
I have this doubt in mind (that means, I didn't run experiments on
this, but purely guessing):

If the data is naturally consisted of N clusters, but some of them have
K points, some of them have L << K points.

My conjecture is, k-means clustering will not find the N clusters (even
if the correct number of clusters, N, is specified).

Reason of conjecture: the small clusters have very little influence on
the objective function, so k-means will partition the large clusters
into subclusters, and kind of ignore the small clusters.

My real concern is, under such circumstances, is there any method so
that the N clusters could still be discovered? Either by another
clustering algorithm, or by some clever preprocessing?

There has been some work in climatology using neural networks
to define sets of locations with similar climates, which is an
application for which clustering has also been used. I reviewed
a paper on the subject some years ago, so I read up on the
literature. The results were good, but whether ot not they would
pass Reef Fish's "monkey" or GIGO tests, I don't know.

Those can easily be tested by ONE simple litmus-test:

1. Is a "cluster" an EXPLICITLY well-defined object which
satisfies certain properties (before one starts looking for it)?

The idea, at least in climatology, is to have the method define
which elements of the set are similar in a multivariate situation,
followed by a human sanity check. IOW the joint properties
which are to be satisfied are not necessarily known a priori, e.g.
which values of temperature versus precipitation versus humidity
versus etc. constitute "similar".

This is what made me think that a prior you DON'T have a measure
of "similarity", but then you reversed that notion in your ultimate
paragraph of this post:

RF> > Because there, the two minimal criteria are (1)
RF> > measures of similarity or dissimilarity; and (2) criterion or
RF> > algorithm for clustering.
Quote:

RM> Of course there are measures and algorithms.


Perhaps you overlooked that those measure were COMBINED into
some measure of similarity before clustering began!


That was why I said,

Quote:
I had assumed that everyone who is in this discussion would have
known that (with few exceptions <such as simultaneous clustering
of cases and variables>) nearly ALL clustering methods start with
an nxn matrix of pairwise DISTANCES or SIMILARITIES between
every pair of points.

When the data came in as a vector of multiple observations, most
clustering methods provided choices to DEFINE the similarity or
distances to be used in the nxn pairs. There are HUNDREDS of
ways of defining the similarity measures, and most clustering
routines have several to dozens of choices for them.

With that status quo, if "similarity" is not even defined until you
start looking

Of course the measure and algorithm is defined, but the
clusters are not, at least explicitly, until the analysis is run.

At which point you have taken the Grand Tour of where I STARTED
by saying that's virtualy what ALL the clustering methods do,
mindlessly looking for something not well-defined, via
algorithms that always deliver clusters whether they exist of not.


Quote:
After the analysis is run one can examine the results to see
what ranges of values of the observations are 'thought" by
the system to consitute "clusters". That was what I meant.
Sorry if that wasn't clear.

It is clear now. That's what OTHER cluster analysts do too.
No different from the Factor Analyst. After they found the
"clusters" (which often are just as random a scatter as the
original data before clustering) they start making all kind of
"sense" out of the newly found sack of nuts from the witch's
pocket.


Quote:
-- that would belong to the class of DOUBLE-
WITCHING methods, beyond just monkeys looking for similar
bunches of nuts.


So if I understand your criteria
the answer no, but IMO that does not necessarily mean the
approach is without value as an exploratory technique, which
is how it is generally used to my field. I agree that the
techniques often leave something to be desired.

From what you said,

If I'd known you were looking for a complete dissertation
on the subject, I still wouldn't have provided it but I would
have at least expected that objection. But what I say on
Usenet should never be construed to be complete, or
even necessarily correct. Smile

But you are expected to express YOURSELF correctly,
right or wrong. Smile IMHO, you mis-expressed yourself the
first time, but now clarified it that you didn't really mean what
you said before -- that these two ingredients below WERE
there in the first place:

RF> > Because there, the two minimal criteria are (1)
RF> > measures of similarity or dissimilarity; and (2) criterion or
RF> > algorithm for clustering.


Quote:
I would say your field has NOT used any of
the methods that are understood in the literature as "clustering
methods".

Had those two ingredients NOT been BOTH there.

Quote:

I guess I'll have to go up a floor and tell Dan Wilks that chapter
14 of his book _Statistical Methods in the Atmospheric
Sciences_, 2nd ed., should not have been titled "Cluster
Analysis". Smile

Then Dan would just tell you what I have just told you, that he
HAD the two key ingredients, though he may be quite modest
not to admit that he is a monkey looking for nuts in the witches
pockets. :-)

Quote:
Because there, the two minimal criteria are (1)
measures of similarity or dissimilarity; and (2) criterion or
algorithm for clustering.

Of course there are measures and algorithms. My point was
that knowing the measure and algorithm being used does not
tell you what the clusters are, unless you can run the data
through your mind like the computer.

Round and round the circle we go ... you have just added
confirmation to the fact WHY "cluster analysis" is such a mess
today, as it was 40 years ago when I started in it.

Nobody knows what they are doing. Algorithms are everywhere.
They ALL deliver "clusters" whether they are "clusters" or not,
Everyone thinks at the end of a cluster analysis they've found a
sack full of nuts found in the witch hunt, and write a paper (for
the eggheads) or show the boss that they are worth what they
are overpaid for (by showing the computer delivered "clusters"),
until the NEXT project on the same, and everyone finds
different sacks of nuts, and the merry-go-round starts all
over again, and again, and again ... for easily 50+ years. ;-/

-- Bob.

Quote:
If (1) is not satisfied, then any method seeking clusters is just
monkey's on a witch hunt.

Monkeys hunt witches? I thought monkeys worked for
witches hunting Dorothies (at least in "The Wizard of OZ"). ;-)

Okay, I concede to the similie or mataphor being a cummy one,
but in clustering, the monkeys are actually looking for bunches
of nuts which they think the witches have in their pockets. :-)

-- Bob.


I could
probably hunt up some references for you if your really want
them, but I'd have to do that from my office.

Cheers,
Russell

Russell or anyone else, I would like to know ANY method that
passes the litmus test of having a well-defined meaning to a
"cluster" in any field, any method -- BEFORE the monkeys
started searching for them.

Life is too short to read another thousand papers which I had
already squandered my time reading, on the subject of
Clustering. ;0(

-- Bob.

Cheers,
Russell
Back to top
Russell.Martin@wdn.com
science forum beginner


Joined: 13 Sep 2005
Posts: 24

PostPosted: Sat Jun 03, 2006 9:39 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

Reef Fish wrote:
Quote:
Russell.Martin@wdn.com wrote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
b83503104@yahoo.com wrote:
I have this doubt in mind (that means, I didn't run experiments on
this, but purely guessing):

If the data is naturally consisted of N clusters, but some of them have
K points, some of them have L << K points.

My conjecture is, k-means clustering will not find the N clusters (even
if the correct number of clusters, N, is specified).

Reason of conjecture: the small clusters have very little influence on
the objective function, so k-means will partition the large clusters
into subclusters, and kind of ignore the small clusters.

My real concern is, under such circumstances, is there any method so
that the N clusters could still be discovered? Either by another
clustering algorithm, or by some clever preprocessing?

There has been some work in climatology using neural networks
to define sets of locations with similar climates, which is an
application for which clustering has also been used. I reviewed
a paper on the subject some years ago, so I read up on the
literature. The results were good, but whether ot not they would
pass Reef Fish's "monkey" or GIGO tests, I don't know.

Those can easily be tested by ONE simple litmus-test:

1. Is a "cluster" an EXPLICITLY well-defined object which
satisfies certain properties (before one starts looking for it)?

The idea, at least in climatology, is to have the method define
which elements of the set are similar in a multivariate situation,
followed by a human sanity check. IOW the joint properties
which are to be satisfied are not necessarily known a priori, e.g.
which values of temperature versus precipitation versus humidity
versus etc. constitute "similar".

This is what made me think that a prior you DON'T have a measure
of "similarity", but then you reversed that notion in your ultimate
paragraph of this post:

RF> > Because there, the two minimal criteria are (1)
RF> > measures of similarity or dissimilarity; and (2) criterion or
RF> > algorithm for clustering.

RM> Of course there are measures and algorithms.

Perhaps you overlooked that those measure were COMBINED into
some measure of similarity before clustering began!


That was why I said,

I had assumed that everyone who is in this discussion would have
known that (with few exceptions <such as simultaneous clustering
of cases and variables>) nearly ALL clustering methods start with
an nxn matrix of pairwise DISTANCES or SIMILARITIES between
every pair of points.

When the data came in as a vector of multiple observations, most
clustering methods provided choices to DEFINE the similarity or
distances to be used in the nxn pairs. There are HUNDREDS of
ways of defining the similarity measures, and most clustering
routines have several to dozens of choices for them.

With that status quo, if "similarity" is not even defined until you
start looking

Of course the measure and algorithm is defined, but the
clusters are not, at least explicitly, until the analysis is run.

At which point you have taken the Grand Tour of where I STARTED
by saying that's virtualy what ALL the clustering methods do,
mindlessly looking for something not well-defined, via
algorithms that always deliver clusters whether they exist of not.


After the analysis is run one can examine the results to see
what ranges of values of the observations are 'thought" by
the system to consitute "clusters". That was what I meant.
Sorry if that wasn't clear.

It is clear now. That's what OTHER cluster analysts do too.
No different from the Factor Analyst. After they found the
"clusters" (which often are just as random a scatter as the
original data before clustering) they start making all kind of
"sense" out of the newly found sack of nuts from the witch's
pocket.

If sense can be made in terms of the field, then in some
sense it doesn't matter from what source or by what route
the conclusion comes. IOW nuts as just as tasty no matter
how they are found. More on this below.

Quote:


-- that would belong to the class of DOUBLE-
WITCHING methods, beyond just monkeys looking for similar
bunches of nuts.


So if I understand your criteria
the answer no, but IMO that does not necessarily mean the
approach is without value as an exploratory technique, which
is how it is generally used to my field. I agree that the
techniques often leave something to be desired.

From what you said,

If I'd known you were looking for a complete dissertation
on the subject, I still wouldn't have provided it but I would
have at least expected that objection. But what I say on
Usenet should never be construed to be complete, or
even necessarily correct. :-)

But you are expected to express YOURSELF correctly,
right or wrong. Smile IMHO, you mis-expressed yourself the
first time, but now clarified it that you didn't really mean what
you said before -- that these two ingredients below WERE
there in the first place:

RF> > Because there, the two minimal criteria are (1)
RF> > measures of similarity or dissimilarity; and (2) criterion or
RF> > algorithm for clustering.


I would say your field has NOT used any of
the methods that are understood in the literature as "clustering
methods".

Had those two ingredients NOT been BOTH there.


I guess I'll have to go up a floor and tell Dan Wilks that chapter
14 of his book _Statistical Methods in the Atmospheric
Sciences_, 2nd ed., should not have been titled "Cluster
Analysis". :-)

Then Dan would just tell you what I have just told you, that he
HAD the two key ingredients, though he may be quite modest
not to admit that he is a monkey looking for nuts in the witches
pockets. :-)

Because there, the two minimal criteria are (1)
measures of similarity or dissimilarity; and (2) criterion or
algorithm for clustering.

Of course there are measures and algorithms. My point was
that knowing the measure and algorithm being used does not
tell you what the clusters are, unless you can run the data
through your mind like the computer.

Round and round the circle we go ... you have just added
confirmation to the fact WHY "cluster analysis" is such a mess
today, as it was 40 years ago when I started in it.

Nobody knows what they are doing. Algorithms are everywhere.
They ALL deliver "clusters" whether they are "clusters" or not,
Everyone thinks at the end of a cluster analysis they've found a
sack full of nuts found in the witch hunt, and write a paper (for
the eggheads) or show the boss that they are worth what they
are overpaid for (by showing the computer delivered "clusters"),
until the NEXT project on the same, and everyone finds
different sacks of nuts, and the merry-go-round starts all
over again, and again, and again ... for easily 50+ years. ;-/

-- Bob.

I think we are miscommunicating because your training
and mine are fundamentally different. Yes, as an applied
physical scientist, not a statistician, I am interested in
getting results, clusters or whatever the analysis is
supposed to produce, out of the computer. But these
results need to make sense relative to the field of study,
and if they do then whether or not a pure statistician is
satisfied with the foundations of the methodology is of
secondary (but not zero) importance. This is not a bad
thing, IMO. For instance, Dirac delta and Heaviside
functions were used in physics before the mathematics
was finally formalized. Actually I sat in on portions of
Dan's class last fall, and my impression is that he is as
unimpressed with cluster analysis as you are. But the
fact is that several methods give similar results in certain
applications (albeit often with some annoying need to
make somewhat ad hoc choices about things like stage
number at which to stop the analysis), results which
make sense to someone trained in the field. Therefore
as a way to present or summarize multivariate data it
can be useful, just like presenting the mean and variance
of a sample from a (presumably) normal distribution is
useful even if the observed distribution varies slightly
point by point from perfectly normal. IOW I personally
would use results from cluster analysis as descriptive or
exploratory, as I said earlier. I would not try to claim
some fundamental importance to the results per se
without separate justification on fundamental grounds.

Cheers,
Russell
Back to top
Reef Fish
science forum Guru Wannabe


Joined: 28 Apr 2005
Posts: 200

PostPosted: Sat Jun 03, 2006 10:35 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

Russell.Martin@wdn.com wrote:
Quote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:

It is clear now. That's what OTHER cluster analysts do too.
No different from the Factor Analyst. After they found the
"clusters" (which often are just as random a scatter as the
original data before clustering) they start making all kind of
"sense" out of the newly found sack of nuts from the witch's
pocket.

If sense can be made in terms of the field, then in some
sense it doesn't matter from what source or by what route
the conclusion comes.

But often the same sense (or nonsense -- depending on one's
orientation) can be made without the data, the wasted computer
resources, and the imaginary sack full of nuts -- but who would
believe them, if they don't have some witchcraft to lean on?

Quote:
IOW nuts as just as tasty no matter
how they are found. More on this below.

IF they are indeed the edible, tasty nuts; rather than some
seemingly tasty, but poisonous nuts.


Quote:
Of course there are measures and algorithms. My point was
that knowing the measure and algorithm being used does not
tell you what the clusters are, unless you can run the data
through your mind like the computer.

Round and round the circle we go ... you have just added
confirmation to the fact WHY "cluster analysis" is such a mess
today, as it was 40 years ago when I started in it.

Since I've related the phenomenon to the witchcraft in Factor
Analysis -- I might add that Factor Analysis had about a 30-year
head start. Look where it is now? The vast majority of the
scientists are finally waking up to the fact that the nuts found by
the Factor Analysts were just figments of their imagination, from
making millions of monkey do the millions of factor rotations to
get some "solution" that seems to make sense ... the rest is
history. I know of no REAL scientific discovery that arose from
the result of a Factor Analysis that was not, or would not have
been found, without Factor Analysis.

Cluster Analysis is heading down the same rosy path.

Been there. Done that. Jumped off that train as I saw it was
accelerating down the cliff of no return.


Quote:
Nobody knows what they are doing. Algorithms are everywhere.
They ALL deliver "clusters" whether they are "clusters" or not,
Everyone thinks at the end of a cluster analysis they've found a
sack full of nuts found in the witch hunt, and write a paper (for
the eggheads) or show the boss that they are worth what they
are overpaid for (by showing the computer delivered "clusters"),
until the NEXT project on the same, and everyone finds
different sacks of nuts, and the merry-go-round starts all
over again, and again, and again ... for easily 50+ years. ;-/

-- Bob.

I think we are miscommunicating because your training
and mine are fundamentally different.

Au contraire. We ARE communicating.

Quote:
Yes, as an applied
physical scientist, not a statistician, I am interested in
getting results, clusters or whatever the analysis is
supposed to produce, out of the computer. But these
results need to make sense relative to the field of study,

Why you do make a statistician, or a SENSIBLE person in
appiication in any field, an exception to a physical scientist
in that regard?

Quote:
and if they do then whether or not a pure statistician is
satisfied with the foundations of the methodology is of
secondary (but not zero) importance. This is not a bad
thing, IMO.

I actually agree. In fact, most of the most important
discoveries in science are ACCIDENTAL discoveries!

In Cluster Analysis (as in Factor Analysis), I would say
tens of thousands of papers (scientific or otherwise) have
been published on the "findings".

Name ONE that you consider an important (or useful)
discovery, out of the past decades, that can be attributed
to the use of Cluster Analysis.

That would be the proof in the pudding (or some such
expression) to this statistician with a scientists's hat on.


Quote:
For instance, Dirac delta and Heaviside
functions were used in physics before the mathematics
was finally formalized. Actually I sat in on portions of
Dan's class last fall, and my impression is that he is as
unimpressed with cluster analysis as you are. But the
fact is that several methods give similar results in certain
applications (albeit often with some annoying need to
make somewhat ad hoc choices about things like stage
number at which to stop the analysis), results which
make sense to someone trained in the field.

Surely you don't mean to say Dirac attributed his Dirac
delta discovery to cluster analysis!


Quote:
Therefore
as a way to present or summarize multivariate data it
can be useful, just like presenting the mean and variance
of a sample from a (presumably) normal distribution is
useful even if the observed distribution varies slightly
point by point from perfectly normal.

Except Cluster Analysis is NOT a way of summaizing
multivariate data! It is a way of grouping OBJECTS (points)
in multivariate data without even a way of characterizing
what properties each group HAS -- hence the ultimate
defect (and disaster) of NOT having a definition as to what
constitutes a "cluster". The only seemingly useful, or
demonstrably credible results are those that anyone with
any lick of sense would have found the same groups or
clusters WITHOUT any "cluster analysis", "factor analysis",
"discriminant analysis" or any other multivariate methods
of analysis, as in the famous Iris example of Fisher's
species of Setosa, Virginica, and two other species that
are so dramatically different in their raw measurements
that only a blind data analyst would not have found the
different groups from the measurements. :-)


Quote:
IOW I personally
would use results from cluster analysis as descriptive or
exploratory, as I said earlier. I would not try to claim
some fundamental importance to the results per se
without separate justification on fundamental grounds.

But look at your "typical" cluster analysis result, and ask
yourself in what way was it "descriptive" other than telling
you SOME of them for "groups" or "clusters", while you are
unable to say WHY each group is different from the other
groups other than that's what "the computer" tells you.

-- Bob.
Back to top
Russell.Martin@wdn.com
science forum beginner


Joined: 13 Sep 2005
Posts: 24

PostPosted: Sat Jun 03, 2006 11:17 pm    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

Reef Fish wrote:
Quote:
Russell.Martin@wdn.com wrote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:

It is clear now. That's what OTHER cluster analysts do too.
No different from the Factor Analyst. After they found the
"clusters" (which often are just as random a scatter as the
original data before clustering) they start making all kind of
"sense" out of the newly found sack of nuts from the witch's
pocket.

If sense can be made in terms of the field, then in some
sense it doesn't matter from what source or by what route
the conclusion comes.

But often the same sense (or nonsense -- depending on one's
orientation) can be made without the data, the wasted computer
resources, and the imaginary sack full of nuts -- but who would
believe them, if they don't have some witchcraft to lean on?

IOW nuts as just as tasty no matter
how they are found. More on this below.

IF they are indeed the edible, tasty nuts; rather than some
seemingly tasty, but poisonous nuts.


Of course there are measures and algorithms. My point was
that knowing the measure and algorithm being used does not
tell you what the clusters are, unless you can run the data
through your mind like the computer.

Round and round the circle we go ... you have just added
confirmation to the fact WHY "cluster analysis" is such a mess
today, as it was 40 years ago when I started in it.

Since I've related the phenomenon to the witchcraft in Factor
Analysis -- I might add that Factor Analysis had about a 30-year
head start. Look where it is now? The vast majority of the
scientists are finally waking up to the fact that the nuts found by
the Factor Analysts were just figments of their imagination, from
making millions of monkey do the millions of factor rotations to
get some "solution" that seems to make sense ... the rest is
history. I know of no REAL scientific discovery that arose from
the result of a Factor Analysis that was not, or would not have
been found, without Factor Analysis.

Cluster Analysis is heading down the same rosy path.

Been there. Done that. Jumped off that train as I saw it was
accelerating down the cliff of no return.

A defer to your greater antiquity on that point. ;-)

Quote:


Nobody knows what they are doing. Algorithms are everywhere.
They ALL deliver "clusters" whether they are "clusters" or not,
Everyone thinks at the end of a cluster analysis they've found a
sack full of nuts found in the witch hunt, and write a paper (for
the eggheads) or show the boss that they are worth what they
are overpaid for (by showing the computer delivered "clusters"),
until the NEXT project on the same, and everyone finds
different sacks of nuts, and the merry-go-round starts all
over again, and again, and again ... for easily 50+ years. ;-/

-- Bob.

I think we are miscommunicating because your training
and mine are fundamentally different.

Au contraire. We ARE communicating.

We are exchanging symbols, I'm not so sure we are
communicating. :-)

Quote:

Yes, as an applied
physical scientist, not a statistician, I am interested in
getting results, clusters or whatever the analysis is
supposed to produce, out of the computer. But these
results need to make sense relative to the field of study,

Why you do make a statistician, or a SENSIBLE person in
appiication in any field, an exception to a physical scientist
in that regard?

I didn't, I just think that statisticians may have different
criteria with which they evaluate a statistical technique
than the people who apply the technique because the
user is more focused on the results while the statistician
is more focused on the process. I think that is how it
should be, with an interplay between the two that is
potentially healthy for both.

Quote:
and if they do then whether or not a pure statistician is
satisfied with the foundations of the methodology is of
secondary (but not zero) importance. This is not a bad
thing, IMO.

I actually agree. In fact, most of the most important
discoveries in science are ACCIDENTAL discoveries!

In Cluster Analysis (as in Factor Analysis), I would say
tens of thousands of papers (scientific or otherwise) have
been published on the "findings".

Name ONE that you consider an important (or useful)
discovery, out of the past decades, that can be attributed
to the use of Cluster Analysis.

I can't, but neither can I give a important or useful discovery
that is based soley on the fact that the observations have
a sample mean and variance. That doesn't detract from
the usefulness of the sample mean and variance.

Quote:

That would be the proof in the pudding (or some such
expression) to this statistician with a scientists's hat on.


For instance, Dirac delta and Heaviside
functions were used in physics before the mathematics
was finally formalized. Actually I sat in on portions of
Dan's class last fall, and my impression is that he is as
unimpressed with cluster analysis as you are. But the
fact is that several methods give similar results in certain
applications (albeit often with some annoying need to
make somewhat ad hoc choices about things like stage
number at which to stop the analysis), results which
make sense to someone trained in the field.

Surely you don't mean to say Dirac attributed his Dirac
delta discovery to cluster analysis!

No, I'm just pointing out cases where physical scientists
were ahead of the mathematicians in applying an useful
concept that the mathematicians were uneasy about
because they had not yet managed to formalize it.

Quote:


Therefore
as a way to present or summarize multivariate data it
can be useful, just like presenting the mean and variance
of a sample from a (presumably) normal distribution is
useful even if the observed distribution varies slightly
point by point from perfectly normal.

Except Cluster Analysis is NOT a way of summaizing
multivariate data!

It takes a large amount of data and extracts a smaller
number of items which represent (or at least purports to
represent) aspects of the data. The results can thus be
used as a type of summary. In fact, as I've suggested,
that is the most useful application I've found of the method.

Quote:
It is a way of grouping OBJECTS (points)
in multivariate data without even a way of characterizing
what properties each group HAS -- hence the ultimate
defect (and disaster) of NOT having a definition as to what
constitutes a "cluster". The only seemingly useful, or
demonstrably credible results are those that anyone with
any lick of sense would have found the same groups or
clusters WITHOUT any "cluster analysis", "factor analysis",
"discriminant analysis" or any other multivariate methods
of analysis, as in the famous Iris example of Fisher's
species of Setosa, Virginica, and two other species that
are so dramatically different in their raw measurements
that only a blind data analyst would not have found the
different groups from the measurements. Smile

True to a point, but IMO an exaggeration. More below.

Quote:


IOW I personally
would use results from cluster analysis as descriptive or
exploratory, as I said earlier. I would not try to claim
some fundamental importance to the results per se
without separate justification on fundamental grounds.

But look at your "typical" cluster analysis result, and ask
yourself in what way was it "descriptive" other than telling
you SOME of them for "groups" or "clusters", while you are
unable to say WHY each group is different from the other
groups other than that's what "the computer" tells you.

-- Bob.

Ahh, in situations where there results make sense I can
say why the clusters differ. And I could even determine
close to the same clusters (and there are often borderline
cases that algorithms disagree on, which does not in and
of itself denigrate the general idea or my analysis) if I
wanted to spend an unnecessary amount of time staring
at or analysing the data by hand, but a major point behind
designing algorithms is so they can be programmed into a
computer to do the grunt work.

Cheers,
Russell
Back to top
Reef Fish
science forum Guru Wannabe


Joined: 28 Apr 2005
Posts: 200

PostPosted: Sun Jun 04, 2006 1:35 am    Post subject: Re: Clustering: if one cluster is huge, the other one is small... Reply with quote

Russell.Martin@wdn.com wrote:
Quote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:
Reef Fish wrote:
Russell.Martin@wdn.com wrote:

But often the same sense (or nonsense -- depending on one's
orientation) can be made without the data, the wasted computer
resources, and the imaginary sack full of nuts -- but who would
believe them, if they don't have some witchcraft to lean on?

IOW nuts as just as tasty no matter
how they are found. More on this below.

IF they are indeed the edible, tasty nuts; rather than some
seemingly tasty, but poisonous nuts.


Quote:
Nobody knows what they are doing. Algorithms are everywhere.
They ALL deliver "clusters" whether they are "clusters" or not,
Everyone thinks at the end of a cluster analysis they've found a
sack full of nuts found in the witch hunt, and write a paper (for
the eggheads) or show the boss that they are worth what they
are overpaid for (by showing the computer delivered "clusters"),
until the NEXT project on the same, and everyone finds
different sacks of nuts, and the merry-go-round starts all
over again, and again, and again ... for easily 50+ years. ;-/

Cutting the rhetorical arguments down to the nuts and bolts ...


Quote:
In Cluster Analysis (as in Factor Analysis), I would say
tens of thousands of papers (scientific or otherwise) have
been published on the "findings".

Name ONE that you consider an important (or useful)
discovery, out of the past decades, that can be attributed
to the use of Cluster Analysis.

I can't, but neither can I give a important or useful discovery
that is based soley on the fact that the observations have
a sample mean and variance. That doesn't detract from
the usefulness of the sample mean and variance.

You can't think of a SINGLE case where these techniques
that are supposed to discover STRUCTURES in the
multivariate space through the use of data?

That pretty much says it all.

Quote:
That would be the proof in the pudding (or some such
expression) to this statistician with a scientists's hat on.

I guess there ain't no pudding today for this scientist today. :-)

A sample mean and variance are not exactly meant for the
same purpose, and even so, the sample mean is quickly
losing favor to the sample median, which is a much more
descriptive measure than the mean for most of everyday-
statistics on distributions that are known to highly skewed.


Quote:
Therefore
as a way to present or summarize multivariate data it
can be useful, just like presenting the mean and variance
of a sample from a (presumably) normal distribution is
useful even if the observed distribution varies slightly
point by point from perfectly normal.

Except Cluster Analysis is NOT a way of summaizing
multivariate data!

It takes a large amount of data and extracts a smaller
number of items which represent (or at least purports to
represent) aspects of the data. The results can thus be
used as a type of summary. In fact, as I've suggested,
that is the most useful application I've found of the method.

It is a way of grouping OBJECTS (points)
in multivariate data without even a way of characterizing
what properties each group HAS -- hence the ultimate
defect (and disaster) of NOT having a definition as to what
constitutes a "cluster". The only seemingly useful, or
demonstrably credible results are those that anyone with
any lick of sense would have found the same groups or
clusters WITHOUT any "cluster analysis", "factor analysis",
"discriminant analysis" or any other multivariate methods
of analysis, as in the famous Iris example of Fisher's
species of Setosa, Virginica, and two other species that
are so dramatically different in their raw measurements
that only a blind data analyst would not have found the
different groups from the measurements. :-)

True to a point, but IMO an exaggeration. More below.

No exaggeration. The more the Iris data is re-examined (BTW,
I forgot about the Versicolor species) through just simple graphics,
the less one is impressed by what FIsher or any of the other
methods re-discover. It made a good textbook example, because
it's one of the few examples that actually had a "correct answer" Smile
but alas, the correct answer was as obvious as 3+3, thought that
answer is not necessarily obvious to many.
Quote:



IOW I personally
would use results from cluster analysis as descriptive or
exploratory, as I said earlier.

But look at your "typical" cluster analysis result, and ask
yourself in what way was it "descriptive" other than telling
you SOME of them for "groups" or "clusters", while you are
unable to say WHY each group is different from the other
groups other than that's what "the computer" tells you.

-- Bob.

Ahh, in situations where there results make sense I can
say why the clusters differ. And I could even determine
close to the same clusters (and there are often borderline
cases that algorithms disagree on, which does not in and
of itself denigrate the general idea or my analysis) if I
wanted to spend an unnecessary amount of time staring
at or analysing the data by hand, but a major point behind
designing algorithms is so they can be programmed into a
computer to do the grunt work.

It would be anticlimatic if I didn't ask WHICH algorithm (among
the hundreds to which I am familiar) do you use for your grunt
work?

-- Bob.
Back to top
Google

Back to top
Display posts from previous:   
Post new topic   Reply to topic Page 1 of 4 [46 Posts] Goto page:  1, 2, 3, 4 Next
View previous topic :: View next topic
The time now is Thu Sep 09, 2010 1:09 pm | All times are GMT
Forum index » Science and Technology » Math
Jump to:  

Similar Topics
Topic Author Forum Replies Last Post
No new posts One by One Clustering?????? mathlover Probability 0 Thu Jul 20, 2006 4:17 pm
No new posts One by One Clustering?????? mathlover Prediction 0 Thu Jul 20, 2006 4:15 pm
No new posts Intersection between a small and grea... christriddle@googlemail.c Math 11 Mon Jul 17, 2006 2:43 pm
No new posts Small Inflatable Air Bags John Eric Voltin Mechanics 4 Fri Jul 14, 2006 3:23 pm
No new posts Thiele - Small parameters dor900 Acoustics 2 Tue Jul 11, 2006 5:50 pm

Copyright © 2004-2005 DeniX Solutions SRL
Other DeniX Solutions sites: Electronics forum |  Medicine forum |  Unix/Linux blog |  Unix/Linux documentation |  Unix/Linux forums  |  send newsletters
 
Debt Help | Ipod Touch | WoW Gold | PT Cruiser | Debt Help


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.1589s ][ Queries: 12 (0.0539s) ][ GZIP on - Debug on ]