COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### A simple cluster method for point sets, thank you

• Follow

```Dear all,

I have a simple question. In a X-Y plane, I got some points (x, y coordinates). I want to divide these points into different clusters. Is there any MATLAB function that can perform this operation? I checked kmeans(), but it needs the cluster number as an input. However, I don't know how many clusters would be generated. I guess this would be a very simple question, but I am not familiar with this issue. Thank you very much.
```
 0

```Digistar You wrote:

> I have a simple question. In a X-Y plane, I got some points (x, y
> coordinates). I want to divide these points into different clusters. Is
> there any MATLAB function that can perform this operation? I checked
> kmeans(), but it needs the cluster number as an input. However, I don't
> know how many clusters would be generated. I guess this would be a very
> simple question, but I am not familiar with this issue. Thank you very
> much.

The simple solution is to put every point into its own individual
cluster: then the clustering error is 0. Or to put all points into the
same cluster; that also has a clustering error of 0.

If you are expecting a non-trivial number of clusters, there is *always*
a trade-off between clustering errors and number of clusters, so you
cannot do any clustering operation unless you specify either the number
of clusters or the formula that you want to use to calculate error, and
to trade error vs number of clusters.

For example, in one dimension, suppose my points are at (1), (2), and
(3). What is the best clustering? Now suppose that (2) value is not
-exactly- 2 internally but has been subject to round-off error: should a
tiny tiny error in its coordinate yank (2) between being clustered with
(1) or with (3) depending on which side of exactly 2 the error happens
to lead? What if the points are at (1), "near" (2) and (3), with normal
distribution and some variance around the nominal "near (2)" position:
the better clustering might be well defined if you knew the nominal
position for the center point, but which side of (2) it is actually
going to land on is probabilistic rather than set in stone, so even if
you end up with (1), (2.5), (3), the "true" clustering might be the
first two points together and the third distinct, because the nominal
position for the second point might be (say) 1.8 and randomly the center
point's deviation was 0.7 in the particular sampling.

There is no algorithm that can just be given a set of actual points and
come up with a non-trivial best "real" clustering without at least being
given information about the "intent" of the position of each of the
points.  Even if the points are (1), (1.003), and (1000), if there is
calculation limitations or probabilities involved in the determination
of the (1.003), there is a real possibility that the (1.003) should
really be clustered with the (1000).

To phrase this a different way: I've seen cats that looked much more
like dogs than they looked like cats -- but they were still cats.
```
 0

```Walter Roberson <roberson@hushmail.com> wrote in message <hq7grh\$hps\$1@canopus.cc.umanitoba.ca>...
>
> To phrase this a different way: I've seen cats that looked much more
> like dogs than they looked like cats -- but they were still cats.

Hum, I'm pretty sure many of those are actually mice.

Bruno
```
 0

```Walter Roberson <roberson@hushmail.com> wrote in message <hq7grh\$hps\$1@canopus.cc.umanitoba.ca>...
> Digistar You wrote:
>
> > I have a simple question. In a X-Y plane, I got some points (x, y
> > coordinates). I want to divide these points into different clusters. Is
> > there any MATLAB function that can perform this operation? I checked
> > kmeans(), but it needs the cluster number as an input. However, I don't
> > know how many clusters would be generated. I guess this would be a very
> > simple question, but I am not familiar with this issue. Thank you very
> > much.
>
> The simple solution is to put every point into its own individual
> cluster: then the clustering error is 0. Or to put all points into the
> same cluster; that also has a clustering error of 0.
>
> If you are expecting a non-trivial number of clusters, there is *always*
> a trade-off between clustering errors and number of clusters, so you
> cannot do any clustering operation unless you specify either the number
> of clusters or the formula that you want to use to calculate error, and
> to trade error vs number of clusters.
>
> For example, in one dimension, suppose my points are at (1), (2), and
> (3). What is the best clustering? Now suppose that (2) value is not
> -exactly- 2 internally but has been subject to round-off error: should a
> tiny tiny error in its coordinate yank (2) between being clustered with
> (1) or with (3) depending on which side of exactly 2 the error happens
> to lead? What if the points are at (1), "near" (2) and (3), with normal
> distribution and some variance around the nominal "near (2)" position:
> the better clustering might be well defined if you knew the nominal
> position for the center point, but which side of (2) it is actually
> going to land on is probabilistic rather than set in stone, so even if
> you end up with (1), (2.5), (3), the "true" clustering might be the
> first two points together and the third distinct, because the nominal
> position for the second point might be (say) 1.8 and randomly the center
> point's deviation was 0.7 in the particular sampling.
>
> There is no algorithm that can just be given a set of actual points and
> come up with a non-trivial best "real" clustering without at least being
> given information about the "intent" of the position of each of the
> points.  Even if the points are (1), (1.003), and (1000), if there is
> calculation limitations or probabilities involved in the determination
> of the (1.003), there is a real possibility that the (1.003) should
> really be clustered with the (1000).
>
> To phrase this a different way: I've seen cats that looked much more
> like dogs than they looked like cats -- but they were still cats.

Dear Walter,

Thank you so much for the detailed explanation. Actually, the question is very simple in my case. I put an image (http://drop.io/digistar#) to show that, all points are cluster very well. So I just want to which MATLAB function can group these points into different clusters and give me the labels of these points. Thank you very much.
```
 0

```Digistar You wrote:

> Thank you so much for the detailed explanation. Actually, the question
> is very simple in my case. I put an image (http://drop.io/digistar#) to
> show that, all points are cluster very well. So I just want to which
> MATLAB function can group these points into different clusters and give
> me the labels of these points. Thank you very much.

I have examined the image, and based upon the intra-point distances, *I* would
cluster the points in the upper right corner into one cluster, and cluster
everything else into another cluster, total of two clusters.

You may have been thinking perhaps of 6 clusters, but if so that is because
you have in mind a different error calculation and different penalty function
for increased number of clusters. Matlab does not know, and _cannot_ know,
which of us is "right" on such a matter.

Other algorithms I know of would likely tend to choose 3 clusters total given
that data, grouping the top-left data into one cluster, the bottom-left data
into a second cluster, and the top-right data into a third cluster. And those
algorithms would be right -- under the terms of the assumptions that had been
programmed into them.

So again I must iterate that there is not and cannot be a "simple" Matlab
function that meaningfully clusters your data non-trivially without being told
either the number of clusters, or the error-calculation and tradeoff penalty
formulae.
```
 0

```Walter Roberson wrote:

> So again I must iterate that there is not and cannot be a "simple"
> Matlab function that meaningfully clusters your data non-trivially
> without being told either the number of clusters, or the
> error-calculation and tradeoff penalty formulae.

Having said that: if you were to "draw" the points into a matrix, and then use
imdilate(), and then use bwlabel(), you would get blobs that would identify
the approximate positions of the clusters. The region properties you would be
able to extract would give you an approximate centroid and bounding box; you
could then run through all of your points determining which of the boxes they
fit in.

Also, though I have not checked, I suspect there is likely something akin to a
2D histogram in the Matlab File Exchange.
```
 0

```Walter Roberson <roberson@hushmail.com> wrote in message <hq7qek\$2q1\$1@canopus.cc.umanitoba.ca>...
> Walter Roberson wrote:
>
> > So again I must iterate that there is not and cannot be a "simple"
> > Matlab function that meaningfully clusters your data non-trivially
> > without being told either the number of clusters, or the
> > error-calculation and tradeoff penalty formulae.
>
> Having said that: if you were to "draw" the points into a matrix, and then use
> imdilate(), and then use bwlabel(), you would get blobs that would identify
> the approximate positions of the clusters. The region properties you would be
> able to extract would give you an approximate centroid and bounding box; you
> could then run through all of your points determining which of the boxes they
> fit in.
>
> Also, though I have not checked, I suspect there is likely something akin to a
> 2D histogram in the Matlab File Exchange.

Thank you Walter. Emm... maybe I should not say this is a clustering issue. As you can see from that image, my question is just to find 6 points, each of which is the centroid of a set of points around this centroid. So, in this case, I would cluster the points in this image into 6 groups. Each centroid of each group would be found - this is my question.
```
 0

```Digistar You wrote:

>  Emm... maybe I should not say this is a clustering
> issue. As you can see from that image, my question is just to find 6
> points, each of which is the centroid of a set of points around this
> centroid. So, in this case, I would cluster the points in this image
> into 6 groups. Each centroid of each group would be found - this is my
> question.

Are you choosing to redefine the question at this point? In the original
problem statement, you did not wish to use kmeans() because kmeans() requires
specifying the number of clusters. Your problem specification quoted above is
to find *6* centroids, which is an exact number that could be passed to
kmeans() and would probably give you the output you were hoping for.

But if you want an algorithm in which you do not have to specify the number of
clusters ahead of time (e.g., because the number varies), then there cannot be
such a general algorithm that does accurate clustering and centroid detection
-- not without assumptions (possibly inapplicable ones) or explicit
information about how to decide whether a set of points represents one bigger
cluster or two smaller clusters.

In the imdilate() / bwlabel() version that proposed, the implicit assumption
would be coded into the dilation size and the grid size.

Here is some sample code, assuming coordinates in x and y:

scatter(x,y,'b.');

GS = 10;   %number of grid divisions

minx = min(0,min(x));
miny = min(0,min(y));
divs = ceil(max(max(x)-minx,max(y)-miny));

gridded = reshape( accumarray( sub2ind([GS,GS], 1+fix((y-miny)*GS/divs),
1+fix((x-minx)*GS/divs)).', 1, [GS.^2,1], @any).', GS, GS);

labeled = bwlabel(gridded);
cents = regionprops(labeled,'centroid');

cxy = GS * (vertcat(cents.Centroid) - 1);
cxy(:,1) = cxy(:,1) + minx;
cxy(:,2) = cxy(:,2) + miny;

hold on
scatter(cxy(:,1),cxy(:,2),'rd')
hold off

The original points will now be in blue dots, and the centroids will be red
diamonds on the graph.

The _approximate_ (x,y) coordinates of the centroids are in cxy. The accuracy
of the approximation can be altered by changing the grid size variable, GS, at
the expense of an increase in memory.

If you want to view the labeled gridded approximation of the original data,
display flipud(labeled) .

This code has been designed so that, to the extent practical, the matrices are
like an approximated drawing of the data; however, because regionprops counts
from the upper left corner instead of from the bottom left corner, they cannot
be represented exactly the same way... or rather, that they could, but you
would have to flipud() the value being assigned to gridded, and you would then
have to flipud(labeled) when you passed it in to regionprops. I figured that
the efficiency of not doing the double flip outweighed the ease of viewing the
matrices step by step... just remember to flipud() the matrices for viewing.

If the data has a minimum notably different from 0, then you may wish to
replace the minx and miny lines to not take min of 0 and the value: not doing
the min against 0 will give a finer granularity in the binning, at the price
of altering the aspect ratio of the displayed matrices compared to what would
be plotted. It would be a better computation, but more difficult to understand
what was going on... and since I had to fiddle somewhat with the various
transposes and order of parameters, having a similar aspect ratio to the graph
was a big help.

Kinda interesting... first time I've ever used bwlabel or regionprops. Or, for
that matter, accumarray for anything serious.
```
 0

```In article <hq89mg\$p73\$1@canopus.cc.umanitoba.ca>,
Walter Roberson <roberson@hushmail.com> wrote:

> Digistar You wrote:
>
> >  Emm... maybe I should not say this is a clustering
> > issue. As you can see from that image, my question is just to find 6
> > points, each of which is the centroid of a set of points around this
> > centroid. So, in this case, I would cluster the points in this image
> > into 6 groups. Each centroid of each group would be found - this is my
> > question.
>
> Are you choosing to redefine the question at this point? In the original
> problem statement, you did not wish to use kmeans() because kmeans() requires
> specifying the number of clusters. Your problem specification quoted above is
> to find *6* centroids, which is an exact number that could be passed to
> kmeans() and would probably give you the output you were hoping for.

I think it's pretty clear that Digistar meant there are obviously 6
clusters for this data set, but other data sets might result in a
different number.

I would just run kmeans multiple times for k over a reasonable range for
your problem.  Then look at the third or fourth output arguments of
kmeans to see which of those clusterings gave the smallest average error
or maximum error.

The difficulty comes when two of your clusters happen to lie close to
each other.  This can be especially frustrating when it is obvious to
your eye how you would like to cluster the data, but kmeans does
something else.  It can be a challenging problem.

If your clusters tend to be approximately round or at least well
separated then kmeans should work well.

--
Doug Schwarz
dmschwarz&ieee,org
Make obvious changes to get real email address.
```
 0

```There is a nice overview of various cluster analysis techniques here:
http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm

If the number of clusters is unknown beforehand, then maybe the QT
clustering algorithm might work:
http://en.wikipedia.org/wiki/Cluster_analysis
```
 0

```Doug Schwarz <see@sig.for.address.edu> wrote in message <see-DEF7AF.22121015042010@news.frontiernet.net>...
> In article <hq89mg\$p73\$1@canopus.cc.umanitoba.ca>,
>  Walter Roberson <roberson@hushmail.com> wrote:
>
> > Digistar You wrote:
> >
> > >  Emm... maybe I should not say this is a clustering
> > > issue. As you can see from that image, my question is just to find 6
> > > points, each of which is the centroid of a set of points around this
> > > centroid. So, in this case, I would cluster the points in this image
> > > into 6 groups. Each centroid of each group would be found - this is my
> > > question.
> >
> > Are you choosing to redefine the question at this point? In the original
> > problem statement, you did not wish to use kmeans() because kmeans() requires
> > specifying the number of clusters. Your problem specification quoted above is
> > to find *6* centroids, which is an exact number that could be passed to
> > kmeans() and would probably give you the output you were hoping for.
>
>
> I think it's pretty clear that Digistar meant there are obviously 6
> clusters for this data set, but other data sets might result in a
> different number.
>
> I would just run kmeans multiple times for k over a reasonable range for
> your problem.  Then look at the third or fourth output arguments of
> kmeans to see which of those clusterings gave the smallest average error
> or maximum error.
>
> The difficulty comes when two of your clusters happen to lie close to
> each other.  This can be especially frustrating when it is obvious to
> your eye how you would like to cluster the data, but kmeans does
> something else.  It can be a challenging problem.
>
> If your clusters tend to be approximately round or at least well
> separated then kmeans should work well.
>
> --
> Doug Schwarz
> dmschwarz&ieee,org
> Make obvious changes to get real email address.

Thank you all very much, friends, for the valuable solutions. I tried a few approaches, including using kmenas for some times. However, I found they worked not well. Finally, I wrote a function below, it might be stupid, but I find it very good for my question, if a suitable threshold T was assigned. Fortunately, such threshold is easy to find.

function [labels] = points_cluster(FP, T)

L = size(FP, 1);

i = 1;
index = 1;
buff = logical(1 : L);
labels(1) = 1;
while i < L
if buff(i)
point = FP(i, :);
for j = 1 : i - 1
if (sqrt((point(1) - FP(j, 1)) ^ 2 + (point(2) - FP(j, 2)) ^ 2)) >= T
labels(i) = index;
end
end
for j = i + 1 : L
if (sqrt((point(1) - FP(j, 1)) ^ 2 + (point(2) - FP(j, 2)) ^ 2)) < T
buff(j) = false;
labels(j) = index;
end
end
index = index + 1;
end
i = i + 1;
end
```
 0