A simple cluster method for point sets, thank you

  • Follow


Dear all,

I have a simple question. In a X-Y plane, I got some points (x, y coordinates). I want to divide these points into different clusters. Is there any MATLAB function that can perform this operation? I checked kmeans(), but it needs the cluster number as an input. However, I don't know how many clusters would be generated. I guess this would be a very simple question, but I am not familiar with this issue. Thank you very much.
0
Reply Digistar 4/15/2010 4:27:20 PM

Digistar You wrote:

> I have a simple question. In a X-Y plane, I got some points (x, y 
> coordinates). I want to divide these points into different clusters. Is 
> there any MATLAB function that can perform this operation? I checked 
> kmeans(), but it needs the cluster number as an input. However, I don't 
> know how many clusters would be generated. I guess this would be a very 
> simple question, but I am not familiar with this issue. Thank you very 
> much.

The simple solution is to put every point into its own individual 
cluster: then the clustering error is 0. Or to put all points into the 
same cluster; that also has a clustering error of 0.

If you are expecting a non-trivial number of clusters, there is *always* 
a trade-off between clustering errors and number of clusters, so you 
cannot do any clustering operation unless you specify either the number 
of clusters or the formula that you want to use to calculate error, and 
to trade error vs number of clusters.

For example, in one dimension, suppose my points are at (1), (2), and 
(3). What is the best clustering? Now suppose that (2) value is not 
-exactly- 2 internally but has been subject to round-off error: should a 
tiny tiny error in its coordinate yank (2) between being clustered with 
(1) or with (3) depending on which side of exactly 2 the error happens 
to lead? What if the points are at (1), "near" (2) and (3), with normal 
distribution and some variance around the nominal "near (2)" position: 
the better clustering might be well defined if you knew the nominal 
position for the center point, but which side of (2) it is actually 
going to land on is probabilistic rather than set in stone, so even if 
you end up with (1), (2.5), (3), the "true" clustering might be the 
first two points together and the third distinct, because the nominal 
position for the second point might be (say) 1.8 and randomly the center 
point's deviation was 0.7 in the particular sampling.

There is no algorithm that can just be given a set of actual points and 
come up with a non-trivial best "real" clustering without at least being 
given information about the "intent" of the position of each of the 
points.  Even if the points are (1), (1.003), and (1000), if there is 
calculation limitations or probabilities involved in the determination 
of the (1.003), there is a real possibility that the (1.003) should 
really be clustered with the (1000).

To phrase this a different way: I've seen cats that looked much more 
like dogs than they looked like cats -- but they were still cats.
0
Reply Walter 4/15/2010 5:02:40 PM


Walter Roberson <roberson@hushmail.com> wrote in message <hq7grh$hps$1@canopus.cc.umanitoba.ca>...
>
> To phrase this a different way: I've seen cats that looked much more 
> like dogs than they looked like cats -- but they were still cats.

Hum, I'm pretty sure many of those are actually mice.

Bruno
0
Reply Bruno 4/15/2010 5:23:05 PM

Walter Roberson <roberson@hushmail.com> wrote in message <hq7grh$hps$1@canopus.cc.umanitoba.ca>...
> Digistar You wrote:
> 
> > I have a simple question. In a X-Y plane, I got some points (x, y 
> > coordinates). I want to divide these points into different clusters. Is 
> > there any MATLAB function that can perform this operation? I checked 
> > kmeans(), but it needs the cluster number as an input. However, I don't 
> > know how many clusters would be generated. I guess this would be a very 
> > simple question, but I am not familiar with this issue. Thank you very 
> > much.
> 
> The simple solution is to put every point into its own individual 
> cluster: then the clustering error is 0. Or to put all points into the 
> same cluster; that also has a clustering error of 0.
> 
> If you are expecting a non-trivial number of clusters, there is *always* 
> a trade-off between clustering errors and number of clusters, so you 
> cannot do any clustering operation unless you specify either the number 
> of clusters or the formula that you want to use to calculate error, and 
> to trade error vs number of clusters.
> 
> For example, in one dimension, suppose my points are at (1), (2), and 
> (3). What is the best clustering? Now suppose that (2) value is not 
> -exactly- 2 internally but has been subject to round-off error: should a 
> tiny tiny error in its coordinate yank (2) between being clustered with 
> (1) or with (3) depending on which side of exactly 2 the error happens 
> to lead? What if the points are at (1), "near" (2) and (3), with normal 
> distribution and some variance around the nominal "near (2)" position: 
> the better clustering might be well defined if you knew the nominal 
> position for the center point, but which side of (2) it is actually 
> going to land on is probabilistic rather than set in stone, so even if 
> you end up with (1), (2.5), (3), the "true" clustering might be the 
> first two points together and the third distinct, because the nominal 
> position for the second point might be (say) 1.8 and randomly the center 
> point's deviation was 0.7 in the particular sampling.
> 
> There is no algorithm that can just be given a set of actual points and 
> come up with a non-trivial best "real" clustering without at least being 
> given information about the "intent" of the position of each of the 
> points.  Even if the points are (1), (1.003), and (1000), if there is 
> calculation limitations or probabilities involved in the determination 
> of the (1.003), there is a real possibility that the (1.003) should 
> really be clustered with the (1000).
> 
> To phrase this a different way: I've seen cats that looked much more 
> like dogs than they looked like cats -- but they were still cats.

Dear Walter,

Thank you so much for the detailed explanation. Actually, the question is very simple in my case. I put an image (http://drop.io/digistar#) to show that, all points are cluster very well. So I just want to which MATLAB function can group these points into different clusters and give me the labels of these points. Thank you very much.
0
Reply Digistar 4/15/2010 6:02:22 PM

Digistar You wrote:

> Thank you so much for the detailed explanation. Actually, the question 
> is very simple in my case. I put an image (http://drop.io/digistar#) to 
> show that, all points are cluster very well. So I just want to which 
> MATLAB function can group these points into different clusters and give 
> me the labels of these points. Thank you very much.

I have examined the image, and based upon the intra-point distances, *I* would 
cluster the points in the upper right corner into one cluster, and cluster 
everything else into another cluster, total of two clusters.

You may have been thinking perhaps of 6 clusters, but if so that is because 
you have in mind a different error calculation and different penalty function 
for increased number of clusters. Matlab does not know, and _cannot_ know, 
which of us is "right" on such a matter.

Other algorithms I know of would likely tend to choose 3 clusters total given 
that data, grouping the top-left data into one cluster, the bottom-left data 
into a second cluster, and the top-right data into a third cluster. And those 
algorithms would be right -- under the terms of the assumptions that had been 
programmed into them.

So again I must iterate that there is not and cannot be a "simple" Matlab 
function that meaningfully clusters your data non-trivially without being told 
either the number of clusters, or the error-calculation and tradeoff penalty 
formulae.
0
Reply Walter 4/15/2010 7:02:31 PM

Walter Roberson wrote:

> So again I must iterate that there is not and cannot be a "simple" 
> Matlab function that meaningfully clusters your data non-trivially 
> without being told either the number of clusters, or the 
> error-calculation and tradeoff penalty formulae.

Having said that: if you were to "draw" the points into a matrix, and then use 
imdilate(), and then use bwlabel(), you would get blobs that would identify 
the approximate positions of the clusters. The region properties you would be 
able to extract would give you an approximate centroid and bounding box; you 
could then run through all of your points determining which of the boxes they 
fit in.

Also, though I have not checked, I suspect there is likely something akin to a 
2D histogram in the Matlab File Exchange.
0
Reply Walter 4/15/2010 7:46:26 PM

Walter Roberson <roberson@hushmail.com> wrote in message <hq7qek$2q1$1@canopus.cc.umanitoba.ca>...
> Walter Roberson wrote:
> 
> > So again I must iterate that there is not and cannot be a "simple" 
> > Matlab function that meaningfully clusters your data non-trivially 
> > without being told either the number of clusters, or the 
> > error-calculation and tradeoff penalty formulae.
> 
> Having said that: if you were to "draw" the points into a matrix, and then use 
> imdilate(), and then use bwlabel(), you would get blobs that would identify 
> the approximate positions of the clusters. The region properties you would be 
> able to extract would give you an approximate centroid and bounding box; you 
> could then run through all of your points determining which of the boxes they 
> fit in.
> 
> Also, though I have not checked, I suspect there is likely something akin to a 
> 2D histogram in the Matlab File Exchange.

Thank you Walter. Emm... maybe I should not say this is a clustering issue. As you can see from that image, my question is just to find 6 points, each of which is the centroid of a set of points around this centroid. So, in this case, I would cluster the points in this image into 6 groups. Each centroid of each group would be found - this is my question.
0
Reply Digistar 4/15/2010 8:10:21 PM

Digistar You wrote:

>  Emm... maybe I should not say this is a clustering 
> issue. As you can see from that image, my question is just to find 6 
> points, each of which is the centroid of a set of points around this 
> centroid. So, in this case, I would cluster the points in this image 
> into 6 groups. Each centroid of each group would be found - this is my 
> question.

Are you choosing to redefine the question at this point? In the original 
problem statement, you did not wish to use kmeans() because kmeans() requires 
specifying the number of clusters. Your problem specification quoted above is 
to find *6* centroids, which is an exact number that could be passed to 
kmeans() and would probably give you the output you were hoping for.

But if you want an algorithm in which you do not have to specify the number of 
clusters ahead of time (e.g., because the number varies), then there cannot be 
such a general algorithm that does accurate clustering and centroid detection 
-- not without assumptions (possibly inapplicable ones) or explicit 
information about how to decide whether a set of points represents one bigger 
cluster or two smaller clusters.

In the imdilate() / bwlabel() version that proposed, the implicit assumption 
would be coded into the dilation size and the grid size.


Here is some sample code, assuming coordinates in x and y:


scatter(x,y,'b.');

GS = 10;   %number of grid divisions

minx = min(0,min(x));
miny = min(0,min(y));
divs = ceil(max(max(x)-minx,max(y)-miny));

gridded = reshape( accumarray( sub2ind([GS,GS], 1+fix((y-miny)*GS/divs), 
1+fix((x-minx)*GS/divs)).', 1, [GS.^2,1], @any).', GS, GS);

labeled = bwlabel(gridded);
cents = regionprops(labeled,'centroid');

cxy = GS * (vertcat(cents.Centroid) - 1);
cxy(:,1) = cxy(:,1) + minx;
cxy(:,2) = cxy(:,2) + miny;

hold on
scatter(cxy(:,1),cxy(:,2),'rd')
hold off

The original points will now be in blue dots, and the centroids will be red 
diamonds on the graph.

The _approximate_ (x,y) coordinates of the centroids are in cxy. The accuracy 
of the approximation can be altered by changing the grid size variable, GS, at 
the expense of an increase in memory.

If you want to view the labeled gridded approximation of the original data, 
display flipud(labeled) .

This code has been designed so that, to the extent practical, the matrices are 
like an approximated drawing of the data; however, because regionprops counts 
from the upper left corner instead of from the bottom left corner, they cannot 
be represented exactly the same way... or rather, that they could, but you 
would have to flipud() the value being assigned to gridded, and you would then 
have to flipud(labeled) when you passed it in to regionprops. I figured that 
the efficiency of not doing the double flip outweighed the ease of viewing the 
matrices step by step... just remember to flipud() the matrices for viewing.


If the data has a minimum notably different from 0, then you may wish to 
replace the minx and miny lines to not take min of 0 and the value: not doing 
  the min against 0 will give a finer granularity in the binning, at the price 
of altering the aspect ratio of the displayed matrices compared to what would 
be plotted. It would be a better computation, but more difficult to understand 
what was going on... and since I had to fiddle somewhat with the various 
transposes and order of parameters, having a similar aspect ratio to the graph 
was a big help.


Kinda interesting... first time I've ever used bwlabel or regionprops. Or, for 
that matter, accumarray for anything serious.
0
Reply Walter 4/16/2010 12:06:37 AM

In article <hq89mg$p73$1@canopus.cc.umanitoba.ca>,
 Walter Roberson <roberson@hushmail.com> wrote:

> Digistar You wrote:
> 
> >  Emm... maybe I should not say this is a clustering 
> > issue. As you can see from that image, my question is just to find 6 
> > points, each of which is the centroid of a set of points around this 
> > centroid. So, in this case, I would cluster the points in this image 
> > into 6 groups. Each centroid of each group would be found - this is my 
> > question.
> 
> Are you choosing to redefine the question at this point? In the original 
> problem statement, you did not wish to use kmeans() because kmeans() requires 
> specifying the number of clusters. Your problem specification quoted above is 
> to find *6* centroids, which is an exact number that could be passed to 
> kmeans() and would probably give you the output you were hoping for.


I think it's pretty clear that Digistar meant there are obviously 6 
clusters for this data set, but other data sets might result in a 
different number.

I would just run kmeans multiple times for k over a reasonable range for 
your problem.  Then look at the third or fourth output arguments of 
kmeans to see which of those clusterings gave the smallest average error 
or maximum error.

The difficulty comes when two of your clusters happen to lie close to 
each other.  This can be especially frustrating when it is obvious to 
your eye how you would like to cluster the data, but kmeans does 
something else.  It can be a challenging problem.

If your clusters tend to be approximately round or at least well 
separated then kmeans should work well.

-- 
Doug Schwarz
dmschwarz&ieee,org
Make obvious changes to get real email address.
0
Reply Doug 4/16/2010 2:12:10 AM

There is a nice overview of various cluster analysis techniques here:
http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm

If the number of clusters is unknown beforehand, then maybe the QT
clustering algorithm might work:
http://en.wikipedia.org/wiki/Cluster_analysis
0
Reply ImageAnalyst 4/16/2010 2:51:55 AM

Doug Schwarz <see@sig.for.address.edu> wrote in message <see-DEF7AF.22121015042010@news.frontiernet.net>...
> In article <hq89mg$p73$1@canopus.cc.umanitoba.ca>,
>  Walter Roberson <roberson@hushmail.com> wrote:
> 
> > Digistar You wrote:
> > 
> > >  Emm... maybe I should not say this is a clustering 
> > > issue. As you can see from that image, my question is just to find 6 
> > > points, each of which is the centroid of a set of points around this 
> > > centroid. So, in this case, I would cluster the points in this image 
> > > into 6 groups. Each centroid of each group would be found - this is my 
> > > question.
> > 
> > Are you choosing to redefine the question at this point? In the original 
> > problem statement, you did not wish to use kmeans() because kmeans() requires 
> > specifying the number of clusters. Your problem specification quoted above is 
> > to find *6* centroids, which is an exact number that could be passed to 
> > kmeans() and would probably give you the output you were hoping for.
> 
> 
> I think it's pretty clear that Digistar meant there are obviously 6 
> clusters for this data set, but other data sets might result in a 
> different number.
> 
> I would just run kmeans multiple times for k over a reasonable range for 
> your problem.  Then look at the third or fourth output arguments of 
> kmeans to see which of those clusterings gave the smallest average error 
> or maximum error.
> 
> The difficulty comes when two of your clusters happen to lie close to 
> each other.  This can be especially frustrating when it is obvious to 
> your eye how you would like to cluster the data, but kmeans does 
> something else.  It can be a challenging problem.
> 
> If your clusters tend to be approximately round or at least well 
> separated then kmeans should work well.
> 
> -- 
> Doug Schwarz
> dmschwarz&ieee,org
> Make obvious changes to get real email address.

Thank you all very much, friends, for the valuable solutions. I tried a few approaches, including using kmenas for some times. However, I found they worked not well. Finally, I wrote a function below, it might be stupid, but I find it very good for my question, if a suitable threshold T was assigned. Fortunately, such threshold is easy to find.

function [labels] = points_cluster(FP, T)

L = size(FP, 1);

i = 1;
index = 1;
buff = logical(1 : L);
labels(1) = 1;
while i < L
    if buff(i)
        point = FP(i, :);
        for j = 1 : i - 1
            if (sqrt((point(1) - FP(j, 1)) ^ 2 + (point(2) - FP(j, 2)) ^ 2)) >= T
                labels(i) = index;
            end
        end
        for j = i + 1 : L
            if (sqrt((point(1) - FP(j, 1)) ^ 2 + (point(2) - FP(j, 2)) ^ 2)) < T
                buff(j) = false;
                labels(j) = index;
            end
        end
        index = index + 1;
    end
    i = i + 1;
end
0
Reply Digistar 4/16/2010 3:38:08 PM

10 Replies
230 Views

(page loaded in 0.078 seconds)


Reply: