The most exciting clustering you have ever seen

in #steemstem6 years ago

Clustering in mathematics is as exciting as a bread. We all use it, not really enjoying in that process and whatever we do, it's more-less the same. At least I've thought so until the matematician wizard sent me his latest state-of-art paper.

Boring clustering


The very first association to clustering is certainly the k-means clustering, invented in the 50's by several authors. The idea is very simple, you type the number of clusters you want to get, the algorithm finds the points that will be the centres of the clusters and it's done in such manner that the distance of the data from the center(s) of the cluster(s) is as small as possible.

If you want to learn more about it, visit StatSoft STATISTICA Help centre, it's explained in a very clear manner and you will master this skill in less than hour.

If you are a biologist, you are certainly familiar with the Tree Clustering, the method that you will find in every book related to taxonomy or evolution. The logic behind the algorithm is again very intuitive and easy to understand. Take your data points, find the distances between them, see the linkage distances and say where you want to cut.

And there is nothing spectacular you can do. You can play with methods for calculation of the distance including the most classical Euclidean distance or you can square it, or if you feel particularly good, you can go for the Manhatan'distance metrics.
You can also add some weights to data points and that's it...

So, what's the problem?


Well, the first one is very obvious even from this Wiki example:


PD image, stolen from here

Your gestalt brain can clearly see 3 circles, but explain it with mathematical language including the distances from the centres.

And, just for fun, imagine the "doughnut shape" data. How could that be clustered using classical k-means? Or even worse, two concentric circles. Simply there is no way to do it properly.

Wizard(s)... KNSC-Ncut (kernel-based nonnegative subspace clustering)


Type this with quotation marks in Google and you will get one single hit, the one done by Ivica Kopriva (ako se nekada učlaniš na Steemit, nagrada od ovog posta je tvoja).

The whole idea emerged from NMF, non-negative matrix factorization, one of the best methods for dimensionality reduction (tested in real world). The main reason for this can be found in the name, all the elements of the matrices are positive. Translated to ordinary language, all present components in the spectrum, for example, contribute only positively. There is no "anti-component".

Now, instead of using data in "as it is format", weighted affinity graph is constructed first.
*(each node represents data point and each weighted edge represents the distance between the points)
If you have no idea what is the affinity graph...

Spectral clustering then finds the cluster membership of the data points by using the spectrum of an affinity graph

But... This approach can give mixed-sign results due to eigenvalue decomposition of the Laplacian.

So instead of the application of eigenvalue decomposition - NMF...
And now if you like math, you see what you could do ;)

How it works in practice?


Authors tested some publicly available datasets and the size of matrices varied from 214 samples x 9 dimensions for 6 clusters to 400 samples x 10.304 dimensions?! for 40 clusters.

Now you know that clustering can be interesting, that it can solve some incredibly difficult problems and that fine tweaking of the algorithms can make it even better for your particular case

Sort:  

If I understand correctly, this method, versus a simple method such as k-means, doesn't only examine the closeness of points in phase space. For example, Figure 1 of the ArXiv paper shows NMF clustering of interior and exterior points. (I don't understand the details, but I'm trying to get a grip on the results.)

If this works, can we use NMF based clustering to identify outliers in datasets? I ask this because this is a significant element of my work. I have large datasets with sometimes questionable data and we want to early on identify outliers that don't fit with the rest. Typically we need to work through a series of regressions prior to identifying those samples that aren't with the program.

What else can we do with these?

Also, will NMF clustering methods reproduce the simple clustering. I mean, if we were to replace across the board all our clustering methods with this, would there be any disadvantages?

Hey @spbeckman
Here's a tip for your valuable feedback! @Utopian-io loves and incentivises informative comments.

Contributing on Utopian
Learn how to contribute on our website.

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Thank you for reading my comment. I appreciate your support.

Loading...

Your gestalt brain can clearly see 3 circles

Nope, I see Mickey Mouse.

First thing i also saw.

What's wrong with me... :)

Hah!

Me too. A very colored Mickey!

Clustering in mathematics is as exciting as a bread.

Hey! Some of us try to make it exciting. ;)

PCA was my first love :D

@alexs1320 i was taught this clustering in my school time . It was too boring and i never understood this.. hehehe. . But although i passed the exam..



This post has been voted on by the steemstem curation team and voting trail.

There is more to SteemSTEM than just writing posts, check here for some more tips on being a community member. You can also join our discord here to get to know the rest of the community!

Hi @alexs1320!

Your post was upvoted by utopian.io in cooperation with steemstem - supporting knowledge, innovation and technological advancement on the Steem Blockchain.

Contribute to Open Source with utopian.io

Learn how to contribute on our website and join the new open source economy.

Want to chat? Join the Utopian Community on Discord https://discord.gg/h52nFrV