Aktualisiert: 25. Okt 2020
Content: Very much information to supervised and unsupervised learning, like clustering, classification and regression and the different algorithms and evaluation methods. This blog post explains formulas and tries to give a first approximation of the processes.
Introduction: When talking about machine learning it gets very fast very complicated. To understand the different concepts one has to dive deep into the world of problem classes, algorithms and evaluation methods. Often one has to read several, different and complicated papers for a brighter understanding. Therefore below we try to make a first step towards machine learning and we evaluate the different concepts.
This blogpost will have a look at the following topics:
What's the difference between supervised learning and unsupervised learning?
The difference is, if you have a labelled dataset (supervised learning) or an unlabelled dataset (unsupervised learning). In different words, if you have X for the input and Y for the output (supervised learning), or if you only have the input X (unsupervised learning). Or from another perspective you could also say, that in supervised learning the algorithm is able to learn on the labelled dataset and from this provides an answer key, which the algorithm can use to evaluate its accuracy on the training data. Whereas the unsupervised learning model is providing, as mentioned before, the unlabelled data, which means that the algorithm is extracting patters and features and is trying this way to make sense on its own. If you were asking what a feature is: we call the X -> input -> feature and the Y -> output -> target.
What does fully labelled means?
We talk about a fully labelled data sets in supervised learning if all the elements from a dataset which are used for training are tagged with the answer, which the algorithm is supposed to come up with on its own in unsupervised learning.
So which Problem classes do exist and how do they work?
The classes are called problem classes, because they help solving a problem. Now we need to think of what kind of different problems could occur, which we would like the algorithm to solve for us.
Lets start with the supervised learning, where we have classification and regression problems. How do we differentiate a classification from a regression problem? The answer lies in the Y, because in the classification problem we have a finite number of Ys, the so called classes in which we classify something, and in the regression problem we have an infinite number of Ys as we have real numbers, which means the output Y could be anything, it is not bound to a limited number of classes as in the classification problem.
Alright, lets have a short look at some examples and if they are classification problems or regression problems:
1) If we have customer data & transactions as an input and we want the salary as an output, is it a classification or regression problem?
2) If we have the network traffic as an input and we want to elaborate if the output is a hacker attack or not, is it a classification or regression problem?
3) If we have an image of a skin anomaly and we want to know the size of the tumor, is it a classification or regression problem?
4) If we have customer data and we want to know the churn detection as an output, is it a classification or regression problem?
5) If we have customer data as an input and we want to know if the customer has read the e-mail, is it a classification or regression problem?
The solutions are:
1) Regression 2) Classification 3) Regression 4) Classification 5) Classification. => We always ask if there is a finite / limited output (classification) or an infinite / unlimited output (regression).
Now with unsupervised learning we face the fact, that we do not live in a dream world, where datasets are perfectly labelled and where we also do not exactly always know, what answer to the question we expect. So therefore we use unsupervised learning, specifically deep learning and neural networks, which analyse and structure the data for us. There are various problems like clustering, anomaly detection, association or autoencoders. Today we'll have a detailed look at clustering and the algorithms behind it.
What is clustering?
Clustering is the process where the model is looking at all the data and building clusters out of similar data. As an example, this way you can separate the data into species by looking at indicators like size or color. We will see more below, lets have a look at the linear regression, decision trees, K-nearest neighbour, logistic regression, K-Mean and DBSCAN.
Supervised learning -> Regression -> What is the linear regression?
The linear regression is a machine learning algorithm where there is a linear relationship between the independet X and the dependent Y variable. With the given data points we plot a line that models the points best, by creating a straight best fit line into the graph. The formula for the linear regression is:
Where w0 is defining the height of Y on X = 0 and w is the weighting.
Supervised learning -> Regression -> What is the loss function and mean squared error?
In the next step we need to evaluate how well our algorithm has modelled the data (the line it made into the graph). There is the possibility that the predictions our model made deviate too much from actual results. Therefore we do the loss function, which measures this deviation. The more it deviates , the bigger the number which the loss function shows. Now with the help of an optimization function, the loss function learns to reduce the error in prediction.
Mean squared error
The mean squared error is measuring the average of the squares of the errors.
We use the following formula to determine the MSE:
We need to calculate the mean squared error loss function, because the derivation of the loss function is needed for the gradient descent.
What is the gradient and the gradient descent?
With the gradient descent we find the local minimum of a differentiable function. What we do is, that we take steps proportional to the negative of the gradient of the function at the current point., to find a local minimum. So the gradient is the mathematical derivation and it shows into the direction of the highest slope.
To calculate the minimum of the three weights (w) we check out how much we were wrong, to then square it. This is the math behind it:
What are decision trees:
Here we use a tree like model as a support tool of decisions and their possible consequences. As an example, if we want to know if we should go and play tennis, we'll have a look at the different input data which is sorted by day, outlook, temperature, humidity and wind. The decision tree will take the data and structure it, firstly it will have a look at the outlook, where it can be sunny, overcast or rainy. It will put the data into the associated category, where it then in the next step will categorize into humidity and wind, and if it is high / normal or strong / weak:
Which algorithm do we use?
For decision trees we can use the ID3, iterative dichotomiser 3, algorithm. It starts with the original data set D as the root node. Then, on each iteration of the algorithm, it iterates through every unused attribute of the set D and calculates the entropy H(D) and the information gain IG(D) of the attribute. In a next step it selects the attribute with the smallest entropy or the largest information gain. The set S is then split by the selected attribute to produce subsets of the data, where a node can be split into child nodes based upon the subsets and this will be continued by the algorithm, until it has recursed on each subset (considering only attributes never selected before).
How do we calculate the entropy?
The entropy is the average level of information-surprise-uncertainty inherent in the variable's possible outcomes. At the probability of 0 it is certain to never occur and 1 means that the result is certain. So in machine learning the entropy is the measure of randomness in the information which is processed, where 0 means very certain and 1 means very uncertain. Or in other words: when we build a tree in machine learning, which has to decide which features to split on and where it has to asses its splitting point, we are looking at each stage of such a recursive partitioning for the splitting point, which minimizes the combined entropy of the resulting subsets. So we are searching the data with the absolute purity.
With the following formula we calculate the entropy:
We make an example of absolute purity and one with bad value below:
The value zero means absolute purity of D, where as 1 indicates that there are o and x, which shows that we could do it better with the first example where the entropy is 0.
What is the information gain?
Here we talk about the information gained about a random variable or signal from observing another variable which is random. Through information gain we calculate how the entropy was and how is it now. Lets make an example, this is our data:
Our tree would look like that:
and to ask this we do the following calculations:
STEP 1: We calculate the entropy of the parent, which was 1 in our example above, because it is perfectly contaminated / unclean, because we had the same amount of x as well as o.
STEP 2: We calculate the entropy of child 1 and 2. For child 2 it is 0 because we only have d3, so that's easy, but for child 1 we need to do some math:
STEP 3: Now we calculate the average after weighting:
What have we learned? Before it was entropy 1 which was bad because it was perfectly contaminated. Now with the information gain we got 0,675 which shows that we have improved by 0,325. So our goal is to be able to maximise our information gain, which means we choose our attribute, we split and the we further split, until we have achieved our goal. The maximal information gain is there once we have achieved entropies of 0.
What is K-nearest neighbour?
Here we have a non-parametric method which helps us to classify. Our input consists of k and our output is a class membership. The object is classified by being assigned to the class most common among its k nearest neighbors. We also call this algorithm lazy learning algorithm, because it doesn't learn a discriminative function from the training data bit just memorizes the training dataset instead.
It is important to know that with the k-nearest neighbour we need to ensure to have a good data distribution, as if there is too much information of one class, the k-nearest neighbour algorithm will just put it always in this class.
What is the logistic regression?
The logistic regression algorithm helps us with the classification task, by predicting the probability of a target variable. How are we doing this?
The sigmoid function
We take the sigmoid function as a logistic function as it is and what we do is that we give an input and the function gives us an output between 0 and 1. As an example:
Now what we need to know about sigmoid is, that when x is zero, sigmoid is 0,5 because it looks like that:
And this is how the classification works, because in the end - lets say we want to classify if it is beer or wine - we define beer as class 0 and wine as class 1, and the function classifies everything under 0,5 to beer and everything above to wine. Everything with sigmoid 0,5 is on the decision boundary, which means that the AI can not decide if it is beer or wine. Lets make an example:
What is a confusion matrix?
In times of covid-19 we can do a great example with the covid-19 tests. If we make a test we have everything: people that have corona, but it is not showing that the person has it, or people that have no corona but the test is showing that one has. We assume that we have 1'000 people, with 900 negative and 100 positive corona cases:
Now we need to understand what is written in the matrix. 90% actually or truly do not have covid-19, while 10% have it for real. The term true negative means, that we predicted that they do not have it and actually they don't have the disease. The word true positive meanwhile means, that we have predicted they have it, but they actually don't have it. Where false negative our worst case is, as they do have corona but we predicted they do not have it. And true positive is when they have covid-19 and we predicted they have it.
How do we measure our results?
Accuracy: The accuracy is telling us how often the classifier is correct overall:
Precision: Is telling us, how often it is correct, when it is saying that someone has the disease.
Recall: Ability to find all relevant instances in a dataset, so when it is actually yes, how often does it predict yes?
Be aware that the accuracy can be very much biased! When one class is unbalanced, it will show you a great accuracy but that's not exactly the reality. Accuracy is great if you have a balanced dataset. Precision and recall are against each other - if you adjust one it goas at the expense of the other value.
What is K-Mean?
The clustering method K-Mean starts with a first group of randomly selected centroids, which are used as the beginning point for every cluster and then performs iterative, which means repetitive, calculations to optimize the positions of the centroids. It is a simple algorithm for group formation where the Euclidean distance defines the similarity and actually it works in any dimension.
What is Euclidean distance?
It defines the measured distance between two points: 1 dimensional = difference and 2 dimesional = theorem of pythagoras. So if we are looking for someting similar, we look if he have similar elements in our radius:
In the example above we took two dimensions, we could do the same with four:
Never forget, distance can not be negative, therefore we calculate to the power of two and then immediately pull the roots again.
We do the following steps when using K-mean:
1. We calculate the center of gravity: all x coordinates and all y coordinates are added. 2. Move the group to the new optimum. 3. If there is data in the r of the other group, we assign the data points to the new / different group 4. Calculate the new center point of all members of the group 5. Adjust groups to center point 6. Found the optimum? Yes, you have, when the center of gravity is no longer recalculated and it can no longer be done, because then the optimum has been found. Or alternatively, one could say from a defined number that one simply stops.
What are the characteristics of K-Means?
K-Means is determined by a fixed number of clusters and is fast even with large amounts of data. However, it cannot handle outliers! The clusters have a round shape. K-Means looks at what is the next group (K) and assigns it, i.e. the outliers are in the wrong group inserted and that's not good.
What is the elbow method and what is distortion?
Distortion is the sum of the distance between the point and the cluster center (K) over all points. So what does that mean? We can have one cluster (K) or we can have several clusters (K). The distortion is larger with one K than with two Ks, because more points are added and larger distances are included there. So the smaller it is, the better because the distances are less.
Why is the “elbow” the best value for K?
In the case of the elbow, the kink, it is a trade-off between extreme cases with distortion 0 where every point represents clusters and a high distortion.
If K is too small, there are more clusters in the data than there are possible centers. I.e. Many points are assigned in a "distant" cluster => high distortion
If K is too large, good / dense clusters are divided even further, but the distortion will not sink very much.
What is DBSCAN?
The name comes from the term "Density-Based Spatial Clustering of Applications with Noise" and what DBSCAN does is, it finds "core" points of regions with great density and expands from these to form clusters. Two important parameters are:
1. E: radius of the "neighborhood" 2. minPts: minimum number of points per neighborhood
DBSCAN has the following important elements:
- Corepoints, the centers.
- Borderpoints, join corepoints if they themselves do not meet the requirements of Corepoints.
- Outlier, if there are neither corepoints nor borderpoints.
All connected corepoints + all borderpoints form together a cluster.
The properties of DBSCAN are: The number of clusters is variable, although some difficulty arises with clusters with different densities. Because at the beginning we determined how big the radius is and how many points it contains. I.e. depending on what we have set, a less dense density is problematic because it will not build clusters even if they are. It recognizes outlier. It recognizes actually any cluster shape. It is sensitive to the setting of the parameters. How do you do it with an inference, which is when a new point is joining the cluster later? It can no longer become a core point itself, so you have to look within the radius of the new point where the next core point is, and if it lies within the radius, then it belongs to this cluster.
How do we evaluate clusters?
Roughly there are three approaches:
- With the internal analysis - With the external analysis - With a manual / subjective analysis
We calculate purely on the point that we have. I.e. the metric gets purely on the data (X) calculated. There are indexes and coefficients that all have the same goal of measuring that within the clusters between the different points there have to be great similarities between the points and everything outside has to be different.
Intracluster must be similar and intercluster if possible different, no mixing!
What is the disadvantage of internal metrics?
If you only evaluate on the points, you could use this metric directly in the clustering algorithm. Good internal metrics don't have to result in a real problem being solved! So you just take data, calculate how far apart are they and how similar etc. and you see that there is missing external information. That means the problems may not be solved with the solutions!
External analysis is when you measure the performance of the cluster on the real problem and you have to make labels. For example a field study has to be done, you have customers and the clusters, and now you have to check the labels by e.g. doing surveys and such.
1) Labeling by field study: sports and cat lovers, gamers and introverts, etc. 2) You give it to experts and they label it for you
Manual / subjective / visual analysis
You kinda look at it and decide by rational thinking and gut feeling if it is ok. For the visual analysis you could work with dimensionality reduction by using t-SNE, t-distribution stochastic neighbour embedding, which totally helps to visually display complex context. Disadvantage: You lose information because of the different dimensions, which are being reduced, so thereby levels with information and with it away.
In the next blogpost we'll have a short look out of business perspective for several important things, like the machine learning process in general, the data flywheel and the seven patterns of AI.