Importance of Distance Metrics in Machine Learning
You read the title right !!! You might be wondering how ‘Distance’ and ‘Machine Learning’ are related? Am I going out of context?
No, if you have hands-on experience with Machine Learning algorithms, undoubtedly you came across ‘Distance’ as a parameter. This blog focuses on the need for distance metrics in Machine learning and it’s use cases so that you appreciate the concept more.
Distance Metrics
The metric is a function that defines a concept of distance between any two members of the set, which are usually called points.
The metric satisfies a few simple properties.
* Non-negativity:
d(i,j)≥0: Distance is a non-negative number.
* Identity of indiscernible:
d(i, i)=0: The distance of an object to itself is 0.
* Symmetry:
d(i,j)=d(j,i): Distance is a symmetric function.
* Triangle Inequality:
d(i,j)≤d(i,k)+d(k,j)
Intuitively, distance can be thought of as a measure of similarity, so two closest observations can be categorized similarly.
A supervised algorithm, K-Nearest Neighbors, and Unsupervised algorithms like K-mean clustering/Hierarchical clustering use distance metrics to understand patterns between data points.
Various distance metrics have their use cases, but it is important to be aware of them while considering the best solution for a given situation to avoid accuracy issues and interpretation issues.
This next section of the blog discusses some commonly used distance metrics in Machine Learning.
Euclidean Distance
You all have must be knowing Pythagoras Theorem from your junior school.
Pythagoras formula can be extended to get the euclidean distance.
Euclidean distance(also known as the Pythagorean metric) is an ordinary straight line between two points in a coordinate axis.
The general formula for Euclidea distance in n-dimension is
A real-life use case of Euclidean Distance will be calculating the distance of a flight between two countries
Applications in Machine Learning
- Cluster Analysis: This metric is commonly used in clustering algorithms such as K-means
- Data Science: It is used as a simple metric to measure the similarity between two data points
Manhattan Distance
Also known as Taxicab geometry, it calculates the distance between two points as the sum of the absolute differences of their Cartesian coordinates.
Where,
- n is the number of dimensions
- xi, yi is the data points
A real-life use case of Manhattan Distance will be calculating cab fare between two locations in a city.
Euclidean Distance Vs Manhattan Distance
- Euclidean distance is commonly used as it is more intuitive. Other distance metrics may be more appropriate in special circumstances.
- Manhattan Distance is used in linear regression with L1 regularisation(also known as ridge regression)
- For high dimensional vectors, you might find Manhattan Distance works better than Euclidean Distance. At high dimension, Euclidean Distance loses pretty much all meaning. The main issue is something commonly referred to as the Curse of Dimensionality.
An illustration of the difference between Euclidean and Manhattan Distance
Applications in Machine Learning
- Regression analysis: It is used in the linear regression to find a straight line that fits a given set of points
- Face Recognition: Manhattan Distance along with Image Segmentation can be used for facial recognition. For further details, you can read this.
Minkowski Distance
Minkowski distance is a similarity measurement between two points in the normed vector space (N-dimensional real space). The definition sounds very mathematical. Let me break it down for you, normed vector space means a space where distances can be represented as a vector that has a length.
Let’s assume that a map is a vector space. If we take a map, we see that distances between cities are normed vector space because we can draw a vector that connects two cities on the map. We can combine multiple vectors to create a route that connects more than two cities. Now, the adjective “normed.” It means that the vector has its length and no vector has a negative length. That constraint is met too because if we draw a line between cities on the map, we can measure its length.
The formula for Minkowski Distance is given by
Minkowski Distance, also known as Lp form can be generalized for Euclidean and Manhattan distance as well by changing the value of p.
For p=2, we get the L2 form, which is Manhattan Distance
p=1, we get the L1 form, which is Euclidean Distance
Applications in Machine Learning
- Fuzzy Clustering with Minkowski Distance Functions
- A Framework for a Minkowski Distance Based Multi Metric Quality of Service Monitoring Infrastructure for Mobile Ad Hoc Networks
Hamming Distance
As we know, Environmental interference and physical defects in the communication medium can cause random bit errors during data transmission. Error coding is a method of detecting and correcting these errors to ensure information is transferred intact from its source to its destination.
Hamming distance is one such error-correcting code to measure the distance between two codewords, we just count the number of bits that differ between them. If we are doing this in hardware or software, we can just XOR the two codewords and count the number of 1 bit in the result. This count is called the Hamming distance.
Suppose there are two strings 11011001 and 10011101.
11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance, d(11011001, 10011101) is 2.
Here are a few useful references :
- Error Correction with Hamming Codes
- Calculating the Hamming Code
- Applying Hamming Code to blocks of data
Cosine Distance and Cosine Similarity
These two terms are closely related and widely used in the recommendation system. Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them
Cosine distance=1-Cosine_Similarity. So, similarity increases when distance decreases and vice versa.
Hence, we can say that Cosine Similarity lies between -1 and +1 as the range of Cos theta is [-1,+1]
- Cosine value 1 is for vectors pointing in the same direction i.e. there are similarities between the data points.
- At zero for orthogonal vectors i.e. Unrelated(some similarity found).
- -1 for vectors pointing in opposite directions(No similarity).
Euclidean Distance VS Cosine Distance for Recommendations
The crux of this whole discussion is to choose the best metric for recommendation systems.
Let’s say I have a database of users who rate movies on a scale of 1–10. If I want to make a list of the top 5 most-watched movies similar to my profile; my first approach to finding similar users was to use Cosine Similarity and just treat user ratings as vector components. The main problem with this approach is that it just measures vector angles and doesn’t take the rating scale or magnitude into consideration. Magnitude matters in this case because I want the top 5 movies that are the closest to my user profile averages. Euclidean Distance can be used in such a scenario as it accounts for magnitude as well.
Cosine Similarity is more useful for instances when you do not want magnitude to skew the results. This is most useful in word vectorization because normalizing the data makes a long document comparable to a short document. The euclidian distance will be very large between documents of different word length which would skew your results.
Also, cosine similarity finds its application in text mining.
Conclusion
Summing everything up, we can search for alternatives instead of mindlessly employing predefined metrics. We are not bound to use handcrafted designs since we can make machines come up with tailored solutions for our specific problems; this is what machine learning is about!
Using the Distance Metric as a hyperparameter is a good optimisation feature. There are times where it will come in handy, and there are times where it won’t. The typical use cases are combating the dimensionality curse, saving computational costs, interpreting data, and improving accuracy.