Computational Complexity of SVM
If you are pursuing Computer Science, probably you must have come across the notations of time complexity.
There’s not a single technical interview round that’s not going to question you on running time complexity of an algorithm.
The time complexity of an algorithm is the total amount of time required by an algorithm to complete its execution. In simple words, every piece of code we write takes time to execute. The time taken by any piece of code to run is known as the time complexity of that code. The lesser the time complexity, the faster the execution.
WHY SHOULD WE CARE ABOUT TIME?
One might argue that why is it even required when we have powerful processors today. Well, it’s a necessity for a programmer to optimise the code. In the real world, you are dealing with thousands or even millions of data. Writing algorithm which has poor running time might hurt you or the organisation you are working for.
Time Complexity is often measured in terms of :
- Big Oh (O): worst-case running time
- Big Omega (Ω): best-case running time
- Big Theta (Θ): both best and worst-case running time
Usually, we are concerned about the upper bound of a computer program i.e the worst-case running time.
The same argument applies to Machine Learning Algorithms. Selecting the best ML algorithm in terms of computational complexity for a given problem is an important part of an ML practitioner. This article will be very specific in context to SVM.
Both, Soft Margin and hard Margin formulation of SVM is a convex quadratic programming problem with linear constriants.
Generally, used techniques for quadratic programming are very slow. Even using gradient descent is computationally expensive, here Stochastic Gradient comes into play.
Quadratic Programming problems are optimised using Stochastic Gradient Descent. Though it takes more steps each individual step is less computationally expensive than gradient descent.
Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning.
SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale to problems with more than 1⁰⁵ training examples and more than 1⁰⁵ features.
Support Vector Machine internally SMO(Sequential Minimal Optimisation) is used. libsvm is the best library used for optimising SVMs which implements SMO. Even Scikit-learn uses libsvm for training SVM models.
Training time for SVM can be given as O(nd²) if d<n
Where d is the dimension
n is the no. of data points
When you are dealing with a large dataset, the training time will be high. SVM is not used in such scenario typically.
The Run time (predicting time) can be given as O(kd)
where k is the no.of support vectors(0<k<=d)
d is the no of data points.
Reference: http://cs229.stanford.edu/materials/smo.pdf
https://www.cs.utah.edu/~zhe/pdf/lec-19-2-svm-sgd-upload.pdf