Grew out of work in AINew capability for computers
Database miningLarge datasets from growth of automation/web.E.g. web click data, medical records, biology, engineering.Application can't program by hand.E.g. Autonomous helicopter, handwriting recognition, most of Natural Language Processing(NLP), Computer Vision.Self-customizing programsE.g. Amazon, Netflix product recommendation.Understanding human learning(brain, real AI).
Machine Learning Definition
Arthur Samuel (1959). Machine learning: Field of study that gives computers the ability to learn without being explicitly programmed.Tom Mitchell (1998). Well-posed Learning program: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.E.g. Suppose email program watch which emails you do or do not mark as spam, and based on that learns how to better filter spam.T: classifying emails as spam or not spam.E: watching you label emails as spam or not spam.P: The number (or fraction) of emails correctly classified as spam/not spam.
Machine learning algorithms:
Supervised learningUnsupervised learning
Others: Reinforcement learning, recommender systems. And practical advice for applying learning algorithms.
Supervised learning: "right answers" given Regression: Predict continuous valued output Classification: Discrete valued output (0 or 1)
In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
Supervised learning problems are categorized into regression and classification problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.
Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.
We can derive this structure by clustering the data based on relationships among the variables in the data.
With unsupervised learning there is no feedback based on the prediction results.
Model and Cost function
To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this: 📷
When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.
We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output y's.
This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.
We have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis.
Imagine that we graph our hypothesis function based on its fields theta0 and theta1 (We are graphing the cost function as a function of the parameter estimates). We are not graphing x and y itself, but the parameter range of our hypothesis function and cost resulting from selecting a particular set of parameters.
We put theta0 on the x axis and theta1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.
We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, when it's value is the minimum.
The wey we do this is by taking the derivative (the tangential line to the function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter alpha, which is called the learning rate.