Machine learning for the impatient: algorithms tuning algorithms
July 24, 2012 § 1 Comment
I recently blogged about achieving 82.5% accuracy predicting winners and losers of matchups in the 2012 Men’s College Basketball season using machine learning. I’ve only used data acquired prior to the predicted match, resulting in a valid representation of how the algorithm would be able to to predict this coming season. This experiment was really me dipping my toes back in the water after recently leaving Zynga. I doubt I will proceed much further down the sports prediction path, much to the chagrin of my friends who are attempting to live out their dreams of an MIT blackjack team style payout through me. My interest is in the technology, and I am hoping to find enough time to blog about and eventually open source as much of it as possible.
In this post I am actively attempting to ignore the inner workings of the algorithms used and instead focus on them as “black box” components. I speak to their use and their usefulness but not a lot about their actual mechanism. This is in no way fair to these elegant algorithms whose inventors are much smarter men than myself. My personal interest lies in producing interesting (useful?) effects via manipulation of these algorithms and my goal is to help explain at a high level how I approached the problem. I will also be ignoring the technical specifics such as library and language choices as they really will just complicate the message for now.
Edit A few people have pointed out that using the testing set for tuning demands that final measure of effectiveness be doing using a validation test set which is not part of either the training or testing datasets. This is due for the very real potential of over fitting. Also – apparently this technique is called “Hyper-Parameter Optimization.” A helpful commenter over at Hacker News supplied the following resources:
Predicting a game of basketball
The general form of our solution which predicts the winner of a given basketball game will look like:
What this shows is that we will take the known statistics about both teams in a given matchup, perform some unknown transformation on them, and then produce a prediction about the winning team. Most solutions to this problem would take this general form, so there are no major surprises here. The real trick is what’s in the question mark box: a Support Vector Machine (SVM).
What’s an SVM?
An SVM is a specialized form of Neural Network. The high level picture of what an SVM can do looks something like this:
What the SVM for the above table has done is classify animals into two output categories (Dog or Cat) based on 4 highly scientific input parameters (features). You’ll see that it was able to correctly use snarkiness as the key differentiating factor and classify the animals accordingly. An SVM functions as an Optimal Margin Classifier. I like to think of an SVM as drawing a multi-dimensional “line” (hyper-plane) through the input when the feature set is mapped into a higher-dimensional grid. In this case, it would be four dimensional. In two dimensions this looks like the picture below:
In this two dimensional case the algorithm is trying to find a dividing line which will separate the “squares” from the “circles” with the widest possible margin. The name support vectors comes from the line drawn from the features (squares, circles) nearest to the dividing line, those vectors “support” the margin in that they constrain it by being nearest to the line. Therefore, Support Vector Machines. We will be doing non-linear classification so we will have to do a bit of additional work to massage the data into a format that can be used for classification – namely by using a kernel function. This will be touched on later.
One distinction you should be aware of is the difference between classification and regression. In the preceding examples we were Classifying. We were transforming input into distinct possible output values (Features –> Dogs/Cats). If instead we were to produce a continuous output value, say we instead wanted to guess the weight of the animal given the other input features – we would use Regression. Regression can be done with Support Vector Machines. It is referred to as Support Vector Regression. It is very interesting and useful, but beyond the scope of this post.
General Form with SVM
Going back to the general form of our problem, and adding in an SVM we have now all of the major components to perform the Win/Loss classification as shown below:
The problem with this picture is that the SVM is completely untrained. Previously I referred to SVM as a specialized form of Neural Network, and just like a Neural Network the model must be built using training data before it can be used to perform classifications. Before we dive into the training of the SVM let’s briefly discuss the stats (features) we will be using for this particular problem.
The dataset I was able to pull together provided statistics including scores on a per-half basis for each matchup during the Men’s College D1 2012 season. A sampling of these statistics is “Field Goals Made”, “Offensive Rebounds”, “Turn Overs”, etc. I know next to nothing about basketball, besides the whole ball goes in hoop part. The way I chose to do this model was to use aggregate stats about all games that both the Home Team and the Away Team played during the 2012 season PRIOR to the current matchup, as well as their stats from the previous game they played. I reasoned that using their average performance in combination with their last recorded performance I could make a fairly educated guess about their ability to perform this next game. So my input feature were a vector roughly containing:
[ Home Team Average 2012 Stats First Half, Home Team Average 2012 Stats Second Half, Home Team Stats Last Game First Half, Home Team Stats Last Game Second Half, Away Team Average 2012 Stats First Half, Away Team Average 2012 Stats Second Half, Away Team Stats Last Game First Half, Away Team Stats Last Game Second Half ]
This vector consists of approximately 50 features which will be provided as input to our SVM. We will further split our input during the training phase into two different groups: training Data and testing Data. The training data will be used during the actual training process to “teach” the SVM to properly model and classify the supplied dataset – potentially discovering highly dimensional features which are not obvious or detectable to manual analysis. The general form of the input will be:
Desired Output Classification : [ Input Vector ]
For our animal example our training data might take the form:
Dog: [4,3,45,0] Cat: [4,2,12,2] Dog: [4,1,75,0]
The training data is used to tell the SVM “Given this input vector, you should produce this output classification.” The testing data is used on a trained model to test its ability to predict as of yet unknown classifications, and is used after training to evaluate the effectiveness of our trained SVM.
Training the SVM
Once we’ve trained our SVM we’ll end up with a trained SVM model. Generally the way these get used is by saving them as a file to disk and then loading them again into the SVM library at runtime.
Training an SVM consists of supplying a set of tuning parameters which control the type and properties of the SVM Algorithm, as well as our set of training data.
Once the input is applied to the specified tuned SVM algorithm, we will have produced an SVM model which can be used to “predict” future classifications. This is cornerstone of our actual prediction engine.
What are these tuning parameters?
Without going into too much detail – they come in four basic flavors: Type of SVM, SVM Cost Variable, Type of Kernel Function, and Kernel Parameters.
The two types of SVM you’ll most often run into are C-SVM & nu-SVM classifiers. The distinction affects the SVM Cost variable which comes in two forms C and nu. There are additional SVM variables possible, such as a slack variable – but we’re going to disregard those here.
Kernel functions are a fancy way of saying a function that maps an input vector into highly dimensional space called feature space. They transform the input vector into a set of coordinates in space that allow the partitioning to take place. Some of the most common kernel functions are Gaussian, Sigmoid, and Polynomial. The biggest task here is choosing the most effective of the available kernel functions for our problem domain. There is also the option to compute your own kernel in which you supply the coordinates in highly dimensional space rather than the actual input vector to the SVM.
In addition to kernel choice there are various kernel variables that must be tuned properly depending on the kernel function chosen. Some of these are polynomial degree and gamma. They are arithmetic arguments. I prefer to wield math like a hammer so we’ll just ignore them for the moment.
Understanding of these is not initially that important – just understand that they exist and have massive impact on the effectiveness of the SVM.
How do we choose tuning parameters?
Good question. Before we can do much we’ll need a way to tell if a set of tuning parameters is effective at classifying our dataset – chiefly by testing and measuring accuracy. If we had a way of measuring the effectiveness our parameters we could conceivably just guess values until we found some that were effective. Remember when we divided our dataset into training and testing samples? Now we get to use the testing set too!
In addition to training the model we’re going to now supply that model back into the SVM engine along with our testing dataset and measure how frequently the trained model correctly classifies the testing dataset. Essentially we’re checking to see how well our model can guess predict basketball match ups at this point. The measure of accuracy we’ll use is the standard Mean Squared Error calculation (MSE). MSE is a simple calculation over the resulting output that produces a representation of our error rate, smaller is better.
Okay so…how do we choose tuning parameters?
Randomly guess! Not really but…yeah, kinda. Our options are actually fairly limited. If you were a lot smarter than me you might try to use a grid-search to choose an effective range for your C/Nu values as well as some starting point values for your kernel functions. The problem with that is
- I don’t really trust that I’m getting optimal results
- It’s boring.
If you’re like me you don’t particularly want to sit around running grid searches and testing out the parameters. You could automate that process but still you’ll run into questions like:
- Input normalization – Is it better to scale the input so that it is normalized over [0,1] or leave it as is
- Feature selection – Which inputs do we want the SVM to use?
Disappointingly, the answer to these questions has really just been: try it and see what works better.
Also, if you’re REALLY like me – you didn’t even ask the above series of logical questions – instead you just immediately started guessing randomly. That algorithm looks something like this:
I like to think of this algorithm as O(?) and it is critically flawed in that I never got anywhere useful. What we really want is an algorithm that looks something like this:
Enter Genetic Algorithms
A genetic algorithm is essentially a search in which the goal is to find the set of input strings which produce the best possible output. It is domain agnostic as long as you can represent the input as a set of strings and “grade” the output to determine how “fit” the input is. The word genetic comes from the darwinian inspiration and workings of the internals of the algorithm in which many solutions are tried and over time the best solutions emerge using software versions of biological concepts such as genetic selection, crossover and mutation. They are unique amongst machine learning algorithms because you can say things like “and then they have sex” – and it makes perfect sense. You’re still a creep though. They work according to this highly scientific flowchart:
A genetic algorithm attempts to be a microcosm of evolution. Using intelligent selection of individuals to reproduce and a domain specific fitness function it tries to evolve towards the most “fit” individual.
At a much higher level you can think of a GA as a combination of 3 inputs and resulting in a single output which is the “best” input candidate as created via evolution:
These parameters are described below.
Form of input candidate For our purposes the “form” of the input candidate is a string that represents all of the possible Tuning Parameters for the SVM as well as a binary on/off flag for each of the possible basketball stats we could provide as input to the SVM. The addition of the on/off flag for each of the basketball stats allows for feature selection. In effect, the GA is also deciding which statistics are relevant and which are not. One subtlety is that we will also need to provide an acceptable range (bounds) for each of the parameters. The bounds for the feature selection parameters are simply 0 or 1. The bounds for the SVM Tuning Parameters had to be determined on a case-by-case basis in which I chose reasonable ranges of values for each parameter.
GA Parameters Black magic. Truthfully we will experiment some here and there are some guidelines we can apply to help divine a better set of parameters which will cause convergence quicker and prevent the algorithm from getting “stuck” within a certain narrow band of solution candidates.
Fitness Function The Fitness Function is the magic part. This is the code that actually evaluates how “fit” a particular set of Tuning Parameters is. For us this is the MSE of the training Set when evaluated given the selected Tuning Parameters – or everything leading up to the resulting Accuracy in our “guess randomly” algorithm:
For us the return value of the fitness function is the Accuracy of the SVM Model produced by the given candidate set of Tuning Parameters.
Putting it all together
So after pulling all of the parts together what we have is:
It is a Genetic Algorithm tuning a Support Vector machine which is being used as a Classifier to look at statistics about a basketball matchup and predict the result as either a Win or a Loss for the home team.
The steps we followed to get this far were:
- Acquire the datasetThis is actually one of the hardest problems in practice. For my dataset I had to beg and plead, and promise not to give it away. I have found UCI and Austin’s own infochimps to be helpful during my experiments, but obviously for anything real world you’ll need to acquire your own elsewhere.
- Frame the problemFraming a question correctly in such a way that it can be answered by machine learning techniques is a very difficult task. The majority of my failures have come from an inability to ask the correct question of my algorithms. When doing classification you must be able to frame the question in such a way that it falls into one of several discrete buckets. Furthermore these buckets must be well formed or your results will be poor. If you cannot formulate distinct buckets, or have continuous output problems such as say oh…predicting the point spread of a given basketball game – you may instead want to look at regression solutions. Perhaps even Support Vector Regression.
- Format the datasetData processing is a necessary pain in which you get to manipulate your input data into whatever format your SVM requires, or at least into a standardized format that it can be easily retrieved and manipulated from in order to feed the SVM. My advise is to take it seriously, do a good job, and then just move on. At this point I have simple utility libraries and have done it enough times that I can avoid wasting a lot of time, but when I was first getting started I could easily spend days in this phase. I currently use sqlite3 databases for small datasets or proofs of concept. For real world big-data cases this will be an ongoing area of development.
- Choose the genomeUsing the method outlined in this post the choice of genome will just be a list of valid Tuning Parameters and their associated ranges, and a binary flag for each possible input feature to allow for feature selection. The actual work here consists of writing a “generator” and a “bounder”. The generator will produce a new random genome and the bounder will ensure that any genome stays within valid ranges for each of its fields.
- Create training & testing setsSplit the dataset up into portions to be used for training and a portion to be used for evaluation. This step is important in that your training set and testing sets should be representative and mutually exclusive of each other. In my experimentation I have frequently used reserved 20% of my available dataset for testing and used the remaining 80% for training. I adjust these numbers depending on the size of the available dataset.
- Create a fitness functionCalled by the Genetic Algorithm on a candidate set of SVM Tuning Parameters this function should run train the SVM against the training set and evaluate the SVM against the testing set according to the specifications of the given Tuning Parameters.
- Run the GARun the Genetic Algorithm to produce the best Tuning Parameters for an SVM that will correctly classify our datasets and save this as a trained SVM Model. This will be the majority of your computation time and will be the subject of significant experimentation in my limited experience.
- Consult the SVMThis is the “usage” step of the whole process. Prepare the input statistics for both teams in a chosen basketball matchup and provide them as input into the trained SVM model. Receive from the model a classification of the game as either a Win or a Loss for the home team.
- Framing the question is the hard part
- You’ll spend most of your time dealing with your dataset somehow, so take it seriously
- Don’t expect a silver bullet. Beating an SVM to death with a GA will not get you THAT far.
For those that are interested I would suggest that they start looking into some of the better freely available SVM and GA libraries. Python is a great language to start with simply because of the wealth of libraries available and the relative high penetration into both the academic and the business worlds. I personally use a toolkit I’ve created which allows me to easily deploy genetic algorithms into EC2 based clusters for quicker results. My GA library of choice is Inspyred. Currently I’m using a band-aided version of libsvm and its python bindings, but going forward I will probably be trying to replace them with something more stable.
I’ve been unable to find a strong Machine Learning community with significant concentration in any one place. There seems to be traction in some of the different communities for specific types of ML such as the HyperNEAT community’s mailing list. I really wish I could find a forum in which practitioners and hobbyists were discussing techniques openly, please someone let me know if this exists!
I hope to elaborate more on my basketball results, and the actual technology choices/packages I made along the way in a future blog post. I won’t be releasing the dataset but am working to release the actual source code.
The information in this post is just a tiny glimpse into a huge field and has ignored many techniques that will be almost immediately necessary for any semi-serious practitioner – such as cross-validation and regression. The field of machine learning is broad and subtle with many extremely fascinating areas of intense research such as Neuroinformatics, Cortical Algorithms, Coupled Oscillators, Central Pattern Generators, Composited Pattern Producing Networks, Evolutionary Computing and undoubtedly hundreds more I’ve never even heard of. The power of biologically inspired algorithms and the availability of processing power and software packages which make handling large amounts of data feasible make this an exciting time to be an AI nerd.
Source : Amir Elaguizy (http://www.aelag.com)