July 28, 2012 § Leave a comment
I read this wonderful article in Readers Digest. So I thought sharing it..How true it is that We Indian still don’t know lot of places where we can visit.. I have already started saving my budget..planning to go North Side of India..sad I haven’t been above gujrat
here is the article…
The India You Don’t Know
Travellers in India usually have their itinerary all mapped out—it’s generally the tried and tested routes. The Golden Triangle (Delhi-Agra-Jaipur) or Goa. And since unstable Kashmir is out, Kerala is in. That is an Indian holiday in a nutshell. There are a few who do special interest tours.
Lakshadweep and Andamans for the diving, Kipling Country for jungle safaris, the Buddhist pilgrim trail, the heritage train rides. But beyond these busy pockets, there is a vast treasure trove of secret places.
Talk to any Indian about a favourite childhood memory and he or she will wax poetic about their “native place.” Ponds they used to swim in, fruit eaten straight off the tree, family feasts, temple festivals. They may also speak of memorable holidays to special destinations, often very close to home but still unexplored, preserved as if in amber. Here are seven spots off the tourist map but well worth seeking out.
Lucknow, Uttar Pradesh
For Mumbai-based model Ashutosh Singh, Lucknow is home. “Whenever I return, it’s as if I’ve never been away. There is an old world courtesy unique to my town.” He says that the frantic development that characterizes other Indian towns hasn’t altered Lucknow’s essential structure. The Old City still preserves the fading glories of this capital of the Nawabs of Awadh.
Towering gates, domes and arches define the cityscape. Even the Charbagh railway station looks like something out of the Arabian Nights. There are also charming havelis with intimate courtyards and interconnected rooms, just like the one where Ashutosh’s own family still stays. In the evenings people would stroll out unhurriedly to socialize over Lucknow’s famous chaat, sweets or paan.
Many of Lucknow’s iconic landmarks have made their presence felt in films like Umrao Jaan and Shatranj ke Khilari: The Bara and Chota Imambaras, Rumi Darwaza, the labyrinthine Bhool Bhulaiyaa, Chattar Manzil and Jama Masjid. The Bara Imambara complex, which also houses the famous maze, is essentially a Shia Muslim shrine. This grand project was undertaken by 18th-century Nawab Asaf ud Daula to generate employment during a time of famine. While the common people worked during the day, the equally impoverished but unskilled nobility were secretly hired to destroy what was constructed during the night, so that the task would continue till the crisis was over. He was the general architect of much of what we see today. “The magnificent Lucknow University buildings are an architectural marvel, with a vast campus,” says Ashutosh, “I’m proud to have studied there.”
Delhi-based writer and filmmaker Vandana Natu Ghana fell in love with Lucknow while she was a student there. She recommends the old markets of Chowk and Aminabad for delicate shadow embroidery (chikan), rich zardozi and badla work in silver and gold threads. This bustling area also houses the legendary Tunde ke Kebab shop, over a century old. “You can base yourself in Lucknow and do some fascinating day trips out of the city. Barabanki, with its ancient Mahabharat connections, and Malihabad, famous for its mango orchards, are redolent of a bygone era and only 25 kilometres away from the city centre,” she suggests. There is also the village of Kakori, which has given its name to silken smooth kebabs, created to indulge a toothless nawab. Lucknow is also very much a gourmet destination. Vandana, who has an Army background, advises that I not miss the British Residency, said to be haunted by ghosts of the 1857 Mutiny and siege, and the long drive through the cantonment area to the War Memorial, fringed by laburnum and gulmohar trees. “In summer, the road becomes a carpet of red and yellow flowers. People tend to visit Delhi, Agra and Varanasi and bypass Lucknow altogether. They don’t realize what they’re missing,” she sighs.
Kasauli and other cantonment towns
I have always liked cantonments. They stave off rampant development, preserve heritage structures and are often in beautiful locations. If you’re interested in old churches, military graveyards and history, you will definitely have a sense of stepping back in time.
Married to officers of the Indian Army’s Gurkha Regiment, Naji Sudarshan and Daphne Chauhan live in Delhi, but have had homes in cantonment towns all over the country. “It is a world all its own,” says Naji. “We are a stone’s throw away from chaotic towns and crowded metros, but the instant you enter Army territory, everything is disciplined and beautifully maintained.” A cantonment town is a time machine. And still properly British. You need a dinner jacket to dine at clubs where the menus have been the same for generations. Gardeners maintain seasonal flowerbeds with military precision and since wooded areas are protected, you find an astounding variety of birdlife.
Self-contained cantonment towns like Ranikhet, Lansdowne and Deolali have a quaint character all their own. Foreigners are not permitted to visit Chakrata in Uttarakhand, which is a restricted access area while Mhow, near Indore, is actually an acronym for Military Headquarters of War. There are artillery and combat schools, sanatoriums, military colleges and regimental headquarters scattered through all of these.
Army families keep getting posted to far-flung stations, but everything remains reassuringly familiar within the cantonment. “So while you get to discover a different place every time you are transferred, the set-up never really changes. Cocooned within the Army, you couldn’t be more secure,” adds Naji.
Kasauli in Himachal Pradesh is one of Naji’s favourites, a flower basket of a hill station with its typical upper and lower mall roads, a delightful bazaar and Victorian cottages with roses around the door. It is also across the hill from Subathu, where the Gurkha regiment has its headquarters. Daphne returned recently to Wellington, home of the Madras Regimental Centre in the Nilgiri Hills, where they had been posted 20 years ago. “Nothing has changed. It is still the same sleepy town, with perfect weather. Yet it is close enough to the social whirl of Ooty,” says Daphne. “A good place to base yourself for treks and tea gardens. Not many hotels, but there are home stays and farms in Wellington as well as in nearby Coonoor.”
Ashtamudi is a sprawling expanse of water, the second largest and deepest wetland ecosystem in Kerala.
Like an octopus, it is eight-armed (ashtamudi literally means eight locks of hair). Vembanad (which includes Kumarakom) is larger and much promoted by Kerala Tourism, but lesser known Ashtamudi has much to offer. All the canals and creeks of these backwaters converge at Neendakara, a hub of the state’s fishing industry.
For Naresh Narendran, a rubber businessman in nearby Kollam (formerly Quilon), Ashtamudi is home territory. “Unlike the other backwaters, you see dense stands of coconut trees, rather than the usual scene of rice paddies,” he says. “There are also sand bars in the estuary which fishermen use. From a distance, it looks like the man is actually walking on water.”
I remember visiting an uncle whose backyard extended to the water’s edge. We could buy karimeen (pearl spot fish) and river mussels straight off the fishing boats. For fresh coconut water or toddy, a man would be immediately despatched up a coconut palm. Much of what we ate was picked from the kitchen garden. Naresh himself is proud of his own “little farm” not far from here, where he experiments with varieties of banana, yam, fruit, and vegetables. This is quintessential, picture-postcard Kerala with palm-fringed lagoons and dense tropical vistas in a hundred shades of green. “You could rent a boat and go around,” suggests Naresh. “But there are commuter ferry services to Alleppey at a fraction of the cost, which will give you much the same views.”
The much-photographed Chinese-style fishing nets of Cochin are seen around Ashtamudi as well. You could use the ferries to visit neighbouring islands, villages and lesser-known towns in and around the backwaters, much as the locals do. There are temples, sacred groves and churches to discover. Water birds like cormorants and herons abound. “I love photographing the backwaters in its many moods. In the monsoon it is quite spectacular,” says Naresh. “A few resorts are coming up here but it is still largely unspoilt.”
Kollam itself is a historic port town worth exploring. The coir and cashew industries made it prosperous but it was well known on ancient trade routes. Marco Polo came here, as did Ibn Battuta, the famed Islamic scholar and traveller. Not far from Kollam town is Thangassery, a little Anglo-Indian enclave that was once settled by both Dutch and Portuguese colonizers. It has a layout reminiscent of towns in Goa, beaches and a stately lighthouse. But the Anglo Indian community which gave it much of its character has largely emigrated.
July 27, 2012 § Leave a comment
The ties between OS X and iOS have never been tighter. But is it enough to reverse a slide in iPhone sales?
Apple officially released OS X (10.8) Mountain Lion, the company’s biggest effort yet to bridge iOS, the operating system that powers the iPhone and iPad, with its desktop cousin. And more than ever, Apple is relying on the cloud to strengthen those cross-OS ties.
Mountain Lion, available today for $19.99 in the Mac App Store, is packed with features aimed at knocking down the barriers between the Apple’s mobile, tablet and laptop and desktop experiences. Key to making those features work is iCloud.
The new (to Mac) iMessage software extends the iPhone and iPad’s messaging service to Mac desktops. Using iCloud to circumvent cellular carriers, iMessage allows users to text, share photos and videos or initiate FaceTime sessions with iOS users. Similarly, iCloud is used to keep email, contacts, reminders and calendars in sync between iOS devices and OS X machines.
Another iOS feature making the leap to the laptop is Notification Center. Debuting first on iOS — and borrowing a page form Android’s notification system handbook — it displays a breakdown of scheduled calendar events, recent emails and other alerts from supported apps that persist across compatible iOS and OS X devices.
And with Mountain Lion, OS X becomes a more social animal. Tighter Twitter and Facebook integration makes it possible to send tweets or Facebook updates from within supported apps without switching to a browser or dedicated app.
But will these changes be enough to erase the cracks that are beginning to show in Apple’s wildly successful iOS ecosystem?
iPhone Sales Slow
Last quarter (Q3 2012) Apple sold 26 million iPhones, a 28 percent increase over the same period last year. By any measure, that’s a huge feat and one that helped the company post $35 billion in revenue and bank $8.8 billion, or $9.32 per share, in profit.
Unfortunately for Apple, the company missed Wall Street analysts’ expectations of 28 million iPhones sold and a dollar more per share. iPhone sales dropped 26 percent from the previous quarter and generated 28 percent fewer revenues. During yesterday’s earnings call, both Apple CEO Tim Cook and CFO Peter Oppenheimer hinted that anticipation of iPhone 5 is crimping sales of current handsets.
It’s not all bad news on the mobile front. One bright spot is the market leading iPad.
The company sold 17 million iPads in Q3 2012, an 84 percent year-on-year increase and a 44 percent boost over the previous quarter. Rumors continue to swirl that Apple is preparing an “iPad Mini” to continue the iPad’s meteoric sales momentum in the face of small tablets like Google’s sold-out Nexus 7 and to grow market share in the face of new challengers like Microsoft’s upcoming Surface slate.
July 27, 2012 § Leave a comment
Last Wednesday I attended online seminar for Amazon Web Services. Just too excited to see how Cloud Computing solves our tomorrow’s problem. There is lots to cover under cloud computing. I will be writing more in details under the section Cloud Computing.
Already reading this Cloud Computing Bible, where it talks about how and why Cloud Computing is future.. How will it benefit in our business need.
Beside that I have sign up for Amazon Web Services and Windows Azure. Already planning to build apps for Cloud
July 27, 2012 § Leave a comment
12 resolutions for programmers
It’s important for programmers to challenge themselves.
Creative and technical stagnation is the only alternative.
In the spirit of the new year, I’ve compiled twelve month-sized resolutions.
Each month is an annually renewable technical or personal challenge:
- Go analog.
- Stay healthy.
- Embrace the uncomfortable.
- Learn a new programming language.
- Learn more mathematics.
- Focus on security.
- Back up your data.
- Learn more theory.
- Engage the arts and humanities.
- Learn new software.
- Complete a personal project.
July 26, 2012 § 2 Comments
What is Hybrid Computing?
A hybrid computing platform lets customers connect the packaged small business software applications that they run on their own internal desktops or servers to applications that run in the cloud.
As discussed in What is Cloud Computing and Why Should You Care?, more software vendors are deciding to develop and deliver new applications as cloud-based, software-as-a-service (SaaS) solutions. This model helps them reach a broader market and serve customers more efficiently and cost-effectively. And, because cloud computing can often provide significant cost, time and ease-of-use benefits, more companies are choosing to buy and deploy cloud computing solutions instead of conventional on-premise software as new solutions needs arise.
However, most companies will continue to use a combination of both traditional on-premise software and cloud-based SaaS solutions. Think about it: You are unlikely to get rid of an application you’re running in-house just to swap in a SaaS solution. But if you need a new solution, you’re likely to look a range of options, including SaaS applications, to fit the bill.
In some newer areas — such as email marketing or social media management — this may be the only way solutions are even available. In cases where you have a choice, you may simply decide that the SaaS model makes more sense, or that traditional deployment will work better for your company.
Why Should You Care?
Many software vendors with a strong presence and customer base in the traditional packaged or “on-premise” software world are developing platforms that provide new SaaS solutions that extend and integrate with their traditional on-premise applications. Some vendors provide app stores or marketplaces to make it easier for you to find solutions that will work well with those you already have.
For instance, Intuit has developed a platform and Intuit’s Workplace App Center so that customers can find and try applications that work with QuickBooks and with each other. Microsoft’s Software + Service strategy is designed to connect a myriad of Microsoft’s traditional software applications to Web-based SaaS solutions.
Recently, Sage launched its Connected Services offerings, designed to connect users of its traditional packaged software offerings with online SaaS services. The Sage e-Marketing application, for example, connects ACT and SalesLogix users with online email marketing services, while many of Sage’s accounting solutions connect with its newSage Exchange online payment processing.
These vendors realize that most companies will use a mix of on-premise and SaaS solutions for a very long time. While companies can get some value from using some point solutions in a standalone fashion, in many cases, you’ll need to integrate the new SaaS solution with an existing on-premise application — such as integrating payroll to accounting and HR, or social media management to contact or customer management application — to get the value and efficiencies you need.
From the standpoint of their own corporate interests, vendors can increase revenues and profitability by selling existing customers new SaaS services (either their own or those of their partners) to connect to and extend on-premise solutions they’re already using. Having a strong SaaS play that is integrated with their on-premise solutions also helps them protect against competitive SaaS-only vendors that could steadily encroach on their turf.
More altruistically, these vendors want to offer their customers the means to bridge between the on-premise and SaaS solution worlds more easily. After all, it can be very confusing to even sort through and differentiate between all the solutions in a given category, and expensive and time-consuming to integrate them so they work well and easily with what you’re already using.
What to Consider
Most small businesses run at least a couple of on-premise software applications that are critical to their business. For instance, it’s a good bet that accounting and financials are on this list. Other applications will vary depending on the business you’re in, but could include things such as solutions to manage contacts and customers, projects, human resources, logistics or a function specific to your industry.
As you identify and prioritize new requirements to streamline and automate additional tasks, think about the overlaps they’ll require with workflows in the core on-premise solutions and processes that you’re using. For instance, if you decide you want to streamline payments processing, does your accounting software vendor provide a payments processing service that can easily snap into the accounting application?
By taking advantage of the SaaS offerings available from a vendor’s hybrid computing platform, your new solution will generally be up, running and integrated with the core application much more quickly. However, keep in mind that as you snap more services into that core on-premise application, your reliance on that anchor application will grow — arguably making it harder to switch should your needs change.
July 24, 2012 § 1 Comment
I recently blogged about achieving 82.5% accuracy predicting winners and losers of matchups in the 2012 Men’s College Basketball season using machine learning. I’ve only used data acquired prior to the predicted match, resulting in a valid representation of how the algorithm would be able to to predict this coming season. This experiment was really me dipping my toes back in the water after recently leaving Zynga. I doubt I will proceed much further down the sports prediction path, much to the chagrin of my friends who are attempting to live out their dreams of an MIT blackjack team style payout through me. My interest is in the technology, and I am hoping to find enough time to blog about and eventually open source as much of it as possible.
In this post I am actively attempting to ignore the inner workings of the algorithms used and instead focus on them as “black box” components. I speak to their use and their usefulness but not a lot about their actual mechanism. This is in no way fair to these elegant algorithms whose inventors are much smarter men than myself. My personal interest lies in producing interesting (useful?) effects via manipulation of these algorithms and my goal is to help explain at a high level how I approached the problem. I will also be ignoring the technical specifics such as library and language choices as they really will just complicate the message for now.
Edit A few people have pointed out that using the testing set for tuning demands that final measure of effectiveness be doing using a validation test set which is not part of either the training or testing datasets. This is due for the very real potential of over fitting. Also – apparently this technique is called “Hyper-Parameter Optimization.” A helpful commenter over at Hacker News supplied the following resources:
Predicting a game of basketball
The general form of our solution which predicts the winner of a given basketball game will look like:
What this shows is that we will take the known statistics about both teams in a given matchup, perform some unknown transformation on them, and then produce a prediction about the winning team. Most solutions to this problem would take this general form, so there are no major surprises here. The real trick is what’s in the question mark box: a Support Vector Machine (SVM).
What’s an SVM?
An SVM is a specialized form of Neural Network. The high level picture of what an SVM can do looks something like this:
What the SVM for the above table has done is classify animals into two output categories (Dog or Cat) based on 4 highly scientific input parameters (features). You’ll see that it was able to correctly use snarkiness as the key differentiating factor and classify the animals accordingly. An SVM functions as an Optimal Margin Classifier. I like to think of an SVM as drawing a multi-dimensional “line” (hyper-plane) through the input when the feature set is mapped into a higher-dimensional grid. In this case, it would be four dimensional. In two dimensions this looks like the picture below:
In this two dimensional case the algorithm is trying to find a dividing line which will separate the “squares” from the “circles” with the widest possible margin. The name support vectors comes from the line drawn from the features (squares, circles) nearest to the dividing line, those vectors “support” the margin in that they constrain it by being nearest to the line. Therefore, Support Vector Machines. We will be doing non-linear classification so we will have to do a bit of additional work to massage the data into a format that can be used for classification – namely by using a kernel function. This will be touched on later.
One distinction you should be aware of is the difference between classification and regression. In the preceding examples we were Classifying. We were transforming input into distinct possible output values (Features –> Dogs/Cats). If instead we were to produce a continuous output value, say we instead wanted to guess the weight of the animal given the other input features – we would use Regression. Regression can be done with Support Vector Machines. It is referred to as Support Vector Regression. It is very interesting and useful, but beyond the scope of this post.
General Form with SVM
Going back to the general form of our problem, and adding in an SVM we have now all of the major components to perform the Win/Loss classification as shown below:
The problem with this picture is that the SVM is completely untrained. Previously I referred to SVM as a specialized form of Neural Network, and just like a Neural Network the model must be built using training data before it can be used to perform classifications. Before we dive into the training of the SVM let’s briefly discuss the stats (features) we will be using for this particular problem.
The dataset I was able to pull together provided statistics including scores on a per-half basis for each matchup during the Men’s College D1 2012 season. A sampling of these statistics is “Field Goals Made”, “Offensive Rebounds”, “Turn Overs”, etc. I know next to nothing about basketball, besides the whole ball goes in hoop part. The way I chose to do this model was to use aggregate stats about all games that both the Home Team and the Away Team played during the 2012 season PRIOR to the current matchup, as well as their stats from the previous game they played. I reasoned that using their average performance in combination with their last recorded performance I could make a fairly educated guess about their ability to perform this next game. So my input feature were a vector roughly containing:
[ Home Team Average 2012 Stats First Half, Home Team Average 2012 Stats Second Half, Home Team Stats Last Game First Half, Home Team Stats Last Game Second Half, Away Team Average 2012 Stats First Half, Away Team Average 2012 Stats Second Half, Away Team Stats Last Game First Half, Away Team Stats Last Game Second Half ]
This vector consists of approximately 50 features which will be provided as input to our SVM. We will further split our input during the training phase into two different groups: training Data and testing Data. The training data will be used during the actual training process to “teach” the SVM to properly model and classify the supplied dataset – potentially discovering highly dimensional features which are not obvious or detectable to manual analysis. The general form of the input will be:
Desired Output Classification : [ Input Vector ]
For our animal example our training data might take the form:
Dog: [4,3,45,0] Cat: [4,2,12,2] Dog: [4,1,75,0]
The training data is used to tell the SVM “Given this input vector, you should produce this output classification.” The testing data is used on a trained model to test its ability to predict as of yet unknown classifications, and is used after training to evaluate the effectiveness of our trained SVM.
Training the SVM
Once we’ve trained our SVM we’ll end up with a trained SVM model. Generally the way these get used is by saving them as a file to disk and then loading them again into the SVM library at runtime.
Training an SVM consists of supplying a set of tuning parameters which control the type and properties of the SVM Algorithm, as well as our set of training data.
Once the input is applied to the specified tuned SVM algorithm, we will have produced an SVM model which can be used to “predict” future classifications. This is cornerstone of our actual prediction engine.
What are these tuning parameters?
Without going into too much detail – they come in four basic flavors: Type of SVM, SVM Cost Variable, Type of Kernel Function, and Kernel Parameters.
The two types of SVM you’ll most often run into are C-SVM & nu-SVM classifiers. The distinction affects the SVM Cost variable which comes in two forms C and nu. There are additional SVM variables possible, such as a slack variable – but we’re going to disregard those here.
Kernel functions are a fancy way of saying a function that maps an input vector into highly dimensional space called feature space. They transform the input vector into a set of coordinates in space that allow the partitioning to take place. Some of the most common kernel functions are Gaussian, Sigmoid, and Polynomial. The biggest task here is choosing the most effective of the available kernel functions for our problem domain. There is also the option to compute your own kernel in which you supply the coordinates in highly dimensional space rather than the actual input vector to the SVM.
In addition to kernel choice there are various kernel variables that must be tuned properly depending on the kernel function chosen. Some of these are polynomial degree and gamma. They are arithmetic arguments. I prefer to wield math like a hammer so we’ll just ignore them for the moment.
Understanding of these is not initially that important – just understand that they exist and have massive impact on the effectiveness of the SVM.
How do we choose tuning parameters?
Good question. Before we can do much we’ll need a way to tell if a set of tuning parameters is effective at classifying our dataset – chiefly by testing and measuring accuracy. If we had a way of measuring the effectiveness our parameters we could conceivably just guess values until we found some that were effective. Remember when we divided our dataset into training and testing samples? Now we get to use the testing set too!
In addition to training the model we’re going to now supply that model back into the SVM engine along with our testing dataset and measure how frequently the trained model correctly classifies the testing dataset. Essentially we’re checking to see how well our model can guess predict basketball match ups at this point. The measure of accuracy we’ll use is the standard Mean Squared Error calculation (MSE). MSE is a simple calculation over the resulting output that produces a representation of our error rate, smaller is better.
Okay so…how do we choose tuning parameters?
Randomly guess! Not really but…yeah, kinda. Our options are actually fairly limited. If you were a lot smarter than me you might try to use a grid-search to choose an effective range for your C/Nu values as well as some starting point values for your kernel functions. The problem with that is
- I don’t really trust that I’m getting optimal results
- It’s boring.
If you’re like me you don’t particularly want to sit around running grid searches and testing out the parameters. You could automate that process but still you’ll run into questions like:
- Input normalization – Is it better to scale the input so that it is normalized over [0,1] or leave it as is
- Feature selection – Which inputs do we want the SVM to use?
Disappointingly, the answer to these questions has really just been: try it and see what works better.
Also, if you’re REALLY like me – you didn’t even ask the above series of logical questions – instead you just immediately started guessing randomly. That algorithm looks something like this:
I like to think of this algorithm as O(?) and it is critically flawed in that I never got anywhere useful. What we really want is an algorithm that looks something like this:
Enter Genetic Algorithms
A genetic algorithm is essentially a search in which the goal is to find the set of input strings which produce the best possible output. It is domain agnostic as long as you can represent the input as a set of strings and “grade” the output to determine how “fit” the input is. The word genetic comes from the darwinian inspiration and workings of the internals of the algorithm in which many solutions are tried and over time the best solutions emerge using software versions of biological concepts such as genetic selection, crossover and mutation. They are unique amongst machine learning algorithms because you can say things like “and then they have sex” – and it makes perfect sense. You’re still a creep though. They work according to this highly scientific flowchart:
A genetic algorithm attempts to be a microcosm of evolution. Using intelligent selection of individuals to reproduce and a domain specific fitness function it tries to evolve towards the most “fit” individual.
At a much higher level you can think of a GA as a combination of 3 inputs and resulting in a single output which is the “best” input candidate as created via evolution:
These parameters are described below.
Form of input candidate For our purposes the “form” of the input candidate is a string that represents all of the possible Tuning Parameters for the SVM as well as a binary on/off flag for each of the possible basketball stats we could provide as input to the SVM. The addition of the on/off flag for each of the basketball stats allows for feature selection. In effect, the GA is also deciding which statistics are relevant and which are not. One subtlety is that we will also need to provide an acceptable range (bounds) for each of the parameters. The bounds for the feature selection parameters are simply 0 or 1. The bounds for the SVM Tuning Parameters had to be determined on a case-by-case basis in which I chose reasonable ranges of values for each parameter.
GA Parameters Black magic. Truthfully we will experiment some here and there are some guidelines we can apply to help divine a better set of parameters which will cause convergence quicker and prevent the algorithm from getting “stuck” within a certain narrow band of solution candidates.
Fitness Function The Fitness Function is the magic part. This is the code that actually evaluates how “fit” a particular set of Tuning Parameters is. For us this is the MSE of the training Set when evaluated given the selected Tuning Parameters – or everything leading up to the resulting Accuracy in our “guess randomly” algorithm:
For us the return value of the fitness function is the Accuracy of the SVM Model produced by the given candidate set of Tuning Parameters.
Putting it all together
So after pulling all of the parts together what we have is:
It is a Genetic Algorithm tuning a Support Vector machine which is being used as a Classifier to look at statistics about a basketball matchup and predict the result as either a Win or a Loss for the home team.
The steps we followed to get this far were:
- Acquire the datasetThis is actually one of the hardest problems in practice. For my dataset I had to beg and plead, and promise not to give it away. I have found UCI and Austin’s own infochimps to be helpful during my experiments, but obviously for anything real world you’ll need to acquire your own elsewhere.
- Frame the problemFraming a question correctly in such a way that it can be answered by machine learning techniques is a very difficult task. The majority of my failures have come from an inability to ask the correct question of my algorithms. When doing classification you must be able to frame the question in such a way that it falls into one of several discrete buckets. Furthermore these buckets must be well formed or your results will be poor. If you cannot formulate distinct buckets, or have continuous output problems such as say oh…predicting the point spread of a given basketball game – you may instead want to look at regression solutions. Perhaps even Support Vector Regression.
- Format the datasetData processing is a necessary pain in which you get to manipulate your input data into whatever format your SVM requires, or at least into a standardized format that it can be easily retrieved and manipulated from in order to feed the SVM. My advise is to take it seriously, do a good job, and then just move on. At this point I have simple utility libraries and have done it enough times that I can avoid wasting a lot of time, but when I was first getting started I could easily spend days in this phase. I currently use sqlite3 databases for small datasets or proofs of concept. For real world big-data cases this will be an ongoing area of development.
- Choose the genomeUsing the method outlined in this post the choice of genome will just be a list of valid Tuning Parameters and their associated ranges, and a binary flag for each possible input feature to allow for feature selection. The actual work here consists of writing a “generator” and a “bounder”. The generator will produce a new random genome and the bounder will ensure that any genome stays within valid ranges for each of its fields.
- Create training & testing setsSplit the dataset up into portions to be used for training and a portion to be used for evaluation. This step is important in that your training set and testing sets should be representative and mutually exclusive of each other. In my experimentation I have frequently used reserved 20% of my available dataset for testing and used the remaining 80% for training. I adjust these numbers depending on the size of the available dataset.
- Create a fitness functionCalled by the Genetic Algorithm on a candidate set of SVM Tuning Parameters this function should run train the SVM against the training set and evaluate the SVM against the testing set according to the specifications of the given Tuning Parameters.
- Run the GARun the Genetic Algorithm to produce the best Tuning Parameters for an SVM that will correctly classify our datasets and save this as a trained SVM Model. This will be the majority of your computation time and will be the subject of significant experimentation in my limited experience.
- Consult the SVMThis is the “usage” step of the whole process. Prepare the input statistics for both teams in a chosen basketball matchup and provide them as input into the trained SVM model. Receive from the model a classification of the game as either a Win or a Loss for the home team.
- Framing the question is the hard part
- You’ll spend most of your time dealing with your dataset somehow, so take it seriously
- Don’t expect a silver bullet. Beating an SVM to death with a GA will not get you THAT far.
For those that are interested I would suggest that they start looking into some of the better freely available SVM and GA libraries. Python is a great language to start with simply because of the wealth of libraries available and the relative high penetration into both the academic and the business worlds. I personally use a toolkit I’ve created which allows me to easily deploy genetic algorithms into EC2 based clusters for quicker results. My GA library of choice is Inspyred. Currently I’m using a band-aided version of libsvm and its python bindings, but going forward I will probably be trying to replace them with something more stable.
I’ve been unable to find a strong Machine Learning community with significant concentration in any one place. There seems to be traction in some of the different communities for specific types of ML such as the HyperNEAT community’s mailing list. I really wish I could find a forum in which practitioners and hobbyists were discussing techniques openly, please someone let me know if this exists!
I hope to elaborate more on my basketball results, and the actual technology choices/packages I made along the way in a future blog post. I won’t be releasing the dataset but am working to release the actual source code.
The information in this post is just a tiny glimpse into a huge field and has ignored many techniques that will be almost immediately necessary for any semi-serious practitioner – such as cross-validation and regression. The field of machine learning is broad and subtle with many extremely fascinating areas of intense research such as Neuroinformatics, Cortical Algorithms, Coupled Oscillators, Central Pattern Generators, Composited Pattern Producing Networks, Evolutionary Computing and undoubtedly hundreds more I’ve never even heard of. The power of biologically inspired algorithms and the availability of processing power and software packages which make handling large amounts of data feasible make this an exciting time to be an AI nerd.
Source : Amir Elaguizy (http://www.aelag.com)
July 19, 2012 § Leave a comment
In text retrieval, full text search refers to techniques for searching a single computer-stored document or a collection in a full text database. Full text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections or bibliographical references).
In a full text search, the search engine examines all of the words in every stored document as it tries to match search criteria (e.g., words supplied by a user). Full text searching techniques became common in online bibliographic databases in the 1990s. Many web sites and application programs (such as word processing software) provide full-text search capabilities. Some web search engines such as AltaVista employ full text search techniques while others index only a portion of the web pages examined by its indexing system.
When dealing with a small number of documents it is possible for the full-text search engine to directly scan the contents of the documents with each query, a strategy called serial scanning. This is what some rudimentary tools, such as grep, do when searching.
However, when the number of documents to search is potentially large or the quantity of search queries to perform is substantial, the problem of full text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms, often called an index, but more correctly named a concordance. In the search stage, when performing a specific query, only the index is referenced rather than the text of the original documents.
The indexer will make an entry in the index for each term or word found in a document and possibly its relative position within the document. Usually the indexer will ignore stop words, such as the English “the”, which are both too common and carry too little meaning to be useful for searching. Some indexers also employ language-specific stemming on the words being indexed, so for example any of the words “drives”, “drove”, or “driven” will be recorded in the index under a single concept word “drive”.
The precision vs. recall tradeoff
Recall measures the quantity of results returned by a search and precision is the measure of the quality of the results returned. Recall is the ratio of relevant results returned divided by all relevant results. Precision is the number of relevant results returned divided by the total number of results returned.
The diagram at right represents a low-precision, low-recall search. In the diagram the red and green dots represent the total population of potential search results for a given search. Red dots represent irrelevant results, and green dots represent relevant results. Relevancy is indicated by the proximity of search results to the center of the inner circle. Of all possible results shown, those that were actually returned by the search are shown on a light-blue background. In the example only one relevant result of three possible relevant results was returned, so the recall is a very low ratio of 1/3 or 33%. The precision for the example is a very low 1/4 or 25%, since only one of the four results returned was relevant.
Due to the ambiguities of natural language, full text search systems typically includes options like stop words to increase precision and stemming to increase recall. Controlled-vocabulary searching also helps alleviate low-precision issues by tagging documents in such a way that ambiguities are eliminated. The trade-off between precision and recall is simple: an increase in precision can lower overall recall while an increase in recall lowers precision.
Free text searching is likely to retrieve many documents that are not relevant to the intended search question. Such documents are called false positives. The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language. In the sample diagram at right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background).
Clustering techniques based on Bayesian algorithms can help reduce false positives. For a search term of “football”, clustering can be used to categorize the document/data universe into “American football”, “corporate football”, etc. Depending on the occurrences of words relevant to the categories, search terms a search result can be placed in one or more of the categories. This technique is being extensively deployed in the e-discovery domain.
The deficiencies of free text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision.
Improved querying tools
- Keywords. Document creators (or trained indexers) are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject. Keywords improve recall, particularly if the keyword list includes a search word that is not in the document text.
- Field-restricted search. Some search engines enable users to limit free text searches to a particular field within a stored data record, such as “Title” or “Author.”
- Boolean queries. Searches that use Boolean operators (for example, “encyclopedia” AND “online” NOT “Encarta”) can dramatically increase the precision of a free text search. The AND operator says, in effect, “Do not retrieve any document unless it contains both of these terms.” The NOT operator says, in effect, “Do not retrieve any document that contains this word.” If the retrieval list retrieves too few documents, the OR operator can be used to increase recall; consider, for example, “encyclopedia” AND “online” OR “Internet” NOT “Encarta”. This search will retrieve documents about online encyclopedias that use the term “Internet” instead of “online.” This increase in precision is very commonly counter-productive since it usually comes with a dramatic loss of recall.
- Phrase search. A phrase search matches only those documents that contain a specified phrase, such as “Wikipedia, the free encyclopedia.”
- Concept search. A search that is based on multi-word concepts, for example Compound term processing. This type of search is becoming popular in many e-Discovery solutions.
- Concordance search. A concordance search produces an alphabetical list of all principal words that occur in a text with their immediate context.
- Proximity search. A phrase search matches only those documents that contain two or more words that are separated by a specified number of words; a search for “Wikipedia” WITHIN2 “free” would retrieve only those documents in which the words “Wikipedia” and “free” occur within two words of each other.
- Regular expression. A regular expression employs a complex but powerful querying syntax that can be used to specify retrieval conditions with precision.
- Fuzzy search will search for document that match the given terms and some variation around them (using for instance edit distance to threshold the multiple variation)
- Wildcard search. A search that substitutes one or more characters in a search query for a wildcard character such as an asterisk. For example using the asterisk in a search query “s*n” will find “sin”, “son”, “sun”, etc. in a text.
Improved search algorithms
The PageRank algorithm developed by Google gives more prominence to documents to which other Web pages have linked.
The following is a partial list of available software products whose predominant purpose is to perform full text indexing and searching. Some of these are accompanied with detailed descriptions of their theory of operation or internal algorithms, which can provide additional insight into how full text search may be accomplished.
Free and open source software
- Apache Solr
- Clusterpoint Server (freeware licence for a single-server)
- ElasticSearch  (Apache License, Version 2.0)
- Hyper Estraier
- Autonomy Corporation
- BA Insight
- Clusterpoint Server (cluster license)
- Concept Searching Limited
- Fast Search & Transfer
- Lucid Imagination