Chapter 1: Introduction to Collective Intelligence Content preview·Buy reprint rights for this chapter Netflix is an online DVD rental company that lets people choose movies to be sent to their homes, and makes recommendations based on the movies that customers have previously rented. In late 2006 it announced a prize of $1 million to the first person to improve the accuracy of its recommendation system by 10 percent, along with progress prizes of $50,000 to the current leader each year for as long as the contest runs. Thousands of teams from all over the world entered and, as of April 2007, the leading team has managed to score an improvement of 7 percent. By using data about which movies each customer enjoyed, Netflix is able to recommend movies to other customers that they may never have even heard of and keep them coming back for more. Any way to improve its recommendation system is worth a lot of money to Netflix. The search engine Google was started in 1998, at a time when there were already several big search engines, and many assumed that a new player would never be able to take on the giants. The founders of Google, however, took a completely new approach to ranking search results by using the links on millions of web sites to decide which pages were most relevant. Google's search results were so much better than those of the other players that by 2004 it handled 85 percent of searches on the Web. Its founders are now among the top 10 richest people in the world. What do these two companies have in common? They both drew new conclusions and created new business opportunities by using sophisticated algorithms to combine data collected from many different people. The ability to collect information and the computational power to interpret it has enabled great collaboration opportunities and a better understanding of users and customers. This sort of work is happening all over the place—dating sites want to help people find their best match more quickly, companies that predict changes in airplane ticket prices are cropping up, and just about everyone wants to understand their customers better in order to create more targeted advertising. These are just a few examples in the exciting field of collective intelligence, and the proliferation of new services means there are new opportunities appearing every day. I believe that understanding machine learning and statistical methods will become ever more important in a wide variety of fields, but particularly in interpreting and organizing the vast amount of information that is being created by people all over the world. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! What Is Collective Intelligence? Content preview·Buy reprint rights for this chapter People have used the phrase collective intelligence for decades, and it has become increasingly popular and more important with the advent of new communications technologies. Although the expression may bring to mind ideas of group consciousness or supernatural phenomena, when technologists use this phrase they usually mean the combining of behavior, preferences, or ideas of a group of people to create novel insights. Collective intelligence was, of course, possible before the Internet. You don't need the Web to collect data from disparate groups of people, combine it, and analyze it. One of the most basic forms of this is a survey or census. Collecting answers from a large group of people lets you draw statistical conclusions about the group that no individual member would have known by themselves. Building new conclusions from independent contributors is really what collective intelligence is all about. A well-known example is financial markets, where a price is not set by one individual or by a coordinated effort, but by the trading behavior of many independent people all acting in what they believe is their own best interest. Although it seems counterintuitive at first, futures markets, in which many participants trade contracts based on their beliefs about future prices, are considered to be better at predicting prices than experts who independently make projections. This is because these markets combine the knowledge, experience, and insight of thousands of people to create a projection rather than relying on a single person's persepective. Although methods for collective intelligence existed before the Internet, the ability to collect information from thousands or even millions of people on the Web has opened up many new possibilities. At all times, people are using the Internet for making purchases, doing research, seeking out entertainment, and building their own web sites. All of this behavior can be monitored and used to derive information without ever having to interrupt the user's intentions by asking him questions. There are a huge number of ways this information can be processed and interpreted. Here are a couple of key examples that show the contrasting approaches: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! What Is Machine Learning? Content preview·Buy reprint rights for this chapter Machine learning is a subfield of artificial intelligence (AI) concerned with algorithms that allow computers to learn. What this means, in most cases, is that an algorithm is given a set of data and infers information about the properties of the data—and that information allows it to make predictions about other data that it might see in the future. This is possible because almost all nonrandom data contains patterns, and these patterns allow the machine to generalize. In order to generalize, it trains a model with what it determines are the important aspects of the data. To understand how models come to be, consider a simple example in the otherwise complex field of email filtering. Suppose you receive a lot of spam that contains the words "online pharmacy." As a human being, you are well equipped to recognize patterns, and you quickly determine that any message with the words "online pharmacy" is spam and should be moved directly to the trash. This is a generalization—you have, in fact, created a mental model of what is spam. After you report several of these messages as spam, a machine-learning algorithm designed to filter spam should be able to make the same generalization. There are many different machine-learning algorithms, all with different strengths and suited to different types of problems. Some, such as decision trees, are transparent, so that an observer can totally understand the reasoning process undertaken by the machine. Others, such as neural networks, are black box, meaning that they produce an answer, but it's often very difficult to reproduce the reasoning behind it. Many machine-learning algorithms rely heavily on mathematics and statistics. According to the definition I gave earlier, you could even say that simple correlation analysis and regression are both basic forms of machine learning. This book does not assume that the reader has a lot of knowledge of statistics, so I have tried to explain the statistics used in as straightforward a manner as possible. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Limits of Machine Learning Content preview·Buy reprint rights for this chapter Machine learning is not without its weaknesses. The algorithms vary in their ability to generalize over large sets of patterns, and a pattern that is unlike any seen by the algorithm before is quite likely to be misinterpreted. While humans have a vast amount of cultural knowledge and experience to draw upon, as well as a remarkable ability to recognize similar situations when making decisions about new information, machine-learning methods can only generalize based on the data that has already been seen, and even then in a very limited manner. The spam-filtering method you'll see in this book is based on the appearance of words or phrases without any regard to what they mean or to sentence structures. Although it's theoretically possible to build an algorithm that would take grammar into account, this is rarely done in practice because the effort required would be disproportionately large compared to the improvement in the algorithm. Understanding the meaning of words or their relevance to a person's life would require far more information than spam filters, in their current incarnation, can access. In addition, although they vary in their propensity for doing so, all machine-learning methods suffer from the possibility of overgeneralizing. As with most things in life, strong generalizations based on a few examples are rarely entirely accurate. It's certainly possible that you could receive an important email message from a friend that contains the words "online pharmacy." In this case, you would tell the algorithm that the message is not spam, and it might infer that messages from that particular friend are acceptable. The nature of many machine-learning algorithms is that they can continue to learn as new information arrives. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Real-Life Examples Content preview·Buy reprint rights for this chapter There are many sites on the Internet currently collecting data from many different people and using machine learning and statistical methods to benefit from it. Google is likely the largest effort—it not only uses web links to rank pages, but it constantly gathers information on when advertisements are clicked by different users, which allows Google to target the advertising more effectively. In you'll learn about search engines and the PageRank algorithm, an important part of Google's ranking system. Other examples include web sites with recommendation systems. Sites like Amazon and Netflix use information about the things people buy or rent to determine which people or items are similar to one another, and then make recommendations based on purchase history. Other sites like Pandora and Last.fm use your ratings of different bands and songs to create custom radio stations with music they think you will enjoy. covers ways to build recommendation systems. Prediction markets are also a form of collective intelligence. One of the most well known of these is the Hollywood Stock Exchange (), where people trade stocks on movies and movie stars. You can buy or sell a stock at the current price knowing that its ultimate value will be one millionth of the movie's actual opening box office number. Because the price is set by trading behavior, the value is not chosen by any one individual but by the behavior of the group, and the current price can be seen as the whole group's prediction of box office numbers for the movie. The predictions made by the Hollywood Stock Exchange are routinely better than those made by individual experts. Some dating sites, such as eHarmony, use information collected from participants to determine who would be a good match. Although these companies tend to keep their methods for matching people secret, it is quite likely that any successful approach would involve a constant reevaluation based on whether the chosen matches were successful or not. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Other Uses for Learning Algorithms Content preview·Buy reprint rights for this chapter The methods described in this book are not new, and although the examples focus on Internet-based collective intelligence problems, knowledge of machine-learning algorithms can be helpful for software developers in many other fields. They are particularly useful in areas that deal with large datasets that can be searched for interesting patterns, for example: Biotechnology Advances in sequencing and screening technology have created massive datasets of many different kinds, such as DNA sequences, protein structures, compound screens, and RNA expression. Machine-learning techniques are applied extensively to all of these kinds of data in an effort to find patterns that can increase understanding of biological processes. Financial fraud detection Credit card companies are constantly searching for new ways to detect if transactions are fraudulent. To this end, they have employed such techniques as neural networks and inductive logic to verify transactions and catch improper usage. Machine vision Interpreting images from a video camera for military or surveillance purposes is an active area of research. Many machine-learning techniques are used to try to automatically detect intruders, identify vehicles, or recognize faces. Particularly interesting is the use of unsupervised techniques like independent component analysis, which finds interesting features in large datasets. Product marketing For a very long time, understanding demographics and trends was more of an art form than a science. Recently, the increased ability to collect data from consumers has opened up opportunities for machine-learning techniques such as clustering to better understand the natural divisions that exist in markets and to make better predictions about future trends. Supply chain optimization Large organizations can save millions of dollars by having their supply chains run effectively and accurately predict demand for products in different areas. The number of ways in which a supply chain can be constructed is massive, as is the number of factors that can potentially affect demand. Optimization and learning techniques are frequently used to analyze these datasets. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Chapter 2: Making Recommendations Content preview·Buy reprint rights for this chapter To begin the tour of collective intelligence, I'm going to show you ways to use the preferences of a group of people to make recommendations to other people. There are many applications for this type of information, such as making product recommendations for online shopping, suggesting interesting web sites, or helping people find music and movies. This chapter shows you how to build a system for finding people who share tastes and for making automatic recommendations based on things that other people like. You've probably come across recommendation engines before when using an online shopping site like Amazon. Amazon tracks the purchasing habits of all its shoppers, and when you log onto the site, it uses this information to suggest products you might like. Amazon can even suggest movies you might like, even if you've only bought books from it before. Some online concert ticket agencies will look at the history of shows you've seen before and alert you to upcoming shows that might be of interest. Sites like reddit.com let you vote on links to other web sites and then use your votes to suggest other links you might find interesting. From these examples, you can see that preferences can be collected in many different ways. Sometimes the data are items that people have purchased, and opinions about these items might be represented as yes/no votes or as ratings from one to five. In this chapter, we'll look at different ways of representing these cases so that they'll all work with the same set of algorithms, and we'll create working examples with movie critic scores and social bookmarking. You know that the low-tech way to get recommendations for products, movies, or entertaining web sites is to ask your friends. You also know that some of your friends have better "taste" than others, something you've learned over time by observing whether they usually like the same things as you. As more and more options become available, it becomes less practical to decide what you want by asking a small group of people, since they may not be aware of all the options. This is why a set of techniques called Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Collaborative Filtering Content preview·Buy reprint rights for this chapter You know that the low-tech way to get recommendations for products, movies, or entertaining web sites is to ask your friends. You also know that some of your friends have better "taste" than others, something you've learned over time by observing whether they usually like the same things as you. As more and more options become available, it becomes less practical to decide what you want by asking a small group of people, since they may not be aware of all the options. This is why a set of techniques called collaborative filtering was developed. A collaborative filtering algorithm usually works by searching a large group of people and finding a smaller set with tastes similar to yours. It looks at other things they like and combines them to create a ranked list of suggestions. There are several different ways of deciding which people are similar and combining their choices to make a list; this chapter will cover a few of these. The term collaborative filtering was first used by David Goldberg at Xerox PARC in 1992 in a paper called "Using collaborative filtering to weave an information tapestry." He designed a system called Tapestry that allowed people to annotate documents as either interesting or uninteresting and used this information to filter documents for other people. There are now hundreds of web sites that employ some sort of collaborative filtering algorithm for movies, music, books, dating, shopping, other web sites, podcasts, articles, and even jokes. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Collecting Preferences Content preview·Buy reprint rights for this chapter The first thing you need is a way to represent different people and their preferences. In Python, a very simple way to do this is to use a nested dictionary. If you'd like to work through the example in this section, create a file called recommendations.py, and insert the following code to create the dataset: # A dictionary of movie critics and their ratings of a small # set of movies critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5, 'The Night Listener': 3.0}, 'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 3.5}, 'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0, 'Superman Returns': 3.5, 'The Night Listener': 4.0}, 'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'The Night Listener': 4.5, 'Superman Returns': 4.0, 'You, Me and Dupree': 2.5}, 'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 2.0}, 'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5}, 'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}} You will be working with Python interactively in this chapter, so you should save recommendations.py somewhere where the Python interactive interpreter can find it. This could be in the python/Lib directory, but the easiest way to do it is to start the Python interpreter in the same directory in which you saved the file. This dictionary uses a ranking from 1 to 5 as a way to express how much each of these movie critics (and I) liked a given movie. No matter how preferences are expressed, you need a way to map them onto numerical values. If you were building a shopping site, you might use a value of 1 to indicate that someone had bought an item in the past and a value of 0 to indicate that they had not. For a site where people vote on news stories, values of −1, 0, and 1 could be used to represent "disliked," "didn't vote," and "liked," as shown in . Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Finding Similar Users Content preview·Buy reprint rights for this chapter After collecting data about the things people like, you need a way to determine how similar people are in their tastes. You do this by comparing each person with every other person and calculating a similarity score. There are a few ways to do this, and in this section I'll show you two systems for calculating similarity scores: Euclidean distance and Pearson correlation. One very simple way to calculate a similarity score is to use a Euclidean distance score, which takes the items that people have ranked in common and uses them as axes for a chart. You can then plot the people on the chart and see how close together they are, as shown in . Figure : People in preference space This figure shows the people charted in preference space. Toby has been plotted at 4.5 on the Snakes axis and at 1.0 on the Dupree axis. The closer two people are in the preference space, the more similar their preferences are. Because the chart is two-dimensional, you can only look at two rankings at a time, but the principle is the same for bigger sets of rankings. To calculate the distance between Toby and LaSalle in the chart, take the difference in each axis, square them and add them together, then take the square root of the sum. In Python, you can use the pow(n,2) function to square a number and take the square root with the sqrt function: >> from math import sqrt >> sqrt(pow(5−4,2)+pow(4−1,2)) 3.1622776601683795 This formula calculates the distance, which will be smaller for people who are more similar. However, you need a function that gives higher values for people who are similar. This can be done by adding 1 to the function (so you don't get a division-by-zero error) and inverting it: >> 1/(1+sqrt(pow(5−4,2)+pow(4−1,2))) 0.2402530733520421 This new function always returns a value between 0 and 1, where a value of 1 means that two people have identical preferences. You can put everything together to create a function for calculating similarity. Add the following code to recommendations.py: from math import sqrt # Returns a distance-based similarity score for person1 and person2 def sim_distance(prefs,person1,person2): # Get the list of shared_items si={} for item in prefs[person1]: if item in prefs[person2]: si[item]=1 # if they have no ratings in common, return 0 if len(si)==0: return 0 # Add up the squares of all the differences sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in prefs[person1] if item in prefs[person2]]) return 1/(1+sum_of_squares) Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Recommending Items Content preview·Buy reprint rights for this chapter Finding a good critic to read is great, but what I really want is a movie recommendation right now. I could just look at the person who has tastes most similar to mine and look for a movie he likes that I haven't seen yet, but that would be too permissive. Such an approach could accidentally turn up reviewers who haven't reviewed some of the movies that I might like. It could also return a reviewer who strangely liked a movie that got bad reviews from all the other critics returned by topMatches. To solve these issues, you need to score the items by producing a weighted score that ranks the critics. Take the votes of all the other critics and multiply how similar they are to me by the score they gave each movie. shows how this process works. Table : Creating recommendations for Toby Critic Similarity Night S.xNight Lady S.xLady Luck S.xLuck Rose 0.99 3.0 2.97 2.5 2.48 3.0 2.97 Seymour 0.38 3.0 1.14 3.0 1.14 1.5 0.57 Puig 0.89 4.5 4.02 3.0 2.68 LaSalle 0.92 3.0 2.77 3.0 2.77 2.0 1.85 Matthews 0.66 3.0 1.99 3.0 1.99 Total 12.89 8.38 8.07 Sim.Sum 3.84 2.95 3.18 Total/Sim. Sum 3.35 2.83 2.53 This table shows correlation scores for each critic and the ratings they gave the three movies (The Night Listener, Lady in the Water, and Just My Luck) that I haven't rated. The columns beginning with S.x give the similarity multiplied by the rating, so a person who is similar to me will contribute more to the overall score than a person who is different from me. The Total row shows the sum of all these numbers. You could just use the totals to calculate the rankings, but then a movie reviewed by more people would have a big advantage. To correct for this, you need to divide by the sum of all the similarities for critics that reviewed that movie (the Sim. Sum row in the table). Because The Night Listener was reviewed by everyone, its total is divided by the sum of all the similarities. Lady in the Water, however, was not reviewed by Puig, so the movie's score is divided by the sum of all the other similarities. The last row shows the results of this division. The code for this is pretty straightforward, and it works with either the Euclidean distance or the Pearson correlation score. Add it to Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Matching Products Content preview·Buy reprint rights for this chapter Now you know how to find similar people and recommend products for a given person, but what if you want to see which products are similar to each other? You may have encountered this on shopping web sites, particularly when the site hasn't collected a lot of information about you. A section of Amazon's web page for the book Programming Python is shown in . Figure : Amazon shows products that are similar to Programming Python In this case, you can determine similarity by looking at who liked a particular item and seeing the other things they liked. This is actually the same method we used earlier to determine similarity between people—you just need to swap the people and the items. So you can use the same methods you wrote earlier if you transform the dictionary from: {'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5}, 'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5}} to: {'Lady in the Water':{'Lisa Rose':2.5,'Gene Seymour':3.0}, 'Snakes on a Plane':{'Lisa Rose':3.5,'Gene Seymour':3.5}} etc.. Add a function to recommendations.py to do this transformation: def transformPrefs(prefs): result={} for person in prefs: for item in prefs[person]: result.setdefault(item,{}) # Flip item and person result[item][person]=prefs[person][item] return result And now call the topMatches function used earlier to find the set of movies most similar to Superman Returns: >> reload(recommendations) >> movies=recommendations.transformPrefs(recommendations.critics) >> recommendations.topMatches(movies,'Superman Returns') [(0.657, 'You, Me and Dupree'), (0.487, 'Lady in the Water'), (0.111, 'Snakes on a Plane'), (−0.179, 'The Night Listener'), (−0.422, 'Just My Luck')] Notice that in this example there are actually some negative correlation scores, which indicate that those who like Superman Returns tend to dislike Just My Luck, as shown in . Figure : Superman Returns and Just My Luck have a negative correlation To twist things around even more, you can get recommended critics for a movie. Maybe you're trying to decide whom to invite to a premiere? Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Building a del.icio.us Link Recommender Content preview·Buy reprint rights for this chapter This section shows you how to retrieve data from one of the most popular online bookmarking sites, and how to use that data to find similar users and recommend links they haven't seen before. This site, which you can access at , allows people to set up an account and post links that interest them for later reference. You can visit the site and look at links that other people have posted, and also browse "popular" links that have been posted by many different people. A sample page from del.icio.us is shown in . Figure : The del.icio.us popular page for programming Unlike some link-sharing sites, del.icio.us doesn't (at the time of writing) include any way to find similar people or recommend links you might like. Fortunately, you can use the techniques discussed in this chapter to add that functionality yourself. Data from del.icio.us is made available through an API that returns data in XML format. To make things even easier for you, there is a Python API that you can download from or . To work through the example in this section, you'll need to download the latest version of this library and put it in your Python library path. (See for more information on installing this library.) This library has several simple calls to get links that people have submitted. For example, to get a list of recent popular posts about programming, you can use the get_popular call: >> import pydelicious >> pydelicious.get_popular(tag='programming') [{'count': '', 'extended': '', 'hash': '', 'description': u'How To Write Unmaintainable Code', 'tags': '', 'href': u'http://thc.segfault.net/root/phun/ unmaintain.html', 'user': u'dorsia', 'dt': u'2006-08-19T09:48:56Z'}, {'count': '', 'extended': '', 'hash': '', 'description': u'Threading in C#', 'tags': '', 'href': u'http://www.albahari.com/threading/', 'user': u'mmihale', 'dt': u'2006-05-17T18: 09:24Z'}, ...etc... You can see that it returns a list of dictionaries, each one containing a URL, description, and the user who posted it. Since you are working from live data, your results will look different from the examples. There are two other calls you'll be using, Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Item-Based Filtering Content preview·Buy reprint rights for this chapter The way the recommendation engine has been implemented so far requires the use of all the rankings from every user in order to create a dataset. This will probably work well for a few thousand people or items, but a very large site like Amazon has millions of customers and products—comparing a user with every other user and then comparing every product each user has rated can be very slow. Also, a site that sells millions of products may have very little overlap between people, which can make it difficult to decide which people are similar. The technique we have used thus far is called user-based collaborative filtering. An alternative is known as item-based collaborative filtering. In cases with very large datasets, item-based collaborative filtering can give better results, and it allows many of the calculations to be performed in advance so that a user needing recommendations can get them more quickly. The procedure for item-based filtering draws a lot on what we have already discussed. The general technique is to precompute the most similar items for each item. Then, when you wish to make recommendations to a user, you look at his top-rated items and create a weighted list of the items most similar to those. The important difference here is that, although the first step requires you to examine all the data, comparisons between items will not change as often as comparisons between users. This means you do not have to continuously calculate each item's most similar items—you can do it at low-traffic times or on a computer separate from your main application. To compare items, the first thing you'll need to do is write a function to build the complete dataset of similar items. Again, this does not have to be done every time a recommendation is needed—instead, you build the dataset once and reuse it each time you need it. To generate the dataset, add the following function to recommendations.py: def calculateSimilarItems(prefs,n=10): # Create a dictionary of items showing which other items they # are most similar to. result={} # Invert the preference matrix to be item-centric itemPrefs=transformPrefs(prefs) c=0 for item in itemPrefs: # Status updates for large datasets c+=1 if c%100==0: print "%d / %d" % (c,len(itemPrefs)) # Find the most similar items to this one scores=topMatches(itemPrefs,item,n=n,similarity=sim_distance) result[item]=scores return result Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Using the MovieLens Dataset Content preview·Buy reprint rights for this chapter For the final example, let's look at a real dataset of movie ratings called MovieLens. MovieLens was developed by the GroupLens project at the University of Minnesota. You can download the dataset from . There are two datasets here. Download the 100,000 dataset in either tar.gz format or zip format, depending on your platform. The archive contains several files, but the ones of interest are u.item, which contains a list of movie IDs and titles, and u.data, which contains actual ratings in this format: 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 298 474 4 884182806 Each line has a user ID, a movie ID, the rating given to the movie by the user, and a timestamp. You can get the movie titles, but the user data is anonymous, so you'll just be working with user IDs in this section. The set contains ratings of 1,682 movies by 943 users, each of whom rated at least 20 movies. Create a new method called loadMovieLens in recommendations.py to load this dataset: def loadMovieLens(path='/data/movielens'): # Get movie titles movies={} for line in open(path+'/u.item'): (id,title)=line.split('|')[0:2] movies[id]=title # Load data prefs={} for line in open(path+'/u.data'): (user,movieid,rating,ts)=line.split('\t') prefs.setdefault(user,{}) prefs[user][movies[movieid]]=float(rating) return prefs In your Python session, load the data and look at some ratings for any arbitrary user: >>> reload(recommendations) >>> prefs=recommendations.loadMovieLens( ) >>> prefs['87'] {'Birdcage, The (1996)': 4.0, 'E.T. the Extra-Terrestrial (1982)': 3.0, 'Bananas (1971)': 5.0, 'Sting, The (1973)': 5.0, 'Bad Boys (1995)': 4.0, 'In the Line of Fire (1993)': 5.0, 'Star Trek: The Wrath of Khan (1982)': 5.0, 'Speechless (1994)': 4.0, etc... Now you can get user-based recommendations: >>> recommendations.getRecommendations(prefs,'87')[0:30] [(5.0, 'They Made Me a Criminal (1939)'), (5.0, 'Star Kid (1997)'), (5.0, 'Santa with Muscles (1996)'), (5.0, 'Saint of Fort Washington (1993)'), etc...] Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! User-Based or Item-Based Filtering? Content preview·Buy reprint rights for this chapter Item-based filtering is significantly faster than user-based when getting a list of recommendations for a large dataset, but it does have the additional overhead of maintaining the item similarity table. Also, there is a difference in accuracy that depends on how "sparse" the dataset is. In the movie example, since every critic has rated nearly every movie, the dataset is dense (not sparse). On the other hand, it would be unlikely to find two people with the same set of del.icio.us bookmarks—most bookmarks are saved by a small group of people, leading to a sparse dataset. Item-based filtering usually outperforms user-based filtering in sparse datasets, and the two perform about equally in dense datasets. To learn more about the difference in performance between these algorithms, check out a paper called "Item-based Collaborative Filtering Recommendation Algorithms" by Sarwar et al. at . Having said that, user-based filtering is simpler to implement and doesn't have the extra steps, so it is often more appropriate with smaller in-memory datasets that change very frequently. Finally, in some applications, showing people which other users have preferences similar to their own has its own value—maybe not something you would want to do on a shopping site, but possibly on a link-sharing or music recommendation site. You've now learned how to calculate similarity scores and how to use these to compare people and items. This chapter covered two different recommendation algorithms, user-based and item-based, along with ways to persist people's preferences and use the del.icio.us API to build a link recommendation system. In , you'll see how to build on some of the ideas from this chapter by finding groups of similar people using unsupervised clustering algorithms. will look at alternative ways to match people when you already know the sort of people they like. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Exercises Content preview·Buy reprint rights for this chapter 1. Tanimoto score. Find out what a Tanimoto similarity score is. In what cases could this be used as the similarity metric instead of Euclidean distance or Pearson coefficient? Create a new similarity function using the Tanimoto score. 2. Tag similarity. Using the del.icio.us API, create a dataset of tags and items. Use this to calculate similarity between tags and see if you can find any that are almost identical. Find some items that could have been tagged "programming" but were not. 3. User-based efficiency. The user-based filtering algorithm is inefficient because it compares a user to all other users every time a recommendation is needed. Write a function to precompute user similarities, and alter the recommendation code to use only the top five other users to get recommendations. 4. Item-based bookmark filtering. Download a set of data from del.icio.us and add it to the database. Create an item-item table and use this to make item-based recommendations for various users. How do these compare to the user-based recommendations? 5. Audioscrobbler. Take a look at , a dataset containing music preferences for a large set of users. Use their web services API to get a set of data for making and building a music recommendation system. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Chapter 3: Discovering Groups Content preview·Buy reprint rights for this chapter discussed ways to find things that are closely related, so, for example, you could find someone who shares your taste in movies. This chapter expands on those ideas and introduces data clustering, a method for discovering and visualizing groups of things, people, or ideas that are all closely related. In this chapter, you'll learn: how to prepare data from a variety of sources; two different clustering algorithms; more on distance metrics; simple graphical visualization code for viewing the generated groups; and finally, a method for projecting very complicated datasets into two dimensions. Clustering is used frequently in data-intensive applications. Retailers who track customer purchases can use this information to automatically detect groups of customers with similar buying patterns, in addition to regular demographic information. People of similar age and income may have vastly different styles of dress, but with the use of clustering, "fashion islands" can be discovered and used to develop a retail or marketing strategy. Clustering is also heavily used in computational biology to find groups of genes that exhibit similar behavior, which might indicate that they respond to a treatment in the same way or are part of the same biological pathway. Since this book is about collective intelligence, the examples in this chapter come from sources in which many people contribute different information. The first example will look at blogs, the topics they discuss, and their particular word usage to show that blogs can be grouped according to their text and that words can be grouped by their usage. The second example will look at a community site where people list things they own and things they would like to own, and we will use this information to show how people's desires can be grouped into clusters. Techniques that use example inputs and outputs to learn how to make predictions are known as supervised learning methods. We'll explore many supervised learning methods in this book, including neural networks, decision trees, support-vector machines, and Bayesian filtering. Applications using these methods "learn" by examining a set of inputs and expected outputs. When we want to extract information using one of these methods, we enter a set of inputs and expect the application to produce an output based on what it has learned so far. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Supervised versus Unsupervised Learning Content preview·Buy reprint rights for this chapter Techniques that use example inputs and outputs to learn how to make predictions are known as supervised learning methods. We'll explore many supervised learning methods in this book, including neural networks, decision trees, support-vector machines, and Bayesian filtering. Applications using these methods "learn" by examining a set of inputs and expected outputs. When we want to extract information using one of these methods, we enter a set of inputs and expect the application to produce an output based on what it has learned so far. Clustering is an example of unsupervised learning. Unlike a neural network or a decision tree, unsupervised learning algorithms are not trained with examples of correct answers. Their purpose is to find structure within a set of data where no one piece of data is the answer. In the fashion example given earlier, the clusters don't tell the retailers what an individual is likely to buy, nor do they make predictions about which fashion island a new person fits into. The goal of clustering algorithms is to take the data and find the distinct groups that exist within it. Other examples of unsupervised learning include non-negative matrix factorization, which will be discussed in , and self-organizing maps. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Word Vectors Content preview·Buy reprint rights for this chapter The normal way of preparing data for clustering is to determine a common set of numerical attributes that can be used to compare the items. This is very similar to what was shown in , when critics' rankings were compared over a common set of movies, and when the presence or absence of a bookmark was translated to a 1 or a 0 for del.icio.us users. This chapter will work through a couple of example datasets. In the first dataset, the items that will be clustered are a set of 120 of the top blogs, and the data they'll be clustered on is the number of times a particular set of words appears in each blog's feed. A small subset of what this looks like is shown in . Table : Subset of blog word frequencies "china" "kids" "music" "yahoo" Gothamist 0 3 3 0 GigaOM 6 0 0 2 Quick Online Tips 0 2 2 22 By clustering blogs based on word frequencies, it might be possible to determine if there are groups of blogs that frequently write about similar subjects or write in similar styles. Such a result could be very useful in searching, cataloging, and discovering the huge number of blogs that are currently online. To generate this dataset, you'll be downloading the feeds from a set of blogs, extracting the text from the entries, and creating a table of word frequencies. If you'd like to skip the steps for creating the dataset, you can download it from . Almost all blogs can be read online or via their RSS feeds. An RSS feed is a simple XML document that contains information about the blog and all the entries. The first step in generating word counts for each blog is to parse these feeds. Fortunately, there is an excellent module for doing this called Universal Feed Parser, which you can download from . This module makes it easy to get the title, links, and entries from any RSS or Atom feed. The next step is to create a function that will extract all the words from a feed. Create a new file called generatefeedvector.py and insert the following code: import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} # Loop over all the entries for e in d.entries: if 'summary' in e: summary=e.summary else: summary=e.description # Extract a list of words words=getwords(e.title+' '+summary) for word in words: wc.setdefault(word,0) wc[word]+=1 return d.feed.title,wc Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Hierarchical Clustering Content preview·Buy reprint rights for this chapter Hierarchical clustering builds up a hierarchy of groups by continuously merging the two most similar groups. Each of these groups starts as a single item, in this case an individual blog. In each iteration this method calculates the distances between every pair of groups, and the closest ones are merged together to form a new group. This is repeated until there is only one group. shows this process. Figure : Hierarchical clustering in action In the figure, the similarity of the items is represented by their relative locations—the closer two items are, the more similar they are. At first, the groups are just individual items. In the second step, you can see that A and B, the two items closest together, have merged to form a new group whose location is halfway between the two. In the third step, this new group is merged with C. Since D and E are now the two closest items, they form a new group. The final step unifies the two remaining groups. After hierarchical clustering is completed, you usually view the results in a type of graph called a dendrogram, which displays the nodes arranged into their hierarchy. The dendrogram for the example above is shown in . Figure : A dendrogram is a visualization of hierarchical clustering This dendrogram not only uses connections to show which items ended up in each cluster, it also uses the distance to show how far apart the items were. The AB cluster is a lot closer to the individual A and B items than the DE cluster is to the individual D and E items. Rendering the graph this way can help you determine how similar the items within a cluster are, which could be interpreted as the tightness of the cluster. This section will show you how to cluster the blogs dataset to generate a hierarchy of blogs, which, if successful, will group them thematically. First, you'll need a method to load in the data file. Create a file called clusters.py and add this function to it: def readfile(filename): lines=[line for line in file(filename)] # First line is the column titles colnames=lines[0].strip( ).split('\t')[1:] rownames=[] data=[] for line in lines[1:]: p=line.strip( ).split('\t') # First column in each row is the rowname rownames.append(p[0]) # The data for this row is the remainder of the row data.append([float(x) for x in p[1:]]) return rownames,colnames,data Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Drawing the Dendrogram Content preview·Buy reprint rights for this chapter You can interpret the clusters more clearly by viewing them as a dendrogram. Hierarchical clustering results are usually viewed this way, since dendrograms pack a lot of information into a relatively small space. Since the dendrograms will be graphical and saved as JPGs, you'll need to download the Python Imaging Library (PIL), which is available at . This library comes with an installer for Windows and source distributions for other platforms. More information on downloading and installing the PIL is available in . The PIL makes it very easy to generate images with text and lines, which is all you'll really need to construct a dendrogram. Add the import statement to the beginning of clusters.py: from PIL import Image,ImageDraw The first step is to use a function that returns the total height of a given cluster. When determining the overall height of the image, and where to put the various nodes, it's necessary to know their total heights. If this cluster is an endpoint (i.e., it has no branches), then its height is 1; otherwise, its height is the sum of the heights of its branches. This is easily defined as a recursive function, which you can add to clusters.py: def getheight(clust): # Is this an endpoint? Then the height is just 1 if clust.left==None and clust.right==None: return 1 # Otherwise the height is the same of the heights of # each branch return getheight(clust.left)+getheight(clust.right) The other thing you need to know is the total error of the root node. Since the length of the lines will be scaled to how much error is in each node, you'll be generating a scaling factor based on how much total error there is. The error depth of a node is just the maximum possible error from each of its branches: def getdepth(clust): # The distance of an endpoint is 0.0 if clust.left==None and clust.right==None: return 0 # The distance of a branch is the greater of its two sides # plus its own distance return max(getdepth(clust.left),getdepth(clust.right))+clust.distance The drawdendrogram function creates a new image allowing 20 pixels in height and a fixed width for each final cluster. The scaling factor is determined by dividing the fixed width by the total depth. The function creates a draw object for this image and then calls Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Column Clustering Content preview·Buy reprint rights for this chapter It's often necessary to cluster on both the rows and the columns. In a marketing study, it can be interesting to group people to find demographics and products, or perhaps to determine shelf locations of items that are commonly bought together. In the blog dataset, the columns represent words, and it's potentially interesting to see which words are commonly used together. The easiest way to do this using the functions you've written thus far is to rotate the entire dataset so that the columns (the words) become rows, each with a list of numbers indicating how many times that particular word appears in each of the blogs. Add this function to clusters.py: def rotatematrix(data): newdata=[] for i in range(len(data[0])): newrow=[data[j][i] for j in range(len(data))] newdata.append(newrow) return newdata You can now rotate the matrix and run the same operations for clustering and drawing the dendrogram. As there are many more words than blogs, this will take longer than running the blog clustering. Remember that since the matrix has been rotated, the words rather than the blogs are now the labels. >> reload(clusters) >> rdata=clusters.rotatematrix(data) >> wordclust=clusters.hcluster(rdata) >> clusters.drawdendrogram(wordclust,labels=words,jpeg='wordclust.jpg') One important thing to realize about clustering is that if you have many more items than variables, the likelihood of nonsensical clusters increases. There are many more words than there are blogs, so you'll notice more reasonable patterns in the blog clustering than in the word clustering. However, some interesting clusters definitely emerge, as shown in . Figure : Word cluster showing online-service-related words The cluster obviously shows that a set of words is often used together in blogs to discuss online services or Internet-related topics. It's possible to find clusters elsewhere that reflect usage patterns, such as "fact," "us," "say," "very," and "think," which indicate that a blog writes in an opinionated style. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! K-Means Clustering Content preview·Buy reprint rights for this chapter Hierarchical clustering gives a nice tree as a result, but it has a couple of disadvantages. The tree view doesn't really break the data into distinct groups without additional work, and the algorithm is extremely computationally intensive. Because the relationship between every pair of items must be calculated and then recalculated when items are merged, the algorithm will run slowly on very large datasets. An alternative method of clustering is K-means clustering. This type of algorithm is quite different from hierarchical clustering because it is told in advance how many distinct clusters to generate. The algorithm will determine the size of the clusters based on the structure of the data. K-means clustering begins with k randomly placed centroids (points in space that represent the center of the cluster), and assigns every item to the nearest one. After the assignment, the centroids are moved to the average location of all the nodes assigned to them, and the assignments are redone. This process repeats until the assignments stop changing. shows this process in action for five items and two clusters. Figure : K-means clustering with two clusters In the first frame, the two centroids (shown as dark circles) are placed randomly. Frame 2 shows that each of the items is assigned to the nearest centroid—in this case, A and B are assigned to the top centroid and C, D, and E are assigned to the bottom centroid. In the third frame, each centroid has been moved to the average location of the items that were assigned to it. When the assignments are calculated again, it turns out that C is now closer to the top centroid, while D and E remain closest to the bottom one. Thus, the final result is reached with A, B, and C in one cluster, and D and E in the other. The function for doing K-means clustering takes the same data rows as input as does the hierarchical clustering algorithm, along with the number of clusters (k) that the caller would like returned. Add this code to clusters.py: import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random( )*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Clusters of Preferences Content preview·Buy reprint rights for this chapter One of the best things about the growing interest in social networking sites is that big sets of data are becoming available, all contributed voluntarily by people. One such site is called Zebo (), which encourages people to create accounts and make lists of things that they own and things that they would like to own. From an advertiser's or social critic's perspective, this is very interesting information, as it can allow them to determine the way that expressed preferences naturally group together. This section will go through the process of creating a dataset from the Zebo web site. It involves downloading many pages from the site and parsing them to extract what each user says they want. If you would like to skip this section, you can download a precreated dataset from . Beautiful Soup is an excellent library for parsing a web page and building a structured representation. It allows you to access any element of the page by type, ID, or any of its properties, and get a string representation of its contents. Beautiful Soup is also very tolerant of web pages with broken HTML, which is useful when generating datasets from web sites. You can download Beautiful Soup from . It comes as a single Python file, which you can put in your Python library path or in the path where you'll be working and starting the Python interpreter. Once you've installed Beautiful Soup, you can see it in action in your interpreter: >> import urllib2 >> from BeautifulSoup import BeautifulSoup >> c=urllib2.urlopen('http://kiwitobes.com/wiki/Programming_language.html') >> soup=BeautifulSoup(c.read( )) >> links=soup('a') >> links[10] algorithms >> links[10]['href'] u'/wiki/Algorithm.html' To construct a soup, which is Beautiful Soup's way of representing a web page, just initialize it with the contents of the page. You can call the soup with a tag type, such as a, and it will return a list of objects with that type. Each of these is also addressable, allowing you to drill down into properties and other objects beneath them in the hierarchy. The structure of the search page on Zebo is fairly complex, but it's easy to determine which parts of the page are the lists of items because they all have the class Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Viewing Data in Two Dimensions Content preview·Buy reprint rights for this chapter The clustering algorithms in this chapter have been demonstrated using a stylized visualization of data in two dimensions, with the difference between the various items indicated by how far apart they are in the diagram. Since most real-life examples of items you would want to cluster have more than two numbers, you can't just take the data as-is and plot it in two dimensions. However, to understand the relationship between the various items, it would be very useful to see them charted on a page with closer distances indicating similarity. This section will introduce a technique called multidimensional scaling, which will be used to find a two-dimensional representation of the dataset. The algorithm takes the difference between every pair of items and tries to make a chart in which the distances between the items match those differences. To do this, the algorithm first calculates the target distances between all the items. In the blog dataset, Pearson correlation was used to compare the items. An example of this is shown in . Table : Sample distance matrix A B C D A 0.0 0.2 0.8 0.7 B 0.2 0.0 0.9 0.8 C 0.8 0.9 0.0 0.1 D 0.7 0.8 0.1 0.0 Next, all the items (blogs, in this case) are placed randomly on the two-dimensional chart, as shown in . Figure : Starting locations of the 2D projection The current distances between all the items are calculated using the actual distance (the sum of the differences of the squares), as shown in . Figure : Distances between items For every pair of items, the target distance is compared to the current distance and an error term is calculated. Every item is moved a small amount closer or further in proportion to the error between the two items. shows the forces acting on item A. The distance between A and B in the chart is 0.5, but the target distance is only 0.2, so A has to be moved closer to B. At the same time, A is also being pushed away by C and D because it is too close. Figure : Forces acting on item A Every node is moved according to the combination of all the other nodes pushing and pulling on it. Each time this happens, the difference between the current distances and the target distances gets a bit smaller. This procedure is repeated many times until the total amount of error cannot be reduced by moving the items any more. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Other Things to Cluster Content preview·Buy reprint rights for this chapter This chapter has looked at two datasets, but there are many other things that can be done. The del.icio.us dataset from can also be clustered to find groups of users or bookmarks. In the same way that the blog feeds were transformed into word vectors, any set of pages that is download can be reduced to just the words. These ideas can be extended to many different areas to find interesting things—message boards based on word usage, companies from Yahoo! Finance based on various statistics, or top reviewers on Amazon according to what they like. It would also be interesting to look at a large social network like MySpace and cluster people according to who their friends are, or possibly use other information they provide about themselves (favorite bands, foods, etc.). The concept of imagining items in space depending on their parameters will be a recurring theme in this book. Using multidimensional scaling is an effective way to take a dataset and actually view it in a way that's easy to interpret. It's important to realize that some information is lost in the process of scaling, but the result should help you understand the algorithms better. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Exercises Content preview·Buy reprint rights for this chapter 1. Using the del.icio.us API from , create a dataset of bookmarks suitable for clustering. Run hierarchical and K-means clustering on it. 2. Modify the blog parsing code to cluster individual entries instead of entire blogs. Do entries from the same blog cluster together? What about entries from the same date? 3. Try using actual (Pythagorean) distance for blog clustering. How does this change the results? 4. Find out what Manhattan distance is. Create a function for it and see how it changes the results for the Zebo dataset. 5. Modify the K-means clustering function to return, along with the cluster results, the total distance between all the items and their respective centroids. 6. After completing Exercise 5, create a function that runs K-means clustering over different values of k. How does the total distance change as the number of clusters increases? At what point does the improvement from having more clusters become very small? 7. Multidimensional scaling in two dimensions is easy to print, but scaling can be done in any number of dimensions. Try changing the code to scale in one dimension (all the points on a line). Now try making it work for three dimensions. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Chapter 4: Searching and Ranking Content preview·Buy reprint rights for this chapter This chapter covers full-text search engines, which allow people to search a large set of documents for a list of words, and which rank results according to how relevant the documents are to those words. Algorithms for full-text searches are among the most important collective intelligence algorithms, and many fortunes have been made by new ideas in this field. It is widely believed that Google's rapid rise from an academic project to the world's most popular search engine was based largely on the PageRank algorithm, a variation that you'll learn about in this chapter. Information retrieval is a huge field with a long history. This chapter will only be able to cover a few key concepts, but we'll go through the construction of a search engine that will index a set of documents and leave you with ideas on how to improve things further. Although the focus will be on algorithms for searching and ranking rather than on the infrastructure requirements for indexing large portions of the Web, the search engine you build should have no problem with collections of up to 100,000 pages. Throughout this chapter, you'll learn all the necessary steps to crawl, index, and search a set of pages, and even rank their results in many different ways. The first step in creating a search engine is to develop a way to collect the documents. In some cases, this will involve crawling (starting with a small set of documents and following links to others) and in other cases it will begin with a fixed collection of documents, perhaps from a corporate intranet. After you collect the documents, they need to be indexed. This usually involves creating a big table of the documents and the locations of all the different words. Depending on the particular application, the documents themselves do not necessarily have to be stored in a database; the index simply has to store a reference (such as a file system path or URL) to their locations. The final step is, of course, returning a ranked list of documents from a query. Retrieving every document with a given set of words is fairly straightforward once you have an index, but the real magic is in how the results are sorted. A huge number of metrics can be generated, and there is no shortage of ways you can tweak them to change the sort order. Just learning all the different metrics might make you wish that the big search engines would let you control more of them ("Why can't I tell Google that my words must be close together?"). This chapter will look at several metrics based on the content of the page, such as word frequency, and then cover metrics based on information external to the content of the page, such as the PageRank algorithm, which looks at how other pages link to the page in question. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! What's in a Search Engine? Content preview·Buy reprint rights for this chapter The first step in creating a search engine is to develop a way to collect the documents. In some cases, this will involve crawling (starting with a small set of documents and following links to others) and in other cases it will begin with a fixed collection of documents, perhaps from a corporate intranet. After you collect the documents, they need to be indexed. This usually involves creating a big table of the documents and the locations of all the different words. Depending on the particular application, the documents themselves do not necessarily have to be stored in a database; the index simply has to store a reference (such as a file system path or URL) to their locations. The final step is, of course, returning a ranked list of documents from a query. Retrieving every document with a given set of words is fairly straightforward once you have an index, but the real magic is in how the results are sorted. A huge number of metrics can be generated, and there is no shortage of ways you can tweak them to change the sort order. Just learning all the different metrics might make you wish that the big search engines would let you control more of them ("Why can't I tell Google that my words must be close together?"). This chapter will look at several metrics based on the content of the page, such as word frequency, and then cover metrics based on information external to the content of the page, such as the PageRank algorithm, which looks at how other pages link to the page in question. Finally, you'll build a neural network for ranking queries. The neural network will learn to associate searches with results based on what links people click on after they get a list of search results. The neural network will use this information to change the ordering of the results to better reflect what people have clicked on in the past. To work through the examples in this chapter, you'll need to create a Python module called searchengine, which has two classes: one for crawling and creating the database, and the other for doing full-text searches by querying the database. The examples will use Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! A Simple Crawler Content preview·Buy reprint rights for this chapter I'll assume for now that you don't have a big collection of HTML documents sitting on your hard drive waiting to be indexed, so I'll show you how to build a simple crawler. It will be seeded with a small set of pages to index and will then follow any links on that page to find other pages, whose links it will also follow. This process is called crawling or spidering. To do this, your code will have to download the pages, pass them to the indexer (which you'll build in the next section), and then parse the pages to find all the links to the pages that have to be crawled next. Fortunately, there are a couple of libraries that can help with this process. For the examples in this chapter, I have set up a copy of several thousand files from Wikipedia, which will remain static at . You're free to run the crawler on any set of pages you like, but you can use this site if you want to compare your results to those in this chapter. urllib2 is a library bundled with Python that makes it easy to download pages—all you have to do is supply the URL. You'll use it in this section to download the pages that will be indexed. To see it in action, start up your Python interpreter and try this: >> import urllib2 >> c=urllib2.urlopen('http://kiwitobes.com/wiki/Programming_language.html') >> contents=c.read( ) >> print contents[0:50] '0: tablelist+=',' clauselist+=' and ' clauselist+='w%d.urlid=w%d.urlid and ' % (tablenumber-1,tablenumber) fieldlist+=',w%d.location' % tablenumber tablelist+='wordlocation w%d' % tablenumber clauselist+='w%d.wordid=%d' % (tablenumber,wordid) tablenumber+=1 # Create the query from the separate parts fullquery='select %s from %s where %s' % (fieldlist,tablelist,clauselist) cur=self.con.execute(fullquery) rows=[row for row in cur] return rows,wordids This function looks a bit complicated, but it's just creating a reference to the wordlocation table for each word in the list and joining them all on their URL IDs (). Figure : Table joins for getmatchrows So a query for two words with the IDs 10 and 17 becomes: select w0.urlid,w0.location,w1.location Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Content-Based Ranking Content preview·Buy reprint rights for this chapter So far you've managed to retrieve pages that match the queries, but the order in which they are returned is simply the order in which they were crawled. In a large set of pages, you would be stuck sifting through a lot of irrelevant content for any mention of each of the query terms in order to find the pages that are really related to your search. To address this issue, you need ways to give pages a score for a given query, as well as the ability to return them with the highest scoring results first. This section will look at several ways to calculate a score based only on the query and the content of the page. These scoring metrics include: Word frequency The number of times the words in the query appear in the document can help determine how relevant the document is. Document location The main subject of a document will probably appear near the beginning of the document. Word distance If there are multiple words in the query, they should appear close together in the document. The earliest search engines often worked with only these types of metrics and were able to give usable results. Later sections will cover ways to improve results with information external to the page, such as the number and quality of incoming links. First, you'll need a new method that will take a query, get the rows, put them in a dictionary, and display them in a formatted list. Add these functions to your searcher class: def getscoredlist(self,rows,wordids): totalscores=dict([(row[0],0) for row in rows]) # This is where you'll later put the scoring functions weights=[] for (weight,scores) in weights: for url in totalscores: totalscores[url]+=weight*scores[url] return totalscores def geturlname(self,id): return self.con.execute( "select url from urllist where rowid=%d" % id).fetchone( )[0] def query(self,q): rows,wordids=self.getmatchrows(q) scores=self.getscoredlist(rows,wordids) rankedscores=sorted([(score,url) for (url,score) in scores.items( )],reverse=1) for (score,urlid) in rankedscores[0:10]: print '%f\t%s' % (score,self.geturlname(urlid)) Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Using Inbound Links Content preview·Buy reprint rights for this chapter The scoring metrics discussed so far have all been based on the content of the page. Although many search engines still work this way, the results can often be improved by considering information that others have provided about the page, specifically, who has linked to the page and what they have said about it. This is particularly useful when indexing pages of dubious value or pages that might have been created by spammers, as these are less likely to be linked than pages with real content. The crawler that you built at the beginning of the chapter already captures all the important information about the links, so there's no need to change it. The links table has the URL IDs for the source and target of every link that it has encountered, and the linkwords table connects the words with the links. The easiest thing to do with inbound links is to count them on each page and use the total number of links as a metric for the page. Academic papers are often rated in this way, with their importance tied to the number of other papers that reference them. The scoring function below creates a dictionary of counts by querying the link table for every unique URL ID in rows, and then it returns the normalized scores: def inboundlinkscore(self,rows): uniqueurls=set([row[0] for row in rows]) inboundcount=dict([(u,self.con.execute( \ 'select count(*) from link where toid=%d' % u).fetchone( )[0]) \ for u in uniqueurls]) return self.normalizescores(inboundcount) Obviously, using this metric by itself will simply return all the pages containing the search terms, ranked solely on how many inbound links they have. In the dataset, "Programming language" has many more inbound links than "Python," but you'd rather see "Python" first in the results if that's what you searched for. To combine relevance with ranking, you need to use the inbound-links metric in combination with one of the metrics shown earlier. This algorithm also weights every inbound link equally, which, while nice and egalitarian, is open to manipulation because someone can easily set up several sites pointing to a page whose score they want to increase. It's also possible that people are more interested in results that have attracted the attention of very popular sites. Next, you'll see how to make links from popular pages worth more in calculating rankings. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Learning from Clicks Content preview·Buy reprint rights for this chapter One of the major advantages of online applications is that they receive constant feedback in the form of user behavior. In the case of a search engine, each user will immediately provide information about how much he likes the results for a given search by clicking on one result and choosing not to click on the others. This section will look at a way to record when a user clicks on a result after a query, and how that record can be used to improve the rankings of the results. To do this, you're going to build an artificial neural network that you'll train by giving it the words in the query, the search results presented to the user, and what the user decided to click. Once the network has been trained with many different queries, you can use it to change the ordering of the search results to better reflect what users actually clicked on in the past. While there are many different kinds of neural networks, they all consist of a set of nodes (the neurons) and connections between them. The network you'll learn how to build is called a multilayer perceptron (MLP) network. This type of network consists of multiple layers of neurons, the first of which takes the input—in this case, the words entered by the user. The last layer gives the output, which in this example is a list of weightings for the different URLs that were returned. There can be multiple middle layers, but the network in this example will just use a single one. This is called the hidden layer because the outside world never interacts with it directly, and it responds to combinations of inputs. In this case, a combination of inputs is a set of words, so you could also think of this as the query layer. shows the structure of the network. All the nodes in the input layer are connected to all the nodes in the hidden layer, and all the nodes in the hidden layer are connected to all the nodes in the output layer. Figure : Design of a click-tracking neural network To ask the neural network to get the best results for a query, the input nodes for the words in that query have their values set to 1. The outputs of those nodes are turned on and they attempt to activate the hidden layer. In turn, the nodes in the hidden layer that get a strong enough input will turn on their outputs and try to activate nodes in the output layer. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Exercises Content preview·Buy reprint rights for this chapter 1. Word separation. The separatewords method currently considers any nonalphanumeric character to be a separator, meaning it will not properly index entries like "C++," "$20," "Ph.D.," or "617-555-1212." What is a better way to separate words? Does using whitespace as a separator work? Write a better word separation function. 2. Boolean operations. Many search engines support Boolean queries, which allow users to construct searches like "python OR perl." An OR search can work by doing the queries separately and combining the results, but what about "python AND (program OR code)"? Modify the query methods to support some basic Boolean operations. 3. Exact matches. Search engines often support "exact match" queries, where the words in the page must match the words in the query in the same order with no additional words in between. Create a new version of getrows that only returns results that are exact matches. (Hint: you can use subtraction in SQL to get the difference between the word locations.) 4. Long/short document search. Sometimes the length of a page will be a determining factor in whether it is relevant to a particular search application or user. A user may be interested in finding a long article about a difficult subject or a quick reference page for a command-line tool. Write a weighting function that will give preference to longer or shorter documents depending on its parameters. 5. Word frequency bias. The "word count" metric is biased to favor longer documents, since a long document has more words and can therefore contain the target words more often. Write a new metric that calculates frequency as a percentage of the number of words in the document. 6. Inbound link searching. Your code can rank items based on the text of the inbound links, but they must already be results based on the content of the page. Sometimes the most relevant page doesn't contain the query text at all, but rather a lot of links with the text pointing to it—this is often the case with links to images. Modify the search code to also include results where an inbound link contains some of the search terms. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Chapter 5: Optimization Content preview·Buy reprint rights for this chapter This chapter will look at how to solve collaboration problems using a set of techniques called stochastic optimization. Optimization techniques are typically used in problems that have many possible solutions across many variables, and that have outcomes that can change greatly depending on the combinations of these variables. These optimization techniques have a wide variety of applications: we use them in physics to study molecular dynamics, in biology to predict protein structures, and in computer science to determine the worst possible running time of an algorithm. NASA even uses optimization techniques to design antennas that have the right operating characteristics, which look unlike anything a human designer would create. Optimization finds the best solution to a problem by trying many different solutions and scoring them to determine their quality. Optimization is typically used in cases where there are too many possible solutions to try them all. The simplest but least effective method of searching for solutions is just trying a few thousand random guesses and seeing which one is best. More effective methods, which will be discussed in this chapter, involve intelligently modifying the solutions in a way that is likely to improve them. The first example in this chapter concerns group travel planning. Anyone who has planned a trip for a group of people, or perhaps even for an individual, realizes that there are a lot of different inputs required, such as what everyone's flight schedule should be, how many cars should be rented, and which airport is easiest. Many outputs must also be considered, such as total cost, time spent waiting at airports, and time taken off work. Because the inputs can't be mapped to the outputs with a simple formula, the problem of finding the best solution lends itself to optimization. The other examples in the chapter show the flexibility of optimization by considering two completely different problems: how to allocate limited resources based on people's preferences, and how to visualize a social network with minimal crossed lines. By the end of the chapter, you'll be able to spot other types of problems that can be solved using optimization. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Group Travel Content preview·Buy reprint rights for this chapter Planning a trip for a group of people (the Glass family in this example) from different locations all arriving at the same place is always a challenge, and it makes for an interesting optimization problem. To begin, create a new file called optimization.py and insert the following code: import time import random import math people = [('Seymour','BOS'), ('Franny','DAL'), ('Zooey','CAK'), ('Walt','MIA'), ('Buddy','ORD'), ('Les','OMA')] # LaGuardia airport in New York destination='LGA' The family members are from all over the country and wish to meet up in New York. They will all arrive on the same day and leave on the same day, and they would like to share transportation to and from the airport. There are dozens of flights per day to New York from any of the family members' locations, all leaving at different times. The flights also vary in price and in duration. You can download a sample file of flight data from . This file contains origin, destination, departure time, arrival time, and price for a set of flights in a comma-separated format: LGA,MIA,20:27,23:42,169 MIA,LGA,19:53,22:21,173 LGA,BOS,6:39,8:09,86 BOS,LGA,6:17,8:26,89 LGA,BOS,8:23,10:28,149 Load this data into a dictionary with the origin and destination (dest) as the keys and a list of potential flight details as the values. Add this code to load the data into optimization.py: flights={} # for line in file('schedule.txt'): origin,dest,depart,arrive,price=line.strip( ).split(',') flights.setdefault((origin,dest),[]) # Add details to the list of possible flights flights[(origin,dest)].append((depart,arrive,int(price))) It's also useful at this point to define a utility function, getminutes, which calculates how many minutes into the day a given time is. This makes it easy to calculate flight times and waiting times. Add this function to optimization.py: def getminutes(t): x=time.strptime(t,'%H:%M') return x[3]*60+x[4] The challenge now is to decide which flight each person in the family should take. Of course, keeping total price down is a goal, but there are many other possible factors that the optimal solution will take into account and try to minimize, such as total waiting time at the airport or total flight time. These other factors will be discussed in more detail shortly. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Representing Solutions Content preview·Buy reprint rights for this chapter When approaching a problem like this, it's necessary to determine how a potential solution will be represented. The optimization functions you'll see later are generic enough to work on many different types of problems, so it's important to choose a simple representation that's not specific to the group travel problem. A very common representation is a list of numbers. In this case, each number can represent which flight a person chooses to take, where 0 is the first flight of the day, 1 is the second, and so on. Since each person needs an outbound flight and a return flight, the length of this list is twice the number of people. For example, the list: [1,4,3,2,7,3,6,3,2,4,5,3] Represents a solution in which Seymour takes the second flight of the day from Boston to New York, and the fifth flight back to Boston on the day he returns. Franny takes the fourth flight from Dallas to New York, and the third flight back. Because it will be difficult to interpret solutions from this list of numbers, you'll need a routine that prints all the flights that people decide to take in a nice table. Add this function to optimization.py: def printschedule(r): for d in range(len(r)/2): name=people[d][0] origin=people[d][1] out=flights[(origin,destination)][r[d]] ret=flights[(destination,origin)][r[d+1]] print '%10s%10s %5s-%5s $%3s %5s-%5s $%3s' % (name,origin, out[0],out[1],out[2], ret[0],ret[1],ret[2]) This will print a line containing each person's name and origin, as well as the departure time, arrival time, and price for the outgoing and return flights. Try this function in your Python session: >>> import optimization >>> s=[1,4,3,2,7,3,6,3,2,4,5,3] >>> optimization.printschedule(s) Seymour Boston 12:34-15:02 $109 12:08-14:05 $142 Franny Dallas 12:19-15:25 $342 9:49-13:51 $229 Zooey Akron 9:15-12:14 $247 15:50-18:45 $243 Walt Miami 15:34-18:11 $326 14:08-16:09 $232 Buddy Chicago 14:22-16:32 $126 15:04-17:23 $189 Les Omaha 15:03-16:42 $135 6:19- 8:13 $239 Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! The Cost Function Content preview·Buy reprint rights for this chapter The cost function is the key to solving any problem using optimization, and it's usually the most difficult thing to determine. The goal of any optimization algorithm is to find a set of inputs—flights, in this case—that minimizes the cost function, so the cost function has to return a value that represents how bad a solution is. There is no particular scale for badness; the only requirement is that the function returns larger values for worse solutions. Often it is difficult to determine what makes a solution good or bad across many variables. Consider a few of the things that can be measured in the group travel example: Price The total price of all the plane tickets, or possibly a weighted average that takes financial situations into account. Travel time The total time that everyone has to spend on a plane. Waiting time Time spent at the airport waiting for the other members of the party to arrive. Departure time Flights that leave too early in the morning may impose an additional cost by requiring travelers to miss out on sleep. Car rental period If the party rents a car, they must return it earlier in the day than when they rented it, or be forced to pay for a whole extra day. It's not too hard to think of even more aspects of a particular schedule that could make the experience more or less pleasant. Any time you're faced with finding the best solution to a complicated problem, you'll need to decide what the important factors are. Although this can be difficult, the big advantage is that once it's done, you can use the optimization algorithms in this chapter on almost any problem with minimal modification. After choosing some variables that impose costs, you'll need to determine how to combine them into a single number. In this example, it's necessary to decide, for instance, how much money that time on the plane or time waiting in the airport is worth. You might decide that it's worth spending $1 for every minute saved on air travel (this translates into spending an extra $90 for a direct flight that saves an hour and a half), and $0.50 for every minute saved waiting in the airport. You could also add the cost of an extra day of car rental if everyone returns to the airport at a later time of the day than when they first rented the car. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Random Searching Content preview·Buy reprint rights for this chapter Random searching isn't a very good optimization method, but it makes it easy to understand exactly what all the algorithms are trying to do, and it also serves as a baseline so you can see if the other algorithms are doing a good job. The function takes a couple of parameters. Domain is a list of 2-tuples that specify the minimum and maximum values for each variable. The length of the solution is the same as the length of this list. In the current example, there are nine outbound flights and nine inbound flights for every person, so the domain in the list is (0,8) repeated twice for each person. The second parameter, costf, is the cost function, which in this example will be schedulecost. This is passed as a parameter so that the function can be reused for other optimization problems. This function randomly generates 1,000 guesses and calls costf on them. It keeps track of the best guess (the one with the lowest cost) and returns it. Add it to optimization.py: def randomoptimize(domain,costf): best=999999999 bestr=None for i in range(1000): # Create a random solution r=[random.randint(domain[i][0],domain[i][1]) for i in range(len(domain))] # Get the cost cost=costf(r) # Compare it to the best one so far if cost>> reload(optimization) >>> domain=[(0,8)]*(len(optimization.people)*2) >>> s=optimization.randomoptimize(domain,optimization.schedulecost) >>> optimization.schedulecost(s) 3328 >>> optimization.printschedule(s) Seymour Boston 12:34-15:02 $109 12:08-14:05 $142 Franny Dallas 12:19-15:25 $342 9:49-13:51 $229 Zooey Akron 9:15-12:14 $247 15:50-18:45 $243 Walt Miami 15:34-18:11 $326 14:08-16:09 $232 Buddy Chicago 14:22-16:32 $126 15:04-17:23 $189 Les Omaha 15:03-16:42 $135 6:19- 8:13 $239 Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Hill Climbing Content preview·Buy reprint rights for this chapter Randomly trying different solutions is very inefficient because it does not take advantage of the good solutions that have already been discovered. In our example, a schedule with a low overall cost is probably similar to other schedules that have a low cost. Because random optimization jumps around, it won't automatically look at similar schedules to locate the good ones that have already been found. An alternate method of random searching is called hill climbing. Hill climbing starts with a random solution and looks at the set of neighboring solutions for those that are better (have a lower cost function). This is analogous to going down a hill, as shown in . Figure : Seeking the lowest cost on a hill Imagine you are the person shown in the figure, having been randomly dropped into this landscape. You want to reach the lowest point to find water. To do this, you might look in each direction and walk toward wherever the land slopes downward most steeply. You would continue to walk in the most steeply sloping direction until you reached a point where the terrain was flat or began sloping uphill. You can apply this hill climbing approach to the task of finding the best travel schedule for the Glass family. Start with a random schedule and find all the neighboring schedules. In this case, that means finding all the schedules that have one person on a slightly earlier or slightly later flight. The cost is calculated for each of the neighboring schedules, and the one with the lowest cost becomes the new solution. This process is repeated until none of the neighboring schedules improves the cost. To implement this, add hillclimb to optimization.py: def hillclimb(domain,costf): # Create a random solution sol=[random.randint(domain[i][0],domain[i][1]) for i in range(len(domain))] # Main loop while 1: # Create list of neighboring solutions neighbors=[] for j in range(len(domain)): # One away in each direction if sol[j]>domain[j][0]: neighbors.append(sol[0:j]+[sol[j]+1]+sol[j+1:]) if sol[j]domain[i][0]: return vec[0:i]+[vec[i]-step]+vec[i+1:] elif vec[i]>> import xml.dom.minidom >>> dom=xml.dom.minidom.parseString('Hello!') >>> dom >>> r=dom.getElementsByTagName('rec') >>> r [] >>> r[0].firstChild >>> r[0].firstChild.data u'Hello!' Because many web sites now offer a way to access information through an XML interface, learning how to use the Python XML packages is very useful for collective intelligence programming. Here are the important methods of DOM objects that you'll be using for the Kayak API: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Optimizing for Preferences Content preview·Buy reprint rights for this chapter You've seen one example of a problem that optimization can be used to solve, but there are many seemingly unrelated problems that can be attacked using the same methods. Remember, the primary requirements for solving with optimization are that the problem has a defined cost function and that similar solutions tend to yield similar results. Not every problem with these properties will be solvable with optimization, but there's a good chance that optimization will return some interesting results that you hadn't considered. This section will consider a different problem, one that clearly lends itself to optimization. The general problem is how to allocate limited resources to people who have expressed preferences and make them all as happy as possible (or, depending on their dispositions, annoy them as little as possible). The example problem in this section is that of assigning students to dorms depending on their first and second choices. Although this is a very specific example, it's easy to generalize this case to other problems—the exact same code can be used to assign tables to players in an online card game, assign bugs to developers in a large coding project, or even to assign housework to household members. Once again, the purpose is to take information from individuals and combine it to produce the optimal result. There are five dorms in our example, each with two spaces available and ten students vying for spots. Each student has first and second choices. Create a new file called dorm.py and add the list of dorms and the list of people, along with their top two choices: import random import math # The dorms, each of which has two available spaces dorms=['Zeus','Athena','Hercules','Bacchus','Pluto'] # People, along with their first and second choices prefs=[('Toby', ('Bacchus', 'Hercules')), ('Steve', ('Zeus', 'Pluto')), ('Andrea', ('Athena', 'Zeus')), ('Sarah', ('Zeus', 'Pluto')), ('Dave', ('Athena', 'Bacchus')), ('Jeff', ('Hercules', 'Pluto')), ('Fred', ('Pluto', 'Athena')), ('Suzie', ('Bacchus', 'Hercules')), ('Laura', ('Bacchus', 'Hercules')), ('Neil', ('Hercules', 'Athena'))] Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Network Visualization Content preview·Buy reprint rights for this chapter The final example in this chapter shows another way in which optimization can be used on problems that are completely unrelated to one another. In this case, the problem is the visualization of networks. A network in this case is any set of things that are connected together. A good example in online applications is a social network like MySpace, Facebook, or LinkedIn, where people are connected because they are friends or have a professional relationship. Each member of the site chooses to whom they are connected, and collectively this creates a network of people. It is interesting to visualize such networks to determine their structure, perhaps in order to find the people who are connectors (those who know a lot of people or who serve as a link between otherwise self-contained cliques). When drawing a network to visualize a big group of people and the links between them, one problem is deciding where each name (or icon) should be placed in the picture. For example, consider the network in . Figure : A confusing network layout In this figure, you can see that Augustus is friends with Willy, Violet, and Miranda. But the layout of the network is a bit messy, and adding more people would make it very confusing. A much cleaner layout is shown in . Figure : A clean network layout This section will look at how optimization can be used to create better, less confusing visuals. To begin, create a new file called socialnetwork.py and add some facts about a subsection of the social network: import math people=['Charlie','Augustus','Veruca','Violet','Mike','Joe','Willy','Miranda'] links=[('Augustus', 'Willy'), ('Mike', 'Joe'), ('Miranda', 'Mike'), ('Violet', 'Augustus'), ('Miranda', 'Willy'), ('Charlie', 'Mike'), ('Veruca', 'Joe'), ('Miranda', 'Augustus'), ('Willy', 'Augustus'), ('Joe', 'Charlie'), ('Veruca', 'Augustus'), ('Miranda', 'Joe')] The goal here is to create a program that can take a list of facts about who is friends with whom and generate an easy-to-interpret network diagram. This is usually done with a Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Other Possibilities Content preview·Buy reprint rights for this chapter This chapter has shown three completely different applications for optimization algorithms, but that's only a small fraction of what is possible. As stated throughout the chapter, the important steps are deciding on a representation and a cost function. If you can do these things, there's a good chance you can use optimization to find solutions to your problem. An interesting activity might be to take a large group of people and divide them into teams in which the skills of the members are evenly divided. In a trivia contest, it might be desirable to create teams from a set of people so that each team has adequate knowledge of sports, history, literature, and television. Another possibility is to assign tasks in group projects by taking a combination of people's skills into account. Optimization can determine the best way to divide the tasks so that the task list is completed in the shortest possible time. Given a long list of web sites tagged with keywords, it might be interesting to find an optimal group of web sites for a user-supplied set of keywords. The optimal group would contain a set of web sites that don't have many keywords in common with each other but represent as many of the user-supplied keywords as possible. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Exercises Content preview·Buy reprint rights for this chapter 1. Group travel cost function. Add total flight time as a cost equal to $0.50 per minute on the plane. Next try adding a penalty of $20 for making anyone get to the airport before 8 a.m. 2. Annealing starting points. The outcome of simulated annealing depends heavily on the starting point. Build a new optimization function that does simulated annealing from multiple starting solutions and returns the best one. 3. Genetic optimization stopping criteria. A function in this chapter runs the genetic optimizer for a fixed number of iterations. Change it so that it stops when there has been no improvement in any of the best solutions for 10 iterations. 4. Round-trip pricing. The function for getting flight data from Kayak right now only looks for one-way flights. Prices are probably cheaper when buying round-trip tickets. Modify the code to get round-trip prices, and modify the cost function to use a price lookup for a particular pair of flights instead of just summing their one-way prices. 5. Pairing students. Imagine if instead of listing dorm preferences, students had to express their preferences for a roommate. How would you represent solutions to pairing students? What would the cost function look like? 6. Line angle penalization. Add an additional cost to the network layout algorithm cost function when the angle between two lines attached to the same person is very small. (Hint: you can use the vector cross-product.) Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Chapter 6: Document Filtering Content preview·Buy reprint rights for this chapter This chapter will demonstrate how to classify documents based on their contents, a very practical application of machine intelligence and one that is becoming more widespread. Perhaps the most useful and well-known application of document filtering is the elimination of spam. A big problem with the wide availability of email and the extremely low cost of sending email messages is that anyone whose address gets into the wrong hands is likely to receive unsolicited commercial email messages, making it difficult for them to read the messages that are actually of interest. The problem of spam does not just apply to email, of course. Web sites have gotten more interactive over time, soliciting comments from users or asking them to create original content, which has compounded the spam problem. Public message boards like Yahoo! Groups and Usenet have long been victims of postings that are unrelated to the board's subject or that hawk dubious products. Blogs and Wikis are now experiencing the same problem. When building an application that allows the general public to contribute, you should always have a strategy for dealing with spam. The algorithms described in this chapter are not specific to dealing with spam. Since they solve the more general problem of learning to recognize whether a document belongs in one category or another, they can be used for less unsavory purposes. One example might be automatically dividing your inbox into social and work-related email, based on the contents of the messages. Another possibility is identifying email messages that request information and automatically forwarding them to the most competent person to answer them. The example at the end of this chapter will demonstrate automatically filtering entries from an RSS feed into different categories. Early attempts to filter spam were all rule-based classifiers, where a person would design a set of rules that was supposed to indicate whether or not a message was spam. Rules typically included things like overuse of capital letters, words related to pharmaceutical products, or particularly garish HTML colors. The problems with rule-based classifiers quickly became apparent—spammers learned all the rules and stopped exhibiting the obvious behaviors to get around the filters, while people whose parents never learned to turn off the Caps Lock key found their good email messages being classified as spam. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Filtering Spam Content preview·Buy reprint rights for this chapter Early attempts to filter spam were all rule-based classifiers, where a person would design a set of rules that was supposed to indicate whether or not a message was spam. Rules typically included things like overuse of capital letters, words related to pharmaceutical products, or particularly garish HTML colors. The problems with rule-based classifiers quickly became apparent—spammers learned all the rules and stopped exhibiting the obvious behaviors to get around the filters, while people whose parents never learned to turn off the Caps Lock key found their good email messages being classified as spam. The other problem with rule-based filters is that what can be considered spam varies depending on where it's being posted and for whom it is being written. Keywords that would strongly indicate spam for one particular user, message board, or Wiki may be quite normal for others. To solve this problem, this chapter will look at programs that learn, based on you telling them what is spam email and what isn't, both initially and as you receive more messages. By doing this, you can create separate instances and datasets for individual users, groups, or sites that will each develop their own ideas about what is spam and what isn't. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Documents and Words Content preview·Buy reprint rights for this chapter The classifier that you will be building needs features to use for classifying different items. A feature is anything that you can determine as being either present or absent in the item. When considering documents for classification, the items are the documents and the features are the words in the document. When using words as features, the assumption is that some words are more likely to appear in spam than in nonspam, which is the basic premise underlying most spam filters. Features don't have to be individual words, however; they can be word pairs or phrases or anything else that can be classified as absent or present in a particular document. Create a new file called docclass.py, and add a function called getwords to extract the features from the text: import re import math def getwords(doc): splitter=re.compile('\\W*') # Split the words by non-alpha characters words=[s.lower( ) for s in splitter.split(doc) if len(s)>2 and len(s)<20] # Return the unique set of words only return dict([(w,1) for w in words]) This function breaks up the text into words by dividing the text on any character that isn't a letter. This leaves only actual words, all converted to lowercase. Determining which features to use is both very tricky and very important. The features must be common enough that they appear frequently, but not so common that they appear in every single document. In theory, the entire text of the document could be a feature, but that would almost certainly be useless unless you receive the exact same email message over and over. At the other extreme, the features could be individual characters, but since they would all likely appear in every email message, they would do a poor job of separating wanted from unwanted documents. Even the choice to use words as features poses questions of exactly how to divide words, which punctuation to include, and whether header information should be included. The other thing to consider when deciding on features is how well they will divide the set of documents into the target categories. For example, the code for Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Training the Classifier Content preview·Buy reprint rights for this chapter The classifiers discussed in this chapter learn how to classify a document by being trained. Many of the other algorithms in this book, such as the neural network you saw in , learn by reading examples of correct answers. The more examples of documents and their correct classifications it sees, the better the classifier will get at making predictions. The classifier is also specifically designed to start off very uncertain and increase in certainty as it learns which features are important for making a distinction. The first thing you'll need is a class to represent the classifier. This class will encapsulate what the classifier has learned so far. The advantage of structuring the module this way is that you can instantiate multiple classifiers for different users, groups, or queries, and train them differently to respond to a particular group's needs. Create a class called classifier in docclass.py: class classifier: def __init_ _(self,getfeatures,filename=None): # Counts of feature/category combinations self.fc={} # Counts of documents in each category self.cc={} self.getfeatures=getfeatures The three instance variables are fc, cc, and getfeatures. The fc variable will store the counts for different features in different classifications. For example: {'python': {'bad': 0, 'good': 6}, 'the': {'bad': 3, 'good': 3}} This indicates that the word "the" has appeared in documents classified as bad three times, and in documents that were classified as good three times. The word "Python" has only appeared in good documents. The cc variable is a dictionary of how many times every classification has been used. This is needed for the probability calculations that we'll discuss shortly. The final instance variable, getfeatures, is the function that will be used to extract the features from the items being classified—in this example, it is the getwords function you just defined. The methods in the class won't use the dictionaries directly because this restricts potential options for storing the training data in a file or database. Create these helper methods to increment and get the counts: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Calculating Probabilities Content preview·Buy reprint rights for this chapter You now have counts for how often an email message appears in each category, so the next step is to convert these numbers into probabilities. A probability is a number between 0 and 1, indicating how likely an event is. In this case, you can calculate the probability that a word is in a particular category by dividing the number of times the word appears in a document in that category by the total number of documents in that category. Add a method called fprob to the classifier class: def fprob(self,f,cat): if self.catcount(cat)==0: return 0 # The total number of times this feature appeared in this # category divided by the total number of items in this category return self.fcount(f,cat)/self.catcount(cat) This is called conditional probability, and is usually written as Pr(A | B) and spoken "the probability of A given B." In this example, the numbers you have now are Pr(word | classification); that is, for a given classification you calculate the probability that a particular word appears. You can test this function in your Python session: >>> reload(docclass) >>> cl=docclass.classifier(docclass.getwords) >>> docclass.sampletrain(cl) >>> cl.fprob('quick','good') 0.66666666666666663 You can see that the word "quick" appears in two of the three documents classified as good, which means there's a probability of Pr(quick | good) = 0.666 (a 2/3 chance) that a good document will contain that word. The fprob method gives an accurate result for the features and classifications it has seen so far, but it has a slight problem—using only the information it has seen so far makes it incredibly sensitive during early training and to words that appear very rarely. In the sample training data, the word "money" only appears in one document and is classified as bad because it is a casino ad. Since the word "money" is in one bad document and no good ones, the probability that it will appear in the good category using fprob is now 0. This is a bit extreme, since "money" might be a perfectly neutral word that just happens to appear first in a bad document. It would be much more realistic for the value to gradually approach zero as a word is found in more and more documents with the same category. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! A Naïve Classifier Content preview·Buy reprint rights for this chapter Once you have the probabilities of a document in a category containing a particular word, you need a way to combine the individual word probabilities to get the probability that an entire document belongs in a given category. This chapter will consider two different classification methods. Both of them work in most situations, but they vary slightly in their level of performance for specific tasks. The classifier covered in this section is called a naïve Bayesian classifier. This method is called naïve because it assumes that the probabilities being combined are independent of each other. That is, the probability of one word in the document being in a specific category is unrelated to the probability of the other words being in that category. This is actually a false assumption, since you'll probably find that documents containing the word "casino" are much more likely to contain the word "money" than documents about Python programming are. This means that you can't actually use the probability created by the naïve Bayesian classifier as the actual probability that a document belongs in a category, because the assumption of independence makes it inaccurate. However, you can compare the results for different categories and see which one has the highest probability. In real life, despite the underlying flawed assumption, this has proven to be a surprisingly effective method for classifying documents. To use the naïve Bayesian classifier, you'll first have to determine the probability of an entire document being given a classification. As discussed earlier, you're going to assume the probabilities are independent, which means you can calculate the probability of all of them by multiplying them together. For example, suppose you've noticed that the word "Python" appears in 20 percent of your bad documents—Pr(Python | Bad) = 0.2—and that the word "casino" appears in 80 percent of your bad documents (Pr(Casino | Bad) = 0.8). You would then expect the independent probability of both words appearing in a bad document—Pr(Python & Casino | Bad)—to be 0.8 × 0.2 = 0.16. From this you can see that calculating the entire document probability is just a matter of multiplying together all the probabilities of the individual words in that document. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! The Fisher Method Content preview·Buy reprint rights for this chapter The Fisher method, named for R. A. Fisher, is an alternative method that's been shown to give very accurate results, particularly for spam filtering. This is the method used by SpamBayes, an Outlook plug-in written in Python. Unlike the naïve Bayesian filter, which uses the feature probabilities to create a whole document probability, the Fisher method calculates the probability of a category for each feature in the document, then combines the probabilities and tests to see if the set of probabilities is more or less likely than a random set. This method also returns a probability for each category that can be compared to the others. Although this is a more complex method, it is worth learning because it allows much greater flexibility when choosing cutoffs for categorization. With the naïve Bayesian filter discussed earlier, you combined all of the Pr(feature | category) results to get an overall document probability, and then flipped it around at the end. In this section, you'll begin by calculating how likely it is that a document fits into a category given that a particular feature is in that document—that is, Pr(category | feature). If the word "casino" appears in 500 documents, and 499 of those are in the bad category, "casino" will get a score very close to 1 for bad. The normal way to calculate Pr(category | feature) would be: (number of documents in this category with the feature) / (total number of documents with the feature) This calculation doesn't take into account the possibility that you may have received far more documents in one category than in another. If you have many good documents and only a few bad ones, a word that appears in all your bad documents will likely have a high probability for bad, even though the message is just as likely to be good. The methods perform better when they assume that in the future you will receive equal numbers of documents in each category, because this allows them to take advantage of the features that distinguish the categories. To perform this normalization, the method calculates three things: * clf = Pr(feature | category) for this category Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Persisting the Trained Classifiers Content preview·Buy reprint rights for this chapter In any real-world application, it's unlikely that all the training and classification will be done entirely in one session. If the classifier is used as part of a web-based application, it will probably have to save any training that the user does while using the application, and then restore the training data the next time the user logs on. This section will show you how to persist the training information for your classifier using a database, in this case, SQLite. If your application involves many users concurrently training and querying the classifier, it's probably wise to store the counts in a database. SQLite is the same database we used in . You'll need to download and install pysqlite if you haven't already; details on how to do this are in . Accessing SQLite from Python is similar to accessing other databases, so this should adapt quite easily. To import pysqlite, add this statement to the top of docclass.py: from pysqlite2 import dbapi2 as sqlite The code in this section will replace the dictionary structures currently in the classifier class with a persistent data store. Add a classifier method that opens a database for this classifier and creates tables if necessary. The tables match the structure of the dictionaries that they replace: def setdb(self,dbfile): self.con=sqlite.connect(dbfile) self.con.execute('create table if not exists fc(feature,category,count)') self.con.execute('create table if not exists cc(category,count)') If you're planning to adapt this for another database, you may need to modify the create table statements to work with the system you're using. You'll have to replace all the helper methods for getting and incrementing the counts: def incf(self,f,cat): count=self.fcount(f,cat) if count==0: self.con.execute("insert into fc values ('%s','%s',1)" % (f,cat)) else: self.con.execute( "update fc set count=%d where feature='%s' and category='%s'" % (count+1,f,cat)) def fcount(self,f,cat): res=self.con.execute( 'select count from fc where feature="%s" and category="%s"' %(f,cat)).fetchone( ) if res==None: return 0 else: return float(res[0]) def incc(self,cat): count=self.catcount(cat) if count==0: self.con.execute("insert into cc values ('%s',1)" % (cat)) else: self.con.execute("update cc set count=%d where category='%s'" % (count+1,cat)) def catcount(self,cat): res=self.con.execute('select count from cc where category="%s"' %(cat)).fetchone( ) if res==None: return 0 else: return float(res[0]) Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Filtering Blog Feeds Content preview·Buy reprint rights for this chapter To try out the classifier on real data and show the different ways it can be used, you can apply it to entries from a blog or other RSS feed. To do this, you'll need to get the Universal Feed Parser, which we used in . If you haven't already downloaded it, you can get it from . More information on installing the Feed Parser is given in . Although a blog will not necessarily contain spam in its entries, many blogs contain some articles that interest you and some that don't. This can be because you only want to read articles in a certain category or by a certain writer, but it's often more complicated than that. Again, you can set up specific rules for things that do and do not interest you—maybe you read a gadget blog and are not interested in entries that contain the word "cell phone"—but it's much less work to use the classifier you've built to figure out these rules for you. A benefit of classifying entries in an RSS feed is that if you use a blog-searching tool like Google Blog Search, you can set up the results of your searches in a feed reader. Many people do this to track products, things that interest them, even their own names. You'll find, though, that spam-based or useless blogs trying to make money from blog traffic can also appear in these searches. For this example, you can use any feed you like, although many feeds have too few entries to do any effective training. This particular example uses the results of a Google Blog Search for the word "Python" in RSS format. You can download these results from . Create a new file called feedfilter.py and add the following code: import feedparser import re # Takes a filename of URL of a blog feed and classifies the entries def read(feed,classifier): # Get feed entries and loop over them f=feedparser.parse(feed) for entry in f['entries']: print print '-----' # Print the contents of the entry print 'Title: '+entry['title'].encode('utf-8') print 'Publisher: '+entry['publisher'].encode('utf-8') print print entry['summary'].encode('utf-8') # Combine all the text to create one item for the classifier fulltext='%s\n%s\n%s' % (entry['title'],entry['publisher'],entry['summary']) # Print the best guess at the current category print 'Guess: '+str(classifier.classify(fulltext)) # Ask the user to specify the correct category and train on that cl=raw_input('Enter category: ') classifier.train(fulltext,cl) Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Improving Feature Detection Content preview·Buy reprint rights for this chapter In all the examples so far, the function for creating the list of features uses just a simple nonalphanumeric split to break up the words. The function also converts all words to lowercase, so there's no way to detect the overuse of uppercase words. There are several different ways this can be improved: * Without actually making uppercase and lowercase tokens completely distinct, use the fact that there are many uppercase words as a feature. * Use sets of words in addition to individual words. * Capture more metainformation, such as who sent an email message or what category a blog entry was posted under, and annotate it as metainformation. * Keep URLs and numbers intact. Remember that it's not simply a matter of making the features more specific. Features have to occur in multiple documents for them to be of any use to the classifier. The classifier class will take any function as getfeatures and run it on the items passed in, expecting a list or dictionary of all the features for that item to be returned. Because it is so generic, you can easily create a function that works on types more complicated than just strings. For example, when classifying entries in a blog feed, you can use a function that takes the whole entry instead of the extracted text and annotates where the different words come from. You can also pull out word pairs from the body of the text and only the individual words from the subject. It's also probably pointless to tokenize the creator field, since the postings of someone named "John Smith" will not likely tell you anything about the postings of someone else with the first name John. Add this new feature-extraction function to feedfilter.py. Notice that it expects a feed entry and not a string as its parameter: def entryfeatures(entry): splitter=re.compile('\\W*') f={} # Extract the title words and annotate titlewords=[s.lower( ) for s in splitter.split(entry['title']) if len(s)>2 and len(s)<20] for w in titlewords: f['Title:'+w]=1 # Extract the summary words summarywords=[s.lower( ) for s in splitter.split(entry['summary']) if len(s)>2 and len(s)<20] # Count uppercase words uc=0 for i in range(len(summarywords)): w=summarywords[i] f[w]=1 if w.isupper( ): uc+=1 # Get word pairs in summary as features if i0.3: f['UPPERCASE']=1 return f Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Using Akismet Content preview·Buy reprint rights for this chapter Akismet is a slight detour from the study of text-classification algorithms, but for a specific class of applications, it may solve your spam-filtering needs with minimal effort and eliminate the need for you to build your own classifier. Akismet started out as a WordPress plug-in that allowed people to report spam comments posted on their blogs, and to filter new comments based on their similarity to spam reported by other people. Now the API is open and you can query Akismet with any string to find out if Akismet thinks the string is spam. The first thing you'll need is an Akismet API key, which you can get at . These keys are free for personal use and there are several options available for commercial use. The Akismet API is called with regular HTTP requests, and libraries have been written for various languages. The one used in this section is available at . Download akismet.py and put it in your code directory or in your Python Libraries directory. Using the API is very simple. Create a new file called akismettest.py and add this function: import akismet defaultkey = "YOURKEYHERE" pageurl="http://yoururlhere.com" defaultagent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) " defaultagent+="Gecko/20060909 Firefox/1.5.0.7" def isspam(comment,author,ipaddress, agent=defaultagent, apikey=defaultkey): try: valid = akismet.verify_key(apikey,pageurl) if valid: return akismet.comment_check(apikey,pageurl, ipaddress,agent,comment_content=comment, comment_author_email=author,comment_type="comment") else: print 'Invalid key' return False except akismet.AkismetError, e: print e.response, e.statuscode return False You now have a method you can call with any string to see if it is similar to those in blog comments. Try it in your Python session: >>> import akismettest >>> msg='Make money fast! Online Casino!' >>> akismettest.isspam(msg,'spammer@spam.com','127.0.0.1') True Experiment with different usernames, agents, and IP addresses to see how the results vary. Because Akismet is primarily used for spam comments posted on blogs, it may not perform well on other types of documents, such as email messages. Also, unlike the classifier, it doesn't allow any tweaking of parameters, nor does it let you see the calculations it uses to come up with the answer. It is, however, very accurate for spam-comment filtering, and it's worth trying on your applications if you are receiving a similar kind of spam because Akismet has a far larger collection of documents for comparison than you are likely to have gathered. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Alternative Methods Content preview·Buy reprint rights for this chapter Both of the classifiers built in this chapter are examples of supervised learning methods, methods that are trained with correct results and gradually get better at making predictions. The artificial neural network described in for weighting search results for ranking purposes was another example of supervised learning. That neural network can be adapted to work on the same problems in this chapter by using the features as inputs and having outputs representing each of the possible classifications. Likewise, support vector machines, which are described in , can be applied to the problems in this chapter. The reason Bayesian classifiers are often used for document classification is that they require far less computing power than other methods do. An email message might have hundreds or even thousands of words in it, and simply updating the counts takes vastly less memory and processor cycles than training a neural network of that size does; as shown, it can be done entirely within a database. Depending on the speed required for training and querying, and on the environment in which it is run, a neural network may be a viable alternative. The complexity of a neural network also brings with it a lack of interpretability; in this chapter you were able to look at the word probabilities and see exactly how much they contribute to the final score, while the connection strengths between the neurons in a network has no equally simple interpretation. On the other hand, neural networks and support-vector machines have one big advantage over the classifiers presented in this chapter: they can capture more complex relationships between the input features. In a Bayesian classifier, every feature has a probability for each category, and you combine the probabilities to get an overall likelihood. In a neural network, the probability of a feature can change depending on the presence or absence of other features. It may be that you're trying to block online-casino spam but you're also interested in horse betting, in which case the word "casino" is bad unless the word "horse" is somewhere else in the email message. Naïve Bayesian classifiers cannot capture this interdependence, and neural networks can. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Exercises Content preview·Buy reprint rights for this chapter 1. Varying assumed probabilities. Change the classifier class so it supports different assumed probabilities for different features. Change the init method so that it will take another classifier and start with a better guess than 0.5 for the assumed probabilities. 2. Calculate Pr(Document). In the naïve Bayesian classifier, the calculation of Pr(Document) was skipped since it wasn't required to compare the probabilities. In cases where the features are independent, it can actually be used to calculate the overall probability. How would you calculate Pr(Document)? 3. A POP-3 email filter. Python comes with a library called poplib for downloading email messages. Write a script that downloads email messages from a server and attempts to classify them. What are the different properties of an email message, and how might you build a feature-extraction function to take advantage of these? 4. Arbitrary phrase length. This chapter showed you how to extract word pairs as well as individual words. Make the feature extraction configurable to extract up to a specified number of words as a single feature. 5. Preserving IP addresses. IP addresses, phone numbers, and other numerical information can be helpful in identifying spam. Modify the feature-extraction function to return these items as features. (IP addresses have periods embedded in them, but you still need to get rid of the periods between sentences.) 6. Other virtual features. There are many virtual features like UPPERCASE that can be useful in classifying documents. Documents of excessive length or with a preponderance of long words may also be clues. Implement these as features. Can you think of any others? 7. Neural network classifier. Modify the neural network from to be used for document classification. How do its results compare? Write a program that classifies and trains on documents thousands of times. Time how long it takes with each of the algorithms. How do they compare? Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Chapter 7: Modeling with Decision Trees Content preview·Buy reprint rights for this chapter You've now seen a few different automatic classifiers, and this chapter will expand on them by introducing a very useful method called decision tree learning. Unlike most other classifiers, the models produced by decision trees are easy to interpret—the list of numbers in a Bayesian classifier will tell you how important each word is, but you really have to do the calculation to know what the outcome will be. A neural network is even more difficult to interpret, since the weight of the connection between two neurons has very little meaning on its own. You can understand the reasoning process of a decision tree just by looking at it, and you can even convert it to a simple series of if-then statements. This chapter will cover three different examples that employ decision trees. The first shows how to predict which of a site's users are likely to pay for premium access. Many online applications that are priced by subscription or on a per-use basis offer users a way to try the applications before spending money. In the case of subscriptions, the sites usually offer a time-limited free trial or a feature-limited free version. Sites that employ per-use pricing may offer a free session or similar incentive. The other examples, covered later in the chapter, will use decision trees to model housing prices and "hotness." Sometimes when a high-traffic site links to a new application that offers free accounts and subscription accounts, the application will get thousands of new users. Many of these users are driven by curiosity and are not really looking for that particular type of application, so there is a very low likelihood that they will become paying customers. This makes it difficult to distinguish and follow up with likely customers, so many sites resort to mass-emailing everyone who has signed up, rather than using a more targeted approach. To help with this problem, it would be useful to be able to predict the likelihood that a user will become a paying customer. You know by now that you can use a Bayesian classifier or neural network to do this. However, clarity is very important in this case—if you know the factors that indicate a user will become a customer, you can use that information to guide an advertising strategy, to make certain aspects of the site more accessible, or to use other strategies that will help increase the number of paying customers. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Predicting Signups Content preview·Buy reprint rights for this chapter Sometimes when a high-traffic site links to a new application that offers free accounts and subscription accounts, the application will get thousands of new users. Many of these users are driven by curiosity and are not really looking for that particular type of application, so there is a very low likelihood that they will become paying customers. This makes it difficult to distinguish and follow up with likely customers, so many sites resort to mass-emailing everyone who has signed up, rather than using a more targeted approach. To help with this problem, it would be useful to be able to predict the likelihood that a user will become a paying customer. You know by now that you can use a Bayesian classifier or neural network to do this. However, clarity is very important in this case—if you know the factors that indicate a user will become a customer, you can use that information to guide an advertising strategy, to make certain aspects of the site more accessible, or to use other strategies that will help increase the number of paying customers. For this example, imagine an online application that offers a free trial. Users sign up for the trial and use the site for a certain number of days, after which they can choose to upgrade to a basic or premium service. As users sign up for the free trial, information about them is collected, and at the end of the trial, the site owners note which users chose to become paying customers. To minimize annoyance for users and sign them up as quickly as possible, the site doesn't ask them a lot of questions about themselves—instead, it collects information from the server logs, such as the site that referred them, their geographical location, how many pages they viewed before signing up, and so on. If you collect the data and put it in a table, it might look like . Table : User behavior and final purchase decision for a web site Referrer Location Read FAQ Pages viewed Service chosen Slashdot USA Yes 18 None Google France Yes 23 Premium Digg USA Yes 24 Basic Kiwitobes France Yes 23 Basic Google UK No 21 Premium Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Introducing Decision Trees Content preview·Buy reprint rights for this chapter Decision trees are one of the simpler machine-learning methods. They are a completely transparent method of classifying observations, which, after training, look like a series of if-then statements arranged into a tree. shows an example of a decision tree for classifying fruit. Figure : Example decision tree Once you have a decision tree, it's quite easy to see how it makes all of its decisions. Just follow the path down the tree that answers each question correctly and you'll eventually arrive at an answer. Tracing back from the node where you ended up gives a rationale for the final classification. This chapter will look at a way to represent a decision tree, at code for constructing the tree from real data, and at code for classifying new observations. The first step is to create a representation of a tree. Create a new class called decisionnode, which represents each node in the tree: class decisionnode: def __init_ _(self,col=-1,value=None,results=None,tb=None,fb=None): self.col=col self.value=value self.results=results self.tb=tb self.fb=fb Each node has five instance variables, all of which may be set in the initializer: * col is the column index of the criteria to be tested. * value is the value that the column must match to get a true result. * tb and fb are decisionnodes, which are the next nodes in the tree if the result is true or false, respectively. * results stores a dictionary of results for this branch. This is None for everything except endpoints. The functions that create a tree return the root node, which can be traversed by following its True or False branches until a branch with results is reached. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Training the Tree Content preview·Buy reprint rights for this chapter This chapter uses an algorithm called CART (Classification and Regression Trees). To build the decision tree, the algorithm first creates a root node. By considering all the observations in the table, it chooses the best variable to divide up the data. To do this, it looks at all the different variables and decides which condition (for example, "Did the user read the FAQ?") would separate the outcomes (which service the user signed up for) in a way that makes it easier to guess what the user will do. divideset is a function that divides the rows into two sets based on the data in a specific column. This function takes a list of rows, a column number, and a value to divide into the column. In the case of Read FAQ, the possible values are Yes or No, and for Referrer, there are several possibilities. It then returns two lists of rows: the first containing the rows where the data in the specified column matches the value, and the second containing the rows where it does not. # Divides a set on a specific column. Can handle numeric # or nominal values def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2) The code creates a function to divide the data called split_function, which depends on knowing if the data is numerical or not. If it is, the true criterion is that the value in this column is greater than value. If the data is not numeric, split_function simply determines whether the column's value is the same as value. It uses this function to divide the data into two sets, one where split_function returns true and one where it returns false. Start a Python session and try dividing the results by the Read FAQ column: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Choosing the Best Split Content preview·Buy reprint rights for this chapter Our casual observation that the chosen variable isn't very good may be accurate, but to choose which variable to use in a software solution, you need a way to measure how mixed a set is. What you want to do is find the variable that creates the two sets with the least possible mixing. The first function you'll need is one to get the counts of each result in a set. Add this to treepredict.py: # Create counts of possible results (the last column of # each row is the result) def uniquecounts(rows): results={} for row in rows: # The result is the last column r=row[len(row)-1] if r not in results: results[r]=0 results[r]+=1 return results uniquecounts finds all the different possible outcomes and returns them as a dictionary of how many times they each appear. This is used by the other functions to calculate how mixed a set is. There are a few different metrics for measuring this, and two will be considered here: Gini impurity and entropy. Gini impurity is the expected error rate if one of the results from a set is randomly applied to one of the items in the set. If every item in the set is in the same category, the guess will always be correct, so the error rate is 0. If there are four possible results evenly divided in the group, there's a 75 percent chance that the guess would be incorrect, so the error rate is 0.75. The function for Gini impurity looks like this: # Probability that a randomly placed item will # be in the wrong category def giniimpurity(rows): total=len(rows) counts=uniquecounts(rows) imp=0 for k1 in counts: p1=float(counts[k1])/total for k2 in counts: if k1==k2: continue p2=float(counts[k2])/total imp+=p1*p2 return imp This function calculates the probability of each possible outcome by dividing the number of times that outcome occurs by the total number of rows in the set. It then adds up the products of all these probabilities. This gives the overall chance that a row would be randomly assigned to the wrong outcome. The higher this probability, the worse the split. A probability of zero is great because it tells you that everything is already in the right set. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Recursive Tree Building Content preview·Buy reprint rights for this chapter To see how good an attribute is, the algorithm first calculates the entropy of the whole group. Then it tries dividing up the group by the possible values of each attribute and calculates the entropy of the two new groups. To determine which attribute is the best to divide on, the information gain is calculated. Information gain is the difference between the current entropy and the weighted-average entropy of the two new groups. The algorithm calculates the information gain for every attribute and chooses the one with the highest information gain. After the condition for the root node has been decided, the algorithm creates two branches corresponding to true or false for that condition, as shown in . Figure : Decision tree after a single split The observations are divided into those that meet the condition and those that don't. For each branch, the algorithm then determines if the branch can be divided further or if it has reached a solid conclusion. If one of the new branches can be divided, the same method as above is used to determine which variable to use. The second division is shown in . Figure : Decision tree after two splits The branches keep dividing, creating a tree by calculating the best attribute for each new node. A branch stops dividing when the information gain from splitting a node is not more than zero. Create a new function called buildtree in treepredict.py. This is a recursive function that builds the tree by choosing the best dividing criteria for the current set: def buildtree(rows,scoref=entropy): if len(rows)==0: return decisionnode( ) current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys( ): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_sets=(set1,set2) # Create the subbranches if best_gain>0: trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1], tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows)) Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Displaying the Tree Content preview·Buy reprint rights for this chapter So now that you have a tree, what should you do with it? Well, one thing you'll definitely want to do is look at it. printtree is a simple function for displaying the tree in plain text. The output isn't pretty, but it's a simple way to view small trees: def printtree(tree,indent=''): # Is this a leaf node? if tree.results!=None: print str(tree.results) else: # Print the criteria print str(tree.col)+':'+str(tree.value)+'? ' # Print the branches print indent+'T->', printtree(tree.tb,indent+' ') print indent+'F->', printtree(tree.fb,indent+' ') This is another recursive function. It takes a tree returned by buildtree and traverses down it, and it knows it has reached the end of a branch when it reaches the node with results. Until it reaches that point, it prints the criteria for the True and False branches and calls printtree on each of them, each time increasing the indent string. Call this function with the tree you just built, and you'll get something like this: >>> reload(treepredict) >>> treepredict.printtree(tree) 0:google? T-> 3:21? T-> {'Premium': 3} F-> 2:yes? T-> {'Basic': 1} F-> {'None': 1} F-> 0:slashdot? T-> {'None': 3} F-> 2:yes? T-> {'Basic': 4} F-> 3:21? T-> {'Basic': 1} F-> {'None': 3} This is a visual representation of the process that the decision tree will go through when trying to make a new classification. The condition on the root node is "is Google in column 0?" If this condition is met, it proceeds to the T-> branch and finds that anyone referred from Google will become a paid subscriber if they have viewed 21 pages or more. If the condition is not met, it jumps to the F-> branch and evaluates the condition "is Slashdot in column 0?" This continues until it reaches a branch that has a result. As mentioned earlier, the ability to view the logic behind the reasoning process is one of the big advantages of decision trees. The textual display of the tree is fine for small trees, but as they get larger, visually tracking your way through the tree can be quite difficult. Here you'll see how to make a graphical representation of the tree that will be useful for viewing trees you'll build in later sections. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Classifying New Observations Content preview·Buy reprint rights for this chapter Now you'll need a function that takes a new observation and classifies it according to the decision tree. Add this function to treepredict.py: def classify(observation,tree): if tree.results!=None: return tree.results else: v=observation[tree.col] branch=None if isinstance(v,int) or isinstance(v,float): if v>=tree.value: branch=tree.tb else: branch=tree.fb else: if v==tree.value: branch=tree.tb else: branch=tree.fb return classify(observation,branch) This function traverses the tree in much the same manner as printtree. After each call, it checks to see if it has reached the end of this branch by looking for results. If not, it evaluates the observation to see if the column matches the value. If it does, it calls classify again on the True branch; if not, it calls classify on the False branch. Now you can call classify to get the prediction for a new observation: >>> reload(treepredict) >>> treepredict.classify(['(direct)','USA','yes',5],tree) {'Basic': 4} You now have functions for creating a decision tree from any dataset, for displaying and interpreting the tree, and for classifying new results. These functions can be applied to any dataset that consists of multiple rows, each containing a set of observations and an outcome. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Pruning the Tree Content preview·Buy reprint rights for this chapter One problem with training the tree using the methods described so far is that it can become overfitted—that is, it can become too specific to the training data. An overfitted tree may give an answer as being more certain than it really is by creating branches that decrease entropy slightly for the training set, but whose conditions are actually completely arbitrary. Because decision trees are so easy to interpret, they are among the most widely used data-mining methods in business analysis, medical decision-making, and policy-making. Often, a decision tree is created automatically, and an expert uses it to understand the key factors and then refines it to better match her beliefs. This process allows machines to assist experts and to clearly show the reasoning process so that individuals can judge the quality of the prediction. Decision trees have been used in this manner for such wide-ranging applications as customer profiling, financial risk analysis, assisted diagnosis, and traffic prediction. Since the algorithm above continually splits the branches until it can't reduce the entropy any further, one possibility is to stop splitting when the entropy is not reduced by a minimum amount. This strategy is employed frequently, but it suffers from a minor drawback—it is possible to have a dataset where the entropy is not reduced much by one split but is reduced greatly by subsequent splits. An alternative strategy is to build the entire tree as described earlier, and then try to eliminate superfluous nodes. This process is known as pruning. Pruning involves checking pairs of nodes that have a common parent to see if merging them would increase the entropy by less than a specified threshold. If so, the leaves are merged into a single node with all the possible outcomes. This helps avoid overfitting and stops the tree from making predictions that are more confident than what can really be gleaned from the data. Add a new function to treepredict.py for pruning the tree: def prune(tree,mingain): # If the branches aren't leaves, then prune them if tree.tb.results==None: prune(tree.tb,mingain) if tree.fb.results==None: prune(tree.fb,mingain) # If both the subbranches are now leaves, see if they # should merged if tree.tb.results!=None and tree.fb.results!=None: # Build a combined dataset tb,fb=[],[] for v,c in tree.tb.results.items( ): tb+=[[v]]*c for v,c in tree.fb.results.items( ): fb+=[[v]]*c # Test the reduction in entropy delta=entropy(tb+fb)-(entropy(tb)+entropy(fb)/2) if delta=tree.value: branch=tree.tb else: branch=tree.fb else: if v==tree.value: branch=tree.tb else: branch=tree.fb return mdclassify(observation,branch) The only difference is at the end where, if the important piece of data is missing, the results for each branch are calculated and then combined with their respective weightings. Try out mdclassify on a row with a crucial piece of information missing and see how your results look: >>> reload(treepredict) >>> Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Dealing with Numerical Outcomes Content preview·Buy reprint rights for this chapter The user behavior example and the fruit tree were both classification problems, since the outcomes were categories rather than numbers. The remaining examples in this chapter, home prices and hotness, are both problems with numerical outcomes. While it's possible to run buildtree on a dataset with numbers as outcomes, the result probably won't be very good. If all the numbers are treated as different categories, the algorithm won't take into account the fact that some numbers are close together and others are far apart; they will all be treated as completely separate. To deal with this, when you have a tree with numerical outcomes, you can use variance as a scoring function instead of entropy or Gini impurity. Add the variance function to treepredict.py: def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance This function is a possible parameter for buildtree, and it calculates the statistical variance for a set of rows. A low variance means that the numbers are all very close together, and a high variance means that they are widely dispersed. When building a tree using variance as the scoring function, node criteria will be picked that split the numbers so that higher values are on one side and lower values are on the other. Splitting the data this way reduces the overall variance on the branches. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Modeling Home Prices Content preview·Buy reprint rights for this chapter There are many potential uses for decision trees, but they are most useful when there are several possible variables and you're interested in the reasoning process. In some cases, you already know the outcomes, and the interesting part is modeling the outcomes to understand why they are as they are. One area in which this is potentially very interesting is understanding prices of goods, particularly those that have a lot of variability in measurable ways. This section will look at building decision trees for modeling real estate prices, because houses vary greatly in price and have many numerical and nominal variables that are easily measured. Zillow is a free web service that tracks real estate prices and uses this information to create price estimates for other houses. It works by looking at comps (similar houses) and using their values to predict a new value, which is similar to what real estate appraisers do. A section of a Zillow web page showing information about a house and its estimate value is shown in . Figure : Screenshot from zillow.com Fortunately, Zillow also has an API that lets you get details and the estimated value of houses. The page for the Zillow API is . You'll need to get a developer key to access the API, which is free and available from the web site. The API itself is quite simple—it involves requesting a URL with all your search parameters in the query, and then parsing the returned XML to get details like number of bedrooms and estimated price. Create a new file called zillow.py and add the following code: import xml.dom.minidom import urllib2 zwskey="X1-ZWz1chwxis15aj_9skq6" As you did in , you're going to use the minidom API to parse XML results of your queries. The function getaddressdata takes an address and a city, and constructs the URL to query Zillow for property information. It parses the results and extracts the important information, which it returns as a tuple of results. Add this function to zillow.py: def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read( )) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success; otherwise, there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price) Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Modeling "Hotness" Content preview·Buy reprint rights for this chapter Hot or Not is a site that allows users to upload photos of themselves. Its original concept was to let users rank other users on their physical appearance, and to aggregate the results to create a score between 1 and 10 for each person. It has since evolved into a dating site, and now has an open API that allows you to get demographic information about members along with their "hotness" rating. This makes it an interesting test case for a decision tree model because there is a set of input variables, an output variable, and a possibly interesting reasoning process. The site itself is also a good example of what might be considered collective intelligence. Again, you'll need to get an application key to access the API. You can sign up and get one at . The Hot or Not API works in much the same way as the other APIs that have been covered. You simply pass the parameters of a query to a URL and parse the XML that is returned. To get started, create a new file called hotornot.py and add the import statements and your key definition: import urllib2 import xml.dom.minidom api_key="479NUNJHETN" Next, get a list of random people to make up the dataset. Fortunately, Hot or Not provides an API call that returns a list of people with specified criteria. In this example, the only criteria will be that the people have "meet me" profiles, since only from these profiles can you get other information like location and interests. Add this function to hotornot.py: def getrandomratings(c): # Construct URL for getRandomProfile url="http://services.hotornot.com/rest/?app_key=%s" % api_key url+="&method=Rate.getRandomProfile&retrieve_num=%d" % c url+="&get_rate_info=true&meet_users_only=true" f1=urllib2.urlopen(url).read( ) doc=xml.dom.minidom.parseString(f1) emids=doc.getElementsByTagName('emid') ratings=doc.getElementsByTagName('rating') # Combine the emids and ratings together into a list result=[] for e,r in zip(emids,ratings): if r.firstChild!=None: result.append((e.firstChild.data,r.firstChild.data)) return result Once you've generated a list of user IDs and ratings, you'll need another function to download information about people—in this case, gender, age, location, and keywords. Having all 50 states as possible location variables will lead to too many branching possibilities. In order to reduce the number of possibilities for location, you can divide the states into regions. Add the following code to specify regions: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! When to Use Decision Trees Content preview·Buy reprint rights for this chapter Probably the biggest advantage of decision trees is how easy it is to interpret a trained model. After running the algorithm on our example problem, we not only end up with a tree that can make predictions about new users, we also get the list of questions used to make those determinations. From this you can see that, for instance, users who find the site through Slashdot never become paid subscribers, but users who find the site through Google and view at least 20 pages are likely to become premium subscribers. This, in turn, might allow you to alter your advertising strategy to target sites that give you the highest quality traffic. We also learn that certain variables, such as the user's country of origin, are not important in determining the outcome. If data is difficult or expensive to collect and we learn that it is not important, we know that we can stop collecting it. Unlike some other machine-learning algorithms, decision trees can work with both categorical and numerical data as inputs. In the first example problem, we used a combination of pages viewed with several categorical inputs. Furthermore, while many algorithms require you to prepare or normalize data before you can run them, the code in this chapter will take any list of data containing category or numerical data and build the appropriate decision tree. Decision trees also allow for probabilistic assignment of data. With some problems, there is not enough information to always make a correct distinction—a decision tree may have a node that has several possibilities and can't be divided any more. The code in this chapter returns a dictionary of the counts of different outcomes, and this information can help us decide how confident we are in the results. Not all algorithms can estimate the probability of an uncertain result. However, there are definitely drawbacks to the decision tree algorithm used here. While it can be very effective for problems with only a few possible results, it can't be used effectively on datasets with many possibilities. In the first example, the only outcomes are none, basic, and premium. If there were hundreds of outcomes instead, the decision tree would grow very complicated and would probably make poor predictions. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Exercises Content preview·Buy reprint rights for this chapter 1. Result probabilities. Currently, the classify and mdclassify functions give their results as total counts. Modify them to give the probabilities of the results being one of the categories. 2. Missing data ranges. mdclassify allows the use of "None" to specify a missing value. For numerical values the result may not be completely unknown, but may be known to be in a range. Modify mdclassify to allow a tuple such as (20,25) in place of a value and traverse down both branches when necessary. 3. Early stopping. Rather than pruning the tree, buildtree can just stop dividing when it reaches a point where the entropy is not reduced enough. This may not be ideal in some cases, but it does save an extra step. Modify buildtree to take a minimum gain parameter and stop dividing the branch if this condition is not met. 4. Building with missing data. You built a function that can classify a row with missing data, but what if there is missing data in the training set? Modify buildtree so that it will check for missing data and, in cases where it's not possible to send a result down a particular branch, will send it down both branches. 5. Multiway splits. (Hard) All the trees built in this chapter are binary decision trees. However, some datasets might create simpler trees if they allowed a node to split into more than two branches. How would you represent this? How would you train the tree? Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Chapter 8: Building Price Models Content preview·Buy reprint rights for this chapter So far we have examined several classifiers, most of which are well suited for predicting to which category a new piece of data belongs. However, Bayesian classifiers, decision trees, and support-vector machines (which you'll see in the next chapter) are not the best algorithms for making predictions about numerical data based on many different attributes, such as prices. This chapter will look at algorithms that can be trained to make numerical predictions based on examples they have seen before, and even display probability distributions for the predictions to help the user interpret how the prediction is being made. We'll be looking at how to use these algorithms for building models that predict prices. Economists consider prices, particularly auction prices, to be a good method of using collective intelligence to determine the real value of something; in a large market with many buyers and sellers, the price will usually reach the optimal value for both sides of the transaction. Price prediction is also a good test for algorithms of this kind, since there are usually many different factors to consider when determining a price. When considering bidding on a laptop, for example, you have to take into account processor speed, installed RAM, hard drive size, screen resolution, and other factors. An important part of making numerical predictions is determining which variables are important and in what combinations. In the laptop example, there are likely to be several variables that will barely, if at all, affect the price, such as free accessories or some bundled software. Further, the screen size may have a greater effect on the final price than the hard drive size. You'll be using the optimization techniques developed in to automatically determine the best weights for the variables. A challenging dataset for testing a numerical prediction algorithm should have a few properties that make the dataset more difficult to make predictions from. If you are looking at TVs, it's easy to infer that bigger is better, and such problems can more easily be solved with traditional statistical techniques. For this reason, it's more interesting to look at a dataset where price doesn't simply increase in proportion to size or the number of characteristics. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Building a Sample Dataset Content preview·Buy reprint rights for this chapter A challenging dataset for testing a numerical prediction algorithm should have a few properties that make the dataset more difficult to make predictions from. If you are looking at TVs, it's easy to infer that bigger is better, and such problems can more easily be solved with traditional statistical techniques. For this reason, it's more interesting to look at a dataset where price doesn't simply increase in proportion to size or the number of characteristics. In this section, you'll create a dataset of wine prices based on a simple artificial model. The prices are based on a combination of the rating and the age of the wine. The model assumes that wine has a peak age, which is older for good wines and almost immediate for bad wines. A high-rated wine will start at a high price and increase in value until its peak age, and a low-rated wine will start cheap and get cheaper. To model this, create a new file called numpredict.py and add the wineprice function: from random import random,randint import math def wineprice(rating,age): peak_age=rating-50 # Calculate price based on rating price=rating/2 if age>peak_age: # Past its peak, goes bad in 5 years price=price*(5-(age-peak_age)) else: # Increases to 5x original value as it # approaches its peak price=price*(5*((age+1)/peak_age)) if price<0: price=0 return price You'll also need a function to build a dataset of wine prices. The following function generates 200 bottles of wine and calculates their prices from the model. It then randomly adds or subtracts 20 percent to capture things like taxes and local variations in prices, and also to make the numerical prediction a bit more difficult. Add wineset1 to numpredict.py: def wineset1( ): rows=[] for i in range(300): # Create a random age and rating rating=random( )*50+50 age=random( )*50 # Get reference price price=wineprice(rating,age) # Add some noise price*=(random( )*0.4+0.8) # Add to the dataset rows.append({'input':(rating,age), 'result':price}) return rows Start up a Python session, and test some wine prices and build a new dataset: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! k-Nearest Neighbors Content preview·Buy reprint rights for this chapter The easiest approach to our wine pricing problem is the same one you would use if you were trying to price something manually—that is, to find a few of the most similar items and assume the prices will be roughly the same. By finding a set of items similar to the item that interests you, the algorithm can average their prices and make a guess at what the price should be for this item. This approach is called k-nearest neighbors (kNN). The k in kNN refers to the number of items that will be averaged to get the final result. If the data were perfect, you could use k=1, meaning you would just pick the nearest neighbor and use its price as the answer. But in real-world situations, there are always aberrations. In the example, you deliberately add "noise" to simulate this (the random addition or subtraction of 20 percent). Someone might get a great deal, or an uninformed customer might drastically overpay for the nearest neighbor. For this reason, it's better to take a few neighbors and average them to reduce any noise. To visualize the problem of choosing too few neighbors, consider a problem where there's only one descriptive variable, say, age. shows a graph of price (on the y-axis) versus age (on the x-axis). Also on the graph is the line that you get if you only use a single nearest neighbor. Figure : kNN using too few neighbors Notice how the predicted price is far too dependent on random variations. If you were using the squiggly line to make a prediction, you would decide that there's a big price jump between wine that is 15 years old and wine that is 16 years old, when that's really just the result of variation in the prices of two particular bottles. On the other hand, choosing too many neighbors will reduce accuracy because the algorithm will be averaging in data from items that are not at all similar to the query. shows the same dataset; the line averages 20 of the nearest neighbors. Figure : kNN using too many neighbors It's clear that averaging too many prices greatly underestimates the prices of wine around the 25-year mark. Choosing the correct number of neighbors can be done manually for different datasets, or it can be done with optimization. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Weighted Neighbors Content preview·Buy reprint rights for this chapter One way to compensate for the fact that the algorithm may be using neighbors that are too far away is to weight them according to their distance. This is similar to the method used in , where people's preferences were weighted according to how similar they were to the preferences of a person seeking a recommendation. The more similar the items are, the smaller the distance between them, so you'll need a way of converting distances to weights. There are a few different ways of doing this, each with advantages and drawbacks. This section will look at three functions that you can use. The function you used in to convert distances to weights was an inverse function. shows what this looks like if you plot weight on one axis and price on the other. Figure : Inverse weight function The simplest form of this function returns 1 divided by the distance. However, in some cases, items are exactly the same or very close, leading to a very high or infinite weight. For this reason, it's necessary to add a small number to the distance before inverting it. Add the inverseweight function to numpredict.py: def inverseweight(dist,num=1.0,const=0.1): return num/(dist+const) This function is fast and easy to implement, and you can experiment with different values of num to see what produces good results. Its main potential drawback is that it applies very heavy weights to items that are close and falls off quickly after that. This may be desirable, but in some cases it will make the algorithm much more sensitive to noise. A second option is a subtraction function, the graph for which is shown in . Figure : Subtraction weight function This is a simple function that subtracts the distance from a constant. The weight is the result of this subtraction if the result is greater than zero; otherwise, the result is zero. Add the subtractweight function to numpredict.py: def subtractweight(dist,const=1.0): if dist>const: return 0 else: return const-dist This function overcomes the potential issue of overweighting close items, but it has its own limitation. Because the weight eventually falls to 0, it's possible that there will be nothing close enough to be considered a close neighbor, which means that for some items the algorithm won't make a prediction at all. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Cross-Validation Content preview·Buy reprint rights for this chapter Cross-validation is the name given to a set of techniques that divide up data into training sets and test sets. The training set is given to the algorithm, along with the correct answers (in this case, prices), and becomes the set used to make predictions. The algorithm is then asked to make predictions for each item in the test set. The answers it gives are compared to the correct answers, and an overall score for how well the algorithm did is calculated. Usually this procedure is performed several times, dividing the data up differently each time. Typically, the test set will be a small portion, perhaps 5 percent of the all the data, with the remaining 95 percent making up the training set. To start, create a function called dividedata in numpredict.py, which divides up the dataset into two smaller sets given a ratio that you specify: def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random( )>> reload(numpredict) >>> data=numpredict.wineset2( ) Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Optimizing the Scale Content preview·Buy reprint rights for this chapter In this case, it's not difficult to choose good parameters for scaling because you know in advance which variables are important. However, most of the time you'll be working with datasets that you didn't build yourself, and you won't necessarily know which variables are unimportant and which ones have a significant impact. In theory, you could try out a lot of numbers in different combinations until you found one that worked well enough, but there might be hundreds of variables to consider and it would be very tedious. Fortunately, if you worked through , you already know how to automatically find a good solution when there are many input variables to consider—by using optimization. You'll recall that optimization simply requires you to specify a domain that gives the number of variables, a range, and a cost function. The crossvalidate function returns a higher value for a worse solution, so it's already essentially a cost function. The only thing you need to do is wrap it so that it takes a list of values as its parameter, rescales the data, and calculates the cross-validation error. Add createcostfunction to numpredict.py: def createcostfunction(algf,data): def costf(scale): sdata=rescale(data,scale) return crossvalidate(algf,sdata,trials=10) return costf The domain is the range of weights for each dimension. In this case, the lowest possible value is 0 because negative numbers will just create a mirror image of the data, which for distance calculations doesn't change anything. In theory, the weights can be as high as you want, but for practical purposes, let's restrict them to 20 for now. Add this line to numpredict.py: weightdomain=[(0,20)]*4 You now have everything you need to automatically optimize the weights. Make sure optimization.py, the file you created in , is in your current directory and try an annealing optimization in your Python session: >>> import optimization >>> reload(numpredict) >>> costf=numpredict.createcostfunction(numpredict.knnestimate,data) >>> optimization.annealingoptimize(numpredict.weightdomain,costf,step=2) Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Uneven Distributions Content preview·Buy reprint rights for this chapter So far we've been assuming that if you take an average or weighted average of the data, you'll get a pretty good estimate of the final price. In many cases this will be accurate, but in some situations there may be an unmeasured variable that can have a big effect on the outcome. Imagine that in the wine example there were buyers from two separate groups: people who bought from the liquor store, and people who bought from a discount store and received a 50 percent discount. Unfortunately, this information isn't tracked in the dataset. The createhiddendataset function creates a dataset that simulates these properties. It drops some of the complicating variables and just focuses on the original ones. Add this function to numericalpredictor.py: def wineset3( ): rows=wineset1( ) for row in rows: if random( )<0.5: # Wine was bought at a discount store row['result']*=0.6 return rows Consider what will happen if you ask for an estimate of the price of a different item using the kNN or weighted kNN algorithms. Since the dataset doesn't actually contain any information about whether the buyer bought from the liquor store or a discount store, the algorithm won't be able to take this into account, so it will bring in the nearest neighbors regardless of where the purchase was made. The result is that it will give the average of items from both groups, perhaps representing a 25 percent discount. You can verify this by trying it in your Python session: >>> reload(numpredict) >>> data=numpredict.wineset3( ) >>> numpredict.wineprice(99.0,20.0) 106.07142857142857 >>> numpredict.weightedknn(data,[99.0,20.0]) 83.475441632209339 >>> numpredict.crossvalidate(numpredict.weightedknn,data) 599.51654107008562 While this is not a bad way to make an estimate if you just want a single number, it does not accurately reflect what someone will actually end up paying for an item. In order to get beyond averages, you need a way to look closer at the data at that point. Rather than taking the weighted average of the neighbors and getting a single price estimate, it might be interesting in this case to know the probability that an item falls within a certain price range. In the example, given inputs of 99 percent and 20 years, you'd like a function that tells you there's a 50 percent chance that the price is between $40 and $80, and a 50 percent chance that it's between $80 and $100. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Using Real Data—the eBay API Content preview·Buy reprint rights for this chapter eBay is an online auction site and one of the most popular sites on the Internet. It has millions of listings and millions of users bidding and jointly setting prices, making it a great example of collective intelligence. As it happens, eBay also has a free XML-based API that you can use to perform searches, get detailed item information, and even post items for sale. In this section, you'll see how to use the eBay API to get price data and convert the data so that the algorithms in this chapter can be used for prediction. The process of accessing eBay's API takes several steps, but it's relatively simple and automatic. A good overview of the process is in the Quick Start Guide, which is online at . This guide will take you through the process of creating a developer account, getting your production keys, and creating a token. When you're finished, you should have four strings that will be needed for the example in this chapter: * A developer key * An application key * A certificate key * An authentication token, which is very long Create a new file called ebaypredict.py and add the following code, which imports some modules and includes the abovementioned strings: import httplib from xml.dom.minidom import parse, parseString, Node devKey = 'developerkey' appKey = 'applicationkey' certKey = 'certificatekey' userToken = 'token' serverUrl = 'api.ebay.com' There is no official Python API for eBay, but there is an XML API that you can access using httplib and minidom. This section will only cover two calls to this API, GetSearchResults and GetItem, but much of the code given here can be reused for other calls, too. For more information on all the calls supported by the API, you can look at the full documentation at . Once you have your keys, it's time to set up a connection to the eBay API. The API requires that you pass a lot of headers, which include the keys and the call you're going to make. To do this, create a function called getHeaders to take the call name and return a dictionary of headers that can be passed to httplib. Add this function to ebaypredict.py: def getHeaders(apicall,siteID="0",compatabilityLevel = "433"): headers = {"X-EBAY-API-COMPATIBILITY-LEVEL": compatabilityLevel, "X-EBAY-API-DEV-NAME": devKey, "X-EBAY-API-APP-NAME": appKey, "X-EBAY-API-CERT-NAME": certKey, "X-EBAY-API-CALL-NAME": apicall, "X-EBAY-API-SITEID": siteID, "Content-Type": "text/xml"} return headers Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! When to Use k-Nearest Neighbors Content preview·Buy reprint rights for this chapter The k-nearest neighbors method has a few disadvantages. Making predictions is very computationally intensive because the distance to every point has to be calculated. Furthermore, in a dataset with many variables, it can be difficult to determine the appropriate weights or whether some variables should be eliminated. Optimization can help with this, but it can take a very long time to find a good solution with big datasets. Still, as you've seen in this chapter, kNN offers a number of advantages over other methods. The flip side to the computational intensity of making a prediction is that new observations can be added to the data without any computational effort. It's also easy to interpret exactly what's happening because you know it's using the weighted value of other observation to make its predictions. Although determining weights can be tricky, once the best weights have been determined, you can use them to better understand the characteristics of the dataset. Finally, you can create probability functions for times when you suspect there are other unmeasured variables in the dataset. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Exercises Content preview·Buy reprint rights for this chapter 1. Optimizing the number of neighbors. Create a cost function for optimization that determines the ideal number of neighbors for a simple dataset. 2. Leave-one-out cross-validation. Leave-one-out cross-validation is an alternative method of calculating prediction error that treats every row in the dataset individually as a test set, and treats the rest of the data as a training set. Implement a function to do this. How does it compare to the method described in this chapter? 3. Eliminating variables. Rather than trying to optimize variable scales for a large set of variables that are probably useless, you could try to eliminate variables that make the prediction much worse before doing anything else. Can you think of a way to do this? 4. Varying ss for graphing probability. The ss parameter in probabilityguess dictates how smoothly the probability is graphed. What happens if this number is too high? Too low? Can you think of any way to determine what a good value will be without looking at the graph? 5. The laptop dataset. Try running the optimization on the laptop dataset from eBay. Which variables are important? Now try running the functions for graphing probability density. Are there any noticeable peaks? 6. Other item types. Which other items on eBay have suitable numerical attributes? iPods, cell phones, and cars all have a lot of interesting information. Try building another dataset for numerical prediction. 7. Search attributes. The eBay API has a lot of functionality that was not covered in this chapter. The GetSearchResults call has many options, including the ability to restrict searches to certain attributes. Modify the function to support this and try finding only Core Duo laptops. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Chapter 9: Advanced Classification: Kernel Methods and SVMs Content preview·Buy reprint rights for this chapter Previous chapters have considered several classifiers, including decision trees, Bayesian classifiers, and neural networks. This chapter will introduce the concept of linear classifiers and kernel methods as a prelude to covering one of the most advanced classifiers, and one that remains an active area of research, called support-vector machines (SVMs). The dataset used throughout much of the chapter pertains to matching people on a dating site. Given information about two people, can we predict whether they will be a good match? This is an interesting problem because there are many variables, both numerical and nominal, and many nonlinear relationships. This dataset will be used to demonstrate some of the weaknesses of the previously described classifiers, and to show how the dataset can be tweaked to work better with these algorithms. An important thing to take away from this chapter is that it's rarely possible to throw a complex dataset at an algorithm and expect it to learn how to classify things accurately. Choosing the right algorithm and preprocessing the data appropriately is often required to get good results. I hope that going through the process of tweaking this dataset will give you ideas for how to modify others in the future. At the end of the chapter, you'll learn how to build a dataset of real people from Facebook, a popular social networking site, and you'll use the algorithms to predict whether people with certain characteristics are likely to be friends. The dataset you'll use in this chapter is based on an imaginary online dating site. Most dating sites collect a lot of interesting information about their members, including demographic information, interests, and behavior. Imagine that this site collects the following information: * Age * Smoker? * Want children? * List of interests * Location Furthermore, this site collects information about whether two people have made a good match, whether they initially made contact, and if they decided to meet in person. This data is used to create the matchmaker dataset. There are two files to download: The matchmaker.csv Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Matchmaker Dataset Content preview·Buy reprint rights for this chapter The dataset you'll use in this chapter is based on an imaginary online dating site. Most dating sites collect a lot of interesting information about their members, including demographic information, interests, and behavior. Imagine that this site collects the following information: * Age * Smoker? * Want children? * List of interests * Location Furthermore, this site collects information about whether two people have made a good match, whether they initially made contact, and if they decided to meet in person. This data is used to create the matchmaker dataset. There are two files to download: The matchmaker.csv file looks like this: 39,yes,no,skiing:knitting:dancing,220 W 42nd St New York NY,43,no,yes,soccer:reading:scrabble,824 3rd Ave New York NY,0 23,no,no,football:fashion,102 1st Ave New York NY,30,no,no,snowboarding:knitting:computers:shopping:tv:travel, 151 W 34th St New York NY,1 50,no,no,fashion:opera:tv:travel,686 Avenue of the Americas New York NY,49,yes,yes,soccer:fashion:photography:computers: camping:movies:tv,824 3rd Ave New York NY,0 Each row has information about a man and a woman and, in the final column, a 1 or a 0 to indicate whether or not they are considered a good match. (Your author is aware of the simplifying assumption made here; computer models are never quite as complicated as real life.) For a site with a large number of profiles, this information might be used to build a predictive algorithm that assists users in finding other people who are likely to be good matches. It might also indicate particular types of people that the site is lacking, which would be useful in strategies for promoting the site to new members. The agesonly.csv file has match information based only on age, which we will use to illustrate how the classifiers work, since two variables are much easier to visualize. The first step is to build a function for loading this dataset. This is just a matter of reading all the fields into a list, but for experimental purposes, the function will have one optional parameter to load only in certain fields. Create a new file called advancedclassify.py Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Difficulties with the Data Content preview·Buy reprint rights for this chapter Two interesting aspects of this dataset are the nonlinearity and the interplay of the variables. If you installed matplotlib () in you can visualize some of the variables using advancedclassify and generating a couple of lists from it. (This step is not necessary to work through the rest of the chapter.) Try this in your Python session: from pylab import * def plotagematches(rows): xdm,ydm=[r.data[0] for r in rows if r.match==1],\ [r.data[1] for r in rows if r.match==1] xdn,ydn=[r.data[0] for r in rows if r.match==0],\ [r.data[1] for r in rows if r.match==0] plot(xdm,ydm,'go') plot(xdn,ydn,'ro') show( ) Call this method from your Python session: >>> reload(advancedclassify) >>> advancedclassify.plotagematches(agesonly) This will generate a scatter plot of the man's age versus the woman's age. The points will be O if the people are a match and X if they are not. You'll get a window like the one shown in . Figure : Generated age-age scatter plot Although there are obviously many other factors that determine whether two people are a match, this figure is based on the simplified age-only dataset, and it shows an obvious boundary that indicates people do not go far outside their own age range. The boundary also appears to curve and become less defined as people get older, showing that people will tolerate greater age differences as they get older. covered decision tree classifiers, which try to automatically classify data using a tree. The decision tree algorithm described there will split the data based on numerical boundaries. This presents a problem when the dividing line can be expressed more accurately as a function of two variables. In this case, the difference between the ages is a much better variable for prediction. You can imagine that training a decision directly on the data would give you a result like the one shown in . Figure : Decision tree for curved boundary This is obviously quite useless for interpretation. It may work for automatic classification, but it is very messy and rigid. Had the other variables besides age been considered, the result would have been even more confusing. To understand what the decision tree is doing, consider the scatter plot and the Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Basic Linear Classification Content preview·Buy reprint rights for this chapter This is one of the simplest classifiers to construct, but it's a good basis for further work. It works by finding the average of all the data in each class and constructing a point that represents the center of the class. It can then classify new points by determining to which center point they are closest. To do this, you'll first need a function that calculates the average point in the classes. In this case, the classes are just 0 and 1. Add lineartrain to advancedclassify.py: def lineartrain(rows): averages={} counts={} for row in rows: # Get the class of this point cl=row.match averages.setdefault(cl,[0.0]*(len(row.data))) counts.setdefault(cl,0) # Add this point to the averages for i in range(len(row.data)): averages[cl][i]+=float(row.data[i]) # Keep track of how many points in each class counts[cl]+=1 # Divide sums by counts to get the averages for cl,avg in averages.items( ): for i in range(len(avg)): avg[i]/=counts[cl] return averages You can run this function in your Python session to get the averages: >>> reload(advancedclassify) >>> avgs=advancedclassify.lineartrain(agesonly) To see why this is useful, consider again the plot of the age data, shown in . Figure : Linear classifier using averages The Xs in the figure represent the average points as calculated by lineartrain. The line dividing the data is halfway between the two Xs. This means that all the points on the left side of the line are closer to the "no match" average point, and all the points on the right side are closer to the "match" average. Whenever you have a new pair of ages and want to guess if two people will be a match, you can just imagine the point on this chart and see to which average it's closer. There are a couple of ways to determine the closeness of a new point. You learned about Euclidean distance in previous chapters; one approach would be to calculate the distance from the point to the averages for each of the classes, and choose the one with the smaller distance. While this approach will work for this classifier, in order to extend it later, you'll need to take a different approach using Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Categorical Features Content preview·Buy reprint rights for this chapter The matchmaker dataset contains numerical data and categorical data. Some classifiers, like the decision tree, can handle both types without any preprocessing, but the classifiers in the remainder of this chapter work only with numerical data. To handle this, you'll need a way to turn data into numbers so that it will be useful to the classifier. The simplest thing to convert to a number is a yes/no question because you can turn a "yes" into 1 and a "no" into −1. This also leaves the option of converting missing or ambiguous data (such as "I don't know") to 0. Add the yesno function to advancedclassify.py to do this conversion for you: def yesno(v): if v=='yes': return 1 elif v=='no': return −1 else: return 0 There are a couple of different ways you can record people's interests in the dataset. The simplest is to treat every possible interest as a separate numerical variable, and assign a 0 if the person has that interest and a 1 if he doesn't. If you are dealing with individual people, that is the best approach. In this case, however, you have pairs of people, so a more intuitive approach is to use the number of common interests as a variable. Add a new function called matchcount to advancedclassify.py, which returns the number of matching items in a list as a float: def matchcount(interest1,interest2): l1=interest1.split(':') l2=interest2.split(':') x=0 for v in l1: if v in l2: x+=1 return x The number of common interests is an interesting variable, but it definitely eliminates some potentially useful information. It's possible that certain combinations of different interests work well together, such as skiing and snowboarding or drinking and dancing. A classifier that wasn't trained on the original data would never be able to learn these combinations. An alternative to creating a new variable for every interest, leading to a lot of variables and thus to much more complex classifiers, is to arrange interests in a hierarchy. You could say, for example, that skiing and snowboarding are both examples of snow sports, which is a subcategory of sports; a pair who are both interested in snow sports but not in the same one might get 0.8 added to their Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Scaling the Data Content preview·Buy reprint rights for this chapter When you were only making comparisons based on people's ages, it was fine to keep the data as it was and to use averages and distances, since it makes sense to compare variables that mean the same thing. However, now you've introduced some new variables that aren't really comparable to age, since their values are much smaller. Having differing opinions about children—a gap of 2, between 1 and −1—may be much more significant in reality than having an age gap of six years, but if you used the data as is, the age difference would count for three times as much. To resolve this issue, it's a good idea to put all the data on a common scale so that differences are comparable in every variable. You can do this by determining the lowest and highest values for every variable, and scaling the data so that the lowest value is now 0 and the highest value is now 1, with all the other values somewhere between 0 and 1. Add scaledata to advancedclassifier.py: def scaledata(rows): low=[999999999.0]*len(rows[0].data) high=[−999999999.0]*len(rows[0].data) # Find the lowest and highest values for row in rows: d=row.data for i in range(len(d)): if d[i]high[i]: high[i]=d[i] # Create a function that scales data def scaleinput(d): return [(d.data[i]−low[i])/(high[i]−low[i]) for i in range(len(low))] # Scale all the data newrows=[matchrow(scaleinput(row.data)+[row.match]) for row in rows] # Return the new data and the function return newrows,scaleinput This function defines an internal function, scaleinput, which finds the lowest value and subtracts that amount from all the values to bring the range to a 0 starting point. It then divides the values by the difference between the lowest and highest values to convert them all to values between 0 and 1. The function applies scaleinput to every row in the dataset and returns the new dataset along with the function so you can also scale your queries. Now you can try the linear classifier on a bigger set of variables: >>> reload(advancedclassify) >>> Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Understanding Kernel Methods Content preview·Buy reprint rights for this chapter Consider what would happen if you tried to use the linear classifier on a dataset similar to the one in . Figure : A class encircling another class Where would the average points be for each class? They would both be in exactly the same place! Even though it's clear to you and me that anything inside the circle is an X and everything outside the circle is an O, the linear classifier is unable to distinguish these two classes. But consider what happens if you square every x and y value first. A point that was at (−1,2) would now be at (1,4), a point that was at (0.5,1) would now be at (0.25,1), and so on. The new plot would look like . Figure : Moving the points into a different space All the Xs have moved into the corner and all the Os are outside that corner. It's now very easy to divide the Xs and Os with a straight line, and any time a new piece of data has to be classified, you can just square its x and y values and see on which side of the line it falls. This example shows that by transforming the points first, it's possible to create a new dataset that can be divided with a straight line. However, this example was chosen precisely because it can be transformed very easily; in real problems, the transformation will likely be a lot more complicated and will involve transforming the data into more dimensions. For example, you might take a dataset of x and y coordinates and create a new dataset with a, b, and c coordinates where a=x^2, b=x*y, and c=y^2. Once the data has been put into more dimensions, it's easier to find the dividing line between two classes. While you could write code to transform the data into a new space like this, it isn't usually done in practice because finding a dividing line when working with real datasets can require casting the data into hundreds or thousands of dimensions, and this is quite impractical to implement. However, with any algorithm that uses dot-products—including the linear classifier—you can use a technique called the kernel trick. The kernel trick involves replacing the dot-product function with a new function that returns what the dot-product would have been if the data had first been transformed to a higher dimensional space using some mapping function. There is no limit to the number of possible transformations, but only a few are actually used in practice. The one that is usually recommended (and the one that you'll use here) is called the Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Support-Vector Machines Content preview·Buy reprint rights for this chapter Consider again the challenge of finding a straight line that divides two classes. illustrates an example. The averages for each class are shown in the figure, along with the dividing line that they imply. Figure : Linear average classifier misclassifies points Notice that the dividing line calculated by using the averages misclassifies two of the points because they are much closer to the line than the majority of the data. The problem is that since most of the data is far away from the line, it's not relevant for including in the dividing line. Support-vector machines are a well known set of methods for creating classifiers that solve this problem. They do this by trying to find the line that is as far away as possible from each of the classes. This line is called the maximum-margin hyperplane, and is shown in . Figure : Finding the best dividing line The dividing line has been chosen so that the parallel lines that touch the items from each class are as far from it as possible. Again, you can determine into which class a new data point fits simply by seeing which side of the line it is on. Notice that only the points at the margin are needed to determine the placement of the dividing line; you could get rid of all the other data and the line would be in the same place. The points near the line are called the support vectors. The algorithm that finds the support vectors and uses them to find the dividing line is the support-vector machine. You have already seen how a linear classifier can be turned into a nonlinear classifier by using the kernel trick, as long as it uses dot-products for comparisons. Support-vector machines also use dot-products so they can be used with kernels to perform nonlinear classification. Because support-vector machines work well with high-dimensional datasets, they are most often applied to data-intensive scientific problems and other problems that deal with very complex sets of data. Some examples include: * Classifying facial expressions * Detecting intruders using military datasets * Predicting the structure of proteins from their sequences Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Using LIBSVM Content preview·Buy reprint rights for this chapter The explanation in the previous section should help you understand how and why support-vector machines work, but the algorithm for training a support-vector machine involves mathematical concepts that are very computationally intensive and are beyond the scope of this chapter. For these reasons, this section will introduce an open-source library called LIBSVM, which can train an SVM model, make predictions, and test the predictions within a dataset. It even has built-in support for the radial-basis function and other kernel methods. You can download LIBSVM from . LIBSVM is written in C++ and includes a version written in Java. The download package includes a Python wrapper called svm.py. In order to use svm.py, you need compiled versions of LIBSVM for your platform. If you're using Windows, a DLL called svmc.dll is included. (Python 2.5 requires that you rename this file to svmc.pyd because it can't import libraries with DLL extensions.) The documentation for LIBSVM explains how to compile the library for other platforms. Once you have a compiled version of LIBSVM, put it and svm.py in your Python path or working directory. You can now import the library in your Python session and try a simple problem: >>> from svm import * The first step is to create a simple dataset. LIBSVM reads the data from a tuple containing two lists. The first list contains the classes and the second list contains the input data. Try creating a simple dataset with two possible classes: >>> prob = svm_problem([1,−1],[[1,0,1],[−1,0,−1]]) You also need to specify which kernel you want to use by creating svm_parameter: >>> param = svm_parameter(kernel_type = LINEAR, C = 10) Next, you can train the model: >>> m = svm_model(prob, param) * optimization finished, #iter = 1 nu = 0.025000 obj = −0.250000, rho = 0.000000 nSV = 2, nBSV = 0 Total nSV = 2 And finally, use it to create predictions about new classes: >>> m.predict([1, 1, 1]) 1.0 This shows all the functionality of LIBSVM that you need to create a model from training data and use it to make predictions. LIBSVM includes another very nice feature, the ability to load and save the trained models you've built: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Matching on Facebook Content preview·Buy reprint rights for this chapter Facebook is a popular social networking site that was originally for college students but eventually opened up to a larger audience. Like other social networking sites, it allows users to make profiles, enter demographic information about themselves, and connect to their friends on the site. Facebook also includes an API that lets you query information about people and find out if two people are friends or not. By doing this, you can build a set similar to the matchmaker dataset using real people. As of this writing, Facebook has remained very committed to privacy, so you can only view the profiles of people who are your friends. The API applies the same rules, requiring a user to log in and only allowing queries, so unfortunately, you'll only be able to work through this section if you have a Facebook account and have connected to at least 20 people. If you have a Facebook account, you can sign up for a developer key on the Facebook developer site at . You'll get two strings, an API key and a "secret" key. The API key is used to identify you, and the secret key is used to encrypt your requests in the hash function that you'll see later. To start, create a new file called facebook.py, import some modules you'll need, and set up some constants: import urllib,md5,webbrowser,time from xml.dom.minidom import parseString apikey="Your API Key" secret="Your Secret Key" FacebookSecureURL = "https://api.facebook.com/restserver.php" There are two convenience methods to add: getsinglevalue gets the next value from inside a named node, and callid returns a number based on the system time. def getsinglevalue(node,tag): nl=node.getElementsByTagName(tag) if len(nl)>0: tagNode=nl[0] if tagNode.hasChildNodes( ): return tagNode.firstChild.nodeValue return '' def callid( ): return str(int(time.time( )*10)) Some of the Facebook calls require that you send a sequence number, which can be any number as long as it is higher than the last sequence number you used. Using the system time is an easy way to get consistently higher numbers. The procedure for creating a session on Facebook is actually intended for you to create an application for other people to use without ever learning their login information. This is accomplished in several steps: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Exercises Content preview·Buy reprint rights for this chapter 1. Bayesian classifier. Can you think of ways that the Bayesian classifier you built in could be used on the matchmaker dataset? What would be good examples of features? 2. Optimizing a dividing line. Do you think it's possible to choose a dividing line using the optimization methods you learned in , instead of just using the averages? What cost function would you use? 3. Choosing the best kernel parameters. Write a function that loops over different values for gamma and determines what the best value for a given dataset is. 4. Hierarchy of interests. Design a simple hierarchy of interests, along with a data structure to represent it. Alter the matchcount function to use the hierarchy to give partial points for matches. 5. Other LIBSVM kernels. Look through the LIBSVM documentation and see what other kernels are available. Try a polynomial kernel. Does the prediction quality improve? 6. Other Facebook predictions. Look at all the fields available through the Facebook API. What datasets can be built for individuals? Could you use an SVM model to predict if someone listed a movie as a favorite based on the school she went to? What other things could you make predictions about? Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Chapter 10: Finding Independent Features Content preview·Buy reprint rights for this chapter Most of the chapters so far have focused primarily on supervised classifiers, except , which was about unsupervised techniques called clustering. This chapter will look at ways to extract the important underlying features from sets of data that are not labeled with specific outcomes. Like clustering, these methods do not seek to make predictions as much as they try to characterize the data and tell you interesting things about it. You'll recall from that clustering assigns every row in a dataset to a group or point in a hierarchy—each item fits into exactly one group that represents the average of the members. Feature extraction is a more general form of this idea; it tries to find new data rows that can be used in combination to reconstruct rows of the original dataset. Rather than belonging to one cluster, each row is created from a combination of the features. One of the classic problems illustrating the need to find independent features is known as the cocktail party problem, the problem of interpreting conversation when many people are talking. A remarkable feature of the human auditory system is our ability to focus on a single voice in a room full of people talking, despite the fact that a mixture of all the different voices is reaching our ears. The brain is quite adept at separating the independent sounds that create all the noise it's hearing. By using algorithms like the one described in this chapter, and getting the input from multiple microphones placed in a room, it's possible for a computer to do the same thing—take a cacophony of sounds and separate them without any prior knowledge of what they are. Another interesting use of feature extraction is identifying recurring word-usage patterns in a corpus of documents, which can help identify themes that are independently present in various combinations. In this chapter, you'll build a system that downloads news articles from a variety of feeds and identifies the key themes that emerge from a set of articles. You may find that articles contain more than one theme, and you will certainly find that themes apply to more than one article. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! A Corpus of News Content preview·Buy reprint rights for this chapter To begin, you'll need a set of news articles to work with. These should be from a variety of sources so that the themes being discussed in different places are easier to discern. Fortunately, most of the major news services and web sites provide RSS or Atom feeds, either for all the articles or for individual categories. You've used the Universal Feed Parser in previous chapters to parse RSS and Atom feeds for blogs, and you can use the same parser to download news. If you don't already have the parser, you can download it from . There are thousands of sources of what can be considered "news," from major news wires and newspapers to political blogs. Some ideas include: * Reuters * The Associated Press * The New York Times * Google News * salon.com * Fox News * Forbes magazine * CNN International This is just a small sample of the possibilities. Combining sources from different points on the political spectrum and those that use different writing styles is a better test for the algorithm, since it should be able to find the important features and ignore irrelevant sections. It's also possible, given the right set of data, for a feature-extraction algorithm to identify a feature that is present in stories with a particular political slant and assign that feature to the story, in addition to the features that describe the subject of the story. Create a new file called newsfeatures.py and add the following code to import some libraries and give a list of sources: import feedparser import re feedlist=['http://today.reuters.com/rss/topNews', 'http://today.reuters.com/rss/domesticNews', 'http://today.reuters.com/rss/worldNews', 'http://hosted.ap.org/lineups/TOPHEADS-rss_2.0.xml', 'http://hosted.ap.org/lineups/USHEADS-rss_2.0.xml', 'http://hosted.ap.org/lineups/WORLDHEADS-rss_2.0.xml', 'http://hosted.ap.org/lineups/POLITICSHEADS-rss_2.0.xml', 'http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml', 'http://www.nytimes.com/services/xml/rss/nyt/International.xml', 'http://news.google.com/?output=rss', 'http://feeds.salon.com/salon/news', 'http://www.foxnews.com/xmlfeed/rss/0,4313,0,00.rss', 'http://www.foxnews.com/xmlfeed/rss/0,4313,80,00.rss', 'http://www.foxnews.com/xmlfeed/rss/0,4313,81,00.rss', 'http://rss.cnn.com/rss/edition.rss', 'http://rss.cnn.com/rss/edition_world.rss', 'http://rss.cnn.com/rss/edition_us.rss'] Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Previous Approaches Content preview·Buy reprint rights for this chapter In previous chapters, you've looked at different ways of dealing with word counts for textual data. For purposes of comparison, it's useful to try these first and see what sort of results you get, then compare them with the results of feature extraction. If you have the code that you wrote for those chapters, you can import those modules and try them here on your feeds. If not, don't worry—this section illustrates how these methods work on the sample data. Bayesian classification is, as you've seen, a supervised learning method. If you were to try to use the classifier built in , you would first be required to classify several examples of stories to train the classifier. The classifier would then be able to put later stories into your predefined categories. Besides the obvious downside of having to do the initial training, this approach also suffers from the limitation that the developer has to decide what all the different categories are. All the classifiers you've seen so far, such as decision trees and support-vector machines, will have this same limitation when applied to a dataset of this kind. If you'd like to try the Bayesian classifier on this dataset, you'll need to place the module you built in in your working directory. You can use the articlewords dictionary as is for the feature set of each article. Try this in your Python session: >>> def wordmatrixfeatures(x): ... return [wordvec[w] for w in range(len(x)) if x[w]>0] ... >>> wordmatrixfeatures(wordmatrix[0]) ['forces', 'said', 'security', 'attacks', 'iraq', 'its', 'pentagon',...] >>> import docclass >>> classifier=docclass.naivebayes(wordmatrixfeatures) >>> classifier.setdb('newstest.db') >>> artt[0] u'Attacks in Iraq at record high: Pentagon' >>> # Train this as an 'iraq' story >>> classifier.train(wordmatrix[0],'iraq') >>> artt[1] u'Bush signs U.S.-India nuclear deal' >>> # Train this as an 'india' story >>> classifier.train(wordmatrix[1],'india') >>> artt[2] u'Fatah, Hamas men abducted freed: sources' >>> # How is this story classified? >>> classifier.classify(wordmatrix[1]) u'iraq' With the examples used, there are many possible themes and only a few stories in each one. The Bayesian classifier will eventually learn all the themes, but since it requires training on several examples per theme, this classifier is better suited to fewer categories with more examples in each category. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Non-Negative Matrix Factorization Content preview·Buy reprint rights for this chapter The technique for extracting the important features of the data is called non-negative matrix factorization (NMF). This is one of the most sophisticated techniques covered in this book, and it requires a bit more explanation and a quick introduction to linear algebra. Everything you need to know will be covered in this section. To understand what NMF is doing, you'll first need to understand a bit about matrix multiplication. If you're already familiar with linear algebra, you can safely skip this section. An example of matrix multiplication is shown in . Figure : Example of matrix multiplication This figure shows how two matrices are multiplied together. When multiplying matrices, the first matrix (Matrix A in the figure) must have the same number of columns as the second matrix (Matrix B) has rows. In this case, Matrix A has two columns and Matrix B has two rows. The resulting matrix (Matrix C) will have the same number of rows as Matrix A and the same number of columns as Matrix B. The value for each cell in Matrix C is calculated by multiplying the values from the same row in Matrix A by the values from the same column in Matrix B and adding all the products together. Looking at the value in the top left corner of Matrix C, you can see that the values from the first row of Matrix A are multiplied by the corresponding values from the first column of Matrix B. These are added together to get the final value. The other cells in Matrix C are all calculated in the same way. Another common matrix operation is transposing. This means that the columns become rows and the rows become columns. It's usually indicated with a "T," as shown in . Figure : Transposing a matrix You'll be using the transpose and multiplication operations in the implementation of the NMF algorithm. So far, what you have is a matrix of articles with word counts. The goal is to factorize this matrix, which means finding two smaller matrices that can be multiplied together to reconstruct this one. The two smaller matrices are: The features matrix This matrix has a row for each feature and a column for each word. The values indicate how important a word is to a feature. Each feature should represent a theme that emerged from a set of articles, so you might expect an article about a new TV show to have a high weight for the word "television." Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Displaying the Results Content preview·Buy reprint rights for this chapter Exactly how you view the results is a little complicated. Every feature in the features matrix has a weighting that indicates how strongly each word applies to that feature, so you can try displaying the top five or ten words in each feature to see what the most important words are in that feature. The equivalent column in the weights matrix tells you how much this particular feature applies to each of the articles, so it's also interesting to show the top three articles and see how strongly this feature applies to all of them. Add a new function called showfeatures to newsfeatures.py: from numpy import * def showfeatures(w,h,titles,wordvec,out='features.txt'): outfile=file(out,'w') pc,wc=shape(h) toppatterns=[[] for i in range(len(titles))] patternnames=[] # Loop over all the features for i in range(pc): slist=[] # Create a list of words and their weights for j in range(wc): slist.append((h[i,j],wordvec[j])) # Reverse sort the word list slist.sort( ) slist.reverse( ) # Print the first six elements n=[s[1] for s in slist[0:6]] outfile.write(str(n)+'\n') patternnames.append(n) # Create a list of articles for this feature flist=[] for j in range(len(titles)): # Add the article with its weight flist.append((w[j,i],titles[j])) toppatterns[j].append((w[j,i],i,titles[j])) # Reverse sort the list flist.sort( ) flist.reverse( ) # Show the top 3 articles for f in flist[0:3]: outfile.write(str(f)+'\n') outfile.write('\n') outfile.close( ) # Return the pattern names for later use return toppatterns,patternnames This function loops over each of the features and creates a list of all the word weights and words from the word vector. It sorts the list so that the most heavily weighted words for the feature appear at the start of the list, and then it prints the first 10 of these words. This should give you a good idea of the theme represented by this particular feature. The function returns the top patterns and pattern names, so they only have to be calculated once and can be used again in the Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Using Stock Market Data Content preview·Buy reprint rights for this chapter As well as dealing with somewhat nominal data like word counts, NMF is suited for problems involving true numerical data. This section will show how the same algorithm can be applied to trading volume in the United States stock market using data downloaded from Yahoo! Finance. This data may show patterns of important trading days or the ways that underlying factors can drive the volume of multiple stocks. Financial markets are considered a quintessential example of collective intelligence because they have a great number of participants acting independently based on different information and biases and producing a small set of outputs, such as price and volume. It has proven extremely difficult for individuals to do a better job than the collective in predicting future prices. There is a large body of academic research on the ways that groups of people are more successful at setting prices in a financial market than any individual could possibly be. The trading volume for a specific stock is the number of shares that are bought and sold within a given period, usually one day. shows a chart of Yahoo! stock, which has the ticker symbol YHOO. The line at the top is the closing price, the price of the last transaction of the day. The bar chart below shows the trading volume. Figure : Stock chart showing price and trading volume You'll notice that the volume tends to be much higher on days when there is a big change in the stock price. This often happens when a company makes a major announcement or releases financial results. Spikes can also occur due to news about the company or the industry. In the absence of external events, trading volume is usually, but not always, constant for a given stock. In this section, you'll be looking at volume for a set of stocks in a time series. This will let you search for patterns of changes in volume that affect multiple stocks at once, or for events that are so influential they become their own features. Volume is used instead of closing price because NMF tries to find positive features that can be added together; prices often move down in response to events, and NMF will not find negative features. Volume, however, is more easily modeled as having a basal level that can increase in response to outside influence, making it suitable for positive matrices. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Exercises Content preview·Buy reprint rights for this chapter 1. Varying news sources. The example in this chapter used mostly pure news sources. Try adding some top political blogs. (is a good place to find blogs.) How does this affect the results? Are there features that apply strongly to political commentary? Are news stories with related commentary grouped easily? 2. K-means clustering. Hierarchical clustering was used on the articles matrix, but what happens if you use K-means clustering? How many clusters do you need to get good separation of different stories? How does this compare to the number of features you need to use to extract all the themes? 3. Optimizing for factorization. Can you use the optimization code that you built in to factorize the matrix? Is this a lot faster or slower? How do the results compare? 4. Stopping criteria. The NMF algorithm in this chapter stops when the cost has dropped to 0 or when it reaches the maximum number of iterations. Sometimes improvement will almost entirely stop once a very good though not perfect solution has been reached. Modify the code so it stops when the cost is not improving by more than one percent per iteration. 5. Alternative display methods. The functions given in this chapter for displaying results are simple and show important features, but they lose a lot of context. Can you think of other ways of displaying results? Try writing a function that displays the articles in their original text with key words from each feature highlighted, or perhaps a trading volume chart with important dates clearly shown. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Chapter 11: EVOLVING INTELLIGENCE Content preview·Buy reprint rights for this chapter Throughout this book you've seen a number of different problems, and in each case you used an algorithm that was suited to solve that particular problem. In some of the examples, you had to tweak the parameters or use optimization to search for a good set of parameters. This chapter will look at a different way to approach problems. Instead of choosing an algorithm to apply to a problem, you'll make a program that attempts to automatically build the best program to solve a problem. Essentially, you'll be creating an algorithm that creates algorithms. To do this, you will use a technique called genetic programming. Since this is the last chapter in which you'll learn a completely new type of algorithm, I've picked a topic that is new, exciting, and being actively researched. This chapter is a little different from the others because it doesn't use any open APIs or public datasets, and because programs that can modify themselves based on their interactions with many people are an interesting and different kind of collective intelligence. Genetic programming is a very large topic about which many books have been written, so you'll only get an introduction here, but I hope it's enough to get you excited about the possibilities and perhaps to research and experiment on your own. The two problems in this chapter are recreating a mathematical function given a dataset, and automatically creating an AI (artificial intelligence) player for a simple board game. This is just a very small sampling of the possibilities of genetic programming—computational power is really the only constraint on the types of problems it can be used to solve. Genetic programming is a machine-learning technique inspired by the theory of biological evolution. It generally works by starting with a large set of programs (referred to as the population), which are either randomly generated or hand-designed and are known to be somewhat good solutions. The programs are then made to compete in some user-defined task. This may be a game in which the programs compete against each other directly, or it may be an individual test to see which program performs better. After the competition, a ranked list of the programs from best to worst can be determined. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! What Is Genetic Programming? Content preview·Buy reprint rights for this chapter Genetic programming is a machine-learning technique inspired by the theory of biological evolution. It generally works by starting with a large set of programs (referred to as the population), which are either randomly generated or hand-designed and are known to be somewhat good solutions. The programs are then made to compete in some user-defined task. This may be a game in which the programs compete against each other directly, or it may be an individual test to see which program performs better. After the competition, a ranked list of the programs from best to worst can be determined. Next—and here's where evolution comes in—the best programs are replicated and modified in two different ways. The simpler way is mutation, in which certain parts of the program are altered very slightly in a random manner in the hope that this will make a good solution even better. The other way to modify a program is through crossover (sometimes referred to as breeding), which involves taking a portion of one of the best programs and replacing it with a portion of one of the other best programs. This replication and modification procedure creates many new programs that are based on, but different from, the best programs. At each stage, the quality of the programs is calculated using a fitness function. Since the size of the population is kept constant, many of the worst programs are eliminated from the population to make room for the new programs. The new population is referred to as "the next generation," and the whole procedure is then repeated. Because the best programs are being kept and modified, it is expected that with each generation they will get better and better, in much the same way that teenagers can be smarter than their parents. New generations are created until a termination condition is reached, which, depending on the problem, can be that: * The perfect solution has been found. * A good enough solution has been found. * The solution has not improved for several generations. * The number of generations has reached a specified limit. For some problems, such as determining a mathematical function that correctly maps a set of inputs to an output, a perfect solution is possible. For others, such as a board game, there may not be a perfect solution, since the quality of a solution depends on the strategy of the program's adversary. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Programs As Trees Content preview·Buy reprint rights for this chapter In order to create programs that can be tested, mutated, and bred, you'll need a way to represent and run them from within your Python code. The representation has to lend itself to easy modification and, more importantly, has to be guaranteed to be an actual program—which means generating random strings and trying to treat them as Python code won't work. Researchers have come up with a few different ways to represent programs for genetic programming, and the most commonly used is a tree representation. Most programming languages, when compiled or interpreted, are first turned into a parse tree, which is very similar to what you'll be working with here. (The programming language Lisp and its variants are essentially ways of entering a parse tree directly.) An example of a parse tree is shown in . Figure : Example program tree Each node represents either an operation on its child nodes or an endpoint, such as a parameter with a constant value. For example, the circular node is a sum operation on its two branches, in this case, the values Y and 5. Once this point is evaluated, it is given to the node above it, which in turn applies its own operation to its branches. You'll also notice that one of the nodes has the operation "if," which specifies that if its leftmost branch evaluates to true, return its center branch; if it doesn't, return its rightmost branch. Traversing the complete tree, you can see that it corresponds to the Python function: def func(x,y) if x>3: return y + 5 else: return y − 2 At first, it might appear that these trees can only be used to build very simple functions. There are two things to consider here—first, the nodes that compose the tree can potentially be very complex functions, such as distance measures or Gaussians. The second thing is that trees can be made recursive by referring to nodes higher up in the tree. Creating trees like this allows for loops and other more complicated control structures. You're now ready to construct tree programs in Python. The trees are made up of nodes, which, depending on the functions associated with them, have some number of child nodes. Some of the nodes will return parameters passed to the program, others will return constants, and the most interesting ones will return operations on their child nodes. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Creating the Initial Population Content preview·Buy reprint rights for this chapter Although it's possible to hand-create programs for genetic programming, most of the time the initial population consists of a set of random programs. This makes the process easier to start, since it's not necessary to design several programs that almost solve a problem. It also creates much more diversity in the initial population—a set of programs designed by a single programmer to solve a problem are likely to be very similar, and although they may give answers that are almost correct, the ideal solution make look quite different. You'll learn more about the importance of diversity shortly. Creating a random program consists of creating a root node with a random associated function and then creating as many random child nodes as necessary, which in turn may have their own associated random child nodes. Like most functions that work with trees, this is most easily defined recursively. Add a new function, makerandomtree, to gp.py: def makerandomtree(pc,maxdepth=4,fpr=0.5,ppr=0.6): if random( )0: f=choice(flist) children=[makerandomtree(pc,maxdepth−1,fpr,ppr) for i in range(f.childcount)] return node(f,children) elif random( )>> random1=gp.makerandomtree(2) Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Testing a Solution Content preview·Buy reprint rights for this chapter You would now have everything you'd need to build programs automatically, if you could just generate random programs until one is correct. Obviously, this would be ridiculously impractical because there are infinite possible programs and it's highly unlikely that you would stumble across a correct one in any reasonable time frame. However, at this point it is worth looking at ways to test a solution to see if it's correct, and if it's not, to determine how close it is. One of the easiest tests for genetic programming is to reconstruct a simple mathematical function. Imagine you were given a table of inputs and an output that looked like . Table : Data and result for an unknown function X Y Result 26 35 829 8 24 141 20 1 467 33 11 1215 37 16 1517 There is some function that maps X and Y to the result, but you're not told what it is. A statistician might see this and try to do a regression analysis, but that requires guessing the structure of the formula first. Another option is to build a predictive model using k-nearest neighbors as you did in , but that requires keeping all the data. In some cases, you just need a formula, perhaps to codify in another much simpler program or to describe to other people what's going on. I'm sure you're in suspense, so I'll tell you what the function is. Add hiddenfunction to gp.py: def hiddenfunction(x,y): return x**2+2*y+3*x+5 You're going to use this function to build a dataset against which you can test your generated programs. Add a new function, buildhiddenset, which creates the dataset: def buildhiddenset( ): rows=[] for i in range(200): x=randint(0,40) y=randint(0,40) rows.append([x,y,hiddenfunction(x,y)]) return rows And use this to create a dataset in your Python session: >>> reload(gp) >>> hiddenset=gp.buildhiddenset( ) Of course, you know what the function used to generate the dataset looks like, but the real test is whether genetic programming can reproduce it without being told. As with optimization, it's necessary to come up with a way to measure how good a solution is. In this case, you're testing a program against a numerical outcome, so an easy way to test a program is to see how close it gets to the correct answers for the dataset. Add Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Mutating Programs Content preview·Buy reprint rights for this chapter After the best programs are chosen, they are replicated and modified for the next generation. As mentioned earlier, mutation takes a single program and alters it slightly. The tree programs can be altered in a number of ways—by changing the function on a node or by altering its branches. A function that changes the number of required child nodes either deletes or adds new branches, as shown in . Figure : Mutation by changing node functions The other way to mutate is by replacing a subtree with an entirely new one, as shown in . Figure : Mutation by replacing subtrees Mutation is not something that should be done too much. You would not, for instance, mutate the majority of nodes in a tree. Instead, you can assign a relatively small probability that any node will be modified. Beginning at the top of the tree, if a random number is lower than that probability, the node is mutated in one of the ways described above; otherwise, the test is performed again on its child nodes. To keep things simple, the code given here only performs the second kind of mutation. Create a new function called mutate to perform this operation: def mutate(t,pc,probchange=0.1): if random( )>> random2.display( ) subtract 7 multiply isgreater p0 p1 if multiply p1 p1 p0 2 >>> muttree=gp.mutate(random2,2) >>> muttree.display( ) subtract 7 multiply isgreater p0 p1 if multiply p1 p1 p0 p1 See if the result of scorefunction has changed significantly, for better or worse, after the tree has been mutated: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Crossover Content preview·Buy reprint rights for this chapter The other type of program modification is crossover or breeding. This involves taking two successful programs and combining them to create a new program, usually by replacing a branch from one with a branch from another. shows an example of how this works. Figure : Crossover operation The function for performing a crossover takes two trees as inputs and traverses down both of them. If a randomly selected threshold is reached, the function returns a copy of the first tree with one of its branches replaced by a branch in the second tree. By traversing both trees at once, the crossover happens at approximately the same level on each tree. Add the crossover function to gp.py: def crossover(t1,t2,probswap=0.7,top=1): if random( )>> random1=gp.makerandomtree(2) >>> random1.display( ) multiply subtract p0 8 isgreater p0 isgreater p1 5 >>> random2=gp.makerandomtree(2) >>> random2.display( ) if 8 p1 2 >>> cross=gp.crossover(random1,random2) >>> cross.display( ) multiply subtract p0 8 2 You'll probably notice that swapping out branches can radically change what the program does. You may also notice that programs may be close to being correct for completely different reasons, so merging them produces a result that's very different from either of its predecessors. Again, the hope is that some crossovers will improve the solution and be kept around for the next generation. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Building the Environment Content preview·Buy reprint rights for this chapter Armed with a measure of success and two methods of modifying the best programs, you're now ready to set up a competitive environment in which programs can evolve. The steps are shown in the flowchart in . Essentially, you create a set of random programs and select the best ones for replication and modification, repeating this process until some stopping criteria is reached. Create a new function called evolve to carry out this procedure: def evolve(pc,popsize,rankfunction,maxgen=500, mutationrate=0.1,breedingrate=0.4,pexp=0.7,pnew=0.05): # Returns a random number, tending towards lower numbers. The lower pexp # is, more lower numbers you will get def selectindex( ): return int(log(random( ))/log(pexp)) # Create a random initial population population=[makerandomtree(pc) for i in range(popsize)] for i in range(maxgen): scores=rankfunction(population) print scores[0][0] if scores[0][0]==0: break # The two best always make it newpop=[scores[0][1],scores[1][1]] # Build the next generation while len(newpop)pnew: newpop.append(mutate( crossover(scores[selectindex( )][1], scores[selectindex( )][1], probswap=breedingrate), pc,probchange=mutationrate)) else: # Add a random node to mix things up newpop.append(makerandomtree(pc)) population=newpop scores[0][1].display( ) return scores[0][1] This function creates an initial random population. It then loops up to maxgen times, each time calling rankfunction to rank the programs from best to worst. The best program is automatically passed through to the next generation unaltered, which is sometimes referred to as elitism. The rest of the next generation is constructed by randomly choosing programs that are near the top of the ranking, and then breeding and mutating them. This process repeats until either a program has a perfect score of 0 or maxgen is reached. The function has several parameters, which are used to control various aspects of the environment. They are: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! A Simple Game Content preview·Buy reprint rights for this chapter A more interesting problem for genetic programming is building an AI for a game. You can force the programs to evolve by having them compete against each other and against real people, and giving the ones that win the most a higher chance of making it to the next generation. In this section, you'll create a simulator for a very simple game called Grid War, which is depicted in . Figure : Grid War example The game has two players who take turns moving around on a small grid. Each player can move in one of four directions, and the board is limited so if a player attempts to move off one side, he forfeits his turn. The object of the game is to capture the other player by moving onto the same square as his on your turn. The only additional constraint is that you automatically lose if you try to move in the same direction twice in a row. This game is very basic but since it pits two players against each other, it will let you explore more competitive aspects of evolution. The first step is to create a function that uses two players and simulates a game between them. The function passes the location of the player and the opponent to each program in turn, along with the last move made by the player, and takes the return value as the move. The move should be a number from 0 to 3, indicating one of four possible directions, but since these are random programs that can return any integer, the function has to handle values outside this range. To do this, it uses modulo 4 on the result. Random programs are also liable to do things like create a player that moves in a circle, so the number of moves is limited to 50 before a tie is declared. Add gridgame to gp.py: def gridgame(p): # Board size max=(3,3) # Remember the last move for each player lastmove=[−1,−1] # Remember the player's locations location=[[randint(0,max[0]),randint(0,max[1])]] # Put the second player a sufficient distance from the first location.append([(location[0][0]+2)%4,(location[0][1]+2)%4]) # Maximum of 50 moves before a tie for o in range(50): # For each player for i in range(2): locs=location[i][:]+location[1-i][:] locs.append(lastmove[i]) move=p[i].evaluate(locs)%4 # You lose if you move the same direction twice in a row if lastmove[i]==move: return 1−i lastmove[i]=move if move==0: location[i][0]−=1 # Board limits if location[i][0]<0: location[i][0]=0 if move==1: location[i][0]+=1 if location[i][0]>max[0]: location[i][0]=max[0] if move==2: location[i][1]−=1 if location[i][1]<0: location[i][1]=0 if move==3: location[i][1]+=1 if location[i][1]>max[1]: location[i][1]=max[1] # If you have captured the other player, you win if location[i]==location[1−i]: return i return n−1 Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Further Possibilities Content preview·Buy reprint rights for this chapter This chapter is just an introduction to genetic programming, which is a huge and rapidly advancing field. You've used it so far to approach simple problems in which programs are built in minutes rather than days, but the principles can be extended to much more complex problems. The number of programs in the populations here have been very small compared to those used in more complex problems—a population of thousands or tens of thousands is more typical. You are encouraged to come up with more difficult problems and try larger population sizes, but you may have to wait hours or days while the programs run. The following section outlines a few ways in which the simple genetic programming model can be extended for different applications. We have used a very small set of functions to construct the programs so far. This limits the scope of what a simple program can do—for more complicated problems, it's necessary to greatly increase the number of functions available to build a tree. Here are some possible functions to add: * Trigonometric functions like sine, cosine, and tangent * Other mathematical functions like power, square root, and absolute value * Statistical distributions, such as a Gaussian * Distance metrics, like Euclidean and Tanimoto distances * A three-parameter function that returns 1 if the first parameter is between the second and third * A three-parameter function that returns 1 if the difference between the first two parameters is less than the third These can get as complicated as you like, and they are often tailored to specific problems. Trigonometric functions may be a necessity when working in a field like signal processing, but they are not much use in a game like the one you built in this chapter. The programs in this chapter are almost entirely reactive; they give a result based solely on their inputs. This is the right approach for solving mathematical functions, but it doesn't allow the programs to work from a longer-term strategy. The chasing game passes the programs the last move they made—mostly so the programs learn they can't make the same move twice in a row—but this is simply the output of the program, not something they set themselves. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Exercises Content preview·Buy reprint rights for this chapter 1. More function types. We started with a very short list of functions. What other functions can you think of? Implement a Euclidean distance node with four parameters. 2. Replacement mutation. Implement a mutation procedure that chooses a random node on the tree and changes it. Make sure it deals with function, constant, and parameter nodes. How is evolution affected by using this function instead of the branch replacement? 3. Random crossover. The current crossover function chooses branches from two trees at the same level. Write a different crossover function that crosses any two random branches. How does this affect evolution? 4. Stopping evolution. Add an additional criteria to evolve that stops the process and returns the best result if the best score hasn't improved within X generations. 5. Hidden functions. Try creating other mathematical functions for the programs to guess. What sort of functions can be found easily, and which are more difficult? 6. Grid War player. Try to hand-design your own tree program that does well at Grid War. If you find this easy, try to write another completely different one. Instead of having a completely random initial population, make it mostly random, with your hand-designed programs included. How do they compare to random programs, and can they be improved with evolution? 7. Tic-tac-toe. Build a tic-tac-toe simulator for your programs to play. Set up a tournament similar to the Grid War tournament. How well do the programs do? Can they ever learn to play perfectly? 8. Nodes with datatypes. Some ideas were provided in this chapter about implementing nodes with mixed datatypes. Implement this and see if you can evolve a program that learns to return the second, third, sixth, and seventh characters of a string (e.g., "genetic" becomes "enic"). Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Chapter 12: Algorithm Summary Content preview·Buy reprint rights for this chapter This book has introduced a number of different algorithms, and if you've been working through the examples, you now have Python code that implements many of them. The earlier chapters are structured around working through an example problem with algorithms and variations introduced throughout the chapter. This chapter will be a reference for the algorithms covered, so when you want to do some data mining or machine learning on a new dataset, you can look at the algorithms here, decide which one is appropriate, and use the code you've already written to analyze your data. To save you from going back through the book to find the details of an algorithm, I'll provide a description of each one, a high-level overview of how it works, what sort of datasets you can apply it to, and how you would use the code you've previously written to run it. I'll also mention some of the strengths and weaknesses of each algorithm (or, if you like, how to sell the idea to your boss). In some cases, I'll use examples to help explain the properties of the algorithm. These examples are greatly simplified—most are so simple you can solve them just by looking at the data yourself—but they are useful for illustration. Supervised learning methods, which guess a classification or a value based on training examples, will be covered first. Bayesian classifiers were covered in . In that chapter, you saw how to create a document classification system, such as those used for spam filtering or dividing up a set of documents based on an ambiguous keyword search. Although all the examples dealt with documents, the Bayesian classifier described in will work on any dataset that can be turned into lists of features. A feature is simply something that is either present or absent for a given item. In the case of documents, the features are the words in the document, but they could also be characteristics of an unidentified object, symptoms of a disease, or anything else that can be said to be present of absent. Like all supervised methods, a Bayesian classifier is trained with examples. Each example is a list of an item's features and the classification for that item. Suppose you're trying to train a classifier to recognize if a document containing the word "python" is about the programming language or the snake. A sample training set might look like . Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Bayesian Classifier Content preview·Buy reprint rights for this chapter Bayesian classifiers were covered in . In that chapter, you saw how to create a document classification system, such as those used for spam filtering or dividing up a set of documents based on an ambiguous keyword search. Although all the examples dealt with documents, the Bayesian classifier described in will work on any dataset that can be turned into lists of features. A feature is simply something that is either present or absent for a given item. In the case of documents, the features are the words in the document, but they could also be characteristics of an unidentified object, symptoms of a disease, or anything else that can be said to be present of absent. Like all supervised methods, a Bayesian classifier is trained with examples. Each example is a list of an item's features and the classification for that item. Suppose you're trying to train a classifier to recognize if a document containing the word "python" is about the programming language or the snake. A sample training set might look like . Table : Features and classifications for a set of documents Features Classification Pythons are constrictors that feed on birds and mammals Snake Python was originally developed as a scripting language Language A 49-ft.-long python was found in Indonesia Snake Python has a dynamic type system Language Python with vivid scales Snake Open source project Language The classifier keeps track of all the features it has seen so far, along with numerical probabilities that the features are associated with a particular classification. The classifier is trained by receiving examples one by one. After each example, the classifier updates the probabilities for the features and classifier in that example, generating the probability that a document about a certain category will contain a given word. For example, trained on a set of documents like those in , you might end up with a set of probabilities like those in . Table : Probabilities of words for a given category Feature Language Snake Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Decision Tree Classifier Content preview·Buy reprint rights for this chapter Decision trees were introduced in to show you how to build a model of user behavior from server logs. Decision trees are notable for being extremely easy to understand and interpret. An example of a decision tree is shown in . Figure : Example decision tree It should be clear from the figure what a decision tree does when faced with the task of classifying a new item. Beginning at the node at the top of the tree, it checks the item against the node's criteria—if the item matches the criteria, it follows the Yes branch; otherwise, it follows the No branch. This process is repeated until an endpoint is reached, which is the predicted category. Classifying in a decision tree is quite simple; training it is trickier. The algorithm described in built the tree from the top, choosing an attribute at each step that would divide the data in the best possible manner. To illustrate this, consider the fruit dataset shown in . This will be referred to as the original set. Table : Fruit data Diameter Color Fruit 4 Red Apple 4 Green Apple 1 Red Cherry 1 Green Grape 5 Red Apple There are two possible variables on which this data can be divided, either Diameter or Color, to create the top node of the tree. The first step is to try each of them in order to decide which of these variables divides the data best. Dividing the set on Color gives the results shown in . Table : Fruit data divided by Color Red Green Apple Apple Cherry Grape Apple The data is still pretty mixed. However, if you divide the dataset by Diameter (less than four inches and greater than or equal to four inches), the results divide much more cleanly (referred to as Subset 1 on the left and Subset 2 on the right). This division is shown in . Table : Fruit data divided by Diameter Diameter < 4in Diameter ≥ 4in Cherry Apple Grape Apple This is obviously a much better result, since Subset 1 contains all the Apple entries from the original set. Although the better variable is clear in this example, larger datasets will not always have such clean divisions. introduced the concept of entropy (the amount of disorder in a set) to measure how good a division is: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Neural Networks Content preview·Buy reprint rights for this chapter showed how to build a simple neural network for altering the ranking of search results based on what links users have clicked in the past. That neural network was able to learn which words in which combinations were important, and also which words were unimportant to a particular query. Neural networks can be applied to both classification and numerical prediction problems. The neural network in was used as a classifier—it gave a number for every link, predicting that the link with the highest number would be the one that the user would click. Because it gave numbers for every link, you could use all the numbers to change the rankings of the search results. There are many different kinds of neural networks. The one covered in this book is known as a multilayer perceptron network, so named because it has a layer of input neurons that feed into one or more layers of hidden neurons. The basic structure is shown in . Figure : Basic neural network structure This network has two layers of neurons. The layers of neurons are connected to each other by synapses, which each have an associated weight. The outputs from one set of neurons are fed to the next layer through the synapses. The higher the weight of a synapse leading from one neuron to the next, the more influence it will have on the output of that neuron. As a simple example, consider again the problem of spam filtering that was described in the earlier section "." In our simplified email world, an email message can contain the word "online," the word "pharmacy," or both. To determine which of these messages is spam, you might use a neural network that looks like . Figure : Neural network for spam classification In this figure, the weights on the synapses are already set to solve the problem. (You'll see how they are set in the next section.) The neurons in the first layer respond to the words that are used as input—if a word is present in the email message, then the neurons that are strongly connected to that word become active. The second layer is fed by the first layer, so it responds to combinations of words. Finally, these neurons feed their results to the outputs, and particular combinations may be strongly or weakly associated with the possible results. In the end, the final decision is whichever output is strongest. shows how the network reacts to the word "online" when it is Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Support-Vector Machines Content preview·Buy reprint rights for this chapter Support-vector machines (SVMs) were introduced in , and are probably the most sophisticated classification method covered by this book. SVMs take datasets with numerical inputs and try to predict which category they fall into. You might, for example, want to decide positions for a basketball team from a list of people's heights and running speeds. To simplify, consider just two possibilities—front-court positions in which tall players are required, and back-court positions where you need the faster movers. An SVM builds a predictive model by finding the dividing line between the two categories. If you plot a set of values for height versus speed and the best position for each person, you get a graph like the one shown in . Front-court players are shown as Xs and back-court players are shown as Os. Also shown on the graph are a few lines that separate the data into the two categories. Figure : Plot of basketball players and dividing lines A support-vector machine finds the line that separates the data most cleanly, which means that it is the greatest possible distance from points near the dividing line. In , although all the different lines separate the data, the one that does this best is the one labeled "Best." The only points necessary to determine where the line should be are the points closest to it, and these are known as the support vectors. After the dividing line has been found, classifying new items is just a matter of plotting them on the graph and seeing on which side of the line they fall. There's no need to go through the training data to classify new points once the line has been found, so classification is very fast. Support-vector machines, along with other linear classifiers that use vector dot-products, often take advantage of a technique called the kernel trick. To understand this, consider how the problem would change if the classification you were trying to predict was not position, but rather, whether the players would be appropriate for an amateur team in which the positions are often switched around. This is more interesting because the division is not linear. You don't want players who are too tall or too fast because they would make the game too difficult for others, but you don't want them to be too short or too slow either. shows what this might look like, where an O indicates that a player is appropriate for the team and an X indicates that he isn't. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! k-Nearest Neighbors Content preview·Buy reprint rights for this chapter covered the topic of numerical prediction using an algorithm called k-nearest neighbors (kNN), and used it to show how you could build models for predicting prices given a set of examples. The recommendation algorithm in for predicting how much someone would like a movie or a link was also a simple version of kNN. kNN works by taking a new item for which you want a numerical prediction and comparing it to a set of items for which it already has values. It finds the ones most similar to the item in question and averages their values to get a predicted value. shows a list of digital cameras, along with their megapixel rating, zoom power, and sale price. Table : Digital cameras and prices Camera Megapixels Zoom Price C1 7.1 3.8x $399 C2 5.0 2.4x $299 C3 6.0 4.0x $349 C4 6.0 12.0x $399 C5 10.0 3x $449 Suppose you want to guess the price for a new camera with a six megapixels and a 6x zoom lens. The first thing you'll need is a way to measure how similar two items are. used Euclidean distance, and you've seen many other distance metrics throughout this book, such as Pearson correlation and Tanimoto score. Using Euclidean distance in this example reveals that the closest item in the table is C3. To visualize this, imagine plotting the items on a chart with megapixels as the x-axis and zoom as the y-axis. The items themselves are identified by their prices in . Figure : Camera prices in zoom-megapixel space You could just take the price of $349 as the answer (it is the closest match, after all), but you'd never know if its price was just an anomaly. For this reason, it's better to take more than one of the best matches and average them. The k in k-nearest neighbors refers to how many of the top matches you use for averaging. For example, if you took the best three matches and averaged them, this would be kNN with k=3. An extension to the basic averaging is to use a weighted average based on how far away the neighbors are. A very close neighbor would count for more than an item that was further away. The weights would be in proportion to the total distance. covered different functions for determining the weights. In the example, you might use the price of $349 weighted most heavily, with the two $399 prices weighted less. For example: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Clustering Content preview·Buy reprint rights for this chapter Hierarchical clustering and K-means clustering are unsupervised learning techniques, meaning they don't require examples for training data because they don't attempt to make predictions. looked at how to take a list of top bloggers and automatically cluster them so you could see which ones naturally fell into groups that write about similar subjects or use similar words. Clustering works on any set of items that have one or more numerical properties. The example in used word counts for the different blogs, but any set of numbers can be used for clustering. To demonstrate how the hierarchical clustering algorithm works, consider a simple table of items (some letters of the alphabet) and some numerical properties (). Table : Simple table for clustering Item P1 P2 A 1 8 B 3 8 C 2 6 D 1.5 1 E 4 2 shows the process of clustering these items. In the first pane, the items have been plotted in two dimensions, with P1 on the x-axis and P2 on the y-axis. Hierarchical clustering works by finding the two items that are closest together and merging them into a cluster. In the second pane, you can see that the closest items, A and B, have been grouped together. The "location" of this cluster is the average of the two items in it. In the next pane, it turns out that the closest items are C and the new A-B cluster. This process continues until the final pane in which everything is contained in one big cluster. Figure : Process of hierarchical clustering This process creates a hierarchy, which can be illustrated as a dendrogram, a tree-like structure that shows which items and groups are closest together. The dendrogram for the example dataset is shown in . Figure : Dendrogram of clustered letters The two closest items, A and B, are connected together at the end. C is connected to the combination of A and B. From the dendrogram you can pick any branch point and decide if it is a group of interest. In , you saw branches consisting almost entirely of political blogs and other branches of technology blogs and so on. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Multidimensional Scaling Content preview·Buy reprint rights for this chapter Another method covered in and applied to blogs was multidimensional scaling. Like clustering, this is an unsupervised technique that does not attempt to make predictions, but instead makes it easier to understand how different items are related. It creates a lower-dimensional representation of a dataset where the distances are as close to the original dataset as possible. For display on a screen or paper, this usually means scaling data from many dimensions down to two dimensions. Imagine, for example, that you have the four-dimensional dataset shown in (every item has four associated values). Table : Simple four-dimensional table for scaling A 0.5 0.0 0.3 0.1 B 0.4 0.15 0.2 0.1 C 0.2 0.4 0.7 0.8 D 1.0 0.3 0.6 0.0 Using the Euclidean distance formula, you can get a distance value for every pair of items. For example, the distance between A and B is sqrt(0.1^2+0.15^2+0.1^2+0.0^2) = 0.2. The full matrix of all the pairwise distances is given in . Table : Sample distance matrix A B C D A 0.0 0.2 0.9 0.8 B 0.2 0.0 0.9 0.7 C 0.9 0.9 0.0 1.1 D 0.8 0.7 1.1 0.0 The goal here is to draw all the items in a two-dimensional chart so that the distances in two dimensions are as close as possible to their distances in four dimensions. All the items are placed randomly on the chart, and the current distances between all the items are calculated, as shown in . Figure : Distances between items For every pair of items, the target distance is compared to the current distance and an error term is calculated. Every item is moved a small amount closer or farther away in proportion to the error between the two items. shows the forces acting on item A. The distance between A and B in the chart is 0.5, but the target distance is only 0.2, so A has to be moved closer to B. At the same time, A is also being pushed away by C and D because it is too close. Figure : Forces acting on item A Every node is moved according the combination of all the other nodes pushing and pulling on it. Each time this is done, the difference between the current distances and the target distances should get a little smaller. This procedure is repeated many times until the total amount of error cannot be reduced by moving the items any more. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Non-Negative Matrix Factorization Content preview·Buy reprint rights for this chapter covered an advanced technique called non-negative matrix factorization (NMF), which is a way to break down a set of numerical observations into their component parts. This method was used to show how news stories could be composed of separate themes and how the trading volume of various stocks could be broken down into events that affected individual stocks or multiple stocks at once. This is also an unsupervised algorithm, since it helps characterize data rather than making predictions about categories or values. To understand what NMF does, consider the set of values shown in : Table : Simple table for NMF Observation Number A B 1 29 29 2 43 33 3 15 25 4 40 28 5 24 11 6 29 29 7 37 23 8 21 6 Assume that observations A and B are composed of some combination of two pairs of numbers (the features), but you don't know what those pairs are or how much of each pair (the weights) is used to create each observation. NMF can find possible values for both the features and the weights. When working with news stories in , the observations were the stories and the columns were the words in the stories. With stock trading volumes, the observations were the days and the columns were various stock tickers. In each case, the algorithm attempted to find a smaller number of parts that could be added together in different amounts to get these observations. One possible answer for the data in the table is that the two pairs are (3, 5) and (7, 2). Using these parts, you can see that the observations can be recreated by combining the pairs in different amounts, such as the following: 5*(3, 5) + 2*(7, 2) = (29, 29) 5*(3, 5) + 4*(7, 2) = (43, 33) This can also be viewed as a matrix multiplication, as shown in . Figure : Factorizing a dataset into weights and features The goal of NMF is to automatically find the features matrix and the weights matrix. To do this, it starts with random matrices for each and updates them according to the update rules. The rules generate four new update matrices. In these descriptions, the original matrix is referred to as the data matrix: Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Optimization Content preview·Buy reprint rights for this chapter Optimization, covered in , is a little different from the other methods; instead of working with a dataset, it attempts to select values that minimize the output of a cost function. showed several examples of cost functions, such as planning group travel using a combination of price and waiting time at the airport, assigning students to the most appropriate dorm, and optimizing the layout of a simple graph. Once the cost function was designed, the same algorithms could be used to solve these three different problems. Two algorithms were covered: simulated annealing and genetic algorithms. A cost function is any function that takes a guess at a solution and returns a value that is higher for worse solutions and lower for better solutions. Optimization algorithms use this function to test solutions and to search possible solutions for the best one. The cost functions you use with optimization often have many variables to consider, and it's not always clear which is the best one to change in order to improve the result. However, for illustration, consider a function with only one variable, defined as: y = 1/x * sin(x) shows the graph of this function. Figure : Graph of 1/x * sin x Because the function has only one variable, it's easy to see from the graph where the lowest point is. We're using this for illustration so you can see how the optimization works; in reality, you wouldn't be able to simply graph complicated functions with many variables in order to find the lowest point. What is interesting about this function is that it has several local minima. These are points that are lower than all the surrounding points but are not necessarily the lowest point overall. This means the problem can't necessarily be solved by trying a random solution and moving down the slope, because it can lead to getting stuck in a local minimum and never finding the global minimum. Simulated annealing, which was inspired by alloy cooling in physics, starts with a random guess at a solution. It tries to improve the solution by determining the cost for a similar solution that's a small distance away and in a random direction from the solution in question. If the cost is lower, this becomes the new solution. If the cost is higher, it becomes the new solution with a certain probability, depending on the current Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Appendix : Third-Party Libraries Content preview·Buy reprint rights for this chapter This book has introduced a number of third-party libraries that we used to collect, store, and analyze data. This appendix covers download and installation instructions, along with some examples of usage. Universal Feed Parser Python Imaging Library Beautiful Soup pysqlite NumPy matplotlib pydelicious Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Universal Feed Parser Content preview·Buy reprint rights for this chapter The Universal Feed Parser is a Python library written by Mark Pilgrim for parsing RSS and Atom feeds. This library is used throughout the book for easily downloading blog posts and articles from online news sources. The home page of the library is . The download page for the library is . Download the latest version of the file named feedparser-X.Y.zip. Extract the contents of the zip file into an empty directory. At a command prompt, enter: c:\download\feedparser>python setup.py install This will locate your Python installation and install the library there. After it is installed, you can enter import feedparser at your Python prompt to begin using it. Example usage of the library is provided at . Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Python Imaging Library Content preview·Buy reprint rights for this chapter The Python Imaging Library (PIL) is an open-source library that adds image creation and processing capabilities to Python. It supports a wide variety of drawing operations and file formats. The home page for this library is . PIL has a Windows installer available for download. On the library home page, scroll to the Downloads section and download the latest Windows executable for your version of Python. Run this file and follow the on-screen instructions. For platforms other than Windows, you'll need to build the library from the sources. The sources are available for download on the library's home page and will work with any recent version of Python. To install, download the sources for the latest version and enter the following at the command prompt, replacing 1.1.6 with the version you have downloaded: $ gunzip Imaging-1.1.6.tar.gz $ tar xvf Imaging-1.1.6.tar $ cd Imaging-1.1.6 $ python setup.py install This will compile the extensions and install the library in your Python directory. This example creates a small image, draws some lines, and writes a message. It then saves the image as a JPEG file. >>> from PIL import Image,ImageDraw >>> img=Image.new('RGB',(200,200),(255,255,255)) # 200×200 white background >>> draw=ImageDraw.Draw(img) >>> draw.line((20,50,150,80),fill=(255,0,0)) # Red line >>> draw.line((150,150,20,200),fill=(0,255,0)) # Green line >>> draw.text((40,80),'Hello!',(0,0,0)) # Black text >>> img.save('test.jpg','JPEG') # Save to test.jpg A more extensive set of examples is available at . Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Beautiful Soup Content preview·Buy reprint rights for this chapter Beautiful Soup is a Python parser for HTML and XML documents. It is designed to work with poorly written web pages. It is used in this book to create datasets from web sites that do not have APIs, and to find all the text on pages for indexing. The home page for this library is . Beautiful Soup is available as a single file source download. Near the bottom of the home page, there is a link to download BeautifulSoup.py. Simply download this and put it in either your working directory or your Python/Lib directory. This example parses the HTML of the Google home page, and shows how to extract elements from the DOM and search for links. >>> from BeautifulSoup import BeautifulSoup >>> from urllib import urlopen >>> soup=BeautifulSoup(urlopen('http://google.com')) >>> soup.head.title Google >>> links=soup('a') >>> len(links) 21 >>> links[0] iGoogle >>> links[0].contents[0] u'iGoogle' A more extensive set of examples is available at . Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! pysqlite Content preview·Buy reprint rights for this chapter pysqlite is a Python interface to the SQLite embedded database. Unlike traditional databases, an embedded database does not run in a separate server process, so there's much less to install and set up. SQLite also stores the entire database in a single file. This book uses pysqlite to illustrate how to persist some of the data being collected. The home page for pysqlite is . The home page has links to download a binary installer for Windows. Simply download this file and run it. It will ask you where your installation of Python is and will install itself there. For platforms other than Windows, you'll need to install pysqlite from the source. The sources are available as a tarball on the pysqlite home page. Download the latest version and enter the following at the command prompt, replacing 2.3.3 with the version you have downloaded: $ gunzip pysqlite-2.3.3.tar.gz $ tar xvf pysqlite-2.3.3.tar.gz $ cd pysqlite-2.3.3 $ python setup.py build $ python setup.py install This example creates a new table, adds a row to it, and commits the change. The table is then queried for the new row: >>> from pysqlite2 import dbapi2 as sqlite >>> con=sqlite.connect('test1.db') >>> con.execute('create table people (name,phone,city)') >>> con.execute('insert into people values ("toby","555-1212","Boston")') >>> con.commit( ) >>> cur=con.execute('select * from people') >>> cur.next( ) (u'toby', u'555-1212', u'Boston') Notice that field types are optional with SQLite. To make this work with a more traditional database, you might have to add field types to your declarations when creating tables. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! NumPy Content preview·Buy reprint rights for this chapter NumPy is a mathematical library for Python that provides an array object, linear algebra functions, and Fourier transforms. It is a very popular way to do scientific computing in Python and it's gaining popularity, in some cases replacing more purpose-built tools like MATLAB. NumPy is used in to implement the NMF algorithm. The home page for NumPy is . There is a Windows binary installer available for NumPy at . Download the .exe file that matches your version of Python and run it. It will ask you for the directory where Python is installed and will install itself there. On other platforms, NumPy can be installed from the sources, which can also be downloaded at . Download the tar.gz file appropriate for your version of Python. To install from the source, use the following, replacing 1.0.2 with the version you have downloaded: $ gunzip numpy-1.0.2.tar.gz $ tar xvf numpy-1.0.2.tar.gz $ cd numpy-1.0.2 $ python setup.py install This example shows you how to create matrices, multiply them together, and then transpose and flatten operations: >>> from numpy import * >>> a=matrix([[1,2,3],[4,5,6]]) >>> b=matrix([[1,2],[3,4],[5,6]]) >>> a*b matrix([[22, 28], [49, 64]]) >>> a.transpose( ) matrix([[1, 4], [2, 5], [3, 6]]) >>> a.flatten( ) matrix([[1, 2, 3, 4, 5, 6]]) Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! matplotlib Content preview·Buy reprint rights for this chapter matplotlib is a 2D graphics library for Python that is much better for creating mathematical graphs than the Python Imaging Library. The figures it produces are intended to be of high enough quality to be used in publications. Before installing matplotlib, you'll need to install NumPy, as described in the previous section. matplotlib has binary builds for all major platforms, including Windows, Mac OS X, RPM-based Linux distributions, and Debian-based distributions. You can find detailed instructions for installing matplotlib on any of the platforms at . This example will use orange circles to plot four points at (1,1), (2,4), (3,9), and (4,16). It will then save the output to a file and display it in a window on the screen. >>> from pylab import * >>> plot([1,2,3,4], [1,4,9,16], 'ro') [] >>> savefig('test1.png') >>> show( ) A great collection of usage examples is available at . Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! pydelicious Content preview·Buy reprint rights for this chapter pydelicious is a library for retrieving data from the del.icio.us social bookmarking site. del.icio.us has an official API that is used for some calls, but pydelicious adds some extra features that we used in to build the recommendation engine. Pydelicious is now hosted at Google code; you can find it at . Getting the latest version of pydelicious is easiest if you have the Subversion version control software installed. If you do, you can just enter the following at a command prompt: svn checkout http://pydelicious.googlecode.com/svn/trunk/pydelicious.py If you don't have Subversion, the files can be downloaded at . After you have the files, just run python setup.py install from the directory where you downloaded them. This will install pydelicious in your Python directory. pydelicious has a number of calls to get popular bookmarks or bookmarks for a specific user. It also allows you to add new bookmarks to your own account: >> import pydelicious >> pydelicious.get_popular(tag='programming') [{'count': '', 'extended': '', 'hash': '', 'description': u'How To Write Unmaintainable Code', 'tags': '', 'href': u'http://thc.segfault.net/root/phun/unmaintain.html', 'user': u'dorsia', 'dt': u'2006-08-19T09:48:56Z'}, {'count': '', 'extended': '', 'hash': '', 'description': u'Threading in C#', 'tags': '', 'href':u'http://www.albahari.com/threading/', etc... >> pydelicious.get_userposts('dorsia') [{'count': '', 'extended': '', 'hash': '', 'description': u'How To Write Unmaintainable Code', 'tags': '', 'href': u'http://thc.segfault.net/root/phun/unmaintain.html', 'user': u'dorsia', 'dt': u'2006-08-19T09:48:56Z'}, etc... >>> a = pydelicious.apiNew(user, passwd) >>> a.posts_add(url="http://my.com/", desciption="my.com", extended="the url is my.moc", tags="my com") True Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Appendix : Mathematical Formulas Content preview·Buy reprint rights for this chapter Throughout the book I have introduced a number of mathematical concepts. This appendix covers selected concepts and gives a description, relevant formulas, and code for each of them. Euclidean Distance Pearson Correlation Coefficient Weighted Mean Tanimoto Coefficient Conditional Probability Gini Impurity Entropy Variance Gaussian Function Dot-Products Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Euclidean Distance Content preview·Buy reprint rights for this chapter Euclidean distance finds the distance between two points in multidimensional space, which is the kind of distance you measure with a ruler. If the points are written as (p1, p2, p3, p4, ...) and (q1, q2, q3, q4, ...), then the formula for Euclidean distance can be expressed as shown in . Figure : Euclidean distance A clear implementation of this formula is shown here: def euclidean(p,q): sumSq=0.0 # add up the squared differences for i in range(len(p)): sumSq+=(p[i]-q[i])**2 # take the square root return (sumSq**0.5) Euclidean distance is used in several places in this book to determine how similar two items are. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Pearson Correlation Coefficient Content preview·Buy reprint rights for this chapter The Pearson correlation coefficient is a measure of how highly correlated two variables are. It is a value between 1 and −1, where 1 indicates that the variables are perfectly correlated, 0 indicates no correlation, and −1 means they are perfectly inversely correlated. shows the Pearson correlation coefficient. Figure : Pearson correlation coefficient This can be implemented with the following code: def pearson(x,y): n=len(x) vals=range(n) # Simple sums sumx=sum([float(x[i]) for i in vals]) sumy=sum([float(y[i]) for i in vals]) # Sum up the squares sumxSq=sum([x[i]**2.0 for i in vals]) sumySq=sum([y[i]**2.0 for i in vals]) # Sum up the products pSum=sum([x[i]*y[i] for i in vals]) # Calculate Pearson score num=pSum-(sumx*sumy/n) den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5 if den==0: return 0 r=num/den return r We used the Pearson correlation in to calculate the level of similarity between people's preferences. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Weighted Mean Content preview·Buy reprint rights for this chapter The weighted mean is a type of average that has a weight for every observation being averaged. It is used in this book to make numerical predictions based on similarity scores. The weighted mean has the formula shown in , where x1...xn are the observations and w1...wn are the weights. Figure : Weighted mean A simple implementation of this formula that takes a list of values and weights is given here: def weightedmean(x,w): num=sum([x[i]*w[i] for i in range(len(w))]) den=sum([w[i] for i in range(len(w))]) return num/den In , weighted means are used to predict how much you'll enjoy a movie. This is done by calculating an average rating from other people, weighted by how similar their tastes are to yours. In , weighted means are used to predict prices. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Tanimoto Coefficient Content preview·Buy reprint rights for this chapter The Tanimoto coefficient is a measure of the similarity of two sets. It is used in this book to calculate how similar two items are based on lists of properties. If you have two sets, A and B, where: A = [shirt, shoes, pants, socks] B = [shirt, skirt, shoes] Then the intersection (overlapping) set, which I'll call C, is [shirt, shoes]. The Tanimoto coefficient is shown in , where Na is the number of items in A, Nb is the number of items in B, and Nc is the number of items in C, the intersection. In this case the Tanimoto coefficient is 2/(4+3-2) = 2/5 = 0.4. Figure : Tanimoto coefficient Here is a simple function that takes two lists and calculates the Tanimoto coefficient: def tanimoto(a,b): c=[v for v in a if v in b] return float(len(c))/(len(a)+len(b)-len(c)) The Tanimoto coefficient is used in to calculate similarities between people for clustering. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Conditional Probability Content preview·Buy reprint rights for this chapter Probability is a way of measuring how likely something is to occur. It is usually written as Pr(A) = x, where A is an event. For example, we might say that there's a 20 percent chance of rain today, which would be written as Pr(rain) = 0.2. If we were to note that it's already cloudy right now, then we might conclude there's a higher chance of rain later today. This is called conditional probability, which is the chance of A given that we know B. This is written as Pr(A | B), so in this case, it's Pr(rain | cloudy). The formula for conditional probability is the probability of both events happening divided by the chance of the given condition, as shown in . Figure : Conditional probability So if 10 percent of the time it's cloudy in the morning and rains later, and 25 percent of the time it's cloudy in the morning, then Pr(rain | cloudy) = 0.1/0.25 = 0.4. Since this is just a simple division, no function is given here. Conditional probability is used in for document filtering. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Gini Impurity Content preview·Buy reprint rights for this chapter Gini impurity is a measure of how impure a set is. If you have a set of items, such as [A, A, B, B, B, C], then Gini impurity tells you the probability that you would be wrong if you picked one item and randomly guessed its label. If the set were all As, you would always guess A and never be wrong, so the set would be totally pure. shows the formula for Gini impurity. Figure : Gini impurity This function takes a list of items and calculates the Gini impurity: def giniimpurity(l): total=len(l) counts={} for item in l: counts.setdefault(item,0) counts[item]+=1 imp=0 for j in l: f1=float(counts[j])/total for k in l: if j==k: continue f2=float(counts[k])/total imp+=f1*f2 return imp In , Gini impurity is used in decision tree modeling to determine if dividing a set makes it more pure. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Entropy Content preview·Buy reprint rights for this chapter Entropy is another way to see how mixed a set is. It comes from information theory, and it measures the amount of disorder in a set. Loosely defined, entropy is how surprising a randomly selected item from the set is. If the entire set were As, you would never be surprised to see an A, so the entropy would be 0. The formula is shown in . Figure : Entropy This function takes a list of items and calculates the entropy: def entropy(l): from math import log log2=lambda x:log(x)/log(2) total=len(l) counts={} for item in l: counts.setdefault(item,0) counts[item]+=1 ent=0 for i in counts: p=float(counts[i])/total ent-=p*log2(p) return ent In , Entropy is used in decision tree modeling to determine if dividing a set reduces the amount of disorder. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Variance Content preview·Buy reprint rights for this chapter Variance measures how much a list of numbers varies from the mean (average) value. It is frequently used in statistics to measure how large the differences are in a set of numbers. It is calculated by averaging the squared difference of every number from the mean, as shown by the formula in . Figure : Variance This is a simple function to implement: def variance(vals): mean=float(sum(vals))/len(vals) s=sum([(v-mean)**2 for v in vals]) return s/len(vals) In , variance is used in regression tree modeling to determine how to best divide a set to make the subsets more tightly distributed. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Gaussian Function Content preview·Buy reprint rights for this chapter The Gaussian function is the probability density function of the normal curve. It is used in this book as a weighting function for weighted k-nearest neighbors, since it starts high and falls off quickly but never reaches 0. The formula for a Gaussian with a variance of σis shown in . Figure : Gaussian function This can be implemented as a two-line function directly translating the formula: import math def gaussian(dist,sigma=10.0): exp=math.e**(-dist**2/(2*sigma**2)) return (1/(sigma*(2*math.pi)**.5))*exp In , the Gaussian function is given as a possible weighting function for building a numerical predictor. Additional content appearing in this section has been removed. Purchase this book now or read it online at Safari to get the whole thing! Dot-Products