AI Blueprints: Implementing content-based recommendations using Python
Want to build successful AI business applications? In this excerpt from “AI Blueprints”, Dr. Joshua Eckrot explains how developers can utilize Python to create content-based recommendations in the enterprise. Get ready for your AI recommendations with this proven Python workflow!
This is an excerpt from Packt’s latest book, AI Blueprints written by Dr. Joshua Eckrot.
In this article, we’ll have a look at how you can implement a content-based recommendation system using Python and the scikit-learn library. But before diving straight into this, it’s important to have some prerequisite knowledge of the different ways by which recommendation systems can recommend an item to users.
There are two ways to recommend an item:
- Content-based: A content-based recommendation finds similar items to a given item by examining the item’s properties, such as its title or description, category, or dependencies on other items (for example, electronic toys require batteries). These kinds of recommendations do not use any information about ratings, purchases, or any other user information. For example, let’s suppose we know that a user is viewing a particular camera or a particular blues musician. We can generate recommendations by examining the item’s (camera’s or musician’s) properties and the user’s stated interests. For example, a database could help generate recommendations by selecting lenses compatible with the camera or musicians in the same genre or a genre that the user has selected in their profile. In a similar context, items can be found by examining the items’ descriptions and finding close matches with the item the user is viewing. These are all a kind of content-based recommendation.
- Collaborative: Collaborative filtering uses feedback from other users to help determine the recommendation for this user. Other users may contribute ratings, “likes,” purchases, views, and so on. Sometimes, websites, such as Amazon, will include a phrase like, “Customers who bought this item also bought…” Such a phrase is a clear indication of collaborative filtering. In practice, collaborative filtering is a means for predicting how much the user in question will like each item, and then filtering down to the few items with the highest-scoring predictions.
Now that we have got an idea of what content-based recommendation is, let’s go though it’s implementation.
SEE ALSO: AI as smart services for everyone
Suppose we wish to find similar items by their titles and descriptions. In other words, we want to examine the words used in each item to find items with similar words. We will represent each item as a vector and compare them with a distance metric to see how similar they are, where a smaller distance means they are more similar.
We can use the bag of words technique to convert an item’s title and description into a vector of numbers. This approach is common for any situation where text needs to be converted to a vector. Furthermore, each item’s vector will have the same dimension (same number of values), so we can easily compute the distance metric on any two item vectors.
The bag of words technique constructs a vector for each item that has as many values as there are unique words among all the items. If there are, say, 1000 unique words mentioned in the titles and descriptions of 100 items, then each of the 100 items will be represented by a 1000-dimension vector. The values in the vector are the counts of the number of times an item uses each particular word. If we have an item vector that starts <3, 0, 2, …>, and the 1000 unique words are “aardvark, aback, abandoned, …”, then we know the item uses the word aardvark 3 times, the word aback 0 times, the word abandoned 2 times, and so on. Also, we often eliminate “stop words,” or common words in the English language, such as “and,” “the,” or “get,” that have little meaning.
Given two item vectors, we can compute their distance in multiple ways. One common way is Euclidean distance:
where xi and yi refer to each value from the first and second items’ vectors. Euclidean distance is less accurate if the item titles and descriptions have a dramatically different number of words, so we often use cosine similarity instead. This metric measures the angle between the vectors. This is easy to understand if our vectors have two dimensions, but it works equally well in any number of dimensions. In two dimensions, the angle between two item vectors is the angle between the lines that connect the 0,0 and the item vector values, <x,y>. Cosine similarity is calculated as
where x and y are n-dimensional vectors and ‖x‖ and ‖y‖ refer to the “magnitude” of a vector, that is, its distance from the origin,
Unlike Euclidean distance, larger values are better with cosine similarity because a larger value indicates the angle between the two vectors is smaller, so the vectors are closer or more similar to each other (recall that the graph of cosine starts at 1.0 with angle 0.0). Two identical vectors will have a cosine similarity of 1.0. The reason it is called the cosine similarity is because we can find the actual angle by taking the inverse cosine of d : Θ=d . We have no reason to do so since d works just fine as a similarity value.
Now we have a way of representing each item’s title and description as a vector, and we can compute how similar two vectors are with cosine similarity. Unfortunately, we have a problem. Two items will be considered highly similar if they use many of the same words even if those particular words are very common. For example, if all video items in our store have the word “Video” and “[DVD]” at the end of their titles, then every video might be considered similar to every other. To resolve this problem, we want to penalize (reduce) the values in the item vectors that represent common words.
A popular way to penalize common words in a bag of words vectors is known as Term Frequency-Inverse Document Frequency (TF-IDF). We recompute each value by multiplying a weight that factors in the commonality of the word. There are multiple variations of this reweighting formula, but a common one works as follows. Each value xi in the vector is changed to where N is the number of items (say, 100 total items) and F(xi) gives the count of items (out of the 100) that contain the word xi. A word that is common will have a smaller N/F(xi) factor so its weighted value xˆi will be smaller the original xi. We use the log() function to ensure the multiplier does not get excessively large for uncommon words. It’s worth noting that N/F(xi) ≥ 1, and in the case when a word is found in every item
the 1 + in front of the log() ensures the word is still counted by leaving xi unchanged.
Now we have properly weighted item vectors and a similarity metric, the last task is to find similar items with this information. Let’s suppose we are given a query item; we want to find three similar items. These items should have the largest cosine similarity to the query item. This is known as a nearest neighbor search. If coded naively, the nearest neighbor search requires computing the similarity from the query item to every other item. A better approach is to use a very efficient library such as Facebook’s faiss library. faiss precomputes similarities and stores them in an efficient index. It can also use the GPU to compute these similarities in parallel and find nearest neighbors extremely quickly.
There is one last complication. The bag of words vectors, even with stop words removed, is very large, and it’s is not uncommon to have vectors with 10k to 50k values, given how many English words may be used in an item title or description. The faiss library does not work well with such large vectors. We can limit the number of words, or a number of “features,” with a parameter to the bag of words processor. However, this parameter keeps the most common words, which is not necessarily what we want; instead, we want to keep the most important words. We can reduce the size of the vectors to just 100 values using matrix factorization, specifically the singular-value decomposition (SVD).
With all this in mind, we can use some simple Python code and the scikit-learn library to implement a content-based recommendation system. In this example, we will use the Amazon Review dataset, which contains 66 million reviews of 6.8 million products, gathered from May 20, 1996, to July 23, 2014. (Nota bene, due to memory constraints, we will process only the first 3.0 million products.) For content-based recommendation, we will ignore the reviews and will just use the product title and descriptions. The product data is made available in a JSON file, where each line is a separate JSON string for each product. We extract the title and description and add them to a list. We’ll also add the product identifier (“asin”) to a list. Then we feed this list of strings into scikit-learn’s
CountVectorizer for constructing the bag of words vector for each string, following-on, we’ll then recalculate these vectors using TF-IDF, before reducing the size of the vectors using singular-value decomposition. These three steps are collected into a scikit-learn “pipeline,” so we can run a single
fit_transform function to execute all of the steps in sequence.
pipeline = make_pipeline(CountVectorizer(stop_words='english', max_features=10000), TfidfTransformer(), TruncatedSVD(n_components=128)) product_asin =  product_text =  with open('metadata.json', encoding='utf-8') as f: for line in f: try: p = json.loads(line) s = p['title'] if 'description' in p: s += ' ' + p['description'] product_text.append(s) product_asin.append(p['asin']) except: pass d = pipeline.fit_transform(product_text, product_asin)
SEE ALSO: Where will automation and AI go in 2019?
d, is a matrix of all of the vectors. Next, we configure faiss for efficient nearest neighbor search. Recall that we wish to take our bag-of-words vectors and find similar items to a given item using cosine similarity on these vectors. The three most similar vectors will give us our content-based recommendations:
gpu_resources = faiss.StandardGpuResources() index = faiss.GpuIndexIVFFlat(gpu_resources, ncols, 400, faiss.METRIC_INNER_PRODUCT)
Note, faiss may also be configured without a GPU:
quantizer = faiss.IndexFlat(ncols) index = faiss.IndexIVFFlat(quantizer, ncols, 400, faiss.METRIC_INNER_PRODUCT)
Then we “train” faiss so that it learns the distribution of the values in the vectors and then “add” our vectors. (Technically, we only need to train on a representative subset of the full dataset.)
Finally, we can find the nearest neighbor by “searching” the index. A search can be performed on multiple items at once, and the result is a list of distances and item indexes. We will use the indexes to retrieve each item’s asin and title/description. For example, suppose we want to find a neighbor of a particular item:
# find 3 neighbors of item #5 distances, indexes = index.search(d[5:6], 3) for idx in indexes: print((product_asin[idx], product_text[idx]))
After processing 3.0 million products, here are some example recommendations. Italicized recommendations are less-than-ideal:
It is clear this approach mostly works. Content-based recommendations are an important kind of recommendation, particularly for new users who do not have a purchase history. Many recommendation systems will mix in content-based recommendations with collaborative filtering recommendations. Content-based recommendations are good at suggesting related items based on the item itself, while collaborative filtering recommendations are best for suggesting items that are often purchased by the same people but otherwise have no intrinsic relatedness, such as camping gear and travel guidebooks.
In this article, we had a look at the different ways of recommending an item to a user, and learnt how to develop a content-based recommendation system to find similar items based on the items’ titles and descriptions.
AI Blueprints published by Packt and written by Dr. Joshua Eckroth gives you a working framework and the techniques to build your own successful AI business applications.
You can read a preview of the book here.