An introduction to using neural networks

Deep learning for search: Using word2vec

Tommaso Teofili
© Shutterstock / SvetaZi

What is word2vec? This neural network algorithm has a number of interesting use cases, especially for search. In this excerpt from Deep Learning for Search, Tommaso Teofili explains how you can use word2vec to map datasets with neural networks.

Word2vec is a neural network algorithm. Although it’s fairly easy to understand its basics, it’s also fascinating to see the good results — in terms of capturing the semantics of words in a text – that you can get out of it. But what does it do, and how is it useful for our synonym expansion use case?

  • Word2vec takes a piece text and outputs a series of vectors, one for each word in the text
  • When the output vectors of word2vec are plotted on a two-dimensional graph, vectors whose words are similar, in term of semantics, are close to one another
  • We can use distance measures like cosine distance to find the most similar words with respect to a certain word
  • We can use this technique to find certain word synonyms

What we want to do is setup a word2vec model, feed it with the text of the song lyrics we want to index, get some output vectors for each word, and use them to find synonyms.

You might have heard about the usage of vectors in the context of search. In a sense, word2vec also generates a vector space model whose vectors (one for each word) are weighted by the neural network during the learning process. Let’s use an example of the Aeroplane song; if we feed its text to word2vec we’ll have a vector for each of our words:

Listing 1: Two-dimensional vectors for Aeroplane words

0.7976110753441061, -1.300175666666296, i
-1.1589942649711316, 0.2550385962680938, like
-1.9136814615251492, 0.0, pleasure
-0.178102361461314, -5.778459658617458, spiked
0.11344064895365787, 0.0, with
0.3778008406249243, -0.11222894354254397, pain
-2.0494382050792344, 0.5871714329463343, and
-1.3652666102221962, -0.4866885862322685, music
-12.878251690899361, 0.7094618209959707, is
0.8220355668636578, -1.2088098678855501, my
-0.37314503461270637, 0.4801501371764839, aeroplane 

…We can see them in the coordinate plan.

Deep learning word2vec

Image 1: Coordinate plan for aeroplane words

In the example output above I decided to use two dimensions to make those vectors more easily plottable on a graph. But in practice it’s common to use higher numbers like one hundred or more and to use dimensionality reduction algorithms like Principal Component Analysis or t-SNE to obtain 2-3 dimensional vectors that can be more easily plotted. This is because these numbers allow you to capture more information as the amount of data grows.

If we use cosine similarity to measure the distance among each of the generated vectors we can find out some interesting results.

SEE ALSO: Machine Learning essentials: Best practices, categories and misconceptions

Listing 2: word2vec similarity on Aeroplane text

music -> song, view
looking -> view, better
in -> the, like
sitting -> turning, could 

As you can see we extract the two nearest vectors for a few random vectors, some results are good, some not much.

  • music and song are close terms in semantics; we could even say they’re synonyms, but it’s not the same for view
  • looking and view are related, better has little to do with looking
  • in, the and like aren’t close to each other
  • sitting and turning are both verbs in the ing form but their semantics are loosely coupled, could is still a verb but hasn’t much more to do with sitting

What’s the problem here; is word2vec not up to the task? Two things are at play here:

  • The number of dimensions (2) of the generated word vectors (or embedded word) is probably too low
  • Feeding the word2vec model with the text of a single song probably doesn’t provide enough context to each of the words to come with an accurate representation; the model needs more examples of the contexts in which the words better and view occur.

Let’s assume we again build the word2vec model, this time by using one hundred dimensions and a larger set of song lyrics taken from the Billboard Hot 100 Dataset, found here.

SEE ALSO: Machine Learning algorithms: Working with text data

Listing 3: word2vec similarity with 100 dimensions and a larger dataset

music -> song, sing
view -> visions, gaze
sitting -> hanging, lying
in -> with, into
looking -> lookin, lustin 

We can see now that the results are much better and appropriate: we can use almost all of them as synonyms in the context of search. You can imagine using such a technique either at query or indexing time. There’d be no more dictionaries or vocabularies to keep up to date; the search engine could learn to generate synonyms from the data it handles.

A couple of questions you might have right about now: how does word2vec work? How can I integrate it, in practice, in my search engine? The original paper (found here) from Tomas Mikolov and others describes two different neural network models for learning such word representations: one is called Continuous Bag of Words and the other is called Continuous Skip-gram Model.

Word2vec performs an unsupervised learning of word representations, which is good; these models need to be fed with a sufficiently large text, properly encoded. The main concept behind word2vec is that the neural network is given a piece of text, which is split into fragments of a certain size (also called window). Every fragment is fed to the network as a pair of target word and context. In the case below the target word is aeroplane and the context is composed of the words music, is, my.

Deep learning word2vec

Image 2: Model of word2vec

The hidden layer of the network contains a set of weights (in the case above, 11, the number of neurons in the hidden layer) for each of the words. These vectors are used as the word representations when learning ends.

SEE ALSO: Deep learning anomalies with TensorFlow and Apache Spark

An important trick about word2vec is that we don’t care too much about the outputs of the neural network. Instead, we extract the internal state of the hidden layer at the end of the training phase, which yields exactly one vector representations for each word.

During such training, a portion of each fragment is used as target word while the rest is used as context. The case of the Continuous Bag of Words model, the target word is used as the output of the network, and the remaining words of the text fragments (the context) are used as inputs. It’s opposite in the Continuous Skip-gram Model: the target word is used as input and the context words as outputs (as in the example above).

For example, given the text “she keeps Moet et Chandon in her pretty cabinet let them eat cake she says” from the song “Killer Queen” by Queen and a window of 5, a word2vec model based on CBOW receives a sample for each five-word fragments in there. For the fragment | she | keeps | moet | et | chandon |, the input’s made of the words | she | keeps | et | chandon | and its output consists of the word moet.

Deep learning word2vec

Image 3: CBOW model based on Killer Queen

As you can see from the figure above, the neural network is composed of an input layer, a hidden layer and an output layer. This kind of neural networks—with only one hidden layer—is called shallow, as opposed to the ones having more than one hidden layer, which are called deep neural networks.

The neurons in the hidden layer have no activation function and they linearly combine weights and inputs (multiply each input by its weight and sum all of these results together). The input layer has a number of neurons equal to the number of words in the text for each word; in fact word2vec requires each word to be represented as a hot encoded vector.

Now let’s see what a hot encoded vector looks like. Imagine we’ve got a dataset with three words [cat, dog, mouse]; we’ll have three vectors, each one of them having all the values set to zero except one, which is set to one, and it’s the one that identifies that specific word.

SEE ALSO: Why deep learning is an essential tool for developers

Listing 4: 3 hot encoded vectors

dog   : [0,0,1]
cat   : [0,1,0]
mouse : [1,0,0] 

If we add the word ‘lion’ to the dataset, hot encoded vectors for this dataset have 4 dimensions:

Listing 5: 4 hot encoded vectors

lion  : [0,0,0,1]
dog   : [0,0,1,0]
cat   : [0,1,0,0]
mouse : [1,0,0,0] 

If you have one hundred words in your input text, each word is represented as a 100-dimensional vector. Consequently, in the CBOW model, you’ll have one hundred input neurons multiplied by the window parameter minus one. If you’ve got a window of 4, you’ll have 300 input neurons.

The hidden layer instead has a number of neurons equals to the desired dimensionality of the resulting word vectors. This is a parameter that must be set by whoever sets up the network.

The size of the output layer is equal to the number of words in the input text, in this example one hundred. A word2vec CBOW model for an input text with one hundred words, dimensionality equals to fifty and window parameter set to 4 has 300 input neurons, fifty hidden neurons and one hundred output neurons.

For word2vec, CBOW model inputs are propagated through the network by first multiplying the hot encoded vectors of the input words by their input to hidden weights; you can imagine a matrix containing a weight for each connection between an input and hidden neuron. Those then get combined (multiplied) with the hidden to output weights, producing the outputs, and these outputs are then passed through a Softmax function.  Softmax “squashes” a K-dimensional vector (our output vector) of arbitrary real values to a K-dimensional vector of real values in the range (0, 1) that add up to 1, representing a probability distribution. Our network is telling us the probability of each output word to be selected, given the context (the network input).

This could be phrased more vaguely but meaningfully; like “adjusted a little bit so that it’d produce a better result next time”. After this forward propagation, the back-propagation learning algorithm adjusts the weights of each neuron in the different layers to produce a more accurate result for each new fragment.

SEE ALSO: Achieving real-time machine learning and deep learning with in-memory computing

With the objective of reducing the output error, the network comes with a certain probability distribution over the output words which gets compared with the actual target word that the network is given, such information is used to adjust the weights going backwards. After the learning process has been completed for all the text fragments with the configured window the hidden to output weights represent the vector representation for each word in the text.

The Continuous Skip-gram Model looks reversed with respect to the CBOW model.

Deep learning word2vec

Image 4: Skip-gram for “She keeps Moet et Chandon”

The same concepts apply for Skip-gram. The input vectors are hot encoded (one for each word) ensuring that the input layer has a number of neurons equal to the number of words in the input text. The hidden layer has the dimensionality of the desired resulting word vectors, and the output layer has a number of neurons equal to the number of words multiplied by the windows size minus one. Using the example we used for CBOW, having the text “she keeps moet et chandon in her pretty cabinet let them eat cake she says” and a window of 5, a word2vec model based on the Continuous Skip-gram model receives a first sample for | she | keeps | moet | et | chandon | where the input’s made of the word moet and its output consists of the words | she | keeps | et | chandon |.

Here’s an example excerpt of word vectors calculated by word2vec on the text of the Hot 100 Billboard dataset. It shows a small subset of words plotted, for the sake of appreciating some word semantics being expressed geometrically.

Deep learning word2vec

Figure 5: word2vec of Hot 100 Billboard dataset

You can notice the expected regularities between me and my with respect to you and your. You can also have a look at groups of similar words, or words that at least are used in similar context, which are good candidates for synonyms.

Now that we’ve learned a bit about how word2vec algorithm works, let’s get some code and see it in action. Then we’ll be able to combine it with our search engine for synonym expansion.

SEE ALSO: Deep Learning: It’s time to democratize technology


Deeplearning4j is a deep learning library for the JVM. It has a good adoption among the Java people and a not-too-steep learning curve for early adopters. It also comes with an Apache 2 license, which is handy if you want to use it within a company and include it within its possibly non-open-source product. DL4J also has tools to import models created with other frameworks such as Keras, Caffe, TensorFlow, Theano, etc.

Setting up word2vec in Deeplearning4J

DeepLearning4J can be used to implement neural network based algorithms; let’s see how we can use it to set up a word2vec model. DL4J has an out-of-the-box implementation of word2vec, based on Continuous Skip-gram model. What we need to do is set up its configuration parameters and pass the input text we want to invest in our search engine.

Keeping our song lyrics use case in mind, we’re going to feed word2vec with the Billboard Hot 100 text file.

Listing 6: DL4J word2vec example

String filePath = 
    new ClassPathResource("billboard_lyrics_1964-2015.txt").getFile().getAbsolutePath(); 1 
SentenceIterator iter = new BasicLineIterator(filePath);   2

Word2Vec vec = new Word2Vec.Builder()  3
        .layerSize(100)  4
        .windowSize(5)  5
        .iterate(iter)  6
        .build();;   7
String[] words = new String[]{"guitar", "love", "rock"}; for (String w : words) { 
    Collection lst = vec.wordsNearest(w, 2);   8
    System.out.println("2 Words closest to '" + w + "': " + lst); 

1 read the corpus of text containing the lyrics

2 set up an iterator over the corpus

3 create a configuration for word2vec

4 set the number of dimensions the vector representations needs

5 set the window parameter

6 set word2vec to iterate over the selected corpus

7 obtain the closest words to an input word

8 print the nearest words

We obtain the following output, which sounds good enough.

SEE ALSO: New Apache Spark library aims to make deep learning approachable

Listing 7: word2vec sample output

2 Words closest to 'guitar': [giggle, piano]
2 Words closest to 'love': [girl, baby]
2 Words closest to 'rock': [party, hips] 

As you can see it’s straightforward to set up such a model and obtain results in a reasonable amount of time (training of the word2vec model took around 30s on a “normal” laptop). Keep in mind that we aim to use this in conjunction with the search engine, which should give us a better synonym expansion algorithm.

Word2vec based synonym expansion

Now that we have this powerful tool in our hands we need to be careful! When using WordNet we have a constraint set of synonyms to prevent blowing up the index. With our word vectors generated by word2vec we may ask the model to return the closest words for each word to be indexed. This might be unacceptable from the performance perspective (for both runtime and storage), and we must come with a strategy for how to use word2vec responsibly. One thing we can do’s constrain the type of words we send to word2vec to get their nearest words.

In natural language processing, it is common to tag each word with a part of speech (PoS) which labels which syntactic role it has in a sentence. Common parts of speech are NOUN, VERB, ADJ, but also more fine-grained ones like NP or NC (proper or common noun). We could decide to use word2vec only for words whose PoS is either NC or VERB and avoid bloating the index with synonyms for adjectives. Another technique is to have a prior look at how much information is found in the document. A short text is likely to have a relatively poor probability to hit a query because it’s composed by a few terms. We could decide to focus more on such documents and be eager to expand synonyms there, rather than in longer documents.

What’s a ‘term weight’ again? On the other hand the “informativeness” of a document doesn’t only depend on its size. Therefore, other techniques could be used, such as looking at term weights (the number of times a term appears in a piece of text) and skipping those ones with a low weight.  Additionally, we can only use word2vec results if they have a good similarity score. If we use cosine distance for measuring the nearest neighbors of a certain word vector, such neighbors could be too far (a low similarity score) but still be the nearest ones. In that case we can decide not to use those neighbors.

A word2vec synonym filter with Lucene and DL4J

A token filter takes the terms provided by a tokenizer and eventually performs some operations on these terms, like filtering them out, or, as in this case, adding some other terms to be indexed. A Lucene TokenFilter is based on the incrementToken API which returns a boolean value, which is false at the end of the token stream; implementers of this API consume one token at a time (e.g. by filtering or expanding a certain token). Earlier in this article you saw a diagram of how word2vec synonym expansion works. Before being able to use word2vec we need to configure and train it using the data to be indexed. In the song lyrics example, we decided to use the Billboard Hot 100 Dataset, which we pass as plain text to the word2vec algorithm, as shown in the previous code listing.

Once we’re done with word2vec training, we create a synonym filter that uses the learned model to predict term synonyms during filtering. Lucene APIs for token filtering require us to implement the incrementToken method. By APU contract this method returns true if there are still tokens to consume from the token stream and false if there are no more tokens left to consider for filtering. The basic idea’s that our token filter returns true for all the original tokens and for all the related synonyms that we get from word2vec.

SEE ALSO: Convolutional LSTM for ocean temperature with Deeplearning4j

Listing 8: A word2vec based synonym expansion filter

Protected W2VSynonymFilter(TokenStream input, Word2Vec word2Vec) {  1 
  this.word2Vec = word2Vec; 


public boolean incrementToken() throws IOException {  2
  if (!outputs.isEmpty()) { 

  ... 3 

  if (!SynonymFilter.TYPE_SYNONYM.equals(typeAtt.type())) { 4
    String word = new String(termAtt.buffer()).trim();

    List<String> list = word2Vec.similarWordsInVocabTo(word, minAccuracy); 5
    int i = 0;
    for (String syn : list) {

      if (i == 2) {  6
      if (!syn.equals(word)) {
        CharsRefBuilder charsRefBuilder = new CharsRefBuilder();

        CharsRef cr = charsRefBuilder.append(syn).get();         7

        State state = captureState();                            8

        outputs.add(new PendingOutput(state, cr));               9
  return !outputs.isEmpty() || input.incrementToken();

1 create a token filter that takes an already trained word2vec model

2 implement the Lucene API for token filtering

3 add cached synonyms to the token stream (see next code listing)

4 only expand a token if it’s not a synonym itself (to avoid loops in expansion)

5 for each term find the closest words using word2vec which have an accuracy higher than a minimum (e.g. 0.35)

6 don’t record more than two synonyms for each token

7 record the synonym value

8 record the current state of the original term (not the synonym) in the token stream (e.g. starting and ending position)

9 create an object to contain the synonyms to be added to the token stream after all the original terms have been consumed

The above piece of code traverses all the terms, and when it finds a synonym it puts it in a list of pending outputs to expand (the outputs List). We apply those pending terms to be added (the actual synonyms) after each original term has been processed, in the code below.

SEE ALSO: How to migrate TensorFlow into Deeplearning4j

Listing 9: Expanding pending synonyms

  if (!outputs.isEmpty()) { 

    PendingOutput output = outputs.remove(0); 1

    restoreState(output.state);                  2


        output.charsRef.offset, output.charsRef.length); 3

    typeAtt.setType(SynonymFilter.TYPE_SYNONYM); 4
    return true;

1 get the first pending output to expand

2 retrieve the state of the original term, including its text, its position in the text stream, etc.

3 set the synonym text to the one given by word2vec and previously saved in the pending output

4 set the type of the term as synonym

SEE ALSO: Skymind’s Deeplearning4j, the Eclipse Foundation, and scientific computing in the JVM


We have applied the word2vec technique so that we use its output results as synonyms only if they have an accuracy that is above a certain threshold. If you want to learn more about applying deep learning techniques to search, read the first chapter of Deep Learning for Search here, and see this slide deck.


Tommaso Teofili

Tommaso Teofili is a software engineer who is passionate and deeply involved in the open source world. His ideal job would include R&D on information retrieval, natural language processing and machine learning; he’d also like to put his skills to work for other people and for his job to be ethically fair. Follow him on Twitter @tteofili.

Inline Feedbacks
View all comments