Building your own in-house HTTP service for natural language processing

Machine learning and data sovereignty in the age of GDPR

Ricardo Wölker and Dr. Alan Nichol
© Shutterstock / Monster Ztudio

Do you know where your data is moving? Dr. Alan Nichol and Ricardo Wölker will show you how to build and run your own GDPR-compliant Natural Language Understanding (NLU) service with the open-source Rasa NLU  library. You can query it over HTTP without Python knowledge and it leaves you fully in charge of your data.

With the GDPR taking effect in the EU on May 25, it becomes all-important for bot developers to know exactly where their data moves. Ideally, this means being in control of every part of your service, from owning the training data to customising your machine-learning models. In this post, we want to:

  1. Show you how to build an in-house NLU system with minimal effort
  2. Demonstrate how to use it over HTTP
  3. Explain how a customizable and GDPR-compliant NLU system can be an advantage

Building your own NLU system

To show you how to build your own NLU system from scratch, let’s dive straight into an example: Say you operate a service for people to ask about the weather. A typical query someone might have looks something like this:

What is the weather like today in London?

In order to respond to a question like this correctly, your bot needs to understand three things:

  1. the intent of the question is to ask about the weather
  2. the specified time is today
  3. the location in this weather query is London

Rasa NLU uses sentences like the one above to train a machine-learning model that is able to generalise to new sentences. Let’s create a file called data.yml and add examples with the marked fields location and time. Here are a few:

data: |
  ## intent:ask_weather
  - what is the weather like [today](time) in [London](location)?
  - can you tell me the weather in [new york](location) [next week](time)?
  - is it going to be sunny in [Berlin](location) [tomorrow](time)?
  - weather in [Antwerp](location) [tomorrow](time)
  - please tell me in won't rain in [Prague](location) on [saturday](time)

  ## intent:greet
  - hey
  - hello
  - good morning

  ## intent:goodbye
  - bye
  - see you
  - goodbye

You’ll see that we have also thrown in a few examples for the model to understand greetings and goodbyes.

SEE MORE: How will GDPR complicate data collection?

You will also need to supply some information specifying the language and machine-learning pipeline tool to be used. In this case, our project is in English and we’ll use the sklearn and spacy machine-learning libraries. Let’s add these two lines to our data.yml from before:

language: "en"
pipeline: "spacy_sklearn" 

Now, all that’s left to do is train your language model.

Training your model

Rasa NLU ships with a full HTTP API that lets you use it without Python. Assuming you want to save your model into a directory called projects, this is how you start the server:

$ python -m rasa_nlu.server --path projects 

This will start a local server under port 5000 that can be reached over a REST API. The first endpoint you’ll want to use is POST /train. This command takes the file you’ve just created and sends it to Rasa NLU for training:

$ curl -XPOST -H "Content-Type: application/x-yml" \
	localhost:5000/train?project=weather_bot --data-binary @data.yml 

After about a minute or so, you’ll see a message showing that the training has finished.

SEE MORE: Security vulnerabilities in open source and GDPR implications

Querying your model over HTTP

Now that you’ve trained your model, let’s look at how to evaluate sentences on it and get back structured data. We will use the POST /parse endpoint of our Rasa NLU server for this. Here’s an example asking about the weather in Philadelphia:

$ curl -XPOST localhost:5000/parse -d \
	'{"q":"hey how is the weather in Philadelphia on Sunday?", "project": "weather_bot"}' 

The result looks like this:

{  "project": "my_project",  "entities": [    {      "extractor": "ner_crf",      "confidence": 0.6644960352482667,      "end": 33,      "value": "philadelphia",      "entity": "location",      "start": 26    },    {      "extractor": "ner_crf",      "confidence": 0.5581670818608013,      "end": 43,      "value": "sunday",      "entity": "time",      "start": 37    }  ],  "intent": {    "confidence": 0.5720685112669681,    "name": "ask_weather"  },  "text": "hey how is the weather in Philadelphia on Sunday?",  "model": "model_20180515-103505",  "intent_ranking": [    {      "confidence": 0.5720685112669681,      "name": "ask_weather"    },    {      "confidence": 0.2403181999618597,      "name": "greet"    },    {      "confidence": 0.18761328877117212,      "name": "goodbye"    }  ]} 

You can see that our simple model has correctly interpreted a query it hasn’t seen before: It is a question of type ask_weather, the location was understood as Philadelphia and the time asked for is Sunday! Here, we’ve only trained on five examples with the ask_weather intent, and are already seeing a good result. We recommend you use at least ten examples for each intent for better performance in your actual project.

The Rasa NLU server comes with other endpoints you can use for things like benchmarking a dataset on your NLU model. The full API is documented in the Rasa NLU Server Docs.

SEE MORE: GDPR — Designing privacy and data protection

Customize your AI and control your data

Rasa NLU lets you fully customize your language model to your needs. It’s easy to see how this can lead to a competitive advantage: a system that is perfectly tweaked to your training data and use case can perform much better than any out-of-the-box NLU solution.

Under the upcoming EU General Data Protection Regulation (GDPR), data-residency rules require knowing where exactly your data is located. Cloud-based machine-learning services make it difficult to trace where your data is stored and processed. Machine learning in NLU requires processing of potentially personal information, making GDPR-compliance crucial in two areas: You will want to know (1) where your machine-learning models are trained, and (2) where your text queries are evaluated against your trained model. The approach we’ve shown you lets you stay in charge of both your training data and your customers’ queries.

Taking your NLU in-house

We hope you’ve seen that it can be very straightforward to build your own NLU system. It makes GDPR-compliance easy as you remain completely in charge of your data. Even better, you need little to no Python to run it and can easily integrate into your system over HTTP.

This post has dealt with only one of two parts to building conversational systems. We find that developers often start with NLU for handling simple queries, but quickly see the need for advanced dialogue handling. Rasa’s open-source library Rasa Core uses machine learning to predict your system’s responses and actions to take. Much like Rasa NLU, it trains on example conversations to generalize to new situations.


Ricardo Wölker and Dr. Alan Nichol

Ricardo Wölker is a machine learning engineer at Rasa, and a contributor to the open-source libraries Rasa NLU and Rasa Core. He currently works towards a PhD in high energy physics at the University of Oxford, where he studies particle collisions at the LHC in Geneva.

Dr. Alan Nichol is the co-founder and CTO of Rasa, and a maintainer of Rasa NLU and Rasa Core, the leading open source libraries for building conversational AI. He is also the author of the DataCamp course “building chatbots in python”. He holds a PhD in machine learning from the University of Cambridge and has years of experience building AI products in industry. Follow him on Twitter @alanmnichol.

Inline Feedbacks
View all comments