Machine Learning algorithms: Working with text data
Deep down ML is a pure numbers game. With very few exceptions, the actual input to an ML Model is always a collection of float values. We talked with Christoph Henkelmann about the way ML algorithms work on words and letters, the difference between image and text and how to handle textual input properly.
JAXenter: What is the difference between image and text from a machine’s point of view?
Christoph Henkelmann: Almost all ML methods, especially neural networks, want tensors (multidimensional arrays of numbers) as input. In case of an image the transformation is obvious, we already have a three-dimensional array of pixels (width x height x color channel), i.e. except for smaller pre-processing the image is already “bite-sized”. There is no obvious representation for text. Text and words exist at a higher level of meaning, for example, if you simply enter Unicode-encoded letters as numbers in the net, the jump from coding to semantics is too “high”.
We also expect systems that work with text to perform semantically more demanding tasks. If a machine recognizes a cat in an image, that’s impressive. But it is not impressive if a machine detects the word “cat” in a sentence.
JAXenter: Why do problems arise concerning Unicode normalization?
Christoph Henkelmann: One would actually like to think that Unicode does not have to be normalized at all – after all, it is intended to finally solve all the coding problems from the early days of word processing. But the devil is in the details. Unicode is enormously complex because language is enormously complex. There are six different types of spaces in Unicode. If you use standard methods of some programming languages to split text from different sources, you suddenly wonder why words still stick together.
Also, the representation of words is not unique, e.g. there are two Unicode encodings of the word “Munich”. If you compare sign by sign, “Munich” suddenly no longer equals “Munich”. If you forget something in preprocessing, we train a system on unclean data – and of course, this does not give a good result.
JAXenter: You speak of different ways of displaying text; How many are there and what are they all about?
If a machine recognizes a cat in an image, that’s impressive. But it is not impressive if a machine detects the word “cat” in a sentence.
Christoph Henkelmann: Since we do not have such an “obvious” representation of text, there are many different ways to feed text to an ML system. Starting with low-level methods, where a number is really assigned to a letter – basically the same as with a text file, through methods where individual words are encoded as the smallest unit, to methods, where a tensor is generated from an entire document, which is actually more of a “fingerprint” of the document, one can choose different “granularities”. Then there are a number of technical variants for each of these approaches. The complicated thing is that there is no such thing as the best approach; you have to choose the right one for your problem.
JAXenter: Is word2vec also about coding semantics?
Christoph Henkelmann: Exactly, much more than with images or audio, the pre-processing of text has an effect on the semantic level at which the process moves. Sometimes preprocessing itself is already a kind of machine learning, so that we can already answer questions, only because we have coded the text differently. The best known and currently much-discussed example is word2vec.
Once you have created a word2vec encoding, you can answer semantic questions like “King – Man + Woman = ?”. Here you can read the answer “Queen” directly from the word2vec coding. Word associations can also be solved, e.g. “Berlin is for Germany like Rome for ?”. Word2Vec delivers the answer “Italy”. The semantic meaning results only from the mathematical distance of the encodings. The system “does not know” what a country or a capital is, it only knows the (high-dimensional) distance between the words. This is an incredibly useful presentation of words for ML systems and therefore also the final part of my presentation at the ML Conference in Munich.