RoboHub. In-corporating body language into NLP (or, More notes on the design of automated body language)

This article discusses how body language is a part of natural language, personality, and NLP design. The article covers various methods for approaching this problem and makes recommendations for the real-time generation of animation to accompany natural language for avatars and robots.

It’s hard to communicate with words. Some researchers claim that almost half of our communication relies on things that aren’t words: body language, tone of voice, and stuff that just isn’t conveyed by text. This includes prosody (tone, pitch and speed of words), facial expression, hand gesture, stance and posture. This probably explains why about 40% of emails are misunderstood.  As designers of robots (or avatars) we need to consider these statistics and consider how to integrate body language into natural language communication. Therefor Geppetto Labs has built a platform to automatically generate body language and coordinate it with what a robot (or avatar) is saying.

Most NLP systems today, be they Siri or Watson, amount to conducting chat via the thin pipe of a text interface. Siri doesn’t have a lot of choice on the matter since Apple had to simplify the complexity of communication, but this text interface reduced the communication itself. If you think that Natural Language Processing is about only text, then step away from the computer, go to a café or bar, and watch people interact for a half an hour.

Videos do an excellent job of conveying the importance of body language. A great video to watch is The History Channel’sSecrets of Body Language. This documentary looks at politicians, cops, athletes, and others, interviewing experts of body language to decodify everyone from Richard Nixon to Marion Jones. Gesture, expression, and tone of voice are all looked at as valuable and important data channels.

This is why face-to-face meetings are so much more productive. Each party can better understand the other because there is a higher throughput of communication. In a lovers’ relationship, or in a family’s relationships, body language is even more important than in a business meeting. Consider the fact that it’s the most intimate relationships (between lovers, primarily, but also between family members, close friends, and others) that involve the most touching. These are also the relationships that rely the most on body language, because body language actually defines the proximal closeness and intimacy.

So if we want people to engage emotionally with robots, or avatars (or any other kind of character that is rigged up to an NLP system), we need to consider using body language as part of that system. We humans are hardwired that way.


At Geppetto Labs we have begun considering Body Language Processing as a sub-set of Natural Language Processing. So just as Natural Language Processing has NLU (understanding) and NLG (generation), we can consider Body Language to have BLU and BLG. I’ll be focusing on the generation of it, but others, such as Skip Rizzo, Noldus Information Technology, and others are also looking at the understanding of body language and facial expressions.

Generating body language requires coordination with the textual components of Natural Language Processing. A gesture or animation has to have the same duration of time, happen at the same moment, and include the same emotional content, or affect, as the message conveyed.  “Hi” should, of course, be accompanied by a gesture that is about one second long — a friendly-looking signifier that’s commonly understood. Raising the hand and wagging it back and forth usually gets the job done. But building this can be tricky. It gets more complicated when there is a sentence like this one that doesn’t have clear emotional content, isn’t the kind of thing you hear as often as “Hi,” and is long enough that the animation needs to be at least ten seconds long.

At Geppetto Labs we’ve developed the ACTR platform in order to accomplish this. The core process, at least as it relates to text, is to generate body language (as opposed to voice output) as follows:

First, we take the nude NL text and determine the three variables of the Duration (timing), Affect (emotion), and Signifiers (specific gestures):

1) Duration, or timing. How long is the sound or string of text we’re dealing with? This is the easiest to calculate directly from the text. Most spoken conversation ranges from between 150 and 175 words per minute, but that can speed up or slow down depending on the emotion of the speaker, and other factors. But let’s call it 150 words per minute. A “word” is calculated in these kinds of standards as five UTF characters, which is also five bytes. So that means that most of us speak at around 750 bytes per minute. Now if we back this out it means that around 12 bytes should leave the system per second, and this is then used to calculate the duration of a given animation. We’ll call this integer between one and 150 a “duration tag.”

2) Affect, or emotion. What is the emotional value of that source string of text? This is the second factor we need to know in order to calculate an animation, and it’s harder than just measuring the letters in a line: it requires either realtime sentiment analysis and/or a pre-built library that identifies the emotional content of a word. One solution is WordNet-Affect. Words in WordNet-Affect are derived from Princeton’s fantastic WordNet project and have been flagged with particular meaning that indicate a range of values, most of which relate to what kind of psychosomatic reaction that word might cause or what kind of state it might indicate. Some simple examples would be happiness, fear, cold, etc. There’s a ton of really sticky material in this labyrinth of language called “affect,” and the ways that words link to one another make it all even stickier. But for this explanation, let’s say that we can take a given word and that word will fall within a bucket of nine different emotions. So we give it a value from one to nine. Fear is a one. Happiness is a nine. If we then take the average affect of the text string in question (again, speaking very simply) we end up with a number that equals the emotion of that sentence. We’ll call this integer between one and nine an “affect tag.”

(Before we go on I want to take a break because we now have enough to make an animation match our sentence.

“How in the world do we build that?” is an eight-word sentence, so we know the duration would be about three seconds. The affect is harder to measure, but for this example let’s say that it ends up being a value of 5. So we have Duration=3, Affect=5. These two bits of information, alone, are enough to calculate a rough animation, but first we need to build a small bucket of animations. They are probably keyframes because we want to interpolate them so that they form a chain. We make them of various durations (1 second, 2 seconds, 3 seconds, etc.) so that if we want a three-second chain we can combine 1-second and 2-second duration animations, or, if we want to avoid replaying the same animation we can reverse the order of these links and combine the 2-second then the 1-second animations. And we make sure that we have these various animation links ready in separate buckets – one for each animation. So if we get a Duration=3 and Affect=5 we go into the bucket labeled Affect #5 and dig up the animation links that add up to three seconds.

The longer the duration, the trickier it gets. If you have a twelve-second animation you might then have to chain together that two-second animation six times, or your one-second animation twelve times, to get the proper duration.  Does that make sense?

No. I hope that at this point you’ve stopped and said, “Wait, no, that would be really dumb. To play an animation twelve times would just look like the character is convulsing. That’s bad body language, Mark!”

Relationships: Robots and Humans. ROILA

ROILA, Robot Interaction Language, is a spoken language for robots. It is constructed to make it easy for humans to learn, but also easy for the robots to understand. ROILA is optimized for the robots’ automatic speech recognition and understanding.

The number of robots in our society is increasing rapidly. The number of service robots that interact with everyday people already outnumbers industrial robots. The easiest way to communicate with these service robots, such asRoomba or Nao, would be natural speech. But current speech recognition technology has not reached a level yet at which it would be easy to use. Often robots misunderstand words or are not able to make sense of them. Some researchers argue that speech recognition will never reach the level of humans.

Palm Inc. faced a similar problem with hand writing recognition for their handheld computers. They invented Graffiti, an artificial alphabet, that was easy to learn and easy for the computer to recognize.  ROILA takes a similar approach by offering an artificial language that is easy to learn for humans and easy to understand for robots. An artificial language as defined by the Oxford Encyclopedia is a language deliberately invented or constructed, especially as a means of communication in computing or information technology.

We reviewed the most successful artificial and natural languages across the dimensions of morphology and phonology (see overview in the form of a large table) and composed a language that is extremely easy to learn. The simple grammar has no irregularities and the words are composed of phonemes that are shared amongst the majority of natural languages. The set of major phonemes was generated from the overview of natural languages. Moreover, we composed a genetic algorithm that generated ROILA’s words in a way that they are easy to pronounce. The same algorithm makes sure that the words in the dictionary sound as different from each other as possible.  This helps the speech recognizer to accurately understand the human speaker.

Most previously developed artificial languages have not been able to attract many human speakers, with the exception of Esperanto. However, with the rise of robots a new community on our planet is formed and there is no reason why robots should not have their own language. Soon there will be millions or robots to which you can talk to in the ROILA language. In summary, we aim to design a “Robotic Interaction Language” that addresses the problems associated with speech interaction using natural languages. Our language is constructed on the basis of two important goals, firstly it should be learnable by the user and secondly, the language should be optimized for efficient recognition by a robot.

ROILA is free to use for everybody and we offer all the technical tools and manuals to make your robot understand and speak ROILA. At the same time we offer courses for humans to learn the ROILA language.

Automated Installer and Java Library March 17, 2013. Development versions of the Automated Installer and Java Library are currently available on GitHub. Please report any issues to Josh at The automated installer and java library are designed to help make it easier to work with ROILA. They are still a work in progress, so there are some features that won’t work fully (especially in the library.) GitHub will be updated with improved copies in the coming months. It will be changed shortly to update a bug with the downloading of the pre-compiled library. There are more planned features for this to come, so keep an eye out on GitHub.

We are currently developing courses in ROILA. They are available in our ROILA Academy. We are also giving a ROILA introductory course to Dutch High School students at the Huygens College Eindhoven. The short course will consist of three lessons followed by a ROILA final exam. The ROILA course will be part of their Science curriculum. The homework curriculum for this course is uploaded in theROILA academy and also on an external website. The vocabulary for this course is uploaded here. You can also find a similar dictionary in the ROILA academy. We will post videos and power point PDFs of each lesson given at the school. We have removed parts of the video that were only relevant to the students (such as administration of the course, etc).

Lesson 1

Lesson 1 Powerpoint PDF

Homework Lessons requirement: Topic 1, 2, 3, 4

Lesson 1 – November 15, 2010

Lesson 1 – November 19, 2010

Lesson 2

Lesson 2 Powerpoint PDF

Homework Lessons requirement: Topic 5, 6, 7

Lesson 2 – November 22, 2010


We have published several articles about ROILA: