This article discusses how body language is a part of natural language, personality, and NLP design. The article covers various methods for approaching this problem and makes recommendations for the real-time generation of animation to accompany natural language for avatars and robots.
It’s hard to communicate with words. Some researchers claim that almost half of our communication relies on things that aren’t words: body language, tone of voice, and stuff that just isn’t conveyed by text. This includes prosody (tone, pitch and speed of words), facial expression, hand gesture, stance and posture. This probably explains why about 40% of emails are misunderstood. As designers of robots (or avatars) we need to consider these statistics and consider how to integrate body language into natural language communication. Therefor Geppetto Labs has built a platform to automatically generate body language and coordinate it with what a robot (or avatar) is saying.
Most NLP systems today, be they Siri or Watson, amount to conducting chat via the thin pipe of a text interface. Siri doesn’t have a lot of choice on the matter since Apple had to simplify the complexity of communication, but this text interface reduced the communication itself. If you think that Natural Language Processing is about only text, then step away from the computer, go to a café or bar, and watch people interact for a half an hour.
Videos do an excellent job of conveying the importance of body language. A great video to watch is The History Channel’sSecrets of Body Language. This documentary looks at politicians, cops, athletes, and others, interviewing experts of body language to decodify everyone from Richard Nixon to Marion Jones. Gesture, expression, and tone of voice are all looked at as valuable and important data channels.
This is why face-to-face meetings are so much more productive. Each party can better understand the other because there is a higher throughput of communication. In a lovers’ relationship, or in a family’s relationships, body language is even more important than in a business meeting. Consider the fact that it’s the most intimate relationships (between lovers, primarily, but also between family members, close friends, and others) that involve the most touching. These are also the relationships that rely the most on body language, because body language actually defines the proximal closeness and intimacy.
So if we want people to engage emotionally with robots, or avatars (or any other kind of character that is rigged up to an NLP system), we need to consider using body language as part of that system. We humans are hardwired that way.
At Geppetto Labs we have begun considering Body Language Processing as a sub-set of Natural Language Processing. So just as Natural Language Processing has NLU (understanding) and NLG (generation), we can consider Body Language to have BLU and BLG. I’ll be focusing on the generation of it, but others, such as Skip Rizzo, Noldus Information Technology, and others are also looking at the understanding of body language and facial expressions.
Generating body language requires coordination with the textual components of Natural Language Processing. A gesture or animation has to have the same duration of time, happen at the same moment, and include the same emotional content, or affect, as the message conveyed. “Hi” should, of course, be accompanied by a gesture that is about one second long — a friendly-looking signifier that’s commonly understood. Raising the hand and wagging it back and forth usually gets the job done. But building this can be tricky. It gets more complicated when there is a sentence like this one that doesn’t have clear emotional content, isn’t the kind of thing you hear as often as “Hi,” and is long enough that the animation needs to be at least ten seconds long.
At Geppetto Labs we’ve developed the ACTR platform in order to accomplish this. The core process, at least as it relates to text, is to generate body language (as opposed to voice output) as follows:
First, we take the nude NL text and determine the three variables of the Duration (timing), Affect (emotion), and Signifiers (specific gestures):
1) Duration, or timing. How long is the sound or string of text we’re dealing with? This is the easiest to calculate directly from the text. Most spoken conversation ranges from between 150 and 175 words per minute, but that can speed up or slow down depending on the emotion of the speaker, and other factors. But let’s call it 150 words per minute. A “word” is calculated in these kinds of standards as five UTF characters, which is also five bytes. So that means that most of us speak at around 750 bytes per minute. Now if we back this out it means that around 12 bytes should leave the system per second, and this is then used to calculate the duration of a given animation. We’ll call this integer between one and 150 a “duration tag.”
2) Affect, or emotion. What is the emotional value of that source string of text? This is the second factor we need to know in order to calculate an animation, and it’s harder than just measuring the letters in a line: it requires either realtime sentiment analysis and/or a pre-built library that identifies the emotional content of a word. One solution is WordNet-Affect. Words in WordNet-Affect are derived from Princeton’s fantastic WordNet project and have been flagged with particular meaning that indicate a range of values, most of which relate to what kind of psychosomatic reaction that word might cause or what kind of state it might indicate. Some simple examples would be happiness, fear, cold, etc. There’s a ton of really sticky material in this labyrinth of language called “affect,” and the ways that words link to one another make it all even stickier. But for this explanation, let’s say that we can take a given word and that word will fall within a bucket of nine different emotions. So we give it a value from one to nine. Fear is a one. Happiness is a nine. If we then take the average affect of the text string in question (again, speaking very simply) we end up with a number that equals the emotion of that sentence. We’ll call this integer between one and nine an “affect tag.”
(Before we go on I want to take a break because we now have enough to make an animation match our sentence.
“How in the world do we build that?” is an eight-word sentence, so we know the duration would be about three seconds. The affect is harder to measure, but for this example let’s say that it ends up being a value of 5. So we have Duration=3, Affect=5. These two bits of information, alone, are enough to calculate a rough animation, but first we need to build a small bucket of animations. They are probably keyframes because we want to interpolate them so that they form a chain. We make them of various durations (1 second, 2 seconds, 3 seconds, etc.) so that if we want a three-second chain we can combine 1-second and 2-second duration animations, or, if we want to avoid replaying the same animation we can reverse the order of these links and combine the 2-second then the 1-second animations. And we make sure that we have these various animation links ready in separate buckets – one for each animation. So if we get a Duration=3 and Affect=5 we go into the bucket labeled Affect #5 and dig up the animation links that add up to three seconds.
The longer the duration, the trickier it gets. If you have a twelve-second animation you might then have to chain together that two-second animation six times, or your one-second animation twelve times, to get the proper duration. Does that make sense?
No. I hope that at this point you’ve stopped and said, “Wait, no, that would be really dumb. To play an animation twelve times would just look like the character is convulsing. That’s bad body language, Mark!”