|JCP1 - A Java based DIY robot!
|Hardware and Software
|What can it do?
The Turing Test
There is an interesting test called the Turing Test, which goal is to "trick" the person into believing that there is a real human being "inside the box" and not an autonomous computer system. Although the goal is not to make the robot pass this Turing test, I'd still make an effort into making it feel more "human". This requires the robot to do many things that are not obvious and often non-necessary things. The idea is to equip the robot with enough features, behaviours and responses so that it can operate autonomously within certain boundaries that are comfortable for the people around it.
In 1991 Hugh Loebner started a competition with a prize to the first to pass the Turing test. This was partially done to start some serious interest in the reasearch on artificial intelligence. So far noone has won the 1st and 2nd prices, but recently a couple has won the 3rd category prize. The test has gotten a lot of controversy since it does not necessary mean better artificial intelligence, but only human dialogue simulation. One could argue that this is not real intelligence, since you basically only need a big enough input-output variation for a human to be fooled by the computer system. But this is also an important step towards trying to simulate an interaction with a computer that feels more natural as if you were adressing another human. And for many mundane tasks the AI would be so much better than a human. Considering how fast it can process large amount of data it could be a very practical addition to our lifestyle. Most choose to call these Intelligent Agents (IA) instead of Artificial Intelligence. The idea is that an agent can perform a programmed task and has enough features to perform a natural interaction with the human using the agent. This would generally require a very solid voice recognition engine and a large enough variation on both input grammar as well as output formulations. The agent would then "feel like a human being" but with exceptional skills in information retrieval and processing.
Just to give you some ideas, here are a couple of scenarios an agent can help you out:
Simple searches. Lets say you were watching a movie with Harrison Ford, and you started discussing the actor and someone wondered what his age is. You ask your agent: "How old is Harrison Ford?". You could easily do this manually by a simple web search, but the question is already well defined and should be simple for an agent to perform. The question is generic also, "How old is X?". The grammar involved is quite simple, but requires a large database (or dictation support) for the name. The search is quite simple as well using for example Wikipedia. The only real challenge involved in this agent is to parse the output web page and retrieve the birth date of the actor. Isolated, the agent has limited functionality, but can easily be expanded by providing more grammars: "What is the age of X?", or totally new questions: "What is the latest movie X acted in?". If the amount of questions supported is great enough, the agent becomes quite useful in cases where specific information is needed.
Emergency situations. You are on a bus stop with an elderly man which suddenly collapses in front of you. Your agent in the form of a mobile phone client and a handsfree set enables you to ask it: "Emergency situation. A man has collapsed in front of me. What do I do?". The agent tells you immediately over the handsfree what to do as well as calling the ambulance (and provides gps locations as well). The grammar involved here is much harder and probably have thousands of variations which would require heavy work into getting it right, but a more general approach can link keywords into probable agent response: emergency + collapse + man.
Calendar. Computers are very useful for keeping track of appointments, birthdays, etc. But they are only useful if you learn to use one regularly and have access to it everywhere. Many have this today on their mobile phones and is a natural part of their lives. A mobile robot could assist in keeping you aware of your appointments and provide simple grammars like: "What is my appointments today?" or "When is my mothers birthday" or the robot can announce important dates for you a day before so you dont forget. Keeping track of things can be a very useful feature, and for some it can be interesting to find when a certain event happened before. For example, you could add log entries like a diary very easily: "Today our child had an astma reaction". The general idea for any agent is that it will follow you like a diary and log whatever you feel worthy or safe logging. The agent then becomes an extra "brain" that stores information you normally would have forgotten the day after.
Image search. Lets say you are looking for a car, and would like to see some pictures of a particular brand. This is another case of simple search using a simple grammar: "Show me a Mercedes". The challenge is to find the database of pictures which can reliably give you a correct picture, in many cases google image search will do that most of the time for simple searches. But it can also take other things into account which makes it more "intelligent". The idea of the agent is that it learns about the user. You might previously have told the robot that you like the color blue, in which case a picture search would first try to show you a blue car. Perhaps you like a special type of car, a stationwagon, in which case that is also used. In time the amount of variables that affect a search can grow, much like a search enginrs shows ads based on what you are looking for. The simplicity of a user interface that seemingly understands what you want to achieve is quite surprising and feels very futuristic, but its really just about software engineers to get their head together and creating some good agents for retrieving commonly used information. As the amount of agents add up you will soon value it as a necessary tool besides all the other things you take for granted in your daily life.
So face recognition works fine in OpenCV, but so what? What do I use it for? Note that the face recognizer only recognises a face and not the owner of the face so it cannot distinguish between people, only know when it is looking at a face. This it the most interesting part first - trying to figure out intelligent things to do with the fact that the robot can recognize a face. Well there are a number of things that spring to mind, and head/face tracking is one that is immediately possible.
Looking you in the eyes is an interesting function. The fact that I can recognize where the rectangle of the face is detected, is enough information for me to move the servos on the robots head so that it centers the rectangle of the face its looking at. It would feel like the robot was watching you when you face it (the recognizer only recognizes frontal faces with the training set supplied).
The face recognition work fine for finding a face, but it doesnt say you which face it is. To figure this out you need a different algorithm, like a complex OCR algorithm that is able to differ between faces. The good thing is that the face detector seems to return the same rectangle area of your face so it can very easily be used as data input for another algorithm for face identification. A simple identifier only compares the difference between images by subtracting, but I am afraid that some sort of alignment will have to be done before as well as a light correction. I can imagine that the image needs to be intensity "leveled" so that the image database and the input image use same intensity span. I also assume that the database should contain a selection of images of the person under different light conditions. Another method I can think of is to use an edge finder on all database images and make each "pixel" wider so that the lines are thicker. The input image is run through same preprocessing before a subtraction test is done. The idea here is to reduce impact of light conditions and base it more on the outline of the face features (eyes, nose, mouth, headshape). If I can get it to work and identify 5-6 different faces I would be very happy.
One thing that I find important about the robot is the idea of acting in context of something, sort of a contiuous state switching. Using the face detection results can be one of these inputs. For example if the robot is idle and it then sees a face it can choose to act on it. It can move closer to the person (based on the size of the face rectangle) and/or it can greet the person. After a while when the robot has not seen any faces it can revert into an idle mode (where it does its own autonomous things). The degree of reaction has to be parameterised and controlled so that you can affect the robots behaviour by telling it to lower its rate of communication. This is necessary to make the feature nice, but not annoying. This kind of "fuzzy logic" parameters is an interesting aspect of the robot and will be part of the behaviour modules I choose to add to it. It might require different relations to different persons so that someone likes to hear the robot greeting, while others dont like it.
Taking time into account
Any object detection has to have some degree of accuracy and consistency for the robot to react to it. Otherwise its head would ping pong towards anything that resembles a face. The idea is that you take time into consideration. If a face is recognized and its area rectangle dont move too far between frames it can be fairly certain that there is a face. I have noticed that the face cascade in OpenCV sometimes recognizes other things as faces, and usually these are only there for one frame of the camera input simply because shadows and other things in the room had an alignment that made the face recognizer find a positive. If the robots head is moving as well as a persons head isnt completely still, we can take this into account when trying to locate a "real" face. This algorithm will also be a bit fuzzy and I have to experiment with the deltas calculated between face recognition rectangles. This will also be important so that the robot choose to focus at one particular head a certain amount of time in the cases where it can see several faces. But the robot might want to switch its "eye" between faces at certain invervals though - that would feel a bit more "human" too.
Smooth servo movements
I-am-a-robot! Many hobby robotic projects seem to have little or no knowledge of math and that often reflects in how rigid and linear all servos move. While its a novel and romantic idea of the jerky robotic movements of the old movies, I really dont see why people still do it. If you look at e.g. the Sony Qrio you can see how its supposed to be. I will certainly make the servo code in a way that it has accelerated and decelerated move from and to its target position. This will make the robot feel more "human" and smooth.
When does the robot listen?
Human beings are excellent at filtering out information from all our sensors. Our ears are quite remarkable at this since we can pick out one conversation among others in a crowd of people or with all sort of noise around us. For a robot to do the same becomes a very difficult task, one that is not really solveable within a hobby budget. However, there are a number of things we can do to grab the robots attention. First we need to have a good microphone setup for speech recognition:
But this is only the beginning. The robot will very easily pick up any sound it hears as a possible command. We further need some additional algorithms for it to know when it is actually being spoken to:
Speech recognition attention states
As the robot will be equipped with a number of grammars it can respond to, the amount of words it can react to grows quickly. Simple tests has proved that a strict grammar also allows for similar sounding words to come through as accepted grammar matches. For example, I dont have to say "stop" to make the robot stop now, it also reacts to "blopp" or "crop". These sound soo similar and since there is no grammar accepting these words it will think I was saying "stop". While this is very good in cases where I dont say the words exactly right as well as allowing non-trained voices to tell it things - it can trigger random events as well. As an example I was speaking to my wife and suddenly the robot started moving around. It had picked up some words that it accepted as movement commands. The recognition strictness can be adjusted in the MS SAPI 5.1 but I really want to explore the idea of context a bit further.
The idea is that if the robot has been idle for a while it will go into a state where you have to begin the command with his name (or a nickname) to wake it up again. So although the grammar reacts to "move forward" it will only be accepted if you say e.g. "robot move forward". After this point the robot is in a new context and you no longer have to tell it "robot" or anything to get its attention. It will then react to a pure "move forward" command. After a minute of idleness it will revert to attention mode again. This will greatly limit the number of false commands it recognizes. Chances are that I will add a LED somewhere on the robot that will be on whenever it has my attention and off when it has not. This will make it natural as feedback to know when I have to say "robot" in front of the command.
The context switching also alters the acceptable grammar dynamically so that it responds to shorter grammars in context of a question or other activities. This makes the conversation more fluid also. Further more the system stores dynamical parameters in a short term memory that can be referred to. For example you might have said something like: "My mother's name is Anne" and next "She is 50 years old". The short term memory here is that a person mentioned is Anne (and a relational link named "mother" is made between a person node called Anne and the current speaker. furthermore the mother link has a "is" link to female/she). The She then refers to Anne through relations in a shor term memory map (based on when nodes was touched the last time). This is very much like humans communicate since we can take lots of shortcuts in our language when the listener knows the context of the conversation. The grammar switching is quite simple. Furthermore the context short term memory can be used to ask related questions. The system would then need a question form for every information that the user can provide directly. For example the input statement "X is N years old" would have a question "How old is X?". Both are used as input grammars that can be used by the robot, but the question grammar can be used as output as well. A normal conversation would require a very large number of grammars related to a particular agent topic if it should feel natural. While generic relational data can be input blindly without the agent knowing anything about the words, the more each agent is programmed for specific topics, the more natural it will feel since you get a feeling it understands the things being told about. For example, if you tell the agent "I have a sister and a brother", it can ask two questions: "What is your sisters name?" and "What is your brothers name?". After this it can use their names alternatively with the words brother/sister for asking "How old is your sister?" and when it asks "How old is your brother?" you can say things like "2 years younger than my sister". The calculations are simple, but finding the grammars can be a daunting task, especially since there are so many things that can be said. It becomes more interesting the moment you link other information to the facts you just said. For example the agent can calculate when your sister was born and inform you after that "Your sister was born in the same year as X". Or based on the age of someone ask a related question "Does your brother work somewhere?". Information gathering and discussion can then become rather interesting as the relational database builds up with all your inputs. Adding a rich grammar is hard work though.
Servo motor hum
A problem I came across with my robot is that since my microphone is on the webcamera it easily picks up the sound of the servos when it moves it head about. It took a while to understand this, because at times I noticed the recognition rate to be lower than it should be, and I had to repeat my commands. But what really made me understand the problem was when I had some short grammars added in a context to control the wheels and it would sometimes start moving about after it had moved its head. The servo sound was just enough for the recognizer to think of it as a command. I think the recognition accuracy was a bit low so thats why it kept making servo motor buzzing into commands. This is very unfortunate since I really want my robot to keep moving its head, following the speakers head, looking around the place and generally shift its head a bit "to make it more alive". An option is of course to turn off the microphone during servo movements, but that would stop me from saying any commands while it is moving its head. Its clear that I need to dampen the sound of the servos or get higher grade servos with less noise if those are available. Using a bluetooth microphone is also a solution, but I would also like to be able to speak to the robot without a remote microphone.
Another problem is that the wires from the webcamera and speaker are fairly thick which makes the servos have to work more to move the head. For the movement its not a problem, but the wires push or pull the servos which makes them struggle to hold a certain position, and you have this constant hum from them as they are trying to correct the position. I probably need to remove the plastic around the wires and find some way of insulating them as I would think the motors in the servos can generate noise. Perhaps I could try to insulate the motors instead. Chances are that I also need to get a big heatsink for the CPU, since it now relies on a small fan spinning at high speed. Although its not very audible inside the robot, I'd like it to be as quiet as possible, and any noise from it contributes to lower speech recognition rate.