Search:
Menu VisualVoice / BackGround

Visual Voice Home


About Artistic Vision People Teams Contact us


Activities



Publications Media Images/Movies


Opportunities


Related Links


Local Only


Website problems?

edit SideBar

Contents

The wiki pages contained in this section provide a summary of work completed over the summer of 2008 towards the implementation of a talking face in the DIVA project. For background on the DIVA project, please see the list of publications, accessed from the left-hand menu of this wiki. For the purpose of this discussion, it suffices to know that the DIVA project is a gesture-to-speech system based on phoneme recognition.

Background

The idea of creating a talking face is not a new one, and among the research projects conducted to this end, two principal methods of simulation exist. The first is audio-driven, and relies on a large database of video clips to piece together the desired facial motions. For an example of this method, see the Video Rewrite project. The second method is text- or phoneme-driven, and uses a set of static images called "vizemes", one for each phoneme or sound, interpolating between them to generate smooth motions. The MikeTalk project is an example of a talking face which uses the vizemes method. Since the DIVA system is based upon phoneme recognition, the vizemes method will be used, and its requirements are discussed below.

The existing DIVA system, as of may 2008, consisted of three principal modes of operation:

i) Training

Allows multiple users to specify gestures for each phoneme, and store these "training sessions" in session files.

ii) Accents

The user selects their best training sessions, and saves an "accent" consisting of a complete mapping of hand gestures to their corresponding phonemes.

iii) Perform

The performance window. Here the user loads an accent, and launches the system in performance mode, which will then generate audio output from glove data.

Requirements

In order to implement a talking face based on phoneme data, three things are required:

i) An appropriate 3-d face model which can be controllably stretched in real-time

This requirement will be filled by the KuraFace model, a 3-d mesh constructed from a set of nodes which can be controllably stretched by adjusting one or more of eight parameter values. The parameters are called Principal Components or PCs, and the set of eight parameters is known as a PC vector. The KuraFace is a model contained in the Artisynth project, a suite of 3-d modeling software that runs a main window.

ii) A set of tools enabling the user to create their own phoneme-to-vizeme mappings

This will consist of a new mode in the DIVA project, called "Vizeme Tools". Specifically, the user must be able to create and store a set of facial shapes, one for each phoneme, and load this mapping at performance time.

iii) An additional object in performance mode to process phoneme data and drive the face model

This object will need to launch the KuraFace model in the Artisynth window, connect to it, and then send it eight parameter values several times a second to simulate movement of the face. Parameter values will be calculated by applying the phoneme-to-vizeme mapping to the incoming phoneme data during performance.