ANIMAZIONE TRIDIMENSIONALE REALISTICA DEL MOVIMENTO...

POLITECNICO DI TORINO III Facoltà di Ingegneria

Corso di Laurea in Ingegneria Informatica

TESI DI LAUREA

ANIMAZIONE TRIDIMENSIONALE REALISTICA DEL MOVIMENTO LABIALE

IN AMBIENTE MAYA

Relatore: Ing Fulvio CORNO

Candidata: Daniela ROSSO

Gennaio 2002

Summary

Chapter 1...................................................................................................... 1 Introduction.............................................................................................................. 1

Chapter 2...................................................................................................... 7 Alias|Wavefront Maya 4.0 Unlimited ...................................................................... 7

2.1 Deeper look on API and MEL ....................................................................... 8 2.2 Moving within the Maya’s environment contest ......................................... 10 2.3 Blendshape deformer ................................................................................... 13

2.3.1 Target and base objects ................................................................................ 13 2.3.2 Target shapes, base shapes, and blend shapes ............................................. 13 2.3.3 Targets .......................................................................................................... 14 2.3.4 Keyframing.................................................................................................... 14 2.3.5 Using keyframing animation......................................................................... 14

Chapter 3.................................................................................................... 15 Lip Synch issues ..................................................................................................... 15

3.1 Lip synch steps............................................................................................. 16 3.2 Getting known phonemes ............................................................................ 16 3.3 Getting known visual phonemes.................................................................. 17 3.4 The lip synch rules ....................................................................................... 22

3.4.1 Record the dialog first .................................................................................. 22 3.4.2 Never animate behind synch ......................................................................... 23 3.4.3 Don’t exaggerate .......................................................................................... 23 3.4.4 Rules were made to be broken ...................................................................... 23

3.5 The lip synch process................................................................................... 24 3.5.1 Break down the speech pattern ..................................................................... 24 3.5.2 Analyze the audible dialog to determine phonemes...................................... 25 3.5.3 Use timing chart to set keyframes................................................................. 25 3.5.4 Tweak the finished animation ....................................................................... 26 3.5.5 Phoneme dropping guidelines....................................................................... 26

3.6 Thinking as an animator............................................................................... 27 3.7 Text to phoneme .......................................................................................... 30

3.7.1 Focusing on language................................................................................... 30 3.8 Synchronization and artistic freedom .......................................................... 32

Chapter 4.................................................................................................... 33 Solutions................................................................................................................. 33

4.1 Interface realization ..................................................................................... 34 4.1.1 First step: creating data structure ................................................................ 34 4.1.2 Second and third tab: collecting information ............................................... 37

4.2 Interaction between MEL and API .............................................................. 44 4.3 Algorithm..................................................................................................... 49

4.3.1 Text to phonemes translation ........................................................................ 51

i

4.3.2 Keyframing.................................................................................................... 60 Chapter 5.................................................................................................... 67

Software description .............................................................................................. 67 5.1 Speak U2 installation ................................................................................... 68 5.2 Plug-in integration ....................................................................................... 68 5.3 User interface ............................................................................................... 71 5.4 Tips & tricks ................................................................................................ 82 5.5 The demo animation .................................................................................... 84

5.5.1 Making the animation ................................................................................... 84 Chapter 6.................................................................................................... 88

Conclusions............................................................................................................ 88 Bibliography .............................................................................................. 91

ii

Chapter 1

Introduction

In the last years the Computer Graphics and Visual Effects have grown

exponentially, being introduced more and more frequently within Hollywood

productions as well as in European ones.

Computer Graphics particularly has found its natural role not only in

cinematographic contests, but even in the industrial activities. The continuous search

of 3D models and characters animated with computer graphic techniques confirms this

new attitude even in the Italian TV publicity, where, for examples, can be found

stylized human exiting from a tube of toothpaste, or nice 3D ants playing with toilet

paper, or visual effects such as boat surfing on desert or lions swimming on the abyss.

Far from the cinematography contest of films or advertising spots, 3D models are

often used to generate virtual prototypes of any genre industrial products in order to

taste the possibilities of its introduction in real market, without spending any materials

or working times.

Even public sectors join the contribute of the three dimension modeling; the

building engineering, for example, uses it to preview the ambient impact of impressive

structure as bridges being able to simulate atmospheric agents that will involve these

Chapter 1 - Introduction

2

buildings. Moreover, the computer graphics becomes useful to evaluate the impact hat

such moderns buildings could have on the ambient.

Nowadays, years of research and computer experiments, held in public or private

labs all over the world, have conducted to the diffusion of different and new ways of

representing. Computer generated images are a great part of every branch of

communication life. But if we consider the importance of the technological evolution

of these products, they are more important because of their cultural impact. All these

improvements reached in this area, imprinted a great perceptive turning point that

gradually is modifying our aesthetic taste and our daily life style.

In spite of this way of utilizations, it is in film making that Computer Graphics

has much more application and challenge, and its development is going to increase

thanks to the equal hardware technological growth and more “special purpose”

powerful software.

Cinema’s history is principally based on character animation id. e. the creation of

moves and lives for figures modelled and drawn with various materials. Lot of

characters, almost human, almost animals, almost aliens, almost nice, have been created

with rolled eyes, rounded noses, hair braving the gravity or body parts exceptionally

elastic with harsh tunes.

3D animation, left to the puppets fun, is easily near imperfect models : the toy, the

soldier, the rubber puppet, however the object generally unanimated that becomes

animated, as they can arrange the funny character without abandoning the photorealistic

research.

The exponential improvement in technology, allowed the arrangement of more

specialized programs able to substitute all the difficulties of the animator.

It is very difficult making act in a realistic manner a figure modelled and drawn.

In fact, in the early Computer Graphics it was usual the use of dark scenes, because

everything that is hidden in the dark do not show all the imperfections of its the

production. Today these figures are usually on stage for foregrounds and full-length


3

film. But the challenge has been always the same : to model and to animate in a

realistic manner a human character, an anthropomorphic being.

Considering the popular movies of these days, it becomes evident that there is the

attitude of searching for human features in animating characters normally unanimated

or unreal ones. An example is the Pixar “Toy Story” movie, with little toys as leading

characters; while some of them maintain their mechanical movements, the main

onesmove in a typical human manner.

Another example can be the recent Dreamworks “Shrek”, focusing on the

adventures of a good ogre whose purpose is to save an anthropomorphic princess, being

helped by a nice mule. Even if characters are animated in a perfect way, they are not

really human and the princess Fiona too is part of an animation film that can’t pretend

to be considered so realistic to make the viewer thinking its actors being real instead of

digital ones.

But, seeing animation films, it is observed that to distinguish with human features

and moves a character, even if not an anthropomorphic one, it is necessary to care not

only about the body, but especially the face.

In fact, the appearance of a character is strongly determined by its face and the

ability of expressing emotions through facial movements.


4

Figure 1.1 : Shrek cast standing

The representation of a human face thus is surely one of the most complex aspect

of the computer animation. Now, the movie considered as “expression of the art” for

what concerns the realistic result of digital characters is “Final Fantasy: the spirit

within” by Squaresoft. In this movie, people really believe to see real actors instead of

digital ones, thanks to a perfect work of modeling, to the exact body movement and

especially to the impressive realism of characters faces.

To make a face real, it is fundamental to give it the ability of expressing emotions

with a combination of eyebrows and eyes as well as facial muscles movements.

But, the element distinguishing a realistic anthropomorphic character from a

digital one is the lip movement during the speech. In fact, a character whose lips are

not synchronized with sound it spokes immediately appears artificial and it loses its aim


5

of being real; if the character wants to be realistic, its lips have to really describe the

movements necessary to spoke that sound.

Figure 1.2 : Final Fantasy : Dr. Aki Ross

To notice this, it is enough to think about cartoon characters, that simply open or

close their mouth according to the speech but without describing the real lips moves.

A good lip synchronization, that is called “lip synch” in a Computer graphics and

Animation contest, is based on paying attention to create movements of lips, tongue,


6

jaw and cheek muscles; these features, as well as facial expression made of eyes and

eyebrows changing, allow to give realism.

The goal of this thesis is to analyze the methodology used in lip synching, to find

an algorithm in order to automate this process and to implement it obtaining an

animation tool, as part of the 3D Alias|Wavefront Maya 4.0 package, useful to achieve

faster and easier a good result in lip synch. The target of this plug-in is to avoid the

animator, who normally can’t waste time in understanding concept such as phonemes,

making boring operation of keyframig, translating words, trying lip movements on a

side mirror. This plug-in, whose name is Speak U2, would make really easy any step of

the lip synch animation, leaving to the animator only to listen to the audio speech and

enter the time of pronunciation and spoken words, working with an easy interface.

The next chapter is devoted to describe the Alias|Wavefront Maya 4.0 Unlimited,

that is one of the most known and used 3D animation and visual effects package. The

third chapter deals with the analysis of the lip synch methodology following which a

correct lip synchronization can be reached.

The fourth chapter covers the Speak U2 implementation and its technical aspects,

while the fifth describes the user interface and explains how to use the plug-in. The last

chapter collect the conclusions about the work and obtained results.

Chapter 2

Alias|Wavefront Maya 4.0 Unlimited

Maya Unlimited is the most powerful and advanced solution for creation of

computer-generated animation and special effects. All Maya Unlimited features are

perfectly integrated into a single environment that is completely customizable and

optimized for maximum productivity. Innovative user interaction techniques in Maya

Unlimited deliver the smoothest possible workflow. Maya Unlimited is for the

advanced digital content creator who wants to work with the ultimate creative tools for

maximum creative possibilities. Maya Unlimited includes:

• Modeling : Industry-leading NURBS and polygon modeling tools, and Maya

Advanced Modeling functionality unique to Maya Unlimited

• Artisan : Integrated brush interface for digital sculpting and attribute painting

• Paint effects : groundbreaking paint technology for adding amazing natural

detail on a 2D canvas or in true 3D space

• Character Animation Tools : general keyframing, expressions, Inverse

kinematics, powerful character skinning and advanced deformation tools

• Dynamics : Integrated particle system plus high-speed rigid-body dynamics

Chapter 2 - Alias|Wavefront Maya 4.0 Unlimited

8

• Rendering : film-quality rendering with advanced visual effects and interactive

photorealistic rendering

• API : access to Maya’s internal data structure that enables programmers to

enhance and complement Maya for their own production needs

• MEL (Maya Embedded Language) : open interface for customizing and

scripting any aspect of Maya’s functionality.

• Advanced Modeling : additional NURB and subdivision surface features

• Live : precision matchmoving allows the marriage of live-action footage with

3D elements rendered in Maya Unlimited

• Cloth : fastest, most accurate solution for simulating a wide variety of digital

clothing and other fabric object

• Fur : incredibly realistic styling and rendering of fur and short hair

• Batch Rendering : two supplemental rendering licenses to increase rendering

productivity and maximize throughput.

2.1 Deeper look on API and MEL

API ( Application Protocol Interface ) is the key to openness of the Maya

architecture at the lowest level. API provides the most direct access to all of Maya’s

internal data. Through API, higly efficient plug-in can also be created to extend the

capability of the system. Maya was built to ensure that features available through user

interface are exposed identically in API and produce identical results for programmers.

MEL provides higher level extensibility that is accessible to user who may

experienced programmer. MEL scripts can be used to quickly perform tedious

repetitive task. MEL also provides efficient way to create prototypes and test new tools

which might ultimately be implemented via API. With a full complement of flow

control commands, a C-like programming syntax , and a broad set of user interface

generation utilities, MEL can also be used to build very detailed new functionality


9

including custom character setup, new particle effects, and even speciality animation

systems.

Together, API and MEL offer methods and tools for extending Maya’s functionality

from the highest to the lowest level. Through the inclusion of both a C++ API and a

robust scripting language with a well understood syntax, the opportunity to build

extensions is available to game programmers and game artist alike.

User Interface

Plug-in

MEL

API

Engine

Figure 2.1 : Maya programming architecture


10

2.2 Moving within the Maya’s environment contest

The programmer who wants to extend Maya’s functionalities has two ways to

perform it: build a stand alone application or create a plug-in integrated in Maya’s

structure itself.

The choice to implement a stand alone application would have the advantage of

obtaining a module that is independent from Maya itself, allowing it to interact with

Microsoft Foundation Classes and use other applications such as database program to

store and modify data. These features may be very appreciated and used in making

animation tool such as the lip synch application, because the possibility of use powerful

features of external software can increment and simplify complex tasks. On the

contrary, building a stand alone application will disturb the workflow philosophy on

which Maya is based. In fact, in Maya any plug-in is considered as an internal module

and the 3D artist never gets to exit outside Maya itself. This is a great feature of this

3D animation software because it allows the artist to remain within the same

environment without loosing concentration or frustrating on many movings between

application windows. Moreover, a well built Maya plug-in recalls the same look and

structure of its containing software, giving to the user the impression of using a native

Maya tool. Because of this considerations, it was decided to make Speak U2 a Maya

plug-in instead of a stand alone application.

Even if Maya is provided with the possibility of creating plug-ins or extensions

using API or MEL, it was not conceived to ease programmers in their work above all

for what concerning in MEL. In fact, using this script language is very boring and

frustrating because of the complete absence of a debugger and for the imperative to

restart Maya any time a change occurs. Moreover, the script editor internal to Maya has

not an adequate word wrap and there is no possibility to have the automatic syntax

coloring that it’s generally useful to avoid typing mistakes. Figure 2.1 represents the

Maya script editor.


11

Figure 2.2 : Maya’s script editor

The inferior part of the script editor is the editable one where it’s possible to write

commands or procedures, but to get their execution it’s necessary to select them and

then press enter key. The message result of the execution is displayed in the above

part, but the real importance of this half window is the fact that it echoes all the

commands and it could be useful when it’s not possible from the documentation to

understand how to proceed with some command.

The only way to relax this manner of working is to use an external text editor,

writing programs and then loading or sourcing them from the script editor. Fortunately

there are some of these program that are more specific for programmer language

including the syntax coloring, word wrap and tabulation. For example, ES-Computing

Edit Plus version 2.10c can import and set the MEL syntax making less frustrating to

write MEL code, even if only executing it within Maya can inform the programmer that


12

an error occur and this requires Maya to be restarted with a long loss of time and CPU

and memory resource.

MEL can be used to automate some repetitive sequence of command, to build

personal user interfaces or to customize the existing one, but it is not adapted to make

complex calculations or implement procedures that don’t refer to the graphic interface.

On the other hand, Maya API and C++ are used when it is necessary to make more

complex algorithm such as those needed in a lip synch plug-in. Maya requires building

its plug-in within Microsoft Visual C++ 6.0 thus the programmer can use a solid

debugger to test his code. The build operation of C++ files generate an .mll file that

has to be loaded within Maya through the Plug-in Manager shown in the figure below.

Figure 2.3 : Maya Plug-in Manager


13

To be able to write a plug-in for Maya Unlimited 4.0 it was necessary not only to

understand MEL and API, but also to enter in confidence with concept typical of 3D

animation and modeling such as key framing, blendshape deformation, interpolation,

tangent. These are basic and fundamental concept to get the know-how needed to plan

the work and writing program code.

2.3 Blendshape deformer

Blendshape deformers enable to deform a NURBS ( Non-Uniform Rational B-Splines )

or polygonal object into the shapes of other NURBS or polygonal objects. It is possible

to blendshapes with the same or different number of NURBS control vertex. In

character setup, a typical use of a blendshape deformer is to set up poses for facial

animation. Unlike the other deformers, the blendshape deformer has an editor that

enables you to control all the blendshape deformers in the scene. The editor can be used

to control the influence of the targets of each blendshape deformer, create new

blendshape deformers, set keys, and so on. Generally speaking or in other software

packages what Maya provides with the blendshape deformer is indicated with terms as

“morph”, “morphing” or “shape interp”.

2.3.1 Target and base objects

When creating a blendshape deformer, it is necessary to identify one or more objects

whose shapes would be used to deform the shape of some other object. These objects

are called target objects, and the object being deformed is called the base object.

2.3.2 Target shapes, base shapes, and blend shapes

The shapes of the target objects are called target shapes, or target object shapes. The

base object’s resulting deformed shape is called the blend shape, whereas its original

shape is called the base shape, or base object shape.


14

2.3.3 Targets

A blendshape deformer includes a keyable attribute (channel) for evaluating each

target object shape's influence on the base object's shape. These attributes are called

targets, though by default they are named after the various target objects. Each target

specifies the influence, or weight, of a given shape independently of the other targets.

Depending on how the blendshape deformer is created or edited, however, a target can

represent the influence of a series of target object shapes instead of just one shape.

2.3.4 Keyframing

Setting keys is the process of creating the keys that specify timing and motion.

Animation is the process of creating and editing the properties of objects that change

over time. Keys are arbitrary markers that designate the property values of an object at

a particular time. Once an object that would be animated is created, it is necessary to

set keys that represent when the attributes of that object change during the animation.

Setting a key involves moving to the time where a value for an attribute would be

established, setting that value, then placing a key there. In effect, it is just as recording a

snapshot of the attribute at that time.

2.3.5 Using keyframing animation

Keyframe animation creates actions from keys that are to be set on attributes at

various times (or frames). A key specifies the value of an attribute at a particular time.

Maya interpolates how the attribute changes its value from one key to the next. Each

key specifies a defining characteristic of the action. To better understand the concept of

interpolation, an example can be the following one. Imagine a modeled sphere as a ball

and suppose its translate attribute on the y axes is 0 at the frame 0, while at frame 10

the same attribute is fixed at a value of 20; playing the animation, Maya will interpolate

the values of this attribute, moving the ball forward in height.

Chapter 3

Lip Synch issues

Without a well done methodology, it is very difficult for an animator to succeed

creating the exact lip movement of a 3D model synchronizing it with the recorded

audio. In fact, the only strategy he could follow is simply to place himself in front of a

mirror and analyze the lips positions while speaking in order to be able to connect them

generating the complex mouth movement. This sort of approach to the lip synch issue

pretends a great work by the animator, achieving results strongly affected by

observation capabilities of the artist and losing a lot of time.

Because of these reasons it was necessary to focus on a precise methodology of

lip synching. Thus a long research was made into how written words correspond to a

sequence of phonemes and how these ones are always expressed with the same mouth

shape and tongue position, finding in this way some fundamentals steps of the lip synch

process.

Chapter 3 - Lip Synch issues

16

3.1 Lip synch steps

First, a library of character model variations is built once and for all, including the

basic mouth shapes necessary for speech (phonemes) and expressive variations such as

the brows lifted, an angry scowl, or a grimace. This is usually done in model in the

modelling portion of the 3D package.

The next step is to break down the recorded dialog, which is the process of

translating what is heard in the dialog track into a list of facial shapes that, when run in

sequence, will create the illusion that the character is producing the recorded sound.

The exact facial shapes and the keyframe numbers they will occupy are entered into a

timing chart.

Finally, the facial shapes that were built in the first step are arranged according to

the sequence listed in the timing chart.

While it may appear to be a mechanical process, there is a great deal of creativity in

deciding how the character’s face will transform throughout the animation. Lip synch

is a part of acting, so the personality of the character defines the message being

delivered. The spoken word can have several meanings depending on the nuances is

given to the character, such as eyes movement and facial expressions.

The first step in the lip synch process is to understand the foundation, i.e.

phonemes. Phonemes are the most misunderstood aspect of facial animation. Nearly all

the information available to animators are not correct , making it challenging to create

really dynamic lip synch animation.

3.2 Getting known phonemes

A phoneme is the smallest part of a grammatical system that distinguishes one utterance

from another in a language or dialect. Basically, it’s the sound we hear in speech

patterns. In phonetic speech, combining phonemes, rather then the actual letters in the


17

word, creates words. For example, in the word “food”, the “oo” sound would be

represented by the “UH” phoneme. The phonetic spelling of the word would be “F-

UH-D”. It looks a bit odd, but phonemes are the backbone of speech and therefore

paramount to the success of lip synch.

When creating a lip synch animation, the facial movement of the 3D character are

synched to recorded dialog. When the phonemes are spoken, the mouth changes shape

to form the sound being spoken.

As a matter of fact, the phonemes don’t actually look like the printed word, but

when speaking them they sound identical. The rule to determines if a unit of speech is

a phoneme is that if replacing it in word results in a change of meaning. For example,

“pin” becomes “bin” when replacing the “p” therefore the “p” is a phoneme.

3.3 Getting known visual phonemes

Visual phonemes are the mouth’s positions that represent the sound that is heard in

speech. These are the building blocks for lip synch animation. When creating a 3D

character lip synch, the beginning is modelling the phonemes. The important thing to

identify is how many actual visual phonemes there really are. The common myth is

that there are ten visual phonemes. While ten can be used to create adequate lip synch

animation, there are actually sixteen visual phonemes.

Each visual phoneme is associated to the audible phonemes that requires that

specific mouth’s position; even if some parts look very similar, a closer inspection

shows that the tongue is in different positions.

The tongue may seem to be an insignificant element in lip synch, but it is very

important to create truly realistic dialog. Now if the tongue movement is unnatural

because too few visual phonemes were used, then the animation will look unrealistic.

A larger number of visual phonemes grantees a major detailed lip movement

because more phonemes have their own corresponding mouth position; thus the same


18

viseme, covering less phonemes, less frequently is used more then once within the same

word avoiding the lips vibrating effect that this may cause.

The choice of using more or less visemes, i.e. a schema based on ten or sixteen

visemes, is so very relevant according to the result of realism the animation pretends to

achieve. Generally, a few number of visual phonemes is preferred when the character

does not appear in the foreground during the animation, while a more accuracy is

necessary for important characters making long speeches attracting the viewer

attention.

Creating sixteen visual phonemes instead of ten requires a big work of modeling

not only to generate a greater number of mouth positions, but also because of the major

request of precision.

Visual phonemes are listed in the below Figure 3.1 , Figure 3.2 and Figure 3.3,

divided into the sixteen and ten schema.


19

Figure 3.1 : First 8 visual phonemes of the 16-based list


20

Figure 3.2 : Last 8 visual phonemes of the 16-based list


21

Figure 3.3 : Short list of 10-based visual phonemes


22

In the list with ten elements, several of the visual phonemes with similar exterior

appearances have been combined. Of course, the subtle tongue movement will not be

accurate, but that is not always important. There are times when it is necessary to

shorten the visual phoneme list to expedite the editing process, though truly

understanding phonemes, it does not take any longer to use the long list because it will

just make a little longer the modelling phase but not the lip synch process. In fact, once

the spoken word is translated into its phonemes sequence it is just a question of

selecting the corresponding visual phonemes according to the adopted schema, without

particular calculations.

3.4 The lip synch rules

Creating the illusion that a character is actually speaking the dialog is challenging,

but following some rules of lip synch animation could be useful.

3.4.1 Record the dialog first

There are two reasons to record the dialog before animating:

• It is far easier to match a character’s facial expression to the dialog than

it is to find voice talent that can accurately dub an existing animation.

This avoids to spend hours in a recording studio trying to match words

to a preexisting animation.

• The recorded dialog will help to determine where the keyframes go in

the animation. Suitable sound editing software will even give you a

visual representation of the dialogue’s shape making easier to breaking

down audio tracks for lip synch animation.


23

3.4.2 Never animate behind synch

There are occasion when a lip synch will work better if it is actually one or two

frames ahead of the dialog, but you should never try to synch behind the dialog. It is

best to start by animating exactly on synch. Then, if necessary, parts of the animation

can always be moved a frame forward to see if it works better.

3.4.3 Don’t exaggerate

This is another important rule that is often overlooked. The actual range of

movement is fairly limited and overly pronounced poses will look forced and unnatural.

It is far better to underplay it than overdo it. Americans naturally talk in an almost

abbreviated manner and it is simple to notice that the mouth doesn’t open very much at

all during speech, so the visual phoneme has not to be exaggerated.

3.4.4 Rules were made to be broken

Many consonant, and occasionally vowels, are actually pronounced in the transition

between the preceding and the following sounds. As a result, the full pose for that

sound never occurs. Also, as mentioned above, Americans have a habit of abbreviating

their speech. Depending on local dialect, syllables are often slurred or deleted entirely.

So it is fundamental to pay attention to the character’s pronunciation and animate it as

the dialog dictates.

The goal is make the movements of the mouth appear natural and lifelike. If it is

necessary to skip a consonant to maintain the tempo and avoid contorting the mouth, it

will be far better to do so than forcing the mouth into unnatural poses or losing the

rhythm of the dialog.


24

3.5 The lip synch process

Using the phoneme chart to translate what is heard in the dialog into their audible

phonemes, then translating those phonemes into a predetermined set of visual

phonemes makes it are the fundamental steps of lip synch process.

The steps of animating lip synch are:

• Break down the speech pattern.

• Analyze the audible dialog and enter the phonemes into the timing

chart.

• Use the timing chart to set the keyframes in the animation.

• Test the animation for synching and tweak the animation where

necessary.

3.5.1 Break down the speech pattern

The first step in lip synch is to determine the speech pattern of the dialog.

Because of the use of abbreviations and dialect influences in current speaking, often it

is not possible to properly assign the phonemes to a dialog if there Is not a previous

conversion to a rough phonetic translation. This doesn’t refer to the concept of

phonemes but it rather means a translation of how the dialog sounds using the normal

alphabet. This is an important element, since it is necessary to assign phonemes to the

actual heard sound, not the seen text. For example, the word “every” can often been

spoken in a contract manner, thus listening to the recorder audio it is important to

notice that and consider the spoken word in its contracted way as “evry”. The next

chapter will describe how this matters are faced in Speak U2.

After making this procedure of understanding what’s really spoken, it is always

necessary to phonetically translate the text before it is actually possible starting

assigning the phonemes.


25

3.5.2 Analyze the audible dialog to determine phonemes

The first thing to do is to load the audio file into a sound-editor program; any will

do as long as it allows to identify the actual times when sounds occur and relate them in

frames. Another important feature is the presence of a scrub tool that lets to drag the

audio forward and backward in order to obtain a bigger precision.

After loading the audio file, usually the lip synch methodology suggests to hear the

dialog and write down the phonemes, but, as will be described in next chapters, it was

understood that was necessary to follow the strategy to write down the real spoken

words, with their abbreviations or contractions, and not the phonemes itself because the

algorithm will make this translation After this, scrubbing the sound back and forth it is

possible to determine the exact location of each sound in the audio file. Using an audio

tool thus is useful because it provides a visual representation of the sound too, which

help to determine the point where words, and so phonemes, are recorded.

3.5.3 Use timing chart to set keyframes

Once the sound have been translated into their visual phonemes and the frames

identified for each, it is necessary to morph the visual phonemes from one to another at

the appropriate frame in the animation.

The are two types of morphing, straight or weighted. Straight morphing simply

morph the object in a linear progression from one object to another. The morph can be

any value from 0 and 100%. The only issue is that it is limited to a single morph

object.

On the other hand, weighted morphing allows to blend multiple objects in a single

morph. This is very useful when adding facial expressions and emotions to the lip

synch animation. It is possible not only to add subtle eye blinks but also complete

changes in the character’s personality through facial expressions. Of course, this

greatly complicates the animation process, but the end result is well worth the effort.


26

3.5.4 Tweak the finished animation

Lip-synching is not an exact science and it will almost always require a little

tweaking. This may involve altering certain poses or tinkering with the timing. In

some cases, the synch may appear more realistic if it is a frame or so ahead of the

dialog, so that the mouth is moving as the viewer’s brain is processing the sound rather

than when their ear is receiving it. These are subtle things, but lip-synching is an

exercise in subtlety. If the aim is making the audience to suspend disbelief and accept

the character as a living being, it is important not to stop until it actually seems if the

character is speaking the dialog.

Speaking about tweaking, there are times when it will be needed to drop a

phoneme to make the animation flow smoothly. Not all letter are pronounced in normal

speaking, particularly if an accent is present.

3.5.5 Phoneme dropping guidelines

After obtaining the animation, often is necessary to drop some phonemes, but

there are some rules to be observed.

Never drop a phoneme in the beginning of a word. It is possible to drop a

consonant at the end of words but never in the beginning because it will change the

visual phonetic pronunciation for the word. Otherwise the consonant at the end of

words can be dropped without having much impact on the word.

Another rule is to drop nasal visual phonemes to smooth transition. This is an

important issue when animating. The most offensive nasal phonemes is the “M”, since

it requires a closed mouth and this can be a problem since in reality the mouth

movement is nearly undetectable. It is necessary to merely go through the timing chart

and identify the nasal phonemes and delete them based on whether their location

qualifies.


27

3.6 Thinking as an animator

Approaching the goal of creating a lip synch plug-in, it is very important to

understand above all who will use it. Speak U2, being a plug-in projected to help in

synching a facial animation, obviously refers to the 3D animator.

What really an animator wants to know and what otherwise doesn’t need to care

about when using a software? It is important to consider that an animator is not a

programmer, so he has no confidence with anything else that is not a 3D animation

graphic concept and he just really wants to work with an easy user interface. It is

important to understand what can make the user frustrated in using a software. For

example, it was discarded the possibilities of asking the animator to collect information

himself editing a text file inserting data between tags. The concept of tag, as the html

tags, is a very simple one, but it is just something that could be very annoying for an

animator and, moreover, he would spend time to learn how to build the right text file on

which apply the algorithm. To solve this problem, the right approach was to imagine

not to be a computer engineer, but try to put myself on the other side of the problem.

This was an interesting challenge to focus on. As shown in the fourth chapter, where

there is a complete look of Speak U2, it was necessary to make this step transparent for

the animator, who will create the tagged file using the interface, simply filling some

fields and pressing some buttons. In this manner, the artist has not to edit a file, he has

not to care of the format of the file itself, and any changes he wants to apply on a

preexisting file will be done through the same interface.

The user interface is the way the animator will work through, so it is fundamental

to give much importance and attention on building it. A 3D artist really needs a very

simple interface, with a clear workflow so that reading the instruction guide could be

considered not really necessary, even if it was provided in an online documentation. In

that way the animator needs to know just what to do, which steps has to follow to reach

the result and how to modify it.


28

Moreover, we have to consider that a lip synch plug-in is based on particular and

difficult concepts such as visemes and phonemes, which may not afflict the animator or

the modeler, but, one time again, they have to know only what they really need. The

modeler has to care about the sixteen or ten mouth positions in order to create the visual

phonemes that would be mapped to phonemes to create the animation. The animator,

on the other side, doesn’t have to matter about visual phonemes, he just only needs to

know which phonemes is associated with that viseme in order to make the map between

them, but, finally, neither the animator or the modeler really need to know what a

phoneme is and how a written or spoken word translates into a sequence of phonemes.

The figure below represents the workflow and rules during the process.


29

Figure 3.4 : workflow and rules diagram

groups

Keyframed scene

Within Speak U2

animator

Inserts data andexecute thealgorithm

animator

maps visemes tophonemes creating thecorrespondence used by thealgorithm

Within audio software

animator

Listens to therecorded audiowriting downspoken wordsand their times

Within Maya

models the 16 or10 mouthpositions corresponding to16 or 10phonemes

modeler


30

3.7 Text to phoneme

Approaching the project to program a lip synch plug-in, the necessity of a text to

phoneme algorithm became immediately evident. At a first time the idea was simply to

build a mouth shape for each alphabetic letter and map the written text exactly to them.

This concept is not as wrong as it seems, but it presents many concrete problems. In

fact, the though to make something correspond to a mouth shape in order to obtain a

lip synchronization is a right one, but it doesn’t concern any single letter. It is evident

that making mouth shape changing at any letter pronunciation, will create an extremely

mechanic lip synch such as a robot one.

The importance of a translation between the written word and its sounds sequence

became vital for lip synching in a realistic manner. Analyzing the movement of the

mouth in daily human speech, it becomes clear that lips move following sound

pronunciation instead of the written text of the spoken word and this discovery

implicates the need to base the lip synch plug-in on an algorithm of translation from

text to sound, thus a translation from text to phonemes.

3.7.1 Focusing on language

Knowing the need to implement an algorithm like this one, it was necessary to

come back one step and decide which language was more appropriated and interesting

Speak U2 plug-in will support. Even if probably Speak U2 will be used to realize just

Italian production and animation, the choice fell upon English language because of

some important considerations that are characteristic of film making environment.

First of all, there was a great preference to develop a lip synch plug-in based on

an English language to allow the possibility of getting in touch with foreign reality and

market. Italian Computer Graphics level is so lower than in other nations that is very

challenging to introduce our products into foreign contest and usually best Italian artists

with much experience and capabilities have to emigrate in other countries to realize

themselves. American attitude especially is to figure out talented artists in other


31

countries and keep them joined in their production, being strong of their powerful

money and Hollywood prestige.

Unfortunately, there is no real future in Italy for Computer Graphics artists, thus it

was really unthinkable to develop a plug-in for Italian lip synch even if at this time it

could not be said if Speak U2 would be known on other countries.

Second, there was a practical reason due to film making rules. Studying film

production, it is common to notice that films are generated for English speaking actors

and that is true even in Computer Graphics animation that in this époque is growing in

complexity and duration. There is a significant example of this concept in the

American Ridley Scott Hannibal, where Italian actors as Giancarlo Giannini and

Francesca Neri has to speak in English during the ciak. Just in a second time the film is

dubbed in other languages and paradoxically even Italian actors are dubbed again.

Moreover, the Speak U2 user interface is logically in English just because it

grows up within an entirely English software package. Making it with an Italian

interface would be in contrast with the philosophy of integration with Maya; on the

other hand it would be a paradox giving it an English structure to lip synch Italian

language only, above all because in that case probably only Italian animators would use

it.

For all these reasons, Speak U2 is based on the English grammar. This feature

imposed spending a long of time understanding this language in order to create a valid

algorithm of text to phonemes translation valid for any word. The way of employ an

always working algorithm was preferred to the use of a dictionary of words with their

corresponding phonemes sequence because this one will require a periodic upgrade and

a longer CPU time according to the search algorithm implemented.


32

3.8 Synchronization and artistic freedom

Once the algorithm has generated the phoneme sequences, these are to be set at

the right time within the range of the word’s pronunciation. This is not a secondary

job as could seem, because even if phonemes and visemes are in the right sequence, if

they are not mapped at the right frame, the lip would move without synchronism with

the recorded audio. Moreover, some phonemes could persist for a little while more

than other, thus this final step in lip synch is important as the previous ones. It remains

to be considered that an algorithm based on precise calculation could never reflect the

complexity and the random way humans speak. The purpose is to obtain a quite perfect

synchronization between lip movement and audio without any other action than

entering in the user interface the spoken word and start and end times. This allow to

obtain a good result spending a little time and doing a reduced work, but without

affecting freedom and creativity of the artist who can later characterize in a more

detailed way his speaking creature.

Chapter 4

Solutions

The analysis of the lip synch methodology was basilar to understand the workflow

needed to achieve this sort of animation. Actually, the lip synch process can be divided

into some fundamental steps. Leaving aside the modeling phase of the 16 or 10

visemes, because this is not a really part of the the animation process being an exercise

of modeling, and thus it can not be considered within Speak U2, there are three basic

steps.

The first step consists of creating a correspondence between phonemes and

visemes, while the second step covers the phase of entering data concerning the speech

to be synchronized.

The third step is surely the most complex one because of its implementation but

even the easier one for the animator being totally automatic: it focuses on the execution

of the algorithm of translation from text to phoneme and computes to set the

blendshapes keys at the right time and value.

Even if these are the fundamental steps of lip synching, Speak U2 considers the

workflow in its totality, thus it includes two more steps, the preview and the retouching

ones, even if thy only are logical phases and not straight concerning the animation

process.

Chapter 4 - Solutions

34

4.1 Interface realization

In Speak U2 there is a great attention to make all the lip synch process the easiest

one, and the user interface reflects this issue. In fact, with its user interface divided into

the five steps, each one covered by a specific tab that contains all controls for that

purpose, follows the structure of the workflow.

This structure was build this way in order to keep clear to the animator in which

moment of the process he is and what he has to do.

4.1.1 First step: creating data structure

As illustrated more deeper in the next chapter, dedicated to the user interface and

software description, the goal of the first step is to collect fundamental information on

the scene and to initialise variables used by the text to phoneme translation algorithm.

Created using MEL, this first phase of the lip synch process gives the possibility

to the user to choose between the schema based on ten or sixteen visual phonemes and,

according to this selection, only one of the two global variable containing the list of

phonemes is used by the real algorithm and its content is displayed in a scrolling list.

The code below presents the two global variables where it is possible to see the

list of phonemes grouped to be covered by a visual phoneme.

global string $su2Phonemes[] = {

"m,b,p",

"n,l,t,d",

"f,v",

"TH,DH",

"k,g,NG",

"SH,ZH,CH,j",

"y,OY,YU,w,WW,UH,ER,r",


35

"IH,EY,EH,AH,AY,AW,AE,AN,h,HH,s,z",

"AA,AO,OW,UW,AX",

"IY"

};

global string $su2Phonemes2[] = {

"m,p,b",

"n",

"l,t,d",

"f,v",

"TH,DH",

"k,g,NG",

"SH,ZH,CH",

"j",

"y",

"WW,UH,ER",

"r",

"IH,EY,EH,AH,AY,AW,AE,AN,h,HH",

"s,z",

"AA,AO",

"OW,UW,AX,OY,YU,w",

"IY"

};

Another scrolling list shows visual phonemes present in the scene according to the

selected blendshape, making it possible to create a correspondence to the phonemes.

This map is obtained by a procedure call that simply saves the name of the viseme in an

array at the same index of the phonemes’ array. Having phonemes and the


36

corresponding visual phonemes at the same position in the arrays is an idea that would

facilitate to find them without errors.

The following is the code generating the map.

global proc su2Map()

{

global string $viseme[];

int $selectedPhonIndex[] = ` textScrollList -q -

selectIndexedItem phonemeList`;

string $vis[] = ` textScrollList -q -selectItem mapToList`;

$viseme[$selectedPhonIndex[0] -1] = $vis[0];

print $viseme

}

The figure below shows a conceptual schema of the workflow for the first step.


37

Chose between the use of 16 or 10 visemes

List of blendshapespresent in the scene:select the onecorresponding tovisemes

List of attributesassociated with theviseme blendshape

List of phonemesdivided into 16 or 10groups according to theselection

Mapping procedure call

Figure 4.1 : First step workflow

4.1.2 Second and third tab: collecting information

The second step focuses on the analysis of the audio track to collect information

on speech. This is an important step in the algorithm solution that was found to

implement Speak U2. In fact, the purpose to write down the phonemes sequence of the

speech was considered as an impracticable one above all by an animator.

Understanding phonemes while listening to a recorded audio it is no a simple task as it

could seems, moreover, as said before, the animator or a 3D artist in general, doesn’t

have to known concepts such as phonemes or English grammar.


38

Because of these arguments, the idea was not to write down phonemes but the real

spoken word that is straight understandable and moreover a script is always provided

with the audio. What really the animator has to do is to write down the time, in frame

units, at which the word’s pronunciation begins and ends. These are some of the data

needed by the algorithm to work on and they are all entered in this step filling dedicated

field after having listen the audio track and scrub along it using tools present in the

Speak U2 interface or from within external audio dedicated software.

All information collected in this phase, that are concerned to a single word, are

saved and added in a data structure containing information about the entire speech and

displayed in the part of the interface covering the third step. The save procedure allows

too to modify a previously saved information on a word. The code below represents

the “save” function:

global proc su2CollectInfo()

{

string $sWord = ` textFieldButtonGrp -q -text startWord `;

string $eWord = ` textFieldButtonGrp -q -text endWord `;

string $spWord = ` textFieldGrp -q - text spoken `;

string $emotion = ` optionMenuGrp -q -value emotionGrp `;

string $total = $spWord + "\\" + $sWord + "\\" + $eWord + "\\"

+ $emotion;

int $many = ` textScrollList -q -numberOfSelectedItems

totalList`;

if ( $many == 0)

textScrollList -edit -append $total totalList;

else

{


39

int $pos[] = ` textScrollList -q -selectIndexedItem totalList

`;

textScrollList -edit -removeIndexedItem $pos[0] totalList;

textScrollList -edit -appendPosition $pos[0] $total -

deselectAll totalList;

}

textFieldGrp -edit -text "" spoken;

}

$sWord and $eWord are variables that are respectively used to store timing of

start and end of word’s pronunciation, while the spoken word is saved in the $spWord

variable. $emotion would be used to keep the emotions that is affecting the

pronunciation but it is variable that at this time is not used by the algorithm but would

probably be fundamental in further development of the plug-in. These items are

grouped in a string variable where are separated by a double slash and then are

displayed in the third tab scrolling list. The $many variable will control if any line of

the list collecting information on the entire speech is selected and thus the if test will

decide if the data are to be appended at the end of the list or if they refer to an existing

line that has to be edited and overwritten.

Generally an audio track for a Computer Graphics animation could be very long

and it would be very difficult to complete the recorded speech analysis at a time.

According to this problem, it was necessary to include in the user interface that covers

the third step a save and reload function. This job is due to two procedure calls that

write or read to text files respectively. The saving procedure is the more complex one

because of some annoying features of Maya itself. The code below represents this

function:


40

global proc su2WriteFile()

{

global string $txtFile;

int $fileId;

string $temp[];

string $sceneN[];

string $wrongDir;

string $correctDir;

int $j;

$wrongDir = ` file -q -sceneName`;

tokenize ( $wrongDir, "\\", $temp);

$correctDir = ($temp[0] + "/" );

for ( $j = 0; $j < size( $temp); $j ++)

{

$correctDir = $correctDir + ($temp[$j] + "/");

}

tokenize ( $correctDir, ".", $sceneN);

$txtFile = ( $sceneN[0] + ".txt");

$fileId = ` fopen $txtFile "w" `;

string $contents[] = ` textScrollList -q -allItems totalList`;

string $riga;

string $riga1;


41

int $z;

$riga1 = $contents[0];

fprint $fileId $riga1;

for ( $z = 1; $z < size( $contents); $z++)

{

$riga = ( "\n" + $contents[$z]);

fprint $fileId $riga;

}

fclose $fileId;

button -edit -enable true okBtn;

}

The text file that is generating by the saving function has the same name of the

scene on which the animator is working on and would be stored in the same directory.

Unfortunately Alias|Wavefront Maya recognizes directory paths using backslashes

instead of generally used slashes, while its scripting language command returns path

with slashes. This characteristic imposed to create a converting function.

To achieve this result, the $wrongDir variable stores the query result of the

name of the opened scene and with tokenizing operations the “\” are substituted in “/”

as well as the “.mb” extension is converted into the “txt” one. The new path is stored in

the $txtFile, variable that is a global one because it has to be passed as an argument to

the C++ compiled plug-in. The $contents array will store the list of all items referring

to the speech words analysis that are written into the text file one at line.

The load procedure simply makes the inverse process on the text file:


42

global proc su2LoadFile()

{

string $works = ` workspace -q -fullName `;

setWorkingDirectory $works "" "";

fileBrowserDialog -mode 0 -fileCommand "su2UpdateScrollList"

-fileType "text" -actionName "Load previous saved speech

analysis" -operationMode "Import";

}

and then updates the list through the su2UpdateScrollList procedure call:

global proc su2UpdateScrollList( string $filename, string $fileType

)

{

int $fileId2;

$fileId2 = ` fopen $filename "r" `;

string $line;

$line = ` fgetline $fileId2`;

textScrollList -edit -removeAll totalList;

while( size ( $line ) > 0 )

{

if ( $line != "\n")

textScrollList -edit -append $line totalList;

$line = ` fgetline $fileId2 `;


43

}

fclose $fileId2;

}

Variables $filename and $fileType are default arguments of the

fileBrowserDialog command. The Figure 4.2 shows a conceptual schema of the

workflow for the second and third steps.

Second step Third step

Listen to therecorded audio

Load list from txtfile

List of collectedinformation on allanalyzed words untilnow

Enter times andspoken word

Edit value ofselected line

Save information on current word

Save list to a txt file

Start algorithm execution

Figure 4.2 : workflow schema of steps two and three


44

4.2 Interaction between MEL and API

The third step also provides the finally call the plug-in command that would

process all data and set keyframes for the animation. The code below belongs to the

MEL procedure associated with this procedure call:

global proc su2DoSynch()

{

global string $su2Phonemes[];

global string $su2Phonemes2[];

global string $viseme[];

global string $txtFile;

global string $parameter;

global string $numvis;

global int $dominance;

string $par1, $par2, $par3, $par4, $par5, $par6, $par7, $par8,

$par9, $par10;

string $par11, $par12, $par13, $par14, $par15, $par16, $par17,

$par18, $par19, $par20;

string $par21, $par22, $par23, $par24, $par25, $par26, $par27,

$par28, $par29, $par30, $par31, $par32;

if ( $numvis == "dieci")

{

$par1 = $su2Phonemes[0];






45






$par11 = $viseme[0];










su2LipSynch $txtFile $numvis $par1 $par2 $par3 $par4 $par5

$par6 $par7 $par8 $par9 $par10 $par11 $par12 $par13 $par14

$par15 $par16 $par17 $par18 $par19 $par20 $parameter

$dominance;

}

else

[.....]

}

As the code shows, many variables are string global ones because their values are

derived from other procedures that have queried them through the user interface. The

effective call to the plug-in command is “su2LipSynch”, which is followed by all

needed arguments as explained in the table below:


46

Varable name Meaning

$txtFile Name of the file storing all data about analyzed speech

$numvis Number of visemes to be used, 16 or 10

$par Element of phonemes and visemes array

$parameter Name of the blendshape of visual phonemes

$dominance Value at which set keys

Table 4.1 : Variables meanings

This procedure seems to be implemented in a terrible way, but unfortunately is

has to be considered that MEL is just a scripting language and, above all, that Maya

API classes are really poor in parsing argument. In fact, there is no way to straight pass

a string array argument within the C++ code. This unexpected problem was solved by

passing each element of the array as a single argument assigning it a new $par variable

and rebuilding the array within the C++ code. Of course, it was not really necessary to

use so many variables, but this choice was made to have a clearer correspondence from

MEL to API code. The code below shows the C++ code presiding to parsing arguments

dual respect to the MEL passing arguments code:

[…]

par_file = args.asString( 0 );

fname = par_file.asChar();

fp1 = fopen( fname, "r");

par1 = args.asString( 1 );

strcpy( based , par1.asChar());

if ( strcmp( based, "dieci") == 0)


47

{

for (i = 2; i <= 11; i++)

{

par1 = args.asString( i );

p = i -2;

strcpy( parPhoneme[p], par1.asChar());

}

for ( i = 12; i <= 21; i++)

{

par1 = args.asString( i );

p = i - 12;

strcpy( parViseme[p], par1.asChar());

}

[...]

par11 = args.asString( 22 );

blendStep = args.asInt( 23 );

}

The par_file variable contains the name of the txt file passed as first argument and

it is used to open the file to process its contents of analyzed speech. The second

argument is assigned to the blend variable and represents the number of used visemes

and phonemes. This number is fundamental not only for the animation itself but to

know the size of the array too. The next two for cycle rebuild the phonemes and

visemes arrays that create the main data structure of the algorithm.

The last variable blendStep needs a specific discussion. In fact, while variables as

par1, based or par_file are all string ones and Maya API provides an asString method in

the MargsList class, paradoxically an asFloat method is absent at all. This deficiency


48

creates a disconcerting problem because it was not possible to directly pass a value as a

float type as the initial purpose was. This value is obtained from the user graphic

interface through a slider.

The first idea was to create a float slider because of the value of the blendshape is

a floating one between 0.0 and 1.0 but, because of the lack of getting a float value from

the C++ API, this simple idea has to be twisted. The only way to preserve the intuitive

manner of setting the slider in the interface and pass its value to the C++ code, was to

make the user to set the percentage value of the blendshape and then pass it as an int

number to the API and create and appropriate procedure to convert the integer value

into a float one.

This conversion is simply made by a division using traditional C++ library div_t

type variable, but then is necessary to construct the right MEL sequence of characters

for the command to be executed. The results of the division are the quotient and its

reminder; for any percentage value indicated by the user through the graphic interface

this is divided by 100 and its quotient and reminder are separated by the decimal point

to obtain a real floating number. It would be more clear reading the code below:

div_t step;

step = div ( blendStep, 100 );

cmdAttr2 = cmdAttr2 + parViseme[x] + "\" " + step.quot + "." +

step.rem;

MGlobal::executeCommand( cmdAttr2, true, false);

It is possible to see that it was very annoying to build up the entire MEL string

adding even the “.” to separate the entire part of the number from its decimal. To better

understand it can be said that the executeCommand is a method of the Maya API class

Global that allow to execute MEL command through the C++ code.

In spite of all these observations, it was not so complicated to solve these

problems; but there’s a little poor documentation to allow a deep learning of Maya


49

programming and interacting and there are some incredible deficiencies in a software

that wants to be a great one in its possibilities of being programmed and developed by

professional users.

4.3 Algorithm

Speak U2 is based on a linear but complex algorithm that is composed of two main

part. The first part attends to the translation from text to phonemes of the English

grammar, while the second concerns the selection of which visual phonemes are of

interest for making the map and the calculation of time at which to set key for the

interested visual phonemes.

The code lines about parsing arguments seen before are all inserted in the first part

of the C++ code which has to follow a precise structure in order to be loaded within

Maya as a command plug-in. In fact, a plug-in have to be implemented including some

mandatory method to be initialized within Maya. These methods are the

initializePlugin and the dual uninitializePlugin which have a standard structure simply

to make Maya know the name of the new generated commands:

MStatus initializePlugin( MObject obj )

{

MStatus status;

MFnPlugin plugin( obj, "Speak U 2", "4.0", "Any");

status = plugin.registerCommand( "su2LipSynch",

su2LipSynch::creator );

if (!status) {

status.perror("registerCommand");

return status;


50

}

return status;

}

The first parameter in the plugin.registerCommand call is the name attributed to the

new command while the second is the name of another obligatory method for the

MpxCommand class from which the su2LipSynch class of the Speak U2 plug-in

inherits and which simply return a new instance of the su2LipSynch class. The

unitializePlugin method is just the dual of the initializePlugin and differs only for the

call to the plugin.deregisterCommand method.

The main important method of the base class is the doIt method which is a sort of

main of a C program in the sense that, as its name suggests, it provides all methods

following the workflow of the program. Thus in the doIt method, besides the parsing

arguments to rebuild the data structure, there is also the analysis of the text file

resulting from use of the interface and containing all data about the spoken words and

times of pronunciation. As seen before, all the information are collected in a data

structure where each line refers to a single word; a slash separates the spoken word

from its beginning and ending times of frame regarding it, thus the decision was to

generate three separated files each one about one data only. In this way the algorithm

can work on a text file containing the complete list of spoken words each one blank

separated form the others, another file containing the numbers of frames at which the

word begins and another with the frames at which the word ends. This files are named

respectively “intermedio_parole.txt”, intermedio_start.txt” and “intermedio_end.txt”

and are opened in write mode in the doIt method where they are generated and then

opened in read mode from other method that would use their contents to work on.


51

4.3.1 Text to phonemes translation

The text to phoneme algorithm has a simple structure even if it is made up of

many steps. The main process is to control for any letter of the spoken word which one

comes before and which comes after, according to English grammar. Thus there was

the need to create a file containing rules for automatic translation of English to

phonetics. Rules are made up of four parts: the left context, the text to match, the right

context and finally the phonemes to substitute for the matched text.

The algorithm starts reading from the “intermedio_parole.txt” until it finds the

end of file EOF. For each word, it proceeds separating each block of letters and adding

a blank on each side. After selecting which is the word to work on, it gets the starting

frame and computes the duration of the pronunciation substracting the ending frames

value. This information would be used in the second part of the algorithm where it

would decides at which time set the appropriate viseme keyframe. These actions of

dividing each word putting a starting and ending space , and determining the time and

duration of the word are all included in the haveLetter method. This method is called

anytime the character read from the “intermedio_parole.txt” file is a letter and prepares

an array containing the word for the translation, terminating it with the symbol “\0”

which is interpreted as a string end, to pass it to the translateWord method.

During the text to phonemes translation it also necessary to count the number of

phonemes that compose the analysed word because of this data would also be useful

and indispensable in the second part of the algorithm. This task is assigned to the

translateWord method which has a counter that is incremented anytime a new

phonemes is translating from the word. In fact, within this method, while the

terminating string is not found, each letter of the word is passed to the call to the

findRule method that, for each letter in the word, looks through the rules where the text

to match starts with the letter in the word. If the text to match is found and the right

and left context patterns also match, it outputs the phonemes for that rule and skip to

the next unmatched letter. Below there is a rule example:


52

static Rule A_rules[] =

{

{Anything, "A", Nothing, "AX"},

{Nothing, "ARE", Nothing, "AAr"},

{Nothing, "AR", "O", "AXr"},

{Anything, "AR", "#", "EHr"},

{"^", "AS", "#", "EYs"},

{Anything, "A", "WA", "AX"},

{Anything, "AW", Anything, "AO"},

{" :", "ANY", Anything, "EHnIY"},

{Anything, "A", "^+#", "EY"},

{"#:", "ALLY", Anything, "AXlIY"},

{Nothing, "AL", "#", "AXl"},

{Anything, "AGAIN", Anything, "AXgEHn"},

{"#:", "AG", "E", "IHj"},

{Anything, "A", "^+:#", "AE"},

{" :", "A", "^+ ", "EY"},

{Anything, "A", "^%", "EY"},

{Nothing, "ARR", Anything, "AXr"},

{Anything, "ARR", Anything, "AEr"},

{" :", "AR", Nothing, "AAr"},

{Anything, "AR", Nothing, "ER"},

{Anything, "AR", Anything, "AAr"},

{Anything, "AIR", Anything, "EHr"},

{Anything, "AI", Anything, "EY"},

{Anything, "AY", Anything, "EY"},

{Anything, "AU", Anything, "AO"},

{"#:", "AL", Nothing, "AXl"},

{"#:", "ALS", Nothing, "AXlz"},


53

{Anything, "ALK", Anything, "AOk"},

{Anything, "AL", "^", "AOl"},

{" :", "ABLE", Anything, "EYbAXl"},

{Anything, "ABLE", Anything, "AXbAXl"},

{Anything, "ANG", "+", "EYnj"},

{Anything, "A", Anything, "AE"},

{Anything, 0, Anything, Silent},

};

The four elements in any row correspond to the left part, the match part, the right

part and the output part respectively. The output part is exactly the one or more

phonemes translating that word or part of word. As seen in the A rule example, there

are some special characters used by the algorithm to signalise some particular situations

that could happen in words. These special characters are:

Symbols Meaning

# One or more vowels

: Zero or more consonant

^ One consonant

. B,D,V,G,J,L,M,N,R,W,Z

+ E,I o Y

% Error: Bad char

Table 4.2 : special characters used for the left match


54

Symbols Meaning

# One ore more vowels

: Zero or more consonants

^ One consonant

. B,D,V,G,J,L,M,N,R,W,Z

+ E,I or Y

% ER,E,ES,ED,ING,ELY

Table 4.3 : special characters used for the right match

The table below shows the 44 American English phonemes used for the

translation:


55

Phoneme Example Phoneme Example

IY bEEt IH bIt

EY gAte EH gEt

AE fAt AA fAther

AO lAWn OW lOne

UH fUll UW fOOl

ER mURdER AX About

AH bUt AY hIde

AW hOW OY tOY

p Pack b Back

t Time d Dime

k Coat g Goat

f Fault v Vault

TH eTHer DH eiTHer

s Sue z Zoo

SH leaSH ZH leiSure

HH How m suM

n suN NG suNG

l Laugh w Wear

y Young r Rate

CH CHar j Jar

WH WHere

Table 4.4 : American English phonemes

Using this data structure that operates on a separate file named regole.cpp

containing rules for any letter of the English alphabet, the findRule method figures out

the right phonemes translation for the word matching what precedes and follows the


56

letter in word with the rule for that letter delegating this job to special method called

leftMatch and righMatch. Thus this method returns the phonemes and passes it to the

translateWord method that is the right place where maintaining a counter for the

number of phonemes composing a word.

The leftMatch method, as well as the right method that is its dual for the right part

of the match, is made up of a long switch structure that discerns on groups of letters

that are found on the left part of the match. The switch case are the special character as

+, # etc considered in a table 4.2 because this allows to use only one simple character

under which many letter are considered. There is a counter that is initialise to the

pattern length and that is decremented any time a good match with the context is found.

The leftMatch method results TRUE or FALSE depending on the match of all the

letters of the pattern and, if successful, in the findRule method the right phoneme is

obtained using the fourth field of the static structure of the rules output =

(*rule)[3]; at this point within the findRule method the algorithm has produced one

phonemes of many composing the entire word, thus it is necessary to store it in a list of

that word’s phonemes.

This new task is assigned to the workPhoneme method that tries to understand

which phonemes was selected. Looking at the example rules for the A letter shown

before, it is possible to see that sometimes the output part of that data structure is not

made of a single phoneme. In fact, it is evident that some are single phonemes but

others are made of a sequence of them and this difference is signalised by the fact that

phonemes translating vowels are upper-case letter written while the consonant

phonemes are lower-case letter. Moreover vowel phonemes are all made up of two

letters while consonants one are made of a single letter.

Because of these features it is quite simple to discover situation at which the given

output phonemes are in reality more than one phoneme. The workPhoneme method

determines this cases analysing the length of the char variable resulting from the

findRule procedure and containing the phonemes: if that length is less than three, than it

is a single phoneme while on the contrary it is formed of a sequence of vowel and


57

consonant phonemes that are discerned considering their case. After incrementing a

phonemes number counter, all phonemes are put in a single cell of an array that is then

passed to the findPhoneme method.

The findPhoneme method finally looks for any of the phonemes contained in the

array received as argument in the main phoneme array initialised at the beginning in the

doIt method while rebuilding the data structure passed by MEL code. Processing each

element of the phonemes array that contains grouped phonemes separated by a comma,

the findPhoneme procedure searches for the matching phoneme and, after finding it,

calls the findViseme method that simply gets the name of the associated viseme getting

it at the same index of the visual phonemes array. The viseme name is then copied into

another string array named wordViseme at the position signalized by the a counter that

starts from 0 and that is incremented within the findViseme procedure itself in order to

obtain a sequential memorization. This string array wordViseme is so used to contain

the visemes that are to be keyframed for the current in the order corresponding to the

list of translated phonemes. This memorization is indispensable because at this time

only a part of the word is translated thus it necessary to maintain the order of the

visemes that would be keyframed later in the algorithm.

From the findViseme and findPhonemes, the process returns to the workPhoneme

procedure and then straight right to the findRule method that was called in a loop

within the translateWord function until the string terminated character.

At this point everything concerning a single word is known, not only its

translation into phonemes but their number too and the visemes to use to cover them;

thus the algorithm enters in the second phase to compute times at which generate key

for the animation and set them.

In order to really understand what has happened right still now during the process

of converting letter to sounds, it could be useful to follow all steps for a simple word.

Analizing the “IN” word for example, it is soon discovered out that in reality this

word its just one of that words that have a translation of more phonemes directly within

the rules of the corresponding alphabetical letter. The findRule procedure has to


58

control just one time the left and right part because the word results to be just“ IN “ and

not a longer one containing it, thus the right end left match are blank spaces that

correspond to a return condition within the leftmatch or rightMatch because there is

nothing to be converted. In fact, seeing the first few line of the I_Rules it is possible to

see that the IN word is present in its entirely among the rules:

static Rule I_rules[] =

{

{Nothing, "IN", Anything, "IHn" },

thus this word has just its sequence of phonemes translation that is IHn if it precedes

and followed by a blank that means that it really is the word IN and not the sequence IN

within a longer word.

To better follow translation steps it could be interesting to analize another word as

“HELLO”. In the translateWord it is deteched the letter whose rule has to be searched

in the rules file. The first letter is an “H” and then the findRule method is called on the

H_rules that is shown below:

static Rule H_rules[] =

{

{Nothing, "HAV", Anything, "hAEv"},

{Nothing, "HERE", Anything, "hIYr"},

{Nothing, "HOUR", Anything, "AWER"},

{Anything, "HOW", Anything, "hAW"},

{Anything, "H", "#", "h"},

[…]

Whitin the findRule method the match variable each time get the value of the

match in the H_rules that is the second column, thus the first time it gets “HAV”. At


59

this time it compares the word to be translated with this match discovering that the A

doesn’t macth with the E of “HELLO”. The algorithm goes on considering the next

match of the rule and it finds “HERE”. The loop breaks because of the mismatch

between the “R” and the “L” of “HELLO” and thus goes on the next match. Finally it

reaches the match “H”, it calls the leftMatch and the rightMatch procedure on the

“Anything” and “#” pattern. The leftMatch method returns quite immediately because

the “H” is the first letter of the word, thus anything matches. The rightMatch method

instead gets as context the word “ELLO” and as pattern the “#” symbol that means any

vowels as indicated in the table 4.3. The “E” in “ELLO” satisfied the match, thus the

context is decremented and the test is executed again but no other vowels follow in the

“LLO” remained context, thus the rightMatch returns and the output variable get the

phoneme translated that is “h”. This is the process that follows for any letter in the

word. The findRule code is shown below:

int su2LipSynch::findRule( char word[], int index, Rule

rules[])

{

[…]

for(;;)

{

rule = rules++;

match = (*rule)[1];

[…]

for ( remainder = index; *match != '\0'; match++,

remainder++)

{

if (*match != word[remainder])

break;


60

}

if ( *match != '\0')

continue;

left = (*rule)[0];

right = (*rule)[2];

if ( !leftMatch( left, &word[index-1]))

continue;

if ( !rightMatch( right, &word[remainder]))

continue;

[…]

This is part of the findRule code, where the more internal cycle guarantees the

comparison between the word to be translated and the match found in the rule of the

letters; then the left and right match procedure are called.

4.3.2 Keyframing

The last step of the translateWord is to call the makeKeyframes method. It is a

little complicated because of there are many operations to be made and in different

ways according to the selected schema. The general purpose of this method is to set the

key for the right visual phoneme at the right time and at the right value. As stated in

the second chapter, setting a key means to fix the value of a precise attribute at a

precise frame, thus in this case means set the blendshape’s value for the interested

visual phoneme, where a blendshape is that particular Maya deformation that morphs

between two positions creating mouth poses.

As other method seen before, makeKeyframes too has a dual structure for the ten

and the sixteen schemata, only changing array dimension and length of loop. Thus


61

studying the portion of the keyframing algorithm regarding the adoption of 10 visual

phonemes is sufficient to understand the complete process.

This method makes a large use of a API class, the MGlobal one, because this

class implements a method that is called executeCommand which allows to execute any

MEL command from a C++ call. Obviously, even if it could be easier because of the

simple syntax of the MEL language, it’s obligatory too to be restricted in the

possibilities of the MEL script itself, thus from the C++ code is necessary to build up

the same exactly command it would be given within the MEL but under a string

constructed variable and the execute that string as command. This often obliges to

concatenate string with inverted commas or with comma or point to obtain the precise

sequence of a MEL command as shown below:

MString cmdAttr = "setAttr \"" + par11 + “.”;

MString cmdKey = "setKeyframe " + par11 + ".";

MString cmdTime = "currentTime -edit ";

cmdAttr2 = cmdAttr;

cmdKey2 = cmdKey;

cmdAttr2 = cmdAttr2 + parViseme[j] + "\" " + nostep;


cmdKey2 = cmdKey2 + parViseme[j];

MGlobal::executeCommand( cmdKey2, true, false);

cmdTime2 = cmdTime + tempo;

MGlobal::executeCommand( cmdTime2, true, false);

These are the MEL commands initialised by the C++ code, where par11 variable

contains the name of the blendshape under which all visemes are created. In fact, the

real MEL code is something like this:


62

setAttr “visemi.viseme1” 0.5;

setKeyframe visemi.viseme1;

The first line sets the attribute named viseme1 of the group of blendshapes named

visemi at the value 0.5, while the second command create the key for the viseme1

attribute according to the value set before at the current time in the timeslider. These

command strings are passed as parameter to the Mglobal::executeCommand method in

order to have Maya execute them.

The Mstring variables cmdAttr and cmdKey store that part of the string command

that never changes and anytime a new command is necessary they are added to empty

string cmdAttr2 and cmdKey2 to obtain the right command any time.

The same explanation regards the use of the cmdTime and cmdTime2 variables

that are used to set and change the time at which save the key for the animation.

Starting this part of the algorithm, data that are known are the number of the

phonemes and then the visual phoneme to be keyed, the duration expressed in frames of

the word’s pronunciation representing the time within which all keys for the current

word are to be set up. Thus, the first operation of this part of the lip synch process is to

divide the duration time of the word by the number of visemes and so assign a regular

and uniform time of persistence of the same viseme in the animation. In fact, the real

study of phonetics suggest that not all phonemes have the same importance in the

pronunciation of a word such as the “t” or “d” sound. Moreover, it is not true that this

differences in sound dominance could be collected to make a rigid rule, because not all

words sound at the same way even if the same phoneme composes them.

For this reason, Speak U2 makes a standard attribution of time, giving the same

uniform time at any phonemes translating a word, and then invite the animator to

correct or modify some keyframe’s time where necessary.

Thinking about the synch of the first word in a speech, it is evident that any

visemes value has to be keyed at a zero value because of the interpolation of Maya

itself. In fact, Maya automatically interpolates during the animation when the same


63

attribute has a key at different values in different frames; the gap between the two times

is covered by Maya itself that interpolates the value of the first key to the second. Thus

for the first word of a speech or a word pronounced after a pause, it is important to set

all the keys for the visemes at zero, but for all other words during the speech it could be

very dangerous to perform this action.

In fact, the make frames algorithm, knowing the number of phonemes composing

a word and the total time of the word’s duration, assigns the same number of frames at

each phoneme; the zero key has to be set at the starting pronunciation frame minus this

uniform quantity because at the real beginning word time keys are just to be at a value

different from zero in order to get the mouth opened. Thus, if this is not a word with a

pause before, setting keys at zero could damage the previous word in its ending.

Because of these motivations, the algorithm was made up of two different

behaviours according to the position of the word in the speech. This control is made by

an if test that verifies if the time at which is necessary to put the first key is before the

end of the previous word. If this happens, keys are not set at zero, but only the

interested viseme key is set to a value that is the half of the value at which the

animation would be built. An half value allows to obtained a smooth transition

between the two phonemes without getting the mouth closed. The code below could

make this concept easier to understand:

parity = div( durata , total);

forcing = value - parity.quot;

Parity and forcing are two variables used to make this control: parity stores the

number of frames uniform assigned to the visemes, while forcing is the time at which is

necessary to put the keys at zero to obtain the interpolation. Last is the variable that

stores the last frame at which the previous word ends.

As it could be seen from the code below, if the forcing variable is greater than the

last variable, it is possible to set keys at a zero value because this doesn’t affect the


64

previous word keyframing and so all visemes are keyed at zero at the time equal to the

forcing variable value:

if ( forcing > last || forcing == last)

{

cmdTime2 = cmdTime + forcing;


cmdAttr2 = cmdAttr; /* to claer cmdAttr2 everytime */

cmdKey2 = cmdKey;

[…]

if ( strcmp ( based, "dieci" ) == 0 )

{

for( j=0; j < 10; j++)

{

cmdAttr2 = cmdAttr2 + parViseme[j] + "\" " +

nostep;

MGlobal::executeCommand( cmdAttr2, true,

false);

cmdKey2 = cmdKey2 + parViseme[j];

MGlobal::executeCommand( cmdKey2, true,

false);

cmdAttr2 = cmdAttr;

cmdKey2 = cmdKey;

}

On the contrary, if the word has not a sufficient space from the previous, a

correction is used for the that viseme corresponding to the first phoneme for that word:


65

if( strcmp( wordViseme[0], parViseme[j]) == 0

{

cmdAttr2 = cmdAttr2 + parViseme[j] + "\" ." +

correction;

where correction is half of the blendshape value set for the animation.

Moreover, in case the division of the duration time by the number of phonemes is

not integer, one more frame is assigned to the first viseme because the starting of a

word is the most important one in the lip synch.

For the mapping of phonemes that are in intermediate position, the algorithm gets

the name of the next viseme that has to be keyed and looks for it in the visemes array;

any time it encounters a visual phoneme with a different name it means that is not the

viseme to be keyed at that time and sets its value at zero, while when it finds out the

viseme it is looking for, it makes a key for it at the value decided at the beginning in the

blendshape dominance parameter. This is performed by this code:

for ( h=0; h < total; h++)

{

cmdAttr2 = cmdAttr;

cmdKey2 = cmdKey;

tempo = tempo + parity.quot;

for( x= 0; x < 10; x++)

{

if( strcmp( wordViseme[h], parViseme[x]) == 0)

{

cmdAttr2 = cmdAttr2 + parViseme[x] + "\" " + step.quot +

"." + step.rem;



66

cmdKey2 = cmdKey2 + parViseme[x];


cmdAttr2 = cmdAttr;

cmdKey2 = cmdKey;

}

else

{

cmdAttr2 = cmdAttr2 + parViseme[x] + "\" " + nostep;


cmdKey2 = cmdKey2 + parViseme[x];


cmdAttr2 = cmdAttr;

cmdKey2 = cmdKey;

}

}

cmdTime2 = cmdTime + tempo;


}

The outer most cycle loops for the number of phonemes in which the word was

translated, while the internal for loop works on the control of the value at which the

blendshape as to be set according to if it is the viseme to be mapped or not at that time.

These different values are set using the variable nostep that is for a zero key, or

step.quot + “.” + step.rem that are translated as for example 0.5.

This code is valid for the ten visemes schema, but it has a just specular code

referring to the sixteen based schema.

Chapter 5

Software description

The Speak U2 interface is the result of a careful analysis of the lip synch

methodology and of the understanding of its fundamental steps. For this reason, using

the MEL scripting language as stated in the previous chapter, the user interface was

projected to reflect the workflow of the lip synch process in order to be itself a guide to

the animator in his work.

Speak U2 was developed thinking it would be addressed to a specific target of

user, as an animator or a video postproduction firm, but in spite of this it was planned to

be a complete software, thus it has its on line guide and a simple installation manual.

The description of a software and of its use is a very important aspect even for an

animation plug-in; the complete understanding of its utilization is basilar, in fact today

there are so many useful plug-in that are not used during the production phase because

too hard to be understood.

Chapter 5 - Software description

68

5.1 Speak U2 installation

The Speak U2 plug-in has a very simple installation as any other Maya plug-in. It

is formed by three MEL files, that are named userSetup.mel, speakU2Register.mel and

speakU2createUI.mel. All these files implement the user interface while the compiled

C++ file has the .mll extension that is is the specific extension for any Maya plug-in.

This file is called SpeakU2.mll. It is important to know the name of useful files

because the installation of a plug-in is normally made up of many action of copy and

paste of files into specific directories.

In the case of Speak U2, the MEL files are to be copied into the user directory,

following the path for Maya\4.0\script. This is valid for Microsoft Windows 2000 and

Windows NT 4.0 too, even if the user directory differs in the two Microsoft operating

system because in one case it is under Document and Settings\user\, while under NT

4.0 it is under winnt\profiles.

In the case a userSetup.mel file is already present in that directory, it is important

to select all the content of the file provided with Speak U2 and append it at the end of

the preexisting one.

The speakU2.mll file is to be copied into the Maya installation directory

AW\Maya4.0\bin\plug-ins where all mll files are located. At the first launch of the

Alias|Wavefront application is necessary to go under the menu

Window\Settings/Preferences\Plug-in Manager to activate the load of the plug-in and

possibly check its autoload box in order to get Speak U2 always present at any Maya

start up.

5.2 Plug-in integration

Speak U2 is perfectly integrated into the Maya graphic interface. In fact, after its

setup, any time the application is started it is possible to notice in the main Maya

window, among all menus in the menu bar, the presence of a menu called SpeakU2.


69

Moreover, that menu also compares in the hotbox which is a particular Maya tool

obtained by pressing the spacebar and which group all of Maya menus for a more quick

selection. Images below show this integration:

Figure 5.1 : Maya Main Window

The Speak U2 menu is present among Maya menus and it is located on the right

part of the menu bar.

Figure 5.2 : Maya Hotbox


70

In the hotbox the Speak U2 menu appears in the high right position among the general

menus, while the menus below refer to other configurations of Maya different from the

default one that is generally the Animation one. Thus the Speak U2 menu is always

present in the Hotbox, independently of the Maya’s working configuration as for

example the rendering or modeling ones.


71

5.3 User interface

Figure 5.3 : First tab of Speak U2 interface


72

The first tab of the user interface, as seen before, refers to the phase of building

the data structure of phonemes and visemes. The figure 5.3 presents this first tab.

In the first line there is an option box that allows to choose between an animation

based on ten or sixteen visemes covering the 44 American phonemes. In the “List of

blendshapes present in the scene:” are listed all blendshape that were modelled in the

scene, and among this blendshape names, the user has to select that deformer that is

linked with the visemes. One time something is selected in this list, all its attributes are

displayed in the “List of blendshape attributes:” scrolling list.

All these attributes are grouped under the main one selected in the first list;

according to the decisions of the modeller, it could happen that blendshapes not

representing visual phonemes are grouped together with those corresponding to

mouth’s shapes. Obviously, it’s not a good way of working not to separate blendshapes

referring to the mouth positions from other regarding different facial animation aspects,

but it could happen.

Because of this considerations, the user interface consider to allow the artist to

choose among this attributes blendshapes to select the only ones corresponding to

visual phonemes; selecting them and pressing the “Add Viseme” button, the selected

visemes are appended into the below scrolling list named “List of Viseme:” where the

animator will introduce ten or sixteen mouth’s shapes according to the schema he wants

to adopt for the animation.

In fact, the choice of the schema determines the filling of the “Grouped American

English Phonemes” of ten or sixteen groups of phonemes and, at this point, if the visual

phonemes were entered reflecting the order of this scrolling list, the map between the

grouped phonemes and visemes is quite evident and simple. Moreover, it is possible,

using the “Delete Viseme” and “Delete All Visemes” buttons, to correct the list of

visual phonemes to be mapped. To perform this mapping there is the “Map between”

button that connects the selected group of phonemes to the selected blendshape viseme.

The first time it is necessary to map each group of phonemes with the corresponding


73

viseme, while map can then be saved and reloaded into a map10.txt or map16.txt file

according to the selected schema.

Figure 5.4 : Second tab of Speak U2 user interface


74

The second tab, that is shown in the figure 5.4, is called “Working on speech”

because it is used to collect information on words during the analysis. In the first line,

the option box allows to decide the source for the audio file that has to be synchronized;

choosing the “From disk” option makes the “Browse” button became active and it is

then possible to select the wav file from the hard disk. If the “From timeline” voice is

selected, then the audio file that will be displayed within the Speak U2 interface is the

same that is visible in the Maya timeline and thus the same file present in the scene will

be used. The audio file has to be a wave one, and it has to be sampled at 44100 kHertz

according to Maya features; moreover to playback the sound into Maya it is necessary

to set the playback speed in order to play it during the animation. Not any time speed

can perform this task, and it has to be chosen according to the video format of the

animation it wants to adopt, as for example real-time 25 frames per second is the

standard for Italian movie.

The sound time section display the audio waveform and allows the user moving

through the sound to listen to it and determine the exact frame at which a word begins

or stop. He can also scrub through the audio where scrubbing means drag on the wave

file and listen to it frame per frame. The buttons below the waveform are to be used for

a easier moving on it. To facilitate the task of writing down the spoken words and their

timing, there is also the possibility to use the option box to set the playback mode to a

continuous loop instead of looping once. It still remains true that Maya has not so

many instruments for working with audio, thus generally it is preferred to use an

external software with many tools as Syntrillium Cool Edit Pro 2000 or Adobe

Premiere 6.0.

Once the artist is satisfied of the timing he has detected, he could insert all these

values in the corresponding fields that are below the waveform; inserting “Word’s

pronunciation beginning” or “Word’s pronunciation ending” could be performed

pressing the “Keep start frame” or “Keep end frames” while positioning on the chosen

frame.


75

The emotion selection list was included in the user interface even if it is not

functional in this Speak U2 version. It would be used to add a particular emotion to the

spoken word, as for example happiness, sadness, or to connote the character with an

interrogative expression. This add requires a further study to understand and plan how

to insert this information among others and how to make the algorithm process them.

The last button is named “Save” and it has to be pressed any time the artist has

entered all the information about a single word which are then stored in the third tab as

shown in figure 5.5


76

Figure 5.5 : Third tab of Speak U2 user interface

The third tab was created in order to collect all information inserted in the

previous tab. Each line of the “Word\Start frame\End frame\Emotion” scrolling list


77

refers to a single word and contains the four information fields as its name suggest.

From this table it is possible to verify that all words have been entered, and it is

possible to delete or modify them. To change some value of a line the user has to select

the line and then come back at the second tab where the data of the selected line would

be automatically displayed and, after changes, the saving button will overwrite the

previous value. To delete on line or all lines it is enough to press the “Delete Line” or

“Delete ALL Line”.

Moreover, to insert a new line there is another button called “Insert Line”:

pressing it will introduce a new line at the position before the one selected in the list

and this line will entered with default values as xxx\0\1\None ready to be modified.

In the last section of this tab there are two buttons labeled “Load Work from a txt

files” and “Save Work on a txt file” that attend to the job of saving all information on a

txt file that would be named as the opened scene in Maya or load information from a

saved file. This is an important feature because, thinking of a long speech, it often

happens that not all words would be analyzed at the same time. Moreover, it could be a

normal manner of working to analyze a short part of a long speech at a time in order to

verify the position of word in the timeline and how the algorithm reacts to synchronize

words.

After saving the work, the “Ok – Generate Keys” button become active and it

performs the last task of the lip synch process, the longest and most complex too,

starting the algorithm of translation and keyframing described in the previous chapter.

The Figure 5.6 represents the output window that will appear during the algorithm

execution and it makes possible to follow through it the actions of lip synch proceeding.

This output windows allows to verify the mapping between visemes and phonemes too

that is very important because a wrong correspondence means a completely unrealistic

animation because sound would be covered by different mouth’s shape instead of the

appropriate one.


78

Figure 5.6 : Speak U2 output window

In the first part of the output window the grouped phonemes and the viseme name

that refers to them are shown, while in the second part the animator can observe the

words being translated at that time.

From the point of view of the code programming, the first three tabs, as analysed

before in the chapter 4, are the most complex ones and their tabs are the most important

too because they cover the main three steps in the lip synching process.

The fourth tab was imagined to make a preview of the lip animation obtained with

Speak U2 in order to perform it in a smaller perspective view instead of the Maya one.

Thus is not a really fundamental tab of the user interface in the sense that it reproduces

something already present in Maya. The animator could choose where to make a

preview, even if this task requires a great usage of computer resources and so even


79

powerful pc have many problems to play a complex animation within Maya. Because

of this problem, it is normally a good choice to perform a “playblast”, that means to

create an AVI file that represent exactly the same play of the animation represented

within Maya. Thus in the bottom part of the fourth folder there is a button that links to

the Maya playblast option in order to allow the artist to customize its resolution, size

and other preferences. The Figure 5.7 focuses on this part of the graphic interface.


80

Figure 5.7 : Fourth tab of Speak U2 interface

Because of this features, the fourth tab was called as “Preview Animation”. The

last conceptual step of the lip synch process, after the keyframing and the preview,


81

generally is the retouching phase to eventually correct some mouth’s movement

imperfections or some timing errors.

Figure 5.8 : Fifth tab of the Speak U2 user interface


82

As the Figure 5.8 shows, one immediate way to correct some wrong situations is

to use the Blendshape Editor that is included in this tab of the user interface to have an

easier access instead of finding it under Maya itself. In the Blendshape Editor, it is

possible to change the viseme value at any frame in the animation thus to eliminate

some of them too setting its value at zero. Obviously, moving through the Maya

timeline and the channel box, that is an instrument collecting all attributes of the

selected shape and their values, it is possible too to completely eliminate a key at a

certain frame. Another correction that could be performed is to shift some keys from a

frame to another, even if a good timing chart excludes such type of correction.

In the bottom part of this tab the user can find other two links to useful Maya

tools, such as the one to the Graph Editor, that could be used to control the tangent of

the animation to modify the influence of the interpolation made by Maya, and the

Hypergraph, that could be used to see the structure of the shapes and to select them

with more precision.

5.4 Tips & tricks

As any software, Speak U2 too hides some tricks that could be useful to know

above all by the animator. In fact, these tips are come from a long period of testing the

plug-in itself and trying to synchronize a lot of different speeches.

The most important thing that has to be understood is that, as explained in the

previous chapters, the algorithm works on a written word that is the spoken one. In

reality, very often it happens that the real spoken doesn’t match with the grammatically

correct written one. This is due by the fact that especially English or American talking

are rich of abbreviations and it is an usual behaviour to cut some part of word.

Sometimes, the right and quicker way to solve some problems that could result

after using this plug-in, is to operate on the list of spoken words that have to be

translated. In fact, after keyframing, the animator could notice that some words,


83

especially the longer ones, could generate a sort of lips vibrating effect. This is

common when a word is formed by a sequence of phonemes that are represented by an

alternation of the same few visemes, so the lip movements are the same and they are

repeated in a very few number of frames giving the impression of a vibration.

There is no error in this from the algorithm, that is working the same way as

usual, but it is evident that if some mouth’s shapes covered some phonemes and there

are word composed by a repetition of the same visemes there is no way to a modify of

the code to solve this problem.

One solution could be to model more visual phonemes in order to group a minor

number of phonemes under the same viseme, so it could be a good idea to pass from a

ten schema to a sixteen one.

As anticipated above, another important thing that can be do is to pay attention to

the written word to be translated according to the rules presented in the first chapter.

Some letters such as m, n, p, thus the nasal ones, have often to be dropped out, but this

not necessary means that the artist has to physically delete that keys from the timelines,

but he can eliminate that letter directly within the written word and then apply the

algorithm. In many cases this method has generated a better lip movement.

It is also possible to use the apostrophe instead of some letter in the word in order

to maintain the same viseme of the previous letter and avoid the use of the viseme of

the substituted letter.

Moreover, “t” or “d” are other letters that are not to be deleted, but that are very

little persistent during the pronunciation, so sometimes they can be substituted or

erased, while where they are fundamental, the animator can shift keys to assign them

fewer frames than that associated by the algorithm.

Another important trick concerns the process of writing down the time table. In

fact, observing a waveform, it is easy to notice that in some point the wave dies and

than restarts because of a pause in the speech. It was discovered that if two or more

word are pronounced in the same wave, so they don’t generate a single wave but


84

occupied the same, then in Speak U2 probably the best choice is to link that words into

an unique spoken word even if it doesn’t exist.

5.5 The demo animation

At the beginning of the process of creating this Maya plug-in, there was the idea

to make a demo animation and this for many reasons. In fact, the first need was to test

Speak U2 and this was made using many audio files, female or male voice

indifferently, moving simple mouth shapes or complete face models, to check not only

the goodness of the algorithm and the plug-in itself, but to discover bugs or lacks too.

This process has allowed to improve Speak U2 in some parts of its graphic interface

adding some possibilities and functions. Obviously this is just the first version of

Speak U2, implemented with a no so larger vision of Maya itself and all its

possibilities.

While this first reason is the most important one, it was though that a demo

animation could always be useful to present Speak U2 itself, thus the animation became

a part of the plug-in to allow it to introduces itself to a possible buyer or at a conference

such as the presentation of a thesis. Seeing a movie before trying to use Speak U2 or

reading of it could focus the attention of people and suggest its potentialities.

5.5.1 Making the animation

The demo animation, that is entitled “She was speechless”, was build ad hoc to

introduce Speak U2 as a lip synch plug-in for English language, but the aim of this little

movie was also to present the result of the work that was made during these months.

To obtain an animation, it is always necessary to write down a storyboard to

collect the information and the general schema that the animation would follow. In this

case, it was first written the text of the speech because this was the most important

element in such animation that wants to give prominence to the synchronism of the lips


85

during the speaking. Moreover, beside the flat text, many annotations were added to

characterize the manner of speaking, as facial expressions and emotions, or movements

of the head and eyes.

With this information, the audio track was recorded in an special structure for

professional audio recording, and then the process of the lip synch was started. Having

the written text avoided the phase of writing down the spoken words, and the speech

was listened many times to write down the beginning and ending frame of each word

moving through the sound waveform. Making this task, all tips and tricks became

important and they were considered in finding the right timing table. In fact, in the

section below, it is possible to see the plain text that was spoken by the speaker, and

then the text that was introduced, word by word, into Speak U2 :

“Hi, my name is Anne. Wow! Finally I was born! And I speak too!

I can speak every language based on English grammar thanks to SpeakU2, the lip

synch plug-in for Maya four realized by Daniela Rosso and Logic Image, as thesis in

computer engineering for Politecnico University of Turin.

Yes, I speak! Look at my lips! They move, they create words without any troubles!

And what about my face?

It looks very nice, isn’t it?

Thank you! Thank you! Oh God! Thank all of you, thak you logic Image, thank you

Mom, thank you Dad and thank you to all my family and my friens.”

In order to obtain a correct lip synch, as seen before, some of these words were linked

together or modified according to the pronunciation. Below, there are the entered

words in the SpeakU2 interface:

“Hi myname's Anne wow finally iwasborn andi speak too I canspeak ev'ry 'aguage

based o'Eglishgrama thanks to speakUtwo the'ip synch plugi for Maya four realized by

Da'ie'a Roso an'Logic Image as tesis in copu'e egine'rig fo Po'i'ecnico u'ivesity of Turin


86

YesI Speak lookat mylips theyMove they create word without any troubles

Andwha aboumy face it look verynice isn'tit”

Moreover this first part was made launching Speak U2 with the dominance blendshape

value set up to 60%, so it creates keys for blendshapes posed at a values of 0.6. Then

Speak U2 was recalled on the same scene with the keys created in the first time, and

then it synchronized the second part of the speaking with the dominance blendshapes

fixed ad 75%:

“Thankyou oh thankyou oh God thankallof you thanku Logic Image thanku Mom thaku

Dad and thanku to allmy fa'ily anmy frieds”

As it can be seen, many words are linked together to create an unique word,

while others have less letters or have some apostrophe substituting some letters and

moreover some other words have a single letter to create the sound instead of the

correct letters, such as the “u” role in the “thank you” sound.

All this tricks are really important because they allow with little work and

attention to avoid to the animator to make changes and correction after the lip synch

process editing keys time or value. In fact, to realize the demo animation, the animator

just simply edited the value of the blendshape in the key corresponding to the word

“Wow”, in order to obtain a more opened mouth to suggest a more expressive

exclamation. No other interventions were necessary to have a realistic lip synch

animation.

For the rest of the animation, the 3D artist has added controls for the movements

of eyes, head and anything concerning its characterization to obtain the effects

according to the speech and the storyboard.


87

Figure 5.9 : Anne, character of “She was speechless”

Chapter 6

Conclusions

Speak u2 has achieved good results considered unexpected at the beginning of the

work. In fact, the obtained lip movement is very satisfactory, because it is perfectly

realistic and rarely needs an intervention from the animator in order to improve some

movements.

Without a tool as Speak U2, the animator, to realize an animation including the lip

synch, could only put himself in front of a mirror and try speaking to observe the lip

movements and then create the mouth ones. Or, knowing the lip synch methodology

explained in this thesis, he could use visemes, but he may translate himself the written

word into its phonemes sequence and moreover manually set all the keys for the

animation.

It is evident that both of these solutions are impracticable during a production,

because they need too long time to be performed. Without a precise methodology ,

hours could be needed to synchronize a single word. The second case is based on the

ability of the animator to translate from text to phoneme, but this is an hard task even

for an expert in linguistic and thus this job is not suitable for a 3D artist who has not the

right know how. Moreover, keys to be set are innumerable so a manual keyframing

work would steal too many hours to the production.

Chapter 6 - Conclusions

89

A realistic estimation of Speak U2 performances leads to the following judgement

about synching a 15 second of audio: about 30 minutes are necessary to the artist to

recognize spoken words and their start and end times; about 5 minutes of CPU time are

required to execute Speak U2 working on a complex facial model and using the 16

visemes based schema; About 30 minutes to execute again Speak U2 on different

timing chart and word data structure.

In a little more than one hour, the animator can synchronize successfully 15

seconds of speech, having the possibility of improve the animation testing it in different

configuration, thus he obtains a final result.

This means that in about four hours of work, an entire minute of speech is

synchronized and so 2 minutes of animation are finished in a day of work. These are

very meaningful values, because they imply the possibility of synching a movie of

standard length of 90 minutes in about one month and an half of work of a single

animator, while it could be a shorter time if many animators work on it.

The quality reached by the animation is strictly connected to the goodness of the

3D model, but in any cases it is surely better than any result obtained with a manual

work, because it achieves the precision of a linguistic translation and of scientific

calculations.

Obviously Speak U2, being at its first version, is not without possibility of

improvement and increases. A future development foresees surely the inclusion of the

option of saving all text files into a path chosen by the user instead of a default one, and

the possibility of implying a different blendshape dominance for any single word

instead of using only one dominance for all the speech.

Moreover, to allow the characterization of a model with emotions, it would be

projected the use of a special field named “emotions”, that is already present in the user

interface, in order to set keys for other blendshapes corresponding to the positions of

eyebrows, eyes and eyelids.

From the point of view of the implementation, there could be more complex

additions as the support of the Italian language and the integration of an audio tool to

Chapter 6 - Conclusions

90

straight analyse the recorded speech in order to automatically compute the start and end

times of spoken words.

These two main additions require a work of research to study their algorithms,

thus they will be probably included in new versions of Speak U2.

Bibliography [1] Fleming B. and Dobbs D.,Animating facial features and expressions, Rockland, Charles River Media, 1999 [2] Kundert-Gibbs J. and Lee P., Mastering Maya 3, Sybex, USA, 2000 [3] The art of Maya, Alias|Wavefront, USA, 2000 [4] Learning Maya 3, Alias|Wavefront, USA, 2000 [5] Using Maya : MEL, Alias|Wavefront, USA, 2001 [6] Maya Developer’s Tool Kit, Alias|Wavefront. USA, 2001 [7] Using Maya : Character Setup, Alias|Wavefront, USA, 2001 [8] Using Maya : Animation, Alias|Wavefront, USA, 2001

91

ANIMAZIONE TRIDIMENSIONALE REALISTICA DEL MOVIMENTO...

Documents

Transcript of ANIMAZIONE TRIDIMENSIONALE REALISTICA DEL MOVIMENTO...