ANIMAZIONE TRIDIMENSIONALE REALISTICA DEL MOVIMENTO...
Transcript of ANIMAZIONE TRIDIMENSIONALE REALISTICA DEL MOVIMENTO...
POLITECNICO DI TORINO III Facoltà di Ingegneria
Corso di Laurea in Ingegneria Informatica
TESI DI LAUREA
ANIMAZIONE TRIDIMENSIONALE REALISTICA DEL MOVIMENTO LABIALE
IN AMBIENTE MAYA
Relatore: Ing Fulvio CORNO
Candidata: Daniela ROSSO
Gennaio 2002
Summary
Chapter 1...................................................................................................... 1 Introduction.............................................................................................................. 1
Chapter 2...................................................................................................... 7 Alias|Wavefront Maya 4.0 Unlimited ...................................................................... 7
2.1 Deeper look on API and MEL ....................................................................... 8 2.2 Moving within the Maya’s environment contest ......................................... 10 2.3 Blendshape deformer ................................................................................... 13
2.3.1 Target and base objects ................................................................................ 13 2.3.2 Target shapes, base shapes, and blend shapes ............................................. 13 2.3.3 Targets .......................................................................................................... 14 2.3.4 Keyframing.................................................................................................... 14 2.3.5 Using keyframing animation......................................................................... 14
Chapter 3.................................................................................................... 15 Lip Synch issues ..................................................................................................... 15
3.1 Lip synch steps............................................................................................. 16 3.2 Getting known phonemes ............................................................................ 16 3.3 Getting known visual phonemes.................................................................. 17 3.4 The lip synch rules ....................................................................................... 22
3.4.1 Record the dialog first .................................................................................. 22 3.4.2 Never animate behind synch ......................................................................... 23 3.4.3 Don’t exaggerate .......................................................................................... 23 3.4.4 Rules were made to be broken ...................................................................... 23
3.5 The lip synch process................................................................................... 24 3.5.1 Break down the speech pattern ..................................................................... 24 3.5.2 Analyze the audible dialog to determine phonemes...................................... 25 3.5.3 Use timing chart to set keyframes................................................................. 25 3.5.4 Tweak the finished animation ....................................................................... 26 3.5.5 Phoneme dropping guidelines....................................................................... 26
3.6 Thinking as an animator............................................................................... 27 3.7 Text to phoneme .......................................................................................... 30
3.7.1 Focusing on language................................................................................... 30 3.8 Synchronization and artistic freedom .......................................................... 32
Chapter 4.................................................................................................... 33 Solutions................................................................................................................. 33
4.1 Interface realization ..................................................................................... 34 4.1.1 First step: creating data structure ................................................................ 34 4.1.2 Second and third tab: collecting information ............................................... 37
4.2 Interaction between MEL and API .............................................................. 44 4.3 Algorithm..................................................................................................... 49
4.3.1 Text to phonemes translation ........................................................................ 51
i
4.3.2 Keyframing.................................................................................................... 60 Chapter 5.................................................................................................... 67
Software description .............................................................................................. 67 5.1 Speak U2 installation ................................................................................... 68 5.2 Plug-in integration ....................................................................................... 68 5.3 User interface ............................................................................................... 71 5.4 Tips & tricks ................................................................................................ 82 5.5 The demo animation .................................................................................... 84
5.5.1 Making the animation ................................................................................... 84 Chapter 6.................................................................................................... 88
Conclusions............................................................................................................ 88 Bibliography .............................................................................................. 91
ii
Chapter 1
Introduction
In the last years the Computer Graphics and Visual Effects have grown
exponentially, being introduced more and more frequently within Hollywood
productions as well as in European ones.
Computer Graphics particularly has found its natural role not only in
cinematographic contests, but even in the industrial activities. The continuous search
of 3D models and characters animated with computer graphic techniques confirms this
new attitude even in the Italian TV publicity, where, for examples, can be found
stylized human exiting from a tube of toothpaste, or nice 3D ants playing with toilet
paper, or visual effects such as boat surfing on desert or lions swimming on the abyss.
Far from the cinematography contest of films or advertising spots, 3D models are
often used to generate virtual prototypes of any genre industrial products in order to
taste the possibilities of its introduction in real market, without spending any materials
or working times.
Even public sectors join the contribute of the three dimension modeling; the
building engineering, for example, uses it to preview the ambient impact of impressive
structure as bridges being able to simulate atmospheric agents that will involve these
Chapter 1 - Introduction
2
buildings. Moreover, the computer graphics becomes useful to evaluate the impact hat
such moderns buildings could have on the ambient.
Nowadays, years of research and computer experiments, held in public or private
labs all over the world, have conducted to the diffusion of different and new ways of
representing. Computer generated images are a great part of every branch of
communication life. But if we consider the importance of the technological evolution
of these products, they are more important because of their cultural impact. All these
improvements reached in this area, imprinted a great perceptive turning point that
gradually is modifying our aesthetic taste and our daily life style.
In spite of this way of utilizations, it is in film making that Computer Graphics
has much more application and challenge, and its development is going to increase
thanks to the equal hardware technological growth and more “special purpose”
powerful software.
Cinema’s history is principally based on character animation id. e. the creation of
moves and lives for figures modelled and drawn with various materials. Lot of
characters, almost human, almost animals, almost aliens, almost nice, have been created
with rolled eyes, rounded noses, hair braving the gravity or body parts exceptionally
elastic with harsh tunes.
3D animation, left to the puppets fun, is easily near imperfect models : the toy, the
soldier, the rubber puppet, however the object generally unanimated that becomes
animated, as they can arrange the funny character without abandoning the photorealistic
research.
The exponential improvement in technology, allowed the arrangement of more
specialized programs able to substitute all the difficulties of the animator.
It is very difficult making act in a realistic manner a figure modelled and drawn.
In fact, in the early Computer Graphics it was usual the use of dark scenes, because
everything that is hidden in the dark do not show all the imperfections of its the
production. Today these figures are usually on stage for foregrounds and full-length
Chapter 1 - Introduction
3
film. But the challenge has been always the same : to model and to animate in a
realistic manner a human character, an anthropomorphic being.
Considering the popular movies of these days, it becomes evident that there is the
attitude of searching for human features in animating characters normally unanimated
or unreal ones. An example is the Pixar “Toy Story” movie, with little toys as leading
characters; while some of them maintain their mechanical movements, the main
onesmove in a typical human manner.
Another example can be the recent Dreamworks “Shrek”, focusing on the
adventures of a good ogre whose purpose is to save an anthropomorphic princess, being
helped by a nice mule. Even if characters are animated in a perfect way, they are not
really human and the princess Fiona too is part of an animation film that can’t pretend
to be considered so realistic to make the viewer thinking its actors being real instead of
digital ones.
But, seeing animation films, it is observed that to distinguish with human features
and moves a character, even if not an anthropomorphic one, it is necessary to care not
only about the body, but especially the face.
In fact, the appearance of a character is strongly determined by its face and the
ability of expressing emotions through facial movements.
Chapter 1 - Introduction
4
Figure 1.1 : Shrek cast standing
The representation of a human face thus is surely one of the most complex aspect
of the computer animation. Now, the movie considered as “expression of the art” for
what concerns the realistic result of digital characters is “Final Fantasy: the spirit
within” by Squaresoft. In this movie, people really believe to see real actors instead of
digital ones, thanks to a perfect work of modeling, to the exact body movement and
especially to the impressive realism of characters faces.
To make a face real, it is fundamental to give it the ability of expressing emotions
with a combination of eyebrows and eyes as well as facial muscles movements.
But, the element distinguishing a realistic anthropomorphic character from a
digital one is the lip movement during the speech. In fact, a character whose lips are
not synchronized with sound it spokes immediately appears artificial and it loses its aim
Chapter 1 - Introduction
5
of being real; if the character wants to be realistic, its lips have to really describe the
movements necessary to spoke that sound.
Figure 1.2 : Final Fantasy : Dr. Aki Ross
To notice this, it is enough to think about cartoon characters, that simply open or
close their mouth according to the speech but without describing the real lips moves.
A good lip synchronization, that is called “lip synch” in a Computer graphics and
Animation contest, is based on paying attention to create movements of lips, tongue,
Chapter 1 - Introduction
6
jaw and cheek muscles; these features, as well as facial expression made of eyes and
eyebrows changing, allow to give realism.
The goal of this thesis is to analyze the methodology used in lip synching, to find
an algorithm in order to automate this process and to implement it obtaining an
animation tool, as part of the 3D Alias|Wavefront Maya 4.0 package, useful to achieve
faster and easier a good result in lip synch. The target of this plug-in is to avoid the
animator, who normally can’t waste time in understanding concept such as phonemes,
making boring operation of keyframig, translating words, trying lip movements on a
side mirror. This plug-in, whose name is Speak U2, would make really easy any step of
the lip synch animation, leaving to the animator only to listen to the audio speech and
enter the time of pronunciation and spoken words, working with an easy interface.
The next chapter is devoted to describe the Alias|Wavefront Maya 4.0 Unlimited,
that is one of the most known and used 3D animation and visual effects package. The
third chapter deals with the analysis of the lip synch methodology following which a
correct lip synchronization can be reached.
The fourth chapter covers the Speak U2 implementation and its technical aspects,
while the fifth describes the user interface and explains how to use the plug-in. The last
chapter collect the conclusions about the work and obtained results.
Chapter 2
Alias|Wavefront Maya 4.0 Unlimited
Maya Unlimited is the most powerful and advanced solution for creation of
computer-generated animation and special effects. All Maya Unlimited features are
perfectly integrated into a single environment that is completely customizable and
optimized for maximum productivity. Innovative user interaction techniques in Maya
Unlimited deliver the smoothest possible workflow. Maya Unlimited is for the
advanced digital content creator who wants to work with the ultimate creative tools for
maximum creative possibilities. Maya Unlimited includes:
• Modeling : Industry-leading NURBS and polygon modeling tools, and Maya
Advanced Modeling functionality unique to Maya Unlimited
• Artisan : Integrated brush interface for digital sculpting and attribute painting
• Paint effects : groundbreaking paint technology for adding amazing natural
detail on a 2D canvas or in true 3D space
• Character Animation Tools : general keyframing, expressions, Inverse
kinematics, powerful character skinning and advanced deformation tools
• Dynamics : Integrated particle system plus high-speed rigid-body dynamics
Chapter 2 - Alias|Wavefront Maya 4.0 Unlimited
8
• Rendering : film-quality rendering with advanced visual effects and interactive
photorealistic rendering
• API : access to Maya’s internal data structure that enables programmers to
enhance and complement Maya for their own production needs
• MEL (Maya Embedded Language) : open interface for customizing and
scripting any aspect of Maya’s functionality.
• Advanced Modeling : additional NURB and subdivision surface features
• Live : precision matchmoving allows the marriage of live-action footage with
3D elements rendered in Maya Unlimited
• Cloth : fastest, most accurate solution for simulating a wide variety of digital
clothing and other fabric object
• Fur : incredibly realistic styling and rendering of fur and short hair
• Batch Rendering : two supplemental rendering licenses to increase rendering
productivity and maximize throughput.
2.1 Deeper look on API and MEL
API ( Application Protocol Interface ) is the key to openness of the Maya
architecture at the lowest level. API provides the most direct access to all of Maya’s
internal data. Through API, higly efficient plug-in can also be created to extend the
capability of the system. Maya was built to ensure that features available through user
interface are exposed identically in API and produce identical results for programmers.
MEL provides higher level extensibility that is accessible to user who may
experienced programmer. MEL scripts can be used to quickly perform tedious
repetitive task. MEL also provides efficient way to create prototypes and test new tools
which might ultimately be implemented via API. With a full complement of flow
control commands, a C-like programming syntax , and a broad set of user interface
generation utilities, MEL can also be used to build very detailed new functionality
Chapter 2 - Alias|Wavefront Maya 4.0 Unlimited
9
including custom character setup, new particle effects, and even speciality animation
systems.
Together, API and MEL offer methods and tools for extending Maya’s functionality
from the highest to the lowest level. Through the inclusion of both a C++ API and a
robust scripting language with a well understood syntax, the opportunity to build
extensions is available to game programmers and game artist alike.
User Interface
Plug-in
MEL
API
Engine
Figure 2.1 : Maya programming architecture
Chapter 2 - Alias|Wavefront Maya 4.0 Unlimited
10
2.2 Moving within the Maya’s environment contest
The programmer who wants to extend Maya’s functionalities has two ways to
perform it: build a stand alone application or create a plug-in integrated in Maya’s
structure itself.
The choice to implement a stand alone application would have the advantage of
obtaining a module that is independent from Maya itself, allowing it to interact with
Microsoft Foundation Classes and use other applications such as database program to
store and modify data. These features may be very appreciated and used in making
animation tool such as the lip synch application, because the possibility of use powerful
features of external software can increment and simplify complex tasks. On the
contrary, building a stand alone application will disturb the workflow philosophy on
which Maya is based. In fact, in Maya any plug-in is considered as an internal module
and the 3D artist never gets to exit outside Maya itself. This is a great feature of this
3D animation software because it allows the artist to remain within the same
environment without loosing concentration or frustrating on many movings between
application windows. Moreover, a well built Maya plug-in recalls the same look and
structure of its containing software, giving to the user the impression of using a native
Maya tool. Because of this considerations, it was decided to make Speak U2 a Maya
plug-in instead of a stand alone application.
Even if Maya is provided with the possibility of creating plug-ins or extensions
using API or MEL, it was not conceived to ease programmers in their work above all
for what concerning in MEL. In fact, using this script language is very boring and
frustrating because of the complete absence of a debugger and for the imperative to
restart Maya any time a change occurs. Moreover, the script editor internal to Maya has
not an adequate word wrap and there is no possibility to have the automatic syntax
coloring that it’s generally useful to avoid typing mistakes. Figure 2.1 represents the
Maya script editor.
Chapter 2 - Alias|Wavefront Maya 4.0 Unlimited
11
Figure 2.2 : Maya’s script editor
The inferior part of the script editor is the editable one where it’s possible to write
commands or procedures, but to get their execution it’s necessary to select them and
then press enter key. The message result of the execution is displayed in the above
part, but the real importance of this half window is the fact that it echoes all the
commands and it could be useful when it’s not possible from the documentation to
understand how to proceed with some command.
The only way to relax this manner of working is to use an external text editor,
writing programs and then loading or sourcing them from the script editor. Fortunately
there are some of these program that are more specific for programmer language
including the syntax coloring, word wrap and tabulation. For example, ES-Computing
Edit Plus version 2.10c can import and set the MEL syntax making less frustrating to
write MEL code, even if only executing it within Maya can inform the programmer that
Chapter 2 - Alias|Wavefront Maya 4.0 Unlimited
12
an error occur and this requires Maya to be restarted with a long loss of time and CPU
and memory resource.
MEL can be used to automate some repetitive sequence of command, to build
personal user interfaces or to customize the existing one, but it is not adapted to make
complex calculations or implement procedures that don’t refer to the graphic interface.
On the other hand, Maya API and C++ are used when it is necessary to make more
complex algorithm such as those needed in a lip synch plug-in. Maya requires building
its plug-in within Microsoft Visual C++ 6.0 thus the programmer can use a solid
debugger to test his code. The build operation of C++ files generate an .mll file that
has to be loaded within Maya through the Plug-in Manager shown in the figure below.
Figure 2.3 : Maya Plug-in Manager
Chapter 2 - Alias|Wavefront Maya 4.0 Unlimited
13
To be able to write a plug-in for Maya Unlimited 4.0 it was necessary not only to
understand MEL and API, but also to enter in confidence with concept typical of 3D
animation and modeling such as key framing, blendshape deformation, interpolation,
tangent. These are basic and fundamental concept to get the know-how needed to plan
the work and writing program code.
2.3 Blendshape deformer
Blendshape deformers enable to deform a NURBS ( Non-Uniform Rational B-Splines )
or polygonal object into the shapes of other NURBS or polygonal objects. It is possible
to blendshapes with the same or different number of NURBS control vertex. In
character setup, a typical use of a blendshape deformer is to set up poses for facial
animation. Unlike the other deformers, the blendshape deformer has an editor that
enables you to control all the blendshape deformers in the scene. The editor can be used
to control the influence of the targets of each blendshape deformer, create new
blendshape deformers, set keys, and so on. Generally speaking or in other software
packages what Maya provides with the blendshape deformer is indicated with terms as
“morph”, “morphing” or “shape interp”.
2.3.1 Target and base objects
When creating a blendshape deformer, it is necessary to identify one or more objects
whose shapes would be used to deform the shape of some other object. These objects
are called target objects, and the object being deformed is called the base object.
2.3.2 Target shapes, base shapes, and blend shapes
The shapes of the target objects are called target shapes, or target object shapes. The
base object’s resulting deformed shape is called the blend shape, whereas its original
shape is called the base shape, or base object shape.
Chapter 2 - Alias|Wavefront Maya 4.0 Unlimited
14
2.3.3 Targets
A blendshape deformer includes a keyable attribute (channel) for evaluating each
target object shape's influence on the base object's shape. These attributes are called
targets, though by default they are named after the various target objects. Each target
specifies the influence, or weight, of a given shape independently of the other targets.
Depending on how the blendshape deformer is created or edited, however, a target can
represent the influence of a series of target object shapes instead of just one shape.
2.3.4 Keyframing
Setting keys is the process of creating the keys that specify timing and motion.
Animation is the process of creating and editing the properties of objects that change
over time. Keys are arbitrary markers that designate the property values of an object at
a particular time. Once an object that would be animated is created, it is necessary to
set keys that represent when the attributes of that object change during the animation.
Setting a key involves moving to the time where a value for an attribute would be
established, setting that value, then placing a key there. In effect, it is just as recording a
snapshot of the attribute at that time.
2.3.5 Using keyframing animation
Keyframe animation creates actions from keys that are to be set on attributes at
various times (or frames). A key specifies the value of an attribute at a particular time.
Maya interpolates how the attribute changes its value from one key to the next. Each
key specifies a defining characteristic of the action. To better understand the concept of
interpolation, an example can be the following one. Imagine a modeled sphere as a ball
and suppose its translate attribute on the y axes is 0 at the frame 0, while at frame 10
the same attribute is fixed at a value of 20; playing the animation, Maya will interpolate
the values of this attribute, moving the ball forward in height.
Chapter 3
Lip Synch issues
Without a well done methodology, it is very difficult for an animator to succeed
creating the exact lip movement of a 3D model synchronizing it with the recorded
audio. In fact, the only strategy he could follow is simply to place himself in front of a
mirror and analyze the lips positions while speaking in order to be able to connect them
generating the complex mouth movement. This sort of approach to the lip synch issue
pretends a great work by the animator, achieving results strongly affected by
observation capabilities of the artist and losing a lot of time.
Because of these reasons it was necessary to focus on a precise methodology of
lip synching. Thus a long research was made into how written words correspond to a
sequence of phonemes and how these ones are always expressed with the same mouth
shape and tongue position, finding in this way some fundamentals steps of the lip synch
process.
Chapter 3 - Lip Synch issues
16
3.1 Lip synch steps
First, a library of character model variations is built once and for all, including the
basic mouth shapes necessary for speech (phonemes) and expressive variations such as
the brows lifted, an angry scowl, or a grimace. This is usually done in model in the
modelling portion of the 3D package.
The next step is to break down the recorded dialog, which is the process of
translating what is heard in the dialog track into a list of facial shapes that, when run in
sequence, will create the illusion that the character is producing the recorded sound.
The exact facial shapes and the keyframe numbers they will occupy are entered into a
timing chart.
Finally, the facial shapes that were built in the first step are arranged according to
the sequence listed in the timing chart.
While it may appear to be a mechanical process, there is a great deal of creativity in
deciding how the character’s face will transform throughout the animation. Lip synch
is a part of acting, so the personality of the character defines the message being
delivered. The spoken word can have several meanings depending on the nuances is
given to the character, such as eyes movement and facial expressions.
The first step in the lip synch process is to understand the foundation, i.e.
phonemes. Phonemes are the most misunderstood aspect of facial animation. Nearly all
the information available to animators are not correct , making it challenging to create
really dynamic lip synch animation.
3.2 Getting known phonemes
A phoneme is the smallest part of a grammatical system that distinguishes one utterance
from another in a language or dialect. Basically, it’s the sound we hear in speech
patterns. In phonetic speech, combining phonemes, rather then the actual letters in the
Chapter 3 - Lip Synch issues
17
word, creates words. For example, in the word “food”, the “oo” sound would be
represented by the “UH” phoneme. The phonetic spelling of the word would be “F-
UH-D”. It looks a bit odd, but phonemes are the backbone of speech and therefore
paramount to the success of lip synch.
When creating a lip synch animation, the facial movement of the 3D character are
synched to recorded dialog. When the phonemes are spoken, the mouth changes shape
to form the sound being spoken.
As a matter of fact, the phonemes don’t actually look like the printed word, but
when speaking them they sound identical. The rule to determines if a unit of speech is
a phoneme is that if replacing it in word results in a change of meaning. For example,
“pin” becomes “bin” when replacing the “p” therefore the “p” is a phoneme.
3.3 Getting known visual phonemes
Visual phonemes are the mouth’s positions that represent the sound that is heard in
speech. These are the building blocks for lip synch animation. When creating a 3D
character lip synch, the beginning is modelling the phonemes. The important thing to
identify is how many actual visual phonemes there really are. The common myth is
that there are ten visual phonemes. While ten can be used to create adequate lip synch
animation, there are actually sixteen visual phonemes.
Each visual phoneme is associated to the audible phonemes that requires that
specific mouth’s position; even if some parts look very similar, a closer inspection
shows that the tongue is in different positions.
The tongue may seem to be an insignificant element in lip synch, but it is very
important to create truly realistic dialog. Now if the tongue movement is unnatural
because too few visual phonemes were used, then the animation will look unrealistic.
A larger number of visual phonemes grantees a major detailed lip movement
because more phonemes have their own corresponding mouth position; thus the same
Chapter 3 - Lip Synch issues
18
viseme, covering less phonemes, less frequently is used more then once within the same
word avoiding the lips vibrating effect that this may cause.
The choice of using more or less visemes, i.e. a schema based on ten or sixteen
visemes, is so very relevant according to the result of realism the animation pretends to
achieve. Generally, a few number of visual phonemes is preferred when the character
does not appear in the foreground during the animation, while a more accuracy is
necessary for important characters making long speeches attracting the viewer
attention.
Creating sixteen visual phonemes instead of ten requires a big work of modeling
not only to generate a greater number of mouth positions, but also because of the major
request of precision.
Visual phonemes are listed in the below Figure 3.1 , Figure 3.2 and Figure 3.3,
divided into the sixteen and ten schema.
Chapter 3 - Lip Synch issues
19
Figure 3.1 : First 8 visual phonemes of the 16-based list
Chapter 3 - Lip Synch issues
20
Figure 3.2 : Last 8 visual phonemes of the 16-based list
Chapter 3 - Lip Synch issues
21
Figure 3.3 : Short list of 10-based visual phonemes
Chapter 3 - Lip Synch issues
22
In the list with ten elements, several of the visual phonemes with similar exterior
appearances have been combined. Of course, the subtle tongue movement will not be
accurate, but that is not always important. There are times when it is necessary to
shorten the visual phoneme list to expedite the editing process, though truly
understanding phonemes, it does not take any longer to use the long list because it will
just make a little longer the modelling phase but not the lip synch process. In fact, once
the spoken word is translated into its phonemes sequence it is just a question of
selecting the corresponding visual phonemes according to the adopted schema, without
particular calculations.
3.4 The lip synch rules
Creating the illusion that a character is actually speaking the dialog is challenging,
but following some rules of lip synch animation could be useful.
3.4.1 Record the dialog first
There are two reasons to record the dialog before animating:
• It is far easier to match a character’s facial expression to the dialog than
it is to find voice talent that can accurately dub an existing animation.
This avoids to spend hours in a recording studio trying to match words
to a preexisting animation.
• The recorded dialog will help to determine where the keyframes go in
the animation. Suitable sound editing software will even give you a
visual representation of the dialogue’s shape making easier to breaking
down audio tracks for lip synch animation.
Chapter 3 - Lip Synch issues
23
3.4.2 Never animate behind synch
There are occasion when a lip synch will work better if it is actually one or two
frames ahead of the dialog, but you should never try to synch behind the dialog. It is
best to start by animating exactly on synch. Then, if necessary, parts of the animation
can always be moved a frame forward to see if it works better.
3.4.3 Don’t exaggerate
This is another important rule that is often overlooked. The actual range of
movement is fairly limited and overly pronounced poses will look forced and unnatural.
It is far better to underplay it than overdo it. Americans naturally talk in an almost
abbreviated manner and it is simple to notice that the mouth doesn’t open very much at
all during speech, so the visual phoneme has not to be exaggerated.
3.4.4 Rules were made to be broken
Many consonant, and occasionally vowels, are actually pronounced in the transition
between the preceding and the following sounds. As a result, the full pose for that
sound never occurs. Also, as mentioned above, Americans have a habit of abbreviating
their speech. Depending on local dialect, syllables are often slurred or deleted entirely.
So it is fundamental to pay attention to the character’s pronunciation and animate it as
the dialog dictates.
The goal is make the movements of the mouth appear natural and lifelike. If it is
necessary to skip a consonant to maintain the tempo and avoid contorting the mouth, it
will be far better to do so than forcing the mouth into unnatural poses or losing the
rhythm of the dialog.
Chapter 3 - Lip Synch issues
24
3.5 The lip synch process
Using the phoneme chart to translate what is heard in the dialog into their audible
phonemes, then translating those phonemes into a predetermined set of visual
phonemes makes it are the fundamental steps of lip synch process.
The steps of animating lip synch are:
• Break down the speech pattern.
• Analyze the audible dialog and enter the phonemes into the timing
chart.
• Use the timing chart to set the keyframes in the animation.
• Test the animation for synching and tweak the animation where
necessary.
3.5.1 Break down the speech pattern
The first step in lip synch is to determine the speech pattern of the dialog.
Because of the use of abbreviations and dialect influences in current speaking, often it
is not possible to properly assign the phonemes to a dialog if there Is not a previous
conversion to a rough phonetic translation. This doesn’t refer to the concept of
phonemes but it rather means a translation of how the dialog sounds using the normal
alphabet. This is an important element, since it is necessary to assign phonemes to the
actual heard sound, not the seen text. For example, the word “every” can often been
spoken in a contract manner, thus listening to the recorder audio it is important to
notice that and consider the spoken word in its contracted way as “evry”. The next
chapter will describe how this matters are faced in Speak U2.
After making this procedure of understanding what’s really spoken, it is always
necessary to phonetically translate the text before it is actually possible starting
assigning the phonemes.
Chapter 3 - Lip Synch issues
25
3.5.2 Analyze the audible dialog to determine phonemes
The first thing to do is to load the audio file into a sound-editor program; any will
do as long as it allows to identify the actual times when sounds occur and relate them in
frames. Another important feature is the presence of a scrub tool that lets to drag the
audio forward and backward in order to obtain a bigger precision.
After loading the audio file, usually the lip synch methodology suggests to hear the
dialog and write down the phonemes, but, as will be described in next chapters, it was
understood that was necessary to follow the strategy to write down the real spoken
words, with their abbreviations or contractions, and not the phonemes itself because the
algorithm will make this translation After this, scrubbing the sound back and forth it is
possible to determine the exact location of each sound in the audio file. Using an audio
tool thus is useful because it provides a visual representation of the sound too, which
help to determine the point where words, and so phonemes, are recorded.
3.5.3 Use timing chart to set keyframes
Once the sound have been translated into their visual phonemes and the frames
identified for each, it is necessary to morph the visual phonemes from one to another at
the appropriate frame in the animation.
The are two types of morphing, straight or weighted. Straight morphing simply
morph the object in a linear progression from one object to another. The morph can be
any value from 0 and 100%. The only issue is that it is limited to a single morph
object.
On the other hand, weighted morphing allows to blend multiple objects in a single
morph. This is very useful when adding facial expressions and emotions to the lip
synch animation. It is possible not only to add subtle eye blinks but also complete
changes in the character’s personality through facial expressions. Of course, this
greatly complicates the animation process, but the end result is well worth the effort.
Chapter 3 - Lip Synch issues
26
3.5.4 Tweak the finished animation
Lip-synching is not an exact science and it will almost always require a little
tweaking. This may involve altering certain poses or tinkering with the timing. In
some cases, the synch may appear more realistic if it is a frame or so ahead of the
dialog, so that the mouth is moving as the viewer’s brain is processing the sound rather
than when their ear is receiving it. These are subtle things, but lip-synching is an
exercise in subtlety. If the aim is making the audience to suspend disbelief and accept
the character as a living being, it is important not to stop until it actually seems if the
character is speaking the dialog.
Speaking about tweaking, there are times when it will be needed to drop a
phoneme to make the animation flow smoothly. Not all letter are pronounced in normal
speaking, particularly if an accent is present.
3.5.5 Phoneme dropping guidelines
After obtaining the animation, often is necessary to drop some phonemes, but
there are some rules to be observed.
Never drop a phoneme in the beginning of a word. It is possible to drop a
consonant at the end of words but never in the beginning because it will change the
visual phonetic pronunciation for the word. Otherwise the consonant at the end of
words can be dropped without having much impact on the word.
Another rule is to drop nasal visual phonemes to smooth transition. This is an
important issue when animating. The most offensive nasal phonemes is the “M”, since
it requires a closed mouth and this can be a problem since in reality the mouth
movement is nearly undetectable. It is necessary to merely go through the timing chart
and identify the nasal phonemes and delete them based on whether their location
qualifies.
Chapter 3 - Lip Synch issues
27
3.6 Thinking as an animator
Approaching the goal of creating a lip synch plug-in, it is very important to
understand above all who will use it. Speak U2, being a plug-in projected to help in
synching a facial animation, obviously refers to the 3D animator.
What really an animator wants to know and what otherwise doesn’t need to care
about when using a software? It is important to consider that an animator is not a
programmer, so he has no confidence with anything else that is not a 3D animation
graphic concept and he just really wants to work with an easy user interface. It is
important to understand what can make the user frustrated in using a software. For
example, it was discarded the possibilities of asking the animator to collect information
himself editing a text file inserting data between tags. The concept of tag, as the html
tags, is a very simple one, but it is just something that could be very annoying for an
animator and, moreover, he would spend time to learn how to build the right text file on
which apply the algorithm. To solve this problem, the right approach was to imagine
not to be a computer engineer, but try to put myself on the other side of the problem.
This was an interesting challenge to focus on. As shown in the fourth chapter, where
there is a complete look of Speak U2, it was necessary to make this step transparent for
the animator, who will create the tagged file using the interface, simply filling some
fields and pressing some buttons. In this manner, the artist has not to edit a file, he has
not to care of the format of the file itself, and any changes he wants to apply on a
preexisting file will be done through the same interface.
The user interface is the way the animator will work through, so it is fundamental
to give much importance and attention on building it. A 3D artist really needs a very
simple interface, with a clear workflow so that reading the instruction guide could be
considered not really necessary, even if it was provided in an online documentation. In
that way the animator needs to know just what to do, which steps has to follow to reach
the result and how to modify it.
Chapter 3 - Lip Synch issues
28
Moreover, we have to consider that a lip synch plug-in is based on particular and
difficult concepts such as visemes and phonemes, which may not afflict the animator or
the modeler, but, one time again, they have to know only what they really need. The
modeler has to care about the sixteen or ten mouth positions in order to create the visual
phonemes that would be mapped to phonemes to create the animation. The animator,
on the other side, doesn’t have to matter about visual phonemes, he just only needs to
know which phonemes is associated with that viseme in order to make the map between
them, but, finally, neither the animator or the modeler really need to know what a
phoneme is and how a written or spoken word translates into a sequence of phonemes.
The figure below represents the workflow and rules during the process.
Chapter 3 - Lip Synch issues
29
Figure 3.4 : workflow and rules diagram
groups
Keyframed scene
Within Speak U2
animator
Inserts data andexecute thealgorithm
animator
maps visemes tophonemes creating thecorrespondence used by thealgorithm
Within audio software
animator
Listens to therecorded audiowriting downspoken wordsand their times
Within Maya
models the 16 or10 mouthpositions corresponding to16 or 10phonemes
modeler
Chapter 3 - Lip Synch issues
30
3.7 Text to phoneme
Approaching the project to program a lip synch plug-in, the necessity of a text to
phoneme algorithm became immediately evident. At a first time the idea was simply to
build a mouth shape for each alphabetic letter and map the written text exactly to them.
This concept is not as wrong as it seems, but it presents many concrete problems. In
fact, the though to make something correspond to a mouth shape in order to obtain a
lip synchronization is a right one, but it doesn’t concern any single letter. It is evident
that making mouth shape changing at any letter pronunciation, will create an extremely
mechanic lip synch such as a robot one.
The importance of a translation between the written word and its sounds sequence
became vital for lip synching in a realistic manner. Analyzing the movement of the
mouth in daily human speech, it becomes clear that lips move following sound
pronunciation instead of the written text of the spoken word and this discovery
implicates the need to base the lip synch plug-in on an algorithm of translation from
text to sound, thus a translation from text to phonemes.
3.7.1 Focusing on language
Knowing the need to implement an algorithm like this one, it was necessary to
come back one step and decide which language was more appropriated and interesting
Speak U2 plug-in will support. Even if probably Speak U2 will be used to realize just
Italian production and animation, the choice fell upon English language because of
some important considerations that are characteristic of film making environment.
First of all, there was a great preference to develop a lip synch plug-in based on
an English language to allow the possibility of getting in touch with foreign reality and
market. Italian Computer Graphics level is so lower than in other nations that is very
challenging to introduce our products into foreign contest and usually best Italian artists
with much experience and capabilities have to emigrate in other countries to realize
themselves. American attitude especially is to figure out talented artists in other
Chapter 3 - Lip Synch issues
31
countries and keep them joined in their production, being strong of their powerful
money and Hollywood prestige.
Unfortunately, there is no real future in Italy for Computer Graphics artists, thus it
was really unthinkable to develop a plug-in for Italian lip synch even if at this time it
could not be said if Speak U2 would be known on other countries.
Second, there was a practical reason due to film making rules. Studying film
production, it is common to notice that films are generated for English speaking actors
and that is true even in Computer Graphics animation that in this époque is growing in
complexity and duration. There is a significant example of this concept in the
American Ridley Scott Hannibal, where Italian actors as Giancarlo Giannini and
Francesca Neri has to speak in English during the ciak. Just in a second time the film is
dubbed in other languages and paradoxically even Italian actors are dubbed again.
Moreover, the Speak U2 user interface is logically in English just because it
grows up within an entirely English software package. Making it with an Italian
interface would be in contrast with the philosophy of integration with Maya; on the
other hand it would be a paradox giving it an English structure to lip synch Italian
language only, above all because in that case probably only Italian animators would use
it.
For all these reasons, Speak U2 is based on the English grammar. This feature
imposed spending a long of time understanding this language in order to create a valid
algorithm of text to phonemes translation valid for any word. The way of employ an
always working algorithm was preferred to the use of a dictionary of words with their
corresponding phonemes sequence because this one will require a periodic upgrade and
a longer CPU time according to the search algorithm implemented.
Chapter 3 - Lip Synch issues
32
3.8 Synchronization and artistic freedom
Once the algorithm has generated the phoneme sequences, these are to be set at
the right time within the range of the word’s pronunciation. This is not a secondary
job as could seem, because even if phonemes and visemes are in the right sequence, if
they are not mapped at the right frame, the lip would move without synchronism with
the recorded audio. Moreover, some phonemes could persist for a little while more
than other, thus this final step in lip synch is important as the previous ones. It remains
to be considered that an algorithm based on precise calculation could never reflect the
complexity and the random way humans speak. The purpose is to obtain a quite perfect
synchronization between lip movement and audio without any other action than
entering in the user interface the spoken word and start and end times. This allow to
obtain a good result spending a little time and doing a reduced work, but without
affecting freedom and creativity of the artist who can later characterize in a more
detailed way his speaking creature.
Chapter 4
Solutions
The analysis of the lip synch methodology was basilar to understand the workflow
needed to achieve this sort of animation. Actually, the lip synch process can be divided
into some fundamental steps. Leaving aside the modeling phase of the 16 or 10
visemes, because this is not a really part of the the animation process being an exercise
of modeling, and thus it can not be considered within Speak U2, there are three basic
steps.
The first step consists of creating a correspondence between phonemes and
visemes, while the second step covers the phase of entering data concerning the speech
to be synchronized.
The third step is surely the most complex one because of its implementation but
even the easier one for the animator being totally automatic: it focuses on the execution
of the algorithm of translation from text to phoneme and computes to set the
blendshapes keys at the right time and value.
Even if these are the fundamental steps of lip synching, Speak U2 considers the
workflow in its totality, thus it includes two more steps, the preview and the retouching
ones, even if thy only are logical phases and not straight concerning the animation
process.
Chapter 4 - Solutions
34
4.1 Interface realization
In Speak U2 there is a great attention to make all the lip synch process the easiest
one, and the user interface reflects this issue. In fact, with its user interface divided into
the five steps, each one covered by a specific tab that contains all controls for that
purpose, follows the structure of the workflow.
This structure was build this way in order to keep clear to the animator in which
moment of the process he is and what he has to do.
4.1.1 First step: creating data structure
As illustrated more deeper in the next chapter, dedicated to the user interface and
software description, the goal of the first step is to collect fundamental information on
the scene and to initialise variables used by the text to phoneme translation algorithm.
Created using MEL, this first phase of the lip synch process gives the possibility
to the user to choose between the schema based on ten or sixteen visual phonemes and,
according to this selection, only one of the two global variable containing the list of
phonemes is used by the real algorithm and its content is displayed in a scrolling list.
The code below presents the two global variables where it is possible to see the
list of phonemes grouped to be covered by a visual phoneme.
global string $su2Phonemes[] = {
"m,b,p",
"n,l,t,d",
"f,v",
"TH,DH",
"k,g,NG",
"SH,ZH,CH,j",
"y,OY,YU,w,WW,UH,ER,r",
Chapter 4 - Solutions
35
"IH,EY,EH,AH,AY,AW,AE,AN,h,HH,s,z",
"AA,AO,OW,UW,AX",
"IY"
};
global string $su2Phonemes2[] = {
"m,p,b",
"n",
"l,t,d",
"f,v",
"TH,DH",
"k,g,NG",
"SH,ZH,CH",
"j",
"y",
"WW,UH,ER",
"r",
"IH,EY,EH,AH,AY,AW,AE,AN,h,HH",
"s,z",
"AA,AO",
"OW,UW,AX,OY,YU,w",
"IY"
};
Another scrolling list shows visual phonemes present in the scene according to the
selected blendshape, making it possible to create a correspondence to the phonemes.
This map is obtained by a procedure call that simply saves the name of the viseme in an
array at the same index of the phonemes’ array. Having phonemes and the
Chapter 4 - Solutions
36
corresponding visual phonemes at the same position in the arrays is an idea that would
facilitate to find them without errors.
The following is the code generating the map.
global proc su2Map()
{
global string $viseme[];
int $selectedPhonIndex[] = ` textScrollList -q -
selectIndexedItem phonemeList`;
string $vis[] = ` textScrollList -q -selectItem mapToList`;
$viseme[$selectedPhonIndex[0] -1] = $vis[0];
print $viseme
}
The figure below shows a conceptual schema of the workflow for the first step.
Chapter 4 - Solutions
37
Chose between the use of 16 or 10 visemes
List of blendshapespresent in the scene:select the onecorresponding tovisemes
List of attributesassociated with theviseme blendshape
List of phonemesdivided into 16 or 10groups according to theselection
Mapping procedure call
Figure 4.1 : First step workflow
4.1.2 Second and third tab: collecting information
The second step focuses on the analysis of the audio track to collect information
on speech. This is an important step in the algorithm solution that was found to
implement Speak U2. In fact, the purpose to write down the phonemes sequence of the
speech was considered as an impracticable one above all by an animator.
Understanding phonemes while listening to a recorded audio it is no a simple task as it
could seems, moreover, as said before, the animator or a 3D artist in general, doesn’t
have to known concepts such as phonemes or English grammar.
Chapter 4 - Solutions
38
Because of these arguments, the idea was not to write down phonemes but the real
spoken word that is straight understandable and moreover a script is always provided
with the audio. What really the animator has to do is to write down the time, in frame
units, at which the word’s pronunciation begins and ends. These are some of the data
needed by the algorithm to work on and they are all entered in this step filling dedicated
field after having listen the audio track and scrub along it using tools present in the
Speak U2 interface or from within external audio dedicated software.
All information collected in this phase, that are concerned to a single word, are
saved and added in a data structure containing information about the entire speech and
displayed in the part of the interface covering the third step. The save procedure allows
too to modify a previously saved information on a word. The code below represents
the “save” function:
global proc su2CollectInfo()
{
string $sWord = ` textFieldButtonGrp -q -text startWord `;
string $eWord = ` textFieldButtonGrp -q -text endWord `;
string $spWord = ` textFieldGrp -q - text spoken `;
string $emotion = ` optionMenuGrp -q -value emotionGrp `;
string $total = $spWord + "\\" + $sWord + "\\" + $eWord + "\\"
+ $emotion;
int $many = ` textScrollList -q -numberOfSelectedItems
totalList`;
if ( $many == 0)
textScrollList -edit -append $total totalList;
else
{
Chapter 4 - Solutions
39
int $pos[] = ` textScrollList -q -selectIndexedItem totalList
`;
textScrollList -edit -removeIndexedItem $pos[0] totalList;
textScrollList -edit -appendPosition $pos[0] $total -
deselectAll totalList;
}
textFieldGrp -edit -text "" spoken;
}
$sWord and $eWord are variables that are respectively used to store timing of
start and end of word’s pronunciation, while the spoken word is saved in the $spWord
variable. $emotion would be used to keep the emotions that is affecting the
pronunciation but it is variable that at this time is not used by the algorithm but would
probably be fundamental in further development of the plug-in. These items are
grouped in a string variable where are separated by a double slash and then are
displayed in the third tab scrolling list. The $many variable will control if any line of
the list collecting information on the entire speech is selected and thus the if test will
decide if the data are to be appended at the end of the list or if they refer to an existing
line that has to be edited and overwritten.
Generally an audio track for a Computer Graphics animation could be very long
and it would be very difficult to complete the recorded speech analysis at a time.
According to this problem, it was necessary to include in the user interface that covers
the third step a save and reload function. This job is due to two procedure calls that
write or read to text files respectively. The saving procedure is the more complex one
because of some annoying features of Maya itself. The code below represents this
function:
Chapter 4 - Solutions
40
global proc su2WriteFile()
{
global string $txtFile;
int $fileId;
string $temp[];
string $sceneN[];
string $wrongDir;
string $correctDir;
int $j;
$wrongDir = ` file -q -sceneName`;
tokenize ( $wrongDir, "\\", $temp);
$correctDir = ($temp[0] + "/" );
for ( $j = 0; $j < size( $temp); $j ++)
{
$correctDir = $correctDir + ($temp[$j] + "/");
}
tokenize ( $correctDir, ".", $sceneN);
$txtFile = ( $sceneN[0] + ".txt");
$fileId = ` fopen $txtFile "w" `;
string $contents[] = ` textScrollList -q -allItems totalList`;
string $riga;
string $riga1;
Chapter 4 - Solutions
41
int $z;
$riga1 = $contents[0];
fprint $fileId $riga1;
for ( $z = 1; $z < size( $contents); $z++)
{
$riga = ( "\n" + $contents[$z]);
fprint $fileId $riga;
}
fclose $fileId;
button -edit -enable true okBtn;
}
The text file that is generating by the saving function has the same name of the
scene on which the animator is working on and would be stored in the same directory.
Unfortunately Alias|Wavefront Maya recognizes directory paths using backslashes
instead of generally used slashes, while its scripting language command returns path
with slashes. This characteristic imposed to create a converting function.
To achieve this result, the $wrongDir variable stores the query result of the
name of the opened scene and with tokenizing operations the “\” are substituted in “/”
as well as the “.mb” extension is converted into the “txt” one. The new path is stored in
the $txtFile, variable that is a global one because it has to be passed as an argument to
the C++ compiled plug-in. The $contents array will store the list of all items referring
to the speech words analysis that are written into the text file one at line.
The load procedure simply makes the inverse process on the text file:
Chapter 4 - Solutions
42
global proc su2LoadFile()
{
string $works = ` workspace -q -fullName `;
setWorkingDirectory $works "" "";
fileBrowserDialog -mode 0 -fileCommand "su2UpdateScrollList"
-fileType "text" -actionName "Load previous saved speech
analysis" -operationMode "Import";
}
and then updates the list through the su2UpdateScrollList procedure call:
global proc su2UpdateScrollList( string $filename, string $fileType
)
{
int $fileId2;
$fileId2 = ` fopen $filename "r" `;
string $line;
$line = ` fgetline $fileId2`;
textScrollList -edit -removeAll totalList;
while( size ( $line ) > 0 )
{
if ( $line != "\n")
textScrollList -edit -append $line totalList;
$line = ` fgetline $fileId2 `;
Chapter 4 - Solutions
43
}
fclose $fileId2;
}
Variables $filename and $fileType are default arguments of the
fileBrowserDialog command. The Figure 4.2 shows a conceptual schema of the
workflow for the second and third steps.
Second step Third step
Listen to therecorded audio
Load list from txtfile
List of collectedinformation on allanalyzed words untilnow
Enter times andspoken word
Edit value ofselected line
Save information on current word
Save list to a txt file
Start algorithm execution
Figure 4.2 : workflow schema of steps two and three
Chapter 4 - Solutions
44
4.2 Interaction between MEL and API
The third step also provides the finally call the plug-in command that would
process all data and set keyframes for the animation. The code below belongs to the
MEL procedure associated with this procedure call:
global proc su2DoSynch()
{
global string $su2Phonemes[];
global string $su2Phonemes2[];
global string $viseme[];
global string $txtFile;
global string $parameter;
global string $numvis;
global int $dominance;
string $par1, $par2, $par3, $par4, $par5, $par6, $par7, $par8,
$par9, $par10;
string $par11, $par12, $par13, $par14, $par15, $par16, $par17,
$par18, $par19, $par20;
string $par21, $par22, $par23, $par24, $par25, $par26, $par27,
$par28, $par29, $par30, $par31, $par32;
if ( $numvis == "dieci")
{
$par1 = $su2Phonemes[0];
$par2 = $su2Phonemes[1];
$par3 = $su2Phonemes[2];
$par4 = $su2Phonemes[3];
$par5 = $su2Phonemes[4];
Chapter 4 - Solutions
45
$par6 = $su2Phonemes[5];
$par7 = $su2Phonemes[6];
$par8 = $su2Phonemes[7];
$par9 = $su2Phonemes[8];
$par10 = $su2Phonemes[9];
$par11 = $viseme[0];
$par12 = $viseme[1];
$par13 = $viseme[2];
$par14 = $viseme[3];
$par15 = $viseme[4];
$par16 = $viseme[5];
$par17 = $viseme[6];
$par18 = $viseme[7];
$par19 = $viseme[8];
$par20 = $viseme[9];
su2LipSynch $txtFile $numvis $par1 $par2 $par3 $par4 $par5
$par6 $par7 $par8 $par9 $par10 $par11 $par12 $par13 $par14
$par15 $par16 $par17 $par18 $par19 $par20 $parameter
$dominance;
}
else
[.....]
}
As the code shows, many variables are string global ones because their values are
derived from other procedures that have queried them through the user interface. The
effective call to the plug-in command is “su2LipSynch”, which is followed by all
needed arguments as explained in the table below:
Chapter 4 - Solutions
46
Varable name Meaning
$txtFile Name of the file storing all data about analyzed speech
$numvis Number of visemes to be used, 16 or 10
$par Element of phonemes and visemes array
$parameter Name of the blendshape of visual phonemes
$dominance Value at which set keys
Table 4.1 : Variables meanings
This procedure seems to be implemented in a terrible way, but unfortunately is
has to be considered that MEL is just a scripting language and, above all, that Maya
API classes are really poor in parsing argument. In fact, there is no way to straight pass
a string array argument within the C++ code. This unexpected problem was solved by
passing each element of the array as a single argument assigning it a new $par variable
and rebuilding the array within the C++ code. Of course, it was not really necessary to
use so many variables, but this choice was made to have a clearer correspondence from
MEL to API code. The code below shows the C++ code presiding to parsing arguments
dual respect to the MEL passing arguments code:
[…]
par_file = args.asString( 0 );
fname = par_file.asChar();
fp1 = fopen( fname, "r");
par1 = args.asString( 1 );
strcpy( based , par1.asChar());
if ( strcmp( based, "dieci") == 0)
Chapter 4 - Solutions
47
{
for (i = 2; i <= 11; i++)
{
par1 = args.asString( i );
p = i -2;
strcpy( parPhoneme[p], par1.asChar());
}
for ( i = 12; i <= 21; i++)
{
par1 = args.asString( i );
p = i - 12;
strcpy( parViseme[p], par1.asChar());
}
[...]
par11 = args.asString( 22 );
blendStep = args.asInt( 23 );
}
The par_file variable contains the name of the txt file passed as first argument and
it is used to open the file to process its contents of analyzed speech. The second
argument is assigned to the blend variable and represents the number of used visemes
and phonemes. This number is fundamental not only for the animation itself but to
know the size of the array too. The next two for cycle rebuild the phonemes and
visemes arrays that create the main data structure of the algorithm.
The last variable blendStep needs a specific discussion. In fact, while variables as
par1, based or par_file are all string ones and Maya API provides an asString method in
the MargsList class, paradoxically an asFloat method is absent at all. This deficiency
Chapter 4 - Solutions
48
creates a disconcerting problem because it was not possible to directly pass a value as a
float type as the initial purpose was. This value is obtained from the user graphic
interface through a slider.
The first idea was to create a float slider because of the value of the blendshape is
a floating one between 0.0 and 1.0 but, because of the lack of getting a float value from
the C++ API, this simple idea has to be twisted. The only way to preserve the intuitive
manner of setting the slider in the interface and pass its value to the C++ code, was to
make the user to set the percentage value of the blendshape and then pass it as an int
number to the API and create and appropriate procedure to convert the integer value
into a float one.
This conversion is simply made by a division using traditional C++ library div_t
type variable, but then is necessary to construct the right MEL sequence of characters
for the command to be executed. The results of the division are the quotient and its
reminder; for any percentage value indicated by the user through the graphic interface
this is divided by 100 and its quotient and reminder are separated by the decimal point
to obtain a real floating number. It would be more clear reading the code below:
div_t step;
step = div ( blendStep, 100 );
cmdAttr2 = cmdAttr2 + parViseme[x] + "\" " + step.quot + "." +
step.rem;
MGlobal::executeCommand( cmdAttr2, true, false);
It is possible to see that it was very annoying to build up the entire MEL string
adding even the “.” to separate the entire part of the number from its decimal. To better
understand it can be said that the executeCommand is a method of the Maya API class
Global that allow to execute MEL command through the C++ code.
In spite of all these observations, it was not so complicated to solve these
problems; but there’s a little poor documentation to allow a deep learning of Maya
Chapter 4 - Solutions
49
programming and interacting and there are some incredible deficiencies in a software
that wants to be a great one in its possibilities of being programmed and developed by
professional users.
4.3 Algorithm
Speak U2 is based on a linear but complex algorithm that is composed of two main
part. The first part attends to the translation from text to phonemes of the English
grammar, while the second concerns the selection of which visual phonemes are of
interest for making the map and the calculation of time at which to set key for the
interested visual phonemes.
The code lines about parsing arguments seen before are all inserted in the first part
of the C++ code which has to follow a precise structure in order to be loaded within
Maya as a command plug-in. In fact, a plug-in have to be implemented including some
mandatory method to be initialized within Maya. These methods are the
initializePlugin and the dual uninitializePlugin which have a standard structure simply
to make Maya know the name of the new generated commands:
MStatus initializePlugin( MObject obj )
{
MStatus status;
MFnPlugin plugin( obj, "Speak U 2", "4.0", "Any");
status = plugin.registerCommand( "su2LipSynch",
su2LipSynch::creator );
if (!status) {
status.perror("registerCommand");
return status;
Chapter 4 - Solutions
50
}
return status;
}
The first parameter in the plugin.registerCommand call is the name attributed to the
new command while the second is the name of another obligatory method for the
MpxCommand class from which the su2LipSynch class of the Speak U2 plug-in
inherits and which simply return a new instance of the su2LipSynch class. The
unitializePlugin method is just the dual of the initializePlugin and differs only for the
call to the plugin.deregisterCommand method.
The main important method of the base class is the doIt method which is a sort of
main of a C program in the sense that, as its name suggests, it provides all methods
following the workflow of the program. Thus in the doIt method, besides the parsing
arguments to rebuild the data structure, there is also the analysis of the text file
resulting from use of the interface and containing all data about the spoken words and
times of pronunciation. As seen before, all the information are collected in a data
structure where each line refers to a single word; a slash separates the spoken word
from its beginning and ending times of frame regarding it, thus the decision was to
generate three separated files each one about one data only. In this way the algorithm
can work on a text file containing the complete list of spoken words each one blank
separated form the others, another file containing the numbers of frames at which the
word begins and another with the frames at which the word ends. This files are named
respectively “intermedio_parole.txt”, intermedio_start.txt” and “intermedio_end.txt”
and are opened in write mode in the doIt method where they are generated and then
opened in read mode from other method that would use their contents to work on.
Chapter 4 - Solutions
51
4.3.1 Text to phonemes translation
The text to phoneme algorithm has a simple structure even if it is made up of
many steps. The main process is to control for any letter of the spoken word which one
comes before and which comes after, according to English grammar. Thus there was
the need to create a file containing rules for automatic translation of English to
phonetics. Rules are made up of four parts: the left context, the text to match, the right
context and finally the phonemes to substitute for the matched text.
The algorithm starts reading from the “intermedio_parole.txt” until it finds the
end of file EOF. For each word, it proceeds separating each block of letters and adding
a blank on each side. After selecting which is the word to work on, it gets the starting
frame and computes the duration of the pronunciation substracting the ending frames
value. This information would be used in the second part of the algorithm where it
would decides at which time set the appropriate viseme keyframe. These actions of
dividing each word putting a starting and ending space , and determining the time and
duration of the word are all included in the haveLetter method. This method is called
anytime the character read from the “intermedio_parole.txt” file is a letter and prepares
an array containing the word for the translation, terminating it with the symbol “\0”
which is interpreted as a string end, to pass it to the translateWord method.
During the text to phonemes translation it also necessary to count the number of
phonemes that compose the analysed word because of this data would also be useful
and indispensable in the second part of the algorithm. This task is assigned to the
translateWord method which has a counter that is incremented anytime a new
phonemes is translating from the word. In fact, within this method, while the
terminating string is not found, each letter of the word is passed to the call to the
findRule method that, for each letter in the word, looks through the rules where the text
to match starts with the letter in the word. If the text to match is found and the right
and left context patterns also match, it outputs the phonemes for that rule and skip to
the next unmatched letter. Below there is a rule example:
Chapter 4 - Solutions
52
static Rule A_rules[] =
{
{Anything, "A", Nothing, "AX"},
{Nothing, "ARE", Nothing, "AAr"},
{Nothing, "AR", "O", "AXr"},
{Anything, "AR", "#", "EHr"},
{"^", "AS", "#", "EYs"},
{Anything, "A", "WA", "AX"},
{Anything, "AW", Anything, "AO"},
{" :", "ANY", Anything, "EHnIY"},
{Anything, "A", "^+#", "EY"},
{"#:", "ALLY", Anything, "AXlIY"},
{Nothing, "AL", "#", "AXl"},
{Anything, "AGAIN", Anything, "AXgEHn"},
{"#:", "AG", "E", "IHj"},
{Anything, "A", "^+:#", "AE"},
{" :", "A", "^+ ", "EY"},
{Anything, "A", "^%", "EY"},
{Nothing, "ARR", Anything, "AXr"},
{Anything, "ARR", Anything, "AEr"},
{" :", "AR", Nothing, "AAr"},
{Anything, "AR", Nothing, "ER"},
{Anything, "AR", Anything, "AAr"},
{Anything, "AIR", Anything, "EHr"},
{Anything, "AI", Anything, "EY"},
{Anything, "AY", Anything, "EY"},
{Anything, "AU", Anything, "AO"},
{"#:", "AL", Nothing, "AXl"},
{"#:", "ALS", Nothing, "AXlz"},
Chapter 4 - Solutions
53
{Anything, "ALK", Anything, "AOk"},
{Anything, "AL", "^", "AOl"},
{" :", "ABLE", Anything, "EYbAXl"},
{Anything, "ABLE", Anything, "AXbAXl"},
{Anything, "ANG", "+", "EYnj"},
{Anything, "A", Anything, "AE"},
{Anything, 0, Anything, Silent},
};
The four elements in any row correspond to the left part, the match part, the right
part and the output part respectively. The output part is exactly the one or more
phonemes translating that word or part of word. As seen in the A rule example, there
are some special characters used by the algorithm to signalise some particular situations
that could happen in words. These special characters are:
Symbols Meaning
# One or more vowels
: Zero or more consonant
^ One consonant
. B,D,V,G,J,L,M,N,R,W,Z
+ E,I o Y
% Error: Bad char
Table 4.2 : special characters used for the left match
Chapter 4 - Solutions
54
Symbols Meaning
# One ore more vowels
: Zero or more consonants
^ One consonant
. B,D,V,G,J,L,M,N,R,W,Z
+ E,I or Y
% ER,E,ES,ED,ING,ELY
Table 4.3 : special characters used for the right match
The table below shows the 44 American English phonemes used for the
translation:
Chapter 4 - Solutions
55
Phoneme Example Phoneme Example
IY bEEt IH bIt
EY gAte EH gEt
AE fAt AA fAther
AO lAWn OW lOne
UH fUll UW fOOl
ER mURdER AX About
AH bUt AY hIde
AW hOW OY tOY
p Pack b Back
t Time d Dime
k Coat g Goat
f Fault v Vault
TH eTHer DH eiTHer
s Sue z Zoo
SH leaSH ZH leiSure
HH How m suM
n suN NG suNG
l Laugh w Wear
y Young r Rate
CH CHar j Jar
WH WHere
Table 4.4 : American English phonemes
Using this data structure that operates on a separate file named regole.cpp
containing rules for any letter of the English alphabet, the findRule method figures out
the right phonemes translation for the word matching what precedes and follows the
Chapter 4 - Solutions
56
letter in word with the rule for that letter delegating this job to special method called
leftMatch and righMatch. Thus this method returns the phonemes and passes it to the
translateWord method that is the right place where maintaining a counter for the
number of phonemes composing a word.
The leftMatch method, as well as the right method that is its dual for the right part
of the match, is made up of a long switch structure that discerns on groups of letters
that are found on the left part of the match. The switch case are the special character as
+, # etc considered in a table 4.2 because this allows to use only one simple character
under which many letter are considered. There is a counter that is initialise to the
pattern length and that is decremented any time a good match with the context is found.
The leftMatch method results TRUE or FALSE depending on the match of all the
letters of the pattern and, if successful, in the findRule method the right phoneme is
obtained using the fourth field of the static structure of the rules output =
(*rule)[3]; at this point within the findRule method the algorithm has produced one
phonemes of many composing the entire word, thus it is necessary to store it in a list of
that word’s phonemes.
This new task is assigned to the workPhoneme method that tries to understand
which phonemes was selected. Looking at the example rules for the A letter shown
before, it is possible to see that sometimes the output part of that data structure is not
made of a single phoneme. In fact, it is evident that some are single phonemes but
others are made of a sequence of them and this difference is signalised by the fact that
phonemes translating vowels are upper-case letter written while the consonant
phonemes are lower-case letter. Moreover vowel phonemes are all made up of two
letters while consonants one are made of a single letter.
Because of these features it is quite simple to discover situation at which the given
output phonemes are in reality more than one phoneme. The workPhoneme method
determines this cases analysing the length of the char variable resulting from the
findRule procedure and containing the phonemes: if that length is less than three, than it
is a single phoneme while on the contrary it is formed of a sequence of vowel and
Chapter 4 - Solutions
57
consonant phonemes that are discerned considering their case. After incrementing a
phonemes number counter, all phonemes are put in a single cell of an array that is then
passed to the findPhoneme method.
The findPhoneme method finally looks for any of the phonemes contained in the
array received as argument in the main phoneme array initialised at the beginning in the
doIt method while rebuilding the data structure passed by MEL code. Processing each
element of the phonemes array that contains grouped phonemes separated by a comma,
the findPhoneme procedure searches for the matching phoneme and, after finding it,
calls the findViseme method that simply gets the name of the associated viseme getting
it at the same index of the visual phonemes array. The viseme name is then copied into
another string array named wordViseme at the position signalized by the a counter that
starts from 0 and that is incremented within the findViseme procedure itself in order to
obtain a sequential memorization. This string array wordViseme is so used to contain
the visemes that are to be keyframed for the current in the order corresponding to the
list of translated phonemes. This memorization is indispensable because at this time
only a part of the word is translated thus it necessary to maintain the order of the
visemes that would be keyframed later in the algorithm.
From the findViseme and findPhonemes, the process returns to the workPhoneme
procedure and then straight right to the findRule method that was called in a loop
within the translateWord function until the string terminated character.
At this point everything concerning a single word is known, not only its
translation into phonemes but their number too and the visemes to use to cover them;
thus the algorithm enters in the second phase to compute times at which generate key
for the animation and set them.
In order to really understand what has happened right still now during the process
of converting letter to sounds, it could be useful to follow all steps for a simple word.
Analizing the “IN” word for example, it is soon discovered out that in reality this
word its just one of that words that have a translation of more phonemes directly within
the rules of the corresponding alphabetical letter. The findRule procedure has to
Chapter 4 - Solutions
58
control just one time the left and right part because the word results to be just“ IN “ and
not a longer one containing it, thus the right end left match are blank spaces that
correspond to a return condition within the leftmatch or rightMatch because there is
nothing to be converted. In fact, seeing the first few line of the I_Rules it is possible to
see that the IN word is present in its entirely among the rules:
static Rule I_rules[] =
{
{Nothing, "IN", Anything, "IHn" },
thus this word has just its sequence of phonemes translation that is IHn if it precedes
and followed by a blank that means that it really is the word IN and not the sequence IN
within a longer word.
To better follow translation steps it could be interesting to analize another word as
“HELLO”. In the translateWord it is deteched the letter whose rule has to be searched
in the rules file. The first letter is an “H” and then the findRule method is called on the
H_rules that is shown below:
static Rule H_rules[] =
{
{Nothing, "HAV", Anything, "hAEv"},
{Nothing, "HERE", Anything, "hIYr"},
{Nothing, "HOUR", Anything, "AWER"},
{Anything, "HOW", Anything, "hAW"},
{Anything, "H", "#", "h"},
[…]
Whitin the findRule method the match variable each time get the value of the
match in the H_rules that is the second column, thus the first time it gets “HAV”. At
Chapter 4 - Solutions
59
this time it compares the word to be translated with this match discovering that the A
doesn’t macth with the E of “HELLO”. The algorithm goes on considering the next
match of the rule and it finds “HERE”. The loop breaks because of the mismatch
between the “R” and the “L” of “HELLO” and thus goes on the next match. Finally it
reaches the match “H”, it calls the leftMatch and the rightMatch procedure on the
“Anything” and “#” pattern. The leftMatch method returns quite immediately because
the “H” is the first letter of the word, thus anything matches. The rightMatch method
instead gets as context the word “ELLO” and as pattern the “#” symbol that means any
vowels as indicated in the table 4.3. The “E” in “ELLO” satisfied the match, thus the
context is decremented and the test is executed again but no other vowels follow in the
“LLO” remained context, thus the rightMatch returns and the output variable get the
phoneme translated that is “h”. This is the process that follows for any letter in the
word. The findRule code is shown below:
int su2LipSynch::findRule( char word[], int index, Rule
rules[])
{
[…]
for(;;)
{
rule = rules++;
match = (*rule)[1];
[…]
for ( remainder = index; *match != '\0'; match++,
remainder++)
{
if (*match != word[remainder])
break;
Chapter 4 - Solutions
60
}
if ( *match != '\0')
continue;
left = (*rule)[0];
right = (*rule)[2];
if ( !leftMatch( left, &word[index-1]))
continue;
if ( !rightMatch( right, &word[remainder]))
continue;
[…]
This is part of the findRule code, where the more internal cycle guarantees the
comparison between the word to be translated and the match found in the rule of the
letters; then the left and right match procedure are called.
4.3.2 Keyframing
The last step of the translateWord is to call the makeKeyframes method. It is a
little complicated because of there are many operations to be made and in different
ways according to the selected schema. The general purpose of this method is to set the
key for the right visual phoneme at the right time and at the right value. As stated in
the second chapter, setting a key means to fix the value of a precise attribute at a
precise frame, thus in this case means set the blendshape’s value for the interested
visual phoneme, where a blendshape is that particular Maya deformation that morphs
between two positions creating mouth poses.
As other method seen before, makeKeyframes too has a dual structure for the ten
and the sixteen schemata, only changing array dimension and length of loop. Thus
Chapter 4 - Solutions
61
studying the portion of the keyframing algorithm regarding the adoption of 10 visual
phonemes is sufficient to understand the complete process.
This method makes a large use of a API class, the MGlobal one, because this
class implements a method that is called executeCommand which allows to execute any
MEL command from a C++ call. Obviously, even if it could be easier because of the
simple syntax of the MEL language, it’s obligatory too to be restricted in the
possibilities of the MEL script itself, thus from the C++ code is necessary to build up
the same exactly command it would be given within the MEL but under a string
constructed variable and the execute that string as command. This often obliges to
concatenate string with inverted commas or with comma or point to obtain the precise
sequence of a MEL command as shown below:
MString cmdAttr = "setAttr \"" + par11 + “.”;
MString cmdKey = "setKeyframe " + par11 + ".";
MString cmdTime = "currentTime -edit ";
cmdAttr2 = cmdAttr;
cmdKey2 = cmdKey;
cmdAttr2 = cmdAttr2 + parViseme[j] + "\" " + nostep;
MGlobal::executeCommand( cmdAttr2, true, false);
cmdKey2 = cmdKey2 + parViseme[j];
MGlobal::executeCommand( cmdKey2, true, false);
cmdTime2 = cmdTime + tempo;
MGlobal::executeCommand( cmdTime2, true, false);
These are the MEL commands initialised by the C++ code, where par11 variable
contains the name of the blendshape under which all visemes are created. In fact, the
real MEL code is something like this:
Chapter 4 - Solutions
62
setAttr “visemi.viseme1” 0.5;
setKeyframe visemi.viseme1;
The first line sets the attribute named viseme1 of the group of blendshapes named
visemi at the value 0.5, while the second command create the key for the viseme1
attribute according to the value set before at the current time in the timeslider. These
command strings are passed as parameter to the Mglobal::executeCommand method in
order to have Maya execute them.
The Mstring variables cmdAttr and cmdKey store that part of the string command
that never changes and anytime a new command is necessary they are added to empty
string cmdAttr2 and cmdKey2 to obtain the right command any time.
The same explanation regards the use of the cmdTime and cmdTime2 variables
that are used to set and change the time at which save the key for the animation.
Starting this part of the algorithm, data that are known are the number of the
phonemes and then the visual phoneme to be keyed, the duration expressed in frames of
the word’s pronunciation representing the time within which all keys for the current
word are to be set up. Thus, the first operation of this part of the lip synch process is to
divide the duration time of the word by the number of visemes and so assign a regular
and uniform time of persistence of the same viseme in the animation. In fact, the real
study of phonetics suggest that not all phonemes have the same importance in the
pronunciation of a word such as the “t” or “d” sound. Moreover, it is not true that this
differences in sound dominance could be collected to make a rigid rule, because not all
words sound at the same way even if the same phoneme composes them.
For this reason, Speak U2 makes a standard attribution of time, giving the same
uniform time at any phonemes translating a word, and then invite the animator to
correct or modify some keyframe’s time where necessary.
Thinking about the synch of the first word in a speech, it is evident that any
visemes value has to be keyed at a zero value because of the interpolation of Maya
itself. In fact, Maya automatically interpolates during the animation when the same
Chapter 4 - Solutions
63
attribute has a key at different values in different frames; the gap between the two times
is covered by Maya itself that interpolates the value of the first key to the second. Thus
for the first word of a speech or a word pronounced after a pause, it is important to set
all the keys for the visemes at zero, but for all other words during the speech it could be
very dangerous to perform this action.
In fact, the make frames algorithm, knowing the number of phonemes composing
a word and the total time of the word’s duration, assigns the same number of frames at
each phoneme; the zero key has to be set at the starting pronunciation frame minus this
uniform quantity because at the real beginning word time keys are just to be at a value
different from zero in order to get the mouth opened. Thus, if this is not a word with a
pause before, setting keys at zero could damage the previous word in its ending.
Because of these motivations, the algorithm was made up of two different
behaviours according to the position of the word in the speech. This control is made by
an if test that verifies if the time at which is necessary to put the first key is before the
end of the previous word. If this happens, keys are not set at zero, but only the
interested viseme key is set to a value that is the half of the value at which the
animation would be built. An half value allows to obtained a smooth transition
between the two phonemes without getting the mouth closed. The code below could
make this concept easier to understand:
parity = div( durata , total);
forcing = value - parity.quot;
Parity and forcing are two variables used to make this control: parity stores the
number of frames uniform assigned to the visemes, while forcing is the time at which is
necessary to put the keys at zero to obtain the interpolation. Last is the variable that
stores the last frame at which the previous word ends.
As it could be seen from the code below, if the forcing variable is greater than the
last variable, it is possible to set keys at a zero value because this doesn’t affect the
Chapter 4 - Solutions
64
previous word keyframing and so all visemes are keyed at zero at the time equal to the
forcing variable value:
if ( forcing > last || forcing == last)
{
cmdTime2 = cmdTime + forcing;
MGlobal::executeCommand( cmdTime2, true, false);
cmdAttr2 = cmdAttr; /* to claer cmdAttr2 everytime */
cmdKey2 = cmdKey;
[…]
if ( strcmp ( based, "dieci" ) == 0 )
{
for( j=0; j < 10; j++)
{
cmdAttr2 = cmdAttr2 + parViseme[j] + "\" " +
nostep;
MGlobal::executeCommand( cmdAttr2, true,
false);
cmdKey2 = cmdKey2 + parViseme[j];
MGlobal::executeCommand( cmdKey2, true,
false);
cmdAttr2 = cmdAttr;
cmdKey2 = cmdKey;
}
On the contrary, if the word has not a sufficient space from the previous, a
correction is used for the that viseme corresponding to the first phoneme for that word:
Chapter 4 - Solutions
65
if( strcmp( wordViseme[0], parViseme[j]) == 0
{
cmdAttr2 = cmdAttr2 + parViseme[j] + "\" ." +
correction;
where correction is half of the blendshape value set for the animation.
Moreover, in case the division of the duration time by the number of phonemes is
not integer, one more frame is assigned to the first viseme because the starting of a
word is the most important one in the lip synch.
For the mapping of phonemes that are in intermediate position, the algorithm gets
the name of the next viseme that has to be keyed and looks for it in the visemes array;
any time it encounters a visual phoneme with a different name it means that is not the
viseme to be keyed at that time and sets its value at zero, while when it finds out the
viseme it is looking for, it makes a key for it at the value decided at the beginning in the
blendshape dominance parameter. This is performed by this code:
for ( h=0; h < total; h++)
{
cmdAttr2 = cmdAttr;
cmdKey2 = cmdKey;
tempo = tempo + parity.quot;
for( x= 0; x < 10; x++)
{
if( strcmp( wordViseme[h], parViseme[x]) == 0)
{
cmdAttr2 = cmdAttr2 + parViseme[x] + "\" " + step.quot +
"." + step.rem;
MGlobal::executeCommand( cmdAttr2, true, false);
Chapter 4 - Solutions
66
cmdKey2 = cmdKey2 + parViseme[x];
MGlobal::executeCommand( cmdKey2, true, false);
cmdAttr2 = cmdAttr;
cmdKey2 = cmdKey;
}
else
{
cmdAttr2 = cmdAttr2 + parViseme[x] + "\" " + nostep;
MGlobal::executeCommand( cmdAttr2, true, false);
cmdKey2 = cmdKey2 + parViseme[x];
MGlobal::executeCommand( cmdKey2, true, false);
cmdAttr2 = cmdAttr;
cmdKey2 = cmdKey;
}
}
cmdTime2 = cmdTime + tempo;
MGlobal::executeCommand( cmdTime2, true, false);
}
The outer most cycle loops for the number of phonemes in which the word was
translated, while the internal for loop works on the control of the value at which the
blendshape as to be set according to if it is the viseme to be mapped or not at that time.
These different values are set using the variable nostep that is for a zero key, or
step.quot + “.” + step.rem that are translated as for example 0.5.
This code is valid for the ten visemes schema, but it has a just specular code
referring to the sixteen based schema.
Chapter 5
Software description
The Speak U2 interface is the result of a careful analysis of the lip synch
methodology and of the understanding of its fundamental steps. For this reason, using
the MEL scripting language as stated in the previous chapter, the user interface was
projected to reflect the workflow of the lip synch process in order to be itself a guide to
the animator in his work.
Speak U2 was developed thinking it would be addressed to a specific target of
user, as an animator or a video postproduction firm, but in spite of this it was planned to
be a complete software, thus it has its on line guide and a simple installation manual.
The description of a software and of its use is a very important aspect even for an
animation plug-in; the complete understanding of its utilization is basilar, in fact today
there are so many useful plug-in that are not used during the production phase because
too hard to be understood.
Chapter 5 - Software description
68
5.1 Speak U2 installation
The Speak U2 plug-in has a very simple installation as any other Maya plug-in. It
is formed by three MEL files, that are named userSetup.mel, speakU2Register.mel and
speakU2createUI.mel. All these files implement the user interface while the compiled
C++ file has the .mll extension that is is the specific extension for any Maya plug-in.
This file is called SpeakU2.mll. It is important to know the name of useful files
because the installation of a plug-in is normally made up of many action of copy and
paste of files into specific directories.
In the case of Speak U2, the MEL files are to be copied into the user directory,
following the path for Maya\4.0\script. This is valid for Microsoft Windows 2000 and
Windows NT 4.0 too, even if the user directory differs in the two Microsoft operating
system because in one case it is under Document and Settings\user\, while under NT
4.0 it is under winnt\profiles.
In the case a userSetup.mel file is already present in that directory, it is important
to select all the content of the file provided with Speak U2 and append it at the end of
the preexisting one.
The speakU2.mll file is to be copied into the Maya installation directory
AW\Maya4.0\bin\plug-ins where all mll files are located. At the first launch of the
Alias|Wavefront application is necessary to go under the menu
Window\Settings/Preferences\Plug-in Manager to activate the load of the plug-in and
possibly check its autoload box in order to get Speak U2 always present at any Maya
start up.
5.2 Plug-in integration
Speak U2 is perfectly integrated into the Maya graphic interface. In fact, after its
setup, any time the application is started it is possible to notice in the main Maya
window, among all menus in the menu bar, the presence of a menu called SpeakU2.
Chapter 5 - Software description
69
Moreover, that menu also compares in the hotbox which is a particular Maya tool
obtained by pressing the spacebar and which group all of Maya menus for a more quick
selection. Images below show this integration:
Figure 5.1 : Maya Main Window
The Speak U2 menu is present among Maya menus and it is located on the right
part of the menu bar.
Figure 5.2 : Maya Hotbox
Chapter 5 - Software description
70
In the hotbox the Speak U2 menu appears in the high right position among the general
menus, while the menus below refer to other configurations of Maya different from the
default one that is generally the Animation one. Thus the Speak U2 menu is always
present in the Hotbox, independently of the Maya’s working configuration as for
example the rendering or modeling ones.
Chapter 5 - Software description
71
5.3 User interface
Figure 5.3 : First tab of Speak U2 interface
Chapter 5 - Software description
72
The first tab of the user interface, as seen before, refers to the phase of building
the data structure of phonemes and visemes. The figure 5.3 presents this first tab.
In the first line there is an option box that allows to choose between an animation
based on ten or sixteen visemes covering the 44 American phonemes. In the “List of
blendshapes present in the scene:” are listed all blendshape that were modelled in the
scene, and among this blendshape names, the user has to select that deformer that is
linked with the visemes. One time something is selected in this list, all its attributes are
displayed in the “List of blendshape attributes:” scrolling list.
All these attributes are grouped under the main one selected in the first list;
according to the decisions of the modeller, it could happen that blendshapes not
representing visual phonemes are grouped together with those corresponding to
mouth’s shapes. Obviously, it’s not a good way of working not to separate blendshapes
referring to the mouth positions from other regarding different facial animation aspects,
but it could happen.
Because of this considerations, the user interface consider to allow the artist to
choose among this attributes blendshapes to select the only ones corresponding to
visual phonemes; selecting them and pressing the “Add Viseme” button, the selected
visemes are appended into the below scrolling list named “List of Viseme:” where the
animator will introduce ten or sixteen mouth’s shapes according to the schema he wants
to adopt for the animation.
In fact, the choice of the schema determines the filling of the “Grouped American
English Phonemes” of ten or sixteen groups of phonemes and, at this point, if the visual
phonemes were entered reflecting the order of this scrolling list, the map between the
grouped phonemes and visemes is quite evident and simple. Moreover, it is possible,
using the “Delete Viseme” and “Delete All Visemes” buttons, to correct the list of
visual phonemes to be mapped. To perform this mapping there is the “Map between”
button that connects the selected group of phonemes to the selected blendshape viseme.
The first time it is necessary to map each group of phonemes with the corresponding
Chapter 5 - Software description
73
viseme, while map can then be saved and reloaded into a map10.txt or map16.txt file
according to the selected schema.
Figure 5.4 : Second tab of Speak U2 user interface
Chapter 5 - Software description
74
The second tab, that is shown in the figure 5.4, is called “Working on speech”
because it is used to collect information on words during the analysis. In the first line,
the option box allows to decide the source for the audio file that has to be synchronized;
choosing the “From disk” option makes the “Browse” button became active and it is
then possible to select the wav file from the hard disk. If the “From timeline” voice is
selected, then the audio file that will be displayed within the Speak U2 interface is the
same that is visible in the Maya timeline and thus the same file present in the scene will
be used. The audio file has to be a wave one, and it has to be sampled at 44100 kHertz
according to Maya features; moreover to playback the sound into Maya it is necessary
to set the playback speed in order to play it during the animation. Not any time speed
can perform this task, and it has to be chosen according to the video format of the
animation it wants to adopt, as for example real-time 25 frames per second is the
standard for Italian movie.
The sound time section display the audio waveform and allows the user moving
through the sound to listen to it and determine the exact frame at which a word begins
or stop. He can also scrub through the audio where scrubbing means drag on the wave
file and listen to it frame per frame. The buttons below the waveform are to be used for
a easier moving on it. To facilitate the task of writing down the spoken words and their
timing, there is also the possibility to use the option box to set the playback mode to a
continuous loop instead of looping once. It still remains true that Maya has not so
many instruments for working with audio, thus generally it is preferred to use an
external software with many tools as Syntrillium Cool Edit Pro 2000 or Adobe
Premiere 6.0.
Once the artist is satisfied of the timing he has detected, he could insert all these
values in the corresponding fields that are below the waveform; inserting “Word’s
pronunciation beginning” or “Word’s pronunciation ending” could be performed
pressing the “Keep start frame” or “Keep end frames” while positioning on the chosen
frame.
Chapter 5 - Software description
75
The emotion selection list was included in the user interface even if it is not
functional in this Speak U2 version. It would be used to add a particular emotion to the
spoken word, as for example happiness, sadness, or to connote the character with an
interrogative expression. This add requires a further study to understand and plan how
to insert this information among others and how to make the algorithm process them.
The last button is named “Save” and it has to be pressed any time the artist has
entered all the information about a single word which are then stored in the third tab as
shown in figure 5.5
Chapter 5 - Software description
76
Figure 5.5 : Third tab of Speak U2 user interface
The third tab was created in order to collect all information inserted in the
previous tab. Each line of the “Word\Start frame\End frame\Emotion” scrolling list
Chapter 5 - Software description
77
refers to a single word and contains the four information fields as its name suggest.
From this table it is possible to verify that all words have been entered, and it is
possible to delete or modify them. To change some value of a line the user has to select
the line and then come back at the second tab where the data of the selected line would
be automatically displayed and, after changes, the saving button will overwrite the
previous value. To delete on line or all lines it is enough to press the “Delete Line” or
“Delete ALL Line”.
Moreover, to insert a new line there is another button called “Insert Line”:
pressing it will introduce a new line at the position before the one selected in the list
and this line will entered with default values as xxx\0\1\None ready to be modified.
In the last section of this tab there are two buttons labeled “Load Work from a txt
files” and “Save Work on a txt file” that attend to the job of saving all information on a
txt file that would be named as the opened scene in Maya or load information from a
saved file. This is an important feature because, thinking of a long speech, it often
happens that not all words would be analyzed at the same time. Moreover, it could be a
normal manner of working to analyze a short part of a long speech at a time in order to
verify the position of word in the timeline and how the algorithm reacts to synchronize
words.
After saving the work, the “Ok – Generate Keys” button become active and it
performs the last task of the lip synch process, the longest and most complex too,
starting the algorithm of translation and keyframing described in the previous chapter.
The Figure 5.6 represents the output window that will appear during the algorithm
execution and it makes possible to follow through it the actions of lip synch proceeding.
This output windows allows to verify the mapping between visemes and phonemes too
that is very important because a wrong correspondence means a completely unrealistic
animation because sound would be covered by different mouth’s shape instead of the
appropriate one.
Chapter 5 - Software description
78
Figure 5.6 : Speak U2 output window
In the first part of the output window the grouped phonemes and the viseme name
that refers to them are shown, while in the second part the animator can observe the
words being translated at that time.
From the point of view of the code programming, the first three tabs, as analysed
before in the chapter 4, are the most complex ones and their tabs are the most important
too because they cover the main three steps in the lip synching process.
The fourth tab was imagined to make a preview of the lip animation obtained with
Speak U2 in order to perform it in a smaller perspective view instead of the Maya one.
Thus is not a really fundamental tab of the user interface in the sense that it reproduces
something already present in Maya. The animator could choose where to make a
preview, even if this task requires a great usage of computer resources and so even
Chapter 5 - Software description
79
powerful pc have many problems to play a complex animation within Maya. Because
of this problem, it is normally a good choice to perform a “playblast”, that means to
create an AVI file that represent exactly the same play of the animation represented
within Maya. Thus in the bottom part of the fourth folder there is a button that links to
the Maya playblast option in order to allow the artist to customize its resolution, size
and other preferences. The Figure 5.7 focuses on this part of the graphic interface.
Chapter 5 - Software description
80
Figure 5.7 : Fourth tab of Speak U2 interface
Because of this features, the fourth tab was called as “Preview Animation”. The
last conceptual step of the lip synch process, after the keyframing and the preview,
Chapter 5 - Software description
81
generally is the retouching phase to eventually correct some mouth’s movement
imperfections or some timing errors.
Figure 5.8 : Fifth tab of the Speak U2 user interface
Chapter 5 - Software description
82
As the Figure 5.8 shows, one immediate way to correct some wrong situations is
to use the Blendshape Editor that is included in this tab of the user interface to have an
easier access instead of finding it under Maya itself. In the Blendshape Editor, it is
possible to change the viseme value at any frame in the animation thus to eliminate
some of them too setting its value at zero. Obviously, moving through the Maya
timeline and the channel box, that is an instrument collecting all attributes of the
selected shape and their values, it is possible too to completely eliminate a key at a
certain frame. Another correction that could be performed is to shift some keys from a
frame to another, even if a good timing chart excludes such type of correction.
In the bottom part of this tab the user can find other two links to useful Maya
tools, such as the one to the Graph Editor, that could be used to control the tangent of
the animation to modify the influence of the interpolation made by Maya, and the
Hypergraph, that could be used to see the structure of the shapes and to select them
with more precision.
5.4 Tips & tricks
As any software, Speak U2 too hides some tricks that could be useful to know
above all by the animator. In fact, these tips are come from a long period of testing the
plug-in itself and trying to synchronize a lot of different speeches.
The most important thing that has to be understood is that, as explained in the
previous chapters, the algorithm works on a written word that is the spoken one. In
reality, very often it happens that the real spoken doesn’t match with the grammatically
correct written one. This is due by the fact that especially English or American talking
are rich of abbreviations and it is an usual behaviour to cut some part of word.
Sometimes, the right and quicker way to solve some problems that could result
after using this plug-in, is to operate on the list of spoken words that have to be
translated. In fact, after keyframing, the animator could notice that some words,
Chapter 5 - Software description
83
especially the longer ones, could generate a sort of lips vibrating effect. This is
common when a word is formed by a sequence of phonemes that are represented by an
alternation of the same few visemes, so the lip movements are the same and they are
repeated in a very few number of frames giving the impression of a vibration.
There is no error in this from the algorithm, that is working the same way as
usual, but it is evident that if some mouth’s shapes covered some phonemes and there
are word composed by a repetition of the same visemes there is no way to a modify of
the code to solve this problem.
One solution could be to model more visual phonemes in order to group a minor
number of phonemes under the same viseme, so it could be a good idea to pass from a
ten schema to a sixteen one.
As anticipated above, another important thing that can be do is to pay attention to
the written word to be translated according to the rules presented in the first chapter.
Some letters such as m, n, p, thus the nasal ones, have often to be dropped out, but this
not necessary means that the artist has to physically delete that keys from the timelines,
but he can eliminate that letter directly within the written word and then apply the
algorithm. In many cases this method has generated a better lip movement.
It is also possible to use the apostrophe instead of some letter in the word in order
to maintain the same viseme of the previous letter and avoid the use of the viseme of
the substituted letter.
Moreover, “t” or “d” are other letters that are not to be deleted, but that are very
little persistent during the pronunciation, so sometimes they can be substituted or
erased, while where they are fundamental, the animator can shift keys to assign them
fewer frames than that associated by the algorithm.
Another important trick concerns the process of writing down the time table. In
fact, observing a waveform, it is easy to notice that in some point the wave dies and
than restarts because of a pause in the speech. It was discovered that if two or more
word are pronounced in the same wave, so they don’t generate a single wave but
Chapter 5 - Software description
84
occupied the same, then in Speak U2 probably the best choice is to link that words into
an unique spoken word even if it doesn’t exist.
5.5 The demo animation
At the beginning of the process of creating this Maya plug-in, there was the idea
to make a demo animation and this for many reasons. In fact, the first need was to test
Speak U2 and this was made using many audio files, female or male voice
indifferently, moving simple mouth shapes or complete face models, to check not only
the goodness of the algorithm and the plug-in itself, but to discover bugs or lacks too.
This process has allowed to improve Speak U2 in some parts of its graphic interface
adding some possibilities and functions. Obviously this is just the first version of
Speak U2, implemented with a no so larger vision of Maya itself and all its
possibilities.
While this first reason is the most important one, it was though that a demo
animation could always be useful to present Speak U2 itself, thus the animation became
a part of the plug-in to allow it to introduces itself to a possible buyer or at a conference
such as the presentation of a thesis. Seeing a movie before trying to use Speak U2 or
reading of it could focus the attention of people and suggest its potentialities.
5.5.1 Making the animation
The demo animation, that is entitled “She was speechless”, was build ad hoc to
introduce Speak U2 as a lip synch plug-in for English language, but the aim of this little
movie was also to present the result of the work that was made during these months.
To obtain an animation, it is always necessary to write down a storyboard to
collect the information and the general schema that the animation would follow. In this
case, it was first written the text of the speech because this was the most important
element in such animation that wants to give prominence to the synchronism of the lips
Chapter 5 - Software description
85
during the speaking. Moreover, beside the flat text, many annotations were added to
characterize the manner of speaking, as facial expressions and emotions, or movements
of the head and eyes.
With this information, the audio track was recorded in an special structure for
professional audio recording, and then the process of the lip synch was started. Having
the written text avoided the phase of writing down the spoken words, and the speech
was listened many times to write down the beginning and ending frame of each word
moving through the sound waveform. Making this task, all tips and tricks became
important and they were considered in finding the right timing table. In fact, in the
section below, it is possible to see the plain text that was spoken by the speaker, and
then the text that was introduced, word by word, into Speak U2 :
“Hi, my name is Anne. Wow! Finally I was born! And I speak too!
I can speak every language based on English grammar thanks to SpeakU2, the lip
synch plug-in for Maya four realized by Daniela Rosso and Logic Image, as thesis in
computer engineering for Politecnico University of Turin.
Yes, I speak! Look at my lips! They move, they create words without any troubles!
And what about my face?
It looks very nice, isn’t it?
Thank you! Thank you! Oh God! Thank all of you, thak you logic Image, thank you
Mom, thank you Dad and thank you to all my family and my friens.”
In order to obtain a correct lip synch, as seen before, some of these words were linked
together or modified according to the pronunciation. Below, there are the entered
words in the SpeakU2 interface:
“Hi myname's Anne wow finally iwasborn andi speak too I canspeak ev'ry 'aguage
based o'Eglishgrama thanks to speakUtwo the'ip synch plugi for Maya four realized by
Da'ie'a Roso an'Logic Image as tesis in copu'e egine'rig fo Po'i'ecnico u'ivesity of Turin
Chapter 5 - Software description
86
YesI Speak lookat mylips theyMove they create word without any troubles
Andwha aboumy face it look verynice isn'tit”
Moreover this first part was made launching Speak U2 with the dominance blendshape
value set up to 60%, so it creates keys for blendshapes posed at a values of 0.6. Then
Speak U2 was recalled on the same scene with the keys created in the first time, and
then it synchronized the second part of the speaking with the dominance blendshapes
fixed ad 75%:
“Thankyou oh thankyou oh God thankallof you thanku Logic Image thanku Mom thaku
Dad and thanku to allmy fa'ily anmy frieds”
As it can be seen, many words are linked together to create an unique word,
while others have less letters or have some apostrophe substituting some letters and
moreover some other words have a single letter to create the sound instead of the
correct letters, such as the “u” role in the “thank you” sound.
All this tricks are really important because they allow with little work and
attention to avoid to the animator to make changes and correction after the lip synch
process editing keys time or value. In fact, to realize the demo animation, the animator
just simply edited the value of the blendshape in the key corresponding to the word
“Wow”, in order to obtain a more opened mouth to suggest a more expressive
exclamation. No other interventions were necessary to have a realistic lip synch
animation.
For the rest of the animation, the 3D artist has added controls for the movements
of eyes, head and anything concerning its characterization to obtain the effects
according to the speech and the storyboard.
Chapter 5 - Software description
87
Figure 5.9 : Anne, character of “She was speechless”
Chapter 6
Conclusions
Speak u2 has achieved good results considered unexpected at the beginning of the
work. In fact, the obtained lip movement is very satisfactory, because it is perfectly
realistic and rarely needs an intervention from the animator in order to improve some
movements.
Without a tool as Speak U2, the animator, to realize an animation including the lip
synch, could only put himself in front of a mirror and try speaking to observe the lip
movements and then create the mouth ones. Or, knowing the lip synch methodology
explained in this thesis, he could use visemes, but he may translate himself the written
word into its phonemes sequence and moreover manually set all the keys for the
animation.
It is evident that both of these solutions are impracticable during a production,
because they need too long time to be performed. Without a precise methodology ,
hours could be needed to synchronize a single word. The second case is based on the
ability of the animator to translate from text to phoneme, but this is an hard task even
for an expert in linguistic and thus this job is not suitable for a 3D artist who has not the
right know how. Moreover, keys to be set are innumerable so a manual keyframing
work would steal too many hours to the production.
Chapter 6 - Conclusions
89
A realistic estimation of Speak U2 performances leads to the following judgement
about synching a 15 second of audio: about 30 minutes are necessary to the artist to
recognize spoken words and their start and end times; about 5 minutes of CPU time are
required to execute Speak U2 working on a complex facial model and using the 16
visemes based schema; About 30 minutes to execute again Speak U2 on different
timing chart and word data structure.
In a little more than one hour, the animator can synchronize successfully 15
seconds of speech, having the possibility of improve the animation testing it in different
configuration, thus he obtains a final result.
This means that in about four hours of work, an entire minute of speech is
synchronized and so 2 minutes of animation are finished in a day of work. These are
very meaningful values, because they imply the possibility of synching a movie of
standard length of 90 minutes in about one month and an half of work of a single
animator, while it could be a shorter time if many animators work on it.
The quality reached by the animation is strictly connected to the goodness of the
3D model, but in any cases it is surely better than any result obtained with a manual
work, because it achieves the precision of a linguistic translation and of scientific
calculations.
Obviously Speak U2, being at its first version, is not without possibility of
improvement and increases. A future development foresees surely the inclusion of the
option of saving all text files into a path chosen by the user instead of a default one, and
the possibility of implying a different blendshape dominance for any single word
instead of using only one dominance for all the speech.
Moreover, to allow the characterization of a model with emotions, it would be
projected the use of a special field named “emotions”, that is already present in the user
interface, in order to set keys for other blendshapes corresponding to the positions of
eyebrows, eyes and eyelids.
From the point of view of the implementation, there could be more complex
additions as the support of the Italian language and the integration of an audio tool to
Chapter 6 - Conclusions
90
straight analyse the recorded speech in order to automatically compute the start and end
times of spoken words.
These two main additions require a work of research to study their algorithms,
thus they will be probably included in new versions of Speak U2.
Bibliography [1] Fleming B. and Dobbs D.,Animating facial features and expressions, Rockland, Charles River Media, 1999 [2] Kundert-Gibbs J. and Lee P., Mastering Maya 3, Sybex, USA, 2000 [3] The art of Maya, Alias|Wavefront, USA, 2000 [4] Learning Maya 3, Alias|Wavefront, USA, 2000 [5] Using Maya : MEL, Alias|Wavefront, USA, 2001 [6] Maya Developer’s Tool Kit, Alias|Wavefront. USA, 2001 [7] Using Maya : Character Setup, Alias|Wavefront, USA, 2001 [8] Using Maya : Animation, Alias|Wavefront, USA, 2001
91