Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.philol.msu.ru/~otipl/SpeechGroup/publications/web_1.doc
Дата изменения: Thu Mar 10 15:18:37 2005
Дата индексирования: Sat Dec 22 13:25:23 2007
Кодировка: Windows-1251

Поисковые слова: prominence

Olga Krivnova, Nina Zinovieva, Leonid Zakharov, Grigoriy Strokin, Aleksey
Babkin

TTS SYNTHESIS FOR RUSSIAN LANGUAGE

Abstract
This paper describes the main principles of Russian text-to-speech
synthesis developing by Speech group of the Philological Department, Moscow
Lomonosov University, Russia. The system is organized as a mixture of two
methods: concatenation - on the segment level (using the linguistically
motivated units - allophones spliced together to form the synthesized
speech wave) and the rule-based method on the prosodic level (generating
melodic and duration settings to modify speech wave created from the
allophones according to the prosodic characteristics of the syntagma being
synthesized).

1. General Architecture of the System
Our system consists of the following functional blocks (or modules):
Automatic transcriber
converts input texts into a sequence of phoneme symbols organized as
phrases or syntagmas with attached special marks (rhythmical,
accentuational, intonational) for prosodic settings.
Allophonic coding module
converts the transcribed texts into the sequence of codes (names) of
phoneme-in-context (allophones) elements for concatenation.
The block of prosodic parametrization
assigns duration (in msec) and melodic (in semi-tones and Herz) values
to chosen allophones according to the intonation and prominence structure
of a phrase and phonetic characteristics of allophones on itself.
The allophonic data-base (acoustic inventory)
contains a set of allophone files in wave format (prerecorded and
stored segments of natural speech).
The block of control file generation
forms the representation of the synthesized phrase as a sequence of
concatenation elements' code names with assigned duration and fundamental
frequency values.
The block of speech signal generating
extracts the chosen concatenation units from the allophonic data-base
and splices them together, smooths the junctures between these elements and
transforms them according to controlled data.

The described system architecture is shown in a Fig.1. Below the more
detailed description of each of the above specified blocks will be given
(see also [1; 2]).

2. Automatic Transcriber (TRANS)
Input text representation
TRANS takes as an input text any sequence of orthographic words
separated by gaps or punctuation marks. In each word the stressed vowel
should be marked ( if there are any vowels).
For compound words it is allowed to have more than one stress mark one
of which is taken as primary. Word stress placement can be implemented
manually or automatically. In the last case the input text is processed
previously by morphological parser which is based on the dictionary
presented in [3]. Besides stress placement this parser determines the
grammatical features of words analized (syntactic class, number, gender,
aspect, case, etc.). These grammatical features are going to be used to
disambiguate the grammatical homograph forms (ru+ki ~ ruki+). We are now
working on this problem but it is a complex one and is not complete to the
present time.
Input text


Text
Preprocessing
Text Normalization
Linguistic Analysis:
syntactical, morphological parsing etc.
Lexicon

Automatic Accent-Intonation

Transcription


Automatic Phonemic

Ttranscription



Speech
Control Prosodic Allophonic
Generation parametrization Coding



Control File Generation



Digital Signal
Processing

Signal Generation Synthesis
Allophonic Databases


Speech

Figure 1. Overall structure of TTS system for Russian.

As far as the other problems of previous text analysis is concerned we
can handle the word strings with numbers and alphabetic abbreviations (on
rule or on the base of special dictionary of the most frequently used items
with their transcriptions), but these resources are not used in our system
yet.

TRANS phoneme inventory
We used the following phonemes inventory of Russian (our transcription
is based on Russian alphabet, for convenience the Russian phoneme symbols
are replaced here with their latin conformity):
1. Stressed vowels: [A], [U], [I], [Y], [O], [E];
2. Unstressed vowels of the first degree of reduction: [a], [u], [i],
[y], [o], [e]. The last two unstressed vowels are not regularly used in
standard Russian, but sometimes they are pronounced in loan words;
3. Unstressed vowels of the second degree of reduction: [ax], [ix],
[ux] ;
4. Non-palatalized consonants: [p], [t], [k], [b], [d], [g], [f], [v],
[s], [sh], [z], [zh], [x], [h], [c], [dz], [m], [n], [r], [l];
5. Palatalized consonants: [p'], [t'], [k'], [b'], [d'], [g'], [f'],
[v'], [s'], [sh'], [z'], [zh'], [x'], [ch'], [dzh'], [m'], [n'], [r'],
[l'], [J'], [j'].
One can see that the phonemic inventory used in our system slightly
differs from that prevalent in Russian phonetic descriptions. This is
because, for the purpose of synthesis, we had to choose such units that not
only represent the phonemic relationships but also have acoustic and
perceptual identity. It means that in some cases it is convenient to have
different symbols in the transcription even for those phone pairs that are
in no meaningful contrast.

Phonological (phonemic) rules and special word lists
Standard phonological rules of TRANS implement the mapping "letter -
phoneme" and "phoneme - phoneme" which include such operations as
elimination of spelling fictions, processing of pronunciation for some
consonant clusters, removing of hard and soft signs in spelling, vowel
letters processing with corresponding interpretations of hardness-softness
of consonants, positional alternation of voiced-unvoiced, hard-soft
features for consonants, vowel reduction and so on. The processes working
as within a word and between words are taken into account.
Irregular pronunciation of some word classes (e.g. loan words) and even
individual words is accounted for by using special word lists. There are 54
such lists in TRANS.

Rhythm and accentuation
TRANS assigns some degree of prominence to each vowel in the
synthesized phrase as its rhythmical feature. We distinguish three degrees
of syllable prominence within a word, four degrees of lexically stressed
syllable prominence within a phrase (1 for full clitics, 2 for functional
words, 3 for nonnuclear meaningful words, 4 for nuclear meaningful word).
The last meaningful word in the phrase is considered to be nuclear phrase
stress by default. Though we are able to synthesize phrases with different
focus accents we have no rules to determine their localization
automatically: it should be done in manual fashion by special symbol (\)
assigned to lexically stressed vowel (instead of the ordinary lexical
stress marker).

Prosodic word grouping and phrasing
Within a phrase this process is closely connected with rhythmization of
word sequence and is realized in our system by special feature "degree of
prosodic break", assigned to the blanks between words. Here we have three
breaks' levels: 0 - between full clitic (e.g. prepozition) and meaningful
word; 1 - between functional word and meaningful word; 2 - between
meaningful words. This information is accounted for in phonological rules
when processing external phoneme sandhi and vowel reduction. These types of
breaks can not be realized with pause and reflect only the degree of word
autonomy in the phrase sound pattern which is supposed to be cogerent.
On the contrary prosodic phrasing is supposed to have some pause after
each phrase.
We distinguish three degrees of pauses: short (about 250ms), moderate
(about 400 ms) and long (about 800 ms). As far as the localization of
phrase boundaries is concerned this problem is under investigation now and
at present their places are fully determined by punctuation marks.

Intonation
To each phrase TRANS assigns one of 7 intonation models. In our system
we use the following models: 2 models of finality; 2 models of non-finality
for the affirmative sentences, 3 interrogative models (general question,
special question, comparative question), 1 model for exclamation sentences.
For all models the possibility of different positions of melodic center in
the phrase is available (in some cases its position is determined by the
nuclear phrase stress, while in others by focus accent mark). The choice of
the appropriate model is based on punctuation mark and some lexical
information, first of all we consider whether the phrase contains words of
definite lexical classes (e.g. interrogative pronouns). It is obvious that
these cues are not enough and more over the relation between punctuation
and intonation models is rather difficult one, especially in Russian . This
problem is also under investigation and it is clear for us that in general
case some semantic and syntactic analysis will be needed to solve it.

3. Allophonic Coding Module
This module converts the phonemic symbols used in transcribed texts
into the sequence of codes (names) of phoneme-in-context (allophones in our
system) elements for concatenation.
Defining the characteristics of the basic unit of concatenation we
proceed from the following three assumptions [7]:
1. the amount of context-dependent variants is significantly larger for
vowels than for consonants;
2. different consonants are affected by context influence to different
degrees;
3. because of the prevalent CV-type of the Russian syllable, the left
context is more important for vowels while the right context is more
important for consonants.
According to these assumptions and a vast amount of preliminary expert
estimations of the phoneme-size wave segments taken from different
contextual environments, we divided the set of phonemes in different
classes regarding the contextual susceptibility of different phonemes. The
names of allophones derived from the transcribed texts, and the names of
concatenation units in the database reflect this specifics. The names of
allophoness are represented by the six-figure codes that are organized in
the following way: the first figure refers to the contextual group of the
coded phoneme, the second figure - the individual name of phoneme within
the group, third and forth figure reflect the left significant context, the
fifth and sixth - the right significant context for given phoneme.
Example: 811010 - means that this allophone represents a phoneme of the
8th contextual group, the phoneme is a stressed /A/, in the position after
and before an alveolar consonant.
The allophone coding is the main but not the only procedure of the
module described. There are two more operations: spliting of some phonemes
(e.g. /d/ -> dPause + dBurst) and merging of some phonemes (e.g. /ix a/ ->
'A).

4. The Block of Prosodic Parametrization
Duration
The duration rules were designed for and are applied separately to
vowels (on the basis of quantitative model presented in ([4;5]) and
consonants.The durational patterns of vowels are formed in accordance with
their prominence levels and phonetic quality. Besides, for the stressed
vowel of the last meaningful word in a phrase, we also take into
consideration the number of syllables and the number of stressed vowels
preceding it. We also apply the rules of vowel final lengthening
(regardless the reduction level and vowel phonetic quality) before a pause.
There are also special duration rules to process the sequences of vowels.
As far as the influence of consonants on the vowel duration is
concerned, we account for it only in the most prominent cases, like in the
position before or after sonorants and unvoiced consonants.
The general rules for consonant duration are based on the following
factors: position of the consonant regarding the phrase boundaries;
intervocal - non-intervocal position; position in the consonant cluster;
the prominence level of the vollowing vowel; simple - complex structure of
the basic concatenation units used for the synthesis of the consonants.
Phonetic quality of consonants and coarticulation effects on duratuon in
clusters are also taken into account.
In our system it is possible to control the overall tempo of
pronunciation (evenly, only on consonants or only on vowels).

Melodic and fundamental frequency contours
The melodic patterns modeling is based on generating tone turning
points (or target points of tone inflection) and their parameters in
frequency and time domains. In this respect our approach is close to the so
called linear intonation models [5].
The rules for phrase melodic patterns assign usually two tonal values
(in semi-tones) to every allophone - as its starting and final points. If
it is necessary the third value can be assigned to any point inside the
allophone. So, on the whole any melodic contour is approximated by linear
tonal movements.
The assigned values are calculated in left to right mode in syllabic
cycles, that is in the frame of the CnV sequence, where Cn - any number of
consonants (including 0) preceding the current vowel. The allophonic
dissolution of the same tone movement is more detailed for prominent
stressed syllables (especially nuclears and focuses) and less detailed for
unstressed and weakly prominent stressed syllables.
To assign the tone values we consider the following factors: the type
of the intonation model; position of the syllable regarding the melodic
center (nuclear or focus) of phrase (the center itself, to the right of it,
to the left of it); the prominence level of the vowel (for stressed ones);
position of the syllable or syllable sequence regarding the phrase
boundaries; numder and position of the syllable in the syllable chain (for
unstressed and atonic syllables); the phonetic structure of the syllable;
position of the allophone regarding the beginning of the syllable and its
vowel nuclear.
Melodic contour described in semi-tones is then transformed into Hz
values accounting voiced - un voiced feature of the consonants. This
procedure takes also into account the Hz conformity of the base tone
characteristic for a speaker.
As far as the global tone parameters are concerned we can control the
pozition of the contour within the whole voice range and "ширина
диапазона".

Energy
We can control the global and local trends in this parameter but the
rules for it are not incorporated in our system yet.

5. The Allophonic Data-Base (Acoustic Inventory)
The allophonic data-base is a set of allophone files in wave format,
each file being named according to the allophonic coding assumptions
described above. This acoustic inventory was derived from a Russian word
list specially constructed and pronounced for this purpose. Each allophone
wave was cut manually from the contextual representative surroundings which
reflect the contextual group influence in the most prominent way.
We use two acoustic inventories: one for male voice (SR 11025Hz; SS 8-
bit) with 158 consonant allophones and 530 vowels; the other for female
voice (SR 22050Hz; SS 16-bit) with 200 consonant allophones and around 1000
vowels.
All vocal sounds in the data-bases are marked semi-automatically
according to their pitch periods to generate the output speech signal.

6. Control File Generation
This module converts the transcribed phrase into a sequence of
allophones' code names with assigned duration and fundamental frequency
values. Here is an example of the control file to generate the phrase
"Zdravstvuyte, dorogie druz'ya!" (Hello, dear friends) for female voice.


|Allophone |Duration | F0 |Energy |
|codes |(in % for | | |
| |cons. | | |
| |in ms for |at the |at the end | |
| |vow.) |begining | | |
|220301 |85 |150 |178 | |
|000100 |85 |178 |211 | |
|020001 |85 |0 |0 | |
|610101 |90 |0 |0 | |
|610707 |90 |0 |0 | |
|811510 |135 |251 |188 |f265 40 |
|210101 |55 |0 |0 | |
|100000 |60 |0 |0 | |
|120001 |60 |0 |0 | |
|510102 |65 |173 |165 | |
|991116 |45 |165 |158 | |
|710201 |60 |158 |157 | |
|100000 |65 |0 | 0 | |
|150004 |65 |0 |0 | |
|981610 |45 |156 |154 | |
|000100 |70 |154 |153 | |
|020001 |70 |0 | 0 | |
|971015 |85 |153 |152 | |
|610707 |80 |0 |0 | |
|911516 |112 |151 |150 | |
|000100 |80 |150 |150 | |
|060004 |90 |0 |0 | |
|831616 |130 |150 |150 | |
|911610 |105 |154 |158 | |
|000100 |70 |158 |155 | |
|020001 |70 |0 |0 | |
|610102 |75 |0 |0 | |
|610707 |75 |0 |0 | |
|921516 |97 |152 |150 | |
|320304 |80 |150 |150 | |
|710404 |95 |150 |150 | |
|811618 |227 |150 |133 | |
|100000 |800 |0 |0 | |


7. Speech Signal Generation
Signal generation is implemented according to the phrase control file
the structure of which was described above. The necessary allophones are
extracted from the database and spliced together.
To transform the base allophones to duration and fundamental frequency
values given by the phrase control file, we use procedures that are close
to PSOLA technique in the time domain [8].
The demo-examples of Russian speech synthesized by our system (in wav
format) can be found in INTERNET to the address
http://isabase.philol.msu.ru/SpeechGroup

References
1. N. V. Zinovieva, O. F. Krivnova. Lingvisticheskoe obespechenie
programmnogo sinteza rechi (Linguistic Support for Programmed Speech
Synthesis) // Vestnik MGU, s.9. Philologia. N3. M., 1994.
2. N. V. Zinovieva, O. F. Krivnova, L. M. Zakharov. Programmniy sintez
russkoy rechi (sintezator "Agafon") (Automatic Speech Synthesis for Russian
Language: sinthezator "Agaphon") // Computational Linguistics and its
Applications. International Workshop "Dialogue95". Kazan, May31-June4,
1995.
3. A. A. Zaliznjak. Grammaticheskiy slovar' russkogo yazyka (Grammatical
Dictionary of the Russian Language). Moscow, Russkij Yazyk, 1977.
4. O. F. Krivnova. Kolichestvennaya ocenka vozdeystviya suprasegmentnih
faktorov na dlitel'nos't' udarnih glasnih v sintagme ( Quantitative Model
of Stressed Vowel Duration under the unfluence of suprasegmetal factors) //
Proceedings of 12th All-Union Seminar on Automatic Speech Recognition and
Synthesis. Novosibirsk, 1984.
5. O. F. Krivnova. Durational Patterns of Russian Syntagma: The Standard
Scheme and its Modifications. // Proc.of the XI-th International Congress
of Phonetic Scienses. Tallinn, 1987.
6. Session: The structure of Intonation - linear or superpositional //
Proc. of the XIII-th International Congress of Phonetic Scienses.
Stockholm, 1995.
7. N. V. Zinovieva . Phonetically Sufficient Allophonic Database for
Concatenation Synthesis of Russian Speech. // Proc. of the XIIIth
International Congress of Phonetic Sciences. Stockholm, 1995.
8. Charpentier F., Moulines E. Pitch-synchronous waveform processing
techniques for text-to-speech synthesis using diphones // Eurospeech89.
Vol. 2. Paris 1989.