Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.philol.msu.ru/~otipl/SpeechGroup/publications/engl_99.doc
Дата изменения: Thu Mar 10 15:18:38 2005
Дата индексирования: Sat Dec 22 13:58:23 2007
Кодировка: Windows-1251
Поисковые слова: prominence

УДК 621.391

O. F. Krivnova, N. V. Zinovieva, L. M. Zaharov, G. S. Strokin, A. V. Babkin

TTS SYNTHESIS FOR RUSSIAN LANGUAGE
(the development experience)

Lomonosov Moscow State University
Russia, 119899 Moscow, Vorobyovi Gori, 1-st Building of the
Humanities
Phone: (095) 939-26-01; Fax: (095) 939-55-96
E-mail: okri, leon, grg@philol.msu.ru, avb@science.park.ru

Abstract
This paper describes the main principles of Russian text-to-speech
synthesis developing by Speech group of the Philological Department, Moscow
Lomonosov University, Russia. The system is organized as a mixture of two
methods: concatenation - on the segment level (using the linguistically
motivated units - allophones spliced together to form the synthesized
speech wave) and the rule-based method on the prosodic level to modify
speech wave created from the allophones according to the prosodic
characteristics of the phrase being synthesized.

1. General architecture of the system.
Our system consists of the following functional blocks (or modules):
. Text normalization module
. Automatic transcriber
. Allophonic coding module
. The block of prosodic parametrization
. The allophonic data-base (acoustic inventory)
. The block of control file generation
. The block of speech signal generation
Below the more detailed description of each of the above specified blocks
will be given (see also /1-7/).

2. Text Preprocessing.
The synthesis of an arbitrary phrase may be split into two larger stages.
On the first stage the task is to generate the phonetic transcription of
the phrase including its intonational, accentuational and rhythmical
characteristics. On the second stage the necessary acoustic signal should
be created on the base of this phonetic representation.

2.1. Input text normalization.
Our system takes as an input text any sequence of orthographic words
separated by gaps or punctuation marks. To receive an adequate
transcription the stressed vowel should be marked in each word ( if there
are any vowels). Compound words should have more than one stress mark one
of which is taken as primary. Word stress placement can be implemented
manually or automatically. In the last case the input text is processed
previously by morphological parser. Besides stress placement this parser
determines the grammatical features of words analyzed (syntactic class,
number, gender, aspect, case, etc.). These grammatical features are going
to be used to disambiguate the grammatical homograph forms (ru+ki ~ ruki+).

As far as the other problems of previous text analysis is concerned we can
handle the word strings with numbers and alphabetic abbreviations (on rule
or on the base of special dictionary of the most frequently used items with
their transcriptions).

2. Automatic accent-intonation transcriptor (AITR).
Prosodic word grouping and phrasing.
Within a phrase this process is closely connected with rhythmization of
word sequence and is realized in our system by special feature "degree of
prosodic break", assigned to the blanks between words. Here we have three
breaks' levels: 0 - between full clitic (e.g. preposition) and meaningful
word; 1 - between functional word and meaningful word; 2 - between
meaningful words. This information is accounted for in phonological rules
when processing external phoneme sandhi and vowel reduction.
Prosodic phrasing is supposed to have some pause after each phrase. We
distinguish three degrees of pauses: short (about 250 ms), moderate (about
400 ms) and long (about 800 ms). As far as the localization of phrase
boundaries is concerned this problem is under investigation now and at
present their places are fully determined by punctuation marks.
Intonation.
To each phrase AITR assigns one of 7 intonation models. In our system we
use the following models: 2 models of finality; 2 models of non-finality
for the affirmative sentences, 3 interrogative models (general question,
special question, comparative question), 1 model for exclamation sentences.
For all models the possibility of different positions of melodic center in
the phrase is available (in some cases its position is determined by the
nuclear phrase stress, while in others by focus accent mark). The choice of
the appropriate model is based on punctuation mark and some lexical
information. It is obvious that these cues are not enough and more over the
relation between punctuation and intonation models is rather difficult one,
especially in Russian.
Rhythm and accentuation.
AITR assigns some degree of prominence to each vowel in the synthesized
phrase as its rhythmical feature. We distinguish three degrees of syllable
prominence within a word, four degrees of lexically stressed syllable
prominence within a phrase ( 1 for full clitics, 2 for functional words, 3
for nonnuclear meaningful words, 4 for nuclear meaningful word). The last
meaningful word in the phrase is considered to be nuclear phrase stress by
default. Though we are able to synthesize phrases with different focus
accents we have no rules to determine their localization automatically: it
should be done manually.

2.3. Automatic phonemic transcriptor (PHTR).
The phonemes inventory includes 56 units. It slightly differs from that
prevalent in Russian phonetic descriptions. In some cases it is convenient
to have different symbols in the transcription even for those phone pairs
that are in no meaningful contrast.
Standard phonological rules of PHTR implement the mapping "letter -
phoneme" and
"phoneme - phoneme" which include such operations as elimination of
spelling fictions, processing of pronunciation for some consonant clusters,
removing of hard and soft signs in spelling and so on. The processes
working as within a word and between words are taken into account.
Irregular pronunciation of some word classes (e.g. loan words) and even
individual words is accounted for by using special word lists. Currently
there are 54 such lists in PHTR.

4. The block of prosodic parametrization.
Duration.
The duration rules were designed for and are applied separately to vowels
and consonants. The durational patterns of vowels are formed in accordance
with their prominence levels, phonetic quality, boundary phrase position
and some other factors. The general rules for consonant duration are also
based on several factors: position regarding the phrase boundaries;
intervocal - non-intervocal position; position in the consonant cluster;
the prominence level of the vollowing vowel and so on. In our system it is
possible to control the overall tempo of pronunciation (evenly, only on
consonants or only on vowels).
Melodic and fundamental frequency.
The melodic patterns modeling is based on generating tone turning points
(or target points of tone inflection) and their parameters in frequency and
time domains. The rules for phrase melodic patterns assign usually two
tonal values (in semi-tones) to every allophone - as its starting and final
points. If it is necessary the third value can be assigned to any point
inside the allophone. So, on the whole any melodic contour is approximated
by linear tonal movements. Melodic contour described in semi-tones is then
transformed into Hz values accounting voiced - unvoiced feature of the
consonants. This procedure takes also into account the Hz conformity of the
base tone characteristic for a speaker.
As far as energy is concerned we can control the global and local trends in
this parameter but the rules for it are not incorporated in our system yet.

5. Allophonic coding module.
This module converts the phonemic symbols used in transcribed texts into
the sequence of codes (names) of phoneme-in-context (allophones in our
system) elements for concatenation.

3. Acoustic signal generation.
3.1. Allophonic data-base.
The allophonic data-base is a set of allophone files in wave format, each
file being named accounting the allophone itself and its phonetic context.
We use two acoustic inventories: one for male voice(SR 11025 Hz; SS 8-bit)
with 158 consonant allophones and 530 vowels; the other for female voice
(SR 22050 Hz; SS 16-bit) with 200 consonant allophones and around 1000
vowels. All vocal sounds in the data-bases are marked semi-automatically
according to their pitch periods to generate the output speech signal.

3.2. Generation of control file.
This module converts the transcribed phrase into a sequence of allophones'
code names with assigned duration, energy and fundamental frequency values.

3. Speech signal generation.
Signal generation is implemented according to the phrase control file the
structure of which was described above. The necessary allophones are
extracted from the database and spliced together. To transform the base
allophones to duration and fundamental frequency values given by the phrase
control file, we use procedures that are close to PSOLA technique in the
time domain /8/.

The demo-examples of Russian speech synthesized by our system (in wav
format) can be found in INTERNET to the address
http://isabase.philol.msu.ru/SpeechGroup.

R E F E R E N C E S

1. A.V.Babkin. Avtomaticheskiy sintez rechi - problemi i metodi generacii
rechevogo signala // Computational Linguistics and its Applications.
International Workshop "Dialogue98". M., 1998.
2. N.V.Zinovieva, O.F.Krivnova, L.M.Zaharov. Programmniy sintez russkoy
rechi (sintezator "Agafon") // Computational Linguistics and its
Applications. International Workshop "Dialogue95". Kazan, 1995.
3. L.M.Zaharov. Trascripciya tekctov pri sinteze i analize russkoy rechi //
Computational Linguistics and its Applications. International Workshop
"Dialogue96". Kazan, 1996.
4. L.M.Zaharov. Trascripciya tekctov pri sinteze i analize russkoy rechi:
netrivial'nie sluchai // Computational Linguistics and its Applications.
International Workshop "Dialogue97". M., 1997.
5. O.F.Krivnova. Modelirovanie i sintez frazovoy intonacii na osnove osobih
tochek tonal'nogo kontura // Computational Linguistics and its
Applications. International Workshop "Dialogue97". M., 1997.
6. O.F.Krivnova.. Avtomaticheskiy sintez russkoy rechi po proizvol'nomu
tekstu (vtoraya versiya c zhenskim golosom) // Computational Linguistics
and its Applications. International Workshop "Dialogue98". M., 1998.
7. G.S.Strokin. Instrumentariy dl'a razrabotki sistemi sinteza rechi //
Computational Linguistics and its Applications. International Workshop
"Dialogue98". M., 1998.
8. T.Dutoit. An Introduction to Text-to-Speech Synthesis. Dordrecht-Boston-
London. 1997.