Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.philol.msu.ru/~otipl/SpeechGroup/publications/krivnova-2001-2004/krivnova_bogdanov_specom-2004.doc
Дата изменения: Thu Jun 30 19:45:59 2005
Дата индексирования: Sat Dec 22 21:07:51 2007
Кодировка:

Creation of Russian Speech Databases: Design, Processing,
Development Tools

Vladimir L. Arlazarov (1), Dimitri S. Bogdanov (1), Olga F. Krivnova (2),
Aleksandr Ya. Podrabinovitch (1)

(1) Institute for System Analysis of Russian Academy of Science, 9,
prospect 60-letriya Oktyabrya, Moscow, Russia
bogdanov@cs.isa.ru
(2) M.V.Lomonosov Moscow State University, Faculty of philology, Leninskiye
Gory, Moscow, 119992, Russia
okri@philol.msu.ru

Abstract

This paper is dedicated to several aspects of creation process of Russian
speech databases. The problems of phonetic notation are discussed. The
process of selection of text material with expected phonetic
characteristics is described. The description of two Russian speech
corpuses is given.

1. Introduction

While doing speech research and/or development of components in Speech
Technologies such as text-to-speech or Speech Recognition systems, the
researcher needs access to large sets of annotated and labeled speech data.
The quality of speech recognition systems based on modern statistical
algorithms depends directly on capacity and phonetic portliness of such
sets. If the researcher develops so-called engineering approach in speech
research he/she needs to study fine structure of speech signal using large
amount of labeled speech that contains various speech state-events. Modern
approach to building text-to-speech systems based on concatenating of
speech fragments demands availability of large speech corpus.
The understanding of importance of access to a large amount of correctly
annotated speech data is not only widely recognized among the people
working in the field of speech recognition, but became generally recognized
in the whole speech researchers' society.
For this reason nowadays the increasing number of professionals engaged
in speech research is involved in the projects, which aim the creation of
large-scale speech databases.
It is impossible to imagine the development of modern speech technologies
of automatic speech recognition and synthesis without the use of extensive
speech databases. Speech corpora are important not only for speech
technologies. The problems of describing and modeling the acoustic side of
speech, considering its acoustic variability in different speech situations
is of scientific interest itself and often appears in phonetic research
dealing with oral speech analysis. Experts point out that it is the
problems of creation and use of high quality speech corpora that bring
together theoretical phonetic research (including speech acoustics) and
applied projects, two fields that often do not share common interests.
This paper summarizes authors' experience in creation of Russian speech
databases for different purposes. Mainly we refer here to our most
significant work in this field - creation of large-scale Russian speech
corpus RuSpeech. This corpus was produced for speech recognition projects
in cooperation with speech group of Intel Corporation.

2. Speech Databases: structure and classification

Speech Fragment

Speech data in a corpus are usually presented as a number of speech
fragments. We define a speech fragment as a fragment of oral speech,
represented as a digitally recorded sound wave and accompanied by
additional information ("annotation"). The minimal necessary information
about a fragment is its orthographical recording (spelling) and phonetic
transcription that shows how the fragment sounds. Often, though not always,
the annotation contains also acoustic labeling of a speech fragment, such
as data that shows time localization within the fragment of additional
acoustic events. These acoustic events include boundaries between sounds,
changes of tonal movement, physical pauses, etc. The choice of phonetic
events to be registered in a speech corpus depends on the aim of the
research for which the corpus is being created. This choice is usually made
in advance, when the corpus is being designed.
We keep speech fragment in storage as a pair of files. The first one
contains digital representation of recorded speech. The second is a text
file with annotation of recorded speech (including text, pronunciation,
type of recording and labeling, speaker personal data, etc.). This is file
named "info file". It generally has keyword structure and consists of
strings with keyword parameters' value.
Info file of speech fragment can include the following data:
. text of recorded utterance

. expected transcription of pronounced text

. real transcription of recorded utterance

. boundaries of phonetic and/or acoustic segments

. speaker's personal data (name, age, gender, accent,.)

. information on recording environment (microphone type, type of sound
card, studio characteristic, etc.)

. prosodic annotation

. other specific characteristics of speech and/or speaker

Structure of Speech Database

Usually Speech Database consists of number of sets which are collected
for different purposes, such as training of algorithms, testing of systems,
working with phonetically rich or phonetically representative collections
of speech, etc.
While developing Russian Speech Database for speech recognition project
we suggest it to consist of 4 sets:
. Train. A set for training of speech recognition algorithms.

. Develop. A set for checking the results of algorithms training. This set
is supposed to be used by developers of algorithms.

. Test. A set which should be used for testing and evaluation of quality of
entire system.

. Peculiarity. A set with utterances which can not be considered as close
to normal pronunciations. Here we collect for example speakers with birth
speech defects as well as particular utterances of regular speakers
pronounced with bad articulation.

Each set may consist of number of subsets with collections of utterances
pronounced by certain speakers. The typical layout of speech database is
shown below.

Speech Corpus:
Set 1
.
Set p ( |speaker 1
. |.
Set q |speaker i ( |speech frag 1
|. |.
|speaker m |speech frag j ( | wave
|. | info file
|speech frag n

Classification of Speech Databases

Speech databases can be classified according to the following features:
. by supposed usage: special, common or representative, for educational
purposes;

. by type of speech : discrete speech, continuous speech, spontaneous
speech, special dialog;

. by type of speech signal: laboratory speech, office speech, phone
speech, mobile phone speech;

. by type of annotation: spelling, phonemic/phonetic transcription,
prosodic transcription, acoustic/phonetic segmentation of signal, other
types of linguistic annotation or comments;

. by type of signal information included besides speech signal: simple,
multimodal, special;

. by type of balance of phonetic/acoustic units: natural distribution,
uniform distribution, others.

Access to speech databases

In 90th there were organized special coordination centers in USA and
Europe that intended to gather, store and distribute as well as to create
speech and language resources which are open to general use and
standardized. Among them are LDC (Linguistic Data Consortium), CSLU (Center
for Spoken Language Understanding, Oregon Graduate Institute), ELRA
(European Language Resources Association), SpeechDat Project and others.
It should be mentioned that the presence of Russian speech resources at
these centers is rather incidental than systematic.
There are several vital issues which should be mentioned if we consider
the problem of creation and usability of speech databases:
. developing requires significant financial expenses;

. there is necessity of cooperative efforts of various specialists in
linguistics, mathematics, acoustics, speech processing, etc.;

. requirement of accessibility and multi-purpose of speech databases;

. standardization of formats gives better usability of speech databases;

. availability of easy-to-use software for recording, processing and
verification of utterance items reduces expenses and increases quality of
speech corpus

Nowadays the problem of creation of large-scale, multifarious and
multilayer, phonetically rich Russian speech databases is being brought to
the forefront of speech science. We also need convenient and effective
instrumental tools for collection of speech databases, development and
usage in speech research and in voice driven software.

3. Russian speech databases creation technology.

Creation of speech corpus can be considered as a special technological
workflow [1] which consists of the following stages: design of speech
corpus, text material preparation, development of software tolls, recording
of speakers, checking technical quality of the speech recordings,
annotating of utterance items, verification of annotation, final producing.

Choosing of speech corpus characteristics

The first stage of creation of speech corpus is the design of speech
corpus. On this stage the following essential questions should be answered:
. speakers' casting criteria (number of speakers in each set, gender and
age distribution of speakers, presence of various dialects, educational
level, social status, professions and so on.)

. choice of text material characteristics: representative or special texts,
themes, style of texts

. type of speech: pronunciation of keyword commands, isolated speech,
continuous speech, reading, spontaneous speech, dialog

. type of distribution of acoustic/phonetic items in speech database:
natural representative, proportional representative, richness of items'
combination

. distribution of text material per set, per speaker

. number of recording sessions per speaker

. distribution of text material among train, test and other parts of speech
corpus

. type of phonetic and linguistic annotation

When the answers were received, the following tasks should be
accomplished:
. preparation of convenient phonetic support system (phonetic notation,
detailed instructions on phonetic labeling and/or transcribing);

. selection of text material for all sets of corpus with expected phonetic
characteristics;

. recruiting of speakers

Phonetic notation

First we should select transcription code system, which allows
transcribing all sentences from text material into normal (expected)
phoneme sequences. Fortunately in Russian language construction of
pronunciation from spelling uses limited set of rules. Due to this
characteristic the transcribing from text to phoneme sequence can be
automated. We use special linguistic software which automatically
transcribes Russian texts. The automatic transcriber was developed by
speech group of philological faculty of Moscow State University [2, 8]. The
availability of such software is very important for creation of large-scale
speech corpora. By using it we can estimate in advance the expected
phonetic characteristics of speech corpus we are working up.
In our developments we use Russian transcription code system of professor
R.I.Avanesov [3]. This phonetic code is well-known and quite habitual for
specialists in linguistics and phonetics. It should be pointed out that we
consider several alternative phonetic notations for transcription and
labeling of speech fragments. We could accept simpler phonetic notation,
and this notation might be more suitable in case of using it for
development of speech recognition system. However reduced phonetic alphabet
is not convenient in linguistic researches. Speech research as well as
development of automatic text-to-speech system demand more detailed
differentiation of phonemes. Therefore we decided to use rather detailed
phoneme code system. According to the aims of using speech corpora in
speech recognition projects the transcription code system can be simplified
by reducing the number of distinguished phonemes. In that case the corpora
can be automatically decoded into a new alphabet. Results of such
simplification were discussed in [4].

Requirements on phonetic coverage and distribution

According to usage of speech corpora in particular speech researches or
speech technology projects, various requirements on text material can be
set up. Let consider some frequently occurred requirements.
The requirement of phonetically rich lexical material. For example the
transcription of texts should include all phonemes of transcription code
and each phoneme must be represented more than given score of times.
Another demand for phonetically richness can be formulated as full
allophone coverage. In this case we expect each allophone (phoneme with
left and right context) to be represented in speech corpora more than three
times. These two requirements were set up in development of large-scale
Russian speech corpus RuSpeech that was developed by Institute for System
Analysis for Intel Corporation.
To correspond to the requirements mentioned above the special automatic
iteration procedure was applied for selection of text material. The scheme
of this procedure is shown below.
[pic]

Figure 1. Automatic selection of texts with phonetically rich
characteristics.

Te requirements on phonetic distribution in recorded speech can be fixed
according to the research purposes. Let qualify text material as
phonetically representative, if the distribution of phonemes and other
phonetic units in it is close to the theoretically natural distribution,
which is considered to be understood as frequencies of language units
statistically defined on large set of samples.

4. Software tolls for speech corpus development

We developed several special software tools to provide automation of some
phases of speech corpus development process.

Speaker Recording Software.

This tool allows organizing batch recording of speakers. The operator
should fill out the form in the start program window with speaker's name,
age, sex, the place of birth, the place of residence, accent type, and
choose education level using needed string in drop-down menu. The unique
speaker identifier will be formed. The program will automatically form the
special list of sentences for each new speaker and starts recording in
batch regime. Technical quality of recorded signals is also automatically
controlled by the program.

Software for transcription verification

This interface program is created for the specialists in phonetics
(experts), who carry out the verification and correction of actual
transcription of pronounced sentences. The user is provided with
opportunities to hear the utterance, to see spelling and expected canonical
pronunciation, to inspect digital signal in wave editor and to edit actual
pronounced phoneme sequence.

Utilities for calculation of phoneme and allophone statistics

These programs calculate occurrences of phonemes and allophones in actual
and expected phonetic transcriptions in the whole speech corpus or in it
subsets.

5. Russian speech database ISABASE

Initially we start recording Russian speech for laboratory purposes. But
then the demand of availability well annotated speech leads us to the
creation of our first Russian speech database ISABASE which was used in
general speech research and in creation of Russian speech recognition
system.
The database contains isolated Russian speech segmented into phonetic
units. Info file contains speaker personal data, spelling of pronounced
sentence, pronunciation, boundaries of segments, corresponding for words
and phonemes.
The semi-automotive segmentation procedure was implemented. We developed
segmentation software based on word extraction, pitch detection and pitch-
synchronous analysis [5]. The results of automatic segmentation were shown
to experts using special wave editor. Experts made required correction of
segments' boundaries and annotated each segment.
Speech database ISABASE consists of 4653 speech fragments. According to
different phonetic requirements it is divided into two separate sets with
different phonetic characteristics:
. Phonetically balanced set. The collection of phonemes in this set has
even distribution. Text material of 500 short sentences was taken from
materials of all-Union State Standard (GOST) which defines requirements
on speech intelligibility in transmitting of speech through radio and
phone lines [6]. The texts were read by 5 males and 4 females. This set
contains 1863 speech fragments

. Phonetically representative set. The collection of phonemes in this set
has distribution close to theoretically natural. Text material was
selected from the literary texts by simplification of some complex
syntactic constructions. There are statements, interrogative sentence and
elements of direct speech and dialog. This set contains 3280 speech
fragments pronounced by 15 males and 15 females.

Lexicon has 3713 entries. There were no professional announcers, readers
or actors among speakers. All of them were native speakers of Moscow
dialect of Russian. The texts were read in the mode of isolated (discrete)
speech with short clearly detached pauses between the words. Such style of
pronunciation simplifies the task of automatic signal segmentation and
reduces co-articulation effects between words.

6. Russian large-scale speech corpus RuSpeech

Russian speech corpus RuSpeech was developed in 2000-2001 by ISA RAS and
Cognitive Technologies, Ltd. for Russian speech recognition project of
Intel Corporation. The corpus contains more then 50 hours of recorded and
transcribed continuous Russian speech. As a result of the projects the
established technology of creation of speech databases was achieved. It
includes several software tolls developed in the framework of this project
[7].

General description

Transcription code system includes 114 phonemes. One of the main
requirements to the speech corpus was to provide full phoneme coverage per
speaker and full allophone coverage integrally in main sets of the corpus.
Besides that the corpus should represent theoretically nature distribution
of phoneme units.
The speech corpus RuSpeech consists of 3 main sets and one additional
set:
. Train - a set of elements designed for training a speech recognition
system;

. Test - a set of elements designed for testing of a speech recognition
system;

. Develop - a set of elements designed for development of a speech
recognition system.

. Bad - a set with utterances which can not be considered that close enough
to normal pronunciations but might be useful for speech researchers or
for speech software debugging purposes.

The Train set includes sentences spoken by 203 speakers (111 male and 92
female). Each speaker has read 250 sentences. Constant part of 70 sentences
which provide full phoneme coverage was read by each speaker. Another 180
sentences (speaker variable part) were consequently taken for current
speaker from the pool of sentences selected to provide full allophone
coverage in the corpus. An average number of speakers per sentence from
allophone rich pool is 14 speakers.
The Test and Develop sections were worked out in the same manner. Each of
these two set contains a total of 1000 elements. The sentences from each
section were read by 10 speakers (5 male and 5 female). Each speaker has
read exactly 100 sentences.
Besides three main sections described above RuSpeech includes a set "Bad"
which contains elements with singularities, such as technical recording
inaccuracy (for example elements with cut-off waveform), very poor
pronunciation or strong regional accent. Also this section contains
elements read by those speakers who for some reasons read only part of the
offered sentences. The section contains elements corresponding to the
sentences read incorrectly by the speaker (e.g. the speaker substituted or
missed a word). Note that wrong stress is not considered a mistake.
The distribution of elements of the speech corpus by its sets looks as
follows:
. Train - 50278 elements;

. Test - 1000 elements;

. Develop - 1000 elements;

. Bad - 2962 elements;

Speech fragments

The pair of files called a speech fragment is considered to be an element
of the corpus. This pair consists of a digitized waveform (wav-file) -
speech signal which represents a sentence spoken by a speaker in Russian,
and an associated information file containing additional information on
this waveform.
Speech signals were recorded in Microsoft Windows RIFF format (Resource
Interchange File Format) in mono regime with sample rate 22050 hertz, 16
bit per sample.
All the recordings were made with microphone "PC Headset Model" SR1 by
Plantronics and sound card "SB Live!".
The information file contains two groups of parameters and annotating
data. The first group is automatically entered into the information file
during recording a sentence spoken by a speaker. This group consists of:
. SetID - identifier of the set of speech corpus which contains this speech
element;

. Text - spelling of pronounced sentence;

. PronunciationExpect - expected pronunciation, i.e. standard transcription
of the sentence;

. RecordDate - date of sentence recording;

. RecordPlace - place of recording, i.e. a city/town where the recording
was performed;

. MicrophoneModel - the model of microphone which was used for recording;

. SoundCard - the sound card used;

. SpeakerName - the name of the speaker;

. SpeakerID - speaker's identifier;

. SpeakerSex - speaker's sex;

. SpeakerAge - speaker's age;

. SpeakerEducation - speaker's education level;

. SpeakerBirthplace - speaker's place of birth;

. SpeakerResidence - speaker's place of residence;

. SpeakerAccent - the type of pronunciation of the speaker.

On the stage of verification when an expert (phonetician) verifies this
recording, the following information fields are added to the information
file:
. PronunciationActual - annotation of actual pronunciation, i.e. real
phoneme sequence of utterance made by an expert during the verification;

. Comment - the expert's commentary on speaker's reading of the sentence;

. ExpertID - the expert's name;

. VerifyDate - the date of verification.

Text Material Composition

The text material used in speech database includes four following
separate sets of sentences:
. A set of 70 sentences which provides full phoneme coverage. Each phoneme
must be represented by at least 3 samples. The sentences of this set were
pronounced by each speaker in Train set of the corpus. They are intended
to train the speech recognition software. The set was constructed by the
linguists to provide its phonetic completeness.

. A set of 3060 sentences was also designed to train the speech recognition
software. It provides full allophone coverage (representation of each
allophone more a twice). These sentences were even distributed among
speakers in the Train set.

. Two sets (each contained 1000 sentences) were designed for development of
speech recognition software components and for testing of the entire
system. The sentences of these groups were pronounced by speakers in
corresponding "Develop" and "Train" sets of the corpus.

The last three sets of sentences were made up of texts from different
Russian newspapers or Internet news sites. The stories were selected so
that they would cover various subjects in politics, economy, culture, art,
medicine, sports, etc. Some sentences were taken as they occurred in the
text, while others had to be slightly changed to be as a rule no longer
than 9-10 words. The sentences were selected with the automatic filtering
procedure described above to meet the requirement of sufficient
representation of allophones. Since there were no special selection of
words with probably missing phonetic characteristics by phoneticians these
sets also meet the requirement of theoretically natural phoneme
distribution.

Speakers

237 speakers at the age of 18 to 65, among them 127 male and 110 female
were involved in the process of pronouncing phrases included in selected
text material. There were no professional speakers among them and none of
them had any experience in the art of speech reading before.
Speakers' age distribution is as follows:
. 18 - 20 age - 56 speakers;

. 21 - 30 age - 101 speakers;

. 31 - 40 age - 33 speakers;

. 41 - 50 age - 25 speakers;

. 51 - 60 age - 16 speakers;

. 61 - 65 age - 6 speakers.

All speakers are residents of Moscow and most of them show Moscow
pronunciation. The speakers identified their pronunciation type themselves.
Yet in some cases on the verification stage the phoneticians acting as
experts pointed out in their commentary to the discrepancy between the real
type of pronunciation and that which had been indicated by the speaker when
registering. The most part of speakers (158) named Moscow as a place of
birth. But there were also natives of many regions of Russia and former
USSR republics among the speakers.

7. Conclusion

Summarizing the experience in creation of speech databases we should
point out that besides the main goal of providing instruments for applied
speech technology development and in particular increasing the quality of
speech recognition or synthesis systems such projects have a great
influence on the fundamental phonetic science itself.
From the other hand the success in fundamental science in accordance to
technical progress in hardware gives us new technological solutions, which
in turn made influence on fundamental science providing it with new tolls
for research.
Finally we can note that the creation of high quality and large-scale
speech corpora requires considerable financial and human resources. It is
becoming an important kind of speech technology, and specialists who have
been working on it have gained valuable experience.

8. References

1] Bogdanov D. S., Brughtij A. V, Krivnova O. F., Podrabinovich A. Ya.,
Strokin G. S. "Technology for Creation of Speech Corpora",
Administration and Artificial Intelligence (in Russian), Moscow: URSS,
2003, pp. 239-259
2] Krivnova O. F., "Phonetic Provisions for a Speech Corpus", Proceedings
of the XIII Session of the Russian Acoustical Society, p. 535, Moscow,
Russia, 2003
3] Avanesov R. I., Russian Literary Pronunciation (in Russian), Moscow,
1972
4] Kibkalo, А.А. and Lotkov, М.М., "Choice of Phonetic Alphabet for Russian
LVCSR System", Proceedings of the Workshop SPECOM'03, Moscow, 2003.
5] Arlazarov V.L., Bogdanov D. S., Rozanov A.O., Finkelshtein Yu.L.
"Methods for pitch extraction in speech signal" Cognitive technologies
for information input and processing (in Russian), Moscow, URSS, 1998
6] GOST 16600-72,Moscow, 1973.
7] Arlazarov V.V., Bogdanov D. S., Brughtij A. V, Podrabinovich A. Ya.
"Software for Creation of Speech Corpora", Administration and Artificial
Intelligence (in Russian), Moscow: URSS, 2003, pp. 259-267
8] Strokin G. S. "On using of problem-oriented language for development of
linguistic algorithms", Administration and Artificial Intelligence (in
Russian), Moscow: URSS, 2003, pp. 267-290
9] Galounov V.I., van den Heuvel.H., Kochanina J.L., Ostroukhov A.V., Tropf
H., Vorontsova A.V. "Speech Database for the Russian Language", ,
Proceedings of the workshop SPEECOM'98, St. Petersburg, Russia, 1998
10] Kouznetsov V., Chuchupal V., Makovkin K., Chichagov A. "Design and
implementation of a Russian telephone speech database", Proceedings of
the Workshop SPECOM'99, Moscow, Russia, 1999

-----------------------
Automatic transcribing

Rejected
sentences

Text flow

Analysis of phonetic structure

Phonetic surfeit filtering

Selected texts

Phonetic statistics