thr3ads.net - freedesktop - [fdo] Re: TTS API [Nov 2004]

If this information is useful, please help other people find it:
Share via:

Bill Haneman

2004-Nov-04 00:50 UTC

[fdo] Re: TTS API

Olaf Jan Schmidt wrote:
>>Or more generally a sequence of audio samples.  Motivation: I think
>>most software synthesizers we are likely to support perform processing
>>of the whole text in several steps, only last of them being writing the
>>whole produced audio sample somewhere.  When synthesizing long texts,
>>it is desirable to allow the synthesizer to split the input into
>>several pieces so that we don't wait for the first coming audio data
>>too long.
>>
> 
> 
> KTTSD already does this, and I think it would be duplication of work to do 
> it in every driver if the higher speech system can take care of this. 
> Doing it before sending the phrases to the engines allows to interupt a 
> longer text with warnings, etc.
But this isn't always what you want.
>>    OJS> 2.b) For hardware speech: possibility to set markers and to
>>get OJS> feedback whenever a marker has been reached.
>>
>>Markers should be available for both software and hardware synthesis.
>>But they differ in their form: While with hardware synthesis feedback
>>should be received whenever the marker is reached in the audio output,
>>with software synthesis positions of the markers in the returned audio
>>sample should be returned.  Or the audio sample can be returned in
>>several pieces as described above, it can be especially split on marker
>>positions and the returned list could contain not only the audio
>>samples, but also the reached markers.
I think you probably do not want to return audio samples
from the TTS driver API in most cases.  It's better to have some
API for connecting the driver with an audio sink.
> 
> 
> Is there any advantage to send the whole text at once to the drivers, 
> rather than sending it smaller pieces which each return an audio stream? 
Yes; some drivers do a lot of semantic/contextual processing, which
can't be done properly with smaller text snippets.

Again, there is a tradeoff between size/latency and quality - but it's 
important to allow the client to do this both ways.  The client can then 
decide whether to send small chunks or large ones.

The callback API must allow for sending big chunks, and getting 
finer-grained notification before the whole request has completed.  Of 
course different TTS engines will have different marker capabilities (as 
was noted above).
> If sending it in a bigger piece avaiod lags, then it might perhaps be 
> worthwile the bigger complexity in the API, but if the lags would be 
> small anyway, then I would suggest to keep the API simpler.
> 
> 
>>Good remark.  But if I understand it correctly, this doesn't concern
>>the TTS API directly, it can just receive and process the pieces
>>separately, one by one, so there's no need for the drivers to be
able
>>to process a list of strings?
>>
> 
> 
> If you have markup within a phrase, then we cannot pass parts of the 
> phrase indepentently of each other. So we would need a string list in 
> this case.
> 
> A driver can easily turn the string list back to a string easily, it would 
> only help those drivers that would parse the the string for tags rather 
> than passing it on to an xml-supporting engine.
> 
> 
>>I'd suggest using SSML instead of VoiceXML.  If I'm not
mistaken, SSML
>>is what is aimed at TTS, while the purpose of VoiceXML is different.
There are some licensing issues to be careful of here - we must use an 
unencumbered XML markup flavor.
> 
> I thought that the GSAPI used some extention of VoiceXML, but maybe I am 
> misinformed here. 
The proposed "GSAPI 1.0" called for some XML markup; I think it's
a good
idea.  I will re-check my notes to make sure which version we proposed; 
it was at the time the clear winner based on licensing issues and 
end-user adoption.
> We should use the same syntax in any case. We can 
> discuss the different possibilities on the list once it has been set up.
> 
> 
>>I'm not sure values other than languages are needed (except for the
>>purpose of configuration as described in C. below).  Application can
>>decide in which language to send the text depending on the available
>>languages, but could available voice names or genders involve the
>>application behavior in any significant way?
I think the voice name should be determined at the higher level API, and 
the drivers should operate on a "voice" or "speaker".  I
think that
changing speaker within a single marked-up string is an unusual case.
> KTTSD allows the user to select the preferred voices by name, and it needs 
> to know which languages and genders are supported by the engines to 
> switch to the correct driver if several are installed. Using different 
> voices for diffferent porposes (long texts, messages, navigation 
> feedback) is also only possible if it is know which voices exists and 
> which driver can must be used to use them.
> 
> 
>>5. Other features needed (some of them are included and can be
>>expressed in SSML):
>>
>>- Enabling/disabling spelling mode.
Not sure this makes sense at the low-level.
>>
>>- Switching punctuation and capital character signalling modes.
>>
> 
> 
> I am not sure what exactly you mean by these two.
> 
> 
>>- Setting rate and pitch.
>>
> 
> 
> There are xml tags for this, but there should be a way to set a default.
I don't think we should rely _solely_ on XML for this, so I agree with 
you.  There should be a way to set the "base" or "current"
parameters on
a given voice or speaker (if the voice/speaker supports this).
> 
> 
>>- Reading single characters and key names.
>>
> 
> 
> Would this make more sense on the driver level, or should the higher 
> speech system deal with this to have this consistent for all drivers?
Probably should be the job of the higher speech system.
>>    OJS> We could either add these functions to the driver API, or we
>>    OJS> could define a standard API for driver configuration
>>libraries.
>>
>>This functionality would be nice, but it should be optional, not to put
>>more burden on the drivers than absolutely needed.
>>
> 
> 
> Sure, if a driver has no configuration options to be shown in the kttsd 
> configuration module, then this is not needed. I only want to avoid that 
> kttsd, gnome-speech, SpeechDispatcher etc. all have to write their own 
> configuration functions for the same drivers.
> 
> 
>>First we should agree on the form of the drivers.  Do we want just some
>>code base providing the defined features or do we want to define some
>>form of a particular API, possibly to be used by alternative APIs?
>>
> 
> 
> Could you explain the differences between the two options a bit?
> 
> Olaf
> 
> - -- 
> Olaf Jan Schmidt, KDE Accessibility Project
> KDEAP co-maintainer, maintainer of http://accessibility.kde.org
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.0.6 (GNU/Linux)
> Comment: For info see http://www.gnupg.org
> 
> iEYEARECAAYFAkGCRzgACgkQoLYC8AehV8d8eQCgrwAwmMRfTe7ytZJjwIvqUYFx
> 5dgAnRx7aMeJhSSOORJGT53oYQfETxss
> =N8eb
> -----END PGP SIGNATURE-----
>

Milan Zamazal

2004-Nov-04 00:50 UTC

head link

[fdo] Re: TTS API

Thanks, Olaf and Bill, for your participation in the idea and keeping it
going forward!
>>>>> "OJS" == Olaf Jan Schmidt
<ojschmidt@kde.org> writes:
>>>>> "BH" == Bill Haneman <Bill.Haneman@Sun.COM>
writes:
    OJS> I have just asked David Stone when we can start using the list.

Thanks.

    >> Or more generally a sequence of audio samples.  Motivation: I
    >> think most software synthesizers we are likely to support perform
    >> processing of the whole text in several steps, only last of them
    >> being writing the whole produced audio sample somewhere.  When
    >> synthesizing long texts, it is desirable to allow the synthesizer
    >> to split the input into several pieces so that we don't wait
for
    >> the first coming audio data too long.

    OJS> KTTSD already does this, and I think it would be duplication of
    OJS> work to do it in every driver if the higher speech system can
    OJS> take care of this.

I think the higher level speech system can't do this, since it requires
utterance chunking which is a typical TTS function.  Utterance chunking
must be performed by a low-level TTS system for two main reasons:
1. It's language dependent; 2. only the TTS system knows how large
pieces of texts it needs to produce its speech output in the appropriate
quality.

If you care about TTS systems which can't perform what we need here,
then we can make this functionality optional in the drivers.  In case
the TTS system/driver doesn't support it, then the higher-level speech
system can perform some heuristics as what KTTSD does I guess.

But from the point of view of duplicated efforts I think it would be
still better to do it in drivers rather than in the higher-level speech
system.  If the drivers have common code base, then the fallback
utterance chunking can be just a library function shared by all the
drivers which don't support their own version and there's no need to
implement it in KTTSD, GNOME Speech, Speech Dispatcher, etc.

    OJS> Doing it before sending the phrases to the engines allows to
    OJS> interupt a longer text with warnings, etc.

I don't understand what you mean here exactly.

    OJS> 2.b) For hardware speech: possibility to set markers and to
    >> get OJS> feedback whenever a marker has been reached.
    >> 
    >> Markers should be available for both software and hardware
    >> synthesis.  But they differ in their form: While with hardware
    >> synthesis feedback should be received whenever the marker is
    >> reached in the audio output, with software synthesis positions of
    >> the markers in the returned audio sample should be returned.  Or
    >> the audio sample can be returned in several pieces as described
    >> above, it can be especially split on marker positions and the
    >> returned list could contain not only the audio samples, but also
    >> the reached markers.

    OJS> Is there any advantage to send the whole text at once to the
    OJS> drivers, rather than sending it smaller pieces which each
    OJS> return an audio stream?  If sending it in a bigger piece avaiod
    OJS> lags, then it might perhaps be worthwile the bigger complexity
    OJS> in the API, but if the lags would be small anyway, then I would
    OJS> suggest to keep the API simpler.

    BH> Yes; some drivers do a lot of semantic/contextual processing,
    BH> which can't be done properly with smaller text snippets.

    BH> Again, there is a tradeoff between size/latency and quality -
    BH> but it's important to allow the client to do this both ways.
    BH> The client can then decide whether to send small chunks or large
    BH> ones.

    BH> The callback API must allow for sending big chunks, and getting
    BH> finer-grained notification before the whole request has
    BH> completed.  Of course different TTS engines will have different
    BH> marker capabilities (as was noted above).

See above and I agree with what Bill writes here.

    BH> I think you probably do not want to return audio samples from
    BH> the TTS driver API in most cases.  It's better to have some API
    BH> for connecting the driver with an audio sink.

I agree the right way to return the produced audio data is to write it
to a given stream.  We could probably agree that the API shouldn't
specify which kind of stream it is (whether some kind of audio sink, a
file stream or any other kind of a binary stream).

    >> Good remark.  But if I understand it correctly, this doesn't
    >> concern the TTS API directly, it can just receive and process the
    >> pieces separately, one by one, so there's no need for the
drivers
    >> to be able to process a list of strings?
    >> 

    OJS> If you have markup within a phrase, then we cannot pass parts
    OJS> of the phrase indepentently of each other. So we would need a
    OJS> string list in this case.

I see.  I'm not sure it is a clean technique, but I must think about it
more before making my opinion on it.

    >> I'd suggest using SSML instead of VoiceXML.  If I'm not
mistaken,
    >> SSML is what is aimed at TTS, while the purpose of VoiceXML is
    >> different.

    BH> There are some licensing issues to be careful of here - we must
    BH> use an unencumbered XML markup flavor.

You probably mean patent issues?  We should definitely avoid them.  Are
you aware of particular problems with SSML or its subset relevant to our
API?

    OJS> I thought that the GSAPI used some extention of VoiceXML, but
    OJS> maybe I am misinformed here. We should use the same syntax in
    OJS> any case. We can discuss the different possibilities on the
    OJS> list once it has been set up.

    BH> The proposed "GSAPI 1.0" called for some XML markup; I
think
    BH> it's a good idea.  I will re-check my notes to make sure which
    BH> version we proposed; it was at the time the clear winner based
    BH> on licensing issues and end-user adoption.

OK, thanks.

    >> I'm not sure values other than languages are needed (except for
    >> the purpose of configuration as described in C. below).
    >> Application can decide in which language to send the text
    >> depending on the available languages, but could available voice
    >> names or genders involve the application behavior in any
    >> significant way?

    OJS> KTTSD allows the user to select the preferred voices by name,
    OJS> and it needs to know which languages and genders are supported
    OJS> by the engines to switch to the correct driver if several are
    OJS> installed. Using different voices for diffferent porposes (long
    OJS> texts, messages, navigation feedback) is also only possible if
    OJS> it is know which voices exists and which driver can must be
    OJS> used to use them.

OK, I understand the purpose now.  We could probably select the exact
parameter set to be consistent with the voice selection features of the
chosen markup?

    BH> I think the voice name should be determined at the higher level
    BH> API, and the drivers should operate on a "voice" or
"speaker".

I don't understand what you mean here with "voice" or
"speaker" exactly.

    BH> I think that changing speaker within a single marked-up string
    BH> is an unusual case.

I can imagine it very easily -- consider faces in Emacs.  For instance,
it may be very convenient to read a comment inside a line of source code
by a different speaker.

    >> 5. Other features needed (some of them are included and can be
    >> expressed in SSML):
    >> 
    >> - Enabling/disabling spelling mode.

    BH> Not sure this makes sense at the low-level.

It makes, since it is language dependent.

    >> - Switching punctuation and capital character signalling modes.
    >> 

    OJS> I am not sure what exactly you mean by these two.

In spelling mode, the given text is spelled.

Punctuation modes handle punctuation reading in different ways,
e.g. there may be modes for explicit reading of all punctuation
characters or not reading any punctuation characters or reading
punctuation characters as they would likely to be read by a human
reader.

Capital character signalling mode signals capital characters within the
text, e.g. by beeping before each of them.

    >> - Setting rate and pitch.
    >> 

    OJS> There are xml tags for this, but there should be a way to set a
    OJS> default.

    BH> I don't think we should rely _solely_ on XML for this, so I
    BH> agree with you.  There should be a way to set the "base" or
    BH> "current" parameters on a given voice or speaker (if the
    BH> voice/speaker supports this).

Agreed.

    >> - Reading single characters and key names.
    >> 

    OJS> Would this make more sense on the driver level, or should the
    OJS> higher speech system deal with this to have this consistent for
    OJS> all drivers?

    BH> Probably should be the job of the higher speech system.

Again, it is language dependent.  Moreover, there can be ambiguities
between texts, characters, and keys (e.g. `a' may be a word or a
character in English).  Maybe it could be technically solved in some way
on the higher level using some language dependent tables or so, but I'd
prefer not to mess with any lower level TTS functionality in the higher
level speech systems -- let's them just allow to clearly express what
they want to synthesize and let's let the synthesizer to do it.

    OJS> Sure, if a driver has no configuration options to be shown in
    OJS> the kttsd configuration module,

... or if the driver author doesn't like to spend his expensive time
designing and writing such functionality ...

    OJS> then this is not needed. I only want to avoid that kttsd,
    OJS> gnome-speech, SpeechDispatcher etc. all have to write their own
    OJS> configuration functions for the same drivers.

Agreed.

    >> First we should agree on the form of the drivers.  Do we want
    >> just some code base providing the defined features or do we want
    >> to define some form of a particular API, possibly to be used by
    >> alternative APIs?
    >> 

    OJS> Could you explain the differences between the two options a
    OJS> bit?

Maybe there's actually none. :-)  But we should agree on the kind of
interface anyway.  Shared library?

Regards,

Milan Zamazal

-- 
If we are going to start removing packages because of the quality of the
software, wonderful.  I move to remove all traces of the travesty of editors,
vi, from Debian, since obviously as editors they are less than alpha quality
software.                                   -- Manoj Srivastava in debian-devel

Olaf Jan Schmidt

2004-Nov-04 00:50 UTC

head link

[fdo] Re: TTS API

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Milan!

Thanks for your comments ont he requirement list.

[Milan Zamazal, Dienstag, 26. Oktober 2004 20:58]> [Since the mailing list apparently hasn't been created yet, I continue
> in private not to freeze the discussion for too long.]
>
I have just asked David Stone when we can start using the list.
> BTW, this might be subject of another standardization step.  I'd like
> to look at kttsd features -- is there some reasonable description or
> documentation of kttsd available?
>
http://accessibility.kde.org/developer/kttsd/
> Or more generally a sequence of audio samples.  Motivation: I think
> most software synthesizers we are likely to support perform processing
> of the whole text in several steps, only last of them being writing the
> whole produced audio sample somewhere.  When synthesizing long texts,
> it is desirable to allow the synthesizer to split the input into
> several pieces so that we don't wait for the first coming audio data
> too long.
>
KTTSD already does this, and I think it would be duplication of work to do 
it in every driver if the higher speech system can take care of this. 
Doing it before sending the phrases to the engines allows to interupt a 
longer text with warnings, etc.
>     OJS> 2.b) For hardware speech: possibility to set markers and to
> get OJS> feedback whenever a marker has been reached.
>
> Markers should be available for both software and hardware synthesis.
> But they differ in their form: While with hardware synthesis feedback
> should be received whenever the marker is reached in the audio output,
> with software synthesis positions of the markers in the returned audio
> sample should be returned.  Or the audio sample can be returned in
> several pieces as described above, it can be especially split on marker
> positions and the returned list could contain not only the audio
> samples, but also the reached markers.
>
Is there any advantage to send the whole text at once to the drivers, 
rather than sending it smaller pieces which each return an audio stream? 
If sending it in a bigger piece avaiod lags, then it might perhaps be 
worthwile the bigger complexity in the API, but if the lags would be 
small anyway, then I would suggest to keep the API simpler.
> Good remark.  But if I understand it correctly, this doesn't concern
> the TTS API directly, it can just receive and process the pieces
> separately, one by one, so there's no need for the drivers to be able
> to process a list of strings?
>
If you have markup within a phrase, then we cannot pass parts of the 
phrase indepentently of each other. So we would need a string list in 
this case.

A driver can easily turn the string list back to a string easily, it would 
only help those drivers that would parse the the string for tags rather 
than passing it on to an xml-supporting engine.
> I'd suggest using SSML instead of VoiceXML.  If I'm not mistaken,
SSML
> is what is aimed at TTS, while the purpose of VoiceXML is different.
>
I thought that the GSAPI used some extention of VoiceXML, but maybe I am 
misinformed here. We should use the same syntax in any case. We can 
discuss the different possibilities on the list once it has been set up.
> I'm not sure values other than languages are needed (except for the
> purpose of configuration as described in C. below).  Application can
> decide in which language to send the text depending on the available
> languages, but could available voice names or genders involve the
> application behavior in any significant way?
KTTSD allows the user to select the preferred voices by name, and it needs 
to know which languages and genders are supported by the engines to 
switch to the correct driver if several are installed. Using different 
voices for diffferent porposes (long texts, messages, navigation 
feedback) is also only possible if it is know which voices exists and 
which driver can must be used to use them.
> 5. Other features needed (some of them are included and can be
> expressed in SSML):
>
> - Enabling/disabling spelling mode.
>
> - Switching punctuation and capital character signalling modes.
>
I am not sure what exactly you mean by these two.
> - Setting rate and pitch.
>
There are xml tags for this, but there should be a way to set a default.
> - Reading single characters and key names.
>
Would this make more sense on the driver level, or should the higher 
speech system deal with this to have this consistent for all drivers?
>     OJS> We could either add these functions to the driver API, or we
>     OJS> could define a standard API for driver configuration
> libraries.
>
> This functionality would be nice, but it should be optional, not to put
> more burden on the drivers than absolutely needed.
>
Sure, if a driver has no configuration options to be shown in the kttsd 
configuration module, then this is not needed. I only want to avoid that 
kttsd, gnome-speech, SpeechDispatcher etc. all have to write their own 
configuration functions for the same drivers.
> First we should agree on the form of the drivers.  Do we want just some
> code base providing the defined features or do we want to define some
> form of a particular API, possibly to be used by alternative APIs?
>
Could you explain the differences between the two options a bit?

Olaf

- -- 
Olaf Jan Schmidt, KDE Accessibility Project
KDEAP co-maintainer, maintainer of http://accessibility.kde.org

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iEYEARECAAYFAkGCRzgACgkQoLYC8AehV8d8eQCgrwAwmMRfTe7ytZJjwIvqUYFx
5dgAnRx7aMeJhSSOORJGT53oYQfETxss
=N8eb
-----END PGP SIGNATURE-----

Willie Walker

2004-Nov-04 00:50 UTC

head link

[fdo] Re: TTS API

Wow!  The number of people on this list is sure growing!
>> Or more generally a sequence of audio samples.  Motivation: I think
>> most software synthesizers we are likely to support perform processing
>> of the whole text in several steps, only last of them being writing 
>> the
>> whole produced audio sample somewhere.  When synthesizing long texts,
>> it is desirable to allow the synthesizer to split the input into
>> several pieces so that we don't wait for the first coming audio
data
>> too long.
> KTTSD already does this, and I think it would be duplication of work 
> to do
> it in every driver if the higher speech system can take care of this.
> Doing it before sending the phrases to the engines allows to interupt a
> longer text with warnings, etc.
Is this level of detail something that needs to be exposed at
the driver level?  How the driver chooses to handle lengthy
input text seems like it needs to be more of an implementation
detail than a driver interface specification.

I think the overall requirement here is that the time to first
sample be "very short" and the time to cancel a request in process
should also be "very short".  (You guys define what "very
short"
means).

The other level of detail that needs to be worked out is whether
a stream of audio data is returned to the app or whether the app
supplies the driver with a place to send the audio to.  IMO, this
appears to be a bit of a stylistic thing and I don't see strong
benefits or drawbacks one way or the other.  If someone gives you
audio, you can send it to a sink.  If someone allows you to give
the driver a sink, you can write your sink to give you audio.  In
either case, the functions of pause/resume/cancel/ff/rev all
introduce a fair amount of complexity, especially when the driver
is trying to multithread things to give you its best performance
possible.

In FreeTTS, we chose the latter (i.e., give FreeTTS a sink to send
the audio to).  It works OK, but is a little unintuitive and our
implementation kind of puts the app at the mercy of FreeTTS when it
comes to the timing.  For the former (i.e., have the driver give you
audio), I'm not sure how many other engines out there really support
this level of flexibility.

BTW, an overlying problem that needs to be worked out across the
whole OS is the notion of managing contention for the audio output
device.  For example, do you queue, mix, cancel, etc. muiltiple
audio requests?  This situation will happen very frequently in the
case of an OS that plays audio when windows appear and disappear -
this behavior causes contention with a screen reader that wants to
also say the name of the window that was just shown.
>> Markers should be available for both software and hardware synthesis.
>> But they differ in their form: While with hardware synthesis feedback
>> should be received whenever the marker is reached in the audio output,
>> with software synthesis positions of the markers in the returned audio
>> sample should be returned.  Or the audio sample can be returned in
>> several pieces as described above, it can be especially split on 
>> marker
>> positions and the returned list could contain not only the audio
>> samples, but also the reached markers.
I think the main thing to think about here is how the app is to
get the events.  A few methods:

    1) Sending an audio stream and a marker index to the client.
       This gives the client more control and allows it to
       manage its own destiny.  Adds a bit of complexity to the
       client, though.

    2) Sending a linear sequence of audio data and marker data.
       Similar to #1, but I'm not so sure you're going to find
       a synthesizer that implements things this way.

    3) The MRCP way (I think), which is to have separate things
       for playing and handling events.  The synthesizer will
       spew data to the audio sink and events to the clients.
       The timing issues here are a bit odd to me, because one
       can never be sure the client receives the event the
       moment (or even near the moment) the audio is played.
       In any case, this is most similar to what the hardware
       synthesizers are doing.
>> I'd suggest using SSML instead of VoiceXML.  If I'm not
mistaken, SSML
>> is what is aimed at TTS, while the purpose of VoiceXML is different.
Just as a clarification:  SSML is a sub-spec of the VoiceXML effort.
It is based upon JSML, the Java Speech API Markup Language, which was
created by my group here at Sun.

Will

Apparently Analagous Threads

Search for more reasonably related threads

freedesktop - Nov 2004 - [fdo] Re: TTS API

[fdo] Re: TTS API

[fdo] Re: TTS API

[fdo] Re: TTS API

[fdo] Re: TTS API

Apparently Analagous Threads