Olaf Jan Schmidt wrote:>>Or more generally a sequence of audio samples. Motivation: I think >>most software synthesizers we are likely to support perform processing >>of the whole text in several steps, only last of them being writing the >>whole produced audio sample somewhere. When synthesizing long texts, >>it is desirable to allow the synthesizer to split the input into >>several pieces so that we don't wait for the first coming audio data >>too long. >> > > > KTTSD already does this, and I think it would be duplication of work to do > it in every driver if the higher speech system can take care of this. > Doing it before sending the phrases to the engines allows to interupt a > longer text with warnings, etc.But this isn't always what you want.>> OJS> 2.b) For hardware speech: possibility to set markers and to >>get OJS> feedback whenever a marker has been reached. >> >>Markers should be available for both software and hardware synthesis. >>But they differ in their form: While with hardware synthesis feedback >>should be received whenever the marker is reached in the audio output, >>with software synthesis positions of the markers in the returned audio >>sample should be returned. Or the audio sample can be returned in >>several pieces as described above, it can be especially split on marker >>positions and the returned list could contain not only the audio >>samples, but also the reached markers.I think you probably do not want to return audio samples from the TTS driver API in most cases. It's better to have some API for connecting the driver with an audio sink.> > > Is there any advantage to send the whole text at once to the drivers, > rather than sending it smaller pieces which each return an audio stream?Yes; some drivers do a lot of semantic/contextual processing, which can't be done properly with smaller text snippets. Again, there is a tradeoff between size/latency and quality - but it's important to allow the client to do this both ways. The client can then decide whether to send small chunks or large ones. The callback API must allow for sending big chunks, and getting finer-grained notification before the whole request has completed. Of course different TTS engines will have different marker capabilities (as was noted above).> If sending it in a bigger piece avaiod lags, then it might perhaps be > worthwile the bigger complexity in the API, but if the lags would be > small anyway, then I would suggest to keep the API simpler. > > >>Good remark. But if I understand it correctly, this doesn't concern >>the TTS API directly, it can just receive and process the pieces >>separately, one by one, so there's no need for the drivers to be able >>to process a list of strings? >> > > > If you have markup within a phrase, then we cannot pass parts of the > phrase indepentently of each other. So we would need a string list in > this case. > > A driver can easily turn the string list back to a string easily, it would > only help those drivers that would parse the the string for tags rather > than passing it on to an xml-supporting engine. > > >>I'd suggest using SSML instead of VoiceXML. If I'm not mistaken, SSML >>is what is aimed at TTS, while the purpose of VoiceXML is different.There are some licensing issues to be careful of here - we must use an unencumbered XML markup flavor.> > I thought that the GSAPI used some extention of VoiceXML, but maybe I am > misinformed here.The proposed "GSAPI 1.0" called for some XML markup; I think it's a good idea. I will re-check my notes to make sure which version we proposed; it was at the time the clear winner based on licensing issues and end-user adoption.> We should use the same syntax in any case. We can > discuss the different possibilities on the list once it has been set up. > > >>I'm not sure values other than languages are needed (except for the >>purpose of configuration as described in C. below). Application can >>decide in which language to send the text depending on the available >>languages, but could available voice names or genders involve the >>application behavior in any significant way?I think the voice name should be determined at the higher level API, and the drivers should operate on a "voice" or "speaker". I think that changing speaker within a single marked-up string is an unusual case.> KTTSD allows the user to select the preferred voices by name, and it needs > to know which languages and genders are supported by the engines to > switch to the correct driver if several are installed. Using different > voices for diffferent porposes (long texts, messages, navigation > feedback) is also only possible if it is know which voices exists and > which driver can must be used to use them. > > >>5. Other features needed (some of them are included and can be >>expressed in SSML): >> >>- Enabling/disabling spelling mode.Not sure this makes sense at the low-level.>> >>- Switching punctuation and capital character signalling modes. >> > > > I am not sure what exactly you mean by these two. > > >>- Setting rate and pitch. >> > > > There are xml tags for this, but there should be a way to set a default.I don't think we should rely _solely_ on XML for this, so I agree with you. There should be a way to set the "base" or "current" parameters on a given voice or speaker (if the voice/speaker supports this).> > >>- Reading single characters and key names. >> > > > Would this make more sense on the driver level, or should the higher > speech system deal with this to have this consistent for all drivers?Probably should be the job of the higher speech system.>> OJS> We could either add these functions to the driver API, or we >> OJS> could define a standard API for driver configuration >>libraries. >> >>This functionality would be nice, but it should be optional, not to put >>more burden on the drivers than absolutely needed. >> > > > Sure, if a driver has no configuration options to be shown in the kttsd > configuration module, then this is not needed. I only want to avoid that > kttsd, gnome-speech, SpeechDispatcher etc. all have to write their own > configuration functions for the same drivers. > > >>First we should agree on the form of the drivers. Do we want just some >>code base providing the defined features or do we want to define some >>form of a particular API, possibly to be used by alternative APIs? >> > > > Could you explain the differences between the two options a bit? > > Olaf > > - -- > Olaf Jan Schmidt, KDE Accessibility Project > KDEAP co-maintainer, maintainer of http://accessibility.kde.org > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.0.6 (GNU/Linux) > Comment: For info see http://www.gnupg.org > > iEYEARECAAYFAkGCRzgACgkQoLYC8AehV8d8eQCgrwAwmMRfTe7ytZJjwIvqUYFx > 5dgAnRx7aMeJhSSOORJGT53oYQfETxss > =N8eb > -----END PGP SIGNATURE----- >
Thanks, Olaf and Bill, for your participation in the idea and keeping it going forward!>>>>> "OJS" == Olaf Jan Schmidt <ojschmidt@kde.org> writes: >>>>> "BH" == Bill Haneman <Bill.Haneman@Sun.COM> writes:OJS> I have just asked David Stone when we can start using the list. Thanks. >> Or more generally a sequence of audio samples. Motivation: I >> think most software synthesizers we are likely to support perform >> processing of the whole text in several steps, only last of them >> being writing the whole produced audio sample somewhere. When >> synthesizing long texts, it is desirable to allow the synthesizer >> to split the input into several pieces so that we don't wait for >> the first coming audio data too long. OJS> KTTSD already does this, and I think it would be duplication of OJS> work to do it in every driver if the higher speech system can OJS> take care of this. I think the higher level speech system can't do this, since it requires utterance chunking which is a typical TTS function. Utterance chunking must be performed by a low-level TTS system for two main reasons: 1. It's language dependent; 2. only the TTS system knows how large pieces of texts it needs to produce its speech output in the appropriate quality. If you care about TTS systems which can't perform what we need here, then we can make this functionality optional in the drivers. In case the TTS system/driver doesn't support it, then the higher-level speech system can perform some heuristics as what KTTSD does I guess. But from the point of view of duplicated efforts I think it would be still better to do it in drivers rather than in the higher-level speech system. If the drivers have common code base, then the fallback utterance chunking can be just a library function shared by all the drivers which don't support their own version and there's no need to implement it in KTTSD, GNOME Speech, Speech Dispatcher, etc. OJS> Doing it before sending the phrases to the engines allows to OJS> interupt a longer text with warnings, etc. I don't understand what you mean here exactly. OJS> 2.b) For hardware speech: possibility to set markers and to >> get OJS> feedback whenever a marker has been reached. >> >> Markers should be available for both software and hardware >> synthesis. But they differ in their form: While with hardware >> synthesis feedback should be received whenever the marker is >> reached in the audio output, with software synthesis positions of >> the markers in the returned audio sample should be returned. Or >> the audio sample can be returned in several pieces as described >> above, it can be especially split on marker positions and the >> returned list could contain not only the audio samples, but also >> the reached markers. OJS> Is there any advantage to send the whole text at once to the OJS> drivers, rather than sending it smaller pieces which each OJS> return an audio stream? If sending it in a bigger piece avaiod OJS> lags, then it might perhaps be worthwile the bigger complexity OJS> in the API, but if the lags would be small anyway, then I would OJS> suggest to keep the API simpler. BH> Yes; some drivers do a lot of semantic/contextual processing, BH> which can't be done properly with smaller text snippets. BH> Again, there is a tradeoff between size/latency and quality - BH> but it's important to allow the client to do this both ways. BH> The client can then decide whether to send small chunks or large BH> ones. BH> The callback API must allow for sending big chunks, and getting BH> finer-grained notification before the whole request has BH> completed. Of course different TTS engines will have different BH> marker capabilities (as was noted above). See above and I agree with what Bill writes here. BH> I think you probably do not want to return audio samples from BH> the TTS driver API in most cases. It's better to have some API BH> for connecting the driver with an audio sink. I agree the right way to return the produced audio data is to write it to a given stream. We could probably agree that the API shouldn't specify which kind of stream it is (whether some kind of audio sink, a file stream or any other kind of a binary stream). >> Good remark. But if I understand it correctly, this doesn't >> concern the TTS API directly, it can just receive and process the >> pieces separately, one by one, so there's no need for the drivers >> to be able to process a list of strings? >> OJS> If you have markup within a phrase, then we cannot pass parts OJS> of the phrase indepentently of each other. So we would need a OJS> string list in this case. I see. I'm not sure it is a clean technique, but I must think about it more before making my opinion on it. >> I'd suggest using SSML instead of VoiceXML. If I'm not mistaken, >> SSML is what is aimed at TTS, while the purpose of VoiceXML is >> different. BH> There are some licensing issues to be careful of here - we must BH> use an unencumbered XML markup flavor. You probably mean patent issues? We should definitely avoid them. Are you aware of particular problems with SSML or its subset relevant to our API? OJS> I thought that the GSAPI used some extention of VoiceXML, but OJS> maybe I am misinformed here. We should use the same syntax in OJS> any case. We can discuss the different possibilities on the OJS> list once it has been set up. BH> The proposed "GSAPI 1.0" called for some XML markup; I think BH> it's a good idea. I will re-check my notes to make sure which BH> version we proposed; it was at the time the clear winner based BH> on licensing issues and end-user adoption. OK, thanks. >> I'm not sure values other than languages are needed (except for >> the purpose of configuration as described in C. below). >> Application can decide in which language to send the text >> depending on the available languages, but could available voice >> names or genders involve the application behavior in any >> significant way? OJS> KTTSD allows the user to select the preferred voices by name, OJS> and it needs to know which languages and genders are supported OJS> by the engines to switch to the correct driver if several are OJS> installed. Using different voices for diffferent porposes (long OJS> texts, messages, navigation feedback) is also only possible if OJS> it is know which voices exists and which driver can must be OJS> used to use them. OK, I understand the purpose now. We could probably select the exact parameter set to be consistent with the voice selection features of the chosen markup? BH> I think the voice name should be determined at the higher level BH> API, and the drivers should operate on a "voice" or "speaker". I don't understand what you mean here with "voice" or "speaker" exactly. BH> I think that changing speaker within a single marked-up string BH> is an unusual case. I can imagine it very easily -- consider faces in Emacs. For instance, it may be very convenient to read a comment inside a line of source code by a different speaker. >> 5. Other features needed (some of them are included and can be >> expressed in SSML): >> >> - Enabling/disabling spelling mode. BH> Not sure this makes sense at the low-level. It makes, since it is language dependent. >> - Switching punctuation and capital character signalling modes. >> OJS> I am not sure what exactly you mean by these two. In spelling mode, the given text is spelled. Punctuation modes handle punctuation reading in different ways, e.g. there may be modes for explicit reading of all punctuation characters or not reading any punctuation characters or reading punctuation characters as they would likely to be read by a human reader. Capital character signalling mode signals capital characters within the text, e.g. by beeping before each of them. >> - Setting rate and pitch. >> OJS> There are xml tags for this, but there should be a way to set a OJS> default. BH> I don't think we should rely _solely_ on XML for this, so I BH> agree with you. There should be a way to set the "base" or BH> "current" parameters on a given voice or speaker (if the BH> voice/speaker supports this). Agreed. >> - Reading single characters and key names. >> OJS> Would this make more sense on the driver level, or should the OJS> higher speech system deal with this to have this consistent for OJS> all drivers? BH> Probably should be the job of the higher speech system. Again, it is language dependent. Moreover, there can be ambiguities between texts, characters, and keys (e.g. `a' may be a word or a character in English). Maybe it could be technically solved in some way on the higher level using some language dependent tables or so, but I'd prefer not to mess with any lower level TTS functionality in the higher level speech systems -- let's them just allow to clearly express what they want to synthesize and let's let the synthesizer to do it. OJS> Sure, if a driver has no configuration options to be shown in OJS> the kttsd configuration module, ... or if the driver author doesn't like to spend his expensive time designing and writing such functionality ... OJS> then this is not needed. I only want to avoid that kttsd, OJS> gnome-speech, SpeechDispatcher etc. all have to write their own OJS> configuration functions for the same drivers. Agreed. >> First we should agree on the form of the drivers. Do we want >> just some code base providing the defined features or do we want >> to define some form of a particular API, possibly to be used by >> alternative APIs? >> OJS> Could you explain the differences between the two options a OJS> bit? Maybe there's actually none. :-) But we should agree on the kind of interface anyway. Shared library? Regards, Milan Zamazal -- If we are going to start removing packages because of the quality of the software, wonderful. I move to remove all traces of the travesty of editors, vi, from Debian, since obviously as editors they are less than alpha quality software. -- Manoj Srivastava in debian-devel
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Milan! Thanks for your comments ont he requirement list. [Milan Zamazal, Dienstag, 26. Oktober 2004 20:58]> [Since the mailing list apparently hasn't been created yet, I continue > in private not to freeze the discussion for too long.] >I have just asked David Stone when we can start using the list.> BTW, this might be subject of another standardization step. I'd like > to look at kttsd features -- is there some reasonable description or > documentation of kttsd available? >http://accessibility.kde.org/developer/kttsd/> Or more generally a sequence of audio samples. Motivation: I think > most software synthesizers we are likely to support perform processing > of the whole text in several steps, only last of them being writing the > whole produced audio sample somewhere. When synthesizing long texts, > it is desirable to allow the synthesizer to split the input into > several pieces so that we don't wait for the first coming audio data > too long. >KTTSD already does this, and I think it would be duplication of work to do it in every driver if the higher speech system can take care of this. Doing it before sending the phrases to the engines allows to interupt a longer text with warnings, etc.> OJS> 2.b) For hardware speech: possibility to set markers and to > get OJS> feedback whenever a marker has been reached. > > Markers should be available for both software and hardware synthesis. > But they differ in their form: While with hardware synthesis feedback > should be received whenever the marker is reached in the audio output, > with software synthesis positions of the markers in the returned audio > sample should be returned. Or the audio sample can be returned in > several pieces as described above, it can be especially split on marker > positions and the returned list could contain not only the audio > samples, but also the reached markers. >Is there any advantage to send the whole text at once to the drivers, rather than sending it smaller pieces which each return an audio stream? If sending it in a bigger piece avaiod lags, then it might perhaps be worthwile the bigger complexity in the API, but if the lags would be small anyway, then I would suggest to keep the API simpler.> Good remark. But if I understand it correctly, this doesn't concern > the TTS API directly, it can just receive and process the pieces > separately, one by one, so there's no need for the drivers to be able > to process a list of strings? >If you have markup within a phrase, then we cannot pass parts of the phrase indepentently of each other. So we would need a string list in this case. A driver can easily turn the string list back to a string easily, it would only help those drivers that would parse the the string for tags rather than passing it on to an xml-supporting engine.> I'd suggest using SSML instead of VoiceXML. If I'm not mistaken, SSML > is what is aimed at TTS, while the purpose of VoiceXML is different. >I thought that the GSAPI used some extention of VoiceXML, but maybe I am misinformed here. We should use the same syntax in any case. We can discuss the different possibilities on the list once it has been set up.> I'm not sure values other than languages are needed (except for the > purpose of configuration as described in C. below). Application can > decide in which language to send the text depending on the available > languages, but could available voice names or genders involve the > application behavior in any significant way?KTTSD allows the user to select the preferred voices by name, and it needs to know which languages and genders are supported by the engines to switch to the correct driver if several are installed. Using different voices for diffferent porposes (long texts, messages, navigation feedback) is also only possible if it is know which voices exists and which driver can must be used to use them.> 5. Other features needed (some of them are included and can be > expressed in SSML): > > - Enabling/disabling spelling mode. > > - Switching punctuation and capital character signalling modes. >I am not sure what exactly you mean by these two.> - Setting rate and pitch. >There are xml tags for this, but there should be a way to set a default.> - Reading single characters and key names. >Would this make more sense on the driver level, or should the higher speech system deal with this to have this consistent for all drivers?> OJS> We could either add these functions to the driver API, or we > OJS> could define a standard API for driver configuration > libraries. > > This functionality would be nice, but it should be optional, not to put > more burden on the drivers than absolutely needed. >Sure, if a driver has no configuration options to be shown in the kttsd configuration module, then this is not needed. I only want to avoid that kttsd, gnome-speech, SpeechDispatcher etc. all have to write their own configuration functions for the same drivers.> First we should agree on the form of the drivers. Do we want just some > code base providing the defined features or do we want to define some > form of a particular API, possibly to be used by alternative APIs? >Could you explain the differences between the two options a bit? Olaf - -- Olaf Jan Schmidt, KDE Accessibility Project KDEAP co-maintainer, maintainer of http://accessibility.kde.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iEYEARECAAYFAkGCRzgACgkQoLYC8AehV8d8eQCgrwAwmMRfTe7ytZJjwIvqUYFx 5dgAnRx7aMeJhSSOORJGT53oYQfETxss =N8eb -----END PGP SIGNATURE-----
Wow! The number of people on this list is sure growing!>> Or more generally a sequence of audio samples. Motivation: I think >> most software synthesizers we are likely to support perform processing >> of the whole text in several steps, only last of them being writing >> the >> whole produced audio sample somewhere. When synthesizing long texts, >> it is desirable to allow the synthesizer to split the input into >> several pieces so that we don't wait for the first coming audio data >> too long. > KTTSD already does this, and I think it would be duplication of work > to do > it in every driver if the higher speech system can take care of this. > Doing it before sending the phrases to the engines allows to interupt a > longer text with warnings, etc.Is this level of detail something that needs to be exposed at the driver level? How the driver chooses to handle lengthy input text seems like it needs to be more of an implementation detail than a driver interface specification. I think the overall requirement here is that the time to first sample be "very short" and the time to cancel a request in process should also be "very short". (You guys define what "very short" means). The other level of detail that needs to be worked out is whether a stream of audio data is returned to the app or whether the app supplies the driver with a place to send the audio to. IMO, this appears to be a bit of a stylistic thing and I don't see strong benefits or drawbacks one way or the other. If someone gives you audio, you can send it to a sink. If someone allows you to give the driver a sink, you can write your sink to give you audio. In either case, the functions of pause/resume/cancel/ff/rev all introduce a fair amount of complexity, especially when the driver is trying to multithread things to give you its best performance possible. In FreeTTS, we chose the latter (i.e., give FreeTTS a sink to send the audio to). It works OK, but is a little unintuitive and our implementation kind of puts the app at the mercy of FreeTTS when it comes to the timing. For the former (i.e., have the driver give you audio), I'm not sure how many other engines out there really support this level of flexibility. BTW, an overlying problem that needs to be worked out across the whole OS is the notion of managing contention for the audio output device. For example, do you queue, mix, cancel, etc. muiltiple audio requests? This situation will happen very frequently in the case of an OS that plays audio when windows appear and disappear - this behavior causes contention with a screen reader that wants to also say the name of the window that was just shown.>> Markers should be available for both software and hardware synthesis. >> But they differ in their form: While with hardware synthesis feedback >> should be received whenever the marker is reached in the audio output, >> with software synthesis positions of the markers in the returned audio >> sample should be returned. Or the audio sample can be returned in >> several pieces as described above, it can be especially split on >> marker >> positions and the returned list could contain not only the audio >> samples, but also the reached markers.I think the main thing to think about here is how the app is to get the events. A few methods: 1) Sending an audio stream and a marker index to the client. This gives the client more control and allows it to manage its own destiny. Adds a bit of complexity to the client, though. 2) Sending a linear sequence of audio data and marker data. Similar to #1, but I'm not so sure you're going to find a synthesizer that implements things this way. 3) The MRCP way (I think), which is to have separate things for playing and handling events. The synthesizer will spew data to the audio sink and events to the clients. The timing issues here are a bit odd to me, because one can never be sure the client receives the event the moment (or even near the moment) the audio is played. In any case, this is most similar to what the hardware synthesizers are doing.>> I'd suggest using SSML instead of VoiceXML. If I'm not mistaken, SSML >> is what is aimed at TTS, while the purpose of VoiceXML is different.Just as a clarification: SSML is a sub-spec of the VoiceXML effort. It is based upon JSML, the Java Speech API Markup Language, which was created by my group here at Sun. Will