f



misc/RFC: TTS+MIDI, phonetics

well, I am posting this where I think it may be relevant...

basically, this was part of a misc idea that came up, and I went and beat 
together the code for it (AKA: I don't expect it to amount to much).


the idea was that I would combine together a speech synthesizer/TTS engine 
and a MIDI synth, and see if I could get much "interesting" from it (such as 
combining music and a synth'ed voice, singing TTS, ...).


in general, it was created by mashing together 2 pieces of code I had 
written before, for which I had noticed some internal similarity:
a TTS-engine / speech synth (where mostly I had used diphone, but had 
experimented some with formant);
a MIDI synth, where in my case I had used wavetable synth.


the TTS engine had had some of the usual front-end machinery, such as text 
normalization, phonetic dictionary handling/lookup, ..., so I kept this.

the MIDI synth is, well, a MIDI synth...

combining them, however, forced a good deal of alteration to the machinery 
for both.
particularly, many pieces of functionality from the TTS engine (such as 
"voices") was absorbed into the MIDI synth, and wavetables/patches are 
essentially relative to the voice, ...

however, the MIDI synth still plays midi-files, as before.
as is, the voice patches override GM patches, but I am likely to move the 
voice patches to bank 2 (banks 0 and 1 being GM and GM2).


the TTS frontend has been reworked mostly so that it produces short MIDI 
fragments, which basically rework the phonetic information into a stream of 
MIDI commands (the frontend has control over matters such as voice frequency 
and timing, ...).

these commands mostly work in terms of a voice-derived wavetable, and AFAIK 
the process is a variant of formant synthesis, although I don't actually 
simulate the voice signal (mostly I use loops derived from various vowel and 
consonant sounds, as well as a few non-looping patches).

mostly this is because it is a lot easier to get a convincing 'ah' or 'eh' 
by deriving it from an actual voice, and by using several recorded 
frequencies in an attempt to cover the vocal range (similar to how multiple 
recordings of an instrument at different notes are used internally in the 
wavetable...).

nothing prevents me from using purely synthetic voices, only that I don't 
see as much need at present...

dipthongs are currently synthed, but this doesn't sound very good, and I 
have doubts about using recorded diphthongs (mostly timing/frequency 
issues...). however, I am not sure of a good mechanism to synthesize them 
(simply blending between the adjacent sounds is not very good...).

at this point, mostly still battling basic comprehensibility issues...


otherwise, it may be worth noting that for composing the MIDI, I am using a 
textual representation of the command-stream, mostly as this is a little 
easier to compose (via sprintf/...) than would be a binary representation.


an issue though is that of how to best represent a combination of text and 
MIDI information (for the input).

one possibility is to just use an odd syntax to just sort of "stuff in" MIDI 
commands, but this seems not very good. another uncertainty is how to best 
represent commands to the voice (such as "speak in this particular note", 
"speak at this rate", ...).

I guess of uncertainty is the issue commonly seen in singing things, where 
people will sing part of a word at one note, and then sing another part at 
another note, ...

as-is, breaking up a word like this would confuse the dictionary, and to 
address this would require representing the words in phonetic form, ...


another issue:
for the phonetic form, is the IPA really necessary?... (internally, I don't 
use IPA, rather a customized ASCII-based notation, vaguely similar to SAMPA 
but currently without non-letter chars, and in many places different as I 
didn't know about SAMPA originally...).

actually, personally I would rather change the notation some (reorganizing 
some of the letters, ...), but the main issue I guess is that I would have 
to rework my dictionary (may be a worthwhile tradeoff, in the past it would 
have been more difficult), ...


I guess a partial issue is what is the most "ideal" notation for photetic 
transcriptions?...
(part of my "ideal" I guess is the avoidance of non-ASCII characters, and 
preferably avoidance of any special characters as well...).

current thinking:
a-z: typical "base sounds"
A-Z: typical "alternate sounds"
ax-zx (excluding xx): additional alternate sounds, or, as a case-insensitive 
alternate to upper-case forms (for example, in filenames, ...).
Ax-Zx: yet more alternate sounds
aX-zX: yet more
AX-ZX: yet more

this allows 156 sounds, although... as is I had yet to exceed the prior 
limit of 56 (lower+upper case), though this is probably because I am 
generally being far less precise than the IPA?...

56 sounds could be done with:
a-z
A-Z | ax-zx


this would be in contrast to my current notation, which uses 'x' as a prefix 
(for a similar purpose):
xa-xz, xA-xZ, ... (and in which xa!=A, as is, I have to use an alternate 
notation in filenames, ...).


the other major changes would be reorganizing some of the letter assignments 
(from my current notation) to be more "traditional"...

(actually, I may use SAMPA partly as a template, trying mostly to add an 
alternate notation, AKA: without special symbols and more flexible WRT case, 
mostly so that it is safer to mix with file names, and with other syntactic 
elements which may also need to use these non-letter characters...).

it is uncertain if it should remain as a mixed-case notation, or be forced 
into being a case-insensitive notation. my current bias is to keep it as 
case-sensitive, but allow certain alternate forms, mostly for file-naming 
(forcing a full case-insensitive notation is likely to just make things 
ugly...).


or such...


-- 
BGB: Hobbyist Programmer (Specialty: 3D, Compilers, VMs)
Site: http://cr88192.dyndns.org/


0
cr88192
7/4/2009 5:27:28 PM
comp.speech.research 1026 articles. 0 followers. Post Follow

17 Replies
701 Views

Similar Articles

[PageSpeed] 59

On Jul 4, 12:27=A0pm, "cr88192" <cr88...@hotmail.com> wrote:

> I guess of uncertainty is the issue commonly seen in singing things, wher=
e
> people will sing part of a word at one note, and then sing another part a=
t
> another note, ...

Could you use the bender for this?

--
lxt
0
luserXtrog
7/5/2009 7:31:04 AM
"cr88192" <cr88192@hotmail.com> wrote in message
news:h2o3e2$f08$1@news.albasani.net...
> another issue:
> for the phonetic form, is the IPA really necessary?... (internally, I don't
> use IPA, rather a customized ASCII-based notation, vaguely similar to SAMPA
> but currently without non-letter chars, and in many places different as I
> didn't know about SAMPA originally...).

If you ever want to distribute this as a useful application (who knows!), proper
IPA support would be nice. SAMPA is a poor man's approximation of the IPA set,
and it has a fair set of strange decisions ... however, it /can/ be entered with
any keyboard.
And, as always, you are free to decide for a scheme for yourself.
The con is that you cannot /mix/ these approaches -- your own scheme could
suddenly pop up and mess up an IPA phrase. Perhaps you could prefix each phrase
with a unique identifier:
"=hElo world"
where the '=' indicates using your private system.

As for needing more than the standard set of a..z/A..Z, SAMPA proves (for me :-)
that throwing in even more ASCII characters for each unique sound doesn't really
help. Perhaps you can get by with multi-character strings, although you should
try to avoid 'ax', 'ex', 'ox' for sounds that have nothing to do with 'echh' -- 
"th" is easier to parse as a soft theta than "tx". All you need to do is finding
a way to incorporate multi-character phonemes /without/ having them pop up
unadvertently :-) -- bracket them? (F.e., "[th]eta" wise) Do you have a list of
problematic phonemes?

Interesting project!

[Jw]


0
Jongware
7/5/2009 10:32:42 AM
On 2009-07-05, [Jongware] <IdontWantSpam@hotmail.com> wrote:
> "cr88192" <cr88192@hotmail.com> wrote:
>> for the phonetic form, is the IPA really necessary?... (internally,
>> I don't use IPA, rather a customized ASCII-based notation, vaguely
>> similar to SAMPA but currently without non-letter chars, and in
>> many places different as I didn't know about SAMPA originally...).
>
> If you ever want to distribute this as a useful application
> (who knows!), proper IPA support would be nice. SAMPA is a poor
> man's approximation of the IPA set, and it has a fair set of
> strange decisions... however, it /can/ be entered with any keyboard.
....
>> an issue though is that of how to best represent a combination
>> of text and MIDI information (for the input).

Whatever else it can handle, surely it must be able to handle the
text simply in MIDI-lyric-events ?

(If your application can't handle plain old MIDI-lyric-events,
then someone is still going to have to write a
MIDI-with-lyric-events-to-your-application-input-format translator.)

> breaking up a word like this would confuse the dictionary, and to 
> address this would require representing the words in phonetic form,

If the text is in MIDI-lyric-events, it can be reassembled into words
ready for your dictionary, can't it?  I mean, if the lyric-text ends
with a space, that means end-of-word...

If the text is in *phonetic* MIDI-lyric-events (I think they have
to be in 7-bit ASCII, don't they?) then that means SAMPA, I guess.
But if it comes from a MusicXML file, (translation not *too* hard,
see http://www.pjb.com.au/midi/musicxml2mid.html ) then it could
be Unicode.

>> actually, personally I would rather change the notation some
>> (reorganizing some of the letters, ...), but the main issue
>> I guess is that I would have to rework my dictionary

You shouldn't have to rework the dictionary, can't you just translate
SAMPA or IPA or English or whatever (if you can do English, you can
do any language, English is the most irregular) into your dictionary
format ?

>> how to best represent commands to the voice (such as "speak in
>> this particular note", "speak at this rate", ...).

If you want plain spoken TTS text within a MIDI file, I guess
you can either (1) use something wierder than lyric-events, or
(2) define a male-spoken-voice patch and a female-spoken-voice
and any other voices you want in a bank somewhere.
(I'm not completely sure I understand the question...)

> Interesting project!

Absolutely...  synth'ed voice and singing TTS I'd just love to see!
Can your MIDI synth use the same kind of soundfonts and stuff that
timidity can be configured to use, I mean for the non-voice stuff ?

Very interesting project!
Could it replace both timidity and festival ?

Regards,  Peter

-- 
Peter Billam       www.pjb.com.au    www.pjb.com.au/comp/contact.html
0
Peter
7/5/2009 11:45:28 AM
"luserXtrog" <mijoryx@yahoo.com> wrote in message 
news:cad33091-64c4-4c64-bccc-5f5f4d73ae3e@s31g2000yqs.googlegroups.com...
On Jul 4, 12:27 pm, "cr88192" <cr88...@hotmail.com> wrote:

>> I guess of uncertainty is the issue commonly seen in singing things, 
>> where
>> people will sing part of a word at one note, and then sing another part 
>> at
>> another note, ...
>
> Could you use the bender for this?
>

bender?...

not sure which feature this is exactly (not sure of any MIDI command with 
that name...).

part of the problem though is that commands tend to be represented 
sequentially, and it would be problematic to represent a command in the 
middle of the word without breaking up the word.


I guess potentially a kind of prefix command could be used, but then the 
question would be "where in the word to change the note?".

one idea I partly thought up is this:
a word ends with '-', which indicates a word break.

^C4 merr- ^D4 ily  ^E4 they ^A3 went ^G4 a- ^F4 long ^C4 their ^E4 way

so, then it can join and look up the word, and try to guess where to break 
it again in the phonetic transcription...

^C4 *mer ^D4 *ily ...


this transformation could be a little awkward though, as my TTS frontend is 
essentially structured around a stack machine...



0
cr88192
7/5/2009 3:23:58 PM
"[Jongware]" <IdontWantSpam@hotmail.com> wrote in message 
news:23d84$4a508143$3ec348e5$25955@news.chello.nl...
> "cr88192" <cr88192@hotmail.com> wrote in message
> news:h2o3e2$f08$1@news.albasani.net...
>> another issue:
>> for the phonetic form, is the IPA really necessary?... (internally, I 
>> don't
>> use IPA, rather a customized ASCII-based notation, vaguely similar to 
>> SAMPA
>> but currently without non-letter chars, and in many places different as I
>> didn't know about SAMPA originally...).
>
> If you ever want to distribute this as a useful application (who knows!), 
> proper
> IPA support would be nice. SAMPA is a poor man's approximation of the IPA 
> set,
> and it has a fair set of strange decisions ... however, it /can/ be 
> entered with
> any keyboard.
> And, as always, you are free to decide for a scheme for yourself.
> The con is that you cannot /mix/ these approaches -- your own scheme could
> suddenly pop up and mess up an IPA phrase. Perhaps you could prefix each 
> phrase
> with a unique identifier:
> "=hElo world"
> where the '=' indicates using your private system.
>

ok.
typically I have used '*' for phonetic fragments, maybe I could use '*' for 
my notation, and '[...]' for SAMPA?...

*helowerld
*DIsIzqfrexz

of course, this would mean either supporting both in my backend (duplicated 
code/effort), or doing a transcription...

(however, a transcription approach could also be made to handle IPA, where 
it would be transcribed...).


note that, without brackets, my TTS engine tends to assume it is a normal 
word, either looking it up in the dictionary or trying to invoke phonics 
magic...


> As for needing more than the standard set of a..z/A..Z, SAMPA proves (for 
> me :-)
> that throwing in even more ASCII characters for each unique sound doesn't 
> really
> help. Perhaps you can get by with multi-character strings, although you 
> should
> try to avoid 'ax', 'ex', 'ox' for sounds that have nothing to do with 
> 'echh' -- 
> "th" is easier to parse as a soft theta than "tx". All you need to do is 
> finding
> a way to incorporate multi-character phonemes /without/ having them pop up
> unadvertently :-) -- bracket them? (F.e., "[th]eta" wise) Do you have a 
> list of
> problematic phonemes?
>

ok, in my newer notation ax/ex/ox/... ended up being assigned to dipthongs 
(I freed up A/E/I/O/U for use as vowels, which had before contained both 
vowels and dipthongs).

'Ax'/'Ex'/... could be used for dipthongs instead, but I had used 'ax'/... 
for this.

at first, I figured I could make dipthongs implicit, but then realized a 
bigger problem:
I would need a notation to indicate when not to use dipthongs.


q/Q is "redefined" in my notation as a vowel (allowing 12 base vowels, as 
well as 12 "extended" vowels, several of which are used as dipthongs).

as is, I currently have about 10 base vowels (I started with 8, but with 
thinking came up with 2 more...).


'x' (in SAMPA) has been moved to 'K'.

under the current scheme, "soft theta" (I assume 'voiced th' is meant by 
this) is 'D'.


in my case, words are either pure photetic or pure textual. partial 
bracketing is not done as this would confuse the current processing 
machinery...


I decided on keeping the system proper as case-sensitive, and essentially 
use a mangling hack to map it to a case-insensitive form.

I guess the major alternative is to continue using my prior notation 
externally (essentially, a variant of the cmudict/Festival notation...).


> Interesting project!
>

maybe, just something random in my case...



0
cr88192
7/5/2009 4:11:21 PM
"Peter Billam" <peter@www.pjb.com.au> wrote in message 
news:slrnh514ip.v4h.peter@box8.pjb.com.au...
> On 2009-07-05, [Jongware] <IdontWantSpam@hotmail.com> wrote:
>> "cr88192" <cr88192@hotmail.com> wrote:
>>> for the phonetic form, is the IPA really necessary?... (internally,
>>> I don't use IPA, rather a customized ASCII-based notation, vaguely
>>> similar to SAMPA but currently without non-letter chars, and in
>>> many places different as I didn't know about SAMPA originally...).
>>
>> If you ever want to distribute this as a useful application
>> (who knows!), proper IPA support would be nice. SAMPA is a poor
>> man's approximation of the IPA set, and it has a fair set of
>> strange decisions... however, it /can/ be entered with any keyboard.
> ...
>>> an issue though is that of how to best represent a combination
>>> of text and MIDI information (for the input).
>
> Whatever else it can handle, surely it must be able to handle the
> text simply in MIDI-lyric-events ?
>
> (If your application can't handle plain old MIDI-lyric-events,
> then someone is still going to have to write a
> MIDI-with-lyric-events-to-your-application-input-format translator.)
>

I had not thought of this...

MIDI is probarily used by the synth backend, whereas TTS would require 
running it though the frontend, which had thus far assumed a sort of 
annotated text input...

I will have to look into lyric events, and see if hopefully there is some 
good way to key the lyrics to the music (such as to particular MIDI channel 
or whatever).


>> breaking up a word like this would confuse the dictionary, and to
>> address this would require representing the words in phonetic form,
>
> If the text is in MIDI-lyric-events, it can be reassembled into words
> ready for your dictionary, can't it?  I mean, if the lyric-text ends
> with a space, that means end-of-word...
>

I will have to look more. the issue is how closely the lyric events can be 
keyed to the notes. unlike a human, the TTS engine is much less capable at 
figuring these things out from context.


> If the text is in *phonetic* MIDI-lyric-events (I think they have
> to be in 7-bit ASCII, don't they?) then that means SAMPA, I guess.
> But if it comes from a MusicXML file, (translation not *too* hard,
> see http://www.pjb.com.au/midi/musicxml2mid.html ) then it could
> be Unicode.
>

will have to look at this...

didn't really know about MusicXML...

basically, music is not really my strong area, and initially was approached 
more for pragmatic reasons, so I am not all that familiar with the field in 
general...


>>> actually, personally I would rather change the notation some
>>> (reorganizing some of the letters, ...), but the main issue
>>> I guess is that I would have to rework my dictionary
>
> You shouldn't have to rework the dictionary, can't you just translate
> SAMPA or IPA or English or whatever (if you can do English, you can
> do any language, English is the most irregular) into your dictionary
> format ?
>

well, the issue was mostly one of changed notation...

however, since my dictionaries are initially translated from a different 
notation (the CMU dictionary notation), I modified my conversion tool and 
re-converted them (me thinking this is both easier and less lossy than it 
would be to make another tool to convert from my old notation to my new 
notation).

some further tweaking on the phonics machinery, which did not perfectly 
handle the transition.


but, yes, the dictionary approach "should" be able to more-or-less handle 
whatever language is used.
however, the text normalization code/... would likely need to be adjusted 
(for example, providing alternate functions for things like how to read out 
numbers, alternate phonics rules for unhandled cases, ...).

it would mean though providing various per-language dictionaries, ...

so, currently, my main focus is English...


>>> how to best represent commands to the voice (such as "speak in
>>> this particular note", "speak at this rate", ...).
>
> If you want plain spoken TTS text within a MIDI file, I guess
> you can either (1) use something wierder than lyric-events, or
> (2) define a male-spoken-voice patch and a female-spoken-voice
> and any other voices you want in a bank somewhere.
> (I'm not completely sure I understand the question...)
>

basically, I have been assuming an annotated text input (vs a binary MIDI 
input).

binary MIDI + lyric events is an interesting idea, just I had not thought of 
it...


in the case of annotated text, it would be mostly an issue of deciding the 
exact notation for the annotations. the issue then becomes with "pure text" 
input, how to avoid accidentally interpreting unintended things as commands, 
....

as is though, the current default input is a text+commands format.
just I will need to define a few more commands.


FWIW, I currently don't really distinguish much between the male voice and 
the female voice, apart from maybe based on frequency, so am currently 
internally assuming more of an "androgynous" voice...

partly though this is because I am lazy (combining together male and female 
derived vocal samples), and with this level of DSP, it is kind of hard to 
tell anyways, apart from based on frequency...

if a female were singing at 100Hz, we can just make a simplifying assumption 
that she would sound like a male, and a male at 200Hz, the assumption could 
be a more femalish sound (AFAICT, this seems to be the primary difference 
anyways, as otherwise the accousics don't seem too much different...).


>> Interesting project!
>
> Absolutely...  synth'ed voice and singing TTS I'd just love to see!
> Can your MIDI synth use the same kind of soundfonts and stuff that
> timidity can be configured to use, I mean for the non-voice stuff ?
>

originally, yes...

this is where I grabbed my original MIDI patches from, although I converted 
them, and since dropped the relevant code for the TTS effort (mostly in 
favor of wav's and text files). I could re-add the code from my original 
MIDI synth if needed though...


> Very interesting project!
> Could it replace both timidity and festival ?
>

I have not really compared it with either of them, more just something I had 
did for my own efforts.

however, I had ripped off some data from both projects in my 2 source 
projects.


current TTS with this synth is not as good as Festival, mostly for the 
reason that Festival uses diphone synthesis (as did my older synth), or 
potentially unit-selection (which has further increases in the "naturalness" 
of the speech), and for this one I had switched to a wavetable+formant 
approach (mostly because diphones are not very "flexible", and the DSP 
required to bend the pitch or tempo with them somewhat drastically reduced 
quality, as well as the inherent difficulty of having well-controlled pitch, 
.... whereas with a more synthetic approach, one has more fine-grain control 
from the start, and how flexible it is depends mostly on the code...).

dropping diphones is not free though, as doing so suddenly makes it sound a 
good deal more like "MS Sam" and friends (although, the voice itself is more 
natural-sounding, the timing/intonation/... is clearly fake, and it has some 
similar accoustic artifacts to "MS Sam" and similar...).


sadly, *lots* of additional and detailed fiddling would be required if the 
quality is to be any good... (or, even if a good level of comprehensibility 
is to be achieved...).


I guess there is also "Flinger", which I guess uses unit selection + 
post-filtering, but I have not checked what the output is like...


> Regards,  Peter
>
> -- 
> Peter Billam       www.pjb.com.au    www.pjb.com.au/comp/contact.html 


0
cr88192
7/5/2009 5:09:25 PM
Not really competent to comment on most of this (except to agree that
it's interesting stuff...!), but one point I can correct:

In article <slrnh514ip.v4h.peter@box8.pjb.com.au>,
Peter Billam  <contact.html@www.pjb.com.au> wrote:
>> "cr88192" <cr88192@hotmail.com> wrote:
>>> an issue though is that of how to best represent a combination
>>> of text and MIDI information (for the input).
>
>Whatever else it can handle, surely it must be able to handle the
>text simply in MIDI-lyric-events ?
This was my initial thought also.
>
> [....]
>If the text is in *phonetic* MIDI-lyric-events (I think they have
>to be in 7-bit ASCII, don't they?) then that means SAMPA, I guess.
Actually they don't.  The data inside a midifile metaevent (including
lyric events) can be *any* sequence of bytes, because the length is
specified in the prefix.  So you'd be free to use any phonetic scheme
you like.

[Oh... responding to another post in the thread, the suggestion
to use a "bender" was I'm sure referring to pitch-bend events.
Because events in a midifile are timed, you could emit pitchbends
to shift the note after the phoneme info had been sent, and still
have it happen somewhere in the middle of the sound.]

Cheers,
			-- Pete --

0
neverland
7/5/2009 10:07:13 PM
In article <h2qmo6$r26$1@news.albasani.net>,
cr88192 <cr88192@hotmail.com> wrote:
>
>"Peter Billam" <peter@www.pjb.com.au> wrote in message 
>news:slrnh514ip.v4h.peter@box8.pjb.com.au...
>>
>> Whatever else it can handle, surely it must be able to handle the
>> text simply in MIDI-lyric-events ?
>>
>I had not thought of this...
>
>MIDI is probarily used by the synth backend, whereas TTS would require 
>running it though the frontend, which had thus far assumed a sort of 
>annotated text input...
>
>I will have to look into lyric events, and see if hopefully there is some 
>good way to key the lyrics to the music (such as to particular MIDI channel 
>or whatever).

Lyric events are timed just like any other, and can contain as little
of the song text as you need (often just a syllable), so you can cue them
exactly to the note timing.  However, they don't include channel information
directly, so you'd have to fudge something for that (maybe just adding the
channel number to the text in some defined way).  In theory you could put
the lyric event in any track of a multitrack midifile (to connect with
the notes in that track) but most sequencers and such assume that metaevents
will be in track 0, so that might cause trouble.

Cheers,
			-- Pete --

0
neverland
7/5/2009 10:20:31 PM
"Pete" <neverland@GOODEVEca.net> wrote in message 
news:3_CdnbSQL_-yuszXnZ2dnUVZ_s6dnZ2d@lmi.net...
> In article <h2qmo6$r26$1@news.albasani.net>,
> cr88192 <cr88192@hotmail.com> wrote:
>>
>>"Peter Billam" <peter@www.pjb.com.au> wrote in message
>>news:slrnh514ip.v4h.peter@box8.pjb.com.au...
>>>
>>> Whatever else it can handle, surely it must be able to handle the
>>> text simply in MIDI-lyric-events ?
>>>
>>I had not thought of this...
>>
>>MIDI is probarily used by the synth backend, whereas TTS would require
>>running it though the frontend, which had thus far assumed a sort of
>>annotated text input...
>>
>>I will have to look into lyric events, and see if hopefully there is some
>>good way to key the lyrics to the music (such as to particular MIDI 
>>channel
>>or whatever).
>
> Lyric events are timed just like any other, and can contain as little
> of the song text as you need (often just a syllable), so you can cue them
> exactly to the note timing.  However, they don't include channel 
> information
> directly, so you'd have to fudge something for that (maybe just adding the
> channel number to the text in some defined way).  In theory you could put
> the lyric event in any track of a multitrack midifile (to connect with
> the notes in that track) but most sequencers and such assume that 
> metaevents
> will be in track 0, so that might cause trouble.
>

yep, otherwise it would just be monotone spoken lyrics, which would sort of 
be pointless...

it is worth noting that the way my synth works, 1-3 channels are needed just 
to do the speech synth (typically only 1 is used, but 1-2 additional 
channels are needed for some constructions).

this would mean a channel layout something like:
channels 1-9: free for music
channel 10: drum track
11-13: free for music (except with GM2, where 11 may be another drum track)
14-16: reserved for TTS

luckily, I don't think MIDI files typically use this many channels...


I guess likely it would also involve likely producing a secondary MIDI 
stream, and then merging it back into the first.

I guess, if a channel is to be used for keying the voice to, another 
question is whether to leave this track in place, or strip the track prior 
to merging.

all this would require re-architecting the process slightly.

the another alternative would be to decompose the MIDI stream, and force the 
entire thing through the TTS engine, with it composing a new stream for the 
output, which would simply require providing a way of feeding MIDI commands 
through the TTS engine. I guess, either way, I might need some mechanism to 
support out-of-order events.

both options seem awkward...

a slightly less awkward approach could be an async write-joiner (type 
thinggy...), where code woud produce multiple streams, and then 
asynchronously write them into the joiner, which would then go about 
producing a new output stream with the various input streams merged 
together. potentially, it could also include a "channel allocator" (sort of 
like a register allocator...) such that the input streams need not worry as 
much about conflicting channels.

(actually, the above is likely to be necessary as well for the construction 
of "drum machines" and other similar features, which would otherwise be 
difficult to produce...).


it may seem odd, but I primarily use single-track midi streams (multi-track 
MIDI files are typically merged on read...).


> Cheers,
> -- Pete --
> 


0
cr88192
7/6/2009 12:34:36 AM
"Pete" <neverland@GOODEVEca.net> wrote in message 
news:IvydncqMKcCMuczXnZ2dnUVZ_tSdnZ2d@lmi.net...
> Not really competent to comment on most of this (except to agree that
> it's interesting stuff...!), but one point I can correct:
>
> In article <slrnh514ip.v4h.peter@box8.pjb.com.au>,
> Peter Billam  <contact.html@www.pjb.com.au> wrote:
>>> "cr88192" <cr88192@hotmail.com> wrote:
>>>> an issue though is that of how to best represent a combination
>>>> of text and MIDI information (for the input).
>>
>>Whatever else it can handle, surely it must be able to handle the
>>text simply in MIDI-lyric-events ?
> This was my initial thought also.
>>
>> [....]
>>If the text is in *phonetic* MIDI-lyric-events (I think they have
>>to be in 7-bit ASCII, don't they?) then that means SAMPA, I guess.
> Actually they don't.  The data inside a midifile metaevent (including
> lyric events) can be *any* sequence of bytes, because the length is
> specified in the prefix.  So you'd be free to use any phonetic scheme
> you like.
>
> [Oh... responding to another post in the thread, the suggestion
> to use a "bender" was I'm sure referring to pitch-bend events.
> Because events in a midifile are timed, you could emit pitchbends
> to shift the note after the phoneme info had been sent, and still
> have it happen somewhere in the middle of the sound.]
>

oh, ok.

this would be an odd way to do it though, and would be an ugly 
post-processing hack IMO. better I think would be to figure out the correct 
note to sing up-front...


> Cheers,
> -- Pete --
> 


0
cr88192
7/6/2009 12:37:46 AM
cr88192 said:
> "luserXtrog" <mijoryx@yahoo.com> wrote...
> "cr88192" wrote:
> 
>>> I guess of uncertainty is the issue commonly seen in singing
>>> things, where
>>> people will sing part of a word at one note, and then sing
>>> another part at
>>> another note, ...
>>
>> Could you use the bender for this?
> 
> bender?...
> 
> not sure which feature this is exactly (not sure of any MIDI
> command with that name...).

Presumably he is referring to pitch-bend. But I don't think that's 
your problem. It seems to me that your problem is one of clean 
syntax design, or at least you hint as much in your OP. Pitch-bend 
may or may not be helpful as a solution to the problem of 
representing variable intonation, but it isn't going to solve your 
syntax problem for you.

<snip>

-- 
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Forged article? See 
http://www.cpax.org.uk/prg/usenet/comp.lang.c/msgauth.php
"Usenet is a strange place" - dmr 29 July 1999
0
Richard
7/6/2009 4:47:12 AM
"Richard Heathfield" <rjh@see.sig.invalid> wrote in message 
news:FcednfaRaruhHczXnZ2dnUVZ8hJi4p2d@bt.com...
> cr88192 said:
>> "luserXtrog" <mijoryx@yahoo.com> wrote...
>> "cr88192" wrote:
>>
>>>> I guess of uncertainty is the issue commonly seen in singing
>>>> things, where
>>>> people will sing part of a word at one note, and then sing
>>>> another part at
>>>> another note, ...
>>>
>>> Could you use the bender for this?
>>
>> bender?...
>>
>> not sure which feature this is exactly (not sure of any MIDI
>> command with that name...).
>
> Presumably he is referring to pitch-bend. But I don't think that's
> your problem. It seems to me that your problem is one of clean
> syntax design, or at least you hint as much in your OP. Pitch-bend
> may or may not be helpful as a solution to the problem of
> representing variable intonation, but it isn't going to solve your
> syntax problem for you.
>

yeah, pretty much...

there was the idea of using MIDI lyric events (and a binary MIDI input), but 
the problem here is how to key the lyrics to the music (apart from assuming 
extra data be included, but this would IMO defeat the point of lyric 
events...).


however, one syntax idea that came to mind is to allow using '-' as a word 
break, such that a word break could be given, and notes changed.

^C4 merr- ^D4 ily ...

another issue I realize now is one of timing:
not only does one care the rate and frequency of the words, but also when 
the words are said.

this opens up yet another set of awkward design issues (such as the possible 
need for timestamps, ...).


so, yes, the "combined whole" is starting to look a little more complex than 
either MIDI or TTS by themselves...


some of the issues could be addressed with certain features I had thought 
up, such as asynchronous MIDI-stream joiners, but timestamps is an issue in 
its own right.

a very simple trick though could be to add explicit breaks along 
quarter-note boundaries, where a command is given that serves to re-align 
the TTS engine to the next note.

^| and then ^| I went ^| down the street ^| to find ^| my homies ^| on the 
beat ^| ...


however, this leaves an issue of what to do if/when a synthed fragment goes 
over a note, where likely having it take 2-notes would not be the intended 
result (potentially throwing the lyrics out of sync with the beat, ...).

(it probably doesn't help much that I don't really know "music theory" 
either...).


and so on...


> <snip>
>
> -- 
> Richard Heathfield <http://www.cpax.org.uk>
> Email: -http://www. +rjh@
> Forged article? See
> http://www.cpax.org.uk/prg/usenet/comp.lang.c/msgauth.php
> "Usenet is a strange place" - dmr 29 July 1999 


0
cr88192
7/6/2009 5:12:06 AM
In article <h2rgqu$uvf$1@news.albasani.net>,
cr88192 <cr88192@hotmail.com> wrote:
>
>"Pete" <neverland@GOODEVEca.net> wrote in message 
>news:3_CdnbSQL_-yuszXnZ2dnUVZ_s6dnZ2d@lmi.net...
>> In article <h2qmo6$r26$1@news.albasani.net>,
>> cr88192 <cr88192@hotmail.com> wrote:
>>>
>>>MIDI is probarily used by the synth backend, whereas TTS would require
>>>running it though the frontend, which had thus far assumed a sort of
>>>annotated text input...
>>>
>>>I will have to look into lyric events, and see if hopefully there is some
>>>good way to key the lyrics to the music (such as to particular MIDI 
>>>channel or whatever).
>>
>> Lyric events are timed just like any other, and can contain as little
>> of the song text as you need (often just a syllable), so you can cue them
>> exactly to the note timing.  However, they don't include channel 
>> information
>> directly, so you'd have to fudge something for that (maybe just adding the
>> channel number to the text in some defined way).  In theory you could put
>> the lyric event in any track of a multitrack midifile (to connect with
>> the notes in that track) but most sequencers and such assume that 
>> metaevents will be in track 0, so that might cause trouble.
>>
>yep, otherwise it would just be monotone spoken lyrics, which would sort of 
>be pointless...
I was assuming (initially at least) that there would only be one 'voice',
so lyric events could be associated with a particular melody channel.
Only if you wanted vocal harmony would you need some way of tagging them
to different channels.  (And note the distinction between "tracks" and
"channels"...)
>
>it is worth noting that the way my synth works, 1-3 channels are needed just 
>to do the speech synth (typically only 1 is used, but 1-2 additional 
>channels are needed for some constructions).
>
Thinking a bit more about all this, though, it strikes me that there's
been a bit too much concentration on "MIDI".  It only comes in because
your original synth code happens to use it.  MIDI is good for its original
purpose -- sending notes and other musical data down a wire -- but gets
a bit contorted when you want to do much more.  Midifiles can hold a
few other things, like lyrics, but the MIDI protocol itself can't transmit
these anywhere outside the computer actually processing the file.

The only feature of pure MIDI that could transmit the extra information
is the System Exclusive message (which seems to have escaped mention so
far!), so I suppose if sticking with MIDI is important for some reason,
you could use that.

It would seem much better to adapt the way your synth is driven, as
things like pitch and loudness could be supplied in any form (as the text
you originally suggested for example).

If eventually separating data source and sound generation into different
machines is desirable, you might want to look at the "OSC" (Open Sound
Control) protocol, which is intended as an open-ended successor to MIDI
(and other things).  It seems to have been -- and is being -- adopted by
quite a few projects, both open-source and commercial.  Its structure is
such that you can transmit any kind of data over as many "channels" as
you like.
   < http://opensoundcontrol.org/introduction-osc >

Cheers,
			-- Pete --

0
neverland
7/6/2009 10:43:17 PM
"Pete" <neverland@GOODEVEca.net> wrote in message 
news:v7WdndW9PdWY48_XnZ2dnUVZ_v2dnZ2d@lmi.net...
> In article <h2rgqu$uvf$1@news.albasani.net>,
> cr88192 <cr88192@hotmail.com> wrote:
>>
>>"Pete" <neverland@GOODEVEca.net> wrote in message
>>news:3_CdnbSQL_-yuszXnZ2dnUVZ_s6dnZ2d@lmi.net...
>>> In article <h2qmo6$r26$1@news.albasani.net>,
>>> cr88192 <cr88192@hotmail.com> wrote:
>>>>
>>>>MIDI is probarily used by the synth backend, whereas TTS would require
>>>>running it though the frontend, which had thus far assumed a sort of
>>>>annotated text input...
>>>>
>>>>I will have to look into lyric events, and see if hopefully there is 
>>>>some
>>>>good way to key the lyrics to the music (such as to particular MIDI
>>>>channel or whatever).
>>>
>>> Lyric events are timed just like any other, and can contain as little
>>> of the song text as you need (often just a syllable), so you can cue 
>>> them
>>> exactly to the note timing.  However, they don't include channel
>>> information
>>> directly, so you'd have to fudge something for that (maybe just adding 
>>> the
>>> channel number to the text in some defined way).  In theory you could 
>>> put
>>> the lyric event in any track of a multitrack midifile (to connect with
>>> the notes in that track) but most sequencers and such assume that
>>> metaevents will be in track 0, so that might cause trouble.
>>>
>>yep, otherwise it would just be monotone spoken lyrics, which would sort 
>>of
>>be pointless...
> I was assuming (initially at least) that there would only be one 'voice',
> so lyric events could be associated with a particular melody channel.
> Only if you wanted vocal harmony would you need some way of tagging them
> to different channels.  (And note the distinction between "tracks" and
> "channels"...)

ok...

I had assumed something more like:
music plays in the background;
singing goes over the top of the music...


>>
>>it is worth noting that the way my synth works, 1-3 channels are needed 
>>just
>>to do the speech synth (typically only 1 is used, but 1-2 additional
>>channels are needed for some constructions).
>>
> Thinking a bit more about all this, though, it strikes me that there's
> been a bit too much concentration on "MIDI".  It only comes in because
> your original synth code happens to use it.  MIDI is good for its original
> purpose -- sending notes and other musical data down a wire -- but gets
> a bit contorted when you want to do much more.  Midifiles can hold a
> few other things, like lyrics, but the MIDI protocol itself can't transmit
> these anywhere outside the computer actually processing the file.
>

yeah...

actually, thus far I am only using a basic set of the built in features:
turning on and off notes, along with program change...

the TTS system basically works by sending program-change messages and 
turning on and off notes (corresponding to various voice-related patches).


however, the synth itself was modified, mostly in terms of somewhat 
increasing the complexity of the 'wavetable' system (vs what was being used 
in the original synth).


> The only feature of pure MIDI that could transmit the extra information
> is the System Exclusive message (which seems to have escaped mention so
> far!), so I suppose if sticking with MIDI is important for some reason,
> you could use that.
>

well, my other main option would be to direct-drive the synth, which is 
mostly a whole bunch of signal processing and mixing code...


> It would seem much better to adapt the way your synth is driven, as
> things like pitch and loudness could be supplied in any form (as the text
> you originally suggested for example).
>

yes, the text is currently the main input...

the MIDI stage is essentially internal (what connects the frontend to the 
backend...).


> If eventually separating data source and sound generation into different
> machines is desirable, you might want to look at the "OSC" (Open Sound
> Control) protocol, which is intended as an open-ended successor to MIDI
> (and other things).  It seems to have been -- and is being -- adopted by
> quite a few projects, both open-source and commercial.  Its structure is
> such that you can transmit any kind of data over as many "channels" as
> you like.
>   < http://opensoundcontrol.org/introduction-osc >
>

well, thus far, MIDI is working...


I don't really actually need to send any additional info to the synth, just 
people had suggested playing MIDI files with lyrics (which I had not 
considered originally). but, supporting MIDI (files) at the same time as TTS 
is a terrible complexity (vs just using it for the synth stage of the TTS 
engine).

so, it is not a synth which was itself extended to do voice, rather voice is 
being done via the synth (vs other possible approaches, such as via diphones 
or unit selection...).

I don't actually have to send any additional info at present, FWIW...


> Cheers,
> -- Pete --
> 


0
cr88192
7/7/2009 12:36:27 AM
>>>"Pete" <neverland@GOODEVEca.net> wrote in message
>>>> Lyric events are timed just like any other, and can contain as little
>>>> of the song text as you need (often just a syllable), so you can cue 
>>>> them exactly to the note timing.  However, they don't include channel
>>>> information directly, so you'd have to fudge something for that
>>>> (maybe just adding the channel number to the text in some defined way).

It's true: or use sysex as also already suggested.

Presumably if MIDI input did contain plain old lyric-events, those
words ought to get applied to all voice-like channels, and there
is a fair amount of music for which this would be adequate, e.g.
  http://www.pjb.com.au/muscript/samples/ich_fahr.pdf
  http://www.pjb.com.au/muscript/samples/ich_fahr.mid

>>>> In theory you could put the lyric event in any track of a
>>>> multitrack midifile (to connect with the notes in that track)
>>>> but most sequencers and such assume that metaevents will be
>>>> in track 0, so that might cause trouble.

If this process is going to involve a custom input format, which
it is (e.g. sysex), then the fact that other synths don't do
metaevents on tracks>0 doesn't matter - they don't sing anyway.

On 2009-07-07, cr88192 <cr88192@hotmail.com> wrote:
> the MIDI stage is essentially internal (what connects the
> frontend to the backend...).  ....
> so, it is not a synth which was itself extended to do voice,
> rather voice is being done via the synth (vs other possible
> approaches, such as via diphones or unit selection...).

OK.  OTOH I have a lot of MIDI files, like choir things where the
the different voices on their different channels usually don't sing
the same word at the same time, or poppier things where there's a
lead singer and some "doo-waa" backing vocals, or plain old folk-
tune and guitar, and I'd just love to be able to hear these things
being *sung* on a synth somehow.  We've had TTS for decades, and
singing should actually be easier because the frequency is taken
care of, and your project is tantalisingly close to doing the job...

If some way could be standardised of getting lyrics-per-channel into
a MIDI file (sysex or an extended lyric-event or multiple tracks),
or some other input format (like MusicXML, which is a pigsty of a
format but then MIDI also has its problems) then I've got a lot of
music hanging out for some synth-extended-to-do-voice.

Timidity doesn't do it, and though Festival with a sufficiently
assertive pre-processor might set the frequencies right,
it doesn't do rhythms, or other patches...

Anyway, all the best with your most interesting project,
Regards,  Peter

-- 
Peter Billam       www.pjb.com.au    www.pjb.com.au/comp/contact.html
0
Peter
7/7/2009 5:57:50 AM
"Peter Billam" <peter@www.pjb.com.au> wrote in message 
news:slrnh55ouu.1ni.peter@box8.pjb.com.au...
>>>>"Pete" <neverland@GOODEVEca.net> wrote in message
>>>>> Lyric events are timed just like any other, and can contain as little
>>>>> of the song text as you need (often just a syllable), so you can cue
>>>>> them exactly to the note timing.  However, they don't include channel
>>>>> information directly, so you'd have to fudge something for that
>>>>> (maybe just adding the channel number to the text in some defined 
>>>>> way).
>
> It's true: or use sysex as also already suggested.
>
> Presumably if MIDI input did contain plain old lyric-events, those
> words ought to get applied to all voice-like channels, and there
> is a fair amount of music for which this would be adequate, e.g.
>  http://www.pjb.com.au/muscript/samples/ich_fahr.pdf
>  http://www.pjb.com.au/muscript/samples/ich_fahr.mid
>

yep.

I guess, if one were to "customize" the MIDI some (as opposed to stock MIDI 
files), one possible option could be to use a sort of "magic patch", which 
would basically cause a channel to not be played (in itself), but would 
indicate that said channel is used mostly for coordinating the vocals.

the issue then would be that of figuring a good way to attach the TTS to 
this channel.


I guess one possible (somewhat different way), would be to, rather than 
storing the lyrics linearly, they are stored in a table. now, then, a slight 
additional trickery would be needed to figure out how to key each table 
entry to the particular notes being played.

a tweaky hack could be to actually merge this table with the patches, such 
that a combination of ProgramChange and NoteOn would indicate to play a 
particular piece of lyric, with the NoteOn event effectively encoding which 
note to sing...

banks 96-127 could be used for this purpose (or 32-63...).


granted though, something like this would not be a backwards-compatible 
extension (I guess I have little idea what the defined behavior is for 
missing patches and unknown events).


>>>>> In theory you could put the lyric event in any track of a
>>>>> multitrack midifile (to connect with the notes in that track)
>>>>> but most sequencers and such assume that metaevents will be
>>>>> in track 0, so that might cause trouble.
>
> If this process is going to involve a custom input format, which
> it is (e.g. sysex), then the fact that other synths don't do
> metaevents on tracks>0 doesn't matter - they don't sing anyway.
>

yep...

meanwhile I tend to always just use single-track MIDI anyways (merging 
multi-track files into a single track).


> On 2009-07-07, cr88192 <cr88192@hotmail.com> wrote:
>> the MIDI stage is essentially internal (what connects the
>> frontend to the backend...).  ....
>> so, it is not a synth which was itself extended to do voice,
>> rather voice is being done via the synth (vs other possible
>> approaches, such as via diphones or unit selection...).
>
> OK.  OTOH I have a lot of MIDI files, like choir things where the
> the different voices on their different channels usually don't sing
> the same word at the same time, or poppier things where there's a
> lead singer and some "doo-waa" backing vocals, or plain old folk-
> tune and guitar, and I'd just love to be able to hear these things
> being *sung* on a synth somehow.  We've had TTS for decades, and
> singing should actually be easier because the frequency is taken
> care of, and your project is tantalisingly close to doing the job...
>

partly though, it is about the synthesis as well...
diphone synth or unit selection are unlikely to really be able to do 
singing.

formant-like approaches are a lot more capable, nevermind the inherently 
reduced "naturalness"...


> If some way could be standardised of getting lyrics-per-channel into
> a MIDI file (sysex or an extended lyric-event or multiple tracks),
> or some other input format (like MusicXML, which is a pigsty of a
> format but then MIDI also has its problems) then I've got a lot of
> music hanging out for some synth-extended-to-do-voice.
>

yep.
hacked-over MIDI is one option.

I guess another could be to make a somewhat modified/extended file-format, 
which could change a few things:
potentially, a different coding for the command streams;
if multi-tracks are used, they actually mean something (just splitting 
individual channels into their own tracks seems a little pointless IMO...);
ability to send-in custom patches (sort of like mod and s3m...);
....

but, then the question is whether something like this would actually be 
worthwhile.


example of a revised command stream:
all inputs are VLIs (although, I would probably switch to a more 
matroska-like form);
a few of the LSB bits would define the value type (delay, opcode, event, 
....);
numeric fields are also VLIs.

potentially, the stream could be made self-contained, such that an external 
container format is largely unnecessary (the file begins with a special sync 
opcode, and the whole rest of the file is a raw command stream).


then again, I think such a format already exists, would just have to go look 
into it.


> Timidity doesn't do it, and though Festival with a sufficiently
> assertive pre-processor might set the frequencies right,
> it doesn't do rhythms, or other patches...
>

yep.


> Anyway, all the best with your most interesting project,
> Regards,  Peter
>

at this stage, I guess it is probably more of an "experiment" though...


> -- 
> Peter Billam       www.pjb.com.au    www.pjb.com.au/comp/contact.html 


0
cr88192
7/7/2009 12:48:57 PM
"cr88192" <cr88192@hotmail.com> wrote in message
news:h2o3e2$f08$1@news.albasani.net...
> well, I am posting this where I think it may be relevant...
>
> basically, this was part of a misc idea that came up, and I went and beat
> together the code for it (AKA: I don't expect it to amount to much).
>
>
> the idea was that I would combine together a speech synthesizer/TTS engine
> and a MIDI synth, and see if I could get much "interesting" from it (such
as
> combining music and a synth'ed voice, singing TTS, ...).
>
>
It may put you off perhaps, but you could model your MIDI plus text input to
what these guys are doing commercially.
http://www.soundsonline-europe.com/Symphonic-Choirs-PLAY-Edition-pr-EW-182.h
tml. This is MIDI notes plus a special text input program. It works well.
There are also three tutorials on you tube that show how to use it.

SysExJohn.


0
Fred
7/31/2009 11:36:30 AM
Reply: