Clarence J. LeBel Professor Emeritus of Electrical Engineering and Computer Science at MIT
RLE was saddened to hear the news of Ken Stevens passing. Professor Stevens was an important part of the RLE community, contributing many works to his field of research and inspiring many students, collaborators, and colleagues alike. He will be greatly missed.
The following is an excerpt from a 1987 article for RLE Currents:
Professor Ken Stevens, a Toronto native, came to MIT in 1948 as a Teaching Assistant in the Electrical Engineering Department after receiving his Master's in Engineering Physics at the University of Toronto. Since joining the RLE faculty in 1958, he has been central to the development of speech communication research at the laboratory.
What was the focus of your graduate research at MIT?
At first, I worked in the MIT Acoustics Lab. The speech work in the Acoustics Lab started in 1948 with Leo Beranek, who had an Air Force contract to study problems related to the intelligibility of processed speech. He worked on that project with some students over the years, and I became involved in that work as a graduate student i n 1951.
At that time, (about 1949 or 1950), Gunnar Fant visited MIT to study the acoustics of speech product ion. I became interested in the perception side of speech and worked with Leo Beranek and J.C.R. Licklider on the perception of speech-like sounds. I wrote my doctoral thesis on the perception of sounds that had speech-like characteristics. Beranek’s work, combined with Gunnar Fant's studies on the acoustics of speech production and my research on the perception of speech-like sounds, and some additional work on the intelligibility of speech, formed the beginning of speech work a t the Acoustics Lab.
Did you have a mentor?
I would say that it was Leo Beranek, who was one of the directors of the Acoustics Lab. He taught courses in acoustics, and one of his interests was speech. When I first came to MIT, I hadn't thought about going into acoustics, but Beranek needed teaching assistants in his acoustics course. Originally, my background was engineering physics, but not so much in acoustics.
Why did you choose to teach?
I really liked doing research with the graduate students here at MIT and so the teaching fit in with that. It was a good place to do research, and so I did some teaching.
What was the nature of your research as a Guggenheim Fellow from 1962-1963?
I worked in Gunnar Fant’s laboratory at the Royal Institute of Technology in Stockholm. One of the things that I studied was speech movements with cineradiographic (x-ray) motion pictures. Recently, we haven't collaborated, but he did visit here in 1982, and we do keep in touch with each other.
In the early days of RLE's speech communication research, what was the focus of its investigations?
Some of the work in speech at the Acoustics Lab became part of RLE. The Acoustics Lab had disbanded, and I remember talking to Professor Wiesner at the time about the possibility of this small group of researchers working i n speech coming under the umbrella of RLE. He thought that it was in line with the other communications work that was already going on at RLE. There was already some speech work being conducted at RLE, and a group of people met regularly to talk about the problems of speech. One of the individuals in this group was the Director of the Modern Languages Department, William Locke. Bob Fano was also part of this group.
If you look back at early RLE reports, you might find a section of linguistics with Noam Chomsky, and Morris Halle was there too. We've always had interact ion with Morris Halle, and I guess our work could be characterized by trying to find or quantify more closely the relations between the acoustic and articulatory events in speech and the linguistic descriptions that underlie speech events. Morris Halle had a strong influence on the early directions of the speech group, although his interests centered on the phonological aspects of speech. Morris Halle has always had a strong influence on my own thinking, and Gunnar Fant.
Even in those early days, we were interested in speech synthesis. So, apart from understanding the fundamental aspects of speech production and perception (which we are still doing), the application of speech synthesis was an early activity, even when Gunnar Fant visited in the early 1950s. That developed even further with Jonathan Allen’s and Dennis Klatt’s interests in speech synthesis. Allen and Klatt, together with the RLE students, brought the speech synthesis work to a culmination with some practical results. Then, within the last five years, there has been an increasing interest in speech recognition and the application of speech to computers. So, this brings to bear much of the basic information that has accumulated in various places over the years to the practical problem of speech recognition.
As people here work on the problems of synthesis and recognition, we realize that there are still some basic aspects of speech production and perception that we still don't under stand. An example is the recent work of Dennis Klatt. He found that although he could get reasonable naturalness in the synthesis of male voices, it was a problem to achieve good naturalness for female voices. So, it was necessary to go back and study in greater detail the properties of sounds that are generated by females. Then, that basic information could be used to improve the synthesis of female voices. Similar things have happened in speech recognition.
Also, as the speech recognition work continues, we realize that we must rely heavily on what the linguists are able to come up with - phonological representations of speech that bring to light, in a natural way, some of the modifications that occur in speech when we speak in a conversation. What happens when you put speech into context, and other kinds of modifications that are made in the sounds when speech occurs in a natural context.
How would you characterize your research in the acoustical aspects of speech production in contrast to other RLE research groups (auditory physiology, sensory communication, and digital signal processing)? What is the nature of your interaction with these different groups?
We began to look at how sounds were generated in the vocal tract and the actual acoustic mechanisms of sound production, and in fact, we are still continuing that work. We are interested in the l ink between what hap pens in the sounds and what are the underlying linguistic descriptions in terms of phonemes and features. Our goal has been to join the understanding of the sound and the linguistic description. One of the big influences over the years in this area has been the people in linguistics, particularly Morris Halle and jay Keyser.
In relation to auditory physiology, we are interested in the stages in processing of the sound, leading ultimately to a linguistic description. One of the stages through which sounds must pass is the ear, obviously. The shaping of sounds in the auditory periphery could form an initial step in t he chain of processes that produce a description in categorical terms. Our concern with auditory physiology is to keep i n touch with what the investigators are doing, and, where possible, to incorporate their research into our models.
In terms of digital signal processing, the speech signal has to be processed initially by digital methods. In fact, when Alan Oppenheim started on the faculty, he was in the speech communications group. Then, he branched out into digital signal processing, and it became an important field in its own right.
How would you characterize the diverse background of investigators who are attracted to the field of acoustic phonetics?
Many linguists are not concerned with the actual details of sound. Phonologists think of speech as being a sequence of sounds, and do not go beyond this characterization. They address the different kinds of regularities and constraints on patterns of sound; how a language is described in terms of constraints on the sequences of speech sounds that are allowed; and, how these sequences are changed when you place the words into context.
But, more recently, there is a group of phonologists who are becoming interested in phonetics. They are trying to explain some of these phonological regularities in terms of constraints on either the listener or the speaker, and the constraints of the actual mechanics of how these sounds are generated.
For example, certain sounds influence others. A classic example is "did" followed by "you" becomes "didju.” Phonologists would simply say that there's a rule that says /d/ plus /y/ will change to /j/. Now, people are trying to explain these changes in terms of the mechanics of the ear and the vocal tract. So, there has been a coming together of people who work on the speech area and those individuals who work in that part of linguistics.
Your ongoing research involves acoustic variability and invariance in speech production. Can you explain the nature of this investigation?
When different people say a particular sound, or when one individual says the same sound in different words or sentences, it appears as though the sound undergoes a lot of change from one person to another, and from one context to another. We are interested in exploring what is common between all those productions of the sound. In spite of the variability, there are some attributes that remain invariant. That’s what we pick up on when we listen to each other. It doesn't matter who says the sound, it doesn't matter what word the sound appears in, we still hear the same sound.
Our approach is to categorize these sounds by certain properties or features, and to discover what those properties are. We believe there is an inventory of properties or features that is an integral part of the human speech production and processing system. Different combinations of properties are used in different languages, but there is a fixed inventory of properties.
Can you describe the research that you and Dennis Klatt have con ducted on vocal tract modeling?
There are two sides to vocal tract modeling. One quest ion that we are trying to answer is: by developing complex models of the vocal tract itself (including the properties of the vocal tract walls and properties of the vocal cords) -can we further understand the mechanisms of the generation of individual speech sounds?
Then, there is the broader aspect of speech modeling (you might call it speech synthesis). How can we build a device that will take the printed words as an input, and put the words into speech? Not only do you have to know how to produce the individual sounds, but you also have to put these sounds together with the right sense of timing and intonation. That’s a problem that Jonathan Allen and Dennis Klatt have worked on for the last twenty years with some success.
Does your research also include the study of speaker verification and recognition?
It automatically comes out of some of our work. If you're looking for the invariants, you're also studying variability when you examine how one speaker differs from another. Over the years, I've had one or two thesis students in this area, but I haven't delved into it very much. This whole business of speaker verification using spectrograms, or by some other method, is a difficult area, and I’m not certain these methods will lead to reliable identification of speakers.
Data collection in the field of speech processing is a tremendously labor-intensive and time-consuming task. What are some of the scientific tools that help you to collect and analyze this large body of data?
With the capability to store large amounts of data in computers, it has been possible to record a database with large numbers of talkers and lots of sentences, and then label all of the sounds in that database. As a result, it is possible to access that database, request a specific sound, and perform some statistical analysis of the properties of that particular sound. Victor Zue and his group have assembled a large database for that purpose.
SPIRE is a basic tool that enables us to look at individual speech sounds in many ways - spectra and spectrograms, for example. The SEARCH program is an extension of SPIRE. It allows us to search a large database and plot distributions of different acoustic properties for speech sounds in different phonetic contexts.
Does your research involve speech aids for the handicapped?
I have worked on speech training aids for the handicapped, especially for children who must learn to speak, but cannot hear. One approach is to provide t hem with some type of feedback of their speech patterns by abstracting and displaying information from the spectrogram so that they can see when they speak properly. In my consultancy with BBN, we didn't use spectrograms because the technology wasn't available to generate it fast the time. So, we displayed simpler patterns like the pitch and timing of speech.
What is the nature of your consultancy at Bolt, Beranek, and Newman?
My more recent work with BBN was to develop methods for measuring people's hearing at very high frequencies, far beyond what is needed for speech. It is important to be able to do this because there are some invasive things that influence hearing. High-intensity noise or certain drugs, like aspirin, can influence certain peoples hearing if large doses are taken. In some cases, it influences hearing first at the very high frequencies. Then, it gradually spreads down into the lower frequencies. So, it is important to be able to measure those effects on hearing at very high frequencies.
Are you excited about a current project that you're working on?
In the past, we have tried to examine speech sounds and their properties as they occur in simple utterances (consonants, vowels, syllables, etc.). We are now currently interested in moving toward more natural types of speech, looking at similar properties to understand this whole process of how sounds become modified within natural speech. That's the thrust now, both in recognition and synthesis.
I'm enthusiastic about "rounding off" our previous work. We've learned a lot about individual speech sounds and how they are produced and perceived. There are still many loose ends to pull together before we move on to the next stage. At this moment, I'm interested in pulling together those loose ends and putting them in a book. Then, I would like to move on to the study of speech in a conversational context.
How do you measure success in terms of testing and developing your ideas?
One measure of success is whether the applications in speech synthesis or speech recognition can actually work and be used by people. In the case of synthesis, there has been some reason able success. In speech recognition, perhaps not so much. Another measure of success is that you understand the concept of how this whole speech process works, and you fill in the gaps of your knowledge of the process; gradually piecing together this jig-saw puzzle. Whether or not it leads to an application is not the point, but rather, if all pieces of the puzzle fit together.
So, you could say that one measure of success is if all of these different pieces of information-whether they be from speech physiology, speech acoustics, speech perception or phonology- fit together into a coherent picture. Obviously, we are still trying to build that picture, and I believe that it’s beginning to fit together. To some extent, we are happy about that, and to some extent we are frustrated because there’s still so much to learn.
What has been the most challenging project that you've worked on?
One of the most challenging things is to try to uncover the basic invariant properties from the speech signal, in spite of all of its variability. Particularly for some sounds, it’s been a real challenge. For example, what distinguishes a /p/ from a /t/ from a /k/? It’s the kind of question that we still don’t have a good answer for.
During your professional career, what do you consider to be the major breakthroughs or milestones that have significantly contributed to or changed the field of acoustic phonetic research?
There is no question that the ability to use the computer to look at data conveniently and quickly, and to perform signal processing, is a major breakthrough. The computers give us access to larger databases, and allow us to test hypotheses with a much faster turnaround time. The disadvantage is that it is too easy to test ideas, and we don't spend enough time thinking about them before they are implemented, because they are so easy to implement.
More broadly, I would say that Gunner Fant's work on acoustics and. the insights of Roman Jacobson into the linguistic description of sounds have represented major milestones. In the past decades, researchers have been trying to build on these ideas.
What do you see as the direction of future research in acoustic phonetics, or speech processing in general?
In the next decade or so, we will have to understand more about these phonological/phonetic changes that occur when we speak in conversational speech. We are getting to the point where we have exhausted the study of individual speech sounds or simple utterances. We now want to move into more conversational speech, where the sounds that we generate and the ones that we hear in normal conversation have been modified quite a bit. In other words, the listener perceives only a fragment of the original sounds that occur rapidly. What the listener picks up on is on l y some of the sound because of redundancy, and because the listener might know something about the topic that is being discussed. It is this area that we will have to work on. Up until now, acoustics and signal processing people have been the major components in speech research. To proceed further, we have to involve people from other disciplines more than we have in the past.
What do you like most about RLE?
The thing I like most is the proximity of colleagues who are in fields related to mine, and who are among the very best in the world - people who really understand hearing, people who understand linguistics and acoustics - and to interact with those people and with such very good students. That’s what makes the place exciting.