Can you teach a computer to talk? With Kelly Davis
Updated: Dec 12, 2021
In this episode I chat with the head of the Mozilla Common Voice project, Kelly Davis. Kelly is a trained physicist who studied string theory before getting hooked on artificial intelligence in the late ‘90s. We talk about AI art collectives, his fascination with creating an intelligent agent, project Common Voice, and a little about the future of humanity.
Dustin Driver: Hey Siri, what’s the weather like?
Siri: It’s currently raining and 44 degrees in Portland. Temperatures will be fairly consistent, averaging about 44 degrees.
Dustin Driver: Who’s the tallest man in the world?
Siri: Check the display on your iPhone for the answer.
Dustin Driver: Okay, it says the tallest living man, at 8’3” tall, is Sultan Kösen of Turkey. He was born in 1982. That’s pretty good. Just six years ago I couldn’t have done that. There was no Siri, no Alexa, no Cortana. Now voice recognition is everywhere, and it’s all thanks to AI, but AI doesn’t learn language the same way we do. A baby can learn a new word after hearing it just a few times, but AI needs to hear a word hundreds, spoken in different ways by different people, before it can really recognize that word.
Dustin Driver: Big companies have recorded millions of hours of audio to get their voice systems to work. If you want to build your own voice system, you either have to buy it from somewhere, which is expensive, or record it yourself, which can take forever. That’s why Mozilla launched Project Common Voice.
Dustin Driver: Mozilla Common Voice is a project to open up voice recognition data to everyone. For those who don’t know, Mozilla is a nonprofit behind the Firefox browser and, full disclosure, also my employer. But here’s how Common Voice works.
Dustin Driver: People from all over the world read text at the Common Voice website, which records the voice data. It’s all volunteer, and it’s all open source. Mozilla is also using that data to train its own voice recognition AI, which will also be open source and free for anyone to use.
Dustin Driver: In this episode, I talk with head of Common Voice project Kelly Davis. Kelly is a trained physicist who studied string theory before getting hooked on artificial intelligence in the late ‘90s. We talked about AI art collectives, his fascination with creating an intelligent agent, Project Common Voice, and a little about the future of humanity.
Dustin Driver: I apologize for the sound quality. Kelly is in Germany, and our conversation was recorded with a conference room mic, so there’s a bit of an echo. Without further ado, here’s my interview with Kelly Davis of Mozilla.
Kelly Davis: I’m Kelly Davis. I’m the manager/technical lead of the machine learning group of Mozilla. I’m working on various different projects including Deep Speech and speech recognition, cognitive voice, open data sets of speech, and also other projects in terms of speech synthesis and natural language understanding, in particular automatic summarization and things like this.
Dustin Driver: What inspired you to get into artificial intelligence and voice recognition?
Kelly Davis: Strangely enough, I think it first started around the 1990s when I was working at startups in Washington DC. At this time, myself and a few other of my colleagues didn’t really realize that a lot was changing in the realm of computers understanding what people can say, and being able to transcribe that into text and/or heard speech. We didn’t really realize that at this point. The technology was there but not mature, I’d say. Instead of really trying to recognize this, instead of actually trying to ground a startup, we decided to create an art collective, which is sort of a strange, backward way of doing things.
Kelly Davis: We created this group called Sentient Art, and what we did is create installations that used machine learning and artificial intelligence, in particular neural networks, to actually interact with the gallery goers. That was the first real exposure to neural networks and machine learning that I had. It was sort of a backwards way of doing things and a roundabout, strange way of approaching the topic, doing installation art. That was the first project I had with machine learning and speech recognition and various other things in this realm.
Dustin Driver: You were studying string theory when you first ran into this, and the origins of the universe itself. What was it about voice recognition and AI that pulled you away from studying physics and string theory?
Kelly Davis: I don’t know if it pulled me away from it. It was a secondary interest, I’d say. Other than being fascinated with the universe and with our place in the universe and how the universe works, in some sense of the word “work,” it’s fascinating to me to think about actually creating -- I’d say an “agent” is a very non-biased way of saying a sort of entity that’s able to converse and think and talk -- it’s very fascinating to think of this.
Kelly Davis: One of the things that probably motivated me, at the time and even now, is a book by Richard Powers called Galatea 2.2. The title comes from a myth. There’s a sculptor, and she created a sculpture that is created of his work, that was actually brought to life by the gods. It’s a similar thing if you think about creating an agent that is able to converse and talk with you, bringing something to life. The book really covers, to some extent, the trials and tribulations involved in creating an agent that’s able to converse, and how this agent learns of our world and learns of our foibles, when coming to life.
Dustin Driver: So, your first encounter with something like this was the art collective?
Kelly Davis: Yeah, strangely enough. It’s a strange way to approach it. At the time though, in the late ‘90s, artificial intelligence and neural networks and deep learning -- I guess deep learning wasn’t even really in existence as a term at that time -- there was a lot of promise there for what it could do. However, it wasn’t delivering on all this promise at the time. It was very much a nascent technology then, and it wasn’t to the state where it is now. Speech recognition in particular now is not solved, but it’s more of a technology problem as opposed to a research problem. In the last intervening twenty years, a lot of work has gone into improving speech recognition in particular.
Kelly Davis: Now there’s still open research problems, in particular around understanding language. It’s much more complicated, I’d say, than understanding speech. It isn’t a relatively simple problem where you can have audio and translate this audio into text. Understanding what someone says is -- I laugh, because, I think, even if I’m talking to someone in person, how do you know that they’ve understood what I’ve said? Even codifying that is a hard thing to do, and trying to do that with a computer, which adds another level of difficulty, is a much, much harder thing to do.
Dustin Driver: Yeah. I think that gets to the root of one of the questions that I posed, which is the challenges with voice recognition. Tell me about what was going on in this. I hate to dwell on this, but the art collective, I imagine you were experimenting with speech-to-text. What exactly was taking place? What was the art project?
Kelly Davis: We had several different things we were doing. I’ll give two examples. At the time, we were thinking about privacy and how privacy is being breached a lot by technology, so one of the pieces we did was essentially this installation that allowed you to -- you’d come there, it would ask you simple questions about your name, where you live, and things like this. Then it would go on the web and actually create a profile of more information it would glean from the web about you, just given your name and approximately where you live.
Kelly Davis: The thing was, it wasn’t always accurate, and that was part of the point of it, the idea that this entity could be obtaining information about you that wasn’t always accurate. And these particular decisions, [the bake decisions, load decisions 0:10:19], things like this, could be made on such incorrect information. That’s one piece we were doing.
Kelly Davis: Another piece we were working on came down to creating a kind of agent, but at the time you couldn’t really do this, so the idea was you would be talking to a child, and talking to the child about experiences the child had. The child was a survivor of World War II. This part of the installation was basically creating a neural network, which was something like a sequence and sequence model at the time. Basically creating a neural network that you could actually talk to, but because the technology wasn’t really mature, one of the reasons we chose to simulate a child was that you could get away with having not-perfect speech, get away with having slightly incorrect answers, because it was a child and you would expect that for a child. Whereas now most agents, if they were to incorrectly understand what you said or give the incorrect answer, it would be viewed as a failure as opposed to, oh, that’s what one expects from this type of entity.
Dustin Driver: It sounds like you were way ahead of the time. Back then, I remember the only speech recognition or speech-to-text game in town was Dragon NaturallySpeaking, which I remember being not very good and extremely expensive. The only people I knew who really used it were lawyers who had to dictate a lot, and they had limited, hit-or-miss success with it.
Dustin Driver: But now something like Dragon is not even on the radar. We’ve come so far. It’s insane to think of a world that didn’t have such great speech-to-text, even though that world was only a few years ago.
Kelly Davis: It’s amazing the amount of progress that has occurred in the intervening twenty or so years. One of the things I think is interesting is that, especially within the last ten-ish years, the progress is compounding. You can think about it in terms of compound interests in that progress begets progress begets progress, and the rate of progress is accelerating. That makes it hard to predict what will come in five years or ten years. Things are accelerating to the extent where twenty years ago predicting five years in the future was a much easier task, whereas now, five years in the future is much further away, if that makes any sense.
Dustin Driver: It’s the exponential growth. So, you became fascinated with creating this agent, this entity that you could talk to and could respond to you into the computer. From there, you went on to found another startup that was along the same lines, and eventually working now with Mozilla on the Project Common Voice.
Dustin Driver: Tell me about that startup. I believe it was called 42. I’m sure anyone who’s going to be listening to this knows the significance of that number as being the answer to everything -- Life, the Universe, and Everything. Obviously you’re a big Douglas Adams fan, which is fantastic. So, tell me about that. Is that another step in creating that agent?
Kelly Davis: It is, in a way. The time we started the startup was around 2011. There was work that IBM was doing, in particular work on the original Watson -- this computer that was able to answer general knowledge questions and actually got a place on Jeopardy, which is a TV show in America where people are asked questions about relatively obscure or not obscure topics and win money as a result of this. IBM’s Watson was on this TV show and actually won against various other champions, which had won against other humans previously.
Kelly Davis: Because of the architecture and because of the research that IBM did around Watson, we were motivated at the time to actually use or re-use or build upon this research, to create an agent that was able to answer general knowledge questions, but it would base these answers on web data.
Kelly Davis: Basically, you’d ask it a question. This could be, “what’s the national bird of the US?” Essentially, what it would amount to is it would do a Google search, Bing search, or whatever search of the web, and find web pages which had relevant information, whatever that might mean. It would maybe look up the terms “America,” “bird,” “national,” and would find various different web pages which had these words, and it would scan these webpages to find out possible answers for the question.
Kelly Davis: It would do this in parallel in various different ways, so it would have sub-agents that would actually try to pull out answers in various different ways. The way it worked, more or less, is it had fifty to a hundred different sub-agents that would each be able to find answers of particular types, and then, at the very end, it would essentially vote.
Kelly Davis: So, these fifty or hundred agents would actually read these webpages that it found, and one would say, “oh yeah, it’s an eagle,” and another would say, “no, it’s the Baltimore Oriole.” At the very end of this process, these fifty or a hundred agents would vote on what the correct answer was, and it would return answers based upon this voting of these various different agents.
Kelly Davis: The agents themselves were kind of individually dumb, so their collective intelligence made the whole thing relatively smart and able to answer questions which are surprising in some ways. To give a concrete example, one of the stupid questions we would ask, just in terms of testing, we would ask it things like, “who’s the tallest man in the world?” Things like this.
Kelly Davis: At one point, I remember it giving the answer. I don’t remember the guy’s name now, but there’s a band called The Tallest Man in the World, or something like this. The band consists of one member, and it gave this guy’s name. The first time I saw that, I thought, what’s wrong with this thing? Why is it saying this? I think it’s a Turkish guy who’s actually the tallest guy in the world, but when you realize the logic of why it did that, it made sense. The answer was correct in some sense of the word, but it’s sort of surprising. It was interesting things like that which surprise you but also entertain, in a way.
Dustin Driver: There’s a sort of childlike response that sometimes the early AI would give you in return. That leads into what we were saying earlier about the challenges of understanding human voice, in that for human beings, even, it’s hard to explain to someone else that you understand them, and yet for the human mind it’s very easy to understand, especially your native language. It’s possible to learn and understand other languages, and it seems to be something that we do very naturally and very quickly, but a computer, even a network of computers, there’s no context for it. It’s completely foreign. Can you explain the main challenges to teach a computer how to understand what we’re saying?
Kelly Davis: I don’t even know if I can do that. Having a computer understand is an unsolved problem, and I don’t think people really know how to do that.
Dustin Driver: Maybe not understand in the sense of ‘I understand what you’re telling me,’ or having a conscious, but just the mechanics of being able to recognize a word when it’s spoken with different accents and spoken by different people with different pitch levels of voice. It’s something we can pick up very easily. If I say mountain, someone else says “mountain,” it could sound almost like a completely different word, nonsensical if you were to graph the sound wave. But for a person it’s very easy to understand that’s the word “mountain.”
Dustin Driver: That’s kind of what Project Common Voice is getting at, is providing that language set for the computer to be able to understand.
Kelly Davis: The word “understand” is always a slippery slope for me. I’d say the use of a very concrete -- let’s say, teaching a computer how to translate speech audio into text. That’s a much more concrete problem. The core of any algorithm that translates speech audio to text is basically learning by seeing a whole bunch of examples. The core of what one does, the first thing one does, is you have to collect a lot of data. By a lot of data, I mean somewhere in the order of a year of continuous audio of people speaking and associated transcripts of that.
Kelly Davis: What one does is, assuming one has the right algorithm for the moment, you feed this speech data to the correct algorithm. You basically say, here’s some audio and here’s the transcripts for the audio, and the machine initially just initializes some kind of random state, and it slowly learns by looking at hundreds of thousands of examples. It slowly learns that this particular kind of audio features may correspond to the word “mountain”.
Kelly Davis: I don’t claim to know, and I don’t think anyone claims to really know at this point, what features the neural networks pull out of there. It has to learn from these examples what the word “mountain” actually looks like in terms of audio features. Basically, it’s given the audio, it may take something like a four-year transcript from the audio, may do slight variants on this four-year transcript, and then from this information, it learns what features to look for in this feature space that correspond to the word “mountain,” or “tree,” or “water,” or “sea.” And it only learns that through repetition and repetition and repetition.
Kelly Davis: For example, for our speech recognition engine, our current models are trained on about three thousand hours of audio, and we have to run that through the system continually for about two weeks before we get a reasonable model out of it. The key part of it in terms of the algorithmic part of it, and also in terms of the data part of it, is repetition of audio examples. And that’s one of the important reasons Common Voice needs lots of data, is because the system itself needs to see a variety of examples so it can tease out exactly what you were talking about in terms of what characterizes the word “mountain” when twenty different people speak, or a thousand different people say it. There’s some kind of core thing which is there, and it’s hard for us, even as people. I know I can hear a person say “mountain” and know what it is. I can hear a man say it, a woman say it, a child say it, and I know the word. I know they said “mountain.” But if you ask me how do I do that, I couldn’t answer that.
Kelly Davis: It’s strange, because when we’re teaching these computers to actually understand the word “mountain” or “tree” or “fish” or “water,” when the computer is done, we’re in kind of in the same state that we are as humans. You could ask me, how did it understand, what did it see in this audio that it knew that is the word “mountain?” I’m not sure. I know it knows that information, because you can test it. You can give it different people saying the word “mountain,” and it gets it correct, but what is it pulling out from this audio? What features it sees in the audio that characterize the word “mountain,” I don’t think I know.
Kelly Davis: Strangely enough, I think that’s only recently become a research topic that people are concerned with: explainability of machine learning. This is only really in the last couple of months that people have really started looking at this. I think it’s because machine learning is becoming more and more prevalent in decision making. For example, automated driving systems, where you may need to know, why did this thing decide to turn right? Or it took a left here, why did it swerve left instead of right here? We need to know these kind of things for, if nothing else, legal reasons. People are starting to research that now, but I think it’s still very much a new field, explainability in machine learning.
Dustin Driver: It’s the black box problem. What’s going on inside the black box? An interesting point that you’re talking about is all the different data that a voice recognition system needs to work, and traditionally, a lot of that data has been proprietary. There are large data sets, millions of hours of recordings, but it’s owned by specific companies, so Project Common Voice is an effort to replicate that data but make it all open source so anyone can use it in their own voice recognition applications.
Kelly Davis: Yeah, that’s one of the big problems in open-source speech recognition. The big problem, I’d say, is a lack of open data that allows you to train your own speech recognition engine. Right now, in terms of speech recognition in particular, there’s a handful of companies that actually control all speech recognition engines that are production-level quality, and there’s no open alternative. In particular, the biggest part is there’s no open data sets that are alternatives to that. Common Voice is aiming to change the state of affairs.
Kelly Davis: Right now, with English you can get some open data, but once you try any other language than English, you can forget it. Even with English, the year of audio you need to train your production system isn’t open. There’s no way to get that much data in an open way, and once you try something like Mandarin, you can completely forget it. You’re not going to be able to create an open speech recognition engine.
Kelly Davis: That’s really what we’re trying to address with Common Voice. We’re trying to open the speech recognition world to the open source world, allowing people to create speech recognition engines in their languages of choice, and create data sets for their languages of choice, and open these data sets to the world.
Kelly Davis: One can think of the case of the web twenty years ago or whatever. Assume what would’ve happened to the web if a handful of companies held the keys to this gate, if you had to get permission from one of these companies to create a website, and if they gave you permission, then you had to pay one of these companies every time someone visits one of your websites. It sounds absurd, and it would’ve really destroyed the web. The web would not be as prevalent as it is today if that were the case. However, that’s exactly the situation we’re in with speech recognition.
Kelly Davis: Speech recognition is becoming more and more prevalent and more and more relevant to the world, but at the same time, speech recognition is controlled by only a handful of companies. It’s a weird situation we’re in, but Common Voice is trying to address that by creating open data sets that anyone can use, and use to create speech recognition engines.
Kelly Davis: Using Common Voice is actually really simple. You just go to voice.mozilla.org, and you can just read a sentence and record yourself reading sentence, and that’s contributing to our data set. Alternatively, you can verify someone else’s read sentence. Basically, you see the sentence that they were asked to read, and you hear the audio of them reading it and say, yes, they said this sentence correctly or no, they didn’t say this sentence correctly. Just something as simple as that. You don’t need any huge technical expertise to contribute, but when you contribute you’ll be helping open the speech ecosystem to the world.
Dustin Driver: And opening up innovation as well, like you were saying. Ironically, the web is consolidating right now, but twenty years ago it was extremely open. There was a tremendous amount of innovation, so we want to try to bring that same spirit into voice recognition.
Dustin Driver: We were talking about that black box, and that even after a voice recognition system can, not necessarily understand, but can recognize the particular word, you don’t know how it got there. To me, that’s almost like magic in a lot of ways. You’re feeding this very seemingly simple neural network a ton of information, and it eventually learns, and you don’t know how. That sends a chill down my spine for a number of reasons. It’s not that it’s scary. It’s just amazing.
Dustin Driver: My question would be, how does that work? What does that feel like? The first time you teach a computer how to recognize a word, what does that feel like, and what goes through your head when that happens?
Kelly Davis: When we were first starting Deep Speech, we used a single example sentence and a single example transcription of the sentence. It’s an example sentence from this TIMIT speech corpora. I remember the first time we realized, this is actually working. It was kind of overfitting and not perfect. We were more testing the code than anything else, but at the same time, we realized after five minutes of training, ten minutes of training, it actually recognized what this crazy audio was, and was able to transcribe this into speech.
Kelly Davis: It was magic, in a way. Even though intellectually you know, okay, it’s an exact propagation there, and it’s propagating this derivative through, and you’re looking at the difference between the desired output and the real output. In some sense of the word, you know the steps involved, in some very low-level way, but then this emergent ability arises from these very low-level derivatives, back propagations, and things like this.
Kelly Davis: From this low-level thing, there’s this emergent behavior. It’s still slightly magical when it first starts happening, even though you intellectually know what’s going on, to some extent. At the same time, the details that emerge, and the abilities that emerge out of these low-level constructs, is still slightly magical, at least the first time it happens. Now, after a year of doing it, you’re kind of like, okay, it’s going to learn, I’m not really worried about it. But the first time it happened, it was very much like, wow, we just built something that actually recognizes speech. This is pretty amazing.
Dustin Driver: I think it does get back to the black box, and even our own minds are sort of a black box. We’re in this strange place where we’re able to create machines that can learn, but we’re not even really positive how we learn. It’s very strange. Ironically, too, there’s talk of actually using the AI that we create to help us understand how our brains work. We’re creating this bizarre loop. We’re going to create this entity that we don’t fully understand to help us understand ourselves.
Kelly Davis: That reminds me, there was some research I only glanced at, not read in any kind of detail. People were looking at brain scans, I think MRIs, of various different people as they look at pictures. Basically, they’d show a person a picture and record the outputs of electroactivity in their brain. They’d show another picture and record the electroactivity, etc. And they built up a data set, where this picture causes this kind of electrical activity, this other picture causes this electrical activity. Once you have this data set of these pairs, you can, in a relatively straightforward way, train a neural network to go the other way, to go from some particular electrical activity to a picture this person would see.
Kelly Davis: What they would do is, with this information, they could then, in some primitive sense of the word, read people’s minds. You could have a person look at a picture, and from this electrical activity that you see as a result of them looking at the picture, you could guess what picture they were looking at. It was still definitely in a primitive state and produced pictures that were blurry and what have you, but it’s strange how, at least in this case, the neural network technology is really not that complicated. All the pieces separately are not complicated, but once you put them together, you have this strange, slightly worrying ability that arises: the fact that, in a very real sense, you can read a person’s mind, which is strange.
Dustin Driver: It’s insane. If you can figure out how to decode the electrical activity of thought, then brain-to-brain communication becomes possible.
Kelly Davis: Yeah. It was strange too, because I’ve been thinking about the lack of interoperability of the neural network itself. To some extent, even with this sort of mind reading ability, it still doesn’t elucidate what’s really going on in our brains. We only know this strange, random pattern corresponds to this image. Why it does, or anything like that, I don’t know.
Dustin Driver: And who knows if we’ll ever truly know. You bring up the picture recognition problem. I know that Google spent an unimaginable amount of time and data teaching its system how to recognize pictures. I don’t remember the exact amount, but to get it to recognize a cat, it had to see millions of pictures of a cat.
Dustin Driver: On the other hand, a two year old, or even a one-year-old person for that matter, you could show them two pictures of a cat, and the child would recognize a cat in any picture you show it from that point on, for the rest of its life. It only takes two, maybe even one picture. And we don’t understand how that works, but there’s obviously some sort of very efficient, fast, amazing feedback loop that’s happening in the brain that allows people to learn so quickly, and with such a limited amount of data, when compared to these systems that we’re building.
Kelly Davis: I completely agree. I don’t think any researcher really thinks this at this point. I don’t think current deep learning/neural network architectures are evolved to the extent that they’re going to evolve. There’s a whole undiscovered territory in the world of deep learning -- many undiscovered territories. Having a single one is probably understating the problem.
Kelly Davis: There are various different undiscovered territories in the world of deep learning and neural networks that have yet to be explored, and things like this, in terms of what we were talking about with natural language understanding -- that’s a very darkly understood area right now. It’s dimly understood as to what one needs to do to understand, in the deepest understanding of “understand” -- which is circular, but -- to understand what language means, there’s this whole undiscovered territory as to how to actually do things like this, and I think researchers know this. The progress one makes has to be, to some extent, incremental. People are not worried they’re going to get to this point where you can say, okay, here’s one example, a cat, and then you can understand the concept “cat” in any image or text you see or hear. But I don’t think we’re there yet. I don’t think we’re to the point where we have brain power computers that actually understand and see the world as a human does.
Dustin Driver: I’ve heard it described as the human mind is very broad. A simple word like “cat” carries with it an image, it carries all the experiences you’ve ever had with a cat, stories about cats, facts about cats, whereas a computer, right now, is very narrow but extremely fast. It’s so much faster than we can imagine, than the human brain can even conceive of, but it’s super, super narrow.
Kelly Davis: One of the things people postulate -- again, I don’t know if it’s true or not -- is that a human-level intelligence requires embodiment in some sense of the word. If they’re going to perform at that level, this agent or computer or what have you has to experience the world as we do, has to live in the world and know water is wet, whatever that means. You can learn that as an abstract.
Kelly Davis: There’s this relatively famous project called Cyc, from the mid ‘80s until now. I think Douglas Lenat is his name. He’s basically been cataloging common knowledge about the world. The amount of time to catalog this common knowledge into an ontology is exorbitant, I would say, because you’re doing it for the last thirty or so years. However, one could think of a kid, a two or three year old, knows these things just from the ability to exist in this world and test the world out. You spill a glass of water on the floor and you recognize, if I spill this glass of water on the floor, the floor is going to be wet. So, we’re codifying that in some sort of logical, computable way of thinking about it that’s super hard and, in some sense, unnatural.
Kelly Davis: There’s this train of thought that assumes or postulates that to understand the world in a humanlike way, one needs to actually embody the agent. The agent needs to be embodied in a robot, whatever that might mean, that actually explores the world as a kid would.
Dustin Driver: It would need the same amount of inputs as, say, the human mind would have.
Kelly Davis: Yeah.
Dustin Driver: Otherwise, it wouldn’t be able to fully understand. A lot of people are worrying, even Elon Musk, that we could potentially create artificial intelligence that would destroy us all or take all of our jobs. There’s always doom and gloom around AI. On the other hand, there’s this glimmering future view of AI, where people like Ray Kurzweil are hoping that AI will be able to solve all of mankind’s problems, and we’ll be able to create these godlike entities.
Dustin Driver: Where do you stand on that spectrum, or is it too early to tell? I look at a lot of this stuff, and I feel like maybe it’s just not even possible to create a brain in a computer.
Kelly Davis: I guess I’m somewhere in between. I think it’s probably possible to create an intelligence which matches or probably exceeds ours. Is it going to lead to some AI apocalypse? I don’t really think that’s the case. At the same time, I’m not really at one extreme or the other. I’m very much more practical. I try and think like, okay, what are we doing today? What can we expect in six months? Maybe a year is kind of pushing it. Looking at a more limited sense of view.
Kelly Davis: I agree with a lot of people that we should start thinking about the implications of AI and these emerging nonhuman intelligences. It’s definitely worth thinking about the implications that they would have and do have in the world. However, I think there’s no immediate need to worry that AIs are going to take over the world or anything like that. You’ll see that coming before it happens. At the same time, it’s prudent and logical and makes a lot of sense to start thinking about the implications that such things would have in the world. I think we should cheer on anyone that’s thinking hard about this, and thinking about the implications that will occur.
Dustin Driver: I know there’s surveys done of people working in the field, and the vast majority think that it is possible to create something that has humanlike levels of intelligence in a machine. You’re with them on that? You think it’s possible?
Kelly Davis: Yeah.
Dustin Driver: And to have the same sort of general intelligence that a person might have, and the ability to tackle multiple different kinds of problems.
Kelly Davis: That’s an assumption to say that to have general intelligence we need to tackle multiple different types of problems. It could be that we need to tackle multiple different types of problems, it could be there’s some underlying algorithm that has yet to be found -- that the algorithm itself can learn or teach itself to tackle these multiple different problems. The answer is I don’t claim to know. Probably no one knows right now. If they do, they’re keeping it a good secret.
Dustin Driver: It’s possible, which is just fascinating. There’s a new method of machine learning, in that you’re actually having one machine check another machine’s work, and they bounce back and forth between one another and, between the two of them, can create something that’s better than either one of them alone. The rate at which they learn is exponential.
Kelly Davis: There’s an interesting sort of dovetailing on that. There’s an easier way that can be realized within the framework of games, and I think that’s why a lot of people are working on games and Go and chess and things like this, because playing Go or chess or what have you is a complicated problem, whereas the constraints of the objective function as to what’s good play or what’s bad play is relative and well defined. If this program can beat this program, it’s better, in some sense of the word. And because it’s got a concrete reality check that’s built into the system, one could have a system where you have two different players, they play against one another maybe a hundred games, and if one of them wins, say, 55 percent of the games, then we know this one’s better.
Kelly Davis: Then we can build all future systems off of this better player, create slightly different versions of it, have them play against one another, and have this exact bootstrapping process that you were talking about, that allows you to create a Go-playing agent that can beat any human or any other computer on the planet in a matter of a few days. It can go from absolutely no knowledge of Go to the best Go player on the planet in a matter of a few days. It’s not an exaggeration at all. It took about that long. It’s kind of amazing.
Kelly Davis: You were saying earlier that maybe we have to solve multiple problems to obtain general levels of intelligence. I mean, maybe we don’t, maybe we do. I really don’t know the answer to that, but if one could formulate some kind of path to general intelligence in such a constrained way -- okay, now these agents can play against one another, and we know this one’s better, because it exhibits some better knowledge of general intelligence, whatever that might mean. I have no idea what that means. But if you could formulate the problem in that way, then it would be conceivable that one could quickly evolve, teach, create, an agent that has high levels of general intelligence, but at the same time I don’t know what this sort of competition would look like. I don’t know what that would really mean in any sense of the word. Maybe someone knows, but I don’t claim to.
Dustin Driver: It’s an evolutionary process, I think, as well. That’s a lot of the processes that lead to these algorithms or these systems like AlphaGo. And AlphaGo is a unique one. As far as games, I understand that, famously, the world’s best chess player was beat by a computer long ago. With chess, the way that worked is it was brute forcing it, right? Basically, every time the human player would make a move, the computer would calculate every single possible move that could be made into the future, which you can do with chess. And it brute forces its way into victory.
Dustin Driver: But with Go, there’s a difference in that the game has so many possible moves that it’s impractical to use that approach, so it had to develop a sort of intuition. Is that correct?
Kelly Davis: Yeah, I think that’s a reasonable reading of how they approached the problem. For Go, I think there’s around two hundred different moves you can make, and just as you said, the techniques people used for chess simply didn’t work. It’s called a branching factor. When the branching factor is two hundred, when the branching factor is so high, this old technique that people used for chess simply didn’t work for Go. They had to essentially create new techniques, and these new techniques are really interesting, in a way.
Kelly Davis: What they did was train a neural network to look at the Go board as like an image. They used a lot of the same neural network architecture that they use to look at images. They used these to have the neural network look at the board of Go, and from this it would calculate possible future moves. This move is good with this probability, this move is good with this probability. It wasn’t doing it in a kind of, “I’m going to look ten moves ahead and figure out if I do this, if I do that, then this happens” -- the core of the algorithm wasn’t really doing that. The core of the algorithm was going from a picture of the board to “I think this is a good move.” In terms of thinking moves ahead, they didn’t do that in the core neural network architecture.
Kelly Davis: What they did is stick another layer on top of that, which is a variant of a Monte Carlo tree search. That did the look ahead, but it really wasn’t a look ahead as deep as what the IBM team did for chess. It was a much more shallow look ahead, and the reason they could be much more shallow is because the core neural network had this intuition of board positions.
Kelly Davis: It learned this intuition more or less from self-play. What it would do is play itself at various sorts of games, and it would store these games that it played. Because of this Monte Carlo tree search algorithm sitting on top of this neural network, the games that were played had stronger play than could be given by only using the neural network alone. They could use these saved played games to train the neural network itself and make the neural network itself much stronger.
Kelly Davis: When they took this neural network itself and plugged it into the Monte Carlo tree search, this combination was stronger than the previous versions of the system. They could essentially bootstrap the system, where it would learn from itself and learn from itself and learn from itself. Each time it played a game, it would be stronger, because it would essentially have a stronger neural network, and a stronger neural network could be made even stronger by using this Monte Carlo tree search. And then the saved games would be better than the games the neural network could play. It bootstrapped itself to pull itself up from essentially nothing, essentially random initialization, to this ability to play better than any human has ever played, within days.
Dustin Driver: That’s really incredible. I do have to go back -- you called it a Monte Carlo tree?
Kelly Davis: A Monte Carlo tree search. It’s a way of looking at possible moves, and looking at possible counter-moves, and counter-moves to those counter-moves. What it does is allows you to do this in a way where you don’t have to necessarily explore all possible moves. You can explore some subset of moves, and in exploring the subset of moves you realize which moves are better moves. Better moves are explored in more detail. In the process of exploring these moves, it kind of learns to ignore the ones that are not possible or not really good moves, and it learns to concentrate its effort on these other moves that have a larger payoff.
Kelly Davis: It learns that in the process of doing this tree search, and as a result of that, it doesn’t have to explore all possible moves in a dumb way -- not a dumb way. “A dumb way” is overstating it. In a way that, say, the IBM chess computer would have to explore more systematically all moves.
Kelly Davis: It gets this kind of intuition -- is not even too strong of a word to use -- intuition as to what set of moves are very good moves, and then it explores those in more detail and then ignores the ones it finds are not profitable. Because it can concentrate its computational effort on profitable moves, it’s able to more quickly learn and more quickly become a really strong player.
Dustin Driver: It seems like a much more natural way to learn and play a game, kind of close to the way a person would learn.
Kelly Davis: I agree.
Dustin Driver: That’s very fascinating. So, what does your idea of a successful agent look like? Does it look like Jarvis from Iron Man? We already have the breadth of human knowledge in our pockets, in the cell phone, but you can’t talk to it. You can’t ask it to do things. What is your dream of the perfect agent? Is it the sci-fi dream of, you know, “hey, make my coffee and then make sure that my trip is booked on time.” This sort of agent running around in the computer world doing things for us.
Kelly Davis: Understandably so, in terms of technology and the limits of current technology, current agents are really limited in a way. They’re more geared towards very limited actions like, “hey, what was the baseball score last night?” It could tell you this. Whereas my view as to where agents can be, they should be able to do the current things that agents are doing in this very limited realm of, “hey, what’s the baseball score?” But also they should be, for lack of a better term, kind of human. You should be able to hold conversations with them and actually have them -- again, to use this word we’ve been using -- understand what you’re saying. It could be understanding in the sense of, I would think, “oh, this thing understands me in the same way that a human understands me.” I think getting there -- I don’t know if anyone really knows how to do that right now.
Kelly Davis: My ideal agents should be able to be a companion too, that you could actually talk to as you would talk to a human, and you would treat this agent, in any real sense of the word, as a human, at least in the way you would talk to it and expect it to treat you. That’s where I would like things to be.
Kelly Davis: I don’t know if you’ve seen the movie Her, but the way the agent -- I guess it’s an OS in the movie -- interacts with people. It interacts, at least initially, on a very human level. It’s very human. Maybe it doesn’t embody the experiences of a human, but it is human in a real sense of the word.
Dustin Driver: She has a wonderful sense of humor, she gets him right off the bat. That’s a great movie. I love that movie Her. I think it explores AI in a really great way, and I think it makes an excellent point, too. The way the movie ends is poignant.
Dustin Driver: I think we’ve gone overtime. I know you have a busy day ahead of you. Thank you. This is really fun, and I love being able to talk to an expert who’s been in it since the very beginning. It’s a treat. It’s a pleasure.
Dustin Driver: The thing to understand is that AI learns at an exponential rate, so while six years ago there was no Siri, tomorrow Siri could be as smart as me, probably even smarter given my background. A question remains, how smart will AI get? And will AI take all of our jobs? I don’t know. Even if AI does take all of our jobs, is that a big deal?
Dustin Driver: I have a sneaking suspicion that if we do create anything that’s smarter than us, it’ll quickly realize that most of our jobs are pretty pointless anyway and tell us to just go play in a field. That’s my hope anyway.
Dustin Driver: Thanks for listening. I really appreciate it. I’m constantly looking for new people to chat with about science and technology. I’d appreciate it, if you know anybody who wants to have an engaging and fun conversation, just send them my way. My email is firstname.lastname@example.org. You can learn more about me at my website, dustindriver.com.
Dustin Driver: You can learn more about Project Common Voice at voice.mozilla.org. I’ll put a link in the show notes, but you can pop over there and contribute your own voice to Project Common Voice. Go ahead, help out the future of humanity or humanity’s descendents learn how to recognize voice. Why not, right?