Episode 8

34 minutes 47 seconds Jun 18th, 2024

Defining Intelligence

AI pioneer Mustafa Suleyman has been at the forefront of the technology through several major leaps forward. As the co-founder of DeepMind, Google’s VP of AI Policy and Products, the co-founder of Inflection, and now the CEO of Microsoft AI, he’s seen AI’s evolution into the transformative tech it is today. Now, how do you train AI to behave like humans want it to?

Mustafa Suleyman

Microsoft

 

Seth Rosenberg:

Hi, I'm Seth Rosenberg. I'm a partner at Greylock and the host of Product-led AI, a series exploring the opportunities at the application layer of AI. Mustafa, thanks so much for joining.

Mustafa Suleyman:

Hey Seth, good to see you again. Thanks for having me.

Seth Rosenberg:

I’m very excited to have Mustafa here. As many people in the audience will know, Mustafa is one of the leading pioneers in AI. He's currently the CEO of AI at Microsoft Venture partner at Greylock, former co-founder of Inflection AI, as well as co-founder of DeepMind, which was of course acquired by Google where he became VP of AI at Google.

So Mustafa, I feel very privileged to be able to work with you through Greylock, but also to have you here to share your take on the current space, which is evolving so quickly.

So maybe to kick it off, I'm curious to hear more about how you decided to focus your career on AI before it was obvious.

Mustafa Suleyman:

It's kind of strange looking back on it. I had to write my TED talk recently and I reflected on how crazy things are 15 years later.

When we first started talking about founding DeepMind in 2010, it's hard to overstate how weird we were. Often people say being an entrepreneur is choosing to obsessively work on something that is really contrarian and attempting something that everybody else thinks is impossible. I think in our case people didn't just think it was impossible – they thought it was completely absurd. I honestly am not quite sure how we came to have so much faith in our ability to try to do something so out of distribution and exceptional.

Seth Rosenberg:

What was the initial insight around DeepMind?

Mustafa Suleyman:

I mean, we didn't just start working on AI or machine learning. We were fully committed to working on artificial general intelligence, producing a system that could exceed human capabilities and knowledge at all levels. And the reason we were motivated to do that is because we genuinely wanted to use AI to solve other problems and to make the world a better place.

There was no academic lab or setting that could accommodate the scale of investment that we thought was necessary at that time. In academic labs, there wasn't a focus on large-scale engineering. There certainly wasn't a focus on products, and even in the kind of big national project investments that you get from working in a government, there was just nothing that was a technical effort to try to understand what intelligence was at scale and deploy it at important problems. And so it was really only startups that were as brave and courageous, I think, as was necessary to be successful in this mission. By that point, that was my third effort at a company and it just felt like this is the only vehicle, because you learn so much along the way by screwing it up and figuring out how to do it right.

I had come off the back of a bunch of years working in the nonprofit sector, in governments, in conflict resolution and facilitation, and had started two small companies, actually one that was focused on selling networking equipment and electronic point of sale systems to restaurants. But this was way before that was possible and that was a failure.

And the thing I had realized is that we need more knowledge and insight in our world to help us address the overwhelming complexity of our systems. It's so difficult to make an intervention into a complex social system today like an economy or a food production system or the financial system and be confident that that intervention is going to make the impact that you think it is, and that's really, I think, one of the reasons why we need amazing AI. We need to be able to make good predictions about complexity in our world in order to create value and to change that world to help people live healthier and better lives. It sounds super cheesy, but that is really what was motivating us back then and still does.

Seth Rosenberg:

I'm curious, so you had this mission that unlocking abundant intelligence for the world would actually solve problems that matter. How do you actually define intelligence in this context?

Mustafa Suleyman:

The thing that gave me confidence that we might actually be able to make progress towards inventing intelligence was our third co-founder Shane Legg, who had spent his entire PhD researching various definitions of intelligence and trying to aggregate them into a single metric that we could use to turn the science of intelligence and the neuroscience of understanding biological intelligence into an engineering effort and really make that a measurable kind of quantifiable exercise. And the definition that he came up with was the ability to perform well across a wide range of environments.

So again, emphasizing generality, and that was major. Now everyone takes the AGI part for granted, as though G is the central part of intelligence, but that's an assumption. Generality happens to be one of the characteristics of intelligence, but it's not the only important characteristic, and it also turns out there's also very hard to measure and to scale it down to something that you can really grasp.

Generality happens to be one of the characteristics of intelligence, but it's not the only important characteristic, and it also turns out there's also very hard to measure and to scale it down to something that you can really grasp.

Whereas another sort of definition was the Turing test, of course: that a system would be intelligent if it could deceive a human into thinking that it was itself a human during natural conversation. And in some ways we've crossed the threshold of that intelligence. We have systems now that are really very good at conversation (and at least for a few turns certainly better than humans in many respects). You can still tell that it's an AI or a chatbot and not a human, but in a few years time you really won't be able to tell. And yet that doesn't really tell us anything about whether these systems are actually intelligent. Every time we cross a threshold in terms of a benchmark or a milestone in AI, you then sort of turn round and go, okay, well here are all the problems with that mechanism of measurement and here's the next sort of thing that we need to measure.

Seth Rosenberg:

I've heard Reid say that AGI is the AI that we don't yet have. It's kind of always pushed forward in the future.

Mustafa Suleyman:

Exactly, and it's like the constant carrot that we dangle to chase ahead to sort of move ahead.

So, another measure that I have proposed is that we should focus more on the capabilities of the system, the actions, the things that it can do, the things that we can observe, that it can have an impact in some environment, rather than this kind of abstract idea of, ‘is it general,’ or ‘is it good at having conversation?’ That would basically be like, can it produce human-quality labor in a practical environment and actually earn money for that, or could it write software, for example? It's a very measurable thing. I sort of called that a modern Turing test and said that in the next five years a system would be able to take a very abstract goal, like go create a new product, get that designed, manufactured, drop shipped, and then distributed and marketed and try and earn a profit from it, and then you could measure that profit in terms of making a million dollars or something like that. I think there's going to be a system that could actually do that, certainly, before 2030.

Seth Rosenberg:

Yeah, that's amazing. And it's possible. Do you expect that that tip of system would actually trade off on the generalizability and actually be built for a specific use case?

Mustafa Suleyman:

Yeah, I think that it's more likely that we'll have really powerful systems that are specialized for specific use cases that have real deep domain expertise than we are going to have this sort of very general purpose system that can switch from being a marketer to a clinician to being a doctor, to being a lawyer, whatever. I mean clearly the general case is going to come afterwards.

Seth Rosenberg:

So I want to spend a moment just getting your take of just the state of large models today. Maybe you can start with some level-setting for the audience of what was the Inflection point that led us to the current state of the GPT-3,  GPT-4 style models, the Inflection-2.5 style models, this combination of the transformer architecture and scaled compute.

Mustafa Suleyman:

I think that this revolution has been driven by deep learning. We are still building deep learning models, albeit with a slightly different flavor now.

The transformer architecture from 2017, we’re now turning those into composable units that are essentially going to act like parts of our software development ecosystem. You’re just going to turn to your AI and it's going to actually generate code for you. We're already seeing that in GitHub Copilot and we're seeing it as a member of your team able to take a natural language instruction and just act in unison with you.

And I think that what people don't quite realize is that these models are not going to be large forever. In the history of all technologies that have been valuable, anything that is significant gets cheaper and easier to use over time, and that curve is double exponential over the last couple years. It's incredible. I mean Microsoft AI just released Phi-3 fully open source. It is close to, but not quite, at GPT-4 level, and it's fully open source. It's 3.8 billion parameters, so that's more than 100X smaller in terms of inference compute than basically the absolute frontier of models today. Like I said, it's not quite as good, but it's certainly as good as or better than GPT-3.5. That's mind blowing. I mean that's something that can fit on your laptop or on a phone, I mean in the future. And so we should expect that trajectory to continue. I think open source models are going to be very close behind the closed source proprietary API models months or maybe even just a year or a year and a half or something, and that's just going to change the whole creation landscape, basically.

These models are not going to be large forever. In the history of all technologies that have been valuable, anything that is significant gets cheaper and easier to use over time, and that curve is double exponential over the last couple years.

Seth Rosenberg:

What enabled this model to be almost equal in performance while also being much smaller?

Mustafa Suleyman:

For the last couple of years, everybody has been focused on reinforcement learning from human feedback, where in the final stages of training at the stage of fine tuning or post-training as people say, you have a bunch of trained raters or judges that compare two possible responses or completions from a model, and that pairwise comparison provides large scale feedback for the kinds of behaviors that you want your model to exhibit. Everyone's familiar with that now, but what we've been focused on (as soon as that was showing promising signs, so for the last 18-24 months) is reinforcement learning from AI feedback, where we really want very smart and capable models to do that pairwise comparison because obviously we can automate that process and we can produce an even larger number of supervised, fine tuned labels to give more feedback to the pre-trained model across a wider range of experiences and moments that might be intention with one another if you only have a small number of samples that come from expensive, highly trained humans.

So that was one method is reinforcement learning from AI feedback, and then the second is generating training data from these models. So sometimes people refer to that as distillation where you're sort of trying to absorb as much of the best bits of a big powerful model as possible, and then you are using that to post-train or sort of align your smaller model, and parameter count is no longer the primary proxy for capability. High-quality data is the real valuable asset here in addition to the architectures. So for the last sort of six to 12 months, everyone's been focused on compute, compute, compute. Can I get compute? And obviously that's kind of important or large models, large models, but really it's investing in high-quality data.

From a startup perspective, I think the real trick is finding existing data sources, or more importantly, creating a UI that allows you to collect high-quality data from an interaction with a product domain that you think is valuable, that is producing a moat of highly valuable data that you can then use to post train and fine tune your model again and get into that feedback loop. That is a path to creating an enormous amount of value and it does not require you to depend on the large scale model providers, which I think is a great, that's why it's such a creative time in entrepreneur land.

From a startup perspective, I think the real trick is finding existing data sources, or more importantly, creating a UI that allows you to collect high-quality data from an interaction with a product domain that you think is valuable.

Seth Rosenberg:

As a startup, you're competing with incumbents that have access to large sets of data. So I'm curious if you could share anything more about the nuances of opportunities that might exist for startups to get specific types of data that's maybe more valuable than others.

Mustafa Suleyman:

Okay, so how do you go about collecting high quality data? Because obviously in pre-training it's about volume of tokens and there the hyperscalers will have a longstanding advantage because they already own search engines or YouTube or whatever it is, whereas in post-training, you need a small number of very high-quality tokens to align the model to the behavior that you want for your product and you can collect that from scratch.

When we built Pi, we created (and to this day the most high-quality, human-like conversational AI with the best EQ even today in the market), and we didn't use any data from big providers. We collected all of it ourselves from scratch by training paid teachers, we call them AI teachers, some people call them raters, but the crucial thing for a startup is you have to really, really, really pay attention to training those teachers. You have to pay them a lot of money.

I'll just tell you from our perspective, we selected people who had an undergraduate education, nothing less than that, largely spoke English as a first language (with some exceptions), and that had a domain expertise that we thought was valuable. Maybe they said they were very passionate about history, or they had a good kind of cultural knowledge, or they were movie buffs, or whatever it is. They had to pass 20 hours of training and testing by us, so we would give them reading and comprehension exams. We would give them multiple choice questions. They would have to do sentence completions, they'd have to spot the differences. They would have to do really quite hard analytical tasks. And in order to just keep my team super humble about how valuable this task was, I would obviously also have all my team go through the same training and go through the same tests. And I can tell you not even a majority of people passed.

Seth Rosenberg:

Yeah, I was about to say I'm nervous. I thought you were going to ask to send over the test!

Mustafa Suleyman:

Yeah, it's actually not easy. It's a pretty tough thing to do because if you think about the task, you are asking a human to read through two 10- turn conversations, look at the proposed answer by one model and by another model, and then absorb a huge behavior policy, like very detailed line by line, “The AI should do x, shouldn't do Y in this situation, it should do this.” And then remember the training from the AI teacher that says all kinds of subtle exceptions and stylistic tones and brand things and backstory things and capability awareness, and then you have to sort of find the correct intersection of all of those to decide is this paragraph more in line with the behavior policy or is this one more in line? Yeah, it's a painful task.

Seth Rosenberg:

That's super interesting. I'm curious how you view this evolving because you mentioned we're moving from reinforcement learning just from humans, but also from AI. So how do you view application layer startups between being vertically integrated versus which parts of the stack they really need to be experts in?

Mustafa Suleyman:

Experts? Yeah, that's a good question. I mean, I think you have to be very principled in answering that question and that's where the bet is of your startup. You have to decide which part of it am I going to bet on? Obviously a bunch of people are building tooling and infrastructure and that's fine. We all are familiar with that kind of strategy. I'm a great believer in building and owning your own product and as much as possible controlling the key bit of the value there, which in my opinion is the LLM and everything around that is secondary. The words that come out of the LLM are what you have to focus on, and that means that I think it's reasonable to break off the pre-train model and get that from somebody else. That's a good approach, but I think you need to own your fine tuning stack and I would not give the fine tuning stuff to somebody else. You have to train your teachers, because that's not going to go away anytime soon. We're not going to have GPT-5 tomorrow and suddenly GPT-5 replaces all humans as the ultimate judge or teacher. I think that it's quite unlikely that it’s going to be a lot better than GPT-4.  But even the people that have tried to do RLAIF with GPT-4 equality...it’s okay and it is impressive. It's very cool, but it isn't on the verge of replacing humans entirely. You can get an 80% prototype going and it can look good, but a real consumer experience requires you to nail the 99 percentile experience. It has to be really high quality consistently, and as soon as the AI breaks out of character, gets something wrong, has hallucination, whatever you want to call it, that sort of destroys the illusion and it breaks trust. So then you lose your consumer. And so I think that the key thing for startups, at least for the next year, is to get really good at data collection and data filtering and data quality.

Seth Rosenberg:

That makes sense. I'm curious, what are the different UIs that you think AI-first companies are going to be building (whether it's a chatbot, an agent, just regular SaaS) that's kind of enabled by AI?

Mustafa Suleyman:

In my opinion, the UI needs to get out of the way, especially for the consumer. Obviously for SaaS, you can have all the bells and whistles and all the developer features, blah, blah, blah. But for a consumer, the goal is to get the UI out the way. So we created a very pared-back, quiet, soft calming, I think, quite distinct looking UI with very few buttons, but we also have one of the best voices in the world. We got in the end, I think we had nine or 10 voices. They were really high quality and very, very, I mean, Pi’s still alive and people should try it out, but I think voice-first is a big part of the future UI.

The UI needs to get out of the way, especially for the consumer.

Seth Rosenberg:

Yeah, I like where you have to choose the voice that you like as part of the onboarding experience with Pi.

Mustafa Suleyman:

Yeah, because sort of connecting with your AI, it's a poor personalization moment and 30% of all our conversations took place on voice and they were by far the longest, most engaged, most retained users. So many people should keep that in mind. I think that's a very important insight.

Seth Rosenberg:

I've heard you talk about AI has IQ, they have EQ and then you've talked about AQ - action quotient. I feel like this is a very interesting topical thing at the moment, where people are talking about these autonomous agents for certain use cases, which includes reasoning and planning. How far away are we from the kind of chat bots we've experienced (that most people have experienced today) to a fully autonomous agent that you described at the beginning that's capable of executing an end-to-end task? What's missing between where we are today and that vision in, let's say, the next six months to three years?

Mustafa Suleyman:

First of all, I don't think we're on a path towards fully autonomous and I think that's actually quite undesirable. I think fully autonomous is quite dangerous and I got a lot of stick after my TED talk because I said that the autonomous capability was dangerous and so on, and that's one that should be regulated and I don't really care. I still think that if you have an agent that can formulate its own plans, come up with its own goals, acquire its own resources, just objectively, completely independently of humans, objectively speaking, that is going to be more potentially risky than not.

So I think about it as these narrow veins of autonomy where you give it a specific goal and it has limited degrees of freedom to go off and act in some specific environment. So automatically calling some API to check some registry to collect some information to observe some state, maybe writing something into a third party API that is not yours, but is again restricted with some specific degrees of freedom. I think the security risks here are significant. So I just think we should tread carefully on the autonomous piece.

But in terms of the actions piece, it's still pretty hard to get these models to follow instructions with subtlety and nuance over extended periods of time. I think that they can do it, and there's a lot of cherry-picked examples that are impressive on Twitter and stuff like that, but to really get it to consistently do it in novel environments is pretty hard. And I think that it's going to be not one but two orders of magnitude more computation of training the models, so not GPT- 5, but more like GPT-6 scale models. So I think we're talking about two years before we have systems that can really take actions.

I don't think we're on a path towards fully autonomous, and I think that's actually quite undesirable.

Seth Rosenberg:

That makes sense. And if you were to break down I guess some of the unsolved research or technical problems to get there?

Mustafa Suleyman:

Well, an action is no different, really, to predicting a sequence of words. So when you ask a model to complete a sequence of actions: Let's say it's like three things, basically, to book a restaurant that you and I can go to on a certain day. The first action would be to check the availability in both of our calendars. So that's a correct function, call reconcile the correct moment. So that's the second action. Make sure that it's a restaurant that has availability, so that check is another one. And then go and sign so that you can basically use the correct tool to book the right restaurant at the right time, put your credit card details down. Having obviously also checked that it's a restaurant that we both like, et cetera. So it's like four or five or six different steps just to produce that one “action”, sub-components.

So in order to get that right, you are basically saying that the model has to produce a perfect function calling for each element and do so in sequence. So it can't just be arbitrary, it has to be in sequence. And that's like saying it has to write a four-page document in response to one question that is exactly that document and it can't be something that is approximate or similar to that document.

So we all think that obviously these models are magic at the moment and they write beautiful poetry and creative copy and text and give you good answers and sometimes they're grounded and blah, blah, blah. But for each one of those answers, there's a wide range of correct answers that it could have picked –  tens, hundreds, thousands maybe. So it isn't producing a specific perfect. Every single piece of every single token that is outputted is the correct answer for each one. It's not there yet. So to get that level of precision, we have to scale up these two orders of magnitude and that's what's happened so far. The last five orders of magnitude of transformers with every 10X of compute and data, we get more precision. It's not just emergent capabilities that's wrong. People say, “Ohh, it's surprising we had these emergent capabilities.” That's an anthropomorphic projection. They're not surprising emergent capabilities. It's just more precise attention to the correct mapping between a prompt and an output. So you're just honing in on something more specific now.

Seth Rosenberg:

Do you think we'll be able to get narrow forms of actions in specific domains before we get to GPT-6?

Mustafa Suleyman:

Yeah, definitely. I mean there are some good actions today. You can definitely see these orchestrators making good API calls at the right time. The question is can it do it with 99% accuracy, because if it does it 80%, then one in five times it’s getting it wrong, it's not usable for a consumer. So you either have to constrain the action space so that every time you are asking your model to take an action, it's only got five options to pick from and the consequences of getting it wrong are low, or you have to find a problem domain where four out of five accuracy is acceptable.

Seth Rosenberg:

Right. And if you think about the architecture of building one of these agents, are there any differences to what we had just talked about (in terms of focusing on post-training, getting the UI out of the way), in terms of the architecture of building one of these kinds of narrow, autonomous action agents?

Mustafa Suleyman:

I mean, one thing for people to keep in mind, I think, is that there's a whole range of tools in the toolbox now. And so the art is in designing a router or a classifier that can take some given input. It's either contextual information, metadata or of course the incoming query from the user and redirect that to a model that is appropriate to that context. And that's important for inference budget management. Obviously you can redirect the query to smaller, cheaper models or higher quality models or models that have specialized in particular domains that have been fine tuned for a particular area of expertise or indeed have certain capabilities. They might be good at retrieval, they can retrieve from some knowledge base or even from the open web or they might trigger a model that has been fine tuned for a voice response because obviously the length and style of a voice response would be quite different to one that was producing paragraphs and paragraphs in a traditional sense. So the router is a critical part of the architecture as well.

Seth Rosenberg:

So, Mustafa, I’m curious. If you were starting your first company today, given everything that exists and the crazy rate of change, where would you focus your attention?

Mustafa Suleyman:

I would look for problem domains that make a virtue out of imprecision. So what is a problem where if you solve that problem, the valuable contribution that you've made is that ambiguity? Imprecision and multiple possible answers are the key thing. If you pick a problem domain where the consequences are really high if you get it wrong and where there's really only one or two correct answers, then your model is going to struggle. Right? That's the first thing I would say is look for more of those things.

Seth Rosenberg:

And just on that point, there is a ton of activity in domains that do require more precision like law or accounting or tax. Do you think that that is a futile effort at the moment?

Mustafa Suleyman:

Actually, law doesn't require as much precision as taking actions. So even in law, most of the applications are retrieving similar cases or giving you summaries of cases, or where there are five possible summaries, all of which could be correct, right? Or where you are retrieving one case over another, you don't have to. So the law thing is actually a high stakes domain because the consequences if it gets it wrong are really wrong or bad, unlike generating marketing copy, where actually there's a lot of right answers in that domain. Domain medical is much harder. Clearly there are fewer right answers in medical, and the consequences are really high. So that's a pretty difficult domain. A bunch of my really good friends from DeepMind Health who have now been at Google, just released a paper last week –  unbelievable work that they have done – showing that they can basically provide an amazing reasoning engine for clinicians, and I think in the future for patients, too, that's coming soon.

Seth Rosenberg:

What other kinds of factors would you think about?

Mustafa Suleyman:

I would say where you can design an interface which naturally collects valuable label data for fine tuning by virtue of the interface. That's really important because if you are successful, you want to be able to compound that success because the more users you get, the more data you get, the more quality, the higher quality model you can produce, and then you get that virtuous cycle. So that's a really important part of it.

And then it sounds obvious, but I think a domain where you can monetize much faster than you might think because you need to get people to pay for it pretty fast, because GPUs are wildly expensive as everyone knows.

Seth Rosenberg:

What's an example of that that comes to mind for you?

Mustafa Suleyman:

I think the companies that are doing specialist services for not quite 10,000 fans, but 10,000 true fans, but I mean people that really do need that kind of niche, highly adapted expert system in your pocket. I don't know whether it's a mechanic or a dentist or someone who's passionate about a certain hobby or a chunk of IP. I think there's value in those sorts of specialist use cases that people will be prepared to pay for.

Seth Rosenberg:

Yeah, that makes sense. I'm curious to hear just a little bit about the AI products you're working on at Microsoft, what the portfolio looks like.

Mustafa Suleyman:

I'm responsible for Bing, Edge, the browser and all of Copilot, which is deployed in basically every Microsoft surface now, and it's been an impressive arrival actually. The quality of products and their scale and reach is sort of much greater than you might think as a kind of default Silicon Valley person who had grown up in Google.

Seth Rosenberg:

You don't become a $3 trillion company for nothing,

Mustafa Suleyman:

But the rep that we give it in Silicon Valley relative to what it has is, I think, needs a rethink, and it also has just huge scale and distribution.

My main goal is to uplevel the quality of Copilot. And so we're rapidly building some of the best models in the world, partnering very closely with OpenAI, building on top of all of the OpenAI models and infrastructure, fine-tuning their models. And the next phase is that we're really going to start focusing on memory and personalization. I mean, your AI should remember everything about you or your context or your personal data, everything that you've said, and be there to support you and be your sidekick throughout your life. So that's what we're going to be focused on next.

Seth Rosenberg:

That's fascinating. I'm curious, what do you think about the constraint of the existing applications Microsoft Office versus the ideal version of Copilot?

Mustafa Suleyman:

Yeah, that's a good question. I mean, people have often said that an AI subsumes all other interfaces and surfaces, and I think that probably overstates it, but it's the right direction. I think there'll come a time in a few years when the first thing you think is you just say, “Hey, Copilot, can you take care of this for me? What's the answer to that? Where do I find this? Can you book this? Remember that, buy this, do this.” You're just going to have this ever-present aid in your life that is going to change what it means to use a keyboard. It's going to change what it feels like to have apps. It'll move us way beyond the search engine and the browser, and you're certainly not going to think I need to go write a document or send a message in a traditional way. You'll still have those things, but your AI will just manage a canvas of activity across your entire life and largely be coordinating with other AIs and other services and collecting information for you.

I think there'll come a time in a few years when...you're just going to have this ever-present aid in your life that is going to change what it means to use a keyboard. It's going to change what it feels like to have apps.

Seth Rosenberg:

Okay. Well, let's leave it at that.

Mustafa, I really appreciate you spending some time today and always enjoy catching up with you.

Mustafa Suleyman:

Yeah, great to see you, Seth. Thanks a lot. It was fun.

Seth Rosenberg:

Thanks.

Mustafa Suleyman:

Ciao.