Practical AI | Transcript: Finding Nemotron

Finding Nemotron

July 2, 2025 / 46:23/E319

Jerod: 00:04

Welcome to the Practical AI podcast, where we break down the real world applications of artificial intelligence and how it's shaping the way we live, work, and create. Our goal is to help make AI technology practical, productive, and accessible to everyone. Whether you're a developer, business leader, or just curious about the tech behind the buzz, you're in the right place. Be sure to connect with us on LinkedIn, X, or Blue Sky to stay up to date with episode drops, behind the scenes content, and AI insights. You can learn more at practicalai.fm.

Jerod: 00:36

Now, onto the show.

Chris: 00:48

Welcome to another episode of the Practical AI Podcast. I am your host, Chris Benson, and today we have a wonderful guest from Nvidia. We've had some other guests along the way, as everyone knows. And today I would like to introduce Joey Conway, who is the Senior Director of Product Management for AI Models at NVIDIA. Welcome to the show, Joey.

Joey: 01:13

Yeah, thanks, Chris. Good to be here.

Chris: 01:15

I'm looking forward to it. I know we're gonna talk about a couple of recently announced models that you guys have have put out there. But before we do that, I always like to get a sense of kind of, like, you know, your own your own background, how you came to NVIDIA, and specifically, you know, this particular area of work. I'd love to know how you got into this and what that special sauce about what you do is for yourself.

Joey: 01:40

Yeah. I think from my background, I have done some software development in the past and also done some product management in the past. And I think in looking at opportunities, say maybe ten years back of exciting things in the future, one thing I was personally excited about was machine learning and AI. And I think looking at opportunities, NVIDIA, this is almost a decade back. NVIDIA was at great spot of they were involved in many things and things were just getting started.

Joey: 02:11

So I had a great opportunity to join NVIDIA. And then being here, the company works on all sorts of amazing technologies. I think one space that our team has focused on has been essentially the non vision workloads. And so we started many years back with things like BERT and NLP and maybe more simple types of language models that could do classification of intent and those types of things. I think we've been on the journey for a while and we've been excited that, there's been great research and breakthroughs the last, say, five years that I think have made, we'll we'll say, exponential improvements and brought it to a much more mainstream type of use case.

Joey: 02:56

And so I think the the background there on my side of being being familiar with software development and kinda comfortable with new technologies. And then the excitement of new opportunities and places to grow, NVIDIA has been very well positioned at that. So I think it's been kind of a few factors coming together at the same time. And if you had asked maybe five, six years ago when we first started on some of this journey, I probably wouldn't have guessed we'd be at such a great inflection point that we are now. But I think we're very excited to be here and there's a lot of fun stuff happening we can talk about.

Chris: 03:29

Gotcha. So I know today we're gonna dive into I'd like you to introduce to the audience the two models that were announced. But if you could frame them a little bit in kind of like the current landscape of of open foundation models and kind of where AI research is at this point, and why why NVIDIA is putting these models out at this time. What is it about them that's different from all the other stuff out there? And why have you made some of the choices in terms of open versus closed, things like that?

Chris: 04:04

So if you would tell us about these models.

Joey: 04:07

Yeah. And I'm happy to start with kind of the landscape or where the world is at, and I can give a little bit of context there too. So on the NVIDIA side, we've been working on publishing models and kinda open weight checkpoints and to some degree datasets for many years now. It's been quite a while, five, six, seven years, probably even longer. And we've trained many large language models as well.

Joey: 04:32

I think the first one I'm trying to remember the formal name. I think it was MegatronLM or Megatron NLG. There was a few variations of it, but that was probably four or five years ago. And we do it for kind of a few reasons. One is we want to understand how to take the best advantage of our infrastructure, So from compute and storage and networking, we also wanna prove out the software stack and make sure the software runs great.

Joey: 04:58

And so we do that ourselves. We learn a lot along the way and we can make improvements. And then we also do that because we want the community to benefit and learn. And so We publish all that software, those techniques, the papers, and we do that so everyone else has higher confidence and can start from a better beginning spot than we did. We've been doing that for many years in many different domains, so things like speech or transcription and large language models and even simpler ones of smaller language models like things like BERT.

Joey: 05:28

So we've been doing that for quite a while. I think in parallel, there's lots of companies in the space and they all have different business models and our goal is to support them. And so I think there's been a few big moments in kind of the language model space. I'd probably say BERT was a big one quite a few years back now. That was where we kind of had an inflection point where language models could do, essential classification tasks that previous to that we weren't able to do.

Joey: 05:57

And so being able to parse out language from people typing or or speaking and being able to help understand what they want, what they're looking for, what types of actions they're asking for, that was a great breakthrough moment. We're very happy with that. We've published lots of software to help support it and make sure it runs efficient on our infrastructure and people can benefit from it. I think probably another big moment for the world was ChatGPT. I think we were super excited to see all that happen, and OpenAI is a wonderful partner.

Joey: 06:27

And so when that happened, it was an inflection point where many people started to realize the capabilities of what was possible. And there was amazing research that went in behind that. And so that was kind of another big milestone that happened along the way. And so as each of these kind of occur, we're always asking how we can help. So how we can help more people take advantage of the technologies and benefit from them.

Joey: 06:49

And so as kind of that chat GBT moment happened, many companies were starting to ask how they can take advantage of these technologies. And as we spend time working in that space, we love to support our partners and we think it's great for companies to use them, we started finding that there were some scenarios that not every company could use some of the solutions out there. And so scenarios where, say, a company has proprietary intellectual property that they don't want to leave their premise, that they they need to keep on premise. There might be scenarios where they want control over the model, the model architecture. They want control over the data that goes into it as well as what they fine tune it.

Joey: 07:30

And so in those scenarios, there was a a lot of open source contributions a few years back. Many companies are training foundation models, we're really excited. And so we did our best to kind of support that both in the software we published to make sure it runs well and all of these partners can build the best models. We also do some of that ourselves too to make sure that we're not just publishing things for others, but we use it ourselves to make sure it runs well too. We're always trying to stretch the scale of things going forward, so we're always trying to push the limits of what's possible.

Joey: 08:01

And so in kind of that broader effort of pushing the limits, what we found is that there's opportunities for us to contribute that out. So say new infrastructure comes, we can sometimes be the first to show people how to do that. In terms of the large scale contributions we've been making over time, that's one of the incentives and reasons we have to keep participating in the space. And so going forward from the moments of, say, two years back when lots of companies and partners were publishing open models to where we are today, the biggest breakthrough we've seen happen was probably around January, February in terms of open weight models now supporting reasoning capabilities. And this came through DeepSeek as kind of one of the leaders in this space of being able to add reasoning capabilities, meaning we can take complex queries and we can now start to break them down and think through them and come up with answers that previously we couldn't.

Joey: 08:57

Previously, we often just had one question in and one answer back out, we had to be fast at it. And now with reasoning, the models can take some time and think about it. And so that's probably the next big milestone that we're really excited about seeing and one of the main reasons that we're publishing models at this kind of juncture in terms of wanting to help move that technology forward.

Chris: 09:19

I'm curious. That was a fantastic answer, by the way, and there was so much there that I'd like to dive into with you. I think a very first thing is is we're hearing about reasoning a lot now, you know, you know, from various organizations. Glad to hear it from you. But I think that it's one of those phrases that in the context of generative models, what does reasoning mean?

Chris: 09:46

Could you talk a little bit about what what is the word reasoning in this context from NVIDIA's standpoint? And how does reasoning in that context differentiate from some of these really powerful models we're seeing from NVIDIA and other organizations that have come that have been able to do amazing things, but weren't necessarily classified as reasoning models?

Joey: 10:10

Yeah. And I can give a few answers here. I think there is a little bit of varying in definitions among the community, but I think I'll try and share where I see the most consensus. So maybe I'll go back a few stages. So I think Sure.

Joey: 10:23

Say going back four or five years, when we had these models, we'll say like GPT type architectures that were autoregressive, meaning that they would go through a loop. And so they generate one word of a sentence and then they feed that back in, generate the next word and the next word. And this was kind of the technique they used to be able to generate paragraphs and kind of this long generative content that we hadn't seen before. And and this allowed them to write sentences, to write stories. And at that kind of original juncture, the challenge we had is it would just do, say, this next word prediction.

Joey: 10:59

We had struggles knowing how to control it, how to direct it, how to guide it, and how to keep those answers with a high accuracy. And so one of the great breakthroughs that OpenAI achieved with ChatGPT that the world got to experience was you could tell the model or give it better guidance and directions and it would adhere to it. And so everyone was very impressed and excited about some of these techniques like alignment and reinforcement learning. It was wonderful breakthrough. And I think we've all benefited from that technology.

Joey: 11:28

And that next stage then allowed us to take these models instead of just doing the next token, they now would actually stay on topic of what we asked. And so they could follow directions. If you said, now take your story and do it in a bullet point format or do an intro, a body, and a conclusion, it would now actually do that instead of just giving you the next word and giving a big paragraph. And so that was one of the big breakthroughs along the way. As of this year, the big breakthrough with reasoning, the way we think about this is that there's kind of these sets of questions or challenges that models have been able to solve up to today.

Joey: 12:02

What we see with reasoning is there's a whole another set of questions and challenges that we couldn't previously solve. And kind of the rationale behind it, and then I'll go into some examples. Previously, we usually when we would kind of interact with a model, we would either give it like a prompt, so you give it a question, or you could give it a few examples in the prompt and then a question. You could say like, I want to do math. Here's some examples of how math works and now here's a math problem.

Joey: 12:31

And the models were pretty good at that. What we've seen though is that the more complex the questions are, the more difficult it was for the model to solve it in the first pass. And so often people would just give it these complex queries, say like a word problem, the classic of two trains coming at each other with different speeds. And the reasoning you have to walk through of the first train is at this speed, the second train is at that speed, what is their directions, what is their rate? The ability to then walk through and ask four or five sub questions that are needed to answer that question, the models weren't very good at doing that.

Joey: 13:07

And often what we would have to do is we would have to manually ourselves or try and use another model to break down the question into sub questions and then try and find ways to answer each of those sub questions with different models. And there was just a lot of manual stitching and kind of ad hoc work that had to happen. And with reasoning, what has now the big breakthrough is we're now able to train the models at kind of training time around the skill of we show them here's a question and then here are the different ways to break it down. These are sometimes called things like reasoning traces, where we show there are multiple ways to solve it, but we give all those examples in there and then we give the answer. Previously, was very much focused on here's the question, here's the answer, and that's how we teach.

Joey: 13:49

But it kind of makes sense if you think about how people learn. When you're doing math problems, it's always good to get the right answer, but sometimes it's even better just to understand how to solve it as opposed to the right answer. And so that's been the big breakthrough with reasoning is we now can teach the models, here's a way to think through complex problems and be able to not just give the right answer, but give all the supporting thought and process you use to reach that right answer. And that applies to like scientific domains, say biology, chemistry, physics, applies to math, applies to software development. It should apply to the majority of domains.

Joey: 14:25

And it's kind of the next tier of challenging problems that we haven't been able to solve very well. And in the open space, these are a big breakthrough in terms of both the data and the techniques of how to teach the model as well as just the model capabilities themselves and the final checkpoint that people can download and use.

Chris: 14:45

So that was pretty fascinating from my standpoint to kinda hear that laid out. I think that's the best explanation I've ever heard in terms of of what reasoning is in this modern context. Could you kind of dive in a little bit to introducing the actual models themselves? And for each of them, kind of describe how they fit into the ecosystem, what they're trying to solve. I know at least one is a reasoning model, and kind of talk about, you know, why them versus some of the others that they are within their kind of subgenres ecosystem wise.

Chris: 15:21

Could you go ahead and introduce the models themselves?

Joey: 15:23

Yeah. In terms of what we've worked on at NVIDIA and kind of where we wanted to contribute, there's a great, thriving community of open weight models. And there's many great partners out there, from China to The US to Europe. I I think we're in other parts of Asia. We're very excited to see these ecosystems growing.

Joey: 15:42

And where we wanted to focus at NVIDIA were on some of the harder problems that we knew it would be difficult for people to solve for and do it in such a way that we could benefit all of the community. And so in in kind of this path of seeing the growing capabilities of open weight models, we tried to think through what all the techniques and skills could be to create even better reasoning models. And so what our focus was is we wanted to be able to take the best of what's open and make it better. And so one of the themes you'll see as we go forward is we're constantly evaluating what's the best in the open community and how do we improve it. And so kinda leading up to how we, decided to publish these models, I'll share maybe high level some of the key techniques that went into thinking about where we could contribute, and then I'll explain the models that we published and and why we did that.

Chris: 16:35

That'd be perfect.

Joey: 16:36

Yeah. Great. So in thinking through what's out there in the community, what we realized is that while there are some great open weight models, often the datasets aren't necessarily all open and the tooling isn't necessarily all there and the techniques aren't necessarily as transparent or published so everyone can reproduce them. And so in kind of thinking through these challenges, we took kind of the life cycle of a model, of life cycle of creating a model. Say you start from some data, you start from some architecture, and then at the end you produce a checkpoint that people can go deploy and use and gain value from in a business setting.

Joey: 17:12

And so kind of walking along that journey, we saw things on the pre training side where it's usually a large set of unlabeled data where we're teaching the model just kind of general knowledge and skills of the world and languages and content, on that side, there's not as much open data there and the techniques of how to do that aren't necessarily as public or as published as we'd like. That was one place we thought about. The kind of next stage we thought about was once this base model that has some set of knowledge of the world and capabilities, often people then take it and fine tune it or make it kind of an expert of how to either interact with people or how to solve a certain domain of problems. And so we wanted to focus heavily on what we felt was important to enterprise companies. And so some of the skills that we decided would be really helpful for the community and for enterprise companies, which is kind of our focus in this place is on enterprise adoption and growth, were things like being able to work through scientific question answer problems, things like math, things like coding, things like tool calling or instruction following, and then being conversational.

Joey: 18:21

Those are some of the key places that we felt that enterprises would benefit the most from. And kind of the flip side of those challenges are in the enterprise setting, they're very much looking forward to having the most correct answer. They wanna avoid hallucinations or incorrect answers. They want the models to follow directions. If if I'm asking for three bullets, I I want it in three bullets, not five and not a paragraph.

Joey: 18:45

And then also on the scientific question answer side, there's a whole domain of companies who are working from things like say drug discovery or other kind of domains that are quite technical where they have complex problems that they can benefit from the reasoning capabilities of the model, being able to think through and and have more time to run these kind of inference calls and reflect and progress through complex challenges. So those are kind of the on the accuracy side of the capabilities and skills we wanted to make more available to the community. And then on the infrastructure side, we knew that these models are super capable. And with reasoning, another challenge we introduced is it requires more compute and more iteration. So every time a token is generated, it takes some compute.

Joey: 19:28

And when you think, it generates more tokens. And so the challenge here is that the more the model thinks, the more compute and potentially more expense there is. But the upside and breakthrough is we can now answer more difficult problems we couldn't before. And so we wanted to think through how to optimize the model to be more efficient on the compute side so that as we spend more time reasoning, we don't actually grow all the expense for the end customer when they want to solve more complex problems. And so those were kind of the key challenge sets we were thinking through.

Joey: 19:59

And so as we went on this journey with the Nemotron family of models, what we published and what we started publishing back in March and kind of celebrating the beginning of this venture is what we're calling Lama Nemotron, meaning that we started from a base LAMA model and then we use the best of the open models and datasets in the community. So we pull data from many of the public models, things like Mistral, things like our own Neumotron, as well as things like DeepSeq and Quen, where there's amazing breakthroughs in the open community. And we use those to gather the best data and the best knowledge and then took some of the state of the art training techniques in our software stack that's open and available called Nemo Framework. And we're able to take the LAMA models and improve their capabilities and skills for reasoning and be able to publish and win many of the leaderboards in those domains. And along that way, some of the other work we did was shrinking the model architecture.

Joey: 20:58

So what we call kind of neural architecture search and being able to take what LAMA did as an amazing and quite common and popular transformer architecture. There were ways that we were able to essentially shrink that model architecture while keeping the accuracy the same. And that allowed us to reduce the cost and the compute footprint a bit as well too. So at the same time, we introduced reasoning and make the model more capable. It also slows it down a bit.

Joey: 21:25

And so we were able to shrink the model architecture to try and keep that speed as quick as we could. And at the end, then we published kind of a family of three models. We have what we call a nano being generally a quite small model that would fit on, maybe a smaller data center GPU. And then we have the super, which fits in the middle, and that fits on one more common large scale data center GPU, like, say, an h 100 or an a 100. And then we have the Ultra, the third of the family.

Joey: 21:55

The Ultra fits within one node. So eight of the H100s or eight of the A100 GPUs. And the Ultra is often the model that shows the best capabilities, the state of the art accuracy. And then the nano and the super are often where we see most people begin and start and put into production and build and fine tune on top of. And so as we publish these three models in this family, we also publish the data we use to post train them.

Joey: 22:23

All of that data we made open source and available that includes all the math and the scientific question answers, the chat, the instruction following, reasoning and non reasoning. One clever thing we wanted to set out to do here was prior the models that were open were either reasoning or non reasoning and they were separate models. And we could kind of empathize with enterprises that deploying two models is twice as much work as deploying one. And so one thing we did when we first published these was put them into one model. And so that way you can ask the model, can you reason through this?

Joey: 22:56

This is more complicated. I'm willing to spend the time and wait and invest in the answer. Or this is a super simple answer like what's two plus two? You don't need to reason, just give me the answer and don't spend the compute on it. And so we published the datasets to support that capability as well as the model checkpoints.

Joey: 23:12

And then some of the software we used inside NeMO framework, things like NeMO RL, there's training techniques inside there we also published as well too. So all of this made this family of models and data and tools that we published under the umbrella we're calling NVIDIA Nemotron.

Chris: 23:29

Gotcha. And just for reference, we often to give people a sense of size and what GPUs, we'll talk about input parameters. Could you assign for each of the three versions the input parameters in terms of how many to are we talking like an 8,000,000,000 for nano or something like that? I'll let you run with that.

Joey: 23:48

Yeah. And I'll I'll tell you where we're at today and a little bit about where we see things going. So where we're at today is for the nano, it's an 8,000,000,000 model. We we do have a smaller 4,000,000,000 variant we just published, but we're likely expect to stay at the 8,000,000,000 parameter size when it's a dense model architecture. And kind of the the rationale there is we're targeting, say, like a 24 gigabyte NVIDIA GPU kind of roughly memory capacity wise.

Joey: 24:18

And in that size range, we wanna maximize the accuracy capabilities. And so likely around 8,000,000,000 dense is probably where we're gonna stay there. Yep. On the super side, we're targeting one more common and larger data center GPUs. So like the h 100 with 80 gigs capacity or a 180 gig.

Joey: 24:36

And so in that space, we expect probably around 50,000,000,000 parameters of a dense model will be the best fit, and we published a 49. So we'll likely stay in that in that ballpark going forward. On the ultra side, what we published and these are all, I should mention, they're variants of LAMA. So, like, the nano is a eight b. We distilled down to a four b, but we realized, kind of the capabilities of reasoning at a small scale.

Joey: 25:02

There are some challenges there, and so the eight b does do quite well. On the super side, we started from the LAMA 70 b, which was a great size, but we wanted it to fit in one GPU, and so we distilled that down to the 49 b. And on the ultra side, we started from LAMA's four zero five b from last summer, which running at, say, FP eight precision does roughly fit within one node. But our goal was to see if we could shrink it and maintain the accuracy because one node is still quite a large deployment footprint. And so with our ULTRA, we have two fifty three billion parameters on the dense side.

Joey: 25:37

And so that fits in roughly four, so about half a node. And so we're excited about those breakthroughs because it does kinda relate to the cost that it takes to run the model, and we're achieving the same, if not better, accuracy, from what we built on. I think going forward, there'll likely be some changes in this space. I think there's work that NVIDIA's published on the research side recently around hybrid dense architectures where there's some techniques around, say, SSMs or say, Mamba style architecture where we can make the output generation much more efficient. And we expect that with reasoning, the longer generations of reasoning traces and the ability to think, that output generation will continue to be more of a challenge.

Joey: 26:24

And so I think we'll likely expect to see on our side say a 10 to 15% throughput speed up on the output generation going forward in kind of newer iterations of this using some of the latest research. And then the other big exciting thing we're looking forward to is on the mixture of expert side, we expect that at the very large scale, so likely around, say, the ultra size range where we've seen a lot of the community, say, Lama four is there and the DeepSeq and Quinn, they all have mixture of experts, especially at the large scale. We expect that will be a new trend going forward. And we think we'll probably also be participating in that space too of at the very large scale, mixture of experts allow us to get great accuracy, also allow us to be more inference efficient, at that larger scale too.

Chris: 27:15

I'm curious, as as you've talked about kind of building off of that LAMA 3.1 base as you go, are you aware like, are you and the meta teams that produce the LAMA kind of are you targeting the same types of features going forward and performance metrics? Because there's so many different places to allocate, you know, the effort. Are you very much in alignment, or do you find yourselves deviating a bit from Meta? You know, as as two large corporations that are partners and working together and both producing, you know, the the same line of open source things, at least at at a base level? How does that work?

Chris: 27:58

And and do you is there any collaboration with Meta, or do you guys each just kinda say, I'm gonna go do my own thing and build off? Because, you know, they had they had built the best base so far for what you guys wanted to build off of next.

Joey: 28:10

Yeah. Meta is a great partner. And so we do work really closely with them in lots of different ways. And so we've been very excited about all the llama work and they did have a conference llama con, probably a month and a half ago now. And we're very supportive.

Joey: 28:24

I think in their keynote there, you'll see there is a slide there on Lama Nemotron kind of celebrating some of the collaboration and achievements. And so I think there's definitely overlap and those are the places where we try and collaborate as much as we can. And I think that they're also very focused on some of these challenges like reasoning and some of these enterprise use cases. And so we're always excited to see the next iteration of LAMA because it gives us an even better starting point for us to think about where else to contribute. So I think going forward, I expect that will continue to be a great collaboration.

Joey: 28:57

I think we're always excited for the next versions of their models to come out and we celebrate them both in our software stack, making sure they run efficiently and we can help enterprises deploy them directly. And then we try on the Nementron side to see what else we can contribute from the rest of the community and some techniques and what kind of breakthroughs we can make. So I think some places where we might see differences going forward could be perhaps in the model architectures. I think those could be places where there's different research breakthroughs that come at different points in time. And so I think there might be timing differences there.

Joey: 29:29

In terms of, I think, like accuracy or capabilities, generally speaking, we're looking at very similar type of achievements. And so I think that will feel more like an incremental growth, say every few months. And so I think that'll be a place that we publish all the data. So we make it in such a way that everyone can benefit from. And so I expect going forward, we should see more achievements.

Joey: 29:52

Beyond LAMA, think part of our effort and we did have a conference last week in Europe, in Paris, and there we announced partnerships with a handful of model builders over in Europe, a little bit over 10. And so our goal over there is also to try and enable a similar ecosystem where there's many different languages and culture and history in Europe. And so what we'd like to be able to see happen and what our partners over there are super excited to be able to invest and do is take some of these models and these techniques and datasets and say, bring reasoning to Polish or to different languages in the regions there where some of these are more nuanced and complicated. They have the history and the culture, and we have kind of the general skills. And so I think going forward, we expect to see a lot more of that out in the community where people in certain countries, certain languages and cultures can benefit from a lot of the breakthroughs that happen in English first in such a way that they can bring those skills.

Joey: 30:51

There are some things generally transferable like math, generally speaking, is pretty consistent across languages. Software development is another one of those. We're pretty optimistic that the work that's happened in English and the datasets we publish should be able to help, say, bootstrap, so to say, other languages and get them up and going. Each of those countries and domains have points that they can celebrate and places that they can adopt and different challenges or obstacles, say scientific question answer in Polish that they're trying to work through, for example. So I think that'll be the other place we expect to see a bunch of growth and we're excited about.

Chris: 31:26

Alright. So, Joey, that was that was a great introduction to the models and laying them out. And to build on that a little bit as we as we get kind of get a little bit more in-depth on them at this point, I think it is often cast in the industry as as and maybe depending on the organization, maybe it is, you know, competition. But there's also, you know, as you've laid out very well, a clear sense of partnership across organizations here. So if you're someone listening to this right now, and you've and you've you're you're very interested in NeMatron, and maybe you already have Llama 3.1 deployed in your organization, how should people and you may have the proprietary ones.

Chris: 32:08

May have, you know, from Gemini or ChatGPT or whatever deployed as well. How, with the model that you have produced here, how should people think about that in the sense of like, is, you know, there is obviously progress keeps being made and models build on each other. And so I think everyone's quite used to the fact that you're iterating on the models that are in your deployment in your organization. But now, you know, as you are looking at Neutron, you may have the LAMA model. Where should they be thinking about LAMA?

Chris: 32:42

Where should they be thinking about Neutron? Where might they think about other things? How do you fit into someone's business today when they have all these different proprietary and open options available? What kind of guidance would you give on that?

Joey: 32:55

Yeah. I'll give two answers. I think one, I'll talk about generally how we think of just evaluating models and understanding capabilities. Then second, I'll answer specifically for Neutron. I think generally the mental model we encourage people to have is think about models as say something like a digital employee.

Joey: 33:16

Like there's a set of skills and capabilities that they were taught, that they were trained on and things that they're really good at. And so those could be from say OpenAI or Gemini or Claude. There's amazing models out there. They could be from LAMA, they could be from Mistral, Quen, DeepSeek. There's a whole variety of options.

Joey: 33:34

And I think the way we think about it internally and where we encourage our customers to think about it is all these models were trained on different datasets, different sets of skills. There are things that their publishers are proud of and excited about. And the main challenge often is for companies to understand where these models are great and then match them up with where their internal opportunities are to use them. I think that's kind of the bigger exercise that knowing these iterations will keep happening, we really want enterprises to get comfortable with kind of this discovery and opportunity fitting process. So to do that, we have software called Nemo Microservices we've been publishing where there's some evaluation techniques and tools in there and some ways for enterprises to take internal data and create evaluation sets out of it.

Joey: 34:21

And so I think that's a great place that we hope to see more people be able to invest in because just like you interview an employee, you're looking for a set of skills and capabilities. You should be able to interview models. And so we're hoping that's something that, people will become more and more comfortable with over time. And then the the second piece there to talk about NeMotron, the places that we're really excited about NeMotron are gonna be around enterprise agentic tasks. And so if there are scenarios where you're trying to look at things like complex tool calling or there are scenarios where you have more complex queries that will benefit from the ability to reason through, meaning you have a query that might require answering from different data sources or from using say like a calculator plus a search plus a data retrieval.

Joey: 35:08

In those more complex scenarios, I think we're very excited that Neutron should be one of the best models to work out there. The other things we would encourage people to think through are where you're going to deploy it. If you have constraints around the data or constraints around your compute, maybe it has to be on premise or it has to be in a certain geographic region, or if there's regulatory constraints, I think the Nemotron family of models give a lot of flexibility of being able to move where you need them, whether that's on prem or across cloud different types of cloud deployments in different types of regions. And so those are probably the two key places where we would encourage people to think through using them. I think there often are places where we see many enterprises using multiple models.

Joey: 35:51

And I think that's often the way we encourage people to think about it because generally people think, oh, I'm using OpenAI, I'm all set. And then they don't realize that there is maybe a different set of problems or different set of challenges that there could be another solution to use in addition to. And so our kind of view is we expect the use cases and opportunities to grow. We don't view this as a kind of a fixed pie. Like every day we see more and more places that models can solve for and more and more opportunities to grow.

Joey: 36:17

And so we expect kind of in the end, there'll be a world where there are many different models all working together on different tasks and enterprises can find the models that work best for them. They might even take, say, a Neutron model and fine tune it. They might say, hey, here's a task that it's really good at, say tool calling, but I actually have all of my own internal APIs, my own internal tools inside my company. I needed to be an expert at those. And so they can take some of the dataset we published, mix it with the dataset they can create using some of the Neemo software and then fine tune it.

Joey: 36:48

And then this variation of Neemotron becomes their expert at tool calling across their domain of tools internally. And they still could even use that in a workflow with say OpenAI or Gemini. So I think we see a world where all of these models get used together to help solve business problems and outcomes.

Chris: 37:03

I love that. I think that's great. I think that is where we're going. But I think a lot of organizations that aren't AI global leader, organizations like NVIDIA and stuff, are trying to find their way into that. They've kind of gotten into using a model or maybe a couple of models, and they're they're working on that kind of AI maturity level of how do they get their internal processes kind of aligned with this multimodal future that we have.

Chris: 37:32

So I think there's a lot of stories unfolding in that arena. One of the things I wanted to bring up real quick, not to deviate you necessarily off of Nemotron, but you also I know you guys have a new speech model called Parakeet. And I was wondering if you'd talk a little bit about that as well and kind of share what that is and where that fits in as well.

Joey: 37:53

Yeah. Thank you. We do quite a bit of work and there's a lot of research that comes out of NVIDIA and it varies across model architectures, types, use cases, datasets. And on the speech place, we've been working there for quite a long time as well too. And in the transcription domain, the challenges have often been, can we transcribe the audio accuracy across different accents and dialects across different languages, and can we do that very fast and efficiently?

Joey: 38:22

And so in terms of what we've been publishing, we've been on that journey for many years and there's a great leaderboard on Hugging Face, I think called Open ASR, where it's, an English dataset, an English use case, And we've been working very diligently over time to keep improving the models that we publish there. And so I think you'll usually see us in the majority of the top 10 with different variations of models. And often we get to trade first place with other companies and we're happy to see kind of the community pushing things forward and we're going to keep working on that. But I think the of the latest breakthroughs we've had in that space that we've been excited about is on the parakeet side, there is some architectural improvements that have made a significant improve, kind of kind of leap forward for us, so to say. And I think to talk a minute about those, can go in a little bit of technical depth here.

Joey: 39:15

On the parakeet side, essentially it's based off of a fast conformer architecture, which improves the original conformer from Google. What we're excited about is that in terms of the model architecture, and you'll see us doing this with LMs too, we always often explore model architecture spaces in terms of what's the most compute efficient on the GPU. And so on the parakeet side, there's changes we made to the way we do depth wise separable convolutional downsampling, essentially meaning like at the start of the input, there are clever ways to shrink that input so we can cut some of the computational cost as longer audio segments get streamed in and we can keep the memory down. So in doing that, we're able to see roughly, in aggregate, a two to three x speed up in inference speed, meaning we can ingest two to three x more audio in the same amount of time and transcribe it without reducing the quality of the audio. And then there's other work we've done in there.

Joey: 40:15

There's a whole bunch of clever work in there around things like changing the attention window to make that more global. And then there's work we've done around some of the, frame by frame processing in there. So some of being able to chunk audio and properly chunk up that audio. So I have a long list of great things we've done in there. I'll mention a few other things too.

Joey: 40:36

There's been some work we've done in terms of the decoder part of the model architecture. There's a set of software we call CUDA graphs where we're able to take smaller kernels and more efficiently schedule those on the GPU. That gives us as well another about 3x boost in speed. And so I think at the end of this, you'll notice, especially in that open ASR leaderboard, kind of the RTF factor there of real time audio were quite high, especially compared to the alternatives up there. And that's because we spend a lot of time and have a lot of insight of how to do that on the GPU.

Joey: 41:08

We try and do that in such a way that we can open it and publish it so ideally other companies and partners can adopt some of those technologies and pull them into the models that they build and release as well too.

Chris: 41:18

Fascinating. Well, I appreciate that. Thanks for kind of laying that out. As we are starting to wind things up, I know that we have we have already delved a little bit into kind of the future and where things are going and stuff. But I'm wondering, you know, from from your chair, as you're sitting there driving these efforts forward in with at NVIDIA, and you're looking I mean, this is probably the most fascinating time in history, in my view, when you think about I mean, there's all sorts of things going on in the world.

Chris: 41:48

But in the technology space, the development of AI and related technologies here is just going faster and faster, broader and broader. And as you are thinking about the future, I often say kind of like when you're going to bed or you're taking the shower at the end of the day and you're kind of relaxing from all the things you've been doing, where does your mind go on this? What are the possibilities that you're excited about over the next few years? And what do you think might be possible that isn't today? If you just kind of share as a final thought kind of your aspirations in this space, I'd really appreciate it.

Joey: 42:28

Yeah. And I think I'll probably a little bit higher abstraction level and then tie it back here. I think going forward, what we're really excited about is the idea of having a digital set of employees or digital workforce to help the current workforce. And so we view going forward the idea that we continue to have people doing great work at great companies and then augmenting and improving that work with digital employees. And so in in kind of that future view of the world where, say, we interact with these digital employees either for simple things like retrieving information from complex systems across a company, just doing simple data analytics to maybe more complex things of being able to do, say, forecasting or helping predict things coming up in the future, I think there'll be a whole massive space around having these digital employees solve more complex tasks and being able to either hire them or rent them across companies.

Joey: 43:24

You can imagine there are certain industries where people are experts in their domain. They might rent out digital employees to other companies who are building products with them as a dependency or with them as a partner. And so in kind of that future world of having all of these digital employees or agents working together, we view backing into things like like Nemotron, the idea of being able to improve the capabilities across single models, across many models, and across the ecosystem, all of that in the end helps us be able to get these more accurate and more productive digital employees. And there's a whole set of software that goes around just not just the model, but having multiple models work together. There's a whole another set of challenges of as you have these digital employees that are based on these models, how do you keep them up to date?

Joey: 44:10

How do you ensure they stay current? They know the latest information about your business, if your supply chain changes or if your inventory changes. And so there's opportunities there we're looking at around data flywheels where we have a set of software we published a month back called NEMO Microservices to help people take these digital employees and keep them current and recent on interactions, enterprise knowledge, and data changes over time. But I think going forward, we're really excited for that space because often there's a lot of, difficult or mundane types of challenges and tasks today that prevent us from getting to the things we're more excited about or where we add more value. And I think we all can kind of relate to that in our day to day.

Joey: 44:48

So I think going forward, expect that these digital agents or employees will be able to help us significantly get past a lot of the mundane repetitive things that we end up having to do because systems are hard or technology is hard or things haven't been built as well as they could. And then focus more on the more exciting places where we can move efforts forward, move businesses forward, and contribute much more to the community and the economy.

Chris: 45:10

That's an amazing vision you have there. I love that. Thank you for sharing that. You've given me yet again some more things to be thinking about as we as we finish up here. So I just wanted to to thank you very much, Joey, for coming on to the show, sharing your insight and telling us about the new models that you got here.

Chris: 45:29

And I hope that you will come back when you have, the next things, that you might wanna share, and share them with our audience. Thank you very much.

Joey: 45:37

Yeah. Sounds good. Thanks for having me.

Jerod: 45:46

All right. That's our show for this week. If you haven't checked out our website, head to practicalai.fm, and be sure to connect with us on LinkedIn, X, or Blue Sky. You'll see us posting insights related to the latest AI developments, and we would love for you to join the conversation. Thanks to our partner Prediction Guard for providing operational support for the show.

Jerod: 46:05

Check them out at predictionguard.com. Also, thanks to Breakmaster Cylinder for the Beats, and to you for listening. That's all for now, but you'll hear from us again next week.

Creators and Guests

Host

Chris Benson

Guest

Joey Conway

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere