While loops with tool calls

Jerod:

Welcome to the Practical AI podcast, where we break down the real world applications of artificial intelligence and how it's shaping the way we live, work, and create. Our goal is to help make AI technology practical, productive, and accessible to everyone. Whether you're a developer, business leader, or just curious about the tech behind the buzz, you're in the right place. Be sure to connect with us on LinkedIn, X, or Blue Sky to stay up to date with episode drops, behind the scenes content, and AI insights. You can learn more at practicalai.fm.

Jerod:

Now, onto the show.

Daniel:

Welcome to another episode of the Practical AI podcast. This is Daniel Wightnack. I am CEO at Prediction Guard, and I'm joined as always by my cohost, Chris Benson, who is a principal AI research engineer at Lockheed Martin. How are doing, Chris?

Chris:

I'm doing just dandy today, Daniel. Getting ready as we record this. We're approaching Halloween. So I'm I'm getting in the spirit.

Daniel:

All sorts of sweet treats.

Chris:

Thank goodness the listeners can't see me. But yeah, I'm I'm busy growing out my my costume in this week beforehand. So I look like I look I don't look so good. So, yeah, glad this is audio only. Thank goodness.

Daniel:

Well, you're in between. Yeah. You're getting getting ready for the festive season. That's that's okay.

Chris:

That's right.

Daniel:

Not not quite to No Shave November, I guess, but, you'll you could you could stretch it out through November and

Chris:

I'm starting to become the monster under the bed maybe, you know?

Daniel:

There you go. Speaking of of sweet treats, our guest today has me thinking about cake because, a little while ago, we had, Jared Zonnerich, who is a cofounder and CEO at PromptLayer join us. And, of course, looking, always reminded of their nice cake logo, which is making me hungry right now. I'm I'm wanting some dinner. But, welcome back to the show, Jared.

Daniel:

It's great to have you.

Jared:

Thank you. Thank you. I'm, excited to be back. And a fun fun fun note for you. I'll I'll tell you, we're we're designing our booths to go to conferences now, and I think we're gonna bring real cakes to the booths.

Daniel:

That's awesome. Oh, nice. You should. Definitely. Nice.

Daniel:

You should.

Jared:

Into it.

Daniel:

Yeah. Yeah. Just, like, have those little I I see at the airport occasionally, they they have those little vending machines where you can get cake out of the vending machine, it's in a little, like, plastic thing you pull open and eat while you're at the airport. So, yeah, I think that's a I think that's a great idea. But, I'm just looking.

Daniel:

We last talked to you March 2024, so episode two sixty one. Folks, go ahead and, can loop back and hear from Jared when we had this initial discussion, which was great. At the time, Jared, everyone was, of course, talking about the I think we were still in the days of everyone talking about kind of prompt engineers and, prompting going crazy. Certainly people are still talking about prompting, but maybe they've shifted in some ways, focus. As you've been kind of at some of the the center of discussions around prompting, how people are engaging with AI, from your perspective, what has the year been like?

Daniel:

How how have things kind of changed in people's perceptions of of prompting AI systems or maybe even your own thoughts around prompting AI systems? And and, of course, we can get into agent stuff later, but, yeah, any thoughts?

Jared:

Well, we're still called prompt layer, so we haven't changed that yet. I don't think we will. But, yeah. So March 2024, a year, a year and a half almost to go. It's almost it would be hard to even list all the big AI things that happened since then.

Jared:

I mean, the the one that immediately comes to mind though is reasoning models. So reasoning models, I think OpenAI released o one. I think I saw a tweet that it was like a year from last week or something like that. And that was, I think, the first next generation of prompting. Meaning before, if you recall probably our last talk, we talked about chain of thought.

Jared:

And, now, I guess, the reasoning model does the chain of thought for you and just the models have been getting better. But at the core, the core way to think about LM applications as an input to an output or input to the LM, meaning the prompt and the model and then the output, that's all the same. I think it's gotten much easier. The models have gotten much better. 're easier to steer.

Jared:

And some of the some of the weirdness of how you persuade the model and yell at the model has chained has gone away, and it's gotten a little bit more straightforward. But I think, yeah, I think prompting is still at the core of everything. And I will have to say the new word people love to say is context engineering, of course. And, to me, they're the same, but, I think the main thing reason people like this word, context engineering, is it's not just the prompt. It's not just the text.

Jared:

It's how much are you putting in the text. Are you putting too much? Is the model getting distracted? Are you using rag? Are you not using rag?

Jared:

Are you throwing in a blob? Are you using multiple models? And, I think my my, like, high level one sentence of what's changed would be we have way more tools at our disposal now, and, we have way more mixing and matching.

Daniel:

Interesting. Yeah. And and I guess that how would you distinguish or how might your thinking around kind of context engineering, you know, maybe that is a term that's coming up. What do you think is kind of just some jargon that's changing and what do you think is kind of a more substantive part of what that means and maybe why people are shifting language, if there is one?

Jared:

Right. I think the key around context engineering and the reason, well, we're a little financially invested in prompt engineering because we made hats that say prompt engineer on it. Maybe Right.

Daniel:

You gotta

Jared:

that's where my bias comes from. Yeah. But maybe we'll make both hats. But I think context engineering really gets to this, well, one context is much longer. So you can send so much more to a model than a year and a half ago.

Jared:

It's almost unlimited from a user perspective. And the question is, how do you send it? How do you make how do you squeeze the juice out of the model? Models, I like to think, get distracted like us humans do. So how do you not lead it down the wrong path?

Jared:

And the the core difference, though, on how we build stuff is I think there's been much more because models have gotten better and because we can fit more in the context, we've moved from and we've seen this with our customers, moved from complex DAGs or complex workflows where you say this prompt goes to this prompt and this prompt goes to that prompt to something a little more autonomous and a little more agentic and a little more of just a while loop with tool call. Everything's a while loop with tool calls now. But let me that's a whole that's a whole different topic. But let me leave it at that.

Chris:

I'm curious, just to kind of pick up on that point, though, I think I mean, I'm guessing a lot of our listeners can really, you know, fill an association with what you're just talking about about that evolution. I know that like, in my own job, it's not a primary responsibility, but as an ancillary, I often get asked to talk to groups in our company about how to prompt and stuff. And some of the things we had put out internal videos like that I put out a year and a half ago that are completely obsolete. And I and so I'll I'll come back to a group and I'll be like, that whole thing I told you before, don't do it that way anymore. It's gotten a little bit more capable, a little bit more sophisticated, drop the the the structured framework a bit.

Chris:

I'm curious, you know, so I'm kind of backing out of what I told them a year and a half ago at this point. And I'm sure a year and a half from now, I'll be doing the same thing with what I'm doing today. So like, how do you address when you're dealing with end users, how to evolve with the capabilities that the model and the infrastructures are are providing these days? Because it's really hard to track that fast, quite honestly.

Jared:

It takes me many hours a day of scrolling on Twitter to keep, keep in touch with what's going on. I would say so I guess for context, for everybody, we're we're a platform for building and testing LM applications. And, people are using us to version their prompts, version their agents, evaluate them, log them, and generally just build a AI system that's working. So we get asked this question all the time of, like, how do we track it? What should we do today?

Jared:

And, we have the same problem. We we have YouTube videos or interviews where maybe I rewatch it. I'm like, oh, well, that's not true anymore. Now you could just use this.

Chris:

Thank goodness it's not just me.

Jared:

Yeah. I mean, it's everyone it's OpenAI too. It's Anthropic too. It's Gemini. I think I think the answer is to lock into the core truths of, building LM systems.

Jared:

And how we look at it is it's a philosophy thing. It's I guess there's there's two there's two competing ideas here, and I'm curious also to hear your opinions on this. But, there's the academic idea of how do you understand LLMs and how do you understand with the context and how it works. And then there's the tinker, the builder of philosophy, which we push people towards, which is it's a black box. I don't need to understand it.

Jared:

I just need my input to match to what I want in the output. And I usually give a really annoying answer to teams who use us, because we we get on a lot of calls with our customer. We're not a consulting product. We're we're a software, but we love to help our customers, and we like to share knowledge. So so we get on a lot of these calls, and almost always my answer to a specific question of, should I use GPT five or should I use Sonnet four or five or should I use Gemini is that I don't know the answer.

Jared:

I don't think OpenAI knows the answer. I don't think Anthropic knows the answer. The only person who can know the answer is you for your own use case by building tests and checking and seeing what works. Basically, it's a black box. You just have to try it out and see if it works.

Jared:

And that's the only repeatable motion here. I know it's an annoying answer though.

Chris:

No. It's it's good. It's and I think it's I I think it's it's a it's an answer that flows with the times because anything that you would say is gonna be you know, anything that you any of us say today is obsolete tomorrow, with the way this thing accelerates. So Yeah. It it's it's not not as annoying as you might think.

Jared:

It's a framework for figuring out the answer instead of relying on us three. And it's, how do you how do you think about testing? Because testing them is really hard and figuring out if your output is good and if the prompt works is really hard. So how do you build heuristics and how do you build evals and how do you even or how do you skip the eval? Like, just how do you approach this is the interesting thing, at least to me.

Daniel:

Yeah. And I think I agree with you, Jared, in terms of like the spectrum that you were talking about. The reality is a lot of people don't need to know what, you know, self attention, you know, means, for example, to build cool AI automations. I think there is like, on the flip side of that, I guess there would be this kind of like, you need kind of enough of a mental model to understand kind of maybe the coverage of, to your point, like, I think this is partially why the testing is so hard because sometimes it may not be clear to you like, Oh, if I, for example, change the formatting of my prompt without changing any of the words, why would that change anything? And kind of these things around prompt sensitivity and other things, right?

Daniel:

So some of that can be built up through tinkering, but maybe also comes even with this kind of mental model, maybe not so much of what the models are, but I find sometimes how they operate in generating output is often a trigger for people to know, like, oh, well, now it makes sense maybe why of things could go off the rails with high temperature or other things like that. So yeah, I don't know if you have any thoughts on that, but that's kind of my thought is that there's like a different kind of mental model or intuition that you need, less so the academic kind and more so like how things operate.

Jared:

Yeah. A 100% agree. I think you could have too much academic knowledge of how LMs work and that actually might hurt you in this because you try to understand what you're doing. Whereas a lot of these things are pretty hard to understand, and maybe people don't need it's it's in the neural net somewhere. But I think what you're referring to I like to refer to it as a LM idiom.

Jared:

So how do you how do you understand the language of LMs? I think the example you gave was good about formatting. Like, JSON formatting has been a big topic people talk about of. Can you just give JSON to the model instead of a prompt, and how will that work? And I think it's a really illustrative example because giving asking the model to return JSON or a structured format will work really well if you wanna return precise numbers or precise values.

Jared:

But if you wanna write a love note, it's probably not gonna work as well because the model is now in I'm coding gear. It's like, if you go into maybe I like to I always compare it kind of the human. I don't think AI is human, but I think there's some things we can kind of use as metaphors. So it's like if you go into an engineering school and you just give someone a paper and say, write a love note, they're they're thinking about code. They're not in the brain space of of poetry as opposed to maybe they are good at it, maybe they can do it, but you kind of have to step back and say, okay.

Jared:

This is what we're thinking about in the same way as if you're asking the model to output JSON and output a key and curly braces. It's probably not gonna write as good creative language in one of those values. And, I can't write academic proof for you on why this is true, but it it is a it's a l it's like an idiom. It's an intuition. It's a way to talk to the LM without using so many words, I think is where I'm kind of settling around.

Chris:

And you may have just identified why my last anniversary with my wife just went off the rails. You know, the JSON output for the love note. I thought I had it, but, you know, clearly not.

Jared:

Yeah. It's a you gotta you gotta be careful with it. But it's really an important thing of I mean, when we're talking as humans, we have idioms that we can say one word, and you're gonna think of a whole different thing. If I mentioned Beethoven or if I mentioned Kanye West, you're you're not you're not thinking of just the person. You're thinking of everything surrounding them.

Jared:

And I think it's the same thing here, and it's the same you're putting it in a probability space. And you wanna be in this part of the probability space, it's gonna be a little bit more challenging.

Daniel:

So, Jared, I do wanna dig into something that you said a little bit earlier, which is there has been this progression maybe from static workflows or kind of prompt chaining where you have like a DAG, which people, if they're not familiar, like that's sort of a one directional graph of like this logic that the chain of calls to the AI system is going through. You kind of mentioned that shifting more to kind of a while loop with tool calls. So could you break that down? I mean, first of all, maybe for those that aren't familiar, kind of what you mean by kind of tool calling and then, like, what what this means to have a while loop of of tool calls.

Jared:

Yeah. So maybe the best way to start is to, like, paint the evolution here. So, you just mentioned, DAGs, these one direction these basically graphs that are just a bunch of nodes that say this node goes into this node, this node goes into this node. The reason we started with that is because models were a little unpredictable because of hallucinations. If you were at the beginning of this LM craze, let's say, two years ago, three years ago, maybe three years ago is too much, maybe it didn't exist.

Jared:

No. Yeah. Something like three years ago was ChatGPT. If you're United Airlines and you're making a customer support chatbot, you don't want to accidentally give people free flights. And to avoid doing that, the best way to do it would be to build these structured paths that LM can go down.

Jared:

So the first question would decide what the user was asking. And if the user was asking for a refund, it would go to the refund prompt. And this is kind of how, let's call it prompt engineers, context engineers, agent engineers, whatever you wanna call it, stopped LMs from going off the rails. Now what's changed is two big, let's call it innovations. One, because models are better at following instructions, and hallucinations really don't happen that much anymore.

Jared:

And the second thing is models have gotten much better at structured outputs. So before, it was kind of hacky to get the model to return in a way that code can process it. Now tool calling, which I'll explain in a second, is baked into all the main models. So what tool calling is is basically you're telling your prompt, you're giving the instructions for the prompt, but you're also telling the prompt it has access to a few different functions. So in the United Airlines example, maybe the functions are issue refund.

Jared:

Maybe there's another function of check user status. When you use chat gbt.com, it has access to search the web. It has access to like, when you're generating an image, it's probably like a generate image type tool. So this way, as we've built more tool calls, a lot of the models have been built around tool calls and have gotten really good at interacting with them, interpreting their response, sending another message to them. And that's why you see so many more autonomous agents.

Jared:

Like like if you look at Cloud Code or you look at Codex. The reason coding agents are actually good now is because of this paradigm and because they've actually simplified everything and said, instead of this complex tag where we think through every single step, we're gonna actually kinda give the model a little bit more free rein, let the model run things, see if it works, and actually fix it because models are turned out to be really good at fixing their own mistakes. And what that has unlocked is a lot more flexibility for the model. So now if you use Cloud Code, which is or Codecs or any of these coding agents in the command line, they have only axe it's the we wrote a few blog posts on how it works behind the hood. But the simple way to explain it is it's one loop that says continue until the AI is done and then ask the user for input.

Jared:

So I'll say, make my application work. And then now it'll start the loop, and it has access to just write in the terminal like a human. So it'll write something. It'll get the output, and then it'll decide. Do I wait for a user response, or is it or do I wanna run another tool call?

Jared:

And this simple loop is much easier to develop, much easier to debug, and kind of just the way everybody's gone. Of course, it has disadvantages. But that's kind of the way I see it, in terms of where we've gone.

Chris:

I'm curious as we talk through this and and, you know, as we've transitioned into this agentic world, especially since we talked to you last time, you know, over the last year and a half, it's really, you know, come on strong, you know, versus where our conversation was back then when that was a very new thing. What kind of incumbencies does this, this whole new set of capabilities bring to the user, when they're trying to think, you know, we've talked a little bit already in this conversation about kind of the evolution of prompting context and, you know, the rapidity of that. But, you know, as we move into this agentic world, any thoughts around around what the user's new responsibilities are to be effective in that?

Jared:

Yeah, the user as in the builder of an AI application or

Chris:

the user in this case is in the human who's listening to our podcast right now and is going to turn to their their system at the end of the show and go, I'm gonna go try that.

Jared:

Totally. Totally. So a little both, maybe. I think if you're just a user on chatbpd.com or you're using an AI application that's available, codex, cursor, whatever, it's capable of doing much more and working for a much longer time and staying on track more. And you kind of can have a it can do a better job, AI now, of figuring it out, let's call it.

Jared:

So if you if you want to do a general task or do an exploration because of this new concept that people are using to build AI applications. These AI agents are able to try something. If it doesn't work, try something else and do the exploration that humans would do. And we're really just trying to make them act more like a human. And the DAG way, the old way that we're talking about where you have a bunch of nodes and you have a structured way, that's not how humans work.

Jared:

When you you when you give an intern a task, you're not giving them the exact flowchart of how to solve it. You're telling them generally what tools they have at their disposal, and they're gonna keep working and using the tools until they figure it out. Now for the builder behind these applications, what it gives them is it makes it a little bit harder to test. It makes it a little bit harder to keep things on the rails, but it makes it much quicker to build something that succeeds. We built a a agent using Cloud Code that updates our docs every day.

Jared:

Base it looks at all the code our team has written over the last, twenty four hours and decides if it should be in the docs and then updates it. Took me two hours to build because all I said was download these repos, read the commits, and then check our docs and see if it should change. And then it gets to figure it out. Now is this gonna be good for a production system? Maybe it needs a little bit more work, but for something simple, it it opens a lot of use cases.

Daniel:

And what would you say, I guess, as you mentioned, you're focused on kind of a platform around kind of building, testing, versioning, improving AI workflows or agents. With these kind of while loop tool calling things, right? If you have kind of a variety of tools that could be called under the hood and you have kind of part of the complexity of the system, I guess, is in the tools that you can call. How does that influence, I guess, like the way that you should version or test these systems or or does it even, in the sense that you kind of now whereas before it it felt like there was kind of a single function that I'm calling into with a prompt. Right?

Daniel:

And that function produces output which may or may not be useful. Here, you have kind of this recursive function that that feeds into itself and could call any number of things. And so, like, when you look back, if you're just looking at the input and output, I guess, is what I'm saying, then any number of things could have gone wrong in that kind of recursive loop or in that while loop that you're talking about. So how does that, I guess, how does that influence kind of the prompts that you would version, how you iterate on those, how how you test and improve the systems from your perspective.

Jared:

Totally. It makes it interesting. So at the end of the day, the tool calls that are being run, the the the functions are still those input output things that can be unit tested and that could be really thoroughly and rigorously tested. The hard part is this wild loop. So I think the core master prompt, the one that runs the loop, that could be tested in a lot of ways.

Jared:

You can run sanity checks. You can test against old data. You could see how things changed. But what you're getting at is this really interesting problem that's kind of been created, which is how do you test a flexible agent where flexibility is kind of one of the keys to it, with what makes it good. I think there's kind of the heuristic I've developed is something I want to call it.

Jared:

Maybe I'll write a blog post about it, like agent smell. So if you run an agent, what sanity checks can you see to see if it smells a little funky? Like, is something is it raising any red flags? And I'll give you some examples. So if I was building a agent to, let's say, to fix errors in my application.

Jared:

So, like, if I had a database error and I wanted to build an agent that would go and fix the code automatically, I would want if I wanted to test this agent, what I would test for is how many tool calls is there. First, I just wanna surface these statistics. How many tool calls is it running? How many times is it retrying the tool calls? How long does the agent take?

Jared:

And these are kind of surface level things. They're not end all be all, but you first want to start simple. So you first want some sort of smell test where you can say, hey, this new version is behaving very differently than my old version. Maybe better, but maybe worse. And then that's when you go in and break down like a state by state test.

Jared:

So the most useful tests we see our users doing are individual states or full conversation simulation. So individual states would basically be saying, here's conversation history. Now run that good tool, and what's the next step? And we're just checking if the next step is the same. Of course, that's only one part of the picture.

Jared:

The other part of the picture is here's the initial instructions. Now simulate the whole conversation and see if the final output's correct. And then combine that with the smell, and it doesn't give you a full picture, but I think the core learning we've had, at least since the past year and a half, is that you don't need to have a 100% coverage when you're evaluating these things. If anything, if you're trying to make a perfect test for your agents, you're probably never gonna ship. And you're probably gonna It's just gonna You're never gonna do it.

Jared:

The better thing is to make it good and have heuristics of figuring out when it's regressed before it does.

Daniel:

So Jared, we talked through a little bit of this agentic stuff, and I know, like you set up some of the conversation where you as a company are enabling kind of non technical users to kind of work on prompts, embed their domain knowledge, kind of have that kind of nontechnical connection to the prompts under the hood, which are maybe embedded in various systems or tools and that sort of thing. I'm wondering about your perspective on obviously, one of the things that's been talked about in recent months a lot is this kind of like 95% of AI pilots failing and and that sort of thing, which, we've talked about on the show, the the report from MIT and gave some thoughts. But I'm I'm wondering how you think that intersects with the way that the tools that people are using to manage their their prompting, their AI systems, maybe the the rigor that needs to be there that's not, or maybe, like, there's another side of this where kind of some of those engineering principles need to be brought into the into the picture. Yeah. What what is your thought kind of from working with a lot of nontechnical users on a platform like this?

Daniel:

You know, what have you learned over time to be those kind of key key pieces of making sure that those people coming to a problem don't end up just wasting a lot of their time working on something that doesn't actually get results.

Jared:

Right. So maybe 95% of AI pilots fail, but they're not using front layers. That's why it's

Daniel:

so Yeah. Exactly. That that was the that was the setup for the question. Yeah.

Jared:

0% of AI pilots in front layer fail.

Daniel:

Well, yeah. Exactly. Same with prediction guard. Yeah. Exactly.

Daniel:

Exactly. Exactly. Yeah. They're not using

Jared:

the right tools. I would say so as a platform, how we look at it is PromptLayer. We we have a large diversity of teams ranging from super technical and all engineers to basically no engineers and everything in the middle. And what we are trying to do is build a rigorous process for building these applications and expose it to the people who know if the outputs are correct or not. So what I mean by this is rigor in terms of versioning and knowing which versions are working.

Jared:

Rigor in terms of being able to test, like we're just talking about testing agents, testing prompts, and also rigor in terms of logging and seeing what's happening in production, what's going on. Because you can only test so much in development in AI and you kind of need to expand the surface area of how people use your product. The reason we focus a lot on getting these domain experts involved in the process is because we believe actually from a business perspective, that's how you win as an AI company. If you're building legal AI, you win by having a lawyer involved. The example I always give and I love going back to so I come from a family of psychologists.

Jared:

It skipped me. I'm an engineer, but but but I have some familiarity with it. And if you wanna go to But

Daniel:

now you're just psychoanalyzing, the language that's going into models. So I I guess your your family can be proud.

Jared:

I hope so. I hope so. I'm working on it. But, if you wanna see a shrink, there's like six on the block. Right?

Jared:

I live in New York City. There's a lot of them. And the assumption I'm making here is that there's no global maximum. There's no one correct answer to psychology. You know, you have different methods.

Jared:

You have CBT. You have Ayahuasca retreats. You have a lot of different ways to treat people. And same with medical doctors, same with education, same with a lot of things. And what's the core differentiator between the different psychologists on the block that my office is on?

Jared:

The taste and how they choose to practice their field. They're all going to the same education. Maybe some have a little bit more knowledge than others, but how they implement their product, let's say. And in the same way as if you're building an AI therapist, how you win as a business is the the non engineering taste that's been put into the AI product and the way it's using the context that you provided and what you've told it to do. And going back to your question of how do you do you take that knowledge of needing the nontechnical engineer or, like like, we think an AI engineer should be nontechnical.

Jared:

And but how do you bring in those engineering principles so the pilot doesn't fail? And a lot of times we see engineering actually owns the product, the AI product. So we're usually talking to VP of engineering or CTOs or something like that to get PromptLayer installed because we we're all engineers on our team. We're bringing in these principles. We think even if it's a nontechnical expertise, you have to do it in a in a organized and systematic way.

Jared:

And that's why you see the skill of prompting, I almost think, is not quite the same, Venn diagram as the skill of coding because it's really a skill of tinkering. And not all coders are tinkerers, but not all writers are tinkerers either. So there's some new type of algorithmic thinking that overlaps very highly.

Daniel:

Yeah. It's almost like being a negotiator. Almost. Exactly. Yeah.

Chris:

It is. But it also, you know, maybe you have organizations that are leaning one way or the other, you know, you've just kind of kind of described that spectrum of of skill and expertise that apply. And so possibly for whatever problem your organization is trying to solve, trying to find the right place on that spectrum, to bring the right resources together. And that, like, even aside from what we're talking about here, that's an easy place for businesses to fall down anyway, across. And so in a sense, it may not be that different for many other business problems that companies are trying to face.

Jared:

Yeah, totally. It's like, what can we learn from non AI world to ship these things better? To us, the big thing that the big mistake people make is they try to boil the whole ocean at the beginning, and they try to do too much. And really, you wanna do the whole crawl, walk, run when you build these systems. And you want to maybe instead of the pilot being, hey.

Jared:

We're gonna add a billion dollars of revenue with our AI product, you wanna say, alright. We're gonna make a beta version. Maybe we'll only release it internally. Maybe we'll do this. Maybe it won't do ever.

Jared:

Maybe it won't be the while loop with all the tool calls. Maybe it'll just be one tool call. And that's also true with how you test these things and how you build them. So it's not just what the product does, but it's a lot of company a lot of teams get stuck at trying to build tests because they try to build perfect tests. And like we're saying, it's hard and maybe even impossible.

Chris:

And there's a learning process, which that kind of implies crawl, walk, run is that it gives it gives companies a chance to, to not crater too hard when they're first starting out, you know, as they're trying to get something done. Keep it small enough scope so that they can actually achieve something small but positive, and learn from that and kind of build up toward what their true aspirations might be.

Jared:

Exactly. I mean, we're all learning.

Daniel:

Yeah. And I don't know when you said crawl, walk, run, Chris, it made me think like some of the problem might be that people don't understand what is crawling, what is walking, what is running. Like, it all just looks like AI tasks. Like, it's a big soup of things that you can sort of quote do with AI and it's unclear, you know, how do I pick out the crawling task? Because I don't know which of I don't know which of them it is.

Daniel:

I I I see that kind of paralysis a little bit. I I don't know if if you see that as well, Jared, or have any suggestions for, like, how people can think about, you know, picking apart? Because because you're you mentioned like those domain experts who are coming into, you know, PromptLayer and you're connecting those those people with the business knowledge into PromptLayer. You know, how is it that they come upon the knowledge to know what is a kind of crawl task or a feasible task to like start with and, and, and play around with? Any, any thoughts?

Jared:

Yeah. It's a good point. I think the most successful teams, AI teams, I've seen work in collaboration between engineering and domain experts. So if you're if you're just domain experts or you're just engineers, you can succeed, and I've seen it succeed. But the the most common way and what we recommend is it should be a joint effort.

Jared:

The engineers often know how to ship a product and how to do agile or iterative design and the the non technical understand what makes it good. And you need both of these, I think. How to break it down, it's a it it's almost the the crawl task, it's almost you have to just step back and say, is my heuristic here as a let's talk about testing. Just what's the crawl task in testing of our AI application? The hard examples are, like, something like summaries where there is no ground truth.

Jared:

So what is the crawl task of evaluating a AI notetaker? Well, you kind of have to step back and say, as a human, what is a good summary? Alright. What's the simplest thing? Maybe the simplest thing is just saying, is the summary in English?

Jared:

Maybe doing another LMS judge where we say, does it use markdown? And then maybe another one that says, is it less than a page? And these are all obviously not end all be all tests, but it's the crawl. And then once that's working, can check for hallucinations. And then maybe you can check for a style.

Jared:

But it's very use case dependent. So it's hard to give a one size fits all there.

Chris:

So I'm gonna I want to turn things as we're starting to get closer toward the end. And I want to ask a kind of a fun question for you. I know you like Claude code. And so I wanted to ask, some of the things that you're playing around with and, you know, what you're doing and, you know, what's got you excited on it.

Jared:

Yeah. And and I will say, I I like Claude code, I also like Codecs. I also like Cursor. I also like AMP. AMP is doing really cool stuff with the the free coding agent too.

Jared:

So I switch between all of them. I think I give Cloud Code a lot of credit for being the first really useful coding agent that I've used. What am I doing? So a lot we we we we redid our whole engineering philosophy around these coding agents at PromptLayer. Basically, the hard part about building a platform, as you guys likely know, is all these little things and the death by a thousand cuts.

Jared:

So we have to work on big features, but also the UX of, like, this button here doesn't have a loading state or this is not draggable here is how you you could fail if you don't fix those and that list piles on. So now we have a new rule in our company. If it takes less than two hours to do using Cloud Code or Codex or something, just do it. Don't ask anyone. Don't prioritize it.

Jared:

And it's helped a lot. Honestly, our customers have literally told us like, wow, you guys are shipping so much faster now. So I think everyone who says, oh, it actually makes you a slower coder, it's just full of it. It's so good. And I use it for nontechnical things too, if I wanna go through a CSV.

Jared:

I'll you I'll tell you one thing I did recently that that is pretty interesting and nontechnical. I went to an event, and I won't say which one because, I then people will be like, oh, that's why I'm getting spammed. But I went to an event, and, everyone was pretty interesting on the event. And, I basically copied and pasted the list of users there. I I actually, I gave the HTML to Clog code.

Jared:

I said, make it into a CSV of all the people there who click going and whatever social media you have. And then I put into prompt layer, and I actually, like, added new columns. So I, like, did a batch of, like, find where they work, find their whatever. And then you could go back to Cloak. I went back to Cloak Code and did some data processing and saying, like, who should I contact?

Jared:

And, just doing, like, random tasks like that and batch prompting. Like, I combine PromptLayer with it, of course, but random like sending emails, creating random UIs for like, to understand a company. I use it for everything. I'm I'm constantly on it. I I constantly vibe code these days, and we sat down with everybody on our team and Claude coded with them just so they can see how good it is because a lot of people are skeptical, and, it's great.

Daniel:

That's great, Jared. I I love the tie in to even this sort of, like, combination of things that it's like when Anthropic released QuadCode, I don't know that they imagine all of these like trickle down effects of like things in the way that people are combining it with other things in the, like, even the nontechnical things that you mentioned. So it's really cool to see how that plays out. As you as you kind of look towards the future, as we we get to the kinda wrap wrapping up here, tell us a little bit about kind of what excites you kind of moving into this next year. And, you know, when we talk again in the next year and a half or whenever it is, you're excited about during that period of time, like, related to PromptLayer and related to kind of things things in general and how the ecosystem's evolving?

Jared:

I'm excited about a lot. Simplest one is we're talking about it, these coding agents, as a headless tool. So using them in your workflows to run things, that's exciting. I'm excited about especially nontechnical uses of these things, I think. You're gonna right now, they're in a terminal, and you're gonna be using them for so much more.

Jared:

And then ClogCode aside, I'm very excited about the whole how the space is evolving to a place where you have a lot of different tools at your disposal. Some models are really good at writing. Some models are really good at coding. And the consumer is just having more options than ever to building their product. I think we're not in a world of one model rules at all.

Jared:

There's a lot of ways to solve a problem. There's a lot of, variability on how you build your product. And I think that's a good thing. I think I'm very I think the future is great. I'm I'm excited for AI to take over.

Jared:

I think, I don't I'm not worried about it at all. And, yeah, I think I think and I'm honestly really I know this is a little bit of a a shill because this is what our company does, but I'm very excited about unlocking AI engineering for people who didn't study computer science. And I think this has been something people have talked about for so long of how do we, democratize coding and get more people coding. And maybe people aren't going to be coding anymore, but the way that people have expertise and now they're going to be able to build AI products around it and do AI engineering around it. And it's really anybody's gonna be able to distribute their work to almost infinite levels.

Jared:

So that's what keeps me up at night in a good way.

Daniel:

That's awesome. And yeah, I would definitely recommend, of course, we'll include links to PromptLayer in the show notes, but also, as, as Jared mentioned, they have a, they have a great blog. You know, they've, they've released some, some excellent articles. They, they have great learning resources out there. So check out everything that they're doing.

Daniel:

Really appreciate the way that you all are contributing to to the ecosystem, Jared, and, definitely keep up the good work, and we'll look forward to, to talking with you again, next time you're on the show.

Jared:

Amazing. Amazing. Thanks for having me. And anyone can reach out on Twitter or on email or sign up for PromptLayer, and get started for free. So, excited to see what people build.

Daniel:

Yeah. Definitely. Alright. Talk to you soon.

Jared:

Thank you.

Jerod:

All right. That's our show for this week. If you haven't checked out our website, head to practicalai.fm, and be sure to connect with us on LinkedIn, X, or Blue Sky. You'll see us posting insights related to the latest AI developments, and we would love for you to join the conversation. Thanks to our partner Prediction Guard for providing operational support for the show.

Jerod:

Check them out at predictionguard.com. Also, thanks to Breakmaster Cylinder for the beats and to you for listening. That's all for now, but you'll hear from us again next week.

Creators and Guests

While loops with tool calls
Broadcast by