Box’s route to its IPO, ten years ago this month, was a difficult one: the company first released an S-1 in March 2014, and potential investors were aghast at the company’s mounting losses; the company took a down round and, eight months later, released an updated S-1 that created the template for money-losing SaaS businesses to explain themselves going forward:
Our business model focuses on maximizing the lifetime value of a customer relationship. We make significant investments in acquiring new customers and believe that we will be able to achieve a positive return on these investments by retaining customers and expanding the size of our deployments within our customer base over time…
We experience a range of profitability with our customers depending in large part upon what stage of the customer phase they are in. We generally incur higher sales and marketing expenses for new customers and existing customers who are still in an expanding stage…For typical customers who are renewing their Box subscriptions, our associated sales and marketing expenses are significantly less than the revenue we recognize from those customers.
This was the justification for those top-line losses; I wrote in an Update at the time:
That right there is the SaaS business model: you’re not so much selling a product as you are creating annuities with a lifetime value that far exceeds whatever you paid to acquire them. Moreover, if the model is working — and in retrospect, we know it has for that 2010 cohort — then I as an investor absolutely would want Box to spend even more on customer acquisition, which, of course, Box has done. The 2011 cohort is bigger than 2010, the 2012 cohort bigger than 2011, etc. This, though, has meant that the aggregate losses have been very large, which looks bad, but, counterintuitively, is a good thing.
Numerous SaaS businesses would include some version of this cohort chart in their S-1’s, each of them manifestations of what I’ve long considered tech’s sixth giant: Apple, Amazon, Google, Meta, Microsoft, and what I call “Silicon Valley Inc.”, the pipeline of SaaS companies that styled themselves as world-changing startups but which were, in fact, color-by-numbers business model disruptions enabled by cloud computing and a dramatically expanded venture capital ecosystem that increasingly accepted relatively low returns in exchange for massively reduced risk profiles.
This is not, to be clear, an Article about Box, or any one SaaS company in particular; it is, though, an exploration of how an era that opened — at least in terms of IPOs — a decade ago is both doomed in the long run and yet might have more staying power than you expect.
Digital Advertising Differences
John Wanamaker, a department store founder and advertising pioneer, famously said, “Half the money I spend on advertising is wasted; the trouble is I don’t know which half.” That, though, was the late 19th century; the last two decades have seen the rise of digital advertising, the defining characteristic of which is knowledge about whom is being targeted, and whether or not they converted. The specifics of how this works have shifted over time, particularly with the crackdown on cookies and Apple’s App Tracking Transparency initiative, which made digital advertising less deterministic and more probabilistic; the probabilities at play, though, are a lot closer to 100% than they are to a flip-of-a-coin.
What is interesting is that this advertising approach hasn’t always worked for everything, most notably some of the most advertising-centric businesses in the world. Back in 2016 Procter & Gamble announced they were scaling back targeted Facebook ads; from the Wall Street Journal:
Procter & Gamble Co., the biggest advertising spender in the world, will move away from ads on Facebook that target specific consumers, concluding that the practice has limited effectiveness. Facebook Inc. has spent years developing its ability to zero in on consumers based on demographics, shopping habits and life milestones. P&G, the maker of myriad household goods including Tide and Pampers, initially jumped at the opportunity to market directly to subsets of shoppers, from teenage shavers to first-time homeowners.
Marc Pritchard, P&G’s chief marketing officer, said the company has realized it took the strategy too far. “We targeted too much, and we went too narrow,” he said in an interview, “and now we’re looking at: What is the best way to get the most reach but also the right precision?”…On a broader scale, P&G’s shift highlights the limits of such targeting for big brands, one of the cornerstones of Facebook’s ad business. The social network is able to command higher prices for its targeted marketing; the narrower the targeting the more expensive the ad.
P&G is a consumer packaged goods (CPG) company, and what mattered most for CPG companies was shelf space. Consumers would become aware of a brand through advertising, motivated to buy through things like coupons, and the payoff came when they were in the store and chose one of the CPG brands off the shelf; of course CPG companies paid for that shelf space, particularly coveted end-caps that made it more likely consumers saw the brands they were familiar with through advertising. There were returns to scale, as well: manufacturing is a big one; the more advertising you bought the less paid per ad; more importantly, the more shelf space you had the more room you had to expand your product lines, and crowd out competitors.
The advertising component specifically was usually outsourced to ad agencies, for reasons I explained in a 2017 Article:
Few advertisers actually buy ads, at least not directly. Way back in 1841, Volney B. Palmer, the first ad agency, was opened in Philadelphia. In place of having to take out ads with multiple newspapers, an advertiser could deal directly with the ad agency, vastly simplifying the process of taking out ads. The ad agency, meanwhile, could leverage its relationships with all of those newspapers by serving multiple clients:
It’s a classic example of how being in the middle can be a really great business opportunity, and the utility of ad agencies only increased as more advertising formats like radio and TV became available. Particularly in the case of TV, advertisers not only needed to place ads, but also needed a lot more help in making ads; ad agencies invested in ad-making expertise because they could scale said expertise across multiple clients.
At the same time, the advertisers were rapidly expanding their geographic footprints, particularly after the Second World War; naturally, ad agencies increased their footprint at the same time, often through M&A. The overarching business opportunity, though, was the same: give advertisers a one-stop shop for all of their advertising needs.
The Internet provided two big challenges to this approach. First, the primary conversion point changed from the cash register to the check-out page; the products that benefited the most were either purely digital (like apps) or — at least in the earlier days of e-commerce — spur-of-the-moment purchases without major time pressure. CPG products didn’t really fall in either bucket.
Second, these types of purchases aligned well with the organizing principle of digital advertising, which is the individual consumer. What Facebook — now Meta — is better at than anyone in the world is understanding consumers not as members of a cohort or demographic group but rather as individuals, and serving them ads that are uniquely interesting to them.
Notice, though, that nothing in the traditional advertiser model was concerned with the individual: brands are created for cohorts or demographic groups, because they need to be manufactured at scale; then, ad agencies would advertise at scale — making money along the way — and the purchase would be consummated in physical stores at some later point in time, constrained (and propelled by) limited shelf space. Thus P&G’s pullback — and thus the opportunity for an entirely new wave of companies that were built around digital advertising and its deep personalization from the get-go.
This bifurcation manifested itself most starkly in the summer of 2020, when large advertisers boycotted Facebook over the company’s refusal to censor then-President Trump; Facebook was barely affected. I wrote in Apple and Facebook:
This is a very different picture from Facebook, where as of Q1 2019 the top 100 advertisers made up less than 20% of the company’s ad revenue; most of the $69.7 billion the company brought in last year came from its long tail of 8 million advertisers…
This explains why the news about large CPG companies boycotting Facebook is, from a financial perspective, simply not a big deal. Unilever’s $11.8 million in U.S. ad spend, to take one example, is replaced with the same automated efficiency that Facebook’s timeline ensures you never run out of content. Moreover, while Facebook loses some top-line revenue — in an auction-based system, less demand corresponds to lower prices — the companies that are the most likely to take advantage of those lower prices are those that would not exist without Facebook, like the direct-to-consumer companies trying to steal customers from massive conglomerates like Unilever.
In this way Facebook has a degree of anti-fragility that even Google lacks: so much of its business comes from the long tail of Internet-native companies that are built around Facebook from first principles, that any disruption to traditional advertisers — like the coronavirus crisis or the current boycotts — actually serves to strengthen the Facebook ecosystem at the expense of the TV-centric ecosystem of which these CPG companies are a part.
It has been nine years since that P&G pullback I referenced above, and one of the big changes that P&G has made in that timeframe is to take most of their ad-buying in-house. This was in the long run inevitable, as the Internet ate everything, including traditional TV viewing, and as the rise of Aggregation platforms meant that the number of places you needed to actually buy an ad to reach everyone decreased even as potential reach increased. Those platforms also got better: programmatic platforms achieve P&G’s goal of mass reach in a way that actually increased efficiency instead of over-spending to over-target; programmatic advertising also covers more platforms now, including TV.
o3 Ammunition
Late last month OpenAI announced its o3
model, validating its initial o1
release and the returns that come from test-time scaling; I explained in an Update when o1
was released:
There has been a lot of talk about the importance of scale in terms of LLM performance; for auto-regressive LLMs that has meant training scale. The more parameters you have, the larger the infrastructure you need, but the payoff is greater accuracy because the model is incorporating that much more information. That certainly still applies to
o1
, as the chart on the left indicates.
It’s the chart on the right that is the bigger deal:
o1
gets more accurate the more time it spends on compute at inference time. This makes sense intuitively given what I laid out above: the more time spent on compute the more timeo1
can spend spinning up multiple chains-of-thought, checking its answers, and iterating through different approaches and solutions.It’s also a big departure from how we have thought about LLMs to date: one of the “benefits” of auto-regressive LLMs is that you’re only generating one answer in a serial manner. Yes, you can get that answer faster with beefier hardware, but that is another way of saying that the pay-off from more inference compute is getting the answer faster; the accuracy of the answer is a function of the underlying model, not the amount of compute brought to bear. Another way to think about it is that the more important question for inference is how much memory is available; the more memory there is, the larger the model, and therefore, the greater amount of accuracy.
In this
o1
represents a new inference paradigm: yes, you need memory to load the model, but given the same model, answer quality does improve with more compute. The way that I am thinking about it is that more compute is kind of like having more branch predictors, which mean more registers, which require more cache, etc.; this isn’t a perfect analogy, but it is interesting to think about inference compute as being a sort of dynamic memory architecture for LLMs that lets them explore latent space for the best answer.
o3
significantly outperforms o1
, and the extent of that outperformance is dictated by how much computing is allocated to the problem at hand. One of the most stark examples was o3
‘s performance on the ARC prize, a visual puzzle test that is designed to be easy for humans but hard for LLMs:
OpenAI’s new o3 system – trained on the ARC-AGI-1 Public Training set – has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored 87.5%.
This is a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models. For context, ARC-AGI-1 took 4 years to go from 0% with GPT-3 in 2020 to 5% in 2024 with GPT-4o. All intuition about AI capabilities will need to get updated for o3…
Despite the significant cost per task, these numbers aren’t just the result of applying brute force compute to the benchmark. OpenAI’s new o3 model represents a significant leap forward in AI’s ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.
Of course, such generality comes at a steep cost, and wouldn’t quite be economical yet: you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy. Meanwhile o3 requires $17-20 per task in the low-compute mode. But cost-performance will likely improve quite dramatically over the next few months and years, so you should plan for these capabilities to become competitive with human work within a fairly short timeline.
I don’t believe that o3
and inference-time scaling will displace traditional LLMs, which will remain both faster and cheaper; indeed, they will likely make traditional LLMs better through their ability to generate synthetic data for further scaling of pre-training. There remains a large product overhang for traditional LLMS — the technology is far more capable than the products that have been developed to date — but even the current dominant product, chatbots, are better experienced with a traditional LLM.
That very use case, however, gets at traditional LLM limitations: because they lack the ability to think and decide and verify they are best thought of as a tool for humans to leverage. Indeed, while conventional wisdom about these models is that it allows anyone to generate good enough writing and research, the biggest returns come to those with the most expertise and agency, who are able to use their own knowledge and judgment to reap efficiency gains while managing hallucinations and mistakes.
What o3
and inference-time scaling point to is something different: AI’s that can actually be given tasks and trusted to complete them. This, by extension, looks a lot more like an independent worker than an assistant — ammunition, rather than a rifle sight. That may seem an odd analogy, but it comes from a talk Keith Rabois gave at Stanford:
So I like this idea of barrels and ammunition. Most companies, once they get into hiring mode…just hire a lot of people, you expect that when you add more people your horsepower or your velocity of shipping things is going to increase. Turns out it doesn’t work that way. When you hire more engineers you don’t get that much more done. You actually sometimes get less done. You hire more designers, you definitely don’t get more done, you get less done in a day.
The reason why is because most great people actually are ammunition. But what you need in your company are barrels. And you can only shoot through the number of unique barrels that you have. That’s how the velocity of your company improves is adding barrels. Then you stock them with ammunition, then you can do a lot. You go from one barrel company, which is mostly how you start, to a two barrel company, suddenly you get twice as many things done in a day, per week, per quarter. If you go to three barrels, great. If you go to four barrels, awesome. Barrels are very difficult to find. But when you have them, give them lots of equity. Promote them, take them to dinner every week, because they are virtually irreplaceable. They are also very culturally specific. So a barrel at one company may not be a barrel at another company because one of the ways, the definition of a barrel is, they can take an idea from conception and take it all the way to shipping and bring people with them. And that’s a very cultural skill set.
The promise of AI generally, and inference-time scaling models in particular, is that they can be ammunition; in this context, the costs — even marginal ones — will in the long run be immaterial compared to the costs of people, particularly once you factor in non-salary costs like coordination and motivation.
The Uneven AI Arrival
There is a long way to go to realize this vision technically, although the arrival of first o1
and then o3
signal that the future is arriving more quickly than most people realize. OpenAI CEO Sam Altman wrote on his blog:
We are now confident we know how to build AGI as we have traditionally understood it. We believe that, in 2025, we may see the first AI agents “join the workforce” and materially change the output of companies. We continue to believe that iteratively putting great tools in the hands of people leads to great, broadly-distributed outcomes.
I grant the technical optimism; my definition of AGI is that it can be ammunition, i.e. it can be given a task and trusted to complete it at a good-enough rate (my definition of Artificial Super Intelligence (ASI) is the ability to come up with the tasks in the first place). The reason for the extended digression on advertising, however, is to explain why I’m skeptical about AI “materially chang[ing] the output of companies”, at least in 2025.
In this analogy CPG companies stand in for the corporate world generally. What will become clear once AI ammunition becomes available is just how unsuited most companies are for high precision agents, just as P&G was unsuited for highly-targeted advertising. No matter how well-documented a company’s processes might be, it will become clear that there are massive gaps that were filled through experience and tacit knowledge by the human ammunition.
SaaS companies, meanwhile, are the ad agencies. The ad agencies had value by providing a means for advertisers to scale to all sorts of media across geographies; SaaS companies have value by giving human ammunition software to do their job. Ad agencies, meanwhile, made money by charging a commission on the advertising they bought; SaaS companies make money by charging a per-seat licensing fee. Look again at that S-1 excerpt I opened with:
Our business model focuses on maximizing the lifetime value of a customer relationship. We make significant investments in acquiring new customers and believe that we will be able to achieve a positive return on these investments by retaining customers and expanding the size of our deployments within our customer base over time…
The positive return on investment comes from retaining and increasing seat licenses; those seats, however, are proxies for actually getting work done, just as advertising was just a proxy for actually selling something. Part of what made direct response digital advertising fundamentally different is that it was tied to actually making a sale, as opposed to lifting brand awareness, which is a proxy for the ultimate goal of increasing revenue. To that end, AI — particularly AI’s like o3
that scale with compute — will be priced according to the value of the task they complete; the amount that companies will pay for inference time compute will be a function of how much the task is worth. This is analogous to digital ads that are priced by conversion, not CPM.
The companies that actually leveraged that capability, however, were not, at least for a good long while, the companies that dominated the old advertising paradigm. Facebook became a juggernaut by creating its own customer base, not by being the advertising platform of choice for companies like P&G; meanwhile, TV and the economy built on it stayed relevant far longer than anyone expected. And, by the time TV truly collapsed, both the old guard and digital advertising had evolved to the point that they could work together.
If something similar plays out with AI agents, then the most important AI customers will primarily be new companies, and probably a lot of them will be long tail type entities that take the barrel and ammunition analogy to its logical extreme. Traditional companies, meanwhile, will struggle to incorporate AI (outside of whole-scale job replacement a la the mainframe); the true AI takeover of enterprises that retain real world differentiation will likely take years.
None of this is to diminish what is coming with AI; rather, as the saying goes, the future may arrive but be unevenly distributed, and, contrary to what you might think, the larger and more successful a company is the less they may benefit in the short term. Everything that makes a company work today is about harnessing people — and the entire SaaS ecosystem is predicated on monetizing this reality; the entities that will truly leverage AI, however, will not be the ones that replace them, but start without them.