
Co-Adapting Human Interfaces and LMs
As the world adapts to LM, the line between the agent and the environment begins to blur
November 12, 2024
With the explosion of computer usage agents like Claude Computer Use, and the search wars between Perplexity, SearchGPT, and Gemini, it seems inevitable that artificial intelligence will change the way we access information. So far, we’ve been working on building agents that better understand the world. But at this point, it’s important to realize that the world is adapting to LM as well. Code libraries, websites, and documents are designed for humans, but once LMs become “users” too, they start to look different.
In this article, I’ll explore some new signs of what the future might look like – how has the world adapted, and what will the web eventually look like? As a researcher or developer, this raises some interesting questions. If we recognize that the digital world is fundamentally malleable, the lines between “subject” and “environment” begin to blur. rather than building better modelwhat end-to-end system Should we build?
When I code side by side with GitHub Copilot, I often notice how my behavior changes in subtle ways to adapt to the tool. Code completion is a natural fit for “document-string-first programming”—if the model has some description of what you want to do first, you’re more likely to get useful code snippets. So it’s natural to enter a comment or descriptive name first, then stop and see if the correct snippet of code is returned:
In short, to make Copilot work better, you make your code look a certain way. The end result is that the codebase ends up looking like it has more annotations than it would otherwise. Code written for human understandability will look different from code written by a programmer to micro-optimize LM. Environmental adaptation tools.
Even more interesting is that this adaptation happens outside of the code we write. As a library/framework/tool developer in 2024, you can bet that a large portion of your human users will interact with your framework through their coding assistants. “Development friendly”, that is, building for the workflows that users have, means building for their LM. For example, Jeremy Howard’s FastHTML includes llm-ctx Documentation – Documentation intended for LM and not for human consumption. This is the prototype of a new parallel world constructed for LM.
The same principle occurs in many applications – to make the LM agent work better, provide it with the correct context in a format it can understand. We want LM to work on websites, GUIs, documents, spreadsheets, etc. For every environment, a hidden part of modeling is actually tuning environmentadjust the input to make the LM easier to understand.
SWE-agent first makes this clear: by building an “agent-to-computer interface” (a layer above file viewers, editors, terminals, etc.) for agents that use computers, you gain huge Performance improvements. Due to the lack of end-to-end pixel-based control, these design decisions exist in every agent: for example, even for multi-modal agents, it helps to enhance the UI by: High-level interactive entities and actions This allows the web proxy to output “click elements” [36] [A] outdoor table lamp” (instead of ).
The reason we have to do this is because the UI is designed for: human user. We base our work on visual hierarchy or “F-shaped pattern reading” Because we discovered that humans have common biases in the way they process visual content. LM has no such prior, so we end up just baking the same inductive bias back into the model, teaching from lots of data, e.g. “if Icons are located to the left of form fields and may be semantically related.
When we take a step back, this feels roundabout – I want to show the user some underlying data, so I design an interface to communicate this data to them visually. Now, instead of humans seeing the UI, we spend a lot of time building models to efficiently parse the visual information back to the underlying data or API requests. (Hmm…🤔)
We have increasingly sophisticated models and tools to perform this “post-processing” of LM’s human-machine interface, e.g. Convert HTML web pages to Markdown or coding spreadsheet Easier to be digested by the model. This will certainly be useful in the short term, as most websites won’t natively adapt to LM for a long time. But I’m very excited about furthering the work to advance alternative approaches. If the raw material in the spreadsheet is at some point manipulated more by the language model than by humans, what if we made the spreadsheet from scratch for the language model? Or a computer operating system? So, once we leave low-level operations to machines, what will human-machine interfaces look like?
In other words, we can improve LM’s ability to use human-computer interfaces in two ways:
- Make LM smarter (e.g. better reasoning, multi-modal understanding)
- Make it easier for LM to understand the interface
I think these are two different bets: first, a smart enough LM can be placed where humans are using the same interface; second, the LM will be “different”, with its own ecosystem and comparative advantage. Which approach will dominate in the long run?
From a research perspective, (1) is attractive because it is fundamental. If you care about general artificial intelligence, “using a computer” is just a test bed for intelligence.
If we care about useful agents, then (2) feels underexplored. Today, LM interfaces may seem ad hoc, but if we make bet (2) (e.g. building an operating system), there is no reason why they can’t be as versatile as human machine interfaces. And there are obvious benefits: efficiency (such as function calls and multi-mode pixel control), safety (limiting the set of operations available to the LM interface), and wider design space (parallelization can be achieved without launching 500 virtual machines). Unlike the physical world, which we have to model using universal methods, because nature is just a certain way, we build the digital world. We can construct it in different ways.
When new technologies emerge, individuals are often the first to change their behavior. Many of the examples above are independent teams of researchers and engineers who are developing prompt solutions or putting together API/agent interfaces to make their LMs work better.
Then you’ll notice that enterprises and providers are adapting: for example, as more and more people use language models to interact with their development tools, it makes sense to write documentation for LM. We’re not there yet, but it’s not hard to imagine that if web proxies actually become a widespread utility, it would make sense for websites to make their UIs slightly easier to parse, so that they’re easier to use with language models, Then perhaps a full-fledged product designed to be compatible with LM agents could be built. Before you know it, the web starts to look different.
The story sounds familiar: Search engines are one of the primary interfaces to our web today, and they have literally changed the shape of the web since the 1990s and 2000s. PageRank is an effective algorithm because it identifies useful heuristics for finding high-quality websites at the time. Mainly “more important sites are more likely to receive more links from other sites.” But as Google became more popular, not only did they adapt to the ways of the web, sites adapted themselves to google. PageRank understands that “high-quality” format and content are constructed more frequently. Website templating becomes more standard. People put entire articles and FAQs before the actual recipes. Wait, enter the SEO-optimized web we have today.
What will the Internet look like when LM becomes one of the standard interfaces, whether through chatbots such as ChatGPT, artificial intelligence search engines such as Perplexity, or network proxies? I think it’s an interesting future to speculate about, but perhaps more importantly, our expectations influence what we build today. People who develop agents may think of the online “environment” as fixed, but in fact, the entire digital world evolves with the tools we build. Given that agent-computer interfaces do appear to be an underutilized lever (as opposed to the popular direction of universal agents using human-computer interfaces), there is still much unexplored territory. First, we don’t actually know which interface is best for language models. I’m very excited about research that attempts to understand and refine the “design principles” that dramatically improve the performance and usefulness of today’s models. This is a systems-building approach to AI research: not only can we improve models individually, but we can also expand our horizons to improve the systems in which they work as a whole.
Considering how the world will fit into language models may also change useful applications. LMs not only provide a window into the Internet—in some cases, they may “replace” certain functions of the Internet entirely. If I have a personal AI agent, do I still need to go to OpenTable to make a reservation or fill out the form again? Once I get the synthesized “generated answer” from a web-reading LM, do I still need to visit the web myself?
I don’t think humans get all Their content is implemented through LM – and then the interesting question is Which Some parts of the network will be used primarily by LMs and humans.
It does seem like there are a lot of apps that are optimized for the near future but haven’t taken a stance on the issue. For example, today, people are obsessed with voice interfaces (perhaps also because they are “universal”), and apps like Enterprise Agent to answer customer calls or handle customer service requests on the spot. But voice is a great interface because it beneficial to human beings. When you realize we’re heading into a future where customers can have their own Google Assistant call Business, they have to do it all robustly… We could make the problem more difficult than it is.
In the long term, we should ask what more humanity needs to do. I hope that in 10 years I won’t have to call to reschedule my package delivery, and if that’s the case, then maybe we don’t need to continue building around basic human interface (voice) to solve this problem. “What I want to do” isn’t “call the doctor,” it’s “make an appointment,” and if technology can coordinate appointments, no one needs to answer the phone. We build LM to help people do things, but a lot of the things we do are because we don’t have artificial intelligence.
It’s not just about the UX/graphic design of the interface, but it’s fundamentally about what job It will be done by LM and humans. If LM can indeed handle a lot of the low-level operations and information synthesis we do today, then we will need new interfaces for humans to do other things – maybe high-level management, or the specifications we want, or some other basic human comminicate. Sometimes, instead of just wanting LM to tell me the answer, I want to read about experiences and see pictures of people who have actually done something in the real world. Once LM handles the tasks of machines, perhaps new tasks and content for humans will emerge.
2024-12-22 08:32:48