In the coming years, agents are expected to take over and more tasks in the name of humans, including the use of computers and smartphones. For now, however, they are too likely to be very useful.
A new agent called S2, created by the Startup Simulate AI, combines border models with specialized models for using computers. The agent achieves state-of-the-art performance in tasks such as using applications and manipulating files, and suggests that becoming different models in different situations can help agents move forward.
“Computer use agents are different from large and different language models of coding,” says Ang Li, co -founder and CEO of Simulate. “It’s a different type of problem.”
In simulating approach, a powerful general purpose AI model, such as the GPT-4O of Openai or the Claude 3.7 of Anthropic, is used to reason about the best way to complete the task at the reach, while smaller open source models are introduced for tasks such as web page interpretation.
Li, who was a researcher on Google Deepmind before founding simulating in 2023, explains that great language models excel in planning, but are not so good for recognizing the elements of a graphical user interface.
S2 is designed to learn from experience with an external memory module that records users’ actions and comments and uses these recordings to improve future actions.
In particularly complex tasks, S2 works better than any other model in Osworld, a reference point that measures an agent’s ability to use a computer operating system.
For example, S2 can complete 34.5 percent of the tasks involving 50 steps, passing the Openai operator, which can complete 32 percent. In the same way, S2 marks 50 percent on Androidworld, a point of reference for agents using smartphones, while the next best agent sets 46 percent.
Victor Zhong, a computer scientist at the University of Waterloo in Canada and one of Osworld’s creators, believes that future AI models can incorporate training data that will help them understand the visual world and to make sense of the graphic interfaces of user.
“This will help agents sail the gui with much higher accuracy,” says Zhong. “I think, in the meantime, before these fundamental advances, state -of -the -art systems will resemble simulations, as they combine various models to hit the limitations of simple models.”
To prepare for this column, I used to simulate flights and look for Amazon for offers, and it seemed better than some of the open source agents I tried last year, including self -management and Vimgpt.
But even the agents of the AI are even more concerned with cases of edge and sometimes have a strange behavior. In one case, when I asked S2 to help find contact information for researchers behind Osworld, the agent was hooked on a loop loop between the project page and the Login of the Osworld’s discordance.
Osworld’s reference points show why officers are still more hype than reality at the moment. While humans can complete 72 percent of Osworld’s tasks, agents climb 38 percent of the time in complex tasks. That said, when the point of reference was introduced in April 2024, the best agent could only complete 12 percent of the tasks.