Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following.
Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality.
We introduce SWE-Lancer, a benchmark of over 1, 400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts.
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length.
The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces.
Ranked #10 on
Natural Language Visual Grounding
on ScreenSpot
Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI).
Despite notable advancements in Retrieval-Augmented Generation (RAG) systems that expand large language model (LLM) capabilities through external retrieval, these systems often struggle to meet the complex and diverse needs of real-world industrial applications.
Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions.
To create rich visualizations, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals.
We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds.