Trending Research

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

stepfun-ai/step-audio ? ? 17 Feb 2025

Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following.

Instruction Following Voice Cloning

3,036

6.18 stars / hour

Paper
Code

Craw4LLM: Efficient Web Crawling for LLM Pretraining

cxcscmu/crawl4llm ? 19 Feb 2025

Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality.

398

4.83 stars / hour

Paper
Code

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

openai/swelancer-benchmark ? 17 Feb 2025

We introduce SWE-Lancer, a benchmark of over 1, 400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts.

1,059

4.71 stars / hour

Paper
Code

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

stepfun-ai/step-video-t2v ? ? 14 Feb 2025

We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length.

Video Generation Video Reconstruction

1,947

3.67 stars / hour

Paper
Code

OmniParser for Pure Vision Based GUI Agent

microsoft/omniparser ? ? 1 Aug 2024

The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces.

Ranked #10 on Natural Language Visual Grounding on ScreenSpot

Natural Language Visual Grounding

15,985

3.48 stars / hour

Paper
Code

MoBA: Mixture of Block Attention for Long-Context LLMs

moonshotai/moba ? ? 18 Feb 2025

Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI).

1,354

2.96 stars / hour

Paper
Code

PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation

microsoft/pike-rag ? 20 Jan 2025

Despite notable advancements in Retrieval-Augmented Generation (RAG) systems that expand large language model (LLM) capabilities through external retrieval, these systems often struggle to meet the complex and diverse needs of real-world industrial applications.

Language Modeling Language Modelling +3

773

2.13 stars / hour

Paper
Code

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

aslp-lab/osum ? ? 23 Jan 2025

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions.

Event Detection Gender Classification +3

268

1.79 stars / hour

Paper
Code

Data Formulator 2: Iteratively Creating Rich Visualizations with AI

microsoft/data-formulator ? 28 Aug 2024

To create rich visualizations, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals.

Code Generation Navigate

8,022

1.41 stars / hour

Paper
Code

Magma: A Foundation Model for Multimodal AI Agents

microsoft/Magma ? 18 Feb 2025

We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds.

Robot Manipulation

437

1.40 stars / hour

Paper
Code