Mark Ibrahim

I’m interested in how we can build reliable multimodal AI systems that can discover and compose the right first principles — from foundation-model training and evaluation to the computer-use agents they power.

SELECTED RESEARCH

Latest in Google Scholar

🗞️ News & Updates

May 2026 I'll be in Rio at ICLR 2026 presenting an oral (top 1%) on OpenApps. Reach out if you'll be there! 🇧🇷
2026 Learning to Reason in 13 Parameters (TinyLoRA) hit the front page of Hacker News and was implemented in Hugging Face PEFT. ⚡
2026 OpenApps is now integrated into OpenEnv (the Hugging Face / PyTorch RL environment) and BrowserGym. 🧩
Dec 2025 I'll be at NeurIPS 2025 sharing some of our latest work: Common-O (a new multimodal reasoning benchmark, 30k+ downloads), AbstentionBench (measuring LLMs' ability to recognize when they don't know), Verbatim Memorization in LLMs, and a study of JEPA architectures (spotlight award). Reach out if you'll be in San Diego! ☀️

🖱️ Computer-Use Agents

OpenApps: simulating app variations for computer-use agent reliability

★ ICLR 2026 Oral · top 1% ★ Oral · NE Agents Day

We open-source OpenApps, a Python research environment that generates endless versions of six real apps — with ground-truth state and rewards, on a single CPU — to train and evaluate computer-use agents across the variations they break on in deployment.

Integrated into OpenEnv (the Hugging Face / PyTorch RL environment) and BrowserGym · oral at NE Agents Day

📣 code release, 📃 paper, and 🎬 video tutorial

Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim

🖼️ Multimodal Training & Evaluation

Common-O: Hallucination in Visual Reasoning Across Scenes

NeurIPS 2025

Multimodal models can perceive objects but hallucinate when reasoning across scenes. Common-O is a decontaminated multi-scene reasoning benchmark on which today's best models score under 25%.

🔥 30k+ Hugging Face downloads

> paper + data

Candace Ross, Florian Bordes, Adina Williams, Polina Kirichenko, Mark Ibrahim

LLIP: Latent-Language Image Pretraining

ICML 2024

A state-of-the-art open-weight vision encoder (ViT-G) with optimized visual cross-attention. Scaled to 5B samples, LLIP outperforms MetaCLIP by an average of 2.9% across 22 zero-shot benchmarks and 6% R@1 on COCO retrieval.

> paper + weights

Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mahmoud Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

NeurIPS 2024

A 50+ benchmark suite of vision-language capabilities revealing that scaling alone doesn't improve visual reasoning — evaluate a model across 7 capability types on 1 GPU in minutes.

> paper

Haider Al-Tahan, Quentin Garrido, Randall Balestriero, Diane Bouchacourt, Caner Hazirbas, Mark Ibrahim

\(\mathbb{X}\)-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs

ICLR 2025

A graph-based contrastive loss that explicitly encodes relationships across samples during training, improving efficiency and robustness.

> paper

Vlad Sobal, Mark Ibrahim, Randall Balestriero, Vivien Cabannes, Diane Bouchacourt, Pietro Astolfi, Kyunghyun Cho, Yann LeCun

🧭 Alignment & Reasoning

Learning to Reason in 13 Parameters (TinyLoRA)

🔥 Front page of Hacker News

We show GRPO needs to update as few as 13 parameters (26 bytes) to bring Qwen2.5-8B within 5% of full-finetuning GSM8K performance — recovering 90% of reasoning gains while training 1000× fewer parameters.

Implemented in Hugging Face PEFT (TinyLoRA) · reached the front page of Hacker News

> paper

John X. Morris, Niloofar Mireshghallah, Mark Ibrahim, Saeed Mahloujifar

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

NeurIPS 2025

We evaluate LLMs' capacity for abstention — the skill of knowing when NOT to answer. We find reasoning LLMs struggle with unanswerable questions and hallucinate.

Used by OpenAI in the GPT-5 system card · adopted by the UK AI Security Institute's Inspect Evals · cited in the MuseSpark Preparedness Report

> paper + code + data

Polina Kirichenko*, Samuel J. Bell*, Kamalika Chaudhuri, Mark Ibrahim*

The Factorization Curse: Which Tokens You Predict Underlie the Reversal Curse and More

NeurIPS 2024

We show that training transformers to predict multiple tokens ahead and back (instead of just the single next token) improves models' ability to retrieve knowledge.

> paper

Ouail Kitouni, Niklas Nolte, Diane Bouchacourt, Adina Williams, Mike Rabbat, Mark Ibrahim

In a follow-up, the same objective improves a transformer's ability to plan in maze navigation (MLM-U), converging 2× faster in GPU hours.

🧪 Self-Supervised Learning & Generalization

Discovering Environments with XRM

★ ICML 2024 · top 1%

A method to automatically discover the spurious environments that break out-of-distribution generalization — without human annotations.

> paper

Mohammad Pezeshki, Diane Bouchacourt, Mark Ibrahim, Nicolas Ballas, Pascal Vincent, David Lopez-Paz

ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations

★ ICLR 2023 Spotlight · top 5%

We find surprisingly similar strengths and vulnerabilities across more than 2,200 deep learning models.

> paper + website

Badr Youbi Idrissi, Diane Bouchacourt, Randall Balestriero, Ivan Evtimov, Caner Hazirbas, Nicolas Ballas, Pascal Vincent, Michal Drozdzal, David Lopez-Paz, Mark Ibrahim

Shortcuts Come in Multiples Where Mitigating One Amplifies Others

CVPR 2023

A method for and study of how deep learning techniques cope with multiple shortcuts (Whac-A-Mole).

> paper + code

Zhiheng Li, Ivan Evtimov*, Albert Gordo, Caner Hazirbas, Tal Hassner, Cristian Canton Ferrer, Chenliang Xu, Mark Ibrahim*

Does Progress on Object Recognition Benchmarks Improve Real-World Generalization?

ICLR 2024

Progress on standard benchmarks fails to improve — and can worsen — geographic disparities in today's best models.

> paper

Megan Richards, Polina Kirichenko, Diane Bouchacourt, Mark Ibrahim

A Cookbook of Self-Supervised Learning

ICML Tutorial 2023

Recipes for training and evaluating self-supervised learning systems — co-authored with Randall Balestriero, Yann LeCun, and many others.

Featured on the Meta AI blog

Earlier work

Global Explanations for Neural Networks: Mapping the Landscape of Predictions

AAAI 2019

> paper + open source library + blog post

Mark Ibrahim, Melissa Louie, Ceena Modarres, John Paisley (Columbia University)

Talks

ICLR Oral Presentation (May 2026)
OpenApps: simulating app variations for computer-use agent reliability

World Modeling Workshop at Mila (Feb 2026)
OpenApps: World Models for Computer-Use

NeurIPS Highlights (Dec 2025)
Slides

Self Supervised Learning: The Final Frontier of AI at the Simons Flatiron Institute (April 2025)
Lightning Talk on Latent Space Prediction
organized by Randall Balestriero, Yann LeCun, Alberto Bieti, and Shirley Ho

NeurIPS Self-Supervised Learning Workshop Oral (Dec 2024)
Occam's Razor: What's sufficient for learning good self-supervised representations?

Brown University talk on Robust Representation Learning (2024)
From Vision to Multimodal Self-Supervised Models

ICML Tutorial on Self-Superivsed Learning (2023) (400+ researchers attended)
From Research Advances to Best Practices (slides and recording)

Georgia Tech's Deep Learning Course Instructor (2022) (10k+ online students)
Lecture on "Feed Forward Neural Networks"

PyCon US 2020 (Python Conference)
Talk on "Machine Learning on Encrypted Data with CrypTen"

NeurIPS 2018 FEAP Workshop Spotlight Talk (Dec 2018)
"Towards Explainable Deep Learning for Credit Lending"

New York Python Meetup (Dec 2018)
Data Science Talk: " Explaining Deep Learning Models"

Applied Machine Learning Tom Tom Conference (April 2018)
"Explainable AI: Key Techniques and Societal Implications"

George Washington University, Data Driven Conference (Dec 2017)
"Understanding the Predictions of Deep Neural Networks"

NYC Data Wranglers Meetup (Aug 2016)
Data Science in Practice: "Building a Graph-Based Search Engine"

Advising & Mentorship

Research Internship Advising: Ouail Kitouni (MIT PhD, now Anthropic Research Scientist), Mazda Moayeri (U Maryland PhD student), Cian Eastwood (now Senior Research Scientist at Valence Labs), Karsten Roth (PhD Student Google DeepMind/Tubingen).

AI Residents: Megan Richards (incoming NYU PhD student with prof Kyunghyun Cho), Haider Al-Tahan (incoming Georgia Tech PhD student)

Industry PhD Advisor: Polina Kirichenko (NYU PhD, now Research Scientist at FAIR, Meta AI).

Courses Taught at the University of Vermont

Calculus I 71 eager minds,

Calculus II 38 étudiants, and

College Algebra 42 estudiantes.