Muyu He
- Graduated from Penn Engineering. Machine learning engineer.
- Philadelphia, Pennsylvania
About Me
I’m a CS graduate at the University of Pennsylvania's School of Engineering and a researcher in PennNLP, Penn Medicine, and Drexel University.
I am interested in the long-horizon reasoning abilities of multi-modal models (LLMs or diffusion), manifested in deductive reasoning, commonsense, formalizing natural languages, etc.. My research focuses on developing frameworks / benchmarks to improve those abilities, which are then applied to downstream tasks such as embodied AI and realistic image generation.
Before studying at Penn, I was a philosophy Summa Cum Laude graduate at the University of California, Los Angeles, ranked as the school's top 3 philosophy undergraduate in 2022.
In my free time, I'm behind two content creators on China's biggest social media, RedNote, as a philosophy educator (25k followers) and a music producer (15k).
Work Experience
SenseTime
Apr 2024 - July 2024
Machine Learning Engineer Intern | Research on the VLM-based GUI agent
- Made key contributions to training vision language models to solve real-world GUI agent tasks such as in-app searching and web browsing, obtaining a 90%+ accuracy on 5+ metrics including question answering and button clicking on 40+ apps
- Independently implemented supervised finetuning on three 2B, 4B, and 7B in-house InternVL2 models with 5 million data samples on GUI agent tasks over 10+ rounds, improving general-purpose vision metrics including grounding accuracy by 80%
- Took full ownership to optimize the distributed training pipeline on 8-32 A100 / H100s, employing key techniques such as deepspeed weight sharding, flash attention, loss scaling, and mixed precision to improve speed by 100%+ over V100 baselines
- Creatively implemented a synthetic data pipeline to create 10k high-quality screen navigation data samples, leveraging Android SDK to perform depth-first search and using graph algorithm to synthesize optimal long-horizon plans of 5-8 steps
- Innovated another synthetic data pipeline to create 6k home screen data samples with ground truth coordinates, using graphing tools such as OpenCV to generate parametric layouts with 5+ layers that resemble real-world Android systems
- Assumed full responsibility in maintaining the compatibility of 3 vision language models across 3 independent GPU clusters
Research
LLM Deductive Benchmark
- Led 6 researchers to build a comprehensive reasoning benchmark with 300+ entries, an average of 200+ possible answers, and an average of 25k+ context tokens per case to LLMs’ ability to play 10 detective games using deductive skills
- Creatively annotated tasks into 7 reasoning types, 7 answer space sizes, and 6 reasoning steps to provide a thorough evaluation of LLMs’ performances, showing reasoning steps are more correlated than answer space sizes to performance
- Orchestrated the evaluations of 12 frontier LLMs of varying sizes from 8B to 671B on 4 prompt configurations, providing 8+ surprising insights such as the inverse relationship between numbers of reasoning tokens and reasoning accuracies, the opposite effect of large context on large and small models, and the ineffectiveness of CoT on deductive tasks.
Diffusion Commonsense Benchmark
COLM 2024
- Coordinated 5 researchers to compile the first benchmark to test diffusion models’ common sense abilities during image generation, spearheading the creation of a dataset across 5 common sense categories, 300 prompts, and 1.2k example images
- Singly developed an automated evaluation pipeline to use 2 vision language models (CLIP, GPT-4V) to compute similarity scores between generated images and ground truth text, converging with human evaluations on 95% of all 150 tasks
- Led the discovery of 5+ novel findings regarding 4 SOTA model pipelines (Stable Diffusion XL, DALLE-3, etc.) including the inability to generate atypical physical conditions and to distinguish between several uses of a word in different contexts
Extracurricular activities
REDPhilosophy content creator
- Innovated popular philosophical content for the general public, acquiring 25,000 followers within 5 months and surpassing 93% of channels in growth speed
- Produced animations, articles, and cartoons every two days, acquiring over 1,000 interactions across social media per post