Muyu He's GitHub Portfolio

Muyu He

Graduated from Penn Engineering. Machine learning engineer.
Philadelphia, Pennsylvania
Email

About Me

I’m a CS graduate at the University of Pennsylvania's School of Engineering and a researcher in PennNLP, Penn Medicine, and Drexel University.

I am interested in the long-horizon reasoning abilities of multi-modal models (LLMs or diffusion), manifested in deductive reasoning, commonsense, formalizing natural languages, etc.. My research focuses on developing frameworks / benchmarks to improve those abilities, which are then applied to downstream tasks such as embodied AI and realistic image generation.

Before studying at Penn, I was a philosophy Summa Cum Laude graduate at the University of California, Los Angeles, ranked as the school's top 3 philosophy undergraduate in 2022.

In my free time, I'm behind two content creators on China's biggest social media, RedNote, as a philosophy educator (25k followers) and a music producer (15k).

Work Experience

SenseTime

Apr 2024 - July 2024

Machine Learning Engineer Intern | Research on the VLM-based GUI agent

Made key contributions to training vision language models to solve real-world GUI agent tasks such as in-app searching and web browsing, obtaining a 90%+ accuracy on 5+ metrics including question answering and button clicking on 40+ apps
Independently implemented supervised finetuning on three 2B, 4B, and 7B in-house InternVL2 models with 5 million data samples on GUI agent tasks over 10+ rounds, improving general-purpose vision metrics including grounding accuracy by 80%
Took full ownership to optimize the distributed training pipeline on 8-32 A100 / H100s, employing key techniques such as deepspeed weight sharding, flash attention, loss scaling, and mixed precision to improve speed by 100%+ over V100 baselines
Creatively implemented a synthetic data pipeline to create 10k high-quality screen navigation data samples, leveraging Android SDK to perform depth-first search and using graph algorithm to synthesize optimal long-horizon plans of 5-8 steps
Innovated another synthetic data pipeline to create 6k home screen data samples with ground truth coordinates, using graphing tools such as OpenCV to generate parametric layouts with 5+ layers that resemble real-world Android systems
Assumed full responsibility in maintaining the compatibility of 3 vision language models across 3 independent GPU clusters

Research

LLM Deductive Benchmark

Led 6 researchers to build a comprehensive reasoning benchmark with 300+ entries, an average of 200+ possible answers, and an average of 25k+ context tokens per case to LLMs’ ability to play 10 detective games using deductive skills
Creatively annotated tasks into 7 reasoning types, 7 answer space sizes, and 6 reasoning steps to provide a thorough evaluation of LLMs’ performances, showing reasoning steps are more correlated than answer space sizes to performance
Orchestrated the evaluations of 12 frontier LLMs of varying sizes from 8B to 671B on 4 prompt configurations, providing 8+ surprising insights such as the inverse relationship between numbers of reasoning tokens and reasoning accuracies, the opposite effect of large context on large and small models, and the ineffectiveness of CoT on deductive tasks.

Diffusion Commonsense Benchmark

COLM 2024

Coordinated 5 researchers to compile the first benchmark to test diffusion models’ common sense abilities during image generation, spearheading the creation of a dataset across 5 common sense categories, 300 prompts, and 1.2k example images
Singly developed an automated evaluation pipeline to use 2 vision language models (CLIP, GPT-4V) to compute similarity scores between generated images and ground truth text, converging with human evaluations on 95% of all 150 tasks
Led the discovery of 5+ novel findings regarding 4 SOTA model pipelines (Stable Diffusion XL, DALLE-3, etc.) including the inability to generate atypical physical conditions and to distinguish between several uses of a word in different contexts

Extracurricular activities

REDPhilosophy content creator

Innovated popular philosophical content for the general public, acquiring 25,000 followers within 5 months and surpassing 93% of channels in growth speed
Produced animations, articles, and cartoons every two days, acquiring over 1,000 interactions across social media per post