Brian R.Y. Huang

prof_pic.jpg

I’m a recent graduate from MIT with Bachelor’s degrees in Math & CS as well as a Master’s in CS. I’m broadly interested in understanding failure modes and weaknesses in frontier AI systems in order to robustify real-world deployment of AI systems.

At MIT, I was fortunate to conduct research on adversarial robustness of deep learning models, advised by Hadi Salman and Aleksander Mądry. I was also a teaching assistant for MIT’s flagship graduate-level machine learning class (6.867, now 6.7900) and for the statistical data analysis class (6.3720/3722). In past internships/research experiences, I’ve worked on causal intervention methods for mech interp at Redwood Research; researched ~quantitatively~ at JPMorgan Chase and WorldQuant; and investigated differential-geometric properties of black hole formation in general relativity with Marcus Khuri at Stony Brook University. Most recently, I was an AI resident at Haize Labs doing research and engineering for red-teaming, automated evals, and adversarial robustness.

selected works

  1. Endless Jailbreaks with Bijection Learning
    Brian RY Huang, Maximilian Li, and Leonard Tang
    ICLR, 2025
    Work done as part of Haize Labs.
    We devise a "bijection attack," an encoding scheme taught to a language model in-context which bypasses model alignment and comprises a highly effective jailbreak. We differentially modulate the complexity of our bijection scheme across different models and derive a quadratic scaling law, finding that, curiously, our bijection attack is stronger on higher-capability models.
  2. learnedarch.png
    Adversarial Learned Soups: Neural Network Averaging for Joint Clean and Robust Performance
    Brian RY Huang
    Master’s Thesis, 2023
    Supervised by Hadi Salman and Aleksander Mądry.
    We introduce weight-space interpolation methods to the adversarial robustness regime, devising a wrapper architecture to optimize the interpolation coefficients of a "model soup" via adversarial training. Varying the intensity of adversarial training (perturbation distance, TRADES weightings, etc.) leads to a smooth tradeoff between the resulting clean and robust accuracy of the interpolated model.
  3. Does It Know?: Probing and Benchmarking Uncertainty in Language Model Latent Beliefs
    Brian RY Huang, and Joe Kwon
    ATTRIB Workshop @ NeurIPS, 2023
    We extend the recent work of Contrast-Consistent Search by Burns et al., 2023, to detect uncertainty in the factual beliefs of language models. We create a toy dataset of timestamped news factoids as a true/false/uncertain classification benchmark for LLMs with a known training cutoff date.
  4. On Sufficient Conditions for Trapped Surfaces in Spherically Symmetric Spacetimes
    Brian RY Huang
    presented at Siemens Competition, 2017
    Some differential geometry / general relativity research on black hole formation that I was fortunate to conduct during high school!