I’m a fifth-year PhD student in the Stanford CS department, where I draw on economics to make AI compatible with humans. I was supported by a Facebook Fellowship and an NSERC PGS-D.

Classical economic theory framed humans as homo economicus, perfectly rational agents that are narrowly self-interested and always act optimally. In the real world however, humans are complex and often "uneconomic", and the last few decades have seen economic models evolve to better capture this reality.

My work argues that the methods currently used to develop, evaluate, and deploy AI make many of the same assumptions of humans that classical economics did. These false assumptions have led to a host of adverse outcomes: datasets much simpler than the task they purport to reflect, a culture of hill-climbing on one performance metric, and models that fail to evolve with users, among other pains. I borrow from more recent advances in economics to create datasets, models, and evaluation protocols that are compatible with real humans, not just idealized ones.

Recent Work (full list)

People often train their models with sub-optimal data that is much simpler than the task they actually want to solve. Why? Although we expect people to act in their self-interest, this happens because determining the value of data is non-trivial. To help create more useful datasets, I proposed an information-theoretic framework for understanding dataset difficulty and then used this framework to create the largest public dataset of human preferences over text (SHP). SHP is used by Amazon AWS (for reranking generations), Microsoft Deepspeed Chat (to train LLMs), StableVicuna (the first open-source RLHF chatbot), and Llama-2 (one of the most widely-used LLMs).

User utility is much more complex than just a single performance metric, yet AI evaluation often fails to go beyond accuracy on a test set. I introduced the notion of utility-driven evaluation, which measures not only the upside from using a model but also its costs (latency, memory, etc.) Working with researchers at Meta, we developed Dynaboard, a holistic evaluation-as-a-service platform for hosting benchmarks. Dynaboard has been used to host many challenges, including DADC (Dynamic Adversarial Data Collection), DataPerf, BabyLM, and Flores. The concept of utility-driven evaluation has since gained wide acceptance and underlies many benchmarks, like Stanford's HELM.

In generative AI, we often have to ask people to rate generations, which are then aggregated across multiple individuals and used to align models with human preferences. I showed that this presumes humans to be von Neumann-Morgenstern rational, which they rarely are in practice; in reality, human preferences are non-static, incomplete, and often intransitive. Not taking this into account can lead to counter-intuitive conclusions, like ChatGPT being no better than models that are 10x smaller. My current work uses more recent developments in modeling preferences, such as prospect theory, to deal with this complexity.