Hi! 👋🏻 I’m Yuxuan (Leo) Lu, a Ph.D. student at Northeastern University, advised by Prof. Dakuo Wang. I’m currently working as an intern applied scientist at Amazon.
My current research focus is on developing and leveraging Large Language Model Agents (LLM Agents) for simulating human behaviors , including using LLM Agents in usability testing [1] , A/B testing [2] , training LLM Agents towards accurate simulation of human behaviors [3] , and more.
Before starting my Ph.D. program, I got my B.E. in Computer Science and Technology and Graduated with honor at Beijing University of Technology. In the past, I’ve worked as Machine Learning Researcher at a joint program between LinkedIn and Microsoft Research Asia. I’ve also worked as an intern research assistant at THUNLP lab, supervised by Prof. Zhiyuan Liu(刘知远).
Picture of me, taken in The Sayram Lake (赛里木湖)
Education
I’m currently persuing my Ph.D. in Computer Science at Khoury College of Computer Sciences, Northeastern University, advised by Prof. Dakuo Wang.
I got my B.E. in Computer Science and Technology and Graduated with honor at Beijing University of Technology. Before that, I’ve finished my junior and senior high at Beijing National Day School (北京市十一学校).
Preprints
2025
UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents
Apr 2025
Usability testing is a fundamental research method that user experience (UX) researchers use to evaluate and iterate a web design, but}textbf{ how to evaluate and iterate the usability testing study design } itself? Recent advances in Large Language Model-simulated Agent (}textbf{LLM Agent}) research inspired us to design }textbf{UXAgent} to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human-subject study. Our system features a Persona Generator module, an LLM Agent module, and a Universal Browser Connector module to automatically generate thousands of simulated users to interactively test the target website. The system also provides an Agent Interview Interface and a Video Replay Interface so that the UX researchers can easily review and analyze the generated qualitative and quantitative log data. Through a heuristic evaluation, five UX researcher participants praised the innovation of our system but also expressed concerns about the future of LLM Agent usage in UX studies.
AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents
Dakuo Wang, Ting-Yao Hsu,
Yuxuan Lu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headean,
Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, and Jessie Wang
Apr 2025
A/B testing experiment is a widely adopted method for evaluating UI/UX design decisions in modern web applications. Yet, traditional A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants, and the long time of waiting for the testing result. Through formative interviews with six experienced industry practitioners, we identified critical bottlenecks in current A/B testing workflows. In response, we present AgentA/B, a novel system that leverages Large Language Model-based autonomous agents (LLM Agents) to automatically simulate user interaction behaviors with real webpages. AgentA/B enables scalable deployment of LLM agents with diverse personas, each capable of navigating the dynamic webpage and interactively executing multi-step interactions like search, clicking, filtering, and purchasing. In a demonstrative controlled experiment, we employ AgentA/B to simulate a between-subject A/B testing with 1,000 LLM agents Amazon.com, and compare agent behaviors with real human shopping behaviors at a scale. Our findings suggest AgentA/B can emulate human-like behavior patterns.
RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care
Ziqi Yang,
Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao,
Dakuo Wang,
Bingsheng Yao, and Nawar Shara
Feb 2025
Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group of cancers that account for more than 35% of cancer-related deaths worldwide, but postoperative complications are unpredictable and can be life-threatening. In this paper, we investigate how recent advancements in large language models (LLMs) can benefit remote patient monitoring (RPM) systems through clinical integration by designing RECOVER, an LLM-powered RPM system for postoperative GI cancer care. To closely engage stakeholders in the design process, we first conducted seven participatory design sessions with five clinical staff and interviewed five cancer patients to derive six major design strategies for integrating clinical guidelines and information needs into LLM-based RPM systems. We then designed and implemented RECOVER, which features an LLM-powered conversational agent for cancer patients and an interactive dashboard for clinical staff to enable efficient postoperative RPM. Finally, we used RECOVER as a pilot system to assess the implementation of our design strategies with four clinical staff and five patients, providing design implications by identifying crucial design elements, offering insights on responsible AI, and outlining opportunities for future LLM-powered RPM systems.
LLM Agents That Act Like Us: Accurate Human Behavior Simulation with Real-World Data
Yuxuan Lu, Jing Huang, Yan Han, Bennet Bei, Yaochen Xie,
Dakuo Wang, Jessie Wang, and Qi He
Apr 2025
Recent research shows that LLMs can simulate “believable” human behaviors to power LLM agents via prompt-only methods. In this work, we focus on evaluating and improving LLM’s objective “accuracy” rather than the subjective “believability” in the web action generation task, leveraging a large-scale, real-world dataset collected from online shopping human actions. We present the first comprehensive quantitative evaluation of state-of-the-art LLMs (e.g., DeepSeek-R1, Llama, and Claude) on the task of web action generation. Our results show that fine-tuning LLMs on real-world behavioral data substantially improves their ability to generate actions compared to prompt-only methods. Furthermore, incorporating synthesized reasoning traces into model training leads to additional performance gains, demonstrating the value of explicit rationale in behavior modeling. This work establishes a new benchmark for evaluating LLMs in behavior simulation and offers actionable insights into how real-world action data and reasoning augmentation can enhance the fidelity of LLM agents.
2024
From Dark Data to Open Data: Challenges and Practices for Data Integrators of Data-Driven Open Science Projects in Geoscience
In Submission to CSCW 2025, Apr 2024
Exploring Domain Adaptation with LLMs for Real-World Augmented Question Answer Generation (RA-QAG) in Children Storytelling
In Submission to EMNLP 2024, Apr 2024
ALERTS: Active Learning and Ensemble LLM Real-Time Switch for Real-World Data Drift Challenges
In Submission to EMNLP 2024, Apr 2024
2023
Human Still Wins over LLM: An Empirical Study of Active Learning on Domain-Specific Annotation Tasks
arXiv preprint arXiv:2311.09825, Apr 2023
Large Language Models (LLMs) have demonstrated considerable advances, and several claims have been made about their exceeding human performance. However, in real-world tasks, domain knowledge is often required. Low-resource learning methods like Active Learning (AL) have been proposed to tackle the cost of domain expert annotation, raising this question: Can LLMs surpass compact models trained with expert annotations in domain-specific tasks? In this work, we conduct an empirical experiment on four datasets from three different domains comparing SOTA LLMs with small models trained on expert annotations with AL. We found that small models can outperform GPT-3.5 with a few hundreds of labeled data, and they achieve higher or similar performance with GPT-4 despite that they are hundreds time smaller. Based on these findings, we posit that LLM predictions can be used as a warmup method in real-world applications and human experts remain indispensable in tasks involving data annotation driven by domain-specific knowledge.
Publications
2025
Characterizing LLM-Empowered Personalized Story Reading and Interaction for Children: Insights From Multi-Stakeholder Perspectives
In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Apr 2025
Personalized interaction is highly valued by parents in their story-reading activities with children. While AI-empowered story-reading tools have been increasingly used, their abilities to support personalized interaction with children are still limited. Recent advances in large language models (LLMs) show promise in facilitating personalized interactions, but little is known about how to effectively and appropriately use LLMs to enhance children’s personalized story-reading experiences. This work explores this question through a design-based study. Drawing on a formative study, we designed and developed StoryMate, an LLM-empowered personalized interactive story-reading tool for children, following an empirical study with children, parents, and education experts. Our participants valued the personalized features in StoryMate, and also highlighted the need to support personalized content, guiding mechanisms, reading context variations, and interactive interfaces. Based on these findings, we propose a series of design recommendations for better using LLMs to empower children’s personalized story reading and interaction.
UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design
In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, Apr 2025
Usability testing is a fundamental yet challenging research method for user experience (UX) researchers to evaluate a web design. Recent advances in Large Language Model-simulated Agent (LLM Agent) research inspired us to design UXAgent to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human-subject study. Our system features an LLM Agent module and a universal browser connector module so that UX researchers can automatically generate thousands of simulated users to test the target website. The system can generate UX study results in qualitative (e.g., interviewing how an agent thinks), quantitative (e.g., # of actions), and video recording formats for UX researchers to analyze. Through a heuristic user evaluation with five UX researchers, participants praised the innovation of our system but also expressed concerns about the future of UX study with LLM Agents1.
2024
More Samples or More Prompt Inputs? Exploring Effective In-Context Sampling for LLM Few-Shot Prompt Engineering
In Findings of the Association for Computational Linguistics: NAACL 2024, Apr 2024
While most existing works on LLM prompt-engineering focus only on how to select a better set of data samples inside one single prompt input (In-Context Learning or ICL), why can’t we design and leverage multiple prompt inputs together to further improve the LLM performance? In this work, we propose In-Context Sampling (ICS), a low-resource LLM prompt-engineering technique to produce the most confident prediction results by optimizing the construction of multiple ICL prompt inputs. Extensive experiments with two SOTA LLMs (FlanT5-XL and Mistral-7B) on three NLI datasets (e-SNLI, Multi-NLI, and ANLI) illustrate that ICS can consistently enhance LLM’s prediction performance and confidence. An ablation study suggests that a diversity-based ICS strategy may further improve LLM’s performance, which sheds light on a new yet promising future research direction.
Professional Network Matters: Connections Empower Person-Job Fit
Hao Chen,
Lun Du,
Yuxuan Lu, Qiang Fu, Xu Chen, Shi Han, Yanbin Kang, Guangming Lu, and Zi Li
In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Apr 2024
Online recruitment platforms typically employ Person-Job Fit models in the core service that automatically match suitable job seekers with appropriate job positions. While existing works leverage historical or contextual information, they often disregard a crucial aspect: job seekers’ social relationships in professional networks. This paper emphasizes the importance of incorporating professional networks into the Person-Job Fit model. Our innovative approach consists of two stages: (1) defining a Workplace Heterogeneous Information Network (WHIN) to capture heterogeneous knowledge, including professional connections and pre-training representations of various entities using a heterogeneous graph neural network; (2) designing a Contextual Social Attention Graph Neural Network (CSAGNN) that supplements users’ missing information with professional connections’ contextual information. We introduce a job-specific attention mechanism in CSAGNN to handle noisy professional networks, leveraging pre-trained entity representations from WHIN. We demonstrate the effectiveness of our approach through experimental evaluations conducted across three real-world recruitment datasets from LinkedIn, showing superior performance compared to baseline models.
Rethinking Human-AI Collaboration in Complex Medical Decision Making: A Case Study in Sepsis Diagnosis
Shao Zhang, Jianing Yu,
Xuhai Xu, Changchang Yin,
Yuxuan Lu,
Bingsheng Yao, Melanie Tory, Lace M. Padilla, Jeffrey Caterino, Ping Zhang, and
Dakuo Wang In Proceedings of the CHI Conference on Human Factors in Computing Systems, Apr 2024
Today’s AI systems for medical decision support often succeed on benchmark datasets in research papers but fail in real-world deployment. This work focuses on the decision making of sepsis, an acute life-threatening systematic infection that requires an early diagnosis with high uncertainty from the clinician. Our aim is to explore the design requirements for AI systems that can support clinical experts in making better decisions for the early diagnosis of sepsis. The study begins with a formative study investigating why clinical experts abandon an existing AI-powered Sepsis predictive module in their electrical health record (EHR) system. We argue that a human-centered AI system needs to support human experts in the intermediate stages of a medical decision-making process (e.g., generating hypotheses or gathering data), instead of focusing only on the final decision. Therefore, we build SepsisLab based on a state-of-the-art AI algorithm and extend it to predict the future projection of sepsis development, visualize the prediction uncertainty, and propose actionable suggestions (i.e., which additional laboratory tests can be collected) to reduce such uncertainty. Through heuristic evaluation with six clinicians using our prototype system, we demonstrate that SepsisLab enables a promising human-AI collaboration paradigm for the future of AI-assisted sepsis diagnosis and other high-stakes medical decision making.
StorySpark: Expert-Annotated QA Pairs with Real-World Knowledge for Children Storytelling
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Apr 2024
Interactive story reading is a common parent-child activity, where parents expect to teach both language skills and real-world knowledge beyond the story. While increasing storytelling and reading systems have been developed for this activity, they often fail to infuse real-world knowledge into the conversation. This limitation can be attributed to the existing question-answering (QA) datasets used for children’s education, upon which the systems are built, failing to capture the nuances of how education experts think when conducting interactive story reading activities. To bridge this gap, we design an annotation framework, empowered by existing knowledge graph to capture experts’ annotations and thinking process, and leverage this framework to construct StorySparkQA dataset, which comprises 5, 868 expert-annotated QA pairs with real-world knowledge. We conduct automated and human expert evaluations across various QA pair generation settings to demonstrate that our StorySparkQA can effectively support models in generating QA pairs that target real-world knowledge beyond story content. StorySparkQA is available at https://huggingface.co/datasets/NEU-HAI/StorySparkQA.
2023
Beyond Labels: Empowering Human Annotators with Natural Language Explanations through a Novel Active-Learning Architecture
Bingsheng Yao,
Ishan Jindal,
Lucian Popa,
Yannis Katsis, Sayan Ghosh, Lihong He,
Yuxuan Lu, Shashank Srivastava, Yunyao Li,
James Hendler, and
Dakuo Wang In Findings of the Association for Computational Linguistics: EMNLP 2023, Dec 2023
Real-world domain expertsD (e.g., doctors) rarely annotate only a decision label in their day-to-day workflow without providing explanations. Yet, existing low-resource learning techniques, such as Active Learning (AL), that aim to support human annotators mostly focus on the label while neglecting the natural language explanation of a data point. This work proposes a novel AL architecture to support experts’ real-world need for label and explanation annotations in low-resource scenarios. Our AL architecture leverages an explanation-generation model to produce explanations guided by human explanations, a prediction model that utilizes generated explanations toward prediction faithfully, and a novel data diversity-based AL sampling strategy that benefits from the explanation annotations. Automated and human evaluations demonstrate the effectiveness of incorporating explanations into AL sampling and the improved human annotation efficiency and trustworthiness with our AL architecture. Additional ablation studies illustrate the potential of our AL architecture for transfer learning, generalizability, and integration with large language models (LLMs). While LLMs exhibit exceptional explanation-generation capabilities for relatively simple tasks, their effectiveness in complex real-world tasks warrants further in-depth study.
Improving Biomedical Question Answering by Data Augmentation and Model Weighting
Yongping Du, Jingya Yan, Yuxuan Lu, Yiliang Zhao, and Xingnan Jin
IEEE/ACM Transactions on Computational Biology and Bioinformatics, Dec 2023
2022
Contextual Embedding and Model Weighting by Fusing Domain Knowledge on Biomedical Question Answering
Yuxuan Lu, Jingya Yan, Zhixuan Qi, Zhongzheng Ge, and Yongping Du
In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Dec 2022
Biomedical Question Answering aims to obtain an answer to the given question from the biomedical domain. Due to its high requirement of biomedical domain knowledge, it is difficult for the model to learn domain knowledge from limited training data. We propose a contextual embedding method that combines open-domain QA model AoA Reader and BioBERT model pre-trained on biomedical domain data. We adopt unsupervised pre-training on large biomedical corpus and supervised fine-tuning on biomedical question answering dataset. Additionally, we adopt an MLP-based model weighting layer to automatically exploit the advantages of two models to provide the correct answer. The public dataset biomrc constructed from PubMed corpus is used to evaluate our method. Experimental results show that our model outperforms state-of-the-art system by a large margin.
2021
Dual Model Weighting Strategy and Data Augmentation in Biomedical Question Answering
Yongping Du, Jingya Yan, Yiliang Zhao, Yuxuan Lu, and Xingnan Jin
In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Dec 2021
Research Experience
My current research fields includes human-ai collaboration and interaction, especially in the area of Large Language Models (LLMs).
I’m currently working as an intern applied scientist at Amazon.
Before that, I’ve worked as Machine Learning Researcher at a joint program between LinkedIn and Microsoft Research Asia where I do study about LinkedIn’s social network data. I’ve also worked as an intern research assistant at THUNLP lab, supervised by Prof. Zhiyuan Liu(刘知远). My research area there includes Knowledge Embedding.
Open source communities
I’ve participated in many open-source communities. I’m the maintainer of the VSCode extension LaTeX-Utilities, and I’m the founder and maintainer of the EduOJ project. Furthermore, I’ve contributed to many open-source projects, like GitLab, UniversalOJ, OI-Wiki, nix and others.
I’ve participated as mentor and community leader in the Open Source Promotion Plan 2021. All my 3 students successfully finished their projects. I’ve participated as a student in the OSPP 2020 in the UniversalOJ community, and successfully finished my project.
Learn more about my open-source experience at here.