Fără categorie – Cristina Garbacea

Personalized Benchmarking: Evaluating LLMs by Individual Preferences

With the rise in capabilities of large language models (LLMs) and their deployment in real-world tasks, evaluating LLM alignment with human preferences has become an important challenge. Current benchmarks average preferences across all users to compute aggregate ratings, overlooking individual user preferences when establishing model rankings. Since users have varying preferences in different contexts, we call for personalized LLM benchmarks that rank models according to individual needs. We compute personalized model rankings using ELO ratings and Bradley-Terry coefficients for 115 active Chatbot Arena users and analyze how user query characteristics (topics and writing style) relate to LLM ranking variations. We demonstrate that individual rankings of LLM models diverge dramatically from aggregate LLM rankings, with Bradley-Terry correlations averaging only \rho=0.04 (57\% of users show near-zero or negative correlation) and ELO ratings showing moderate correlation (\rho=0.43). Through topic modeling and style analysis, we find users exhibit substantial heterogeneity in topical interests and communication styles, influencing their model preferences. We further show that a compact combination of topic and style features provides a useful feature space for predicting user-specific model rankings. Our results provide strong quantitative evidence that aggregate benchmarks fail to capture individual preferences for most users, and highlight the importance of developing personalized benchmarks that rank LLM models according to individual user preferences.

For more details, please check our paper.

HyPerAlign: Interpretable Personalized LLM Alignment via Hypothesis Generation

Alignment algorithms are widely used to align large language models (LLMs) to human users based on preference annotations. Typically these (often divergent) preferences are aggregated over a diverse set of users, resulting in fine-tuned models that are aligned to the “average-user” preference. Nevertheless, current models are used by individual users in very specific contexts and situations, emphasizing the need for user-dependent preference control. In this work we address the problem of personalizing LLM outputs to their users. We aim to generate customized responses tailored to specific individuals instead of generic outputs that emulate the collective voices of diverse populations. We propose HyPerAlign, an interpretable and sample-efficient hypothesis-driven personalization approach for LLM models. Given few-shot examples written by a particular user, we first infer hypotheses about their communication strategies, personality, and writing style, then prompt LLM models with these hypotheses and user-specific attributes to generate customized outputs. We conduct experiments on two different personalization tasks, namely authorship attribution and deliberative alignment, with datasets from diverse domains (news articles, blog posts, emails, jailbreaking benchmarks). Results demonstrate the superiority of hypothesis-driven LLM personalization compared to preference-based fine-tuning methods. For authorship attribution, HyPerAlign generations have consistently high win-rates (commonly > 90%) against state-of-the-art preference fine-tuning approaches across diverse user profiles and LLM models. For deliberative alignment, the helpfulness of LLM models is improved by up to 70% on average. The inferred hypotheses are of high quality, can generalize across models and to out-of distribution datasets. Overall, HyPerAlign represents an interpretable and sample-efficient strategy for the personalization of LLM models to individual users.

For more details, please check our paper.

Evaluating the Goal-Directedness of Large Language Models

To what extent do LLMs use their capabilities towards their given goal? We take this as a measure of their goal-directedness. We evaluate goal-directedness on tasks that require information gathering, cognitive effort, and plan execution, where we use subtasks to infer each model’s relevant capabilities. Our evaluations of LLMs from Google DeepMind, OpenAI, and Anthropic show that goal-directedness is relatively consistent across tasks, differs from task performance, and is only moderately sensitive to motivational prompts. Notably, most models are not fully goal-directed. We hope our goaldirectedness evaluations will enable better monitoring of LLM progress, and enable more deliberate design choices of agentic properties in LLMs.

For more details, please see our paper.

Why is constrained neural language generation particularly challenging?

Recent advances in deep neural language models combined with the capacity of large scale datasets have accelerated the development of natural language generation systems that produce fluent and coherent texts (to various degrees of success) in a multitude of tasks and application contexts. However, controlling the output of these models for specific user and task needs is still an open challenge. This is crucial not only to customizing the content and style of the generated language, but also to their safe and reliable deployment in the real world. We present an extensive survey on the emerging topic of constrained neural language generation in which we formally define and categorize the problems of natural language generation by distinguishing between conditions and constraints (the latter being testable conditions on the output text instead of the input), present constrained text generation tasks, and review existing methods and evaluation metrics for constrained text generation. Our aim is to highlight recent progress and trends in this emerging field, informing on the most promising directions and limitations towards advancing the state-of-the-art of constrained neural language generation research.

For more details, please see our Transactions on Machine Learning Research (TMLR) 2025 paper: https://openreview.net/pdf?id=Vwgjk5ysWn

PhD Thesis is Publicly Available!

I am incredibly excited to share that I have recently completed my PhD degree in Computer Science and Engineering at the University of Michigan! I am immensely grateful to my PhD advisor Prof. Qiaozhu Mei for his unwavering support and wise advice throughout the years, as well as to my amazing committee members Prof. Joyce Chai, Prof. Emily Mower Provost, Prof. Kevyn Collins-Thompson and Prof. Lu Wang for their advice and feedback! My PhD thesis titled “Neural Language Generation for Content Adaptation: Explainable, Efficient Low-Resource Text Simplification and Evaluation” is publicly available.

I will be continuing my research as a PostDoctoral Scholar at the University of Chicago, Data Science Institute. Looking forward to the journey ahead!

Adapting Pre-trained Language Models to Low-Resource Text Simplification: The Path Matters @CoLLAs 2022

Our long paper “Adapting Pre-trained Language Models to Low-Resource Text Simplification: The Path Matters” by Cristina Garbacea and Qiaozhu Mei has been accepted at the 1st Conference on Lifelong Learning Agents (CoLLAs), which is held in Montreal, Canada between 18th -23rd August 2022. If you are attending the conference, please stop by on Thursday August 18th, 11 am – 2 pm to learn more about our work. Please see the abstract of the paper below:

“We frame the problem of text simplification from a task and domain adaptation perspective, where neural language models are pre-trained on large-scale corpora and then adapted to new tasks in different domains through limited training examples. We investigate the performance of two popular vehicles of task and domain adaptation: meta-learning and transfer learning (in particular fine-tuning), in the context of low-resource text simplification that involves a diversity of tasks and domains. We find that when directly adapting a Web-scale pre-trained language model to low-resource text simplification tasks, fine-tuning based methods present a competitive advantage over meta-learning approaches. Surprisingly, adding an intermediate stop in the adaptation path between the source and target, an auxiliary dataset and task that allow for the decomposition of the adaptation process into multiple steps, significantly increases the performance of the target task. The performance is however sensitive to the selection and ordering of the adaptation strategy (task adaptation vs. domain adaptation) in the two steps. When such an intermediate dataset is not available, one can build a “pseudostop” using the target domain/task itself. Our extensive analysis serves as a preliminary step towards bridging these two popular paradigms of few-shot adaptive learning and towards developing more structured solutions to task/domain adaptation in a novel setting.”

For more details please see our paper, talk, slides and poster.

Explainable Prediction of Text Complexity: The Missing Preliminaries for Text Simplification @ ACL-IJCNLP 2021

Our long paper “Explainable Prediction of Text Complexity: The Missing Preliminaries for Text Simplification” by Cristina Garbacea, Mengtian Guo, Samuel Carton and Qiaozhu Mei has been accepted at the ACL-IJCNLP 2021 main conference, which is held in Bangkok, Thailand, during August 1-6, 2021. If you are attending the conference, please stop by “Session 1H: Machine Learning for NLP” on August 2nd. Please see the abstract of our paper below:

“Text simplification reduces the language complexity of professional content for accessibility purposes. End-to-end neural network models have been widely adopted to directly generate the simplified version of input text, usually functioning as a blackbox. We show that text simplification can be decomposed into a compact pipeline of tasks to ensure the transparency and explainability of the process. The first two steps in this pipeline are often neglected: 1) to predict whether a given piece of text needs to be simplified, and 2) if yes, to identify complex parts of the text. The two tasks can be solved separately using either lexical or deep learning methods, or solved jointly. Simply applying explainable complexity prediction as a preliminary step, the out-of-sample text simplification performance of the state-of-the-art, black-box simplification models can be improved by a large margin.”

For more details please check our paper, poster, slides and longer / shorter talk.

Judge The Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation @EMNLP-IJCNLP 2019

Our long paper “Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation” by Cristina Garbacea, Samuel Carton, Shiyan Yan and Qiaozhu Mei will be presented at the upcoming EMNLP-IJCNLP 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing taking place November 3-7 in Hong Kong, China. Please see the abstract of our paper below:

“Recent advances in deep learning have resulted in a resurgence in the popularity of natural language generation (NLG). Many deep learning based models, including recurrent neural networks and generative adversarial networks, have been proposed and applied to generating various types of text. Despite the fast development of methods, how to better evaluate the quality of these natural language generators remains a significant challenge. We conduct an in-depth empirical study to evaluate the existing evaluation methods for natural language generation. We compare human-based evaluators with a variety of automated evaluation procedures, including discriminative evaluators that measure how well the generated text can be distinguished from human-written text, as well as text overlap metrics that measure how similar the generated text is to human-written references. We measure to what extent these different evaluators agree on the ranking of a dozen of state-of-the-art generators for online product reviews. We find that human evaluators do not correlate well with discriminative evaluators, leaving a bigger question of whether adversarial accuracy is the correct objective for natural language generation. In general, distinguishing machine-generated text is a challenging task even for human evaluators, and their decisions tend to correlate better with text overlap metrics. We also find that diversity is an intriguing metric that is indicative of the assessments of different evaluators.”

If you are attending the conference, do not miss the Machine Learning session on Wednesday November 6th to learn more on our large scale study focused on the evaluation of neural language models. The poster is also available online at this location.
Feel free to get in touch with any questions!

Low Bit-rate Speech Coding With VQ-VAE and a WaveNet Decoder

“Low Bit-rate Speech Coding With VQ-VAE and a WaveNet Decoder” by Cristina Garbacea, Aaron van den Oord, Yazhe Li, Felicia S C Lim, Alejandro Luebs, Oriol Vinyals and Thomas C Walters has been accepted at ICASSP 2019 and will be presented this week at the conference in Brighton, UK. The work was carried during my internship with Google Deepmind. I am posting the abstract of the paper below:

In order to efficiently transmit and store speech signals, speech codecs create a minimally redundant representation of the input signal which is then decoded at the receiver with the best possible perceptual quality. In this work we demonstrate that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality. A prosody-transparent and speaker-independent model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits perceptual quality which is around halfway between the MELP codec at 2.4 kbps and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality recorded speech with the test speaker included in the training set, a model coding speech at 1.6 kbps produces output of similar perceptual quality to that generated by AMR-WB at 23.05 kbps.

For more details please check the paper and the poster.

Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

“Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation” by Cristina Garbacea, Samuel Carton, Shiyan Yan and Qiaozhu Mei is available online now at this location.

Recent advances in deep learning have resulted in a resurgence in the popularity of natural language generation (NLG). Many deep learning based models, including recurrent neural networks and generative adversarial networks, have been proposed and applied to generating various types of text. Despite the fast development of methods, how to better evaluate the quality of these natural language generators remains a significant challenge. We conduct an in-depth empirical study to evaluate the existing evaluation methods for natural language generation. We compare human-based evaluators with a variety of automated evaluation procedures, including discriminative evaluators that measure how well the generated text can be distinguished from human-written text, as well as text overlap metrics that measure how similar the generated text is to human-written references. We measure to what extent these different evaluators agree on the ranking of a dozen of state-of-the-art generators for online product reviews. We find that human evaluators do not correlate well with discriminative evaluators, leaving a bigger question of whether adversarial accuracy is the correct objective for natural language generation. In general, distinguishing machine-generated text is a challenging task even for human evaluators, and their decisions tend to correlate better with text overlap metrics. We also find that diversity is an intriguing metric that is indicative of the assessments of different evaluators.

For more details please check the paper.