Hello! I’m Ivi, a 3rd year PhD candidate at the Max Planck Institute for Software Systems, under the supervision of Dr. Manuel Gomez-Rodriguez. My research interests lie broadly in the area of Human-Centered Machine Learning. Currently, I am working on LLM evaluation and uncertainty.
Before joining MPI-SWS, I received my MEng in Electrical and Computer Engineering from the National Technical University of Athens, where I did my diploma thesis with Prof. Symeon Papavassiliou. During my undergraduate studies, I was fortunate to briefly work at ETH Zürich under Dr. Fanny Yang, and at the National Observatory of Athens, where I contributed to the development of the Flash Detection Software.
My Research: I study how the inherent stochasticity in LLMs can lead to evaluation instability and fairness concerns, and how we can account for and control this stochasticity. For example, inference randomness can skew model rankings [1], obfuscate model biases [2], and cause identical outputs to have arbitrarily different tokenizations and costs [5]. To address such challenges, I develop statistical and causal methods for LLM evaluation and oversight, proposing how to reliably use LLM-as-a-judge [3], speed up evaluation pipelines [1], determine if LLM agents’ tool calls comply with policy manuals [4], and ensure deterministic tokenization and pricing [5]. Overall, my goal is to make LLM systems and evaluations more predictable and trustworthy.
Recent news
July 2026: Our preprint MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents will be presented in the Failure Modes in Agentic AI workshop at ICML 2026.
May 2026: Our paper Evaluation of Large Language Models via Coupled Token Generation was presented in AISTATS 2026.
May 2026: In our new preprint, we show how to use SMT validation to reliably generate policy compliance benchmarks for LLM agents.
Publications
Evaluation of Large Language Models via Coupled Token Generation
AISTATS 2026
also presented at the Building Trust in LLMs and LLM Applications workshop at ICLR 2025
Nina Corvelo Benz, Stratis Tsirtis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi, Manuel Gomez-Rodriguez
[arxiv] [pdf] [code] [poster]Counterfactual Token Generation in Large Language Models
CLeaR 2025
also presented at the Causality and Large Models workshop at NeurIPS 2024
Ivi Chatzi*, Nina Corvelo Benz*, Eleni Straitouri*, Stratis Tsirtsis*, Manuel Gomez-Rodriguez
[arxiv] [pdf] [code] [poster]Prediction-Powered Ranking of Large Language Models
NeurIPS 2024
also presented at the Human-centered Evaluation and Auditing of Language Models workshop at CHI 2024
Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, Manuel Gomez-Rodriguez
[arxiv] [pdf] [code] [poster]MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents
Failure Modes in Agentic AI workshop at ICML 2026
Ashwani Anand, Ivi Chatzi, Ritam Raha, Anne-Kathrin Schmuck
[arxiv] [pdf]Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service
Tokenization workshop at ICML 2025
Ivi Chatzi, Nina Corvelo Benz, Stratis Tsirtsis, Manuel Gomez-Rodriguez
[arxiv] [pdf] [code] [data] [poster]
