02/05/2026 | Press release | Distributed by Public on 02/05/2026 11:03
Recent research suggests that generic large language models (LLMs) can match the accuracy of traditional methods when forecasting macroeconomic variables in pseudo out-of-sample settings generated via prompts. This paper assesses the out-of-sample forecasting accuracy of LLMs by eliciting real-time forecasts of U.S. inflation from ChatGPT. We find that out-of-sample predictions are largely inaccurate and stale, even though forecasts generated in pseudo out-of-sample environments are comparable to existing benchmarks. Our results underscore the importance of out-of-sample benchmarking for LLM predictions.
Suggested citation:
Alam, M. Jahangir, Shane Boyle, Huiyu Li, and Tatevik Sekhposyan. 2026. "ChatMacro: Evaluating Inflation Forecasts of Generative AI*." Federal Reserve Bank of San Francisco Working Paper 2026-04. https://doi.org/10.24148/wp2026-04