Skip to main content

Applied ML

NLP for Economic Text Analysis

Text-as-data methods used in economics and finance: dictionary scoring of central-bank statements, topic models on Fed minutes, and transformer embeddings for financial sentiment, with the measurement-validity caveats that determine whether the proxy is interpretable.

AdvancedTier 3Current~15 min
0

Why This Matters

Central-bank statements, earnings calls, 10-K filings, and policy minutes are high-stakes text. A small change in wording in an FOMC statement moves trillions of dollars in fixed-income markets within minutes. Quantifying the content of that text is now a standard step in macro and finance research, and most of the early methods were built before transformers existed. The contemporary stack mixes 1940s-style dictionary scoring with 2020s-style fine-tuned transformers, often in the same paper, because each method buys different things.

The hard part is rarely the model. It is the measurement-validity argument that connects "score produced by model X on document Y" to "the economic quantity I care about." Without that argument, an impressive R squared on a sentiment-vs-returns regression can be measuring document length, sector composition, or boilerplate copy.

Core Ideas

Dictionary methods. A dictionary assigns each word in a curated list a sign or weight, and the document score is a linear function of word counts. The Loughran-McDonald financial dictionary (Loughran and McDonald 2011, Journal of Finance 66) replaced the Harvard-IV General Inquirer for finance text after showing that "liability," "tax," and "cost" carry different sentiment in 10-Ks than in general English. Hansen, McMahon, and Prat (2018, QJE 133) score FOMC statements with policy-domain dictionaries to recover hawkish-dovish positions. The strengths: interpretable, deterministic, auditable. The weakness: insensitive to negation, syntax, and context.

Topic models on policy text. Latent Dirichlet Allocation on Fed minutes recovers latent themes (financial conditions, labor markets, foreign sector) and tracks their prevalence over time. Hansen and McMahon (2016, Journal of International Economics 99) use LDA to separate economic-condition content from forward-guidance content in FOMC communications, finding that forward-guidance shocks have larger effects on real variables than economic-condition shocks. LDA is unsupervised and topic stability across runs requires careful seeding.

Transformer embeddings and FinBERT. FinBERT (Araci 2019, arXiv 1908.10063) is BERT further pre-trained on a financial corpus and fine-tuned for sentiment classification on the Financial PhraseBank. Compared to dictionary methods, FinBERT handles negation, sarcasm, and context, and typically improves classification accuracy by 5 to 15 points on financial sentiment benchmarks. The cost is opacity: a sentiment score from a transformer cannot be audited word by word the way a dictionary score can.

Measurement validity is the bottleneck. A sentiment score is only useful to an economist if it maps onto an interpretable construct. Two threats recur. First, document length and boilerplate dominate raw scores; standard practice is to control for these explicitly. Second, the proxy may correlate with the outcome of interest through a confounding channel: a "hawkish" score on FOMC text may be capturing macroeconomic conditions already known to the market rather than the surprise component of the statement. The credibility of any text-as-data result rests on the identification argument that ties the score to the construct.

Common Confusions

Watch Out

Higher transformer accuracy is not better economics

FinBERT can beat Loughran-McDonald on sentiment classification while being worse for an economic application. If the construct of interest is "tone the central bank intended to project," interpretability and stability across training runs may matter more than two extra points of F1. Pick the model that supports the inference, not the leaderboard.

Watch Out

Dictionary scores need length and boilerplate controls

A 10-K with twice as many words as the prior year will mechanically have more positive and negative words. Comparing raw counts across documents or across years without normalization measures filing length, not sentiment. The standard fix is per-word frequencies plus year and firm fixed effects.

References

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics