Building Jurors That Read as Real People: Notes From the Persona Creation Process

TL;DR: Creating juror personas that look, act and talk like real human jurors is hard but interesting.

There's a version of this where I built a few hundred good juror profiles, copied most of the same attributes but changed out the front-facing demographics (the info that you’d find on the jury questionnaire form), and shipped a "mock jury simulation." It would have worked, sort of. An attorney would run a case, see 12 distinct jurors, get responses, and conclude something. The conclusion would be both flat and wrong, because the 12 jurors would be variations of the same person.

That's not what plaintiff PI attorneys need. They need a panel that feels like the panel they'll actually face — a retired Postal carrier from New Bedford who carries his father's politics, a Lowell DNP whose career didn't become what she expected, a 70-year-old Indo-Caribbean Hindu in Pittsfield whose family history is older than the town. Those people uniquely exist in real jury pools. The product only works if the synthetic versions do too.

This post is about the work of getting that right. It's the kind of work that lives behind the scenes, but it's the heart (or brain?) of the project.

The persona database — 105 columns of psychometric coherence

Each MockJuror is described by 105 attributes spanning demographics (age, race, education, income, county, occupation), personality (HEXACO), moral foundations (MFT), legal authoritarianism (RLAQ), narrative identity orientation (redemption / contamination / neutral / mixed, after McAdams), need for cognition, just-world belief, regulatory focus, life event history, communication style markers, and more.

None of these are sampled independently. The whole point of psychometric research is that these dimensions correlate in patterned ways. Political stance correlates with moral foundations. Income correlates with current financial stress, which correlates with locus of control. Education correlates with need for cognition, which correlates with anchor-sensitivity and story coherence weighting. HEXACO Honesty-Humility correlates with MFT Fairness emphasis. The pipeline that generates these values has to enforce all of those correlations simultaneously, against real census data, while preserving the existing correlations.

The current persona-creation pipeline is an 11-stage Python project, no LLM for the psychometric layer. It includes a bunch of reference data from public sources. Stage 0 normalizes demographics (survey questionnaire stuff). Stage 1 regenerates county-conditioned anchors against Census ACS 2022 data for every county in the US. Stage 1.5 places education and income within those anchors. Stage 1.7 establishes political stance, which becomes the hub feeding roughly 20 downstream fields. Stages 2 through 7 layer in HEXACO, MFT, locus of control, just-world belief, need for cognition, situation tags, numeracy, regulatory focus, narrative identity, attitudes, voter status, community involvement and life event vignettes. Lots of iterations to get here and more to go.

Vignettes — the bridge from data to voice

A 105-column attribute profile doesn't generate a voice on its own. To make a juror feel like a real person to the LLM that will eventually run the panel, we decided on generating three short biographical vignettes per persona — first-person past-tense passages of roughly 250 words each, chosen from an 18-theme menu (addiction and recovery, faith and meaning, career arc, family built, geographic identity, relationship to authority, defining hardship, and so on).

The vignettes are not user-facing output. They're intermediate context, grounding the LLM uses to voice a juror during simulation. But because they shape every response the juror will make, their quality matters. A flat, generic vignette produces a flat, generic juror. A vignette that captures the specific shape of someone's life produces a juror who responds in character.

Writing the handful of training examples that teach the model what good vignettes look like took multiple sessions with three anchor personas. Fernando, 59, a widowed Portuguese-American father of three in New Bedford with a Bachelor's degree and an Under-$25K income — Evangelical Christian, 12-step recovery, redemption-oriented. Yolanda, 39, a Jewish DNP in Lowell scaled back to part-time after her second kid — moderate-conservative, contamination-oriented, contemplating whether to ever go back full-time. Kevin, 70, an Indo-Caribbean Hindu in Pittsfield with less than a high school education and five adult children — moderate-liberal, neutral narrative orientation.

A passage from Fernando's third vignette, in his own voice:

The church we belong to is in a strip mall off Route 18. There's a nail salon on one side and a place that used to be a Subway on the other. The pastor's name is Carlos and he was a roofer for twenty-six years before he got called. He preaches in English and Portuguese on alternating Sundays, and most of us understand both but prefer one, and we sort ourselves accordingly.

That's the bar — texture from inside the persona's life, not anthropology from outside. Real Portuguese-American working-class Evangelical congregations in New Bedford do meet in strip malls off Route 18. Pastors do come from the trades. The narrator isn't explaining anything to anyone; he's just telling us where he goes on Sundays.

QA cycles — what the iteration actually looked like

The prompt template that produces production vignettes went through three full iteration cycles on a fixed 32-persona stratified sample before I trusted it to create them all.

Iteration 1.0 surfaced two distinct convergence problems we hadn't anticipated. Model-default phrases ("what I didn't expect was how much...") appeared in 28% of vignettes. Deliberation-register language ("I think we can agree," "I understand where you're coming from") appeared in 12 occurrences across the 96-vignette batch. Both are subtle. Either one would have made jurors feel templated at panel scale.

Iteration 2 (v1.0.1 if we’re being nerdy) added two revisions targeting the convergence. One revision instructed the model to avoid using exemplar tics. The other instructed active diversification across the three vignettes per persona and explicitly forbade deliberation register, since vignettes are biographical rather than persuasive. The diversification revision worked dramatically, deliberation register fell to near zero, the model-default crutch fell from 28% to 6%. The anti-exemplar-tic revision barely moved the needle.

We dug into why. The cross-referencing revealed something important: phrases like "plain and simple" appeared in nine of 32 personas because those nine personas had "plain and simple" listed as their assigned communication tic in their attribute profile. The model wasn't borrowing exemplar tics; it was correctly honoring each persona's own database-specified voice markers at 100% fidelity. The "amplification" I had diagnosed was a measurement artifact. The instruction targeting it was solving a non-problem.

Iteration 3 (v1.0.2) removed that revision, added an off-menu theme guard for a different bug, and re-ran. All four comparison metrics held cleanly. v1.0.2 became the locked production template.

That entire arc — discovering a problem that wasn't a problem, learning that in-context demonstrations beat instructions when they conflict, removing an instruction targeting a non-problem took just a little more time than I wanted to devote on a Saturday, but it was raining anyway, so I pushed through. Worth every minute. The principle generalized: throughout the rest of the project, we treated unexpected pattern observations as hypotheses to verify against base rates before treating them as defects to fix. A good process to build into future sessions.

Scaling, and the audit chain that interrupted us

Phase 2 generated 4,821 vignettes for all 1,607 Massachusetts personas at v1.0.2 — production-quality across mechanical validation, statistical distribution checks, regional voice modulation, and rare attribute combination handling. The voice convergence stayed fixed at scale. Note: The spelling and grammar is excellent but I kind of wish you could hear the Boston accents. The schema-edge handling held across rare configurations like a 70-year-old Asian-Jewish Cape plumber and an 18-year-old high-earner.

Phase 3 added Rhode Island (1,089 vignettes), Connecticut (2,769 vignettes), and New Hampshire (2,943 vignettes). Each state stopped for review. The run/stop/verify/run process worked well here too.

The current state of the persona database is v23 — 299,672 rows, 105 columns, full ZIP-City-State-County coherence at the data layer, with 138,919 additional ZCTA-anchored personas filling previously-uncovered neighborhoods nationally. I also have a note file with way too many follow up ideas /improvements for the next rainy day.

What the first wave actually looks like

8,461 personas across Massachusetts, Rhode Island, Connecticut, New Hampshire, and Pennsylvania. 25,383 vignettes. Nominal spend in API costs across all of it including iteration cycles, audit work, and a Haiku-versus-Sonnet model comparison and a Codex independent review. Not bad for a couple hours on a rainy Saturday.

What's still being built

Several pieces of work remain. ZCTA-level enrichment data is ready but not yet joined to vignette prompts, that's a deliberate later decision once we know which fields carry weight worth showing to the model and which are better left at the data layer. There’s actually a bunch of ZIP code related work that brings me back to my days working on the USPS Change of address: A Virginia independent-city issue, PO Box only towns and a few others. A persona plausibility threshold review will tighten the first-cut sanity filters that catch attribute combinations like a 92-year-old warehouse manager with a JD/MD. So many questions about that guys’ career!

A closing thought

There's a reason this work has taken the shape it has — research-grounded, iteration-disciplined, audit-tolerant, willing to deprioritize fast outputs for foundations that hold up. The vision makes a serious promise to plaintiff attorneys: that the panel they test their case against is structurally similar to the real one they'll face at trial. The audit chains, the rollback discipline, the willingness to spend a rainy Saturday afternoon investigating why all the jurors in Fairfield county were from Bridgeport — that's what it takes to make good on the promise.

We're not there yet on every dimension. The fidelity of pass-through fields on the augmentation personas needs another refinement pass. The neighborhood-level enrichment is queued. The testing, and more testing. Then building, then more testing. But the foundation is in materially better shape than it was last month, and the discipline that got it here will carry whatever comes next.