Analysis¶
The analysis/ module provides single-run diagnostics for a generated or
imported SIM-PANEL run.
It reads one run directory, loads event rows and metadata, computes summaries and metrics, optionally saves plots, and can run questionnaire-aware regression models on the analyzed event data.
Analysis is for inspecting one run. Multi-run comparison and synthetic-vs-real benchmarking live in Comparison.
Purpose¶
Use analysis when you want to answer questions such as:
Did the run produce the expected number of events?
Which outcome fields are present, and how much missingness do they have?
How are ratings or other outcomes distributed?
Do panelists or products differ in their observed outcome profiles?
In self-selection runs, which products were requested or executed most often?
Can simple regression models explain an outcome using available event features?
Input directory¶
A typical analysis input is a SIM-PANEL run directory:
outputs/run_001/
events.jsonl
metadata.json
data_dictionary.json
The analysis loader primarily uses:
File |
Role |
|---|---|
|
Main event-level dataset. |
|
Run provenance and configuration snapshot. |
Linked persona and product files are optional. Many summaries can be computed
directly from events.jsonl because event rows already contain
panelist_features, product_features, outcomes, and traces.
Minimal config¶
A minimal analysis config contains a run directory and an output directory:
run_dir: outputs/run_001
output_dir: outputs/run_001/analysis
Run analysis with the analysis CLI or runner used by the current project setup.
Full config shape¶
The analysis YAML supports the following sections:
run_dir: outputs/run_001
output_dir: outputs/run_001/analysis
load:
resolve_sources: true
prefer_extra_paths: true
strict_source_resolution: false
summaries:
run: true
outcomes: true
traces: true
selections: true
metrics:
quality: true
diversity: true
persona: true
selection: false
plots:
outcome_distributions:
enabled: true
normalize_to_share: false
fields: null
figsize: [7, 4.5]
panelist_summary:
enabled: false
outcome_field: rating
metrics: [mean, variance]
max_items: 30
sort_by: label_asc
horizontal: false
product_summary:
enabled: false
outcome_field: rating
metrics: [mean, variance]
max_items: 30
sort_by: label_asc
horizontal: false
selection_concentration:
enabled: false
modes: [executed, requested]
top_k: 15
horizontal: true
export:
csv: true
json: true
markdown: true
overwrite: true
Load options¶
The load section controls how much upstream source information analysis tries
to resolve.
Field |
Default |
Description |
|---|---|---|
|
|
Try to resolve linked persona/product source files from metadata. |
|
|
Prefer explicit extra paths when multiple source paths are available. |
|
|
If true, fail when linked sources cannot be resolved. |
Source resolution is optional. Analysis should remain useful even when only
events.jsonl and metadata.json are available.
Summaries¶
The summaries section controls human-readable structured summaries.
Summary |
Description |
|---|---|
|
Row counts, event type counts, schema/policy/seed/backend metadata, observed panelist/product/period counts, and provenance paths. |
|
Per-outcome observed counts, missingness, and lightweight numeric or categorical summaries. |
|
Trace-field presence and text-oriented diagnostics. |
|
Selection-event summaries for self-selection runs. |
These summaries are intended for quick inspection and report generation.
Metrics¶
The metrics section controls reusable analytic diagnostics.
Metric group |
Description |
|---|---|
|
Coverage, missingness, and basic linking checks. |
|
Outcome diversity diagnostics. |
|
Panelist/product differentiation metrics. |
|
Selection concentration and entropy-style diagnostics. |
Summaries and metrics are separate by design: summaries are reader-facing tables, while metrics are reusable machine-readable diagnostics.
Plots¶
The plots section controls optional single-run diagnostic figures.
Current plot families include:
Plot family |
Description |
|---|---|
|
Per-outcome distributions. |
|
Bar summaries over panelists for a chosen outcome. |
|
Bar summaries over products for a chosen outcome. |
|
Requested/executed product concentration in self-selection runs. |
If JSON export is enabled, analysis also writes a plot index mapping plot names to saved image paths.
Exported artifacts¶
The analysis runner writes structured artifacts under the configured
output_dir.
Typical layout:
analysis/
summary/
metrics/
plots/
report/
summary/¶
Contains structured single-run summaries such as:
run_summary
outcome_summary
trace_summary
selection_summary
Depending on export settings, summaries may be written as JSON and/or CSV.
metrics/¶
Contains machine-readable metric artifacts such as:
quality_metrics
diversity_metrics
persona_metrics
selection_metrics
Depending on export settings, metrics may be written as JSON and/or CSV.
plots/¶
Contains saved diagnostic figures. When JSON export is enabled, this directory also includes:
plot_index.json
report/¶
Contains a lightweight markdown report:
report.md
The report is intentionally concise. It gives a run overview, surfaces selected summary and metric values, and points to saved artifacts.
Export controls¶
Export behavior is controlled by the export section:
export:
csv: true
json: true
markdown: true
overwrite: true
Field |
Description |
|---|---|
|
Write row-oriented CSV summaries and selected metric tables. |
|
Write machine-readable JSON outputs and plot index. |
|
Write |
|
Allow existing artifact files to be replaced. |
Regression analysis¶
Regression is an optional submodule of single-run analysis.
It is integrated into the analysis workflow rather than exposed as a separate top-level CLI command. When enabled, regression models are fit from the analyzed event data and written under a regression subdirectory.
Enable regression with:
regression:
enabled: true
save_results: true
output_subdir: regression
options:
drop_missing: true
standardize_numeric: false
add_intercept: true
max_iter: 200
include_inference: true
confidence_level: 0.95
covariance_type: nonrobust
specs:
- family: ols
design: product_features + panelist_features
outcome_field: rating
Supported model families include:
Family |
Outcome type |
|---|---|
|
Continuous outcomes. |
|
Binary outcomes. |
|
Binary outcomes. |
|
Nominal categorical outcomes. |
|
Ordinal categorical outcomes. |
|
Ordinal categorical outcomes. |
Regression is questionnaire-aware. The selected model family should be compatible with the declared analysis type of the requested outcome field.
Regression options¶
Field |
Default |
Description |
|---|---|---|
|
|
Drop rows with missing target or regressors before fitting. |
|
|
Standardize numeric regressors before fitting. |
|
|
Add an intercept where appropriate. |
|
|
Maximum optimizer iterations for nonlinear models. |
|
|
Request coefficient-level inference outputs. |
|
|
Confidence level for intervals. |
|
|
Covariance estimator. Supported values include |
Analysis versus comparison¶
Analysis and comparison serve different roles.
Layer |
Scope |
Typical input |
Typical output |
|---|---|---|---|
|
One run |
One run directory |
Summaries, metrics, plots, report, optional regression. |
|
Multiple conditions or synthetic-vs-reference |
Multiple run/reference directories |
Comparison metrics, tables, diagnostics, report. |
Use analysis first to inspect whether individual runs are healthy. Use comparison afterwards to evaluate differences across conditions or against a frozen benchmark reference.