Comparison¶
The analysis/compare/ module provides multi-condition comparison workflows for
SIM-PANEL.
It is separate from single-run Analysis. Analysis inspects one run; comparison evaluates relationships across runs or against a real reference.
Purpose¶
Use comparison when you want to:
compare prompting strategies;
compare different models or policies;
compare synthetic outputs against a frozen real-data reference;
produce compact metrics, tables, diagnostics, and markdown reports.
The comparison layer currently supports two modes:
Mode |
Description |
|---|---|
|
Compare multiple synthetic conditions against one another. |
|
Compare synthetic conditions against exactly one real reference condition. |
Mode resolution is automatic from the condition list.
CLI usage¶
The comparison pipeline has its own CLI command:
sim-panel compare --config path/to/compare.yaml
The CLI loads the compare config, runs the comparison pipeline, and writes artifacts to the configured output directory.
Config structure¶
A compare config requires:
output_dira non-empty
conditionslist
It also accepts:
outcome_fieldrating_scalecompare-mode-specific options supported by the current implementation
Cross-condition config¶
Use cross mode when all conditions are synthetic runs.
output_dir: outputs/compare_beer_strategies
outcome_field: rating
conditions:
- label: beer_persona
model: gemma3:12b
strategy: persona
run_dir: outputs/beer_persona
condition_type: synthetic
events_filename: events.jsonl
- label: beer_persona_cot
model: gemma3:12b
strategy: persona_cot
run_dir: outputs/beer_persona_cot
condition_type: synthetic
events_filename: events.jsonl
If condition_type is omitted, it defaults to synthetic.
If events_filename is omitted, it defaults to events.jsonl.
Benchmark config¶
Use benchmark mode when the condition list contains exactly one real reference condition.
output_dir: outputs/compare_amazon_benchmark
outcome_field: rating
benchmark_top_k_products: 20
conditions:
- label: amazon_real
model: real
strategy: reference
run_dir: outputs/benchmarks/amazon_grocery_subset
condition_type: real
events_filename: events.jsonl
- label: amazon_self_selection
model: gemma3:12b
strategy: persona
run_dir: outputs/amazon_self_selection
condition_type: synthetic
events_filename: events.jsonl
- label: amazon_self_selection_cot
model: gemma3:12b
strategy: persona_cot
run_dir: outputs/amazon_self_selection_cot
condition_type: synthetic
events_filename: events.jsonl
Benchmark mode is intended for synthetic-vs-reference evaluation. The reference condition is usually produced by the Benchmarks module from imported real-data artifacts.
Condition fields¶
Each condition defines one run or reference dataset.
Field |
Required |
Default |
Description |
|---|---|---|---|
|
No |
|
Human-readable condition label. |
|
No |
|
Model identifier or descriptive label. |
|
No |
|
Prompting or generation strategy label. |
|
Yes |
None |
Directory containing event artifacts. |
|
No |
|
Either |
|
No |
|
Event file name inside |
The comparison loader reads event rows from each condition directory and filters to evaluation events.
Mode resolution¶
Comparison mode is resolved from the condition types.
Condition mix |
Mode |
|---|---|
No real conditions |
|
Exactly one real condition |
|
More than one real condition |
Invalid in the current implementation. |
This fail-fast behavior keeps the interpretation of benchmark results clear.
Runtime flow¶
At a high level, the comparison runner:
resolves compare mode from the condition list;
loads event rows from each condition directory;
restricts rows to
event_type == "evaluation";computes shared per-condition metrics;
dispatches to cross-mode or benchmark-mode artifact builders;
writes tables, JSON artifacts, plots where applicable, and a markdown report.
Cross mode¶
Cross mode compares synthetic conditions against one another.
Typical outputs include:
condition_metrics.json
condition_metrics.csv
pivot_tables.json
js_divergence_matrix.json
pairwise_rmse_matrix.json
comparison_report.md
Cross mode is useful for comparing:
prompting strategies;
model choices;
assignment policies;
self-selection variants;
ablation settings.
Benchmark mode¶
Benchmark mode compares synthetic conditions against one real reference condition.
Typical outputs include:
condition_metrics.json
condition_metrics.csv
benchmark_summary.json
benchmark_summary.csv
benchmark_product_diagnostics_topk.json
benchmark_product_diagnostics_topk.csv
pivot_tables.json
benchmark_rating_bar_charts.png
comparison_report.md
Benchmark calculations are restricted to shared products where appropriate. This keeps synthetic-vs-reference comparisons from being driven by products that exist only in one side of the comparison.
Benchmark diagnostics¶
Benchmark mode exports compact product-level diagnostics for the best and worst matching products, controlled by:
benchmark_top_k_products: 20
These diagnostics are intended for quick inspection, not as a complete error analysis framework.
Rating scale¶
If the outcome is a rating, you may provide a rating scale:
rating_scale: [1, 2, 3, 4, 5]
The rating scale helps normalize rating-distribution comparisons and report consistent tables.
Output reports¶
Both comparison modes write:
comparison_report.md
The report summarizes conditions, core metrics, and mode-specific diagnostics. In benchmark mode, the markdown report may reference the saved rating bar-chart figure through a relative image path.
Relationship to sources and benchmarks¶
Comparison is downstream of source ingestion and benchmark subsetting.
Typical real-data workflow:
sources/
raw Amazon Reviews'23 files
-> imported events/products/personas
benchmarks/
imported events/products
-> frozen reference subset
analysis/compare/
frozen reference subset + synthetic runs
-> comparison metrics and reports
Sources ingest. Benchmarks freeze. Comparison evaluates.
Relationship to single-run analysis¶
Run single-run analysis before comparison when debugging output quality.
Analysis can reveal missing outcomes, malformed traces, sparse products, or unexpected selection patterns before those issues propagate into multi-run comparison.
Comparison assumes the input conditions are already valid enough to compare.
Current limitations¶
The current comparison layer is intentionally compact.
Important limitations:
benchmark mode supports exactly one real reference condition;
benchmark comparison is product-overlap-aware but not a full causal evaluation;
reports are lightweight markdown artifacts;
richer multi-run visual diagnostics are still evolving.
The goal is to provide reproducible comparison scaffolding without hiding the underlying event data or metric assumptions.