Comparison¶

The analysis/compare/ module provides multi-condition comparison workflows for SIM-PANEL.

It is separate from single-run Analysis. Analysis inspects one run; comparison evaluates relationships across runs or against a real reference.

Purpose¶

Use comparison when you want to:

compare prompting strategies;
compare different models or policies;
compare synthetic outputs against a frozen real-data reference;
produce compact metrics, tables, diagnostics, and markdown reports.

The comparison layer currently supports two modes:

Mode	Description
`cross`	Compare multiple synthetic conditions against one another.
`benchmark`	Compare synthetic conditions against exactly one real reference condition.

Mode resolution is automatic from the condition list.

CLI usage¶

The comparison pipeline has its own CLI command:

sim-panel compare --config path/to/compare.yaml

The CLI loads the compare config, runs the comparison pipeline, and writes artifacts to the configured output directory.

Config structure¶

A compare config requires:

output_dir
a non-empty conditions list

It also accepts:

outcome_field
rating_scale
compare-mode-specific options supported by the current implementation

Cross-condition config¶

Use cross mode when all conditions are synthetic runs.

output_dir: outputs/compare_beer_strategies
outcome_field: rating

conditions:
  - label: beer_persona
    model: gemma3:12b
    strategy: persona
    run_dir: outputs/beer_persona
    condition_type: synthetic
    events_filename: events.jsonl

  - label: beer_persona_cot
    model: gemma3:12b
    strategy: persona_cot
    run_dir: outputs/beer_persona_cot
    condition_type: synthetic
    events_filename: events.jsonl

If condition_type is omitted, it defaults to synthetic.

If events_filename is omitted, it defaults to events.jsonl.

Benchmark config¶

Use benchmark mode when the condition list contains exactly one real reference condition.

output_dir: outputs/compare_amazon_benchmark
outcome_field: rating
benchmark_top_k_products: 20

conditions:
  - label: amazon_real
    model: real
    strategy: reference
    run_dir: outputs/benchmarks/amazon_grocery_subset
    condition_type: real
    events_filename: events.jsonl

  - label: amazon_self_selection
    model: gemma3:12b
    strategy: persona
    run_dir: outputs/amazon_self_selection
    condition_type: synthetic
    events_filename: events.jsonl

  - label: amazon_self_selection_cot
    model: gemma3:12b
    strategy: persona_cot
    run_dir: outputs/amazon_self_selection_cot
    condition_type: synthetic
    events_filename: events.jsonl

Benchmark mode is intended for synthetic-vs-reference evaluation. The reference condition is usually produced by the Benchmarks module from imported real-data artifacts.

Condition fields¶

Each condition defines one run or reference dataset.

Field	Required	Default	Description
`label`	No	`cond_i`	Human-readable condition label.
`model`	No	`""`	Model identifier or descriptive label.
`strategy`	No	`""`	Prompting or generation strategy label.
`run_dir`	Yes	None	Directory containing event artifacts.
`condition_type`	No	`synthetic`	Either `synthetic` or `real`.
`events_filename`	No	`events.jsonl`	Event file name inside `run_dir`.

The comparison loader reads event rows from each condition directory and filters to evaluation events.

Mode resolution¶

Comparison mode is resolved from the condition types.

Condition mix	Mode
No real conditions	`cross`
Exactly one real condition	`benchmark`
More than one real condition	Invalid in the current implementation.

This fail-fast behavior keeps the interpretation of benchmark results clear.

Runtime flow¶

At a high level, the comparison runner:

resolves compare mode from the condition list;
loads event rows from each condition directory;
restricts rows to event_type == "evaluation";
computes shared per-condition metrics;
dispatches to cross-mode or benchmark-mode artifact builders;
writes tables, JSON artifacts, plots where applicable, and a markdown report.

Shared per-condition metrics¶

The comparison layer computes descriptive metrics for each condition once and reuses them across reports and tables.

Typical metrics include:

number of evaluation rows;
number of panelists;
number of products;
observed outcome coverage;
rating or outcome distribution summaries.

Exact metric fields may evolve as the comparison module develops.

Cross mode¶

Cross mode compares synthetic conditions against one another.

Typical outputs include:

condition_metrics.json
condition_metrics.csv
pivot_tables.json
js_divergence_matrix.json
pairwise_rmse_matrix.json
comparison_report.md

Cross mode is useful for comparing:

prompting strategies;
model choices;
assignment policies;
self-selection variants;
ablation settings.

Benchmark mode¶

Benchmark mode compares synthetic conditions against one real reference condition.

Typical outputs include:

condition_metrics.json
condition_metrics.csv
benchmark_summary.json
benchmark_summary.csv
benchmark_product_diagnostics_topk.json
benchmark_product_diagnostics_topk.csv
pivot_tables.json
benchmark_rating_bar_charts.png
comparison_report.md

Benchmark calculations are restricted to shared products where appropriate. This keeps synthetic-vs-reference comparisons from being driven by products that exist only in one side of the comparison.

Benchmark diagnostics¶

Benchmark mode exports compact product-level diagnostics for the best and worst matching products, controlled by:

benchmark_top_k_products: 20

These diagnostics are intended for quick inspection, not as a complete error analysis framework.

Rating scale¶

If the outcome is a rating, you may provide a rating scale:

rating_scale: [1, 2, 3, 4, 5]

The rating scale helps normalize rating-distribution comparisons and report consistent tables.

Output reports¶

Both comparison modes write:

comparison_report.md

The report summarizes conditions, core metrics, and mode-specific diagnostics. In benchmark mode, the markdown report may reference the saved rating bar-chart figure through a relative image path.

Relationship to sources and benchmarks¶

Comparison is downstream of source ingestion and benchmark subsetting.

Typical real-data workflow:

sources/
  raw Amazon Reviews'23 files
  -> imported events/products/personas

benchmarks/
  imported events/products
  -> frozen reference subset

analysis/compare/
  frozen reference subset + synthetic runs
  -> comparison metrics and reports

Sources ingest. Benchmarks freeze. Comparison evaluates.

Relationship to single-run analysis¶

Run single-run analysis before comparison when debugging output quality.

Analysis can reveal missing outcomes, malformed traces, sparse products, or unexpected selection patterns before those issues propagate into multi-run comparison.

Comparison assumes the input conditions are already valid enough to compare.

Current limitations¶

The current comparison layer is intentionally compact.

Important limitations:

benchmark mode supports exactly one real reference condition;
benchmark comparison is product-overlap-aware but not a full causal evaluation;
reports are lightweight markdown artifacts;
richer multi-run visual diagnostics are still evolving.

The goal is to provide reproducible comparison scaffolding without hiding the underlying event data or metric assumptions.