Quickstart¶
This page walks through a minimal SIM-PANEL run: install the package, generate a small synthetic panel dataset, validate the output, and inspect a few rows.
SIM-PANEL is designed to be configuration-driven. A typical run starts from a YAML file and writes JSONL artifacts plus metadata into an output directory.
Installation¶
From the repository root, install the package in editable mode:
pip install -e .
For development, install optional development dependencies if they are defined in the project configuration:
pip install -e ".[dev]"
Check that the command-line interface is available:
sim-panel --help
Generate a small dataset¶
Run the generator with an example configuration:
sim-panel generate \
--config examples/configs/minimal.yaml \
--out outputs/run_001
This command reads the YAML configuration, loads panelists and products, applies
the configured assignment or selection policy, generates event-level outcomes,
and writes artifacts to outputs/run_001.
A typical run directory contains:
outputs/run_001/
events.jsonl
metadata.json
data_dictionary.json
If CSV export is enabled by the command or writer settings, the run directory may also include:
events.csv
events.jsonl is the primary dataset output. events.csv is a convenience
export; nested structures are JSON-serialized into string cells.
Validate the output¶
Validate the generated event file against the SIM-PANEL schema:
sim-panel validate \
--data outputs/run_001/events.jsonl \
--meta outputs/run_001/metadata.json
Validation checks that generated rows follow the expected event schema. For self-selection runs, validation may also check that evaluation events are linked correctly to their corresponding selection events.
Inspect sample rows¶
Print a few generated events:
sim-panel sample \
--data outputs/run_001/events.jsonl \
--n 5
Each row represents an event-level observation. A typical evaluation event contains fields such as:
event_id
schema_version
event_type
panelist_id
product_id
t
policy
outcomes
traces
panelist_features
product_features
For random and manual assignment runs, most rows are evaluation events. For
self-selection runs, the output may contain both selection events and linked
evaluation events.
Minimal configuration¶
A minimal generation config has three required top-level sections:
panelistsproductspolicy
Most useful runs also define generator, outcomes_model, questionnaire, and
output_dir.
A small deterministic random-assignment config looks like this:
panelists:
source: examples/data/panelists.jsonl
variant: default
products:
source: examples/data/products.jsonl
variant: default
policy:
name: random
generator:
schema_version: "0.1.0"
seed: 42
n_periods: 3
validate_on_finish: true
max_errors: 50
event_namespace: quickstart
max_workers: 1
prompting_strategy: persona
outcomes_model:
name: deterministic
questionnaire:
outcomes:
fields:
rating:
type: int
choices: [1, 2, 3, 4, 5]
question: "Overall, how much do you like this product?"
purchase_intent:
type: categorical
choices: ["no", "maybe", "yes"]
question: "How likely are you to purchase in 30 days?"
traces:
fields:
review_text:
type: text
question: "Write a short review in 2–4 sentences."
output_dir: outputs/quickstart
The deterministic outcome model fills the questionnaire using stable hashes of
panelist_id, product_id, and t. This makes it useful for local tests,
schema checks, CI, and CPU-only pipeline debugging.
What to inspect first¶
After your first run, check the following files:
File |
Purpose |
|---|---|
|
Primary event-level dataset. |
|
Lightweight run bookkeeping: generation time, schema version, seed, row counts, panelist/product/period counts, policy name, optional config snapshot, and config hash. |
|
Dataset legend/contract: schema version plus JSONable snapshots of key configs and specs, including generator config, policy config, selection config, execution rules, and outcome config. |
|
Optional convenience export. Nested structures are JSON-serialized into string cells. |
The most important file is events.jsonl. It is the canonical output format for
validation, downstream analysis, and benchmarking.
Reproducibility¶
SIM-PANEL uses deterministic seeding for non-LLM components such as exposure sampling, assignment shuffles, and stable event-id generation. Re-running the same configuration with the same seed should produce the same non-LLM outputs.
LLM-backed enrichment, selection, or outcome generation may not be perfectly deterministic unless the backend, model version, prompts, decoding parameters, and runtime behavior are also controlled. For reproducible local tests, prefer deterministic or mock components.
Next steps¶
After completing the quickstart, the most useful pages are:
Configuration for YAML-driven experiment setup.
Schema for event-level fields, artifacts, and validation rules.
Generation for the synthetic panel generation pipeline.
Sources for real-data adapters such as Amazon Reviews’23.
Benchmarks for freezing benchmark-ready real-data subsets.