Sources¶
SIM-PANEL sources convert external datasets into canonical internal artifacts: event rows, product records, persona records, metadata, data dictionaries, and import statistics.
The source layer is for data ingestion and conversion. It is separate from synthetic generation, benchmark subset construction, and comparison.
sources/
raw external data -> canonical imported artifacts
benchmarks/
imported artifacts -> frozen real-data subset
analysis/compare/
synthetic outputs + reference subset -> comparison metrics and reports
Source-layer contract¶
All source importers inherit from BaseSource.
A source is responsible for:
validating its configuration;
loading raw source artifacts;
transforming raw rows into SIM-PANEL records;
exporting materialized bundles;
optionally supporting a streaming import/export path.
The default in-memory flow is:
validate config
-> load raw artifacts
-> transform into canonical records
-> return SourceExportBundle
-> export artifacts
Streaming sources may bypass raw-bundle materialization and write artifacts incrementally to disk.
Source registry¶
SIM-PANEL uses a source registry so orchestration code does not need to hard-code source classes.
A source config declares the source name:
source:
name: amazon_reviews_2023
reviews_path: data/raw/All_Beauty.jsonl.gz
metadata_path: data/raw/meta_All_Beauty.jsonl.gz
output_dir: outputs/imports/all_beauty
The currently implemented source is:
Source name |
Description |
|---|---|
|
Imports Amazon Reviews’23 review and metadata files into SIM-PANEL artifacts. |
Export bundle¶
A source import produces a SourceExportBundle containing:
Artifact |
Description |
|---|---|
|
Schema-valid event rows. |
|
Product records. |
|
Persona records. |
|
Source-level import metadata. |
|
Field-level source/export dictionary. |
|
Lightweight import statistics. |
For streaming imports, large artifacts are written directly to disk. The returned bundle may contain empty row lists while still carrying metadata and stats.
Amazon Reviews’23 background¶
Amazon Reviews’23 is a large-scale review dataset released by McAuley Lab. It contains user reviews, item metadata, and user-item interaction structure across multiple product categories.
SIM-PANEL does not redistribute the raw dataset. The amazon_reviews_2023
adapter assumes users have downloaded review and metadata files locally, then
converts those local files into SIM-PANEL-compatible artifacts.
Users should consult the original dataset page for download links, citation information, licensing terms, and data-use conditions.
Amazon Reviews’23 source¶
The Amazon importer expects two local files:
a review JSONL or JSONL.GZ file;
a metadata JSONL or JSONL.GZ file.
Example config:
source:
name: amazon_reviews_2023
reviews_path: data/raw/All_Beauty.jsonl.gz
metadata_path: data/raw/meta_All_Beauty.jsonl.gz
category: All_Beauty
output_dir: outputs/imports/all_beauty
import_mode: in_memory
Amazon config fields¶
Field |
Default |
Description |
|---|---|---|
|
Required |
Path to raw review JSONL or JSONL.GZ file. |
|
Required |
Path to raw metadata JSONL or JSONL.GZ file. |
|
|
Optional category/domain label. |
|
|
Output directory. |
|
|
Either |
|
|
Drop review events without matching product metadata if true. |
|
title/text mapping |
Maps review fields into event traces. |
|
|
How event period index |
|
|
Minimum review count required to emit a persona record. |
|
|
Optional cap on raw review rows. |
|
|
Optional cap on raw metadata rows. |
Default trace mapping:
trace_field_map:
title: review_title
text: review_text
Import mapping¶
The Amazon importer converts source files as follows:
SIM-PANEL artifact |
Built from |
|---|---|
|
Item metadata rows. |
|
User review histories. |
|
Review rows. |
SIM-PANEL uses parent_asin as the canonical product_id. Child asin values
are retained as source provenance under event traces.
Imported review rows become schema-valid evaluation events with:
Event field |
Source |
|---|---|
|
|
|
Amazon |
|
Amazon |
|
Product display text or display name. |
|
Review |
|
Review |
|
Review |
|
Review |
|
Review |
Imported real events use policy: manual because they are observational records,
not randomized synthetic assignments.
Time index t¶
The Amazon importer supports three time-index modes:
Mode |
Description |
Streaming support |
|---|---|---|
|
Sort each user’s reviews chronologically and assign |
Yes |
|
Use the source timestamp directly. |
Yes |
|
Sort the full corpus chronologically and assign a corpus-wide index. |
In-memory only |
Default:
time_index_mode: panelist_sequence
panelist_sequence is recommended for panel-style simulation because it gives
each panelist an individual exposure history.
Import modes¶
In-memory¶
The in-memory path materializes raw review and metadata rows before transforming them.
Use it for:
small categories;
capped development runs;
transformation debugging;
semantic tests.
Example:
source:
name: amazon_reviews_2023
reviews_path: data/raw/All_Beauty.jsonl.gz
metadata_path: data/raw/meta_All_Beauty.jsonl.gz
output_dir: outputs/imports/all_beauty_small
import_mode: in_memory
max_reviews: 10000
max_metadata_rows: 5000
Streaming¶
The streaming path writes large artifacts incrementally and avoids constructing the full event table in memory.
Use it for larger imports.
Example:
source:
name: amazon_reviews_2023
reviews_path: data/raw/All_Beauty.jsonl.gz
metadata_path: data/raw/meta_All_Beauty.jsonl.gz
output_dir: outputs/imports/all_beauty_streaming
import_mode: streaming
time_index_mode: panelist_sequence
Streaming supports panelist_sequence and raw_timestamp. It does not support
global_sequence.
Exported files¶
The Amazon source export writes:
events.jsonl
products.jsonl
personas.jsonl
metadata.json
data_dictionary.json
stats.json
File |
Description |
|---|---|
|
Schema-valid evaluation rows derived from reviews. |
|
Product records derived from metadata. |
|
Persona records derived from user review histories. |
|
Source-level import metadata. |
|
Source export dictionary. |
|
Import counts and source-specific statistics. |
Missing product metadata¶
Amazon review rows may reference products whose metadata is absent from the loaded metadata file.
By default:
require_metadata_match_for_events: false
With this setting, review events may still be emitted even when product metadata
is missing. Missing metadata counts are recorded in stats.json.
To keep only metadata-backed events:
require_metadata_match_for_events: true
Capped development imports¶
For development, use caps:
source:
name: amazon_reviews_2023
reviews_path: data/raw/All_Beauty.jsonl.gz
metadata_path: data/raw/meta_All_Beauty.jsonl.gz
output_dir: outputs/imports/all_beauty_dev
import_mode: in_memory
max_reviews: 1000
max_metadata_rows: 1000
Independent caps can reduce review/metadata overlap because the first N review
rows and first M metadata rows may not refer to the same products.
Downstream use¶
After source import, the benchmarks/ module can freeze a smaller
benchmark-ready subset. The comparison layer can then compare synthetic outputs
against that frozen reference.
Sources ingest. Benchmarks freeze. Comparison evaluates.