Sources

SIM-PANEL sources convert external datasets into canonical internal artifacts: event rows, product records, persona records, metadata, data dictionaries, and import statistics.

The source layer is for data ingestion and conversion. It is separate from synthetic generation, benchmark subset construction, and comparison.

sources/
  raw external data -> canonical imported artifacts

benchmarks/
  imported artifacts -> frozen real-data subset

analysis/compare/
  synthetic outputs + reference subset -> comparison metrics and reports

Source-layer contract

All source importers inherit from BaseSource.

A source is responsible for:

  • validating its configuration;

  • loading raw source artifacts;

  • transforming raw rows into SIM-PANEL records;

  • exporting materialized bundles;

  • optionally supporting a streaming import/export path.

The default in-memory flow is:

validate config
  -> load raw artifacts
  -> transform into canonical records
  -> return SourceExportBundle
  -> export artifacts

Streaming sources may bypass raw-bundle materialization and write artifacts incrementally to disk.

Source registry

SIM-PANEL uses a source registry so orchestration code does not need to hard-code source classes.

A source config declares the source name:

source:
  name: amazon_reviews_2023
  reviews_path: data/raw/All_Beauty.jsonl.gz
  metadata_path: data/raw/meta_All_Beauty.jsonl.gz
  output_dir: outputs/imports/all_beauty

The currently implemented source is:

Source name

Description

amazon_reviews_2023

Imports Amazon Reviews’23 review and metadata files into SIM-PANEL artifacts.

Export bundle

A source import produces a SourceExportBundle containing:

Artifact

Description

events

Schema-valid event rows.

products

Product records.

personas

Persona records.

metadata

Source-level import metadata.

data_dictionary

Field-level source/export dictionary.

stats

Lightweight import statistics.

For streaming imports, large artifacts are written directly to disk. The returned bundle may contain empty row lists while still carrying metadata and stats.

Amazon Reviews’23 background

Amazon Reviews’23 is a large-scale review dataset released by McAuley Lab. It contains user reviews, item metadata, and user-item interaction structure across multiple product categories.

SIM-PANEL does not redistribute the raw dataset. The amazon_reviews_2023 adapter assumes users have downloaded review and metadata files locally, then converts those local files into SIM-PANEL-compatible artifacts.

Users should consult the original dataset page for download links, citation information, licensing terms, and data-use conditions.

Amazon Reviews’23 source

The Amazon importer expects two local files:

  • a review JSONL or JSONL.GZ file;

  • a metadata JSONL or JSONL.GZ file.

Example config:

source:
  name: amazon_reviews_2023
  reviews_path: data/raw/All_Beauty.jsonl.gz
  metadata_path: data/raw/meta_All_Beauty.jsonl.gz
  category: All_Beauty
  output_dir: outputs/imports/all_beauty
  import_mode: in_memory

Amazon config fields

Field

Default

Description

reviews_path

Required

Path to raw review JSONL or JSONL.GZ file.

metadata_path

Required

Path to raw metadata JSONL or JSONL.GZ file.

category

null

Optional category/domain label.

output_dir

null

Output directory.

import_mode

in_memory

Either in_memory or streaming.

require_metadata_match_for_events

false

Drop review events without matching product metadata if true.

trace_field_map

title/text mapping

Maps review fields into event traces.

time_index_mode

panelist_sequence

How event period index t is derived.

min_reviews_per_persona

1

Minimum review count required to emit a persona record.

max_reviews

null

Optional cap on raw review rows.

max_metadata_rows

null

Optional cap on raw metadata rows.

Default trace mapping:

trace_field_map:
  title: review_title
  text: review_text

Import mapping

The Amazon importer converts source files as follows:

SIM-PANEL artifact

Built from

products.jsonl

Item metadata rows.

personas.jsonl

User review histories.

events.jsonl

Review rows.

SIM-PANEL uses parent_asin as the canonical product_id. Child asin values are retained as source provenance under event traces.

Imported review rows become schema-valid evaluation events with:

Event field

Source

policy

"manual"

panelist_id

Amazon user_id

product_id

Amazon parent_asin

product_display

Product display text or display name.

outcomes.rating

Review rating.

outcomes.verified_purchase

Review verified_purchase.

outcomes.helpful_vote

Review helpful_vote or helpful_votes.

traces.review_title

Review title, by default.

traces.review_text

Review text, by default.

Imported real events use policy: manual because they are observational records, not randomized synthetic assignments.

Time index t

The Amazon importer supports three time-index modes:

Mode

Description

Streaming support

panelist_sequence

Sort each user’s reviews chronologically and assign t = 0, 1, ... within user.

Yes

raw_timestamp

Use the source timestamp directly.

Yes

global_sequence

Sort the full corpus chronologically and assign a corpus-wide index.

In-memory only

Default:

time_index_mode: panelist_sequence

panelist_sequence is recommended for panel-style simulation because it gives each panelist an individual exposure history.

Import modes

In-memory

The in-memory path materializes raw review and metadata rows before transforming them.

Use it for:

  • small categories;

  • capped development runs;

  • transformation debugging;

  • semantic tests.

Example:

source:
  name: amazon_reviews_2023
  reviews_path: data/raw/All_Beauty.jsonl.gz
  metadata_path: data/raw/meta_All_Beauty.jsonl.gz
  output_dir: outputs/imports/all_beauty_small
  import_mode: in_memory
  max_reviews: 10000
  max_metadata_rows: 5000

Streaming

The streaming path writes large artifacts incrementally and avoids constructing the full event table in memory.

Use it for larger imports.

Example:

source:
  name: amazon_reviews_2023
  reviews_path: data/raw/All_Beauty.jsonl.gz
  metadata_path: data/raw/meta_All_Beauty.jsonl.gz
  output_dir: outputs/imports/all_beauty_streaming
  import_mode: streaming
  time_index_mode: panelist_sequence

Streaming supports panelist_sequence and raw_timestamp. It does not support global_sequence.

Exported files

The Amazon source export writes:

events.jsonl
products.jsonl
personas.jsonl
metadata.json
data_dictionary.json
stats.json

File

Description

events.jsonl

Schema-valid evaluation rows derived from reviews.

products.jsonl

Product records derived from metadata.

personas.jsonl

Persona records derived from user review histories.

metadata.json

Source-level import metadata.

data_dictionary.json

Source export dictionary.

stats.json

Import counts and source-specific statistics.

Missing product metadata

Amazon review rows may reference products whose metadata is absent from the loaded metadata file.

By default:

require_metadata_match_for_events: false

With this setting, review events may still be emitted even when product metadata is missing. Missing metadata counts are recorded in stats.json.

To keep only metadata-backed events:

require_metadata_match_for_events: true

Capped development imports

For development, use caps:

source:
  name: amazon_reviews_2023
  reviews_path: data/raw/All_Beauty.jsonl.gz
  metadata_path: data/raw/meta_All_Beauty.jsonl.gz
  output_dir: outputs/imports/all_beauty_dev
  import_mode: in_memory
  max_reviews: 1000
  max_metadata_rows: 1000

Independent caps can reduce review/metadata overlap because the first N review rows and first M metadata rows may not refer to the same products.

Downstream use

After source import, the benchmarks/ module can freeze a smaller benchmark-ready subset. The comparison layer can then compare synthetic outputs against that frozen reference.

Sources ingest. Benchmarks freeze. Comparison evaluates.