MedWeight Clinical Assessment System

MedWeight Clinical Assessment System
Instrument Definition Testing Report · Version 2.0 — Full 19-Instrument Suite · Prepared for Dr. Michael Lyon, Clinical Director, Obesity Medicine and Diabetes Institute · Report Date: March 29, 2026
100%
Pass Rate
4,087 tests · 0 failures
4,087
Automated Tests
Test Suite Version 2.0
19
Instruments
28 test groups
745
Clinical Items
Across all instruments
View Full Report
Download PDF
Section 1
Executive Summary
This report documents the comprehensive automated testing of all 19 clinical assessment instrument definitions deployed in the MedWeight patient engagement platform. Each instrument is defined as a structured JSON file that drives the assessment engine, scoring engine, phenotype classification, coaching session check-ins, and cross-instrument trigger logic.
The test suite executes 4,087 discrete automated assertions across 28 test groups, covering structural integrity, scoring correctness, phenotype classification, trigger chain validation, dynamic prompt generation, clinical flag evaluation, special-case instrument logic, and synthetic patient profile simulations that validate the full scoring-to-phenotype-to-trigger pipeline against hand-calculated expected outputs. All 19 instruments pass with a 100% pass rate.
This represents the completion of the full instrument library originally scoped during the clinical design phase. The first 11 instruments were validated in Test Suite v1.0 (1,714 tests, March 26, 2026). The eight additional instruments (TNSDA, RRVA, WTPTHA, BIWSSA, SSSHEA, NRAF-EF, BLPA, MRCA) have now been built, uploaded, and subjected to the same testing standard, with the suite expanded to 28 test groups and 4,087 tests to accommodate the new instruments' unique architectural features.
Section 1.1
What This Test Suite Validates
The test suite serves as a structural and clinical logic audit of every instrument definition before it reaches patients or clinicians. It verifies that:
Schema & Structure
Every JSON definition conforms to the required schema and contains all mandatory fields for the assessment engine to render, score, classify, and act on patient responses.
Item Numbering
Item numbering is sequential, correctly prefixed, and free of duplicates, gaps, or orphaned references.
Reverse Scoring
Reverse-scored items are consistent between inline markers and scoring block declarations, with verified symmetric reversal maps.
Domain Alignment
Scoring domains align perfectly between section definitions and scoring block declarations, with correct separation of non-scorable sections (patterns, context, risk flags).
Severity Thresholds
Severity thresholds cover the full measurement scale with no gaps, starting at zero and spanning through the maximum possible score.
Phenotype Classification
Phenotype classification rules reference only domains that exist in their instrument, have complete coaching implications, and use the correct operator type (threshold-based or highest-section).
Cross-instrument trigger rules form a valid directed graph: every referenced module exists, every referenced domain exists within its source module, and every referenced risk flag key exists in the source instrument.
Short-form sentinel items for coaching session micro-assessments exist in the instrument, have MI-reworded conversational versions, and cover a representative sample of the instrument's domains.
The scoring engine produces mathematically correct domain means, global scores, and severity classifications for synthetic patients at floor, midpoint, and ceiling response values, including correct handling of reverse-scored items.
The cross-instrument trigger cascade fires correctly under high-severity, zero-severity, and selective-elevation conditions, with full second-level chain verification.
Special-case scoring systems (EBCA sum-based grading, CANLA knowledge scoring, CEFRA trigger load separation, FSMCA inverted thresholds, BLPA profile interpretation, WTPTHA composite indices, SACA apnea domain separation, DOMM PHQ-9 integration, CRSEM coaching stance safety rules) all function as designed.
Section 1.2
Why This Matters Clinically
These instrument definitions are not static documents. They are the executable clinical logic that determines how patients are assessed, how severity is classified, which deeper assessments are triggered, what phenotype labels drive coaching adaptation, and what clinical flags reach the clinician dashboard.
The Risk of a Malformed Definition
Misclassify severity
Fail to trigger a warranted downstream assessment
Assign the wrong coaching stance
Miss a clinical flag that requires clinician review
The Assurance Automated Testing Provides
Automated testing at this granularity provides assurance that every instrument deployed to the live server will behave exactly as designed by the clinical team, with no silent structural errors propagating into patient care.
Section 2
Instruments Tested
The following 19 instruments were defined and tested. Total item count across all instruments: 745 items. Instruments span seven clinical categories: Psychology (4), Eating Behavior (3), Skills (2), Physiology (2), Change Process (3), Environment (2), and Adaptation (3).
Column key: Secs = sections, Rev = reverse-scored items, Phen = phenotype rules, Trig = trigger rules, Sent = sentinel items for coaching micro-assessment.
Section 3
Testing Methodology
The test suite (test_all_instruments.py) loads all 19 JSON instrument definitions from the server's assessment_definitions directory and runs 28 test groups. Each test is a discrete boolean assertion that either passes or fails with a specific diagnostic message. Tests are designed to catch both structural errors (malformed JSON, missing fields, broken references) and clinical logic errors (incorrect scoring, wrong phenotype classification, failed trigger chains, missing clinical flags).
The suite is designed to be run after any instrument definition change, JSON upload, or database reload. It requires no database connection, no server access, and no patient data. It operates entirely on the JSON definitions as the single authoritative source of clinical logic.
No DB Required
Operates on JSON definitions only
Boolean Assertions
Pass or fail with diagnostic messages
Dual Coverage
Structural errors + clinical logic errors
Run Anytime
After any definition change or upload
Section 3.1
Test Groups (G1–G10)
Section 3.1 (continued)
Test Groups (G11–G28)
Section 4
Scoring Architecture Coverage
The 19 instruments use five distinct scoring architectures. The test suite validates each type with dedicated test groups:
4.1 Domain Mean Scoring (14 instruments)
The primary scoring method used by BIWSSA, BLPA, CRSEM, DOMM, FSMCA, LOCEA, MCAA, MDOA, MRCA, NRAF-EF, RRVA, SACA, SEIM, SSSHEA, TNSDA, and WTPTHA. Items within each domain are averaged to produce a domain mean score (0.0–4.0 scale). The global score is the mean of all domain scores. Reverse-scored items are remapped before averaging. The test suite validates the mathematical correctness of this computation at floor (0), midpoint (2), and ceiling (4) values, including the expected non-zero domain means that arise when reverse-scored items are present at floor scoring.
4.2 Sum-Based Graded Severity (EBCA)
The EBCA uses a sum score (0–30) across 10 binge severity items with four severity thresholds. The test suite validates the sum method, max score, item range, and four-level classification at boundary values. It also verifies the presence of separate bulimia and anorexia screening criteria.
4.3 Knowledge Scoring (CANLA)
The CANLA uses a sum-correct method where each of 60 knowledge items has a correct_answer that must match one of the defined option labels. The test suite validates every item has a correct_answer and options field, confirms correct_answer appears in the option labels, and tests the three-tier threshold classification (low/moderate/high) at boundary values.
4.4 Core Domains + Trigger Load (CEFRA)
The CEFRA uses a dual scoring path: 6 core symptom domains scored as domain means, plus a separate trigger_load score (items 33–42) scored independently. The clinical_patterns section (items 43–48) contains both yes/no screening items and categorical pattern items. The test suite validates this separation, the trigger_load item range, and the clinical flag definitions for probable and strongly probable addictive patterns.
4.5 Profile Interpretation (BLPA)
The BLPA is a non-severity profiler that maps domain scores to coaching style recommendations. The test suite validates the profile_interpretation keys, domain-level thresholds (low/moderate/high rather than minimal/mild/moderate/severe), and correct instrument_type designation as 'profile'.
Section 5
Synthetic Patient Profile Testing
Test groups G25 through G28 score clinically realistic synthetic patient profiles through the full assessment pipeline and validate every output dimension — domain scores, global scores, severity classification, phenotype assignment, trigger activation, and coaching stance constraints — against hand-calculated expected values.
This is the most clinically meaningful layer of testing: it confirms that the instruments produce correct clinical decisions for real-world patient presentations, not merely that their JSON structures are well-formed.
G25
MDOA Profiles
32 tests · 5 synthetic patients
G26
Downstream Instruments
21 tests · 7 instruments
G27
New 8 Instruments
31 tests · 8 instruments
G28
Special Scoring
22 tests · EBCA & CANLA
Section 5.1
MDOA Patient Profiles (G25 — 32 tests)
Five synthetic patients are scored through the MDOA with distinct clinical presentations:
1
Mood-Driven
mood=3.0, all other domains=1.0. Validates: mood_driven phenotype assigned, DOMM and SEIM triggered, CEFRA/LOCEA/NRAF-EF/BIWSSA not triggered, global=1.33, severity=Mild.
2
High Complexity
All 6 domains=3.0, 3 risk flags endorsed. Validates: high_complexity and mixed_high_complexity phenotypes both assigned, all 6 downstream modules triggered, global=3.0, severity=Severe.
3
Subclinical
Domains mixed at 1–2 (global=1.5). Validates: no single-domain phenotype fires (no domain ≥2.5), severity=Mild.
4
Reward + LOC Dual Elevation
reward=3.0, LOC=3.0, others=1.0. Validates: both reward_dominant and loc_compulsive phenotypes assigned, CEFRA and LOCEA triggered, DOMM not triggered.
5
Boundary Case
global=1.67, no domain ≥2.5. Validates: correct absence of any phenotype classification. This is the documented boundary behavior from the clinical design phase — phenotype rules require domain scores ≥2.5, so a globally mild patient correctly receives no phenotype label.
Section 5.2 & 5.3
Downstream & New 8 Instrument Profiles
G26 — Downstream Instrument Profiles (21 tests)
Synthetic patients are scored through 7 downstream instruments with clinically meaningful profiles. Each test validates domain scores, global scores, and severity classification:
SEIM: Moderate stress-eater (emotional_trigger=3.0, others=2.0) → global=2.25, severity=Moderate.
DOMM: Severe depression-obesity profile (all domains=3.0) → global=3.0, severity=Severe. Also tests mild profile (all=1.0) → global=1.0, severity=Mild, confirming the instrument is not clinically significant at this level.
CEFRA: Probable addictive pattern (harm_impairment=3.0 with 3 core domains=3.0) → global=2.33, severity=Moderate. Also tests low-severity (mostly 0s) → severity=Minimal.
FSMCA: Strong competence (all domains=3.0) → severity=Strong (inverted threshold validation). Severe impairment (all domains=0) → severity=Severe Impairment.
SACA: Core domains at 2.0 → global=2.0. Validates apnea_risk domain is excluded from domain_scores (scored separately).
G27 — New 8 Instrument Profiles (31 tests)
Each of the 8 newly deployed instruments receives at least one synthetic patient with a clinically distinct profile. For instruments that use the highest-section phenotype operator, the test validates that the highest-scoring domain is correctly identified:
TNSDA: Hyperarousal-dominant (hyperarousal=3.0, others=1–2) → validates hyperarousal is highest domain. Also tests pan-severe (all=3.0) → severity=Severe.
BIWSSA: Shame-dominant (shame=3.0) → validates shame is highest domain.
RRVA: Cognitive rigidity profile (cognitive_rigidity=3.0) with correct recovery_capacity scoring including reverse items.
NRAF-EF: Impulsivity-dominant (impulsivity=3.0) → validates impulsivity is highest domain.
MRCA: Evening hyperphagia (evening_eating=3.0) with correct reverse-scored item handling across 5 domains.
SSSHEA: Household conflict + low agency (both=3.0) with correct reverse-scored support and low_agency domain validation.
WTPTHA: Cycling + biological resistance (both=3.0) → global=2.5, severity=Moderate.
BLPA: Structure-dependent learner (structure_preference=3.0, accountability=3.0) → qualifies for high_structure profile interpretation.
CRSEM: Two profiles tested — (a) low-readiness/high-ambivalence (readiness=1.0, ambivalence=3.0) validates the coaching safety rule is in effect; (b) high-readiness/low-ambivalence (readiness=3.0, ambivalence=1.0) validates action coaching is appropriate.
Section 5.4
Special Scoring Profiles (G28 — 22 tests)
Tests the two instruments with non-standard scoring architectures using boundary-value analysis:
EBCA Sum-Based Grading
5 synthetic patients at boundary scores:
score=0 → minimal
score=9 → maximal minimal
score=10 → minimal mild-to-moderate
score=15 → mid-range mild-to-moderate
score=28 → severe
Validates threshold boundaries produce correct classifications.
CANLA Knowledge Scoring
3 synthetic patients at:
score=10 → low
score=30 → moderate
score=60 → high/perfect
Validates correct_answer alignment with option labels for a sample of items across all 6 knowledge domains.
Section 6
Cross-Instrument Trigger Chain Validation
The assessment system uses a directed acyclic graph of trigger rules to cascade patients from the foundational MDOA through progressively deeper instruments based on domain-level severity. The test suite validates the complete trigger chain at three levels:
01
6.1 First-Level Triggers (MDOA → Downstream)
MDOA domain scores trigger six downstream instruments: mood ≥ 2.0 triggers DOMM and SEIM; LOC ≥ 2.0 triggers LOCEA; reward ≥ 2.0 triggers CEFRA; executive ≥ 2.5 triggers NRAF-EF; shame ≥ 2.5 triggers BIWSSA. The test suite verifies all six fire under high-severity conditions, zero fire under no-severity conditions, and only the mood-linked instruments fire when only mood is elevated.
02
6.2 Second-Level Triggers
Downstream instruments trigger further assessment: SEIM emotional_trigger ≥ 2.5 triggers TNSDA; CRSEM recovery ≤ 1.5 triggers RRVA; RRVA biological_rebound ≥ 2.0 triggers WTPTHA; SACA circadian ≥ 2.0 triggers MRCA; FSMCA adherence_barriers ≥ 2.5 triggers SSSHEA. The test suite verifies each of these source_module references exists and the trigger chain connectivity is complete.
03
6.3 Clinical Signal Triggers
Several instruments also support clinical_signal triggers (e.g., BLPA triggered by 'coaching_style_optimization', TNSDA by 'dysregulation_or_trauma_history') that allow clinician-initiated assessment assignment outside the automated cascade. These are validated as part of the trigger rule structure tests.
The trigger graph ensures patients are routed to progressively deeper assessments only when domain-level severity warrants it, preventing unnecessary assessment burden while ensuring no clinically significant pathway is missed.
Section 7
Special-Case Instrument Validation
7.1 FSMCA Inverted Thresholds
The FSMCA measures food skill competence, where higher scores indicate greater capability. Unlike all other severity instruments, its thresholds run in the opposite direction: 'strong' (3.0–4.0) at the top, 'severe_impairment' (0.0–0.9) at the bottom. The test suite confirms this inversion is correctly structured.
7.2 SACA Apnea Domain Separation
The SACA's apnea_risk domain is scored separately from the five core sleep domains and excluded from the global score. It has dedicated clinical flags (apnea_high_suspicion, apnea_screen_further) that route to clinician review. The test suite validates this separation.
7.3 DOMM PHQ-9 Integration
The DOMM includes a phq9_integration block that defines three clinical interpretation scenarios where DOMM scores are cross-referenced with PHQ-9 results. The test suite validates all three scenarios exist and that a PHQ-9 companion instrument trigger is defined in the trigger_rules.
7.4 CRSEM Coaching Safety Rule
The CRSEM defines a coaching_rule that explicitly prohibits action-heavy coaching for low-readiness, high-ambivalence patients. The test suite validates this safety constraint is present in the scoring block.
7.5 WTPTHA Composite Indices
The WTPTHA computes three composite indices (cycling severity, maintenance fragility, biological defense) that aggregate domain scores for clinical decision-making. Each uses the mean method across its constituent items. The test suite validates all three indices and their associated clinical flags.
7.6 Reverse Scoring Validation
Twelve of 19 instruments contain reverse-scored items (59 total across all instruments). The test suite validates three layers of reverse-scoring integrity: the scoring block's reverse_scored_items list references only items that exist, inline reverse_scored markers on individual items match the scoring block list, and reverse_scoring_map entries produce correct symmetric reversals (0→4, 1→3, 2→2, 3→1, 4→0). The scoring simulation confirms that scoring all items at the floor value (0) produces the mathematically correct non-zero domain means for domains containing reverse-scored items.
Section 8
Results
19
Instruments Tested
745
Total Clinical Items
28
Test Groups
4,087
Total Automated Tests
4,087
Tests Passed
0
Tests Failed
100%
Pass Rate
Zero failures across all 4,087 tests
All 19 instrument definitions pass comprehensive automated testing with zero failures. The JSON structures correctly support scoring, reverse scoring, phenotype classification, cross-instrument trigger logic, clinical flag evaluation, coaching stance selection, PHQ-9 integration triage, apnea escalation, dynamic MI-based prompt generation, composite index computation, profile interpretation, knowledge assessment, and the full assessment cascade from MDOA through all downstream instruments.
Section 8.1
Changes from Test Suite v1.0
Test Suite v1.0 (March 26, 2026) tested 11 instruments with 1,714 tests across 16 test groups. Test Suite v2.0 expands coverage to all 19 instruments with 4,087 tests across 28 test groups.
v1.0 — March 26, 2026
11
Instruments
1,714
Tests
16
Test Groups
v2.0 — March 29, 2026
19
Instruments
4,087
Tests
28
Test Groups
The twelve new test groups include: BLPA profile interpretation, WTPTHA composite indices, highest-section phenotype logic (used by 7 newer instruments), clinical flag evaluation across all flagged instruments, reverse scoring map symmetry validation, SACA apnea domain separation, DOMM PHQ-9 integration, profile mapping validation, and four synthetic patient profile groups that score clinically realistic patient scenarios through the complete scoring-phenotype-trigger pipeline with hand-verified expected outputs.
Section 9
Conclusion
All 19 deployed instrument definitions pass comprehensive automated testing with a 100% pass rate across 4,087 tests. The instrument library is complete as originally scoped during the clinical design phase and is ready for full integration with the assessment engine (assessment_engine.py), the web-form rendering system, and the conversational coaching delivery pipeline.
The definitions are the authoritative representation of each instrument. Word documents produced for the clinical team are abbreviated reference sheets only. Any future instrument modifications must be made in the JSON definitions and re-validated against this test suite before deployment to the live server.
Prepared by
Bert, CTO
Test Suite
test_all_instruments.py v2.0
Execution Date
March 29, 2026
Prepared For
Dr. Michael Lyon, Clinical Director