Factory Scene Data Modeling Guide

This document is intended for industrial engineers, process technicians, and data analysts. It systematically explains how to identify valuable scenarios, collect data, classify variables, and build effective production models in a factory environment.

1. Scenario Identification and Value Assessment

1.1 What is a "Scenario"?

In industrial data modeling, a scenario refers to a complete production process unit, including:

Clear inputs (raw materials, parameter settings)
Observable process states (temperature, pressure, flow rate, etc.)
Quantifiable output results (product quality, yield, energy consumption, etc.)

Example Scenarios:

Scenario Type	Description	Typical Industry
Reactor Batch Control	Batch reaction process in chemical/pharmaceuticals	Chemical, Pharmaceutical
Fermentation Process Optimization	Temperature, pH, and dissolved oxygen control in microbial fermentation	Food, Biological
Extrusion Molding Process	Extrusion temperature, pressure, and speed control for plastics/rubber	Material Processing
Drying Process Control	Temperature, humidity, and time control in hot air drying	Food, Agricultural Products
Batch Mixing System	Proportioning accuracy control of multi-component raw materials	Food, Feed

1.2 How to Judge the Core Value of a Scenario?

Not all scenarios are worth modeling. Use the following evaluation framework:

Value Assessment Matrix

                    High Business Value
                         │
         ┌───────────────┼───────────────┐
         │   Prioritize  │   Strategic   │
  High   │   Modeling    │   Reserve     │
  Data   │ (Act Now)     │(Long-term)    │
  Avail- ├───────────────┼───────────────┤
  ability│   Quick POC   │   Shelve      │
         │   (Pilot)     │ (Wait)        │
  Low    │               │               │
  Data   └───────────────┴───────────────┘
  Avail.             Low Business Value

Value Assessment Checklist

Business Value Dimension (1-5 points each):

Evaluation Item	Scoring Criteria	Score
Quality Issue Frequency	5=Multiple times a month, 1=Rarely occurs	___
Quality Loss Amount	5=Annual loss > 1M, 1=< 100k	___
Process Optimization Space	5=Obvious room for optimization, 1=Already very mature	___
Replicability	5=Applicable to multiple lines, 1=Single point dedicated	___
Management Attention	5=High-level focus, 1=Grassroots spontaneous	___

Data Availability Dimension (1-5 points each):

Evaluation Item	Scoring Criteria	Score
Historical Data Volume	5=>1 year complete data, 1=Almost no data	___
Data Quality	5=Complete and accurate, 1=Massive missing/errors	___
Collection Automation	5=Fully automatic, 1=Fully manual recording	___
Key Variable Measurability	5=All online measurable, 1=Mostly offline testing	___
IT System Support	5=Has MES/SCADA, 1=No IT system	___

Score Interpretation:

Business Value ≥ 20 AND Data Availability ≥ 20: Prioritize Modeling
Business Value ≥ 20 BUT Data Availability < 20: Improve Data Collection First
Business Value < 15: Temporarily Shelve, Seek Higher Value Scenarios

1.3 Scenario Priority Ranking Example

Scenario Assessment for a Food Processing Plant:

Scenario	Business Value	Data Availability	Priority	Action Recommendation
Sterilization Temp Control	25 pts	20 pts	⭐⭐⭐⭐⭐	Start Immediately
Batching Accuracy Opt.	22 pts	18 pts	⭐⭐⭐⭐	Start after adding sensors
Packaging Seal Inspection	15 pts	22 pts	⭐⭐⭐	Low priority, consider when resources are ample
Raw Material Inbound Insp.	12 pts	15 pts	⭐⭐	Temporarily Shelve

2. Data Collection Strategy

2.1 Four Levels of Data Collection

Level 1: Manual Recording
    └── Paper records, Excel manual entry
    └── Suitable for: Initial exploration, no automation system
    └── Disadvantages: Error-prone, low frequency, hard to trace

Level 2: Semi-Automatic Collection
    └── Instrument data export + manual sorting
    └── Suitable for: Key equipment exists but no system integration
    └── Improvement: Establish standardized export templates

Level 3: Automatic Collection
    └── PLC/SCADA automatic recording
    └── Suitable for: Automated control systems exist
    └── Advantages: High frequency, accurate, traceable

Level 4: Integrated Platform
    └── MES/ERP/Data Lake integration
    └── Suitable for: Highly digitalized factories
    └── Advantages: Data correlation, full-link traceability

2.2 Data Collection Planning Template

Create a data collection plan for each scenario:

markdown

## Scenario Name: [Fill in]

### 1. Controlled Variables (Y)
| Variable Name | Measurement Method | Frequency | Data Location | Notes |
|-------|---------|---------|---------|------|
| Product Quality Index | Lab testing | Per batch | LIMS | 2-hour testing cycle |
| Product Yield | Auto statistics | Real-time | MES | - |

### 2. Feature Variables (X)
| Variable Name | Variable Type | Measurement Method | Frequency | Data Location |
|-------|---------|---------|---------|---------|
| Reaction Temp | Set/Manipulated Variable | Temp Sensor | 1 min | SCADA |
| Material Batch | Disturbance Variable | Barcode Scan | Per batch | ERP |
| Ambient Temp | Disturbance Variable | Temp/Humidity Meter | 1 hour | Manual |

### 3. Data Collection Cycle
- Historical Data Traceback: [ ] months
- New Data Collection: Starting from [Date]
- Target Sample Size: At least [ ] batches/cycles

### 4. Data Quality Assurance
- [ ] Sensor calibration plan
- [ ] Outlier handling rules
- [ ] Missing value imputation strategy
- [ ] Data review process

2.3 Data Collection Best Practices

DO:

✅ Record complete batch information (time, operator, material batch)
✅ Collect both normal and abnormal condition data
✅ Annotate known abnormal events (equipment failure, material change, etc.)
✅ Maintain timestamp consistency
✅ Regularly backup raw data

DON'T:

❌ Only collect "good" data and discard "bad" data
❌ Have inconsistent timestamps from different sources
❌ Manually transcribe without keeping original records
❌ Use a sampling frequency that is too low (cannot capture dynamics)
❌ Use a sampling frequency that is too high (generates massive redundant data)

3. Variable Classification System

3.1 Core Variable Definitions (Industry Standard)

In the field of industrial modeling and control, we follow these standard variable definitions:

Abbreviation	Full Name	Chinese	Description
SV	Set Value	设定值	Target value set for the manipulated variable, modifiable on DCS
MV	Manipulated Variable	操纵变量	Valves, pumps, etc. that operators / APC can directly adjust
DV	Disturbance Variable	扰动变量	Uncontrollable, unadjustable disturbance factors
CV	Controlled Variable	被控变量	Core target to be controlled and optimized
PV	Process Value	过程测量值	Actual values measured by instruments / sensors

When modeling data, we map these variables to the model's inputs (X) and outputs (Y):

text

┌───────────────────────────────────────────────────────────────────┐
│                      Variable Modeling Mapping System             │
├───────────────────────────────────────────────────────────────────┤
│                                                                   │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐        │
│   │Set/Manipulated │    │ Disturbance  │    │  Controlled  │        │
│   │   (SV/MV)    │    │     (DV)     │    │     (CV)     │        │
│   └──────┬───────┘    └──────┬───────┘    └──────┬───────┘        │
│          │                   │                   │                │
│          ▼                   ▼                   ▼                │
│   Parameters we can   Variables we cannot   Targets we want to    │
│   actively adjust(X)  control but affect(X) predict/optimize(Y)   │
│                                                                   │
│   Example: Reaction   Example: Ambient      Example: Product      │
│     temp setpoint(SV)   temperature           purity (CV)         │
│     Valve opening(MV)   Material fluctuation  (Usually shown as PV)│
│                                                                   │
└───────────────────────────────────────────────────────────────────┘

3.2 Detailed Variable Types

Set Value/Manipulated Variable (SV/MV)

Definition: Input parameters that operators or control systems can directly adjust. In actual industrial production, in most cases, operators adjust Set Values (SV) to indirectly control actuators, while Manipulated Variables (MV) are typically automatic outputs from low-level PID or control systems.

Characteristics:

Can be actively changed (mainly modifying SV)
Usually have clear operation ranges
Main focus for process optimization

Common Examples:

Industry	Set/Manipulated Variable Examples
Chemical	Reaction temperature, pressure, stirring speed, catalyst dosage
Food	Sterilization temp, holding time, ingredient ratio, drying wind speed
Pharmaceutical	Heating rate, holding time, cooling rate, pH setpoint
Metallurgy	Heating power, cooling water flow, rolling speed

Role in Modeling:

As the core component of X variables
Key focus of VIP analysis
Direct operation targets for process optimization

Disturbance Variables (DV)

Definition: Variables that affect process output but cannot (or are difficult to) be actively controlled.

Characteristics:

Objectively exist, hard to intervene artificially
May change over time
Factors to consider for model robustness

Common Examples:

Type	Disturbance Variable Examples	Coping Strategy
Material	Batch differences, moisture fluctuation, impurity content	Inbound inspection, feedforward control
Environment	Ambient temp, humidity, atmospheric pressure	Environmental compensation, AC control
Equipment	Equipment wear, catalyst decay, heat exchanger fouling	Regular maintenance, online correction
Operation	Operator differences, shift handover impact	SOP standardization, training

Role in Modeling:

As a supplement to X variables
Help explain model residuals
Identify sources of "uncontrollable" variation

Controlled Variable (CV)

Definition: Process outputs or quality indicators that we want to control within target ranges.

Characteristics:

Results of the process
Usually have clear quality standards
Targets for model prediction (Y)

Common Examples:

Industry	Controlled Variable Examples
Chemical	Product purity, conversion rate, selectivity, byproduct content
Food	Moisture content, color, taste score, microbial indicators
Pharmaceutical	Active ingredient content, dissolution rate, impurity profile
Material	Tensile strength, hardness, surface finish

Role in Modeling:

As Y variables (Controlled variables)
Objects for model prediction and optimization
Core indicators for evaluating model performance

Process Value (PV)

Definition: Process values actually measured by instruments or sensors.

Characteristics:

True reflection of physical or chemical states
The basis for calculating or evaluating CV
May contain measurement noise or errors

Common Examples:

Industry	Process Value Examples
Chemical	Actual temperature measured by thermocouple, flow meter reading
Food	Online moisture meter reading, actual pH value
Pharmaceutical	Stirring motor current fed back by sensor

Role in Modeling:

Used to characterize controlled variables (Y)
Feedback current system state for optimization control

3.3 Variable Classification Decision Tree

3.4 Variable Classification Example

Scenario: Chemical Reactor Batch Control

Variable Name	Variable Type	Classification Reason	Modeling Role
Reaction Temp Setpoint	Set/Manipulated Variable	Adjustable via DCS	X
Reaction Pressure	Set/Manipulated Variable	Adjustable via valves	X
Actual Reaction Pressure	Process Value	Pressure sensor feedback	X
Stirring Speed	Set/Manipulated Variable	Inverter control	X
Catalyst Dosage	Set/Manipulated Variable	Determined at batching	X
Material Batch	Disturbance Variable	Determined by procurement	X
Material Moisture	Disturbance Variable	Natural fluctuation	X
Ambient Temp	Disturbance Variable	Uncontrollable	X
Product Conversion Rate	Controlled Variable	Process result	Y
Product Selectivity	Controlled Variable	Quality indicator	Y

4. Scenario Modeling Practice

4.1 Pre-Modeling Preparation

Data Sorting Checklist

markdown

## Pre-Modeling Checklist

### Data Integrity
- [ ] Sample size ≥ 30 (PLS minimum requirement)
- [ ] Number of X variables < Sample size/2 (Avoid overfitting)
- [ ] No severe missing values (<10%)
- [ ] Timestamps correctly aligned

### Variable Confirmation
- [ ] Set/Manipulated variables (SV/MV) identified and marked
- [ ] Disturbance variables identified and marked
- [ ] Process values (PV) identified and evaluated
- [ ] Controlled variables (CV) clarified
- [ ] Variable units unified

### Business Understanding
- [ ] Understand normal operating ranges
- [ ] Understand common abnormal patterns
- [ ] Clarify modeling goals (Prediction/Optimization/Monitoring)

4.2 Modeling Workflow

4.3 Step-by-Step Modeling Guide

Step 1: Data Import and Configuration

Import Data: Import the sorted Excel data into the platform
Set Header Row: Mark the variable name row
Configure X Variables:
- Select all Set/Manipulated variables (SV/MV)
- Select important Disturbance variables (DV)
- (Optional) If concerned with process states, introduce Process Values (PV)
Configure Y Variables: Select Controlled variables (CV)

Step 2: Exploratory Analysis (PCA)

Purpose: Understand data structure, identify abnormal samples

Operations:

Create a PCA model using only X variables
View the Score Plot
Identify outliers far from the main cluster
View T² and SPE plots, mark statistical anomalies

Interpretation:

Normal batches should cluster in the core area of the principal component space
Points far from the cluster need investigation for causes
Combine business knowledge to decide whether to exclude

Step 3: Build PLS Regression Model

Operations:

Configure X (Set/Manipulated + Disturbance) and Y (Controlled)
Click "Fit" to train the model
View model indicators:
- R²Y: Goodness of fit
- Q²Y: Predictive ability (>0.5 acceptable, >0.9 excellent)

Diagnostics:

If Q²Y < 0.5: Check variable selection, increase sample size
If R²Y is high but Q²Y is low: Overfitting, reduce the number of latent variables

Step 4: VIP Analysis and Variable Selection

Purpose: Find the X variables that have the greatest impact on Y

Operations:

View the VIP plot
Identify key variables with VIP > 1
Consider excluding variables with VIP < 0.5
Remodel and validate

Business Interpretation:

Set/Manipulated variables with high VIP are the focus of process optimization
Disturbance variables with high VIP require enhanced monitoring

Step 5: Model Validation

Internal Validation:

Cross-validation Q²Y
Check residual distribution

External Validation (If conditions permit):

Test with newly collected data
Compare predicted values with actual values

4.4 Typical Scenario Modeling Case

Case: Fermentation Process Optimization

Scenario Description:

Product: A certain amino acid fermentation
Goal: Increase product concentration (Y)
Cycle: 48-hour batch

Variable Classification:

Type (Algorithm)	Control System Term	Variable Examples	Description
Set/Manipulated Variables (X)	SV / MV (Set Value/Manipulated Variable)	Temperature setpoint(SV), pH setpoint(SV), stirring speed(MV), aeration rate(MV)	Parameters actively adjusted by operators or APC in DCS (Usually modifying SV)
Disturbance Variables (X)	DV (Disturbance Variable)	Seed batch, medium batch, ambient temperature	Objective factors affecting the system but uncontrollable artificially
Controlled Variables (Y)	CV (Controlled Variable)	Product concentration, conversion rate	Actual detection results from offline testing or online instruments
Process Values (X or Y)	PV (Process Value) as CV representation or value	Actual temperature(PV), actual pH(PV)	Process states fed back by sensors

Modeling Results:

R²Y = 0.92, Q²Y = 0.85
Key variables with VIP > 1: pH setpoint, aeration rate, temperature setpoint
Finding: pH control accuracy has the greatest impact on product concentration

Optimization Suggestions:

Upgrade the pH control system to improve control accuracy
Establish a pH feedforward compensation model
Expected to increase product concentration by 8-12%

5. Tool-Assisted Variable Selection

5.1 Variable Selection Toolbox

In the platform, you can use the following tools to assist in selecting variables:

Tool 1: Correlation Analysis

Purpose: Identify collinearity among X variables

Operations:

Use a heatmap to view correlations between X variables
Identify highly correlated variable pairs with |r| > 0.8
Keep one of them and exclude the redundant variable

Example:

If "Reaction Temp" and "Reactor Wall Temp" have a correlation of 0.95
→ Only keep "Reaction Temp" (more directly controllable)

Tool 2: PCA Loading Analysis

Purpose: Understand the internal structure among variables

Operations:

View the PCA Loading Plot
Identify groups of variables clustered together (representing similar information)
Choose the most representative variable from each group

Tool 3: VIP Iterative Selection

Purpose: Gradually optimize the variable set

Workflow:

Round 1: All variables → Calculate VIP
Round 2: Exclude variables with VIP<0.5 → Remodel
Round 3: Check Q²Y change
      ↓
   If Q²Y drops <5%: Accept simplified model
   If Q²Y drops >10%: Restore some excluded variables

Tool 4: Variable Importance Ranking Table

Comprehensive Evaluation Framework:

Variable	VIP	Controllability	Measurement Cost	Comp. Score	Suggestion
Temp	1.8	High	Low	⭐⭐⭐⭐⭐	Keep
Pressure	1.5	High	Low	⭐⭐⭐⭐⭐	Keep
Material Batch	0.3	Low	Medium	⭐⭐	Exclude
Ambient Humidity	0.4	Low	High	⭐	Exclude

5.2 Variable Selection Decision Process

5.3 Variable Selection Best Practices

DO:

✅ Prioritize retaining Set/Manipulated variables (optimizable)
✅ Retain variables with high VIP and easy measurability
✅ Retain variables that are "important by common sense" in business
✅ Use cross-validation to test the simplified model

DON'T:

❌ Only look at VIP and completely ignore business knowledge
❌ Exclude too many variables at once
❌ Exclude variables with low VIP but also low cost
❌ Over-screen when the sample size is very small

6. FAQs and Best Practices

6.1 Frequently Asked Questions

Q1: What if the sample size is insufficient?

A:
Minimum requirement: Sample size > Number of X variables
Ideal situation: Sample size ≥ 3 × Number of X variables
If insufficient:
Reduce X variables (prioritize excluding those with low VIP)
Extend the data collection cycle
Consider using PCA for dimensionality reduction first

Q2: How to handle missing values?

A:
Missing <5%: Impute with mean/median
Missing 5-20%: Impute with interpolation or regression prediction
Missing >20%: Consider excluding the variable or sample
The platform supports multiple missing value handling strategies

Q3: What if the boundary between Set/Manipulated variables and Disturbance variables is blurred?

A:
Judgment criterion: Can it be actively adjusted under current technology/cost conditions?
Example: Ambient temperature is theoretically controllable (AC), but the cost is too high → treated as a disturbance
Both are X in the model; the difference lies only in the optimization strategy

Q4: What if the model performs poorly on new data?

A:
Check if the new data is within the range of the training data (extrapolation risk)
Check if new disturbance factors have emerged
Consider model updates (incremental learning or retraining)

Q5: How to report modeling results to management?

A:
Avoid technical jargon, focus on business value
Use specific numbers: "Expected to increase yield by X% after optimization"
Display visualizations: Score plot, VIP plot
Provide clear action recommendations

6.2 Modeling Success Checklist

markdown

## Project Delivery Checklist

### Model Quality
- [ ] Q²Y > 0.5 (Minimum threshold)
- [ ] R²Y - Q²Y < 0.2 (Avoid overfitting)
- [ ] No obvious patterns in residuals
- [ ] VIP of key variables > 1

### Business Validation
- [ ] Key variables conform to process common sense
- [ ] Abnormal samples have reasonable explanations
- [ ] Model prediction error is within an acceptable range
- [ ] Validated with at least one independent batch of data

### Document Completeness
- [ ] Variable classification list
- [ ] Data collection method description
- [ ] Model performance report
- [ ] Application suggestions and risk warnings

6.3 Continuous Improvement Suggestions

Model Lifecycle Management:

Months 1-2: Model Building and Validation
    └── Collect data, build initial model
    └── Internal validation, parameter tuning

Months 3-6: Trial Run and Optimization
    └── Small-scale trial
    └── Collect feedback, correct issues

Months 6-12: Official Deployment
    └── Full application
    └── Establish monitoring mechanism

After 12 Months: Regular Maintenance
    └── Evaluate model performance quarterly
    └── Data drift detection
    └── Retrain when necessary

Appendix: Quick Reference Cards

Variable Classification Quick Reference

Question	Set/Manipulated Variable (SV/MV)	Disturbance Variable (DV)	Controlled Variable (CV)	Process Value (PV)
Actively adjustable?	✅ Yes (mainly modifying SV)	❌ No	N/A (It's a result)	N/A (It's a result)
Role in model	X	X	Y	X/Y
Optimization value	High (direct operation)	Medium (monitoring and early warning)	Target	State feedback
Example	Temperature setpoint	Ambient temperature	Product purity	Actual temp reading

Model Selection Quick Reference

Scenario	Recommended Model	Key Indicators
Only X, explore structure	PCA	R²X, Score Plot
X→Y Prediction (Continuous)	PLS	R²Y, Q²Y, VIP
X→Y Classification (Discrete)	PLS-DA	Accuracy, F1, AUC

VIP Interpretation Quick Reference

VIP Value	Importance	Suggestion
> 1.5	Very Important	Focus
1.0-1.5	Important	Keep
0.5-1.0	General	Can keep
< 0.5	Unimportant	Consider excluding

This document is a companion guide for the Data Insight Platform, combining actual industrial scenarios to help users systematically conduct data modeling work.

Factory Scene Data Modeling Guide ​

1. Scenario Identification and Value Assessment ​

1.1 What is a "Scenario"? ​

1.2 How to Judge the Core Value of a Scenario? ​

Value Assessment Matrix ​

Value Assessment Checklist ​

1.3 Scenario Priority Ranking Example ​

2. Data Collection Strategy ​

2.1 Four Levels of Data Collection ​

2.2 Data Collection Planning Template ​

2.3 Data Collection Best Practices ​

3. Variable Classification System ​

3.1 Core Variable Definitions (Industry Standard) ​

3.2 Detailed Variable Types ​

Set Value/Manipulated Variable (SV/MV) ​

Disturbance Variables (DV) ​

Controlled Variable (CV) ​

Process Value (PV) ​

3.3 Variable Classification Decision Tree ​

3.4 Variable Classification Example ​

4. Scenario Modeling Practice ​

4.1 Pre-Modeling Preparation ​

Data Sorting Checklist ​

4.2 Modeling Workflow ​

4.3 Step-by-Step Modeling Guide ​

Step 1: Data Import and Configuration ​

Step 2: Exploratory Analysis (PCA) ​

Step 3: Build PLS Regression Model ​

Step 4: VIP Analysis and Variable Selection ​

Step 5: Model Validation ​

4.4 Typical Scenario Modeling Case ​

Case: Fermentation Process Optimization ​

5. Tool-Assisted Variable Selection ​

5.1 Variable Selection Toolbox ​

Tool 1: Correlation Analysis ​

Tool 2: PCA Loading Analysis ​

Tool 3: VIP Iterative Selection ​

Tool 4: Variable Importance Ranking Table ​

5.2 Variable Selection Decision Process ​

5.3 Variable Selection Best Practices ​

6. FAQs and Best Practices ​

6.1 Frequently Asked Questions ​

6.2 Modeling Success Checklist ​

6.3 Continuous Improvement Suggestions ​

Appendix: Quick Reference Cards ​

Variable Classification Quick Reference ​

Model Selection Quick Reference ​

VIP Interpretation Quick Reference ​

Factory Scene Data Modeling Guide

1. Scenario Identification and Value Assessment

1.1 What is a "Scenario"?

1.2 How to Judge the Core Value of a Scenario?

Value Assessment Matrix

Value Assessment Checklist

1.3 Scenario Priority Ranking Example

2. Data Collection Strategy

2.1 Four Levels of Data Collection

2.2 Data Collection Planning Template

2.3 Data Collection Best Practices

3. Variable Classification System

3.1 Core Variable Definitions (Industry Standard)

3.2 Detailed Variable Types

Set Value/Manipulated Variable (SV/MV)

Disturbance Variables (DV)

Controlled Variable (CV)

Process Value (PV)

3.3 Variable Classification Decision Tree

3.4 Variable Classification Example

4. Scenario Modeling Practice

4.1 Pre-Modeling Preparation

Data Sorting Checklist

4.2 Modeling Workflow

4.3 Step-by-Step Modeling Guide

Step 1: Data Import and Configuration

Step 2: Exploratory Analysis (PCA)

Step 3: Build PLS Regression Model

Step 4: VIP Analysis and Variable Selection

Step 5: Model Validation

4.4 Typical Scenario Modeling Case

Case: Fermentation Process Optimization

5. Tool-Assisted Variable Selection

5.1 Variable Selection Toolbox

Tool 1: Correlation Analysis

Tool 2: PCA Loading Analysis

Tool 3: VIP Iterative Selection

Tool 4: Variable Importance Ranking Table

5.2 Variable Selection Decision Process

5.3 Variable Selection Best Practices

6. FAQs and Best Practices

6.1 Frequently Asked Questions

6.2 Modeling Success Checklist

6.3 Continuous Improvement Suggestions

Appendix: Quick Reference Cards

Variable Classification Quick Reference

Model Selection Quick Reference

VIP Interpretation Quick Reference