Factory Scene Data Modeling Guide โ
This document is intended for industrial engineers, process technicians, and data analysts. It systematically explains how to identify valuable scenarios, collect data, classify variables, and build effective production models in a factory environment.
1. Scenario Identification and Value Assessment โ
1.1 What is a "Scenario"? โ
In industrial data modeling, a scenario refers to a complete production process unit, including:
- Clear inputs (raw materials, parameter settings)
- Observable process states (temperature, pressure, flow rate, etc.)
- Quantifiable output results (product quality, yield, energy consumption, etc.)
Example Scenarios:
| Scenario Type | Description | Typical Industry |
|---|---|---|
| Reactor Batch Control | Batch reaction process in chemical/pharmaceuticals | Chemical, Pharmaceutical |
| Fermentation Process Optimization | Temperature, pH, and dissolved oxygen control in microbial fermentation | Food, Biological |
| Extrusion Molding Process | Extrusion temperature, pressure, and speed control for plastics/rubber | Material Processing |
| Drying Process Control | Temperature, humidity, and time control in hot air drying | Food, Agricultural Products |
| Batch Mixing System | Proportioning accuracy control of multi-component raw materials | Food, Feed |
1.2 How to Judge the Core Value of a Scenario? โ
Not all scenarios are worth modeling. Use the following evaluation framework:
Value Assessment Matrix โ
High Business Value
โ
โโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ
โ Prioritize โ Strategic โ
High โ Modeling โ Reserve โ
Data โ (Act Now) โ(Long-term) โ
Avail- โโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโค
abilityโ Quick POC โ Shelve โ
โ (Pilot) โ (Wait) โ
Low โ โ โ
Data โโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโ
Avail. Low Business ValueValue Assessment Checklist โ
Business Value Dimension (1-5 points each):
| Evaluation Item | Scoring Criteria | Score |
|---|---|---|
| Quality Issue Frequency | 5=Multiple times a month, 1=Rarely occurs | ___ |
| Quality Loss Amount | 5=Annual loss > 1M, 1=< 100k | ___ |
| Process Optimization Space | 5=Obvious room for optimization, 1=Already very mature | ___ |
| Replicability | 5=Applicable to multiple lines, 1=Single point dedicated | ___ |
| Management Attention | 5=High-level focus, 1=Grassroots spontaneous | ___ |
Data Availability Dimension (1-5 points each):
| Evaluation Item | Scoring Criteria | Score |
|---|---|---|
| Historical Data Volume | 5=>1 year complete data, 1=Almost no data | ___ |
| Data Quality | 5=Complete and accurate, 1=Massive missing/errors | ___ |
| Collection Automation | 5=Fully automatic, 1=Fully manual recording | ___ |
| Key Variable Measurability | 5=All online measurable, 1=Mostly offline testing | ___ |
| IT System Support | 5=Has MES/SCADA, 1=No IT system | ___ |
Score Interpretation:
- Business Value โฅ 20 AND Data Availability โฅ 20: Prioritize Modeling
- Business Value โฅ 20 BUT Data Availability < 20: Improve Data Collection First
- Business Value < 15: Temporarily Shelve, Seek Higher Value Scenarios
1.3 Scenario Priority Ranking Example โ
Scenario Assessment for a Food Processing Plant:
| Scenario | Business Value | Data Availability | Priority | Action Recommendation |
|---|---|---|---|---|
| Sterilization Temp Control | 25 pts | 20 pts | โญโญโญโญโญ | Start Immediately |
| Batching Accuracy Opt. | 22 pts | 18 pts | โญโญโญโญ | Start after adding sensors |
| Packaging Seal Inspection | 15 pts | 22 pts | โญโญโญ | Low priority, consider when resources are ample |
| Raw Material Inbound Insp. | 12 pts | 15 pts | โญโญ | Temporarily Shelve |
2. Data Collection Strategy โ
2.1 Four Levels of Data Collection โ
Level 1: Manual Recording
โโโ Paper records, Excel manual entry
โโโ Suitable for: Initial exploration, no automation system
โโโ Disadvantages: Error-prone, low frequency, hard to trace
Level 2: Semi-Automatic Collection
โโโ Instrument data export + manual sorting
โโโ Suitable for: Key equipment exists but no system integration
โโโ Improvement: Establish standardized export templates
Level 3: Automatic Collection
โโโ PLC/SCADA automatic recording
โโโ Suitable for: Automated control systems exist
โโโ Advantages: High frequency, accurate, traceable
Level 4: Integrated Platform
โโโ MES/ERP/Data Lake integration
โโโ Suitable for: Highly digitalized factories
โโโ Advantages: Data correlation, full-link traceability2.2 Data Collection Planning Template โ
Create a data collection plan for each scenario:
## Scenario Name: [Fill in]
### 1. Controlled Variables (Y)
| Variable Name | Measurement Method | Frequency | Data Location | Notes |
|-------|---------|---------|---------|------|
| Product Quality Index | Lab testing | Per batch | LIMS | 2-hour testing cycle |
| Product Yield | Auto statistics | Real-time | MES | - |
### 2. Feature Variables (X)
| Variable Name | Variable Type | Measurement Method | Frequency | Data Location |
|-------|---------|---------|---------|---------|
| Reaction Temp | Set/Manipulated Variable | Temp Sensor | 1 min | SCADA |
| Material Batch | Disturbance Variable | Barcode Scan | Per batch | ERP |
| Ambient Temp | Disturbance Variable | Temp/Humidity Meter | 1 hour | Manual |
### 3. Data Collection Cycle
- Historical Data Traceback: [ ] months
- New Data Collection: Starting from [Date]
- Target Sample Size: At least [ ] batches/cycles
### 4. Data Quality Assurance
- [ ] Sensor calibration plan
- [ ] Outlier handling rules
- [ ] Missing value imputation strategy
- [ ] Data review process2.3 Data Collection Best Practices โ
DO:
- โ Record complete batch information (time, operator, material batch)
- โ Collect both normal and abnormal condition data
- โ Annotate known abnormal events (equipment failure, material change, etc.)
- โ Maintain timestamp consistency
- โ Regularly backup raw data
DON'T:
- โ Only collect "good" data and discard "bad" data
- โ Have inconsistent timestamps from different sources
- โ Manually transcribe without keeping original records
- โ Use a sampling frequency that is too low (cannot capture dynamics)
- โ Use a sampling frequency that is too high (generates massive redundant data)
3. Variable Classification System โ
3.1 Core Variable Definitions (Industry Standard) โ
In the field of industrial modeling and control, we follow these standard variable definitions:
| Abbreviation | Full Name | Chinese | Description |
|---|---|---|---|
| SV | Set Value | ่ฎพๅฎๅผ | Target value set for the manipulated variable, modifiable on DCS |
| MV | Manipulated Variable | ๆ็บตๅ้ | Valves, pumps, etc. that operators / APC can directly adjust |
| DV | Disturbance Variable | ๆฐๅจๅ้ | Uncontrollable, unadjustable disturbance factors |
| CV | Controlled Variable | ่ขซๆงๅ้ | Core target to be controlled and optimized |
| PV | Process Value | ่ฟ็จๆต้ๅผ | Actual values measured by instruments / sensors |
When modeling data, we map these variables to the model's inputs (X) and outputs (Y):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Variable Modeling Mapping System โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โSet/Manipulated โ โ Disturbance โ โ Controlled โ โ
โ โ (SV/MV) โ โ (DV) โ โ (CV) โ โ
โ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ โ
โ โ โ โ โ
โ โผ โผ โผ โ
โ Parameters we can Variables we cannot Targets we want to โ
โ actively adjust(X) control but affect(X) predict/optimize(Y) โ
โ โ
โ Example: Reaction Example: Ambient Example: Product โ
โ temp setpoint(SV) temperature purity (CV) โ
โ Valve opening(MV) Material fluctuation (Usually shown as PV)โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ3.2 Detailed Variable Types โ
Set Value/Manipulated Variable (SV/MV) โ
Definition: Input parameters that operators or control systems can directly adjust. In actual industrial production, in most cases, operators adjust Set Values (SV) to indirectly control actuators, while Manipulated Variables (MV) are typically automatic outputs from low-level PID or control systems.
Characteristics:
- Can be actively changed (mainly modifying SV)
- Usually have clear operation ranges
- Main focus for process optimization
Common Examples:
| Industry | Set/Manipulated Variable Examples |
|---|---|
| Chemical | Reaction temperature, pressure, stirring speed, catalyst dosage |
| Food | Sterilization temp, holding time, ingredient ratio, drying wind speed |
| Pharmaceutical | Heating rate, holding time, cooling rate, pH setpoint |
| Metallurgy | Heating power, cooling water flow, rolling speed |
Role in Modeling:
- As the core component of X variables
- Key focus of VIP analysis
- Direct operation targets for process optimization
Disturbance Variables (DV) โ
Definition: Variables that affect process output but cannot (or are difficult to) be actively controlled.
Characteristics:
- Objectively exist, hard to intervene artificially
- May change over time
- Factors to consider for model robustness
Common Examples:
| Type | Disturbance Variable Examples | Coping Strategy |
|---|---|---|
| Material | Batch differences, moisture fluctuation, impurity content | Inbound inspection, feedforward control |
| Environment | Ambient temp, humidity, atmospheric pressure | Environmental compensation, AC control |
| Equipment | Equipment wear, catalyst decay, heat exchanger fouling | Regular maintenance, online correction |
| Operation | Operator differences, shift handover impact | SOP standardization, training |
Role in Modeling:
- As a supplement to X variables
- Help explain model residuals
- Identify sources of "uncontrollable" variation
Controlled Variable (CV) โ
Definition: Process outputs or quality indicators that we want to control within target ranges.
Characteristics:
- Results of the process
- Usually have clear quality standards
- Targets for model prediction (Y)
Common Examples:
| Industry | Controlled Variable Examples |
|---|---|
| Chemical | Product purity, conversion rate, selectivity, byproduct content |
| Food | Moisture content, color, taste score, microbial indicators |
| Pharmaceutical | Active ingredient content, dissolution rate, impurity profile |
| Material | Tensile strength, hardness, surface finish |
Role in Modeling:
- As Y variables (Controlled variables)
- Objects for model prediction and optimization
- Core indicators for evaluating model performance
Process Value (PV) โ
Definition: Process values actually measured by instruments or sensors.
Characteristics:
- True reflection of physical or chemical states
- The basis for calculating or evaluating CV
- May contain measurement noise or errors
Common Examples:
| Industry | Process Value Examples |
|---|---|
| Chemical | Actual temperature measured by thermocouple, flow meter reading |
| Food | Online moisture meter reading, actual pH value |
| Pharmaceutical | Stirring motor current fed back by sensor |
Role in Modeling:
- Used to characterize controlled variables (Y)
- Feedback current system state for optimization control
3.3 Variable Classification Decision Tree โ
3.4 Variable Classification Example โ
Scenario: Chemical Reactor Batch Control
| Variable Name | Variable Type | Classification Reason | Modeling Role |
|---|---|---|---|
| Reaction Temp Setpoint | Set/Manipulated Variable | Adjustable via DCS | X |
| Reaction Pressure | Set/Manipulated Variable | Adjustable via valves | X |
| Actual Reaction Pressure | Process Value | Pressure sensor feedback | X |
| Stirring Speed | Set/Manipulated Variable | Inverter control | X |
| Catalyst Dosage | Set/Manipulated Variable | Determined at batching | X |
| Material Batch | Disturbance Variable | Determined by procurement | X |
| Material Moisture | Disturbance Variable | Natural fluctuation | X |
| Ambient Temp | Disturbance Variable | Uncontrollable | X |
| Product Conversion Rate | Controlled Variable | Process result | Y |
| Product Selectivity | Controlled Variable | Quality indicator | Y |
4. Scenario Modeling Practice โ
4.1 Pre-Modeling Preparation โ
Data Sorting Checklist โ
## Pre-Modeling Checklist
### Data Integrity
- [ ] Sample size โฅ 30 (PLS minimum requirement)
- [ ] Number of X variables < Sample size/2 (Avoid overfitting)
- [ ] No severe missing values (<10%)
- [ ] Timestamps correctly aligned
### Variable Confirmation
- [ ] Set/Manipulated variables (SV/MV) identified and marked
- [ ] Disturbance variables identified and marked
- [ ] Process values (PV) identified and evaluated
- [ ] Controlled variables (CV) clarified
- [ ] Variable units unified
### Business Understanding
- [ ] Understand normal operating ranges
- [ ] Understand common abnormal patterns
- [ ] Clarify modeling goals (Prediction/Optimization/Monitoring)4.2 Modeling Workflow โ
4.3 Step-by-Step Modeling Guide โ
Step 1: Data Import and Configuration โ
- Import Data: Import the sorted Excel data into the platform
- Set Header Row: Mark the variable name row
- Configure X Variables:
- Select all Set/Manipulated variables (SV/MV)
- Select important Disturbance variables (DV)
- (Optional) If concerned with process states, introduce Process Values (PV)
- Configure Y Variables: Select Controlled variables (CV)
Step 2: Exploratory Analysis (PCA) โ
Purpose: Understand data structure, identify abnormal samples
Operations:
- Create a PCA model using only X variables
- View the Score Plot
- Identify outliers far from the main cluster
- View Tยฒ and SPE plots, mark statistical anomalies
Interpretation:
- Normal batches should cluster in the core area of the principal component space
- Points far from the cluster need investigation for causes
- Combine business knowledge to decide whether to exclude
Step 3: Build PLS Regression Model โ
Operations:
- Configure X (Set/Manipulated + Disturbance) and Y (Controlled)
- Click "Fit" to train the model
- View model indicators:
- RยฒY: Goodness of fit
- QยฒY: Predictive ability (>0.5 acceptable, >0.9 excellent)
Diagnostics:
- If QยฒY < 0.5: Check variable selection, increase sample size
- If RยฒY is high but QยฒY is low: Overfitting, reduce the number of latent variables
Step 4: VIP Analysis and Variable Selection โ
Purpose: Find the X variables that have the greatest impact on Y
Operations:
- View the VIP plot
- Identify key variables with VIP > 1
- Consider excluding variables with VIP < 0.5
- Remodel and validate
Business Interpretation:
- Set/Manipulated variables with high VIP are the focus of process optimization
- Disturbance variables with high VIP require enhanced monitoring
Step 5: Model Validation โ
Internal Validation:
- Cross-validation QยฒY
- Check residual distribution
External Validation (If conditions permit):
- Test with newly collected data
- Compare predicted values with actual values
4.4 Typical Scenario Modeling Case โ
Case: Fermentation Process Optimization โ
Scenario Description:
- Product: A certain amino acid fermentation
- Goal: Increase product concentration (Y)
- Cycle: 48-hour batch
Variable Classification:
| Type (Algorithm) | Control System Term | Variable Examples | Description |
|---|---|---|---|
| Set/Manipulated Variables (X) | SV / MV (Set Value/Manipulated Variable) | Temperature setpoint(SV), pH setpoint(SV), stirring speed(MV), aeration rate(MV) | Parameters actively adjusted by operators or APC in DCS (Usually modifying SV) |
| Disturbance Variables (X) | DV (Disturbance Variable) | Seed batch, medium batch, ambient temperature | Objective factors affecting the system but uncontrollable artificially |
| Controlled Variables (Y) | CV (Controlled Variable) | Product concentration, conversion rate | Actual detection results from offline testing or online instruments |
| Process Values (X or Y) | PV (Process Value) as CV representation or value | Actual temperature(PV), actual pH(PV) | Process states fed back by sensors |
Modeling Results:
- RยฒY = 0.92, QยฒY = 0.85
- Key variables with VIP > 1: pH setpoint, aeration rate, temperature setpoint
- Finding: pH control accuracy has the greatest impact on product concentration
Optimization Suggestions:
- Upgrade the pH control system to improve control accuracy
- Establish a pH feedforward compensation model
- Expected to increase product concentration by 8-12%
5. Tool-Assisted Variable Selection โ
5.1 Variable Selection Toolbox โ
In the platform, you can use the following tools to assist in selecting variables:
Tool 1: Correlation Analysis โ
Purpose: Identify collinearity among X variables
Operations:
- Use a heatmap to view correlations between X variables
- Identify highly correlated variable pairs with |r| > 0.8
- Keep one of them and exclude the redundant variable
Example:
If "Reaction Temp" and "Reactor Wall Temp" have a correlation of 0.95
โ Only keep "Reaction Temp" (more directly controllable)Tool 2: PCA Loading Analysis โ
Purpose: Understand the internal structure among variables
Operations:
- View the PCA Loading Plot
- Identify groups of variables clustered together (representing similar information)
- Choose the most representative variable from each group
Tool 3: VIP Iterative Selection โ
Purpose: Gradually optimize the variable set
Workflow:
Round 1: All variables โ Calculate VIP
Round 2: Exclude variables with VIP<0.5 โ Remodel
Round 3: Check QยฒY change
โ
If QยฒY drops <5%: Accept simplified model
If QยฒY drops >10%: Restore some excluded variablesTool 4: Variable Importance Ranking Table โ
Comprehensive Evaluation Framework:
| Variable | VIP | Controllability | Measurement Cost | Comp. Score | Suggestion |
|---|---|---|---|---|---|
| Temp | 1.8 | High | Low | โญโญโญโญโญ | Keep |
| Pressure | 1.5 | High | Low | โญโญโญโญโญ | Keep |
| Material Batch | 0.3 | Low | Medium | โญโญ | Exclude |
| Ambient Humidity | 0.4 | Low | High | โญ | Exclude |
5.2 Variable Selection Decision Process โ
5.3 Variable Selection Best Practices โ
DO:
- โ Prioritize retaining Set/Manipulated variables (optimizable)
- โ Retain variables with high VIP and easy measurability
- โ Retain variables that are "important by common sense" in business
- โ Use cross-validation to test the simplified model
DON'T:
- โ Only look at VIP and completely ignore business knowledge
- โ Exclude too many variables at once
- โ Exclude variables with low VIP but also low cost
- โ Over-screen when the sample size is very small
6. FAQs and Best Practices โ
6.1 Frequently Asked Questions โ
Q1: What if the sample size is insufficient?
A:
- Minimum requirement: Sample size > Number of X variables
- Ideal situation: Sample size โฅ 3 ร Number of X variables
- If insufficient:
- Reduce X variables (prioritize excluding those with low VIP)
- Extend the data collection cycle
- Consider using PCA for dimensionality reduction first
Q2: How to handle missing values?
A:
- Missing <5%: Impute with mean/median
- Missing 5-20%: Impute with interpolation or regression prediction
- Missing >20%: Consider excluding the variable or sample
- The platform supports multiple missing value handling strategies
Q3: What if the boundary between Set/Manipulated variables and Disturbance variables is blurred?
A:
- Judgment criterion: Can it be actively adjusted under current technology/cost conditions?
- Example: Ambient temperature is theoretically controllable (AC), but the cost is too high โ treated as a disturbance
- Both are X in the model; the difference lies only in the optimization strategy
Q4: What if the model performs poorly on new data?
A:
- Check if the new data is within the range of the training data (extrapolation risk)
- Check if new disturbance factors have emerged
- Consider model updates (incremental learning or retraining)
Q5: How to report modeling results to management?
A:
- Avoid technical jargon, focus on business value
- Use specific numbers: "Expected to increase yield by X% after optimization"
- Display visualizations: Score plot, VIP plot
- Provide clear action recommendations
6.2 Modeling Success Checklist โ
## Project Delivery Checklist
### Model Quality
- [ ] QยฒY > 0.5 (Minimum threshold)
- [ ] RยฒY - QยฒY < 0.2 (Avoid overfitting)
- [ ] No obvious patterns in residuals
- [ ] VIP of key variables > 1
### Business Validation
- [ ] Key variables conform to process common sense
- [ ] Abnormal samples have reasonable explanations
- [ ] Model prediction error is within an acceptable range
- [ ] Validated with at least one independent batch of data
### Document Completeness
- [ ] Variable classification list
- [ ] Data collection method description
- [ ] Model performance report
- [ ] Application suggestions and risk warnings6.3 Continuous Improvement Suggestions โ
Model Lifecycle Management:
Months 1-2: Model Building and Validation
โโโ Collect data, build initial model
โโโ Internal validation, parameter tuning
Months 3-6: Trial Run and Optimization
โโโ Small-scale trial
โโโ Collect feedback, correct issues
Months 6-12: Official Deployment
โโโ Full application
โโโ Establish monitoring mechanism
After 12 Months: Regular Maintenance
โโโ Evaluate model performance quarterly
โโโ Data drift detection
โโโ Retrain when necessaryAppendix: Quick Reference Cards โ
Variable Classification Quick Reference โ
| Question | Set/Manipulated Variable (SV/MV) | Disturbance Variable (DV) | Controlled Variable (CV) | Process Value (PV) |
|---|---|---|---|---|
| Actively adjustable? | โ Yes (mainly modifying SV) | โ No | N/A (It's a result) | N/A (It's a result) |
| Role in model | X | X | Y | X/Y |
| Optimization value | High (direct operation) | Medium (monitoring and early warning) | Target | State feedback |
| Example | Temperature setpoint | Ambient temperature | Product purity | Actual temp reading |
Model Selection Quick Reference โ
| Scenario | Recommended Model | Key Indicators |
|---|---|---|
| Only X, explore structure | PCA | RยฒX, Score Plot |
| XโY Prediction (Continuous) | PLS | RยฒY, QยฒY, VIP |
| XโY Classification (Discrete) | PLS-DA | Accuracy, F1, AUC |
VIP Interpretation Quick Reference โ
| VIP Value | Importance | Suggestion |
|---|---|---|
| > 1.5 | Very Important | Focus |
| 1.0-1.5 | Important | Keep |
| 0.5-1.0 | General | Can keep |
| < 0.5 | Unimportant | Consider excluding |
This document is a companion guide for the Data Insight Platform, combining actual industrial scenarios to help users systematically conduct data modeling work.