Skip to content

Core Model Algorithm Explanation โ€‹

The core capabilities of the StarWay Data Insight are built on three classic multivariate statistical models: PCA (Principal Component Analysis), PLS (Partial Least Squares Regression), and PLS-DA (Partial Least Squares Discriminant Analysis).

This chapter will deeply analyze the principles, application scenarios, mathematical essence, and specific applications of these three algorithms in the platform. Understanding these models will help you better interpret analysis results and make more accurate data-driven decisions.


๐Ÿ“Š Model Family Overview โ€‹

Before diving into details, let's first see the relationship between the three models with a diagram:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚               Multivariate Data Analysis Model Family       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚
โ”‚   โ”‚  Unsupervisedโ”‚        โ”‚  Supervised  โ”‚                 โ”‚
โ”‚   โ”‚              โ”‚        โ”‚              โ”‚                 โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
โ”‚          โ”‚                       โ”‚                          โ”‚
โ”‚          โ–ผ                       โ–ผ                          โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚
โ”‚   โ”‚     PCA      โ”‚        โ”‚     PLS      โ”‚                 โ”‚
โ”‚   โ”‚  Explore     โ”‚        โ”‚  Regression  โ”‚                 โ”‚
โ”‚   โ”‚  Structure   โ”‚        โ”‚  Prediction  โ”‚                 โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
โ”‚                                  โ”‚                          โ”‚
โ”‚                                  โ–ผ                          โ”‚
โ”‚                           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚
โ”‚                           โ”‚   PLS-DA     โ”‚                 โ”‚
โ”‚                           โ”‚  Classificationโ”‚                โ”‚
โ”‚                           โ”‚  Y is Label  โ”‚                 โ”‚
โ”‚                           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
โ”‚                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

One-sentence summary:

  • PCA: "What does the data look like?" โ†’ Explore internal structure
  • PLS: "How does X affect Y?" โ†’ Establish predictive relationships
  • PLS-DA: "Which category does it belong to?" โ†’ Perform classification

๐Ÿ”ฌ PCA (Principal Component Analysis) โ€‹

What is PCA? โ€‹

PCA (Principal Component Analysis) is an unsupervised dimensionality reduction technique. Its core idea is: Use fewer new variables (principal components) to retain as much information from the original data as possible.

Imagine you have a set of 3D data points. PCA finds the best 2D plane such that when these points are projected onto the plane, the "spread" is maximized - minimizing information loss.

PCA Principle Diagram

Interpretation of the diagram:

  • Blue points represent original high-dimensional (3D) data
  • Red plane is the optimal projection plane found by PCA (spanned by PC1 and PC2)
  • Green dashed lines show the projection process of data points onto the plane
  • Projected red crosses preserve information in the direction of maximum variance

Core Principles โ€‹

1. Variance is Information โ€‹

PCA believes: The direction with greater data variation contains more information.

  • If a variable is similar across all samples (small variance), it has little information
  • If variables differ significantly (large variance), they carry important information

2. Construction of Principal Components โ€‹

PCA transforms the original correlated variables into uncorrelated new variables (principal components) through linear transformation:

Where:

  • are called Principal Components
  • are Loadings, representing the contribution of original variables to new components
  • Principal components are uncorrelated (orthogonal)

3. Characteristics of Principal Components โ€‹

  • First Principal Component (PC1): Direction that explains the maximum variance in data
  • Second Principal Component (PC2): Direction that explains the maximum remaining variance under orthogonality with PC1
  • And so on...

Mathematical Essence (Simplified) โ€‹

The mathematical essence of PCA is eigenvalue decomposition of the covariance matrix:

  1. Data Centering: Subtract the mean of each variable
  2. Calculate Covariance Matrix:
  3. Eigenvalue Decomposition:
    • (eigenvalue): Represents the variance explained by this principal component
    • (eigenvector): Represents the direction of the principal component (i.e., loading)
  4. Select Principal Components: Sort by eigenvalues in descending order, take top

Application in the Platform โ€‹

Application Scenarios โ€‹

  • Data Exploration:ๅˆๆญฅไบ†่งฃๆ•ฐๆฎ็š„ๆ•ดไฝ“็ป“ๆž„ๅ’Œๅˆ†ๅธƒ
  • Anomaly Detection:ๅ‘็Žฐๅผ‚ๅธธๆ ทๆœฌ through Tยฒ and SPE statistics
  • Dimensionality Reduction Visualization:ๅฐ†้ซ˜็ปดๆ•ฐๆฎๆŠ•ๅฝฑๅˆฐ 2D/3D ็ฉบ้—ด่ง‚ๅฏŸ
  • Denoising:ๅ‰”้™คๅ™ชๅฃฐๆˆๅˆ†๏ผŒไฟ็•™ไธป่ฆไฟกๅท

PCA Score Plot Example โ€‹

The figure below shows a typical score plot of PCA analysis, where each point represents a sample, allowing you to intuitively see the distribution pattern and outliers:

PCA Score Plot Example

How to interpret:

  • Each gray point represents a sample, positioned based on its scores on PC1 and PC2
  • Elliptical area represents confidence interval (usually 95%), points outside may be outliers
  • Clustered points indicate similar samples, scattered points indicate significant differences

Platform Automatic Trigger Condition โ€‹

When you configure only X variables (no Y variables), the platform automatically uses PCA model:

Only X columns โ†’ Automatically select PCA โ†’ Explore data structure

Key Output Indicators โ€‹

IndicatorMeaningInterpretation
RยฒXCumulative explanation rate of XHow much information of X the model explains, closer to 1 is better
Cumulative Contribution RateExplanation proportion of top k principal componentsUsually 80%~95% is acceptable
LoadingRelationship between variables and principal componentsSee which variables dominate this component
ScoreCoordinates of samples in new coordinate systemUsed to draw scatter plots to observe sample distribution

Selection of Number of Principal Components โ€‹

The platform automatically selects the optimal number of principal components through cross-validation, but you can manually adjust via C+1/C-1:

  • Too few: Underfitting, serious information loss
  • Too many: Overfitting, introducing noise
  • Rule of thumb: Consider stopping when the contribution of new components to RยฒX is < 5%

Advantages and Disadvantages of PCA โ€‹

โœ… Advantages:

  • Unsupervised, no need for labeled data
  • Efficient computation, interpretable results
  • Effective dimensionality reduction, removes correlations between variables
  • Good visualization effect

โš ๏ธ Limitations:

  • Only focuses on variance structure of X, ignores Y
  • Sensitive to outliers
  • Principal components are linear combinations, may lack business meaning
  • Assumes principal components are orthogonal, actual data may not satisfy

๐Ÿ”— PLS (Partial Least Squares Regression) โ€‹

What is PLS? โ€‹

PLS (Partial Least Squares Regression) is a supervised regression method. Unlike PCA, PLS considers both X (features) and Y (target) when modeling, finding a latent variable space that best explains the relationship between them.

Simply put: PCA asks "How does X change", PLS asks "How does X affect Y".

PLS Principle Diagram

Interpretation of the diagram:

  • Left blue box is X variables (X1-X5), right green box is Y variables (target)
  • Middle yellow box is extracted latent variables (LV1, LV2), explaining both X and Y
  • PLS aims to maximize covariance between X latent variables and Y latent variables
  • Establishes X โ†’ Y prediction relationship through latent variables

Core Principles โ€‹

1. Simultaneous Decomposition of X and Y โ€‹

PLS decomposes both X and Y simultaneously, requiring their latent variables to be maximally correlated:

Where:

  • : Score matrix of X (similar to PCA scores)
  • : Loading matrix of X
  • : Score matrix of Y
  • : Loading matrix of Y
  • : Residual matrices

2. Maximize Covariance โ€‹

The core optimization goal of PLS is: Find latent variables of X and Y that maximize their covariance.

This means the components extracted by PLS must both represent changes in X and be closely related to Y.

3. Iterative Component Extraction โ€‹

PLS extracts latent variables one by one through iterative algorithms (such as NIPALS):

  1. Find the direction with maximum covariance between X and Y as the first pair of latent variables
  2. Subtract the explained part from X and Y (decorrelation)
  3. Repeat until enough components are extracted

Mathematical Essence (Simplified) โ€‹

The mathematical core of PLS is covariance maximization:

  1. Initialization: Start from a column of Y or random vector
  2. Iterative Optimization:
    • (find X weights from Y scores)
    • (calculate X scores)
    • (find Y weights from X scores)
    • (calculate Y scores)
  3. After Convergence: Calculate loading
  4. Decorrelation: ,

Application in the Platform โ€‹

Application Scenarios โ€‹

  • Regression Prediction: Establish X โ†’ Y prediction models
  • Variable Selection: Find X variables that have the greatest impact on Y through VIP
  • Multi-response Problems: Y can be multiple columns (multi-response variables)
  • Collinearity Handling: Stable even when X variables are highly correlated

Platform Automatic Trigger Condition โ€‹

When you configure both X variables and Y variables, and Y is continuous numerical:

X + Y (continuous) โ†’ Automatically select PLS โ†’ Establish regression model

Key Output Indicators โ€‹

IndicatorMeaningInterpretation
RยฒXCumulative explanation rate of XProportion of X variance captured by model
RยฒYCumulative explanation rate of YProportion of Y variation explained by model, higher is better
QยฒYCross-validated predictive ability of YMost critical! Reflects generalization ability, > 0.5 acceptable, > 0.9 excellent
RMSERoot mean squared errorAverage deviation between predicted and actual values, smaller is better
VIPVariable Importance in Projection> 1 indicates important variables, < 0.5 can be ignored

Selection of Number of Latent Variables โ€‹

The platform automatically selects the optimal number of latent variables through cross-validation, based on the principle:

  • Optimal when QยฒY reaches peak
  • If RยฒY is high but QยฒY is low โ†’ Overfitting, need to reduce components
  • If both are low โ†’ Underfitting, may need to increase components or check data

VIP Analysis โ€‹

VIP (Variable Importance in Projection) is an important output of PLS, telling you which X variables are most important for predicting Y.

The figure below shows typical results of VIP analysis, where taller bars indicate greater impact of that variable on Y:

VIP Variable Importance Chart

How to interpret:

  • Red reference line (VIP = 1) is the threshold for important variables
  • Variables above the red line (such as x3, x5 in the example) have significant contribution to Y
  • Variables below the red line can be considered for removal to simplify the model

VIP calculation formula:

Interpretation criteria:

  • VIP > 1: Important variables, significant contribution to Y
  • 0.5 < VIP < 1: Moderately important
  • VIP < 0.5: Negligible, consider removal

Advantages and Disadvantages of PLS โ€‹

โœ… Advantages:

  • Handles X and Y simultaneously, strong predictive ability
  • Effectively solves multicollinearity problems
  • Supports multi-response variables
  • Provides VIP for variable selection
  • Works well when sample size < number of variables

โš ๏ธ Limitations:

  • Requires labeled data (Y)
  • Model interpretation is more complex than PCA
  • Limited ability to model nonlinear relationships
  • Sensitive to outliers

๐ŸŽฏ PLS-DA (Partial Least Squares Discriminant Analysis) โ€‹

What is PLS-DA? โ€‹

PLS-DA (Partial Least Squares Discriminant Analysis) is an extension of PLS, specifically designed for classification problems. When Y is category labels (such as "pass/fail", "Class A/Class B/Class C"), PLS-DA is your choice.

Simply put: PLS predicts numerical values, PLS-DA predicts categories.

PLS-DA Classification Principle

Interpretation of the diagram:

  • Blue dots represent Class A, orange squares represent Class B
  • Two types of samples form clearly separated clusters in latent variable space (LV1-LV2)
  • Green dashed line is decision boundary, used to distinguish between two classes
  • Shadow ellipses represent confidence regions of each class, less overlap means better classification effect

Core Principles โ€‹

1. Convert Classification Problem to Regression Problem โ€‹

The cleverness of PLS-DA lies in: Convert category labels to dummy variables, then use PLS for regression.

For example, a three-class problem (A, B, C) is converted to:

SampleOriginal LabelY1(A)Y2(B)Y3(C)
1A100
2B010
3C001

Then perform standard PLS on this multi-response Y matrix.

2. Discrimination Rule โ€‹

During prediction, PLS-DA outputs "scores" for each category, and samples are assigned to the category with the highest score:

3. Visualization Advantages โ€‹

PLS-DA score plots are naturally suitable for showing classification effects:

  • Samples of different categories should form separated clusters in the plot
  • First latent variable is usually most correlated with inter-group differences
  • Second latent variable shows intra-group variation

Mathematical Essence โ€‹

The mathematics of PLS-DA is almost the same as PLS, with the difference in Y matrix construction:

  1. Encoding: Convert category labels to indicator matrix
  2. PLS Regression: Perform standard PLS on X and encoded Y
  3. Discrimination: Select the category with the largest response value during prediction

Category encoding methods:

  • Binary classification: Y = 0/1 or -1/+1
  • Multi-class classification: One-hot encoding (one column per class)

Application in the Platform โ€‹

Application Scenarios โ€‹

  • Binary Classification: Pass/fail, positive/negative, normal/abnormal
  • Multi-class Classification: Raw material grading, product classification, variety identification
  • Feature Selection: Find key variables that distinguish different categories
  • Biomarker Discovery: Medical, omics data analysis

Platform Automatic Trigger Condition โ€‹

When you configure X variables and Y variables, and Y is category labels (text or discrete values):

X + Y (category labels) โ†’ Automatically select PLS-DA โ†’ Establish classification model

Key Output Indicators โ€‹

IndicatorMeaningInterpretation
RยฒXCumulative explanation rate of XX variance captured by model
AccuracyClassification accuracyProportion of correct predictions, beware of class imbalance pitfalls
F1 ScoreHarmonic mean of precision and recallMore reliable than Accuracy in class imbalance situations
AUCArea under ROC curveDiscrimination ability, 0.5 random, 1.0 perfect, > 0.8 good
Confusion MatrixPrediction vs actual classification tableVisually see which classes are easily confused
VIPVariable importanceFind key variables that distinguish categories

Classification Performance Evaluation โ€‹

Confusion Matrix Interpretation:

The figure below shows the confusion matrix of a PLS-DA model, intuitively displaying the model's predictive performance across categories:

PLS-DA Confusion Matrix

How to interpret:

  • Values on the diagonal (dark color) represent the number of correctly classified samples
  • Off-diagonal values represent misclassified samples
  • Ideally, all samples should be concentrated on the diagonal

Confusion Matrix Table Form:

                Prediction
            Positive    Negative
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  Actualโ”‚   TP    โ”‚   FN    โ”‚
  Positiveโ”‚(True Positive) โ”‚(False Negative) โ”‚
        โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
  Actualโ”‚   FP    โ”‚   TN    โ”‚
  Negativeโ”‚(False Positive) โ”‚(True Negative) โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Derived Indicators:

  • Precision: โ€”โ€” How many predicted positives are actually positive
  • Recall: โ€”โ€” How many actual positives are found
  • F1 Score: โ€”โ€” Comprehensive indicator

Pitfalls in Class Imbalance:

If 95% of samples are Class A and 5% are Class B:

  • Even if the model predicts all as A, Accuracy is still 95%
  • But this is completely ineffective for Class B!
  • Solution: Look at F1 Score, AUC, or adjust class weights

ROC and AUC โ€‹

ROC Curve: True positive rate vs false positive rate under different thresholds

ROC Curve Example

How to interpret:

  • The closer the curve is to the upper left corner, the better the model
  • Diagonal = random guess

AUC Evaluation Criteria:

  • AUC = 0.5: Random level (no discrimination ability)
  • 0.7 โ‰ค AUC < 0.8: Acceptable
  • 0.8 โ‰ค AUC < 0.9: Good
  • AUC โ‰ฅ 0.9: Excellent

Advantages and Disadvantages of PLS-DA โ€‹

โœ… Advantages:

  • Suitable for high-dimensional small sample data (number of variables > number of samples)
  • Handles multicollinearity
  • Provides visualization (score plots show class separation)
  • Gives VIP for screening discriminant variables
  • More robust than LDA (Linear Discriminant Analysis)

โš ๏ธ Limitations:

  • Assumes linear boundaries between classes
  • Sensitive to class imbalance
  • Overfitting risk (when too many components)
  • Requires strict evaluation through cross-validation

๐Ÿ”„ Comparison and Selection of the Three Models โ€‹

Quick Selection Guide โ€‹

                    Start
                     โ”‚
                     โ–ผ
            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
            โ”‚   Have Y variable?  โ”‚
            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ–ผ                   โ–ผ
          No                   Yes
           โ”‚                   โ”‚
           โ–ผ                   โ–ผ
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚    PCA      โ”‚    โ”‚  What type is Y?  โ”‚
    โ”‚  Explore    โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ”‚  Structure  โ”‚             โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                       โ–ผ                  โ–ผ
                   Continuous           Category
                       โ”‚                  โ”‚
                       โ–ผ                  โ–ผ
                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                โ”‚    PLS      โ”‚    โ”‚   PLS-DA    โ”‚
                โ”‚  Regression โ”‚    โ”‚ Classificationโ”‚
                โ”‚  X โ†’ Y      โ”‚    โ”‚  Category   โ”‚
                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Detailed Comparison Table โ€‹

FeaturePCAPLSPLS-DA
Learning TypeUnsupervisedSupervisedSupervised
Y VariableNot neededContinuous valuesCategory labels
Main PurposeDimensionality reduction, explorationRegression predictionClassification
Optimization GoalMaximize X varianceMaximize X-Y covarianceMaximize class separation
OutputPrincipal components, scoresPredicted values, VIPClass probabilities, VIP
Key IndicatorsRยฒXRยฒY, QยฒY, RMSEAccuracy, F1, AUC
VisualizationScore plots, loading plotsPrediction plots, VIP plotsROC curves, confusion matrices
Sample RequirementsNo restrictionsSample size > number of variables is betterBalanced samples across classes is better

Combined Usage Strategy โ€‹

In actual projects, the three models are often used in combination:

Scenario 1: Exploration first, then modeling

1. PCA explores data structure โ†’ discovers anomalies, understands distribution
2. Data cleaning โ†’ removes abnormal samples
3. PLS/PLS-DA modeling โ†’ establishes prediction/classification models

Scenario 2: Model diagnosis

1. PLS trains model
2. Uses PCA idea to view score plots โ†’ checks sample distribution
3. Combines with Tยฒ/SPE โ†’ identifies abnormal samples

Scenario 3: Variable selection

1. PLS/PLS-DA calculates VIP
2. Removes variables with VIP < 0.5
3. Remodels โ†’ simplifies model, improves generalization

๐Ÿ› ๏ธ Modeling Practice in the Platform โ€‹

Modeling Process โ€‹

Parameter Tuning Tips โ€‹

Selection of Number of Components/Latent Variables โ€‹

The platform provides C+1/C-1 buttons for manual adjustment:

PhenomenonCauseSolution
High Rยฒ, low QยฒOverfittingReduce components
Both Rยฒ and Qยฒ are lowUnderfittingIncrease components
Qยฒ decreases as components increaseIntroduce noiseSelect component number at Qยฒ peak

Cross-Validation Settings โ€‹

  • K-Fold: Used when sample size is large (e.g., 5-fold, 10-fold)
  • Leave-One-Out (LOO): Used when sample size is small
  • Random Seed: Fix seed to ensure reproducible results

Model Diagnosis Checklist โ€‹

After training the model, check according to the following checklist:

General Checks (All Models):

  • [ ] Is RยฒX reasonable (> 0.5 usually acceptable)
  • [ ] Are there obvious outliers in the score plots
  • [ ] Are Tยฒ/SPE exceeding limits

PLS Specific Checks:

  • [ ] QยฒY > 0.5 (minimum threshold)
  • [ ] Gap between RยฒY and QยฒY < 0.2 (prevent overfitting)
  • [ ] Do high VIP variables conform to businessๅธธ่ฏ†
  • [ ] Does the prediction scatter plot distribute along the diagonal

PLS-DA Specific Checks:

  • [ ] Accuracy > 0.8 (depends on task difficulty)
  • [ ] Reasonable F1 Score (must check in class imbalance)
  • [ ] AUC > 0.8
  • [ ] Is there any particularly poor class in the confusion matrix
  • [ ] Are categories clearly separated on the score plot

๐Ÿ“š Further Reading โ€‹

If you want to understand these algorithms more deeply, the following resources are recommended:

Classic Literature:

  • Wold, S. et al. (2001). PLS-regression: a basic tool of chemometrics
  • Trygg, J. & Wold, S. (2002). Orthogonal projections to latent structures (O-PLS)

Platform-Related Charts:

  • Model Summary โ€”โ€” View overall model indicators
  • Score Plot โ€”โ€” Sample distribution visualization
  • Loading Plot โ€”โ€” Variable contribution analysis
  • VIP Plot โ€”โ€” Variable importance ranking
  • Tยฒ Plot โ€”โ€” In-model anomaly detection
  • SPE Plot โ€”โ€” Out-of-model anomaly detection
  • ROC Curve โ€”โ€” Classification model evaluation

๐Ÿ’ก Summary โ€‹

ModelOne-sentence understandingWhen to use
PCAWhat does the data look like?Explore structure, dimensionality reduction, anomaly detection
PLSHow does X predict Y?Regression problems, establishing prediction equations
PLS-DAWhich category does it belong to?Classification problems, discriminant analysis

Mastering these three models means mastering the core of the StarWay Data Insight. Remember: Models are tools, business understanding is the soul. Good analysis = correct model + clean data + deep domain knowledge.

Wishing you a smooth data exploration journey! ๐Ÿš€

Let data speak, make decisions simpler.