Multiple Linear Regression

Multiple Linear Regression






































Multiple linear regression is a statistical technique used to model the relationship between a dependent variable and multiple independent variables. It is an extension of simple linear regression, which models the relationship between a single dependent variable and a single independent variable.


Model Formulation:


The general form of a multiple linear regression model is:


Y = β0 + β1X1 + β2X2 + ... + βnXn + ε



where:


 Y is the dependent variable

 X1, X2, ..., Xn are independent variables

 β0 is the intercept (the value of Y when all independent variables are zero)

 β1, β2, ..., βn are regression coefficients that represent the effect of each independent variable on Y

 ε is the error term, which represents the unexplained variance in Y


Assumptions:


Multiple linear regression assumes that the following conditions are met:


 The relationship between Y and the independent variables is linear.

 The independent variables are independent of each other.

 The error term has a normal distribution with mean 0 and constant variance.

 There are no outliers or influential observations.


Interpretation of Regression Coefficients:


The regression coefficients (β1, β2, ..., βn) indicate the change in Y associated with a one-unit increase in the corresponding independent variable, holding all other variables constant.


 A positive coefficient indicates a positive relationship between the independent variable and Y.

 A negative coefficient indicates a negative relationship.

 The magnitude of the coefficient reflects the strength of the relationship.


Model Estimation and Evaluation:


Multiple linear regression models are estimated using least squares regression. The estimated coefficients are the values that minimize the sum of squared errors between the predicted values of Y and the observed values.


The model is then evaluated based on its goodness-of-fit, which measures how well the model fits the data. Common goodness-of-fit measures include:


 R-squared: The proportion of variance in Y that is explained by the model.

 Adjusted R-squared: The R-squared value adjusted for the number of independent variables.

 Root mean squared error (RMSE): The square root of the average squared difference between the predicted and observed values of Y.


Applications:


Multiple linear regression is used in a wide range of applications, including:


 Forecasting future trends

 Identifying factors associated with a particular outcome

 Building predictive models

 Assessing the impact of interventions


 Example 1:

Study Title: Association of Cardiovascular Risk Factors with Cardiovascular Events in a Large Cohort Study


Independent Variables (Predictors):


 Age

 Sex

 Body mass index (BMI)

 Smoking status

 Systolic blood pressure

 Total cholesterol

 HDL cholesterol

 LDL cholesterol

 Triglycerides

 Fasting blood glucose

 Estimated glomerular filtration rate (eGFR)


Dependent Variable (Outcome):


 Cardiovascular events (e.g., heart attack, stroke)


Model:


Cardiovascular events = β0 + β1  Age + β2  Sex + ... + β11  eGFR + ε



where:


 β0 is the intercept

 β1 to β11 are the regression coefficients

 ε is the error term


Interpretation:


The multiple linear regression model predicts the probability of cardiovascular events based on the values of the independent variables included in the model. The regression coefficients (β1 to β11) represent the change in the log-odds of cardiovascular events for a one-unit increase in the corresponding independent variable, holding all other variables constant.


Example Results:


Variable | Regression Coefficient (β) | p-value

-------- | -------------------------- | --------

Age      | 0.012                      | <0.001

Sex      | 0.154                      | 0.003

BMI      | 0.021                      | <0.001

Smoking  | 0.286                      | <0.001

SBP      | 0.014                      | <0.001

Total    | 0.003                      | 0.007

HDL      | -0.011                     | 0.002

LDL      | 0.009                      | 0.004

Triglycerides | 0.002                      | 0.015

FBG      | 0.023                      | <0.001

eGFR     | -0.005                     | 0.008



Interpretation:


 For every 1-year increase in age, the log-odds of cardiovascular events increases by 0.012 (1.2%).

 Men have a higher log-odds of cardiovascular events compared to women (β = 0.154).

 A one-unit increase in BMI is associated with a 2.1% increase in the log-odds of cardiovascular events.

 Smokers have a 28.6% higher log-odds of cardiovascular events compared to nonsmokers.

 Each unit increase in systolic blood pressure is associated with a 1.4% increase in the log-odds of cardiovascular events.

 Lower HDL cholesterol and higher LDL cholesterol are associated with increased log-odds of cardiovascular events.

 Higher fasting blood glucose and lower eGFR are associated with increased log-odds of cardiovascular events.


Clinical Significance:


The multiple linear regression model provides a comprehensive assessment of the associations between multiple cardiovascular risk factors and cardiovascular events. This information can aid in patient risk stratification, guiding preventive interventions and treatment decisions to reduce the risk of future cardiovascular events.

 Example 2:

Multiple Linear Regression Model for Lung Cancer Risk

Variables:

Dependent Variable: Lung cancer risk (binary: 0 = no cancer, 1 = cancer present)

Independent Variables:

 Age
 Sex (binary: 0 = female, 1 = male)
 Smoking status (binary: 0 = never smoker, 1 = current or former smoker)
 Pack-years of smoking (continuous)
 Radon exposure (binary: 0 = no exposure, 1 = exposure)
 Family history of lung cancer (binary: 0 = no history, 1 = history)
 Body mass index (BMI, continuous)
 Education level (categorical: low, medium, high)

Model Equation:

Lung cancer risk = β0 + β1Age + β2Sex + β3Smoking status + β4Pack-years of smoking + β5Radon exposure + β6Family history of lung cancer + β7BMI + β8Education level + ε


where:

 β0 is the intercept
 β1-β8 are the regression coefficients for each variable
 ε is the error term

Interpretation:

The regression coefficients (β1-β8) represent the change in lung cancer risk associated with a one-unit change in the corresponding independent variable, while holding all other variables constant.

For example:

 If the regression coefficient for age is β1 = 0.01, then for every additional year of age, the risk of lung cancer increases by 1%.
 If the regression coefficient for smoking status is β3 = 0.5, then current or former smokers have a 50% higher risk of lung cancer compared to never smokers.

Statistical Significance:

The statistical significance of each regression coefficient is tested using a t-test. If the p-value for a coefficient is less than 0.05, then the variable is considered to be a significant predictor of lung cancer risk.

Model Fit:

The model's fit can be assessed using measures such as the R-squared value and the adjusted R-squared value. The R-squared value represents the proportion of variance in the lung cancer risk that is explained by the independent variables. The adjusted R-squared value adjusts for the number of independent variables in the model.

Example:

Suppose a multiple linear regression analysis of lung cancer risk in 1000 individuals yields the following results:

| Variable | β Coefficient | t-value | p-value |
|---|---|---|---|
| Age | 0.01 | 2.5 | 0.01 |
| Sex | 0.2 | 3.0 | 0.001 |
| Smoking status | 0.5 | 6.0 | < 0.001 |
| Pack-years of smoking | 0.02 | 4.0 | < 0.001 |
| Radon exposure | 0.1 | 2.0 | 0.05 |
| Family history of lung cancer | 0.3 | 4.5 | < 0.001 |
| BMI | 0.005 | 1.5 | 0.1 |
| Education level | -0.1 | 2.0 | 0.05 |

Interpretation:

 Age, sex, smoking status, pack-years of smoking, and family history of lung cancer are significant predictors of lung cancer risk.
 Radon exposure and education level are marginally significant predictors of lung cancer risk (p-values < 0.1).
 BMI is not a significant predictor of lung cancer risk in this model.
 The model has an R-squared value of 0.45 and an adjusted R-squared value of 0.42, indicating that it explains 42% of the variance in lung cancer risk.


Example 3: 

Multiple Linear Regression Model for Hypertension

Dependent Variable: Blood Pressure (systolic or diastolic)

Independent Variables:

 Age
 Gender
 Body mass index (BMI)
 Smoking status (current smoker, former smoker, never smoker)
 Alcohol consumption (drinks per week)
 Physical activity level (hours per week)
 Family history of hypertension
 Diet (sodium intake, potassium intake, etc.)
 Medications (antihypertensive drugs)

Model Equation:

Blood Pressure = β0 + β1  Age + β2  Gender + β3  BMI + β4  Smoking Status + β5  Alcohol Consumption + β6  Physical Activity Level + β7  Family History + β8  Diet + β9  Medications + ε

where:

 β0 is the intercept
 β1-β9 are the regression coefficients for each independent variable
 ε is the error term

Interpretation:

The regression coefficients (β1-β9) indicate the change in blood pressure associated with a one-unit increase in the corresponding independent variable, while holding all other variables constant. For example, β3 represents the change in blood pressure for a one-unit increase in BMI.

A positive coefficient indicates that the independent variable is associated with a higher blood pressure, while a negative coefficient indicates an inverse relationship. The magnitude of the coefficient reflects the strength of the association.

Significance Testing:

Statistical significance tests can be performed to determine whether the independent variables are significantly associated with blood pressure. This involves comparing the observed regression coefficients to a null hypothesis of no effect. A significant p-value indicates that the corresponding independent variable is making a significant contribution to the model.

Example:

A study found the following regression equation for systolic blood pressure:

Systolic BP = 120 + 1.2  Age + 5.6  BMI + 10.3  Smoking Status + 4.2  Alcohol Consumption - 2.7  Physical Activity Level + 6.8  Family History + 1.9  Sodium Intake + 7.4  Antihypertensive Medications

This equation suggests that:

 Age, BMI, smoking status, and alcohol consumption are positively associated with systolic blood pressure.
 Physical activity level and antihypertensive medications are inversely associated with systolic blood pressure.
 The other variables (family history, diet, etc.) have smaller but significant effects.


Example 4:


Multiple Linear Regression Model for Diabetes


Overview:


Multiple linear regression is a statistical technique used to predict a continuous outcome (dependent variable) based on multiple independent variables (predictors). In the context of diabetes, it can be employed to identify the factors associated with the risk or severity of diabetes.


Independent Variables:


 Age

 Sex

 Body mass index (BMI)

 Fasting glucose levels

 Systolic blood pressure

 Diastolic blood pressure

 Total cholesterol

 HDL cholesterol

 Triglycerides

 HbA1c levels


Dependent Variable:


 Diabetes status (e.g., presence or absence of type 1, type 2, or gestational diabetes)


Model:


HbA1c = β0 + β1  Age + β2  Sex + β3  BMI + β4  Fasting Glucose + β5  Systolic Blood Pressure + β6  Diastolic Blood Pressure + β7  Total Cholesterol + β8  HDL Cholesterol + β9  Triglycerides



where:


 β0 is the intercept

 β1-β9 are the regression coefficients for each independent variable


Interpretation:


 The intercept (β0) represents the predicted HbA1c level when all independent variables are equal to zero.

 Each regression coefficient (β1-β9) estimates the change in HbA1c for a one-unit increase in the corresponding independent variable, holding other variables constant.

 A positive coefficient indicates a positive association between the independent variable and HbA1c (i.e., an increase in the independent variable is associated with an increase in HbA1c).

 A negative coefficient indicates a negative association (i.e., an increase in the independent variable is associated with a decrease in HbA1c).


Example:


A study of 500 individuals with diabetes reveals the following multiple linear regression model:


HbA1c = 6.5 + 0.07  Age - 0.2  Sex + 0.5  BMI + 0.2  Fasting Glucose + 0.1  Systolic Blood Pressure + 0.1  Diastolic Blood Pressure + 0.05  Total Cholesterol - 0.1  HDL Cholesterol + 0.02  Triglycerides



Interpretation:


 For every year increase in age, HbA1c is predicted to increase by 0.07%.

 Being female is associated with a 0.2% lower HbA1c.

 For every 1 kg/m² increase in BMI, HbA1c is predicted to increase by 0.5%.

 For every 1 mg/dL increase in fasting glucose, HbA1c is predicted to increase by 0.2%.

 For every 1 mmHg increase in systolic blood pressure, HbA1c is predicted to increase by 0.1%.

 For every 1 mmHg increase in diastolic blood pressure, HbA1c is predicted to increase by 0.1%.

 For every 1 mg/dL increase in total cholesterol, HbA1c is predicted to increase by 0.05%.

 For every 1 mg/dL increase in HDL cholesterol, HbA1c is predicted to decrease by 0.1%.

 For every 1 mg/dL increase in triglycerides, HbA1c is predicted to increase by 0.02%.


The statement "being female is associated with a 0.2% lower hba1c" can be derived from the given equation:


hba1c = 6.5 + 0.07age - 0.2sex + 0.5bmi + 0.2fasting glucose + 0.1systolic blood pressure + 0.1diastolic blood pressure + 0.05total cholesterol - 0.1hdl cholesterol + 0.02triglycerides


In this equation, the coefficient of the "sex" variable is -0.2. This means that for every unit increase in sex (where female is assigned a value of 1 and male is assigned a value of 0), the hba1c value decreases by 0.2%.


Therefore, being female (sex = 1) is associated with a 0.2% lower hba1c compared to being male (sex = 0).

Regression Coefficient

In a linear regression model, the regression coefficient is the numerical value that represents the change in the dependent variable for each unit change in the independent variable. It measures the strength and direction of the relationship between the variables.

Types of Regression Coefficients:

 β0 (Intercept): The regression coefficient for the constant term in the model. It represents the value of the dependent variable when all independent variables are equal to zero.
 β1, β2, ..., βn (Slope): The regression coefficients for the independent variables. Each coefficient indicates the slope of the regression line, which represents the change in the dependent variable for each unit change in the corresponding independent variable.

Interpretation:

The sign of the regression coefficient indicates the direction of the relationship:

 Positive: As the independent variable increases, the dependent variable also increases.
 Negative: As the independent variable increases, the dependent variable decreases.

The magnitude of the coefficient represents the strength of the relationship. A larger coefficient indicates a stronger relationship between the independent and dependent variables.

Example:

Consider a simple linear regression model where Income (Y) is the dependent variable and Age (X) is the independent variable:

Y = β0 + β1X

If the regression coefficient for Age (β1) is 0.5, it means that for every year increase in age, the income increases by 0.5 units, ceteris paribus (holding all other factors constant).

Importance:

Regression coefficients are used to:

 Predict the value of the dependent variable for given values of the independent variables.
 Assess the strength and direction of the relationship between variables.
 Test the significance of the independent variables in explaining the variation in the dependent variable.

Squared R, also known as the coefficient of determination, is a statistical measure that quantifies the proportion of variance in the dependent variable that is explained by the independent variable(s) in a regression model. It is represented as R².

Formula:

R² = SSreg / SStot

Where:

 SSreg is the sum of squares of the regression line
 SStot is the total sum of squares

Interpretation:

Squared R² indicates the percentage of variability in the dependent variable that is accounted for by the regression model. It ranges from 0 to 1, with the following interpretations:

 R² = 0: The regression model explains none of the variance in the dependent variable.
 R² = 0.5: The regression model explains 50% of the variance in the dependent variable.
 R² = 1: The regression model explains 100% of the variance in the dependent variable.

Advantages:

 Provides a measure of goodness of fit for a regression model.
 Helps determine how well the independent variables predict the dependent variable.
 Can be used to compare different regression models or models with varying independent variables.

Limitations:

 Does not indicate the statistical significance of the relationship between variables.
 Can be misleading if the sample size is small.
 Does not adjust for the number of independent variables in the model.

Applications:

Squared R² is widely used in various fields, including:

 Data analysis
 Statistical modeling
 Machine learning
 Regression analysis
 Hypothesis testing


Regression Coefficient vs Coefficient of Determination (R-squared):


Regression Coefficient

 Definition: Measures the change in the dependent variable for a one-unit change in the independent variable, holding all other variables constant.
 Formula: β = (Σ(x - x̄)(y - ȳ)) / Σ(x - x̄)²
 Interpretation:
     Positive coefficient: Increase in independent variable leads to an increase in dependent variable.
     Negative coefficient: Increase in independent variable leads to a decrease in dependent variable.
 Units: Same units as the dependent variable.

Coefficient of Determination (R-squared)

 Definition: Measures the proportion of variance in the dependent variable that is explained by the independent variable(s).
 Formula: R² = 1 - (SSresidual / SStotal)
 Interpretation:
     Value between 0 and 1.
     0: No linear relationship between variables.
     1: Perfect linear relationship between variables.
 Units: Unitless, expressed as a percentage.

Key Differences:

 Focus: Regression coefficient measures the individual effect of each independent variable, while R-squared measures the overall goodness of fit of the regression model.
 Formula: The regression coefficient uses only the independent and dependent variables, while R-squared also considers the residuals (unexplained variance).
 Interpretation: The regression coefficient indicates the direction and magnitude of the relationship, while R-squared provides information about the model's predictive power.
 Units: The regression coefficient has units, while R-squared is unitless.

Relationship:

R-squared is affected by the regression coefficients, but it is not a weighted average of the coefficients. A high R-squared value generally indicates that the regression coefficients are statistically significant and that the model has a good fit.

....................................................................................................


👉 For the data analysis, please go to my Youtube(Ads) channel to Watch Video (Video Link) in  

      Youtube Channel (Channel Link) and Download(Ads) video.


      💗 Thanks to Subscribe(channel) and Click(channel) on bell 🔔 to get more videos!💗!!



👉 For Powerpoint presentation slide(Ads), please help to fill a short information, Please click here
      (Survey Link) and Download(Ads) the presentation slides from Google Drive(G_drive link)

                                 💓 Thanks for your completing the Survey 💗

                                💓 Thanks for visiting and sharing my Facebook Page 💗

..............................................................................................................






  1. STATA for dataset restructuring, descriptive and analytical data analysis
  2. SPSS for dataset restructuring, data entry, data check, descriptive, and analytical data analysis
  3. Epi-Info for building questionnaires, data check, data entry, descriptive, and analytical data analysis
  4. Epidata-Analysis for dataset restructuring, descriptive and analytical data analysis
  5. Epi-Collect for building questionnaires, remote data entry, mapping, and data visualization
  6. Epidata-Entry for building questionnaires, data check, data entry, and data validation

👉 ចុះឈ្មោះឥលូវនេះនឹងទទួលបានតម្លៃពិសេស!🤩‼️

................................................................................................


☎️ Mobile Phone: (+855) - 96 810 0024
🚩 Facebook Page: Page
📎 Please join us: Telegram Channel
🌎 Please visit the website Thesis Writing Skill: Website
📩 Please send an email to us: Email
🏡 Address៖ #27B, Street-271, Street-33BT, Phsar Boeng Tumpon, Sangkat Boeng Tumpon, Khan  
                      Meanchey,  Phnom Penh, Cambodia.


                                                   💖 Thanks for your contribution!!! 💕

ABA Account-holder name: Sokchea YAN

ABA Account number: 002 996 999

ABA QR Code:


 





or tap on link below to send payment:

https://pay.ababank.com/iT3dMbNKCJhp7Hgz6


✌ Have a nice day!!! 💞












Comments

Popular posts from this blog

What is Kruskal-Wallis Test?

Kendall's Coefficient of Concordance (W)