Multiple Linear Regression

Multiple linear regression is a statistical technique used to model the relationship between a dependent variable and multiple independent variables. It is an extension of simple linear regression, which models the relationship between a single dependent variable and a single independent variable.

Model Formulation:

The general form of a multiple linear regression model is:

Y = β0 + β1X1 + β2X2 + ... + βnXn + ε

where:

Y is the dependent variable

X1, X2, ..., Xn are independent variables

β0 is the intercept (the value of Y when all independent variables are zero)

β1, β2, ..., βn are regression coefficients that represent the effect of each independent variable on Y

ε is the error term, which represents the unexplained variance in Y

Assumptions:

Multiple linear regression assumes that the following conditions are met:

The relationship between Y and the independent variables is linear.

The independent variables are independent of each other.

The error term has a normal distribution with mean 0 and constant variance.

There are no outliers or influential observations.

Interpretation of Regression Coefficients:

The regression coefficients (β1, β2, ..., βn) indicate the change in Y associated with a one-unit increase in the corresponding independent variable, holding all other variables constant.

A positive coefficient indicates a positive relationship between the independent variable and Y.

A negative coefficient indicates a negative relationship.

The magnitude of the coefficient reflects the strength of the relationship.

Model Estimation and Evaluation:

Multiple linear regression models are estimated using least squares regression. The estimated coefficients are the values that minimize the sum of squared errors between the predicted values of Y and the observed values.

The model is then evaluated based on its goodness-of-fit, which measures how well the model fits the data. Common goodness-of-fit measures include:

R-squared: The proportion of variance in Y that is explained by the model.

Adjusted R-squared: The R-squared value adjusted for the number of independent variables.

Root mean squared error (RMSE): The square root of the average squared difference between the predicted and observed values of Y.

Applications:

Multiple linear regression is used in a wide range of applications, including:

Forecasting future trends

Identifying factors associated with a particular outcome

Building predictive models

Assessing the impact of interventions

Example 1:

Study Title: Association of Cardiovascular Risk Factors with Cardiovascular Events in a Large Cohort Study

Independent Variables (Predictors):

Age

Sex

Body mass index (BMI)

Smoking status

Systolic blood pressure

Total cholesterol

HDL cholesterol

LDL cholesterol

Triglycerides

Fasting blood glucose

Estimated glomerular filtration rate (eGFR)

Dependent Variable (Outcome):

Cardiovascular events (e.g., heart attack, stroke)

Model:

Cardiovascular events = β0 + β1 Age + β2 Sex + ... + β11 eGFR + ε

where:

β0 is the intercept

β1 to β11 are the regression coefficients

ε is the error term

Interpretation:

The multiple linear regression model predicts the probability of cardiovascular events based on the values of the independent variables included in the model. The regression coefficients (β1 to β11) represent the change in the log-odds of cardiovascular events for a one-unit increase in the corresponding independent variable, holding all other variables constant.

Example Results:

Variable | Regression Coefficient (β) | p-value

-------- | -------------------------- | --------

Age | 0.012 | <0.001

Sex | 0.154 | 0.003

BMI | 0.021 | <0.001

Smoking | 0.286 | <0.001

SBP | 0.014 | <0.001

Total | 0.003 | 0.007

HDL | -0.011 | 0.002

LDL | 0.009 | 0.004

Triglycerides | 0.002 | 0.015

FBG | 0.023 | <0.001

eGFR | -0.005 | 0.008

Interpretation:

For every 1-year increase in age, the log-odds of cardiovascular events increases by 0.012 (1.2%).

Men have a higher log-odds of cardiovascular events compared to women (β = 0.154).

A one-unit increase in BMI is associated with a 2.1% increase in the log-odds of cardiovascular events.

Smokers have a 28.6% higher log-odds of cardiovascular events compared to nonsmokers.

Each unit increase in systolic blood pressure is associated with a 1.4% increase in the log-odds of cardiovascular events.

Lower HDL cholesterol and higher LDL cholesterol are associated with increased log-odds of cardiovascular events.

Higher fasting blood glucose and lower eGFR are associated with increased log-odds of cardiovascular events.

Clinical Significance:

The multiple linear regression model provides a comprehensive assessment of the associations between multiple cardiovascular risk factors and cardiovascular events. This information can aid in patient risk stratification, guiding preventive interventions and treatment decisions to reduce the risk of future cardiovascular events.

Example 2:

Multiple Linear Regression Model for Lung Cancer Risk

Variables:

Dependent Variable: Lung cancer risk (binary: 0 = no cancer, 1 = cancer present)

Independent Variables:

Age

Sex (binary: 0 = female, 1 = male)

Smoking status (binary: 0 = never smoker, 1 = current or former smoker)

Pack-years of smoking (continuous)

Radon exposure (binary: 0 = no exposure, 1 = exposure)

Family history of lung cancer (binary: 0 = no history, 1 = history)

Body mass index (BMI, continuous)

Education level (categorical: low, medium, high)

Model Equation:

Lung cancer risk = β0 + β1Age + β2Sex + β3Smoking status + β4Pack-years of smoking + β5Radon exposure + β6Family history of lung cancer + β7BMI + β8Education level + ε

where:

β0 is the intercept

β1-β8 are the regression coefficients for each variable

ε is the error term

Interpretation:

The regression coefficients (β1-β8) represent the change in lung cancer risk associated with a one-unit change in the corresponding independent variable, while holding all other variables constant.

For example:

If the regression coefficient for age is β1 = 0.01, then for every additional year of age, the risk of lung cancer increases by 1%.

If the regression coefficient for smoking status is β3 = 0.5, then current or former smokers have a 50% higher risk of lung cancer compared to never smokers.

Statistical Significance:

The statistical significance of each regression coefficient is tested using a t-test. If the p-value for a coefficient is less than 0.05, then the variable is considered to be a significant predictor of lung cancer risk.

Model Fit:

The model's fit can be assessed using measures such as the R-squared value and the adjusted R-squared value. The R-squared value represents the proportion of variance in the lung cancer risk that is explained by the independent variables. The adjusted R-squared value adjusts for the number of independent variables in the model.

Example:

Suppose a multiple linear regression analysis of lung cancer risk in 1000 individuals yields the following results:

|---|---|---|---|

| Age | 0.01 | 2.5 | 0.01 |

| Sex | 0.2 | 3.0 | 0.001 |

| Smoking status | 0.5 | 6.0 | < 0.001 |

| Pack-years of smoking | 0.02 | 4.0 | < 0.001 |

| Radon exposure | 0.1 | 2.0 | 0.05 |

| Family history of lung cancer | 0.3 | 4.5 | < 0.001 |

| BMI | 0.005 | 1.5 | 0.1 |

| Education level | -0.1 | 2.0 | 0.05 |

Interpretation:

Age, sex, smoking status, pack-years of smoking, and family history of lung cancer are significant predictors of lung cancer risk.

Radon exposure and education level are marginally significant predictors of lung cancer risk (p-values < 0.1).

BMI is not a significant predictor of lung cancer risk in this model.

The model has an R-squared value of 0.45 and an adjusted R-squared value of 0.42, indicating that it explains 42% of the variance in lung cancer risk.

Example 3:

Multiple Linear Regression Model for Hypertension

Dependent Variable: Blood Pressure (systolic or diastolic)

Independent Variables:

Age

Gender

Body mass index (BMI)

Smoking status (current smoker, former smoker, never smoker)

Alcohol consumption (drinks per week)

Physical activity level (hours per week)

Family history of hypertension

Diet (sodium intake, potassium intake, etc.)

Medications (antihypertensive drugs)

Model Equation:

Blood Pressure = β0 + β1 Age + β2 Gender + β3 BMI + β4 Smoking Status + β5 Alcohol Consumption + β6 Physical Activity Level + β7 Family History + β8 Diet + β9 Medications + ε

where:

β0 is the intercept

β1-β9 are the regression coefficients for each independent variable

ε is the error term

Interpretation:

The regression coefficients (β1-β9) indicate the change in blood pressure associated with a one-unit increase in the corresponding independent variable, while holding all other variables constant. For example, β3 represents the change in blood pressure for a one-unit increase in BMI.

A positive coefficient indicates that the independent variable is associated with a higher blood pressure, while a negative coefficient indicates an inverse relationship. The magnitude of the coefficient reflects the strength of the association.

Significance Testing:

Statistical significance tests can be performed to determine whether the independent variables are significantly associated with blood pressure. This involves comparing the observed regression coefficients to a null hypothesis of no effect. A significant p-value indicates that the corresponding independent variable is making a significant contribution to the model.

Example:

A study found the following regression equation for systolic blood pressure:

Systolic BP = 120 + 1.2 Age + 5.6 BMI + 10.3 Smoking Status + 4.2 Alcohol Consumption - 2.7 Physical Activity Level + 6.8 Family History + 1.9 Sodium Intake + 7.4 Antihypertensive Medications

This equation suggests that:

Age, BMI, smoking status, and alcohol consumption are positively associated with systolic blood pressure.

Physical activity level and antihypertensive medications are inversely associated with systolic blood pressure.

The other variables (family history, diet, etc.) have smaller but significant effects.

Example 4:

Multiple Linear Regression Model for Diabetes

Overview:

Multiple linear regression is a statistical technique used to predict a continuous outcome (dependent variable) based on multiple independent variables (predictors). In the context of diabetes, it can be employed to identify the factors associated with the risk or severity of diabetes.

Independent Variables:

Age

Sex

Body mass index (BMI)

Fasting glucose levels

Systolic blood pressure

Diastolic blood pressure

Total cholesterol

HDL cholesterol

Triglycerides

HbA1c levels

Dependent Variable:

Diabetes status (e.g., presence or absence of type 1, type 2, or gestational diabetes)

Model:

HbA1c = β0 + β1 Age + β2 Sex + β3 BMI + β4 Fasting Glucose + β5 Systolic Blood Pressure + β6 Diastolic Blood Pressure + β7 Total Cholesterol + β8 HDL Cholesterol + β9 Triglycerides

where:

β0 is the intercept

β1-β9 are the regression coefficients for each independent variable

Interpretation:

The intercept (β0) represents the predicted HbA1c level when all independent variables are equal to zero.

Each regression coefficient (β1-β9) estimates the change in HbA1c for a one-unit increase in the corresponding independent variable, holding other variables constant.

A positive coefficient indicates a positive association between the independent variable and HbA1c (i.e., an increase in the independent variable is associated with an increase in HbA1c).

A negative coefficient indicates a negative association (i.e., an increase in the independent variable is associated with a decrease in HbA1c).

Example:

A study of 500 individuals with diabetes reveals the following multiple linear regression model:

HbA1c = 6.5 + 0.07 Age - 0.2 Sex + 0.5 BMI + 0.2 Fasting Glucose + 0.1 Systolic Blood Pressure + 0.1 Diastolic Blood Pressure + 0.05 Total Cholesterol - 0.1 HDL Cholesterol + 0.02 Triglycerides

Interpretation:

For every year increase in age, HbA1c is predicted to increase by 0.07%.

Being female is associated with a 0.2% lower HbA1c.

For every 1 kg/m² increase in BMI, HbA1c is predicted to increase by 0.5%.

For every 1 mg/dL increase in fasting glucose, HbA1c is predicted to increase by 0.2%.

For every 1 mmHg increase in systolic blood pressure, HbA1c is predicted to increase by 0.1%.

For every 1 mmHg increase in diastolic blood pressure, HbA1c is predicted to increase by 0.1%.

For every 1 mg/dL increase in total cholesterol, HbA1c is predicted to increase by 0.05%.

For every 1 mg/dL increase in HDL cholesterol, HbA1c is predicted to decrease by 0.1%.

For every 1 mg/dL increase in triglycerides, HbA1c is predicted to increase by 0.02%.

The statement "being female is associated with a 0.2% lower hba1c" can be derived from the given equation:

hba1c = 6.5 + 0.07age - 0.2sex + 0.5bmi + 0.2fasting glucose + 0.1systolic blood pressure + 0.1diastolic blood pressure + 0.05total cholesterol - 0.1hdl cholesterol + 0.02triglycerides

In this equation, the coefficient of the "sex" variable is -0.2. This means that for every unit increase in sex (where female is assigned a value of 1 and male is assigned a value of 0), the hba1c value decreases by 0.2%.

Therefore, being female (sex = 1) is associated with a 0.2% lower hba1c compared to being male (sex = 0).

Regression Coefficient

In a linear regression model, the regression coefficient is the numerical value that represents the change in the dependent variable for each unit change in the independent variable. It measures the strength and direction of the relationship between the variables.

Types of Regression Coefficients:

β0 (Intercept): The regression coefficient for the constant term in the model. It represents the value of the dependent variable when all independent variables are equal to zero.

β1, β2, ..., βn (Slope): The regression coefficients for the independent variables. Each coefficient indicates the slope of the regression line, which represents the change in the dependent variable for each unit change in the corresponding independent variable.

Interpretation:

The sign of the regression coefficient indicates the direction of the relationship:

Positive: As the independent variable increases, the dependent variable also increases.

Negative: As the independent variable increases, the dependent variable decreases.

The magnitude of the coefficient represents the strength of the relationship. A larger coefficient indicates a stronger relationship between the independent and dependent variables.

Example:

Consider a simple linear regression model where Income (Y) is the dependent variable and Age (X) is the independent variable:

Y = β0 + β1X

If the regression coefficient for Age (β1) is 0.5, it means that for every year increase in age, the income increases by 0.5 units, ceteris paribus (holding all other factors constant).

Importance:

Regression coefficients are used to:

Predict the value of the dependent variable for given values of the independent variables.

Assess the strength and direction of the relationship between variables.

Test the significance of the independent variables in explaining the variation in the dependent variable.

Squared R, also known as the coefficient of determination, is a statistical measure that quantifies the proportion of variance in the dependent variable that is explained by the independent variable(s) in a regression model. It is represented as R².

Formula:

R² = SSreg / SStot

Where:

SSreg is the sum of squares of the regression line

SStot is the total sum of squares

Interpretation:

Squared R² indicates the percentage of variability in the dependent variable that is accounted for by the regression model. It ranges from 0 to 1, with the following interpretations:

R² = 0: The regression model explains none of the variance in the dependent variable.

R² = 0.5: The regression model explains 50% of the variance in the dependent variable.

R² = 1: The regression model explains 100% of the variance in the dependent variable.

Advantages:

Provides a measure of goodness of fit for a regression model.

Helps determine how well the independent variables predict the dependent variable.

Can be used to compare different regression models or models with varying independent variables.

Limitations:

Does not indicate the statistical significance of the relationship between variables.

Can be misleading if the sample size is small.

Does not adjust for the number of independent variables in the model.

Applications:

Squared R² is widely used in various fields, including:

Data analysis

Statistical modeling

Machine learning

Regression analysis

Hypothesis testing

Regression Coefficient vs Coefficient of Determination (R-squared):

Regression Coefficient

Definition: Measures the change in the dependent variable for a one-unit change in the independent variable, holding all other variables constant.

Formula: β = (Σ(x - x̄)(y - ȳ)) / Σ(x - x̄)²

Interpretation:

Positive coefficient: Increase in independent variable leads to an increase in dependent variable.

Negative coefficient: Increase in independent variable leads to a decrease in dependent variable.

Units: Same units as the dependent variable.

Coefficient of Determination (R-squared)

Definition: Measures the proportion of variance in the dependent variable that is explained by the independent variable(s).

Formula: R² = 1 - (SSresidual / SStotal)

Interpretation:

Value between 0 and 1.

0: No linear relationship between variables.

1: Perfect linear relationship between variables.

Units: Unitless, expressed as a percentage.

Key Differences:

Focus: Regression coefficient measures the individual effect of each independent variable, while R-squared measures the overall goodness of fit of the regression model.

Formula: The regression coefficient uses only the independent and dependent variables, while R-squared also considers the residuals (unexplained variance).

Interpretation: The regression coefficient indicates the direction and magnitude of the relationship, while R-squared provides information about the model's predictive power.

Units: The regression coefficient has units, while R-squared is unitless.

Relationship:

R-squared is affected by the regression coefficients, but it is not a weighted average of the coefficients. A high R-squared value generally indicates that the regression coefficients are statistically significant and that the model has a good fit.

....................................................................................................

👉 For the data analysis, please go to my Youtube(Ads) channel to Watch Video (Video Link) in
Youtube Channel (Channel Link) and Download(Ads) video.

💗 Thanks to Subscribe(channel) and Click(channel) on bell 🔔 to get more videos!💗!!

👉 For Powerpoint presentation slide(Ads), please help to fill a short information, Please click here
(Survey Link) and Download(Ads) the presentation slides from Google Drive(G_drive link)

💓 Thanks for your completing the Survey 💗

👉 For more details, please contact the admin via:
Tell: (+855) - 96 810 0024
Telegram: https://t.me/sokchea_yann
Facebook Page: https://www.facebook.com/CambodiaBiostatistics/
TikTok: https://www.tiktok.com/@sokcheayann999
💓 Thanks for visiting and sharing my Facebook Page 💗

..............................................................................................................

👉 We also provide some services as following:

👉 កម្មវិធីសិក្សា:

STATA for dataset restructuring, descriptive and analytical data analysis
SPSS for dataset restructuring, data entry, data check, descriptive, and analytical data analysis
Epi-Info for building questionnaires, data check, data entry, descriptive, and analytical data analysis
Epidata-Analysis for dataset restructuring, descriptive and analytical data analysis
Epi-Collect for building questionnaires, remote data entry, mapping, and data visualization
Epidata-Entry for building questionnaires, data check, data entry, and data validation

👉 ចុះឈ្មោះឥលូវនេះនឹងទទួលបានតម្លៃពិសេស!🤩‼️

................................................................................................

👉 ព័ត៌មានបន្ថែមសូមទំនាក់ទំនងតាមរយៈ៖

☎️ Mobile Phone: (+855) - 96 810 0024
📳 Telegram: https://t.me/sokchea_yann
🚩 Facebook Page: Page
📎 Please join us: Telegram Channel
🌎 Please visit the website Thesis Writing Skill: Website
👬 Please follow us: https://www.tiktok.com/@sokcheayann999
📩 Please send an email to us: Email
🏡 Address៖ #27B, Street-271, Street-33BT, Phsar Boeng Tumpon, Sangkat Boeng Tumpon, Khan
Meanchey,  Phnom Penh, Cambodia.

💖 Thanks for your contribution!!! 💕

ABA Account-holder name: Sokchea YAN
ABA Account number: 002 996 999
ABA QR Code:

or tap on link below to send payment:
https://pay.ababank.com/iT3dMbNKCJhp7Hgz6

✌ Have a nice day!!! 💞

Search This Blog

Thesis helper

Multiple Linear Regression

Comments

Post a Comment

Popular posts from this blog

Citrobacter freundii

Escherichia coli (E. coli)