This post introduces dummy coding for categorical variables.
Under most situations, categorical variables cannot be entered directly into a regression model and be meaningfully interpreted. As a result, a common method dealing with categorical variables in regression is Dummy Coding. Dummy coding refers to the process of coding categorical variables into dichotomous variables (Wikiversity).
For example, given a categorical variable having three classes: “faculty”, “staff”, and “student”. Dummy variables are created as follows:
| dv_1 | dv_2 | dv_3 | |
|---|---|---|---|
| faculty | 1 | 0 | 0 | 
| staff | 0 | 1 | 0 | 
| student | 0 | 0 | 1 | 
The categorical variable is dummy coded as three dummy variables: dv_1, dv_2, and dv_3.
Usually, people will select a category as the reference category in the regression process to avoid rank deficiency. For example, if “faculty” is chosen as the reference category, the new dummy coded variables become:
| dv_1 | dv_2 | |
|---|---|---|
| faculty | 0 | 0 | 
| staff | 1 | 0 | 
| student | 0 | 1 | 
pandas# Create dataframe with categorical variable: [“status”: faculty, staff, student]
import pandas as pd
data = pd.DataFrame({'status':['faculty','staff','student']})
dv1 = pd.get_dummies(data)
print(dv1)
# if having another categorical variable: [“gender”: M, F]
data = pd.DataFrame({'gender':['M','F', 'M'], 'status':['faculty','staff','student']})
dv2 = pd.get_dummies(data)
print(dv2)% create categorical variable: ["status": faculty, staff, student]
status = categorical({'faculty'; 'staff'; 'student'});
dv_status = dummyvar(status)
% if having another categorical variable: ["gender": M, F]
gender = categorical({'M'; 'F'; 'M'});
dv_gender_status = [dummyvar(gender) dummyvar(status)]
dv_status =
     1     0     0
     0     1     0
     0     0     1
dv_gender_status =
     0     1     1     0     0
     1     0     0     1     0
     0     1     0     0     1
Categorical regression using dummy coding can be done either manually or automatically in Matlab. The codes are shown respectively as follows which generate the same fitting results.
% 4.A Manually dummy coding
load('carsmall')
cars = table(MPG, Weight, Model_Year);
cars.Model_Year = nominal(cars.Model_Year);
dv = dummyvar(cars.Model_Year);
Model_Year1 = dv(:, 1); Model_Year2 = dv(:, 2); Model_Year3 = dv(:, 3);
cars = table(MPG, Weight, Model_Year2, Model_Year3);
fit = fitlm(cars, 'MPG~Weight*Model_Year2 + Weight*Model_Year3')
fit =
Linear regression model:
    MPG ~ 1 + Weight*Model_Year2 + Weight*Model_Year3
Estimated Coefficients:
                           Estimate          SE         tStat        pValue  
                          ___________    __________    ________    __________
    (Intercept)                37.399        2.1466      17.423    2.8607e-30
    Weight                 -0.0058437    0.00061765     -9.4612    4.6077e-15
    Model_Year2                4.6903        2.8538      1.6435       0.10384
    Model_Year3                21.051         4.157      5.0641    2.2364e-06
    Weight:Model_Year2    -0.00082009    0.00085468    -0.95953       0.33992
    Weight:Model_Year3     -0.0050551     0.0015636     -3.2329     0.0017256
% 4.B Automatic dummy coding via built-in matlab function
load('carsmall')
cars = table(MPG, Weight, Model_Year);
cars.Model_Year = nominal(cars.Model_Year);
fit = fitlm(cars, 'MPG~Weight*Model_Year')
fit =
Linear regression model:
    MPG ~ 1 + Weight*Model_Year
Estimated Coefficients:
                             Estimate          SE         tStat        pValue  
                            ___________    __________    ________    __________
    (Intercept)                  37.399        2.1466      17.423    2.8607e-30
    Weight                   -0.0058437    0.00061765     -9.4612    4.6077e-15
    Model_Year_76                4.6903        2.8538      1.6435       0.10384
    Model_Year_82                21.051         4.157      5.0641    2.2364e-06
    Weight:Model_Year_76    -0.00082009    0.00085468    -0.95953       0.33992
    Weight:Model_Year_82     -0.0050551     0.0015636     -3.2329     0.0017256