Meanimputing¶

Missing Data - Mean Imputing¶

It is recommened to read determineMissingValues before starting.

Methods for Handling Missing Values: Mean Imputing¶

The Idea here is to get the average value of certain column to fill the missing values with it.

df['A'] = df.A.fillna(df.A.mean())

But In case that, the average value is depending on other certain categorical values, So it is recommended to calculate each mean for each unique categorical value

Step#1: Calculate the average value per categorical feature¶

# Calculate the average value for feature numeical_col 'A' based on each categorical_col unique value 'B'.
fill_value = (
    df.groupby('B')
    .agg({'A':'mean'})
    # In case, it's required to have integer values only
    .round()
    # Convert the dataframe into dictionary with list values
    .T.to_dict('list')
)

display(fill_value)

The output should be like:

{'Apartment': [965.0],
 'Cabin': [688.0],
 'Chalet': [1207.0],
 'Clinic': [1159.0],
 'Duplex': [1127.0],
 'Family House': [228.0],
 'Office': [1121.0],
 'Penthouse': [989.0],
 'Retail': [1023.0],
 'Serviced Apartment': [823.0],
 'Studio': [887.0],
 'Townhouse': [1201.0],
 'Twinhouse': [1151.0],
 'Villa': [1093.0]}

Step#2: Impute the missing values based on that dictionary¶

df['A'] = (
    df.A
    .fillna(
        df.B.apply(lambda x: fill_value.get(x)[0])
        )
)