spearman相關係數使用的效果不如預期,想知道為什麼會這樣呢?
當使用spearman相關係數時,過濾相關度大於0.3的特徵出來,發現效果不如預期 mape = 0.0975 但如果全部特徵用下去可以達到 mape = 0.06 想請教老師這是為什麼,還是我有什麼地方需要改進 import pandas as pd import numpy as np import matplotlib.pyplot as plt from scipy.stats import norm from sklearn.model_selection import train_test_split from sklearn.model_selection import HalvingRandomSearchCV as HRSCV from sklearn.ensemble import RandomForestRegressor from eda_module import data_cleaning from eda_module import feature_engineering from eda_module import k_means_binning from eda_module import regression_report from eda_module import view_miss_data from eda_module import view_discrete_data from eda_module import view_continual_data from eda_module import dection_datatype from sklearn.cluster import KMeans import matplotlib.pyplot as plt import warnings warnings.filterwarnings("ignore") train = pd.read_csv('./train.csv') test = pd.read_csv('./test.csv') total = pd.concat([train, test], axis = 0) total = data_cleaning(train, test) total = feature_engineering(total) int_features = [] float_features = [] object_features = [] for dtype, feature in zip(total.dtypes, total.columns): if dtype == 'float64': float_features.append(feature) elif dtype == 'int64': int_features.append(feature) else: # dtype == 'object': object_features.append(feature) c_col = [] + int_features + float_features corr_df = total[c_col].corr(method='spearman')['SalePrice'].reset_index() fs_col = [i for i in corr_df[(corr_df['SalePrice'] > 0.3) | (corr_df['SalePrice'] < -0.3)]['index'].values if i != 'SalePrice'] km = KMeans(n_clusters=10) y_pred = km.fit_predict(total) total['data_kmean'] = y_pred data_y = total['SalePrice'] ids = total['Id'] total.drop(columns = ['Id', 'SalePrice'], inplace = True) final_cols = [] + fs_col + object_features fs_total = total[final_cols] # fs_total = total X_train = fs_total[:1095] y_train = data_y[:1095] X_test = fs_total[1095:] y_test = data_y[1095:] RF = RandomForestRegressor() RF.fit(X_train, y_train) val_pred = RF.predict(X_test) report_number_list = regression_report(y_test, val_pred, True)
回答列表
-
2021/08/27 下午 01:31王健安贊同數:0不贊同數:0留言數:0
Ting 您好, 首先我先從 spearman 相關係數開始說起, spearman相關係數與pearson相關係數一樣, 是探討兩變數之間是否有線性相關性, 但 spearman 相關係數適用於「序列資料」,例如:排名; 而 pearson 相關係數是用在「連續型資料」,例如:銷售額, 因此,建議您在選擇相關係數方法時使用 pearson 相關係數。 ```python corr_df = total[c_col].corr(method='pearson')['SalePrice'].reset_index() ``` 另外, 特徵越多,模型能辨識目標的資訊越多, 模型的效果當然就會越好, 只是好到什麼程度而已, 但在實務上,為了將模型效能與運算時間做平衡, 才需要挑出與目標變項最直接相關的變數, 也因此需要透過各種特徵挑選的方法選出與目標變項最具相關的特徵, 此外, 某些特徵與目標變項雖然沒有線性相關, 但可能會有非線性(例如:拋物線)的規律性, 這也可以做為相關性高的考量, 如有疑問可再行回復,謝謝。