這幾天參加了一下阿里的天池大數據競賽,初賽已經結束了,雖然差了一個百分點很遺憾沒有晉級下一輪,不過第一次參加這種比賽,能達到144/2522這樣的成績,我感覺也算過的去,关键是在4-5天的比赛里,确实学到了许多的东西,在此對這一次比賽的過程做一個記錄。 比賽地址:天池精準醫療大賽——人工智能輔助糖尿病遺傳風險預測 1. 數據集的處理 首先看到比賽題目最直觀的能馬上得到的是主辦方給的測試集,入手以後看了下測試集,缺失內容非常之多,數據處理用*python*進行,主要使用*pandas*來查看數據。

data_train = pd.read_csv(".......csv")
data_train.columns 
data = pd.concat([train,test],axis=0)
print(data.isnull().sum()/len(data))    #查看數據的缺失比例
  1. 模型的選用 這次比賽選用的是baseline的lighGBM,一開始用了nn效果不是很好,代碼如下: ```python def make_feat(train,test): train_id = train.id.values.copy() test_id = test.id.values.copy() data = pd.concat([train,test])

    data[‘性别’] = data[‘性别’].map({‘男’:1,‘女’:0,‘??’:1}) data[‘体检日期’] = (pd.to_datetime(data[‘体检日期’]) - parse(‘2017-10-09’)).dt.days

    data.fillna(data.median(axis=0),inplace=True)

    train_feat = data[data.id.isin(train_id)] test_feat = data[data.id.isin(train_id)]

    return train_feat,test_feat

train_feat,test_feat = make_feat(train,test)

predictors = [f for f in test_feat.columns if f not in [‘血糖’]]

def evalerror(pred, df): label = df.get_label().values.copy() score = mean_squared_error(label,pred)*0.5 return (‘0.5mse’,score,False)

print(‘开始训练…’) params = { ‘learning_rate’: 0.01, ‘boosting_type’: ‘gbdt’, ‘objective’: ‘regression’, ‘metric’: ‘mse’, ‘sub_feature’: 0.7, ‘num_leaves’: 60, ‘colsample_bytree’: 0.7, ‘min_data’: 100, ‘min_hessian’: 1, ‘verbose’: -1, }

print(‘开始CV 5折训练…’) t0 = time.time() train_preds = np.zeros(train_feat.shape[0]) test_preds = np.zeros((test_feat.shape[0],5)) kf = KFold(len(train_feat),n_folds=5,shuffle=True,random_state=520) for i,(train_index,test_index) in enumerate(kf): print(‘第{}次训练…’.format(i)) train_feat1 = train_feat.iloc[train_index] train_feat2 = train_feat.iloc[test_index] lgb_train1 = lgb.Dataset(train_feat1[predictors],train_feat1[‘血糖’])#,categorical_feature=[‘性别’]) lgb_train2 = lgb.Dataset(train_feat2[predictors],train_feat2[‘血糖’]) gbm = lgb.train( params, lgb_train1, num_boost_round=3000, valid_sets=lgb_train2, verbose_eval=100, feval=evalerror, early_stopping_rounds=100 ) feat_imp = pd.Series(gbm.feature_importance(),index=predictors).sort_values(ascending=False) train_preds[test_index] += gbm.predict(train_feat2[predictors]) test_preds[:,i] = gbm.predict(test_feat[predictors]) print(‘线下得分: {}’.format(mean_squared_error(train_feat[‘血糖’],train_preds)*0.5)) print(‘CV训练用时{}秒’.format(time.time() - t0))

submission = pd.DataFrame({‘pred’:test_preds.mean(axis=1)}) submission.tocsv(r’sub{}.csv’.format(datetime.datetime.now().strftime(‘%Y%m%d%H%M%S’)),header=None, index=False,float_format=‘%.4f’) ``` 3. 展望 數據的缺失補全,特征提取都有了一定的了解,不過沒進複賽很不甘心,下次再接再厲。