머신러닝과 딥러닝/머신러닝

Titanic 데이터 분석하기 - 3 (Modeling and Evaluation)

Stat_in_KNU 2020. 6. 14. 22:29

[ADP대비, 데이터 분석 PT면접 대비]

 

학부시절부터 수도 없이 만났던 타이타닉 데이터, 주먹구구식으로 분석하지말고 Kaggle Kernel을 따라 차근차근 따라가보자.

 

참고 커널

https://www.kaggle.com/ash316/eda-to-prediction-dietanic

 

EDA To Prediction(DieTanic)

Explore and run machine learning code with Kaggle Notebooks | Using data from Titanic: Machine Learning from Disaster

www.kaggle.com

https://www.kaggle.com/startupsci/titanic-data-science-solutions

 

Titanic Data Science Solutions

Explore and run machine learning code with Kaggle Notebooks | Using data from Titanic: Machine Learning from Disaster

www.kaggle.com

데이터 소개

타이타닉호는 역사상 가장 악명 높은 재난이었다. 1912년 4월 15일 타이타닉호의 첫 항해 도중 빙산과 충돌하여 총 2224명 중 1502명이 사망했다. 

타이타닉호를 만드는데는 약 750만 달러가 들었다, (과거의 화폐가치를 따졌을때나, 현재의 가치 그대로나 엄청난 돈)  지금은 타이타닉데이터는 기계학습 분류 데이터셋으로 데이터 분석가/사이언티스트들이 많이 쓰고있다.

 

 

변수설명

survival Survival (0 = No; 1 = Yes)
pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

조금 해석이 어려운것들

Sibsp : 동반한 배우자/형제자매 수라고 보면됨

Parch : 부모/자식 수라고 보면됨

Cabin : 객실번호

Embarked : 탑승지

 

분석과정

1. EDA

2. 데이터 전처리(Feature Engineering, Data Cleaning)

3. 데이터 모델링 (ML modeling)

4. 모델 평가

 


3. 데이터 모델링

기본적으로 사용할 모델들

 

1) 로지스틱 회귀

2) SVM(linear, rbf kernel)

3) Random Forest

4) KNN

5) Naive Bayes

6) Decision Tree

 

필요한 패키지들 Import

from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.naive_bayes import GaussianNB #Naive bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix
from sklearn.metrics import accuracy_score

데이터 분할

train_X, test_X, train_Y, test_Y = train_test_split(data.drop('Survived', axis =1 ), data.Survived, 
test_size = 0.3, random_state = 0, stratify = data['Survived'])

내가 몰랐던 유의할 점

train_test_split함수에서 데이터와 레이블을 둘다 넣었을때는 return값이 train_X, test_X, train_Y, test_Y 4개로 나오지만

레이블 없이 넣었을때는 train, test 두가지로만 return 해준다.

 

Cross Validation시 사용할 전체 데이터도 미리 만들어놓자

X = data.drop('Survived', axis = 1)
Y = data['Survived']

 

각각의 ML모델을 적합해보자.

 

Radial SVM

model = svm.SVC(kernel = 'rbf', C=1, gamma = 0.1)
model.fit(train_X, train_Y)
prediction1 = model.predict(test_X)
print('Accuracy for rbf SVM is %.2f%%'%(accuracy_score(prediction1, test_Y)*100))

# Accuracy for rbf SVM is 83.58%

 

Linear SVM

model = svm.SVC(kernel='linear', C = 0.1, gamma = 0.1)
model.fit(train_X, train_Y)
prediction2 = model.predict(test_X)
print("Accuracy for linear SVM is %.2f%%"%((accuracy_score(prediction2, test_Y))*100))

# Accuracy for linear SVM is 81.72%

 

Logistic Regression

model = LogisticRegression()
model.fit(train_X, train_Y)
prediction3 = model.predict(test_X)
print("Accuracy for Logistic Regression is %.2f%%"%((accuracy_score(prediction3, test_Y))*100))

# Accuracy for Logistic Regression is 81.72%

 

Decision Tree

model = DecisionTreeClassifier()
model.fit(train_X, train_Y)
prediction4 = model.predict(test_X)
print("Accuracy for Decision Tree is %.2f%%"%((accuracy_score(prediction4, test_Y))*100))

# Accuracy for Decision Tree is 80.22%

 

K - Nearest Neighbors

 

model = KNeighborsClassifier() #deafult 5
model.fit(train_X, train_Y)
prediction5 = model.predict(test_X)
print("Accuracy for KNN is %.2f%%"%((accuracy_score(prediction5, test_Y))*100))

# Accuracy for KNN is 83.21%

 

knn 모델의 k 파라미터 변화에 따른 정확도 변화를 확인해보자.

 

a_index = list(range(1,11))
a = pd.Series()
x = [x for x in range(0,11)]
for i in a_index:
	model = KNeighborsClassifier(n_neighbors=i)
	model.fit(train_X, train_Y)
    prediction = model.predict(test_X)
    a = a.append(pd.Series(accuracy_score(prediction, test_Y)))
plt.plot(a_index, a)
plt.xticks(x)
plt.show()
print('Accuracies for different values of n are : ', a.values, 'with the max value as', a.values.max())

#Accuracies for different values of n are : [0.75746269 0.79104478 0.80970149 0.80223881 0.83208955 0.81716418 0.82835821 0.83208955 0.8358209 0.83208955] with the max value as 0.835820895522388

 

K = 9일때 정확도가 가장 높다

 

Gaussian Naive Bayes

model = GaussianNB()
model.fit(train_X, train_Y)
prediction6 = model.predict(test_X)
print("Accuracy for Gaussian Naive Bayes is %.2f%%"%((accuracy_score(prediction6, test_Y))*100))

# Accuracy for Gaussian Naive Bayes is 81.34%

 

Random Forests

model = RandomForestClassifier()
model.fit(train_X, train_Y)
prediction7 = model.predict(test_X)
print("Accuracy for Random Forests is %.2f%%"%((accuracy_score(prediction7, test_Y))*100))

# Accuracy for Random Forests is 82.09%

 

 

이렇게 Model을 Fitting하면 분석이 끝나는 걸까?? NO

train/test set의 변화에 따라서 model의 변동이 커지면 안된다. (bias/Variance trade-off)

즉, 모델이 일반화가 되야함!

이를 위해 CrossValidation을 사용하자.


Cross Validation

 

데이터 분류문제에서 많은 경우에 Data Imbalanced(데이터 불균형) 문제를 만날 수 있다.

그래서 데이터를 학습시키고 테스트 하는데에 모든 측면의 데이터셋을 써야한다.

CV를 쓰면 모든 데이터셋을 이용해 정확도의 평균을 얻어낼 수 있다!

 

1. K-Fold CV는 첫번째로 분할된 데이터셋에 의해 k개의 분할셋으로 작동한다.

2. k =5 로 분할 햇다고 생각해보자. 그러면 1개의 분할셋은 테스트로, 4개의 분할셋은 train으로 사용할 수 있다.

3. 우리는 테스트셋을 계속 바꿔가면서 학습시키는 것으로 프로세스를 지속할 수 있다.  그러면 정확도와 에러는 각각의 수행에 대해 평균을 낼 수 있음.

4. 알고리즘은 몇몇 training set에 대해서 underfitting이 될수도있고 overfitting이 될 수도 있다 그래서 CV와 함께라면 일반화된 모델을 얻어낼 수 있는 것!

 

from sklearn.model_selection import KFold # for K-fold cross validation
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict # prediction
kfold = KFold(n_splits=10, random_state=22) # k = 10, split the data into 10 equal parts
xyz = []
accuracy = []
std = []
classifiers = ['Linear Svm', 'Radial Svm', 'Logistic Regression', \
               'KNN', "Decision Tree", 'Naive Bayes', 'Random Forest']
models = [svm.SVC(kernel='linear'), svm.SVC(kernel='rbf'), LogisticRegression(), KNeighborsClassifier(n_neighbors=9)\
         ,DecisionTreeClassifier(), GaussianNB(), RandomForestClassifier(n_estimators=100)]
for m in models:
    model = m
    cv_result = cross_val_score(model, data.drop('Survived' ,axis = 1), data.Survived, cv = kfold, scoring='accuracy')
    cv_result = cv_result
    xyz.append(cv_result.mean())
    std.append(cv_result.std())
    accuracy.append(cv_result)
new_models_dataframe2 = pd.DataFrame({'CV Mean':xyz, 'Std': std}, index = classifiers)
new_models_dataframe2

 

각 알고리즘에 대한 정확도의 평균과 분산

plt.subplots(figsize=(12,6))
box = pd.DataFrame(accuracy, index = [classifiers])
box.T.boxplot()

 

new_models_dataframe2['CV Mean'].plot.barh(width=0.8)
plt.title('Average CV Mean Accuracy')
fig = plt.gcf()
fig.set_size_inches(8,5)
plt.show()

 

 

f,ax=plt.subplots(3,3,figsize=(12,10))
y_pred = cross_val_predict(svm.SVC(kernel='rbf'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,0],annot=True,fmt='2.0f')
ax[0,0].set_title('Matrix for rbf-SVM')
y_pred = cross_val_predict(svm.SVC(kernel='linear'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,1],annot=True,fmt='2.0f')
ax[0,1].set_title('Matrix for Linear-SVM')
y_pred = cross_val_predict(KNeighborsClassifier(n_neighbors=9),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,2],annot=True,fmt='2.0f')
ax[0,2].set_title('Matrix for KNN')
y_pred = cross_val_predict(RandomForestClassifier(n_estimators=100),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,0],annot=True,fmt='2.0f')
ax[1,0].set_title('Matrix for Random-Forests')
y_pred = cross_val_predict(LogisticRegression(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,1],annot=True,fmt='2.0f')
ax[1,1].set_title('Matrix for Logistic Regression')
y_pred = cross_val_predict(DecisionTreeClassifier(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,2],annot=True,fmt='2.0f')
ax[1,2].set_title('Matrix for Decision Tree')
y_pred = cross_val_predict(GaussianNB(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[2,0],annot=True,fmt='2.0f')
ax[2,0].set_title('Matrix for Naive Bayes')
plt.subplots_adjust(hspace=0.2,wspace=0.2)
plt.show()

결과를 봤을때 Radial SVM과 RandomForest의 결과가 좋아보임.

두 모델에대해서 HyperParameter Tuning을 수행해주자.

 


Hyper-Parameter Tuning

 

GridSearchCV를 이용해서 최적의 파라미터를 찾고, 해당 파라미터와 그떄의 정화도를 출력해보즈아.

SVM

from sklearn.model_selection import GridSearchCV
C=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
gamma = [round(0.1*i,1) for i in range(1, 11)]

kernel = ['rbf', 'linear']
hyper = {'kernel':kernel, 'C':C, 'gamma':gamma}
gd = GridSearchCV(estimator=svm.SVC(), param_grid=hyper, verbose = True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)

# 0.8282828282828283

# SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

 

 

Random Forest

n_estimators = range(100, 1000, 100)
hyper = {'n_estimators' : n_estimators}
gd = GridSearchCV(estimator=RandomForestClassifier(random_state = 2020), param_grid = hyper, verbose = True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)

# 0.8159371492704826

# RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=800, n_jobs=None, oob_score=False, random_state=2020, verbose=0, warm_start=False)

 

 


Ensembling

 

1. Voting Classifier

2. Bagging

3. Boosting

 

Voting

from sklearn.ensemble import VotingClassifier
ensemble_lin_rbf = VotingClassifier\
(estimators=[('KNN', KNeighborsClassifier(n_neighbors=10))
            ,('RBF', svm.SVC(kernel = 'rbf', probability = True, C = 0.5, gamma = 0.1))
              ,('RF', RandomForestClassifier(n_estimators = 800, random_state = 0))
                ,('LR', LogisticRegression(C = 0.05))
              ,("DT", DecisionTreeClassifier(random_state =2020))
              ,("NB", GaussianNB())
              ,('svm', svm.SVC(kernel = 'linear', probability=True))
            ], voting = 'soft').fit(train_X, train_Y)


print('The accuracy of ensembled model is : ', ensemble_lin_rbf.score(test_X, test_Y))
cross = cross_val_score(ensemble_lin_rbf, X, Y, cv = 10, scoring = 'accuracy')
print('The cross validated score is', cross.mean())

# The accuracy of ensembled model is : 0.8246268656716418

# The cross validated score is 0.8237535467029848

굉장히 안정적이고, 높은 accuracy를 도출함.

 

Bagging

 

- KNN + Bagging

from sklearn.ensemble import BaggingClassifier
model = BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors = 3), random_state=2020, n_estimators=700)
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
print('The accuracy for bagged KNN is:',metrics.accuracy_score(prediction,test_Y))
result=cross_val_score(model,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for bagged KNN is:',result.mean())

# The accuracy for bagged KNN is: 0.8246268656716418

# The cross validated score for bagged KNN is: 0.813765747361253

 

- Decision Tree + Bagging

model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=2020, n_estimators = 100)
model.fit(train_X, train_Y)
prediction=model.predict(test_X)
print('The accuracy for bagged Decision Tree is:',metrics.accuracy_score(prediction,test_Y))
result=cross_val_score(model,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for bagged Decision Tree is:',result.mean())

# The accuracy for bagged Decision Tree is: 0.8208955223880597

# The cross validated score for bagged Decision Tree is: 0.8148771422086029

 

 

Boosting

 

1. AdaBoost

from sklearn.ensemble import AdaBoostClassifier
ada=AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.1)
result=cross_val_score(ada,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for AdaBoost is:',result.mean())

# The cross validated score for AdaBoost is: 0.8249526160481218

 

2. Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier
grad=GradientBoostingClassifier(n_estimators=500,random_state=0,learning_rate=0.1)
result=cross_val_score(grad,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for Gradient Boosting is:',result.mean())

# The cross validated score for Gradient Boosting is: 0.8182862331176939

 

3. XgBoost

import xgboost as xg
xgboost=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
result=cross_val_score(xgboost,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for XGBoost is:',result.mean())

# The cross validated score for XGBoost is: 0.8104710021563954

 

- 별도의 파라미터 튜닝 없이 가장 성능이 좋았던 AdaBoost로 Hyper-parameter Tuning

n_estimators=list(range(100,1100,100))
learn_rate=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
hyper={'n_estimators':n_estimators,'learning_rate':learn_rate}
gd=GridSearchCV(estimator=AdaBoostClassifier(),param_grid=hyper,verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)

# 0.8316498316498316

# AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=0.05, n_estimators=200, random_state=None)

위의 파라미터를 가진 AdaBoost를 최종 모델로 결정!

 

최종모델의 Confusion Matrix

ada = AdaBoostClassifier(n_estimators=200, random_state=0, learning_rate=0.05)
result = cross_val_predict(ada, X, Y, cv = 10)
sns.heatmap(confusion_matrix(Y,result), cmap = 'winter', annot = True, fmt = '2.0f')
plt.show()

Feature Importance를 추출할 수 있는 모델들(RF, AdaBoost, GradientBoost, XgBoost)로 Feature Importance Plot 도출

f,ax=plt.subplots(2,2,figsize=(15,12))
model=RandomForestClassifier(n_estimators=500,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,0])
ax[0,0].set_title('Feature Importance in Random Forests')
model=AdaBoostClassifier(n_estimators=200,learning_rate=0.05,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,1],color='#ddff11')
ax[0,1].set_title('Feature Importance in AdaBoost')
model=GradientBoostingClassifier(n_estimators=500,learning_rate=0.1,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,0],cmap='RdYlGn_r')
ax[1,0].set_title('Feature Importance in Gradient Boosting')
model=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,1],color='#FD0F00')
ax[1,1].set_title('Feature Importance in XgBoost')
plt.show()

 

 

- 일반적으로 좋은 Feature로 보이는건, Initial, Fare_cat, Pclass, Family_size.

 

- Sex Column이 EDA에서와는 달리 RF를 제외하고는 굉장히 낮은 Feature Impotance를 보였는데, Sex와 Initial Column의 다중 공선성 때문인것으로 보임.

 

- Pclass와 Fare_cat도 마찬가지로 Passengers, SibSp, Parch, Alone등의 변수와의 다중공선성으로 낮은 Feature Impotacne를 보이는듯함!