[AI스쿨] 2주차-Kaggle 문제풀이로 배우는 데이터분석
깃허브 코드 github 코드
Data 정보¶
- Survived: 0 = Dead, 1 = Survived
- pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- sibsp: # of siblings / spouses aboard the Titanic
- parch: # of parents / children aboard the Titanic
- ticket: Ticket number
- cabin: Cabin number
- embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))
In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from scipy import stats
from matplotlib import rc
# matplotlib 한글 font 사용
plt.rcParams["font.family"]='NanumGothic'
# 노트북 안에 그래프 그리기 위해 셋팅
%matplotlib inline
# 그래프를 격자 스타일로 (숫자 범위가 눈에 잘 띄도록 ggplot 스타일 사용.)
plt.style.use("ggplot")
# 그래프에서 마이너스 폰트 깨지는 문제 해결을 위해
mpl.rcParams["axes.unicode_minus"] = False
In [3]:
train = pd.read_csv("./data/train.csv")
In [4]:
test = pd.read_csv("./data/test.csv")
In [5]:
train.head()
Out[5]:
1) 결측치 파악¶
In [6]:
train.info()
In [7]:
train.isnull().sum()
Out[7]:
In [8]:
msno.matrix(train, figsize=(12,6))
Out[8]:
In [9]:
def get_probability(feature):
Survived = train[train["Survived"]==1][feature].value_counts()
Dead = train[train["Survived"]==0][feature].value_counts()
total = Survived + Dead
print("살아남은 확률: \n{} \n 죽은 확률: \n{}".format(Survived/total, Dead/total))
In [10]:
features = ['Sex', 'Pclass', 'SibSp', 'Parch', 'Embarked']
for feature in features:
get_probability(feature)
(2) 시각화 : bar chart, count plot, facetgrid¶
In [11]:
def bar_chart(feature, ax=None):
survived = train[train['Survived']==1][feature].value_counts()
dead = train[train['Survived']==0][feature].value_counts()
df = pd.DataFrame([survived, dead])
df.index = ['Survived', 'Dead']
df.plot(kind='bar', stacked=True, ax=ax)
In [12]:
figure, ((ax1,ax2,ax3), (ax4,ax5,ax6)) = plt.subplots(nrows=2, ncols=3)
figure.set_size_inches(18,12)
bar_chart('Sex', ax1)
bar_chart('Pclass', ax2)
bar_chart('SibSp', ax3)
bar_chart('Parch', ax4)
bar_chart('Embarked', ax5)
ax1.set(title="성별 생사정보")
ax2.set(title="티켓 class")
ax3.set(title="형제 수")
ax4.set(title="부모 자식의 수")
ax5.set(title="승선 장소")
Out[12]:
In [13]:
def count_plot(column, ax):
sns.countplot(x=column, hue='Survived', data=train, ax=ax)
In [14]:
figure, ((ax1,ax2,ax3), (ax4,ax5,ax6)) = plt.subplots(nrows=2, ncols=3)
figure.set_size_inches(18,12)
count_plot('Sex', ax1)
count_plot('Pclass', ax2)
count_plot('SibSp', ax3)
count_plot('Parch', ax4)
count_plot('Embarked', ax5)
ax1.set(title="성별 생사정보")
ax2.set(title="티켓 class")
ax3.set(title="형제 수")
ax4.set(title="부모 자식의 수")
ax5.set(title="승선 장소")
Out[14]:
In [15]:
# contious 한 데이터 column 의 시각화
def draw_facetgrid(feature):
facet = sns.FacetGrid(train, hue="Survived", aspect=5)
facet.map(sns.kdeplot, feature, shade=True)
facet.set(xlim=(0, train[feature].max()))
# survived 라벨을 표시.
facet.add_legend()
plt.show()
In [16]:
draw_facetgrid("Age")
In [17]:
draw_facetgrid("Fare")
데이터 시각화를 통한 데이터 분석 결과¶
1. Sex
-> 남자의 경우 죽을 확률이 확연하게 큼을 확인할 수 있다.
2. Pclass
-> 티켓 클래스가 높을수록 살아남을 확률이 크다.
3,4. SibSp, Parch
-> 형제, 부모/자식이 없는 경우 죽을 확률이 크다.
5. Embarked
-> C 승선장 Cherbourg에서 탄 사람이 살아남은 수가 크다.
6. Age
-> 10살 이하의 경우 살 확률이 크고, 10대 중반에서 30살까지는 죽을 확률이 더 크다. (나머지는 크게 의미 없음)
7. Fare
-> 티켓 가격이 높을 수록 살아남을 확률이 더 크다.
In [18]:
def drop_columns(feature):
train.drop(feature, axis=1, inplace=True)
test.drop(feature, axis=1, inplace=True)
In [19]:
drop_columns('Cabin')
In [20]:
drop_columns('Ticket')
2) Feature Generation¶
(1) Sex¶
In [21]:
train_test_data = [train, test]
In [22]:
sex_mapping = {"male": 0 , "female":1}
In [23]:
for dataset in train_test_data:
dataset['Sex'] = dataset['Sex'].map(sex_mapping)
(2) Title¶
In [24]:
for dataset in train_test_data:
dataset['Title'] = dataset['Name'].str.extract('([A-Za-z]+)\.', expand=False)
In [25]:
train["Title"].value_counts()
Out[25]:
In [26]:
drop_columns("Name")
In [27]:
title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2,
"Master": 0, "Dr": 3, "Rev": 3, "Col": 3, "Major": 3, "Mlle": 3,"Countess": 3,
"Ms": 2, "Lady": 2, "Jonkheer": 1, "Don": 3, "Dona" : 3, "Mme": 3,"Capt": 3,"Sir": 0 }
In [28]:
for dataset in train_test_data:
dataset['Title'] = dataset['Title'].map(title_mapping)
In [29]:
train.head(1)
Out[29]:
In [30]:
train.head(1)
Out[30]:
(3) Age¶
In [31]:
train["Age"].fillna(train.groupby("Title")["Age"].transform("median"), inplace=True)
test["Age"].fillna(test.groupby("Title")["Age"].transform("median"), inplace=True)
In [32]:
train['AgeBand'] = pd.cut(train['Age'], 5)
train[['AgeBand', 'Survived']].groupby('AgeBand', as_index=False).mean().sort_values(by='AgeBand', ascending=True)
Out[32]:
In [33]:
for dataset in train_test_data:
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0,
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 0.5,
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 1,
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 1.5,
dataset.loc[ dataset['Age'] > 64, 'Age'] = 2
In [34]:
train.drop("AgeBand", axis=1, inplace=True)
In [35]:
train.head(1)
Out[35]:
In [36]:
test.head(1)
Out[36]:
(4) Embarked¶
In [37]:
Pclass1 = train[train['Pclass']==1]['Embarked'].value_counts()
Pclass2 = train[train['Pclass']==2]['Embarked'].value_counts()
Pclass3 = train[train['Pclass']==3]['Embarked'].value_counts()
df = pd.DataFrame([Pclass1, Pclass2, Pclass3])
df.index = ['1st class','2nd class', '3rd class']
df.plot(kind='bar',stacked=True, figsize=(10,5))
Out[37]:
In [38]:
for dataset in train_test_data:
dataset['Embarked'] = dataset['Embarked'].fillna('S')
In [39]:
mapping_data ={"S":0, "Q":1, "C":2}
In [40]:
for dataset in train_test_data:
dataset["Embarked"] = dataset["Embarked"].map(mapping_data)
In [41]:
train.head(1)
Out[41]:
In [42]:
test.head(1)
Out[42]:
(5) Fare¶
In [43]:
train["FareBand"] = pd.cut(train["Fare"], 5)
train[["FareBand", "Survived"]].groupby("FareBand", as_index=False).mean().sort_values(by='FareBand', ascending=True)
Out[43]:
In [44]:
for dataset in train_test_data:
dataset.loc[ dataset['Fare'] <= 102, 'Fare'] = 0,
dataset.loc[(dataset['Fare'] > 102) & (dataset['Fare'] <= 204), 'Fare'] = 1,
dataset.loc[(dataset['Fare'] > 204) & (dataset['Fare'] <= 307), 'Fare'] = 2,
dataset.loc[ dataset['Fare'] > 307, 'Fare'] = 3
In [45]:
train.drop("FareBand", axis=1, inplace=True)
In [46]:
test.isnull().sum()
Out[46]:
In [47]:
test["Fare"].fillna(0.0, inplace=True)
In [48]:
train.head(1)
Out[48]:
In [49]:
test.head(1)
Out[49]:
(6) SibSp, Parch¶
In [50]:
train["FamilySize"] = train["SibSp"] + train["Parch"] +1
test["FamilySize"] = test["SibSp"] + test["Parch"] +1
facet = sns.FacetGrid(train, hue="Survived", aspect=4)
facet.map(sns.kdeplot,'FamilySize',shade= True)
facet.set(xlim=(0, train['FamilySize'].max()))
facet.add_legend()
plt.xlim(0)
Out[50]:
In [51]:
train["FamilySize"].value_counts()
Out[51]:
In [52]:
test["FamilySize"].value_counts()
Out[52]:
In [53]:
family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5: 1.6, 6: 2, 7: 2.4, 8: 2.8, 9: 3.2, 10: 3.6, 11: 4}
In [54]:
for dataset in train_test_data:
dataset['FamilySize'] = dataset['FamilySize'].map(family_mapping)
In [55]:
drop_columns("SibSp")
drop_columns("Parch")
In [56]:
train.head(1)
Out[56]:
In [57]:
test.head(1)
Out[57]:
In [58]:
droped_data =["Survived", "PassengerId"]
train_data = train.drop(droped_data, axis=1)
Cross Validation ( K-fold)¶
데이터를 k 개로 나눠서 validation 하는 방법.
ex) 만약 k가 10이라면 9/10의 데이터는 학습하고 1/10검증을 10번 반복해서 평균 수치를 내는 validation 방법.
In [59]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import KFold, cross_val_score
In [60]:
target = train["Survived"]
In [61]:
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
1) k-NN¶
In [62]:
clf = KNeighborsClassifier(n_neighbors=11)
scoring = 'accuracy'
score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)
print(round(np.mean(score)*100,2))
2) Decision Tree¶
In [63]:
clf = DecisionTreeClassifier()
scoring = 'accuracy'
score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)
print(round(np.mean(score)*100,2))
3) Naive Bayes¶
In [64]:
clf = GaussianNB()
scoring ='accuracy'
socre = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)
print(round(np.mean(score)*100,2))
4) SVM¶
In [65]:
clf = SVC()
scoring ='accuracy'
score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)
round(np.mean(score)*100,2)
Out[65]:
5) Random Forest¶
In [66]:
clf = RandomForestClassifier(n_estimators=200)
scoring ='accuracy'
score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)
round(np.mean(score)*100,2)
Out[66]:
4. Testing¶
In [67]:
clf = SVC()
clf.fit(train_data, target)
test_data = test.drop("PassengerId", axis=1).copy()
prediction = clf.predict(test_data)
In [68]:
submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": prediction
})
submission.to_csv('./data/submission.csv', index=False)
Subscribe via RSS