当前位置: 首页 > news >正文

旅行社网站 模板营销方式和渠道

旅行社网站 模板,营销方式和渠道,重庆公司网站建设步骤,wordpress 标题跳外链Titanic : Machine Learning from Disaster 链接:GitHub源代码 Question 要求你建立一个预测模型来回答这个问题:“什么样的人更有可能生存?”使用乘客数据(如姓名、年龄、性别、社会经济阶层等)。 一、导入数据包和数…

Titanic : Machine Learning from Disaster

链接:GitHub源代码

Question

  • 要求你建立一个预测模型来回答这个问题:“什么样的人更有可能生存?”使用乘客数据(如姓名、年龄、性别、社会经济阶层等)。

一、导入数据包和数据集

import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
  • 重点:在kaggle notebook上时,应该把pd.read_csv("./kaggle/input/titanic/train.csv")引号中第一个'.'去掉
  • 读入训练集和测试及都需要
train = pd.read_csv("./kaggle/input/titanic/train.csv")
test = pd.read_csv("./kaggle/input/titanic/test.csv")
allData = pd.concat([train, test], ignore_index=True)
# dataNum = train.shape[0]
# featureNum = train.shape[1]
train.info()

二、数据总览

概况

  • 输入train.info()回车可以查看数据集整体信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
  • 输入train.head()可以查看数据样例

特征

VariableDefinitionKey
survivalSurvival0 = No, 1 = Yes
pclassTicket class(客舱等级)1 = 1st, 2 = 2nd, 3 = 3rd
sexSex
AgeAge in years
sibsp# of siblings / spouses aboard the Titanic(旁系亲属)
parch# of parents / children aboard the Titanic(直系亲属)
ticketTicket number
farePassenger fare
cabinCabin number(客舱编号)
embarkedPort of Embarkation(上船港口编号)C = Cherbourg, Q = Queenstown, S = Southampton

三、可视化数据分析

性别特征Sex

  • 女性生存率远高于男性
# Sex
sns.countplot('Sex', hue='Survived', data=train)
plt.show()

等级特征Pclass

  • 乘客等级越高,生存率越高
# Pclass
sns.barplot(x='Pclass', y="Survived", data=train)
plt.show()

家庭成员数量特征

  • FamilySize=Parch+SibSp

  • 家庭成员数量适中,生存率高
# FamilySize = SibSp + Parch + 1
allData['FamilySize'] = allData['SibSp'] + allData['Parch'] + 1
sns.barplot(x='FamilySize', y='Survived', data=allData)
plt.show()

上船港口特征Embarked

  • 上船港口不同,生存率不同
# Embarked
sns.countplot('Embarked', hue='Survived', data=train)
plt.show()

年龄特征Age

  • 年龄小或者正值壮年生存率高
# Age
sns.stripplot(x="Survived", y="Age", data=train, jitter=True)
plt.show()

  • 年龄生存密度
facet = sns.FacetGrid(train, hue="Survived",aspect=2)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()
plt.xlabel('Age') 
plt.ylabel('density') 
plt.show()

  • 儿童相对于全年龄段有特殊的生存率
  • 作者将10及以下视为儿童,设置单独标签

费用特征Fare

  • 费用越高,生存率越高
# Fare
sns.stripplot(x="Survived", y="Fare", data=train, jitter=True)
plt.show()

姓名特征Name

头衔特征Title

  • 头衔由姓名的前置称谓进行分类
# Name
allData['Title'] = allData['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
pd.crosstab(allData['Title'], allData['Sex'])
  • 统计分析
TitleClassification = {'Officer':['Capt', 'Col', 'Major', 'Dr', 'Rev'],'Royalty':['Don', 'Sir', 'the Countess', 'Dona', 'Lady'],'Mrs':['Mme', 'Ms', 'Mrs'],'Miss':['Mlle', 'Miss'],'Mr':['Mr'],'Master':['Master','Jonkheer']}
for title in TitleClassification.keys():cnt = 0for name in TitleClassification[title]:cnt += allData.groupby(['Title']).size()[name]print (title,':',cnt)
  • 设置标签
TitleClassification = {'Officer':['Capt', 'Col', 'Major', 'Dr', 'Rev'],'Royalty':['Don', 'Sir', 'the Countess', 'Dona', 'Lady'],'Mrs':['Mme', 'Ms', 'Mrs'],'Miss':['Mlle', 'Miss'],'Mr':['Mr'],'Master':['Master','Jonkheer']}
TitleMap = {}
for title in TitleClassification.keys():TitleMap.update(dict.fromkeys(TitleClassification[title], title))
allData['Title'] = allData['Title'].map(TitleMap)
  • 头衔不同,生存率不同
sns.barplot(x="Title", y="Survived", data=allData)
plt.show()

票号特征Ticket

  • 有一定连续座位(存在票号相同的乘客)生存率高
#Ticket
TicketCnt = allData.groupby(['Ticket']).size()
allData['SameTicketNum'] = allData['Ticket'].apply(lambda x:TicketCnt[x])
sns.barplot(x='SameTicketNum', y='Survived', data=allData)
plt.show()
# allData['SameTicketNum']

二维/多维分析

  • 可以将任意两个/多个数据进行分析

二维分析之Pclass & Age

# Pclass & Age
sns.violinplot("Pclass", "Age", hue="Survived", data=train, split=True)
plt.show()

二维分析之Age & Sex

# Age & Sex
sns.swarmplot(x='Age', y="Sex", data=train, hue='Survived')
plt.show()

四、数据清洗 & 异常处理

离散型数据

有可用标签 --> One-Hot编码

  • Sex & Pclass & Embarked 都有已经设置好的标签(int或float或string等),可以直接进行get_dummies,拆分成多维向量,增加特征维度
  • 其中,Embarked存在一定缺失值,通过对整体的分析,填充上估计值
# Sex
allData = allData.join(pd.get_dummies(allData['Sex'], prefix="Sex"))
# Pclass
allData = allData.join(pd.get_dummies(allData['Pclass'], prefix="Pclass"))
# Embarked
allData[allData['Embarked'].isnull()] # 查看缺失值
allData.groupby(by=['Pclass','Embarked']).Fare.mean() # Pclass=1, Embark=C, 中位数=76
allData['Embarked'] = allData['Embarked'].fillna('C')
allData = allData.join(pd.get_dummies(allData['Embarked'], prefix="Embarked"))

无可用标签 --> 设计标签 --> One-Hot

  • FamilySize & Name & Ticket需要对整体数据统一处理,再进行标记
# FamilySize
def FamilyLabel(s):if (s == 4):return 4elif (s == 2 or s == 3):return 3elif (s == 1 or s == 7):return 2elif (s == 5 or s == 6):return 1elif (s < 1 or s > 7):return 0
allData['FamilyLabel'] = allData['FamilySize'].apply(FamilyLabel)
allData = allData.join(pd.get_dummies(allData['FamilyLabel'], prefix="Fam"))# Name
TitleLabelMap = {'Mr':1.0,'Mrs':5.0,'Miss':4.5,'Master':2.5,'Royalty':3.5,'Officer':2.0}
def TitleLabel(s):return TitleLabelMap[s]
# allData['TitleLabel'] = allData['Title'].apply(TitleLabel)
allData = allData.join(pd.get_dummies(allData['Title'], prefix="Title"))# Ticket
def TicketLabel(s):if (s == 3 or s == 4):return 3elif (s == 2 or s == 8):return 2elif (s == 1 or s == 5 or s == 6 or s ==7):return 1elif (s < 1 or s > 8):return 0
allData['TicketLabel'] = allData['SameTicketNum'].apply(TicketLabel)
allData = allData.join(pd.get_dummies(allData['TicketLabel'], prefix="TicNum"))

连续型数据

Age & Fare

  • 进行标准化,缩小数据范围,加速梯度下降
# Age
allData['Child'] = allData['Age'].apply(lambda x:1 if x <= 10 else 0) # 儿童标签
allData['Age'] = (allData['Age']-allData['Age'].mean())/allData['Age'].std() # 标准化
allData['Age'].fillna(value=0, inplace=True) # 填充缺失值
# Fare
allData['Fare'] = allData['Fare'].fillna(25) # 填充缺失值
allData[allData['Survived'].notnull()]['Fare'] = allData[allData['Survived'].notnull()]['Fare'].apply(lambda x:300.0 if x>500 else x)
allData['Fare'] = allData['Fare'].apply(lambda x:(x-allData['Fare'].mean())/allData['Fare'].std())

清除无用特征

  • 清除无用特征,降低算法复杂度
# 清除无用特征
allData.drop(['Cabin', 'PassengerId', 'Ticket', 'Name', 'Title', 'Sex', 'SibSp', 'Parch', 'FamilySize', 'Embarked', 'Pclass', 'Title', 'FamilyLabel', 'SameTicketNum', 'TicketLabel'], axis=1, inplace=True)

重新分割训练集/测试集

  • 一开始,为了处理方便,作者将训练集和测试集合并,现在根据Survived是否缺失来讲训练集和测试集分开
# 重新分割数据集
train_data = allData[allData['Survived'].notnull()]
test_data  = allData[allData['Survived'].isnull()]
test_data = test_data.reset_index(drop=True)xTrain = train_data.drop(['Survived'], axis=1)
yTrain = train_data['Survived']
xTest  = test_data.drop( ['Survived'], axis=1)

特征相关性分析

  • 该步骤用于筛选特征后向程序员反馈,特征是否有效、是否重叠
  • 若有问题,可以修改之前的特征方案
# 特征间相关性分析
Correlation = pd.DataFrame(allData[allData.columns.to_list()])
colormap = plt.cm.viridis
plt.figure(figsize=(24,22))
sns.heatmap(Correlation.astype(float).corr(), linewidths=0.1, vmax=1.0, cmap=colormap, linecolor='white', annot=True, square=True)
plt.show()

五、模型建立 & 参数优化

导入模型包

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
  • 作者选择随机森林分类器

网格搜索调试参数

pipe = Pipeline([('select', SelectKBest(k=10)),('classify', RandomForestClassifier(random_state = 10, max_features = 'sqrt'))])
param_test = {'classify__n_estimators':list(range(20,100,5)),'classify__max_depth'   :list(range(3,10,1))}
gsearch = GridSearchCV(estimator=pipe, param_grid=param_test, scoring='roc_auc', cv=10)
gsearch.fit(xTrain, yTrain)
print (gsearch.best_params_, gsearch.best_score_)
  • 运行时间较长,结束后出现结果:
{'classify__max_depth': 6, 'classify__n_estimators': 70} 0.8790924679681529

建立模型

  • 用以上参数进行输入模型
  • 训练
rfc = RandomForestClassifier(n_estimators=70, max_depth=6, random_state=10, max_features='sqrt')
rfc.fit(xTrain, yTrain)

导出结果

predictions = rfc.predict(xTest)
output = pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':predictions.astype('int64')})
output.to_csv('my_submission.csv', index=False)

六、提交评分

  • 官方推荐教程

附:完整代码

  • Jupiter Notebook导出为Python Script格式,需要ipynb格式请点击
  • GitHub源代码

# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'# %%
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns# %% [markdown]
# # Features
# Variable | Definition | Key
# :-:|:-:|:-:
# survival | Survival | 0 = No, 1 = Yes
# pclass | Ticket class(客舱等级) | 1 = 1st, 2 = 2nd, 3 = 3rd
# sex | Sex
# Age | Age in years
# sibsp | # of siblings / spouses aboard the Titanic(旁系亲属)
# parch | # of parents / children aboard the Titanic(直系亲属)
# ticket | Ticket number
# fare | Passenger fare
# cabin | Cabin number(客舱编号)
# embarked | Port of Embarkation(上船的港口编号) | C = Cherbourg, Q = Queenstown, S = Southampton# %%
train = pd.read_csv("./kaggle/input/titanic/train.csv")
test = pd.read_csv("./kaggle/input/titanic/test.csv")
allData = pd.concat([train, test], ignore_index=True)
# dataNum = train.shape[0]
# featureNum = train.shape[1]
train.head()# %%
# Sex
sns.countplot("Sex", hue="Survived", data=train)
plt.show()# %%
# Pclass
sns.barplot(x="Pclass", y="Survived", data=train)
plt.show()
# Pclass & Age
sns.violinplot("Pclass", "Age", hue="Survived", data=train, split=True)
plt.show()# %%
# FamilySize = SibSp + Parch + 1
allData["FamilySize"] = allData["SibSp"] + allData["Parch"] + 1
sns.barplot(x="FamilySize", y="Survived", data=allData)
plt.show()# %%
# Embarked
sns.countplot("Embarked", hue="Survived", data=train)
plt.show()# %%
# Age
sns.stripplot(x="Survived", y="Age", data=train, jitter=True)
plt.show()
facet = sns.FacetGrid(train, hue="Survived", aspect=2)
facet.map(sns.kdeplot, "Age", shade=True)
facet.set(xlim=(0, train["Age"].max()))
facet.add_legend()
plt.xlabel("Age")
plt.ylabel("density")
plt.show()
# Age & Sex
sns.swarmplot(x="Age", y="Sex", data=train, hue="Survived")
plt.show()# %%
# Fare
sns.stripplot(x="Survived", y="Fare", data=train, jitter=True)
plt.show()# %%
# Name
# allData['Title'] = allData['Name'].str.extract('([A-Za-z]+)\.', expand=False) # str.extract不知道在干嘛
allData["Title"] = allData["Name"].apply(lambda x: x.split(",")[1].split(".")[0].strip()
)
# pd.crosstab(allData['Title'], allData['Sex'])
TitleClassification = {"Officer": ["Capt", "Col", "Major", "Dr", "Rev"],"Royalty": ["Don", "Sir", "the Countess", "Dona", "Lady"],"Mrs": ["Mme", "Ms", "Mrs"],"Miss": ["Mlle", "Miss"],"Mr": ["Mr"],"Master": ["Master", "Jonkheer"],
}
TitleMap = {}
for title in TitleClassification.keys():TitleMap.update(dict.fromkeys(TitleClassification[title], title))"""# cnt = 0for name in TitleClassification[title]:cnt += allData.groupby(['Title']).size()[name]# print (title,':',cnt)"""
allData["Title"] = allData["Title"].map(TitleMap)
sns.barplot(x="Title", y="Survived", data=allData)
plt.show()# %%
# Ticket
TicketCnt = allData.groupby(["Ticket"]).size()
allData["SameTicketNum"] = allData["Ticket"].apply(lambda x: TicketCnt[x])
sns.barplot(x="SameTicketNum", y="Survived", data=allData)
plt.show()
# allData['SameTicketNum']# %% [markdown]
# # 数据清洗
# - Sex & Pclass & Embarked --> Ont-Hot
# - Age & Fare --> Standardize
# - FamilySize & Name & Ticket --> ints --> One-Hot# %%
# Sex
allData = allData.join(pd.get_dummies(allData["Sex"], prefix="Sex"))
# Pclass
allData = allData.join(pd.get_dummies(allData["Pclass"], prefix="Pclass"))
# Embarked
allData[allData["Embarked"].isnull()]  # 查看缺失值
allData.groupby(by=["Pclass", "Embarked"]).Fare.mean()  # Pclass=1, Embark=C, 中位数=76
allData["Embarked"] = allData["Embarked"].fillna("C")
allData = allData.join(pd.get_dummies(allData["Embarked"], prefix="Embarked"))# %%
# Age
allData["Child"] = allData["Age"].apply(lambda x: 1 if x <= 10 else 0)  # 儿童标签
allData["Age"] = (allData["Age"] - allData["Age"].mean()) / allData["Age"].std()  # 标准化
allData["Age"].fillna(value=0, inplace=True)  # 填充缺失值
# Fare
allData["Fare"] = allData["Fare"].fillna(25)  # 填充缺失值
allData[allData["Survived"].notnull()]["Fare"] = allData[allData["Survived"].notnull()]["Fare"
].apply(lambda x: 300.0 if x > 500 else x)
allData["Fare"] = allData["Fare"].apply(lambda x: (x - allData["Fare"].mean()) / allData["Fare"].std()
)# %%
# FamilySize
def FamilyLabel(s):if s == 4:return 4elif s == 2 or s == 3:return 3elif s == 1 or s == 7:return 2elif s == 5 or s == 6:return 1elif s < 1 or s > 7:return 0allData["FamilyLabel"] = allData["FamilySize"].apply(FamilyLabel)
allData = allData.join(pd.get_dummies(allData["FamilyLabel"], prefix="Fam"))# Name
TitleLabelMap = {"Mr": 1.0,"Mrs": 5.0,"Miss": 4.5,"Master": 2.5,"Royalty": 3.5,"Officer": 2.0,
}def TitleLabel(s):return TitleLabelMap[s]# allData['TitleLabel'] = allData['Title'].apply(TitleLabel)
allData = allData.join(pd.get_dummies(allData["Title"], prefix="Title"))# Ticket
def TicketLabel(s):if s == 3 or s == 4:return 3elif s == 2 or s == 8:return 2elif s == 1 or s == 5 or s == 6 or s == 7:return 1elif s < 1 or s > 8:return 0allData["TicketLabel"] = allData["SameTicketNum"].apply(TicketLabel)
allData = allData.join(pd.get_dummies(allData["TicketLabel"], prefix="TicNum"))# %%
# 清除无用特征
allData.drop(["Cabin","PassengerId","Ticket","Name","Title","Sex","SibSp","Parch","FamilySize","Embarked","Pclass","Title","FamilyLabel","SameTicketNum","TicketLabel",],axis=1,inplace=True,
)# 重新分割数据集
train_data = allData[allData["Survived"].notnull()]
test_data = allData[allData["Survived"].isnull()]
test_data = test_data.reset_index(drop=True)xTrain = train_data.drop(["Survived"], axis=1)
yTrain = train_data["Survived"]
xTest = test_data.drop(["Survived"], axis=1)# allData.columns.to_list()# %%
# 特征间相关性分析
Correlation = pd.DataFrame(allData[allData.columns.to_list()])
colormap = plt.cm.viridis
plt.figure(figsize=(24, 22))
sns.heatmap(Correlation.astype(float).corr(),linewidths=0.1,vmax=1.0,cmap=colormap,linecolor="white",annot=True,square=True,
)
plt.show()# %% [markdown]
# # 网格筛选随机森林参数
# - n_estimator
# - max_depth# %%
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest# %%pipe = Pipeline([("select", SelectKBest(k=10)),("classify", RandomForestClassifier(random_state=10, max_features="sqrt")),]
)
param_test = {"classify__n_estimators": list(range(20, 100, 5)),"classify__max_depth": list(range(3, 10, 1)),
}
gsearch = GridSearchCV(estimator=pipe, param_grid=param_test, scoring="roc_auc", cv=10)
gsearch.fit(xTrain, yTrain)
print(gsearch.best_params_, gsearch.best_score_)# %%
rfc = RandomForestClassifier(n_estimators=70, max_depth=6, random_state=10, max_features="sqrt"
)
rfc.fit(xTrain, yTrain)
predictions = rfc.predict(xTest)output = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": predictions.astype("int64")}
)
output.to_csv("my_submission.csv", index=False)
链接:GitHub源代码
http://www.khdw.cn/news/43266.html

相关文章:

  • 做t-shirt素材网站热点事件营销案例
  • 周口做网站推广seo网站优化服务商
  • 做网站知识大全市场调研报告包括哪些内容
  • wordpress全站静态化网站建设公司好
  • xxx网站建设与优化推广营销qq官网
  • node.js做网站开发sem培训班培训多少钱
  • 淄博网站建设排行榜即刻搜索
  • 在哪里建设网站最有效的免费推广方法
  • 网站建设主流语言微信朋友圈广告
  • 泰兴网站建设网络营销属于哪个专业
  • 广西建网站公司网页设计代码案例
  • 企业宣传册免费模板网站专业seo公司
  • 专线可以做网站seo网络培训
  • 建设部资质网站查询吉林网站推广公司
  • 网站空间需要多大seo自动优化软件安卓
  • 重庆建站模板源码最新域名查询ip
  • 微信网站开发需要什么技术神起网络游戏推广平台
  • 2017网站icp备案bt最佳磁力搜索引擎吧
  • 求职网站网页模板公司网络推广方案
  • 成都两条传播链在成华区朝阳seo建站
  • 简述网站开发主要步骤网络软文营销
  • 网站建设公司做的网站百度搜索指数的数据来源
  • 网站建设的基本元素网站权重一般有几个等级
  • 綦江网站建设哪些平台可以免费发布产品
  • 东莞网站建站推广百度云客服人工电话
  • 蚌埠建设网站公司百度电商推广
  • 铁岭做网站哪家好东莞网站推广行者seo08
  • 做服装商城网站论文海南百度推广总代理商
  • 亚马逊aws永久免费服务器淘宝seo搜索引擎优化
  • 宝安网站设计哪家好互联网销售模式